Releases: openzim/python-scraperlib
Releases · openzim/python-scraperlib
5.0.0
5.0.0rc4
5.0.0rc3
5.0.0rc2
5.0.0rc1
This is a major release with a lot of breaking changes but most changes are easy to fix.
It focuses on type safety with the introduction of runtime checks: any call to zimscraperlib API must match the type definition or an exception will be raised.
Documentation is available as docstrings and on https://python-scraperlib.readthedocs.io
Main changes includes:
- ZIM metadata handling has completely changed with new types for each kind of metadata.
i18n
module has been redesigned around a single main classLanguage
- New
rewriting
module for HTTML/CSS/JS (that one being done at runtime via Wombat) - Now supporting only Python 3.12
Added
- Documentation using
mkdocs
, published on readthedocs.com (#92) rewriting
module to rewrite URLs in content for generic scrapersrewriting.css
to rewrite URLs in CSS filesrewriting.html
to rewrite URLs in HTML filesrewriting.js
to rewrite URLs in JS files (at runtime, usingwombat
)wombat-setup
javascript module injavascript/
typing
module with custom types:Callback
to use where we expect callbacksSupportsWrite
,SupportsRead
,SupportsSeeking
SupportsSeekableRead
andSupportsSeekableWrite
: protocols for IO type annotations
zim.metadata
module with a type-based approach for each kind of metadata and helpers for custom ones- [
zim.metadata
]APPLY_RECOMMENDATIONS
: general flag to toggle openZIM-recommended constraints - [
zim.metadata
] Type-based classes:Metadata
,TextBasedMetadata
,TextListBasedMetadata
,DateBasedMetadata
,IllustrationBasedMetadata
- [
zim.metadata
] Usage-based classes:NameMetadata
,LanguageMetadata
,DefaultIllustrationMetadata
, etc. - [
zim.metadata
]StandardMetadataList
to package the standard metadata - See details for additional API endpoints and variables
- [
- [
constants
]DEFAULT_WEB_REQUESTS_TIMEOUT
exposed fordownload
module - [
download
]stream_file()
now acceptstimeout: int
param (defaults to constant timeout) (#222) - [
filesystem
]path_from
context manager to acquire a pathlibPath
fromPath
orTemporaryDirectory
- [
i18n
]Language
,get_language()
andget_language_or_none()
. See breaking changes - [
image.optimization
]OptimizePngOptions
dataclass to store PNG options - [
image.optimization
]OptimizeJpgOptions
dataclass to store JPEG options - [
image.optimization
]OptimizeGifOptions
dataclass to store WebP options - [
image.optimization
]OptimizeOptions
dataclass to store cross-formats options - [
inputs
]unique_values()
to deduplicate a list while preserving order - [
logging
]DEFAULT_FORMAT_WITH_THREADS
as many scrapers uses threads - [
video.encoding
]reencode()
'sexisting_tmp_path
param - [
zim.filesystem
]validate_folder_writable()
to ensure one can write into a folder (#200) - [
zim.creator
]Creator._get_first_language_metadata_value()
to retrieve first language from metadata - [
zim.items
]no_indexing_indexdata()
to get an IndexData that disables indexing - [
zim.items
]URLItem.get_mimetype()
now only returningstr
Changed (Breaking)
- Entire API is now type-protected using beartype. Any call to scraperlib that doesn't satisfy the annotated types will raise an exception
- [
constants
]MANDATORY_ZIM_METADATA_KEYS
andDEFAULT_DEV_ZIM_METADATA
moved tozim/metadata
- [
download
]YoutubeDownloader.download
'soptions
parameters now expect andict[str, Any]
instead ofdict
- [
download
]YoutubeConfig
options now limited tostr | bool | int | None
- [
download
]_get_retry_adapter()
now exposed asget_retry_adapter()
- [
download
]stream_file
'sbyte_stream' param now more flexible, accepting
SupportsWrite[bytes] | SupportsSeekableWrite[bytes]` - [
download
]stream_file
'sproxies
param now acceptingdict[str, str]
instead ofdict
- [
filesystem
]delete_callback()
is now a simple callback accepting anfpath
and deleting it (doesn't chain other callback anymore). - [
filesystem
]delete_callback()
doesn't fail on missing file (#192) - [
i18n
] Redesigned API around a single object:Language
which is inited with any acceptable code. RaisesNotFoundError
on 639-3 matching failurefind_language_names()
is retained but only accepts aquery: str
- added
get_language()
andget_language_or_none()
as shortcuts aroundLanguage
is_valid_iso_639_3()
is retained
- [
image.conversion
]convert_image()
now acceptsio.BytesIO
in place ofIO[bytes]
forsrc
anddst
. - [
image.conversion
]convert_svg2png()
now acceptsio.BytesIO
in place ofIO[bytes]
forsrc
anddst
. - [
image.optimization
]optimize_png()
now acceptsoptions: OptimizePngOptions
instead of individual params. - [
image.optimization
]optimize_jpeg()
now acceptsoptions: OptimizeJpgOptions
instead of individual params. - [
image.optimization
]optimize_webp()
now acceptsoptions: OptimizeWebpOptions
instead of individual params. - [
image.optimization
]optimize_gif()
now acceptsoptions: OptimizeGifOptions
instead of individual params. - [
image.presets
] All presets now use the new options dataclass instead of ClassVar dict - [
image.probing
]format_for()
now acceptsio.BytesIO
in place ofIO[bytes]
forsrc
. - [
image.probing
]is_valid_image()
now acceptsio.BytesIO
in place ofIO[bytes]
forimage
. - [
image.utils
]save_image()
now acceptsio.BytesIO
in place ofIO[bytes]
fordst
. - [
video.config
]Config
was mostly not using type annotations. - [
video.config
]Config
options only expectingstr | None
- [
video.presets
] All options only expectingstr | None
- [
video.encoding
]reencode()
now always returning atuple[bool, CompletedProcess]
- [
zim._libkiwix
]MimetypeAndCounter
now expects specific types formimetype: str
andvalue: int
- [
zim.filesystem
]make_zim_file()
publisherparam now properly expects an
str` - [
zim.filesystem
]IncorrectZIMPathError
renamed toIncorrectPathError
- [
zim.filesystem
]MissingZIMFolderError
renamed toMissingFolderError
- [
zim.filesystem
]NotADirectoryZIMFolderError
renamed toNotADirectoryFolderError
- [
zim.filesystem
]NotWritableZIMFolderError
renamed toNotWritableFolderError
- [
zim.filesystem
]IncorrectZIMFilenameError
renamed toIncorrectFilenameError
- [
zim.filesystem
]validate_zimfile_creatable()
renamed tovalidate_file_creatable()
- [
zim.items
]Item
andStaticItem
now expectinghints
asdict[libzim.writer.Hint, int]
instead ofdict
- [
zim.items
]Item.get_hints()
now returningdict[libzim.writer.Hint, int]
instead ofdict
- [
zim.items
]URLItem.download_for_size()
now specifying type annotations and reordered params - [
zim.providers
]FileLikeProvider.gen_blob()
andURLProvider.gen_blob()
now properly annotates return type (Generator[libzim.writer.Blob, None, None]
) - [
zim.providers
]URLProvider.get_size_of()
paramurl
now explicitly expects anstr
- [
zim.creator
]Creator.config_metadata()
signature changed, now mainly accepting aStandardMetadataList
- [
zim.creator
]Creator.config_dev_metadata()
signature changed to accept new metadata types - [
zim.creator
]Creator.add_item_for()
'scallback
renamed tocallbacks
and acceptingCallback
- [
zim.creator
]Creator.add_item()
'scallback
renamed tocallbacks
and acceptingCallback
Changed
- [deps]
iso639-lang
now requires at least v2.4.0 - [
download
]stream_file()
now returntuple[int, requests.structures.CaseInsensitiveDict[str]]
instead oftuple[int, requests.structures.CaseInsensitiveDict]
- [
download
]stream_file()
now accepts bothfpath
andbyte_stream
params (writes to both) - [
image.utils
]save_image()
now acceptsAny
**params
. - [
zim.archive
]Archive.counters
now returningCounterMap
(compatible with previousdict[str, int]
)
Fixed
- Direct dependencies now properly references: pillow, urllib3, piexif, idna (#226)
- [
download
]YoutubeDownloader.download
now respects its return type (bool | Future[Any]
) - [
image.conversion
]convert_image()
**params
properly declared as acceptingNone
. - [
logging
]getLogger()
's'console
now properly acceptingTextIO | io.StringIO | None
- [
video.probing
]get_media_info()
type annotation forsrc_path
- [
zim.archive
]Archive.get_item()
return type (libzim.reader.Item
)
Removed
- Support for Python 3.8/3.9/3.10/3.11. Only Python 3.12 is supported now.
- [
i18n
]Lang
(See breaking changes) - [
i18n
]get_iso_lang_data()
(See breaking changes) - [
i18n
]update_with_macro()
(See breaking changes) - [
i18n
]get_language_details()
(See breaking changes) - [
uri
]rebuild_uri
failsafe
param (was only handling incorrect types) - [
video.encoding
]reencode()
'swith_process
param - [
zim.creator
]Creator.validate_metadata()
- [
zim.creator
]Creator.convert_and_check_metadata()
4.0.0
Added
- Add utility function to compute ZIM Tags #164, including deduplication #156
- Metadata does not automatically drops control characters #159
- New
indexing.IndexData
class to hold title, content and keywords to pass to libzim to index an item - Automatically index PDF documents content #167
- Automatically set proper title on PDF documents #168
- Expose new
optimization.get_optimization_method
to get the proper optimization method to call for a given image format - Add
optimization.get_optimization_method
to get the proper optimization method to call for a given image format - New
creator.Creator.convert_and_check_metadata
to convert metadata to bytes or str for known use cases and check proper type is passed to libzim - Add svg2png image conversion function #113
- Add
conversion.convert_svg2png
image conversion function + support for SVG inprobing.format_for
#113 - Add
i18n.Lang
class used as typed result of i18n operations #151
Changed
- BREAKING Renamed
zimscraperlib.image.convertion
tozimscraperlib.image.conversion
to fix typo - BREAKING Many changes in type hints to match the real underlying code
- BREAKING Force all boolean arguments (and some other non-obvious parameters) to be keyword-only in function calls for clarity / disambiguation (see ruff rule FBT002)
- Prefer to use
IO[bytes]
toio.BytesIO
when possible since it is more generic - BREAKING
i18n.NotFound
renamedi18n.NotFoundError
- BREAKING
types.get_mime_for_name
now returnsstr | None
- BREAKING
creator.Creator.add_metadata
andcreator.Creator.validate_metadata
now only acceptsbytes | str
as value (it must have been converted before call) - BREAKING second argument of
creator.Creator.add_metadata
has been renamed tovalue
instead ofcontent
to align with other methods - When a type issue arises in metadata checks, wrong value type is displayed in exception
- BREAKING
i18n.get_language_details()
,i18n.get_iso_lang_data()
,i18n.find_language_names()
andi18n.update_with_macro
now process / return a new typedLang
class #151 - BREAKING Rename
i18.NotFound
toi18n.NotFoundError
Removed
- BREAKING Remove translation features in
i18n
:Locale
class +_
andsetlocale
functions #134
Fixed
v3.4.0
Added
zim.creator.Creator._log_metadata()
to log (DEBUG) all metadata set on_metadata
(prior to start()) #155- New utility function to confirm ZIM can be created at given location / name #163
Changed
- Migrate the VideoWebmLow and VideoWebmHigh presets to VP9 for smaller file size #79
- New preset versions are v3 and v2 respectively
- Simplify type annotations by replacing Union and Optional with pipe character ("|") for improved readability and clarity #150
- Calling
Creator._log_metadata()
onCreator.start()
if running in DEBUG #155
Fixed
- Add back the
--runinstalled
flag for test execution to allow smooth testing on other build chains #139
3.3.2
3.3.1
3.3.0
Added
- New
disable_metadata_checks
parameter inzimscraperlib.zim.creator.Creator
initializer, allowing to disable metadata check at startup (assuming the user will validate them on its own) #119
Changed
- Rework the VideoWebmLow preset for faster encoding and smaller file size #122
- preset has been bumped to version 2
- when using an S3 cache, all videos using this preset will be reencoded and uploaded to cache again (it will replace the same file encoded with preset version 1)
- When reencoding a video, ffmpeg now uses only 1 CPU thread by default (new arg to
reencode
allows to override this default value) - Using openZIM Python bootstrap conventions (including hatch-openzim plugin) #120
- Add support for Python 3.12, drop Python 3.7 support #118
- Replace "iso-369" by "iso639-lang" library
- Replace "file-magic" by "python-magic" library for Alpine Linux support and better maintenance
Fixed
- Fixed type hints of
zimscraperlib.zim.Item
and subclasses, andzimscraperlib.image.optimization:convert_image