Skip to content

Releases: openzim/python-scraperlib

5.0.0

14 Jan 15:55
af21c01
Compare
Choose a tag to compare

This is a major release with a lot of breaking changes but most changes are easy to fix.

In addition to item below, see content of Release Candidates for changes since 4.x.

Changed

  • Add support for urllib3 2.3.x #243

5.0.0rc4

09 Jan 10:49
4944025
Compare
Choose a tag to compare
5.0.0rc4 Pre-release
Pre-release

Changed

  • Mark library as typed and fix sdist content (#241)

5.0.0rc3

07 Jan 16:08
136b4dd
Compare
Choose a tag to compare
5.0.0rc3 Pre-release
Pre-release

Changed

  • Upgrade wombat to 3.8.7 (#239)

5.0.0rc2

07 Jan 12:54
671e9e2
Compare
Choose a tag to compare
5.0.0rc2 Pre-release
Pre-release

Fixed

  • Fix wombatSetup.js location in wheel (#236)

5.0.0rc1

07 Jan 10:33
9ecdb28
Compare
Choose a tag to compare
5.0.0rc1 Pre-release
Pre-release

This is a major release with a lot of breaking changes but most changes are easy to fix.

It focuses on type safety with the introduction of runtime checks: any call to zimscraperlib API must match the type definition or an exception will be raised.

Documentation is available as docstrings and on https://python-scraperlib.readthedocs.io

Main changes includes:

  • ZIM metadata handling has completely changed with new types for each kind of metadata.
  • i18n module has been redesigned around a single main class Language
  • New rewriting module for HTTML/CSS/JS (that one being done at runtime via Wombat)
  • Now supporting only Python 3.12

Added

  • Documentation using mkdocs, published on readthedocs.com (#92)
  • rewriting module to rewrite URLs in content for generic scrapers
    • rewriting.css to rewrite URLs in CSS files
    • rewriting.html to rewrite URLs in HTML files
    • rewriting.js to rewrite URLs in JS files (at runtime, using wombat)
      • wombat-setup javascript module in javascript/
  • typing module with custom types:
    • Callback to use where we expect callbacks
    • SupportsWrite, SupportsRead, SupportsSeeking SupportsSeekableRead and SupportsSeekableWrite: protocols for IO type annotations
  • zim.metadata module with a type-based approach for each kind of metadata and helpers for custom ones
    • [zim.metadata] APPLY_RECOMMENDATIONS: general flag to toggle openZIM-recommended constraints
    • [zim.metadata] Type-based classes: Metadata, TextBasedMetadata, TextListBasedMetadata, DateBasedMetadata, IllustrationBasedMetadata
    • [zim.metadata] Usage-based classes: NameMetadata, LanguageMetadata, DefaultIllustrationMetadata, etc.
    • [zim.metadata] StandardMetadataList to package the standard metadata
    • See details for additional API endpoints and variables
  • [constants] DEFAULT_WEB_REQUESTS_TIMEOUT exposed for download module
  • [download] stream_file() now accepts timeout: int param (defaults to constant timeout) (#222)
  • [filesystem] path_from context manager to acquire a pathlib Path from Path or TemporaryDirectory
  • [i18n] Language, get_language() and get_language_or_none(). See breaking changes
  • [image.optimization] OptimizePngOptions dataclass to store PNG options
  • [image.optimization] OptimizeJpgOptions dataclass to store JPEG options
  • [image.optimization] OptimizeGifOptions dataclass to store WebP options
  • [image.optimization] OptimizeOptions dataclass to store cross-formats options
  • [inputs] unique_values() to deduplicate a list while preserving order
  • [logging] DEFAULT_FORMAT_WITH_THREADS as many scrapers uses threads
  • [video.encoding] reencode()'s existing_tmp_path param
  • [zim.filesystem] validate_folder_writable() to ensure one can write into a folder (#200)
  • [zim.creator] Creator._get_first_language_metadata_value() to retrieve first language from metadata
  • [zim.items] no_indexing_indexdata() to get an IndexData that disables indexing
  • [zim.items] URLItem.get_mimetype() now only returning str

Changed (Breaking)

  • Entire API is now type-protected using beartype. Any call to scraperlib that doesn't satisfy the annotated types will raise an exception
  • [constants] MANDATORY_ZIM_METADATA_KEYS and DEFAULT_DEV_ZIM_METADATA moved to zim/metadata
  • [download] YoutubeDownloader.download's options parameters now expect an dict[str, Any] instead of dict
  • [download] YoutubeConfig options now limited to str | bool | int | None
  • [download] _get_retry_adapter() now exposed as get_retry_adapter()
  • [download] stream_file's byte_stream' param now more flexible, accepting SupportsWrite[bytes] | SupportsSeekableWrite[bytes]`
  • [download] stream_file's proxies param now accepting dict[str, str] instead of dict
  • [filesystem] delete_callback() is now a simple callback accepting an fpath and deleting it (doesn't chain other callback anymore).
  • [filesystem] delete_callback() doesn't fail on missing file (#192)
  • [i18n] Redesigned API around a single object:
    • Language which is inited with any acceptable code. Raises NotFoundError on 639-3 matching failure
    • find_language_names() is retained but only accepts a query: str
    • added get_language() and get_language_or_none() as shortcuts around Language
    • is_valid_iso_639_3() is retained
  • [image.conversion] convert_image() now accepts io.BytesIO in place of IO[bytes] for src and dst.
  • [image.conversion] convert_svg2png() now accepts io.BytesIO in place of IO[bytes] for src and dst.
  • [image.optimization] optimize_png() now accepts options: OptimizePngOptions instead of individual params.
  • [image.optimization] optimize_jpeg() now accepts options: OptimizeJpgOptions instead of individual params.
  • [image.optimization] optimize_webp() now accepts options: OptimizeWebpOptions instead of individual params.
  • [image.optimization] optimize_gif() now accepts options: OptimizeGifOptions instead of individual params.
  • [image.presets] All presets now use the new options dataclass instead of ClassVar dict
  • [image.probing] format_for() now accepts io.BytesIO in place of IO[bytes] for src.
  • [image.probing] is_valid_image() now accepts io.BytesIO in place of IO[bytes] for image.
  • [image.utils] save_image() now accepts io.BytesIO in place of IO[bytes] for dst.
  • [video.config] Config was mostly not using type annotations.
  • [video.config] Config options only expecting str | None
  • [video.presets] All options only expecting str | None
  • [video.encoding] reencode() now always returning a tuple[bool, CompletedProcess]
  • [zim._libkiwix] MimetypeAndCounter now expects specific types for mimetype: str and value: int
  • [zim.filesystem] make_zim_file() publisherparam now properly expects anstr`
  • [zim.filesystem] IncorrectZIMPathError renamed to IncorrectPathError
  • [zim.filesystem] MissingZIMFolderError renamed to MissingFolderError
  • [zim.filesystem] NotADirectoryZIMFolderError renamed to NotADirectoryFolderError
  • [zim.filesystem] NotWritableZIMFolderError renamed to NotWritableFolderError
  • [zim.filesystem] IncorrectZIMFilenameError renamed to IncorrectFilenameError
  • [zim.filesystem] validate_zimfile_creatable() renamed to validate_file_creatable()
  • [zim.items] Item and StaticItem now expecting hints as dict[libzim.writer.Hint, int] instead of dict
  • [zim.items] Item.get_hints() now returning dict[libzim.writer.Hint, int] instead of dict
  • [zim.items] URLItem.download_for_size() now specifying type annotations and reordered params
  • [zim.providers] FileLikeProvider.gen_blob() and URLProvider.gen_blob() now properly annotates return type (Generator[libzim.writer.Blob, None, None])
  • [zim.providers] URLProvider.get_size_of() param url now explicitly expects an str
  • [zim.creator] Creator.config_metadata() signature changed, now mainly accepting a StandardMetadataList
  • [zim.creator] Creator.config_dev_metadata() signature changed to accept new metadata types
  • [zim.creator] Creator.add_item_for()'s callback renamed to callbacks and accepting Callback
  • [zim.creator] Creator.add_item()'s callback renamed to callbacks and accepting Callback

Changed

  • [deps] iso639-lang now requires at least v2.4.0
  • [download] stream_file() now return tuple[int, requests.structures.CaseInsensitiveDict[str]] instead of tuple[int, requests.structures.CaseInsensitiveDict]
  • [download] stream_file() now accepts both fpath and byte_stream params (writes to both)
  • [image.utils] save_image() now accepts Any **params.
  • [zim.archive] Archive.counters now returning CounterMap (compatible with previous dict[str, int])

Fixed

  • Direct dependencies now properly references: pillow, urllib3, piexif, idna (#226)
  • [download] YoutubeDownloader.download now respects its return type (bool | Future[Any])
  • [image.conversion] convert_image() **params properly declared as accepting None.
  • [logging] getLogger()'s' console now properly accepting TextIO | io.StringIO | None
  • [video.probing] get_media_info() type annotation for src_path
  • [zim.archive] Archive.get_item() return type (libzim.reader.Item)

Removed

  • Support for Python 3.8/3.9/3.10/3.11. Only Python 3.12 is supported now.
  • [i18n] Lang (See breaking changes)
  • [i18n] get_iso_lang_data() (See breaking changes)
  • [i18n] update_with_macro() (See breaking changes)
  • [i18n] get_language_details() (See breaking changes)
  • [uri] rebuild_uri failsafe param (was only handling incorrect types)
  • [video.encoding] reencode()'s with_process param
  • [zim.creator] Creator.validate_metadata()
  • [zim.creator] Creator.convert_and_check_metadata()

4.0.0

05 Aug 09:33
6489f2a
Compare
Choose a tag to compare

Added

  • Add utility function to compute ZIM Tags #164, including deduplication #156
  • Metadata does not automatically drops control characters #159
  • New indexing.IndexData class to hold title, content and keywords to pass to libzim to index an item
  • Automatically index PDF documents content #167
  • Automatically set proper title on PDF documents #168
  • Expose new optimization.get_optimization_method to get the proper optimization method to call for a given image format
  • Add optimization.get_optimization_method to get the proper optimization method to call for a given image format
  • New creator.Creator.convert_and_check_metadata to convert metadata to bytes or str for known use cases and check proper type is passed to libzim
  • Add svg2png image conversion function #113
  • Add conversion.convert_svg2png image conversion function + support for SVG in probing.format_for #113
  • Add i18n.Lang class used as typed result of i18n operations #151

Changed

  • BREAKING Renamed zimscraperlib.image.convertion to zimscraperlib.image.conversion to fix typo
  • BREAKING Many changes in type hints to match the real underlying code
  • BREAKING Force all boolean arguments (and some other non-obvious parameters) to be keyword-only in function calls for clarity / disambiguation (see ruff rule FBT002)
  • Prefer to use IO[bytes] to io.BytesIO when possible since it is more generic
  • BREAKING i18n.NotFound renamed i18n.NotFoundError
  • BREAKING types.get_mime_for_name now returns str | None
  • BREAKING creator.Creator.add_metadata and creator.Creator.validate_metadata now only accepts bytes | str as value (it must have been converted before call)
  • BREAKING second argument of creator.Creator.add_metadata has been renamed to value instead of content to align with other methods
  • When a type issue arises in metadata checks, wrong value type is displayed in exception
  • BREAKING i18n.get_language_details(), i18n.get_iso_lang_data(), i18n.find_language_names() and i18n.update_with_macro now process / return a new typed Lang class #151
  • BREAKING Rename i18.NotFound to i18n.NotFoundError

Removed

  • BREAKING Remove translation features in i18n: Locale class + _ and setlocale functions #134

Fixed

  • Metadata length validation is buggy for unicode strings #158
  • Pillow 10.4.0 reveals improper type hints for image probing functions #177
  • Enhance error when locale fails to setup #157

v3.4.0

21 Jun 11:26
8b040a6
Compare
Choose a tag to compare

Added

  • zim.creator.Creator._log_metadata() to log (DEBUG) all metadata set on _metadata (prior to start()) #155
  • New utility function to confirm ZIM can be created at given location / name #163

Changed

  • Migrate the VideoWebmLow and VideoWebmHigh presets to VP9 for smaller file size #79
    • New preset versions are v3 and v2 respectively
  • Simplify type annotations by replacing Union and Optional with pipe character ("|") for improved readability and clarity #150
  • Calling Creator._log_metadata() on Creator.start() if running in DEBUG #155

Fixed

  • Add back the --runinstalled flag for test execution to allow smooth testing on other build chains #139

3.3.2

25 Mar 09:16
2923655
Compare
Choose a tag to compare

Added

  • Add support for disable_metadata_checks and ignore_duplicates arguments in make_zim_file function ("zimwritefs-mode")

Changed

  • Relaxed constraints on Python dependencies
  • Upgraded optional dependencies used for test and QA

3.3.1

27 Feb 14:21
e31f5ed
Compare
Choose a tag to compare

Added

  • Set a user-agent for handle_user_provided_file #103

Changed

  • Migrate to generic syntax in all std collections #140

Fixed

  • Do not modify the ffmpeg_args in reencode function #144

3.3.0

14 Feb 09:42
04181e9
Compare
Choose a tag to compare

Added

  • New disable_metadata_checks parameter in zimscraperlib.zim.creator.Creator initializer, allowing to disable metadata check at startup (assuming the user will validate them on its own) #119

Changed

  • Rework the VideoWebmLow preset for faster encoding and smaller file size #122
    • preset has been bumped to version 2
    • when using an S3 cache, all videos using this preset will be reencoded and uploaded to cache again (it will replace the same file encoded with preset version 1)
  • When reencoding a video, ffmpeg now uses only 1 CPU thread by default (new arg to reencode allows to override this default value)
  • Using openZIM Python bootstrap conventions (including hatch-openzim plugin) #120
  • Add support for Python 3.12, drop Python 3.7 support #118
  • Replace "iso-369" by "iso639-lang" library
  • Replace "file-magic" by "python-magic" library for Alpine Linux support and better maintenance

Fixed

  • Fixed type hints of zimscraperlib.zim.Item and subclasses, and zimscraperlib.image.optimization:convert_image