Releases: datajuicer/data-juicer
Release v1.4.4: NeurIPS 2025 Spotlight; New Video & Multimodal Ops; Repo Reorganization; S3 I/O Support
Major Updates
- 🎉 Update NeurIPS 2025 News: our Data-Juicer 2.0 paper is accepted as a NeurIPS'25 Spotlight (top 3.1% of all submissions)! And our two other works are also accepted by NeurIPS'25. #788
- 🧩 The sandbox component, data-juicer recipes, and data-juicer agents have been officially split from the main repository as data-juicer-sandbox/hub/agents respectively, to enable independent development and faster iteration. #817 #827 #830
- 🤝 S3 I/O support: Added S3 support in data loader and exporter for seamless cloud storage integration. #806
New OPs
detect_main_character_mapper: Extract all main character names based on the given image and its caption. #795detect_character_locations_mapper: Given an image and a list of main character names, extract the bounding boxes for each present character. (YOLOE + MLLM) #795detect_character_attributes_mapper: Takes an image, a caption, and main character names as input to extract the characters' attributes. #795vggt_mapper: Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks. #804video_whole_body_pose_estimation_mapper: Input a video containing people, and use the DWPose model to extract the body, hand, feet, and face keypoints of the human subjects in the video, i.e., 2D Whole-body Pose Estimation. #812video_hand_reconstruction_mapper: Use the WiLoR model for hand localization and reconstruction. #818
Enhancements
- Enhanced documentation for operator details, significantly expanding coverage of effect demonstrations and usage examples, and improved homepage styling for better readability. #778 #819
- Added notebook detection and auto-redirect in logger setup for better user experience in Jupyter environments. #790
- Optimized the build_op_doc hook for more reliable documentation generation. #794
- Improved auto num_proc calculation in Ray mode for better resource utilization across operators. #789 #825
- Enabled support for videos and audios in WebDataset I/O, expanding multimodal data handling capabilities. #803
- Updated repository URLs and links across the project for consistency and correctness. #805
- Added support for FFmpeg and Decord backends in video data processing, improving flexibility and performance. #826 #829
- Added an MCP server CLI entry point to facilitate modular service deployment and upodate MCP documentation. #798
Fixed Bugs
- Fixed the Auto Prompt pipeline in sandbox to restore correct prompt generation behavior. #791
- Fixed a Ray connection error by properly passing the config parameter through resource utility functions. #808
- Fixed several CUDA-based operators to use internal resource monitor. #809
- Fixed custom op module loading issues and optimized video_extract_frames_mapper for saving extracted frames. #803
- Reset num_proc for vLLM and set default batch_size to 10 for CUDA operators to improve stability. #814
- Fixed Sphinx autodoc compatibility issue in the SpecialTokens metaclass to restore documentation build. #816
- Resolved a bug in trace_filter by excluding the
__dj_stats__column during dataset comparison. #828 - Fix several typos in
video_split_by_scene_mapper. #744
Acknowledgement
- @kyo-tom helps to fix the ray connection error in #808
- @liuyuhanalex helps to fix several small typos in #744
Full Changelog: v1.4.3...v1.4.4
Release v1.4.3: OP Doc Enhancement; Optimized Auto Parallelism; Optimized Sandbox
Major Updates
- 🤝 OP Document Updates: Optimized multi-version docs; Doc strings are rewritten and enhanced by qwen-max. #755 #765 #768 #769 #787
- 💪🏻 Auto Parallelism Optimization: support cpu/gpu/mem requirement specification for each OP; optimize
calculate_npfor ray mode. #679 #774 #782 #786 - 🛠️ Sandbox Optimization: support iterative pipelines and early-stop targets; refactor the context infos; a new example on auto prompt optimization and several related hooks are added. #757
- 📈 Upgrade spacy from 3.8.0 to 3.8.7 due to the previous one is yanked. #763
New OPs
image_detection_yolo_mapper: perform object detection (with YOLO) on images and return the bounding box values and class labels. #764optimize_prompt_mapper: optimize prompts based on the existing ones. #757
Enhancements
- Support shard_size and extra args for write methods in
export_extra_argsfor RayExporter. #739 - Support min/max_closed_interval args to control filtering with open/closed intervals and reversed_range arg to allow keeping samples outside a specified range for Filters. #741
- Support API models for existing
optimize_qa_mapper. #771
Fixed Bugs
- Fix and re-enable the disabled op_list_to_trace argument. #766
- Add missing
skiptag to several API-based test cases for forked repos. #767 - Limit the version of
transformersto "<4.55.0" to avoid computing on None value. #781 - Fix out-of-date invoking methods in several tools. #785 (from issue #750)
- Fix 500 error in API service. #785 (from issue #777)
- Remove
specified_xxx_filterfrom NON_STATS_FILTER. #785 (from issue #783)
Full Changelog: v1.4.2...v1.4.3
Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"
Major Updates
- 💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
- 🧩 5 OPs for data attribution are added. #735
- 🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument
custom_operator_paths. #758 - 🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760
New Operators
Filter
- Validation-free
llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
- Validation-based
in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735
Enhancements
- A new environment variable DATA_JUICER_EXTERNAL_MODELS_HOME is added to allow to specify some private or read-only paths to store external and extra models. #740
- Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
- Support custom save_dir for OPs that produce extra multimodal data. #751
- Add official and detailed docs about Data-Juicer Agent. #759
- Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
- Refining developer guide for better practice on building new OPs. #760
Bugs Fixed
- Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
- Fix some test cases. #754
Acknowledgement
- @ShenQianli made their first contribution to 5 new OPs. #735
Full Changelog: v1.4.1...v1.4.2
Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.
Major Updates
- 🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
- 💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
- 🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
- 🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
- 🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738
New Operators
download_file_mapperdownloads data from URLs to local files or specified fields. #709
Enhancements
- New analysis method: correlation analysis among stats is added. #663
- Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
- The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
- Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
- Support store and process bytes data of images in the dataset. #725
Bugs Fixed
- The wheel & docker image building bug is fixed. #706
- Fix bugs in log_summarization. #710
- Fix "no module named data_juicer" error after installing from the wheel file. #727
Acknowledgement
- @fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
- @ayushdg helps to support a GPU-version Minhash deduplicator. #644
- @ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730
Full Changelog: v1.4.0...v1.4.1
v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)
Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.
🔧 Major Refactors & Improvements
-
🔄 Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
- Includes the InternVL example as a showcase.
-
📘 DJ-Doc Redesign (#675):
- Now with multilingual support (English / Chinese) and a modernized style.
-
📦 Dependency Management Update (#660, #680):
- Migrated to
uvfor faster dependency resolution. - Added sub-groups for better organization.
- Migrated to
🌍 New Features & Integrations (#683, #688, #692)
-
🆕 Additional Repo Supported:
- Trinity-RFT now supported by Data-Juicer.
-
📜 DJ-Awesome-List:
- A survey paper accepted by TPAMI'25!
-
🧪 Synthetic Benchmark Added:
- DetailMaster – a new benchmark for synthetic data evaluation.
-
🛠️ New Operators Introduced (#673, #701):
llm_analysis_filtergeneral_field_filter
🚀 Core Optimizations & Bug Fixes
-
✅ Ray Executor Enhancements (#697):
- File extension detection added.
- Support for more data formats.
-
⏱️ Startup Time Optimization:
- Improved startup performance. (#684)
-
🧠 Text Embedding Support:
- Added support for text embedding via API and local model. (#681)
-
🐳 Docker Build Improvement:
- Ignore installed
distutilslibraries during Docker image building. (#668)
- Ignore installed
-
🛠️ Mapper Module Fix:
- Fixed issue with module initialization. (#700)
-
🗑️ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)
📚 Full Changelog
Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.
Major Updates
- 🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
- Add new OPs and recipes for Img-Diff. #658
Enhancements
- Support HF llm for two llm_xxx_score_filter OPs. #655
- Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
- Split standalone and distributed unit tests to save time when re-running failed ones. #666
Bugs Fixed
- Address possibly missing cfg in
unify_format. #653 - Improve clarity & fix bad links for some docs. #659
Acknowledgement
Full Changelog: v1.3.2...v1.3.3
Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes
What's Changed
- Human OP enhancements, in #642 #645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
- OP efficiency optimization of
document_minhash_deduplicator, in #639 - set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
- fix date typo by in #648
- Fix docker building failure in #650
- Fix StreamToLoguru compatibility issue with torch._dynamo in #651
- add init file for annotation module, fix dj-process command error in #652
New Contributor
Release v1.3.1: added HumanOPs & fixed some bugs
Major Updates
- 💥 prototype Implementation for HumanOps (annotation). #617 Included features:
- boilerplate code for supporting label studio powered human annotation ops
- a human preference annotation reference implementation is provided
- label studio service script; can start up local instance using docker or pip, whichever is available
- reference configs and data
- event driven and notification mixins framework for ops
New OPs
extract_tables_from_html_mapper: extract tables from html texts. #634general_fused_op: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626
Bug Fixed
- fix dataset builder initialization failure #630
- update Executor references from Executor to DefaultExecutor #632 #633
- switch the backend of
pltto avoid sub-process/thread error #633 - fix some boundary condition bugs in several deduplicators #635 #637
Others
- check dataset when loading to support to pass dataset in the
DefaultExecutor.runmethod. #633 - update docs to highlight light env installation part. #636
Acknowledgement
- @liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635
Full Changelog: v1.3.0...v1.3.1
Release v1.3.0: Refactor of dataset builder and executor!
The Big Change 🚀
Refactor of dataset builder and executor, see #537, @cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.
Others 💡
🔊 New audio processing operator: audio_add_gaussian_noise (PR #622), @liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency (PR #625)
Release v1.2.2
Major Updates
- 🧪 Add document for API service. Add parameter transmission using
json.dumpsto support API calls for arbitrary registration functions and classes. #613 - 🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.
New OPs
llm_quality_score_filter: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620llm_difficulty_score_filter: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
