01 Dec 04:06

HYLcool

deb99e5

Release v1.4.4: NeurIPS 2025 Spotlight; New Video & Multimodal Ops; Repo Reorganization; S3 I/O Support Latest

Latest

Major Updates

🎉 Update NeurIPS 2025 News: our Data-Juicer 2.0 paper is accepted as a NeurIPS'25 Spotlight (top 3.1% of all submissions)! And our two other works are also accepted by NeurIPS'25. #788
🧩 The sandbox component, data-juicer recipes, and data-juicer agents have been officially split from the main repository as data-juicer-sandbox/hub/agents respectively, to enable independent development and faster iteration. #817 #827 #830
🤝 S3 I/O support: Added S3 support in data loader and exporter for seamless cloud storage integration. #806

New OPs

detect_main_character_mapper: Extract all main character names based on the given image and its caption. #795
detect_character_locations_mapper: Given an image and a list of main character names, extract the bounding boxes for each present character. (YOLOE + MLLM) #795
detect_character_attributes_mapper: Takes an image, a caption, and main character names as input to extract the characters' attributes. #795
vggt_mapper: Input a video of a single scene, and use VGGT to extract information including Camera Pose, Depth Maps, Point Maps, and 3D Point Tracks. #804
video_whole_body_pose_estimation_mapper : Input a video containing people, and use the DWPose model to extract the body, hand, feet, and face keypoints of the human subjects in the video, i.e., 2D Whole-body Pose Estimation. #812
video_hand_reconstruction_mapper : Use the WiLoR model for hand localization and reconstruction. #818

Enhancements

Enhanced documentation for operator details, significantly expanding coverage of effect demonstrations and usage examples, and improved homepage styling for better readability. #778 #819
Added notebook detection and auto-redirect in logger setup for better user experience in Jupyter environments. #790
Optimized the build_op_doc hook for more reliable documentation generation. #794
Improved auto num_proc calculation in Ray mode for better resource utilization across operators. #789 #825
Enabled support for videos and audios in WebDataset I/O, expanding multimodal data handling capabilities. #803
Updated repository URLs and links across the project for consistency and correctness. #805
Added support for FFmpeg and Decord backends in video data processing, improving flexibility and performance. #826 #829
Added an MCP server CLI entry point to facilitate modular service deployment and upodate MCP documentation. #798

Fixed Bugs

Fixed the Auto Prompt pipeline in sandbox to restore correct prompt generation behavior. #791
Fixed a Ray connection error by properly passing the config parameter through resource utility functions. #808
Fixed several CUDA-based operators to use internal resource monitor. #809
Fixed custom op module loading issues and optimized video_extract_frames_mapper for saving extracted frames. #803
Reset num_proc for vLLM and set default batch_size to 10 for CUDA operators to improve stability. #814
Fixed Sphinx autodoc compatibility issue in the SpecialTokens metaclass to restore documentation build. #816
Resolved a bug in trace_filter by excluding the __dj_stats__ column during dataset comparison. #828
Fix several typos in video_split_by_scene_mapper. #744

Acknowledgement

@kyo-tom helps to fix the ray connection error in #808
@liuyuhanalex helps to fix several small typos in #744

Full Changelog: v1.4.3...v1.4.4

Contributors

liuyuhanalex and kyo-tom

Assets 3

11 Sep 09:11

HYLcool

v1.4.3

613882b

Release v1.4.3: OP Doc Enhancement; Optimized Auto Parallelism; Optimized Sandbox

Major Updates

🤝 OP Document Updates: Optimized multi-version docs; Doc strings are rewritten and enhanced by qwen-max. #755 #765 #768 #769 #787
💪🏻 Auto Parallelism Optimization: support cpu/gpu/mem requirement specification for each OP; optimize calculate_np for ray mode. #679 #774 #782 #786
🛠️ Sandbox Optimization: support iterative pipelines and early-stop targets; refactor the context infos; a new example on auto prompt optimization and several related hooks are added. #757
📈 Upgrade spacy from 3.8.0 to 3.8.7 due to the previous one is yanked. #763

New OPs

image_detection_yolo_mapper: perform object detection (with YOLO) on images and return the bounding box values and class labels. #764
optimize_prompt_mapper: optimize prompts based on the existing ones. #757

Enhancements

Support shard_size and extra args for write methods in export_extra_args for RayExporter. #739
Support min/max_closed_interval args to control filtering with open/closed intervals and reversed_range arg to allow keeping samples outside a specified range for Filters. #741
Support API models for existing optimize_qa_mapper. #771

Fixed Bugs

Fix and re-enable the disabled op_list_to_trace argument. #766
Add missing skip tag to several API-based test cases for forked repos. #767
Limit the version of transformers to "<4.55.0" to avoid computing on None value. #781
Fix out-of-date invoking methods in several tools. #785 (from issue #750)
Fix 500 error in API service. #785 (from issue #777)
Remove specified_xxx_filter from NON_STATS_FILTER. #785 (from issue #783)

Full Changelog: v1.4.2...v1.4.3

Assets 3

18 Aug 03:22

HYLcool

v1.4.2

14f6594

Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"

Major Updates

💪🏻 Data-Juicer now is compatible with Python 3.11 & 3.12. #749
🧩 5 OPs for data attribution are added. #735
🤝 Now Data-Juicer support register and apply custom OPs in external paths using the argument custom_operator_paths. #758
🔧 "uv" is the first choice to installing Data-Juicer now due to its capability to solve the dependency conflicts. #760

New Operators

Filter

Validation-free
- llm_perplexity_filter: Filter to keep samples with perplexity score, computed using a specified llm, within a specific range. #735
- instruction_following_difficulty_filter: Filter to keep texts whose instruction follows difficulty (IFD, https://arxiv.org/abs/2308.12032) falls within a specific range. #735
Validation-based
- in_context_influence_filter: Filter to keep texts whose in-context influence upon validation set within a specific range. #735
- llm_task_relevance_filter: Filter to keep sample with high relevance score to validation tasks estimated by LLM. #735
- text_embd_similarity_filter: Filter to keep texts whose average embedding similarity to a set of given validation texts falls within a specific range. #735

Enhancements

A new environment variable DATA_JUICER_EXTERNAL_MODELS_HOME is added to allow to specify some private or read-only paths to store external and extra models. #740
Optimize the video link transformation and multi-version maintainence in the docs. Update demo videos with higher-resolution versions. #746
Support custom save_dir for OPs that produce extra multimodal data. #751
Add official and detailed docs about Data-Juicer Agent. #759
Enhance unit tests: show the name of the current test cases; recycle resources after each test cases for ray mode. #749
Refining developer guide for better practice on building new OPs. #760

Bugs Fixed

Move the updating of special tokens of multimodal data in initialization of base_op, which fix the bug that special tokens might not be synced with the main process when processing data in parallel. #752
Fix some test cases. #754

Acknowledgement

@ShenQianli made their first contribution to 5 new OPs. #735

Full Changelog: v1.4.1...v1.4.2

Contributors

ShenQianli

Assets 3

16 Jul 13:05

HYLcool

v1.4.1

7505686

Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

Major Updates

🔧 Introduce Data-Juicer MCP server. Users can make use of the data processing capabilities in the MCP way conveniently. #690 #737
💪🏻 Unit test coverage rate is improved to 85%+ and several bugs in test cases are resolved (OOM, encoding error, and so on), which makes Data-Juicer more reliable. #698 #717 #720 #727
🤝 Minhash deduplication based on GPU is supported, collaborated with developers from Nvidia. #694 #644
🧩 RayExporter supports more formats to export a ray dataset in addition to json/jsonl. #687
🎥 Two demo videos are added to introduce the Data-Juicer core functions, agentic usages, and sandbox. #738

New Operators

download_file_mapper downloads data from URLs to local files or specified fields. #709

Enhancements

New analysis method: correlation analysis among stats is added. #663
Several core dependencies are updated and fixed to a newer version, and dependency conflicts are resolved. #715 #717 #723
The EasyAnimate pipelines in the sandbox are updated to follow the refactoring of sandbox. #710
Apply more reliable pre-commit tools to improve the code style of Data-Juicer. #714
Support store and process bytes data of images in the dataset. #725

Bugs Fixed

The wheel & docker image building bug is fixed. #706
Fix bugs in log_summarization. #710
Fix "no module named data_juicer" error after installing from the wheel file. #727

Acknowledgement

@fanronghai helps to fix the param error in dataset_splitting_by_language tool. #713
@ayushdg helps to support a GPU-version Minhash deduplicator. #644
@ricksun2023 helps to fix the bugs when there are more than one same-name OPs in the configs. #730

Full Changelog: v1.4.0...v1.4.1

Contributors

ayushdg, fanronghai, and ricksun2023

Assets 3

13 Jun 11:43

yxdyc

v1.4.0

714df97

v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

Summarization: 200+ files changed with 18,535 additions and 3,720 deletions.

🔧 Major Refactors & Improvements

🔄 Sandbox Usability (#686):
- Support for multiple pipelines, context info, and an environment manager to run different commands in various environments.
- Includes the InternVL example as a showcase.
📘 DJ-Doc Redesign (#675):
- Now with multilingual support (English / Chinese) and a modernized style.
📦 Dependency Management Update (#660, #680):
- Migrated to uv for faster dependency resolution.
- Added sub-groups for better organization.

🌍 New Features & Integrations (#683, #688, #692)

🆕 Additional Repo Supported:
- Trinity-RFT now supported by Data-Juicer.
📜 DJ-Awesome-List:
- A survey paper accepted by TPAMI'25!
🧪 Synthetic Benchmark Added:
- DetailMaster – a new benchmark for synthetic data evaluation.
🛠️ New Operators Introduced (#673, #701):
- llm_analysis_filter
- general_field_filter

🚀 Core Optimizations & Bug Fixes

✅ Ray Executor Enhancements (#697):
- File extension detection added.
- Support for more data formats.
⏱️ Startup Time Optimization:
- Improved startup performance. (#684)
🧠 Text Embedding Support:
- Added support for text embedding via API and local model. (#681)
🐳 Docker Build Improvement:
- Ignore installed distutils libraries during Docker image building. (#668)
🛠️ Mapper Module Fix:
- Fixed issue with module initialization. (#700)
🗑️ Warning Suppression:
- Suppressed unnecessary warnings from fasttext. (#696)

📚 Full Changelog

View all changes since v1.3.3 →

Assets 3

09 May 10:20

HYLcool

v1.3.3

444537e

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

Major Updates

🎉 Our work of Data-Juicer Sandbox has been accepted as a Spotlight by ICML 2025 (top 2.6% of all submissions)!
Add new OPs and recipes for Img-Diff. #658

Enhancements

Support HF llm for two llm_xxx_score_filter OPs. #655
Sync docker image to Aliyun OSS for downloading if docker hub is not accessed. #657
Split standalone and distributed unit tests to save time when re-running failed ones. #666

Bugs Fixed

Address possibly missing cfg in unify_format. #653
Improve clarity & fix bad links for some docs. #659

Acknowledgement

@co63oc helps to fix some typos. #654

Full Changelog: v1.3.2...v1.3.3

Contributors

co63oc

Assets 3

25 Apr 11:17

yxdyc

v1.3.2

2172698

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

Human OP enhancements, in #642 #645
- update label-studio version
- make service script more robust
- add documentation
- optimizing fields mapping
OP efficiency optimization of document_minhash_deduplicator, in #639
set temp_parser.usage to argparse.SUPPRESS, skip too much help log in #643
fix date typo by in #648
Fix docker building failure in #650
Fix StreamToLoguru compatibility issue with torch._dynamo in #651
add init file for annotation module, fix dj-process command error in #652

New Contributor

@cmgzn made their first contribution in #651

Contributors

cmgzn

Assets 3

11 Apr 09:48

HYLcool

v1.3.1

e90a759

Release v1.3.1: added HumanOPs & fixed some bugs

Major Updates

💥 prototype Implementation for HumanOps (annotation). #617 Included features:
- boilerplate code for supporting label studio powered human annotation ops
- a human preference annotation reference implementation is provided
- label studio service script; can start up local instance using docker or pip, whichever is available
- reference configs and data
- event driven and notification mixins framework for ops

New OPs

extract_tables_from_html_mapper: extract tables from html texts. #634
general_fused_op: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626

Bug Fixed

fix dataset builder initialization failure #630
update Executor references from Executor to DefaultExecutor #632 #633
switch the backend of plt to avoid sub-process/thread error #633
fix some boundary condition bugs in several deduplicators #635 #637

Others

check dataset when loading to support to pass dataset in the DefaultExecutor.run method. #633
update docs to highlight light env installation part. #636

Acknowledgement

@liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635

Full Changelog: v1.3.0...v1.3.1

Contributors

liuyuhanalex

Assets 3

28 Mar 12:08

yxdyc

v1.3.0

1b9afd1

Release v1.3.0: Refactor of dataset builder and executor!

The Big Change 🚀

Refactor of dataset builder and executor, see #537, @cyruszhang
📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.
🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.
🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters/downloaders.
🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.
🔍 Add data format validation to ensure consistency and correctness.
🌐 Expanded data source support:
a. 📦 ModelScope integration.
b. 📚 ArXiv dataset import (download, decompress, ingest).
c. 📚 Wikipedia dataset support (download, decompress, ingest).
d. 🌐 Common Crawl integration (download, decompress, ingest).
🔗 Backward compatibility with existing dataset_path command-line syntax.
🔀 Support for data mixtures to combine multiple datasets dynamically.
🔧 Support for empty formatters/generated datasets without pre-defined config files.

Others 💡

🔊 New audio processing operator: audio_add_gaussian_noise (PR #622), @liuyuhanalex
📊 Added dynamic coverage rate badge to the README for transparency (PR #625)

Contributors

cyruszhang and liuyuhanalex

Assets 3

14 Mar 09:58

BeachWang

v1.2.2

8d09410

Release v1.2.2

Major Updates

🧪 Add document for API service. Add parameter transmission using json.dumps to support API calls for arbitrary registration functions and classes. #613
🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616
A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., 16% gain on MathVision using only 400 samples). See more details in MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions.

New OPs

llm_quality_score_filter: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620
llm_difficulty_score_filter: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620

Others

Fix config in LLaVa pretrain recipe. #610
Update news for MindGYM and fix doc. #615
Fix decode error through UTF-8 decoding. #618

Assets 3

Releases: datajuicer/data-juicer

Release v1.4.4: NeurIPS 2025 Spotlight; New Video & Multimodal Ops; Repo Reorganization; S3 I/O Support

Major Updates

New OPs

Enhancements

Fixed Bugs

Acknowledgement

Contributors

Uh oh!

Release v1.4.3: OP Doc Enhancement; Optimized Auto Parallelism; Optimized Sandbox

Major Updates

New OPs

Enhancements

Fixed Bugs

Uh oh!

Release v1.4.2: Python > 3.10 are supported; Data Attribution OPs; External OPs are supported; Install with "uv"

Major Updates

New Operators

Filter

Enhancements

Bugs Fixed

Acknowledgement

Contributors

Uh oh!

Release v1.4.1: MCP server; GPU-based Minhash deduplicator; Improved unit test coverage.

Major Updates

New Operators

Enhancements

Bugs Fixed

Acknowledgement

Contributors

Uh oh!

v1.4.0 Major Refactor for Env Management, Doc, Sandbox; Derivative Works (TPAMI Survey; Trinity-RFT & DetailMaster)

🔧 Major Refactors & Improvements

🌍 New Features & Integrations (#683, #688, #692)

🚀 Core Optimizations & Bug Fixes

📚 Full Changelog

Uh oh!

Release v1.3.3: Sandbox is accepted as Spotlight by ICML 2025; Add Img-Diff recipes.

Major Updates

Enhancements

Bugs Fixed

Acknowledgement

Contributors

Uh oh!

Release v1.3.2: Enhancements on usability & two OPs; some bugs fixes

What's Changed

New Contributor

Contributors

Uh oh!

Release v1.3.1: added HumanOPs & fixed some bugs

Major Updates

New OPs

Bug Fixed

Others

Acknowledgement

Contributors

Uh oh!

Release v1.3.0: Refactor of dataset builder and executor!

The Big Change 🚀

Others 💡

Contributors

Uh oh!

Release v1.2.2

Major Updates

New OPs

Others

Uh oh!