Skip to content

Release v1.0.0: Refactor DJ-Dataset & DJ-Operator, Sandbox, and more exciting features!

Compare
Choose a tag to compare
@yxdyc yxdyc released this 22 Nov 02:50
· 27 commits to main since this release
9caaaa9

Major Updates

  • ๐Ÿš€ Refactor Data-Juicer Operator & Dataset for better usability! We combine our two backends, HuggingFace Dataset and Ray Dataset, into a unified DJ-Dataset, and unify and introduce new invoking interfaces. Based on this, we add a fault-tolerant strategy during the data processing, helping users to know the actual reasons for processing failure. #359 #366
  • ๐Ÿงช [Experimental] Data-Juicer Sandbox toolkit is now available! Users are allowed to develop datasets and models in a co-development way with the highly customizable Sandbox to obtain better performance. For more details, please refer to the docs. #273 #291 #312 #332 #364
  • ๐Ÿš€ Basic API server based on FastAPI is now available in Data-Juicer! Now users can make use of the capabilities of OPs with API service. #468
  • ๐Ÿš€ Support adaptive resource management:
    • Adaptive number of processors for model-based OPs according to the GPU memory and other types of resource utilization. #270 #329 #354
    • Adaptive batch size for batched OPs according to their resource utilization to maximize the OP speed. #429
  • ๐Ÿ’ฅ We presented a tutorial of Multi-modal Data Processing for Foundation Models: Practical Guidance and Use Cases on KDD'24. #310
  • ๐Ÿ’ฅ A lot of additions and improvements were made to OPs, DJ-Engine, and CI/CD. See more details below~
  • ๐Ÿ› A playground for Data-Juicer is opened for user trial. #277 #368

OPs

Text

  • ray_document_deduplicator: supports Ray-based distributed exact-match deduplication for text-only datasets. #263
  • Support sentencepiece tokenizer for MinHash deduplicators. #269
  • generate_qa_from_text_mapper: generates question and answer pairs from input texts. #333 #454
  • generate_qa_from_examples_mapper: generates question and answer pairs based on examples. #338 #454
  • optimize_qa_mapper: optimizes the question-answer pairs in question-answering samples. #338 #454
  • optimize_query_mapper: optimizes the query in question-answering samples. #338 #454
  • optimize_response_mapper: optimizes the response in question-answering samples. #454
  • calibrate_qa_mapper: calibrates question-answer pairs based on reference text. #463
  • calibrate_query_mapper: calibrates query in question-answer pairs based on reference text. #463
  • calibrate_response_mapper: calibrates response in question-answer pairs based on reference text. #463
  • text_chunk_mapper: splits input text to chunks. #481
  • extract_entity_attribute_mapper: extracts attributes for given entities from the text. #481
  • extract_entity_relation_mapper: extracts entities and relations in the text for knowledge graph. #481
  • extract_event_mapper: extracts events and relevant characters in the text. #481
  • extract_keyword_mapper: generates keywords for the text. #481
  • extract_nickname_mapper: extracts nickname relationship in the text.. #481

Image

  • image_face_blur_mapper: blurs faces detected in images. #249
  • image_nsfw_filter: keeps samples containing images with NSFW scores below the threshold. #252
  • image_watermark_filter: keeps samples containing images with predicted watermark probabilities below the threshold. #256
  • ray_image_deduplicator: supports Ray-based distributed exact-match deduplication for image or image-text datasets. #263
  • image_pair_similarity_filter: keeps image pairs with image feature cosine similarity within the specified range based on a CLIP model. #393
  • image_tagging_mapper: generates image tags from the input images. #423
  • image_face_count_filter: keeps samples containing images with face counts within the specified range. #446

Video

  • video_face_blur_mapper: blurs faces detected in videos. #253
  • video_remove_watermark_mapper: removes the watermarks in given regions from the videos. #236
  • video_nsfw_filter: keeps samples containing videos with NSFW scores below the threshold. #252
  • video_watermark_filter: keeps samples containing videos with predicted watermark probabilities below the threshold. #256
  • ray_video_deduplicator: supports Ray-based distributed exact-match deduplication for video or video-text datasets. #263
  • video_tagging_from_frames_filter: keeps samples containing videos with given tags. #260
  • video_captioning_from_frames_mapper: generates samples whose captions are generated based on an image-to-text model and sampled video frames. Captions from different frames will be concatenated into a single string. #257
  • video_captioning_from_summarizer_mapper: generates video captions by summarizing several kinds of generated texts (captions from video/audio/frames, tags from audio/frames, ...). #250
  • video_motion_score_raft_filter: keeps samples with video motion scores (based on RAFT model) within a specific range. #478
  • Enhance the video_motion_score_filter to support float sampling FPS, frame resizing, optical flow magnitude normalization, and so on. #361

Misc.

  • Switch face detection used in 3 OPs (image_face_ratio_filter, image_face_blur_mapper, video_face_blur_mapper) from dlib to OpenCV to avoid dependency problems. #320
  • Deduplicators for multimodal datasets are allowed to consider text information as well. #313
  • Support batched processing for some OPs. #406 #435

Others (Engine, Job Control and Tools)

  • Support more multimodal (video) dataset conversion tools: MSR-VTT #248
  • Support distributed processing script for Slurm. #242
  • Support Minhash-LSH deduplication tools based on Spark. #290
  • Enable GPU usage for Ray executor. #274
  • Add debug mode for Data-Juicer. #303
  • Add video generation tools for several metrics. #273 #312
  • Deploy a self-hosted runner for unit tests and enable unit tests for Ray mode. #304
  • Add sampled frames from videos for video OPs to support OP fusion. #271
  • Allow to save stats for each OP respectively by specifying the exporting paths for them. #309
  • Add a new field to record the source files of multimodal data when they are augmented or regenerated by some OPs, so it's convenient to trace back. #317
  • Support turbo mode to disable some processing-unrelated functions to maximize the processing speed and save resource utilization. #402
  • Update type annotations from jsonargparse to Pydantic. #422
  • Add a Monitor module to monitor the resource utilization during data processing for each OP. #429
  • Allow lazy importing for third-party libraries and installing dependencies if they are not installed. #414 #443
  • Allow batched processing for all OPs based on the single-sample version of compute_stats/process methods to avoid modifying them to a batched version manually. #448
  • Enable unit test coverage report. #460
  • Support invoking API models for interaction with OpenAI-compatible APIs. #463 #479

Document Updates

  • Refine documentation system based on Sphinx. #245
  • Regular document updates. #234 #246
  • Update the class importing and document building logics for better automation. #299
  • Reorganize the operator documents for better reading. #472

Bugs Fixed

  • Fix the bug of non-existent videos returned by the video splitting function given a short duration. #243
  • Fix the bug that the produced multimodal data would be stored in nested dirs in different ops. #247
  • Fix some problems in demos. #244
  • Fix "Undefined punctuation_pattern" error in two OPs. #301
  • Exceptions and errors can be reraised to the upper level and the status code can be returned to the system correctly. #287
  • Fix the bug of out-of-work type hint checking for config files. #302
  • Fix the bug of parameters in the base classes that can not be parsed in some OPs. #311
  • Fix the memory leaking of video OPs. #374
  • Fix the bug of two OPs (video_aesthetics_filter and image_diffusion_mapper) that can not make use of GPUs. #389
  • Fix the bug of checkpoints not being restored correctly when the current process list has fewer OPs then the previous one. #391

Acknowledgment

Here we thank public contributors for their PRs to make Data-Juicer better!

  • @chg0901 helps to fix typos in documents. #237
  • @lingzhq helps to update the paper list in Awesome Data-Model Co-Development of MLLMs. #289
  • @shiweijiezero helps fix the bugs in updating the data keys. #300
  • @seanzhang-zhichen helps to support multiple patterns for replace_content_mapper. #319
  • @simplaj helps to fix a bug of a non-predefined attribute for video_captioning_from_summarizer_mapper. #343
  • @zhenqincn helps to reorganize the paper list and add more papers from our survey in Awesome Data-Model Co-Development of MLLMs. #352 #381 #456 #461
  • @2108038773 helps to add trust_remote_code argument for some public models on HuggingFace. #382 #385
  • @TobyJasper helps to fix typos in documents and contribute a new OP image_face_count_filter. #392 #452
  • @co63oc helps to fix some typos in documents and code. #427