Skip to content

Latest commit

 

History

History
97 lines (74 loc) · 7.99 KB

process.md

File metadata and controls

97 lines (74 loc) · 7.99 KB

⚙️ Process

The process module enables the extraction and standardization of text and images from diverse file formats (listed below), making it ideal for creating datasets for applications such as RAG, multimodal content generation, and preprocessing data for multimodal LLMs/LLMs.

🔨 Quick Start

🧑‍💻 Global installation

Setup the project in each device you want to use using our setup script or looking at what it does and doing it manually.

bash scripts/setup.sh

💻 Running locally

You need to specify the input folder by modifying the config file. You can also twist the parameters to your needs. Once ready, you can run the process using the following command:

source .venv/bin/activate
python run_process.py --config_file examples/process_config.yaml

The output of the pipeline has the following structure:

output_path
├── processors
│   ├── Processor_type_1
│   │   └── results.jsonl
│   ├── Processor_type_2
│   │   └── results.jsonl
│   ├── ...
│   
└── merged
│    └── merged_results.jsonl
|
└── images

🚀 Running on distributed nodes

We provide a simple bash script to run the process on distributed mode. Please call it with your arguments.

bash scripts/process_distributed.sh -f /path/to/my/input/folder 

📜 Examples

You can find more examples scripts in the /examples directory.

⚡ Optimization

🏎️ Fast mode

For some file types, we provide a fast mode that will allow you to process the files faster, using a different method. To use it, set the use_fast_processors to true in the config file.

Be aware that the fast mode might not be as accurate as the default mode, especially for scanned non-native PDFs, which may require Optical Character Recognition (OCR) for more accurate extraction.

🚀 Distributed mode

The project is designed to be easily scalable to a multi GPU / multi node environment. To use it, To use it, set the distribued to true in the config file., and follow the steps described in the section.

🔧 File type parameters tuning

Many parameters are hardware-dependent and can be customized to suit your needs. For example, you can adjust the processor batch size, dispatcher batch size, and the number of threads per worker to optimize performance.

You can configure parameters by providing a custom config file. You can find an example of a config file in the examples folder.

🚨 Not all parameters are configurable yet 😉

📜 More information on what's under the hood

🚧 Pipeline architecture

Our pipeline is a 3 steps process:

  • Crawling: We first crawl over the file/folder to list all the files we need to process.
  • Dispatching: We then dispatch the files to the workers, using a dispatcher that will send the files to the workers in batches. This part is in charge of the load balancing between different nodes if the project is running in a distributed environment.
  • Processing: The workers then process the files, using the appropriate tools for each file type. They extract the text, images, audio, and video frames, and send them to the next step. Our goal is to provide an easy way to add new processors for new file types, or even other types of processing for existing file types.

🛠️ Used tools

The project supports multiple file types and utilizes various AI-based tools for processing. Below is a table summarizing the supported file types and corresponding tools (N/A means no choice):

File Type Default Mode Tool(s) Fast Mode Tool(s)
DOCX python-docx to extract the text and images. N/A
MD markdown for text extraction, markdownify for HTML conversion N/A
PPTX python-pptx to extract the text and images. N/A
XLSX openpyxl to extract the text and images. N/A
TXT python built-in library N/A
EML python built-in library N/A
MP4, MOV, AVI, MKV, MP3, WAV, AAC moviepy for video frame extraction; whisper-large-v3-turbo for transcription whisper-tiny
PDF marker-pdf for OCR and structured data extraction PyMuPDF for text and image extraction
Webpages (TBD) selenium to navigate the webpage; requests for images; surya for OCR and extraction selenium to navigate the webpage and extract content; requests for images; trafilatura for content extraction

We also use Dask distributed to manage the distributed environment.

🔧 Customization

The system is designed to be extensible, allowing you to register custom processors for handling new file types or specialized processing. To implement a new processor you need to inherit the Processor class and implement a minimum of 3 methods:

  • accepts: defines the file types your processor supports
  • process_implementation: to process a single file
  • require_gpu: to check if the processor requires a GPU

See TextProcessor in src/process/processors/text_processor.py for a minimal example.

The power of this achitecture is to easily accomodate for any new file type or processing method ! Feel free to add your own processors and share them with the community.