The process module enables the extraction and standardization of text and images from diverse file formats (listed below), making it ideal for creating datasets for applications such as RAG, multimodal content generation, and preprocessing data for multimodal LLMs/LLMs.
Setup the project in each device you want to use using our setup script or looking at what it does and doing it manually.
bash scripts/setup.sh
You need to specify the input folder by modifying the config file. You can also twist the parameters to your needs. Once ready, you can run the process using the following command:
source .venv/bin/activate
python run_process.py --config_file examples/process_config.yaml
The output of the pipeline has the following structure:
output_path
├── processors
│ ├── Processor_type_1
│ │ └── results.jsonl
│ ├── Processor_type_2
│ │ └── results.jsonl
│ ├── ...
│
└── merged
│ └── merged_results.jsonl
|
└── images
We provide a simple bash script to run the process on distributed mode. Please call it with your arguments.
bash scripts/process_distributed.sh -f /path/to/my/input/folder
You can find more examples scripts in the /examples
directory.
For some file types, we provide a fast mode that will allow you to process the files faster, using a different method. To use it, set the use_fast_processors
to true
in the config file.
Be aware that the fast mode might not be as accurate as the default mode, especially for scanned non-native PDFs, which may require Optical Character Recognition (OCR) for more accurate extraction.
The project is designed to be easily scalable to a multi GPU / multi node environment. To use it, To use it, set the distribued
to true
in the config file., and follow the steps described in the section.
Many parameters are hardware-dependent and can be customized to suit your needs. For example, you can adjust the processor batch size, dispatcher batch size, and the number of threads per worker to optimize performance.
You can configure parameters by providing a custom config file. You can find an example of a config file in the examples folder.
🚨 Not all parameters are configurable yet 😉
Our pipeline is a 3 steps process:
- Crawling: We first crawl over the file/folder to list all the files we need to process.
- Dispatching: We then dispatch the files to the workers, using a dispatcher that will send the files to the workers in batches. This part is in charge of the load balancing between different nodes if the project is running in a distributed environment.
- Processing: The workers then process the files, using the appropriate tools for each file type. They extract the text, images, audio, and video frames, and send them to the next step. Our goal is to provide an easy way to add new processors for new file types, or even other types of processing for existing file types.
The project supports multiple file types and utilizes various AI-based tools for processing. Below is a table summarizing the supported file types and corresponding tools (N/A means no choice):
File Type | Default Mode Tool(s) | Fast Mode Tool(s) |
---|---|---|
DOCX | python-docx to extract the text and images. | N/A |
MD | markdown for text extraction, markdownify for HTML conversion | N/A |
PPTX | python-pptx to extract the text and images. | N/A |
XLSX | openpyxl to extract the text and images. | N/A |
TXT | python built-in library | N/A |
EML | python built-in library | N/A |
MP4, MOV, AVI, MKV, MP3, WAV, AAC | moviepy for video frame extraction; whisper-large-v3-turbo for transcription | whisper-tiny |
marker-pdf for OCR and structured data extraction | PyMuPDF for text and image extraction | |
Webpages (TBD) | selenium to navigate the webpage; requests for images; surya for OCR and extraction | selenium to navigate the webpage and extract content; requests for images; trafilatura for content extraction |
We also use Dask distributed to manage the distributed environment.
The system is designed to be extensible, allowing you to register custom processors for handling new file types or specialized processing. To implement a new processor you need to inherit the Processor
class and implement a minimum of 3 methods:
- accepts: defines the file types your processor supports
- process_implementation: to process a single file
- require_gpu: to check if the processor requires a GPU
See TextProcessor
in src/process/processors/text_processor.py
for a minimal example.
The power of this achitecture is to easily accomodate for any new file type or processing method ! Feel free to add your own processors and share them with the community.