PDF Parallel Processing with Python API / 基于Python API的PDF并行处理代码分享 #1375

relic-yuexi · 2024-12-28T11:21:12Z

relic-yuexi
Dec 28, 2024

PDF Parallel Processing with Python API / 基于Python API的PDF并行处理工具分享

中文说明

概述

我最近开发了一个基于 MinerU 库的Python脚本，用于并行处理PDF文件并将其转换为Markdown格式。这个工具特别适合需要处理大量PDF文件的场景，并且支持多GPU加速，能够显著提高处理效率。

主要功能

多GPU支持：利用多个GPU并行处理PDF文件，显著提高处理速度。
日志记录：支持控制台和文件日志记录，便于调试和监控。
目录结构保持：生成的Markdown文件会保持与输入PDF文件相同的目录结构。
强制处理：支持强制重新处理所有PDF文件，即使目标文件已存在。
动态文件分配：自动将PDF文件分配给不同的工作进程，确保负载均衡。

如何使用

安装依赖：
- 请参考 MinerU 库的安装说明。
- 确保安装PyTorch的GPU版本。
下载脚本：
- 脚本已上传至 GitHub Gist，可以直接下载使用。

运行脚本：

python pdf_processor.py --input-dir /path/to/pdf --output-dir /path/to/output --num-gpus 2 --num-workers 4

示例

python pdf_processor.py --input-dir ./pdfs --output-dir ./output --num-gpus 2 --num-workers 4 --log-level DEBUG --log-file ./log.txt --force

未来计划

支持多CPU：未来计划扩展功能，支持在多CPU环境下并行处理PDF文件，以进一步提升处理效率。

如果你有任何问题或建议，欢迎在评论区留言！希望这个工具能帮助到你！

English Description

Overview

I recently developed a Python script based on the MinerU library for parallel processing of PDF files and converting them to Markdown format. This tool is particularly suitable for scenarios where a large number of PDF files need to be processed, and it supports multi-GPU acceleration, significantly improving processing efficiency.

Key Features

Multi-GPU Support: Utilizes multiple GPUs for parallel processing of PDF files, significantly improving processing speed.
Logging: Supports both console and file logging for easy debugging and monitoring.
Directory Structure Preservation: The generated Markdown files maintain the same directory structure as the input PDF files.
Force Processing: Supports forcing the re-processing of all PDF files, even if the target files already exist.
Dynamic File Distribution: Automatically distributes PDF files to different worker processes to ensure load balancing.

How to Use

Install dependencies:
- Refer to the installation instructions in the MinerU library.
- Ensure the GPU version of PyTorch is installed.
Download the script:
- The script is uploaded to GitHub Gist and can be downloaded directly.

Run the script:

python pdf_processor.py --input-dir /path/to/pdf --output-dir /path/to/output --num-gpus 2 --num-workers 4

Example

python pdf_processor.py --input-dir ./pdfs --output-dir ./output --num-gpus 2 --num-workers 4 --log-level DEBUG --log-file ./log.txt --force

Future Plans

Multi-CPU Support: Future plans include extending functionality to support parallel processing of PDF files in a multi-CPU environment, further improving processing efficiency.

If you have any questions or suggestions, feel free to leave a comment! Hope this tool helps you!

xcvil · 2024-12-28T14:54:57Z

xcvil
Dec 28, 2024

这个太棒了，我看了一下是直接用python mp实现的，能不能对比一下跟litserver相比有什么优劣势吗？你的logging里面有没有error或者warning能发出来对比一下？

2 replies

relic-yuexi Dec 29, 2024
Author

没有使用过那一个，那一个看起来不够自定义。我这个脚本能设定进程数以及显卡数量，我在A100上能使用3个显卡，每个显卡3个进程把需要的目录给分批处理，同时代码改一改就能使用CPU来处理了。

xcvil Dec 29, 2024

没有使用过那一个，那一个看起来不够自定义。我这个脚本能设定进程数以及显卡数量，我在A100上能使用3个显卡，每个显卡3个进程把需要的目录给分批处理，同时代码改一改就能使用CPU来处理了。

那个也可以。可以设置每张显卡几个process。

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Parallel Processing with Python API / 基于Python API的PDF并行处理代码分享 #1375

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

PDF Parallel Processing with Python API / 基于Python API的PDF并行处理代码分享 #1375

relic-yuexi Dec 28, 2024

PDF Parallel Processing with Python API / 基于Python API的PDF并行处理工具分享

中文说明

概述

主要功能

如何使用

示例

未来计划

English Description

Overview

Key Features

How to Use

Example

Future Plans

Replies: 1 comment · 2 replies

xcvil Dec 28, 2024

relic-yuexi Dec 29, 2024 Author

xcvil Dec 29, 2024

relic-yuexi
Dec 28, 2024

Replies: 1 comment 2 replies

xcvil
Dec 28, 2024

relic-yuexi Dec 29, 2024
Author