GitHub - venkatmanish/OCR-PDF-to-DOCX: A Python project that allows you to extract paragraphs from PDF files using Optical Character Recognition (OCR). Leveraging popular libraries such as pytesseract, pdf2image, and python-docx, this project facilitates the conversion of PDFs into images, performing OCR on the images, and generating a DOCX file containing the extracted paragraphs.

#OCR-PDF-to-DOCX OCR-PDF-to-DOCX is a Python project that allows you to extract paragraphs from PDF files using Optical Character Recognition (OCR) technology and generate a DOCX file while preserving the original formatting. It leverages popular libraries such as pytesseract, pdf2image, and python-docx to perform the conversion seamlessly.

Features Extracts paragraphs from a PDF file using OCR. Supports different types of lists (numbered, alphabets, bullets). Preserves the formatting and structure of the original PDF content. Process annotation JSON files to extract relevant paragraph information. Generates a DOCX file with the extracted paragraphs. How to Use Install the required dependencies: json, pytesseract, pdf2image, python-docx. Clone this repository or download the project files. Ensure that you have a PDF file and its corresponding annotation JSON file. Replace the paths in the code with the paths to your PDF file, annotation JSON file, and desired output file. Run the script and wait for the execution to complete. The result will be saved in the specified output file as a JSON format, containing the extracted paragraphs and their associated information. You can open the generated DOCX file to view the extracted paragraphs with preserved formatting. Feel free to customize and modify the code to suit your specific needs. You can also explore additional functionalities and enhancements to extend the project further.

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
OCR-PDF-to-DOCX.ipynb		OCR-PDF-to-DOCX.ipynb
README.md		README.md
annotation.json		annotation.json
output.docx		output.docx
output.json		output.json
single page.pdf		single page.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

venkatmanish/OCR-PDF-to-DOCX

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages