Convert pdf timetable to csv
It works in 3 steps
Step 1 - Conversion of PDF to PNG
Step 2 - Conversion of PNG to TXT
Step 3 - Formatting of TXT to CSV
-
You will need python 3 installed, latest version will do.
-
Download poppler for your operating system.
-
Download Tesseract-OCR for your system and install it.
-
In cmd/terminal,
pip install pytesseract
-
Install pdf2image and pillow by
pip install pdf2image
pip install pillow
-
Clone or download the files in this repo and extract them.
-
Done!
-
Open the
config.ini
file in your favourite text editor and enter in it, the path of 'poppler' bin directory and 'Tesseract' and save the file. -
In cmd/terminal/Power shell enter
python pdf2csv.py -i filetoconvert.pdf
to run for whole file
python pdf2csv.py -i filetoconvert.pdf -fp 1 -lp 5
to run for 5 pages of the file
-
DONE!
Tip - If you want to convert a file in different folder, in cmd, put full address, eg.
python pdf2.csv.py -i "C:\Users\USERNAME\someRandomFolder\samplefile.pdf"
It is advised to put the address in double inverted commas because it avoids error due to folder with spaces in their name.
PS - type python pdf2csv.py -h
to see all available arguements in command line.
- PIL or pillow
- csv
- pytesseract
- os
- pdf2image
- argeparse
- configparser
- pathlib
See original code: https://github.com/SahilChoudhary22/pdf2csv-converter