Minimalistic example of crawling any kind of web page, including a dynamic one page app
-
You'll need to have Python 3 installed on your system. Be sure to check "Add Python environment variables" if asked during installation.
-
Now either download this project or clone with git:
git clone https://github.com/gittyeric/one-page-app-web-crawler.git
-
Download the Selenium web driver for chrome (you can use Firefox too but you'll have to change the code a bit). This will allow you to control your browser with Python code.
-
Drop the downloaded web driver file into the root of this project.
-
From command line, install the selenium library for Python:
pip install selenium
Now simply run the python script from command line:
cd path/to/project
python clark_crawl.py
This is a simple example written for the Clark County PD, it'll run through an inmate database, pull out all the records page-by-page and print them out at the end. You can change the print statement in clark_crawl.py to save the result however you'd like, and you can follow the code in crawler.py to see how it's done.