Skip to content

Latest commit

 

History

History
83 lines (65 loc) · 3.6 KB

README.md

File metadata and controls

83 lines (65 loc) · 3.6 KB

not-quite-daily-scraper

A simple scraper for simple webcomics.

Download comic images and author comments with ease, utilizing a simple config file.

Output either just images or HTML pages with built in navigation.

Several example configurations are included.

Currently under development. Developed for Windows.

Setup

Download repo

Install the following dependencies:

  • selenium (pip3 install selenium)

Download the following dependencies and install them to PATH (Windows):

Usage

Create and fill out a config file for each comic you want to scrape, using config.txt as a template. (If you choose to use the HTML option it is advised to start at the current page and iterate backwards- This allows the script to link each page to the next in the correct order.)

Settings/Configurables include:

  • Output Path (leave blank to use project directory)
  • Create Subfolder? (if enabled, creates subdir based on comic name)
  • Download Comments?
  • Download Image?
  • Image Naming (chose to base on either comic title, on original image name, or both)
  • Run Headless? (run without showing the browser window- disable this for troubleshooting)
  • Comic Name
  • comicStartPage (XPATH, relative or absolute)
  • imageTitlePath (XPATH, relative or absolute)
  • nextButtonPath (XPATH, relative or absolute)
  • nextButtonType (accepts either link or javaClick, representing whether the button has an href or must be clicked)
  • imagePath (XPATH, relative or absolute)
  • commentPath (XPATH, relative or absolute)
  • initialClick (XPATH, relative or absolute, runs once at start- can be used to start at homepage, then click latest page link)

A good guide to getting the xpath to page elements: https://stackoverflow.com/a/42194160

Then run:

python scraper.py My-Config.txt

Notes

  • This tool is intended to be flexible, such that it can be used in a variety of ways.
  • My primary usage is as follows:
    • Set comicStartPage to the homepage of the comic
    • Use initial click to navigate to the newest page, via the link on the homepage
    • Set nextButtonPath to the 'previous page' button so the script iterates backwards over the comic
    • Choose comments and images+pages to generate nice html pages with author comments and navivation
  • Alternately you could set comicStartPage to page one, and the nextButtonPath to the next button and just iterate forward and only save images.
  • If you are having trouble with page element xpaths, try both relative and absolute.
  • If the comic pages have a date block, it can be used in conjuction with the image filename for nice naming.
  • Some sites (such as Ava's Demon) have javascript buttons- that's what the nextbutton type is for.

Psuedocode

  • Check if config file exists
  • Gather variables from config file
  • Build output directory path, and create it if it does not exist
  • Configure and open (headless) firefox instance with selenium
  • Check if starting page is valid
  • If user requested it, click the provided initialClick element
  • Loop through pages, setting each new page based on the 'next' link
    • Gather relevant page elements
    • Extract data from page elements
    • Compose output paths and names
    • Sanitise paths and names
    • Save image (if requested)
    • Save comment (if requested)
    • Build html page with navigation links (if requested)
    • Record the current/previous page
    • Click the next button to proceed to new page
    • Check against list of visited pages to confirm we aren't stuck in a loop
  • ....Profit!