GitHub - bardiadoosti/InternetArchiveCrawler: This code crawls Internet Archive

This is Internet Archive crawling script written by Bardia Doosti If you used my code please cite our paper:

@inproceedings{Doosti:2017:DSH:3091478.3091503,
 author = {Doosti, Bardia and Crandall, David J. and Su, Norman Makoto},
 title = {A Deep Study into the History of Web Design},
 booktitle = {Proceedings of the 2017 ACM on Web Science Conference},
 series = {WebSci '17},
 year = {2017},
 isbn = {978-1-4503-4896-6},
 location = {Troy, New York, USA},
 pages = {329--338},
 numpages = {10},
 url = {http://doi.acm.org/10.1145/3091478.3091503},
 doi = {10.1145/3091478.3091503},
 acmid = {3091503},
 publisher = {ACM},
 address = {New York, NY, USA},
 keywords = {convolutional neural networks, cultural analytics, deep learning, web design},
}

This website crawler is based on PhantomJS so please remember to download it from http://phantomjs.org/ and put pre-compiled phantomjs beside the render.py and snap-archive-api.js file.

The render.py lists the websites you want to crawl and sends it to phantomjs and snap-archive-api.js will save the png file using phantomjs.

Please notice that render.py crawls from 1996 to 2013 with the rate of two snapshots per month which we used in our research. You can change it to whatever which is good for you.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md
render.py		render.py
snap-archive-api.js		snap-archive-api.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Languages

bardiadoosti/InternetArchiveCrawler

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages