This repository contains:
-
A Jupyter Notebook script (leg_extract.ipynb) for the extraction of sections and sub-sections from XML tagged legislation, as available from http://www.legislation.gov.uk/
-
A python script (LIscrape.py) implementing a Scrapy spider for the extraction of contractual clauses from material contracts filed with the SEC, as available from https://www.lawinsider.com/
-
A sample dataset of extracted legislation (Leg_data160718.csv)
-
A sample dataset of extracted contract clauses (LIdata160718.csv) and a list of scraped URLs (before addition of suffixes) (TopDomains.txt)
Please see https://richardbatstone.github.io/ for a discussion and further background.