Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CKAN general scraper #246

Merged
merged 25 commits into from
Nov 13, 2024
Merged
Show file tree
Hide file tree
Changes from 17 commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
1d22426
Cleanup and test new methodology
EvilDrPurple Oct 18, 2024
816e4ea
Basic prototype for package retrieval
EvilDrPurple Oct 18, 2024
b5f0b35
Implement CKAN package search method
EvilDrPurple Oct 19, 2024
3280a2d
Add template
EvilDrPurple Oct 19, 2024
1e59794
Update README
EvilDrPurple Oct 19, 2024
fd2c36f
Add return section
EvilDrPurple Oct 19, 2024
00a3b76
Fix return value
EvilDrPurple Oct 20, 2024
87797df
Add requirements.txt
EvilDrPurple Oct 24, 2024
b2af6f2
README updates
EvilDrPurple Oct 24, 2024
9200909
Add ckan_group_package_show()
EvilDrPurple Oct 25, 2024
aba11e6
Add infrastructure for scraping multiple data portals
EvilDrPurple Oct 26, 2024
901c491
Add ckan_collection_search()
EvilDrPurple Oct 26, 2024
4013b80
Updates to README
EvilDrPurple Oct 26, 2024
f61f604
Add pagination to ckan_collection_search()
EvilDrPurple Oct 26, 2024
239ebbc
Fix error in collection search
EvilDrPurple Oct 26, 2024
62295f1
Format with Black
EvilDrPurple Oct 26, 2024
9893aeb
Add threading to collection search
EvilDrPurple Oct 27, 2024
d77781a
Add ckan_package_search_from_organization()
EvilDrPurple Nov 4, 2024
4360a4a
Further search updates
EvilDrPurple Nov 4, 2024
f6c2af1
Parse returned data
EvilDrPurple Nov 11, 2024
93538c5
Output data to CSV, other edge case handling
EvilDrPurple Nov 11, 2024
7d1aba9
Delete results.csv
EvilDrPurple Nov 11, 2024
2adda16
Update file structure
EvilDrPurple Nov 11, 2024
842b915
Update README
EvilDrPurple Nov 11, 2024
2fd5a78
activate venv instructions
josh-chamberlain Nov 13, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 0 additions & 5 deletions scrapers_library/data_portals/__init__.py

This file was deleted.

94 changes: 94 additions & 0 deletions scrapers_library/data_portals/ckan/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# CKAN Scraper

## Introduction

This scraper can be used to retrieve package information from CKAN, which hosts open data projects such as <https://data.gov/>
EvilDrPurple marked this conversation as resolved.
Show resolved Hide resolved

The scraper's functions can be found in `ckan_scraper.py`.

A template can be found in the `template` folder.

## Definitions

* `Package` - Also called a dataset, is a page containing relevant information about a dataset. For example, this page is a package: <https://catalog.data.gov/dataset/electric-vehicle-population-data>.
EvilDrPurple marked this conversation as resolved.
Show resolved Hide resolved
* `Collection` - A grouping of child packages, related to a parent package. This is seperate from a group.
EvilDrPurple marked this conversation as resolved.
Show resolved Hide resolved
* `Group` - Also called a topic, is a grouping of packages. Packages in a group do not have a parent package. Groups can also contain subgroups.
* `Organization` - Organizations are what the data in packages belong to, such as "City of Austin" or "Department of Energy". Organization types are groups of organizations that share something in common with each other.

## Setup

EvilDrPurple marked this conversation as resolved.
Show resolved Hide resolved
1. In a terminal, navigate to the CKAN scraper folder
```cmd
cd scrapers_library/data_portals/ckan/
```
2. Create a Python virtual environment
```cmd
python -m venv venv
```
3. Install the requirements
```cmd
pip install -r requirements.txt
```
4. Copy the template script to another desired directory. Edit the template as needed. Then, run the scraper
```cmd
python [script name]
```

## How can I tell if a website I want to scrape is hosted using CKAN?

There's no easy way to tell, some websites will reference CKAN or link back to the CKAN documentation while others will not. There doesn't seem to be a database of all CKAN instances either.

EvilDrPurple marked this conversation as resolved.
Show resolved Hide resolved
The best way to determine if a data catalog is using CKAN is to attempt to query its API. To do this:

1. In a web browser, navigate to the website's data catalog (e.g. for data.gov this is at <https://catalog.data.gov/dataset/>)
2. Copy the first part of the link (e.g. <https://catalog.data.gov/>)
3. Paste it in the browser's URL bar and add `api/3/action/package_search` to the end (e.g. <https://catalog.data.gov/api/3/action/package_search>)

*NOTE: Some hosts use a different base URL for API requests. For example, Canada's Open Government Portal can be found at <https://search.open.canada.ca/opendata/> while the API access link is <https://open.canada.ca/data/en/api/3/action/package_search> as described in their [Access our API](https://open.canada.ca/en/access-our-application-programming-interface-api) page*

## Documentation

`ckan_package_search(base_url: str, query: Optional[str], rows: Optional[int], start: Optional[int], **kwargs) -> list[dict[str, Any]]`

Searches for packages (datasets) in a CKAN data portal that satisfies a given search criteria.

### Parameters

* **base_url** - The base URL to search from. e.g. "https://catalog.data.gov/"
* **query (optional)** - The keyword string to search for. e.g. "police". Leaving empty will return all packages in the package list.
* **rows (optional)** - The maximum number of results to return. Leaving empty will return all results.
* **start (optional)** - Which result number to start at. Leaving empty will start at the first result.
* **kwargs (optional)** - Additional keyword arguments. For more information on acceptable keyword arguments and their function see <https://docs.ckan.org/en/2.10/api/index.html#ckan.logic.action.get.package_search>

### Return

The function returns a list of dictionaries containing matching package results.

---

`ckan_group_package_show(base_url: str, id: str, limit: Optional[int]) -> list[dict[str, Any]]`

Returns a list of CKAN packages that belong to a particular group.

* **base_url** - The base URL of the CKAN portal. e.g. "https://catalog.data.gov/"
* **id** - The group's ID. This can be retrieved by searching for a package and finding the "id" key in the "groups" key.
* **limit** - The maximum number of results to return, leaving empty will return all results.

### Return

The function returns a list of dictionaries representing the packages associated with the group.

---

`ckan_collection_search(base_url: str, collection_id: str) -> list[Package]`

Returns a list of CKAN package information that belong to a collection. When querying the API, CKAN data portals are supposed to have relationships returned along with the rest of the data. However, in practice not all data portals have it set up this way. Since child packages are not able to be queried directly, they will not show up in any search results. To get around this, this function will manually scrape the information of all child packages related to the given parent.

*NOTE: This function has only been tested on <https://catalog.data.gov/>. It is likely it will not work properly on other platforms.*

* **base_url** - The base URL of the CKAN portal before the collection ID. e.g. "https://catalog.data.gov/dataset/"
* **collection_id** - The ID of the parent package. This can be found by querying the parent package and using the "id" key, or by navigating to the list of child packages and looking in the URL. e.g. In <https://catalog.data.gov/dataset/?collection_package_id=7b1d1941-b255-4596-89a6-99e1a33cc2d8> the collection_id is "7b1d1941-b255-4596-89a6-99e1a33cc2d8"

### Return

List of Package objects representing the child packages associated with the collection.
141 changes: 141 additions & 0 deletions scrapers_library/data_portals/ckan/ckan_scraper.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,141 @@
from concurrent.futures import as_completed, ThreadPoolExecutor
from dataclasses import dataclass
import math
import sys

import time
from typing import Any, Optional
from urllib.parse import urljoin

from bs4 import BeautifulSoup
from ckanapi import RemoteCKAN
import requests


@dataclass
class Package:
url: str = ""
title: str = ""
agency_name: str = ""
description: str = ""


def ckan_package_search(
base_url: str,
query: Optional[str] = None,
rows: Optional[int] = sys.maxsize,
start: Optional[int] = 0,
**kwargs,
) -> list[dict[str, Any]]:
"""Performs a CKAN package (dataset) search from a CKAN data catalog URL.

:param base_url: Base URL to search from. e.g. "https://catalog.data.gov/"
:param query: Search string, defaults to None. None will return all packages.
:param rows: Maximum number of results to return, defaults to maximum integer.
:param start: Offsets the results, defaults to 0.
:param kwargs: See https://docs.ckan.org/en/2.10/api/index.html#ckan.logic.action.get.package_search for additional arguments.
:return: List of dictionaries representing the CKAN package search results.
"""
remote = RemoteCKAN(base_url, get_only=True)
results = []
offset = start
rows_max = 1000 # CKAN's package search has a hard limit of 1000 packages returned at a time by default

while start < rows:
num_rows = rows - start + offset
packages = remote.action.package_search(
q=query, rows=num_rows, start=start, **kwargs
)
results += packages["results"]

total_results = packages["count"]
if rows > total_results:
rows = total_results

result_len = len(packages["results"])
# Check if the website has a different rows_max value than CKAN's default
if result_len != rows_max and start + rows_max < total_results:
rows_max = result_len

start += rows_max

return results


def ckan_group_package_show(
base_url: str, id: str, limit: Optional[int] = sys.maxsize
) -> list[dict[str, Any]]:
"""Returns a list of CKAN packages from a group.

:param base_url: Base URL of the CKAN portal. e.g. "https://catalog.data.gov/"
:param id: The group's ID.
:param limit: Maximum number of results to return, defaults to maximum integer.
:return: List of dictionaries representing the packages associated with the group.
"""
remote = RemoteCKAN(base_url, get_only=True)
result = remote.action.group_package_show(id=id, limit=limit)
return result


def ckan_collection_search(base_url: str, collection_id: str) -> list[Package]:
"""Returns a list of CKAN packages from a collection.

:param base_url: Base URL of the CKAN portal before the collection ID. e.g. "https://catalog.data.gov/dataset/"
:param collection_id: The ID of the parent package.
:return: List of Package objects representing the packages associated with the collection.
"""
packages = []
url = f"{base_url}?collection_package_id={collection_id}"
soup = get_soup(url)

# Calculate the total number of pages of packages
num_results = int(soup.find(class_="new-results").text.split()[0].replace(",", ""))
pages = math.ceil(num_results / 20)

for page in range(1, pages + 1):
url = f"{base_url}?collection_package_id={collection_id}&page={page}"
soup = get_soup(url)

with ThreadPoolExecutor(max_workers=10) as executor:
futures = [
executor.submit(
collection_search_get_package_data, dataset_heading, base_url
)
for dataset_heading in soup.find_all(class_="dataset-heading")
]

[packages.append(package) for package in as_completed(futures)]

# Take a break to avoid being timed out
if len(futures) >= 15:
time.sleep(10)

return packages


def collection_search_get_package_data(dataset_heading, base_url: str):
package = Package()
joined_url = urljoin(base_url, dataset_heading.a.get("href"))
dataset_soup = get_soup(joined_url)
# Determine if the dataset url should be the linked page to an external site or the current site
resources = dataset_soup.find("section", id="dataset-resources").find_all(
class_="resource-item"
)
button = resources[0].find(class_="btn-group")
if len(resources) == 1 and button is not None and button.a.text == "Visit page":
package.url = button.a.get("href")
else:
package.url = joined_url

package.title = dataset_soup.find(itemprop="name").text.strip()
package.agency_name = dataset_soup.find("h1", class_="heading").text.strip()
package.description = dataset_soup.find(class_="notes").p.text

return package


def get_soup(url: str) -> BeautifulSoup:
"""Returns a BeautifulSoup object for the given URL."""
time.sleep(1)
response = requests.get(url)
return BeautifulSoup(response.content, "lxml")
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
from itertools import chain
import json
import sys

from from_root import from_root
from tqdm import tqdm

p = from_root("CONTRIBUTING.md").parent
sys.path.insert(1, str(p))

from scrapers_library.data_portals.ckan.ckan_scraper import (
ckan_package_search,
ckan_group_package_show,
ckan_collection_search,
)
from search_terms import package_search, group_search


def main():
results = []

for search in package_search:
results += [
ckan_package_search(search["url"], query=query) for query in search["terms"]
]

flat_list = list(chain(*results))
# Deduplicate entries
flat_list = [i for n, i in enumerate(flat_list) if i not in flat_list[n + 1 :]]

print("Retrieving collections...")
flat_list = [
(
[
ckan_collection_search(
base_url="https://catalog.data.gov/dataset/", collection_id=result["id"]
)
for extra in result["extras"]
if extra["key"] == "collection_metadata" and extra["value"] == "true"
]
if "extras" in result.keys()
else result
)
for result in tqdm(flat_list)
]
flat_list = list(chain(*flat_list))

#print(json.dumps(flat_list, indent=4))
[print(type(result)) for result in flat_list]
print(len(flat_list))
print(flat_list[0])
print(flat_list[-1])


if __name__ == "__main__":
main()
# results = ckan_collection_search(base_url="https://catalog.data.gov/dataset/", collection_id="7b1d1941-b255-4596-89a6-99e1a33cc2d8")
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
package_search = [
{
"url": "https://catalog.data.gov/",
"terms": [
"police",
"crime",
"tags:(court courts court-cases criminal-justice-system law-enforcement law-enforcement-agencies)",
],
},
{"url": "https://data.boston.gov/", "terms": ["police"]},
]

group_search = [
{
"url": "https://data.birminghamal.gov/", "ids": [""],
}
]
66 changes: 0 additions & 66 deletions scrapers_library/data_portals/ckan/package_list/main.py

This file was deleted.

8 changes: 0 additions & 8 deletions scrapers_library/data_portals/ckan/package_list/readme.md

This file was deleted.

Loading