Police-Data-Accessibility-Project · josh-chamberlain · Nov 13, 2024 · Oct 18, 2024 · Oct 18, 2024 · Oct 19, 2024
@@ -0,0 +1,94 @@
+# CKAN Scraper
+
+## Introduction
+
+This scraper can be used to retrieve package information from CKAN, which hosts open data projects such as <https://data.gov/>
+
+The scraper's functions can be found in `ckan_scraper.py`.
+
+A template can be found in the `template` folder.
+
+## Definitions
+
+* `Package` - Also called a dataset, is a page containing relevant information about a dataset. For example, this page is a package: <https://catalog.data.gov/dataset/electric-vehicle-population-data>.
+* `Collection` - A grouping of child packages, related to a parent package. This is seperate from a group.
+* `Group` - Also called a topic, is a grouping of packages. Packages in a group do not have a parent package. Groups can also contain subgroups.
+* `Organization` - Organizations are what the data in packages belong to, such as "City of Austin" or "Department of Energy". Organization types are groups of organizations that share something in common with each other.
+
+## Setup
+
+1. In a terminal, navigate to the CKAN scraper folder
+    ```cmd
+    cd scrapers_library/data_portals/ckan/
+    ```
+2. Create a Python virtual environment
+    ```cmd
+    python -m venv venv
+    ```
+3. Install the requirements
+    ```cmd
+    pip install -r requirements.txt
+    ```
+4. Copy the template script to another desired directory. Edit the template as needed. Then, run the scraper
+    ```cmd
+    python [script name]
+    ```
+
+## How can I tell if a website I want to scrape is hosted using CKAN?
+
+There's no easy way to tell, some websites will reference CKAN or link back to the CKAN documentation while others will not. There doesn't seem to be a database of all CKAN instances either.
+
+The best way to determine if a data catalog is using CKAN is to attempt to query its API. To do this:
+
+1. In a web browser, navigate to the website's data catalog (e.g. for data.gov this is at <https://catalog.data.gov/dataset/>)
+2. Copy the first part of the link (e.g. <https://catalog.data.gov/>)
+3. Paste it in the browser's URL bar and add `api/3/action/package_search` to the end (e.g. <https://catalog.data.gov/api/3/action/package_search>)
+
+*NOTE: Some hosts use a different base URL for API requests. For example, Canada's Open Government Portal can be found at <https://search.open.canada.ca/opendata/> while the API access link is <https://open.canada.ca/data/en/api/3/action/package_search> as described in their [Access our API](https://open.canada.ca/en/access-our-application-programming-interface-api) page*
+
+## Documentation
+
+`ckan_package_search(base_url: str, query: Optional[str], rows: Optional[int], start: Optional[int], **kwargs) -> list[dict[str, Any]]`
+
+Searches for packages (datasets) in a CKAN data portal that satisfies a given search criteria.
+
+### Parameters
+
+* **base_url** - The base URL to search from. e.g. "https://catalog.data.gov/"
+* **query (optional)** - The keyword string to search for. e.g. "police". Leaving empty will return all packages in the package list.
+* **rows (optional)** - The maximum number of results to return. Leaving empty will return all results.
+* **start (optional)** - Which result number to start at. Leaving empty will start at the first result.
+* **kwargs (optional)** - Additional keyword arguments. For more information on acceptable keyword arguments and their function see <https://docs.ckan.org/en/2.10/api/index.html#ckan.logic.action.get.package_search>
+
+### Return
+
+The function returns a list of dictionaries containing matching package results.
+
+---
+
+`ckan_group_package_show(base_url: str, id: str, limit: Optional[int]) -> list[dict[str, Any]]`
+
+Returns a list of CKAN packages that belong to a particular group.
+
+* **base_url** - The base URL of the CKAN portal. e.g. "https://catalog.data.gov/"
+* **id** - The group's ID. This can be retrieved by searching for a package and finding the "id" key in the "groups" key.
+* **limit** - The maximum number of results to return, leaving empty will return all results.
+
+### Return
+
+The function returns a list of dictionaries representing the packages associated with the group.
+
+---
+
+`ckan_collection_search(base_url: str, collection_id: str) -> list[Package]`
+
+Returns a list of CKAN package information that belong to a collection. When querying the API, CKAN data portals are supposed to have relationships returned along with the rest of the data. However, in practice not all data portals have it set up this way. Since child packages are not able to be queried directly, they will not show up in any search results. To get around this, this function will manually scrape the information of all child packages related to the given parent.
+
+*NOTE: This function has only been tested on <https://catalog.data.gov/>. It is likely it will not work properly on other platforms.*
+
+* **base_url** - The base URL of the CKAN portal before the collection ID. e.g. "https://catalog.data.gov/dataset/"
+* **collection_id** - The ID of the parent package. This can be found by querying the parent package and using the "id" key, or by navigating to the list of child packages and looking in the URL. e.g. In <https://catalog.data.gov/dataset/?collection_package_id=7b1d1941-b255-4596-89a6-99e1a33cc2d8> the collection_id is "7b1d1941-b255-4596-89a6-99e1a33cc2d8"
+
+### Return
+
+List of Package objects representing the child packages associated with the collection.
diff --git a/scrapers_library/data_portals/ckan/ckan_scraper.py b/scrapers_library/data_portals/ckan/ckan_scraper.py
@@ -0,0 +1,141 @@
+from concurrent.futures import as_completed, ThreadPoolExecutor
+from dataclasses import dataclass
+import math
+import sys
+
+import time
+from typing import Any, Optional
+from urllib.parse import urljoin
+
+from bs4 import BeautifulSoup
+from ckanapi import RemoteCKAN
+import requests
+
+
+@dataclass
+class Package:
+    url: str = ""
+    title: str = ""
+    agency_name: str = ""
+    description: str = ""
+
+
+def ckan_package_search(
+    base_url: str,
+    query: Optional[str] = None,
+    rows: Optional[int] = sys.maxsize,
+    start: Optional[int] = 0,
+    **kwargs,
+) -> list[dict[str, Any]]:
+    """Performs a CKAN package (dataset) search from a CKAN data catalog URL.
+
+    :param base_url: Base URL to search from. e.g. "https://catalog.data.gov/"
+    :param query: Search string, defaults to None. None will return all packages.
+    :param rows: Maximum number of results to return, defaults to maximum integer.
+    :param start: Offsets the results, defaults to 0.
+    :param kwargs: See https://docs.ckan.org/en/2.10/api/index.html#ckan.logic.action.get.package_search for additional arguments.
+    :return: List of dictionaries representing the CKAN package search results.
+    """
+    remote = RemoteCKAN(base_url, get_only=True)
+    results = []
+    offset = start
+    rows_max = 1000  # CKAN's package search has a hard limit of 1000 packages returned at a time by default
+
+    while start < rows:
+        num_rows = rows - start + offset
+        packages = remote.action.package_search(
+            q=query, rows=num_rows, start=start, **kwargs
+        )
+        results += packages["results"]
+
+        total_results = packages["count"]
+        if rows > total_results:
+            rows = total_results
+
+        result_len = len(packages["results"])
+        # Check if the website has a different rows_max value than CKAN's default
+        if result_len != rows_max and start + rows_max < total_results:
+            rows_max = result_len
+
+        start += rows_max
+
+    return results
+
+
+def ckan_group_package_show(
+    base_url: str, id: str, limit: Optional[int] = sys.maxsize
+) -> list[dict[str, Any]]:
+    """Returns a list of CKAN packages from a group.
+
+    :param base_url: Base URL of the CKAN portal. e.g. "https://catalog.data.gov/"
+    :param id: The group's ID.
+    :param limit: Maximum number of results to return, defaults to maximum integer.
+    :return: List of dictionaries representing the packages associated with the group.
+    """
+    remote = RemoteCKAN(base_url, get_only=True)
+    result = remote.action.group_package_show(id=id, limit=limit)
+    return result
+
+
+def ckan_collection_search(base_url: str, collection_id: str) -> list[Package]:
+    """Returns a list of CKAN packages from a collection.
+
+    :param base_url: Base URL of the CKAN portal before the collection ID. e.g. "https://catalog.data.gov/dataset/"
+    :param collection_id: The ID of the parent package.
+    :return: List of Package objects representing the packages associated with the collection.
+    """
+    packages = []
+    url = f"{base_url}?collection_package_id={collection_id}"
+    soup = get_soup(url)
+
+    # Calculate the total number of pages of packages
+    num_results = int(soup.find(class_="new-results").text.split()[0].replace(",", ""))
+    pages = math.ceil(num_results / 20)
+
+    for page in range(1, pages + 1):
+        url = f"{base_url}?collection_package_id={collection_id}&page={page}"
+        soup = get_soup(url)
+
+        with ThreadPoolExecutor(max_workers=10) as executor:
+            futures = [
+                executor.submit(
+                    collection_search_get_package_data, dataset_heading, base_url
+                )
+                for dataset_heading in soup.find_all(class_="dataset-heading")
+            ]
+
+            [packages.append(package) for package in as_completed(futures)]
+
+        # Take a break to avoid being timed out
+        if len(futures) >= 15:
+            time.sleep(10)
+
+    return packages
+
+
+def collection_search_get_package_data(dataset_heading, base_url: str):
+    package = Package()
+    joined_url = urljoin(base_url, dataset_heading.a.get("href"))
+    dataset_soup = get_soup(joined_url)
+    # Determine if the dataset url should be the linked page to an external site or the current site
+    resources = dataset_soup.find("section", id="dataset-resources").find_all(
+        class_="resource-item"
+    )
+    button = resources[0].find(class_="btn-group")
+    if len(resources) == 1 and button is not None and button.a.text == "Visit page":
+        package.url = button.a.get("href")
+    else:
+        package.url = joined_url
+
+    package.title = dataset_soup.find(itemprop="name").text.strip()
+    package.agency_name = dataset_soup.find("h1", class_="heading").text.strip()
+    package.description = dataset_soup.find(class_="notes").p.text
+
+    return package
+
+
+def get_soup(url: str) -> BeautifulSoup:
+    """Returns a BeautifulSoup object for the given URL."""
+    time.sleep(1)
+    response = requests.get(url)
+    return BeautifulSoup(response.content, "lxml")
diff --git a/scrapers_library/data_portals/ckan/multi_portal_scraper/scrape_ckan_data_portals.py b/scrapers_library/data_portals/ckan/multi_portal_scraper/scrape_ckan_data_portals.py
@@ -0,0 +1,57 @@
+from itertools import chain
+import json
+import sys
+
+from from_root import from_root
+from tqdm import tqdm
+
+p = from_root("CONTRIBUTING.md").parent
+sys.path.insert(1, str(p))
+
+from scrapers_library.data_portals.ckan.ckan_scraper import (
+    ckan_package_search,
+    ckan_group_package_show,
+    ckan_collection_search,
+)
+from search_terms import package_search, group_search
+
+
+def main():
+    results = []
+
+    for search in package_search:
+        results += [
+            ckan_package_search(search["url"], query=query) for query in search["terms"]
+        ]
+
+    flat_list = list(chain(*results))
+    # Deduplicate entries
+    flat_list = [i for n, i in enumerate(flat_list) if i not in flat_list[n + 1 :]]
+
+    print("Retrieving collections...")
+    flat_list = [
+        (
+            [
+                ckan_collection_search(
+                    base_url="https://catalog.data.gov/dataset/", collection_id=result["id"]
+                )
+                for extra in result["extras"]
+                if extra["key"] == "collection_metadata" and extra["value"] == "true"
+            ]
+            if "extras" in result.keys()
+            else result
+        )
+        for result in tqdm(flat_list)
+    ]
+    flat_list = list(chain(*flat_list))
+
+    #print(json.dumps(flat_list, indent=4))
+    [print(type(result)) for result in flat_list]
+    print(len(flat_list))
+    print(flat_list[0])
+    print(flat_list[-1])
+
+
+if __name__ == "__main__":
+    main()
+    # results = ckan_collection_search(base_url="https://catalog.data.gov/dataset/", collection_id="7b1d1941-b255-4596-89a6-99e1a33cc2d8")
diff --git a/scrapers_library/data_portals/ckan/multi_portal_scraper/search_terms.py b/scrapers_library/data_portals/ckan/multi_portal_scraper/search_terms.py
@@ -0,0 +1,17 @@
+package_search = [
+    {
+        "url": "https://catalog.data.gov/",
+        "terms": [
+            "police",
+            "crime",
+            "tags:(court courts court-cases criminal-justice-system law-enforcement law-enforcement-agencies)",
+        ],
+    },
+    {"url": "https://data.boston.gov/", "terms": ["police"]},
+]
+
+group_search = [
+    {
+        "url": "https://data.birminghamal.gov/", "ids": [""],
+    }
+]