CKAN general scraper #246

EvilDrPurple · 2024-10-19T22:56:43Z

Fixes

Part of scrape CKAN for data sources data-projects#9

Description

Adds a general scraper for scraping from CKAN instances

Docs

See README.md

josh-chamberlain

Hello @EvilDrPurple, sorry for the delay! I was able to run it after installing requirements. I think after making this beginner friendly, we could call it approved:

could you please add a requirements.txt file?
it would also be good to have, right after the introduction, a header for "Running this scraper" with instructions for setting up a venv and installing the requirements there
give an example command for running the appropriate script, including with the args

Once that's done, do you want to try using it to find some data sources? Below are some portals which purport to use CKAN in our database. Can you successfully get something like a submission to our db using the scraper / this list?

https://opendata.cabq.gov/
https://data.boston.gov/
https://data.birminghamal.gov/
https://open.jacksonms.gov/
https://data.houstontx.gov/
https://data.milwaukee.gov/
https://data.ci.newark.nj.us/
https://data.ok.gov/
https://data.sanantonio.gov/
https://data.sanjoseca.gov/
https://data.vbgov.com/

EvilDrPurple · 2024-10-24T01:41:54Z

@josh-chamberlain Requested changes made, I'll work on the next part now

josh-chamberlain · 2024-10-27T17:30:40Z

@EvilDrPurple thanks for the updates! scraping multiple portals will be important, so it's good to see you added that. However, I'm still a bit confused (as the resident code dummy) about how to use this. It's a little complex. It seems part of the current complexity is due to the different options we have for approaching what's there; a couple things that might help:

Zooming out, the point of this is to identify the location of useful packages so that we can put their URLs in our database. How about, for now, we just let people pass in a list of one or more organizations or base_urls which we comb through for packages, as well as search terms?
The "template" looks like a configuration which you edit before running it, and it calls the main scraper; I think it'd make more sense to call the template config or setup, and instead of asking user to make copy, just leave it where it is. Then, people can just run the main scraper which references the config.
Is the current template where you add the args? The README doesn't spell out where to actually put the args, and like I said, I'm a code dummy; if we think of it as a config file, we could make it more obvious how to use them.

simpler tree:

ckan/
  README.md
  ckan_scraper.py
  ckan_scraper_config.py

here's a snippet to illustrate what I mean about letting people set args in the config:

base_url = "https://catalog.data.gov/"  # The base URL to search from
query = "calls"  # Optional: Keyword string to search for
rows = 10  # Optional: Max number of results to return
start = 0  # Optional: Starting result number
kwargs = {}  # Optional: Additional keyword arguments

Do you think this simpler approach is possible? Any roadblocks or major things I'm missing that we would be leaving out this way?

scrapers_library/data_portals/ckan/README.md

maxachis · 2024-10-28T17:22:28Z

To piggyback off of what @josh-chamberlain is saying, I think the four python files would each benefit from having a module-level docstring at the top of the file -- this way, it's easier to determine at a glance, for example, how ckan_scraper_template.py relates to everything else, and how users are expected to use it.

Ideally, everyone would read the README thoroughly and then engage with each section, but considering I didn't even do that, we should probably aim to stupid-proof where possible. Or Max-proof, which may or may not be a subset of stupid-proof.*

*Note that I absolutely do not do this in all of my code, so this is very much a "Not as I do" sort of recommendation.

EvilDrPurple · 2024-10-28T19:15:06Z

@josh-chamberlain

* Zooming out, the point of this is to identify the location of useful `packages` so that we can put their URLs in our database. How about, for now, we just let people pass in a list of one or more `organizations` or `base_urls` which we comb through for packages, as well as `search terms`?

If I'm reading into this correctly, that is what this does. You call the ckan_package_search() function with the base_url and the search terms and it returns the ckanapi response, which is a list of relevant datasets

* The "template" looks like a configuration which you edit before running it, and it calls the main scraper; I think it'd make more sense to call the template `config` or `setup`, and instead of asking user to make copy, just leave it where it is. Then, people can just run the main scraper which references the config.

The way that it is setup is the way we've been doing it in other scraper portals. It's for a scraper writer to copy the template and move to one of the folders such as PA/allegheny_county/... and use to scrape data. If we want to change this approach for either just this one or for all of them we can

* Is the current template where you add the args? The README doesn't spell out where to actually put the args, and like I said, I'm a code dummy; if we think of it as a config file, we could make it more obvious how to use them.

The use of the template is optional, but yes. The args are passed to the ckan_package_search() function. Another option is I could rewrite the template to include the optional args so that it's more clear what to replace and how to use it. In my opinion, a config file obfuscates the code more and what each arguments purpose/proper input should be and makes it more complicated for programmers to follow. The upside is that it's more user-friendly for a non-coder. I do think a good middle group here would be to simply add the optional args in additional variables in the template to make it more clear, but let me know what you think

base_url = "https://catalog.data.gov/"  # The base URL to search from
query = "calls"  # Optional: Keyword string to search for
rows = 10  # Optional: Max number of results to return
start = 0  # Optional: Starting result number
kwargs = {}  # Optional: Additional keyword arguments
Do you think this simpler approach is possible? Any roadblocks or major things I'm missing that we would be leaving out this way?

Something like this is good and what I would envision being added to the current template, just nitpicky but I'm not sure kwargs would work with a dict but there could be a way I don't know of to make it work (besides, kwargs is more for advanced users anyway)

EvilDrPurple · 2024-10-28T19:23:41Z

To piggyback off of what @josh-chamberlain is saying, I think the four python files would each benefit from having a module-level docstring at the top of the file -- this way, it's easier to determine at a glance, for example, how ckan_scraper_template.py relates to everything else, and how users are expected to use it.

Ideally, everyone would read the README thoroughly and then engage with each section, but considering I didn't even do that, we should probably aim to stupid-proof where possible. Or Max-proof, which may or may not be a subset of stupid-proof.*

*Note that I absolutely do not do this in all of my code, so this is very much a "Not as I do" sort of recommendation.

@maxachis Yeah this makes a lot of sense, I'll be sure to add this

maxachis · 2024-10-28T23:39:01Z

In my opinion, a config file obfuscates the code more and what each arguments purpose/proper input should be and makes it more complicated for programmers to follow. The upside is that it's more user-friendly for a non-coder.

@EvilDrPurple @josh-chamberlain

As an aside, and without dipping myself too deep into this particular debate, I will say my tendency in my own code is to find a middle ground between python modules vs. configuration files and have configuration files that are python modules. This way, I can document the code, properly establish its relationships to other components, and not have to worry about an alternative file format. For example:

class EndpointSchemaConfig:
    def __init__(
        self,
        input_schema: Optional[Schema] = None,
        output_schema: Optional[Schema] = None,
        input_dto_class: Optional[Type[DTOTypes]] = None,
    ):
        """

        :param input_schema: Describes the schema to be input on a request
        :param output_schema: Describes the schema to be output on a successful request
        :param input_dto_class: Describes the DTO which will be populated based on the input schema.
        """
        self.input_schema = input_schema
        self.output_schema = output_schema
        self.input_dto_class = input_dto_class
        # ...

class SchemaConfigs(Enum):
    # region Data Requests
    DATA_REQUESTS_GET_MANY = EndpointSchemaConfig(
        input_schema=GetManyDataRequestsRequestsSchema(),
        output_schema=GetManyDataRequestsResponseSchema(),
        input_dto_class=GetManyDataRequestsRequestsDTO,
    )
    DATA_REQUESTS_BY_ID_GET = EndpointSchemaConfig(
        input_schema=None, output_schema=GetByIDDataRequestsResponseSchema()
    )
   #endregion
   #...

(In retrospect, I could have made EndpointSchemaConfig a dataclass, but oh well)

That aside, er, aside: There is something to be said about whether our scrapers, as they exist, are easy to use, standardized, and relatively low on redundant code from scraper to scraper, but it's hard to determine that when zoomed into a single scraper. That may require a separate discussion.

EvilDrPurple · 2024-10-29T00:26:19Z

As an aside, and without dipping myself too deep into this particular debate, I will say my tendency in my own code is to find a middle ground between python modules vs. configuration files and have configuration files that are python modules. This way, I can document the code, properly establish its relationships to other components, and not have to worry about an alternative file format. For example:

@maxachis I do like this approach, the config implementation that I thought was being asked of is an older way that a previous programmer with limited experience setup, I do not like this particular implementation very much:

configs = {
    "url": "",
    "department_code": "",
    "list_header": [
        "ChargeDescription",
        "CaseNum",
        "ReportDate",
        "OffenseDate",
        "Location",
        "ChargeDisposition",
    ],
}

save_dir = "./data/"
data = []

if not os.path.exists(save_dir):
    os.makedirs(save_dir)

crimegraphics_clery(configs, save_dir)

You can see this at https://github.com/Police-Data-Accessibility-Project/scrapers/blob/c2b0fdc80b8b9fa50d21af3e3f9c76389137bd32/scrapers_library/data_portals/crimegraphics/template/crimegraphics_clery_template.py

josh-chamberlain · 2024-10-29T18:27:42Z

@EvilDrPurple

The way that it is setup is the way we've been doing it in other scraper portals. It's for a scraper writer to copy the template and move to one of the folders such as PA/allegheny_county/... and use to scrape data. If we want to change this approach for either just this one or for all of them we can

Ahh, now I see. Interesting. I don't think I ever understood that that was the goal of the template. I can see how you got there! I would say that the geographic structure is for scrapers which apply to specific geographic sources; these portal scrapers cover USA, and don't need to be hemmed in by the geographic structure of the repo.

When we run the scraper, we might choose to focus on just one geography, which is possible with configuration. But I think the most important use case is someone dedicating some time (or using github actions) to run the CKAN scraper on all our known regions at once. I think we can do this by checking our API for sources which have a data_portal_type of CKAN, and even filter geographically there. No need to sprinkle files through the repo.

Is this making sense?

@maxachis

a middle ground between python modules vs. configuration files and have configuration files that are python modules

Sounds good to me. I'm not too picky about the format, but I'm most comfortable with the idea of running one scraper that is configurable in a clear way rather than having files for different use cases scattered about.

There is something to be said about whether our scrapers, as they exist, are easy to use, standardized, and relatively low on redundant code from scraper to scraper, but it's hard to determine that when zoomed into a single scraper. That may require a separate discussion.

most of these scrapers are to be used and run independently. We are not planning to automate the whole repo or orchestrate a scraper farm. If we do one day, we could refactor—but through tough experience we have learned that the closer we tie scrapers to each other, the harder it is to write individual scrapers.

If the data source collecting scrapers share common code, that would make some sense—because they are designed to collect data sources and send them to our API (or label studio). If we have a consistent way of talking to our API or label studio's, that makes sense.

maxachis · 2024-10-29T20:15:17Z

If the data source collecting scrapers share common code, that would make some sense—because they are designed to collect data sources and send them to our API (or label studio). If we have a consistent way of talking to our API or label studio's, that makes sense.

Oooh, that might be cause to create some kind of mini-SDK to handle common calls to the API or label studio or whatever else we have in mind. Let's make note of that for later.

EvilDrPurple · 2024-11-04T03:17:14Z

When we run the scraper, we might choose to focus on just one geography, which is possible with configuration. But I think the most important use case is someone dedicating some time (or using github actions) to run the CKAN scraper on all our known regions at once. I think we can do this by checking our API for sources which have a data_portal_type of CKAN, and even filter geographically there. No need to sprinkle files through the repo.

@josh-chamberlain
There is one issue with this while I was looking through airtable, and that's that some data sources are labeled CKAN that aren't actually CKAN. There's also an issue with the CKAN API query url is not always the same as the url that's used for general internet user access, which is what we'd store in our database. There's also a lot of parity issues from one CKAN instance to another, meaning to ensure relevant data is returned we will unfortunately probably have to go on a case by case basis when a new CKAN portal is added to the database

josh-chamberlain · 2024-11-05T20:25:01Z

@EvilDrPurple hmm, interesting. The sources db is sort of the foundation here, so it'll be important to try to improve the accuracy of the data_portal_type property.

Could you provide an example of what you mean by parity issues? Does "package" not mean the same thing from one instance to the other? We're just trying to document the existence of a source, so I'm hoping we can find a solution.

EvilDrPurple · 2024-11-05T23:17:13Z

Could you provide an example of what you mean by parity issues? Does "package" not mean the same thing from one instance to the other? We're just trying to document the existence of a source, so I'm hoping we can find a solution.

@josh-chamberlain A package is always the same meaning across CKAN instances, however the way packages are sorted is what varies most. Some instances will have all police data related packages in a group, others will use tags to group packages by tagging a package with something like "police". With some instances, performing a general keyword search for "police" will return additional unrelated packages that may mention "police" in the description somewhere but not necessarily be relevant or will leave some relevant packages out.
My point is, performing the same search on every CKAN instance will return a different quality of responses from each. At the very least something we have going is the general layout of important information contained in the API response is pretty much the same between instances.
Another issue I've found is the "organization" returned in the package's response is not always the proper agency_described. For example, an organization can be "Houston Police Department" (nice!) or "District of Columbia" (not as nice). With some the publishing entity will be listed under the "author" key, while others will only put it under extras.key == "publisher".

maxachis · 2024-11-05T23:53:24Z

@EvilDrPurple @josh-chamberlain The question then is, how and to what degree can we anticipate and document the different forms which CKAN instances can take?

Are there commonly recurring structures, or are they so diverse that any attempt at standardization is doomed to failure?
Can we document general rules or best practices for people when they're probing this stuff?
Basically, how do we make it so that the next person who touches this stuff is less surprised and confused than we were?

EvilDrPurple · 2024-11-08T22:51:30Z

1. Are there commonly recurring structures, or are they so diverse that any attempt at standardization is doomed to failure?

@maxachis, forgive the long paste but below is a sample package returned from ckanapi. Link to host page: https://catalog.data.gov/dataset/nypd-arrest-data-year-to-date

From what I can tell, the layout is completely the same across instances with the exception of the extras key and its content. This is where the instance may define their own custom keys. Other than that the only real difference is the quality of the data filled out. For example, data.gov seems to have the author key always set to null and instead opts to use the custom publisher key under extras while others will just use the author key. Another annoyance specifically with data.gov is the relationships_as_object key is never used, when presumably it should be used to link a parent package with its children.

{
      "author": null,
      "author_email": null,
      "creator_user_id": "2b785922-9f13-491b-a3c2-2a40acbd80c2",
      "id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
      "isopen": false,
      "license_id": "notspecified",
      "license_title": "License not specified",
      "maintainer": "NYC OpenData",
      "maintainer_email": "no-reply@data.cityofnewyork.us",
      "metadata_created": "2020-11-10T17:05:36.995577",
      "metadata_modified": "2024-10-25T20:28:59.948113",
      "name": "nypd-arrest-data-year-to-date",
      "notes": "This is a breakdown of every arrest effected in NYC by the NYPD during the current year.\n This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. \n Each record represents an arrest effected in NYC by the NYPD and includes information about the type of crime, the location and time of enforcement. \nIn addition, information related to suspect demographics is also included. \nThis data can be used by the public to explore the nature of police enforcement activity. \nPlease refer to the attached data footnotes for additional information about this dataset.",
      "num_resources": 4,
      "num_tags": 5,
      "organization": {
          "id": "1149ee63-2fff-494e-82e5-9aace9d3b3bf",
          "name": "city-of-new-york",
          "title": "City of New York",
          "type": "organization",
          "description": "",
          "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Seal_of_New_York.svg/175px-Seal_of_New_York.svg.png",
          "created": "2020-11-10T15:26:23.896065",
          "is_organization": true,
          "approval_status": "approved",
          "state": "active"
      },
      "owner_org": "1149ee63-2fff-494e-82e5-9aace9d3b3bf",
      "private": false,
      "state": "active",
      "title": "NYPD Arrest Data (Year to Date)",
      "type": "dataset",
      "url": null,
      "version": null,
      "extras": [
          {
              "key": "resource-type",
              "value": "Dataset"
          },
          {
              "key": "source_hash",
              "value": "5489c9a695b7ef8e8b45680d5b5c81cf24b2cc9d2f66497c1da7da7cf1993bc2"
          },
          {
              "key": "source_datajson_identifier",
              "value": true
          },
          {
              "key": "source_schema_version",
              "value": "1.1"
          },
          {
              "key": "accessLevel",
              "value": "public"
          },
          {
              "key": "identifier",
              "value": "https://data.cityofnewyork.us/api/views/uip8-fykc"
          },
          {
              "key": "issued",
              "value": "2020-10-28"
          },
          {
              "key": "landingPage",
              "value": "https://data.cityofnewyork.us/d/uip8-fykc"
          },
          {
              "key": "modified",
              "value": "2024-10-21"
          },
          {
              "key": "publisher",
              "value": "data.cityofnewyork.us"
          },
          {
              "key": "theme",
              "value": [
                  "Public Safety"
              ]
          },
          {
              "key": "catalog_@context",
              "value": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld"
          },
          {
              "key": "catalog_@id",
              "value": "https://data.cityofnewyork.us/data.json"
          },
          {
              "key": "catalog_conformsTo",
              "value": "https://project-open-data.cio.gov/v1.1/schema"
          },
          {
              "key": "catalog_describedBy",
              "value": "https://project-open-data.cio.gov/v1.1/schema/catalog.json"
          },
          {
              "key": "harvest_object_id",
              "value": "cf1f3514-4b33-4196-b5ee-347dc395ffbb"
          },
          {
              "key": "harvest_source_id",
              "value": "1696593e-c691-4f61-a696-5dcb9e4c9b4c"
          },
          {
              "key": "harvest_source_title",
              "value": "NYC JSON"
          }
      ],
      "groups": [
          {
              "description": "Local Government Topic - for all datasets with state, local, county organizations",
              "display_name": "Local Government",
              "id": "7d625e66-9e91-4b47-badd-44ec6f16b62b",
              "image_display_url": "",
              "name": "local",
              "title": "Local Government"
          }
      ],
      "resources": [
          {
              "cache_last_updated": null,
              "cache_url": null,
              "created": "2020-11-10T17:05:37.001960",
              "description": "",
              "format": "CSV",
              "hash": "",
              "id": "c48f1a1a-5efb-4266-9572-769ed1c9b472",
              "last_modified": null,
              "metadata_modified": "2020-11-10T17:05:37.001960",
              "mimetype": "text/csv",
              "mimetype_inner": null,
              "name": "Comma Separated Values File",
              "no_real_name": true,
              "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
              "position": 0,
              "resource_type": null,
              "size": null,
              "state": "active",
              "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.csv?accessType=DOWNLOAD",
              "url_type": null
          },
          {
              "cache_last_updated": null,
              "cache_url": null,
              "created": "2020-11-10T17:05:37.001970",
              "describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.rdf",
              "describedByType": "application/rdf+xml",
              "description": "",
              "format": "RDF",
              "hash": "",
              "id": "5c137f71-4e20-49c5-bd45-a562952195fe",
              "last_modified": null,
              "metadata_modified": "2020-11-10T17:05:37.001970",
              "mimetype": "application/rdf+xml",
              "mimetype_inner": null,
              "name": "RDF File",
              "no_real_name": true,
              "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
              "position": 1,
              "resource_type": null,
              "size": null,
              "state": "active",
              "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.rdf?accessType=DOWNLOAD",
              "url_type": null
          },
          {
              "cache_last_updated": null,
              "cache_url": null,
              "created": "2020-11-10T17:05:37.001976",
              "describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.json",
              "describedByType": "application/json",
              "description": "",
              "format": "JSON",
              "hash": "",
              "id": "036b090c-488e-41fd-89f2-126fead8cda7",
              "last_modified": null,
              "metadata_modified": "2020-11-10T17:05:37.001976",
              "mimetype": "application/json",
              "mimetype_inner": null,
              "name": "JSON File",
              "no_real_name": true,
              "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
              "position": 2,
              "resource_type": null,
              "size": null,
              "state": "active",
              "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.json?accessType=DOWNLOAD",
              "url_type": null
          },
          {
              "cache_last_updated": null,
              "cache_url": null,
              "created": "2020-11-10T17:05:37.001981",
              "describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.xml",
              "describedByType": "application/xml",
              "description": "",
              "format": "XML",
              "hash": "",
              "id": "8bbb0d22-bf80-407f-bb8d-e72e2cd1fbd9",
              "last_modified": null,
              "metadata_modified": "2020-11-10T17:05:37.001981",
              "mimetype": "application/xml",
              "mimetype_inner": null,
              "name": "XML File",
              "no_real_name": true,
              "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
              "position": 3,
              "resource_type": null,
              "size": null,
              "state": "active",
              "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.xml?accessType=DOWNLOAD",
              "url_type": null
          }
      ],
      "tags": [
          {
              "display_name": "arrest",
              "id": "a76dff3f-cba8-42b4-ab51-1aceb059d16f",
              "name": "arrest",
              "state": "active",
              "vocabulary_id": null
          },
          {
              "display_name": "crime",
              "id": "df442823-c823-4890-8fca-805427bd8dd9",
              "name": "crime",
              "state": "active",
              "vocabulary_id": null
          },
          {
              "display_name": "law-enforcement",
              "id": "c7de583e-e2ca-4e1a-99f9-d0bdb8acd910",
              "name": "law-enforcement",
              "state": "active",
              "vocabulary_id": null
          },
          {
              "display_name": "nypd",
              "id": "f3f11364-96d9-4bda-b4f4-5d6111c39108",
              "name": "nypd",
              "state": "active",
              "vocabulary_id": null
          },
          {
              "display_name": "public-safety",
              "id": "66326f22-35c0-46a7-941b-4b7690f8b199",
              "name": "public-safety",
              "state": "active",
              "vocabulary_id": null
          }
      ],
      "relationships_as_subject": [],
      "relationships_as_object": []
}

maxachis · 2024-11-09T14:37:57Z

@EvilDrPurple So that would indicate that we have some standardization in terms of format, but not in how that format is populated. And some parts use completely original formatting, but only in defined areas like extras.

That's useful to know for future reference, in that presumably it can guide people on a higher level with designing these scrapers, even if it can't help as much in the lower level implementation.

EvilDrPurple · 2024-11-10T01:13:13Z

@maxachis Would this be something to outline in the README, then?

maxachis · 2024-11-10T13:59:15Z

@maxachis Would this be something to outline in the README, then?

It would be! And I'd defer to you as to how you think that would make the most sense to outline, since you're much more familiar with the peculiarities of CKAN than I am.

EvilDrPurple · 2024-11-11T23:19:23Z

@maxachis I think this is ready for another look. I made some changes to the README and file structure. Also the scrape_ckan_data_portals.py is fully functional and outputs the data to a CSV file. As for having the user modify the search terms, they can be found in search_terms.py. They should be fairly simple to modify for anyone looking to do a different search

maxachis

The README is great, and it works like a charm! I was amazed at how much data I was able to pull with just one run.

My main recommendations involve the revision or adding of some documentation, and extracting some logic to separate functions to enhance legibility. Other than that, I think it's good, but will defer to @josh-chamberlain for final judgment 🔨

maxachis · 2024-11-12T14:08:39Z

scrapers_library/data_portals/ckan/ckan_scraper_toolkit.py

+        url = f"{base_url}?collection_package_id={collection_id}&page={page}"
+        soup = _get_soup(url)
+
+        with ThreadPoolExecutor(max_workers=10) as executor:


With the number of indentation and the size of this function, I think it's a candidate for extracting some of the logic to a separate function! I think the with ThreadPoolExecutor part in particular is useful to extract, both because it's quite complex and involves some low-level components, but also because an additional docstring would help provide some additional information.

maxachis · 2024-11-12T14:11:50Z

scrapers_library/data_portals/ckan/scrape_ckan_data_portals.py

+
+    for result in tqdm(results):
+        if "extras" in result.keys():
+            collections = [


This collections definition could benefit from being extracted and spread out onto multiple lines. As is, if someone's reading this, they have to keep track of

A for loop

An if conditional to check if extra["key"] is "collection_metadata"

An if conditional to check if extra["value" is "true"

An if conditional to check if result["resources"] is null.

If someone was running this in a debugger, I imagine it would get a little messy.

maxachis · 2024-11-12T14:13:26Z

scrapers_library/data_portals/ckan/scrape_ckan_data_portals.py

+    # If there are no resources available
+    elif len(result["resources"]) == 0:
+        # Use the dataset's external landing page
+        package.url = [


Recommend extracting this to a small helper function due to the density of the logic!

maxachis · 2024-11-12T14:15:06Z

scrapers_library/data_portals/ckan/scrape_ckan_data_portals.py

+    :return: Updated list of results.
+    """
+    key = list(search_terms[0].keys())[1]
+    for search in tqdm(search_terms):


The use of tqdm is an excellent touch, especially since it can take a little while for some of these to load 👌

Yes! It used to take ~30 minutes to run before I limited the number of collections being searched for but I figured I'd keep it so that anyone running the program doesn't thin it's frozen

maxachis · 2024-11-12T14:16:10Z

scrapers_library/data_portals/ckan/scrape_ckan_data_portals.py

+
+    # Remove packages with no data or landing page
+    if len(result["resources"]) == 0:
+        landing_page = next(


Let's extract some of this logic! This is a next, a for loop, two key-value accesses, and a boolean conditional condensed in a very small amount of real estate!

scrapers_library/data_portals/ckan/README.md

maxachis · 2024-11-12T14:22:43Z

scrapers_library/data_portals/ckan/README.md

+}
+```
+
+---


For this part here, I'm curious if this is providing anything that's not already in the docstrings for these functions.

Either way, I think it's best to migrate whatever is unique into the docstrings of the functions themselves and then either remove or condense to a higher level this section.

Having this documentation in two places raises the risk that it will become misaligned, and if folks can easily access this logic in the docstrings themselves, there's not as much need to have it in the README.

maxachis · 2024-11-12T14:28:30Z

scrapers_library/data_portals/ckan/search_terms.py

I'm wondering if this page overall might benefit from some additional documentation describing each of the unique parameters for each section: For example: what does url mean? What about terms? What do the ids represent, and is this the same for both group_search and organization_search? Some of these can be inferred, but there's uncertainty in there that I think would benefit from being disambiguated.

josh-chamberlain · 2024-11-13T16:59:09Z

@EvilDrPurple nice, I was able to run this. gosh, that's a lot of potential sources! it looks like many of them would be relevant.

josh-chamberlain

I made tiny tweaks to the README but I think this is really good!

EvilDrPurple added 6 commits October 18, 2024 14:11

Cleanup and test new methodology

1d22426

Basic prototype for package retrieval

816e4ea

Implement CKAN package search method

b5f0b35

Add template

3280a2d

Update README

1e59794

Add return section

fd2c36f

EvilDrPurple requested a review from josh-chamberlain as a code owner October 19, 2024 22:56

Fix return value

00a3b76

josh-chamberlain requested changes Oct 23, 2024

View reviewed changes

EvilDrPurple added 2 commits October 23, 2024 19:12

Add requirements.txt

87797df

README updates

b2af6f2

EvilDrPurple added 8 commits October 25, 2024 12:18

Add ckan_group_package_show()

9200909

Add infrastructure for scraping multiple data portals

aba11e6

Add ckan_collection_search()

901c491

Updates to README

4013b80

Add pagination to ckan_collection_search()

f61f604

Fix error in collection search

239ebbc

Format with Black

62295f1

Add threading to collection search

9893aeb

maxachis self-requested a review October 28, 2024 17:11

maxachis reviewed Oct 28, 2024

View reviewed changes

scrapers_library/data_portals/ckan/README.md Outdated Show resolved Hide resolved

EvilDrPurple added 2 commits November 3, 2024 21:57

Add ckan_package_search_from_organization()

d77781a

Further search updates

4360a4a

EvilDrPurple added 5 commits November 10, 2024 21:50

Parse returned data

f6c2af1

Output data to CSV, other edge case handling

93538c5

Delete results.csv

7d1aba9

Update file structure

2adda16

Update README

842b915

maxachis previously approved these changes Nov 12, 2024

View reviewed changes

activate venv instructions

2fd5a78

josh-chamberlain dismissed maxachis’s stale review via 2fd5a78 November 13, 2024 16:57

josh-chamberlain approved these changes Nov 13, 2024

View reviewed changes

josh-chamberlain mentioned this pull request Dec 4, 2024

Endpoint to match agencies for submission Police-Data-Accessibility-Project/data-sources-app#550

Closed

6 tasks

josh-chamberlain merged commit a81c8e7 into main Nov 13, 2024
1 check passed

josh-chamberlain deleted the ckan-scraper branch November 13, 2024 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CKAN general scraper #246

CKAN general scraper #246

EvilDrPurple commented Oct 19, 2024

josh-chamberlain left a comment

EvilDrPurple commented Oct 24, 2024

josh-chamberlain commented Oct 27, 2024 •

edited

Loading

maxachis commented Oct 28, 2024

EvilDrPurple commented Oct 28, 2024 •

edited

Loading

EvilDrPurple commented Oct 28, 2024

maxachis commented Oct 28, 2024 •

edited

Loading

EvilDrPurple commented Oct 29, 2024

josh-chamberlain commented Oct 29, 2024 •

edited

Loading

maxachis commented Oct 29, 2024

EvilDrPurple commented Nov 4, 2024 •

edited

Loading

josh-chamberlain commented Nov 5, 2024

EvilDrPurple commented Nov 5, 2024 •

edited

Loading

maxachis commented Nov 5, 2024

EvilDrPurple commented Nov 8, 2024

maxachis commented Nov 9, 2024

EvilDrPurple commented Nov 10, 2024

maxachis commented Nov 10, 2024

EvilDrPurple commented Nov 11, 2024

maxachis left a comment

maxachis Nov 12, 2024

maxachis Nov 12, 2024

maxachis Nov 12, 2024

maxachis Nov 12, 2024

EvilDrPurple Nov 20, 2024

maxachis Nov 12, 2024

maxachis Nov 12, 2024

maxachis Nov 12, 2024

josh-chamberlain commented Nov 13, 2024

josh-chamberlain left a comment

CKAN general scraper #246

CKAN general scraper #246

Conversation

EvilDrPurple commented Oct 19, 2024

Fixes

Description

Docs

josh-chamberlain left a comment

Choose a reason for hiding this comment

EvilDrPurple commented Oct 24, 2024

josh-chamberlain commented Oct 27, 2024 • edited Loading

maxachis commented Oct 28, 2024

EvilDrPurple commented Oct 28, 2024 • edited Loading

EvilDrPurple commented Oct 28, 2024

maxachis commented Oct 28, 2024 • edited Loading

EvilDrPurple commented Oct 29, 2024

josh-chamberlain commented Oct 29, 2024 • edited Loading

maxachis commented Oct 29, 2024

EvilDrPurple commented Nov 4, 2024 • edited Loading

josh-chamberlain commented Nov 5, 2024

EvilDrPurple commented Nov 5, 2024 • edited Loading

maxachis commented Nov 5, 2024

EvilDrPurple commented Nov 8, 2024

maxachis commented Nov 9, 2024

EvilDrPurple commented Nov 10, 2024

maxachis commented Nov 10, 2024

EvilDrPurple commented Nov 11, 2024

maxachis left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

josh-chamberlain commented Nov 13, 2024

josh-chamberlain left a comment

Choose a reason for hiding this comment

josh-chamberlain commented Oct 27, 2024 •

edited

Loading

EvilDrPurple commented Oct 28, 2024 •

edited

Loading

maxachis commented Oct 28, 2024 •

edited

Loading

josh-chamberlain commented Oct 29, 2024 •

edited

Loading

EvilDrPurple commented Nov 4, 2024 •

edited

Loading

EvilDrPurple commented Nov 5, 2024 •

edited

Loading