Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CKAN general scraper #246

Merged
merged 25 commits into from
Nov 13, 2024
Merged

CKAN general scraper #246

merged 25 commits into from
Nov 13, 2024

Conversation

EvilDrPurple
Copy link
Contributor

Fixes

Description

  • Adds a general scraper for scraping from CKAN instances

Docs

  • See README.md

Copy link
Contributor

@josh-chamberlain josh-chamberlain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @EvilDrPurple, sorry for the delay! I was able to run it after installing requirements. I think after making this beginner friendly, we could call it approved:

  • could you please add a requirements.txt file?
  • it would also be good to have, right after the introduction, a header for "Running this scraper" with instructions for setting up a venv and installing the requirements there
  • give an example command for running the appropriate script, including with the args

Once that's done, do you want to try using it to find some data sources? Below are some portals which purport to use CKAN in our database. Can you successfully get something like a submission to our db using the scraper / this list?

https://opendata.cabq.gov/
https://data.boston.gov/
https://data.birminghamal.gov/
https://open.jacksonms.gov/
https://data.houstontx.gov/
https://data.milwaukee.gov/
https://data.ci.newark.nj.us/
https://data.ok.gov/
https://data.sanantonio.gov/
https://data.sanjoseca.gov/
https://data.vbgov.com/

@EvilDrPurple
Copy link
Contributor Author

@josh-chamberlain Requested changes made, I'll work on the next part now

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Oct 27, 2024

@EvilDrPurple thanks for the updates! scraping multiple portals will be important, so it's good to see you added that. However, I'm still a bit confused (as the resident code dummy) about how to use this. It's a little complex. It seems part of the current complexity is due to the different options we have for approaching what's there; a couple things that might help:

  • Zooming out, the point of this is to identify the location of useful packages so that we can put their URLs in our database. How about, for now, we just let people pass in a list of one or more organizations or base_urls which we comb through for packages, as well as search terms?
  • The "template" looks like a configuration which you edit before running it, and it calls the main scraper; I think it'd make more sense to call the template config or setup, and instead of asking user to make copy, just leave it where it is. Then, people can just run the main scraper which references the config.
  • Is the current template where you add the args? The README doesn't spell out where to actually put the args, and like I said, I'm a code dummy; if we think of it as a config file, we could make it more obvious how to use them.

simpler tree:

ckan/
  README.md
  ckan_scraper.py
  ckan_scraper_config.py

here's a snippet to illustrate what I mean about letting people set args in the config:

base_url = "https://catalog.data.gov/"  # The base URL to search from
query = "calls"  # Optional: Keyword string to search for
rows = 10  # Optional: Max number of results to return
start = 0  # Optional: Starting result number
kwargs = {}  # Optional: Additional keyword arguments

Do you think this simpler approach is possible? Any roadblocks or major things I'm missing that we would be leaving out this way?

@maxachis maxachis self-requested a review October 28, 2024 17:11
@maxachis
Copy link
Contributor

To piggyback off of what @josh-chamberlain is saying, I think the four python files would each benefit from having a module-level docstring at the top of the file -- this way, it's easier to determine at a glance, for example, how ckan_scraper_template.py relates to everything else, and how users are expected to use it.

Ideally, everyone would read the README thoroughly and then engage with each section, but considering I didn't even do that, we should probably aim to stupid-proof where possible. Or Max-proof, which may or may not be a subset of stupid-proof.*

*Note that I absolutely do not do this in all of my code, so this is very much a "Not as I do" sort of recommendation.

@EvilDrPurple
Copy link
Contributor Author

EvilDrPurple commented Oct 28, 2024

@josh-chamberlain

* Zooming out, the point of this is to identify the location of useful `packages` so that we can put their URLs in our database. How about, for now, we just let people pass in a list of one or more `organizations` or `base_urls` which we comb through for packages, as well as `search terms`?

If I'm reading into this correctly, that is what this does. You call the ckan_package_search() function with the base_url and the search terms and it returns the ckanapi response, which is a list of relevant datasets

* The "template" looks like a configuration which you edit before running it, and it calls the main scraper; I think it'd make more sense to call the template `config` or `setup`, and instead of asking user to make copy, just leave it where it is. Then, people can just run the main scraper which references the config.

The way that it is setup is the way we've been doing it in other scraper portals. It's for a scraper writer to copy the template and move to one of the folders such as PA/allegheny_county/... and use to scrape data. If we want to change this approach for either just this one or for all of them we can

* Is the current template where you add the args? The README doesn't spell out where to actually put the args, and like I said, I'm a code dummy; if we think of it as a config file, we could make it more obvious how to use them.

The use of the template is optional, but yes. The args are passed to the ckan_package_search() function. Another option is I could rewrite the template to include the optional args so that it's more clear what to replace and how to use it. In my opinion, a config file obfuscates the code more and what each arguments purpose/proper input should be and makes it more complicated for programmers to follow. The upside is that it's more user-friendly for a non-coder. I do think a good middle group here would be to simply add the optional args in additional variables in the template to make it more clear, but let me know what you think

base_url = "https://catalog.data.gov/"  # The base URL to search from
query = "calls"  # Optional: Keyword string to search for
rows = 10  # Optional: Max number of results to return
start = 0  # Optional: Starting result number
kwargs = {}  # Optional: Additional keyword arguments

Do you think this simpler approach is possible? Any roadblocks or major things I'm missing that we would be leaving out this way?

Something like this is good and what I would envision being added to the current template, just nitpicky but I'm not sure kwargs would work with a dict but there could be a way I don't know of to make it work (besides, kwargs is more for advanced users anyway)

@EvilDrPurple
Copy link
Contributor Author

To piggyback off of what @josh-chamberlain is saying, I think the four python files would each benefit from having a module-level docstring at the top of the file -- this way, it's easier to determine at a glance, for example, how ckan_scraper_template.py relates to everything else, and how users are expected to use it.

Ideally, everyone would read the README thoroughly and then engage with each section, but considering I didn't even do that, we should probably aim to stupid-proof where possible. Or Max-proof, which may or may not be a subset of stupid-proof.*

*Note that I absolutely do not do this in all of my code, so this is very much a "Not as I do" sort of recommendation.

@maxachis Yeah this makes a lot of sense, I'll be sure to add this

@maxachis
Copy link
Contributor

maxachis commented Oct 28, 2024

In my opinion, a config file obfuscates the code more and what each arguments purpose/proper input should be and makes it more complicated for programmers to follow. The upside is that it's more user-friendly for a non-coder.

@EvilDrPurple @josh-chamberlain

As an aside, and without dipping myself too deep into this particular debate, I will say my tendency in my own code is to find a middle ground between python modules vs. configuration files and have configuration files that are python modules. This way, I can document the code, properly establish its relationships to other components, and not have to worry about an alternative file format. For example:

class EndpointSchemaConfig:
    def __init__(
        self,
        input_schema: Optional[Schema] = None,
        output_schema: Optional[Schema] = None,
        input_dto_class: Optional[Type[DTOTypes]] = None,
    ):
        """

        :param input_schema: Describes the schema to be input on a request
        :param output_schema: Describes the schema to be output on a successful request
        :param input_dto_class: Describes the DTO which will be populated based on the input schema.
        """
        self.input_schema = input_schema
        self.output_schema = output_schema
        self.input_dto_class = input_dto_class
        # ...

class SchemaConfigs(Enum):
    # region Data Requests
    DATA_REQUESTS_GET_MANY = EndpointSchemaConfig(
        input_schema=GetManyDataRequestsRequestsSchema(),
        output_schema=GetManyDataRequestsResponseSchema(),
        input_dto_class=GetManyDataRequestsRequestsDTO,
    )
    DATA_REQUESTS_BY_ID_GET = EndpointSchemaConfig(
        input_schema=None, output_schema=GetByIDDataRequestsResponseSchema()
    )
   #endregion
   #...

(In retrospect, I could have made EndpointSchemaConfig a dataclass, but oh well)

That aside, er, aside: There is something to be said about whether our scrapers, as they exist, are easy to use, standardized, and relatively low on redundant code from scraper to scraper, but it's hard to determine that when zoomed into a single scraper. That may require a separate discussion.

@EvilDrPurple
Copy link
Contributor Author

As an aside, and without dipping myself too deep into this particular debate, I will say my tendency in my own code is to find a middle ground between python modules vs. configuration files and have configuration files that are python modules. This way, I can document the code, properly establish its relationships to other components, and not have to worry about an alternative file format. For example:

@maxachis I do like this approach, the config implementation that I thought was being asked of is an older way that a previous programmer with limited experience setup, I do not like this particular implementation very much:

configs = {
    "url": "",
    "department_code": "",
    "list_header": [
        "ChargeDescription",
        "CaseNum",
        "ReportDate",
        "OffenseDate",
        "Location",
        "ChargeDisposition",
    ],
}

save_dir = "./data/"
data = []

if not os.path.exists(save_dir):
    os.makedirs(save_dir)

crimegraphics_clery(configs, save_dir)

You can see this at https://github.com/Police-Data-Accessibility-Project/scrapers/blob/c2b0fdc80b8b9fa50d21af3e3f9c76389137bd32/scrapers_library/data_portals/crimegraphics/template/crimegraphics_clery_template.py

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Oct 29, 2024

@EvilDrPurple

The way that it is setup is the way we've been doing it in other scraper portals. It's for a scraper writer to copy the template and move to one of the folders such as PA/allegheny_county/... and use to scrape data. If we want to change this approach for either just this one or for all of them we can

Ahh, now I see. Interesting. I don't think I ever understood that that was the goal of the template. I can see how you got there! I would say that the geographic structure is for scrapers which apply to specific geographic sources; these portal scrapers cover USA, and don't need to be hemmed in by the geographic structure of the repo.

When we run the scraper, we might choose to focus on just one geography, which is possible with configuration. But I think the most important use case is someone dedicating some time (or using github actions) to run the CKAN scraper on all our known regions at once. I think we can do this by checking our API for sources which have a data_portal_type of CKAN, and even filter geographically there. No need to sprinkle files through the repo.

Is this making sense?

@maxachis

a middle ground between python modules vs. configuration files and have configuration files that are python modules

Sounds good to me. I'm not too picky about the format, but I'm most comfortable with the idea of running one scraper that is configurable in a clear way rather than having files for different use cases scattered about.

There is something to be said about whether our scrapers, as they exist, are easy to use, standardized, and relatively low on redundant code from scraper to scraper, but it's hard to determine that when zoomed into a single scraper. That may require a separate discussion.

most of these scrapers are to be used and run independently. We are not planning to automate the whole repo or orchestrate a scraper farm. If we do one day, we could refactor—but through tough experience we have learned that the closer we tie scrapers to each other, the harder it is to write individual scrapers.

If the data source collecting scrapers share common code, that would make some sense—because they are designed to collect data sources and send them to our API (or label studio). If we have a consistent way of talking to our API or label studio's, that makes sense.

@maxachis
Copy link
Contributor

If the data source collecting scrapers share common code, that would make some sense—because they are designed to collect data sources and send them to our API (or label studio). If we have a consistent way of talking to our API or label studio's, that makes sense.

Oooh, that might be cause to create some kind of mini-SDK to handle common calls to the API or label studio or whatever else we have in mind. Let's make note of that for later.

@EvilDrPurple
Copy link
Contributor Author

EvilDrPurple commented Nov 4, 2024

When we run the scraper, we might choose to focus on just one geography, which is possible with configuration. But I think the most important use case is someone dedicating some time (or using github actions) to run the CKAN scraper on all our known regions at once. I think we can do this by checking our API for sources which have a data_portal_type of CKAN, and even filter geographically there. No need to sprinkle files through the repo.

@josh-chamberlain
There is one issue with this while I was looking through airtable, and that's that some data sources are labeled CKAN that aren't actually CKAN. There's also an issue with the CKAN API query url is not always the same as the url that's used for general internet user access, which is what we'd store in our database. There's also a lot of parity issues from one CKAN instance to another, meaning to ensure relevant data is returned we will unfortunately probably have to go on a case by case basis when a new CKAN portal is added to the database

@josh-chamberlain
Copy link
Contributor

@EvilDrPurple hmm, interesting. The sources db is sort of the foundation here, so it'll be important to try to improve the accuracy of the data_portal_type property.

Could you provide an example of what you mean by parity issues? Does "package" not mean the same thing from one instance to the other? We're just trying to document the existence of a source, so I'm hoping we can find a solution.

@EvilDrPurple
Copy link
Contributor Author

EvilDrPurple commented Nov 5, 2024

Could you provide an example of what you mean by parity issues? Does "package" not mean the same thing from one instance to the other? We're just trying to document the existence of a source, so I'm hoping we can find a solution.

@josh-chamberlain A package is always the same meaning across CKAN instances, however the way packages are sorted is what varies most. Some instances will have all police data related packages in a group, others will use tags to group packages by tagging a package with something like "police". With some instances, performing a general keyword search for "police" will return additional unrelated packages that may mention "police" in the description somewhere but not necessarily be relevant or will leave some relevant packages out.
My point is, performing the same search on every CKAN instance will return a different quality of responses from each. At the very least something we have going is the general layout of important information contained in the API response is pretty much the same between instances.
Another issue I've found is the "organization" returned in the package's response is not always the proper agency_described. For example, an organization can be "Houston Police Department" (nice!) or "District of Columbia" (not as nice). With some the publishing entity will be listed under the "author" key, while others will only put it under extras.key == "publisher".

@maxachis
Copy link
Contributor

maxachis commented Nov 5, 2024

@EvilDrPurple @josh-chamberlain The question then is, how and to what degree can we anticipate and document the different forms which CKAN instances can take?

  1. Are there commonly recurring structures, or are they so diverse that any attempt at standardization is doomed to failure?
  2. Can we document general rules or best practices for people when they're probing this stuff?
  3. Basically, how do we make it so that the next person who touches this stuff is less surprised and confused than we were?

@EvilDrPurple
Copy link
Contributor Author

1. Are there commonly recurring structures, or are they so diverse that any attempt at standardization is doomed to failure?

@maxachis, forgive the long paste but below is a sample package returned from ckanapi. Link to host page: https://catalog.data.gov/dataset/nypd-arrest-data-year-to-date

From what I can tell, the layout is completely the same across instances with the exception of the extras key and its content. This is where the instance may define their own custom keys. Other than that the only real difference is the quality of the data filled out. For example, data.gov seems to have the author key always set to null and instead opts to use the custom publisher key under extras while others will just use the author key. Another annoyance specifically with data.gov is the relationships_as_object key is never used, when presumably it should be used to link a parent package with its children.

{
      "author": null,
      "author_email": null,
      "creator_user_id": "2b785922-9f13-491b-a3c2-2a40acbd80c2",
      "id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
      "isopen": false,
      "license_id": "notspecified",
      "license_title": "License not specified",
      "maintainer": "NYC OpenData",
      "maintainer_email": "no-reply@data.cityofnewyork.us",
      "metadata_created": "2020-11-10T17:05:36.995577",
      "metadata_modified": "2024-10-25T20:28:59.948113",
      "name": "nypd-arrest-data-year-to-date",
      "notes": "This is a breakdown of every arrest effected in NYC by the NYPD during the current year.\n This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. \n Each record represents an arrest effected in NYC by the NYPD and includes information about the type of crime, the location and time of enforcement. \nIn addition, information related to suspect demographics is also included. \nThis data can be used by the public to explore the nature of police enforcement activity. \nPlease refer to the attached data footnotes for additional information about this dataset.",
      "num_resources": 4,
      "num_tags": 5,
      "organization": {
          "id": "1149ee63-2fff-494e-82e5-9aace9d3b3bf",
          "name": "city-of-new-york",
          "title": "City of New York",
          "type": "organization",
          "description": "",
          "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Seal_of_New_York.svg/175px-Seal_of_New_York.svg.png",
          "created": "2020-11-10T15:26:23.896065",
          "is_organization": true,
          "approval_status": "approved",
          "state": "active"
      },
      "owner_org": "1149ee63-2fff-494e-82e5-9aace9d3b3bf",
      "private": false,
      "state": "active",
      "title": "NYPD Arrest Data (Year to Date)",
      "type": "dataset",
      "url": null,
      "version": null,
      "extras": [
          {
              "key": "resource-type",
              "value": "Dataset"
          },
          {
              "key": "source_hash",
              "value": "5489c9a695b7ef8e8b45680d5b5c81cf24b2cc9d2f66497c1da7da7cf1993bc2"
          },
          {
              "key": "source_datajson_identifier",
              "value": true
          },
          {
              "key": "source_schema_version",
              "value": "1.1"
          },
          {
              "key": "accessLevel",
              "value": "public"
          },
          {
              "key": "identifier",
              "value": "https://data.cityofnewyork.us/api/views/uip8-fykc"
          },
          {
              "key": "issued",
              "value": "2020-10-28"
          },
          {
              "key": "landingPage",
              "value": "https://data.cityofnewyork.us/d/uip8-fykc"
          },
          {
              "key": "modified",
              "value": "2024-10-21"
          },
          {
              "key": "publisher",
              "value": "data.cityofnewyork.us"
          },
          {
              "key": "theme",
              "value": [
                  "Public Safety"
              ]
          },
          {
              "key": "catalog_@context",
              "value": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld"
          },
          {
              "key": "catalog_@id",
              "value": "https://data.cityofnewyork.us/data.json"
          },
          {
              "key": "catalog_conformsTo",
              "value": "https://project-open-data.cio.gov/v1.1/schema"
          },
          {
              "key": "catalog_describedBy",
              "value": "https://project-open-data.cio.gov/v1.1/schema/catalog.json"
          },
          {
              "key": "harvest_object_id",
              "value": "cf1f3514-4b33-4196-b5ee-347dc395ffbb"
          },
          {
              "key": "harvest_source_id",
              "value": "1696593e-c691-4f61-a696-5dcb9e4c9b4c"
          },
          {
              "key": "harvest_source_title",
              "value": "NYC JSON"
          }
      ],
      "groups": [
          {
              "description": "Local Government Topic - for all datasets with state, local, county organizations",
              "display_name": "Local Government",
              "id": "7d625e66-9e91-4b47-badd-44ec6f16b62b",
              "image_display_url": "",
              "name": "local",
              "title": "Local Government"
          }
      ],
      "resources": [
          {
              "cache_last_updated": null,
              "cache_url": null,
              "created": "2020-11-10T17:05:37.001960",
              "description": "",
              "format": "CSV",
              "hash": "",
              "id": "c48f1a1a-5efb-4266-9572-769ed1c9b472",
              "last_modified": null,
              "metadata_modified": "2020-11-10T17:05:37.001960",
              "mimetype": "text/csv",
              "mimetype_inner": null,
              "name": "Comma Separated Values File",
              "no_real_name": true,
              "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
              "position": 0,
              "resource_type": null,
              "size": null,
              "state": "active",
              "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.csv?accessType=DOWNLOAD",
              "url_type": null
          },
          {
              "cache_last_updated": null,
              "cache_url": null,
              "created": "2020-11-10T17:05:37.001970",
              "describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.rdf",
              "describedByType": "application/rdf+xml",
              "description": "",
              "format": "RDF",
              "hash": "",
              "id": "5c137f71-4e20-49c5-bd45-a562952195fe",
              "last_modified": null,
              "metadata_modified": "2020-11-10T17:05:37.001970",
              "mimetype": "application/rdf+xml",
              "mimetype_inner": null,
              "name": "RDF File",
              "no_real_name": true,
              "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
              "position": 1,
              "resource_type": null,
              "size": null,
              "state": "active",
              "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.rdf?accessType=DOWNLOAD",
              "url_type": null
          },
          {
              "cache_last_updated": null,
              "cache_url": null,
              "created": "2020-11-10T17:05:37.001976",
              "describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.json",
              "describedByType": "application/json",
              "description": "",
              "format": "JSON",
              "hash": "",
              "id": "036b090c-488e-41fd-89f2-126fead8cda7",
              "last_modified": null,
              "metadata_modified": "2020-11-10T17:05:37.001976",
              "mimetype": "application/json",
              "mimetype_inner": null,
              "name": "JSON File",
              "no_real_name": true,
              "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
              "position": 2,
              "resource_type": null,
              "size": null,
              "state": "active",
              "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.json?accessType=DOWNLOAD",
              "url_type": null
          },
          {
              "cache_last_updated": null,
              "cache_url": null,
              "created": "2020-11-10T17:05:37.001981",
              "describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.xml",
              "describedByType": "application/xml",
              "description": "",
              "format": "XML",
              "hash": "",
              "id": "8bbb0d22-bf80-407f-bb8d-e72e2cd1fbd9",
              "last_modified": null,
              "metadata_modified": "2020-11-10T17:05:37.001981",
              "mimetype": "application/xml",
              "mimetype_inner": null,
              "name": "XML File",
              "no_real_name": true,
              "package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
              "position": 3,
              "resource_type": null,
              "size": null,
              "state": "active",
              "url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.xml?accessType=DOWNLOAD",
              "url_type": null
          }
      ],
      "tags": [
          {
              "display_name": "arrest",
              "id": "a76dff3f-cba8-42b4-ab51-1aceb059d16f",
              "name": "arrest",
              "state": "active",
              "vocabulary_id": null
          },
          {
              "display_name": "crime",
              "id": "df442823-c823-4890-8fca-805427bd8dd9",
              "name": "crime",
              "state": "active",
              "vocabulary_id": null
          },
          {
              "display_name": "law-enforcement",
              "id": "c7de583e-e2ca-4e1a-99f9-d0bdb8acd910",
              "name": "law-enforcement",
              "state": "active",
              "vocabulary_id": null
          },
          {
              "display_name": "nypd",
              "id": "f3f11364-96d9-4bda-b4f4-5d6111c39108",
              "name": "nypd",
              "state": "active",
              "vocabulary_id": null
          },
          {
              "display_name": "public-safety",
              "id": "66326f22-35c0-46a7-941b-4b7690f8b199",
              "name": "public-safety",
              "state": "active",
              "vocabulary_id": null
          }
      ],
      "relationships_as_subject": [],
      "relationships_as_object": []
}

@maxachis
Copy link
Contributor

maxachis commented Nov 9, 2024

@EvilDrPurple So that would indicate that we have some standardization in terms of format, but not in how that format is populated. And some parts use completely original formatting, but only in defined areas like extras.

That's useful to know for future reference, in that presumably it can guide people on a higher level with designing these scrapers, even if it can't help as much in the lower level implementation.

@EvilDrPurple
Copy link
Contributor Author

@maxachis Would this be something to outline in the README, then?

@maxachis
Copy link
Contributor

@maxachis Would this be something to outline in the README, then?

It would be! And I'd defer to you as to how you think that would make the most sense to outline, since you're much more familiar with the peculiarities of CKAN than I am.

@EvilDrPurple
Copy link
Contributor Author

@maxachis I think this is ready for another look. I made some changes to the README and file structure. Also the scrape_ckan_data_portals.py is fully functional and outputs the data to a CSV file. As for having the user modify the search terms, they can be found in search_terms.py. They should be fairly simple to modify for anyone looking to do a different search

maxachis
maxachis previously approved these changes Nov 12, 2024
Copy link
Contributor

@maxachis maxachis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is great, and it works like a charm! I was amazed at how much data I was able to pull with just one run.

My main recommendations involve the revision or adding of some documentation, and extracting some logic to separate functions to enhance legibility. Other than that, I think it's good, but will defer to @josh-chamberlain for final judgment 🔨

url = f"{base_url}?collection_package_id={collection_id}&page={page}"
soup = _get_soup(url)

with ThreadPoolExecutor(max_workers=10) as executor:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the number of indentation and the size of this function, I think it's a candidate for extracting some of the logic to a separate function! I think the with ThreadPoolExecutor part in particular is useful to extract, both because it's quite complex and involves some low-level components, but also because an additional docstring would help provide some additional information.


for result in tqdm(results):
if "extras" in result.keys():
collections = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This collections definition could benefit from being extracted and spread out onto multiple lines. As is, if someone's reading this, they have to keep track of

  1. A for loop
  2. An if conditional to check if extra["key"] is "collection_metadata"
  3. An if conditional to check if extra["value" is "true"
  4. An if conditional to check if result["resources"] is null.

If someone was running this in a debugger, I imagine it would get a little messy.

# If there are no resources available
elif len(result["resources"]) == 0:
# Use the dataset's external landing page
package.url = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recommend extracting this to a small helper function due to the density of the logic!

:return: Updated list of results.
"""
key = list(search_terms[0].keys())[1]
for search in tqdm(search_terms):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The use of tqdm is an excellent touch, especially since it can take a little while for some of these to load 👌

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! It used to take ~30 minutes to run before I limited the number of collections being searched for but I figured I'd keep it so that anyone running the program doesn't thin it's frozen


# Remove packages with no data or landing page
if len(result["resources"]) == 0:
landing_page = next(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's extract some of this logic! This is a next, a for loop, two key-value accesses, and a boolean conditional condensed in a very small amount of real estate!

scrapers_library/data_portals/ckan/README.md Show resolved Hide resolved
scrapers_library/data_portals/ckan/README.md Show resolved Hide resolved
scrapers_library/data_portals/ckan/README.md Show resolved Hide resolved
}
```

---
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this part here, I'm curious if this is providing anything that's not already in the docstrings for these functions.

Either way, I think it's best to migrate whatever is unique into the docstrings of the functions themselves and then either remove or condense to a higher level this section.

Having this documentation in two places raises the risk that it will become misaligned, and if folks can easily access this logic in the docstrings themselves, there's not as much need to have it in the README.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if this page overall might benefit from some additional documentation describing each of the unique parameters for each section: For example: what does url mean? What about terms? What do the ids represent, and is this the same for both group_search and organization_search? Some of these can be inferred, but there's uncertainty in there that I think would benefit from being disambiguated.

@josh-chamberlain
Copy link
Contributor

@EvilDrPurple nice, I was able to run this. gosh, that's a lot of potential sources! it looks like many of them would be relevant.

Copy link
Contributor

@josh-chamberlain josh-chamberlain left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made tiny tweaks to the README but I think this is really good!

@josh-chamberlain josh-chamberlain merged commit a81c8e7 into main Nov 13, 2024
1 check passed
@josh-chamberlain josh-chamberlain deleted the ckan-scraper branch November 13, 2024 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants