-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CKAN general scraper #246
CKAN general scraper #246
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @EvilDrPurple, sorry for the delay! I was able to run it after installing requirements. I think after making this beginner friendly, we could call it approved:
- could you please add a
requirements.txt
file? - it would also be good to have, right after the introduction, a header for "Running this scraper" with instructions for setting up a venv and installing the requirements there
- give an example command for running the appropriate script, including with the args
Once that's done, do you want to try using it to find some data sources? Below are some portals which purport to use CKAN in our database. Can you successfully get something like a submission to our db using the scraper / this list?
https://opendata.cabq.gov/
https://data.boston.gov/
https://data.birminghamal.gov/
https://open.jacksonms.gov/
https://data.houstontx.gov/
https://data.milwaukee.gov/
https://data.ci.newark.nj.us/
https://data.ok.gov/
https://data.sanantonio.gov/
https://data.sanjoseca.gov/
https://data.vbgov.com/
@josh-chamberlain Requested changes made, I'll work on the next part now |
@EvilDrPurple thanks for the updates! scraping multiple portals will be important, so it's good to see you added that. However, I'm still a bit confused (as the resident code dummy) about how to use this. It's a little complex. It seems part of the current complexity is due to the different options we have for approaching what's there; a couple things that might help:
simpler tree:
here's a snippet to illustrate what I mean about letting people set args in the config:
Do you think this simpler approach is possible? Any roadblocks or major things I'm missing that we would be leaving out this way? |
To piggyback off of what @josh-chamberlain is saying, I think the four python files would each benefit from having a module-level docstring at the top of the file -- this way, it's easier to determine at a glance, for example, how Ideally, everyone would read the README thoroughly and then engage with each section, but considering I didn't even do that, we should probably aim to stupid-proof where possible. Or Max-proof, which may or may not be a subset of stupid-proof.* *Note that I absolutely do not do this in all of my code, so this is very much a "Not as I do" sort of recommendation. |
If I'm reading into this correctly, that is what this does. You call the
The way that it is setup is the way we've been doing it in other scraper portals. It's for a scraper writer to copy the template and move to one of the folders such as
The use of the template is optional, but yes. The args are passed to the
Something like this is good and what I would envision being added to the current template, just nitpicky but I'm not sure kwargs would work with a dict but there could be a way I don't know of to make it work (besides, kwargs is more for advanced users anyway) |
@maxachis Yeah this makes a lot of sense, I'll be sure to add this |
@EvilDrPurple @josh-chamberlain As an aside, and without dipping myself too deep into this particular debate, I will say my tendency in my own code is to find a middle ground between python modules vs. configuration files and have configuration files that are python modules. This way, I can document the code, properly establish its relationships to other components, and not have to worry about an alternative file format. For example: class EndpointSchemaConfig:
def __init__(
self,
input_schema: Optional[Schema] = None,
output_schema: Optional[Schema] = None,
input_dto_class: Optional[Type[DTOTypes]] = None,
):
"""
:param input_schema: Describes the schema to be input on a request
:param output_schema: Describes the schema to be output on a successful request
:param input_dto_class: Describes the DTO which will be populated based on the input schema.
"""
self.input_schema = input_schema
self.output_schema = output_schema
self.input_dto_class = input_dto_class
# ...
class SchemaConfigs(Enum):
# region Data Requests
DATA_REQUESTS_GET_MANY = EndpointSchemaConfig(
input_schema=GetManyDataRequestsRequestsSchema(),
output_schema=GetManyDataRequestsResponseSchema(),
input_dto_class=GetManyDataRequestsRequestsDTO,
)
DATA_REQUESTS_BY_ID_GET = EndpointSchemaConfig(
input_schema=None, output_schema=GetByIDDataRequestsResponseSchema()
)
#endregion
#... (In retrospect, I could have made That aside, er, aside: There is something to be said about whether our scrapers, as they exist, are easy to use, standardized, and relatively low on redundant code from scraper to scraper, but it's hard to determine that when zoomed into a single scraper. That may require a separate discussion. |
@maxachis I do like this approach, the config implementation that I thought was being asked of is an older way that a previous programmer with limited experience setup, I do not like this particular implementation very much: configs = {
"url": "",
"department_code": "",
"list_header": [
"ChargeDescription",
"CaseNum",
"ReportDate",
"OffenseDate",
"Location",
"ChargeDisposition",
],
}
save_dir = "./data/"
data = []
if not os.path.exists(save_dir):
os.makedirs(save_dir)
crimegraphics_clery(configs, save_dir) |
Ahh, now I see. Interesting. I don't think I ever understood that that was the goal of the template. I can see how you got there! I would say that the geographic structure is for scrapers which apply to specific geographic sources; these portal scrapers cover When we run the scraper, we might choose to focus on just one geography, which is possible with configuration. But I think the most important use case is someone dedicating some time (or using github actions) to run the CKAN scraper on all our known regions at once. I think we can do this by checking our API for sources which have a Is this making sense?
Sounds good to me. I'm not too picky about the format, but I'm most comfortable with the idea of running one scraper that is configurable in a clear way rather than having files for different use cases scattered about.
most of these scrapers are to be used and run independently. We are not planning to automate the whole repo or orchestrate a scraper farm. If we do one day, we could refactor—but through tough experience we have learned that the closer we tie scrapers to each other, the harder it is to write individual scrapers. If the data source collecting scrapers share common code, that would make some sense—because they are designed to collect data sources and send them to our API (or label studio). If we have a consistent way of talking to our API or label studio's, that makes sense. |
Oooh, that might be cause to create some kind of mini-SDK to handle common calls to the API or label studio or whatever else we have in mind. Let's make note of that for later. |
@josh-chamberlain |
@EvilDrPurple hmm, interesting. The sources db is sort of the foundation here, so it'll be important to try to improve the accuracy of the Could you provide an example of what you mean by parity issues? Does "package" not mean the same thing from one instance to the other? We're just trying to document the existence of a source, so I'm hoping we can find a solution. |
@josh-chamberlain A package is always the same meaning across CKAN instances, however the way packages are sorted is what varies most. Some instances will have all police data related packages in a |
@EvilDrPurple @josh-chamberlain The question then is, how and to what degree can we anticipate and document the different forms which CKAN instances can take?
|
@maxachis, forgive the long paste but below is a sample package returned from ckanapi. Link to host page: https://catalog.data.gov/dataset/nypd-arrest-data-year-to-date From what I can tell, the layout is completely the same across instances with the exception of the {
"author": null,
"author_email": null,
"creator_user_id": "2b785922-9f13-491b-a3c2-2a40acbd80c2",
"id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
"isopen": false,
"license_id": "notspecified",
"license_title": "License not specified",
"maintainer": "NYC OpenData",
"maintainer_email": "no-reply@data.cityofnewyork.us",
"metadata_created": "2020-11-10T17:05:36.995577",
"metadata_modified": "2024-10-25T20:28:59.948113",
"name": "nypd-arrest-data-year-to-date",
"notes": "This is a breakdown of every arrest effected in NYC by the NYPD during the current year.\n This data is manually extracted every quarter and reviewed by the Office of Management Analysis and Planning. \n Each record represents an arrest effected in NYC by the NYPD and includes information about the type of crime, the location and time of enforcement. \nIn addition, information related to suspect demographics is also included. \nThis data can be used by the public to explore the nature of police enforcement activity. \nPlease refer to the attached data footnotes for additional information about this dataset.",
"num_resources": 4,
"num_tags": 5,
"organization": {
"id": "1149ee63-2fff-494e-82e5-9aace9d3b3bf",
"name": "city-of-new-york",
"title": "City of New York",
"type": "organization",
"description": "",
"image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/c/ca/Seal_of_New_York.svg/175px-Seal_of_New_York.svg.png",
"created": "2020-11-10T15:26:23.896065",
"is_organization": true,
"approval_status": "approved",
"state": "active"
},
"owner_org": "1149ee63-2fff-494e-82e5-9aace9d3b3bf",
"private": false,
"state": "active",
"title": "NYPD Arrest Data (Year to Date)",
"type": "dataset",
"url": null,
"version": null,
"extras": [
{
"key": "resource-type",
"value": "Dataset"
},
{
"key": "source_hash",
"value": "5489c9a695b7ef8e8b45680d5b5c81cf24b2cc9d2f66497c1da7da7cf1993bc2"
},
{
"key": "source_datajson_identifier",
"value": true
},
{
"key": "source_schema_version",
"value": "1.1"
},
{
"key": "accessLevel",
"value": "public"
},
{
"key": "identifier",
"value": "https://data.cityofnewyork.us/api/views/uip8-fykc"
},
{
"key": "issued",
"value": "2020-10-28"
},
{
"key": "landingPage",
"value": "https://data.cityofnewyork.us/d/uip8-fykc"
},
{
"key": "modified",
"value": "2024-10-21"
},
{
"key": "publisher",
"value": "data.cityofnewyork.us"
},
{
"key": "theme",
"value": [
"Public Safety"
]
},
{
"key": "catalog_@context",
"value": "https://project-open-data.cio.gov/v1.1/schema/catalog.jsonld"
},
{
"key": "catalog_@id",
"value": "https://data.cityofnewyork.us/data.json"
},
{
"key": "catalog_conformsTo",
"value": "https://project-open-data.cio.gov/v1.1/schema"
},
{
"key": "catalog_describedBy",
"value": "https://project-open-data.cio.gov/v1.1/schema/catalog.json"
},
{
"key": "harvest_object_id",
"value": "cf1f3514-4b33-4196-b5ee-347dc395ffbb"
},
{
"key": "harvest_source_id",
"value": "1696593e-c691-4f61-a696-5dcb9e4c9b4c"
},
{
"key": "harvest_source_title",
"value": "NYC JSON"
}
],
"groups": [
{
"description": "Local Government Topic - for all datasets with state, local, county organizations",
"display_name": "Local Government",
"id": "7d625e66-9e91-4b47-badd-44ec6f16b62b",
"image_display_url": "",
"name": "local",
"title": "Local Government"
}
],
"resources": [
{
"cache_last_updated": null,
"cache_url": null,
"created": "2020-11-10T17:05:37.001960",
"description": "",
"format": "CSV",
"hash": "",
"id": "c48f1a1a-5efb-4266-9572-769ed1c9b472",
"last_modified": null,
"metadata_modified": "2020-11-10T17:05:37.001960",
"mimetype": "text/csv",
"mimetype_inner": null,
"name": "Comma Separated Values File",
"no_real_name": true,
"package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
"position": 0,
"resource_type": null,
"size": null,
"state": "active",
"url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.csv?accessType=DOWNLOAD",
"url_type": null
},
{
"cache_last_updated": null,
"cache_url": null,
"created": "2020-11-10T17:05:37.001970",
"describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.rdf",
"describedByType": "application/rdf+xml",
"description": "",
"format": "RDF",
"hash": "",
"id": "5c137f71-4e20-49c5-bd45-a562952195fe",
"last_modified": null,
"metadata_modified": "2020-11-10T17:05:37.001970",
"mimetype": "application/rdf+xml",
"mimetype_inner": null,
"name": "RDF File",
"no_real_name": true,
"package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
"position": 1,
"resource_type": null,
"size": null,
"state": "active",
"url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.rdf?accessType=DOWNLOAD",
"url_type": null
},
{
"cache_last_updated": null,
"cache_url": null,
"created": "2020-11-10T17:05:37.001976",
"describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.json",
"describedByType": "application/json",
"description": "",
"format": "JSON",
"hash": "",
"id": "036b090c-488e-41fd-89f2-126fead8cda7",
"last_modified": null,
"metadata_modified": "2020-11-10T17:05:37.001976",
"mimetype": "application/json",
"mimetype_inner": null,
"name": "JSON File",
"no_real_name": true,
"package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
"position": 2,
"resource_type": null,
"size": null,
"state": "active",
"url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.json?accessType=DOWNLOAD",
"url_type": null
},
{
"cache_last_updated": null,
"cache_url": null,
"created": "2020-11-10T17:05:37.001981",
"describedBy": "https://data.cityofnewyork.us/api/views/uip8-fykc/columns.xml",
"describedByType": "application/xml",
"description": "",
"format": "XML",
"hash": "",
"id": "8bbb0d22-bf80-407f-bb8d-e72e2cd1fbd9",
"last_modified": null,
"metadata_modified": "2020-11-10T17:05:37.001981",
"mimetype": "application/xml",
"mimetype_inner": null,
"name": "XML File",
"no_real_name": true,
"package_id": "f468fe8a-a319-464f-9374-f77128ffc9dc",
"position": 3,
"resource_type": null,
"size": null,
"state": "active",
"url": "https://data.cityofnewyork.us/api/views/uip8-fykc/rows.xml?accessType=DOWNLOAD",
"url_type": null
}
],
"tags": [
{
"display_name": "arrest",
"id": "a76dff3f-cba8-42b4-ab51-1aceb059d16f",
"name": "arrest",
"state": "active",
"vocabulary_id": null
},
{
"display_name": "crime",
"id": "df442823-c823-4890-8fca-805427bd8dd9",
"name": "crime",
"state": "active",
"vocabulary_id": null
},
{
"display_name": "law-enforcement",
"id": "c7de583e-e2ca-4e1a-99f9-d0bdb8acd910",
"name": "law-enforcement",
"state": "active",
"vocabulary_id": null
},
{
"display_name": "nypd",
"id": "f3f11364-96d9-4bda-b4f4-5d6111c39108",
"name": "nypd",
"state": "active",
"vocabulary_id": null
},
{
"display_name": "public-safety",
"id": "66326f22-35c0-46a7-941b-4b7690f8b199",
"name": "public-safety",
"state": "active",
"vocabulary_id": null
}
],
"relationships_as_subject": [],
"relationships_as_object": []
} |
@EvilDrPurple So that would indicate that we have some standardization in terms of format, but not in how that format is populated. And some parts use completely original formatting, but only in defined areas like That's useful to know for future reference, in that presumably it can guide people on a higher level with designing these scrapers, even if it can't help as much in the lower level implementation. |
@maxachis Would this be something to outline in the README, then? |
It would be! And I'd defer to you as to how you think that would make the most sense to outline, since you're much more familiar with the peculiarities of CKAN than I am. |
@maxachis I think this is ready for another look. I made some changes to the README and file structure. Also the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The README is great, and it works like a charm! I was amazed at how much data I was able to pull with just one run.
My main recommendations involve the revision or adding of some documentation, and extracting some logic to separate functions to enhance legibility. Other than that, I think it's good, but will defer to @josh-chamberlain for final judgment 🔨
url = f"{base_url}?collection_package_id={collection_id}&page={page}" | ||
soup = _get_soup(url) | ||
|
||
with ThreadPoolExecutor(max_workers=10) as executor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the number of indentation and the size of this function, I think it's a candidate for extracting some of the logic to a separate function! I think the with ThreadPoolExecutor
part in particular is useful to extract, both because it's quite complex and involves some low-level components, but also because an additional docstring would help provide some additional information.
|
||
for result in tqdm(results): | ||
if "extras" in result.keys(): | ||
collections = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This collections definition could benefit from being extracted and spread out onto multiple lines. As is, if someone's reading this, they have to keep track of
- A for loop
- An if conditional to check if extra["key"] is "collection_metadata"
- An if conditional to check if extra["value" is "true"
- An if conditional to check if result["resources"] is null.
If someone was running this in a debugger, I imagine it would get a little messy.
# If there are no resources available | ||
elif len(result["resources"]) == 0: | ||
# Use the dataset's external landing page | ||
package.url = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recommend extracting this to a small helper function due to the density of the logic!
:return: Updated list of results. | ||
""" | ||
key = list(search_terms[0].keys())[1] | ||
for search in tqdm(search_terms): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of tqdm is an excellent touch, especially since it can take a little while for some of these to load 👌
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! It used to take ~30 minutes to run before I limited the number of collections being searched for but I figured I'd keep it so that anyone running the program doesn't thin it's frozen
|
||
# Remove packages with no data or landing page | ||
if len(result["resources"]) == 0: | ||
landing_page = next( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's extract some of this logic! This is a next, a for loop, two key-value accesses, and a boolean conditional condensed in a very small amount of real estate!
} | ||
``` | ||
|
||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this part here, I'm curious if this is providing anything that's not already in the docstrings for these functions.
Either way, I think it's best to migrate whatever is unique into the docstrings of the functions themselves and then either remove or condense to a higher level this section.
Having this documentation in two places raises the risk that it will become misaligned, and if folks can easily access this logic in the docstrings themselves, there's not as much need to have it in the README.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if this page overall might benefit from some additional documentation describing each of the unique parameters for each section: For example: what does url mean? What about terms? What do the ids represent, and is this the same for both group_search and organization_search? Some of these can be inferred, but there's uncertainty in there that I think would benefit from being disambiguated.
@EvilDrPurple nice, I was able to run this. gosh, that's a lot of potential sources! it looks like many of them would be relevant. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made tiny tweaks to the README but I think this is really good!
Fixes
Description
Docs