Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint to match agencies for submission #550

Closed
1 of 6 tasks
josh-chamberlain opened this issue Nov 13, 2024 · 2 comments
Closed
1 of 6 tasks

Endpoint to match agencies for submission #550

josh-chamberlain opened this issue Nov 13, 2024 · 2 comments
Assignees
Labels
api fixed_in_dev This is merged into the dev environment and waiting to be merged into main

Comments

@josh-chamberlain
Copy link
Contributor

josh-chamberlain commented Nov 13, 2024

Context

We have tools which generate batches of Data Sources to submit to our db. The biggest hurdle is typically that agencies out in the world aren't named the exact same thing as they are in our database; we need an agency_described which matches our database. There are 3 potential things that can happen:

  • * happy: the name is exactly the same in our database, and we get a single match. hooray!
  • ~ different: the agency exists in our database by a different name.
  • + new: the agency does not exist in our database and we need to add it.
  • x unknown: we don't even have an external name for the agency; we have no idea what to guess. that's out of scope but begun here: Identify agencies data-source-identification#15

Requirements

  • make an agency-match endpoint which we can use to find proper names for our agencies
  • It should accept a single request with
    • external name
    • county
    • state
    • locality
  • It should return
    • agency name: great for CSV submissions
    • agency id: good for hitting the API
  • It should return one of 3 possible entries:
    • exact match, if one confident match is detected; this covers the happy and different paths
    • possible match, an exact match is not found, but there are one or more agencies within the same location which might match
    • no match, No agencies found for that location and/or meet the threshold for being a possible match.
  • Determine if this should be an endpoint vs. an internal tool used by the scrapers. For example, a read-only SQLite database of agencies (which could minimize endpoint calls).
    • determined that this should be an endpoint because of the broad usage
  • Agencies may have the same name across different states or counties. we should disambiguate using counties or states.
Screen Shot 2024-11-13 at 12 21 19 PM

Open questions

How would we do this internally?

We have several possible ways of approaching this, depending on our preferences and how thorough we want to be. This includes:

@josh-chamberlain josh-chamberlain transferred this issue from Police-Data-Accessibility-Project/scrapers Nov 13, 2024
@josh-chamberlain josh-chamberlain changed the title Utility for getting PDAP agencies for data source collectors Endpoint to match agencies for submission Dec 4, 2024
@josh-chamberlain josh-chamberlain transferred this issue from Police-Data-Accessibility-Project/data-source-identification Dec 4, 2024
@maxachis
Copy link
Contributor

Looking at this now!

@josh-chamberlain When we say it should accept a batch, are we thinking of providing that batch via csv, json, or both?

@maxachis
Copy link
Contributor

maxachis commented Dec 10, 2024

In terms of implementation, here's my initial thinking:

Steps

  1. Get all agencies whose location id match the location id that is derived from the state/county/locality
    a. This assumes exact matching.
  2. Of values retrieved, perform fuzzy matching on names. Exact matches are labeled as such, while possible matches (i.e. those where the similarity is above some threshold) are included. Anything below the threshold is not returned.

Possible Outcomes

A. Exact match
B. Possible matches
C. No matches

Other Considerations

  • This will be premised on locations being exactly correct. Misspellings or alternative names (e.g., "St." instead of "Saint") won't be considered.
  • If successful, this could be extended to other components, such as data sources.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api fixed_in_dev This is merged into the dev environment and waiting to be merged into main
Projects
Status: Done
Development

No branches or pull requests

2 participants