-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Link parallel cites #76
Comments
This comment was marked as off-topic.
This comment was marked as off-topic.
I had a lengthy comment above that was mostly off topic, but it looks like as we've added parentheticals to CourtListener, we've run headlong into this. (Happy to give beta access to anybody interested.) I'm not sure what the fix is. Right now it's double counting the depth, like you say, @jcushman and it's adding the parentheticals to our DB multiple times as well. Not great! I guess we could overhaul the API of eyecite to return citations that link these kinds of things together. For example, instead of returning a flat list of citations, we could return a list of lists of citations. Something like: [
"citationGroup1": [citations: [{reporter, page, volume}, {reporter, page, volume}], parenthetical: "Sky is blue"],
"citationGroup2": [...],
] That'd be a pretty big overhaul. Another approach would be to add a linkage attribute to subsequent citations, allowing our flat list to remain. Something like: [
{reporter, page, volume, parenthetical, parallel_to: null},
{reporter, page, volume, parenthetical, parallel_to: 0},
] If that's not clear, the Hm. The first approach feels more correct, but the second one sure seems simpler. @mattdahl, maybe you have thoughts too? |
Quick answer with the brain I have available ... :) Data representation I like Mike's second approach fine. I think the first one centers an edge case too much, you shouldn't have to wade through that layer when most cites aren't parallel. A third option that I would strongly consider would be to only include the first cite in the output list, and attach the parallel cites to that:
This has the benefit that if you're a naive caller of the library who has no idea about parallel cites, you'll do a reasonable default thing of ignoring them instead of double counting anything. I'm not sure what data exactly belongs in (Note that besides doubled parentheticals, there's a couple of other weird things about the current situation, like the first site has a Implementation OK, the underlying scenario we're trying to detect is if you have tokens like So where do we detect that? A place I can think of to handle that is in add_post_citation: Line 76 in aaf0a20
If we get to This answer is a little fiddly because we don't know what tokens Since citations are annoying, some special cases to consider ...
|
My take: The identification of parallel citations cannot occur before resolution occurs. Otherwise, we run into the problem that Jack mentioned -- citations that are right next to each other but that refer to different cases. The only way to logically differentiate them is to wait until the citations are resolved, and then check whether they resolve to the same resource or not to determine whether they're actually parallel citations or not. So I would argue that the return output of If the user needs to deal with parallel citations, they should then run The downside to this approach is that it depends on the user having a competent backend for doing resolutions. This is not a problem for CL and CAP, but the default implementation of Happy to discuss further! |
I think I agree with Matt in terms of when this should happen. That's sort of what we've got right now when searching for parentheticals. The way it works now is that if two citations have the same parenthetical, they're the same and we call that good enough, but we could do better, I'm sure. |
I don't want this to be the answer, though that doesn't mean it isn't. :) I don't like it because it makes our parsing process so different from the intuitive order for extracting metadata about a cite, which will lead to lots of messiness. Like if the source text is:
A human would see a single citation with title="Foo v. Bar", cite="12 Mass. 34", parenthetical="holding blah", etc., and would just see "56 N.E.2d 78, 79" as a bit of extra metadata for the cite. We currently come out with two complete cites with overlapping parses, leading to messy metadata:
The second cite in particular is messy ( Thinking about alternatives, there probably aren't that many legitimate reporter pairs that appear as parallel cites -- could we list them all and add them to reporters-db? If FLP does have pretty good parallel cite info, maybe one could dump all the allowed patterns (Mass. + N.E.2d, Mass. + Tyng etc.) and only treat pairs as parallel if they're on the list. I'd think |
Just wanted to jump in to note that, however we decide to handle parallel case citations, we should probably mirror that treatment for parallel statutory (freelawproject/reporters-db#116 and Ind. R16.1.5) and treaty citations (freelawproject/reporters-db#48 and BB R21.4). |
Reading this again a year later, I now think I agree with @jcushman. I think in practice my previous proposal is intractable because resolution is too inaccurate (especially for the user running locally without e.g. the FLP database). I also like the idea of using a list of allowed parallel citation combinations as a detection heuristic. Relatedly, is there any way to create a mapping between reporter volumes and years using the FLP data? If we knew that two citations are (1) physically close to each other, (2) are from reporters that are known to be a valid combination, and (3) are published in volumes with nearby years, I think we could be pretty confident that the citations are indeed parallel citations and we could collapse them appropriately. |
Well, maybe. We have a lot of parallel cites in the DB, but not all of them. We could probably export that and you'd have something.
We don't always have years for reporters, unfortunately, but we could probably get them for the most important ones.
Yeah, me too. I think it makes sense that the whole string is really a citation object, even if it references multiple books. Let us know if you want to try to get a database export, Matt. |
I’m a little late to this discussion, but I think trying to map legitimate pair reporters is a fool’s errand. Just to confirm that, I visualized the connections in one data source for parallel citations in state courts: Turns out there are lots and lots of interconnected reporters. I think @jcushman got it right. We already have a solid track record of identifying parallel citations in the extra field. With a small adjustment, we could parse these and incorporate them directly into a unified FULL CASE CITATION. I think the benefits are
|
I just caught up on this and re-read everything. I think we've got three decent approaches, but Jack's approach wins. Let's do it. My other note I'll make is that if we're smart about this, we can start tallying which citations go together and use this process to add citations to the database. For example, if you see |
Exactly! |
For a cite like "1 U.S. 1, 2 S. Ct. 2 (1999) (overruling ...)" we extract "1 U.S. 1" and "2 S. Ct. 2" as separate cites that both have the parenthetical "overruling ...". If you later report the parentheticals somehow you double up, or if you use a resolver that knows those are the same case, you double-count the weight of that citation. It would be good if we detected this and linked the two cites as parallel to each other, so the weight and parenthetical could only be counted once.
The text was updated successfully, but these errors were encountered: