Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Link parallel cites #76

Open
jcushman opened this issue May 24, 2021 · 12 comments
Open

Link parallel cites #76

jcushman opened this issue May 24, 2021 · 12 comments
Assignees

Comments

@jcushman
Copy link
Contributor

For a cite like "1 U.S. 1, 2 S. Ct. 2 (1999) (overruling ...)" we extract "1 U.S. 1" and "2 S. Ct. 2" as separate cites that both have the parenthetical "overruling ...". If you later report the parentheticals somehow you double up, or if you use a resolver that knows those are the same case, you double-count the weight of that citation. It would be good if we detected this and linked the two cites as parallel to each other, so the weight and parenthetical could only be counted once.

@mlissner

This comment was marked as off-topic.

@mlissner
Copy link
Member

mlissner commented Feb 18, 2022

I had a lengthy comment above that was mostly off topic, but it looks like as we've added parentheticals to CourtListener, we've run headlong into this. (Happy to give beta access to anybody interested.)

I'm not sure what the fix is. Right now it's double counting the depth, like you say, @jcushman and it's adding the parentheticals to our DB multiple times as well. Not great!

I guess we could overhaul the API of eyecite to return citations that link these kinds of things together. For example, instead of returning a flat list of citations, we could return a list of lists of citations. Something like:

[
    "citationGroup1": [citations: [{reporter, page, volume}, {reporter, page, volume}], parenthetical: "Sky is blue"],
    "citationGroup2": [...],
]

That'd be a pretty big overhaul. Another approach would be to add a linkage attribute to subsequent citations, allowing our flat list to remain. Something like:

[
    {reporter, page, volume, parenthetical, parallel_to: null},
    {reporter, page, volume, parenthetical, parallel_to: 0},
]

If that's not clear, the parallel_to attribute would just point to the thing it's parallel to in the list of citations.

Hm. The first approach feels more correct, but the second one sure seems simpler.

@mattdahl, maybe you have thoughts too?

@jcushman
Copy link
Contributor Author

Quick answer with the brain I have available ... :)

Data representation

I like Mike's second approach fine. I think the first one centers an edge case too much, you shouldn't have to wade through that layer when most cites aren't parallel. A third option that I would strongly consider would be to only include the first cite in the output list, and attach the parallel cites to that:

[
  {reporter, page, volume, parenthetical, parallel_cites: [{reporter, page, volume}]}
]

This has the benefit that if you're a naive caller of the library who has no idea about parallel cites, you'll do a reasonable default thing of ignoring them instead of double counting anything. I'm not sure what data exactly belongs in parallel_cites but hopefully the answer would jump out during implementation.

(Note that besides doubled parentheticals, there's a couple of other weird things about the current situation, like the first site has a pin_cite that contains the second cite, and the second site has a title that contains the first cite. So throwing away the second cite entirely rather than just tagging the second one as special somehow is appealing.)

Implementation

OK, the underlying scenario we're trying to detect is if you have tokens like <CitationToken>, <optional pin cite>, <CitationToken>, <optional pin cite>, <CitationToken>..., the subsequent citation tokens are parallel cites and should be stuck onto the first one somehow instead of processed from scratch. (At least we hope that's true; if someone is doing see cases such as 1 Foo 1, 2 Bar 2, and others, this'll misfire.)

So where do we detect that? A place I can think of to handle that is in add_post_citation:

def add_post_citation(citation: CaseCitation, words: Tokens) -> None:

If we get to citation.metadata.pin_cite = clean_pin_cite(m["pin_cite"]) or None and the m["pin_cite"] was built from tokens that include CitationTokens, then that's our special case. Assuming we can detect it somehow at that point, we can then set citation.metadata.pin_cite to just the part up to the first CitationToken, and set some temporary data on citation to let us know downstream to consume the next CitationTokens as a parallel cite instead.

This answer is a little fiddly because we don't know what tokens m["pin_cite"] was built from. Not sure what's right here -- maybe re-tokenize m["pin_cite"] if that's reasonably fast. Or, before you even do the match_on_tokens(), scan out for (say) 10 tokens and if you find a CitationToken, see if the tokens up to the citation token are a pin cite, and if so that's the special case. I don't love any of that but one of those should work.

Since citations are annoying, some special cases to consider ...

  • As mentioned there can of course be more than one parallel cite, such as in SCOTUS cites.
  • And some of the numbers can be just underscores.
  • I can't remember how we handle nominative cites with a special parallel format like 1 Foo (2 Bar) 3 right now, but that might need its own thing if we aren't doing something with it yet.
  • How do parallel cites work for short cites? If it's like 1 Foo at 1, 2 Foo at 2 (parenthetical), they'd ideally be joined as well, but it's a different thing to detect.

@mattdahl
Copy link
Contributor

My take: The identification of parallel citations cannot occur before resolution occurs. Otherwise, we run into the problem that Jack mentioned -- citations that are right next to each other but that refer to different cases. The only way to logically differentiate them is to wait until the citations are resolved, and then check whether they resolve to the same resource or not to determine whether they're actually parallel citations or not.

So I would argue that the return output of get_citations() should not change at all. get_citations() should be unopinionated on this point -- it should just return a list of every citation in an opinion without attempting to aggregate any possible duplicates.

If the user needs to deal with parallel citations, they should then run resolve_citations() as normal. Then I would propose that we implement a new function -- prune_parallel_citations() or something, this could also be configured as a step in the resolve_citations() function itself -- that goes through the list after resolutions have been made and deletes/aggregates the parallel citations in each resource group. To detect parallelness at this stage, we can just check whether each citation's span() is sufficiently close to some (configurable) distance away from another citation's span() in the same resource group. Because resolutions have already been made, we'll then know for sure that those citations are in fact parallel, and not just nearby different citations.

The downside to this approach is that it depends on the user having a competent backend for doing resolutions. This is not a problem for CL and CAP, but the default implementation of resolve_citations() will obviously not recognize citations from different reporters as resolving to the same resource (even if they're true parallel citations). However, I'm okay with this, because I think the user ultimately needs to grapple with the fact that the identification of parallel citations cannot conceptually happen before the citations are resolved, so they need to deal with making sure their resolutions are sane first.

Happy to discuss further!

@mlissner
Copy link
Member

I think I agree with Matt in terms of when this should happen. That's sort of what we've got right now when searching for parentheticals. The way it works now is that if two citations have the same parenthetical, they're the same and we call that good enough, but we could do better, I'm sure.

@jcushman
Copy link
Contributor Author

My take: The identification of parallel citations cannot occur before resolution occurs.

I don't want this to be the answer, though that doesn't mean it isn't. :)

I don't like it because it makes our parsing process so different from the intuitive order for extracting metadata about a cite, which will lead to lots of messiness. Like if the source text is:

... blah blah blah. Foo v. Bar, 12 Mass. 34, 35, 56 N.E.2d 78, 79 (Mass. 1999) (holding blah). blah blah blah ...

A human would see a single citation with title="Foo v. Bar", cite="12 Mass. 34", parenthetical="holding blah", etc., and would just see "56 N.E.2d 78, 79" as a bit of extra metadata for the cite.

We currently come out with two complete cites with overlapping parses, leading to messy metadata:

[
  FullCaseCitation('12 Mass. 34', ... metadata=(parenthetical='holding blah', pin_cite='35', year='1999', plaintiff='Foo', defendant='Bar', extra='56 N.E.2d 78, 79', ...)),
  FullCaseCitation('56 N.E.2d 78', metadata=(parenthetical='holding blah', pin_cite='79', year='1999', plaintiff='Foo', defendant='Bar, 12 Mass. 34, 35', ...))
]

The second cite in particular is messy (defendant='Bar, 12 Mass. 34, 35') because it doesn't know that it isn't really a complete cite on its own, just a bit of metadata for a previous cite, so it can't hope to work out what's going on around it. If we rely on resolution for deleting these, we'll end up with these messy confused cites any time we don't know all of the cites for a case, which for CAP at least is common. FLP might be better on knowing parallel cites for each case, not sure how close it is to 100%.

Thinking about alternatives, there probably aren't that many legitimate reporter pairs that appear as parallel cites -- could we list them all and add them to reporters-db? If FLP does have pretty good parallel cite info, maybe one could dump all the allowed patterns (Mass. + N.E.2d, Mass. + Tyng etc.) and only treat pairs as parallel if they're on the list. I'd think 12 Mass. 34, 56 N.E.2d 78 would rarely be two separate cases, particularly because it'd be an odd mix of citation styles if it were.

@mlissner mlissner moved this from Needs Discussion to Done in Opinion-Summarizing Parentheticals Mar 16, 2022
@mlissner mlissner moved this from Done to Needs Discussion in Opinion-Summarizing Parentheticals Aug 1, 2022
@bbernicker
Copy link
Contributor

Just wanted to jump in to note that, however we decide to handle parallel case citations, we should probably mirror that treatment for parallel statutory (freelawproject/reporters-db#116 and Ind. R16.1.5) and treaty citations (freelawproject/reporters-db#48 and BB R21.4).

@mattdahl
Copy link
Contributor

Reading this again a year later, I now think I agree with @jcushman. I think in practice my previous proposal is intractable because resolution is too inaccurate (especially for the user running locally without e.g. the FLP database).

I also like the idea of using a list of allowed parallel citation combinations as a detection heuristic. Relatedly, is there any way to create a mapping between reporter volumes and years using the FLP data? If we knew that two citations are (1) physically close to each other, (2) are from reporters that are known to be a valid combination, and (3) are published in volumes with nearby years, I think we could be pretty confident that the citations are indeed parallel citations and we could collapse them appropriately.

@mlissner
Copy link
Member

Relatedly, is there any way to create a mapping between reporter volumes and years using the FLP data?

Well, maybe. We have a lot of parallel cites in the DB, but not all of them. We could probably export that and you'd have something.

are published in volumes with nearby years

We don't always have years for reporters, unfortunately, but we could probably get them for the most important ones.

Reading this again a year later, I now think I agree with @jcushman.

Yeah, me too. I think it makes sense that the whole string is really a citation object, even if it references multiple books.

Let us know if you want to try to get a database export, Matt.

@flooie flooie moved this from Backlog Nov 19 - 29 to General Backlog in Case Law Sprint Nov 19, 2024
@flooie flooie moved this from General Backlog to Backlog Dec 16 - Dec 27th in Case Law Sprint Dec 16, 2024
@flooie flooie moved this from Backlog Dec 16 - Dec 27th to To Do in Case Law Sprint Dec 17, 2024
@flooie flooie assigned grossir and flooie and unassigned grossir Dec 17, 2024
@flooie
Copy link
Contributor

flooie commented Dec 17, 2024

I’m a little late to this discussion, but I think trying to map legitimate pair reporters is a fool’s errand. Just to confirm that, I visualized the connections in one data source for parallel citations in state courts:

Image

Turns out there are lots and lots of interconnected reporters.


I think @jcushman got it right. We already have a solid track record of identifying parallel citations in the extra field. With a small adjustment, we could parse these and incorporate them directly into a unified FULL CASE CITATION.

I think the benefits are

  1. A more natural data model.
  2. The ability to inherit information from other parallel citations (for example) if one parallel citation is a volume_year neutral citation we could ID the year correctly for all citations.
  3. Here is an example that would benefit I came across today 2015 WI App 13, 359 Wis.2d 675, 859 N.W.2d 628 (unpublished opinion)
  4. Fixes the parsing of defendant name and other issues for complex citations like @jcushman id'd above.
  5. And it should be easy to integrate it into CL and since we have identified them as part of the same citation it should improve and simplify the matching to CL part of the process.I had another thought but it slipped out of my mind.

@mlissner

@mlissner
Copy link
Member

I just caught up on this and re-read everything. I think we've got three decent approaches, but Jack's approach wins. Let's do it.

My other note I'll make is that if we're smart about this, we can start tallying which citations go together and use this process to add citations to the database. For example, if you see 2 Bar 3, 4 baz 5 together enough times, you can probably safely assume that those two citations are parallel, even if you don't have one or the other in the DB. In other words, this could be used to harvest citations.

@flooie
Copy link
Contributor

flooie commented Dec 17, 2024

Exactly!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: To Do
Status: No status
Status: Needs Discussion
Development

No branches or pull requests

6 participants