-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting full citation span #135
Comments
Hey @overmode Thanks for the write up. There is a method for It returns the full normalized string.
When you run it - it provides the full citation including names, but I believe there is a bug in it when it uses dates and courts. if you wanted to take a look at
for the example above. |
Is the idea, @overmode, to remove all citations to make it better training data? |
One other thing to know, @overmode, is that the way we identify the name of the case is very sloppy. It just uses heuristics around where it finds a |
Hey, thanks for the quick reply. I took note of your method, it's ok if the recall of citation extraction is not excellent because I have many documents anyway, but I will need a way to tell whether the parsing went well to at least have a good precision. @flooie I tried the
This is not exactly what I would like, though, because it is not exact text that was matched (notice the added comma between page numbers). Also, Is this the bug you pointed out ? I'm open to a PR in case there is no better workaround, so I would appreciate if you have insights to share already. [UPDATE] The parenthesis is not closed because in
I assume that a parenthesis is missing at the end. |
Just chiming in here since I saw your PR (#136) and was surprised that this wasn't already possible! Thanks for implementing it! Separate from your changes in the PR, I was also curious about the court issue. It seems that the |
@mattdahl Thanks ! |
@overmode every time I see the words |
I understand, regexes are powerful but scale badly. |
For the court issue, the question is essentially, "What bad things will happen if we broaden how we match court strings against the text?" Honestly, I don't think anybody knows. Right now we do two things. We:
If we went a step further and matched with regexes or by taking out whitespace, would we have false matches? I don't know, but I know how to check! If we want to run this down, I think the trick is to look at the |
Here's a gist doing that collision test: https://gist.github.com/mattdahl/a563a48ac512275d893907dd19acd4ae It doesn't seem that removing whitespace causes any additional collisions, so I think we can safely do that. However, the fact that there are so many existing collisions also suggests that we probably shouldn't just be uncritically accepting the first match, as currently implemented. |
Yeah, that jumped out at me too. @flooie what's your take on that? |
|
@mattdahl - we had imported a lot of courts - that were low level county, town courts and in ny a few of courts had been generated with the parent citation string. For example, New York County Court -> has like 50+ County courts and they were generated with N.Y. Cty. Ct. as the citation string instead of NY Cty. Ct., Suffolk Cty. ... etc. I went thru and fixed the 100 or so collisions |
Nice!! The only duplicate left is |
no- ha - thats just a duplicate court. I'll strip that in a second. I have a few things to add about courts and citation strings. Ill add momentarily |
Hi, thank you for the great library !
Problem description
I am preparing a dataset, in which I would like to mask some citations, e.g. replacing them by "[CITATION]".
I could not find a way to get the full span of the citation. Indeed, only the normalized part is covered by the builtin span() function (see below)
output :
One can see that the span only partially covers the citation text.
If possible, I would like to avoid using regex for recovering the full span.
Concatenating the lengths of the citation's attributes (plaintiff, defendant, etc.) does not seem to be a viable solution as well, because the second example misses the "Pa. Super" text.
Desired behavior
It would be nice to have a 'full_span()' function such that, if I use it instead of span() in the above example, I get
Specs
eyecite version : 2.4.0
The text was updated successfully, but these errors were encountered: