Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling white spaces in journal names. #119

Open
bbernicker opened this issue Aug 9, 2022 · 3 comments
Open

Handling white spaces in journal names. #119

bbernicker opened this issue Aug 9, 2022 · 3 comments
Assignees

Comments

@bbernicker
Copy link
Contributor

While testing Eyecite today, I noticed that there were some citations to law reviews in my dataset which are missing a space in L.Rev. and/or between the name of the law review and L.Rev. or L. Rev.. For an example, Strickland v. Washington cites 58 N.Y.U.L.Rev. 299; 83 Colum.L.Rev. 1544; 93 Harv.L.Rev. 752; and 50 U.Chi.L.Rev. 138.

I was curious whether ignoring white spaces in the names of journals (and maybe reporters and laws for that matter) would help improve detection (especailly with OCR'd files). Alternatively, does it make sense to specify alternative versions of journal names without some and/or all of its white spaces? Or else to change "L.Rev." to "L. Rev." in Eyecite's clean module?

@mlissner
Copy link
Member

I haven't looked at the code for this specifically, but yeah, some sort of solution is needed. I forget how journal names are identified (I think a regex?). In general, it's easier to tweak our journal/statute/citation-specific regex than it is to do things like whitespace stripping (which tends to be less granular).

@bbernicker
Copy link
Contributor Author

bbernicker commented Aug 12, 2022

Maybe I could go through the regex and replace " L. Rev." with "\s?L.\s*Rev." This would allow a match whether or not there is one space before the L. and whenever L. and Rev. are separated by nothing or nothing except white space. It would not match journal names with missing spaces unless they have L. Rev. in them (e.g. "Admin. L.J. Am. U." would match, but not Admin.L.J.Am.U."), but it would at least be a step in the right direction.

@mlissner
Copy link
Member

@flooie Can you take over review on this one, please? (Sorry @bbernicker I just know he'll have better opinions on this codebase.)

@mlissner mlissner moved this to 🆕 New in @flooie's backlog Aug 12, 2022
@mlissner mlissner moved this from 🆕 New to 📋 Backlog in @flooie's backlog Aug 12, 2022
@flooie flooie self-assigned this Aug 12, 2022
@flooie flooie moved this to General Backlog in Case Law Sprint Nov 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: 📋 Backlog
Status: General Backlog
Development

No branches or pull requests

3 participants