Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change uuid to paper_id (and other naming changes) #16

Open
jameshowison opened this issue Nov 9, 2024 · 8 comments
Open

change uuid to paper_id (and other naming changes) #16

jameshowison opened this issue Nov 9, 2024 · 8 comments
Assignees

Comments

@jameshowison
Copy link

For the parquet files, can we standardized the column names a little.

e.g., mentions uses uuid, but papers uses paperId

Prefer all column names to be snake_case as well.

And can we change the columns like documentContextAttributes.created.value to purpose_created_doc_context and purpose_created_mention_context

Although if we are going to follow the ERD we would have a new table

purpose_assessments(
  mention_id,
  purpose[used,created,shared],
  context[local|document],
  certainty_percent)

which raises the question of what the mention_id is, I guess combination of paper_id and mention_index

@willbeason
Copy link
Member

Done in data model spec:

Proposed tables are in light blue. Other boxes are potential tables, but for now are essentially just primary keys.

SoftCite Data Model(3)

@willbeason
Copy link
Member

Actually ... accommodating parse_type:
SoftCite Data Model(4)

@willbeason
Copy link
Member

Now part of the spec; closing this and will export and upload once done with #18 .

@willbeason
Copy link
Member

The "Parse" table above is debatable... won't include for now.

@willbeason
Copy link
Member

willbeason commented Dec 11, 2024

Looking at options, we could create a nested type for PurposeAssessment. Not going to pursue for now.

@willbeason willbeason self-assigned this Dec 12, 2024
@willbeason willbeason reopened this Dec 12, 2024
@jameshowison
Copy link
Author

jameshowison commented Dec 12, 2024

A Paper can have multiple Parses, each for a different source representation (pdf, jats, xml, etc). A Parse results in identifying multiple SoftwareMentions (with full text context, raw and normalized name, version, url), each of which has a number of PurposeAssessments (used, created, etc. in both a local and a document context).

Paper has_many SourceFile
SourceFile has_many Parse
Parse has_many SoftwareMention
SoftwareMention has_many PurposeAssessment

A Paper can have multiple SourceFile (e.g., pdf, jats, xml), each of which was parsed resulting in multiple SoftwareMentions (with full text context, raw and normalized name, version, url). Each SoftwareMention has a number of PurposeAssessment, resulting from assessments about whether the software in the mention was used, created, etc. Some assessments draw on only the local context of the single mention, while others draw on all the mentions of that piece of software in the document.

A Paper (doi, title, ...) has many identified SoftwareMentions (software_name, full_text_where_found, ...), each of which have six PurposeAssessments (used, created, shared in local or document context).

Papers have metadata like doi, title, journal name.
SoftwareMentions have metadata like raw and normalized software name, url, version, as well as full text context snippet in which the mention was identified.
PurposeAssessments are about whether the software in the mention was assessed to be used, created or shared by the paper. Each of these three were assessed in two ways: just the local context of the single mention, and across the document, drawing on all the mentions of that piece of software in the paper.

(used, created, shared in both a local mention and document context

@willbeason
Copy link
Member

We should likely have descriptions of the columns stored as metadata. Need to look into how languages other than Go access this.

For ID columns, use the form [table]_id.

A difficult bit here is communicating what the primary key is for a table, as all of these have composite primary keys. We're going to merge these into a single column but keep the original columns as well, that way we have a nice single primary key column. We're keeping the other columns (e.g. paper_id) even though they contain duplicate information to the new software_mention_id column as they'll be a common GROUP_BY target.

Some open questions:

  • What is the default parse type? I recognize it by the one that lacks a specific label.
  • Is there a consistent threshold at which certainty_percent becomes "true" or "false" for "is_purpose"? If so, don't need to keep is_purpose. Otherwise, do need to keep as its own column.

@willbeason
Copy link
Member

Change "parse_type" to "source_file_type" since it's not really different parses, but different source files which happen to represent the same paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants