-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change uuid to paper_id (and other naming changes) #16
Comments
Now part of the spec; closing this and will export and upload once done with #18 . |
The "Parse" table above is debatable... won't include for now. |
Looking at options, we could create a nested type for PurposeAssessment. Not going to pursue for now. |
A Paper can have multiple Parses, each for a different source representation (pdf, jats, xml, etc). A Parse results in identifying multiple SoftwareMentions (with full text context, raw and normalized name, version, url), each of which has a number of PurposeAssessments (used, created, etc. in both a local and a document context). Paper has_many SourceFile A Paper can have multiple SourceFile (e.g., pdf, jats, xml), each of which was parsed resulting in multiple SoftwareMentions (with full text context, raw and normalized name, version, url). Each SoftwareMention has a number of PurposeAssessment, resulting from assessments about whether the software in the mention was used, created, etc. Some assessments draw on only the local context of the single mention, while others draw on all the mentions of that piece of software in the document. A Paper (doi, title, ...) has many identified SoftwareMentions (software_name, full_text_where_found, ...), each of which have six PurposeAssessments (used, created, shared in local or document context). Papers have metadata like doi, title, journal name. (used, created, shared in both a local mention and document context |
We should likely have descriptions of the columns stored as metadata. Need to look into how languages other than Go access this. For ID columns, use the form [table]_id. A difficult bit here is communicating what the primary key is for a table, as all of these have composite primary keys. We're going to merge these into a single column but keep the original columns as well, that way we have a nice single primary key column. We're keeping the other columns (e.g. paper_id) even though they contain duplicate information to the new software_mention_id column as they'll be a common GROUP_BY target. Some open questions:
|
Change "parse_type" to "source_file_type" since it's not really different parses, but different source files which happen to represent the same paper. |
For the parquet files, can we standardized the column names a little.
e.g., mentions uses uuid, but papers uses paperId
Prefer all column names to be
snake_case
as well.And can we change the columns like documentContextAttributes.created.value to purpose_created_doc_context and purpose_created_mention_context
Although if we are going to follow the ERD we would have a new table
which raises the question of what the mention_id is, I guess combination of paper_id and mention_index
The text was updated successfully, but these errors were encountered: