You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Workflow: Given a list of DOIs, output the subset of data for those papers
Input format:
Newline-delimited list of strings representing DOIs
Output format:
JSONL or Parquet tables (initially just .jsonl.gz)
Internal logic requirements:
Translate DOIs in various forms to 10.NNNN.N format. Match against those as there's no guarantee of the specific format we'll get
Support common valid DOI formats (URL, doi:, 10.NNNN.N, so on) - particuarly any seen in the dataset
New components:
Add command for list of DOIs -> UUIDs
Add command for list of UUIDs -> subset of data
Should keep DOI -> UUID and UUIDs -> subset commands separate so other ways of cutting the data are possible later. Overhead for keeping this separate should be minimal.
The text was updated successfully, but these errors were encountered:
Intermediate format should probably be newline-delimited UUID strings. Worst case is a result file of ~828 MB (uncompressed), and best case for a binary format is ~351 MB (uncompressed) so format optimization is unnecessary.
Workflow: Given a list of DOIs, output the subset of data for those papers
Input format:
Output format:
Internal logic requirements:
New components:
Should keep DOI -> UUID and UUIDs -> subset commands separate so other ways of cutting the data are possible later. Overhead for keeping this separate should be minimal.
The text was updated successfully, but these errors were encountered: