Make tool for creating subset of SoftCite by list of DOIs #14

willbeason · 2024-10-14T20:44:41Z

Workflow: Given a list of DOIs, output the subset of data for those papers

Input format:

Newline-delimited list of strings representing DOIs

Output format:

JSONL or Parquet tables (initially just .jsonl.gz)

Internal logic requirements:

Translate DOIs in various forms to 10.NNNN.N format. Match against those as there's no guarantee of the specific format we'll get
Support common valid DOI formats (URL, doi:, 10.NNNN.N, so on) - particuarly any seen in the dataset

New components:

Add command for list of DOIs -> UUIDs
Add command for list of UUIDs -> subset of data

Should keep DOI -> UUID and UUIDs -> subset commands separate so other ways of cutting the data are possible later. Overhead for keeping this separate should be minimal.

willbeason · 2024-10-14T20:48:46Z

Intermediate format should probably be newline-delimited UUID strings. Worst case is a result file of ~828 MB (uncompressed), and best case for a binary format is ~351 MB (uncompressed) so format optimization is unnecessary.

willbeason self-assigned this Oct 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make tool for creating subset of SoftCite by list of DOIs #14

Make tool for creating subset of SoftCite by list of DOIs #14

willbeason commented Oct 14, 2024

willbeason commented Oct 14, 2024

Make tool for creating subset of SoftCite by list of DOIs #14

Make tool for creating subset of SoftCite by list of DOIs #14

Comments

willbeason commented Oct 14, 2024

willbeason commented Oct 14, 2024