How to load bulk data #3173

quevon24 · 2023-09-22T23:25:04Z

quevon24
Sep 22, 2023
Maintainer

In this discussion i will explain in more detail how to load the bulk data and the problems you may face while loading them.

The first thing you need to do is to load the schema, this will automatically create the required database and tables. To do that simply run:

psql -f /opt/bulk/schema-2023-08-31.sql -p 5432 -U postgres

In my case my user is postgres and it is running on port 5432

You may see some errors like:

ERROR:  role "django" does not exist
ERROR:  function public.pgtrigger_update_or_delete_snapshot_update_ef1b8() does not exist
psql:/opt/bulk/schema-2023-08-31.sql:7954: 
ERROR:  trigger "pgtrigger_update_or_delete_snapshot_update_ef1b8" for table "people_db_retentionevent" does not exist

These errors are normal because some data is not dumped in the schema(functions, triggers, roles, etc), we don't pass any additional argument to psql to ignore any errors that may have occurred during the restoration process.

Now to load bulk data, first you need to load some data, this data is required to import the dockets, clusters, opinions and financial disclosures, people needs to be loaded in the order presented:

courts
people-db-people
people-db-school
people-db-positions
originating-court-information
fjc-integrated-database

Courts:
COPY public.search_court (id, pacer_court_id, pacer_has_rss_feed, pacer_rss_entry_types, fjc_court_id, date_modified, in_use, has_opinion_scraper, has_oral_argument_scraper, position, citation_string, short_name, full_name, url, start_date, end_date, jurisdiction, notes) FROM '/opt/bulk/courts-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER);

people-db-people:
COPY public.people_db_person (id, date_created, date_modified, date_completed, fjc_id, slug, name_first, name_middle, name_last, name_suffix, date_dob, date_granularity_dob, date_dod, date_granularity_dod, dob_city, dob_state, dob_country, dod_city, dod_state, dod_country, gender, religion, ftm_total_received, ftm_eid, has_photo, is_alias_of_id) FROM '/opt/bulk/people-db-people-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER);

people-db-school
COPY public.people_db_school (id, date_created, date_modified, name, ein, is_alias_of_id) FROM '/opt/bulk/people-db-schools-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER);

people-db-positions
COPY public.people_db_position (id, date_created, date_modified, position_type, job_title,sector, organization_name, location_city, location_state,date_nominated, date_elected, date_recess_appointment,date_referred_to_judicial_committee, date_judicial_committee_action,judicial_committee_action, date_hearing, date_confirmation, date_start,date_granularity_start, date_termination, termination_reason,date_granularity_termination, date_retirement, nomination_process, vote_type,voice_vote, votes_yes, votes_no, votes_yes_percent, votes_no_percent, how_selected, has_inferred_values, appointer_id, court_id, person_id, predecessor_id, school_id, supervisor_id) FROM '/opt/bulk/people-db-positions-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER);

originating-court-information
COPY search_originatingcourtinformation (id, date_created, date_modified, docket_number, assigned_to_str, ordering_judge_str, court_reporter, date_disposed, date_filed, date_judgment, date_judgment_eod, date_filed_noa, date_received_coa, assigned_to_id, ordering_judge_id) FROM '/opt/bulk/originating-court-information-2023-09-22.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)

fjc-integrated-database
COPY public.recap_fjcintegrateddatabase (id, date_created, date_modified, dataset_source, office, docket_number, origin, date_filed, jurisdiction, nature_of_suit, title, section, subsection, diversity_of_residence, class_action, monetary_demand, county_of_residence, arbitration_at_filing, arbitration_at_termination, multidistrict_litigation_docket_number, plaintiff, defendant, date_transfer, transfer_office, transfer_docket_number, transfer_origin, date_terminated, termination_class_action_status, procedural_progress, disposition, nature_of_judgement, amount_received, judgment, pro_se,year_of_tape, nature_of_offense, version, circuit_id, district_id) FROM '/opt/bulk/fjc-integrated-database-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER);

Now you can start loading dockets, clusters, oral arguments, opinions and financial disclosures.

Docket:
COPY public.search_docket (id, date_created, date_modified, source, appeal_from_str, assigned_to_str, referred_to_str, panel_str, date_last_index, date_cert_granted, date_cert_denied, date_argued, date_reargued, date_reargument_denied, date_filed, date_terminated, date_last_filing, case_name_short, case_name, case_name_full, slug, docket_number, docket_number_core, pacer_case_id, cause, nature_of_suit, jury_demand, jurisdiction_type, appellate_fee_status, appellate_case_type_information, mdl_status, filepath_local, filepath_ia, filepath_ia_json, ia_upload_failure_count, ia_needs_upload, ia_date_first_change, view_count, date_blocked, blocked, appeal_from_id, assigned_to_id, court_id, idb_data_id, originating_court_information_id, referred_to_id) FROM '/opt/bulk/dockets-2023-08-31_0.csv' WITH (FORMAT csv, ENCODING utf8, HEADER);

Clusters: (Requires the dockets to already be loaded)
COPY public.search_opinioncluster (id, date_created, date_modified, judges, date_filed, date_filed_is_approximate, slug, case_name_short, case_name, case_name_full, scdb_id, scdb_decision_direction, scdb_votes_majority, scdb_votes_minority, source, procedural_history, attorneys, nature_of_suit, posture, syllabus, headnotes, summary, disposition, history, other_dates, cross_reference, correction, citation_count, precedential_status, date_blocked, blocked, filepath_json_harvard, docket_id, arguments, headmatter) FROM '/opt/bulk/opinion-clusters-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)

oral-arguments (Requires the dockets to already be loaded)
COPY public.audio_audio (id, date_created, date_modified, source, case_name_short, case_name, case_name_full, judges, sha1, download_url, local_path_mp3, local_path_original_file, filepath_ia, ia_upload_failure_count, duration, processing_complete, date_blocked, blocked, stt_status, stt_google_response, docket_id) FROM '/opt/bulk/oral-arguments-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)'

Opinions: (Requires the clusters to already be loaded)
COPY public.search_opinion (id, date_created, date_modified, author_str, per_curiam, joined_by_str, type, sha1, page_count, download_url, local_path, plain_text, html, html_lawbox, html_columbia, html_anon_2020, xml_harvard, html_with_citations, extracted_by_ocr, author_id, cluster_id) FROM '/opt/bulk/opinions-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER);

Financial disclosures
COPY public.disclosures_financialdisclosure (id, date_created, date_modified, year, download_filepath, filepath, thumbnail, thumbnail_status, page_count, sha1, report_type, is_amended, addendum_content_raw, addendum_redacted, has_been_extracted, person_id) FROM '/opt/bulk/financial-disclosures-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)

Now from here we can load the remaining data:

Requires clusters to already be loaded:

citations
COPY public.search_citation (id, volume, reporter, page, type, cluster_id) FROM '/opt/bulk/citations-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)'

Requires opinions to already be loaded:

citation-map
COPY public.search_opinionscited (id, depth, cited_opinion_id, citing_opinion_id) FROM '/opt/bulk/citation-map-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)'
parentheticals
COPY public.search_parenthetical (id, text, score, described_opinion_id, describing_opinion_id, group_id) FROM '/opt/bulk/parentheticals-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)'

Requires people to already be loaded:

people-db-retention-events
COPY public.people_db_retentionevent (id, date_created, date_modified, retention_type, date_retention, votes_yes, votes_no, votes_yes_percent, votes_no_percent, unopposed, won, position_id) FROM '/opt/bulk/people-db-retention-events-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)'
people-db-educations
COPY people_db_education (id, date_created, date_modified, degree_level, degree_detail, degree_year, person_id, school_id) FROM '/opt/bulk/people-db-educations-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)

Requires financial disclosures to already be loaded:

financial-disclosure-investments
COPY public.disclosures_investment (id, date_created, date_modified, page_number, description, redacted, income_during_reporting_period_code, income_during_reporting_period_type, gross_value_code, gross_value_method, transaction_during_reporting_period, transaction_date_raw, transaction_date, transaction_value_code, transaction_gain_code, transaction_partner, has_inferred_values, financial_disclosure_id) FROM '/opt/bulk/financial-disclosure-investments-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)'
financial-disclosures-positions
COPY public.disclosures_position (id, date_created, date_modified, position, organization_name,redacted, financial_disclosure_id) FROM '/opt/bulk/financial-disclosure-positions-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)
financial-disclosures-agreements
COPY public.disclosures_agreement (id, date_created, date_modified, date_raw, parties_and_terms,redacted, financial_disclosure_id) FROM '/opt/bulk/financial-disclosure-agreements-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)
financial-disclosures-non-investment-income
COPY public.disclosures_noninvestmentincome (id, date_created, date_modified, date_raw, source_type,income_amount, redacted, financial_disclosure_id) FROM '/opt/bulk/financial-disclosure-non-investment-income-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)
financial-disclosures-spousal-income
COPY public.disclosures_spouseincome (id, date_created, date_modified, source_type, date_raw, redacted, financial_disclosure_id) FROM '/opt/bulk/financial-disclosure-spousal-income-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)
financial-disclosures-reimbursements
COPY public.disclosures_reimbursement (id, date_created, date_modified, source, date_raw, location, purpose, items_paid_or_provided, redacted, financial_disclosure_id) FROM '/opt/bulk/financial-disclosure-reimbursements-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)
financial-disclosures-gifts
COPY public.disclosures_gift (id, date_created, date_modified, source, description, value, redacted, financial_disclosure_id) FROM '/opt/bulk/financial-disclosure-gifts-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)
financial-disclosures-debts
COPY public.disclosures_debt (id, date_created, date_modified, creditor_name, description,value_code, redacted, financial_disclosure_id) FROM '/opt/bulk/financial-disclosure-debts-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER)

NOTES

These commands might change over time because new fields are added to the bulk script or maybe new tables, you can use the list of fields for each table from the script: make_bulk_data.sh
The reason why psql \copy is not used to load the data is described in this issue: Error importing opinions.csv into Postgres: unterminated CSV quoted field #3152
Tables are exported one at a time, outside of a transaction block. This means that the data can change in between table exports, causing occasional problems. You can read more about this in issue: Missing data in search_opinion #3162

Any comments are welcome 😄

mlissner · 2023-09-23T18:00:00Z

mlissner
Sep 23, 2023
Maintainer

This is great, thanks Kevin. Could we generate and export a script each month that has these commands in it? I'm guessing that wouldn't be too hard, and then people could just run import_bulk_data.sh after downloading the right files? How's your bash scripting?? :)

0 replies

LexTrex · 2023-12-03T19:56:58Z

LexTrex
Dec 3, 2023

Hi - This is incredibly helpful. I was able to follow these instructions up to "originating-court-information." For some reason I'm not seeing that file listed on the bulk data page for court.listener:
https://com-courtlistener-storage.s3-us-west-2.amazonaws.com/list.html?prefix=bulk-data/

Am I looking in the wrong place or was that not exported?

1 reply

mlissner Dec 4, 2023
Maintainer

No, you were looking in the right place, but the export was broken (bad docker build and old dependencies). It's fixed now and I'm doing a data export by hand now. Expect fresh data in about 24 hours.

legaltextai · 2024-03-20T14:48:23Z

legaltextai
Mar 20, 2024
Collaborator

some sql commands need to be updated as the files have extra columns, e.g. 'courts' has a new 'parent_court_id' column

1 reply

mlissner Mar 20, 2024
Maintainer

Thanks. Want to do a PR?

legaltextai · 2024-03-20T17:13:47Z

legaltextai
Mar 20, 2024
Collaborator

I 'd be happy to but I don't know what PR is

5 replies

mlissner Mar 20, 2024
Maintainer

Ha, a pull request. In other words, if the code is no good, you could tweak it and suggest the fix for it?

legaltextai Mar 20, 2024
Collaborator

wait, there is a code? I only suggested to add 'parent_court_id' to the text above:

Courts:
COPY public.search_court (id, pacer_court_id, pacer_has_rss_feed, pacer_rss_entry_types, fjc_court_id, date_modified, in_use, has_opinion_scraper, has_oral_argument_scraper, position, citation_string, short_name, full_name, url, start_date, end_date, jurisdiction, notes, parent_court_id) FROM '/opt/bulk/courts-2023-08-31.csv' WITH (FORMAT csv, ENCODING utf8, HEADER);

do you want me to review all other files to see if the headers are still correct?

mlissner Mar 20, 2024
Maintainer

Well, it's in the repo, you could use the code search here on Github to look it up or I can just assign it to somebody. Sounds like the latter might be easiest, but if you're interested in making an open source code contribution, a PR would be the way to do that.

legaltextai Mar 20, 2024
Collaborator

mlissner Mar 20, 2024
Maintainer

Yeah, there's some wonkiness: https://github.com/freelawproject/courtlistener/blob/main/scripts/make_bulk_data.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to load bulk data #3173

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 4 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

How to load bulk data #3173

quevon24 Sep 22, 2023 Maintainer

Replies: 4 comments · 7 replies

mlissner Sep 23, 2023 Maintainer

LexTrex Dec 3, 2023

mlissner Dec 4, 2023 Maintainer

legaltextai Mar 20, 2024 Collaborator

mlissner Mar 20, 2024 Maintainer

legaltextai Mar 20, 2024 Collaborator

mlissner Mar 20, 2024 Maintainer

legaltextai Mar 20, 2024 Collaborator

mlissner Mar 20, 2024 Maintainer

legaltextai Mar 20, 2024 Collaborator

mlissner Mar 20, 2024 Maintainer

quevon24
Sep 22, 2023
Maintainer

Replies: 4 comments 7 replies

mlissner
Sep 23, 2023
Maintainer

LexTrex
Dec 3, 2023

mlissner Dec 4, 2023
Maintainer

legaltextai
Mar 20, 2024
Collaborator

mlissner Mar 20, 2024
Maintainer

legaltextai
Mar 20, 2024
Collaborator

mlissner Mar 20, 2024
Maintainer

legaltextai Mar 20, 2024
Collaborator

mlissner Mar 20, 2024
Maintainer

legaltextai Mar 20, 2024
Collaborator

mlissner Mar 20, 2024
Maintainer