Merge pull request #1448 from MIT-LCP/mimiciii_postgres_concepts

Include postgres MIMIC-III concepts
MIT-LCP · Dec 9, 2022 · 1ff562b · 1ff562b
2 parents 79eb377 + f8ceb8b
commit 1ff562b
Show file tree

Hide file tree

Showing 92 changed files with 13,912 additions and 350 deletions.
diff --git a/mimic-iii/concepts/README.md b/mimic-iii/concepts/README.md
@@ -10,76 +10,34 @@ You can read about cloud access to MIMIC-III, including via Google BigQuery, on
 The rest of this README describes:
 
 * [Generating the concepts in BigQuery](#generating-the-concepts-in-bigquery)
-* [Generating the concepts in PostgreSQL (\*nix/Mac OS X)](#generating-the-concepts-in-postgresql-nix-mac-os-x)
-* [Generating the concepts in PostgreSQL (Windows)](#generating-the-concepts-in-postgresql-windows)
+* [Generating the concepts in PostgreSQL](#generating-the-concepts-in-postgresql)
 
 ## Generating the concepts in BigQuery
 
 You do not need to generate the concepts if you are using BigQuery! They have already been generated for you. If you have access to MIMIC-III on BigQuery, look under `physionet-data.mimic_derived`. If you would like to generate the concepts again, for example on your own dataset, you must modify the `TARGET_DATASET` variable within the [make-concepts.sh](/concepts/make-concepts.sh) script. The script assumes you have installed and configured the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install).
 
-## Generating the concepts in PostgreSQL (\*nix/Mac OS X)
+## Generating the concepts in PostgreSQL
+
+### Quickstart
+
+Go to the [concepts_postgres](../concepts_postgres) folder, run the [postgres-functions.sql](../concepts_postgres/postgres-make-concepts.sql) and [postgres-make-concepts.sql](../concepts_postgres/postgres-make-concepts.sql) scripts, in that order.
+
+### In more detail
 
 While the SQL scripts here are written in BigQuery's Standard SQL syntax, there are many BigQuery specific functions which do not carry over to PostgreSQL. Nevertheless, with only a few changes, the scripts can be made compatible. In order to generate the concepts on a PostgreSQL database, one must:
 
 * create postgres functions which emulate BigQuery functions
 * modify SQL scripts for incompatible syntax
 * run the modified SQL scripts and direct the output into tables in the PostgreSQL database
 
-This can be done as follows:
-
-1. Open a terminal in the `concepts` folder.
-2. Run [postgres-functions.sql](postgres-functions.sql).
-    * e.g. `psql -f postgres-functions.sql`
-    * This script creates functions which emulate BigQuery syntax.
-3. Run [postgres_make_concepts.sh](postgres_make_concepts.sh).
-    * e.g. `bash postgres_make_concepts.sh`
-    * This file runs the scripts after applying a few regular expressions which convert table references and date calculations appropriately.
-    * This file generates all concepts on the `public` schema.
-    * Exporting DBCONNEXTRA before calling this script will add this to the
-        connection string.  For example, running:
-        `DBCONNEXTRA="user=mimic password=mimic" bash postgres_make_concepts.sh`
-        will add these settings to all of the psql calls.  (Note that "dbname"
-        and "search_path" do not need to be set.)
-
-If you do not have access to a PostgreSQL database with MIMIC, you can read more about building the data within one in the [buildmimic/postgres](https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iii/buildmimic/postgres) folder.
-
-## Generating the concepts in PostgreSQL (Windows)
-
-On Windows, it is a bit more complex to generate the concepts in the PostgreSQL database. The approach relies on using \*nix command line tools which are not available by default in a Windows installation. Instead, we have adapted the script into a `.bat` file which relies on the Windows Subsystem for Linux in order to run the shell commands. The steps are:
-
-1. Install the [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10).
-    * If you don't have a preference, follow the steps to install a Ubuntu system. The bat file was tested with Ubuntu, though the commands should work with any flavor of \*nix since we rely on the utils rather than the kernel.
-2. Verify you can use the wsl.exe utilities in command prompt.
-    * Go to run and type `cmd`, or type "command prompt" in the search.
-    * Run `wsl.exe echo "hi"` - this should print out `hi` back to you
-3. Change to your local folder where these concepts are stored
-    * e.g. `cd C:\Tools\mimic-code-master\concepts`
-4. Modify the .bat file: update the `CONNSTR` and `PSQL_PATH` variables.
-    * Replace `INSERT_PASSWORD_HERE` in `CONNSTR` with your password; or remove it if you have a `.pgpass` file or other form of authentication. If you have a different username or database location, be sure to update those as well.
-    * Change `PSQL_PATH` to point to your `psql.exe` file. It is currently set to the default location for a PostgreSQL 13 installation.
-5. Run the .bat file
-    * In the command prompt, type `postgres_make_concepts_windows.bat`
-
-The script echos the commands and the outputs as they run. If it is running successfully, you should see a `SELECT` statement after each command, with the number of rows generated in the table.
-
-### Can I just do the above manually without WSL?
-
-Of course! And this might be more informative.
-
-First, generate the necessary functions as above, by running `postgres-functions.sql` in the SQL shell.
-Once that's done, you need to do the following text replacements in all the SQL files:
-
-1. Replace ````physionet-data.mimiciii_clinical.<table_name>````, ````physionet-data.mimiciii_derived.<table_name>```` , and ````physionet-data.mimiciii_notes.<table_name>````  with just `<table_name>`.
-    * This is done by the `REGEX_SCHEMA` variable in the `postgres_make_concepts.sh` script.
-    * Ideally you should set your search path with `set search_path to public,mimiciii;`. This will create the concepts on `public`, and read data from `mimiciii`. This distinction isn't strictly necessary, but many find it useful.
-2. Replace `DATETIME_DIFF(date1, date2, DATE_PART)` with `DATETIME_DIFF(date1, date2, 'DATE_PART')`.
-    * This adds single quotes around any `DATE_PART`, which is required by PostgreSQL.
-    * This is done by the `REGEX_DATETIME_DIFF ` variable in the `postgres_make_concepts.sh` script.
-3. Add a create table statement at the top of the file, e.g. if the file is named `echo_data.sql`, add `CREATE TABLE echo_data AS` at the top of the file.
-    * This is done by the `echo` calls in the shell script.
-4. Run each file individually in the order specified by the make concepts script.
-
-The above steps replicate what is done in the shell script (postgres_make_concepts.sh).
+The bash script [convert_mimiciii_concepts_bq_to_psql.sh](/convert_mimiciii_concepts_bq_to_psql.sh) has done most of this for you. To generate concepts in PostgreSQL, simply go to the [concepts_postgres](../concepts_postgres) folder and run:
+
+```sh
+\i postgres-functions.sql
+\i postgres-make-concepts.sql
+```
+
+You can also read more about building the data within PostgreSQL in the [buildmimic/postgres](https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iii/buildmimic/postgres) folder.
 
 ## List of concepts
 
@@ -190,4 +148,4 @@ Useful snippets of SQL implementing common functions. For example, the `auroc.sq
 
 ## other-languages
 
-Scripts in flavours of SQL which are not necessarily compatible with PostgreSQL.
+Scripts in flavours of SQL which are not compatible with BigQuery/PostgreSQL.
diff --git a/mimic-iii/concepts/convert_mimiciii_concepts_bq_to_psql.sh b/mimic-iii/concepts/convert_mimiciii_concepts_bq_to_psql.sh
@@ -0,0 +1,158 @@
+#!/bin/bash
+# This shell script converts BigQuery .sql files into PostgreSQL .sql files.
+
+# path in which we create the postgres concepts
+TARGET_PATH='../concepts_postgres'
+mkdir -p $TARGET_PATH
+
+# String replacements are necessary for some queries.
+
+# Schema replacement: change `physionet-data.<dataset>.<table>` to just <table> (with no backticks)
+export REGEX_SCHEMA='s/`physionet-data.(mimiciii_clinical|mimiciii_derived|mimiciii_notes).([A-Za-z0-9_-]+)`/\2/g'
+# Note that these queries are very senstive to changes, e.g. adding whitespaces after comma can already change the behavior.
+export REGEX_DATETIME_DIFF="s/DATETIME_DIFF\(([^,]+), ?(.*), ?(DAY|MINUTE|SECOND|HOUR|YEAR)\)/DATETIME_DIFF(\1, \2, '\3')/g"
+export REGEX_DATETIME_TRUNC="s/DATETIME_TRUNC\(([^,]+), ?(DAY|MINUTE|SECOND|HOUR|YEAR)\)/DATE_TRUNC('\2', \1)/g"
+# Add necessary quotes to INTERVAL, e.g. "INTERVAL 5 hour" to "INTERVAL '5' hour"
+export REGEX_INTERVAL="s/interval ([[:digit:]]+) (hour|day|month|year)/INTERVAL '\1' \2/gI"
+# Specific queries for some problems that arose with some files.
+export REGEX_INT="s/CAST\(hr AS INT64\)/CAST\(hr AS bigint\)/g"
+export REGEX_ARRAY="s/GENERATE_ARRAY\(-24, CEIL\(DATETIME\_DIFF\(it\.outtime_hr, it\.intime_hr, HOUR\)\)\)/ARRAY\(SELECT \* FROM generate\_series\(-24, CEIL\(DATETIME\_DIFF\(it\.outtime_hr, it\.intime_hr, HOUR\)\)\)\)/g"
+export REGEX_HOUR_INTERVAL="s/INTERVAL CAST\(hr AS INT64\) HOUR/interval \'1\' hour * CAST\(hr AS bigint\)/g"
+export REGEX_SECONDS="s/SECOND\)/\'SECOND\'\)/g"
+
+# tables we want to run before all other concepts
+# usually because they are used as dependencies
+DIR_AND_TABLES_TO_PREBUILD='demographics.icustay_times demographics.icustay_hours .echo_data .code_status .rrt durations.weight_durations fluid_balance.urine_output organfailure.kdigo_uo'
+
+# tables which are written directly in postgresql and source code controlled
+# this is usually because there is no trivial conversion between bq/psql syntax
+DIR_AND_TABLES_ALREADY_IN_PSQL='demographics.icustay_times demographics.icustay_hours demographics.note_counts diagnosis.ccs_dx'
+
+# tables which we want to run after all other concepts
+# usually because they depend on one or more other queries
+DIR_AND_TABLES_TO_SKIP=''
+
+# First, we re-create the postgres-make-concepts.sql file.
+echo "\echo ''" > $TARGET_PATH/postgres-make-concepts.sql
+
+# Now we add some preamble for the user running the script.
+echo "\echo '==='" >> $TARGET_PATH/postgres-make-concepts.sql
+echo "\echo 'Beginning to create materialized views for MIMIC database.'" >> $TARGET_PATH/postgres-make-concepts.sql
+echo "\echo '"'Any notices of the form  "NOTICE: materialized view "XXXXXX" does not exist" can be ignored.'"'" >> $TARGET_PATH/postgres-make-concepts.sql
+echo "\echo 'The scripts drop views before creating them, and these notices indicate nothing existed prior to creating the view.'" >> $TARGET_PATH/postgres-make-concepts.sql
+echo "\echo '==='" >> $TARGET_PATH/postgres-make-concepts.sql
+echo "\echo ''" >> $TARGET_PATH/postgres-make-concepts.sql
+
+# ======================================== #
+# === CONCEPTS WHICH WE MUST RUN FIRST === #
+# ======================================== #
+echo -n "Dependencies:"
+
+# output table creation calls to the make-concepts script
+echo "" >> $TARGET_PATH/postgres-make-concepts.sql
+echo "-- dependencies" >> $TARGET_PATH/postgres-make-concepts.sql
+
+for dir_and_table in $DIR_AND_TABLES_TO_PREBUILD;
+do
+  d=`echo ${dir_and_table} | cut -d. -f1`
+  tbl=`echo ${dir_and_table} | cut -d. -f2`
+
+  if [[ $d == '' ]]; then
+    d='.'
+  fi
+
+  # make the sub-folder for postgres if it does not exist
+  mkdir -p "$TARGET_PATH/${d}"
+
+  # convert the bigquery script to psql and output it to the appropriate subfolder
+  echo -n " ${d}.${tbl} .."
+
+  # re-write the script into psql using regex
+  # the if statement ensures we do not overwrite tables which are already written in psql
+  if ! [[ "$DIR_AND_TABLES_ALREADY_IN_PSQL" =~ "$d.$tbl" ]]; then
+    echo "-- THIS SCRIPT IS AUTOMATICALLY GENERATED. DO NOT EDIT IT DIRECTLY." > "${TARGET_PATH}/${d}/${tbl}.sql"
+    echo "DROP TABLE IF EXISTS ${tbl}; CREATE TABLE ${tbl} AS " >> "${TARGET_PATH}/${d}/${tbl}.sql"
+    cat "${d}/${tbl}.sql" | sed -r -e "${REGEX_ARRAY}" | sed -r -e "${REGEX_HOUR_INTERVAL}" | sed -r -e "${REGEX_INT}" | sed -r -e "${REGEX_DATETIME_DIFF}" | sed -r -e "${REGEX_DATETIME_TRUNC}" | sed -r -e "${REGEX_SCHEMA}" | sed -r -e "${REGEX_INTERVAL}" >> "${TARGET_PATH}/${d}/${tbl}.sql"
+  fi
+
+  # write out a call to this script in the make concepts file
+  echo "\i ${d}/${tbl}.sql" >> $TARGET_PATH/postgres-make-concepts.sql
+done
+echo " done!"
+
+# ================================== #
+# === MAIN LOOP FOR ALL CONCEPTS === #
+# ================================== #
+
+# Iterate through each concept subfolder, and:
+# (1) apply the above regular expressions to update the script
+# (2) output to the postgres subfolder
+# (3) add a line to the postgres-make-concepts.sql script to generate this table
+
+#  organfailure.kdigo_stages firstday.first_day_sofa sepsis.sepsis3 medication.vasoactive_agent medication.norepinephrine_equivalent_dose
+
+# the order *only* matters during the conversion step because our loop is
+# inserting table build commands into the postgres-make-concepts.sql file
+for d in durations comorbidity demographics firstday fluid_balance sepsis diagnosis organfailure severityscores;
+do
+    mkdir -p "$TARGET_PATH/${d}"
+    echo -n "${d}:"
+    echo "" >> $TARGET_PATH/postgres-make-concepts.sql
+    echo "-- ${d}" >> $TARGET_PATH/postgres-make-concepts.sql
+    for fn in `ls $d`;
+    do
+        # only run SQL queries
+        if [[ "${fn: -4}" == ".sql" ]]; then
+            # table name is file name minus extension
+            tbl="${fn%????}"
+            echo -n " ${tbl} "
+
+            if [[ "$DIR_AND_TABLES_TO_PREBUILD" =~ "$d.$tbl" ]]; then
+              echo -n "(exists!) .."
+              continue
+            elif [[ "$DIR_AND_TABLES_TO_SKIP" =~ "$d.$tbl" ]]; then
+              echo -n "(skipping!) .."
+              continue
+            else
+              echo -n ".."
+            fi
+
+            # re-write the script into psql using regex
+            # the if statement ensures we do not overwrite tables which are already written in psql
+            if ! [[ "$DIR_AND_TABLES_ALREADY_IN_PSQL" =~ "$d.$tbl" ]]; then
+              echo "-- THIS SCRIPT IS AUTOMATICALLY GENERATED. DO NOT EDIT IT DIRECTLY." > "${TARGET_PATH}/${d}/${tbl}.sql"
+              echo "DROP TABLE IF EXISTS ${tbl}; CREATE TABLE ${tbl} AS " >> "${TARGET_PATH}/${d}/${tbl}.sql"
+              cat "${d}/${tbl}.sql" | sed -r -e "${REGEX_ARRAY}" | sed -r -e "${REGEX_HOUR_INTERVAL}" | sed -r -e "${REGEX_INT}" | sed -r -e "${REGEX_DATETIME_DIFF}" | sed -r -e "${REGEX_DATETIME_TRUNC}" | sed -r -e "${REGEX_SCHEMA}" | sed -r -e "${REGEX_INTERVAL}" >> "${TARGET_PATH}/${d}/${fn}"
+            fi
+
+            # add statement to generate this table to make concepts script
+            echo "\i ${d}/${fn}" >> ${TARGET_PATH}/postgres-make-concepts.sql
+        fi
+    done
+    echo " done!"
+done
+
+# finally generate first_day_sofa which depends on concepts in firstday folder
+echo "" >> ${TARGET_PATH}/postgres-make-concepts.sql
+echo "-- final tables which were dependent on one or more prior tables" >> ${TARGET_PATH}/postgres-make-concepts.sql
+
+echo -n "final:"
+for dir_and_table in $DIR_AND_TABLES_TO_SKIP
+do
+  d=`echo ${dir_and_table} | cut -d. -f1`
+  tbl=`echo ${dir_and_table} | cut -d. -f2`
+
+  # make the sub-folder for postgres if it does not exist
+  mkdir -p "$TARGET_PATH/${d}"
+
+  # convert the bigquery script to psql and output it to the appropriate subfolder
+  echo -n " ${d}.${tbl} .."
+  if ! [[ "$DIR_AND_TABLES_ALREADY_IN_PSQL" =~ "$d.$tbl" ]]; then
+    echo "-- THIS SCRIPT IS AUTOMATICALLY GENERATED. DO NOT EDIT IT DIRECTLY." > "${TARGET_PATH}/${d}/${tbl}.sql"
+    echo "DROP TABLE IF EXISTS ${tbl}; CREATE TABLE ${tbl} AS " >> "${TARGET_PATH}/${d}/${tbl}.sql"
+    cat "${d}/${tbl}.sql" | sed -r -e "${REGEX_ARRAY}" | sed -r -e "${REGEX_HOUR_INTERVAL}" | sed -r -e "${REGEX_INT}" | sed -r -e "${REGEX_DATETIME_DIFF}" | sed -r -e "${REGEX_DATETIME_TRUNC}" | sed -r -e "${REGEX_SCHEMA}" | sed -r -e "${REGEX_INTERVAL}" >> "${TARGET_PATH}/${d}/${fn}"
+  fi
+  # write out a call to this script in the make concepts file
+  echo "\i ${d}/${tbl}.sql" >> $TARGET_PATH/postgres-make-concepts.sql
+done
+echo " done!"
diff --git a/mimic-iii/concepts/diagnosis/README.md b/mimic-iii/concepts/diagnosis/README.md
@@ -1,48 +1,4 @@
-The Clinical Classification Software (CCS) categorizes ICD-9 coded diagnoses into clinically meaningful groups. The categorization was developed by the Agency for Healthcare Research and Quality (AHRQ). More detail can be found on the AHRQ website: https://www.hcup-us.ahrq.gov/tools_software.jsp
+# ccs_dx
 
-This folder contains:
-
-* `ccs_diagnosis_table.sql` - Creates two tables: `ccs_single_level_dx` and `ccs_multi_level_dx`. These two tables are loaded from `ccs_single_level_dx.csv.gz` and `ccs_multi_level_dx.csv.gz`. Note that the script assumes you are using PostgreSQL v9.4 or later, and you must execute the script from this directory.
-
-## Creation of the ccs_multi_level file
-
-Download the original file from CCS: 
-
-```
-wget https://www.hcup-us.ahrq.gov/toolssoftware/ccs/Multi_Level_CCS_2015.zip
-```
-
-Unzip to a folder.
-
-```
-unzip Multi_Level_CCS_2015.zip
-```
-
-Use Python to convert all apostrophes in `ccs_multi_dx_tool_2015.csv` into double quotes (the file mixed apostrophes/double quotes as field encapsulators):
-
-```python
-import pandas as pd
-df = pd.read_csv('ccs_multi_dx_tool_2015.csv.gz')
-# remove apostrophes from header names and relabel
-df.rename(columns={"'ICD-9-CM CODE'": "icd9_code", "'CCS LVL 1'": "ccs_level1", "'CCS LVL 1 LABEL'": "ccs_group1", "'CCS LVL 2'": "ccs_level2", "'CCS LVL 2 LABEL'": "ccs_group2", "'CCS LVL 3'": "ccs_level3", "'CCS LVL 3 LABEL'": "ccs_group3", "'CCS LVL 4'": "ccs_level4", "'CCS LVL 4 LABEL'": "ccs_group4", }, inplace=True)
-
-def remove_surrounding_apostrophes(x):
-    if x[0] == "'":
-        x = x[1:]
-    if x[-1] == "'":
-        x = x[:-1]
-    return x
-
-for c in df.columns:
-    df[c] = df[c].map(remove_surrounding_apostrophes)
-    idxRemove = df[c].str.strip() == ''
-    if idxRemove.any():
-        df.loc[idxRemove, c] = None
-
-# write to file
-df.to_csv('ccs_multi_dx.csv.gz', index=False, compression='gzip')
-```
-
-(above run with Python 3.7 and pandas 0.23.2).
-
-Now the SQL script can be run (`ccs_diagnosis_table.sql`). The `ccs_multi_dx.csv.gz` file generated by this process is available in the repo, so the above process is just for documentation, and does not necessarily have to be re-run.
+The `ccs_multi_dx.csv.gz` data file must be uploaded to `physionet-data.mimiciii_derived.ccs_multi_dx`.
+The data file is available in [../concepts_postgres/diagnosis](../concepts_postgres/diagnosis). The BigQuery schema definition is available in this folder as [ccs_multi_dx.json](/ccs_multi_dx.json).
diff --git a/mimic-iii/concepts/durations/neuroblock_dose.sql b/mimic-iii/concepts/durations/neuroblock_dose.sql
@@ -7,11 +7,11 @@ with drugmv as
 (
   select
       icustay_id, orderid
-    , rate as vaso_rate
-    , amount as vaso_amount
+    , rate as drug_rate
+    , amount as drug_amount
     , starttime
     , endtime
-  from inputevents_mv
+  from `physionet-data.mimiciii_clinical.inputevents_mv`
   where itemid in
   (
       222062 -- Vecuronium (664 rows, 154 infusion rows)
@@ -41,7 +41,7 @@ with drugmv as
             when itemid >= 40000 then coalesce(rate, amount)
           else rate end) as drug_rate
     , max(amount) as drug_amount
-  from inputevents_cv
+  from `physionet-data.mimiciii_clinical.inputevents_cv`
   where itemid in
   (
       30114 -- Cisatracurium (63994 rows)

diff --git a/mimic-iii/concepts/durations/vasopressin_dose.sql b/mimic-iii/concepts/durations/vasopressin_dose.sql
@@ -235,7 +235,7 @@ and
 (
   select
     icustay_id, linkorderid
-    , CASE WHEN valueuom = 'units/min' THEN rate*60.0 ELSE rate END as vaso_rate
+    , CASE WHEN rateuom = 'units/min' THEN rate*60.0 ELSE rate END as vaso_rate
     , amount as vaso_amount
     , starttime
     , endtime

diff --git a/mimic-iii/concepts/fluid_balance/crystalloid_bolus.sql b/mimic-iii/concepts/fluid_balance/crystalloid_bolus.sql
@@ -11,7 +11,7 @@ with t1 as
       when mv.amountuom = 'ml'
         then mv.amount
     else null end) as amount
-  from inputevents_mv mv
+  from `physionet-data.mimiciii_clinical.inputevents_mv` mv
   where mv.itemid in
   (
     -- 225943 Solution
@@ -47,7 +47,7 @@ with t1 as
   , cv.charttime
   -- carevue always has units in millilitres
   , round(cv.amount) as amount
-  from inputevents_cv cv
+  from `physionet-data.mimiciii_clinical.inputevents_cv` cv
   where cv.itemid in
   (
     30015 -- "D5/.45NS" -- mixed colloids and crystalloids
@@ -155,4 +155,4 @@ select
   , sum(amount) as crystalloid_bolus
 from t2
 group by t2.icustay_id, t2.charttime
-order by icustay_id, charttime;
+;