Skip to content

Commit

Permalink
Merge pull request #1448 from MIT-LCP/mimiciii_postgres_concepts
Browse files Browse the repository at this point in the history
Include postgres MIMIC-III concepts
  • Loading branch information
alistairewj authored Dec 9, 2022
2 parents 79eb377 + f8ceb8b commit 1ff562b
Show file tree
Hide file tree
Showing 92 changed files with 13,912 additions and 350 deletions.
76 changes: 17 additions & 59 deletions mimic-iii/concepts/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,76 +10,34 @@ You can read about cloud access to MIMIC-III, including via Google BigQuery, on
The rest of this README describes:

* [Generating the concepts in BigQuery](#generating-the-concepts-in-bigquery)
* [Generating the concepts in PostgreSQL (\*nix/Mac OS X)](#generating-the-concepts-in-postgresql-nix-mac-os-x)
* [Generating the concepts in PostgreSQL (Windows)](#generating-the-concepts-in-postgresql-windows)
* [Generating the concepts in PostgreSQL](#generating-the-concepts-in-postgresql)

## Generating the concepts in BigQuery

You do not need to generate the concepts if you are using BigQuery! They have already been generated for you. If you have access to MIMIC-III on BigQuery, look under `physionet-data.mimic_derived`. If you would like to generate the concepts again, for example on your own dataset, you must modify the `TARGET_DATASET` variable within the [make-concepts.sh](/concepts/make-concepts.sh) script. The script assumes you have installed and configured the [Google Cloud SDK](https://cloud.google.com/sdk/docs/install).

## Generating the concepts in PostgreSQL (\*nix/Mac OS X)
## Generating the concepts in PostgreSQL

### Quickstart

Go to the [concepts_postgres](../concepts_postgres) folder, run the [postgres-functions.sql](../concepts_postgres/postgres-make-concepts.sql) and [postgres-make-concepts.sql](../concepts_postgres/postgres-make-concepts.sql) scripts, in that order.

### In more detail

While the SQL scripts here are written in BigQuery's Standard SQL syntax, there are many BigQuery specific functions which do not carry over to PostgreSQL. Nevertheless, with only a few changes, the scripts can be made compatible. In order to generate the concepts on a PostgreSQL database, one must:

* create postgres functions which emulate BigQuery functions
* modify SQL scripts for incompatible syntax
* run the modified SQL scripts and direct the output into tables in the PostgreSQL database

This can be done as follows:

1. Open a terminal in the `concepts` folder.
2. Run [postgres-functions.sql](postgres-functions.sql).
* e.g. `psql -f postgres-functions.sql`
* This script creates functions which emulate BigQuery syntax.
3. Run [postgres_make_concepts.sh](postgres_make_concepts.sh).
* e.g. `bash postgres_make_concepts.sh`
* This file runs the scripts after applying a few regular expressions which convert table references and date calculations appropriately.
* This file generates all concepts on the `public` schema.
* Exporting DBCONNEXTRA before calling this script will add this to the
connection string. For example, running:
`DBCONNEXTRA="user=mimic password=mimic" bash postgres_make_concepts.sh`
will add these settings to all of the psql calls. (Note that "dbname"
and "search_path" do not need to be set.)

If you do not have access to a PostgreSQL database with MIMIC, you can read more about building the data within one in the [buildmimic/postgres](https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iii/buildmimic/postgres) folder.

## Generating the concepts in PostgreSQL (Windows)

On Windows, it is a bit more complex to generate the concepts in the PostgreSQL database. The approach relies on using \*nix command line tools which are not available by default in a Windows installation. Instead, we have adapted the script into a `.bat` file which relies on the Windows Subsystem for Linux in order to run the shell commands. The steps are:

1. Install the [Windows Subsystem for Linux](https://docs.microsoft.com/en-us/windows/wsl/install-win10).
* If you don't have a preference, follow the steps to install a Ubuntu system. The bat file was tested with Ubuntu, though the commands should work with any flavor of \*nix since we rely on the utils rather than the kernel.
2. Verify you can use the wsl.exe utilities in command prompt.
* Go to run and type `cmd`, or type "command prompt" in the search.
* Run `wsl.exe echo "hi"` - this should print out `hi` back to you
3. Change to your local folder where these concepts are stored
* e.g. `cd C:\Tools\mimic-code-master\concepts`
4. Modify the .bat file: update the `CONNSTR` and `PSQL_PATH` variables.
* Replace `INSERT_PASSWORD_HERE` in `CONNSTR` with your password; or remove it if you have a `.pgpass` file or other form of authentication. If you have a different username or database location, be sure to update those as well.
* Change `PSQL_PATH` to point to your `psql.exe` file. It is currently set to the default location for a PostgreSQL 13 installation.
5. Run the .bat file
* In the command prompt, type `postgres_make_concepts_windows.bat`

The script echos the commands and the outputs as they run. If it is running successfully, you should see a `SELECT` statement after each command, with the number of rows generated in the table.

### Can I just do the above manually without WSL?

Of course! And this might be more informative.

First, generate the necessary functions as above, by running `postgres-functions.sql` in the SQL shell.
Once that's done, you need to do the following text replacements in all the SQL files:

1. Replace ````physionet-data.mimiciii_clinical.<table_name>````, ````physionet-data.mimiciii_derived.<table_name>```` , and ````physionet-data.mimiciii_notes.<table_name>```` with just `<table_name>`.
* This is done by the `REGEX_SCHEMA` variable in the `postgres_make_concepts.sh` script.
* Ideally you should set your search path with `set search_path to public,mimiciii;`. This will create the concepts on `public`, and read data from `mimiciii`. This distinction isn't strictly necessary, but many find it useful.
2. Replace `DATETIME_DIFF(date1, date2, DATE_PART)` with `DATETIME_DIFF(date1, date2, 'DATE_PART')`.
* This adds single quotes around any `DATE_PART`, which is required by PostgreSQL.
* This is done by the `REGEX_DATETIME_DIFF ` variable in the `postgres_make_concepts.sh` script.
3. Add a create table statement at the top of the file, e.g. if the file is named `echo_data.sql`, add `CREATE TABLE echo_data AS` at the top of the file.
* This is done by the `echo` calls in the shell script.
4. Run each file individually in the order specified by the make concepts script.

The above steps replicate what is done in the shell script (postgres_make_concepts.sh).
The bash script [convert_mimiciii_concepts_bq_to_psql.sh](/convert_mimiciii_concepts_bq_to_psql.sh) has done most of this for you. To generate concepts in PostgreSQL, simply go to the [concepts_postgres](../concepts_postgres) folder and run:

```sh
\i postgres-functions.sql
\i postgres-make-concepts.sql
```

You can also read more about building the data within PostgreSQL in the [buildmimic/postgres](https://github.com/MIT-LCP/mimic-code/tree/main/mimic-iii/buildmimic/postgres) folder.

## List of concepts

Expand Down Expand Up @@ -190,4 +148,4 @@ Useful snippets of SQL implementing common functions. For example, the `auroc.sq

## other-languages

Scripts in flavours of SQL which are not necessarily compatible with PostgreSQL.
Scripts in flavours of SQL which are not compatible with BigQuery/PostgreSQL.
158 changes: 158 additions & 0 deletions mimic-iii/concepts/convert_mimiciii_concepts_bq_to_psql.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,158 @@
#!/bin/bash
# This shell script converts BigQuery .sql files into PostgreSQL .sql files.

# path in which we create the postgres concepts
TARGET_PATH='../concepts_postgres'
mkdir -p $TARGET_PATH

# String replacements are necessary for some queries.

# Schema replacement: change `physionet-data.<dataset>.<table>` to just <table> (with no backticks)
export REGEX_SCHEMA='s/`physionet-data.(mimiciii_clinical|mimiciii_derived|mimiciii_notes).([A-Za-z0-9_-]+)`/\2/g'
# Note that these queries are very senstive to changes, e.g. adding whitespaces after comma can already change the behavior.
export REGEX_DATETIME_DIFF="s/DATETIME_DIFF\(([^,]+), ?(.*), ?(DAY|MINUTE|SECOND|HOUR|YEAR)\)/DATETIME_DIFF(\1, \2, '\3')/g"
export REGEX_DATETIME_TRUNC="s/DATETIME_TRUNC\(([^,]+), ?(DAY|MINUTE|SECOND|HOUR|YEAR)\)/DATE_TRUNC('\2', \1)/g"
# Add necessary quotes to INTERVAL, e.g. "INTERVAL 5 hour" to "INTERVAL '5' hour"
export REGEX_INTERVAL="s/interval ([[:digit:]]+) (hour|day|month|year)/INTERVAL '\1' \2/gI"
# Specific queries for some problems that arose with some files.
export REGEX_INT="s/CAST\(hr AS INT64\)/CAST\(hr AS bigint\)/g"
export REGEX_ARRAY="s/GENERATE_ARRAY\(-24, CEIL\(DATETIME\_DIFF\(it\.outtime_hr, it\.intime_hr, HOUR\)\)\)/ARRAY\(SELECT \* FROM generate\_series\(-24, CEIL\(DATETIME\_DIFF\(it\.outtime_hr, it\.intime_hr, HOUR\)\)\)\)/g"
export REGEX_HOUR_INTERVAL="s/INTERVAL CAST\(hr AS INT64\) HOUR/interval \'1\' hour * CAST\(hr AS bigint\)/g"
export REGEX_SECONDS="s/SECOND\)/\'SECOND\'\)/g"

# tables we want to run before all other concepts
# usually because they are used as dependencies
DIR_AND_TABLES_TO_PREBUILD='demographics.icustay_times demographics.icustay_hours .echo_data .code_status .rrt durations.weight_durations fluid_balance.urine_output organfailure.kdigo_uo'

# tables which are written directly in postgresql and source code controlled
# this is usually because there is no trivial conversion between bq/psql syntax
DIR_AND_TABLES_ALREADY_IN_PSQL='demographics.icustay_times demographics.icustay_hours demographics.note_counts diagnosis.ccs_dx'

# tables which we want to run after all other concepts
# usually because they depend on one or more other queries
DIR_AND_TABLES_TO_SKIP=''

# First, we re-create the postgres-make-concepts.sql file.
echo "\echo ''" > $TARGET_PATH/postgres-make-concepts.sql

# Now we add some preamble for the user running the script.
echo "\echo '==='" >> $TARGET_PATH/postgres-make-concepts.sql
echo "\echo 'Beginning to create materialized views for MIMIC database.'" >> $TARGET_PATH/postgres-make-concepts.sql
echo "\echo '"'Any notices of the form "NOTICE: materialized view "XXXXXX" does not exist" can be ignored.'"'" >> $TARGET_PATH/postgres-make-concepts.sql
echo "\echo 'The scripts drop views before creating them, and these notices indicate nothing existed prior to creating the view.'" >> $TARGET_PATH/postgres-make-concepts.sql
echo "\echo '==='" >> $TARGET_PATH/postgres-make-concepts.sql
echo "\echo ''" >> $TARGET_PATH/postgres-make-concepts.sql

# ======================================== #
# === CONCEPTS WHICH WE MUST RUN FIRST === #
# ======================================== #
echo -n "Dependencies:"

# output table creation calls to the make-concepts script
echo "" >> $TARGET_PATH/postgres-make-concepts.sql
echo "-- dependencies" >> $TARGET_PATH/postgres-make-concepts.sql

for dir_and_table in $DIR_AND_TABLES_TO_PREBUILD;
do
d=`echo ${dir_and_table} | cut -d. -f1`
tbl=`echo ${dir_and_table} | cut -d. -f2`

if [[ $d == '' ]]; then
d='.'
fi

# make the sub-folder for postgres if it does not exist
mkdir -p "$TARGET_PATH/${d}"

# convert the bigquery script to psql and output it to the appropriate subfolder
echo -n " ${d}.${tbl} .."

# re-write the script into psql using regex
# the if statement ensures we do not overwrite tables which are already written in psql
if ! [[ "$DIR_AND_TABLES_ALREADY_IN_PSQL" =~ "$d.$tbl" ]]; then
echo "-- THIS SCRIPT IS AUTOMATICALLY GENERATED. DO NOT EDIT IT DIRECTLY." > "${TARGET_PATH}/${d}/${tbl}.sql"
echo "DROP TABLE IF EXISTS ${tbl}; CREATE TABLE ${tbl} AS " >> "${TARGET_PATH}/${d}/${tbl}.sql"
cat "${d}/${tbl}.sql" | sed -r -e "${REGEX_ARRAY}" | sed -r -e "${REGEX_HOUR_INTERVAL}" | sed -r -e "${REGEX_INT}" | sed -r -e "${REGEX_DATETIME_DIFF}" | sed -r -e "${REGEX_DATETIME_TRUNC}" | sed -r -e "${REGEX_SCHEMA}" | sed -r -e "${REGEX_INTERVAL}" >> "${TARGET_PATH}/${d}/${tbl}.sql"
fi

# write out a call to this script in the make concepts file
echo "\i ${d}/${tbl}.sql" >> $TARGET_PATH/postgres-make-concepts.sql
done
echo " done!"

# ================================== #
# === MAIN LOOP FOR ALL CONCEPTS === #
# ================================== #

# Iterate through each concept subfolder, and:
# (1) apply the above regular expressions to update the script
# (2) output to the postgres subfolder
# (3) add a line to the postgres-make-concepts.sql script to generate this table

# organfailure.kdigo_stages firstday.first_day_sofa sepsis.sepsis3 medication.vasoactive_agent medication.norepinephrine_equivalent_dose

# the order *only* matters during the conversion step because our loop is
# inserting table build commands into the postgres-make-concepts.sql file
for d in durations comorbidity demographics firstday fluid_balance sepsis diagnosis organfailure severityscores;
do
mkdir -p "$TARGET_PATH/${d}"
echo -n "${d}:"
echo "" >> $TARGET_PATH/postgres-make-concepts.sql
echo "-- ${d}" >> $TARGET_PATH/postgres-make-concepts.sql
for fn in `ls $d`;
do
# only run SQL queries
if [[ "${fn: -4}" == ".sql" ]]; then
# table name is file name minus extension
tbl="${fn%????}"
echo -n " ${tbl} "

if [[ "$DIR_AND_TABLES_TO_PREBUILD" =~ "$d.$tbl" ]]; then
echo -n "(exists!) .."
continue
elif [[ "$DIR_AND_TABLES_TO_SKIP" =~ "$d.$tbl" ]]; then
echo -n "(skipping!) .."
continue
else
echo -n ".."
fi

# re-write the script into psql using regex
# the if statement ensures we do not overwrite tables which are already written in psql
if ! [[ "$DIR_AND_TABLES_ALREADY_IN_PSQL" =~ "$d.$tbl" ]]; then
echo "-- THIS SCRIPT IS AUTOMATICALLY GENERATED. DO NOT EDIT IT DIRECTLY." > "${TARGET_PATH}/${d}/${tbl}.sql"
echo "DROP TABLE IF EXISTS ${tbl}; CREATE TABLE ${tbl} AS " >> "${TARGET_PATH}/${d}/${tbl}.sql"
cat "${d}/${tbl}.sql" | sed -r -e "${REGEX_ARRAY}" | sed -r -e "${REGEX_HOUR_INTERVAL}" | sed -r -e "${REGEX_INT}" | sed -r -e "${REGEX_DATETIME_DIFF}" | sed -r -e "${REGEX_DATETIME_TRUNC}" | sed -r -e "${REGEX_SCHEMA}" | sed -r -e "${REGEX_INTERVAL}" >> "${TARGET_PATH}/${d}/${fn}"
fi

# add statement to generate this table to make concepts script
echo "\i ${d}/${fn}" >> ${TARGET_PATH}/postgres-make-concepts.sql
fi
done
echo " done!"
done

# finally generate first_day_sofa which depends on concepts in firstday folder
echo "" >> ${TARGET_PATH}/postgres-make-concepts.sql
echo "-- final tables which were dependent on one or more prior tables" >> ${TARGET_PATH}/postgres-make-concepts.sql

echo -n "final:"
for dir_and_table in $DIR_AND_TABLES_TO_SKIP
do
d=`echo ${dir_and_table} | cut -d. -f1`
tbl=`echo ${dir_and_table} | cut -d. -f2`

# make the sub-folder for postgres if it does not exist
mkdir -p "$TARGET_PATH/${d}"

# convert the bigquery script to psql and output it to the appropriate subfolder
echo -n " ${d}.${tbl} .."
if ! [[ "$DIR_AND_TABLES_ALREADY_IN_PSQL" =~ "$d.$tbl" ]]; then
echo "-- THIS SCRIPT IS AUTOMATICALLY GENERATED. DO NOT EDIT IT DIRECTLY." > "${TARGET_PATH}/${d}/${tbl}.sql"
echo "DROP TABLE IF EXISTS ${tbl}; CREATE TABLE ${tbl} AS " >> "${TARGET_PATH}/${d}/${tbl}.sql"
cat "${d}/${tbl}.sql" | sed -r -e "${REGEX_ARRAY}" | sed -r -e "${REGEX_HOUR_INTERVAL}" | sed -r -e "${REGEX_INT}" | sed -r -e "${REGEX_DATETIME_DIFF}" | sed -r -e "${REGEX_DATETIME_TRUNC}" | sed -r -e "${REGEX_SCHEMA}" | sed -r -e "${REGEX_INTERVAL}" >> "${TARGET_PATH}/${d}/${fn}"
fi
# write out a call to this script in the make concepts file
echo "\i ${d}/${tbl}.sql" >> $TARGET_PATH/postgres-make-concepts.sql
done
echo " done!"
50 changes: 3 additions & 47 deletions mimic-iii/concepts/diagnosis/README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,4 @@
The Clinical Classification Software (CCS) categorizes ICD-9 coded diagnoses into clinically meaningful groups. The categorization was developed by the Agency for Healthcare Research and Quality (AHRQ). More detail can be found on the AHRQ website: https://www.hcup-us.ahrq.gov/tools_software.jsp
# ccs_dx

This folder contains:

* `ccs_diagnosis_table.sql` - Creates two tables: `ccs_single_level_dx` and `ccs_multi_level_dx`. These two tables are loaded from `ccs_single_level_dx.csv.gz` and `ccs_multi_level_dx.csv.gz`. Note that the script assumes you are using PostgreSQL v9.4 or later, and you must execute the script from this directory.

## Creation of the ccs_multi_level file

Download the original file from CCS:

```
wget https://www.hcup-us.ahrq.gov/toolssoftware/ccs/Multi_Level_CCS_2015.zip
```

Unzip to a folder.

```
unzip Multi_Level_CCS_2015.zip
```

Use Python to convert all apostrophes in `ccs_multi_dx_tool_2015.csv` into double quotes (the file mixed apostrophes/double quotes as field encapsulators):

```python
import pandas as pd
df = pd.read_csv('ccs_multi_dx_tool_2015.csv.gz')
# remove apostrophes from header names and relabel
df.rename(columns={"'ICD-9-CM CODE'": "icd9_code", "'CCS LVL 1'": "ccs_level1", "'CCS LVL 1 LABEL'": "ccs_group1", "'CCS LVL 2'": "ccs_level2", "'CCS LVL 2 LABEL'": "ccs_group2", "'CCS LVL 3'": "ccs_level3", "'CCS LVL 3 LABEL'": "ccs_group3", "'CCS LVL 4'": "ccs_level4", "'CCS LVL 4 LABEL'": "ccs_group4", }, inplace=True)

def remove_surrounding_apostrophes(x):
if x[0] == "'":
x = x[1:]
if x[-1] == "'":
x = x[:-1]
return x

for c in df.columns:
df[c] = df[c].map(remove_surrounding_apostrophes)
idxRemove = df[c].str.strip() == ''
if idxRemove.any():
df.loc[idxRemove, c] = None

# write to file
df.to_csv('ccs_multi_dx.csv.gz', index=False, compression='gzip')
```

(above run with Python 3.7 and pandas 0.23.2).

Now the SQL script can be run (`ccs_diagnosis_table.sql`). The `ccs_multi_dx.csv.gz` file generated by this process is available in the repo, so the above process is just for documentation, and does not necessarily have to be re-run.
The `ccs_multi_dx.csv.gz` data file must be uploaded to `physionet-data.mimiciii_derived.ccs_multi_dx`.
The data file is available in [../concepts_postgres/diagnosis](../concepts_postgres/diagnosis). The BigQuery schema definition is available in this folder as [ccs_multi_dx.json](/ccs_multi_dx.json).
8 changes: 4 additions & 4 deletions mimic-iii/concepts/durations/neuroblock_dose.sql
Original file line number Diff line number Diff line change
Expand Up @@ -7,11 +7,11 @@ with drugmv as
(
select
icustay_id, orderid
, rate as vaso_rate
, amount as vaso_amount
, rate as drug_rate
, amount as drug_amount
, starttime
, endtime
from inputevents_mv
from `physionet-data.mimiciii_clinical.inputevents_mv`
where itemid in
(
222062 -- Vecuronium (664 rows, 154 infusion rows)
Expand Down Expand Up @@ -41,7 +41,7 @@ with drugmv as
when itemid >= 40000 then coalesce(rate, amount)
else rate end) as drug_rate
, max(amount) as drug_amount
from inputevents_cv
from `physionet-data.mimiciii_clinical.inputevents_cv`
where itemid in
(
30114 -- Cisatracurium (63994 rows)
Expand Down
2 changes: 1 addition & 1 deletion mimic-iii/concepts/durations/vasopressin_dose.sql
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,7 @@ and
(
select
icustay_id, linkorderid
, CASE WHEN valueuom = 'units/min' THEN rate*60.0 ELSE rate END as vaso_rate
, CASE WHEN rateuom = 'units/min' THEN rate*60.0 ELSE rate END as vaso_rate
, amount as vaso_amount
, starttime
, endtime
Expand Down
6 changes: 3 additions & 3 deletions mimic-iii/concepts/fluid_balance/crystalloid_bolus.sql
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@ with t1 as
when mv.amountuom = 'ml'
then mv.amount
else null end) as amount
from inputevents_mv mv
from `physionet-data.mimiciii_clinical.inputevents_mv` mv
where mv.itemid in
(
-- 225943 Solution
Expand Down Expand Up @@ -47,7 +47,7 @@ with t1 as
, cv.charttime
-- carevue always has units in millilitres
, round(cv.amount) as amount
from inputevents_cv cv
from `physionet-data.mimiciii_clinical.inputevents_cv` cv
where cv.itemid in
(
30015 -- "D5/.45NS" -- mixed colloids and crystalloids
Expand Down Expand Up @@ -155,4 +155,4 @@ select
, sum(amount) as crystalloid_bolus
from t2
group by t2.icustay_id, t2.charttime
order by icustay_id, charttime;
;
Loading

0 comments on commit 1ff562b

Please sign in to comment.