Objectives of this workshop are as follows:
1. To demonstrate how to fetch clinical EHR data using Patient Explorer
2. To demonstrate the usage of REST API to fetch SPOKEsig embedding vectors corresponding to the obtained EHR data
3. To demonstrate the usage of SPOKEsig vectors in classifying patient population using a machine learning model
Follow the steps described below to follow the workshop
Let us first make sure you have installed Python3 into your local machine. For that:
-
Open terminal (for Linux/MacOS) or Open Command Prompt (for Windows)
-
Copy-paste the following
(Note: you can copy the following by clicking an icon that pops up at the right end of the line):
python --version
or
python3 --version
If you see Python 3.x.x you should be good to go.
Note: We have tested and verified that all codes in this repo will work on Python 3.6.x, Python 3.7.x and Python 3.8.x
If Python3 is not installed in your machine, please install one of the above mentioned versions of Python3 before proceeding further.
-
Open terminal (for Linux/MacOS) or Open Command Prompt (for Windows)
-
On the command line, copy and paste the following
git clone https://github.com/BaranziniLab/SPOKEsig-Workshop.git
-
Enter your GitHub credentials if asked.
-
Check if a folder named "SPOKEsig-Workshop" is created. If yes, Step 1 is completed
First, check if virtualenv is installed in your local machine. For that, type the following:
virtualenv --version
If you see a version number on your screen, then you are good to go. Otherwise, type the following:
pip install virtualenv
- In the terminal, open "SPOKEsig-Workshop" folder that you just cloned. You can copy-paste the following:
cd SPOKEsig-Workshop
- Once you are inside the folder, let us create a virtual environment. For that, copy-paste the following:
virtualenv -p $(which python3) venv
- Activate the created virtual environment. For that, copy-paste the following:
source venv/bin/activate
- To install the required python modules for this workshop, copy-paste the following:
pip install -r requirements.txt
- In the cmd, open "SPOKEsig-Workshop" folder that you just cloned. You can copy-paste the following:
cd SPOKEsig-Workshop
- Once you are inside the folder, let us create a virtual environment.
For that, use the following syntax (NB: DO NOT COPY PASTE).
virtualenv --python "\path\to\python.exe" venv
Change "\path\to\python.exe" to the path of python in your local machine.
Usually path would be like (assuming you are using Python 3.6.x):
C:\Users\<user_name>\AppData\Local\Programs\Python\Python36\python.exe
- Activate the created virtual environment. For that, copy-paste the following:
.\venv\Scripts\activate
- To install the required python modules for this workshop, copy-paste the following:
pip install -r requirements.txt
In this workshop, we will run all codes in Jupyter notebook. Hence, type the following in your terminal(for Linux/MacOS)/CMD(for Windows) to start Jupyter notebook:
jupyter-notebook
This will spin up a jupyter notebook instance in your local machine. This notebook will appear in your browser
To use this instance, you will be prompted to enter a token. To get the token, check your terminal(for Linux/MacOS)/CMD(for Windows)
Once you furnish the token, you can see the contents of SPOKEsig-Workshop directory in your browser
Note:
If the notebook doesn't start automatically in your browser, copy and paste the link that appears on your terminal(for Linux/MacOS)/CMD(for Windows), The link will look like the following:
http://localhost:8888/?token=
Congratulations! you have successfully started Jupyter notebook in your local machine!
Now, let us start running codes.
For successfully running codes, we need to first add API credentials to workshop.conf file
-
Open the file workshop.conf in SPOKEsig-Workshop directory
-
Change <API USERNAME> and <API PASSWORD> in the [API] section of the config file to username and password provided to you
-
Save the config file
From this step onwards, you need to connect your UCSF VPN.
In this step, we will create SPOKE signatures (a.k.a. SPOKEsigs) of patients using their clinical EHR data. For this, we make use of PatientExplorer to import clinical data of patients.
Please follow instructions in the presentation to know how to import clinical data from PatientExplorer
Once the data is imported to your local machine, then:
- You must have downloaded three csv files from PatientExplorer to your local machine such as a condition file, a medication file and a measurement file. As you might have noticed, these data correspond to three diseases such as breast_cancer, colon_cancer and irritable_bowel_syndrome.
Rename these files as:
condition csv file --> breast_colon_ibd_conditions.csv
medication csv file --> breast_colon_ibd_drugs.csv
measurement csv file --> breast_colon_ibd_measurements_w_ab.csv
-
Copy these renamed files to the folder /data in SPOKEsig-Workshop directory
-
Open the folder /code in SPOKEsig-Workshop directory
-
Open the notebook named get_patient_spoke_sig_using_patient_explorer_and_API.ipynb
-
As the name indicates, this notebook is used to create SPOKEsigs of patients that is imported from PatientExplorer. Notebook is well commented and hence, you can see that SPOKEsigs are obtained by making API calls whose credentials are saved at Step 4.
-
Once you run all sections of this notebook, it saves three files to /data folder. Files are such as: a SPOKEsig numpy file (named random_patient_spokesigs.npy) and two flat files that provide information on patients (random_patient_info.tsv and example_cohort.tsv)
Congratulations! You have now successfully created and saved SPOKEsigs of patients from PatientExplorer.
Now we created SPOKEsigs of patients coming from three different disease categories such as breast_cancer, colon_cancer and irritable_bowel_syndrome.
Let us see if a machine learning model can make use of these SPOKEsigs and classify patients into these three different disease categories.
-
Open the notebook named patient_spoke_sig_analysis.ipynb
-
This notebook is also well commented. However, unlike previous notebook, this one has allocated some sections for you to fill.
Don't worry! Useful links are provided at relevant sections for you to refer and fill out.
If you still cannot figure out, you can always go to the wiki section of this repo and see the missing portions of the code from there.
- Once you run all sections of this notebook, you can see the performance of a machine learning model (random forest classifier) in classifying patient data into three disease categories using SPOKEsigs as its input.
Congratulations! You have successfully completed the hands-on session of this workshop.