In order to make a structure based predict on the bio-activity of molecules a list of features is generated with a KNIME workflow. This list is used as input for either a Neural Network or a Random Forest Predictor. In both scripts the input data is splitted into training and test data, 70% of the data is used to train the predictor. Furthermore, the parameters of the predictors are adjusted by GridSearchCV: The predictor is trained multiple times with different combinations of available parameters and the best predictor is then used to predict the bio-activity.
The KNIME workflow featureGeneration.knar receives an input file containing SMILES and the predicted bio-activity of the molecule in a comma separated csv file. It generates a list of features for the molecules and outputs a comma separated file containing the activity, the SMILES structure the molecules corresponding features.
In order to run the program one has to specify
-t Path of the input csv file generated by the KNIME workflow -o Destination path of the resulting prediction csv
randomForest_GridSearch.py -t trainingData_Features.csv -o rfc_GridSearch_res.csv
neuronalNetwork_GridSearch.py -t trainingData_Features.csv -o rfc_GridSearch_res.csv
- KNIME - Analytics Platform (3.7)
- RDKIT - Software Package to read and analyse SMILE data (3.4.0v)
- Python - Python programming language (3.6)
- scikit-learn - Software Package for Machine Learning (v0.20.1)
- keras - Open Source Deep Learning Library (2.24)
- matplotlib - 2D Plotting Library (2.2.2)
- pandas - Datastructures and Dataframes (v0.23.4)
- numpy - Scientific computing with Python (v1.15.2)
Jennifer Bödker Tobias Nietsch