Sort of project documentation. Will be substituted with Wiki/Sphinx.
The Chemputeroptimizer requires the following third-party libraries to operate:
numpy
- general Python library for array manipulation.scikit-learn>=0.20,<0.24
- machine learning library in Python.scikit-optimize==0.8.1
- Bayesian optimization library based onscikit-learn
algorithm implementations.pyDOE2
- library for Design of Experiments in Python.
Since chemputeroptimizer targets ChemPU as its hardware platform and XDL as chemical programming language, the following cronin group libraries are required:
ChemputerXDL
XDL
chempiler
AnalyticalLabware
- Clone the repo into desired folder. Master branch always contains most recent updates, although not extensively tested (so does all versions tagged with alpha, e.g.
v0.3.0a0
).
git clone https://gitlab.com/croningroup/chemputer/chemputeroptimizer.git && cd chemputeroptimizer
- Install Chemputer requirements if needed (XDL, ChemputerXDL and AnalyticalLabware).
pip install -r requirements.txt
- Install ChemputerOptimizer.
pip install .
- To use ChemputerOptimizer prepare a XDL procedure with some steps (which parameters should be tweaked during optimization) wrapped with
OptimizeStep
, e.g.
<OptimizeStep
id="0"
optimize_properties="{'mass': {'max_value': 0.24, 'min_value': 0.16}}">
<Add
reagent="copper(I)iodide"
vessel="filter"
mass="0.19 g"
stir="False" />
</OptimizeStep>
Where optimize_properties
contains the dictionary of the step's properties to optimize. Only continuous variables are supported, therefore max_value
and min_value
must be given.
Additionally, add FinalAnalysis
step, indicating the analytical method and its parameters used for reaction outcome quantification, e.g.:
<FinalAnalysis
vessel="reactor"
method="NMR"
sample_volume="2.5"
method_props="{'protocol': '1D FLUORINE HDEC', 'protocol_options': {'centerFrequency': -85, 'PulselengthScale': 1, 'decouplePower': 0, 'AcquisitionTime': 1.64, 'RepetitionTime': 15, 'PulseAngle': 90, 'Number': 64}}"
force_shimming="True"
cleaning_solvent="THF" />
In this example the reaction mixture sample (2.5 mL) is transferred to the benchtop NMR to perform 19F analysis and calculate the reaction outcome. For additional parameters of the FinalAnalysis
step -> [[#Final Analysis|here]].
- Prepare a configuration file, that specifies optimization objective, algorithm name and additional parameters, reference for the spectral analysis, constraints (see below) and parameters for the control experiment. For example:
{
"max_iterations": 10,
"target": {
"spectrum_peak_area_42": "inf"
},
"algorithm": {
"name": "doe"
},
"reference": 43,
"batch_size": 1,
"constraints": null,
"control": {
"n_runs": 0,
"every": 5
}
}
Full description of parameters is given [[#Configuration|below]].
- Import necessary modules and run optimization.
from chemputeroptimizer import ChemputerOptimizer
co = ChemputerOptimizer('xdl.xdl', 'graph.json')
co.prepare_for_optimization('configuration.json')
# ChemPU related
from chemiler import Chempiler
from AnalyticalLabware.devices import chemputer_devices
import ChemputerAPI
c = Chempiler('experiment_name', 'graph.json', 'output_dir', False, device_modules=[ChemputerAPI, chemputer_devices])
# Run optimization
co.optimize(c)
Under development
The optimization workflow is managed using the OptimizeDynamicStep
(ODS), which inherits from abstract AbstractDynamicStep
and is responsible for dynamic execution of the children steps. In case of ODS, the children steps are taken from the given XDL procedure (i.e., from the XDL object, generated from that procedure). Each iteration is performed as follows:
- New copy of the original XDL object is created.
- The parameters for the indicated XDL steps (i.e. those, that are tweaked during optimization) are updated according to the algorithm suggestion.
- In case of parallel optimization, several XDL objects are created with corresponding steps parameters and scheduled for parallel execution. Otherwise, a single procedure is just prepared for execution and stored as ODS attribute (called
working_xdl
). - For single batch optimization, additional iterator is created from the list of
working_xdl
steps. This iterator is then used to keep track of steps executed, allowing the procedure to be resumed in case of failure. - Callbacks for updating the reaction outcome are added to
FinalAnalysis
step. - Special steps (i.e.
ConstrainedStep
, see [[#Constrained Step|below]]) are updated.
Execution is managed by the on_continue
method, which outputs a list of steps to be executed. In case of parallel execution, this list is just steps from working_xdl
. For single-batch optimization, each step is queried from the steps iterator.
- When
FinalAnalysis
is executed, the spectrum is measured and passed to the callback function, where the reaction outcome (e.g., peak integration) is performed. - The outcome of each batch of optimization is saved in the corresponding folder, including the state of the optimization, list of all parameters and individual XDL file for that particular batch.
- When the list of steps to execute is exhausted, the ODS checks for the reagent flasks and wastes volumes, loads the data (parameters and reaction outcome) to the algorithm class and updates the XDL object (see [[#1 Preparation|Preparation]] above).
- If any termination criteria is met, the optimization stops.
- If control experiment is required, the parameters for that are taken from any of previously performed experiments and analyzed [[#Result processing|specially]].
Each batch is considered completed, when corresponding FinalAnalysis
step is performed. Each iteration is considered completed, when no more steps to execute are left.
Upon batch completion the following data is saved: current steps parameters, current XDL procedure, current spectrum.
Upon iteration completion the following data is saved: full table of optimization results, schedule (if running in parallel) and current optimization state.
The Analyze or FinalAnalysis step performs the reaction analysis when executed. It contains all necessary parameters to run the analysis via Raman, HPLC or NMR instruments, including reaction sampling, sample treatment (e.g., dilution) and instrument control. During the compilation the step’s attributes are additionally checked for consistency (e.g., “dilution volume” if dilution is required) and necessary callback functions (i.e., to update the optimization state upon iteration completion) are assigned. Should be inserted in the procedure after the reaction is complete. The full list of properties is given below:
- Step related
vessel
- name of the vessel to perform the analysis and/or take a sample.method
- name of the analytical method.sample_volume
- volume of the sample to transfer to to the analytical instrument.instrument
- name of the analytical instrument (given internally).on_finish
- callback function (assigned during the procedure preparation).reference_step
- previous step in the procedure, with its properties. Used for additional sample preparations (e.g. dissolving solid afterFilter
step). Reserved for future use, (given internally).method_props
- properties of the analytical instrument, e.g. protocol for the NMR. Given as a JSON_PROP_TYPE and passed to the specific analysis step as is.batch_id
- batch id, used for referencing. - Method related
cleaning_solvent
- name of the solvent to clean the instrument. Must be given, but ignored for the Raman and other non-sampling analytical methods.cleaning_solvent_vessel
- name of the cleaning solvent vessel (given internally).priming_waste
- name of the waste container to dispose sample after priming (given internally). - Sample related
sample_pump
- closest pump tovessel
(given internally).injection_pump
- closest pump toinstrument
(given internally).sample_excess_volume
- excess of the sample volume to prime the pump, 2 mL by default. This excess is then returned to the reactor.dilution_vessel
- vessel, used for dilution.dilution_volume
- volume of the solvent to dilute the sample.dilution_solvent
- name of the solvent to dilute the sample.dilution_solvent_vessel
- name of the vessel for the solvent to dilute the sample (given internally).distribution_valve
- name of the distribution valve to inject sample in HPLC (given internally).injection_waste
- name of the waste container to dispose the remaining of the sample after injection (given internally). - NMR specific
force_shimming
- True, if shimming is required by default before each analysis. Otherwise, the shimming will be tracked and performed at least every 24 hours.shimming_solvent_flask
- name of the flask for the solvent for the shimming (given internally). The solvent is selected from the list of allowed solvents, if none found - exception is raised during compilation.shimming_reference_peak
- position of the peak of the solvent used for shimming (given internally).
In some cases, additional constraints beyond upper and lower bounds are imposed on the optimization variables. For example, all factors in mixture experiments must sum up to 100%. The ConstrainedStep
wraps a child step and adjusts the value of a given parameter as a function of the values of selected OptimizeStep
steps and the desired target value (Fig. S3). In the given example, the volume of the Add step is calculated as in:
<OptimizeStep
id="1"
optimize_properties="
{'volume': {'max_value': 20.0, 'min_value': 10.0}}">
<Add
vessel="reactor"
reagent="cyclohexenone"
volume="10 mL" />
</OptimizeStep>
<ConstrainedStep
ids="[1, 2]"
parameter="volume"
target="50">
<Add
reagent="toluene"
vessel="reactor"
volume="99999 mL" />
<ConstrainedStep/>
<OptimizeStep
id="2"
optimize_properties="
{'volume': {'max_value': 20.0, 'min_value': 5.0}}">
<Add
vessel="reactor"
reagent="pyrrolidine"
volume="10 mL" />
</OptimizeStep>
The configuration file contains all necessary information to run the optimization and calculate the reaction outcome in a json format. The description of each parameter is given below,:
max_iterations
- maximum number of experiments to perform within optimization run.target
- indicates the calculation for the reaction outcome. Available:spectrum_peak_area_XXX
- peak at XXX position (x coordinate).spectrum_integration_area_LLL..RRR
- area under the curve with LLL (left) and RRR (right) limits on x axis.novelty
- novelty, by default is the number of peaks identified. Special implementation for the NMR spectrum (see [[#Novelty|below]]).
algorithm
- indicates the algorithm to be used and its respective attributes.reference
- indicates the peak position of the reference compound (i.e., internal reference for the analysis). Is used to reference the spectrum and calculate the reaction outcome.constraints
- lists regions on the reaction spectrum to consider when calculating the final reaction outcome. In the current implementation, the intermediate result (e.g., area under the curve for the target peak) is divided by the sum of integration areas for all listed regions.batch_size
- number of batches to perform the optimization in parallel.control
- parameters to perform the control experiment:n_runs
- number of control experiment to perform after everyevery
number of iterations.
The AlgorithmAPI class is designed to provide a unified interface to the algorithmic classes and dynamic XDL execution. The class contains methods to parse the reaction parameters into parameter-agnostic data arrays suitable for the numeric optimization algorithms, initialize algorithm classes, query data for the next setup and save the intermediate results.
1. Random search Random search algorithm relies on the random submodule of the NumPy library.5 The suggestion of this algorithm is a random number array (sampled from a uniform distribution) within the parameters constraints. The following parameters are available to construct the algorithm’s class:
random_state
– a seed to initialize low-level random number generator, used to preserve reproducibility across multiple optimization runs.
2. Genetic algorithm A basic genetic algorithm was written and adopted for the sequential optimization of chemical reactions. It uses truncation selection, single point crossover, and random reset mutation while preserving the best solution (elitism). In case of premature convergence, the population is reinitialized. The key hyperparameters to tune are:
pop_size
– Number of individuals in the population.mutation_rate
– Probability (0 to 1) of mutating a gene.
3. Sequential model-based optimization (SMBO) The algorithm implementation is based on the scikit-optimize python library and provides a simple and efficient library to minimize expensive and noisy black-box function. The parameters for the underlying algorithm are listed in the official documentation.
4. Design of Experiments Several experimental designs are available via a wrapper for pyDOE2. Available designs include full and fractional factorial designs, Response-surface designs, and randomized designs. The wrapper calls the appropriate pyDOE2 functions to generate the design matrix and maps the levels onto the appropriate parameter values based on the search space bounds. For details, checkout the original documentation for pyDOE.
5. Dummy methods
The following “dummy” methods are given to the user to run the given reaction iteratively using the respective setup on each iteration:
• fromcsv
– the next setup is read from the given csv file.
• reproduce
– the next setup is equal to the initial setup from the procedure. Used to check the experimental outcome for reproducibility.
The SpectraAnalyzer
(SA) class contains methods to manage the spectra and perform the corresponding analysis for the quantification of the reaction outcome. The class is used inside FA step to load the measured spectrum (load_spectrum
) and call final_analysis
to get the iteration result. The latter method is called with reference, target and constraints arguments, obtained from optimization configuration.
Additionally, SA contains method to load previously measured spectra (e.g. for novelty search with NMR feedback), processing control experiment (currently under development).
The functions for the analysis are presented in the special class: LossFunctionCollection
and queried by the name of the spectrum class and name of the optimization objective, e.g.:
>>> lfc = LossFunctionCollection()
>>> loss_function = lfc[('SpinsolveNMRSpectrum', 'spectrum_peak_area_42')]
>>> loss_function
<bound method LossFunctionsCollection._spinsolvenmrspectrum_spectrum_peak_area ...>
The functions implemented should follow the naming convention as _spectrumclassname_ojbective_name
and accept 3 arguments: reference
, target
and constraints
. If loss function for the corresponding spectrum class not found, it falls back to generic (_generic_objectivename
). If loss function for the corresponding objective is not found, it falls back to generic_loss_function
, which returns Nan
just to keep optimization going.
If analysis function requires more than just reference, target and constraints (e.g. previously measured spectra, like novelty analysis for NMR), that function should raise NotImplementedFunctionError
and be implemented in the SA class with corresponding name.
Generic function for novelty search just counts the number of peaks identified on the spectrum. Special treatment is developed for the NMR spectra, where 1) incoming spectrum is analyzed for "information"; 2) the information score is multiplied by the "novelty coefficient", which is the amount of new (with respect to previously measured spectra) peaks in the incoming spectrum. For the details check the SI for the Optimizer paper. TODO: transfer the novelty description.
The control experiment is the way to monitor the reproducibility of the automated experiments. It can be run "manually", i.e. executing previously saved XDL procedure and comparing the result and output spectrum manually. However, since v0.3.6a0
the control experiment can be executed automatically, using control
parameter in the configuration file. Such experiment will be executed with the same parameters as one of the recently performed (selected randomly) and it's result will be saved in the optimization state as control_result
. For the moment, special methods to quantify control experiment are given in SA, control_analysis
, and in AlgorithmAPI, validate_control
. The SA method will output an mean difference between two spectra, while validate method is reserved for future use.
Under development
The special processing of the control experiment result is under development. As of now (v0.4.1
) the control result is saved as mean of the 2 spectra difference (i.e., from the control and original experiments).
This section is under development
The OptimizerClient
represents the class for communication with the SummitServer
. The communication is based on the TCP-IP sockets and managed via selectors and socket modules of the python standard library. The communication messages are encoded JSON-like dictionaries containing information about the current procedure, steps and parameters subjected to the optimization and optimization configuration. The class is initialized if the requested algorithm is the one available in the Summit framework and managed via AlgorithmAPI
.
If you know your reaction mixture spectrum, i.e., peaks that belong to your product and potential side products, you can "constrain" your optimization to include purity in the objective calculation. To do so, include constraints
parameter in the optimization configuration with the least of peaks (for spectrum_peak_area
objective) or regions (as XX..YY
for spectrum_integration_area
) of potential side products. Those peaks/regions will be integrated, summed up and the integration area of the product will be divided by that some.
Note
It is highly recommended to use integration area (not peak), when constraining optimization for purity, as current peak integration method will integrate closest peak to the one listed, even if it is not found.
By default, the algorithms for optimization are set to guide the optimization towards maximization of the objective. If you need to minimize your target parameter (e.g., starting material peak), you can prepend the objective name with neg
, for example neg_spectrum_peak_area
. The integration result in this case will be negated.
If you get a new analytical instrument, e.g. my_awesome_spectrometer
, you have to add the following classes and methods for it, to be implemented into the ChemputerOptimizer workflow.
MyAwesomeSpectrometer
- will be your main communication class, located in theAnalyticalLabware.devices.myawesomespectrometermanufacturer.myawesomespectrometer
module. This class doesn't inherit anything by default design, however is required to have a methodget_spectrum
, which may or may not return the spectral data, but must save it in the corresponding attribute, i.e.,MyAwesomeSpectrometer.spectrum
.MyAwesomeSpectrometerSpectrum
- will be your class for manipulating the spectral data from your awesome spectrometer. It inherits fromAbstractSpectrum
and should rewrite custom method toload_spectrum
and save it in thex
andy
attributes as numpy arrays. Inherited processing methods (e.g.,correct_baseline
,integrate_peak
, etc.) must be supported or rewritten.- To be integrated into ChemPU workflow, specific device class,
ChemputerMyAwesomeSpectrometer
should be implemented. It inherits fromChemputerDevice
to supportcapabilities
property (i.e., capabilities for liquid transfer) and should list all necessary attributes in its constructor for graph support. - Additionally simulated instrument classes should be prepared to allow simulating the optimization and running tests.
- Next, you have to prepare the XDL step to run your awesome spectrometer,
RunMyAwesomeSpectrometer
. Nothing complicated here, follow general guidelines of XDL steps development, declare necessary arguments and their types and write customexecute
method to perform the acquisition. Here yourget_spectrum
method will come in handy. Additionally, ifon_finish
callback, if given, should be called with copy of you spectrum. This is required to allow reaction outcome calculation at the end of optimization iteration. - Next, you have to integrate
RunMyAwesomeSpectrometer
step into ChemputerOptimizer FA pipeline, adding necessary sample preparations, sampling, cleaning, etc. Just check for existing analytical methods for examples. - Last, but not least, if your want some special loss function for your spectral analysis (i.e. you are not happy with generic peak integration with
spectrum_peak_area
objective), you should prepare a method forLossFunctionCollection
class, following the name convention as_myawesomespectrometerspectrum_spectrum_peak_area
. - Declare your awesome spectrometer in
SUPPORTED_ANALYTICAL_METHODS
andANALYTICAL_INSTRUMENTS
inchemputeroptimizer.constants
module.
That's it! Now your awesome spectrometer can be used as an analytical method to analyze your reaction during FinalAnalysis
step in the optimization workflow.
To implement your custom analysis (loss) function, you should prepare the following methods.
- Generic method for all
AbstractSpectrum
spectra classes as_generic_my_reaction_analysis
in theLossFunctionCollection
class. This method should accept the following arguments: spectrum - instance of the analyzed spectrum class; target - name of the target astarget_name_targetvalue
(if that target implies the value, e.g. peak position for the peak integration target); reference - position of the reference (aka internal standard) on the spectrum; constraints - list of constraints. The output of your function must be float type. - If any special processing is required for specific spectrum class, corresponding method should be prepared, as
_spectrumclassname_my_reaction_analysis
. The signature is the same as generic method. - If your analysis function needs access to more than just one spectrum, the corresponding method in
LossFunctionCollection
should raiseNotImplementedFunctionError
and a method with exactly same name should be implemented inSpectraAnalyzer
class. However, that method should not take spectrum argument, since all spectra are saved in the SA class asspectraanalyzer.spectra
. - Declare your loss function in
chemputeroptimizer.constants
inTARGET_PARAMETERS
list.
Note
A small note on name convention. If your loss function requires any parameter it should be declared as my_loss_function_name_mylossfunctionparameter
, since it will be parsed to obtain the corresponding method in the LossFunctionCollection
class. If, however, your loss function does not need any parameters, you can declare it as is (i.e. my_loss_function_name
) in the constants
module, however its name should be patched in chemputeroptimizer.validation
module to append underscore (_
) to its name for the correct parsing.
All algorithms in the ChemputerOptimizer framework are managed using AlgorithmAPI
class. In most of the cases it is just responsible for extracting parameters from the dictionaries to arrays and back, with some additional "magic" for special methods (e.g. for novelty, where all previous experiments should be updated, when the new result comes). So, to implement new algorithm, you will need to inherit it from chemputeroptimizer.algorithms.base_algorithm.AbstractAlgorithm
and implement a single method, suggest
. This method should accept the following arguments:
parameters
- (n x i) size matrix where n is number of experiments and i is number of experimental parameters.results
- (n x j) size matrix where j is the number of target parameters.contraints
- tuple with min/max values for the parameters.n_batches
- number of latest executed experiments (batches).-1
means, that all data should be taken into account, e.g. loading previous experiments.n_returns
- number of new parameters to return (typically equal to number of batches).
The method should return a numpy array of (n_returns
by i
) size with new set(s) of experimental input parameters.
Note
The method should run even if called with all arguments as None
.
This section is under development Several tests are currently implemented for the ChemputerOptimizer library:
- integration: simply run all possible analytical instruments with all possible algorithms and objectives. This is pretty exhaustive (260+ tests as of
v0.3.6
), so a smaller test is written -test_at_random.py
, which performs N number of tests with (surprise) random configuration. - unit: these tests are mainly used to check specific functionality. Prepare and/or run them if you've made very specific changes to the framework.
If you've added new analytical instrument and/or loss function, don't forget to add the corresponding graph/xdl/configuration to the test files directory, so it will be picked up during integration tests.
- Categorical variables.
- Include algorithm testing into overall test pipeline.
- Export "not-only-optimization-related" XDL steps to ChemputerXDL.
- Retire optimization configuration file in favor of optimization description inside XDL file.
- Using Blueprints inside OptimizeDynamicStep to control procedure parameters update.
- Retire OptimizeDynamicStep - in favor of OptimizerPlatform controlling the execution pipeline.
- Advanced spectra comparison for quantitative control experiment validation.
- Custom algorithm for optimization: batch-constrained Bayesian optimization with GP, RF and Bayesian NN for surrogate modeling.
- Advanced parallelization support.
- Multi-objective optimization.