Retrosynthetic analysis and reaction prediction are fundamental for efficient chemical synthesis and drug discovery. Consequently, computer-assisted synthesis planning (CASP) tools have been at the forefront of research and development, striving to computationally identify the optimal sequence of chemical reaction steps that transform commercially viable starting materials into desired chemical compounds. 1
A leading CASP tool, AiZynthFinder, achieves retrosynthesis prediction by generating a retrosynthetic search tree using a template-based, feedforward neural network (FNN) model known as the expansion policy to give a ranked list of reaction templates. This process is followed by another neural network, the filter policy, which removes unfeasible reactions. Once the retrosynthesis search tree is constructed, a Monte Carlo Tree Search (MCTS) algorithm traverses the search tree to identify the best synthetic routes.
To enhance AiZynthFinder, this study integrates SMILES-based sequence-to-sequence (Seq2Seq) and transformer models into its expansion policy. By leveraging these advanced neural network architectures with SMILES-based chemical representations, we aim to overcome the inherent limitations of template-based retrosynthetic methods. The integration seeks to broaden accurate predictions beyond the rule-based knowledge base, and ensure that predictions consider the entire molecular environment and account for stereochemistry.
This study is ongoing, involving continuous model optimisations and research. The results and discussion for the latest Seq2Seq model are available here. Development of the transformer model is currently in progress.
1. Retrosynthesis with AiZynthFinder - Overview
1.1 Basics of Retrosynthesis
1.2 Retrosynthetic Search Tree
1.3 AiZynthFinder Template-Based Retrosynthesis Model (Define Disconnection Rules)
1.4 Monte Carlo Tree Search: Finding the Best Routes (Traverse the Retrosynthesis Search Tree Efficiently)
1.4.1 Heuristic Search Algorithms
1.4.2 Monte Carlo Tree Search in AiZynthFinder
1.5 AiZynthFinder Post-Processing Tools - Route Scoring
1.6 Route Clustering
1.7 References
2. AiZynthFinder's Expansion Policy Neural Network
2.1 What is AiZynthFinder's Expansion Policy Neural Network?
2.2 Neural Networks Overview
2.3 Feedforward Neural Networks (FNNs)
2.4 Recurrent Neural Networks (RNNs)
2.4.1 Recurrent Neural Network Architecture
2.4.2 Backpropagation vs Backpropagation Through Time
2.4.3 Recurrent Neural Network Training
2.4.4 Types of Recurrent Neural Networks
i. Standard RNNs
ii. Bidirectional Recurrent Neural Networks (BRRNs)
iii. Long Short-Term Memory (LSTM)
iv. Gated Recurrent Units (GNUs)
v. Encoder-Decoder RNN
2.5 References
3. Sequence-to-Sequence Expansion Policy
3.1 Limitations of Template-Based Retrosynthetic Methods
3.2 Alternative SMILES-Based Retrosynthetic Methods
3.3 Sequence-to-Sequence Model
3.4 Architecture of Sequence-to-Sequence Models
3.4.1 Encoder
3.4.2 Decoder
3.4.3 Attention Mechanism
3.5 References
4. Retrosynthesis Sequence-to-Sequence Model Literature Review
4.1 Britz et al. Analysis of Neural Machine Translation Architecture Hyperparameters
4.1.1 Embedding Dimensionality
4.1.2 Encoder and Decoder Recurrent Neural Network (RNN) Cell Variant
4.1.3 Encoder and Decoder Depth
4.1.4 Unidirectional vs. Bidirectional Encoder
4.1.5 Attention Mechanism
4.1.6 Beam Search Strategies
4.2 Liu et al. Sequence-to-Sequence Model
4.1.1 Data Preparation
i. Training
ii. Testing
4.1.2 Model Architecture
4.3 References
5. Project Retrosynthesis Sequence-to-Sequence Model
5.1 Data Preparation
5.2 Model Optimisation
5.2.1 Deterministic Training Environment
5.2.2 Data Tokenization and Preprocessing Optimisation
i. DeepChem Tokenizer
ii. TensorFlow TextVectorisation
5.2.3 Loss Function Optimisation
i. Categorical Cross-Entropy vs Sparse Categorical Cross-Entropy
ii. Optimiser - Adam
iii. Weight Decay (L2 Regularisation)
5.2.4 Callbacks Optimisation
i. EarlyStopping
ii. Dynamic Learning Rate (ReduceLROnPlateau)
iii. Checkpoints (ModelCheckpoint)
iv. Visualisation in TensorBoard (TensorBoard)
v. Validation Metrics (ValidationMetricsCallback)
5.2.5 Metrics Optimisation
i. Perplexity
ii. BLEU Score
iii. SMILES String Metrics (Exact Match, Chemical Validity, Tanimoto Similarity, Levenshtein Distance)
5.2.6 Encoder Optimisation
i. Residual Connections
ii. Layer Normalisation
5.2.7 Decoder Optimisation
i. Residual Connections and Layer Normalisation
5.2.8 Attention Mechanism Optimisation
i. Bahdanau Attention Mechanism
ii. Residual Connections and Layer Normalisation
5.2.9 Inference Optimisation
i. Greedy Decoding vs Beam Search
5.3 Model Architecture
5.3.1 Optimised Encoder Architecture
5.3.2 Optimised Decoder Architecture
5.3.3 Optimised Attention Mechanism (Bahdanau Attention) Architecture
5.4 Model Documentation
5.4.1 Model Training Pipeline
5.4.2 Model Data Flow - ONNX Graph
i. Flow of Data Through Encoder
ii. Flow of Data Through Decoder
iii. Flow of Data Through Attention Mechanism
5.4.3 Model Debugging
i. Data Tokenization and Preprocessing Debugging
ii. General TensorFlow Debugging
5.5 Results and Discussion
5.5.1 Evaluation of Current Optimal Model Architecture
i. Analysis of Performance Metrics
ii. Analysis of Sample Predictions
5.5.2 Integrating Seqeuence-to-Sequence Model into AiZynthFinder
i. Simple Drug Retrosynthesis - Aspirin
ii. Complex Chiral Drug Retrosynthesis - Rivaroxaban
iii. AiZynthFinder Expansion Policy Performance Analysis
5.6 Future Model Optimisations
5.6.1 Increased Training Dataset Size and Diversity
5.6.2 Layer-wise Learning Rate Decay
5.6.3 Scheduled Sampling
5.6.4 High Throughput Testing of Model Expansion Policy Performance
5.7 References
6. Transformer Expansion Policy
7. Retrosynthesis Transformer Model Literature Review
8. Project Retrosynthesis Transformer Model
AiZynthFinder is a computer-aided synthesis planning (CASP) tool developed by AstraZeneca's MolecularAI department. Specifically, it is a computer-assisted synthesis prediction tool that seeks to identify the optimal sequence of chemical reaction steps capable of transforming a set of commercially available starting materials into a desired chemical compound. 1 2
AiZynthFinder leverages recent advancements in machine learning techniques, specifically deep neural networks, to predict synthetic pathways via retrosynthetic analysis with minimal human intervention. 1 3
Retrosynthetic analysis involves the deconstruction of a target molecule into simpler precursor structures in order to probe different synthetic routes to the target molecule and compare the different routes in terms of synthetic viability.
Retrosynthesis involves:
- Disconnection:
- The breaking of a chemical bond to give a possible starting material. This can be thought of as the reverse of a synthetic reaction.
- Synthons:
- These are the fragments produced by the disconnection.
- Usually, a single bond disconnection will give a negatively charged, nucleophilic synthon, and a positively charged, electrophilic synthon.
- However, other times the disconnection will give neutral fragments. Classical examples of this are pericyclic reactions, such as Diels-Alder reactions.
- Synthetic Equivalients:
- Synthons are not species that exist in reality due to their reactivity, and so a synthetic equivalent is a reagent carrying out the function of a synthon in the synthesis.
- Functional Group Interconversion (FGI):
- If a disconnection is not possible at a given site, FGI can be used.
- An FGI is an operation whereby one functional group is converted into another so that a disconnection becomes possible.
- A common FGI is the oxidation of an alcohol to a carbonyl, or amine to nitro group
- Functional Group Addition (FGA):
- Similar to FGI, FGA is the addition of a functional group to another to make it suitable for disconnection.
Typically, the retrosynthetic analysis of a target molecule is an iterative process whereby the subsequent fragments are themselves broken down until we reach a stop criterion. This stop criterion is typically when we reach precursors that are commercially available/in stock.
This iterative process results in a retrosynthesis tree where the breadth is incredibly large, but the depth is quite small/shallow. In comparison to the search trees for games such as chess and Go (Fig 6), the breadth of a retrosynthesis search tree is incredibly large because you could theoretically break any bonds in the target molecule, and the subsequent fragments. This leads to an explosion in child nodes from the first few subtrees.
The depth of a retrosynthesis search tree is small/shallow on the other hand, as it only takes a few disconnections before viable precursors are found. This is ideal since we don't want linear synthetic reactions with an excessive number of steps.
For effective retrosynthetic analysis, a retrosynthesis program must:
- Define the disconnection rules clearly and efficiently in order to reduce the breadth of the retrosynthesis search tree.
- Traverse the retrosynthesis search tree efficiently using an effective search algorithm.
AiZynthFinder uses a template-based retrosynthesis model to define the disconnection rules. This approach utilises a curated database of transformation rules that are extracted from external reaction databases that are then encoded computationally into reaction templates as SMIRKS.
- SMIRKS is a form of linear notation used for molecular reaction representation. It was developed by Daylight and can be thought of as a hybrid between SMILES and SMARTS.
These reaction templates can then be used as the disconnection rules for decomposing the target molecule into simpler, commercially available precursors.
However, before they are used, AiZynthFinder uses a simple neural network (Expansion policy) to predict the probability for each template given a molecule 6 (Fig 8).
This expansion policy neural network template ranking works as follows:
- Encoding of query molecule/target molecule: The query molecule/target molecule is encoded as an extended-connectivity fingerprint (ECFP) bit string, 7 specifically an ECFP4 bit string.
- Expansion policy neural network: The ECFP4 fingerprints are then fed into a simple neural network, called an expansion policy. The output of this neural network is a ranked list of templates.
- Keep top-ranked templates and apply to target molecule: The top-ranked templates are kept (typically the top 50), and are applied to the target molecule, producing different sets of precursors
However, the expansion policy doesn't know much about chemistry and doesn't take all of the reaction environment into consideration. As a result, it can rank unfeasible reactions highly.
Therefore, AiZynthFinder has another trained neural network called filter policy that is used to filter and remove unfeasible reactions (Fig 9). 8
Fig 9 a) An example suggested route from the expansion policy without the filter neural network. The single step route would not practical in the wet-laboratory however due to selectivity issues. b) The suggested route when the filter neural network is applied. Although not perfect, it is a much more feasible route. 8
1.4 Monte Carlo Tree Search: Finding the Best Routes (Traverse the Retrosynthesis Search Tree Efficiently)
Monte Carlo Tree Search (MCTS) 9 is a powerful search algorithm that uses heuristics (i.e., rules of thumb) for decision-making processes, particularly in complex search spaces.
- Like with other heuristic search algorithms, the goal of MCTS is to find a good enough solution within a reasonable amount of time, rather than guaranteeing the optimal solution by examining all possible outcomes.
- Heuristic search algorithms like MCTS are guided by a heuristic function, which is a mathematical function used to estimate the cost, distance or likelihood of reaching the goal from a given state or node. This function helps prioritse which paths or options to explore, based on their likelihood of leading to an optimal or near-optimal solution.
- Heuristic search algorithms aim to reduce the search space, making them more efficient than exhaustive search methods. By focusing on promising paths, they can often find solutions faster, especially in complex or large problem spaces. Although these solutions may not be optimal, they are usually good enough for practical purposes.
In AiZynthFinder, MCTS plays a crucial role in effectively navigating the vast search space of possible synthetic routes to find the best synthesis pathway for a target molecule.
To recap, the retrosynthesis tree structure representation consists of:
- Nodes: Each node in the tree represents a state of the retrosynthesis problem. In AiZynthFinder, a node corresponds to a set of one or more intermediate molecules that can be used to synthesise the molecule(s) in the current node's parent node.
- Edges: The edges between the nodes represent the application of a specific reaction template (disconnection rule) to decompose the molecule set in the parent node into simpler precursor molecules in the child node.
In AiZynthFinder, MCTS uses iterative/sequential Monte Carlo simulations 10 to explore potential synthetic routes as follows:
1. Selection
- Starting at the root node (target molecule), the MCTS algorithm selects the most promising node for expansion based on a balance of exploration (trying new reactions), and exploitation (further exploring known good reactions)
- This is governed by a Upper Confidence Bound (UCB) score formula (Fig 10) and is how AiZynthFinder selects and scores routes.
Fig 10 AiZynthFinder Upper Confidence Bound (UCB) score formula for selecting and scoring synthetic routes in a retrosynthesis tree 5
2. Expansion
- Once a node has been selected, the MCTS algorithm expands it by applying a new reaction template from the expansion policy and filter policy, generating new precursor molecules (i.e., new child nodes).
- At each expansion step, the expansion policy and filter policy are used to filter out unfeasible reactions, ensuring that the search focuses on viable synthetic routes.
3. Rollout/Iteration
- This process of selection and expansion is then repeated for each resulting precursor molecule until the stop criterion is met and we reach a terminal node.
- The stop criterion is usually either when the search reaches commercially available precursors, or when it reaches a pre-defined tree depth/number of disconnections.
4. Update/Backpropagation
- Once the terminal node is reached, the Monte Carlo simulation is complete and a complete synthetic route is generated. This completed simulation/synthetic route is known as a playout.
- The score of the terminal node (and hence the score of the playout/synthetic route) is then propagated up through the tree.
- This score of the terminal node is the accumulated reward (Q) in the UCB Score formula (Fig 10), which is a function of the tree depth at the terminal node (i.e., how many synthesis steps between it and the target molecule), and the fraction of the precursor molecules in that route that are in stock.
- This gives a quantitative analysis of the quality of the synthetic route.
Steps 1 - 4 are then repeated in iterative Monte Carlo simulations. The number of iterations is governed by a predefined limit, or a predefined search time. This iterative process is illustrated in Fig 11.
AiZynthFinder also uses a number of scoring algorithms to score routes during post-processing (Fig 12).
AiZynthFinder also has the ability to cluster routes in order to perform a cluster analysis via hierarchical clustering. 11
The specific type of hierarchical clustering that AiZynthFinder uses is agglomerative ("bottom-up") hierarchical clustering 9. This involves:
- Creating a dendrogram (a common visualisation tookit from the
scipy.cluster.hierarchy
package), to represent the hierarchy of clusters of routes formed at different levels of distance. - Using a linkage matrix (
linkage_matrix
) to calculate the Euclidean distance between the clusters at each step of the clustering process. This gives a measure of similarity or dissimilarity between the clusters of routes. Thislinkage_matrix
is generated by theClusterHelper
class, which uses the agglomerative clustering algorithm implemented in scikit-learn
[1] Saigiridharan, L. et al. (2024) ‘AiZynthFinder 4.0: Developments based on learnings from 3 years of industrial application’, Journal of Cheminformatics, 16(1).
[2] Coley, C.W. et al. (2017) ‘Prediction of organic reaction outcomes using machine learning’, ACS Central Science, 3(5), pp. 434–443.
[3] Ishida, S. et al. (2022) ‘Ai-driven synthetic route design incorporated with Retrosynthesis Knowledge’, Journal of Chemical Information and Modeling, 62(6), pp. 1357–1367.
[2] Zhao, D., Tu, S. and Xu, L. (2024) ‘Efficient retrosynthetic planning with MCTS Exploration Enhanced A* search’, Communications Chemistry, 7(1).
[5] Genheden, S. (2022) 'AiZynthFinder', AstraZeneca R&D Presentation. Available at: https://www.youtube.com/watch?v=r9Dsxm-mcgA (Accessed: 22 August 2024).
[6] Thakkar, A. et al. (2020) ‘Datasets and their influence on the development of computer assisted synthesis planning tools in the pharmaceutical domain’, Chemical Science, 11(1), pp. 154–168.
[7] David, L. et al. (2020) ‘Molecular representations in AI-Driven Drug Discovery: A review and practical guide’, Journal of Cheminformatics, 12(1).
[8] Genheden, S., Engkvist, O. and Bjerrum, E.J. (2020) A quick policy to filter reactions based on feasibility in AI-guided retrosynthetic planning.
[9] Coulom, R. (2007) ‘Efficient selectivity and backup operators in Monte-Carlo Tree Search’, Lecture Notes in Computer Science, pp. 72–83.
[10] Kroese, D.P. et al. (2014) ‘Why the monte Carlo method is so important today’, WIREs Computational Statistics, 6(6), pp. 386–392.
[11] Genheden, S., Engkvist, O. and Bjerrum, E. (2021) ‘Clustering of synthetic routes using tree edit distance’, Journal of Chemical Information and Modeling, 61(8), pp. 3899–3907.