Luca Naef (VantAI)
🔥What are the most important advancements in the sector you noticed in 2023?
1️⃣ Increasing multi-modality & modularity — as shown by the emergence of initial co-folding methods for each proteins & small molecules, diffusion and non-diffusion-based, to increase on AF2 success: DiffusionProteinLigand within the last days of 2022 and RFDiffusion, AlphaFold2 and Umol by end of 2023. We’re also seeing models which have sequence & structure co-trained: SAProt, ProstT5, and sequence, structure & surface co-trained with ProteinINR. There may be a general revival of surface-based methods after a quieter 2021 and 2022: DiffMasif, SurfDock, and ShapeProt.
2️⃣ Datasets and benchmarks. Datasets, especially synthetic/computationally derived: ATLAS and the MDDB for protein dynamics. MISATO, SPICE, Splinter for protein-ligand complexes, QM1B for molecular properties. PINDER: large protein-protein docking dataset with matched apo/predicted pairs and benchmark suite with retrained docking models. CryoET data portal for CryoET. And a complete host of welcome benchmarks: PINDER, PoseBusters, and PoseCheck, with a deal with more rigorous and practically relevant settings.
3️⃣ Creative pre-training strategies to get across the sparsity of diverse protein-ligand complexes. Van-der-mers training (DockGen) & sidechain training strategies in RF-AA and pre-training on ligand-only complexes in CCD in RF-AA. Multi-task pre-training Unimol and others.
🏋️ What are the open challenges that researchers might overlook?
1️⃣ Generalization. DockGen showed that current state-of-the-art protein-ligand docking models completely lose predictability when asked to generalise towards novel protein domains. We see the same phenomenon within the AlphaFold-lastest report, where performance on novel proteins & ligands drops heavily to below biophysics-based baselines (which have access to holo structures), despite very generous definitions of novel protein & ligand. This means that existing approaches might still largely depend on memorization, an remark that has been extensively argued through the years
2️⃣ The curse of (easy) baselines. A recurring topic through the years, 2023 has again shown what industry practitioners have long known: in lots of practical problems corresponding to molecular generation, property prediction, docking, and conformer prediction, easy baselines or classical approaches often still outperform ML-based approaches in practice. This has been documented increasingly in 2023 by Tripp et al., Yu et al., Zhou et al.
🔮 Predictions for 2024!
“In 2024, data sparsity will remain top of mind and we are going to see a variety of smart ways to make use of models to generate synthetic training data. Self-distillation in AlphaFold2 served as a giant inspiration, Confidence Bootstrapping in DockGen, leveraging the insight that we now have sufficiently powerful models that may rating poses but not at all times generate them, first realised in 2022.” — Luca Naef (VantAI)
2️⃣ We are going to see more biological/chemical assays purpose-built for ML or only making sense in a machine learning context (i.e., won’t result in biological insight by themselves but be primarily useful for training models). An example from 2023 is the large-scale protein folding experiments by Tsuboyama et al. This move is perhaps driven by techbio startups, where we’ve got seen the primary foundation models built on such ML-purpose-built assays for structural biology with e.g. ATOM-1.
Andreas Loukas (Prescient Design, a part of Genentech)
🔥 What are the most important advancements in the sector you noticed in 2023?
“In 2023, we began to see among the challenges of equivariant generation and representation for proteins to be resolved through diffusion models.” — Andreas Loukas (Prescient Design)
1️⃣ We also noticed a shift towards approaches that model and generate molecular systems at higher fidelity. As an illustration, probably the most recent models adopt a totally end-to-end approach by generating backbone, sequence and side-chains jointly (AbDiffuser, dyMEAN) or at the very least solve the issue in two steps but with a partially joint model (Chroma); as in comparison with backbone generation followed by inverse folding as in RFDiffusion and FrameDiff. Other attempts to enhance the modelling fidelity will be present in the newest updates to co-folding tools like AlphaFold2 and RFDiffusion which render them sensitive to non-protein components (ligands, prosthetic groups, cofactors); in addition to in papers that try to account for conformational dynamics (see discussion above). For my part, this line of labor is important since the binding behaviour of molecular systems will be very sensitive to how atoms are placed, move, and interact.
2️⃣ In 2023, many works also attempted to get a handle on binding affinity by learning to predict the effect of mutations of a known crystal by pre-training on large corpora, corresponding to computationally predicted mutations (graphinity), and on side-tasks, corresponding to rotamer density estimation. The obtained results are encouraging as they’ll significantly outperform semi-empirical baselines like Rosetta and FoldX. Nonetheless, there remains to be significant work to be done to render these models reliable for binding affinity prediction.
3️⃣ I even have further observed a growing recognition of protein Language Models (pLMs) and specifically ESM as helpful tools, even amongst those that primarily favour geometric deep learning. These embeddings are used to assist docking models, allow the development of straightforward yet competitive predictive models for binding affinity prediction (Li et al 2023), and may generally offer an efficient method to create residue representations for GNNs which can be informed by the extensive proteome data without the necessity for extensive pretraining (Jamasb et al 2023). Nonetheless, I do maintain a priority regarding using pLMs: it’s unclear whether their effectiveness is as a consequence of data leakage or real generalisation. This is especially pertinent when evaluating models on tasks like amino-acid recovery in inverse folding and conditional CDR design, where distinguishing between these two aspects is crucial.
🏋️ What are the open challenges that researchers might overlook?
1️⃣ Working with energetically relaxed crystal structures (and, even worse, folded structures) can significantly affect the performance of downstream predictive models. This is particularly true for the prediction of protein-protein interactions (PPIs). In my experience, the performance of PPI predictors severely deteriorates once they are given a relaxed structure versus the binding (holo) crystalised structure.
2️⃣ Though successful in silico antibody design has the capability to revolutionise drug design, general protein models will not be (yet?) nearly as good at folding, docking or generating antibodies as antibody-specific models are. This is maybe as a consequence of the low conformational variability of the antibody fold and the distinct binding mode between antibodies and antigens (loop-mediated interactions that may involve a non-negligible entropic component). Perhaps for a similar reasons, the de novo design of antibody binders (that I define as 0-shot generation of an antibody that binds to a previously unseen epitope) stays an open problem. Currently, experimentally confirmed cases of de novo binders involve mostly stable proteins, like alpha-helical bundles, which can be common within the PDB and harbour interfaces that differ substantially from epitope-paratope interactions.
3️⃣ We’re still lacking a general-purpose proxy for binding free energy. The predominant issue here is the dearth of high-quality data of sufficient size and variety (esp. co-crystal structures). We must always subsequently be cognizant of the constraints of any such learned proxy for any model evaluation: though predicted binding scores which can be out of distribution of known binders is a transparent signal that something is off, we should always avoid the everyday pitfall of attempting to exhibit the prevalence of our model in an empirical evaluation by showing the way it results in even higher scores.
Dominique Beaini (Valence Labs, a part of Recursion)
“I’m excited to see a really large community being built around the issue of drug discovery, and I feel we’re on the point of a brand new revolution within the speed and efficiency of discovering drugs.” — Dominique Beaini (Valence Labs)
What work got me excited in 2023?
I’m confident that machine learning will allow us to tackle rare diseases quickly, stop the subsequent COVID-X pandemic before it may spread, and live longer and healthier. But there’s a variety of work to be done and there are a variety of challenges ahead, some bumps within the road, and a few canyons on the way in which. Speaking of communities, you may visit the Valence Portal to maintain up-to-date with the 🔥 latest in ML for drug discovery.
What are the hard questions for 2024?
⚛️ A brand new generation of quantum mechanics. Machine learning force-fields, often based on equivariant and invariant GNNs, have been promising us a treasure. The treasure of the precision of density functional theory, but hundreds of times faster and at the dimensions of entire proteins. Although some steps were made on this direction with Allegro and MACE-MP, current models don’t generalize well to unseen settings and really large molecules, and so they are still too slow to be applicable on the timescale that is required 🐢. For the generalization, I imagine that larger and more diverse datasets are a very powerful stepping stones. For the computation time, I imagine we are going to see models which can be less enforcing of the equivariance, corresponding to FAENet. But efficient sampling methods will play a much bigger role: spatial-sampling corresponding to using DiffDock to get more interesting starting points and time-sampling corresponding to TimeWarp to avoid simulating every frame. I’m really excited by the massive STEBS 👣 awaiting us in 2024: Spatio-temporal equivariant Boltzmann samplers.
🕸️ Every part is connected. Biology is inherently multimodal 🙋🐁 🧫🧬🧪. One cannot simply decouple the molecule from the remainder of the biological system. In fact, that’s how ML for drug discovery was done up to now: simply construct a model of the molecular graph and fit it to experimental data. But we’ve got reached a critical point 🛑, regardless of what number of trillion parameters are within the GNN model is, and the way much data are used to coach it, and what number of experts are mixtured together. It’s time to bring biology into the combination, and probably the most straightforward way is with multi-modal models. One method is to condition the output of the GNNs with the goal protein sequences corresponding to MocFormer. One other is to make use of microscopy images or transcriptomics to higher inform the model of the biological signature of molecules corresponding to TranSiGen. One more is to make use of LLMs to embed contextual information in regards to the tasks corresponding to TwinBooster. And even higher, combining all of those together 🤯, but this might take years. The predominant issue for the broader community appears to be the provision of enormous amounts of quality and standardized data, but fortunately, this shouldn’t be a difficulty for Valence.
🔬 Relating biological knowledge and observables. Humans have been attempting to map biology for a very long time, constructing relational maps for genes 🧬, protein-protein interactions 🔄, metabolic pathways 🔀, etc. I invite you to read this review of data graphs for drug discovery. But all this information often sits unused and ignored by the ML community. I feel that that is an area where GNNs for knowledge graphs could prove very useful, especially in 2024, and it could provide one other modality for the 🕸️ point above. Considering that human knowledge is incomplete, we are able to as a substitute recuperate relational maps from foundational models. That is the route taken by Phenom1 when attempting to recall known genetic relationships. Nonetheless, having to take care of various knowledge databases is an especially complex task that we are able to’t expect most ML scientists to have the ability to tackle alone. But with the assistance of artificial assistants like LOWE, this will be done in a matter of seconds.
🏆 Benchmarks, benchmarks, benchmarks. I can’t repeat the word benchmark enough. Alas, benchmarks will stay the unloved kid on the ML block 🫥. But when the word benchmark is uncool, its cousin competition is way cooler 😎! Just because the OGB-LSC competition and Open Catalyst challenge played a significant role for the GNN community, it’s now time for a brand new series of competitions 🥇. We even got the TGB (Temporal graph benchmark) recently. In the event you were at NeurIPS’23, then you definitely probably heard of Polaris coming up early 2024 ✨. Polaris is a consortium of multiple pharma and academic groups attempting to improve the standard of obtainable molecular benchmarks to higher represent real drug discovery. Perhaps we’ll even see a benchmark suitable for molecular graph generation as a substitute of optimizing QED and cLogP, but I wouldn’t hold my breath, I even have been waiting for years. What type of latest, crazy competition will light up the GDL community this yr 🤔?