Home News Dr. Serafim Batzoglou, Chief Data Officer at Seer – Interview Series

Dr. Serafim Batzoglou, Chief Data Officer at Seer – Interview Series

Dr. Serafim Batzoglou, Chief Data Officer at Seer – Interview Series

Serafim Batzoglou is Chief Data Officer at Seer. Prior to joining Seer, Serafim served as Chief Data Officer at Insitro, leading machine learning and data science of their approach to drug discovery. Prior to Insitro, he served as VP of Applied and Computational Biology at Illumina, leading research and technology development of AI and molecular assays for making genomic data more interpretable in human health.

What initially attracted you to the sphere of genomics?

I became inquisitive about the sphere of computational biology in the beginning of my PhD in computer science at MIT, once I took a category on the subject taught by Bonnie Berger, who became my PhD advisor, and David Gifford. The human genome project was picking up pace during my PhD. Eric Lander, who was heading the Genome Center at MIT became my PhD co-advisor and involved me within the project. Motivated by the human genome project, I worked on whole-genome assembly and comparative genomics of human and mouse DNA.

I then moved to Stanford University as faculty on the Computer Science department where I spent 15 years, and was privileged to have advised about 30 incredibly talented PhD students and lots of postdoctoral researchers and undergraduates. My team’s focus has been the applying of algorithms, machine learning and software tools constructing for the evaluation of large-scale genomic and biomolecular data. I left Stanford in 2016 to guide a research and technology development team at Illumina. Since then, I even have enjoyed leading R&D teams in industry. I find that teamwork, the business aspect, and a more direct impact to society are characteristic of industry in comparison with academia. I worked at progressive corporations over my profession: DNAnexus, which I co-founded in 2009, Illumina, insitro and now Seer. Computation and machine learning are essential across the technology chain in biotech, from technology development, to data acquisition, to biological data interpretation and translation to human health.

During the last 20 years, sequencing the human genome has turn out to be vastly cheaper and faster. This led to dramatic growth within the genome sequencing market and broader adoption within the life sciences industry. We at the moment are on the cusp of getting population genomic, multi-omic and phenotypic data of sufficient size to meaningfully revolutionize healthcare including prevention, diagnosis, treatment and drug discovery. We are able to increasingly discover the molecular underpinnings of disease for people through computational evaluation of genomic data, and patients have the possibility to receive treatments which might be personalized and targeted, especially within the areas of cancer and rare genetic disease. Beyond the apparent use in medicine, machine learning coupled with genomic information allows us to achieve insights into other areas of our lives, resembling our genealogy and nutrition. The following several years will see adoption of personalized, data-driven healthcare, first for select groups of individuals, resembling rare disease patients, and increasingly for the broad public.

Prior to your current role you were Chief Data Officer at Insitro, leading machine learning and data science of their approach to drug discovery. What were a few of your key takeaways from this time period with how machine learning may be used to speed up drug discovery?

The traditional drug discovery and development “trial-and-error” paradigm is plagued with inefficiencies and intensely lengthy timelines. For one drug to get to market, it will probably take upwards of $1 billion and over a decade. By incorporating machine learning into these efforts, we are able to dramatically reduce costs and timeframes in several steps on the way in which. One step is goal identification, where a gene or set of genes that modulate a disease phenotype or revert a disease cellular state to a more healthy state may be identified through large-scale genetic and chemical perturbations, and phenotypic readouts resembling imaging and functional genomics. One other step is compound identification and optimization, where a small molecule or other modality may be designed by machine learning-driven in silico prediction in addition to in vitro screening, and furthermore desired properties of a drug resembling solubility, permeability, specificity and non-toxicity may be optimized. The toughest in addition to most significant aspect is probably translation to humans. Here, alternative of the precise model—induced pluripotent stem cell-derived lines versus primary patient cell lines and tissue samples versus animal models—for the precise disease poses an incredibly vital set of tradeoffs that ultimately reflect on the power of the resulting data plus machine learning to translate to patients.

Seer Bio is pioneering recent ways to decode the secrets of the proteome to enhance human health, for readers who’re unfamiliar with this term what’s the proteome?

The proteome is the changing set of proteins produced or modified by an organism over time and in response to environment, nutrition and health state. Proteomics is the study of the proteome inside a given cell type or tissue sample. The genome of a human or other organisms is static: with the vital exception of somatic mutations, the genome at birth is the genome one has their entire life, copied exactly in each cell of their body. The proteome is dynamic and changes within the time spans of years, days and even minutes. As such, proteomes are vastly closer to phenotype and ultimately to health status than are genomes, and consequently more informative for monitoring health and understanding disease.

At Seer, we’ve got developed a brand new strategy to access the proteome that gives deeper insights into proteins and proteoforms in complex samples resembling plasma, which is a highly accessible sample that unfortunately to-date has posed an excellent challenge for conventional mass spectrometry proteomics.

What’s the Seer’s Proteograph™ platform and the way does it offer a brand new view of the proteome?

Seer’s Proteograph platform leverages a library of proprietary engineered nanoparticles, powered by a straightforward, rapid, and automatic workflow, enabling deep and scalable interrogation of the proteome.

The Proteograph platform shines in interrogating plasma and other complex samples that exhibit large dynamic range—many orders of magnitude difference within the abundance of assorted proteins within the sample—where conventional mass spectrometry methods are unable to detect the low abundance a part of the proteome. Seer’s nanoparticles are engineered with tunable physiochemical properties that gather proteins across the dynamic range in an unbiased manner. In typical plasma samples, our technology enables detection of 5x to 8x more proteins than when processing neat plasma without using the Proteograph. In consequence, from sample prep to instrumentation to data evaluation, our Proteograph Product Suite helps scientists find proteome disease signatures which may otherwise be undetectable. We wish to say that at Seer, we’re opening up a brand new gateway to the proteome.

Moreover, we’re allowing scientists to simply perform large-scale proteogenomic studies. Proteogenomics is the combining of genomic data with proteomic data to discover and quantify protein variants, link genomic variants with protein abundance levels, and ultimately link the genome and the proteome to phenotype and disease, and begin disentangling the causal and downstream genetic pathways related to disease.

Are you able to discuss a few of the machine learning technology that’s currently used at Seer Bio?

Seer is leveraging machine learning in any respect steps from technology development to downstream data evaluation. Those steps include: (1) design of our proprietary nanoparticles, where machine learning helps us determine which physicochemical properties and mixtures of nanoparticles will work with specific product lines and assays; (2) detection and quantification of peptides, proteins, variants and proteoforms from the readout data produced from the MS instruments; (3) downstream proteomic and proteogenomic analyses in large-scale population cohorts.

Last 12 months, we published a paper in Advanced Materials combining proteomics methods, nanoengineering and machine learning for improving our understanding of the mechanisms of protein corona formation. This paper uncovered nano-bio interactions and is informing Seer within the creation of improved future nanoparticles and products.

Beyond nanoparticle development, we’ve got been developing novel algorithms to discover variant peptides and post-translational modifications (PTMs). We recently developed a way for detection of protein quantified trait loci (pQTLs) that is powerful to protein variants, which is a known confounder for affinity-based proteomics. We’re extending this work to directly discover these peptides from the raw spectra using deep learning-based de novo sequencing methods to permit search without inflating the scale of spectral libraries.

Our team can also be developing methods to enable scientists without deep expertise in machine learning to optimally tune and utilize machine learning models of their discovery work. That is achieved via a Seer ML framework based on the AutoML tool, which allows efficient hyperparameter tuning via Bayesian optimization.

Finally, we’re developing methods to scale back the batch effect and increase the quantitative accuracy of the mass spec readout by modeling the measured quantitative values to maximise expected metrics resembling correlation of intensity values across peptides inside a protein group.

Hallucinations are a typical issue with LLMs, what are a few of the solutions to forestall or mitigate this?

LLMs are generative methods which might be given a big corpus and are trained to generate similar text. They capture the underlying statistical properties of the text they’re trained on, from easy local properties resembling how often certain mixtures of words (or tokens) are found together, to higher level properties that emulate understanding of context and meaning.

Nonetheless, LLMs are usually not primarily trained to be correct. Reinforcement learning with human feedback (RLHF) and other techniques help train them for desirable properties including correctness, but are usually not fully successful. Given a prompt, LLMs will generate text that almost all closely resembles the statistical properties of the training data. Often, this text can also be correct. For instance, if asked “when was Alexander the Great born,” the right answer is 356 BC (or BCE), and an LLM is probably going to provide that answer because throughout the training data Alexander the Great’s birth appears often as this value. Nonetheless, when asked “when was Empress Reginella born,” a fictional character not present within the training corpus, the LLM is more likely to hallucinate and create a story of her birth. Similarly, when asked a matter that the LLM may not retrieve a right answer for (either because the precise answer doesn’t exist, or for other statistical purposes), it’s more likely to hallucinate and answer as if it knows. This creates hallucinations which might be an obvious problem for serious applications, resembling “how can such and such cancer be treated.”

There are not any perfect solutions yet for hallucinations. They’re endemic to the design of the LLM. One partial solution is proper prompting, resembling asking the LLM to “consider carefully, step-by-step,” and so forth. This increases the LLMs likelihood to not concoct stories. A more sophisticated approach that’s being developed is the use of data graphs. Knowledge graphs provide structured data: entities in a knowledge graph are connected to other entities in a predefined, logical manner. Constructing a knowledge graph for a given domain is after all a difficult task but doable with a mixture of automated and statistical methods and curation. With a built-in knowledge graph, LLMs can cross-check the statements they generate against the structured set of known facts, and may be constrained to not generate an announcement that contradicts or is just not supported by the knowledge graph.

Due to the elemental issue of hallucinations, and arguably due to their lack of sufficient reasoning and judgment abilities, LLMs are today powerful for retrieving, connecting and distilling information, but cannot replace human experts in serious applications resembling medical diagnosis or legal advice. Still, they will tremendously enhance the efficiency and capability of human experts in these domains.

Are you able to share your vision for a future where biology is steered by data slightly than hypotheses?

The normal hypothesis-driven approach, which involves researchers finding patterns, developing hypotheses, performing experiments or studies to check them, after which refining theories based on the information, is becoming supplanted by a brand new paradigm based on data-driven modeling.

On this emerging paradigm, researchers start with hypothesis-free, large-scale data generation. Then, they train a machine learning model resembling an LLM with the target of accurate reconstruction of occluded data, strong regression or classification performance in quite a few downstream tasks. Once the machine learning model can accurately predict the information, and achieves fidelity comparable to the similarity between experimental replicates, researchers can interrogate the model to extract insight in regards to the biological system and discern the underlying biological principles.

LLMs are proving to be especially good in modeling biomolecular data, and are geared to fuel a shift from hypothesis-driven to data-driven biological discovery. This shift will turn out to be increasingly pronounced over the following 10 years and permit accurate modeling of biomolecular systems at a granularity that goes well beyond human capability.

What’s the potential impact for disease diagnosis and drug discovery?

I consider LLM and generative AI will result in significant changes within the life sciences industry. One area that may profit greatly from LLMs is clinical diagnosis, specifically for rare, difficult-to-diagnose diseases and cancer subtypes. There are tremendous amounts of comprehensive patient information that we are able to tap into – from genomic profiles, treatment responses, medical records and family history – to drive accurate and timely diagnosis. If we are able to discover a strategy to compile all this data such that they’re easily accessible, and never siloed by individual health organizations, we are able to dramatically improve diagnostic precision. This is just not to imply that the machine learning models, including LLMs, will find a way to autonomously operate in diagnosis. As a consequence of their technical limitations, within the foreseeable future they may not be autonomous, but as a substitute they may augment human experts. They shall be powerful tools to assist the doctor provide superbly informed assessments and diagnoses in a fraction of the time needed up to now, and to properly document and communicate their diagnoses to the patient in addition to to the whole network of health providers connected through the machine learning system.

The industry is already leveraging machine learning for drug discovery and development, touting its ability to scale back costs and timelines in comparison with the normal paradigm. LLMs further add to the available toolbox, and are providing excellent frameworks for modeling large-scale biomolecular data including genomes, proteomes, functional genomic and epigenomic data, single-cell data, and more. Within the foreseeable future, foundation LLMs will undoubtedly connect across all these data modalities and across large cohorts of people whose genomic, proteomic and health information is collected. Such LLMs will aid in generation of promising drug targets, discover likely pockets of activity of proteins related to biological function and disease, or suggest pathways and more complex cellular functions that may be modulated in a particular way with small molecules or other drug modalities. We can even tap into LLMs to discover drug responders and non-responders based on genetic susceptibility, or to repurpose drugs in other disease indications. Lots of the present progressive AI-based drug discovery corporations are undoubtedly already beginning to think and develop on this direction, and we must always expect to see the formation of additional corporations in addition to public efforts aimed toward the deployment of LLMs in human health and drug discovery.


Please enter your comment!
Please enter your name here