Researchers from the University of Toronto Introduce scGPT: A Foundation Model for Single-Cell Biology based on Generative Pre-Trained Transformer Across a Repository of Over 33 Million Cells

Community

Researchers from the University of Toronto Introduce scGPT: A Foundation Model for Single-Cell Biology based on Generative Pre-Trained Transformer Across a Repository of Over 33 Million Cells

admin

July 6, 2023

Researchers from the University of Toronto Introduce scGPT: A Foundation Model for Single-Cell Biology based on Generative Pre-Trained Transformer Across a Repository of Over 33 Million Cells

Natural language processing and computer vision are only examples of the fields where generative pre-trained models have succeeded incredibly. Particularly, a viable strategy for constructing foundation models is to mix varied large-scale datasets with pre-trained transformers. The study investigates the feasibility of foundation models to further research in cellular biology and genetics by drawing connections between language and biological constructions (where texts constitute genes and respectively characterize words and cells). Researchers have been on the forefront of constructing scGPT, a foundation model for single-cell biology based on a generative pre-trained transformer spanning a repository of over 1,000,000 cells, using the growing body of single-cell sequencing data. Results show that scGPT, a pre-trained generative transformer, efficiently extracts key biological insights related to genes and cells. The script might be improved to be used in various applications through the use of transfer learning in recent ways. These challenges include gene network inference, genetic perturbation prediction, and multi-batch integration. View the scGPT source code.

By facilitating detailed characterization of individual cell types and enhancing our knowledge of disease pathogenesis, single-cell RNA sequencing (scRNA-seq) paves the way in which for the investigation of cellular heterogeneity, the tracking of lineages, the elucidation of pathogenic mechanisms, and the event of patient-specific therapeutic approaches.

Given the exponential growth of sequencing data, it’s urgent to create methods that may effectively leverage, enhance, and adapt to those recent trends. The generative pre-training of foundation models is an efficient strategy for overcoming this difficulty. Learning from massive datasets, generative pre-training has recently seen extraordinary success in various domains. Popular uses include NLG (natural language generation) and computer vision. These baseline models, including DALL-E2 and GPT-4, are based on the tenet of pre-training transformers on large-scale heterogeneous datasets that might be easily adapted to specific downstream tasks and scenarios. Not only that, but these pre-trained generative models at all times perform higher than their custom-trained counterparts.

[Sponsored] 🔥 Construct your personal brand with Taplio 🚀 The first all-in-one AI-powered tool to grow on LinkedIn. Create higher LinkedIn content 10x faster, schedule, analyze your stats & engage. Try it free of charge!

Researchers take cues from the NLG self-supervised pre-training method to enhance the modeling of massive amounts of single-cell sequencing data. It has been proven that the self-attention transformer is a useful and efficient framework for modeling input tokens of text.

Using generative pre-training on greater than 1,000,000 cells, these scientists offer the primary try and construct a single-cell foundation model, dubbed scGPT. They present novel approaches to pre-training massive amounts of single-cell omic data, addressing each the methodological and engineering issues that arise. They employ an in-memory data structure with quick access to store lots of of datasets, allowing them to take care of massive amounts of information. They modify the transformer architecture to learn cell and gene representations concurrently and construct a unified generative pre-training approach tailored to non-sequential omic data. To enable using the pre-trained model in various downstream tasks, additionally they supply standard pipelines with task-specific objectives for model fine-tuning.

Through these three components, the scGPT model highlights the revolutionary potential of the single-cell foundation concept. That begins with scGPT, the primary large-scale generative foundation model that supports transfer learning to numerous downstream activities. They reveal the efficacy of the “pre-training universally, fine-tuning on demand” approach as a generalist solution for computational applications in single-cell omics by achieving state-of-the-art performance on cell type annotation, genetic perturbation prediction, batch correction, and multi-omic integration.

Particularly, scGPT is the one base model able to incorporating scATAC-seq data and other single-cell omics. Second, scGPT reveals necessary biological insights into condition-specific gene-gene interactions by comparing gene embeddings and a spotlight weights between the refined and raw pre-trained models. Third, the outcomes show a scaling law: higher pre-trained embeddings and better performance on downstream tasks result from using more data within the pre-training phase. This discovery underlines the promising possibility that foundation models can steadily improve as an increasing number of sequencing data becomes available to the research community. In light of those results, they hypothesize that using pre-trained foundation models will significantly increase our knowledge of cell biology and lay the groundwork for future advancements in the sphere. Making the scGPT models and workflow publicly available allows research in these and related fields to be strengthened and accelerated.

The script is a novel generative pretrained foundation model that uses pre-trained transformers to make sense of a big volume of single-cell data, as described by the study’s authors. Self-supervised pre-training has proven effective in language models equivalent to chatGPT and GPT4. Within the study of single cells, they used the identical technique to decipher intricate biological connections. To raised model different facets of cellular processes, scGPT uses transformers to learn each gene and cell embeddings concurrently. Single-cell GPT (scGPT) captures gene-to-gene interactions on the single-cell level, adding a brand new degree of interpretability through the use of the eye mechanism of transformers.

Researchers used extensive studies in zero-shot and fine-tuning scenarios to prove pre-training’s value. The trained model is already a feature extractor for any dataset. It demonstrates impressive extrapolation ability, displaying substantial cell clumping in zero-shot studies. As well as, there may be a high degree of congruence between the learned gene networks in scGPT and previously established functional relationships. We now have faith within the model’s ability to find relevant discoveries in single-cell biology since it captures gene-gene interactions and reflects known biological information effectively. As well as, with some fine-tuning, the data learned by the pre-trained model might be used for various subsequent tasks. The optimized scGPT model often beats models trained from scratch on tasks like cell type annotation, multi-batch, and multi-omic integration. This shows how the pre-trained model advantages subsequent tasks by improving accuracy and biological relevance. Overall, the tests reveal the usefulness of pre-training in scGPT, demonstrating its capability to generalize, capture gene networks, and enhance performance in subsequent tasks utilizing transfer learning.

Key Features

The generalist strategy allows for integrated multi-omic evaluation and perturbation prediction to be performed using a single model for a single-cell study.
We may discover condition-specific gene-gene interactions using learned attention weights and gene embeddings.
It identified a scaling law demonstrating the continual improvement of model performance with increasing data load.
There at the moment are many pre-trained foundation models for various solid organs available within the scGPT model zoo (see github) and a comprehensive pan-cancer model. Start digging into the info using the perfect possible place to begin model.

Pre-training is anticipated to happen on a much larger dataset that features multi-omic data, spatial omics, and a wide selection of illness states. The model can learn causal linkages and estimate how genes and cells respond over time if perturbation and temporal data are included within the pre-training phase. To raised comprehend and interpret the pre-trained model’s learnings, validating the model on a broader set of biologically significant tasks could be ideal. Moreover, they aim to analyze context-aware knowledge for single-cell data. The pre-trained model must grasp and adapt to recent jobs and environments without additional fine-tuning in a zero-shot configuration. They will improve scGPT’s utility and applicability in quite a few study contexts by teaching it to know various studies’ subtleties and unique needs. They expect the pre-training paradigm to be easily implemented in single-cell research and to put the groundwork for capitalizing on the accrued knowledge within the rapidly expanding cell atlases.

Take a look at the Paper and Github Link. Don’t forget to affix our 25k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the newest AI research news, cool AI projects, and more. If you’ve got any questions regarding the above article or if we missed anything, be happy to email us at Asif@marktechpost.com

Featured Tools:

Aragon: Get stunning skilled headshots effortlessly with Aragon.
StoryBird AI: Create personalized stories using AI
Taplio: Transform your LinkedIn presence with Taplio’s AI-powered platform
Otter AI: Get a gathering assistant that records audio, writes notes, routinely captures slides, and generates summaries.
Notion: Notion AI is a sturdy generative AI tool that assists users with tasks like note summarization
tinyEinstein: tinyEinstein is an AI Marketing manager that helps you grow your Shopify store 10x faster with almost zero time investment from you.
AdCreative.ai: Boost your promoting and social media game with AdCreative.ai – the final word Artificial Intelligence solution.
SaneBox: SaneBox’s powerful AI routinely organizes your email for you, and the opposite smart tools ensure your email habits are more efficient than you’ll be able to imagine
Motion: Motion is a clever tool that uses AI to create each day schedules that account to your meetings, tasks, and projects.

🚀 Check Out 100’s AI Tools in AI Tools Club

Dhanshree

” data-medium-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-169×300.jpg” data-large-file=”https://www.marktechpost.com/wp-content/uploads/2022/11/20221028_101632-Dhanshree-Shenwai-576×1024.jpg”>

Dhanshree Shenwai is a Computer Science Engineer and has a very good experience in FinTech firms covering Financial, Cards & Payments and Banking domain with keen interest in applications of AI. She is keen about exploring recent technologies and advancements in today’s evolving world making everyone’s life easy.

🔥 StoryBird.ai just dropped some amazing features. Generate an illustrated story from a prompt. Test it out here. (Sponsored)

LEAVE A REPLY Cancel reply