以基因、转录、蛋白质等生命组学为主体的生物大数据快速积累和以深度学习为代表的人工智能技术迅猛发展,催生出各种类别的生物大模型 (biological large models)。复杂的深度学习架构、巨大的参数量和算力需求、以及海量的预训练数据等是大模型技术的主要特征。预训练数据类别及参数量一定程度上决定了大模型所具备的能力强弱,而不同的模型架构则可支撑不同类别的下游任务。近两年,围绕 DNA/RNA/蛋白质等生物序列与单细胞表达图谱等组学数据分析挖掘、大分子结构预测、新型药物设计和功能机制解析等多种应用场景,涌现了多种通用或专用大模型, 展示出其在生物医学研究及转化应用等领域的巨大潜力。本文旨在结合不同类别的生物数据特点和研究应用需求, 概述生物数据特征及其用于生物大模型训练的技术方法, 并进一步综述现有大
模型在生物医学研究及疾病诊疗中的应用进展, 为提升生物大模型能力、 拓展应用范围提供新的思路。
The rapid accumulation of biological big data, primarily comprising genomics, transcriptomics, proteomics, and more, coupled with the swift advancement of artificial intelligence technologies, notably deep learning, has given rise to a variety of biological large models. Characterized by complex deep-learning architectures, massive parameter counts, high computational power requirements, and vast amounts of pre-training data, these large models' capabilities are largely dictated by the types and volumes of pre-training data, while different model architectures support various downstream tasks. Over the past two years, a variety of general-purpose and specialized large models have emerged in multiple application scenarios, including the analysis and mining of DNA, RNA, and protein sequences, single-cell expression atlases, structure prediction of biomacromolecules, de novo drug design, and interpretation of biological mechanisms. These models have demonstrated significant potential in the domains of biomedical research and translational applications. This paper aims to provide an overview of the characteristics of biological data and the technical methods used for training biological large models, considering the unique features and research application needs of different types of biological data. Furthermore, it reviews the application progress of existing models in biomedical research and disease diagnosis and treatment,
offering new insights for enhancing model capabilities and expanding their application scope.