SHI Jin-Long, ZHANG Zhe, DAI An-Lin, LIN Kai, HE Kun-Lun
The rapid accumulation of biological big data, primarily comprising genomics, transcriptomics, proteomics, and more, coupled with the swift advancement of artificial intelligence technologies, notably deep learning, has given rise to a variety of biological large models. Characterized by complex deep-learning architectures, massive parameter counts, high computational power requirements, and vast amounts of pre-training data, these large models' capabilities are largely dictated by the types and volumes of pre-training data, while different model architectures support various downstream tasks. Over the past two years, a variety of general-purpose and specialized large models have emerged in multiple application scenarios, including the analysis and mining of DNA, RNA, and protein sequences, single-cell expression atlases, structure prediction of biomacromolecules, de novo drug design, and interpretation of biological mechanisms. These models have demonstrated significant potential in the domains of biomedical research and translational applications. This paper aims to provide an overview of the characteristics of biological data and the technical methods used for training biological large models, considering the unique features and research application needs of different types of biological data. Furthermore, it reviews the application progress of existing models in biomedical research and disease diagnosis and treatment,
offering new insights for enhancing model capabilities and expanding their application scope.