With the increase of genome-wide experimental data for protein essentiality and the great abundance in protein-protein interaction (PPI) data, researchers have paid more attention to the discovery of essential proteins from the network level. However, the incompleteness and inaccuracy of currently available PPI networks for each species make it still challenging to identify essential proteins solely based on network topological characteristics. Especially, the discovery of essential proteins across species with distant relationships is lack of in-depth study. Based on PPI networks,the complementary relationships between multi-source omics data such as genomics, proteomics, transcriptomics, metabolomics, and the PPI data will be analyzed. Based on this, some stronger combined features by integrating multi-source omics information for protein essentiality will be mined. Then the underlying mechanism of the transferability of protein essentiality among different species, as well as the varying patterns of representational capacity of features inferred from multi-source omics data and the combined features for protein essentiality with the evolution of species, will be explored by comparatively analyzing the common characters and peculiarities of essential proteins in multiple model organisms. Then we will study effective methods for distantly related cross-species essential protein discovery based on transfer learning and meta-ensemble learning. The proposed essential protein discovery methods for cross-species organisms will make it possible to precisely identify essential proteins from under-studied organisms and those lack of experimental validation. The research results will provide theoretical foundation for finding new drug targets and accelerate the process of new drug design and development.
随着全基因组蛋白质关键性实验数据的增加和蛋白质互作(PPI)数据的丰富,从网络水平上研究蛋白质关键性越来越受到重视。PPI数据的不完整性和不精确性使得基于网络拓扑特性的关键蛋白质识别极具挑战性,特别是面向远亲缘关系的跨物种关键蛋白质识别则缺乏深入研究。本项目拟以PPI网络为立足点,分析基因组、蛋白质组、转录组、代谢组等多组学数据与PPI数据之间的互补性,挖掘出对蛋白质关键性表征能力更强的集成多组学信息的组合特征;对比分析多种模式生物中关键蛋白质的共性与特质,探索蛋白质的关键性在不同物种之间迁移的内在机制,以及多组学导出的特征和组合特征对蛋白质关键性的表征能力随物种进化的变化规律,以此为基础,研究基于迁移学习和元集成学习的面向远亲缘关系的跨物种关键蛋白质识别方法,使在缺乏实验验证的物种内高精度地识别关键蛋白质成为可能。研究成果将为寻找新药靶标提供理论基础,加速新药的设计与开发进程。
本项目以蛋白质互作网络为立足点,研究集成多组学信息的关键蛋白质识别方法。我们取得了如下研究成果:1)提出了集成蛋白质互作网络和基因表达谱的集成框架,该集成框架能大大提高拓扑特征对关键蛋白质的识别能力;2)提出了集成蛋白质互作网络、基因表达谱、同源信息的关键蛋白质识别方法,其性能优于多种其它算法;3)研究了来自蛋白质互作网络、同源性、蛋白质复合物、蛋白质域、蛋白质的细胞定位等多组学数据提取的特征对人类关键蛋白质的识别能力,构建了基于随机森林的关键蛋白质识别模型和特征选择方法;4)研究了各组学提取的特征对关键蛋白质的表征能力随物种进化的变化规律,用于指导基于机器学习的关键蛋白质识别模型的特征选择。我们所提出的关键蛋白质识别方法适用于不同的物种。研究发现,单一特征或者集成多组学信息的中心性指标不足以识别出一个物种内所有的关键蛋白质,所以研究能有效集成多组学特征的基于机器学习的关键蛋白质识别模型非常重要。另外,单一特征对关键蛋白质的识别能力并不与它在分类模型构建中的重要性成正比,因此我们需要全面选择特征以构建出高精确度的关键蛋白质识别模型。人类关键基因中有相当大一部分属于特定环境下的关键基因,而这类关键基因被发现是引起人类各种疾病的致病基因。本项目的研究成果能大大促进对人类关键蛋白质和致病基因的识别,为发现新的药物靶标提供依据。
{{i.achievement_title}}
数据更新时间:2023-05-31
论大数据环境对情报学发展的影响
跨社交网络用户对齐技术综述
小跨高比钢板- 混凝土组合连梁抗剪承载力计算方法研究
基于公众情感倾向的主题公园评价研究——以哈尔滨市伏尔加庄园为例
青藏高原狮泉河-拉果错-永珠-嘉黎蛇绿混杂岩带时空结构与构造演化
基于异质多组学数据集成的基因调控网络建模方法研究
整合多组学数据系统识别调控免疫事件进而驱动癌症转移的关键因子
重大疾病多组学与医学大数据挖掘基础理论及关键技术
基于代谢网络的多组学数据整合研究