m6A-seq, a combination of Immunoprecipitation and Next generation sequencing technology, can be used to map N6-methylated guanidine(m6A) in transcriptome. This application is intended to address the issue of the low quality of existing m6A-seq data analysis methods. The main contents of the research are: (1) To solve the problem that signal (m6A fold enrichment) is confused with noise in raw data, a noise estimation method for raw data will be developed, and a denoising model will be established by using long-short-term memory network and convolutional neural network. (2) To solve the problem that existing method cannot specifically recognize the reads enrichment region(peak) of m6A, the key features which can be used to distinguish m6A peaks from the false positive peaks will be clarified, and a m6A-specific peak recognition model will be established by using deep forests method. (3) To solve the problem that m6A-seq cannot locate the m6A sites, the correspondence between the peaks and the m6A sites will be clarified, and a m6A site localization model will be established by using integrated MIForests method. Finally, a m6A-seq data analysis quality improvement model will be established, which will lay the foundation for the subsequent study of the role of m6A in the development of complex diseases.
m6A-seq是一种结合免疫共沉淀和高通量测序的技术,可从全转录组范围内鉴定N6-甲基化嘌呤(m6A)。本申请拟针对目前m6A-seq数据分析方法质量不高的问题,展开如下研究:(1)针对原始数据信号(m6A富集倍数)与噪声混淆的问题,量化原始数据噪声水平,采用长短期记忆网络结合卷积神经网络方法,建立原始数据去噪模型;(2)针对已存方法无法特异性识别m6A的读段富集区(peak)的问题,阐明区分m6A对应peak和假阳性peak的关键特征,采用深度森林方法,建立m6A特异性peak识别模型;(3)针对m6A-seq无法定位m6A位点的问题,明确peak与m6A位点的对应关系,采用集成多示例森林方法,建立m6A位点定位模型。最终建立起m6A-seq数据分析质量提升模型,为后续研究m6A在复杂疾病的发生发展过程中的作用奠定基础。
m6A-seq被广泛用于m6A修饰图谱绘制,但是其假阳性率较高且无法确定m6A的具体位置和数量。因此,本项目从peak差异分析、假阳性peak剔除及peak中m6A位点定位等三个方面进行研究。首先,对有对应关系的m6A-seq、miCLIP-seq及YTHDF2的RIP-seq数据进行了搜集,分析发现m6A与YTHDF2都在除第1外显子外的其他外显子区域富集程度最高,此外,研究发现YTHDF2结合区域距离转录起始位点较近,说明YTHDF2可能与转录功能关系密切,在此基础上,构建了m6ABRP软件,该模型AUC可达0.920,能够精准地对m6A-YTHDF2结合区域进行预测;然后,为了降低peak的假阳性,进一步区分不同表观修饰产生的peak,对m6Am、m7G及f5C等位点数据进行搜集并构建训练数据集,在此基础上,深入挖掘不同修饰的关键特征并构建m6Aminer、f5Cfinder及m7GPredictor等软件工具,其模型AUC分别可达0.913、0.851及0.945,可用于进一步对peak进行特异性注释;最后,为了确定peak中m6A的数量和位置,将每个peak当作正包,其中每个正包中包含至少一个正示例,在此基础上,构建包级分类器,实现从peak中识别最可靠m6A样本的功能。为了验证模型的可靠性,采用m6A单精度位点数据进行验证,结果表明,所建立的模型能从低分辨率peak数据中准确地识别m6A位点。本项目提出的模型可有效地降低m6A-seq数据的假阳性,并提高m6A位点的定位精度,为进一步研究m6A在神经发育、免疫反应、DNA损伤反应、肿瘤发生发展及植物胁迫响应等多种生物过程中的作用奠定基础。
{{i.achievement_title}}
数据更新时间:2023-05-31
玉米叶向值的全基因组关联分析
论大数据环境对情报学发展的影响
一种光、电驱动的生物炭/硬脂酸复合相变材料的制备及其性能
正交异性钢桥面板纵肋-面板疲劳开裂的CFRP加固研究
硬件木马:关键问题研究进展及新动向
基于深度多示例学习的视频理解与内容安全分析
基于最大间隔的多示例学习算法设计与分析
基于高斯过程模型的多示例多标记学习算法研究
面向多示例数据的分类和多序列回归算法研究