Current speech technology works well for certain controlled tasks and domains, but has limited ability to acquire knowledge about people or situations, to adapt, and to generalize. In reality, the quantity of possible audio and text data available is vast and rapidly increasing. Sources of data range from traditional broadcast media, such as television, radio and the websites, to content produced by individuals. The quality of this audio and editing, and associated metadata (such as orthographic transcriptions, speaker, and location) is highly variable and poor compared with traditional speech recognition databases. Unfortunately these factors significantly degraded the performances of speaker and speech recognition. Therefore, we shall start our work on these natural non-homogeneous data, e.g. Multi-genre Broadcast data (MGB challenge), focus on the integrated theoretical development of new joint models for speaker diarization and speech recognition. These models based on deep learning will allow us to, 1) build up speaker diarization that can precisely detect “who spoke what, when, and how”; 2) embrace human-like perception by using the articulation features in accordance with the acoustic features; 3) incorporate knowledge about the speakers, the environment, the communication context and awareness of the task, and will learn and adapt from real world data in an online, unsupervised manner; 4) design the decoding with situation awareness to reduce uncertainty before a single word is said; 5) feedback to learn from mistakes; and instantaneous and continuous learning in specific situations. These theoretical investigations combined with the state-of-the-art theory into practical applications, will enable us to bring a naturalness to speech technology that is not currently attainable. The project aims to provide a feasible solution to multi-channel, multi-speaker, multi-lingual speaker diarization and speech recognition, which is the main stream of natural speech processing.
针对多源异构音频数据的多变性、复杂性和多层次等特点,开展复杂环境下的目标识别以及语音到文本转写的研究,重点突破远场噪声建模,自适应学习,解码策略和错误反馈等制约说话人和语音识别性能的瓶颈问题。具体研究内容包括:针对复杂音频数据流探索精准的说话人切分、聚类算法,提供有效的语音分段以及高质的说话人识别;利用最新的深度学习模型,实现带噪语音的去噪以及去噪模型和识别模型的联合优化;探索基于人类多模感知机制的建模架构,提升识别系统的远场抗噪能力;探索无监督模式下的自适应学习,提出新型高效的语音识别解码方案设计,集成面向口语语言环境的检错容错机制,实现海量数据下的快速训练和识别。本项目瞄准国际领先的海量多源数据识别(MGB)任务,建立完整的目标识别和文本转写系统框架,针对跨信道多语种多说话人的语音数据给出快速有效的处理方案,即是语音信息处理的重要方向,也符合国家反恐维稳和保障社会公共安全的重大需求。
本项目按计划开展了复杂环境下的目标识别以及语音到文本的转写的研究,针对跨信道、多语种、多目标人的复杂场景下的语音数据,研究汉语普通话、英语等多语言连续语音识别以多目标人识别。具体包括:1、在复杂数据的精确说话人分割和聚类方面,提出了基于深度模型嵌入学习的说话人表示和识别,鲁棒的单人DOA 估计及多人混叠语音的检测等算法2、在说话人识别方面,提出基于多模态信息融合的鲁棒目标人身份确认方法,以及端到端的海量说话人识别框架,提高说话人识别系统准确率;3、在多源异构语音数据的抗噪鲁棒识别方面,基于人类听觉感知和言语生成的特性,提出创新性的双模态抗噪建模方法应用于复杂环境(远场、非平稳噪声等)的语音数据上;4、在自适应技术方面,提出基于深度学习(BLHUC,贝叶斯学习隐层单元)的无监督聚类方法,针对各种声学条件等进行自适应学习的策略;5、在识别和评价机制方面,提出了基于极深卷积残差神经网络及自适应方法的抗噪鲁棒语音识别等方法。.本项目在复杂声学数据的分离和降噪,大规模连续语音识别,说话人识别等研究领域产生一系列成果,共发表论文89篇(SCI 22篇,EI 67篇),申请中国发明专利11项,授权发明专利4项。所提出的语音生成计算建模方法发表在语音领域 IEEE Trans. On ASLP, Speech Communication, Plos One, Frontiers 等旗舰杂志上,以及ICASSP, InterSpeech 等本领域较高影响力的国际会议上。提出的基于自学习隐式单元(LHUC)的说话人自适应技术发表在ICASSP 2019上,获得Best Student Paper Award;所提出的新型的多人声干扰的分离与识别方法发表在IEEE ASRU 2019上,获得Best Paper Award。.项目组将所提出的算法应用于公安司法鉴证领域,建立超大规模说话人识别系统平台,包括百万级声纹库,总时长约3万小时语音,Top5识别正确率达到97.45%。.项目负责人和主要参与人均担任 InterSpeech2020 等国际会议组委会成员,多次在语音领域的国际会议上报告研究成果。共培养硕士生10名,博士生5名。项目负责人入选国家第五批万人计划,合作单位上海交大的钱彦旻副教授获得2021年国自然优秀青年基金。
{{i.achievement_title}}
数据更新时间:2023-05-31
基于改进LinkNet的寒旱区遥感图像河流识别方法
新型树启发式搜索算法的机器人路径规划
结直肠癌免疫治疗的多模态影像及分子影像评估
智能煤矿建设路线与工程实践
现代优化理论与应用
复杂环境下语音数据的说话人识别及关键词检索
复杂环境下语音数据的说话人识别及关键词关联检索
大数据环境下批量语音内容认证及篡改恢复技术研究
面向高准确率语音转写的用户反馈学习与识别结果优化