Recently, data annotation tasks can be completed by ordinary Internet users through crowdsourcing platforms. Since labeling qualities of labelers in crowdsourcing are different, supervised learning from these crowdsourced labeled data is full of challenges. This proposal focuses on ground truth inference and supervised learning for crowdsourcing from the aspect of machine learning, providing novel theories and methods for building supervised classification learning systems based on crowdsourced labeled data. Based on the classic statistical-query learnable theory and considering the specific characteristics of crowdsourcing, we first study the relationship among sample features, classifiers, labeling qualities, the performance of ground truth inference algorithms, and the qualities of learned models, and then build up a set of fundamental theories which can supervise ground truth inference and model training. Secondly, in order to improve the quality of integrated labels, we study novel algorithms for multi-class ground truth inference which are based on the fusion of concept-level features and physical-level features of examples, and novel methods which can correct integrated mislabeled examples. Thirdly, in order to improve the performance of learned models under an active learning paradigm, we study finer-grain sampling strategies, labeling optimization methods and labeler selection strategies based on temporal modeling for labeling qualities. Finally, we build up a prototype system for ground truth inference and supervised classification, using crowdsourced data to verify practical values of our research outcomes.
近年来,数据标注任务可以方便地通过众包平台由互联网上的普通用户完成。由于众包标注者的标注质量参差不齐,利用众包标注数据进行监督学习模型训练充满挑战。本项目致力于研究众包标注数据机器学习中真值推断与监督分类问题,为基于众包标注数据构建监督分类学习系统提供新理论与新方法。首先,从经典的统计查询可学习理论出发,结合众包标注相关特性,研究预算约束条件下,样本特征、分类器、标注者质量与真值推断性能及学习模型质量之间的关系,建立指导真值推断和模型训练的基础理论;其次,研究基于样本“概念层”与“物理层”特征融合的多分类真值推断算法和“集成误标”样本标签纠正方法,以提高集成标签质量;再次,研究主动学习范式下更加精细的样本选择策略、标签优化方法以及基于标注质量时序建模的标注者选择策略,以提高学习模型的性能。最后,构建面向众包标注的真值推断与监督分类的原型系统,以验证研究成果的应用价值。
众包环境的不确定性使得利用众包标注数据进行机器学习充满挑战。项目聚焦面向众包标注的真值推断与监督分类中的关键问题。首先,研究了众包标签真值推断算法。针对偏置标注,提出适应性加权多数投票推断算法,平衡了标注者对两类样本的投票权重。针对多分类多标签任务,提出基于混合多努利分布的推断算法以发掘并利用标签之间的相关性。针对样本和标签稀疏性,提出单一参数建模标注者质量与样本难度的鲁棒性推断模型。这些算法显著提升了真值推断的准确度。其次,研究了基于标签噪声纠正的标签集成方法。提出迭代双层聚类标签集成算法,通过对概念层和物理层特征进行交叉聚类分析,发现并纠正概念层集成标签中的误标。提出基于模型预测标签噪声纠正的标签集成算法,通过高标签质量样本构建的集成学习模型发现并纠正低标签质量样本中的误标。再次,研究了面向众包标注数据的精细化预测模型学习方法。提出四种众包噪声标签利用方法和基于样本复制的集成学习算法,提升了预测模型的泛化性能。提出三种主动学习样本选择策略,降低了标注成本。最后,以开源软件的形式将项目原型系统中的核心算法和数据向研究社区开放。研究成果推进了人机协同人工智能的发展且具备广泛的应用前景。
{{i.achievement_title}}
数据更新时间:2023-05-31
基于公众情感倾向的主题公园评价研究——以哈尔滨市伏尔加庄园为例
基于协同表示的图嵌入鉴别分析在人脸识别中的应用
一种改进的多目标正余弦优化算法
多源数据驱动CNN-GRU模型的公交客流量分类预测
面向工件表面缺陷的无监督域适应方法
炎性微环境下TGF-β1/Treg相关细胞因子介导的免疫调控在骨髓间充质干细胞骨向分化中作用机制研究
面向海量数据语义标注众包的任务管理方法研究
基于众包标注的多标记学习研究
面向多类不相容标注的真值推理与模型获取研究
面向多类图像分类的众包主动学习方法研究