Image sentence annotation means annotating images with a group of sentences which contain so plentiful combined semantic information (CSI). It is a new research challenge which strides across the computer vision (CV) field and the natural language generation (NLG) field. As we all know, the sentence contains so plentiful combined semantic information (CSI), which may extend and broad the semantic expression mode of the traditional image annotation. As the reason, the sentences can describe the images’ key content more accurately and more completely than the single words. However, several key problems such as inaccurate semantic relevance measurements between different words and too heavy noisy inferences still remain in the current research works. To deal with the two problems, many useful improvements in the project are presented: On one hand, the noisy semantic correlations between images and words are cut off based on a sparse bipartite graph (SBG) model which objectively describes the key semantic correlations between the images and the words. The key words that accurately describe the images’ content are summarized in turn for the sentence generation after the random walk of the sparse bipartite graph model. Our sentences are made up of these words. Therefore, the heavy noisy inferences are suppressed effectively. On the other hand, the useful multimodal distributional semantic (MDS) between heterogeneous media are mined to promote the descriptive power of the traditional word embeddings (WE). A new word embeddings named multimodal distributional word embeddings (MDWE) is generated in turn by a new modified distributional semantic model (DSM). The model absorbs the mid-level semantic information of the images such as visual attriutes or visual constructions of objects or scenes into it. Therefore, the new word embeddings contains more accurate and more plentiful semantic information than the traditional WE, which helps to improve the semantic relevance measurements between different words. Finally, based on the the sparse bipartite graph model and the multimodal distributional semantics, a natural language generation (NLG) model introduced before is modified to generate more coherent sentences. The new sentences with abundant semantic information and coherent syntactic structures are generated easily by the NLG model. We hope the annotation performance especially for the BLEU-3 scores can be further improved through our research work. More importantly, we hope the new cross-media retrieval services with various forms and rich contents could be offered to users conveniently. It will really narrow the "semantic gap" between the high-level cognitions and the low-level image features.
图像句子标注指为图像标注一组蕴含丰富组合语义信息的句子,它是机器视觉与自然语言生成交叉、融合出的新兴研究领域,也是传统图像标注在语义表达上的延伸。因此,句子较单词能更准确、全面且无歧义地描述图像内容。然而,现有研究工作存在“文本噪声干扰严重”、“单词间语义相关性度量精度不高”等问题,针对这些问题,一方面,基于稀疏二分图模型削弱文本与图像间的噪声语义关联,以筛选出一组最能准确刻画图像内容的关键单词,它们是构成句子的基本单位;另一方面,深入挖掘语料中的多模态分布式语义,以生成多模态分布式词向量,该词向量中蕴含丰富的语义信息,包括视觉属性、目标结构等中间语义层内容,这有助于改善单词间语义相关性度量精度。最后,基于稀疏二分图与多模态分布式语义优化自然语言生成模型,生成连贯、流畅的句子,标注图像。期望本研究能进一步提高图像句子标注性能,进而缩小人类高层认知与低层图像特征之间的“语义鸿沟“。
图像句子标注指为图像标注一组蕴含丰富组合语义信息的句子,它是机器视觉、自然语言生成、认知学、深度学习等多学科深度交叉、融合出的新兴研究领域,也是传统基于单词的图像标注在语义表达上的延伸。近年来,它吸引了众多学者的高度关注,因为蕴含组合语义信息的短语或句子较独立的单词能更准确、更全面且无歧义地描述图像内容,进而缩小高层认知与低层表征之间的“鸿沟”。现有研究在特征学习、可解释性等方面存在缺陷,图像句子标注包含更垂直或细化的研究方向,它们能更好地对接实际应用。为此,本项目围绕图像句子标注的垂直领域--可解释性诊断模型开展研究,即用自然语言来刻画医学图像,要达到该目的就必须先完成多个层次的图像属性标注,然后基于属性标注结果生成连贯的自然语言。故本课题先从多特征融合、多阶段特征精化、弱监督多示例学习等角度设计高质量肿瘤图像识别模型,准确输出阴性或阳性的肿瘤图像分类结果,为生成句子奠定重要基础。实验表明:本课题提出的DE-Ada*、改进的有效区域基因优选(ERGS)、层次化融合、深度对比互学习(DCML)、嵌入融合互学习(EFML)等特征融合或优选模型是有效且鲁棒的,它们充分利用异构特征间的互补性提升分类精度。其次,在准确识别肿瘤图像分级同时,基于多重注意力机制精确定位医学图像中的病灶区域(该模型称为MA-MIDN),即综合示例和完整图像级别的关键特征描述图像内容,最终既输出肿瘤图像分级,并基于注意力聚焦关键病灶区域,准确描述医学图像中蕴含的病理语义。实验结果表明:MA-MIDN模型能定位出病理图像中的关键病灶,为后续诊断奠定重要基础。基于上述肿瘤图像分类及区域定位即可生成具备较强可解释性的自然语言。该可解释性结果能缩小高层认知与低层图像特征间的语义“鸿沟”,进而准确、高效地辅助医生的临床诊断活动,有力地推动了医工深度融合。此外,本研究还可为传统图像句子标注提供有益的借鉴。
{{i.achievement_title}}
数据更新时间:2023-05-31
基于SSVEP 直接脑控机器人方向和速度研究
内点最大化与冗余点控制的小型无人机遥感图像配准
基于多模态信息特征融合的犯罪预测算法研究
基于公众情感倾向的主题公园评价研究——以哈尔滨市伏尔加庄园为例
F_q上一类周期为2p~2的四元广义分圆序列的线性复杂度
融合社交语义环境的网络图像标注关键技术研究
基于多模态特征的多媒体语义分析关键理论与技术研究
基于随机森林和深度学习耦合模型的RGB-D图像语义标注关键技术研究
实时双模态自动图像软标注与多关键词检索