As data quality becomes a key issue in practice, the metric distance constraints are often deployed to improve the quality of data, such as detecting violations, analyzing consistencies, repairing dirty data and so on. In this proposal, we focus on the automatic discovery of metric distance constraints, as well as their application in the important data repairing problem. First, to find metric distance constraints automatically, we propose the parameter-free mining of distance thresholds. Advanced pruning techniques are also carefully designed to optimize the discovery process. Once the metric distance constraints are obtained by mining methods, we can investigate the foundations and techniques for applying them in solving data quality problems. In particular, we study the application of metric distance constraints in data repairing. The complexity and hardness of the repairing problem are first analyzed with theoretical proofs. Recognizing the hardness, we thereby develop a safe contraction based algorithm for approximate repairing. All the proposed approaches are evaluated through an extensive experiment. To our best knowledge, this is the first work on mining and repairing with respect to metric distance constraints. We believe that our proposal can improve the quality and reliability of data, and contribute in the national progress of developing reliable software systems.
针对数据质量的需求日益迫切,基于距离的数据约束规则在数据冲突检测、数据一致性分析、数据修复等数据质量应用中具有重要作用。本项目拟研究距离约束规则的自动挖掘机制,并探讨距离约束规则在数据修复中的实践方法。其中针对距离约束规则的挖掘问题,提出无参数的距离阈值确定方法,并设计距离阈值计算算法的性能优化技术。通过研究距离约束规则的挖掘方法,能够为数据质量领域的应用提供理论依据和技术基础。其中,本项目重点研究距离约束规则在数据修复中的实际应用。通过理论分析,探讨基于距离约束规则的数据修复问题复杂度和技术难点,并提出基于安全收缩的有效近似修复方法。研究结果将通过实验进行验证。距离约束规则的自动挖掘和数据修复技术将提高数据的质量和可信度,促进我国可信软件的部署与发展。
项目针对数据中日益迫切的质量问题,研究基于距离的数据约束规则,具有理论和现实意义。本课题研究了距离约束规则的自动挖掘机制,并探讨距离约束规则在数据修复中的实践方法。具体成果包括:基于支持度(Support)及置信度(Confidence)度量的约束规则度量;距离约束规则挖掘算法;基于距离关系的数据修复模型;在序列、空间、图等多种数据类型上的高效率数据修复算法等。相关成果发表论文14篇,其中SCI索引论文3篇、EI索引论文7篇、中国计算机学会(CCF)推荐的A类论文11篇,提交专利申请1项。
{{i.achievement_title}}
数据更新时间:2023-05-31
论大数据环境对情报学发展的影响
肉苁蓉种子质量评价及药材初加工研究
惯性约束聚变内爆中基于多块结构网格的高效辐射扩散并行算法
Himawari-8/AHI红外光谱资料降水信号识别与反演初步应用研究
中外学术论文与期刊的宏观差距分析及改进建议
基于规则的多光谱数据分析的研究
数据驱动下基于规则分类的医疗决策模型研究
基于规则的主动数据库----算法与机制研究
汉语声调远距离规则的内隐学习