As a tolerance induced all pairs of qualification between two object sets, similarity joins ask to find input pairs that are within a certain deviation scope from each other according to some distance measure. It becomes more widely practiced in the world of big data where the data model and retrieval pattern contain the associated or the correlated constraints. Nevertheless, suffering from the bottleneck of the data exchange and the brute-force enumeration, the existing parallel computational techniques can hardly translate these essential fuzzy logics into the efficiently divide-and-conquer tasks, especially when the fine-gained partitioning requirements encounter the extensive overlap among the skewed data. It is of interesting to exploit those constraints to shore up the data layout and query optimization, so as to improve the data locality as well as the pruning power, especially in the massive set-similarity join paradigm. Towards both high-throughput exchange and high-productivity computation, this project will firstly launch the MapReduce optimizations along with the cascaded storage and the reduced execution models, mainly concerning the band join condition under foreign key reference constraint. By presenting the Reduce-oriented load balance strategy and enforcing it onto the gradually weak constraints, this project planning paper is proposed to spread the similarity join researches in terms of the topological properties and the positional tokens. As for the spatial join in the update intensive context, it introduces the atomic operations and the cross-block GPU synchronization techniques to the data-centered join strategies by virtue of the grid snapshot. Towards the nonmetric string similarity join, it will further focus on the optimal combinational prefixes filtering topics, covering the optimal tokens selection, the compression of the grouping exchange and the high-efficiency pruning strategies. Finally, for the massive parallel environments, this project will extract the coherent theoretical achievements and give a suite of robust and scalable core techniques for efficiently handling the different level of constraints over the similarity join.
作为互联网时代极具潜力的数据处理手段,相似连接在数据清洗、分析、挖掘和集成等方向具有广阔的应用和研究价值,已成为数据库和知识工程的交叉研究热点。本项目以数据及查询的关联约束为驱动,将交换高通量和计算高效能作为目标,对典型集合数据展开适用于大规模并行的优化布局方法研究,并力图为相似连接提供高效的查询剪枝执行手段。项目面向层次约束构建研究体系,首先基于关系约束在MapReduce下进行数据级联存储,借助流水线和多路连接深入探索相似连接的任务分割与负载均衡策略;研究论证频繁更新下数据为中心的公制距离连接技术,以格网分区与优化映射为基本途径,展开GPU下空间约束剪枝和跨Block同步策略研究;最后以最优符号组合剪枝为目标,对资源平衡视角下符号组合选择策略、分组与前缀压缩交换技术及其过滤方法进行重点研究。借助项目研究的实施,初步构建出一套健壮、高效和可扩展的面向大数据不同约束层次的关联优化核心技术。
本项目以数据及查询的关联约束为驱动,将交换高通量和计算高效能作为目标,对典型集合数据展开适用于大规模并行的优化布局方法研究,并力图为相似连接提供高效的查询剪枝执行手段。项目面向层次约束构建研究体系,首先基于关系约束在MapReduce下进行数据级联存储,借助流水线和多路连接深入探索相似连接的任务分割与负载均衡策略;研究论证频繁更新下数据为中心的公制距离连接技术,以格网分区与优化映射为基本途径,展开GPU下空间约束剪枝和跨Block同步策略研究;最后以最优符号组合剪枝为目标,对资源平衡视角下符号组合选择策略、分组与前缀压缩交换技术及其过滤方法进行重点研究。借助项目研究的实施,初步构建出一套健壮、高效和可扩展的面向大数据不同约束层次的关联优化核心技术。截止结题日期,项目组发表CCF-A类论文2篇,CCF-B类论文3篇,CCF-C类论文1篇,其他国内核心期刊论文若干。邀请了国际著名的数据库和人工智能专家进行不同级别的专题讲座10余次。项目产生的研究成果已开始初步应用于国家重点研发计划和深圳鹏程实验室的前沿应用项目上,较为显著地推动了文本处理和多源异构大数据的管理和分析。
{{i.achievement_title}}
数据更新时间:2023-05-31
涡度相关技术及其在陆地生态系统通量研究中的应用
环境类邻避设施对北京市住宅价格影响研究--以大型垃圾处理设施为例
高压工况对天然气滤芯性能影响的实验研究
基于非线性接触刚度的铰接/锁紧结构动力学建模方法
多空间交互协同过滤推荐
面向大数据的相似连接操作关键技术研究
高维大数据相似性连接查询关键技术研究
面向大数据保护的高效能重复数据删除存储关键技术研究
基于相似紧邻的缺失数据填补关键技术研究