面向大规模数据流的集成学习模型与方法研究

基本信息

批准号：71471022

项目类别：面上项目

资助金额：63.00

负责人：王昱

学科分类：

依托单位：重庆大学

批准年份：2014

结题年份：2018

起止时间：2015-01-01 - 2018-12-31

项目状态：已结题

项目参与者：钟波,赵鹏,杨道理,王现宁,徐炜

关键词：

大规模数据流集成学习

结项摘要

Traditional data mining research and practice are focused on batch learning, in which the whole training data are available to the data mining algorithm that outputs a decision model after processing the data multiple times. In recent years, the developments of information and networks have dramatically changed the data collection and processing procedures. Data are generated and collected at high speed, meaning that the data are large-scale, dynamic, and often with concept drift. In these situations, data are modeled best not as static and persistent tables, but rather as transient data streams. Consequently, the traditional data mining techniques are inapplicable to effective and efficient knowledge discovery. In this research, we introduce the ensemble learning methodology to large-scale data streams mining. In ensemble learning, a pool of different base learners, instead of a single one, are constructed and combined to predict the class label of unknown instance. The main idea of ensemble learning is to take advantage of the base learners and avoid their weakness. For effective ensemble learning, it is required that base learners in the ensemble are accurate and diverse. Nevertheless, little attention has been paid to the diversity of the generated base learners in the existing research, which may lead to a degradation of overall learning performance. To overcome the above limitations, this research project investigates the impact of the characteristics of large-scale data streams on the efficiency and accuracy of data mining, and study how to construct ensemble learning models to find out interesting information and knowledge from the data streams with higher accuracy and efficiency. The four main aspects of this research are: (1) Concepts and knowledge representation and detection of concept drift in large-scale data streams. Fuzzy set theory is adopted to define the domain knowledge in the context where customers' interests and background are embedded. (2) Incremental and scalable ensemble learning models and algorithms. Sampling techniques, data reduction methods and instance-base learning are employed to construct the ensemble learning methods that are capable of incremental and scalable learning. (3) Dynamic and evolutionary ensemble learning models and algorithms for dealing with concept drift, which could present the evolutionary patterns of data characteristics, parameters, and optimal ensemble learning models. (4) Dynamic customers segmentation method and software prototype system based on ensemble learning. This research project studies the fundamental issues of data mining and knowledge discovery in dynamic and large-scale data streams, which is of critical importance to business intelligence and decision support nowadays. Therefore, it is meaningful and valuable both in theory and practice.

近年来，随着网络化信息技术在各个行业的广泛应用,数据挖掘所面向的主要数据形态已由静态数据转变为具有海量性、动态性、概念漂移性等特性的大规模数据流，从而使得传统的数据挖掘技术很难有效地进行数据学习和知识发现。本项目针对网络环境下大规模数据流的特性，基于集成学习理论方法开展数据挖掘和知识发现研究，探讨大规模数据流中存在的特性对数据学习的效率和准确率的影响，并研究如何更加高效和准确地找出具有共性或规律性的信息和知识。项目的主要研究内容包括：（1）大规模数据流中的概念知识的形式化表示以及概念漂移检测；（2）具有增量性和可扩展性的集成学习模型及算法；（3）针对概念漂移的动态演化集成学习模型与算法；（4）基于集成学习的客户动态细分方法及软件系统原型。本项目从方法论的角度对面向大规模数据流的集成学习方法和技术进行研究，有助于提高企业和组织应用数据挖掘进行决策支持的水平，具有重要的理论意义和应用价值。

项目摘要

近年来，随着网络化信息技术在各个行业的广泛应用,数据挖掘所面向的主要数据形态已由静态数据转变为具有海量性、动态性、概念漂移性等特性的大规模数据流，从而使得传统的数据挖掘技术很难有效地进行数据学习和知识发现。本项目针对网络环境下大规模数据流的特性，基于集成学习理论方法开展数据挖掘和知识发现研究，探讨大规模数据流中存在的特性对数据学习的效率和准确率的影响，并研究如何更加高效和准确地找出具有共性或规律性的信息和知识。项目已开展的主要研究内容包括：（1）针对大规模数据中特征空间异质性的测度与分类方法研究；（2）具有增量性和可扩展性的改进K-近邻规则及其应用研究；（3）针对概念漂移的动态演化集成学习模型与算法；（4）基于集成学习的消费者信用评估、数据库营销、海关查验走私等风险决策问题。本项目从方法论的角度对面向大规模数据流的集成学习方法和技术进行研究，有助于提高企业和组织应用数据挖掘进行决策支持的水平，具有重要的理论意义和应用价值。

项目成果

DOI：{{i.doi}}

发表时间：{{i.publish_year}}

暂无此项成果

数据更新时间：2023-05-31

其他相关文献

DOI：10.12054/lydk.bisu.148

发表时间：2020

DOI：

发表时间：2022

DOI：10.3724/sp.j.1089.2022.19009

发表时间：2022

DOI：

发表时间：2019

DOI：10.3788/AOS202040.2212001

发表时间：2020

王昱的其他基金

批准号：61702446

批准年份：2017

资助金额：25.00

项目类别：青年科学基金项目

批准号：41101108

批准年份：2011

资助金额：23.00

项目类别：青年科学基金项目

批准号：71001112

批准年份：2010

资助金额：17.70

项目类别：青年科学基金项目

批准号：71601128

批准年份：2016

资助金额：18.20

项目类别：青年科学基金项目

批准号：21804132

批准年份：2018

资助金额：25.00

项目类别：青年科学基金项目

批准号：81400143

批准年份：2014

资助金额：23.00

项目类别：青年科学基金项目

批准号：51705161

批准年份：2017

资助金额：25.00

项目类别：青年科学基金项目

批准号：81770189

批准年份：2017

资助金额：55.00

项目类别：面上项目

批准号：51669011

批准年份：2016

资助金额：39.00

项目类别：地区科学基金项目

批准号：61906125

批准年份：2019

资助金额：24.00

项目类别：青年科学基金项目

批准号：61572347

批准年份：2015

资助金额：65.00

项目类别：面上项目

相似国自然基金

面向大规模数据流的弱信息在线学习理论与方法研究

批准号：61906165

批准年份：2019

负责人：翟婷婷

学科分类：F0603

资助金额：23.00

项目类别：青年科学基金项目

基于集成学习的分布式XML数据流的挖掘模型与概念漂移挖掘方法研究

批准号：61773415

批准年份：2017

负责人：毛国君

学科分类：F0603

资助金额：64.00

项目类别：面上项目

面向城市大数据的深度学习模型与方法研究

批准号：61773324

批准年份：2017

负责人：李天瑞

学科分类：F0607

资助金额：63.00

项目类别：面上项目

面向大规模无线网络感知数据的多标记学习模型与算法

批准号：61571233

批准年份：2015

负责人：胡海峰

学科分类：F0102

资助金额：57.00

项目类别：面上项目

面向大规模数据流的集成学习模型与方法研究

{{i.achievement_title}}

暂无此项成果

其他相关文献

自然灾难地居民风险知觉与旅游支持度的关系研究——以汶川大地震重灾区北川和都江堰为例

基于公众情感倾向的主题公园评价研究——以哈尔滨市伏尔加庄园为例

基于协同表示的图嵌入鉴别分析在人脸识别中的应用

一种改进的多目标正余弦优化算法

基于混合优化方法的大口径主镜设计

王昱的其他基金

基于相似度学习的糖尿病个性化诊疗方法研究

东北地区资源枯竭城镇的空间剥夺问题研究

考虑特征空间异质性的分类技术及其在商务智能中的应用研究

有限分布信息环境下手术调度问题的鲁棒优化方法研究

过渡金属掺杂碳量子点在多功能比率荧光传感器可控化设计方面的研究

供者淋巴细胞输注抗白血病作用的机制研究

考虑SLM加工约束的结构优化设计理论与方法

DLI逆转T细胞功能耗竭有效治疗移植后B-ALL复发的实验研究

内陆河流水电梯级开发的生境响应关系研究

面向不确定缺失信息的多模式自主切换态势评估方法研究

大规模移动群智感知中的资源优化理论与技术研究

相似国自然基金