高维数据分析的分治策略

基本信息

批准号：11871411

项目类别：面上项目

资助金额：53.00

负责人：练恒

学科分类：

依托单位：香港城市大学深圳研究院

批准年份：2018

结题年份：2022

起止时间：2019-01-01 - 2022-12-31

项目状态：已结题

项目参与者：张君,周彦,贺莘,戴奔

关键词：

渐进性质变量选择半参模型B样条分位数回归

结项摘要

Given the recent rapid increase in the availability of extremely large datasets, storage, access, and analysis of such data sets becomes critical. ..Since data sets are often too large to load into the memory of a single machine, let alone conducting statistical analysis for the whole data sets at once, divide and conquer methodology has received significant attention. Conceptually, this simply involves distributing the entire data to multiple machines, carrying out standard statistical model fitting at each local machine separately to obtain multiple estimates of the same quantities/parameters of interest, and finally pooling the estimates into a single estimate on a central machine by a simple averaging step. ..For many models, the simple divide and conquer method described above can be theoretically shown to achieve the same estimation performance as when the entire data set is analyzed by a single machine, which is called the oracle property of the divide and conquer method. However, for high-dimensional models where the number of parameters to estimate could exceed the number of observations, the case is more complicated. In particular, the naïve averaging fails due to the propagation of bias attributed to the penalty used to make high-dimensional estimation feasible. Thus debiasing is critical before aggregation. In this proposal, we plan to study divide and conquer method for several high-dimensional statistical models, including partially linear models, quantile regression models, and support vector classification. The purpose of this study is to propose debiasing method in these penalized models and establish rigorously the optimal convergence rate or even, in some cases, the asymptotic distribution of the aggregated estimates. Once achieved, it will deepen our understanding of the divide and conquer strategy and significantly expand its applicability.

由于大型数据集往往因为数据量太大而无法加载到单个机器的内存中，分而治之的方法近年来已经受到了广泛的关注。也就是说，将不同的数据子集分配到多台机器，分别在每台机器上进行统计模型拟合，最后将多个估计值汇集到一个中央机器进行平均。 ..对于许多模型，理论上可以用以上所述的简单的分而治之的方法来实现，以达到与用单机分析整个数据集相同的估计性能，这就是所谓的分而治之法的oracle性质。然而，在要估计的参数个数可能超过观测数量的高维模型中，情况更为复杂。特别是在用惩罚函数进行变量选择时产生了估计偏差，在汇总之前，纠偏是至关重要的。在这个项目中，我们计划研究几种高维统计模型的分治法，包括部分线性模型，分位数回归模型和支持向量机分类器。本研究的目的是在这些使用LASSO惩罚的模型中提出纠偏的方法，严格地建立最优收敛速度，并通过数值模拟研究其有限样本性质。

项目摘要

在本项目中，我们对部分线性模型、分位数回归模型、非参数模型等几种复杂模型的分治策略的统计特性及相关方法进行了研究。我们建立的统计理论阐明了这些流行方法在大数据分析中的一些重要理论。特别是，我们展示了在合理的数学假设下不同模型中各种估计量的（通常是最优的）收敛速度。我们的研究结果已经发表在一些顶级国际期刊上。

项目成果

DOI：{{i.doi}}

发表时间：{{i.publish_year}}

暂无此项成果

数据更新时间：2023-05-31

其他相关文献

DOI：10.1093/bib/bbab336

发表时间：2021

DOI：10.1051/jnwpu/20213920292

发表时间：2021

DOI：

发表时间：

DOI：10.16383/j.aas.c180673

发表时间：2021

DOI：

发表时间：2020

练恒的其他基金

批准号：31401026

批准年份：2014

资助金额：24.00

项目类别：青年科学基金项目

相似国自然基金

高维数据的几何结构分析

批准号：61272341

批准年份：2012

负责人：林宙辰

学科分类：F0605

资助金额：81.00

项目类别：面上项目

复杂高维数据流的实时监控策略研究

批准号：71772147

批准年份：2017

负责人：李健

学科分类：G0108

资助金额：48.00

项目类别：面上项目

数据缺失时高维数据降维分析的方法、理论与应用

批准号：11171331

批准年份：2011

负责人：王启华

学科分类：A0403

资助金额：40.00

项目类别：面上项目

高维数据的函数型数据(functional data)分析方法

批准号：11001084

批准年份：2010

负责人：周迎春

学科分类：A0402

资助金额：16.00

项目类别：青年科学基金项目

高维数据分析的分治策略

{{i.achievement_title}}

暂无此项成果

其他相关文献

DNAgenie: accurate prediction of DNA-type-specific binding residues in protein sequences

一种基于多层设计空间缩减策略的近似高维优化方法

基于LS-SVM香梨可溶性糖的近红外光谱快速检测

二维FM系统的同时故障检测与控制

扶贫资源输入对贫困地区分配公平的影响

练恒的其他基金

miR156及其靶基因SPL参与花器官形态建成的分子机理研究

相似国自然基金