Given the recent rapid increase in the availability of extremely large datasets, storage, access, and analysis of such data sets becomes critical. ..Since data sets are often too large to load into the memory of a single machine, let alone conducting statistical analysis for the whole data sets at once, divide and conquer methodology has received significant attention. Conceptually, this simply involves distributing the entire data to multiple machines, carrying out standard statistical model fitting at each local machine separately to obtain multiple estimates of the same quantities/parameters of interest, and finally pooling the estimates into a single estimate on a central machine by a simple averaging step. ..For many models, the simple divide and conquer method described above can be theoretically shown to achieve the same estimation performance as when the entire data set is analyzed by a single machine, which is called the oracle property of the divide and conquer method. However, for high-dimensional models where the number of parameters to estimate could exceed the number of observations, the case is more complicated. In particular, the naïve averaging fails due to the propagation of bias attributed to the penalty used to make high-dimensional estimation feasible. Thus debiasing is critical before aggregation. In this proposal, we plan to study divide and conquer method for several high-dimensional statistical models, including partially linear models, quantile regression models, and support vector classification. The purpose of this study is to propose debiasing method in these penalized models and establish rigorously the optimal convergence rate or even, in some cases, the asymptotic distribution of the aggregated estimates. Once achieved, it will deepen our understanding of the divide and conquer strategy and significantly expand its applicability.
由于大型数据集往往因为数据量太大而无法加载到单个机器的内存中,分而治之的方法近年来已经受到了广泛的关注。也就是说,将不同的数据子集分配到多台机器,分别在每台机器上进行统计模型拟合,最后将多个估计值汇集到一个中央机器进行平均。 ..对于许多模型,理论上可以用以上所述的简单的分而治之的方法来实现,以达到与用单机分析整个数据集相同的估计性能,这就是所谓的分而治之法的oracle性质。然而,在要估计的参数个数可能超过观测数量的高维模型中,情况更为复杂。特别是在用惩罚函数进行变量选择时产生了估计偏差,在汇总之前,纠偏是至关重要的。在这个项目中,我们计划研究几种高维统计模型的分治法,包括部分线性模型,分位数回归模型和支持向量机分类器。本研究的目的是在这些使用LASSO惩罚的模型中提出纠偏的方法,严格地建立最优收敛速度,并通过数值模拟研究其有限样本性质。
在本项目中,我们对部分线性模型、分位数回归模型、非参数模型等几种复杂模型的分治策略的统计特性及相关方法进行了研究。我们建立的统计理论阐明了这些流行方法在大数据分析中的一些重要理论。特别是,我们展示了在合理的数学假设下不同模型中各种估计量的(通常是最优的)收敛速度。我们的研究结果已经发表在一些顶级国际期刊上。
{{i.achievement_title}}
数据更新时间:2023-05-31
DNAgenie: accurate prediction of DNA-type-specific binding residues in protein sequences
一种基于多层设计空间缩减策略的近似高维优化方法
基于LS-SVM香梨可溶性糖的近红外光谱快速检测
二维FM系统的同时故障检测与控制
扶贫资源输入对贫困地区分配公平的影响
高维数据的几何结构分析
复杂高维数据流的实时监控策略研究
数据缺失时高维数据降维分析的方法、理论与应用
高维数据的函数型数据(functional data)分析方法