As people and business rely more and more on Internet for their daily life and work, they require higher and higher Internet service performance experience. Internet Content Providers (ICPs), as one of the most important components of the Internet ecosystem, are facing great challenges in meeting the performance requirements for the following two reasons. First, its scale can be sheer: the number of product lines, shared software modules, hardware/network resources continue to grow; so do their coupling dependency, complexity, and dynamics. Second, the user traffic demand to ICPs can be very dynamic due to many reasons such as seasonality and crowd events. As a result of any component failures, resource exhaustion, or demand change, the user-perceived performance can undergo various degrades, such as latency increase, throughput drop, access failure etc. Unfortunately, the current ICP practice to detect and troubleshoot these performance degrade are still relatively rudimentary and sometimes manual. This calls for a new architecture for ICPs to be able to automatically detect, isolate and troubleshoot all kinds of performance degrade events in real-time. To address above challenges, in this project we propose a generic, real-time, accurate architecture for ICPs to manage user-perceived performance, based on big data analytics. It consists of four separated but highly related components: 1) a flow-based hierarchical KPI clustering and storage system, which can provide fine-grained KPI monitoring but with small storage/computation overhead and fast query speed; 2) an adaptive anomaly detection engine that can learn each KPI's properties and then choose the appropriate anomaly detection algorithms and parameters for each KPI; 3) an offline, distributed association rule mining system that automatically calculate the co-occurrence relationship and even causal relationship between various fault, anomaly, and workflow events, without requiring 100% input data accuracy; 4) a root cause analysis system that takes events and aforementioned association rules to automatically figure out the chain of events (including the root cause) that have caused the interested events (such as performance degrade) to happen, and proposes actionable suggestions. The proposed architecture is adaptive, self-learning, and iterative, in that the one component might serve as input to another, and the latter component's results might provide valuable feedback for the former component to learn and adjust parameters to achieve better accuracy in the future. All these feedback are automatic within minimum manual involvement. We will show through this project that this architecture can greatly improve the user-perceived performance of the ICPs, and benefit the Internet ecosystem.
随着互联网逐渐深入到现代生活的方方面面,用户对互联网服务体验的要求越来越高。作为互联网生态系统里至关重要的一环,互联网内容提供商(ICP)的产品线、模块、以及软硬件资源的规模、耦合度、复杂度、动态性都在不断增加,同时来自用户的负载也在不断动态变化。如上特点使得ICP服务性能管理面临巨大的挑战:运维人员需要一整套高效工具检测并排查随时可能发生的(访问延迟增加、吞吐率下降、访问失败等)性能体验下降事件。 针对如上挑战, 本项目提出了一个基于大数据分析的通用、实时、高准确率的ICP性能体验管理体系结构。它包含了四个主要模块:基于流的层次化KPI聚合与存储方法;差异性KPI的自适应异常检测引擎;面向海量运维事件的分布式抗噪性关联挖掘方法;基于故障传播链的分布式故障定位模型。本体系结构具有自适应、循环迭代、自学习的特点,能显著提升互联网服务质量管理的性能体验。
互联网内容提供商(Internet Content Provider, ICP)所提供服务的性能在整个端对端用户体验中所占的比重越来越大,为ICP服务性能管理带来了机遇与挑战。本课题的研究内容为基于大数据分析的,通用、实时、高准确率的ICP服务性能体验管理体系结构,包括基于流的层次化KPI (Key Performance Indicator)聚合与存储方法,差异性KPI的自适应异常检测引擎,面向海量运维事件的分布式抗噪性关联挖掘方法,基于故障传播链的分布式故障定位模型。经过4年研究,项目组在上述4项研究内容取得了重要进展。项目组提出了一种依据KPI曲线特征对KPI数据进行聚类的方法,实现了流数据的自动聚类;提出了一种基于有监督学习和集成学习的通用异常检测系统,以及针对周期性KPI的、基于变分自动编码器的无监督异常检测机制;提出了一种基于频繁项集挖掘和多种相关系数度量的机器学习模型,以实现关联规则挖掘,从而辅助故障传播链快速定位根因;提出了一种基于蒙特卡洛树搜索和逐层剪枝的快速故障定位模型。这些成果丰富了互联网服务性能管理的理论和方法,在相关关键技术上有较大突破和创新。项目执行期间,项目组发表了学术论文41篇,包括CCF A类会议或期刊论文9篇,CCF B类会议或期刊论文15篇。项目负责人于2018年获得日本大川基金资助,且本项目所资助的1篇论文获得2018年IEEE ISSRE(CCF B类国际会议)最佳学术论文奖,1篇论文获得2016年IEEE/ACM IWQoS(CCF B类国际会议)最佳学术论文提名。该项目培养博士毕业生10人,硕士毕业生3人,其中1人获得清华大学年度人物,1人获得清华大学优秀博士学位论文,1人获得国家奖学金。项目组圆满完成了项目计划书制定的研究计划,达到了预期的研究目标。
{{i.achievement_title}}
数据更新时间:2023-05-31
玉米叶向值的全基因组关联分析
论大数据环境对情报学发展的影响
基于分形维数和支持向量机的串联电弧故障诊断方法
基于FTA-BN模型的页岩气井口装置失效概率分析
基于全模式全聚焦方法的裂纹超声成像定量检测
管理与决策大数据分析方法与个性化知识服务
认知互联网体系结构与服务模型研究
基于大数据分析的缓存与无线传输资源管理
面向管理决策大数据分析的理论与方法