Apache Chukwa 介绍

gaojingsong

浏览: 1227700 次
性别:
来自: 深圳

最近访客更多访客>>

muyuanqiang7

ZZ_lll

boveysmith

zah5897

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Apache 之Chukwa

Apache Chukwa 介绍

一、 chukwa 介绍

chukwa 是一个开源的用于监控大型分布式系统的数据收集系统。这是构建在 hadoop 的 hdfs 和 map/reduce 框架之上的，继承了 hadoop 的可伸缩性和鲁棒性。Chukwa 还包含了一个强大和灵活的工具集，可用于展示、监控和分析已收集的数据。在一些网站上，甚至声称 chukwa 是一个“日志处理/分析的full stack solution”。

Apache Chukwa is a Apache Hadoop subproject devoted to bridging that gap between logs processing and Hadoop ecosystem. Apache Chukwa is a scalable distributed monitoring and analysis system, particularly logs from Apache Hadoop and other distributed systems.

chukwa 不是什么

1. chukwa 不是一个单机系统. 在单个节点部署一个 chukwa 系统,基本没有什么用处. chukwa 是一个构建在 hadoop 基础上的分布式日志处理系统.换言之,在搭建 chukwa 环境之前,你需要先构建一个 hadoop 环境,然后在 hadoop 的基础上构建 chukwa 环境,这个关系也可以从稍后的 chukwa 架构图上看出来.这也是因为 chukwa 的假设是要处理的数据量是在 T 级别的.

2. chukwa 不是一个实时错误监控系统.在解决这个问题方面, ganglia,nagios 等等系统已经做得很好了,这些系统对数据的敏感性都可以达到秒级. chukwa 分析的是数据是分钟级别的,它认为像集群的整体 cpu 使用率这样的数据,延迟几分钟拿到,不是什么问题.

3. chukwa 不是一个封闭的系统.虽然 chukwa 自带了许多针对 hadoop 集群的分析项,但是这并不是说它只能监控和分析 hadoop.chukwa 提供了一个对大数据量日志类数据采集、存储、分析和展示的全套解决方案和框架,在这类数据生命周期的各个阶段, chukwa 都提供了近乎完美的解决方案,这一点也可以从它的架构中看出来.

具体而言, chukwa 致力于以下几个方面的工作:

1. 总体而言, chukwa 可以用于监控大规模(2000+ 以上节点, 每天产生数据量在T级别) hadoop 集群的整体运行情况并对它们的日志进行分析

2. 对于集群的用户而言: chukwa 展示他们的作业已经运行了多久,占用了多少资源,还有多少资源可用,一个作业是为什么失败了,一个读写操作在哪个节点出了问题.

3. 对于集群的运维工程师而言: chukwa 展示了集群中的硬件错误,集群的性能变化,集群的资源瓶颈在哪里.

4. 对于集群的管理者而言: chukwa 展示了集群的资源消耗情况,集群的整体作业执行情况,可以用以辅助预算和集群资源协调.

5. 对于集群的开发者而言: chukwa 展示了集群中主要的性能瓶颈,经常出现的错误,从而可以着力重点解决重要问题.

二、chukwa 原理

Agents and Adaptors

Apache Chukwa agents do not collect some particular fixed set of data. Rather, they support dynamically starting and stopping Adaptors, which small dynamically-controllable modules that run inside the Agent process and are responsible for the actual collection of data.

These dynamically controllable data sources are called adaptors, since they generally are wrapping some other data source, such as a file or a Unix command-line tool. Apache Chukwa agent guide includes an up-to-date list of available Adaptors.

Data sources need to be dynamically controllable because the particular data being collected from a machine changes over time, and varies from machine to machine. For example, as Hadoop tasks start and stop, different log files must be monitored. We might want to increase our collection rate if we detect anomalies. And of course, it makes no sense to collect Hadoop metrics on an NFS server.

ETL Processes

Apache Chukwa Agents can write data directly to HBase or sequence files. This is convenient for rapidly getting data committed to stable storage.

HBase provides index by primary key, and manage data compaction. It is better for continous monitoring of data stream, and periodically produce reports.

HDFS provides better throughput for working with large volume of data. It is more suitable for one time research analysis job . But it's less convenient for finding particular data items. As a result, Apache Chukwa has a toolbox of MapReduce jobs for organizing and processing incoming data.

These jobs come in two kinds: Archiving and Demux. The archiving jobs simply take Chunks from their input, and output new sequence files of Chunks, ordered and grouped. They do no parsing or modification of the contents. (There are several different archiving jobs, that differ in precisely how they group the data.)

Demux, in contrast, take Chunks as input and parse them to produce ChukwaRecords, which are sets of key-value pairs. Demux can run as a MapReduce job or as part of HBaseWriter.

For details on controlling this part of the pipeline, see the Pipeline guide. For details about the file formats, and how to use the collected data, see the Programming guide.

Data Analytics Scripts

Data stored in HBase are aggregated by data analytic scripts to provide visualization and interpretation of health of Hadoop cluster. Data analytics scripts are written in PigLatin, the high level language provides easy to understand programming examples for data analyst to create additional scripts to visualize data on HICC.

HICC

HICC, the Hadoop Infrastructure Care Center is a web-portal style interface for displaying data. Data is fetched from HBase, which in turn is populated by collector or data analytic scripts that runs on the collected data, after Demux. The Administration guide has details on setting up HICC.

Apache HBase Integration

Apache Chukwa has adopted to use HBase to ensure data arrival in milli-seconds and also make data available to down steam application at the same time. This will enable monitoring application to have near realtime view as soon as data are arriving in the system. The file rolling, archiving are replaced by HBase Region Server minor and major compactions.