Here's my notes about introduction and some hints for Hadoop based open source projects. Hope it's useful to you.
Management Tool
Ambari: A web-based tool for provisioning, managing, and monitoring Apache Hadoop clusters which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig and Sqoop. Ambari also provides a dashboard for viewing cluster health such as heatmaps and ability to view MapReduce, Pig and Hive applications visually alongwith features to diagnose their performance characteristics in a user-friendly manner.
Ambari enables System Administrators to:
- Provision a Hadoop Cluster
- Ambari handles configuration of Hadoop services for the cluster.
- Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
- Manage a Hadoop Cluster
- Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
- Monitor a Hadoop Cluster
- Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
- Ambari leverages Ganglia for metrics collection.
- Ambari leverages Nagios for system alerting and will send emails when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).
- Ambari enables Application Developers and System Integrators to:
- Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.
Chukwa: Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.
Data Storage
Avro: A data serialization system.
Avro provides:
- Rich data structures.
- A compact, fast, binary data format.
- A container file, to store persistent data.
- Remote procedure call (RPC).
- Simple integration with dynamic languages. Code generation is not required to read or write data files nor to use or implement RPC protocols. Code generation as an optional optimization, only worth implementing for statically typed languages.
Avro provides functionality similar to systems such as Thrift, Protocol Buffers, etc. Avro differs from these systems in the following fundamental aspects.
- Dynamic typing: Avro does not require that code be generated. Data is always accompanied by a schema that permits full processing of that data without code generation, static datatypes, etc. This facilitates construction of generic data-processing systems and languages.
- Untagged data: Since the schema is present when data is read, considerably less type information need be encoded with data, resulting in smaller serialization size.
- No manually-assigned field IDs: When a schema changes, both the old and new schema are always present when processing data, so differences may be resolved symbolically, using field names.
HBase: Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables — billions of rows X millions of columns — atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS.
Features
- Linear and modular scalability.
- Strictly consistent reads and writes.
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers.
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
- Easy to use Java API for client access.
- Block cache and Bloom Filters for real-time queries.
- Query predicate push down via server side Filters
- Thrift gateway and a REST-ful Web service that supports XML, Protobuf, and binary data encoding options
- Extensible jruby-based (JIRB) shell
- Support for exporting metrics via the Hadoop metrics subsystem to files or Ganglia; or via JMX
Hive: Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Accumulo: The Apache Accumulo sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop, Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process.
Gora: The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support.
Although there are various excellent ORM frameworks for relational databases, data modeling in NoSQL data stores differ profoundly from their relational cousins. Moreover, data-model agnostic frameworks such as JDO are not sufficient for use cases, where one needs to use the full power of the data models in column stores. Gora fills this gap by giving the user an easy-to-use in-memory data model and persistence for big data framework with data store specific mappings and built in Apache Hadoop support.
The overall goal for Gora is to become the standard data representation and persistence framework for big data. The roadmap of Gora can be grouped as follows.
- Data Persistence : Persisting objects to Column stores such as HBase, Cassandra, Hypertable; key-value stores such as Voldermort, Redis, etc; SQL databases, such as MySQL, HSQLDB, flat files in local file system of Hadoop HDFS.
- Data Access : An easy to use Java-friendly common API for accessing the data regardless of its location.
- Indexing : Persisting objects to Lucene and Solr indexes, accessing/querying the data with Gora API.
- Analysis : Accesing the data and making analysis through adapters for Apache Pig, Apache Hive and Cascading
- MapReduce support : Out-of-the-box and extensive MapReduce (Apache Hadoop) support for data in the data store.
ORM stands for Object Relation Mapping. It is a technology which abstacts the persistency layer (mostly Relational Databases) so that plain domain level objects can be used, without the cumbersome effort to save/load the data to and from the database. Gora differs from current solutions in that:
- Gora is specially focussed at NoSQL data stores, but also has limited support for SQL databases.
- The main use case for Gora is to access/analyze big data using Hadoop.
- Gora uses Avro for bean definition, not byte code enhancement or annotations.
- Object-to-data store mappings are backend specific, so that full data model can be utilized.
- Gora is simple since it ignores complex SQL mappings.
- Gora will support persistence, indexing and anaysis of data, using Pig, Lucene, Hive, etc.
HCatalog: Apache HCatalog is a table and storage management service for data created using Apache Hadoop.
This includes:
- Providing a shared schema and data type mechanism.
- Providing a table abstraction so that users need not be concerned with where or how their data is stored.
- Providing interoperability across data processing tools such as Pig, Map Reduce, and Hive.
Development Platform
Pig: Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
At the present time, Pig's infrastructure layer consists of a compiler that produces sequences of Map-Reduce programs, for which large-scale parallel implementations already exist (e.g., the Hadoop subproject). Pig's language layer currently consists of a textual language called Pig Latin, which has the following key properties:
- Ease of programming. It is trivial to achieve parallel execution of simple, "embarrassingly parallel" data analysis tasks. Complex tasks comprised of multiple interrelated data transformations are explicitly encoded as data flow sequences, making them easy to write, understand, and maintain.
- Optimization opportunities. The way in which tasks are encoded permits the system to optimize their execution automatically, allowing the user to focus on semantics rather than efficiency.
- Extensibility. Users can create their own functions to do special-purpose processing.
Bigtop: Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.
The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc…) developed by a community with a focus on the system as a whole, rather than individual projects.
RHIPE: RHIPE (hree-pay') is the R and Hadoop Integrated Programming Environment. It means "in a moment" in Greek. RHIPE is a merger of R and Hadoop. R is the widely used, highly acclaimed interactive language and environment for data analysis. Hadoop consists of the Hadoop Distributed File System (HDFS) and the MapReduce distributed compute engine. RHIPE allows an analyst to carry out D&R analysis of complex big data wholly from within R. RHIPE communicates with Hadoop to carry out the big, parallel computations.
R/Hadoop: The aim of this project is to provide easy to use R interfaces to the open source distributed computing environment Hadoop including Hadoop Streaming and the Hadoop Distributed File System.
Data Transferring Tool
Sqoop: Apache Sqoop is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases. You can use Sqoop to import data from external structured datastores into Hadoop Distributed File System or related systems like Hive and HBase. Conversely, Sqoop can be used to extract data from Hadoop and export it to external structured datastores such as relational databases and enterprise data warehouses.
Flume: Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.
Workflow & Pipeline
- Oozie is a workflow scheduler system to manage Apache Hadoop jobs.
- Oozie Workflow jobs are Directed Acyclical Graphs (DAGs) of actions.
- Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty.
- Oozie is integrated with the rest of the Hadoop stack supporting several types of Hadoop jobs out of the box (such as Java map-reduce, Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system specific jobs (such as Java programs and shell scripts).
- Oozie is a scalable, reliable and extensible system.
Crunch: The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
Running on top of Hadoop MapReduce, the Apache Crunch library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
文章系本人原创,转载请保持完整性并注明出自《四火的唠叨》
相关推荐
数据算法:Hadoop/Spark大数据处理技巧
Maven坐标:org.apache.hadoop:hadoop-mapreduce-client-common:2.6.5; 标签:apache、mapreduce、common、client、hadoop、jar包、java、API文档、中英对照版; 使用方法:解压翻译后的API文档,用浏览器打开...
Hadoop是一个开源的分布式计算平台,主要由Apache软件基金会维护。它被设计用来在普通硬件构建的集群环境中存储和处理大量数据。Hadoop的核心特性包括: 1. **分布式存储**:Hadoop分布式文件系统(HDFS)可以存储...
资源名称:大数据处理系统:Hadoop源代码情景分析内容简介:Hadoop是目前重要的一种开源的大数据处理平台,读懂Hadoop的源代码,深入理解其各种机理,对于掌握大数据处理的技术有着显而易见的重要性。 本书从大数据...
赠送jar包:hadoop-auth-2.6.5.jar 赠送原API文档:hadoop-auth-2.6.5-javadoc.jar 赠送源代码:hadoop-auth-2.6.5-sources.jar 包含翻译后的API文档:hadoop-auth-2.6.5-javadoc-API文档-中文(简体)-英语-对照版...
Hadoop是大数据处理领域的一个核心框架,主要用于分布式存储和计算。这个文档集合应该是关于Hadoop开发者的下载资源,可能包含了源代码、开发工具和其他相关资料。由于没有具体的描述,我将根据一般Hadoop开发者的...
Maven坐标:org.apache.hadoop:hadoop-mapreduce-client-core:2.5.1; 标签:core、apache、mapreduce、client、hadoop、jar包、java、API文档、中文版; 使用方法:解压翻译后的API文档,用浏览器打开“index.html...
Hadoop硬实战:Hadoop in Practice
Hadoop实战:Hadoop in Action
Apache Hadoop:Hadoop集群运维与优化.docx
1.4.1 Hadoop的元数据备份方案 1.4.2 Hadoop的SecondaryNameNode方案 1.4.3 Hadoop的Checkpoint ode方案 1.4.4 Hadoop的BackupNode方案 1.4.5 DRDB方案 1.4.6 FaceBook的AvatarNode方案 1.5 方案优缺点比较 第2章 ...
Apache Hadoop:Hadoop资源管理器YARN详解.docx
深入云计算:Hadoop源代码分析(修订版)
$ sudo chown -R hadoop:hadoop /opt/hadoop-0.2.203.0 ``` 这里`/opt/hadoop-0.2.203.0`是Hadoop的具体安装路径,应根据实际情况进行调整。 2. **重新启动Hadoop服务**:修改完所有权后,需要重新启动Hadoop...
Apache Hadoop:Hadoop数据仓库Hive入门与应用.docx
【标题】"演讲: Hadoop与数据分析"涵盖了大数据处理领域中的关键技术和应用,主要讨论了Hadoop在数据处理和分析中的角色。Hadoop是Apache软件基金会开发的一个开源框架,专门设计用于处理和存储大规模数据集。它允许...
本篇文章将基于"深入云计算:Hadoop应用开发实战详解 源代码"这一主题,深入探讨Hadoop在云计算环境中的应用与开发实践。 Hadoop是Apache基金会开发的一个分布式计算系统,主要由Hadoop Distributed File System ...
Apache Hadoop:Hadoop数据安全与权限管理技术教程.docx
《数据算法:Hadoop+Spark大数据》中文版是一本深入探讨大数据处理的书籍,主要聚焦在Hadoop和Spark这两个在大数据领域中至关重要的框架。这本书的高清版为读者提供了清晰易读的阅读体验,是学习大数据算法和技术的...
第五课:hadoopwindow单机部署和试用-python验证码识别1