The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.
Running on top of Hadoop MapReduce and Apache Spark, the Apache Crunch™ library is a simple Java API for tasks like joining and data aggregation that are tedious to implement on plain MapReduce. The APIs are especially useful when processing data that does not fit naturally into relational model, such as time series, serialized object formats like protocol buffers or Avro records, and HBase rows and columns. For Scala users, there is the Scrunch API, which is built on top of the Java APIs and includes a REPL (read-eval-print loop) for creating MapReduce pipelines.
data Model and Operators
Crunch's Java API is centered around three interfaces that represent distributed datasets: PCollection, PTable, and PGroupedTable.
A PCollection<T> represents a distributed, immutable collection of elements of type T. For example, we represent a text file as a PCollection<String> object. PCollection<T> provides a method, parallelDo, that applies a DoFn to each element in the PCollection<T> in parallel, and returns a new PCollection<U> as its result.
A PTable<K, V> is a sub-interface of PCollection<Pair<K, V>> that represents a distributed, unordered multimap of its key type K to its value type V. In addition to the parallelDo operation, PTable provides a groupByKey operation that aggregates all of the values in the PTable that have the same key into a single record. It is the groupByKey operation that triggers the sort phase of a MapReduce job. Developers can exercise fine-grained control over the number of reducers and the partitioning, grouping, and sorting strategies used during the shuffle by providing an instance of the GroupingOptions class to the groupByKey function.
The result of a groupByKey operation is a PGroupedTable<K, V> object, which is a distributed, sorted map of keys of type K to an Iterable that may be iterated over exactly once. In addition to parallelDo processing via DoFns, PGroupedTable provides a combineValues operation that allows a commutative and associative Aggregator to be applied to the values of the PGroupedTable instance on both the map and reduce sides of the shuffle. A number of common Aggregator<V> implementations are provided in the Aggregators class.
Finally, PCollection, PTable, and PGroupedTable all support a union operation, which takes a series of distinct PCollections that all have the same data type and treats them as a single virtual PCollection.
三种分布式数据集的抽象接口:PCollection,PTable,PGroupedTable
1)PCollection<T>代表分布式、不可变的数据集,提供 parallelDo 和 union 方法,触发对每个元素进行DoFn操作,返回新的PCollection<U>
2)PTable<K, V>是PCollection<Pair<K,V>>实现,代表分布式、未排序的multimap。除了继承自PCollection 的parallelDo,还复写了union方法,提供了 groupByKey 方法。groupByKey方法对应MapReduce job里的排序阶段。在groupByKey操作里,开发者可以在shuffle过程里(参见GroupingOptions类)做细粒度的reducer数目、分区策略、分组策略以及排序策略控制
3)PGroupedTable<K, V>是groupByKey操作的结果,代表分布式、排过序的map,具备迭代器,其实现是PCollection<Pair<K,Iterable<V>>>。除了继承自PCollection的parallelDo、union,提供 combineValues 方法,允许在shuffle的map端或reduce端使用满足交换律和结合律的聚合算子(参见Aggregator类)作用于PGroupedTable实例的values上
All of the other data transformation operations supported by the Crunch APIs (aggregations, joins, sorts, secondary sorts, and cogrouping) are implemented in terms of these four primitives. The patterns themselves are defined in the org.apache.crunch.lib package and its children, and a few of of the most common patterns have convenience functions defined on the PCollection and PTable interfaces.
Every Crunch data pipeline is coordinated by an instance of the Pipeline interface, which defines methods for reading data into a pipeline via Source instances and writing data out from a pipeline to Target instances. There are currently three implementations of the Pipeline interface that are available for developers to use:
MRPipeline: Executes the pipeline as a series of MapReduce jobs.
MemPipeline: Executes the pipeline in-memory on the client.
SparkPipeline: Executes the pipeline by converting it to a series of Spark pipelines.
Apache Crunch是FlumeJava的实现,为不太方便直接开发和使用的MapReduce程序,开发一套MR流水线,具备数据表示模型,提供基础原语和高级原语,根据底层执行引擎对MR Job的执行进行优化。从分布式计算角度看,Crunch提供的许多计算原语,可以在Spark、Hive、Pig等地方找到很多相似之处,而本身的数据读写,序列化处理,分组、排序、聚合的实现,类似MapReduce各阶段的拆分都可以在Hadoop里找到影子。
相关推荐
Apache Hadoop 中的 Apache Crunch 是一个专为简化 MapReduce 作业设计的 Java 类库,它构建于 FlumeJava 之上,旨在提供一个高效且灵活的数据处理框架。Crunch 提供了丰富的 API,使得开发人员能够更轻松地创建复杂...
- 最小化抽象:Crunch的数据管道在MapReduce之上提供了一层薄薄的抽象。开发者在需要时可以随时访问低级别的MapReduce API。这种最小化抽象意味着Crunch非常快,几乎与使用MapReduce API手动调优的管道一样快,并且...
1. **Apache Crunch**:Apache Crunch 是一个构建在 Apache Hadoop MapReduce 框架之上的高级抽象库,它简化了常见数据处理任务的编写,如数据清洗、转换和聚合,提供了易于使用的类和函数,以及对错误处理和管道...
HBase,作为Apache的一个开源项目,是一个分布式、版本化的非关系型数据库,基于Google的Bigtable设计,运行在Hadoop之上。Crunch-HBase结合了两者,为HBase提供了高级的数据处理工具。 在crunch-hbase-0.10.0中,...
Apache Crunch是一个基于Google的FlumeJava库构建的Java框架,它简化了Hadoop MapReduce的编程模型,提供了更高级别的抽象,使得开发人员可以编写更简洁、可读性强的代码来处理大规模数据。在本示例中,我们将关注一...
VarCrunch 是一种生殖系和体细胞变异调用程序,它使用 Apache (S)Crunch API 在 Hadoop 上运行 变异调用算法本身来自 Guacamole,但 VarCrunch 用作包装器,使用 MapReduce 在 Hadoop 上处理 DNA 测序读取。 用法 ...
紧缩示例这是一个示例项目,演示了Kite Morphline SDK和Apache Crunch 。样本输入CSV 输入,使用 gzip 压缩,结构如下: id, name, age, salary, years_spent, title, department示例 1:查找每个部门的平均工资这是...
Apache Crunch 是一个构建在 Apache Hadoop 和 Apache Spark 上的数据处理库,主要针对大规模数据集的批处理和流处理任务。这个“紧缩”的镜像是Apache Crunch项目的一个版本,目前正处于孵化阶段,意味着它是一个...
CDH4将整合Hadoop 0.23,并基于BigTop进行构建,同时整合了S4、Giraph、Crunch、Blur和正在孵化的Apache BigTop项目。 此外,Hadoop生态系统自身也正在向一个项目的方向发展,包含集成测试、兼容性版本、通用打包和...
在开源数据处理领域,Flink 与其它数据处理引擎如Hive、Cascading、Mahout、Pig、Crunch等并存。它处理的数据既可以来自数据仓库,如Hive;也可以来自流式数据处理系统,如Apache Storm、Apache Tez等。Flink 与YARN...
- **Apache HDFS**:Hadoop分布式文件系统(HDFS)是Hadoop项目的核心之一,旨在提供大规模数据集的存储。在CDH5.0.0版本中,HDFS可能会有显著的功能增强和技术改进,如更高效的数据块管理机制、数据完整性保障等。 ...
18. **Apache Crunch**:Crunch 是一个 Java 库,提供类似 Pig 和 Hive 的抽象,简化 MapReduce 程序的编写。 19. **Apache Whirr**:Whirr 提供了一组工具,用于在云端部署和管理包括 Hadoop 在内的服务,支持 ...
18. **Apache Crunch**:Crunch 是一个用于构建 MapReduce 程序的 Java 库,提供了类似于 Pig 和 Hive 的功能,但更注重性能和简洁性。 19. **Apache Whirr**:Whirr 可以在云端自动化部署和管理 Hadoop 及其他服务...
这个简单的项目以输入 CSV 数据为例,演示了如何使用 Apache Crunch 写出 RCFile 文件。 使用以下命令运行作业: hadoop jar crunchcsvtorcfile-0.0.1-SNAPSHOT-job.jar [numberofcolumnsinthedata] /your/path/...
- **Apache Crunch**:Java API,简化了在Hadoop上执行常见的数据处理任务。 - **Apache DataFu**:LinkedIn开发的Hadoop和Pig的UDF集合,增强数据分析能力。 - **Apache Flink**:提供低延迟、高吞吐量的流处理...