With the explosion of Hadoop and big data usage, many people are currently looking for approaches to convert their existing implementations into MapReduce. Unfortunately, with the notable exception of "Data-Intensive Text Processing with MapReduce" and "Mahout in Action" there are very few publications dedicated to the designing of MapReduce implementations. In his new article, "MapReduce Patterns, Algorithms, and Use Cases" Ilya Katsov provides a systematic overview of problems that can be solved using a MapReduce framework.
It starts with a fairly straightforward usage of MapReduce as a general purpose parallel execution framework, which can be applicable to many implementations requiring leveraging of large clusters for compute and data intensive calculations, including physical and engineering simulations, numerical analysis, performance testing, etc. The next group of algorithms, commonly used in Log Analysis, ETL and Data Querying, includes counting and summing, data collating (based on specific functions), filtering, parsing, validation and sorting.
The second large group of MapReduce patterns, discussed by Katsov includes multiple relational MapReduce patterns, often used by data warehousing applications. These patterns are widely leveraged by Hive and Pig implementations and include predicate/function based data selection, data projection, data union, difference and intersection and groupBy aggregations. A separate discussion is dedicated to implementing data joins and include such algorithms as repartition joins and replicated joins
Moving further up the chain of complexity, the article discusses more complex MapReduce processing algorithms, including graph processing, search algorithms (breadth first search), page rank and data aggregation algorithms that can be leveraged in graph analysis, web indexing and general search applications. It also covers common text analysis and market analysis use cases requiring cross correlation calculation. This part covers both "pairs" and "stripes" design patterns and their comparative merits.
Finally, Katsov provides a good bibliography of more complex MapReduce implementations in the field of machine learning.
Most of the algorithms, described in the article are accompanied by pseudo code and basic information for their applicability, advantages and disadvantages and some real world use cases.
Many people today are still struggling with applicability of Hadoop and MapReduce for solving their business problems. Some still consider it a "technical approach in search of a business problem". The article is an important step in filling an existing void in the field of MapReduce algorithms, use cases and design patterns. It shows MapReduce’s power far beyond infamous "word count" and the ways it can be leveraged for solving a wide range of practical problems.
Posted by Boris Lublinsky
http://www.infoq.com/news/2012/02/MapReducePatterns
分享到:
相关推荐
### MapReduce-algorithms #### 一、MapReduce简介与云计算计算模型 《Data-Intensive Text Processing with MapReduce》由Jimmy Lin和Chris Dyer撰写,是Morgan & Claypool Synthesis Lectures on Human Language...
Until now, design patterns for the MapReduce framework ...stage problems, or to perform several analytics in the same job Input and output patterns: customize the way you use Hadoop to load or store data
Tree Serialization, Finding the Top k Elements of Data Streams, MapReduce, Partial Sorting, the Skyline Problem, DFS, BFS and Topological Sorting of Dags, the Alternative Alphabet and the Phone Words...
### MapReduce 算法详解 #### 排序(Sorting) **输入:** - 文件集,每行一个值。 - 映射器键为文件名和行号。 - 映射器值为行内容。 **排序算法:** - 利用 Reducer 的特性:(key, value) 对按 key 排序处理;...
f you are ready to dive into the MapReduce framework for processing large datasets, this practical book takes you step by step through the algorithms and tools you need to build distributed MapReduce ...
Big Data, MapReduce, Hadoop, and Spark with Python: Master Big Data Analytics and Data Wrangling with MapReduce Fundamentals using Hadoop, Spark, and Python by LazyProgrammer English | 15 Aug 2016 | ...
with this book, you will soon learn about many exciting topics such as MapReduce patterns, using Hadoop to solve analytics, classifications, online marketing, recommendations, and data indexing and ...
[奥莱理] MapReduce Design Patterns Building Effective Algorithms and Analytics for Hadoop and Other Systems (E-Book) ☆ 出版信息:☆ [作者信息] Donald Miner, Adam Shook [出版机构] 奥莱理 [出版日期...
MapReduce_and_filter
You will learn about Mahout building blocks, addressing feature extraction, reduction and the curse of dimensionality, delving into classification use cases with the random forest and Naïve Bayes ...
MapReduce是一种分布式计算模型,由Google在2004年提出,主要用于处理和生成大规模数据集。这个模型将复杂的计算任务拆分成两个主要阶段:Map(映射)和Reduce(化简),并且非常适合于并行处理大数据。在这个...
### 数据密集型文本处理与MapReduce技术 #### 一、引言 《数据密集型文本处理与MapReduce》是一本由Jimmy Lin和Chris Dyer编写的书籍,主要介绍了如何利用MapReduce这一分布式计算模型来处理大规模的数据集。本书...