MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. It's definitely based on the principle of divide-and-conquer method. A MapReduce program is composed of Map() and Reduce(). Map process is responsible for analyzing some data patch and emit the result. Reduce process receives all results emited from previous Map process and calculates the final result which programmer wants. It's quite a popular framework to implement big-data-mining algorithm in industrical circles. The page collects many different kind of papers in different years on computer science which introduce algorithms or principle based on MapReduce.
Ads & E-commerce
Improving ad relevance in sponsored search
Predicting the Click-Through Rate for Rare/New Ads
Learning Influence Probabilities in Social Networks
Mining advertiser-specific user behavior using adfactors
Extracting user profiles from large scale data
Large-Scale Behavioral Targeting (2009)
Search Advertising using Web Relevance Feedback (2008)
Predicting Ads’ ClickThrough Rate with Decision Rules (2008)
*A stochastic learning-to-rank algorithm and its application to contextual advertising(2011)
*Learning website hierarchies for keyword enrichment in contextual advertising(2011)
Astronomy
*Algorithms for Large-Scale Astronomical Problems (2011)
Social Networks
*Social Content Matching in MapReduce (2011)
*Parallel Knowledge Community Detection Algorithm Research Based on MapReduce(2011)
*Large-Scale Community Detection on YouTube for Topic Discovery and Exploration(2011)
Bioinformatics/Medical Informatics
A novel approach to multiple sequence alignment using hadoop data grids
MapReduce-Based Pattern Finding Algorithm Applied in Motif Detection for Prescription Compatibility Network (2009)
MrsRF: an efficient MapReduce algorithm for analyzing large collections of evolutionary trees
*HBase, MapReduce, and Integrated Data Visualization for Processing Clinical Signal Data (2011)
*Accelerating statistical image reconstruction algorithms for fan-beam x-ray CT using cloud computing (2011)
Machine Translation
Training Phrase-Based Machine Translation Models on the Cloud Open Source Machine Translation Toolkit Chaski
Grammar based statistical MT on Hadoop (2009)
Large Language Models in Machine Translation (2008)
*Fast, Easy and Cheap: Construction of Statistical Machine Translation Models with Mapreduce
Spatial Data Processing
Experiences on Processing Spatial Data with MapReduce
*Scalable spatio-temporal knowledge harvesting (2011)
Information Extraction and Text Processing
Statistical Sentence Chunking Using Map Reduce
Data-intensive text processing with MapReduce
Web-Scale Distributional Similarity and Entity Set Expansion (2009)
The infinite HMM for unsupervised PoS tagging (2009)
*Batch Text Similarity Search with MapReduce (2011)
*An Empirical Study of Massively Parallel Bayesian Networks Learning for Sentiment Extraction from Unstructured Text (2011)
*EntityTagger: automatically tagging entities with descriptive phrases (2011)
Artificial Intelligence/Machine Learning/Data Mining
LogMaster: Mining Event Correlations in Logs of Large Scale Cluster Systems
Stateful Bulk Processing for Incremental Analytics
Mining dependency in distributed systems through unstructured logs analysis
Beyond online aggregation: parallel and incremental data mining with online mapreduce
Learning based opportunistic admission control algorithm for mapreduce as a service
OWL reasoning with WebPIE: calculating the closure of 100 billion triples
Scaling ECGA model building via data-intensive computing
SPARQL basic graph pattern processing with iterative mapreduce
Residual Splash for Optimally Parallelizing Belief Propagation
Stochastic gradient boosted distributed decision trees
Distributed Algorithms for Topic Models
When Huge is Routine: Scaling Genetic Algorithms and Estimation of Distribution Algorithms via Data-Intensive Computing
Cloud Computing Boosts Business Intelligence of Telecommunication Industry
Parallel K-Means Clustering Based on MapReduce
Large-scale multimedia semantic concept modeling using robust subspace bagging and MapReduce
Parallel algorithms for mining large-scale rich-media data
Scaling Simple and Compact Genetic Algorithms using MapReduce
Scalable Distributed Reasoning using Mapreduce
Scaling Up Classifiers to Cloud Computers (2008)
*Preliminary Results on Using Matching Algorithms in Map-Reduce Applications (2011)
*Improving the Effectiveness of Statistical Feature Selection Algorithms Using Bag of Synsets and its Parallelization (2011)
*Tri-training and MapReduce-based massive data learning (2011)
*Parallel evolutionary approach of compaction problem using mapreduce (2011)
*COMET: A Recipe for Learning and Using Large Ensembles on Massive Data (2011)
*Parallelized K-Means clustering algorithm for self aware mobile ad-hoc networks(2011)
- For an example of Parallel Machine Learning with Hadoop/Mapreduce, check out our
- .
Search Query Analysis
Parallelizing Random Walk with Restart for large-scale query recommendation
BBM: Bayesian Browsing Model from Petabyte-scale Data (2009)
AIDE: Ad-hoc Intents Detection Engine over Query Logs (2009)
Information Retrieval (Search)
Automatically Incorporating New Sources in Keyword Search-Based Data Integration
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Learning URL patterns for webpage de-duplication
Information Seeking with Social Signals: Anatomy of a Social Tag-based EXploratory Search Browser
MIREX: Mapreduce Information Retrieval Experiments
Efficient Clustering of Web Derived Data Sets
The PageRank algorithm and application on searching of academic papers
A Parallel Algorithm for Finding Related Pages in the Web by Using Segmented Link Structures
On Single-Pass Indexing with MapReduce (2009)
A Data Parallel Algorithm for XML DOM Parsing (2009)
Semantic Sitemaps: Efficient and Flexible Access to Datasets on the Semantic Web(2008)
*Scalable knowledge harvesting with high precision and high recall (2011)
*MapReduce indexing strategies: Studying scalability and efficiency (2011)
*Ranking on large-scale graphs with rich metadata (2011)
*Distributed Index for Near Duplicate Detection (2011)
*SPRINT: ranking search results by paths (2011)
*Bagging Gradient-Boosted Trees for High Precision, Low Variance Ranking Models(2011)
*Sparse hidden-dynamics conditional random fields for user intent understanding(2011)
- For more about mapreduce in information retrieval, check out our presentation
- .
Spam & Malware Detection
Characterizing Botnets from Email Spam Records (2008)
- Clustering of emails into spam campaign
- Finding probability that 2 spam messages are sent form same machine
- Estime likelihood of botnets based on common senders in spam campaigns
The Ghost In The Browser Analysis of Web-based Malware (2007)
Image and Video Processing
Font rendering on a GPU-based raster image processor
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
- Video Stream Re-Rendering
Map-Reduce Meets Wider Varieties of Applications (2008)
- Location detection in images
*Counting triangles and the curse of the last reducer (2011)
*Adapting Skyline Computation to the MapReduce Framework: Algorithms and Experiments (2011)
Networking
Reducible Complexity in DNS
Simulation
Map-Reduce Meets Wider Varieties of Applications (2008)
- Simulation of earthquakes (geology)
Statistics
User-based collaborative filtering recommendation algorithms on hadoop
Brute Force and Indexed Approaches to Pairwise Document Similarity Comparisons with MapReduce (2009)
Fast Parallel Outlier Detection for Categorical Datasets using Mapreduce (2009)
MapReduce Optimization Using Regulated Dynamic Prioritization (2009)
- Digg.com story recommendations
Calculating the Jaccard Similarity Coefficient with Map Reduce for Entity Pairs in Wikipedia (2008)
- Measuring Wikipedia Editor similarity
Map-Reduce Meets Wider Varieties of Applications (2008)
- Netflix video recommendation
Large-scale Parallel Collaborative Filtering for the Netflix Prize (2008)
Numerical Mathematics
Distributed non-negative matrix factorization for dyadic data analysis on mapreduce
A mapreduce algorithm for SC
Multi-GPU Volume Rendering using MapReduce
Mapreduce for Integer Factorization
*Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent (2011)
Sets & Graphs
Towards scalable RDF graph analytics on MapReduce
Efficient Parallel Set-Similarity Joins using Mapreduce
Max-cover algorithm in map-reduce
Distributed Algorithm for Computing Formal Concepts Using Map-Reduce Framework
Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce
Graph Twiddling in a MapReduce World
DOULION: Counting Triangles in Massive Graphs with a Coin (2009)
Fast counting of triangles in real-world networks: proofs, algorithms and observations(2008)
*Filtering: A Method for Solving Graph Problems in MapReduce (2011)
*Colorful Triangle Counting and a MapReduce Implementation (2011)
*Mining Large Graphs: Algorithms, Inference, and Discoveries (2011)
*On labeled paths (2011)
*HADI: Mining radii of large graphs (2011)
*Towards Efficient Subgraph Search in Cloud Computing Environment (2011)
via:
http://atbrox.com/2011/05/16/mapreduce-hadoop-algorithms-in-academic-papers-4th-update-may-2011/
相关推荐
论文中详细描述了MapReduce的基本编程模型和使用案例,包括其简化编程接口,以及如何通过Map和Reduce函数实现复杂的计算任务。此外,还介绍了针对集群环境优化的MapReduce实现,讨论了实际应用中的最佳实践和性能...
谷歌在03到06年间连续发表了三篇很有影响力的文章,分别是03年SOSP的GFS,04年OSDI的MapReduce,和06年OSDI的BigTable。SOSP和OSDI都是操作系统领域的顶级会议,在计算机学会推荐会议里属于A类。SOSP在单数年举办,...
Google三大论文之一Mapreduce的中文翻译版,海量数据处理模型。
### MapReduce 中文版论文知识点解析 #### 一、MapReduce 概述 **MapReduce** 是一种编程模型,同时也是处理和生成超大数据集的一种算法实现。它通过两个主要的操作——**Map** 和 **Reduce** 来处理数据。在 Map ...
MapReduce是Google在2004年提出的一种用于大规模数据集处理的编程模型,它极大地简化了在分布式计算环境中处理海量数据的任务。这篇论文“google 实验室 mapreduce 论文中英版”包含了MapReduce的核心概念、设计原理...
Bigtable使用了一个类似表格的数据模型,每个表格由行和列组成,其中行是唯一的行键,列是由列族和列限定符组成的复合键。数据按照行键排序,列族则可以预先定义,允许快速的范围查询。Bigtable还支持时间戳,可以...
总的来说,Google的MapReduce论文不仅奠定了现代大数据处理的基础,也启发了后续一系列分布式计算框架的创新与发展。通过对MapReduce的理解,我们可以更好地掌握大数据处理的关键技术,应对日益增长的数据挑战。
谷歌的三篇经典论文——Bigtable、File-System和MapReduce,对现代大数据处理和分布式系统的发展产生了深远影响。这三篇文章分别详细介绍了谷歌在数据存储、文件系统和大规模并行计算上的创新解决方案。 首先,让...
GFS的核心设计理念包括强一致性模型、主从式架构以及大文件块的使用,这些特性使得它能够支持大规模的数据并行处理,为Google的搜索引擎和其他大数据应用提供了坚实的基础。 接着,MapReduce是一种编程模型,用于大...
在论文《MapReduce: Simplified Data Processing on Large Clusters》中,Google详细描述了MapReduce的实现细节,包括作业调度、数据分区、容错机制以及优化策略等。 结合GFS和MapReduce,Google能够高效地处理海量...
2. 紧随其后的就是2004年公布的 MapReduce论文,论文描述了大数据的分布式计算方式,主要思想是将任务分解然后在多台处理能力较弱的计算节点中同时处理,然后将结果合并从而完成大数据处理。 3. 最后就是谷歌发布于...
现在,我们拥有的是一份包含这些核心论文的2021年修正版集合,涵盖了中英文版本的PDF和Word文档,方便读者深入学习和研究。 **谷歌分布式文件系统(GFS)** GFS,全称为Google File System,是谷歌设计的一个大...
基于MapReduce的并行异常检测算法(毕业论文).caj
标题中的“Google大数据经典论文(GFS/BigTable/MapReduce)”指的是Google在大数据处理领域发布的三篇标志性论文,这些论文对后来的大数据技术发展,尤其是Hadoop等开源框架产生了深远影响。以下是对这三篇论文的...
谷歌三大论文涵盖了分布式文件系统(DFS)、MapReduce编程模型和大规模分布式存储系统Bigtable的核心技术,这些技术共同构成了谷歌内部处理大数据和大规模计算任务的基础架构。下面是这些技术的核心知识点和描述: ...
MapReduce是大数据处理领域中的一个核心框架,由Google在2004年提出,主要用于解决大规模数据集的并行计算问题。Hadoop是开源社区实现MapReduce的主要平台,它为分布式存储(HDFS)和分布式计算(MapReduce)提供了...