重温Hadoop（2）-- MapReduce流程及partition -

defungo

浏览: 82947 次
性别:
来自: 北京

最近访客更多访客>>

csyfly2003

david_xu

melin

biyelei

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

重温Hadoop（2）-- MapReduce流程及partition

博客分类：

分布式
hadoop

1.map(K1, V1) –> list (K2, V2) // 对输入数据进行抽取过滤排序等操作

2.combine(K2, list(V2)) –> list(K2, V2) // 为了减少reduce的输入，需要在map端对输出进行预处理，类似3.reduce。不是所有的reduce都在部分数据集上有效，比如求平均值就不能简单用于combine

4.partition(K2, V2) –> integer //将key划分配到不同reduce分区，返回分区索引号。分区内的key会排序，相同的键的所有值会合成一个组（list(V2)）

5.reduce(K2, list(V2)) –> list(K3, V3) //每个reduce会处理具有某些特性的键，每个键上都有值的序列，是通过对所有map输出的值进行统计得来的；当获得一个分区后，tasktracker会对每条记录调用reduce。

默认的map和reduce函数是IdentityMapper和IdentityReducer，均是泛型类型，简单的将所有输入写到输出中。默认的 partitioner是HashPartitioner，对每天记录的键进行哈希操作以决定该记录属于那个分区让reduce处理。

partition:

/**
 * Licensed to the Apache Software Foundation (ASF) under one
 * or more contributor license agreements.  See the NOTICE file
 * distributed with this work for additional information
 * regarding copyright ownership.  The ASF licenses this file
 * to you under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance
 * with the License.  You may obtain a copy of the License at
 *
 *     http://www.apache.org/licenses/LICENSE-2.0
 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */

package org.apache.hadoop.mapreduce;

/** 
 * Partitions the key space.
 * 
 * <p><code>Partitioner</code> controls the partitioning of the keys of the 
 * intermediate map-outputs. The key (or a subset of the key) is used to derive
 * the partition, typically by a hash function. The total number of partitions
 * is the same as the number of reduce tasks for the job. Hence this controls
 * which of the <code>m</code> reduce tasks the intermediate key (and hence the 
 * record) is sent for reduction.</p>
 * 
 * @see Reducer
 */
public abstract class Partitioner<KEY, VALUE> {
  
  /** 
   * Get the partition number for a given key (hence record) given the total 
   * number of partitions i.e. number of reduce-tasks for the job.
   *   
   * <p>Typically a hash function on a all or a subset of the key.</p>
   *
   * @param key the key to be partioned.
   * @param value the entry value.
   * @param numPartitions the total number of partitions.
   * @return the partition number for the <code>key</code>.
   */
  public abstract int getPartition(KEY key, VALUE value, int numPartitions);
  
}

可以看到只有一个方法，自定义pratition的时候，只需实现getPartition方法，getPartition返回分配到的reducer的号，大小从0到reducer数。

分享到：

nutch源码阅读(1)-Crawl | 重温Hadoop（1）--Mapredure

2013-05-16 10:25
浏览 1048
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

重温Hadoop（2）-- MapReduce流程及partition

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

重温Hadoop（2）-- MapReduce流程及partition

评论

发表评论

相关推荐

eclipse 编译hadoop源码

(转)hadoop/mapred 优化方法

Hadoop重温(4)--ipc

重温hadoop(3)--序列化

重温Hadoop（1）--Mapredure

重温Hadoop（1）--Mapredure

hadoop问题解决

HDFS的基本概念

微博 请问你是怎么优化数据库的？

用sharding技术来扩展你的数据库

最近访客更多访客>>

微博请问你是怎么优化数据库的？