[笔记]hadoop mapred InputFormat分析 -

GQM

浏览: 25218 次
性别:
来自: 上海

最近访客更多访客>>

wafer1021

melin

萝__卜

leoeco2000

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

[笔记]hadoop mapred InputFormat分析

博客分类：

hadoop

hadoop

Hadoop MapReduce的编程接口层主要有5个可编程组件，分别为InputFormat、Mapper、Partitioner、Reducer和OutputFormat。

InputFormat
主要用于描述输入数据的格式，提供两个功能：

数据切分：将输入数据切分为若干个split（分片），每个split会被分发到一个Map任务中。

记录识别：通过创建RecordReader，使用它将某个split（分片）中的记录（key, value形式）识别出来（Mapper使用split前的初始化），每个记录会作为Mapper中map函数的输入。

[/list]

public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;
public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

getSplits：

引用

Logically split the set of input files for the job.
Each InputSplit is then assigned to an individual Mapper for processing.
Note: The split is a logical split of the inputs and the input files are not physically split into chunks. For e.g. a split could be <input-file-path, start, offset> tuple. The InputFormat also creates the RecordReader to read the InputSplit.

它只在逻辑上对输入数据进行分片，并不会在磁盘上将其切片分成分片进行存储。InputSplit只记录了分片的元数据信息（起始位置、长度以及所在的节点列表等）。
createRecordReader：

引用

Create a record reader for a given split. The framework will call RecordReader.initialize(InputSplit, TaskAttemptContext) before the split is used.

FileInputFormat的示例：

InputFormat (org.apache.hadoop.mapreduce) 子类层次图：

TextInputFormat分析
[list]

文件切分算法

文件切分算法主要决定InputSplit的个数以及每个InputSplit对应的数据段。TextInputFormat继承FileInputFormat，以文件为单位切分生成InputSplit。

引用

protected long computeSplitSize(long blockSize, long minSize,
long maxSize) {
return Math.max(minSize, Math.min(maxSize, blockSize));
}

在计算splitSize中使用了blockSize, minSize, maxSize。
blockSize：文件在HDFS中存储的block的大小，默认为64MB，通过dfs.block.size设置。
minSize：InputSplit的最小值，由配置参数mapred.min.split.size设置，默认值为1。
maxSize：InputSplit的最大值，由配置参数mapred.max.split.size设置，默认值为Long.MAX_VALUE。
一旦确定splitSize值后，FileInputFormat将文件一次切成大小为splitSize的InputSplit，最后剩下不足splitSize的数据块单独成为一个InputSplit。

     long bytesRemaining = length;
        while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
          splits.add(new FileSplit(path, length-bytesRemaining, splitSize, 
                                   blkLocations[blkIndex].getHosts()));
          bytesRemaining -= splitSize;
        }
        
        if (bytesRemaining != 0) {
          splits.add(new FileSplit(path, length-bytesRemaining, bytesRemaining, 
                     blkLocations[blkLocations.length-1].getHosts()));
        }

FileSplit

FileSplit继承InputSplit，包含了InputSplit所在的文件、起始位置、长度以及所在host的列表。

  /** Constructs a split with host information
   *
   * @param file the file name
   * @param start the position of the first byte in the file to process
   * @param length the number of bytes in the file to process
   * @param hosts the list of hosts containing the block, possibly null
   */
  public FileSplit(Path file, long start, long length, String[] hosts) {
    this.file = file;
    this.start = start;
    this.length = length;
    this.hosts = hosts;
  }

其中hosts的获取是通过InputSplit的所在文件查找（向NameNode）获取文件的所有BlockLocation，并通过InputSplit的起始位置查找对应的blkIndex，然后通过blkIndex获取对应BlockLocation的host信息。

LineRecordReader

  public RecordReader<LongWritable, Text> 
    createRecordReader(InputSplit split,
                       TaskAttemptContext context) {
    return new LineRecordReader();
  }

LineRecordReader继承了RecordReader类，并适配了LineReader类。LineReader类通过构建了buffer字节数组缓冲（缓冲区大小由参数io.file.buffer.size设置，默认为64K），将数据从流中读出（DFSClient.DFSInputStream.read(byte buf[], int off, int len)）,当Record跨块时，会重新定位node，并至少再次读取一次（从新定位的node中读取buffer长度的字节数组）

            if (pos > blockEnd) {
              currentNode = blockSeekTo(pos);
            }
            int realLen = (int) Math.min((long) len, (blockEnd - pos + 1L));
            int result = readBuffer(buf, off, realLen);