Hadoop代码分析（三）

jiji879

浏览: 30502 次
性别:
来自: 南昌

最近访客更多访客>>

zh2655236

yifeng0898

josephzbl

qingtingcq

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hadoop

Hadoop

下面是关于LineRecoedReader的NextKeyValue代码：

public boolean nextKeyValue() throws IOException {
    if (key == null) {
      key = new LongWritable();
    }
    key.set(pos);
    if (value == null) {
      value = new Text();
    }
    int newSize = 0;
    while (pos < end) {
      newSize = in.readLine(value, maxLineLength,
                            Math.max((int)Math.min(Integer.MAX_VALUE, end-pos),
                                     maxLineLength));
      if (newSize == 0) {
        break;
      }
      pos += newSize;
      if (newSize < maxLineLength) {
        break;
      }

      // line too long. try again
      LOG.info("Skipped line of size " + newSize + " at pos " + 
               (pos - newSize));
    }
    if (newSize == 0) {
      key = null;
      value = null;
      return false;
    } else {
      return true;
    }
  }

在key.set(pos)中，pos是该Line的位置，value是该Line的内容，有一个例子说明，是权威指南中的：

On the top of the Crumpetty Tree

The Quangle Wangle sat,

But his face you could not see,

On account of his Beaver Hat.

该记录被LIneRecordReader处理为4条K/V对：

(0, On the top of the Crumpetty Tree)

(33, The Quangle Wangle sat,)

(57, But his face you could not see,)

(89, On account of his Beaver Hat.)

结合wordcount的例子，每一次mapper处理的K/V对，对value进行处理，StringTokenizer itr = new StringTokenizer(value.toString())，将value分割成一个一个的标记，经过mapper的处理。生成如下格式的中间体：

（On,1),(the,1),(top,1),(of,1).....再由job将这个中间体传给reducer进行排序和汇总；

所以，一个job的input从输入到mapper的输出大概是这样：

从FIleInputFormat.addInputPath(args),将input提交给FileInputFormat的getSplit()进行分块，在本例中，TextIputFormat获取每一行数据的LineRecordReader,用LineRecord进行从K/V对的读取，LineRecord其实是就像是读取器，具体的从输入流中读取数据的任务是它完成的，最后读取的K/V对交由mapper进行处理。

分享到：

Hadoop代码分析（四） | Hadoop代码分析（二）

2011-01-11 16:25
浏览 762
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论