sources study-part 3-mapreduce - what is a split? - 莱布尼兹

leibnitz

浏览: 283395 次
性别:
来自: 广州

最近访客更多访客>>

eternal1025

bneliao

adapterofcoms

caipeijun666

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

sources study-part 3-mapreduce - what is a split?

博客分类：

hadoop sources reading

Mapreduce performance

"split" which is a logical concept relatives to a "block" whis is real store unit.

when a client submit a job to JT,it will compute the splits by file,than the TT will generate InputSplit to map task.

so splits are used for spawn mappers ,if you use FileINputformat and set isSplitable() to false,that means this file will NOT be splitted,so this file is as a file to come to a mapper.

RecordReader is used to recover to file data that splited by client before submitting to JT in Reducer .so if u can read a split as a record.

intergrated FileInputFormat and RecordReader,u can get a only record for a whole file :

a set isSplitable() to false;

b rewrite the next() in RecordReader to read the whlole split once.

how to compute a split size?

new verion computing formula:

split size = max(min-split-size,min(max-split-size,blocksize))

note the final number of split is not simply to divide file length by split size,it use a split slot to optimize.

that is it will consider the positon seeking performance?

old version formula:

split size = max(min-split-size,min(goalsize,blocksize))

the goalsize is generated from dividing the total size of all files by numMapTasks.

of course there is a split slop in it also.

finally,the client will generate a split file which summary all the splits info to the dfs.so it is a logical to let the app have a second chance to adjust to inpput size when running into mapper.

how to restore records from split file?

yes, it is excited to talk about this subject. as the split file is not considered in case of line length(maybe exceed the threshold of mapred.linerecordreader.length) and whether it is breaked in a non-ascii char when generated by client before submiting a job.

in Local mode,this is LocalJobRunner to process tasks running.of course ,a LineReader is used to recover every split file(fragment actually) to push to a mapper.there are the import things to do it :

A each split file have it's raw file (parent file) as it's property.and

it keep a pair of current data offset(relate to raw file) and current data lengh of split file

B a CR and LF both are ascii -codes(that means they are not splittable to avoid affecting to process split proglems)

and this is the style of loca mode,what about real cluster? TODO :)

by the way,there is a trick to avoid resplit the raw file in LocalJobRunner,go to see in job.run() of it:

if (job.getUseNewMapper()) {
...
}else{
..
}

you can use the JobClient.getSplits() to instead of it,mabe this is a "optimization" :)

分享到：

sources study-part 3-mapreduce -tasksche ... | sources study-part 2-hdfs get file

2011-04-28 13:33
浏览 1039
评论(0)
分类:编程语言
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

sources study-part 3-mapreduce - what is a split?

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

sources study-part 3-mapreduce - what is a split?

评论

发表评论

相关推荐

hadoop源码阅读-shell启动流程-start-all

hadoop源码阅读-shell启动流程

hadoop源码阅读-第二回阅读开始

hadoop几种排序简介

sources study-part 6- remote debugging

sources study-part 7-summary

sources study-part 5-hdfs - advanced features - blocks allocation policy

sources study-part 4-mapreduce - advanced features - spill,merge and sort

sources study-part 3-mapreduce -taskscheduler

sources study-part 2-hdfs get file

sources study-part 2-hdfs client

sources study-part 1-outline

最近访客更多访客>>