"split" which is a logical
concept relatives to a "block" whis is real store unit.
when a client submit a job to JT,it will compute the splits by file,than the TT will generate InputSplit to map task.
so splits are used for spawn mappers
,if you use FileINputformat and set isSplitable() to false,that means this file will NOT be splitted,so this file is as a file to come to a mapper.
RecordReader is used to recover to file data that splited by client before submitting to JT in Reducer
.so if u can read a split as a record.
intergrated FileInputFormat and RecordReader,u can get a only record for a whole file
:
a
set isSplitable() to false;
b
rewrite the next() in RecordReader to read the whlole split once.
how to compute a split size?
new verion computing formula:
split size = max(min-split-size,min(max-split-size,blocksize))
note the final number of split is not simply to divide
file length by split size,it use a split slot to optimize.
that is it will consider the positon seeking performance?
old version formula:
split size = max(min-split-size,min(goalsize,blocksize))
the goalsize is generated from dividing the total size of all files by numMapTasks.
of course there is a split slop in it also.
finally,the client will generate a split file which summary all the splits info to the dfs.so it is a logical to let the app have a second chance
to adjust to inpput size when running into mapper.
how to restore records from split file?
yes, it is excited to talk about this subject. as the split file is not considered in case of line length(maybe exceed the threshold of mapred.linerecordreader.length) and whether it is breaked in a non-ascii char when generated by client before submiting a job.
in Local mode,this is LocalJobRunner to process tasks running.of course ,a LineReader is used to recover every split file(fragment actually) to push to a mapper.there are the import things to do it
:
A each split file have it's raw file
(parent file) as it's property.and
it keep a pair of current data offset(relate to raw file) and current data lengh of split file
B a CR and LF both are ascii -codes(that means they are not splittable
to avoid affecting to process split proglems)
and this is the style of loca mode,what about real cluster? TODO
:)
by the way,there is a trick to avoid resplit the raw file in LocalJobRunner,go to see in job.run() of it:
if (job.getUseNewMapper()) {
...
}else{
..
}
you can use the JobClient.getSplits() to instead of it,mabe this is a "optimization" :)
分享到:
相关推荐
Hadoop 3.x(MapReduce)----【MapReduce 概述】---- 代码 Hadoop 3.x(MapReduce)----【MapReduce 概述】---- 代码 Hadoop 3.x(MapReduce)----【MapReduce 概述】---- 代码 Hadoop 3.x(MapReduce)----...
hadoop-annotations-3.1.1.jar hadoop-common-3.1.1.jar hadoop-mapreduce-client-core-3.1.1.jar hadoop-yarn-api-3.1.1.jar hadoop-auth-3.1.1.jar hadoop-hdfs-3.1.1.jar hadoop-mapreduce-client-hs-3.1.1.jar ...
Hadoop 3.x(MapReduce)----【Hadoop 序列化】---- 代码 Hadoop 3.x(MapReduce)----【Hadoop 序列化】---- 代码 Hadoop 3.x(MapReduce)----【Hadoop 序列化】---- 代码 Hadoop 3.x(MapReduce)----【Hadoop ...
MapReduce是一种分布式计算模型,由Google开发,用于处理和生成大规模数据集。在这个特定的案例中,我们将讨论如何使用MapReduce来计算数据行的平均值和标准差,这是数据分析中的两个重要统计指标。 首先,我们要...
赠送源代码:hadoop-mapreduce-client-jobclient-2.6.5-sources.jar; 赠送Maven依赖信息文件:hadoop-mapreduce-client-jobclient-2.6.5.pom; 包含翻译后的API文档:hadoop-mapreduce-client-jobclient-2.6.5-...
《MapReduce:灵活的数据处理工具》 MapReduce是一种由Google提出的分布式计算框架,它为海量数据的处理提供了高效且可扩展的解决方案。该技术的核心在于将大规模数据集分解成小块,通过“映射(Map)”和“规约...
3. **创建提交脚本**:创建一个提交脚本(通常为bash脚本),用于指定Map和Reduce任务的输入、输出路径,以及调用Hadoop Streaming命令。命令格式如下: ``` hadoop jar $HADOOP_HOME/share/hadoop/tools/lib/...
在大数据处理领域,MapReduce是一种广泛使用的分布式计算框架,由Google提出并被Hadoop采纳为标准组件。本案例主要探讨如何使用MapReduce来求取数据集的行平均值,这在数据分析、数据挖掘以及日志分析等场景中非常...
hadoop-mapreduce-examples-2.7.1.jar
大数据处理技术比较:MapReduce、Spark和Storm 大数据时代的到来,带来了海量数据的存储和处理问题,如何高效地处理和分析这些数据成为一个关键问题。为解决这个问题,出现了一些大数据处理技术,今天我们将比较三...
MapReduce Java API实例-统计单次出现频率示例代码-MapReduceDemo.rar MapReduce Java API实例-统计单次出现频率示例代码-MapReduceDemo.rar MapReduce Java API实例-统计单次出现频率示例代码-MapReduceDemo.rar
在大数据处理领域,Hadoop MapReduce 是一个至关重要的组件,尤其在Hadoop 2.8.0这个版本中,它提供了强大的分布式计算能力。MapReduce是Google提出的一种编程模型,用于大规模数据集(大于1TB)的并行计算。本资料...
hadoop-mapreduce-examples-2.6.5.jar 官方案例源码
在大数据处理领域,Hadoop MapReduce 是一个至关重要的组件,尤其在处理海量数据时,它提供了分布式计算的能力。Hadoop 2.8.0 是一个稳定版本,包含了多个改进和优化,增强了系统的性能和稳定性。在这个"Day04"的...
Hadoop实现了一个分布式文件系统(Hadoop Distributed File System),简称HDFS。HDFS有高容错性的特点,并且设计用来部署在低廉的(low-cost)硬件上;而且它提供高吞吐量(high throughput)来访问应用程序的数据...
标题中的“行业分类-设备装置-用于在MAPREDUCE环境中处理机器学习算法的系统和方法”揭示了这个压缩包文件的主要内容,它涉及到的是在大数据处理框架MAPREDUCE上实施机器学习算法的系统与方法。这通常关联到分布式...
在这个场景中,"Mapreduce-1:python中的MapReduce的孙子/祖父母对"可能是指一个具体的示例或练习,其中涉及到对数据进行层次关系分析,比如找出数据集中某个节点的所有子孙或祖父母节点。 Map阶段是MapReduce的第一...
hadoop-mapreduce-examples-2.0.0-alpha.jar
这个过程可以视为“更新”步骤,但因为MapReduce模型不支持原地更新,所以需要再次运行MapReduce作业,将新的质心作为输入,开始下一轮迭代。 4. **迭代过程**:重复上述过程,直到质心不再明显变化或者达到预设的...