`

InputFormat牛逼(3)org.apache.hadoop.mapreduce.InputFormat<K, V>

 
阅读更多


@Public
@Stable

InputFormat describes the input-specification for a Map-Reduce job.

The Map-Reduce framework relies on the InputFormat of the job to:


Validate the input-specification of the job.
Split-up the input file(s) into logical InputSplits, each of which is then assigned to an individual Mapper.
Provide the RecordReader implementation to be used to glean input records from the logical InputSplit for processing by the Mapper.
The default behavior of file-based InputFormats, typically sub-classes of FileInputFormat, is to split the input into logical InputSplits based on the total size, in bytes, of the input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.

Clearly, logical splits based on input-size is insufficient for many applications since record boundaries are to respected. In such cases, the application has to also implement a RecordReader on whom lies the responsibility to respect record-boundaries and present a record-oriented view of the logical InputSplit to the individual task.

@InterfaceAudience.Public
@InterfaceStability.Stable
public abstract class InputFormat<K, V> {

  /** 
   * Logically split the set of input files for the job.  
   * 
   * <p>Each {@link InputSplit} is then assigned to an individual {@link Mapper}
   * for processing.</p>
   *
   * <p><i>Note</i>: The split is a <i>logical</i> split of the inputs and the
   * input files are not physically split into chunks. For e.g. a split could
   * be <i>&lt;input-file-path, start, offset&gt;</i> tuple. The InputFormat
   * also creates the {@link RecordReader} to read the {@link InputSplit}.
   * 
   * @param context job configuration.
   * @return an array of {@link InputSplit}s for the job.
   */
  public abstract 
    List<InputSplit> getSplits(JobContext context
                               ) throws IOException, InterruptedException;
  
  /**
   * Create a record reader for a given split. The framework will call
   * {@link RecordReader#initialize(InputSplit, TaskAttemptContext)} before
   * the split is used.
   * @param split the split to be read
   * @param context the information about the task
   * @return a new record reader
   * @throws IOException
   * @throws InterruptedException
   */
  public abstract 
    RecordReader<K,V> createRecordReader(InputSplit split,
                                         TaskAttemptContext context
                                        ) throws IOException, 
                                                 InterruptedException;

}




  • 大小: 69.8 KB
分享到:
评论

相关推荐

    CustomInputFormatCollection:Hadoop Mapreduce InputFormat 集合

    Hadoop 代码使用方式 job.setInputFormatClass(SmallFileCombineTextInputFormat.class);...org.apache.hadoop.mapreduce.sample.SmallFileWordCount -Dmapreduce.input.fileinputformat.split.maxsize=10

    hadoop api.doc

    3. **org.apache.hadoop.dfs**: 这是Hadoop早期版本中针对HDFS实现的包,现在的Hadoop已经将其替换为`org.apache.hadoop.hdfs`。不过,这里提及可能是旧文档中的引用,HDFS是Hadoop的分布式文件系统,提供了高容错性...

    自定义MapReduce的InputFormat

    3. **实现`createRecordReader()`方法**:此方法返回一个实现了`org.apache.hadoop.mapreduce.RecordReader`接口的对象。RecordReader负责从split中读取并解析单个记录,将其转换为键值对的形式。 4. **自定义...

    hive inputformat

    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; public class CustomSpaceDelimitedInputFormat extends FileInputFormat&lt;LongWritable, Text&gt; { @Override public RecordReader&lt;LongWritable,...

    Hadoop.MapReduce.分析

    - **新代码**:主要位于`org.apache.hadoop.mapreduce.*`,包含36,915行代码,这部分代码进行了重构,提高了代码质量和可维护性。 - 辅助类:分别位于`org.apache.hadoop.util.*`(148行)和`org.apache.hadoop.file...

    hadoop-3.1.3-src.tar.gz

    - **核心类库**:如`org.apache.hadoop.fs.FileSystem`、`org.apache.hadoop.mapreduce.Job`等,提供了与HDFS交互和MapReduce编程的基本接口。 4. **开发与调试** - **Hadoop API**:学习如何使用Hadoop API开发...

    hadoop源码分析-mapreduce部分.doc

    在源码层面,org.apache.hadoop.mapreduce包包含了关键的接口和类。Writeable、Counter和ID相关类处理计数和标识,Context类提供Mapper和Reducer所需的上下文信息,Mapper、Reducer和Job类定义了MapReduce的基本操作...

    hadoop2lib.tar.gz

    Hadoop由Apache基金会维护,其设计理念是分布式存储和计算,它将大规模数据处理的任务分散到多台服务器上,通过并行处理大幅提升效率。Hadoop主要由两个核心部分组成:HDFS(Hadoop Distributed File System)和...

    MapReduce之wordcount范例代码

    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; public class WordCount { public static class TokenizerMapper extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt;{ private final ...

    hadoop2.7.2 之 snappy压缩支持包.zip

    &lt;value&gt;org.apache.hadoop.io.compress.SnappyCodec&lt;/value&gt; &lt;/property&gt; &lt;/configuration&gt; ``` 同时,如果你想要在MapReduce作业中指定特定的输入或输出压缩格式,可以在作业配置中进行设置,如下所示: ```...

    Hadoop 0.20.2 API

    3. **org.apache.hadoop.mapreduce**: 这是MapReduce的新一代API,它在Hadoop 0.20.x版本中引入,目的是提供更高级的功能和更好的性能。`MapReduce`类库包含`Mapper`, `Reducer`, `Driver`, `Context`等,这些类使得...

    hadoop eclipse mapreduce 下开发所有需要用到的 JAR 包

    在Hadoop生态系统中,Eclipse是一个常用的集成开发环境(IDE),用于编写MapReduce程序。MapReduce是一种编程模型,用于大规模数据集的并行处理。它将大数据任务分解为两个主要阶段:映射(Map)和化简(Reduce)。...

    Hadoop MapReduce Cookbook 源码

    《Hadoop MapReduce Cookbook 源码》是一本专注于实战的书籍,旨在帮助读者通过具体的例子深入理解并掌握Hadoop MapReduce技术。MapReduce是大数据处理领域中的核心组件,尤其在处理大规模分布式数据集时,它的重要...

    hadoop-mapreduce

    通常会有自定义的Mapper和Reducer类,它们继承自Hadoop提供的基类,如`org.apache.hadoop.mapreduce.Mapper`和`org.apache.hadoop.mapreduce.Reducer`。此外,可能还有自定义的Driver类,它负责配置和启动MapReduce...

    Hadoop CountWord 例子

    import org.apache.hadoop.mapreduce.Mapper; public class WordCountMapper extends Mapper&lt;LongWritable, Text, Text, IntWritable&gt; { private final static IntWritable one = new IntWritable(1); private ...

    MapReduce源码分析

    相反,OutputFormat接口定义了如何将Reduce任务的输出写回存储系统,`org.apache.hadoop.mapreduce.lib.output.FileOutputFormat`则是常见的输出格式。 **Job与TaskTracker** 在MapReduce框架中,JobTracker是任务...

    wonderdog:批量加载以进行弹性搜索

    您可以在自己的Hadoop MapReduce作业中使用的 ,可从轻松使用这些InputFormat和OutputFormat类 从 LOAD和STORE到ElasticSearch的 一些用于与ElasticSearch进行交互的 &lt; project&gt; ... &lt; dependencies&gt; &lt; ...

    中的接口).docx

    在Hadoop 2.x版本之后,MapReduce的API经历了一次重大的更新,产生了两个主要的版本:旧版的`org.apache.hadoop.mapred`和新版的`org.apache.hadoop.mapreduce`。本文将主要探讨旧版的`org.apache.hadoop.mapred`...

    Java-API-Operate-Hadoop.rar_hadoop_hadoop api

    它的核心组件包括HDFS(Hadoop Distributed File System)和MapReduce,而Java API是开发者与Hadoop交互的主要方式。本文将深入探讨Java如何操作Hadoop,以及在"Java-API-Operate-Hadoop.rar"压缩包中提供的资源。 ...

    Java编写Mapreduce程序过程浅析

    3. **InputFormat类**:定义如何将输入数据分割成键值对,如`org.apache.hadoop.mapred.TextInputFormat`。 4. **OutputFormat类**:定义如何写入输出结果,如`org.apache.hadoop.mapred.TextOutputFormat`。 5. **...

Global site tag (gtag.js) - Google Analytics