`
bit1129
  • 浏览: 1070469 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

【Hadoop五】Word Count实例结果分析

 
阅读更多

如下是运行Word Count的结果,输入了两个小文件,从大小在几K之间。

 

hadoop@hadoop-Inspiron-3521:~/hadoop-2.5.2/bin$ hadoop jar WordCountMapReduce.jar /users/hadoop/hello/world /users/hadoop/output5
--->/users/hadoop/hello/world
--->/users/hadoop/output5
14/12/15 22:35:40 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
14/12/15 22:35:41 INFO input.FileInputFormat: Total input paths to process : 2 //一共有两个文件要处理
14/12/15 22:35:41 INFO mapreduce.JobSubmitter: number of splits:2  //两个input splits,每个split对应一个Map Task
14/12/15 22:35:42 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1418652929537_0001
14/12/15 22:35:43 INFO impl.YarnClientImpl: Submitted application application_1418652929537_0001
14/12/15 22:35:43 INFO mapreduce.Job: The url to track the job: http://hadoop-Inspiron-3521:8088/proxy/application_1418652929537_0001/
14/12/15 22:35:43 INFO mapreduce.Job: Running job: job_1418652929537_0001
14/12/15 22:35:54 INFO mapreduce.Job: Job job_1418652929537_0001 running in uber mode : false
14/12/15 22:35:54 INFO mapreduce.Job:  map 0% reduce 0%
14/12/15 22:36:04 INFO mapreduce.Job:  map 50% reduce 0%
14/12/15 22:36:05 INFO mapreduce.Job:  map 100% reduce 0%
14/12/15 22:36:16 INFO mapreduce.Job:  map 100% reduce 100%
14/12/15 22:36:17 INFO mapreduce.Job: Job job_1418652929537_0001 completed successfully
14/12/15 22:36:17 INFO mapreduce.Job: Counters: 49
	File System Counters
		FILE: Number of bytes read=3448
		FILE: Number of bytes written=299665
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
		HDFS: Number of bytes read=2574
		HDFS: Number of bytes written=1478
		HDFS: Number of read operations=9
		HDFS: Number of large read operations=0
		HDFS: Number of write operations=2
	Job Counters 
		Launched map tasks=2 //一个输入文件一个Map Task
		Launched reduce tasks=1
		Data-local map tasks=2 //两个Map Task都是从本地Node读取数据内容
		Total time spent by all maps in occupied slots (ms)=17425
		Total time spent by all reduces in occupied slots (ms)=8472
		Total time spent by all map tasks (ms)=17425
		Total time spent by all reduce tasks (ms)=8472
		Total vcore-seconds taken by all map tasks=17425
		Total vcore-seconds taken by all reduce tasks=8472
		Total megabyte-seconds taken by all map tasks=17843200
		Total megabyte-seconds taken by all reduce tasks=8675328
	Map-Reduce Framework
		Map input records=90 //输入的两个文件的一共90行
		Map output records=251 //Map输出了251行,也就是说一行有将近3个单词,251/90
		Map output bytes=2940
		Map output materialized bytes=3454
		Input split bytes=263
		Combine input records=0
		Combine output records=0
		Reduce input groups=138
		Reduce shuffle bytes=3454
		Reduce input records=251
		Reduce output records=138
		Spilled Records=502
		Shuffled Maps =2
		Failed Shuffles=0
		Merged Map outputs=2
		GC time elapsed (ms)=274
		CPU time spent (ms)=3740
		Physical memory (bytes) snapshot=694566912
		Virtual memory (bytes) snapshot=3079643136
		Total committed heap usage (bytes)=513277952
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=2311   //两个文件的总大小
	File Output Format Counters 
		Bytes Written=1478 //输出文件part-r-00000文件的大小
 
只有Mapper没有Reducer
把Combiner设置为Reducer的实现,同时设置numReducerTask为0,那么只有mapper有输出,输出文件名为part-m-00000,结果显示:
1.结果既没有排序,也没有对相同的结果做归并,即Combiner没起到作用
设置五个Reducer
 输出结果中有五个文件part-r-00000到part-r-00004
设置block的大小
设置block的大小时,需要注意一下两个参数,如下两个参数的约束限制了block大小的设置,要想设置block大小需要依赖于如下两个参数的设置
1.block的大小不能比dfs.namenode.fs-limits.min-block-size设置的块大小更小,默认1048576
2.block的大小需要是io.bytes.per.checksum的整数倍,而io.bytes.per.checksum的默认大小是256字节
要将block大小改为512字节,可以在hdfs-site.xml做如下配置:
<property>
  <name>dfs.block.size</name>
  <!--<value>67108864</value>-->
  <value>512</value>
  <description>The default block size for new files.</description>
</property>

<property>
  <name>dfs.namenode.fs-limits.min-block-size</name>
  <!--<value>67108864</value>-->
  <value>256</value>
  <description>The minimum of block size</description>
</property
 
 Pro Apache Hadoop(p13)
A map task can run on any compute node in the cluster, and multiple map tasks can run in parallel across the cluster. The map task is responsible for transforming the input records into key/value pairs. The output of all the maps will be partitioned, and each partition will be sorted. There will be one partition for each reduce task. Each partition’s sorted keys and the values associated with the keys are then processed by the reduce task. There can be multiple reduce tasks running in parallel on the cluster.
The key to receiving a list of values for a key in the reduce phase is a phase known as the sort/shuffle phase in MapReduce. All the key/value pairs emitted by the Mapper are sorted by the key in the Reducer. If multiple Reducers are allocated, a subset of keys will be allocated to each Reducer. The key/value pairs for a given Reducer are sorted by key, which ensures that all the values associated with one key are received by the Reducer together.
 
 Word Count Map Reduce过程
 
 

 
 p24:
Some of the metadata stored by the NameNode includes these:
• File/directory name and its location relative to the parent directory.
• File and directory ownership and permissions.
• File name of individual blocks. Each block is stored as a file in the local file system of the DataNode in the directory that can be configured by the Hadoop system administrator.
如何查看HDFS上的数据块
 如果文件大小不足1个block的size大小,那个这个文件将占用1个block(记录元信息到NameNode),这个block的实际大小就是文件的大小
 
p19
The NameNode file that contains the metadata is fsimage. Any changes to the metadata during the system
operation are stored in memory and persisted to another file called edits. Periodically, the edits file is merged with the fsimage file by the Secondary NameNode.
 
使用如下命令可以查看HDFS的状态
hdfs  fsck / -files -blocks -locations |grep /users/hadoop/wordcount -A 30
  • 大小: 201.6 KB
分享到:
评论

相关推荐

    Hadoop Streaming程序实例

    在本文中,我们将深入探讨Hadoop Streaming的工作原理、配置以及如何创建一个简单的程序实例。 一、Hadoop Streaming简介 Hadoop Streaming的基本概念是通过标准输入(stdin)和标准输出(stdout)与Mapper和...

    大数据MapReduce实现Word Count

    Word Count虽简单,但其核心思想——分而治之和数据聚合,广泛应用于各种复杂的大数据分析场景,如搜索引擎的索引构建、社交网络的情感分析等。 7. **优化技巧**: 在实际应用中,可以通过优化如Combiner(本地...

    ( Hadoop Streaming编程实战(C++、PHP、Python).pdf )

    foreach ($word2count as $word =&gt; $count) { echo $word . "\t" . $count . PHP_EOL; } ``` Reducer程序用于对Mapper的输出进行汇总,计算每个单词出现的总次数,并输出结果。 这些代码示例演示了如何使用不同的...

    Java访问Hadoop集群源码

    在Java编程环境中,访问Hadoop集群是一项常见的任务,特别是在大数据处理和分析的场景下。Hadoop是一个开源框架,主要用于存储和处理大规模数据集。本文将深入探讨如何利用Java API来与Hadoop集群进行交互,包括读取...

    传智黑马赵星老师hadoop七天课程资料笔记-第三天(全)

    8. **wc流程.xls** - wc是Hadoop自带的Word Count示例,这个Excel文件可能分析了Word Count程序的执行流程,展示每个步骤的时间和资源消耗。 9. **day3** - 这可能是一个包含其他辅助文件或笔记的目录,补充了第三...

    WordCount2_hadoopwordcount_

    Reduce任务接收一组&lt;word, count&gt;键值对,对它们进行求和,得到每个单词的总出现次数,然后输出&lt;word, total_count&gt;。这样就完成了单词统计的过程。 在提供的文件名"WordCount2.java"中,我们可以预期找到实现这一...

    Hadoop MRUnit测试

    MRUnit提供了模拟输入数据、检查输出数据以及监控中间结果的功能,极大地简化了测试流程。 **一、MRUnit的使用** 1. **安装与集成** MRUnit可以通过Maven依赖添加到项目中。在你的`pom.xml`文件中,添加以下依赖...

    phoenix_wordcount.tar.gz_Hadoop Phoenix_mapReduce_phoenix wordc

    "Phoenix WordCount"实例展示了如何将传统的大数据处理模式与NoSQL数据库相结合,利用Hadoop的并行计算能力和Phoenix的SQL查询优势,实现了对大规模文本数据的高效分析。这种融合不仅优化了数据处理效率,还降低了...

    Java实现Hadoop下词配对Wordcount计数代码实现

    Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass...

    Hadoop生态系统中MapReduce算法的设计与实现解析

    通过实例详细解析了如何使用MapReduce实现经典的大数据处理任务,如Word Count、矩阵乘法和PageRank算法。此外,文章还探讨了Hadoop的高级特性,包括容错机制、高可用性和性能调优。 适合人群:熟悉基本编程知识和...

    Hadoop单机模式和伪分布模式.ppt

    1. **Word Count实例实验** - 实验步骤: - 创建`test1.txt`和`test2.txt`两个文件,并放入`input`文件夹。 - 执行命令:`hadoop jar hadoop-0.20.2-examples.jar wordcount input output`,启动wordcount程序。 ...

    Hadoop基础架构与历史技术教程

    Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass...

    Hadoop数据处理框架MapReduce原理及开发

    Word Count是一个非常典型的MapReduce应用实例,它的目的是统计文本文件中每个单词出现的次数。 - **Mapper**:Mapper负责读取输入的文本行,并将每行文本分割成单词。对于每个单词,Mapper会产生一个键值对(key-...

    JAVA使用Apache Hadoop实现大规模数据处理.txt

    - 输出:键值对`(word, count)`,其中`count`是单词出现的总次数。 ##### 3.3 主程序配置 主程序`WordCount`负责设置Hadoop作业的相关参数,并启动作业执行。 - **配置参数**: - 创建一个新的`Configuration`...

    Hadoop集群中WordCount示例

    Job job = Job.getInstance(conf, "word count"); job.setJarByClass(WordCount.class); job.setMapperClass(TokenizerMapper.class); job.setCombinerClass(IntSumReducer.class); job.setReducerClass...

    MapRrduce代码实例

    **Word Count实例**: 在单词计数的例子中,Mapper的职责是将输入的文本行拆分成单词,并为每个单词生成一个 `(word, 1)` 的键值对。例如,输入是"Hello World Hello",Mapper会输出 `[("Hello", 1), ("World", 1), ...

    介绍基于Hadoop的C++扩展和新的任务计划

    #### 七、HCE实例——Word Count 以Word Count为例,展示了如何使用HCE进行MapReduce任务的编写和执行。具体步骤如下: 1. **编写Map函数**:用于读取输入数据并将其分解成单词键值对。 2. **编写Reduce函数**:...

    hadoop中 MapReduce学习代码WordCount

    Reducer 收集所有来自 Map 阶段的 `(word, 1)` 键值对,对相同单词的计数进行累加,生成最终的 `(word, count)` 键值对。在 `WcReducer` 类中,我们将相同的单词键合并,并累加其对应的值。 ```java public class ...

    分布式文件系统经典实例-mapreduce-统计字符数

    在这个实例中,`wc`通常表示“word count”,在Hadoop MapReduce中,它被用来统计单词数。然而,在这个字符数统计的场景下,我们可以稍作修改,使其计算字符数而非单词数。这通常涉及对Mapper和Reducer的Java代码...

Global site tag (gtag.js) - Google Analytics