Hadoop workshop homework.
For privacy, the blog post will not show source code at all, only the job output logs and counters.
- Copy the packaged jar file into hadoop cluster:
[root@n1 hadoop-examples]# scp gsun@192.168.1.102:~/prog/hadoop/cdh4-examples/cdh4-examples.jar . Password: cdh4-examples.jar 100% 46KB 46.0KB/s 00:00
-
Copy the input data into HDFS:
$ scp NASA_access_log_Jul95.gz root@n1.example.com:/root/hadoop-examples root@n1.example.com's password: NASA_access_log_Jul95.gz 100% 20MB 19.7MB/s 00:00 [root@n1 hadoop-examples]# gunzip -d NASA_access_log_Jul95.gz [root@n1 hadoop-examples]# hadoop fs -mkdir nasa_access_log [root@n1 hadoop-examples]# hadoop fs -copyFromLocal NASA_access_log_Jul95 ./nasa_access_log/
Scenario 1 output:
[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar demo.LogProcessor nasa_access_log output 2 13/07/13 00:14:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/13 00:14:57 INFO input.FileInputFormat: Total input paths to process : 1 13/07/13 00:14:58 INFO mapred.JobClient: Running job: job_201307122107_0009 13/07/13 00:14:59 INFO mapred.JobClient: map 0% reduce 0% 13/07/13 00:15:17 INFO mapred.JobClient: map 5% reduce 0% 13/07/13 00:15:18 INFO mapred.JobClient: map 14% reduce 0% 13/07/13 00:15:21 INFO mapred.JobClient: map 28% reduce 0% 13/07/13 00:15:25 INFO mapred.JobClient: map 44% reduce 0% 13/07/13 00:15:27 INFO mapred.JobClient: map 68% reduce 0% 13/07/13 00:15:30 INFO mapred.JobClient: map 78% reduce 0% 13/07/13 00:15:34 INFO mapred.JobClient: map 87% reduce 0% 13/07/13 00:15:36 INFO mapred.JobClient: map 96% reduce 0% 13/07/13 00:15:39 INFO mapred.JobClient: map 100% reduce 0% 13/07/13 00:15:54 INFO mapred.JobClient: map 100% reduce 84% 13/07/13 00:15:56 INFO mapred.JobClient: map 100% reduce 100% 13/07/13 00:15:59 INFO mapred.JobClient: Job complete: job_201307122107_0009 13/07/13 00:15:59 INFO mapred.JobClient: Counters: 33 13/07/13 00:15:59 INFO mapred.JobClient: File System Counters 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of bytes read=21497514 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of bytes written=31791353 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/13 00:15:59 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of bytes read=205308182 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of bytes written=2139772 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of read operations=4 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/13 00:15:59 INFO mapred.JobClient: HDFS: Number of write operations=2 13/07/13 00:15:59 INFO mapred.JobClient: Job Counters 13/07/13 00:15:59 INFO mapred.JobClient: Launched map tasks=2 13/07/13 00:15:59 INFO mapred.JobClient: Launched reduce tasks=2 13/07/13 00:15:59 INFO mapred.JobClient: Data-local map tasks=2 13/07/13 00:15:59 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=63399 13/07/13 00:15:59 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=26747 13/07/13 00:15:59 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/13 00:15:59 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/13 00:15:59 INFO mapred.JobClient: Map-Reduce Framework 13/07/13 00:15:59 INFO mapred.JobClient: Map input records=1871988 13/07/13 00:15:59 INFO mapred.JobClient: Map output records=1871988 13/07/13 00:15:59 INFO mapred.JobClient: Map output bytes=43967362 13/07/13 00:15:59 INFO mapred.JobClient: Input split bytes=278 13/07/13 00:15:59 INFO mapred.JobClient: Combine input records=0 13/07/13 00:15:59 INFO mapred.JobClient: Combine output records=0 13/07/13 00:15:59 INFO mapred.JobClient: Reduce input groups=81621 13/07/13 00:15:59 INFO mapred.JobClient: Reduce shuffle bytes=10171946 13/07/13 00:15:59 INFO mapred.JobClient: Reduce input records=1871988 13/07/13 00:15:59 INFO mapred.JobClient: Reduce output records=81621 13/07/13 00:15:59 INFO mapred.JobClient: Spilled Records=5615964 13/07/13 00:15:59 INFO mapred.JobClient: CPU time spent (ms)=43710 13/07/13 00:15:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=767377408 13/07/13 00:15:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=3596718080 13/07/13 00:15:59 INFO mapred.JobClient: Total committed heap usage (bytes)=397082624 13/07/13 00:15:59 INFO mapred.JobClient: demo.LogProcessorMap$LOG_PROCESSOR_COUNTER 13/07/13 00:15:59 INFO mapred.JobClient: BAD_RECORDS=1871988 # of Good Records :1871988
Scenario 2 output:
[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar demo.genericwritable.LogProcessor nasa_access_log output 2 13/07/13 00:17:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/13 00:17:28 INFO input.FileInputFormat: Total input paths to process : 1 13/07/13 00:17:29 INFO mapred.JobClient: Running job: job_201307122107_0011 13/07/13 00:17:30 INFO mapred.JobClient: map 0% reduce 0% 13/07/13 00:17:43 INFO mapred.JobClient: map 24% reduce 0% 13/07/13 00:17:45 INFO mapred.JobClient: map 33% reduce 0% 13/07/13 00:17:46 INFO mapred.JobClient: map 49% reduce 0% 13/07/13 00:17:48 INFO mapred.JobClient: map 57% reduce 0% 13/07/13 00:17:49 INFO mapred.JobClient: map 66% reduce 0% 13/07/13 00:17:51 INFO mapred.JobClient: map 75% reduce 0% 13/07/13 00:17:54 INFO mapred.JobClient: map 87% reduce 0% 13/07/13 00:17:57 INFO mapred.JobClient: map 99% reduce 0% 13/07/13 00:17:59 INFO mapred.JobClient: map 100% reduce 0% 13/07/13 00:18:12 INFO mapred.JobClient: map 100% reduce 50% 13/07/13 00:18:15 INFO mapred.JobClient: map 100% reduce 69% 13/07/13 00:18:18 INFO mapred.JobClient: map 100% reduce 70% 13/07/13 00:18:20 INFO mapred.JobClient: map 100% reduce 83% 13/07/13 00:18:21 INFO mapred.JobClient: map 100% reduce 84% 13/07/13 00:18:25 INFO mapred.JobClient: map 100% reduce 86% 13/07/13 00:18:26 INFO mapred.JobClient: map 100% reduce 100% 13/07/13 00:18:30 INFO mapred.JobClient: Job complete: job_201307122107_0011 13/07/13 00:18:30 INFO mapred.JobClient: Counters: 32 13/07/13 00:18:30 INFO mapred.JobClient: File System Counters 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of bytes read=70122269 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of bytes written=103466795 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of read operations=0 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of large read operations=0 13/07/13 00:18:30 INFO mapred.JobClient: FILE: Number of write operations=0 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of bytes read=205308182 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of bytes written=86859890 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of read operations=4 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of large read operations=0 13/07/13 00:18:30 INFO mapred.JobClient: HDFS: Number of write operations=2 13/07/13 00:18:30 INFO mapred.JobClient: Job Counters 13/07/13 00:18:30 INFO mapred.JobClient: Launched map tasks=2 13/07/13 00:18:30 INFO mapred.JobClient: Launched reduce tasks=2 13/07/13 00:18:30 INFO mapred.JobClient: Data-local map tasks=2 13/07/13 00:18:30 INFO mapred.JobClient: Total time spent by all maps in occupied slots (ms)=47028 13/07/13 00:18:30 INFO mapred.JobClient: Total time spent by all reduces in occupied slots (ms)=44185 13/07/13 00:18:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 13/07/13 00:18:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 13/07/13 00:18:30 INFO mapred.JobClient: Map-Reduce Framework 13/07/13 00:18:30 INFO mapred.JobClient: Map input records=1891715 13/07/13 00:18:30 INFO mapred.JobClient: Map output records=3743976 13/07/13 00:18:30 INFO mapred.JobClient: Map output bytes=168829257 13/07/13 00:18:30 INFO mapred.JobClient: Input split bytes=278 13/07/13 00:18:30 INFO mapred.JobClient: Combine input records=0 13/07/13 00:18:30 INFO mapred.JobClient: Combine output records=0 13/07/13 00:18:30 INFO mapred.JobClient: Reduce input groups=81621 13/07/13 00:18:30 INFO mapred.JobClient: Reduce shuffle bytes=33609934 13/07/13 00:18:30 INFO mapred.JobClient: Reduce input records=3743976 13/07/13 00:18:30 INFO mapred.JobClient: Reduce output records=81621 13/07/13 00:18:30 INFO mapred.JobClient: Spilled Records=11231928 13/07/13 00:18:30 INFO mapred.JobClient: CPU time spent (ms)=51290 13/07/13 00:18:30 INFO mapred.JobClient: Physical memory (bytes) snapshot=914145280 13/07/13 00:18:30 INFO mapred.JobClient: Virtual memory (bytes) snapshot=4566802432 13/07/13 00:18:30 INFO mapred.JobClient: Total committed heap usage (bytes)=573489152
Scenario 3 (Hadoop streaming MapReduce)
Copy the python scrpit into hadoop cluster:
$ scp logProcessor.py root@n1.example.com:/root/hadoop-examples root@n1.example.com's password: logProcessor.py 100% 470 0.5KB/s 00:00
Output:
[root@n1 hadoop-examples]# hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar -input nasa_access_log -output output -mapper 'python logProcessor.py' -reducer aggregate -file logProcessor.py packageJobJar: [logProcessor.py, /tmp/hadoop-root/hadoop-unjar641255321819856404/] [] /tmp/streamjob5121005386227726797.jar tmpDir=null 13/07/13 00:34:05 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 13/07/13 00:34:05 INFO mapred.FileInputFormat: Total input paths to process : 1 13/07/13 00:34:06 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local] 13/07/13 00:34:06 INFO streaming.StreamJob: Running job: job_201307122107_0015 13/07/13 00:34:06 INFO streaming.StreamJob: To kill this job, run: 13/07/13 00:34:06 INFO streaming.StreamJob: UNDEF/bin/hadoop job -Dmapred.job.tracker=n1.example.com:8021 -kill job_201307122107_0015 13/07/13 00:34:06 INFO streaming.StreamJob: Tracking URL: http://n1.example.com:50030/jobdetails.jsp?jobid=job_201307122107_0015 13/07/13 00:34:07 INFO streaming.StreamJob: map 0% reduce 0% 13/07/13 00:34:24 INFO streaming.StreamJob: map 11% reduce 0% 13/07/13 00:34:25 INFO streaming.StreamJob: map 25% reduce 0% 13/07/13 00:34:27 INFO streaming.StreamJob: map 39% reduce 0% 13/07/13 00:34:28 INFO streaming.StreamJob: map 52% reduce 0% 13/07/13 00:34:31 INFO streaming.StreamJob: map 75% reduce 0% 13/07/13 00:34:33 INFO streaming.StreamJob: map 87% reduce 0% 13/07/13 00:34:34 INFO streaming.StreamJob: map 100% reduce 0% 13/07/13 00:34:46 INFO streaming.StreamJob: map 100% reduce 100% 13/07/13 00:34:50 INFO streaming.StreamJob: Job complete: job_201307122107_0015 13/07/13 00:34:50 INFO streaming.StreamJob: Output: output
相关推荐
【标题】"Homework-2-master.rar"是一个压缩文件,通常用于存储一组相关的文件或项目。在IT领域,这样的文件经常被用作源代码管理的一部分,尤其是当涉及到软件开发作业或项目时。"master"分支在版本控制系统如Git中...
根据给定文件的信息,我们可以总结出以下几个主要的知识点: ### 1. 人工智能的基本概念 #### 定义 - **智能**:智能是指生物或机器解决问题、适应环境、学习新知识的能力。它包括感知、推理、决策等多个方面。...
homework-1-1.ipynb
nginxrning-homework-m笔记
MongoDBg-homework-m笔记
【标题】"基于dip-homework-mast爬虫开发" 涉及的主要知识点是爬虫技术,这是一项在信息技术领域中广泛应用于数据采集的关键技能。爬虫,也称为网络爬虫或网页抓取器,是自动化地从互联网上抓取大量信息的程序。在本...
本实践项目"springboot-homework-bootstrap-vue.rar"就是基于这三大框架的集成应用,旨在展示如何在SpringBoot后端服务上使用BootStrap进行前端UI设计,并通过Vue.js实现动态交互。接下来,我们将深入探讨这三个技术...
人工智能arning-homework-m笔记
ROS开发arning-homework-m笔记
(01)Homework-Perceptron-2023.ipynb
homework-2.c
【标题】"homework-源码.rar"是一个压缩文件,通常包含编程作业的源代码。在IT领域,源码是程序员用编程语言编写的原始文本文件,这些文件能被编译器或解释器转化为可执行程序。源码是软件开发的基础,它允许开发者...
【标题】"home-credit-homework-源码.rar" 提供的是一个关于Home Credit作业的源代码包,这通常是指一个数据科学或机器学习项目,可能是为了教学、研究或竞赛目的。源码中可能包含了处理Home Credit数据集、构建预测...
前后端分离系统ng-homework-mast笔记
前后端分离系统rning-homework-m笔记
homework-1.html
学生成绩数据集(study-hours,homework-completion-rate,attendance-count
homework-世界疫情.py
本教程将围绕"springboot-homework-thymeleaf.rar"这一压缩包,深入探讨如何利用SpringBoot和Thymeleaf构建项目,并实现前端数据的动态展示。 首先,SpringBoot是Spring框架的轻量级版本,它简化了Spring应用程序的...
1_homework-2..zip