Homework - NASA Access Log Processing - George's dev dream port

sunwinner

浏览: 204217 次
性别:
来自: 上海

最近访客更多访客>>

luojianbing

yanghuangsanguo

jahentao

baichoufei90sina

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Homework - NASA Access Log Processing

博客分类：

Hadoop

Hadoop workshop homework.

For privacy, the blog post will not show source code at all, only the job output logs and counters.

Copy the packaged jar file into hadoop cluster:

[root@n1 hadoop-examples]# scp gsun@192.168.1.102:~/prog/hadoop/cdh4-examples/cdh4-examples.jar .
Password:
cdh4-examples.jar                                                                            100%   46KB  46.0KB/s   00:00

Copy the input data into HDFS:

$ scp NASA_access_log_Jul95.gz root@n1.example.com:/root/hadoop-examples
root@n1.example.com's password: 
NASA_access_log_Jul95.gz                                                              100%   20MB  19.7MB/s   00:00 
[root@n1 hadoop-examples]# gunzip -d NASA_access_log_Jul95.gz 
[root@n1 hadoop-examples]# hadoop fs -mkdir nasa_access_log
[root@n1 hadoop-examples]# hadoop fs -copyFromLocal NASA_access_log_Jul95 ./nasa_access_log/

Scenario 1 output:

[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar demo.LogProcessor nasa_access_log output 2
13/07/13 00:14:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 00:14:57 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 00:14:58 INFO mapred.JobClient: Running job: job_201307122107_0009
13/07/13 00:14:59 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 00:15:17 INFO mapred.JobClient:  map 5% reduce 0%
13/07/13 00:15:18 INFO mapred.JobClient:  map 14% reduce 0%
13/07/13 00:15:21 INFO mapred.JobClient:  map 28% reduce 0%
13/07/13 00:15:25 INFO mapred.JobClient:  map 44% reduce 0%
13/07/13 00:15:27 INFO mapred.JobClient:  map 68% reduce 0%
13/07/13 00:15:30 INFO mapred.JobClient:  map 78% reduce 0%
13/07/13 00:15:34 INFO mapred.JobClient:  map 87% reduce 0%
13/07/13 00:15:36 INFO mapred.JobClient:  map 96% reduce 0%
13/07/13 00:15:39 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 00:15:54 INFO mapred.JobClient:  map 100% reduce 84%
13/07/13 00:15:56 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 00:15:59 INFO mapred.JobClient: Job complete: job_201307122107_0009
13/07/13 00:15:59 INFO mapred.JobClient: Counters: 33
13/07/13 00:15:59 INFO mapred.JobClient:   File System Counters
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of bytes read=21497514
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of bytes written=31791353
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of bytes read=205308182
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of bytes written=2139772
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of read operations=4
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of write operations=2
13/07/13 00:15:59 INFO mapred.JobClient:   Job Counters 
13/07/13 00:15:59 INFO mapred.JobClient:     Launched map tasks=2
13/07/13 00:15:59 INFO mapred.JobClient:     Launched reduce tasks=2
13/07/13 00:15:59 INFO mapred.JobClient:     Data-local map tasks=2
13/07/13 00:15:59 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=63399
13/07/13 00:15:59 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=26747
13/07/13 00:15:59 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 00:15:59 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 00:15:59 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 00:15:59 INFO mapred.JobClient:     Map input records=1871988
13/07/13 00:15:59 INFO mapred.JobClient:     Map output records=1871988
13/07/13 00:15:59 INFO mapred.JobClient:     Map output bytes=43967362
13/07/13 00:15:59 INFO mapred.JobClient:     Input split bytes=278
13/07/13 00:15:59 INFO mapred.JobClient:     Combine input records=0
13/07/13 00:15:59 INFO mapred.JobClient:     Combine output records=0
13/07/13 00:15:59 INFO mapred.JobClient:     Reduce input groups=81621
13/07/13 00:15:59 INFO mapred.JobClient:     Reduce shuffle bytes=10171946
13/07/13 00:15:59 INFO mapred.JobClient:     Reduce input records=1871988
13/07/13 00:15:59 INFO mapred.JobClient:     Reduce output records=81621
13/07/13 00:15:59 INFO mapred.JobClient:     Spilled Records=5615964
13/07/13 00:15:59 INFO mapred.JobClient:     CPU time spent (ms)=43710
13/07/13 00:15:59 INFO mapred.JobClient:     Physical memory (bytes) snapshot=767377408
13/07/13 00:15:59 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3596718080
13/07/13 00:15:59 INFO mapred.JobClient:     Total committed heap usage (bytes)=397082624
13/07/13 00:15:59 INFO mapred.JobClient:   demo.LogProcessorMap$LOG_PROCESSOR_COUNTER
13/07/13 00:15:59 INFO mapred.JobClient:     BAD_RECORDS=1871988
# of Good Records :1871988

Scenario 2 output:

[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar demo.genericwritable.LogProcessor nasa_access_log output 2
13/07/13 00:17:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 00:17:28 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 00:17:29 INFO mapred.JobClient: Running job: job_201307122107_0011
13/07/13 00:17:30 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 00:17:43 INFO mapred.JobClient:  map 24% reduce 0%
13/07/13 00:17:45 INFO mapred.JobClient:  map 33% reduce 0%
13/07/13 00:17:46 INFO mapred.JobClient:  map 49% reduce 0%
13/07/13 00:17:48 INFO mapred.JobClient:  map 57% reduce 0%
13/07/13 00:17:49 INFO mapred.JobClient:  map 66% reduce 0%
13/07/13 00:17:51 INFO mapred.JobClient:  map 75% reduce 0%
13/07/13 00:17:54 INFO mapred.JobClient:  map 87% reduce 0%
13/07/13 00:17:57 INFO mapred.JobClient:  map 99% reduce 0%
13/07/13 00:17:59 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 00:18:12 INFO mapred.JobClient:  map 100% reduce 50%
13/07/13 00:18:15 INFO mapred.JobClient:  map 100% reduce 69%
13/07/13 00:18:18 INFO mapred.JobClient:  map 100% reduce 70%
13/07/13 00:18:20 INFO mapred.JobClient:  map 100% reduce 83%
13/07/13 00:18:21 INFO mapred.JobClient:  map 100% reduce 84%
13/07/13 00:18:25 INFO mapred.JobClient:  map 100% reduce 86%
13/07/13 00:18:26 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 00:18:30 INFO mapred.JobClient: Job complete: job_201307122107_0011
13/07/13 00:18:30 INFO mapred.JobClient: Counters: 32
13/07/13 00:18:30 INFO mapred.JobClient:   File System Counters
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of bytes read=70122269
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of bytes written=103466795
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of bytes read=205308182
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of bytes written=86859890
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of read operations=4
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of write operations=2
13/07/13 00:18:30 INFO mapred.JobClient:   Job Counters 
13/07/13 00:18:30 INFO mapred.JobClient:     Launched map tasks=2
13/07/13 00:18:30 INFO mapred.JobClient:     Launched reduce tasks=2
13/07/13 00:18:30 INFO mapred.JobClient:     Data-local map tasks=2
13/07/13 00:18:30 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=47028
13/07/13 00:18:30 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=44185
13/07/13 00:18:30 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 00:18:30 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 00:18:30 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 00:18:30 INFO mapred.JobClient:     Map input records=1891715
13/07/13 00:18:30 INFO mapred.JobClient:     Map output records=3743976
13/07/13 00:18:30 INFO mapred.JobClient:     Map output bytes=168829257
13/07/13 00:18:30 INFO mapred.JobClient:     Input split bytes=278
13/07/13 00:18:30 INFO mapred.JobClient:     Combine input records=0
13/07/13 00:18:30 INFO mapred.JobClient:     Combine output records=0
13/07/13 00:18:30 INFO mapred.JobClient:     Reduce input groups=81621
13/07/13 00:18:30 INFO mapred.JobClient:     Reduce shuffle bytes=33609934
13/07/13 00:18:30 INFO mapred.JobClient:     Reduce input records=3743976
13/07/13 00:18:30 INFO mapred.JobClient:     Reduce output records=81621
13/07/13 00:18:30 INFO mapred.JobClient:     Spilled Records=11231928
13/07/13 00:18:30 INFO mapred.JobClient:     CPU time spent (ms)=51290
13/07/13 00:18:30 INFO mapred.JobClient:     Physical memory (bytes) snapshot=914145280
13/07/13 00:18:30 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=4566802432
13/07/13 00:18:30 INFO mapred.JobClient:     Total committed heap usage (bytes)=573489152

Scenario 3 (Hadoop streaming MapReduce)

Copy the python scrpit into hadoop cluster:

$ scp logProcessor.py root@n1.example.com:/root/hadoop-examples
root@n1.example.com's password: 
logProcessor.py                                                                       100%  470     0.5KB/s   00:00

Output:

[root@n1 hadoop-examples]# hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar -input nasa_access_log -output output -mapper 'python logProcessor.py' -reducer aggregate -file logProcessor.py 
packageJobJar: [logProcessor.py, /tmp/hadoop-root/hadoop-unjar641255321819856404/] [] /tmp/streamjob5121005386227726797.jar tmpDir=null
13/07/13 00:34:05 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 00:34:05 INFO mapred.FileInputFormat: Total input paths to process : 1
13/07/13 00:34:06 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
13/07/13 00:34:06 INFO streaming.StreamJob: Running job: job_201307122107_0015
13/07/13 00:34:06 INFO streaming.StreamJob: To kill this job, run:
13/07/13 00:34:06 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=n1.example.com:8021 -kill job_201307122107_0015
13/07/13 00:34:06 INFO streaming.StreamJob: Tracking URL: http://n1.example.com:50030/jobdetails.jsp?jobid=job_201307122107_0015
13/07/13 00:34:07 INFO streaming.StreamJob:  map 0%  reduce 0%
13/07/13 00:34:24 INFO streaming.StreamJob:  map 11%  reduce 0%
13/07/13 00:34:25 INFO streaming.StreamJob:  map 25%  reduce 0%
13/07/13 00:34:27 INFO streaming.StreamJob:  map 39%  reduce 0%
13/07/13 00:34:28 INFO streaming.StreamJob:  map 52%  reduce 0%
13/07/13 00:34:31 INFO streaming.StreamJob:  map 75%  reduce 0%
13/07/13 00:34:33 INFO streaming.StreamJob:  map 87%  reduce 0%
13/07/13 00:34:34 INFO streaming.StreamJob:  map 100%  reduce 0%
13/07/13 00:34:46 INFO streaming.StreamJob:  map 100%  reduce 100%
13/07/13 00:34:50 INFO streaming.StreamJob: Job complete: job_201307122107_0015
13/07/13 00:34:50 INFO streaming.StreamJob: Output: output

分享到：

Homework - How to Configure Hadoop Task ... | Homework - Running Hadoop WordCount Exam ...

2013-07-13 00:36
浏览 1309
评论(0)
分类:企业架构
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Homework - NASA Access Log Processing

评论

发表评论

相关推荐

最近访客 更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论

Homework - NASA Access Log Processing

评论

发表评论

相关推荐

Availability and Reliability with HBase

Failed to Run Pig Script with Macro

Solution to Hive Thrift Client Hang without Any Return

Hive - Load Data from CSV/TSV

如何制作Hive数据文件

Hive - 创建Index失败，原因暂未知

Cascading Terminology and Concepts

Cascading Kick Start: Word Counting

Joins with Apache Crunch

Getting Started with Apache Crunch

Blooming Filter in Hadoop

Finding Friends of Friends (FoFs)

Accelerating Comparison by Providing RawComparator

Hadoop Performance Woes Checklist

MapReduce Algorithm - Secondary Sort

MapReduce Algorithm - Semi-joins

MapReduce Algorithm - Another Way to Do Map-side Join

Running MapReduce Job with HBase

Hadoop DataJoin in Action

Adding HBase Library into Java Classpath

最近访客更多访客>>