  • 浏览: 204217 次
  • 性别: Icon_minigender_1
  • 来自: 上海

Homework - NASA Access Log Processing


Hadoop workshop homework.


For privacy, the blog post will not show source code at all, only the job output logs and counters.

  • Copy the packaged jar file into hadoop cluster:
[root@n1 hadoop-examples]# scp gsun@ .
cdh4-examples.jar                                                                            100%   46KB  46.0KB/s   00:00 


  • Copy the input data into HDFS:
    $ scp NASA_access_log_Jul95.gz root@n1.example.com:/root/hadoop-examples
    root@n1.example.com's password: 
    NASA_access_log_Jul95.gz                                                              100%   20MB  19.7MB/s   00:00 
    [root@n1 hadoop-examples]# gunzip -d NASA_access_log_Jul95.gz 
    [root@n1 hadoop-examples]# hadoop fs -mkdir nasa_access_log
    [root@n1 hadoop-examples]# hadoop fs -copyFromLocal NASA_access_log_Jul95 ./nasa_access_log/

Scenario 1 output:

[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar demo.LogProcessor nasa_access_log output 2
13/07/13 00:14:56 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 00:14:57 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 00:14:58 INFO mapred.JobClient: Running job: job_201307122107_0009
13/07/13 00:14:59 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 00:15:17 INFO mapred.JobClient:  map 5% reduce 0%
13/07/13 00:15:18 INFO mapred.JobClient:  map 14% reduce 0%
13/07/13 00:15:21 INFO mapred.JobClient:  map 28% reduce 0%
13/07/13 00:15:25 INFO mapred.JobClient:  map 44% reduce 0%
13/07/13 00:15:27 INFO mapred.JobClient:  map 68% reduce 0%
13/07/13 00:15:30 INFO mapred.JobClient:  map 78% reduce 0%
13/07/13 00:15:34 INFO mapred.JobClient:  map 87% reduce 0%
13/07/13 00:15:36 INFO mapred.JobClient:  map 96% reduce 0%
13/07/13 00:15:39 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 00:15:54 INFO mapred.JobClient:  map 100% reduce 84%
13/07/13 00:15:56 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 00:15:59 INFO mapred.JobClient: Job complete: job_201307122107_0009
13/07/13 00:15:59 INFO mapred.JobClient: Counters: 33
13/07/13 00:15:59 INFO mapred.JobClient:   File System Counters
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of bytes read=21497514
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of bytes written=31791353
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 00:15:59 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of bytes read=205308182
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of bytes written=2139772
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of read operations=4
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 00:15:59 INFO mapred.JobClient:     HDFS: Number of write operations=2
13/07/13 00:15:59 INFO mapred.JobClient:   Job Counters 
13/07/13 00:15:59 INFO mapred.JobClient:     Launched map tasks=2
13/07/13 00:15:59 INFO mapred.JobClient:     Launched reduce tasks=2
13/07/13 00:15:59 INFO mapred.JobClient:     Data-local map tasks=2
13/07/13 00:15:59 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=63399
13/07/13 00:15:59 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=26747
13/07/13 00:15:59 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 00:15:59 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 00:15:59 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 00:15:59 INFO mapred.JobClient:     Map input records=1871988
13/07/13 00:15:59 INFO mapred.JobClient:     Map output records=1871988
13/07/13 00:15:59 INFO mapred.JobClient:     Map output bytes=43967362
13/07/13 00:15:59 INFO mapred.JobClient:     Input split bytes=278
13/07/13 00:15:59 INFO mapred.JobClient:     Combine input records=0
13/07/13 00:15:59 INFO mapred.JobClient:     Combine output records=0
13/07/13 00:15:59 INFO mapred.JobClient:     Reduce input groups=81621
13/07/13 00:15:59 INFO mapred.JobClient:     Reduce shuffle bytes=10171946
13/07/13 00:15:59 INFO mapred.JobClient:     Reduce input records=1871988
13/07/13 00:15:59 INFO mapred.JobClient:     Reduce output records=81621
13/07/13 00:15:59 INFO mapred.JobClient:     Spilled Records=5615964
13/07/13 00:15:59 INFO mapred.JobClient:     CPU time spent (ms)=43710
13/07/13 00:15:59 INFO mapred.JobClient:     Physical memory (bytes) snapshot=767377408
13/07/13 00:15:59 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=3596718080
13/07/13 00:15:59 INFO mapred.JobClient:     Total committed heap usage (bytes)=397082624
13/07/13 00:15:59 INFO mapred.JobClient:   demo.LogProcessorMap$LOG_PROCESSOR_COUNTER
13/07/13 00:15:59 INFO mapred.JobClient:     BAD_RECORDS=1871988
# of Good Records :1871988


Scenario 2 output: 

[root@n1 hadoop-examples]# hadoop jar cdh4-examples.jar demo.genericwritable.LogProcessor nasa_access_log output 2
13/07/13 00:17:28 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 00:17:28 INFO input.FileInputFormat: Total input paths to process : 1
13/07/13 00:17:29 INFO mapred.JobClient: Running job: job_201307122107_0011
13/07/13 00:17:30 INFO mapred.JobClient:  map 0% reduce 0%
13/07/13 00:17:43 INFO mapred.JobClient:  map 24% reduce 0%
13/07/13 00:17:45 INFO mapred.JobClient:  map 33% reduce 0%
13/07/13 00:17:46 INFO mapred.JobClient:  map 49% reduce 0%
13/07/13 00:17:48 INFO mapred.JobClient:  map 57% reduce 0%
13/07/13 00:17:49 INFO mapred.JobClient:  map 66% reduce 0%
13/07/13 00:17:51 INFO mapred.JobClient:  map 75% reduce 0%
13/07/13 00:17:54 INFO mapred.JobClient:  map 87% reduce 0%
13/07/13 00:17:57 INFO mapred.JobClient:  map 99% reduce 0%
13/07/13 00:17:59 INFO mapred.JobClient:  map 100% reduce 0%
13/07/13 00:18:12 INFO mapred.JobClient:  map 100% reduce 50%
13/07/13 00:18:15 INFO mapred.JobClient:  map 100% reduce 69%
13/07/13 00:18:18 INFO mapred.JobClient:  map 100% reduce 70%
13/07/13 00:18:20 INFO mapred.JobClient:  map 100% reduce 83%
13/07/13 00:18:21 INFO mapred.JobClient:  map 100% reduce 84%
13/07/13 00:18:25 INFO mapred.JobClient:  map 100% reduce 86%
13/07/13 00:18:26 INFO mapred.JobClient:  map 100% reduce 100%
13/07/13 00:18:30 INFO mapred.JobClient: Job complete: job_201307122107_0011
13/07/13 00:18:30 INFO mapred.JobClient: Counters: 32
13/07/13 00:18:30 INFO mapred.JobClient:   File System Counters
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of bytes read=70122269
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of bytes written=103466795
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of read operations=0
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of large read operations=0
13/07/13 00:18:30 INFO mapred.JobClient:     FILE: Number of write operations=0
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of bytes read=205308182
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of bytes written=86859890
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of read operations=4
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of large read operations=0
13/07/13 00:18:30 INFO mapred.JobClient:     HDFS: Number of write operations=2
13/07/13 00:18:30 INFO mapred.JobClient:   Job Counters 
13/07/13 00:18:30 INFO mapred.JobClient:     Launched map tasks=2
13/07/13 00:18:30 INFO mapred.JobClient:     Launched reduce tasks=2
13/07/13 00:18:30 INFO mapred.JobClient:     Data-local map tasks=2
13/07/13 00:18:30 INFO mapred.JobClient:     Total time spent by all maps in occupied slots (ms)=47028
13/07/13 00:18:30 INFO mapred.JobClient:     Total time spent by all reduces in occupied slots (ms)=44185
13/07/13 00:18:30 INFO mapred.JobClient:     Total time spent by all maps waiting after reserving slots (ms)=0
13/07/13 00:18:30 INFO mapred.JobClient:     Total time spent by all reduces waiting after reserving slots (ms)=0
13/07/13 00:18:30 INFO mapred.JobClient:   Map-Reduce Framework
13/07/13 00:18:30 INFO mapred.JobClient:     Map input records=1891715
13/07/13 00:18:30 INFO mapred.JobClient:     Map output records=3743976
13/07/13 00:18:30 INFO mapred.JobClient:     Map output bytes=168829257
13/07/13 00:18:30 INFO mapred.JobClient:     Input split bytes=278
13/07/13 00:18:30 INFO mapred.JobClient:     Combine input records=0
13/07/13 00:18:30 INFO mapred.JobClient:     Combine output records=0
13/07/13 00:18:30 INFO mapred.JobClient:     Reduce input groups=81621
13/07/13 00:18:30 INFO mapred.JobClient:     Reduce shuffle bytes=33609934
13/07/13 00:18:30 INFO mapred.JobClient:     Reduce input records=3743976
13/07/13 00:18:30 INFO mapred.JobClient:     Reduce output records=81621
13/07/13 00:18:30 INFO mapred.JobClient:     Spilled Records=11231928
13/07/13 00:18:30 INFO mapred.JobClient:     CPU time spent (ms)=51290
13/07/13 00:18:30 INFO mapred.JobClient:     Physical memory (bytes) snapshot=914145280
13/07/13 00:18:30 INFO mapred.JobClient:     Virtual memory (bytes) snapshot=4566802432
13/07/13 00:18:30 INFO mapred.JobClient:     Total committed heap usage (bytes)=573489152

Scenario 3 (Hadoop streaming MapReduce)

Copy the python scrpit into hadoop cluster:

$ scp logProcessor.py root@n1.example.com:/root/hadoop-examples
root@n1.example.com's password: 
logProcessor.py                                                                       100%  470     0.5KB/s   00:00 


[root@n1 hadoop-examples]# hadoop jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar -input nasa_access_log -output output -mapper 'python logProcessor.py' -reducer aggregate -file logProcessor.py 
packageJobJar: [logProcessor.py, /tmp/hadoop-root/hadoop-unjar641255321819856404/] [] /tmp/streamjob5121005386227726797.jar tmpDir=null
13/07/13 00:34:05 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/07/13 00:34:05 INFO mapred.FileInputFormat: Total input paths to process : 1
13/07/13 00:34:06 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-root/mapred/local]
13/07/13 00:34:06 INFO streaming.StreamJob: Running job: job_201307122107_0015
13/07/13 00:34:06 INFO streaming.StreamJob: To kill this job, run:
13/07/13 00:34:06 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=n1.example.com:8021 -kill job_201307122107_0015
13/07/13 00:34:06 INFO streaming.StreamJob: Tracking URL: http://n1.example.com:50030/jobdetails.jsp?jobid=job_201307122107_0015
13/07/13 00:34:07 INFO streaming.StreamJob:  map 0%  reduce 0%
13/07/13 00:34:24 INFO streaming.StreamJob:  map 11%  reduce 0%
13/07/13 00:34:25 INFO streaming.StreamJob:  map 25%  reduce 0%
13/07/13 00:34:27 INFO streaming.StreamJob:  map 39%  reduce 0%
13/07/13 00:34:28 INFO streaming.StreamJob:  map 52%  reduce 0%
13/07/13 00:34:31 INFO streaming.StreamJob:  map 75%  reduce 0%
13/07/13 00:34:33 INFO streaming.StreamJob:  map 87%  reduce 0%
13/07/13 00:34:34 INFO streaming.StreamJob:  map 100%  reduce 0%
13/07/13 00:34:46 INFO streaming.StreamJob:  map 100%  reduce 100%
13/07/13 00:34:50 INFO streaming.StreamJob: Job complete: job_201307122107_0015
13/07/13 00:34:50 INFO streaming.StreamJob: Output: output






Global site tag (gtag.js) - Google Analytics