Hive分析搜索引擎的数据（搜狗实验室数据） -

Kevin12

浏览: 238810 次
性别:
来自: 上海

最近访客更多访客>>

gtsle

a1002323289

changyuan126

一帆风顺

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

Hive分析搜索引擎的数据（搜狗实验室数据）

博客分类：

Hive

Hive Hadoop

搜狗实验室：http://www.sogou.com/labs/resources.html
用户查询日志：http://www.sogou.com/labs/dl/q.html

将上面三种数据全部下载下来，解压后拷贝到虚拟机的/usr/local/hive/hivedata目录中；

注意：如果数据乱码的情况下，请先转成utf-8的格式，方便后面的练习；

数据格式为
访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL
其中，用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值，即同一次使用浏览器输入的不同查询对应同一个用户ID。
第一列：时间；第二列：id；第四列：搜索关键词；第四列：关键词在搜索结果中的排名；第五列：查询结果列表中用户点击了第几个；第六列：url;

创建外部分区表
分区d设置为d=sogouq1,d=sogouq2,d=sogouq3;

hive> create external table tbsogou(id string,websession string,keyword string,search_seq int,click_seq int,website string) partitioned by (d string) row format delimited fields terminated by '\t' lines terminated by '\n';
OK
Time taken: 0.208 seconds
[img]http://dl2.iteye.com/upload/attachment/0115/7441/03188d9f-132d-3178-9e26-0668e800f756.png[/img]

创建对应分区目录,并将数据上传到对应的分区中

root@master1:/usr/local/hive/hivedata# ll
总用量 1379652
drwxr-xr-x 2 root root       4096  3月  6 19:05 ./
drwxr-xr-x 6 root root       4096  3月  6 19:08 ../
-rw-rw-r-- 1 jylu jylu  108750574  6月 14  2014 SogouQ1.txt
-rw-rw-r-- 1 jylu jylu  217441417  6月 14  2014 SogouQ2.txt
-rw-rw-r-- 1 jylu jylu 1086552775  7月 13  2014 SogouQ3.txt
root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1
16/03/06 22:26:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ1.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1
16/03/06 22:26:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2
16/03/06 22:27:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ2.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2
16/03/06 22:27:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3
16/03/06 22:27:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ3.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3
16/03/06 22:27:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -lsr /user/hive/warehouse/testdb.db/tbsogou/
lsr: DEPRECATED: Please use 'ls -R' instead.
16/03/06 22:28:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
drwxr-xr-x   - root supergroup          0 2016-03-06 22:26 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1
-rw-r--r--   2 root supergroup  108750574 2016-03-06 22:26 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1/SogouQ1.txt
drwxr-xr-x   - root supergroup          0 2016-03-06 22:27 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2
-rw-r--r--   2 root supergroup  217441417 2016-03-06 22:27 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2/SogouQ2.txt
drwxr-xr-x   - root supergroup          0 2016-03-06 22:28 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3
-rw-r--r--   2 root supergroup 1086552775 2016-03-06 22:28 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3/SogouQ3.txt
root@master1:/usr/local/hive/hivedata#

加载数据到对应的分区中

hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq1')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq1';
OK
Time taken: 0.266 seconds
hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq2')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq2';
OK
Time taken: 0.177 seconds
hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq3')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq3';
OK
Time taken: 0.245 seconds

查看分区

hive> show partitions tbsogou;
OK
d=sogouq1
d=sogouq2
d=sogouq3
Time taken: 0.152 seconds, Fetched: 3 row(s)

查看每个分区的数量

hive>  select d,count(1) from tbsogou group by d;
Query ID = root_20160307213457_df786923-5b7e-41e3-be18-bf6c2c06a926
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 6
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457352585589_0001, Tracking URL = http://master1:8088/proxy/application_1457352585589_0001/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457352585589_0001
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 6
2016-03-07 21:35:21,593 Stage-1 map = 0%,  reduce = 0%
2016-03-07 21:36:22,322 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 3.86 sec
2016-03-07 21:36:39,202 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 13.62 sec
2016-03-07 21:36:42,474 Stage-1 map = 19%,  reduce = 0%, Cumulative CPU 14.91 sec
2016-03-07 21:36:46,047 Stage-1 map = 31%,  reduce = 0%, Cumulative CPU 15.84 sec
2016-03-07 21:36:48,170 Stage-1 map = 44%,  reduce = 0%, Cumulative CPU 16.12 sec
2016-03-07 21:36:49,239 Stage-1 map = 49%,  reduce = 0%, Cumulative CPU 16.56 sec
2016-03-07 21:36:58,007 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 17.53 sec
2016-03-07 21:37:07,956 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 21.51 sec
2016-03-07 21:37:12,077 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 24.44 sec
2016-03-07 21:37:15,700 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 25.86 sec
2016-03-07 21:37:16,828 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 26.18 sec
2016-03-07 21:37:53,346 Stage-1 map = 100%,  reduce = 4%, Cumulative CPU 27.15 sec
2016-03-07 21:37:56,467 Stage-1 map = 100%,  reduce = 22%, Cumulative CPU 28.87 sec
2016-03-07 21:37:58,541 Stage-1 map = 100%,  reduce = 28%, Cumulative CPU 29.83 sec
2016-03-07 21:37:59,589 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 31.31 sec
2016-03-07 21:38:01,668 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 33.3 sec
2016-03-07 21:38:02,712 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 35.63 sec
2016-03-07 21:38:05,801 Stage-1 map = 100%,  reduce = 83%, Cumulative CPU 37.31 sec
2016-03-07 21:38:06,831 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 39.19 sec
MapReduce Total cumulative CPU time: 39 seconds 190 msec
Ended Job = job_1457352585589_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 6   Cumulative CPU: 39.19 sec   HDFS Read: 1491774463 HDFS Write: 49 SUCCESS
Total MapReduce CPU Time Spent: 39 seconds 190 msec
OK
sogouq1    1000000
sogouq2    2000000
sogouq3    10000000
Time taken: 194.981 seconds, Fetched: 3 row(s)

从上面看执行情况，发现hive执行时使用了一个 job，执行时间比较长，原因有二：
1.因为自己搭建的是四台虚拟机进行测试，和生产环境的大集群没有可比性；
2.hive执行MapReduce过程本身就很慢；
执行的过程中可通过浏览器查看job运行情况：

练习：(使用分区为sogouq3的数据，因为这里面的数据相对来说多一些)
1.统计分区d='sogouq3'中搜索结果为第一个，点击了第一个的有多少；

hive> select count(*)cnt from tbsogou where d='sogouq3' and search_seq=1 and click_seq=1;
Query ID = root_20160313111650_9c4bfb1e-c330-4ef1-b6f1-ef0d74812678
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0005, Tracking URL = http://master1:8088/proxy/application_1457829749863_0005/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0005
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
2016-03-13 11:17:00,345 Stage-1 map = 0%,  reduce = 0%
2016-03-13 11:17:12,813 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 4.75 sec
2016-03-13 11:17:13,845 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 5.04 sec
2016-03-13 11:17:16,936 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 8.86 sec
2016-03-13 11:17:17,975 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 9.81 sec
2016-03-13 11:17:26,298 Stage-1 map = 40%,  reduce = 13%, Cumulative CPU 15.56 sec
2016-03-13 11:17:28,357 Stage-1 map = 53%,  reduce = 13%, Cumulative CPU 19.81 sec
2016-03-13 11:17:32,491 Stage-1 map = 60%,  reduce = 13%, Cumulative CPU 21.32 sec
2016-03-13 11:17:38,644 Stage-1 map = 67%,  reduce = 20%, Cumulative CPU 22.65 sec
2016-03-13 11:17:40,708 Stage-1 map = 73%,  reduce = 20%, Cumulative CPU 23.21 sec
2016-03-13 11:17:44,834 Stage-1 map = 100%,  reduce = 20%, Cumulative CPU 24.91 sec
2016-03-13 11:17:47,903 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 26.27 sec
MapReduce Total cumulative CPU time: 26 seconds 270 msec
Ended Job = job_1457829749863_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 1   Cumulative CPU: 26.27 sec   HDFS Read: 1147287940 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 26 seconds 270 msec
OK
3891022
Time taken: 58.658 seconds, Fetched: 1 row(s)

2.查询分区d='sogouq3'中搜索的关键字排名前5的数据；

hive> select keyword,count(*) as cnt from tbsogou where d='sogouq3' group by keyword order by cnt desc limit 5;
Query ID = root_20160313111820_181be93c-07f4-4aa3-ab42-778b9cc2f312
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0006, Tracking URL = http://master1:8088/proxy/application_1457829749863_0006/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0006
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5
2016-03-13 11:18:31,471 Stage-1 map = 0%,  reduce = 0%
2016-03-13 11:18:46,946 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 6.6 sec
2016-03-13 11:18:57,652 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 30.33 sec
2016-03-13 11:18:59,234 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 32.57 sec
2016-03-13 11:19:02,356 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 34.44 sec
2016-03-13 11:19:03,644 Stage-1 map = 47%,  reduce = 0%, Cumulative CPU 36.89 sec
2016-03-13 11:19:05,416 Stage-1 map = 53%,  reduce = 0%, Cumulative CPU 36.89 sec
2016-03-13 11:19:09,267 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 40.57 sec
2016-03-13 11:19:10,466 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 42.18 sec
2016-03-13 11:19:22,900 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 45.82 sec
2016-03-13 11:19:27,047 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU 49.62 sec
2016-03-13 11:19:28,201 Stage-1 map = 93%,  reduce = 0%, Cumulative CPU 50.0 sec
2016-03-13 11:19:29,228 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 51.6 sec
2016-03-13 11:19:38,556 Stage-1 map = 100%,  reduce = 8%, Cumulative CPU 52.38 sec
2016-03-13 11:19:44,830 Stage-1 map = 100%,  reduce = 26%, Cumulative CPU 53.72 sec
2016-03-13 11:19:47,907 Stage-1 map = 100%,  reduce = 27%, Cumulative CPU 56.14 sec
2016-03-13 11:19:51,016 Stage-1 map = 100%,  reduce = 29%, Cumulative CPU 59.15 sec
2016-03-13 11:19:54,105 Stage-1 map = 100%,  reduce = 34%, Cumulative CPU 61.99 sec
2016-03-13 11:19:56,161 Stage-1 map = 100%,  reduce = 40%, Cumulative CPU 63.76 sec
2016-03-13 11:20:13,727 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 67.11 sec
2016-03-13 11:20:15,901 Stage-1 map = 100%,  reduce = 83%, Cumulative CPU 71.64 sec
2016-03-13 11:20:16,927 Stage-1 map = 100%,  reduce = 88%, Cumulative CPU 74.51 sec
2016-03-13 11:20:20,012 Stage-1 map = 100%,  reduce = 90%, Cumulative CPU 76.43 sec
2016-03-13 11:20:23,082 Stage-1 map = 100%,  reduce = 96%, Cumulative CPU 78.55 sec
2016-03-13 11:20:25,143 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 79.45 sec
MapReduce Total cumulative CPU time: 1 minutes 19 seconds 450 msec
Ended Job = job_1457829749863_0006
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0007, Tracking URL = http://master1:8088/proxy/application_1457829749863_0007/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0007
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2016-03-13 11:20:37,281 Stage-2 map = 0%,  reduce = 0%
2016-03-13 11:20:55,833 Stage-2 map = 17%,  reduce = 0%, Cumulative CPU 2.7 sec
2016-03-13 11:20:58,917 Stage-2 map = 33%,  reduce = 0%, Cumulative CPU 7.23 sec
2016-03-13 11:20:59,942 Stage-2 map = 44%,  reduce = 0%, Cumulative CPU 8.87 sec
2016-03-13 11:21:00,966 Stage-2 map = 61%,  reduce = 0%, Cumulative CPU 9.46 sec
2016-03-13 11:21:03,019 Stage-2 map = 83%,  reduce = 0%, Cumulative CPU 11.85 sec
2016-03-13 11:21:06,098 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 13.3 sec
2016-03-13 11:21:11,232 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 17.45 sec
MapReduce Total cumulative CPU time: 17 seconds 450 msec
Ended Job = job_1457829749863_0007
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 5   Cumulative CPU: 79.45 sec   HDFS Read: 1147297321 HDFS Write: 115331402 SUCCESS
Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 17.45 sec   HDFS Read: 115338803 HDFS Write: 110 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 36 seconds 900 msec
OK
百度    77627
baidu    36564
人体艺术    29598
4399小游戏    23306
优酷    20847
Time taken: 172.861 seconds, Fetched: 5 row(s)
从分析的结果看，用过搜狗查询的关键词是“百度”的挺高的。

3.url搜索访问此时排名前5的

hive> select website ,count(website) as cnt from tbsogou where d='sogouq3' group by website order by cnt desc limit 5;
Query ID = root_20160313112231_24d147cb-8dce-412c-b255-67dc65660cd4
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0008, Tracking URL = http://master1:8088/proxy/application_1457829749863_0008/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0008
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5
2016-03-13 11:22:39,114 Stage-1 map = 0%,  reduce = 0%
2016-03-13 11:22:54,627 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 16.32 sec
2016-03-13 11:23:00,819 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 28.21 sec
2016-03-13 11:23:01,845 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 29.87 sec
2016-03-13 11:23:03,426 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 29.87 sec
2016-03-13 11:23:09,805 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 39.44 sec
2016-03-13 11:23:11,009 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 42.0 sec
2016-03-13 11:23:14,083 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 46.6 sec
2016-03-13 11:23:26,957 Stage-1 map = 69%,  reduce = 0%, Cumulative CPU 53.89 sec
2016-03-13 11:23:30,204 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 55.56 sec
2016-03-13 11:23:32,269 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 56.56 sec
2016-03-13 11:23:35,356 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU 57.3 sec
2016-03-13 11:23:50,951 Stage-1 map = 89%,  reduce = 0%, Cumulative CPU 62.5 sec
2016-03-13 11:23:51,977 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 63.37 sec
2016-03-13 11:23:54,032 Stage-1 map = 92%,  reduce = 0%, Cumulative CPU 63.52 sec
2016-03-13 11:23:55,053 Stage-1 map = 93%,  reduce = 0%, Cumulative CPU 63.69 sec
2016-03-13 11:23:58,131 Stage-1 map = 94%,  reduce = 0%, Cumulative CPU 64.0 sec
2016-03-13 11:24:00,172 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 64.32 sec
2016-03-13 11:24:03,255 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU 64.59 sec
2016-03-13 11:24:06,351 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 65.05 sec
2016-03-13 11:24:08,590 Stage-1 map = 98%,  reduce = 0%, Cumulative CPU 65.28 sec
2016-03-13 11:24:11,687 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 65.67 sec
2016-03-13 11:24:18,963 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 66.17 sec
2016-03-13 11:24:49,446 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 69.83 sec
2016-03-13 11:24:50,461 Stage-1 map = 100%,  reduce = 31%, Cumulative CPU 71.2 sec
2016-03-13 11:24:52,685 Stage-1 map = 100%,  reduce = 40%, Cumulative CPU 72.16 sec
2016-03-13 11:24:53,716 Stage-1 map = 100%,  reduce = 47%, Cumulative CPU 73.29 sec
2016-03-13 11:24:55,755 Stage-1 map = 100%,  reduce = 53%, Cumulative CPU 73.96 sec
2016-03-13 11:25:09,145 Stage-1 map = 100%,  reduce = 59%, Cumulative CPU 74.82 sec
2016-03-13 11:25:13,234 Stage-1 map = 100%,  reduce = 60%, Cumulative CPU 79.95 sec
2016-03-13 11:25:14,267 Stage-1 map = 100%,  reduce = 61%, Cumulative CPU 84.89 sec
2016-03-13 11:25:18,691 Stage-1 map = 100%,  reduce = 62%, Cumulative CPU 86.7 sec
2016-03-13 11:25:19,725 Stage-1 map = 100%,  reduce = 63%, Cumulative CPU 87.4 sec
2016-03-13 11:25:21,777 Stage-1 map = 100%,  reduce = 65%, Cumulative CPU 88.1 sec
2016-03-13 11:25:22,801 Stage-1 map = 100%,  reduce = 66%, Cumulative CPU 88.66 sec
2016-03-13 11:25:24,370 Stage-1 map = 100%,  reduce = 69%, Cumulative CPU 90.07 sec
2016-03-13 11:25:25,806 Stage-1 map = 100%,  reduce = 73%, Cumulative CPU 91.48 sec
2016-03-13 11:25:26,846 Stage-1 map = 100%,  reduce = 74%, Cumulative CPU 92.05 sec
2016-03-13 11:25:27,899 Stage-1 map = 100%,  reduce = 77%, Cumulative CPU 93.2 sec
2016-03-13 11:25:28,922 Stage-1 map = 100%,  reduce = 79%, Cumulative CPU 93.57 sec
2016-03-13 11:25:32,045 Stage-1 map = 100%,  reduce = 80%, Cumulative CPU 94.4 sec
2016-03-13 11:25:34,164 Stage-1 map = 100%,  reduce = 82%, Cumulative CPU 95.36 sec
2016-03-13 11:25:37,297 Stage-1 map = 100%,  reduce = 83%, Cumulative CPU 96.01 sec
2016-03-13 11:25:44,816 Stage-1 map = 100%,  reduce = 85%, Cumulative CPU 97.26 sec
2016-03-13 11:25:55,122 Stage-1 map = 100%,  reduce = 87%, Cumulative CPU 98.23 sec
2016-03-13 11:25:58,213 Stage-1 map = 100%,  reduce = 93%, Cumulative CPU 99.1 sec
2016-03-13 11:26:01,270 Stage-1 map = 100%,  reduce = 94%, Cumulative CPU 101.44 sec
2016-03-13 11:26:04,348 Stage-1 map = 100%,  reduce = 97%, Cumulative CPU 103.83 sec
2016-03-13 11:26:06,427 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 104.88 sec
MapReduce Total cumulative CPU time: 1 minutes 44 seconds 880 msec
Ended Job = job_1457829749863_0008
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0009, Tracking URL = http://master1:8088/proxy/application_1457829749863_0009/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0009
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2016-03-13 11:26:23,710 Stage-2 map = 0%,  reduce = 0%
2016-03-13 11:26:44,247 Stage-2 map = 28%,  reduce = 0%, Cumulative CPU 8.53 sec
2016-03-13 11:26:50,529 Stage-2 map = 57%,  reduce = 0%, Cumulative CPU 13.83 sec
2016-03-13 11:26:53,648 Stage-2 map = 66%,  reduce = 0%, Cumulative CPU 15.83 sec
2016-03-13 11:26:54,668 Stage-2 map = 72%,  reduce = 0%, Cumulative CPU 16.36 sec
2016-03-13 11:26:56,719 Stage-2 map = 83%,  reduce = 0%, Cumulative CPU 17.23 sec
2016-03-13 11:26:59,790 Stage-2 map = 93%,  reduce = 0%, Cumulative CPU 18.65 sec
2016-03-13 11:27:02,862 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 19.38 sec
2016-03-13 11:27:10,374 Stage-2 map = 100%,  reduce = 17%, Cumulative CPU 20.61 sec
2016-03-13 11:27:13,444 Stage-2 map = 100%,  reduce = 67%, Cumulative CPU 22.25 sec
2016-03-13 11:27:15,542 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 24.15 sec
MapReduce Total cumulative CPU time: 24 seconds 150 msec
Ended Job = job_1457829749863_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 5   Cumulative CPU: 104.88 sec   HDFS Read: 1147298011 HDFS Write: 297356296 SUCCESS
Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 24.15 sec   HDFS Read: 297363766 HDFS Write: 140 SUCCESS
Total MapReduce CPU Time Spent: 2 minutes 9 seconds 30 msec
OK
http://www.baidu.com/    148348
http://www.4399.com/    38611
http://www.youku.com/    28959
http://www.hao123.com/    28715
http://qzone.qq.com/    26020
Time taken: 285.674 seconds, Fetched: 5 row(s)

4.统计一天中用户每个时间段搜索情况,并将结果存到临时表中；

drop table if exists tmp_pre_hour_seach_info;
create table tmp_pre_hour_seach_info
as
select substring(id,9,2)as hour,count(*) as cnt from tbsogou where d='sogouq3' and substring(id,1,8)='20111230' group by substring(id,9,2);


hive> drop table if exists tmp_pre_hour_seach_info;
OK
Time taken: 0.019 seconds
hive> create table tmp_pre_hour_seach_info
    > as
    > select substring(id,9,2)as hour,count(*) as cnt from tbsogou where d='sogouq3' and substring(id,1,8)='20111230' group by substring(id,9,2);
Query ID = root_20160313115133_bd005a7b-edf2-4c4c-a4c1-3b83c3d34335
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0015, Tracking URL = http://master1:8088/proxy/application_1457829749863_0015/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0015
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5
2016-03-13 11:51:41,275 Stage-1 map = 0%,  reduce = 0%
2016-03-13 11:52:01,969 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 13.69 sec
2016-03-13 11:52:03,000 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 14.02 sec
2016-03-13 11:52:11,604 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 28.14 sec
2016-03-13 11:52:13,801 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 29.28 sec
2016-03-13 11:52:17,954 Stage-1 map = 60%,  reduce = 1%, Cumulative CPU 33.8 sec
2016-03-13 11:52:19,522 Stage-1 map = 60%,  reduce = 3%, Cumulative CPU 34.15 sec
2016-03-13 11:52:20,803 Stage-1 map = 67%,  reduce = 3%, Cumulative CPU 35.55 sec
2016-03-13 11:52:22,921 Stage-1 map = 73%,  reduce = 4%, Cumulative CPU 35.76 sec
2016-03-13 11:52:23,947 Stage-1 map = 73%,  reduce = 5%, Cumulative CPU 37.26 sec
2016-03-13 11:52:25,437 Stage-1 map = 100%,  reduce = 5%, Cumulative CPU 38.09 sec
2016-03-13 11:52:26,481 Stage-1 map = 100%,  reduce = 7%, Cumulative CPU 38.15 sec
2016-03-13 11:52:27,553 Stage-1 map = 100%,  reduce = 8%, Cumulative CPU 38.2 sec
2016-03-13 11:52:51,246 Stage-1 map = 100%,  reduce = 11%, Cumulative CPU 38.85 sec
2016-03-13 11:52:53,288 Stage-1 map = 100%,  reduce = 20%, Cumulative CPU 39.85 sec
2016-03-13 11:52:54,307 Stage-1 map = 100%,  reduce = 40%, Cumulative CPU 41.77 sec
2016-03-13 11:52:59,424 Stage-1 map = 100%,  reduce = 60%, Cumulative CPU 43.97 sec
2016-03-13 11:53:09,679 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 47.44 sec
MapReduce Total cumulative CPU time: 47 seconds 440 msec
Ended Job = job_1457829749863_0015
Moving data to: hdfs://master1:9000/user/hive/warehouse/testdb.db/tmp_pre_hour_seach_info
Table testdb.tmp_pre_hour_seach_info stats: [numFiles=5, numRows=24, totalSize=234, rawDataSize=210]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 5   Cumulative CPU: 47.44 sec   HDFS Read: 1147301096 HDFS Write: 664 SUCCESS
Total MapReduce CPU Time Spent: 47 seconds 440 msec
OK
Time taken: 98.72 seconds
hive> select * from tmp_pre_hour_seach_info order by hour ;
Query ID = root_20160313122830_a6be9cad-fc18-4c9e-9950-f6f97be0f996
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0016, Tracking URL = http://master1:8088/proxy/application_1457829749863_0016/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0016
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2016-03-13 12:28:38,814 Stage-1 map = 0%,  reduce = 0%
2016-03-13 12:28:50,158 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.48 sec
2016-03-13 12:28:57,412 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.87 sec
MapReduce Total cumulative CPU time: 3 seconds 870 msec
Ended Job = job_1457829749863_0016
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 3.87 sec   HDFS Read: 9717 HDFS Write: 234 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 870 msec
OK
00    167799
01    121494
02    85056
03    63484
04    52604
05    51630
06    61241
07    97949
08    307308
09    516763
10    612179
11    597820
12    595252
13    623990
14    614563
15    640062
16    636304
17    581630
18    591573
19    682096
20    707056
21    659366
22    542951
23    389667
Time taken: 27.597 seconds, Fetched: 24 row(s)

5.统计用户每个时段搜索所在比例；

hive> select hour,cnt/total from tmp_pre_hour_seach_info a join (select sum(cnt)total from tmp_pre_hour_seach_info) b where 1=1 order by hour;
Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Stage-3:MAPRED' is a cross product
Query ID = root_20160313123951_9afbb68b-d07b-49e8-bd90-c2a362133d89
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0022, Tracking URL = http://master1:8088/proxy/application_1457829749863_0022/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0022
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2016-03-13 12:39:58,777 Stage-1 map = 0%,  reduce = 0%
2016-03-13 12:40:10,073 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.3 sec
2016-03-13 12:40:17,294 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.48 sec
MapReduce Total cumulative CPU time: 3 seconds 480 msec
Ended Job = job_1457829749863_0022
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/03/13 12:40:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/root/root_20160313123951_9afbb68b-d07b-49e8-bd90-c2a362133d89.log
2016-03-13 12:40:22    Starting to launch local task to process map join;    maximum memory = 518979584
2016-03-13 12:40:23    Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/root/94a0f21c-473d-4ff2-8d73-5a286929f3ec/hive_2016-03-13_12-39-51_816_7328192251576208367-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile10--.hashtable
2016-03-13 12:40:24    Uploaded 1 File to: file:/tmp/root/94a0f21c-473d-4ff2-8d73-5a286929f3ec/hive_2016-03-13_12-39-51_816_7328192251576208367-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile10--.hashtable (559 bytes)
2016-03-13 12:40:24    End of local task; Time Taken: 1.176 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0023, Tracking URL = http://master1:8088/proxy/application_1457829749863_0023/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0023
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2016-03-13 12:40:32,128 Stage-3 map = 0%,  reduce = 0%
2016-03-13 12:40:39,379 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.54 sec
2016-03-13 12:40:46,562 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 2.98 sec
MapReduce Total cumulative CPU time: 2 seconds 980 msec
Ended Job = job_1457829749863_0023
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 3.48 sec   HDFS Read: 10788 HDFS Write: 117 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 2.98 sec   HDFS Read: 10776 HDFS Write: 556 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 460 msec
OK
00    0.016780173516828326
01    0.012149598038448027
02    0.00850573864353989
03    0.006348503480606734
04    0.005260485745917659
05    0.0051630841582717794
06    0.006124199824457138
07    0.009795059659472449
08    0.030731300920205
09    0.0516771423374201
10    0.06121889786803525
11    0.05978297446248374
12    0.05952617027657551
13    0.06240001712027906
14    0.06145730175401859
15    0.06400724331806608
16    0.06363143719242624
17    0.05816394807235358
18    0.05915826427970776
19    0.06821071183460291
20    0.07070675252006607
21    0.06593767478409898
22    0.0542959850245559
23    0.03896733516756323
Time taken: 55.825 seconds, Fetched: 24 row(s)