`
Kevin12
  • 浏览: 235334 次
  • 性别: Icon_minigender_1
  • 来自: 上海
社区版块
存档分类
最新评论

Hive分析搜索引擎的数据(搜狗实验室数据)

    博客分类:
  • Hive
阅读更多
搜狗实验室:http://www.sogou.com/labs/resources.html
用户查询日志:http://www.sogou.com/labs/dl/q.html



将上面三种数据全部下载下来,解压后拷贝到虚拟机的/usr/local/hive/hivedata目录中;


注意:如果数据乱码的情况下,请先转成utf-8的格式,方便后面的练习;

数据格式为
访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL
其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户ID。
第一列:时间;第二列:id;第四列:搜索关键词;第四列:关键词在搜索结果中的排名;第五列:查询结果列表中用户点击了第几个;第六列:url;

创建外部分区表
分区d设置为d=sogouq1,d=sogouq2,d=sogouq3;
hive> create external table tbsogou(id string,websession string,keyword string,search_seq int,click_seq int,website string) partitioned by (d string) row format delimited fields terminated by '\t' lines terminated by '\n';
OK
Time taken: 0.208 seconds
[img]http://dl2.iteye.com/upload/attachment/0115/7441/03188d9f-132d-3178-9e26-0668e800f756.png[/img]


创建对应分区目录,并将数据上传到对应的分区中
root@master1:/usr/local/hive/hivedata# ll
总用量 1379652
drwxr-xr-x 2 root root       4096  3月  6 19:05 ./
drwxr-xr-x 6 root root       4096  3月  6 19:08 ../
-rw-rw-r-- 1 jylu jylu  108750574  6月 14  2014 SogouQ1.txt
-rw-rw-r-- 1 jylu jylu  217441417  6月 14  2014 SogouQ2.txt
-rw-rw-r-- 1 jylu jylu 1086552775  7月 13  2014 SogouQ3.txt
root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1
16/03/06 22:26:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ1.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1
16/03/06 22:26:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2
16/03/06 22:27:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ2.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2
16/03/06 22:27:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3
16/03/06 22:27:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ3.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3
16/03/06 22:27:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
root@master1:/usr/local/hive/hivedata# hdfs dfs -lsr /user/hive/warehouse/testdb.db/tbsogou/
lsr: DEPRECATED: Please use 'ls -R' instead.
16/03/06 22:28:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
drwxr-xr-x   - root supergroup          0 2016-03-06 22:26 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1
-rw-r--r--   2 root supergroup  108750574 2016-03-06 22:26 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1/SogouQ1.txt
drwxr-xr-x   - root supergroup          0 2016-03-06 22:27 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2
-rw-r--r--   2 root supergroup  217441417 2016-03-06 22:27 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2/SogouQ2.txt
drwxr-xr-x   - root supergroup          0 2016-03-06 22:28 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3
-rw-r--r--   2 root supergroup 1086552775 2016-03-06 22:28 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3/SogouQ3.txt
root@master1:/usr/local/hive/hivedata# 


加载数据到对应的分区中
hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq1')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq1';
OK
Time taken: 0.266 seconds
hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq2')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq2';
OK
Time taken: 0.177 seconds
hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq3')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq3';
OK
Time taken: 0.245 seconds


查看分区
hive> show partitions tbsogou;
OK
d=sogouq1
d=sogouq2
d=sogouq3
Time taken: 0.152 seconds, Fetched: 3 row(s)


查看每个分区的数量
hive>  select d,count(1) from tbsogou group by d;
Query ID = root_20160307213457_df786923-5b7e-41e3-be18-bf6c2c06a926
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 6
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457352585589_0001, Tracking URL = http://master1:8088/proxy/application_1457352585589_0001/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457352585589_0001
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 6
2016-03-07 21:35:21,593 Stage-1 map = 0%,  reduce = 0%
2016-03-07 21:36:22,322 Stage-1 map = 0%,  reduce = 0%, Cumulative CPU 3.86 sec
2016-03-07 21:36:39,202 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 13.62 sec
2016-03-07 21:36:42,474 Stage-1 map = 19%,  reduce = 0%, Cumulative CPU 14.91 sec
2016-03-07 21:36:46,047 Stage-1 map = 31%,  reduce = 0%, Cumulative CPU 15.84 sec
2016-03-07 21:36:48,170 Stage-1 map = 44%,  reduce = 0%, Cumulative CPU 16.12 sec
2016-03-07 21:36:49,239 Stage-1 map = 49%,  reduce = 0%, Cumulative CPU 16.56 sec
2016-03-07 21:36:58,007 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 17.53 sec
2016-03-07 21:37:07,956 Stage-1 map = 63%,  reduce = 0%, Cumulative CPU 21.51 sec
2016-03-07 21:37:12,077 Stage-1 map = 68%,  reduce = 0%, Cumulative CPU 24.44 sec
2016-03-07 21:37:15,700 Stage-1 map = 75%,  reduce = 0%, Cumulative CPU 25.86 sec
2016-03-07 21:37:16,828 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 26.18 sec
2016-03-07 21:37:53,346 Stage-1 map = 100%,  reduce = 4%, Cumulative CPU 27.15 sec
2016-03-07 21:37:56,467 Stage-1 map = 100%,  reduce = 22%, Cumulative CPU 28.87 sec
2016-03-07 21:37:58,541 Stage-1 map = 100%,  reduce = 28%, Cumulative CPU 29.83 sec
2016-03-07 21:37:59,589 Stage-1 map = 100%,  reduce = 33%, Cumulative CPU 31.31 sec
2016-03-07 21:38:01,668 Stage-1 map = 100%,  reduce = 50%, Cumulative CPU 33.3 sec
2016-03-07 21:38:02,712 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 35.63 sec
2016-03-07 21:38:05,801 Stage-1 map = 100%,  reduce = 83%, Cumulative CPU 37.31 sec
2016-03-07 21:38:06,831 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 39.19 sec
MapReduce Total cumulative CPU time: 39 seconds 190 msec
Ended Job = job_1457352585589_0001
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 6   Cumulative CPU: 39.19 sec   HDFS Read: 1491774463 HDFS Write: 49 SUCCESS
Total MapReduce CPU Time Spent: 39 seconds 190 msec
OK
sogouq1    1000000
sogouq2    2000000
sogouq3    10000000
Time taken: 194.981 seconds, Fetched: 3 row(s)

从上面看执行情况,发现hive执行时使用了一个 job,执行时间比较长,原因有二:
1.因为自己搭建的是四台虚拟机进行测试,和生产环境的大集群没有可比性;
2.hive执行MapReduce过程本身就很慢;
执行的过程中可通过浏览器查看job运行情况:




练习:(使用分区为sogouq3的数据,因为这里面的数据相对来说多一些)
1.统计分区d='sogouq3'中搜索结果为第一个,点击了第一个的有多少;
hive> select count(*)cnt from tbsogou where d='sogouq3' and search_seq=1 and click_seq=1;
Query ID = root_20160313111650_9c4bfb1e-c330-4ef1-b6f1-ef0d74812678
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0005, Tracking URL = http://master1:8088/proxy/application_1457829749863_0005/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0005
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1
2016-03-13 11:17:00,345 Stage-1 map = 0%,  reduce = 0%
2016-03-13 11:17:12,813 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 4.75 sec
2016-03-13 11:17:13,845 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 5.04 sec
2016-03-13 11:17:16,936 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 8.86 sec
2016-03-13 11:17:17,975 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 9.81 sec
2016-03-13 11:17:26,298 Stage-1 map = 40%,  reduce = 13%, Cumulative CPU 15.56 sec
2016-03-13 11:17:28,357 Stage-1 map = 53%,  reduce = 13%, Cumulative CPU 19.81 sec
2016-03-13 11:17:32,491 Stage-1 map = 60%,  reduce = 13%, Cumulative CPU 21.32 sec
2016-03-13 11:17:38,644 Stage-1 map = 67%,  reduce = 20%, Cumulative CPU 22.65 sec
2016-03-13 11:17:40,708 Stage-1 map = 73%,  reduce = 20%, Cumulative CPU 23.21 sec
2016-03-13 11:17:44,834 Stage-1 map = 100%,  reduce = 20%, Cumulative CPU 24.91 sec
2016-03-13 11:17:47,903 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 26.27 sec
MapReduce Total cumulative CPU time: 26 seconds 270 msec
Ended Job = job_1457829749863_0005
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 1   Cumulative CPU: 26.27 sec   HDFS Read: 1147287940 HDFS Write: 8 SUCCESS
Total MapReduce CPU Time Spent: 26 seconds 270 msec
OK
3891022
Time taken: 58.658 seconds, Fetched: 1 row(s)


2.查询分区d='sogouq3'中搜索的关键字排名前5的数据;
hive> select keyword,count(*) as cnt from tbsogou where d='sogouq3' group by keyword order by cnt desc limit 5;
Query ID = root_20160313111820_181be93c-07f4-4aa3-ab42-778b9cc2f312
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0006, Tracking URL = http://master1:8088/proxy/application_1457829749863_0006/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0006
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5
2016-03-13 11:18:31,471 Stage-1 map = 0%,  reduce = 0%
2016-03-13 11:18:46,946 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 6.6 sec
2016-03-13 11:18:57,652 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 30.33 sec
2016-03-13 11:18:59,234 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 32.57 sec
2016-03-13 11:19:02,356 Stage-1 map = 40%,  reduce = 0%, Cumulative CPU 34.44 sec
2016-03-13 11:19:03,644 Stage-1 map = 47%,  reduce = 0%, Cumulative CPU 36.89 sec
2016-03-13 11:19:05,416 Stage-1 map = 53%,  reduce = 0%, Cumulative CPU 36.89 sec
2016-03-13 11:19:09,267 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 40.57 sec
2016-03-13 11:19:10,466 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 42.18 sec
2016-03-13 11:19:22,900 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 45.82 sec
2016-03-13 11:19:27,047 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU 49.62 sec
2016-03-13 11:19:28,201 Stage-1 map = 93%,  reduce = 0%, Cumulative CPU 50.0 sec
2016-03-13 11:19:29,228 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 51.6 sec
2016-03-13 11:19:38,556 Stage-1 map = 100%,  reduce = 8%, Cumulative CPU 52.38 sec
2016-03-13 11:19:44,830 Stage-1 map = 100%,  reduce = 26%, Cumulative CPU 53.72 sec
2016-03-13 11:19:47,907 Stage-1 map = 100%,  reduce = 27%, Cumulative CPU 56.14 sec
2016-03-13 11:19:51,016 Stage-1 map = 100%,  reduce = 29%, Cumulative CPU 59.15 sec
2016-03-13 11:19:54,105 Stage-1 map = 100%,  reduce = 34%, Cumulative CPU 61.99 sec
2016-03-13 11:19:56,161 Stage-1 map = 100%,  reduce = 40%, Cumulative CPU 63.76 sec
2016-03-13 11:20:13,727 Stage-1 map = 100%,  reduce = 67%, Cumulative CPU 67.11 sec
2016-03-13 11:20:15,901 Stage-1 map = 100%,  reduce = 83%, Cumulative CPU 71.64 sec
2016-03-13 11:20:16,927 Stage-1 map = 100%,  reduce = 88%, Cumulative CPU 74.51 sec
2016-03-13 11:20:20,012 Stage-1 map = 100%,  reduce = 90%, Cumulative CPU 76.43 sec
2016-03-13 11:20:23,082 Stage-1 map = 100%,  reduce = 96%, Cumulative CPU 78.55 sec
2016-03-13 11:20:25,143 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 79.45 sec
MapReduce Total cumulative CPU time: 1 minutes 19 seconds 450 msec
Ended Job = job_1457829749863_0006
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0007, Tracking URL = http://master1:8088/proxy/application_1457829749863_0007/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0007
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2016-03-13 11:20:37,281 Stage-2 map = 0%,  reduce = 0%
2016-03-13 11:20:55,833 Stage-2 map = 17%,  reduce = 0%, Cumulative CPU 2.7 sec
2016-03-13 11:20:58,917 Stage-2 map = 33%,  reduce = 0%, Cumulative CPU 7.23 sec
2016-03-13 11:20:59,942 Stage-2 map = 44%,  reduce = 0%, Cumulative CPU 8.87 sec
2016-03-13 11:21:00,966 Stage-2 map = 61%,  reduce = 0%, Cumulative CPU 9.46 sec
2016-03-13 11:21:03,019 Stage-2 map = 83%,  reduce = 0%, Cumulative CPU 11.85 sec
2016-03-13 11:21:06,098 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 13.3 sec
2016-03-13 11:21:11,232 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 17.45 sec
MapReduce Total cumulative CPU time: 17 seconds 450 msec
Ended Job = job_1457829749863_0007
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 5   Cumulative CPU: 79.45 sec   HDFS Read: 1147297321 HDFS Write: 115331402 SUCCESS
Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 17.45 sec   HDFS Read: 115338803 HDFS Write: 110 SUCCESS
Total MapReduce CPU Time Spent: 1 minutes 36 seconds 900 msec
OK
百度    77627
baidu    36564
人体艺术    29598
4399小游戏    23306
优酷    20847
Time taken: 172.861 seconds, Fetched: 5 row(s)
从分析的结果看,用过搜狗查询的关键词是“百度”的挺高的。


3.url搜索访问此时排名前5的
hive> select website ,count(website) as cnt from tbsogou where d='sogouq3' group by website order by cnt desc limit 5;
Query ID = root_20160313112231_24d147cb-8dce-412c-b255-67dc65660cd4
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks not specified. Estimated from input data size: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0008, Tracking URL = http://master1:8088/proxy/application_1457829749863_0008/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0008
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5
2016-03-13 11:22:39,114 Stage-1 map = 0%,  reduce = 0%
2016-03-13 11:22:54,627 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 16.32 sec
2016-03-13 11:23:00,819 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 28.21 sec
2016-03-13 11:23:01,845 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 29.87 sec
2016-03-13 11:23:03,426 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 29.87 sec
2016-03-13 11:23:09,805 Stage-1 map = 45%,  reduce = 0%, Cumulative CPU 39.44 sec
2016-03-13 11:23:11,009 Stage-1 map = 60%,  reduce = 0%, Cumulative CPU 42.0 sec
2016-03-13 11:23:14,083 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 46.6 sec
2016-03-13 11:23:26,957 Stage-1 map = 69%,  reduce = 0%, Cumulative CPU 53.89 sec
2016-03-13 11:23:30,204 Stage-1 map = 73%,  reduce = 0%, Cumulative CPU 55.56 sec
2016-03-13 11:23:32,269 Stage-1 map = 80%,  reduce = 0%, Cumulative CPU 56.56 sec
2016-03-13 11:23:35,356 Stage-1 map = 87%,  reduce = 0%, Cumulative CPU 57.3 sec
2016-03-13 11:23:50,951 Stage-1 map = 89%,  reduce = 0%, Cumulative CPU 62.5 sec
2016-03-13 11:23:51,977 Stage-1 map = 91%,  reduce = 0%, Cumulative CPU 63.37 sec
2016-03-13 11:23:54,032 Stage-1 map = 92%,  reduce = 0%, Cumulative CPU 63.52 sec
2016-03-13 11:23:55,053 Stage-1 map = 93%,  reduce = 0%, Cumulative CPU 63.69 sec
2016-03-13 11:23:58,131 Stage-1 map = 94%,  reduce = 0%, Cumulative CPU 64.0 sec
2016-03-13 11:24:00,172 Stage-1 map = 95%,  reduce = 0%, Cumulative CPU 64.32 sec
2016-03-13 11:24:03,255 Stage-1 map = 96%,  reduce = 0%, Cumulative CPU 64.59 sec
2016-03-13 11:24:06,351 Stage-1 map = 97%,  reduce = 0%, Cumulative CPU 65.05 sec
2016-03-13 11:24:08,590 Stage-1 map = 98%,  reduce = 0%, Cumulative CPU 65.28 sec
2016-03-13 11:24:11,687 Stage-1 map = 99%,  reduce = 0%, Cumulative CPU 65.67 sec
2016-03-13 11:24:18,963 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 66.17 sec
2016-03-13 11:24:49,446 Stage-1 map = 100%,  reduce = 24%, Cumulative CPU 69.83 sec
2016-03-13 11:24:50,461 Stage-1 map = 100%,  reduce = 31%, Cumulative CPU 71.2 sec
2016-03-13 11:24:52,685 Stage-1 map = 100%,  reduce = 40%, Cumulative CPU 72.16 sec
2016-03-13 11:24:53,716 Stage-1 map = 100%,  reduce = 47%, Cumulative CPU 73.29 sec
2016-03-13 11:24:55,755 Stage-1 map = 100%,  reduce = 53%, Cumulative CPU 73.96 sec
2016-03-13 11:25:09,145 Stage-1 map = 100%,  reduce = 59%, Cumulative CPU 74.82 sec
2016-03-13 11:25:13,234 Stage-1 map = 100%,  reduce = 60%, Cumulative CPU 79.95 sec
2016-03-13 11:25:14,267 Stage-1 map = 100%,  reduce = 61%, Cumulative CPU 84.89 sec
2016-03-13 11:25:18,691 Stage-1 map = 100%,  reduce = 62%, Cumulative CPU 86.7 sec
2016-03-13 11:25:19,725 Stage-1 map = 100%,  reduce = 63%, Cumulative CPU 87.4 sec
2016-03-13 11:25:21,777 Stage-1 map = 100%,  reduce = 65%, Cumulative CPU 88.1 sec
2016-03-13 11:25:22,801 Stage-1 map = 100%,  reduce = 66%, Cumulative CPU 88.66 sec
2016-03-13 11:25:24,370 Stage-1 map = 100%,  reduce = 69%, Cumulative CPU 90.07 sec
2016-03-13 11:25:25,806 Stage-1 map = 100%,  reduce = 73%, Cumulative CPU 91.48 sec
2016-03-13 11:25:26,846 Stage-1 map = 100%,  reduce = 74%, Cumulative CPU 92.05 sec
2016-03-13 11:25:27,899 Stage-1 map = 100%,  reduce = 77%, Cumulative CPU 93.2 sec
2016-03-13 11:25:28,922 Stage-1 map = 100%,  reduce = 79%, Cumulative CPU 93.57 sec
2016-03-13 11:25:32,045 Stage-1 map = 100%,  reduce = 80%, Cumulative CPU 94.4 sec
2016-03-13 11:25:34,164 Stage-1 map = 100%,  reduce = 82%, Cumulative CPU 95.36 sec
2016-03-13 11:25:37,297 Stage-1 map = 100%,  reduce = 83%, Cumulative CPU 96.01 sec
2016-03-13 11:25:44,816 Stage-1 map = 100%,  reduce = 85%, Cumulative CPU 97.26 sec
2016-03-13 11:25:55,122 Stage-1 map = 100%,  reduce = 87%, Cumulative CPU 98.23 sec
2016-03-13 11:25:58,213 Stage-1 map = 100%,  reduce = 93%, Cumulative CPU 99.1 sec
2016-03-13 11:26:01,270 Stage-1 map = 100%,  reduce = 94%, Cumulative CPU 101.44 sec
2016-03-13 11:26:04,348 Stage-1 map = 100%,  reduce = 97%, Cumulative CPU 103.83 sec
2016-03-13 11:26:06,427 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 104.88 sec
MapReduce Total cumulative CPU time: 1 minutes 44 seconds 880 msec
Ended Job = job_1457829749863_0008
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0009, Tracking URL = http://master1:8088/proxy/application_1457829749863_0009/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0009
Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1
2016-03-13 11:26:23,710 Stage-2 map = 0%,  reduce = 0%
2016-03-13 11:26:44,247 Stage-2 map = 28%,  reduce = 0%, Cumulative CPU 8.53 sec
2016-03-13 11:26:50,529 Stage-2 map = 57%,  reduce = 0%, Cumulative CPU 13.83 sec
2016-03-13 11:26:53,648 Stage-2 map = 66%,  reduce = 0%, Cumulative CPU 15.83 sec
2016-03-13 11:26:54,668 Stage-2 map = 72%,  reduce = 0%, Cumulative CPU 16.36 sec
2016-03-13 11:26:56,719 Stage-2 map = 83%,  reduce = 0%, Cumulative CPU 17.23 sec
2016-03-13 11:26:59,790 Stage-2 map = 93%,  reduce = 0%, Cumulative CPU 18.65 sec
2016-03-13 11:27:02,862 Stage-2 map = 100%,  reduce = 0%, Cumulative CPU 19.38 sec
2016-03-13 11:27:10,374 Stage-2 map = 100%,  reduce = 17%, Cumulative CPU 20.61 sec
2016-03-13 11:27:13,444 Stage-2 map = 100%,  reduce = 67%, Cumulative CPU 22.25 sec
2016-03-13 11:27:15,542 Stage-2 map = 100%,  reduce = 100%, Cumulative CPU 24.15 sec
MapReduce Total cumulative CPU time: 24 seconds 150 msec
Ended Job = job_1457829749863_0009
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 5   Cumulative CPU: 104.88 sec   HDFS Read: 1147298011 HDFS Write: 297356296 SUCCESS
Stage-Stage-2: Map: 2  Reduce: 1   Cumulative CPU: 24.15 sec   HDFS Read: 297363766 HDFS Write: 140 SUCCESS
Total MapReduce CPU Time Spent: 2 minutes 9 seconds 30 msec
OK
http://www.baidu.com/    148348
http://www.4399.com/    38611
http://www.youku.com/    28959
http://www.hao123.com/    28715
http://qzone.qq.com/    26020
Time taken: 285.674 seconds, Fetched: 5 row(s)


4.统计一天中用户每个时间段搜索情况,并将结果存到临时表中;
drop table if exists tmp_pre_hour_seach_info;
create table tmp_pre_hour_seach_info
as
select substring(id,9,2)as hour,count(*) as cnt from tbsogou where d='sogouq3' and substring(id,1,8)='20111230' group by substring(id,9,2);


hive> drop table if exists tmp_pre_hour_seach_info;
OK
Time taken: 0.019 seconds
hive> create table tmp_pre_hour_seach_info
    > as
    > select substring(id,9,2)as hour,count(*) as cnt from tbsogou where d='sogouq3' and substring(id,1,8)='20111230' group by substring(id,9,2);
Query ID = root_20160313115133_bd005a7b-edf2-4c4c-a4c1-3b83c3d34335
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Estimated from input data size: 5
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0015, Tracking URL = http://master1:8088/proxy/application_1457829749863_0015/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0015
Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5
2016-03-13 11:51:41,275 Stage-1 map = 0%,  reduce = 0%
2016-03-13 11:52:01,969 Stage-1 map = 7%,  reduce = 0%, Cumulative CPU 13.69 sec
2016-03-13 11:52:03,000 Stage-1 map = 20%,  reduce = 0%, Cumulative CPU 14.02 sec
2016-03-13 11:52:11,604 Stage-1 map = 27%,  reduce = 0%, Cumulative CPU 28.14 sec
2016-03-13 11:52:13,801 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 29.28 sec
2016-03-13 11:52:17,954 Stage-1 map = 60%,  reduce = 1%, Cumulative CPU 33.8 sec
2016-03-13 11:52:19,522 Stage-1 map = 60%,  reduce = 3%, Cumulative CPU 34.15 sec
2016-03-13 11:52:20,803 Stage-1 map = 67%,  reduce = 3%, Cumulative CPU 35.55 sec
2016-03-13 11:52:22,921 Stage-1 map = 73%,  reduce = 4%, Cumulative CPU 35.76 sec
2016-03-13 11:52:23,947 Stage-1 map = 73%,  reduce = 5%, Cumulative CPU 37.26 sec
2016-03-13 11:52:25,437 Stage-1 map = 100%,  reduce = 5%, Cumulative CPU 38.09 sec
2016-03-13 11:52:26,481 Stage-1 map = 100%,  reduce = 7%, Cumulative CPU 38.15 sec
2016-03-13 11:52:27,553 Stage-1 map = 100%,  reduce = 8%, Cumulative CPU 38.2 sec
2016-03-13 11:52:51,246 Stage-1 map = 100%,  reduce = 11%, Cumulative CPU 38.85 sec
2016-03-13 11:52:53,288 Stage-1 map = 100%,  reduce = 20%, Cumulative CPU 39.85 sec
2016-03-13 11:52:54,307 Stage-1 map = 100%,  reduce = 40%, Cumulative CPU 41.77 sec
2016-03-13 11:52:59,424 Stage-1 map = 100%,  reduce = 60%, Cumulative CPU 43.97 sec
2016-03-13 11:53:09,679 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 47.44 sec
MapReduce Total cumulative CPU time: 47 seconds 440 msec
Ended Job = job_1457829749863_0015
Moving data to: hdfs://master1:9000/user/hive/warehouse/testdb.db/tmp_pre_hour_seach_info
Table testdb.tmp_pre_hour_seach_info stats: [numFiles=5, numRows=24, totalSize=234, rawDataSize=210]
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 5  Reduce: 5   Cumulative CPU: 47.44 sec   HDFS Read: 1147301096 HDFS Write: 664 SUCCESS
Total MapReduce CPU Time Spent: 47 seconds 440 msec
OK
Time taken: 98.72 seconds
hive> select * from tmp_pre_hour_seach_info order by hour ;
Query ID = root_20160313122830_a6be9cad-fc18-4c9e-9950-f6f97be0f996
Total jobs = 1
Launching Job 1 out of 1
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0016, Tracking URL = http://master1:8088/proxy/application_1457829749863_0016/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0016
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2016-03-13 12:28:38,814 Stage-1 map = 0%,  reduce = 0%
2016-03-13 12:28:50,158 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.48 sec
2016-03-13 12:28:57,412 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.87 sec
MapReduce Total cumulative CPU time: 3 seconds 870 msec
Ended Job = job_1457829749863_0016
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 3.87 sec   HDFS Read: 9717 HDFS Write: 234 SUCCESS
Total MapReduce CPU Time Spent: 3 seconds 870 msec
OK
00    167799
01    121494
02    85056
03    63484
04    52604
05    51630
06    61241
07    97949
08    307308
09    516763
10    612179
11    597820
12    595252
13    623990
14    614563
15    640062
16    636304
17    581630
18    591573
19    682096
20    707056
21    659366
22    542951
23    389667
Time taken: 27.597 seconds, Fetched: 24 row(s)


5.统计用户每个时段搜索所在比例;
hive> select hour,cnt/total from tmp_pre_hour_seach_info a join (select sum(cnt)total from tmp_pre_hour_seach_info) b where 1=1 order by hour;
Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Stage-3:MAPRED' is a cross product
Query ID = root_20160313123951_9afbb68b-d07b-49e8-bd90-c2a362133d89
Total jobs = 2
Launching Job 1 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0022, Tracking URL = http://master1:8088/proxy/application_1457829749863_0022/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0022
Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1
2016-03-13 12:39:58,777 Stage-1 map = 0%,  reduce = 0%
2016-03-13 12:40:10,073 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 2.3 sec
2016-03-13 12:40:17,294 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 3.48 sec
MapReduce Total cumulative CPU time: 3 seconds 480 msec
Ended Job = job_1457829749863_0022
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
16/03/13 12:40:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Execution log at: /tmp/root/root_20160313123951_9afbb68b-d07b-49e8-bd90-c2a362133d89.log
2016-03-13 12:40:22    Starting to launch local task to process map join;    maximum memory = 518979584
2016-03-13 12:40:23    Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/root/94a0f21c-473d-4ff2-8d73-5a286929f3ec/hive_2016-03-13_12-39-51_816_7328192251576208367-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile10--.hashtable
2016-03-13 12:40:24    Uploaded 1 File to: file:/tmp/root/94a0f21c-473d-4ff2-8d73-5a286929f3ec/hive_2016-03-13_12-39-51_816_7328192251576208367-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile10--.hashtable (559 bytes)
2016-03-13 12:40:24    End of local task; Time Taken: 1.176 sec.
Execution completed successfully
MapredLocal task succeeded
Launching Job 2 out of 2
Number of reduce tasks determined at compile time: 1
In order to change the average load for a reducer (in bytes):
  set hive.exec.reducers.bytes.per.reducer=<number>
In order to limit the maximum number of reducers:
  set hive.exec.reducers.max=<number>
In order to set a constant number of reducers:
  set mapreduce.job.reduces=<number>
Starting Job = job_1457829749863_0023, Tracking URL = http://master1:8088/proxy/application_1457829749863_0023/
Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job  -kill job_1457829749863_0023
Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1
2016-03-13 12:40:32,128 Stage-3 map = 0%,  reduce = 0%
2016-03-13 12:40:39,379 Stage-3 map = 100%,  reduce = 0%, Cumulative CPU 1.54 sec
2016-03-13 12:40:46,562 Stage-3 map = 100%,  reduce = 100%, Cumulative CPU 2.98 sec
MapReduce Total cumulative CPU time: 2 seconds 980 msec
Ended Job = job_1457829749863_0023
MapReduce Jobs Launched: 
Stage-Stage-1: Map: 2  Reduce: 1   Cumulative CPU: 3.48 sec   HDFS Read: 10788 HDFS Write: 117 SUCCESS
Stage-Stage-3: Map: 1  Reduce: 1   Cumulative CPU: 2.98 sec   HDFS Read: 10776 HDFS Write: 556 SUCCESS
Total MapReduce CPU Time Spent: 6 seconds 460 msec
OK
00    0.016780173516828326
01    0.012149598038448027
02    0.00850573864353989
03    0.006348503480606734
04    0.005260485745917659
05    0.0051630841582717794
06    0.006124199824457138
07    0.009795059659472449
08    0.030731300920205
09    0.0516771423374201
10    0.06121889786803525
11    0.05978297446248374
12    0.05952617027657551
13    0.06240001712027906
14    0.06145730175401859
15    0.06400724331806608
16    0.06363143719242624
17    0.05816394807235358
18    0.05915826427970776
19    0.06821071183460291
20    0.07070675252006607
21    0.06593767478409898
22    0.0542959850245559
23    0.03896733516756323
Time taken: 55.825 seconds, Fetched: 24 row(s)
  • 大小: 12.8 KB
  • 大小: 194.4 KB
  • 大小: 74 KB
  • 大小: 146.7 KB
  • 大小: 75.3 KB
分享到:
评论

相关推荐

    基于Hive的搜狗日志分析

    本文档主要介绍了基于Hive的搜狗日志分析的整个过程,从数据预处理、构建数据仓库、数据分析到其他数据操作等方面进行了详细的介绍。 一、 数据预处理 数据预处理是整个日志分析的第一步,也是最为重要的一步。...

    基于 Hive的数据分析案例 -MM聊天软件数据分析

    基于 Hive 的数据分析案例 -MM 聊天软件数据分析 本资源摘要信息主要介绍了基于 Hive 的数据分析案例,通过对 MM 聊天软件的数据进行统计分析,了解用户行为,实现精准的用户画像,并为公司的发展决策提供精确的...

    利用Hive进行复杂用户行为大数据分析及优化案例

    利用Hive进行复杂用户行为大数据分析及优化案例(全套视频+课件+代码+讲义+工具软件),具体内容包括: 01_自动批量加载数据到hive 02_Hive表批量加载数据的脚本实现(一) 03_Hive表批量加载数据的脚本实现(二) ...

    大数据 hive 实战数据

    在Hive中,我们可以创建一个`video`表来存储这些数据,通过Hive SQL进行查询和分析,例如,找出最受欢迎的视频或最常被搜索的类别。 其次,`user`数据通常包括用户ID、用户名、注册日期、地理位置、用户行为日志...

    hive实验报告.docx

    通过这次实验,学生能够深入理解Hive作为数据仓库工具的作用,熟悉其基本操作和使用场景,同时也掌握了自定义函数的开发流程,这对于进一步学习大数据处理和分析具有重要意义。在实际工作中,Hive的性能优化和与其他...

    Hive总结.docx

    Hive是基于Hadoop平台的数据仓库解决方案,它主要解决了在大数据场景下,业务人员和数据科学家能够通过熟悉的SQL语言进行数据分析的问题。Hive并不存储数据,而是依赖于HDFS进行数据存储,并利用MapReduce、Tez或...

    搜狗搜索日志分析报告.docx

    《搜狗搜索日志分析报告》是对海量的搜狗搜索数据进行深度挖掘和解析的科研文档,旨在揭示用户搜索行为的规律与趋势。本报告基于500万条搜狗搜索日志,采用Hadoop、Hive等大数据处理工具,结合R语言进行统计分析,...

    Hive数据仓库之垃圾分类数据分析系统

    (3)sqoop数据迁移,完成HIve与MySQL数据库中的数据交互 (4)Echarts搭建动态可视化大屏 (5)SpringBoot搭建可视化后台系统,完成前端与后台的数据传递与交互。 (6)基于Cenots7 搭建虚拟机,配置Hadoop、HDFS、...

    基于hadoop平台hive数据库处理电影数据(8965字数32页).doc

    【标题】:“基于Hadoop平台Hive数据库处理电影数据”的文档详细介绍了如何在Hadoop分布式环境中利用Hive进行大规模电影数据的分析。该系统的主要目标是建立一个分布式Hadoop集群,并在此基础上对电影数据进行深入...

    基于SpringBoot+hiveJDBC+echarts的数据大屏可视化和大数据分析源码+项目使用说明.zip

    基于SpringBoot+hiveJDBC+echarts的数据大屏可视化和大数据分析源码+项目使用说明.zip 【项目介绍】 使用Hadoop技术可快速生成分析结果,对6万条美妆销售数据分析,将数据转化为有价值的数据。 在Centos7中搭建 ...

    hive案例之---------微博数据分析及答案.zip

    在这个“hive案例之---------微博数据分析及答案”的项目中,我们将会探索如何利用Hive进行大规模的微博数据挖掘与分析。 首先,项目说明文档.docx可能会详细阐述了该项目的目标、背景、数据来源以及预期的结果。...

    电影票房数据分析-hive代码

    电影票房数据分析-hive代码 本资源为电影票房数据分析的Hive代码,涵盖了四个主要部分:统计2020年上映的电影中当前总票房最高的10部电影、统计2020年国庆假期中电影票房增长最多的三部电影及其每日的票房数据、...

    数据分析系统Hive

    ### 数据分析系统Hive #### 一、Hive背景及应用场景 Hive 是一款由 Facebook 开源的数据仓库工具,主要用于处理大规模数据集。它通过提供一种 SQL 类似的查询语言 HQL,使得用户能够轻松地对存储在 Hadoop 分布式...

    《Hive数据仓库案例教程》教学课件 第5章 Hive数据操作.pdf

    《Hive数据仓库案例教程》教学课件 第5章 Hive数据操作.pdf《Hive数据仓库案例教程》教学课件 第5章 Hive数据操作.pdf《Hive数据仓库案例教程》教学课件 第5章 Hive数据操作.pdf《Hive数据仓库案例教程》教学课件 第...

    基于Hadoop豆瓣电影数据分析实验报告

    【基于Hadoop豆瓣电影数据分析实验报告】 在大数据时代,对海量信息进行高效处理和分析是企业决策的关键。Hadoop作为一款强大的分布式计算框架,自2006年诞生以来,已经在多个领域展现了其卓越的数据处理能力。本...

    网站流量数据分析 (MapReduce+Hive综合实验)

    网站流量数据分析 (MapReduce+Hive综合实验)

    1. 搜狗日志查询分析; 2. 运营商关于用户基站停留数据统计; 3. 根据气象数据中心的数据进行温度统计; Hadoop

    选题:搜狗日志查询分析 (MapReduce+Hive综合实验) 前提条件: 安装好hadoop2.8.0 安装好HQL 安装好Hive 安装好eclipse 选题要求: 解压数据源,并上传到hdfs,保存的目录以个人学号区分,176为我的学号 创建hive...

    基于Hadoop的数据仓库Hive学习指南.doc

    1. **Hadoop数据仓库Hive**:Hive是由Facebook开发的一种基于Hadoop的数据仓库工具,它允许SQL熟悉的用户对存储在Hadoop分布式文件系统(HDFS)上的大规模数据进行分析。Hive将结构化的数据文件映射为数据库表,提供了...

    datax数据从hive导入mysql数据缺失解决

    Hive 通常用于存储海量数据并进行批处理分析,而 MySQL 则用于存储结构化数据并提供快速查询服务。为了实现两者之间的数据交换,DataX 作为一种开源的数据同步工具被广泛应用。 #### 问题描述 在使用 DataX 将数据...

Global site tag (gtag.js) - Google Analytics