- 浏览: 236828 次
- 性别:
- 来自: 上海
文章分类
最新评论
-
lwb314:
你的这个是创建的临时的hive表,数据也是通过文件录入进去的, ...
Spark SQL操作Hive数据库 -
yixiaoqi2010:
你好 我的提交上去 总是报错,找不到hive表,可能是哪里 ...
Spark SQL操作Hive数据库 -
bo_hai:
target jvm版本也要选择正确。不能选择太高。2.10对 ...
eclipse开发spark程序配置本地运行
搜狗实验室:http://www.sogou.com/labs/resources.html
用户查询日志:http://www.sogou.com/labs/dl/q.html
将上面三种数据全部下载下来,解压后拷贝到虚拟机的/usr/local/hive/hivedata目录中;
注意:如果数据乱码的情况下,请先转成utf-8的格式,方便后面的练习;
数据格式为
访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL
其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户ID。
第一列:时间;第二列:id;第四列:搜索关键词;第四列:关键词在搜索结果中的排名;第五列:查询结果列表中用户点击了第几个;第六列:url;
创建外部分区表
分区d设置为d=sogouq1,d=sogouq2,d=sogouq3;
创建对应分区目录,并将数据上传到对应的分区中
加载数据到对应的分区中
查看分区
查看每个分区的数量
从上面看执行情况,发现hive执行时使用了一个 job,执行时间比较长,原因有二:
1.因为自己搭建的是四台虚拟机进行测试,和生产环境的大集群没有可比性;
2.hive执行MapReduce过程本身就很慢;
执行的过程中可通过浏览器查看job运行情况:
练习:(使用分区为sogouq3的数据,因为这里面的数据相对来说多一些)
1.统计分区d='sogouq3'中搜索结果为第一个,点击了第一个的有多少;
2.查询分区d='sogouq3'中搜索的关键字排名前5的数据;
3.url搜索访问此时排名前5的
4.统计一天中用户每个时间段搜索情况,并将结果存到临时表中;
5.统计用户每个时段搜索所在比例;
用户查询日志:http://www.sogou.com/labs/dl/q.html
将上面三种数据全部下载下来,解压后拷贝到虚拟机的/usr/local/hive/hivedata目录中;
注意:如果数据乱码的情况下,请先转成utf-8的格式,方便后面的练习;
数据格式为
访问时间\t用户ID\t[查询词]\t该URL在返回结果中的排名\t用户点击的顺序号\t用户点击的URL
其中,用户ID是根据用户使用浏览器访问搜索引擎时的Cookie信息自动赋值,即同一次使用浏览器输入的不同查询对应同一个用户ID。
第一列:时间;第二列:id;第四列:搜索关键词;第四列:关键词在搜索结果中的排名;第五列:查询结果列表中用户点击了第几个;第六列:url;
创建外部分区表
分区d设置为d=sogouq1,d=sogouq2,d=sogouq3;
hive> create external table tbsogou(id string,websession string,keyword string,search_seq int,click_seq int,website string) partitioned by (d string) row format delimited fields terminated by '\t' lines terminated by '\n'; OK Time taken: 0.208 seconds [img]http://dl2.iteye.com/upload/attachment/0115/7441/03188d9f-132d-3178-9e26-0668e800f756.png[/img]
创建对应分区目录,并将数据上传到对应的分区中
root@master1:/usr/local/hive/hivedata# ll 总用量 1379652 drwxr-xr-x 2 root root 4096 3月 6 19:05 ./ drwxr-xr-x 6 root root 4096 3月 6 19:08 ../ -rw-rw-r-- 1 jylu jylu 108750574 6月 14 2014 SogouQ1.txt -rw-rw-r-- 1 jylu jylu 217441417 6月 14 2014 SogouQ2.txt -rw-rw-r-- 1 jylu jylu 1086552775 7月 13 2014 SogouQ3.txt root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1 16/03/06 22:26:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ1.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1 16/03/06 22:26:54 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2 16/03/06 22:27:09 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ2.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2 16/03/06 22:27:20 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -mkdir /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3 16/03/06 22:27:34 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -put SogouQ3.txt /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3 16/03/06 22:27:45 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable root@master1:/usr/local/hive/hivedata# hdfs dfs -lsr /user/hive/warehouse/testdb.db/tbsogou/ lsr: DEPRECATED: Please use 'ls -R' instead. 16/03/06 22:28:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable drwxr-xr-x - root supergroup 0 2016-03-06 22:26 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1 -rw-r--r-- 2 root supergroup 108750574 2016-03-06 22:26 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq1/SogouQ1.txt drwxr-xr-x - root supergroup 0 2016-03-06 22:27 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2 -rw-r--r-- 2 root supergroup 217441417 2016-03-06 22:27 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq2/SogouQ2.txt drwxr-xr-x - root supergroup 0 2016-03-06 22:28 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3 -rw-r--r-- 2 root supergroup 1086552775 2016-03-06 22:28 /user/hive/warehouse/testdb.db/tbsogou/d=sogouq3/SogouQ3.txt root@master1:/usr/local/hive/hivedata#
加载数据到对应的分区中
hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq1')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq1'; OK Time taken: 0.266 seconds hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq2')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq2'; OK Time taken: 0.177 seconds hive> ALTER TABLE tbsogou ADD PARTITION (d='sogouq3')LOCATION '/user/hive/warehouse/testdb.db/tbsogou/d=sogouq3'; OK Time taken: 0.245 seconds
查看分区
hive> show partitions tbsogou; OK d=sogouq1 d=sogouq2 d=sogouq3 Time taken: 0.152 seconds, Fetched: 3 row(s)
查看每个分区的数量
hive> select d,count(1) from tbsogou group by d; Query ID = root_20160307213457_df786923-5b7e-41e3-be18-bf6c2c06a926 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 6 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457352585589_0001, Tracking URL = http://master1:8088/proxy/application_1457352585589_0001/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457352585589_0001 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 6 2016-03-07 21:35:21,593 Stage-1 map = 0%, reduce = 0% 2016-03-07 21:36:22,322 Stage-1 map = 0%, reduce = 0%, Cumulative CPU 3.86 sec 2016-03-07 21:36:39,202 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 13.62 sec 2016-03-07 21:36:42,474 Stage-1 map = 19%, reduce = 0%, Cumulative CPU 14.91 sec 2016-03-07 21:36:46,047 Stage-1 map = 31%, reduce = 0%, Cumulative CPU 15.84 sec 2016-03-07 21:36:48,170 Stage-1 map = 44%, reduce = 0%, Cumulative CPU 16.12 sec 2016-03-07 21:36:49,239 Stage-1 map = 49%, reduce = 0%, Cumulative CPU 16.56 sec 2016-03-07 21:36:58,007 Stage-1 map = 60%, reduce = 0%, Cumulative CPU 17.53 sec 2016-03-07 21:37:07,956 Stage-1 map = 63%, reduce = 0%, Cumulative CPU 21.51 sec 2016-03-07 21:37:12,077 Stage-1 map = 68%, reduce = 0%, Cumulative CPU 24.44 sec 2016-03-07 21:37:15,700 Stage-1 map = 75%, reduce = 0%, Cumulative CPU 25.86 sec 2016-03-07 21:37:16,828 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 26.18 sec 2016-03-07 21:37:53,346 Stage-1 map = 100%, reduce = 4%, Cumulative CPU 27.15 sec 2016-03-07 21:37:56,467 Stage-1 map = 100%, reduce = 22%, Cumulative CPU 28.87 sec 2016-03-07 21:37:58,541 Stage-1 map = 100%, reduce = 28%, Cumulative CPU 29.83 sec 2016-03-07 21:37:59,589 Stage-1 map = 100%, reduce = 33%, Cumulative CPU 31.31 sec 2016-03-07 21:38:01,668 Stage-1 map = 100%, reduce = 50%, Cumulative CPU 33.3 sec 2016-03-07 21:38:02,712 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 35.63 sec 2016-03-07 21:38:05,801 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 37.31 sec 2016-03-07 21:38:06,831 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 39.19 sec MapReduce Total cumulative CPU time: 39 seconds 190 msec Ended Job = job_1457352585589_0001 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 6 Cumulative CPU: 39.19 sec HDFS Read: 1491774463 HDFS Write: 49 SUCCESS Total MapReduce CPU Time Spent: 39 seconds 190 msec OK sogouq1 1000000 sogouq2 2000000 sogouq3 10000000 Time taken: 194.981 seconds, Fetched: 3 row(s)
从上面看执行情况,发现hive执行时使用了一个 job,执行时间比较长,原因有二:
1.因为自己搭建的是四台虚拟机进行测试,和生产环境的大集群没有可比性;
2.hive执行MapReduce过程本身就很慢;
执行的过程中可通过浏览器查看job运行情况:
练习:(使用分区为sogouq3的数据,因为这里面的数据相对来说多一些)
1.统计分区d='sogouq3'中搜索结果为第一个,点击了第一个的有多少;
hive> select count(*)cnt from tbsogou where d='sogouq3' and search_seq=1 and click_seq=1; Query ID = root_20160313111650_9c4bfb1e-c330-4ef1-b6f1-ef0d74812678 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0005, Tracking URL = http://master1:8088/proxy/application_1457829749863_0005/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0005 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 1 2016-03-13 11:17:00,345 Stage-1 map = 0%, reduce = 0% 2016-03-13 11:17:12,813 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 4.75 sec 2016-03-13 11:17:13,845 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 5.04 sec 2016-03-13 11:17:16,936 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 8.86 sec 2016-03-13 11:17:17,975 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 9.81 sec 2016-03-13 11:17:26,298 Stage-1 map = 40%, reduce = 13%, Cumulative CPU 15.56 sec 2016-03-13 11:17:28,357 Stage-1 map = 53%, reduce = 13%, Cumulative CPU 19.81 sec 2016-03-13 11:17:32,491 Stage-1 map = 60%, reduce = 13%, Cumulative CPU 21.32 sec 2016-03-13 11:17:38,644 Stage-1 map = 67%, reduce = 20%, Cumulative CPU 22.65 sec 2016-03-13 11:17:40,708 Stage-1 map = 73%, reduce = 20%, Cumulative CPU 23.21 sec 2016-03-13 11:17:44,834 Stage-1 map = 100%, reduce = 20%, Cumulative CPU 24.91 sec 2016-03-13 11:17:47,903 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 26.27 sec MapReduce Total cumulative CPU time: 26 seconds 270 msec Ended Job = job_1457829749863_0005 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 1 Cumulative CPU: 26.27 sec HDFS Read: 1147287940 HDFS Write: 8 SUCCESS Total MapReduce CPU Time Spent: 26 seconds 270 msec OK 3891022 Time taken: 58.658 seconds, Fetched: 1 row(s)
2.查询分区d='sogouq3'中搜索的关键字排名前5的数据;
hive> select keyword,count(*) as cnt from tbsogou where d='sogouq3' group by keyword order by cnt desc limit 5; Query ID = root_20160313111820_181be93c-07f4-4aa3-ab42-778b9cc2f312 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 5 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0006, Tracking URL = http://master1:8088/proxy/application_1457829749863_0006/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0006 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5 2016-03-13 11:18:31,471 Stage-1 map = 0%, reduce = 0% 2016-03-13 11:18:46,946 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 6.6 sec 2016-03-13 11:18:57,652 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 30.33 sec 2016-03-13 11:18:59,234 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 32.57 sec 2016-03-13 11:19:02,356 Stage-1 map = 40%, reduce = 0%, Cumulative CPU 34.44 sec 2016-03-13 11:19:03,644 Stage-1 map = 47%, reduce = 0%, Cumulative CPU 36.89 sec 2016-03-13 11:19:05,416 Stage-1 map = 53%, reduce = 0%, Cumulative CPU 36.89 sec 2016-03-13 11:19:09,267 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 40.57 sec 2016-03-13 11:19:10,466 Stage-1 map = 73%, reduce = 0%, Cumulative CPU 42.18 sec 2016-03-13 11:19:22,900 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 45.82 sec 2016-03-13 11:19:27,047 Stage-1 map = 87%, reduce = 0%, Cumulative CPU 49.62 sec 2016-03-13 11:19:28,201 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 50.0 sec 2016-03-13 11:19:29,228 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 51.6 sec 2016-03-13 11:19:38,556 Stage-1 map = 100%, reduce = 8%, Cumulative CPU 52.38 sec 2016-03-13 11:19:44,830 Stage-1 map = 100%, reduce = 26%, Cumulative CPU 53.72 sec 2016-03-13 11:19:47,907 Stage-1 map = 100%, reduce = 27%, Cumulative CPU 56.14 sec 2016-03-13 11:19:51,016 Stage-1 map = 100%, reduce = 29%, Cumulative CPU 59.15 sec 2016-03-13 11:19:54,105 Stage-1 map = 100%, reduce = 34%, Cumulative CPU 61.99 sec 2016-03-13 11:19:56,161 Stage-1 map = 100%, reduce = 40%, Cumulative CPU 63.76 sec 2016-03-13 11:20:13,727 Stage-1 map = 100%, reduce = 67%, Cumulative CPU 67.11 sec 2016-03-13 11:20:15,901 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 71.64 sec 2016-03-13 11:20:16,927 Stage-1 map = 100%, reduce = 88%, Cumulative CPU 74.51 sec 2016-03-13 11:20:20,012 Stage-1 map = 100%, reduce = 90%, Cumulative CPU 76.43 sec 2016-03-13 11:20:23,082 Stage-1 map = 100%, reduce = 96%, Cumulative CPU 78.55 sec 2016-03-13 11:20:25,143 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 79.45 sec MapReduce Total cumulative CPU time: 1 minutes 19 seconds 450 msec Ended Job = job_1457829749863_0006 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0007, Tracking URL = http://master1:8088/proxy/application_1457829749863_0007/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0007 Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1 2016-03-13 11:20:37,281 Stage-2 map = 0%, reduce = 0% 2016-03-13 11:20:55,833 Stage-2 map = 17%, reduce = 0%, Cumulative CPU 2.7 sec 2016-03-13 11:20:58,917 Stage-2 map = 33%, reduce = 0%, Cumulative CPU 7.23 sec 2016-03-13 11:20:59,942 Stage-2 map = 44%, reduce = 0%, Cumulative CPU 8.87 sec 2016-03-13 11:21:00,966 Stage-2 map = 61%, reduce = 0%, Cumulative CPU 9.46 sec 2016-03-13 11:21:03,019 Stage-2 map = 83%, reduce = 0%, Cumulative CPU 11.85 sec 2016-03-13 11:21:06,098 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 13.3 sec 2016-03-13 11:21:11,232 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 17.45 sec MapReduce Total cumulative CPU time: 17 seconds 450 msec Ended Job = job_1457829749863_0007 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 5 Cumulative CPU: 79.45 sec HDFS Read: 1147297321 HDFS Write: 115331402 SUCCESS Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 17.45 sec HDFS Read: 115338803 HDFS Write: 110 SUCCESS Total MapReduce CPU Time Spent: 1 minutes 36 seconds 900 msec OK 百度 77627 baidu 36564 人体艺术 29598 4399小游戏 23306 优酷 20847 Time taken: 172.861 seconds, Fetched: 5 row(s) 从分析的结果看,用过搜狗查询的关键词是“百度”的挺高的。
3.url搜索访问此时排名前5的
hive> select website ,count(website) as cnt from tbsogou where d='sogouq3' group by website order by cnt desc limit 5; Query ID = root_20160313112231_24d147cb-8dce-412c-b255-67dc65660cd4 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks not specified. Estimated from input data size: 5 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0008, Tracking URL = http://master1:8088/proxy/application_1457829749863_0008/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0008 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5 2016-03-13 11:22:39,114 Stage-1 map = 0%, reduce = 0% 2016-03-13 11:22:54,627 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 16.32 sec 2016-03-13 11:23:00,819 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 28.21 sec 2016-03-13 11:23:01,845 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 29.87 sec 2016-03-13 11:23:03,426 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 29.87 sec 2016-03-13 11:23:09,805 Stage-1 map = 45%, reduce = 0%, Cumulative CPU 39.44 sec 2016-03-13 11:23:11,009 Stage-1 map = 60%, reduce = 0%, Cumulative CPU 42.0 sec 2016-03-13 11:23:14,083 Stage-1 map = 67%, reduce = 0%, Cumulative CPU 46.6 sec 2016-03-13 11:23:26,957 Stage-1 map = 69%, reduce = 0%, Cumulative CPU 53.89 sec 2016-03-13 11:23:30,204 Stage-1 map = 73%, reduce = 0%, Cumulative CPU 55.56 sec 2016-03-13 11:23:32,269 Stage-1 map = 80%, reduce = 0%, Cumulative CPU 56.56 sec 2016-03-13 11:23:35,356 Stage-1 map = 87%, reduce = 0%, Cumulative CPU 57.3 sec 2016-03-13 11:23:50,951 Stage-1 map = 89%, reduce = 0%, Cumulative CPU 62.5 sec 2016-03-13 11:23:51,977 Stage-1 map = 91%, reduce = 0%, Cumulative CPU 63.37 sec 2016-03-13 11:23:54,032 Stage-1 map = 92%, reduce = 0%, Cumulative CPU 63.52 sec 2016-03-13 11:23:55,053 Stage-1 map = 93%, reduce = 0%, Cumulative CPU 63.69 sec 2016-03-13 11:23:58,131 Stage-1 map = 94%, reduce = 0%, Cumulative CPU 64.0 sec 2016-03-13 11:24:00,172 Stage-1 map = 95%, reduce = 0%, Cumulative CPU 64.32 sec 2016-03-13 11:24:03,255 Stage-1 map = 96%, reduce = 0%, Cumulative CPU 64.59 sec 2016-03-13 11:24:06,351 Stage-1 map = 97%, reduce = 0%, Cumulative CPU 65.05 sec 2016-03-13 11:24:08,590 Stage-1 map = 98%, reduce = 0%, Cumulative CPU 65.28 sec 2016-03-13 11:24:11,687 Stage-1 map = 99%, reduce = 0%, Cumulative CPU 65.67 sec 2016-03-13 11:24:18,963 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 66.17 sec 2016-03-13 11:24:49,446 Stage-1 map = 100%, reduce = 24%, Cumulative CPU 69.83 sec 2016-03-13 11:24:50,461 Stage-1 map = 100%, reduce = 31%, Cumulative CPU 71.2 sec 2016-03-13 11:24:52,685 Stage-1 map = 100%, reduce = 40%, Cumulative CPU 72.16 sec 2016-03-13 11:24:53,716 Stage-1 map = 100%, reduce = 47%, Cumulative CPU 73.29 sec 2016-03-13 11:24:55,755 Stage-1 map = 100%, reduce = 53%, Cumulative CPU 73.96 sec 2016-03-13 11:25:09,145 Stage-1 map = 100%, reduce = 59%, Cumulative CPU 74.82 sec 2016-03-13 11:25:13,234 Stage-1 map = 100%, reduce = 60%, Cumulative CPU 79.95 sec 2016-03-13 11:25:14,267 Stage-1 map = 100%, reduce = 61%, Cumulative CPU 84.89 sec 2016-03-13 11:25:18,691 Stage-1 map = 100%, reduce = 62%, Cumulative CPU 86.7 sec 2016-03-13 11:25:19,725 Stage-1 map = 100%, reduce = 63%, Cumulative CPU 87.4 sec 2016-03-13 11:25:21,777 Stage-1 map = 100%, reduce = 65%, Cumulative CPU 88.1 sec 2016-03-13 11:25:22,801 Stage-1 map = 100%, reduce = 66%, Cumulative CPU 88.66 sec 2016-03-13 11:25:24,370 Stage-1 map = 100%, reduce = 69%, Cumulative CPU 90.07 sec 2016-03-13 11:25:25,806 Stage-1 map = 100%, reduce = 73%, Cumulative CPU 91.48 sec 2016-03-13 11:25:26,846 Stage-1 map = 100%, reduce = 74%, Cumulative CPU 92.05 sec 2016-03-13 11:25:27,899 Stage-1 map = 100%, reduce = 77%, Cumulative CPU 93.2 sec 2016-03-13 11:25:28,922 Stage-1 map = 100%, reduce = 79%, Cumulative CPU 93.57 sec 2016-03-13 11:25:32,045 Stage-1 map = 100%, reduce = 80%, Cumulative CPU 94.4 sec 2016-03-13 11:25:34,164 Stage-1 map = 100%, reduce = 82%, Cumulative CPU 95.36 sec 2016-03-13 11:25:37,297 Stage-1 map = 100%, reduce = 83%, Cumulative CPU 96.01 sec 2016-03-13 11:25:44,816 Stage-1 map = 100%, reduce = 85%, Cumulative CPU 97.26 sec 2016-03-13 11:25:55,122 Stage-1 map = 100%, reduce = 87%, Cumulative CPU 98.23 sec 2016-03-13 11:25:58,213 Stage-1 map = 100%, reduce = 93%, Cumulative CPU 99.1 sec 2016-03-13 11:26:01,270 Stage-1 map = 100%, reduce = 94%, Cumulative CPU 101.44 sec 2016-03-13 11:26:04,348 Stage-1 map = 100%, reduce = 97%, Cumulative CPU 103.83 sec 2016-03-13 11:26:06,427 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 104.88 sec MapReduce Total cumulative CPU time: 1 minutes 44 seconds 880 msec Ended Job = job_1457829749863_0008 Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0009, Tracking URL = http://master1:8088/proxy/application_1457829749863_0009/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0009 Hadoop job information for Stage-2: number of mappers: 2; number of reducers: 1 2016-03-13 11:26:23,710 Stage-2 map = 0%, reduce = 0% 2016-03-13 11:26:44,247 Stage-2 map = 28%, reduce = 0%, Cumulative CPU 8.53 sec 2016-03-13 11:26:50,529 Stage-2 map = 57%, reduce = 0%, Cumulative CPU 13.83 sec 2016-03-13 11:26:53,648 Stage-2 map = 66%, reduce = 0%, Cumulative CPU 15.83 sec 2016-03-13 11:26:54,668 Stage-2 map = 72%, reduce = 0%, Cumulative CPU 16.36 sec 2016-03-13 11:26:56,719 Stage-2 map = 83%, reduce = 0%, Cumulative CPU 17.23 sec 2016-03-13 11:26:59,790 Stage-2 map = 93%, reduce = 0%, Cumulative CPU 18.65 sec 2016-03-13 11:27:02,862 Stage-2 map = 100%, reduce = 0%, Cumulative CPU 19.38 sec 2016-03-13 11:27:10,374 Stage-2 map = 100%, reduce = 17%, Cumulative CPU 20.61 sec 2016-03-13 11:27:13,444 Stage-2 map = 100%, reduce = 67%, Cumulative CPU 22.25 sec 2016-03-13 11:27:15,542 Stage-2 map = 100%, reduce = 100%, Cumulative CPU 24.15 sec MapReduce Total cumulative CPU time: 24 seconds 150 msec Ended Job = job_1457829749863_0009 MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 5 Cumulative CPU: 104.88 sec HDFS Read: 1147298011 HDFS Write: 297356296 SUCCESS Stage-Stage-2: Map: 2 Reduce: 1 Cumulative CPU: 24.15 sec HDFS Read: 297363766 HDFS Write: 140 SUCCESS Total MapReduce CPU Time Spent: 2 minutes 9 seconds 30 msec OK http://www.baidu.com/ 148348 http://www.4399.com/ 38611 http://www.youku.com/ 28959 http://www.hao123.com/ 28715 http://qzone.qq.com/ 26020 Time taken: 285.674 seconds, Fetched: 5 row(s)
4.统计一天中用户每个时间段搜索情况,并将结果存到临时表中;
drop table if exists tmp_pre_hour_seach_info; create table tmp_pre_hour_seach_info as select substring(id,9,2)as hour,count(*) as cnt from tbsogou where d='sogouq3' and substring(id,1,8)='20111230' group by substring(id,9,2); hive> drop table if exists tmp_pre_hour_seach_info; OK Time taken: 0.019 seconds hive> create table tmp_pre_hour_seach_info > as > select substring(id,9,2)as hour,count(*) as cnt from tbsogou where d='sogouq3' and substring(id,1,8)='20111230' group by substring(id,9,2); Query ID = root_20160313115133_bd005a7b-edf2-4c4c-a4c1-3b83c3d34335 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks not specified. Estimated from input data size: 5 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0015, Tracking URL = http://master1:8088/proxy/application_1457829749863_0015/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0015 Hadoop job information for Stage-1: number of mappers: 5; number of reducers: 5 2016-03-13 11:51:41,275 Stage-1 map = 0%, reduce = 0% 2016-03-13 11:52:01,969 Stage-1 map = 7%, reduce = 0%, Cumulative CPU 13.69 sec 2016-03-13 11:52:03,000 Stage-1 map = 20%, reduce = 0%, Cumulative CPU 14.02 sec 2016-03-13 11:52:11,604 Stage-1 map = 27%, reduce = 0%, Cumulative CPU 28.14 sec 2016-03-13 11:52:13,801 Stage-1 map = 33%, reduce = 0%, Cumulative CPU 29.28 sec 2016-03-13 11:52:17,954 Stage-1 map = 60%, reduce = 1%, Cumulative CPU 33.8 sec 2016-03-13 11:52:19,522 Stage-1 map = 60%, reduce = 3%, Cumulative CPU 34.15 sec 2016-03-13 11:52:20,803 Stage-1 map = 67%, reduce = 3%, Cumulative CPU 35.55 sec 2016-03-13 11:52:22,921 Stage-1 map = 73%, reduce = 4%, Cumulative CPU 35.76 sec 2016-03-13 11:52:23,947 Stage-1 map = 73%, reduce = 5%, Cumulative CPU 37.26 sec 2016-03-13 11:52:25,437 Stage-1 map = 100%, reduce = 5%, Cumulative CPU 38.09 sec 2016-03-13 11:52:26,481 Stage-1 map = 100%, reduce = 7%, Cumulative CPU 38.15 sec 2016-03-13 11:52:27,553 Stage-1 map = 100%, reduce = 8%, Cumulative CPU 38.2 sec 2016-03-13 11:52:51,246 Stage-1 map = 100%, reduce = 11%, Cumulative CPU 38.85 sec 2016-03-13 11:52:53,288 Stage-1 map = 100%, reduce = 20%, Cumulative CPU 39.85 sec 2016-03-13 11:52:54,307 Stage-1 map = 100%, reduce = 40%, Cumulative CPU 41.77 sec 2016-03-13 11:52:59,424 Stage-1 map = 100%, reduce = 60%, Cumulative CPU 43.97 sec 2016-03-13 11:53:09,679 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 47.44 sec MapReduce Total cumulative CPU time: 47 seconds 440 msec Ended Job = job_1457829749863_0015 Moving data to: hdfs://master1:9000/user/hive/warehouse/testdb.db/tmp_pre_hour_seach_info Table testdb.tmp_pre_hour_seach_info stats: [numFiles=5, numRows=24, totalSize=234, rawDataSize=210] MapReduce Jobs Launched: Stage-Stage-1: Map: 5 Reduce: 5 Cumulative CPU: 47.44 sec HDFS Read: 1147301096 HDFS Write: 664 SUCCESS Total MapReduce CPU Time Spent: 47 seconds 440 msec OK Time taken: 98.72 seconds hive> select * from tmp_pre_hour_seach_info order by hour ; Query ID = root_20160313122830_a6be9cad-fc18-4c9e-9950-f6f97be0f996 Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0016, Tracking URL = http://master1:8088/proxy/application_1457829749863_0016/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0016 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2016-03-13 12:28:38,814 Stage-1 map = 0%, reduce = 0% 2016-03-13 12:28:50,158 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.48 sec 2016-03-13 12:28:57,412 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.87 sec MapReduce Total cumulative CPU time: 3 seconds 870 msec Ended Job = job_1457829749863_0016 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.87 sec HDFS Read: 9717 HDFS Write: 234 SUCCESS Total MapReduce CPU Time Spent: 3 seconds 870 msec OK 00 167799 01 121494 02 85056 03 63484 04 52604 05 51630 06 61241 07 97949 08 307308 09 516763 10 612179 11 597820 12 595252 13 623990 14 614563 15 640062 16 636304 17 581630 18 591573 19 682096 20 707056 21 659366 22 542951 23 389667 Time taken: 27.597 seconds, Fetched: 24 row(s)
5.统计用户每个时段搜索所在比例;
hive> select hour,cnt/total from tmp_pre_hour_seach_info a join (select sum(cnt)total from tmp_pre_hour_seach_info) b where 1=1 order by hour; Warning: Map Join MAPJOIN[19][bigTable=?] in task 'Stage-3:MAPRED' is a cross product Query ID = root_20160313123951_9afbb68b-d07b-49e8-bd90-c2a362133d89 Total jobs = 2 Launching Job 1 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0022, Tracking URL = http://master1:8088/proxy/application_1457829749863_0022/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0022 Hadoop job information for Stage-1: number of mappers: 2; number of reducers: 1 2016-03-13 12:39:58,777 Stage-1 map = 0%, reduce = 0% 2016-03-13 12:40:10,073 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.3 sec 2016-03-13 12:40:17,294 Stage-1 map = 100%, reduce = 100%, Cumulative CPU 3.48 sec MapReduce Total cumulative CPU time: 3 seconds 480 msec Ended Job = job_1457829749863_0022 SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/usr/local/hadoop/hadoop-2.6.0/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/usr/local/spark/spark-1.6.0-bin-hadoop2.6/lib/spark-assembly-1.6.0-hadoop2.6.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 16/03/13 12:40:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Execution log at: /tmp/root/root_20160313123951_9afbb68b-d07b-49e8-bd90-c2a362133d89.log 2016-03-13 12:40:22 Starting to launch local task to process map join; maximum memory = 518979584 2016-03-13 12:40:23 Dump the side-table for tag: 0 with group count: 1 into file: file:/tmp/root/94a0f21c-473d-4ff2-8d73-5a286929f3ec/hive_2016-03-13_12-39-51_816_7328192251576208367-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile10--.hashtable 2016-03-13 12:40:24 Uploaded 1 File to: file:/tmp/root/94a0f21c-473d-4ff2-8d73-5a286929f3ec/hive_2016-03-13_12-39-51_816_7328192251576208367-1/-local-10005/HashTable-Stage-3/MapJoin-mapfile10--.hashtable (559 bytes) 2016-03-13 12:40:24 End of local task; Time Taken: 1.176 sec. Execution completed successfully MapredLocal task succeeded Launching Job 2 out of 2 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Starting Job = job_1457829749863_0023, Tracking URL = http://master1:8088/proxy/application_1457829749863_0023/ Kill Command = /usr/local/hadoop/hadoop-2.6.0/bin/hadoop job -kill job_1457829749863_0023 Hadoop job information for Stage-3: number of mappers: 1; number of reducers: 1 2016-03-13 12:40:32,128 Stage-3 map = 0%, reduce = 0% 2016-03-13 12:40:39,379 Stage-3 map = 100%, reduce = 0%, Cumulative CPU 1.54 sec 2016-03-13 12:40:46,562 Stage-3 map = 100%, reduce = 100%, Cumulative CPU 2.98 sec MapReduce Total cumulative CPU time: 2 seconds 980 msec Ended Job = job_1457829749863_0023 MapReduce Jobs Launched: Stage-Stage-1: Map: 2 Reduce: 1 Cumulative CPU: 3.48 sec HDFS Read: 10788 HDFS Write: 117 SUCCESS Stage-Stage-3: Map: 1 Reduce: 1 Cumulative CPU: 2.98 sec HDFS Read: 10776 HDFS Write: 556 SUCCESS Total MapReduce CPU Time Spent: 6 seconds 460 msec OK 00 0.016780173516828326 01 0.012149598038448027 02 0.00850573864353989 03 0.006348503480606734 04 0.005260485745917659 05 0.0051630841582717794 06 0.006124199824457138 07 0.009795059659472449 08 0.030731300920205 09 0.0516771423374201 10 0.06121889786803525 11 0.05978297446248374 12 0.05952617027657551 13 0.06240001712027906 14 0.06145730175401859 15 0.06400724331806608 16 0.06363143719242624 17 0.05816394807235358 18 0.05915826427970776 19 0.06821071183460291 20 0.07070675252006607 21 0.06593767478409898 22 0.0542959850245559 23 0.03896733516756323 Time taken: 55.825 seconds, Fetched: 24 row(s)
发表评论
-
Spark SQL内置函数应用
2016-04-22 07:00 8675简单说明 使用Spark SQL中的内置函数对数据进行 ... -
Spark SQL操作Hive数据库
2016-04-13 22:37 17634本次例子通过scala编程实现Spark SQL操作Hive数 ... -
Spark SQL on hive配置和实战
2016-03-26 18:40 5592spark sql 官网:http://spark ... -
Hive使用默认数据库derby报错及解决方法
2016-03-05 21:19 3169在使用derby数据库时,遇到了很多错误,报错信息和解决方法如 ... -
Hive1.2.1安装和使用(基于Hadoop2.6.0)
2016-03-05 20:22 8871安装hive,这里使用mysql作为hive的metastor ... -
hive第一个字段为null
2014-03-20 23:17 1442在hive中创建表后,将准备好的数据导入到该表中,如果第一个字 ...
相关推荐
本文档主要介绍了基于Hive的搜狗日志分析的整个过程,从数据预处理、构建数据仓库、数据分析到其他数据操作等方面进行了详细的介绍。 一、 数据预处理 数据预处理是整个日志分析的第一步,也是最为重要的一步。...
基于 Hive 的数据分析案例 -MM 聊天软件数据分析 本资源摘要信息主要介绍了基于 Hive 的数据分析案例,通过对 MM 聊天软件的数据进行统计分析,了解用户行为,实现精准的用户画像,并为公司的发展决策提供精确的...
利用Hive进行复杂用户行为大数据分析及优化案例(全套视频+课件+代码+讲义+工具软件),具体内容包括: 01_自动批量加载数据到hive 02_Hive表批量加载数据的脚本实现(一) 03_Hive表批量加载数据的脚本实现(二) ...
在Hive中,我们可以创建一个`video`表来存储这些数据,通过Hive SQL进行查询和分析,例如,找出最受欢迎的视频或最常被搜索的类别。 其次,`user`数据通常包括用户ID、用户名、注册日期、地理位置、用户行为日志...
Hive是基于Hadoop平台的数据仓库解决方案,它主要解决了在大数据场景下,业务人员和数据科学家能够通过熟悉的SQL语言进行数据分析的问题。Hive并不存储数据,而是依赖于HDFS进行数据存储,并利用MapReduce、Tez或...
通过这次实验,学生能够深入理解Hive作为数据仓库工具的作用,熟悉其基本操作和使用场景,同时也掌握了自定义函数的开发流程,这对于进一步学习大数据处理和分析具有重要意义。在实际工作中,Hive的性能优化和与其他...
《搜狗搜索日志分析报告》是对海量的搜狗搜索数据进行深度挖掘和解析的科研文档,旨在揭示用户搜索行为的规律与趋势。本报告基于500万条搜狗搜索日志,采用Hadoop、Hive等大数据处理工具,结合R语言进行统计分析,...
(3)sqoop数据迁移,完成HIve与MySQL数据库中的数据交互 (4)Echarts搭建动态可视化大屏 (5)SpringBoot搭建可视化后台系统,完成前端与后台的数据传递与交互。 (6)基于Cenots7 搭建虚拟机,配置Hadoop、HDFS、...
【标题】:“基于Hadoop平台Hive数据库处理电影数据”的文档详细介绍了如何在Hadoop分布式环境中利用Hive进行大规模电影数据的分析。该系统的主要目标是建立一个分布式Hadoop集群,并在此基础上对电影数据进行深入...
基于SpringBoot+hiveJDBC+echarts的数据大屏可视化和大数据分析源码+项目使用说明.zip 【项目介绍】 使用Hadoop技术可快速生成分析结果,对6万条美妆销售数据分析,将数据转化为有价值的数据。 在Centos7中搭建 ...
电影票房数据分析-hive代码 本资源为电影票房数据分析的Hive代码,涵盖了四个主要部分:统计2020年上映的电影中当前总票房最高的10部电影、统计2020年国庆假期中电影票房增长最多的三部电影及其每日的票房数据、...
在这个“hive案例之---------微博数据分析及答案”的项目中,我们将会探索如何利用Hive进行大规模的微博数据挖掘与分析。 首先,项目说明文档.docx可能会详细阐述了该项目的目标、背景、数据来源以及预期的结果。...
### 数据分析系统Hive #### 一、Hive背景及应用场景 Hive 是一款由 Facebook 开源的数据仓库工具,主要用于处理大规模数据集。它通过提供一种 SQL 类似的查询语言 HQL,使得用户能够轻松地对存储在 Hadoop 分布式...
《Hive数据仓库案例教程》教学课件 第5章 Hive数据操作.pdf《Hive数据仓库案例教程》教学课件 第5章 Hive数据操作.pdf《Hive数据仓库案例教程》教学课件 第5章 Hive数据操作.pdf《Hive数据仓库案例教程》教学课件 第...
通过上述分析可以看出,对于“Spark或MR引擎插入的数据,Hive表查询数据为0”的问题,主要是因为Tez引擎在执行`UNION ALL`操作时生成了额外的子目录,而查询引擎未能正确处理这些子目录。通过改变数据写入方式或是...
网站流量数据分析 (MapReduce+Hive综合实验)
【基于Hadoop豆瓣电影数据分析实验报告】 在大数据时代,对海量信息进行高效处理和分析是企业决策的关键。Hadoop作为一款强大的分布式计算框架,自2006年诞生以来,已经在多个领域展现了其卓越的数据处理能力。本...
选题:搜狗日志查询分析 (MapReduce+Hive综合实验) 前提条件: 安装好hadoop2.8.0 安装好HQL 安装好Hive 安装好eclipse 选题要求: 解压数据源,并上传到hdfs,保存的目录以个人学号区分,176为我的学号 创建hive...
1. **Hadoop数据仓库Hive**:Hive是由Facebook开发的一种基于Hadoop的数据仓库工具,它允许SQL熟悉的用户对存储在Hadoop分布式文件系统(HDFS)上的大规模数据进行分析。Hive将结构化的数据文件映射为数据库表,提供了...