hive的sort by应用

p_x1984

浏览: 1192881 次
性别:
来自: 北京

最近访客更多访客>>

u012363178

清风_秋雨

sun80264629

shaoaj

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hive

在hive中不光有order by操作，还有个sort by操作。两者执行的都是排序的操作，但有存在很大的不同。
还是用上次order by的例子来说明。

测试用例
hive> select * from test09;
OK
100 tom
200 mary
300 kate
400 tim
Time taken: 0.061 seconds

hive> select * from test09 sort by id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201105020924_0068, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0068
Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0068
2011-05-03 05:39:21,389 Stage-1 map = 0%, reduce = 0%
2011-05-03 05:39:23,410 Stage-1 map = 50%, reduce = 0%
2011-05-03 05:39:25,430 Stage-1 map = 100%, reduce = 0%
2011-05-03 05:39:30,470 Stage-1 map = 100%, reduce = 50%
2011-05-03 05:39:32,493 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201105020924_0068
OK
100 tom
300 kate
200 mary
400 tim
Time taken: 17.783 seconds

结果看起来和order by差不多，但是sort by是不受hive.mapred.mode参数影响，无论hive.mapred.mode在什么模式都可以。
从上面的Number of reduce tasks not specified. Defaulting to jobconf value of: 2可以看得出来，此时共启动了2个reduce。
实际上sort by可以控制每个reduce产生的文件都是排序的，这样对多个已经排序好的文件做一次归并排序就ok了。
比用order by的时候，仅仅有单个reduce要好得多。

我们把上面的结果写到文件中就看得清楚的多了。

hive> insert overwrite local directory ‘/home/hjl/sunwg/qqq’ select * from test09 sort by id;
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks not specified. Defaulting to jobconf value of: 2
In order to change the average load for a reducer (in bytes):
set hive.exec.reducers.bytes.per.reducer=
In order to limit the maximum number of reducers:
set hive.exec.reducers.max=
In order to set a constant number of reducers:
set mapred.reduce.tasks=
Starting Job = job_201105020924_0069, Tracking URL = http://hadoop00:50030/jobdetails.jsp?jobid=job_201105020924_0069
Kill Command = /home/hjl/hadoop/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop00:9001 -kill job_201105020924_0069
2011-05-03 05:41:27,913 Stage-1 map = 0%, reduce = 0%
2011-05-03 05:41:30,939 Stage-1 map = 100%, reduce = 0%
2011-05-03 05:41:37,993 Stage-1 map = 100%, reduce = 50%
2011-05-03 05:41:41,023 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201105020924_0069
Copying data to local directory /home/hjl/sunwg/qqq
Copying data to local directory /home/hjl/sunwg/qqq
4 Rows loaded to /home/hjl/sunwg/qqq
OK
Time taken: 18.496 seconds

[hjl@sunwg src]$ ll /home/hjl/sunwg/qqq
total 8
-rwxrwxrwx 1 hjl hjl 17 May 3 05:41 attempt_201105020924_0069_r_000000_0
-rwxrwxrwx 1 hjl hjl 17 May 3 05:41 attempt_201105020924_0069_r_000001_0

此时产生了2个文件，分别查看每个文件的内容。

[hjl@sunwg src]$ cat /home/hjl/sunwg/qqq/attempt_201105020924_0069_r_000000_0
100tom
300kate
[hjl@sunwg src]$ cat /home/hjl/sunwg/qqq/attempt_201105020924_0069_r_000001_0
200mary
400tim

可以看得出来每个文件的内部都是排好顺序的。

order by和sort by都可以实现排序的功能，不过具体怎么使用还得根据情况，如果数据量不是太大的情况可以使用order by，如果数据库过于庞大，最好还是使用sort by。

分享到：

hive的distribute by应用 | hive的order by应用

2011-08-23 17:02
浏览 1716
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论