将公司oracle脚本迁移到hive平台hql时一些总结

chengjianxiaoxue

浏览: 1321444 次
性别:
来自: 北京

最近访客更多访客>>

liu_shui8

happy2012

nddht

yhtppp

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hive

2 hive in http://www.tuicool.com/articles/vauiAzN

3 inner join http://www.cnblogs.com/xd502djj/archive/2013/01/18/2866662.html

4 字符串连接符 oracle 'aaa' || 'bbb' = 'aaabbb' 对应hive---> concat

5 oracle substr(字符串,截取开始位置,截取长度) --> substr('Hello World',2,4) //返回结果为 'ello'
hive substr(string A, intstart, int len) --->hive> select substr('abcde',3,2) from lxw_dual;   // cd   oracle和hive的功能是一样的

6 oracle a.entname||'%'     字符串连接函数
hive    concat(a.entname,'%')

7 在oracle中, where substr(enttype,1,2) in ('43','58','68')
                or   substr(enttype,1,3) in ('455')
    hive       where substr(enttype,1,2) = '43' or substr(enttype,1,2) = '58' or substr(enttype,1,2) = '68' or substr(enttype,1,3) = '455' 需要写在一行否则报错

8 hive 获取当前时间
这里使用shell方式,
1 有个模板文件,里面 sql为 where date1 > to_date('%SYSDATE%')
2写shell，将模板文件cp到目标目录，得到shell的当前时间后用 sed -i将拷贝到目标目录的文件的%SYSDATE%替换为shell的当前时间
3 shell中通过 hive -f 方式执行替换后的目标文件
eg:
current_date=$(date +%Y-%m-%d)

src_dir=src
target_dir=target_$current_date

rm -rf $target_dir > /dev/null 2>&1
mkdir $target_dir

cp $src_dir/temp.sql $target_dir/temp.sql
sed -i "s/%SYSDATE%/$current_date/g" $target_dir/temp.sql

sudo -u hdfs hive -f $target_dir/temp.sql

9 hive多库下,创建表可以指定库名, eg : create table mydb.stu(id string, name string);
使用hive -f时，可以在对应linux路径的 .sql文件中第一行指定库,这样sql里面的创建表都会在指定库下 create database if not exists st_statics; use st_statics;

10 hive union all 和 oracle的 union all结果一样都是讲两个表的所有数据都弄在一起即使重复也弄在一起 http://www.tuicool.com/articles/MZf6ny

11 hive case when CN is null then 0 else CN end CN

12 http://blog.csdn.net/jim110/article/details/6923934 left join 过滤条件写在on后面和写在where 后面的区别

13 hive 资源被使用完了启动hive脚本不执行使用 mapred job -list

14 hive join连接中的不等值和过滤条件区别：
join连接的过滤条件： from a join b on a.id = b.id where a.fsq_zlix <> '12' 这是过滤不是不等值连接

15 在使用cdh的时候，使用root登录后,hive环境下执行hive sql，有时候会报cannot run program "/opt/cloudera/parcels/CHD-5.2.1-../hadoop/bin/hadoop" (in directory "/root") error=13
Permission denied的错.此时登录到 /home/hdfs/下进入 sudo -u hdfs hive 执行hive脚本就不会报错

16 hive自定义函数, 在 xx.sql中引用写法：
  1 写好hive udfs函数打jar
  2 将jar 放再集群hive安装节点一个位置 eg: /cloud/mr/ent_statistics/ent_statistics.jar
  3 xx.sql(hive sql脚本)开头，写:
    set hive.cli.print.current.db=true;
    set hive.cli.print.header=true;
    add jar /cloud/mr/ent_statistics/ent_statistics.jar
    create temporary function instrfun as 'org.apache.hadoop.hive.ql.udf.instrfun';

17 oracle union 在hive应用: 用collect_set 这个函数试试
  感觉应该是现将两个表相同的记录弄临时表a后然后在将这两个表用 collect_set得到没有重复的记录b 然后在合并

18 hive uuid自定义函数: @UDFType(deterministic = false)    http://chengjianxiaoxue.iteye.com/admin/blogs/2263225

20 将linux本地文件上传到hive表中, 要让hive表中 name is null,那么上传文件name对应字段为 \N 如果写成null，那么入到hive表后为 name='null'而不是is null.

21 hive两个表列间隔,
一个是\t
stu:
1 张三
2 李四
一个是,
score
1 50
2 90
执行hive: select a.* ,b.score from stu a left join b score on a.id=b.id; 能够执行成，也就是说,表中列的间隔尽管不同，但是Hive的编译执行器会帮你过滤掉不同
实现统一化表操作。

22 hive on where的区别以及where是在哪里执行的 map 还是reduce
   hive不等值连接：
   http://blog.csdn.net/lpxuan151009/article/details/7956568
   http://blog.csdn.net/kntao/article/details/17440903
   http://my.oschina.net/leejun2005/blog/98244
   http://my.oschina.net/leejun2005/blog/307812
   http://scholers.iteye.com/blog/1771945 ！！！！

23 测试下hive0.13下 select name form t1; 是否还在走mr 在走，----> 不走

24 hive join on where : joins发生在where字句前,如果果要限制join的输出，需要写在where字句，否则写在JOIN字句
   http://yugouai.iteye.com/blog/1849395 join on where 中的详细介绍
   SELECT a.val, b.val FROM a LEFT OUTER JOIN b ON (a.key=b.key)
   WHERE a.ds='2009-07-07' AND b.ds='2009-07-07'    -----> join on where中where是在join on后的结果集后在进一步过滤

   或者写成: SELECT a.* FROM a JOIN b ON (a.id = b.id AND a.department = b.department) 就是将where的过滤写在了on上。
   详细举例：
stu:
1 张三
2 李四
3   王五
4   赵六
一个是,
score
1 90
2 50
3   150
5   200
join on where 分开写:
select a.*, b.score from stu a left join score b on a.id = b.id where a.id='1';
1 张三 90
join on where条件写在on内
select a.*, b.score from stu a left join score b on (a.id = b.id and a.id='1');    可以看到写法不一样明显操作方式是不同的。
1 张三 90
2 李四    NULL
3   王五    NULL
4   赵六    NULL

25 oracle Procedure结构:
declare
定义部分(可选)
begin
可执行部分(必选)
exception
异常处理部分(可选)
end;

26 oracle insert into stu with xxx as tmp1   ....   这个表示根据xxx的查询规则得到临时表tmp1, 然后在后面sql中引用tmp1的数据在和别的表处理
   处理后的结果在保存到表 stu中。

27 hive union all, 子查询, left join !!!! http://www.tuicool.com/articles/MZf6ny
stu a union all tea b; ---->
id   | name     |
+------+----------+
| 1    | zhangsan |
| 2    | lisi     |
| 3    | wangwu   |
| 1    | zhaoliu |
| 2    | lisi     |
| 1    | 50       |
| 4    | 90       |
| 5    | 150

28 where a.pripid = b.pripid(+) ，哪边有（+）哪边就允许为空
等同于 from a left join b on a.pripid = b.pripid;
SELECT a.*, b.* from a(+) = b就是一个右连接，等同于select a.*, b.* from a right join b
SELECT a.*, b.* from a = b(+)就是一个左连接，等同于select a.*, b.* from a left join b

29 oracle : oracle中的nulls last/first 等效于 mysql的 order by desc/asc

30 oracle: http://www.cnblogs.com/ggjucheng/archive/2013/01/03/2842855.html hive in not in

31 hive row_number() over(partition by pripid order by candate) rn from d_e_gt_cancel where .... hive0.13之后可以支持

32 hive 类型强转 select CONAM ,cast(CONAM as decimal(18,6))*0.07 as test111 from d_ei; 其中,CONAM是投资金额在oracle中是number(18,6)的精度,hive中用string存

在hive计算中会出现科学计数法，需要原值展示并保存到hdfs，(如果是用科学计数法存在hdfs的数比如真实是 85423318.1568，而科学计数法为 8.542331815680001E7 那么用后者数据在导入到关系库时也是后者写法)，现在需要前者方式存储，写法为：

select cast(0.615*8.32*16694676 as dicimal(18,4)) from dual limit 2;

85423318.1568

85423318.1568
注意，必须是18位，如果位数小返回结果为NULL

33 join on stu.name = score.name 时需要注意的：

一般关系库导入hdfs时都会对Null做初始化处理比如处理成''

在hive中使用 join时候要注意 join on stu.name = score.name 两边都为 ''的情况

一般业务上会对两边为'' 做过滤处理，eg:

join on stu.name = score.name and stu.name !='' and score.name != ''

否则会多出很多数据, eg:

stu:
id	name
1	zhangsan
2	''

score:
id	name	score
1	zhangsan	50
2	''	90
3	''	150

create table stu(id string,name string) row format delimited fields terminated by '\t';

create table score(id string,name string,score string) row format  delimited fields terminated by '\t';

load data local inpath 'stu' into table stu;
load data local inpath 'score' into table score;

select stu.*, sco.score from stu stu join score sco on stu.name=sco.name;
Total MapReduce CPU Time Spent: 1 seconds 280 msec
OK
stu.id  stu.name        sco.score
1       zhangsan        50
2       ''      90
2       ''      150

34 关于join on 条件1 and 条件2 的补充:

join on 条件1 where 条件2 ----> 需要些笔记总结和实验，

35 hive函数 concat转义字符 ;的使用:

select ('a',';','b') from stu; 会报错，报什么EOF的错，其实就是 ;需要转义的意思，

应该修改成 select ('a','\;','b') from stu; 最后结果就是 a;b

36 业务要求如下：

对业务主键字段去重，同时还有: 
若FCP_FZRQ不为空时，选取发布日期较早的一条，  a
若FCP_FZRQ不全为空时，保留不为空的一条， b
若FCP_FZRQ同时为空时，随机保留一条： c

下面写法中，order by case when FQX_FZRQ <> '' then FQX_FZRQ else null 表示分组时就按照FQX_FZRQ升序排，这样在分组后FQX_FZRQ都不是null的情况下，
自然能将 asc下第一条获取到，满足了条件a,

然后，条件b下，then FQX_FZRQ else null时,那么这个null在 asc排序下不参与排序,规则和oracle的一致，如下
缺省Oracle在Order by 时缺省认为null是最大值，所以如果是ASC升序则排在最后，DESC降序则排在最前 
这样的情况下，自然能符合条件b的获取数据，
至于条件c，因为是随便的因此获取数据规则和条件 a,b一致即可。

代码如下

insert into table f_china_qyjs
select sysdateFun1('yyyyMMdd'),
t.pripid,
t.s_ext_nodenum,
t.entname,
t.entname_gllzd,
tocharFun(t1.FQX_ISSDATE,'yyyy-MM-dd','yyyy/MM/dd'),
'14',
'A1402',
'',
t1.FQX_EVAUNIT,
concat(t1.FQX_CERNUM,t1.FQX_EVAUNIT,t1.FQX_ISSDATE,t1.FQX_ENTNAME),
t1.fqx_id
from (select * from be_derivative_db20151200.f_ent_qyzm where s_ext_nodenum='110000') t
inner join (select *, row_number() over(partition by FQX_CERNUM,FQX_EVAUNIT,FQX_ISSDATE,FQX_ENTNAME order by case when FQX_FZRQ <> '' then FQX_FZRQ  else null end asc ) num from ST_SRRZ.CHANPIN) t1 
on t.entname_gllzd = t1.FCP_QYMC_GLLZD and t1.num = 1 
where length(t.entname_gllzd) > 3 and t1.FCP_FZRQ <> '' AND T1.FCP_FZRQ is not null;

37: 将企业成立之前的所有废弃数据删除掉：

select xxx, .....  
from (select *, min(case when fhq_act='A0101' then fhq_time else null end) over(partition by fhq_lcid, fhq_area) qycl_date from f_china_qyjs_2 t where fhq_time<> '') t
left join m_weidu2 t1 on t.fhq_act = t1.wddm 
left join m_shijian t2 on t.fhq_act = t2.wddm 
where  t.fhq_time >=t.qycl_date 
group by t.ddate, toCharFun1(t.fhq_time,'yyyy/MM/dd','yyyy/MM'), t.fhq_lcid, t.fhq_areq, t2.wdmc, t1.wddw,t1.wdzd;

上面的min是当状态为A0101即企业成立时，将fhq_lcid, fhq_area作为分组字段后聚合所有相同企业并得到这些行的最小企业成立时间，
然后在每条记录对比的时候，用每条记录的fhq_time和这个最小成立时间qycl_date做对比来过滤掉不符合的数据。

分享到：

hadoop fsimage edits关系 | hive自定义函数-uuid

2015-12-11 14:18
浏览 5758
评论(0)
分类:互联网
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论