`
k_lb
  • 浏览: 856186 次
  • 性别: Icon_minigender_1
  • 来自: 郑州
社区版块
存档分类
最新评论
  • kitleer: 据我所知,国内有款ETL调度监控工具TaskCTL,支持ket ...
    kettle调度

hive中使用自定义函数(UDF)实现分析函数row_number的功能

 
阅读更多
之前部门实现row_number是使用的transform,我觉得用UDF实现后,平时的使用会更方便,免去了transform相对繁琐的语法。

<wbr><wbr><wbr>用到的测试表为:</wbr></wbr></wbr>

hive> desc row_number_test;
OK
id1<wbr><wbr><wbr><wbr>int<br> id2<wbr><wbr><wbr><wbr>string<br> age<wbr><wbr><wbr><wbr>int<br> score<wbr><wbr> double<br> name<wbr><wbr><wbr>string</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

hive> select * from row_number_test;
OK
2<wbr><wbr><wbr><wbr><wbr><wbr>t04<wbr><wbr><wbr><wbr>25<wbr><wbr><wbr><wbr><wbr>60.0<wbr><wbr><wbr>youlia<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t01<wbr><wbr><wbr><wbr>20<wbr><wbr><wbr><wbr><wbr>85.0<wbr><wbr><wbr>liujiannan<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t02<wbr><wbr><wbr><wbr>24<wbr><wbr><wbr><wbr><wbr>70.0<wbr><wbr><wbr>zengqiu<br> 2<wbr><wbr><wbr><wbr><wbr><wbr>t03<wbr><wbr><wbr><wbr>30<wbr><wbr><wbr><wbr><wbr>88.0<wbr><wbr><wbr>hongqu<br> 2<wbr><wbr><wbr><wbr><wbr><wbr>t03<wbr><wbr><wbr><wbr>27<wbr><wbr><wbr><wbr><wbr>70.0<wbr><wbr><wbr>yongqi<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t02<wbr><wbr><wbr><wbr>19<wbr><wbr><wbr><wbr><wbr>75.0<wbr><wbr><wbr>wangdong<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t02<wbr><wbr><wbr><wbr>24<wbr><wbr><wbr><wbr><wbr>70.0<wbr><wbr><wbr>zengqiu</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

使用时要先在子查询中进行分区与排序,比如oracle中这样一句SQL:

select row_number() over (partition by id1 order by age desc)from row_number_test;

转换为hive语句应该是:

select row_number(id1) from<wbr><wbr>--partition by的字段传到row_number函数中去</wbr></wbr>

<wbr><wbr><wbr> (select *from row_number_test distribute by id1 sort by id1,age desc) a;</wbr></wbr></wbr>

<wbr></wbr>

如果partition by 两个字段:

select row_number() over (partition by id1,id2 orderby<wbr>score) from row_number_test;</wbr>

转换为hive语句应该是:

select row_number(id1,id2)<wbr><wbr>--partition by的字段传到row_number函数中去</wbr></wbr>

<wbr><wbr><wbr> from(select * from row_number_test distribute by id1,id2 sort byid1,id2,score) a;</wbr></wbr></wbr>

<wbr></wbr>

展示一下查询结果:

1.

select id1,id2,age,score,name,row_number(id1) rn from (select *from row_number_test distribute by id1 sort by id1,age desc) a;

<wbr></wbr>

OK
2<wbr><wbr><wbr><wbr><wbr><wbr>t03<wbr><wbr><wbr><wbr>30<wbr><wbr><wbr><wbr><wbr>88.0<wbr><wbr><wbr>hongqu<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>1<br> 2<wbr><wbr><wbr><wbr><wbr><wbr>t03<wbr><wbr><wbr><wbr>27<wbr><wbr><wbr><wbr><wbr>70.0<wbr><wbr><wbr>yongqi<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>2<br> 2<wbr><wbr><wbr><wbr><wbr><wbr>t04<wbr><wbr><wbr><wbr>25<wbr><wbr><wbr><wbr><wbr>60.0<wbr><wbr><wbr>youlia<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>3<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t02<wbr><wbr><wbr><wbr>24<wbr><wbr><wbr><wbr><wbr>70.0<wbr><wbr><wbr>zengqiu<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>1<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t02<wbr><wbr><wbr><wbr>24<wbr><wbr><wbr><wbr><wbr>70.0<wbr><wbr><wbr>zengqiu<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>2<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t01<wbr><wbr><wbr><wbr>20<wbr><wbr><wbr><wbr><wbr>85.0<wbr><wbr><wbr>liujiannan<wbr><wbr><wbr><wbr><wbr>3<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t02<wbr><wbr><wbr><wbr>19<wbr><wbr><wbr><wbr><wbr>75.0<wbr><wbr><wbr>wangdong<wbr><wbr><wbr><wbr><wbr><wbr><wbr>4</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

2.

select id1,id2,age,score,name,row_number(id1,id2) rn from(select * from row_number_test distribute by id1,id2 sortby id1,id2,score) a;

<wbr></wbr>

OK
2<wbr><wbr><wbr><wbr><wbr><wbr>t04<wbr><wbr><wbr><wbr>25<wbr><wbr><wbr><wbr><wbr>60.0<wbr><wbr><wbr>youlia<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>1<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t02<wbr><wbr><wbr><wbr>24<wbr><wbr><wbr><wbr><wbr>70.0<wbr><wbr><wbr>zengqiu<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>1<br> 2<wbr><wbr><wbr><wbr><wbr><wbr>t03<wbr><wbr><wbr><wbr>27<wbr><wbr><wbr><wbr><wbr>70.0<wbr><wbr><wbr>yongqi<wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr><wbr>1<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t02<wbr><wbr><wbr><wbr>24<wbr><wbr><wbr><wbr><wbr>70.0<wbr><wbr><wbr> zengqiu<wbr><wbr><wbr><wbr> 2<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t02<wbr><wbr><wbr><wbr>19<wbr><wbr><wbr><wbr><wbr>75.0<wbr><wbr><wbr> wangdong<wbr><wbr><wbr><wbr>3<br> 1<wbr><wbr><wbr><wbr><wbr><wbr>t01<wbr><wbr><wbr><wbr>20<wbr><wbr><wbr><wbr><wbr>85.0<wbr><wbr><wbr>liujiannan<wbr><wbr><wbr><wbr><wbr>1<br> 2<wbr><wbr><wbr><wbr><wbr><wbr>t03<wbr><wbr><wbr><wbr>30<wbr><wbr><wbr><wbr><wbr>88.0<wbr><wbr><wbr> hongqu<wbr><wbr><wbr><wbr><wbr>2</wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr></wbr>

<wbr></wbr>

下面是代码,只实现了接收1个参数和2个参数的evaluator方法,参数再多的照搬代码就可以了,代码仅供参考:

package org.rowincrement;

import java.util.ArrayList;
import org.apache.hadoop.hive.ql.exec.UDFArgumentException;
import org.apache.hadoop.hive.ql.metadata.HiveException;
import org.apache.hadoop.hive.ql.exec.UDFArgumentLengthException;
import org.apache.hadoop.hive.ql.udf.generic.GenericUDTF;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
import org.apache.hadoop.hive.serde2.objectinspector.StructObjectInspector;
import org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;

public class RowIncrement extends GenericUDTF {
 
  Object[] result = new Object[1];
 
  @Override
  public void close() throws HiveException {
  }
 
  @Override
  public StructObjectInspector initialize(ObjectInspector[] args)
          throws UDFArgumentException {
      if (args.length != 1) {
          throw new UDFArgumentLengthException("RowIncrement takes only one argument");
      }
      if (!args[0].getTypeName().equals("int")) {
       throw new UDFArgumentException("RowIncrement only takes an integer as a parameter");
        }
      ArrayList<String> fieldNames = new ArrayList<String>();
      ArrayList<ObjectInspector> fieldOIs = new ArrayList<ObjectInspector>();
      fieldNames.add("col1");
      fieldOIs.add(PrimitiveObjectInspectorFactory.javaIntObjectInspector);
     
      return ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames,fieldOIs);
  }
 
  @Override
  public void process(Object[] args) throws HiveException {
   try
   {
    int n = Integer.parseInt(args[0].toString());
    for (int i=0;i<n;i++)
    {
     result[0] = i+1;
     forward(result);
    }
   }
   catch (Exception e) { 
			throw new HiveException("RowIncrement has an exception");
   }
  }
 public static void main(String args[])

 {

  Row_number t = new Row_number();

  System.out.println(t.evaluate(123));

  System.out.println(t.evaluate(123));

  System.out.println(t.evaluate(123));

  System.out.println(t.evaluate(1234));

  System.out.println(t.evaluate(1234));

  System.out.println(t.evaluate(1234));

  System.out.println(t.evaluate(1235));

 }
}


分享到:
评论

相关推荐

    自定义hive函数

    至于“rownum”,在标准SQL中,`ROW_NUMBER()`是一个窗口函数,但在Hive中,你需要使用`ROW_NUMBER()`或者`RANK()`等内置函数来实现类似的功能,为每一行分配一个唯一的行号。如果Hive内置的函数无法满足需求,你...

    Hive函数实例数据The-NBA-Championship.txt

    Hive函数实例数据文件通过列出不同NBA球队及其夺冠年份的方式,展现了如何在Hive中使用函数对数据进行处理。通过这一实例,我们可以了解到Hive中数据的组织形式、函数的应用以及如何处理复杂数据。 Hive是基于...

    hive 函数大全

    10. **UDAF(用户自定义聚合函数)**:用于实现自定义的聚合操作,如统计特定条件下的数据个数。 11. **UDTF(用户自定义表生成函数)**:能够返回多行数据的函数,可以将一行数据转化为多行。 以上是Hive函数的...

    hive函数大全.7z

    5. **高级特性**:可能会涉及UDF(用户定义函数)、UDAF(用户定义聚合函数)和UDTF(用户定义表生成函数)的开发和使用,这些允许用户自定义函数来扩展Hive的功能。 6. **性能优化**:如何通过分区、桶、缓存等...

    Hive-工具篇_hive_

    - **窗口函数**:ROW_NUMBER、RANK、LEAD和LAG等窗口函数在分析排序数据时非常有用,例如计算排名或滞后/领先值。 6. **Hive函数全攻略** - **内置函数**:包括字符串函数(如CONCAT、SUBSTRING)、数学函数(如...

    hive常用函数参数手册

    在大数据处理领域,Hive作为一个广泛使用的数据仓库工具,提供了丰富的内置函数来支持数据的处理与分析。本文档旨在介绍Hive中的常用函数及其用法,帮助用户更好地理解和应用这些函数。需要注意的是,由于Hive的不同版本...

    hive的函数大全.rar

    本资料“hive的函数大全.rar”包含了Hive的中英文两个版本的函数大全详细介绍,是学习和使用Hive进行大数据分析的重要参考资料。 Hive函数是Hive查询语言(HQL)的核心组成部分,它们允许用户执行各种数据操作,如...

    hive 表 dml 操作.zip

    此外,还可以使用子查询、窗口函数(ROW_NUMBER、RANK、DENSE_RANK 等)进行复杂的数据分析。 5. **PARTITION**:在Hive中,分区是一种组织大量数据的有效方式,可以提高查询性能。通过在表上定义一个或多个分区列...

    大数据组件 Hive 面试题 + Hive 高频面试题

    窗口函数是数据分析中的利器,如`RANK()`、`DENSE_RANK()`和`ROW_NUMBER()`。它们在`OVER()`定义的窗口内工作,可以实现动态排序。`LAG()`和`LEAD()`用于获取当前行前后的数据,而`NTILE()`将数据分配到固定数量的组...

    《企业级Hive实战课程》大纲

    - 窗口和分析函数的使用(SUM、AVG、MIN、MAX、NTILE、ROW_NUMBER、RANK等); - 虚拟列(INPUT__FILE__NAME、BLOCK__OFFSET__INSIDE__FILE)的含义与应用场景。 6. **Hive存储格式与压缩** - 行式存储与列式...

    HIVE面试题集锦201901

    8. **Hive UDF(用户自定义函数)**:包括UDF(单行函数)、UDAF(聚合函数)和UDTF(多行函数),允许用户扩展Hive的功能,处理特殊数据类型或复杂逻辑。 9. **Hive与Spark集成**:使用Spark作为执行引擎,可以...

    Hive编程指南+HIVE从入门到精通+Hive高级编程+Apache Oozie

    4. **视图与UDF**: 视图简化了复杂查询,而用户自定义函数(UDF)允许用户扩展Hive的功能。 5. **Hive与Hadoop的交互**: 了解如何在Hadoop集群上部署和配置Hive,以及如何通过HDFS交互数据。 **Hive从入门到精通** ...

    hive编程指南

    - **UDF(User Defined Function)**:用户可以定义自己的函数来扩展HQL的功能,实现更复杂的数据处理逻辑。 ### 数据加载与导出 Hive支持多种方式来加载数据到表中,如使用`LOAD DATA INPATH`命令将HDFS中的数据...

    2021年各大企业大数据技术面试题.pdf

    开窗函数如`ROW_NUMBER()`, `RANK()`, `LEAD()`, `LAG()`等在SQL中用于在数据窗口上执行计算,比如计算排名或移动平均。自连接SQL是将一个表与自身进行连接,常用于处理具有层级关系的数据或查找相关记录。 2. **...

    Spark SQL操作大全.zip

    - **窗口函数**:提供类似SQL的窗口函数,如row_number、lead、lag、rank等,用于处理时间序列数据和排名问题。 - **UDF(用户自定义函数)**:可以注册自定义函数,扩展Spark SQL的功能。 - **Join操作**:支持...

    Spark SQL最佳实践

    - **窗口函数**: 支持窗口函数,如`row_number()`, `lead()`, `lag()`等,进行复杂的数据分析。 - **Joins**: 提供不同类型的join操作,包括内连接、外连接和全连接。 - **UDF(用户自定义函数)**: 可以注册Java...

Global site tag (gtag.js) - Google Analytics