深入学习《Programing Hive》：HiveQL查询(1)

flyingdutchman

浏览: 361794 次
性别:
来自: 上海

最近访客更多访客>>

zyi74

zhanggang807

zhangshu001987

lizhitao

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

2013-07 ( 18)
2013-06 ( 13)
2013-05 ( 53)
更多存档...

博客分类：

Hive

hive

        前几章已经学习过Hive表的定义和数据操纵，本章我们开始学习HiveQL查询。
        SELECT ... FROM ...查询
        SELECT在SQL中是一个投影操作。让我们从新来看之前定义过的分区表employees：

              CREATE TABLE employees (  
                name STRING,  
                salary FLOAT,  
                subordinates ARRAY<STRING> COMMENT '下属',  
                deductions MAP<STRING,FLOAT> COMMENT '扣费',  
                address STRUT<street:STRING,city:STRING,state:STRING,zip:INT>  
              )  
              PARTITIONED BY(country STRING,state STRING);

SELECT查询：

              hive> SELECT name,salary FROM employees;
              John Doe      100000.0
              Mary Smith     80000.0
              Todd Jones     70000.0
              Bill King      60000.0

用户也可以给FROM之后的表，视图或子查询起一个别名，如：

              hive> SELECT e.name,e.salary FROM employees e;

上面两个HiveQL语句是相同的，给表起别名在JOIN操作中特别有用。
下面我们来看如何查询employees表中的集合类型的数据。我们先看一下如何查询ARRAY类型的数据，如employees表的下属“subordinates”

              hive> SELECT name,subordinates FROM employees;
              John Doe     ["Mary Smith","Todd Jones"]
              Mary Smith   ["Bill King"]
              Todd Jones   []
              Bill king    []

再看MAP类型的查询，如“deductions”：

              hive> SELECT name,deductions FROM employees;
              John Doe   {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
              Mary Smith {"Federal Taxes":0.2,"State Taxes":0.05,"Insurance":0.1}
              Todd Jones {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}
              Bill King  {"Federal Taxes":0.15,"State Taxes":0.03,"Insurance":0.1}

再看STRUCT类型的查询，如“address”：

              hive> SELECT name,address FROM employees;
              John Doe   {"Street":"1 Michign Ave.","city":"Chicago","State":"IL","ZIP":60600}
              Mary Smith {"Street":"100 Ontario St.","city":"Chicago","State":"IL","ZIP":60601}
              Todd Jones {"Street":"200 Chicago Ave.","city":"Oak Park","State":"IL","ZIP":60700}
              Bill King  {"Street":"300 Obscure Dr.","city":"Obscuria","State":"IL","ZIP":60100}

接下来我们再看如何查看集合性属性字段中的数据：

              hive> SELECT name,subordinates[0],deductions["State Taxes"],address.city FROM employees;
              John Doe    Mary Smith  0.05  Chicago
              Mary Smith  Bill King   0.05  Chicago 
              Todd Jones  NULL        0.03  Oak Park
              Bill King   NULL        0.03  Obscuria

使用正则表达式查询符合条件的列
在Hive查询中，用户可以使用正则表达式查询符合条件的列，下面的实例中就是使用正则表达式的使用用例，可以查询到symbol列和所有以“price”开头的列：

              hive> SELECT symbol,'price.*' FROM stocks;
              AAPL  195.69  197.88  194.0  194.12  194.12
              AAPL  192.63  196.0   190.85  195.46  195.46
              AAPL  196.73  198.37  191.57  192.05  192.05
              AAPL  195.17  200.2   194.42  199.23  199.23
              AAPL  195.91  196.32  193.38  195.86  195.86
              ...

列计算
在HiveQL中，用户不但可以从表中查询某些列，还可以通过函数或数学表达式来计算列的值。例如，我们可以在employees表中查询雇员的姓名，薪水，联邦税百分百及其他列的值：

              hive> SELECT upper(name),salary,deductions["Federal Taxes"],
                  > round(salary * (1 - deductions["Federal Taxes"])) 
                  > FROM employees;
              JOHN DOE    100000.0  0.2   80000
              MARY SMITH   80000.0  0.2   64000
              TODD JONES   70000.0  0.15  59500
              BILL KING    60000.0  0.15  51000

        Hive是使用JAVA写的开源软件，在函数或数学表达式来计算列的值时类型转型和JAVA的转型相同。

        聚合函数
        要在HiveQL查询中使用聚合函数，必须先将hive.map.aggr配置参数设置为true，举例如下：

              hive> SET hive.map.aggr=true;
              hibe> SELECT count(*),avg(salary) FROM employees;

但是将

hive.map.aggr

设置为true会占用更多的内存。

LIMIT
一次典型的HiveQL查询可能会返回所有符合条件的数据记录，但是LIMIT关键字可以限制返回的记录的条数：

              hive> SELECT upper(name),salary,deductions["Federal Taxes"],
                  > round(salary * (1 - deductions["Federal Taxes"])) 
                  > FROM employees 
                  > LIMIT 2;
              JOHN DOE    100000.0  0.2   80000
              MARY SMITH   80000.0  0.2   64000

        给列奇别名

              hive> SELECT upper(name),salary,deductions["Federal Taxes"] AS
                  > fed_taxes,round(salary * (1 - deductions["Federal Taxes"])) AS
                  > salary_minus_fed_taxes
                  > FROM employees 
                  > LIMIT 2;
              JOHN DOE    100000.0  0.2   80000
              MARY SMITH   80000.0  0.2   64000

子查询
给列起别名特别适合与子查询中的列，让我们将上个查询示例修改为子查询的使用用例：

              hive> FROM(
                  >   SELECT upper(name),salary,deductions["Federal Taxes"] AS
                  >   fed_taxes,round(salary * (1 - deductions["Federal Taxes"])) 
                  >   AS salary_minus_fed_taxes
                  >   FROM employees
                  > ) e
                  > SELECT e.name,e.salary_minus_fed_taxes
                  > WHERE e.salary_minus_fed_taxes > 70000;
               JOHN DOE    100000.0  0.2   80000

CASE ... WHEN ... THEN语句
CASE ... WHEN ... THEN向标准的SQL语句中一样使用在SELECT列中，对某一个猎德返回值做判断，示例如下：

              hive> SELECT name,salary,
                  >   CASE
                  >     WHEN  salary < 50000.0 THEN 'low'
                  >     WHEN  salary >= 50000.0 AND salary < 70000.0 THEN 'middle'
                  >     WHEN  salary >= 70000.0 AND salay < 100000.0 THEN 'high'
                        ELSE 'very high'
                  >   END AS bracket FROM  employees;
                  John Doe        100000.0  very high         
                  Mary Smith       80000.0  high
                  Todd Jones       70000.0  high
                  Bill King        60000.0  middle
                  Boss Man        200000.0  very high
                  Fred Finance    150000.0  very high
                  Stcy Accountant  60000.0  middle

WHERE过滤条件
SELECT决定返回哪些数据列，而WHERE决定返回那些符合条件的数据：

               hive> SELECT name,salary,deductions["Federal Taxes"],
                   >   salary * (1 - deductions["Federal Taxes"])
                   > FROM employees
                   > WHERE round(salary * (1 - deductions["Federal Taxes"])) >  
                   >  70000;
              John Doe   100000.0  0.2  80000.0

该示例有一个问题，那就是salary * (1 - deductions["Federal Taxes"])分别在SELECT部分和WHERE部分都执行了，性能上不是多优化。那么，对salary * (1 - deductions["Federal Taxes"])使用别名能否消除这种冲突呢？，不幸的是这是无效的：

               hive> SELECT name,salary,deductions["Federal Taxes"],
                   >   salary * (1 - deductions["Federal Taxes"]) AS 
                   >   salary_minus_fed_taxes
                   > FROM employees
                   > WHERE round(salary_minus_fed_taxes) >  70000;
              FAILED:Error in semantic analysis: Line 4:13 Invalid table alias or 
              colomn reference 'salary_minus_fed_taxes': (possible colomn names  
              are: name,salary,subordinates,deductions,address)

如错误信息中所说，用户不能在WHERE部分中引用列的别名，那么我们是否可以使用其他办法来消除这种冲突呢？答案是使用子查询：

              hive> SELECT e.* FROM
                  > (SELECT name,salary,deductions["Federal Taxes"] AS ded,
                  >    salary * (1 - deductions["Federal Taxes"]) AS 
                  >    salary_minus_fed_taxes
                  >  FROM employees) e
                  > WHERE round(salary_minus_fed_taxes) >  70000;

分享到：

深入学习《Programing Hive》：HiveQL查询 ... | 深入学习《Programing Hive》：数据操纵DM ...

2013-05-14 23:37
浏览 3579
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论