hiveQL学习和hive常用操作

郑云飞

浏览: 817172 次
性别:
来自: 北京

最近访客更多访客>>

anlinko

ssydxa219

u012363178

jiyilee

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

hive

Hive服务

Hive外壳环境是可以使用hive命令来运行的一项服务。可以在运行时使用-

service选项指明要使用哪种服务。键入hive-servicehelp可以获得可用服务

列表。下面介绍最有用的一些服务。

cli

Hive的命令行接口(外壳环境)。这是默认的服务。

hiveserver

让Hive以提供Trift服务的服务器形式运行，允许用不同语言编写的客户端进

行访问。使用Thrift, JDBC和ODBC连接器的客户端需要运行Hive服务器来

和Hive进行通信。通过设置HIVE_ PORT环境变量来指明服务器所监听的端口

号(默认为10 000).

hwi

Hive的Web接口。参见第372页的补充内容“HiveWeb Interface"。

（hive –service hwi）启动web服务后通过访问http://ip:9999/hwi

jar

与hadoopjar等价的Hive的接口。这是运行类路径中同时包含Hadoop和

Hive类的Java应用程序的简便方法。

metastore

默认情况下，metastore和Hive服务运行在同一个进程里。使用这个服务，可

以让metastore作为一个单独的(远程)进程运行。通过设置METASTORE_PORT

环境变量可以指定服务器监听的端口号。

Hive客户端

启动（hive --service hiveserver &）hive远程访问服务

会提示Starting Hive Thrift Server 。

这个时候就可以通过thrift 客户端,jdbc驱动，odbc驱动去访问和操作了。

Metastore

metastore是Hive元数据的集中存放地。metastore包括两部分:服务和后台数据的存储。

默认derby数据，不过只能单机访问。

一般都放在远程数据库，hive和元数据数据库分开放。比如mysql直接配置上mysql参数即可。参考安装部分。

HiveQL

Hive查询的和数据处理的语言，内部会解析成对应的操作或者mapreduce程序等处理。

数据类型

基本数据类型

TINYINT: 1个字节

SMALLINT: 2个字节

INT: 4个字节

BIGINT: 8个字节

BOOLEAN: TRUE/FALSE

FLOAT: 4个字节，单精度浮点型

DOUBLE: 8个字节，双精度浮点型

STRING 字符串

复杂数据类型

ARRAY: 有序字段

MAP: 无序字段

STRUCT: 一组命名的字段

数据转换

Hive中数据部分可以通行的范围是允许隐身转换的。

个人处理数据要显示指定转化的话可以调用cast函数比如：cast(‘1’ as int)

当然如果说处理的数据属于非法的话，比如cast(‘x’ as int) 会直接返回null

表

Hive表格逻辑上由存储的数据和描述表格中数据形式的相关元数据组成。

Hive表中存在两种形式一个是在自己仓库目录（托管表），另一种是hdfs仓库目录以外的（外部表）。对于托管表基本上是load和drop的时候直接对数据和元数据都操作。但是外部表却是基本只对元数据操作。

创建普通表语句

create table records (yearstring,temperature int,quality int) row format delimited fields terminated by'\t'

创建外部表语句

外部表数据位置

[root@ebsdi-23260-oozie tmp]# hadoop fs-put sample.txt  /user/houchangren/tmp/location[root@ebsdi-23260-oozie tmp]# hadoop fs-mkdir  /user/houchangren/tmp/location[root@ebsdi-23260-oozie tmp]# hadoop fs-put sample.txt /user/houchangren/tmp/location[root@ebsdi-23260-oozie tmp]# hadoop fs-cat /user/houchangren/tmp/location/sample.txt1990   44      11991   45      21992   41      31993   43      21994   41      1

创建表指定外部表数据位置&查看数据

hive> create external tabletb_ext_records(year string,temperature int,quality int) row format delimitedfields terminated by '\t' location '/user/houchangren/tmp/location/';OKTime taken: 0.133 secondshive> select * from tb_ext_records;OK1990   44      11991   45      21992   41      31993   43      21994   41      1Time taken: 0.107 seconds

分区和桶

分区表是hive中一种存放表但是可以根据个别列来分别存放的形式的表结构。区别于普通表的时候要指定分区的列，而且数据中是不存在分区列的，而且不能存在。

一个分区表表中有可以多个维度分区。

创建分区表语句

create table tb_test (yearstring,temperature int,quality int) partitioned by (ds string,ds2 string) row format delimited fieldsterminated by '\t';

查看分区

show partitions tb_test;

加载数据到指定分区表

load data local inpath'/root/hcr/tmp/sample.txt' into table tb_test partition(ds='2013-12-06',ds2='shanghai')

根据分区条件查询

select * from tb_test where ds='2013-12-06';

创建桶语句

create table tb_test_bucket(yearstring,temperature int,quality int) clustered by(temperature) into 3 buckets row format delimited fields terminated by '\t';

加载数据到桶中

insert overwrite table tb_test_bucket select * from records;

查看hdfs文件

hive> dfs -ls/user/hive/warehouse/tb_test_bucket;Found 3 items-rw-r--r--  2 root supergroup         202013-12-09 11:36 /user/hive/warehouse/tb_test_bucket/000000_0-rw-r--r--  2 root supergroup         202013-12-09 11:36 /user/hive/warehouse/tb_test_bucket/000001_0-rw-r--r--  2 root supergroup         60 2013-12-0911:36 /user/hive/warehouse/tb_test_bucket/000002_0

查看数据取样测试

select * from tb_test_bucket  table sample(bucket 1 out of 2 on temperature);

hive> select * from tb_test_bucket  tablesample(bucket 1 out of 2 on temperature);Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_201311101215_51576, Tracking URL = http://hadoop-master.TB.com:50030/jobdetails.jsp?jobid=job_201311101215_51576Kill Command = /usr/lib/hadoop-0.20/bin/hadoop job  -Dmapred.job.tracker=hadoop-master.TB.com:8021 -kill job_201311101215_51576Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 02013-12-09 11:36:48,415 Stage-1 map = 0%,  reduce = 0%2013-12-09 11:36:50,449 Stage-1 map = 33%,  reduce = 0%, Cumulative CPU 2.81 sec2013-12-09 11:36:51,463 Stage-1 map = 67%,  reduce = 0%, Cumulative CPU 2.81 sec2013-12-09 11:36:52,475 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.39 sec2013-12-09 11:36:53,489 Stage-1 map = 100%,  reduce = 0%, Cumulative CPU 4.39 sec2013-12-09 11:36:54,504 Stage-1 map = 100%,  reduce = 100%, Cumulative CPU 4.39 secMapReduce Total cumulative CPU time: 4 seconds 390 msecEnded Job = job_201311101215_51576MapReduce Jobs Launched:Job 0: Map: 3   Accumulative CPU: 4.39 sec   HDFS Read: 802 HDFS Write: 20 SUCESSTotal MapReduce CPU Time Spent: 4 seconds 390 msecOK1990    44      11990    44      1Time taken: 11.094 seconds

导入数据

Insert overwrite table

在插入数据的时候是强制替换的overwrite

动态分区使用（从一个表中的分区中取数据放到另一个目标分区表中，分区是在查询表已经存在的。）

设定环境

set hive.exec.dynamic.partition=true;sethive.exec.dynamic.partition.mode=nonstrict;

目标分区表

create table tb_test_pt (yearstring,temperature int,quality int) partitioned by (ds string) row format delimited fields terminated by'\t';

动态分区取数插入

insert overwrite table tb_test_pt partition(ds) select year,temperature,quality,ds from tb_test;

多表导入

在hive中是支持如下语法

from sourceTableinsert overwrite table targetTableselect col1,col2

源表数据

hive> select * from tb_test;OK1990    44      1       2013-12-06      shandong1991    45      2       2013-12-06      shandong1992    41      3       2013-12-06      shandong1993    43      2       2013-12-06      shandong1994    41      1       2013-12-06      shandong1990    44      1       2013-12-06      shanghai1991    45      2       2013-12-06      shanghai1992    41      3       2013-12-06      shanghai1993    43      2       2013-12-06      shanghai1994    41      1       2013-12-06      shanghai

创建三个目标表

create table tb_records_by_year (year string,count int) row format delimited fields terminated by '\t';create table tb_stations_by_year (year string,count int) row format delimited fields terminated by '\t';create table tb_good_records_by_year (year string,count int) row format delimited fields terminated by '\t';

插入多表执行sql

from tb_testinsert overwrite table tb_stations_by_yearselect  year,count(distinct temperature)group by yearinsert overwrite table tb_records_by_yearselect  year,count(1)group by yearinsert overwrite table tb_good_records_by_yearselect  year,count(1)where temperature!=9999 and (quality =0  or quality=1 or quality=3)group by  year;

操作结果

hive> select * from tb_records_by_year;OK1990    21991    21992    21993    21994    2Time taken: 0.088 secondshive> select * from tb_stations_by_year;OK1990    11991    11992    11993    11994    1Time taken: 0.081 secondshive> select * from tb_good_records_by_year;OK1990    21992    21994    2Time taken: 0.085 seconds

Create Table … As Select (CTAS)

把 hive 查询的数据直接放到一个新表中。（因为是原子性操作，so如果查询失败，那么创建也是失败）

操作实例

create table tb_records_ctasasselect year,temperature from tb_test;

数据导出

导出到本地目录

insert overwrite local directory'/root/hcr/tmp/ex_abc2.txt' select * from m_t2;

导出到hdfs目录

insert overwrite directory'/user/houchangren/tmp/m_t2' select * from m_t2;

表的修改Alter table

修改表名rename to

alter table tb_records_ctas rename totb_records_2

增加新列

alter table tb_records_2 add columns(new_col int);

修改某一列的信息

ALTER TABLE tb_records_2 CHANGE COLUMN new_col col1  string;

等等具体还有好多修改表信息的操作

分享到：

我的第二份工作 | ZooKeeper伪分布式集群安装及使用

2014-04-07 00:02
浏览 1417
评论(0)
分类:数据库
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论