- 浏览: 253159 次
- 性别:
- 来自: 北京
文章分类
最新评论
-
cys1314:
太给力了,多谢
Excel下拉列表多选框实现 -
兜兜没有糖:
你好 我想请问一下,进入新浪微博授权页面, 新浪微博会以地址形 ...
利用java如何授权并发送新浪微博 -
ganliang13:
是啊,呵呵
java 利用jdbc连接hive查询示例 -
JustDone:
你是咸宁人?
java 利用jdbc连接hive查询示例 -
Ivy_upup:
不错!!!
学习了,多谢!
Excel下拉列表多选框实现
1.Hive CLI(hive命令行 command line)
hive命令行选项:
-d k=v (定义变量) -e "" -f filename -h host -p port -v (控制台显示执行的hql)
hive交互模式:
set;显示hive中的所有变量,例如set mapred.reduce.tasks=32;
set k=v :如果k不存在,不会报错
! shell command :hive交互模式执行shell,例如 : ! echo aa
dfs command :hive交互模式执行hadoop fs 的命令,和hadoop fs 命令相同
set:输出hive设置的变量
数据类型:
Integers
TINYINT - 1 byte integer
SMALLINT - 2 byte integer
INT - 4 byte integer
BIGINT - 8 byte integer
Boolean type
BOOLEAN - TRUE/FALSE
Floating point numbers
FLOAT - single precision
DOUBLE - Double precision
String type
STRING - sequence of characters in a specified character set
Complex Types
Structs STRUCT {a INT; b INT} c.a struct_type : STRUCT < col_name : data_type [COMMENT col_comment], ...>
Maps M['group']
Arrays The elements in the array have to be in the same type ['a', 'b', 'c'], A[1] retruns 'b'.
union: UNIONTYPE<data_type, data_type, ...>
SELECT create_union(0, key), create_union(if(key<100, 0, 1), 2.0, value), create_union(1, "a", struct(2, "b")) FROM src LIMIT 2;
union_type
: UNIONTYPE < data_type, data_type, ... >
TIMESTAMP
Note: Only available starting with Hive 0.8.0
BINARY
BINARY (Note: Only available starting with Hive 0.8.0)
操作符(Built in operators):
Relational Operators
A LIKE B _%,sql中的相同
A RLIKE B :正则表达式like,例如 'foo' rlike 'f.*' 返回true,与A REGEXP B 相同
Arithmetic Operators
A % B A & B(按位与)、A | B (按位或)、A ^ B (按位异或)、~A (按位非)
Logical Operators
A AND B A OR B !A NOT A
Operators on Complex Types
A[n] A is an Array and n is an int
M[key] M is a Map<K, V> and key has type K
S.x S is a struct
函数:
round(double a)BIGINT
floor(double a) BIGINT
ceil(double a) BIGINT
rand(), rand(int seed) double
concat(string A, string B,...)
substr(string A, int start, int length)
upper(string A)
lower(string A)
trim(string A)
ltrim(string A)
rtrim(string A)
regexp_replace(string A, string B, string C)
size(Map<K.V>) returns the number of elements in the map type
size(Array<T>)
cast(<expr> as <type>) 和mysql中的相同
from_unixtime(int unixtime)
to_date(string timestamp) to_date("1970-01-01 00:00:00") = "1970-01-01"
year(string date) year("1970-01-01") = 1970
month(string date) month("1970-11-01") = 11
day(string date) day("1970-11-01") = 1 ,相当于dayofmonth()
hour()/minute()/second()
weekofyear(string date)
get_json_object(string json_string, string path)
aggregate functions
count(*), count(expr), count(DISTINCT expr[, expr_.])
sum(col), sum(DISTINCT col)
avg(col), avg(DISTINCT col)
min(col)
max(col)
创建表:
CREATE [EXTERNAL] TABLE [if not exists] page_view(viewTime INT comment '', userid BIGINT,
page_url STRING, referrer_url STRING,
friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING comment '', country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS 表按userid分为32个桶,每个桶中,数据按viewTime进行排序
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE/TEXTFILE/RCFILE/INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
[LOCATION hdfs_path]
[AS select_statement];
注意:
1.The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
2.tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
3.Table names and column names are case insensitive but SerDe and property names are case sensitive.
4.The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
如何往分桶的表中插入数据:
set hive.enforce.bucketing = true;
FROM user_id
INSERT OVERWRITE TABLE user_info_bucketed
PARTITION (ds='2009-02-25')
SELECT userid, firstname, lastname WHERE ds='2009-02-25';
修改表:Alter Table/Partition Statements
修改分区:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...,该语法1次不能添加多个分区
partition_spec: (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
ALTER TABLE table_name DROP [IF EXISTS] partition_spec, partition_spec,...
ALTER TABLE table_name [PARTITION partitionSpec] SET LOCATION "new location"
Alter Table/Partition Protections
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP;
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
修改表字段:
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
Alter Table Properties(给表中增加自己的元数据)
ALTER TABLE table_name SET TBLPROPERTIES table_properties
Alter Table (Un)Archive
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP;
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;
视图:
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...
DROP VIEW [IF EXISTS] view_name
ALTER VIEW view_name SET TBLPROPERTIES table_properties
注意:
1.此视图只是逻辑上的,目前不支持物化视图
2.如果视图的基表被删除,视图的schema不会被改变,使用视图时会出错。
3.视图是只读的。
函数:
add files ...添加jar到hive的classpath
CREATE TEMPORARY FUNCTION function_name AS class_name (使用的类必须包含在classpath中)
DROP TEMPORARY FUNCTION [IF EXISTS] function_name
索引:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS index_type
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
DROP INDEX [IF EXISTS] index_name ON table_name
详情见:https://cwiki.apache.org/confluence/display/Hive/IndexDev#CREATE_INDEX
Show/Describe Statements
show databases/tables [like 'RegExp'] ,此处为正则表达式
show partitions tableName [PARTITION(ds='2010-03-03')]
show tblproperties tableName Hive 0.10.0
SHOW FUNCTIONS "a.*"
SHOW [FORMATTED] (INDEX|INDEXES) ON table_with_index [(FROM|IN) db_name]
SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name],输出所有列,包括分区列 Version information As of Hive 0.10
desc database xl_netdisk_ods;
加载数据:
1:数据加载到表:
Standard syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)],如果不使用overwrite,文件名不冲突的情况下原先数据依然存在,否则将被替换
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
注意:
1.INSERT OVERWRITE will overwrite any existing data in the table or partition unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0)
INSERT INTO will append to the table or partition keeping the existing data in tact. (Note: INSERT INTO syntax is only available starting in version 0.8)
2:数据加载到目录:
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
3.例子:
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
INSERT OVERWRITE TABLE user_active SELECT user.*FROM user WHERE user.active = 1;
In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example.
INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';
however, the following query is not allowed
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) 一个select中只能有一个distinct
FROM pv_users
GROUP BY pv_users.gender;
hive中in,exists子查询的替代:
SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM B);
可以被重写为:
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key)
Left Semi join(左半连接) 当第二个(底端)输入中有匹配行时,Left Semi Join 逻辑运算符返回第一个(顶端)输入中的每行。如果Argument列内不存在任何联接谓词,则每行都是一个匹配行。
Multi Table/File Inserts
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';
Dynamic-partition Insert(动态分区插入,自动创建分区,解决了需要预先知晓分区的问题,往多个分区插入数据,不需要多个job作业,0.6之后版本的功能)
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) 注意此处没有写明country的值,会自动创建分区并插入值
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
动态分区,只能为分区的最后一个列,只能是最后一个子分区,不能是这样(dt, country='US') ,动态分区需要在select指定分区列,静态分区不需要。如果被插入的分区已经存在,数据被重写
否则不被重写。如果分区列的值为null或‘’,数据会被插入到默认分区__HIVE_DEFAULT_PARTITION__,此数据被认为是坏数据。
注意:每一个节点(mapper or reducer)创建的分区数不能超过100个,通过DML创建的总的分区数不能超过1000个,文件数不能超过10w个,都可以通过参数配置而改变。hive默认不允许所有的分区
都是动态的,可以通过改变hive.exec.dynamic.partition.mode=nonstrict来改变这种状况
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country
DISTRIBUTE BY ds, country; 每个map或reduce产生的分区数超过100,可以将分区均衡,映射到不同的reduce,使用 distribute by
This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.
查询:
注意:
1.sort by,order by区别,sort by只保证单个reduce中有序,order by保证整体有序,整体有序是在牺牲性能的情况下保证,reduce的数量为1
2.sort按照字段的类型进行排序,如果用字符串存储数值,则需要将其转换为非字符串类型之后再排序
3.如果想所有的相同的key在同一个reduce中,使用 cluster by ..,或使用 distribute by .. sort by ..,前者是后者的缩写形式,一般distribute by后面的列为sort by的前缀
4.分区的使用和mysql一样,自动选择分区。分区的选择,where中和join之后的on中可以使用
连接:
1.多个join转为1个map/reduce作业。Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses e.g.SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
2.在join中的每一个map/reduce阶段,缓存前面的表,后面的表的数据依次流入,如果有多个作业,缓存之前的结果,后面的表数据依次流入reduce,如果想要改变缓存的表,给一个hint,
例如:SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1),本来b表中数据依次流入,改为a表中数据依次流入
3.Joins occur BEFORE WHERE CLAUSES,注意左、右、外连接的问题,条件放在on中和where中结果不相同,只要知道join和where的顺序,以及外连接的含义,结果很容易理解
先进行on条件筛选,再连接,之后where,on后的筛选条件主要是针对从表,对主表不起作用,因为是外关联,主表数据都会输出,对于主表的筛选条件应该放在where后面,如果
觉得主表不需要关联的数据太多,可以使用子查询,先过滤主表中无用数据
4.只是用一个map完成join,注意有限制。 SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key,对于a的每一个map,b完全读入内存,只需要map就可以完成join操作。 The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed
如果join的表都很大,但是在join的字段上进行了分桶,而且一个的桶数是另一个的倍数,则也可以进行mapjoin
侧视图(lateral view):用来列转行
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
取出样例数据:tablesample
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname]),将数据分为y桶,取出第x桶,如果建表时表没被分桶,则会扫描全表,进行分桶,colname可以为rand()
block_sample: TABLESAMPLE (n PERCENT) 取出n%的数据量,不是%n行
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
SELECT * FROM source TABLESAMPLE(0.1 PERCENT) s;
虚拟列:
INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE
查看语句执行过程
explain [extended] query
生成采样数据:
INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32 [ON userid]);
Union all 略
Array Operations
Array columns in tables can only be created programmatically currently.
SELECT pv.friends[2] FROM page_views pv; SELECT pv.userid, size(pv.friends)FROM page_view pv;
Map(Associative Arrays) Operations
Maps provide collections similar to associative arrays. Such structures can only be created programmatically currently.
INSERT OVERWRITE page_views_map
SELECT pv.userid, pv.properties['page type'] FROM page_views pv;SELECT size(pv.properties) FROM page_view pv;
distribute by 和 cluster by 的区别:
Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so the partition columns and sort columns can be different
Altering Tables
ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');
Dropping Tables and Partitions
ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')
Hive内置函数:
查看函数:show functions;desc function [extended] fname;
运算符:
A [NOT] BETWEEN B AND C (as of version 0.9.0)
函数
1.数学函数:
round(double a[, int d]):int/double ,pow(a,b),sqrt(),bin(a):返回2进制形式,hex():16进制形式,conv(BIGINT num, int from_base, int to_base)
abs(), pmod(int a, int b),返回正余数,a%b如果余数为负,返回负余数,degrees(double a) 弧度转为度,radians(double a),e(),pi(),sign()符号函数
std(),stddev()
2.Collection Functions
size(Map<K.V>),size(Array<T>),map_keys(Map<K.V>),map_values(Map<K.V>),array_contains(Array<T>, value),sort_array(Array<T>):按自然顺序(as of version 0.9.0)
array(n0, n1...) - Creates an array with the given elements
3.Type Conversion Functions
cast(expr as <type>)
4.Date Functions
from_unixtime(bigint unixtime[, string format])
unix_timestamp(string date, string pattern)
weekofyear(string date),
datediff(string enddate, string startdate),
date_add(string startdate, int days),
date_sub(string startdate, int days)
from_utc_timestamp(timestamp, string timezone)
to_utc_timestamp(timestamp, string timezone)
date_format
5.Conditional Functions
if(boolean testCondition, T valueTrue, T valueFalseOrNull)
COALESCE(a1, a2, ...) - Returns the first non-null argument,如果参数全为null,返回null。D.J.[kəʊəles] 可以用来替代ifnull,
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END When a = b, returns c; when a = d, return e; else return f
CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END When a = true, returns b; when c = true, return d; else return e
6.String Functions
concat_ws(string SEP, string A, string B...) ,可以使用自定义分隔符
find_in_set(string str, string strList) find_in_set('ab', 'abc,b,ab,c,def') returns 3
format_number(number x, int d) Formats the number X to a format like '#,###,###.##', rounded to D decimal places(as of Hive 0.10.0)
get_json_object(string json_string, string path) ,跟对象的名字为$
json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.
in_file(string str, string filename)
instr(string str, string substr) ,locate(string substr, string str[, int pos])
lpad(string str, int len, string pad) lpad('a',3,'b'):bba,rpad(),ltrim(),rtrim(),trim()
ngrams(array<array<string>>, int N, int K, int pf) =================
parse_url(string urlString, string partToExtract [, string keyToExtract]) Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.
parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') : 'facebook.com',parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') returns 'v1'.
parse_url_tuple(url, partname1, partname2, ..., partnameN) - extracts N (N>=1) parts from a URL.
SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;
SELECT parse_url_tuple(a.fullurl, 'HOST', 'PATH', 'QUERY', 'REF', 'PROTOCOL', 'FILE', 'AUTHORITY', 'USERINFO', 'QUERY:k1') as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a;
printf(String format, Obj... args) (as of Hive 0.9.0)
regexp_extract('foothebar', 'foo(.*?)(bar)', n) 抽取第n组的数据,例如regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.'
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
repeat(string str, int n) Repeat str n times
reverse(string A)
sentences(string str, string lang, string locale) :array<array<string>>
space(int n)
split(string str, string pat):array Split str around pat (pat is a regular expression)
str_to_map(text[, delimiter1, delimiter2]) Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ',' for delimiter1 and '=' for delimiter2.
substr(string|binary A, int start, int len)或substring(string|binary A, int start, int len)
translate(string input, string from, string to)
upper(),lower()
groupconcat()
map_keys()
map_values()
Misc. Functions
varies java_method(class, method[, arg1[, arg2..]])Synonym for reflect (as of Hive 0.9.0)
varies reflect(class, method[, arg1[, arg2..]]) Use this UDF to call Java methods by matching the argument signature (uses reflection). (as of Hive 0.7.0)
XPath Functions(从xml格式中获取数据)
xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string
Built-in Aggregate Functions (UDAF)
max(),min(),count(),avg(),sum()
double variance(col), var_pop(col) Returns the variance of a numeric column in the group 方差
double var_samp(col) Returns the unbiased sample variance of a numeric column in the group 样本方差
double stddev_pop(col) Returns the standard deviation of a numeric column in the group 标准差
double stddev_samp(col) Returns the unbiased sample standard deviation of a numeric column in the group 样本标准差
double covar_pop(col1, col2) Returns the population covariance of a pair of numeric columns in the group 协方差
double covar_samp(col1, col2) Returns the sample covariance of a pair of a numeric columns in the group 样本协方差
Built-in Table-Generating Functions (UDTF)
Array Type explode(array<TYPE> a) For each element in a, explode() generates a row containing that element
No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF's can't be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported
stack(INT n, v_1, v_2, ..., v_k) Breaks up v_1, ..., v_k into n rows. Each row will have k/n columns. n must be constant.
hive命令行选项:
-d k=v (定义变量) -e "" -f filename -h host -p port -v (控制台显示执行的hql)
hive交互模式:
set;显示hive中的所有变量,例如set mapred.reduce.tasks=32;
set k=v :如果k不存在,不会报错
! shell command :hive交互模式执行shell,例如 : ! echo aa
dfs command :hive交互模式执行hadoop fs 的命令,和hadoop fs 命令相同
set:输出hive设置的变量
数据类型:
Integers
TINYINT - 1 byte integer
SMALLINT - 2 byte integer
INT - 4 byte integer
BIGINT - 8 byte integer
Boolean type
BOOLEAN - TRUE/FALSE
Floating point numbers
FLOAT - single precision
DOUBLE - Double precision
String type
STRING - sequence of characters in a specified character set
Complex Types
Structs STRUCT {a INT; b INT} c.a struct_type : STRUCT < col_name : data_type [COMMENT col_comment], ...>
Maps M['group']
Arrays The elements in the array have to be in the same type ['a', 'b', 'c'], A[1] retruns 'b'.
union: UNIONTYPE<data_type, data_type, ...>
SELECT create_union(0, key), create_union(if(key<100, 0, 1), 2.0, value), create_union(1, "a", struct(2, "b")) FROM src LIMIT 2;
union_type
: UNIONTYPE < data_type, data_type, ... >
TIMESTAMP
Note: Only available starting with Hive 0.8.0
BINARY
BINARY (Note: Only available starting with Hive 0.8.0)
操作符(Built in operators):
Relational Operators
A LIKE B _%,sql中的相同
A RLIKE B :正则表达式like,例如 'foo' rlike 'f.*' 返回true,与A REGEXP B 相同
Arithmetic Operators
A % B A & B(按位与)、A | B (按位或)、A ^ B (按位异或)、~A (按位非)
Logical Operators
A AND B A OR B !A NOT A
Operators on Complex Types
A[n] A is an Array and n is an int
M[key] M is a Map<K, V> and key has type K
S.x S is a struct
函数:
round(double a)BIGINT
floor(double a) BIGINT
ceil(double a) BIGINT
rand(), rand(int seed) double
concat(string A, string B,...)
substr(string A, int start, int length)
upper(string A)
lower(string A)
trim(string A)
ltrim(string A)
rtrim(string A)
regexp_replace(string A, string B, string C)
size(Map<K.V>) returns the number of elements in the map type
size(Array<T>)
cast(<expr> as <type>) 和mysql中的相同
from_unixtime(int unixtime)
to_date(string timestamp) to_date("1970-01-01 00:00:00") = "1970-01-01"
year(string date) year("1970-01-01") = 1970
month(string date) month("1970-11-01") = 11
day(string date) day("1970-11-01") = 1 ,相当于dayofmonth()
hour()/minute()/second()
weekofyear(string date)
get_json_object(string json_string, string path)
aggregate functions
count(*), count(expr), count(DISTINCT expr[, expr_.])
sum(col), sum(DISTINCT col)
avg(col), avg(DISTINCT col)
min(col)
max(col)
创建表:
CREATE [EXTERNAL] TABLE [if not exists] page_view(viewTime INT comment '', userid BIGINT,
page_url STRING, referrer_url STRING,
friends ARRAY<BIGINT>, properties MAP<STRING, STRING>
ip STRING COMMENT 'IP Address of the User')
COMMENT 'This is the page view table'
PARTITIONED BY(dt STRING comment '', country STRING)
CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS 表按userid分为32个桶,每个桶中,数据按viewTime进行排序
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '1'
COLLECTION ITEMS TERMINATED BY '2'
MAP KEYS TERMINATED BY '3'
STORED AS SEQUENCEFILE/TEXTFILE/RCFILE/INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname
[LOCATION hdfs_path]
[AS select_statement];
注意:
1.The EXTERNAL keyword lets you create a table and provide a LOCATION so that Hive does not use a default location for this table. When dropping an EXTERNAL table, data in the table is NOT deleted from the file system.
2.tables or partitions can be bucketed using CLUSTERED BY columns, and data can be sorted within that bucket via SORT BY columns. This can improve performance on certain kinds of queries.
3.Table names and column names are case insensitive but SerDe and property names are case sensitive.
4.The CLUSTERED BY and SORTED BY creation commands do not affect how data is inserted into a table – only how it is read. This means that users must be careful to insert data correctly by specifying the number of reducers to be equal to the number of buckets, and using CLUSTER BY and SORT BY commands in their query.
如何往分桶的表中插入数据:
set hive.enforce.bucketing = true;
FROM user_id
INSERT OVERWRITE TABLE user_info_bucketed
PARTITION (ds='2009-02-25')
SELECT userid, firstname, lastname WHERE ds='2009-02-25';
修改表:Alter Table/Partition Statements
修改分区:
ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION partition_spec [LOCATION 'location1'] partition_spec [LOCATION 'location2'] ...,该语法1次不能添加多个分区
partition_spec: (partition_col = partition_col_value, partition_col = partiton_col_value, ...)
ALTER TABLE table_name DROP [IF EXISTS] partition_spec, partition_spec,...
ALTER TABLE table_name [PARTITION partitionSpec] SET LOCATION "new location"
Alter Table/Partition Protections
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP;
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
修改表字段:
ALTER TABLE table_name CHANGE [COLUMN] col_old_name col_new_name column_type [COMMENT col_comment] [FIRST|AFTER column_name]
ALTER TABLE table_name ADD|REPLACE COLUMNS (col_name data_type [COMMENT col_comment], ...)
Alter Table Properties(给表中增加自己的元数据)
ALTER TABLE table_name SET TBLPROPERTIES table_properties
Alter Table (Un)Archive
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE NO_DROP;
ALTER TABLE table_name [PARTITION partition_spec] ENABLE|DISABLE OFFLINE;
ALTER TABLE table_name PARTITION partition_spec RENAME TO PARTITION partition_spec;
视图:
CREATE VIEW [IF NOT EXISTS] view_name [(column_name [COMMENT column_comment], ...) ]
[COMMENT view_comment]
[TBLPROPERTIES (property_name = property_value, ...)]
AS SELECT ...
DROP VIEW [IF EXISTS] view_name
ALTER VIEW view_name SET TBLPROPERTIES table_properties
注意:
1.此视图只是逻辑上的,目前不支持物化视图
2.如果视图的基表被删除,视图的schema不会被改变,使用视图时会出错。
3.视图是只读的。
函数:
add files ...添加jar到hive的classpath
CREATE TEMPORARY FUNCTION function_name AS class_name (使用的类必须包含在classpath中)
DROP TEMPORARY FUNCTION [IF EXISTS] function_name
索引:
CREATE INDEX index_name
ON TABLE base_table_name (col_name, ...)
AS index_type
[WITH DEFERRED REBUILD]
[IDXPROPERTIES (property_name=property_value, ...)]
[IN TABLE index_table_name]
[
[ ROW FORMAT ...] STORED AS ...
| STORED BY ...
]
[LOCATION hdfs_path]
[TBLPROPERTIES (...)]
[COMMENT "index comment"]
DROP INDEX [IF EXISTS] index_name ON table_name
详情见:https://cwiki.apache.org/confluence/display/Hive/IndexDev#CREATE_INDEX
Show/Describe Statements
show databases/tables [like 'RegExp'] ,此处为正则表达式
show partitions tableName [PARTITION(ds='2010-03-03')]
show tblproperties tableName Hive 0.10.0
SHOW FUNCTIONS "a.*"
SHOW [FORMATTED] (INDEX|INDEXES) ON table_with_index [(FROM|IN) db_name]
SHOW COLUMNS (FROM|IN) table_name [(FROM|IN) db_name],输出所有列,包括分区列 Version information As of Hive 0.10
desc database xl_netdisk_ods;
加载数据:
1:数据加载到表:
Standard syntax:
LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)],如果不使用overwrite,文件名不冲突的情况下原先数据依然存在,否则将被替换
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;
INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1
[INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2]
[INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...;
Hive extension (dynamic partition inserts):
INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement;
注意:
1.INSERT OVERWRITE will overwrite any existing data in the table or partition unless IF NOT EXISTS is provided for a partition (as of Hive 0.9.0)
INSERT INTO will append to the table or partition keeping the existing data in tact. (Note: INSERT INTO syntax is only available starting in version 0.8)
2:数据加载到目录:
Standard syntax:
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 SELECT ... FROM ...
Hive extension (multiple inserts):
FROM from_statement
INSERT OVERWRITE [LOCAL] DIRECTORY directory1 select_statement1
3.例子:
[INSERT OVERWRITE [LOCAL] DIRECTORY directory2 select_statement2] ...
LOAD DATA LOCAL INPATH /tmp/pv_2008-06-08_us.txt INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
LOAD DATA INPATH '/user/data/pv_2008-06-08_us.txt' INTO TABLE page_view PARTITION(date='2008-06-08', country='US')
INSERT OVERWRITE TABLE user_active SELECT user.*FROM user WHERE user.active = 1;
In order check the existence of a key in another table, the user can use LEFT SEMI JOIN as illustrated by the following example.
INSERT OVERWRITE TABLE pv_users
SELECT u.*
FROM user u LEFT SEMI JOIN page_view pv ON (pv.userid = u.id)
WHERE pv.date = '2008-03-03';
however, the following query is not allowed
INSERT OVERWRITE TABLE pv_gender_agg
SELECT pv_users.gender, count(DISTINCT pv_users.userid), count(DISTINCT pv_users.ip) 一个select中只能有一个distinct
FROM pv_users
GROUP BY pv_users.gender;
hive中in,exists子查询的替代:
SELECT a.key, a.value FROM a WHERE a.key in (SELECT b.key FROM B);
可以被重写为:
SELECT a.key, a.val FROM a LEFT SEMI JOIN b on (a.key = b.key)
Left Semi join(左半连接) 当第二个(底端)输入中有匹配行时,Left Semi Join 逻辑运算符返回第一个(顶端)输入中的每行。如果Argument列内不存在任何联接谓词,则每行都是一个匹配行。
Multi Table/File Inserts
FROM pv_users
INSERT OVERWRITE TABLE pv_gender_sum
SELECT pv_users.gender, count_distinct(pv_users.userid)
GROUP BY pv_users.gender
INSERT OVERWRITE DIRECTORY '/user/data/tmp/pv_age_sum'
SELECT pv_users.age, count_distinct(pv_users.userid)
GROUP BY pv_users.age;
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='US')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'US'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='CA')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'CA'
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country='UK')
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip WHERE pvs.country = 'UK';
Dynamic-partition Insert(动态分区插入,自动创建分区,解决了需要预先知晓分区的问题,往多个分区插入数据,不需要多个job作业,0.6之后版本的功能)
FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt='2008-06-08', country) 注意此处没有写明country的值,会自动创建分区并插入值
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip, pvs.country
动态分区,只能为分区的最后一个列,只能是最后一个子分区,不能是这样(dt, country='US') ,动态分区需要在select指定分区列,静态分区不需要。如果被插入的分区已经存在,数据被重写
否则不被重写。如果分区列的值为null或‘’,数据会被插入到默认分区__HIVE_DEFAULT_PARTITION__,此数据被认为是坏数据。
注意:每一个节点(mapper or reducer)创建的分区数不能超过100个,通过DML创建的总的分区数不能超过1000个,文件数不能超过10w个,都可以通过参数配置而改变。hive默认不允许所有的分区
都是动态的,可以通过改变hive.exec.dynamic.partition.mode=nonstrict来改变这种状况
hive> set hive.exec.dynamic.partition.mode=nonstrict;
hive> FROM page_view_stg pvs
INSERT OVERWRITE TABLE page_view PARTITION(dt, country)
SELECT pvs.viewTime, pvs.userid, pvs.page_url, pvs.referrer_url, null, null, pvs.ip,
from_unixtimestamp(pvs.viewTime, 'yyyy-MM-dd') ds, pvs.country
DISTRIBUTE BY ds, country; 每个map或reduce产生的分区数超过100,可以将分区均衡,映射到不同的reduce,使用 distribute by
This query will generate a MapReduce job rather than Map-only job. The SELECT-clause will be converted to a plan to the mappers and the output will be distributed to the reducers based on the value of (ds, country) pairs. The INSERT-clause will be converted to the plan in the reducer which writes to the dynamic partitions.
查询:
注意:
1.sort by,order by区别,sort by只保证单个reduce中有序,order by保证整体有序,整体有序是在牺牲性能的情况下保证,reduce的数量为1
2.sort按照字段的类型进行排序,如果用字符串存储数值,则需要将其转换为非字符串类型之后再排序
3.如果想所有的相同的key在同一个reduce中,使用 cluster by ..,或使用 distribute by .. sort by ..,前者是后者的缩写形式,一般distribute by后面的列为sort by的前缀
4.分区的使用和mysql一样,自动选择分区。分区的选择,where中和join之后的on中可以使用
连接:
1.多个join转为1个map/reduce作业。Hive converts joins over multiple tables into a single map/reduce job if for every table the same column is used in the join clauses e.g.SELECT a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1)
2.在join中的每一个map/reduce阶段,缓存前面的表,后面的表的数据依次流入,如果有多个作业,缓存之前的结果,后面的表数据依次流入reduce,如果想要改变缓存的表,给一个hint,
例如:SELECT /*+ STREAMTABLE(a) */ a.val, b.val, c.val FROM a JOIN b ON (a.key = b.key1) JOIN c ON (c.key = b.key1),本来b表中数据依次流入,改为a表中数据依次流入
3.Joins occur BEFORE WHERE CLAUSES,注意左、右、外连接的问题,条件放在on中和where中结果不相同,只要知道join和where的顺序,以及外连接的含义,结果很容易理解
先进行on条件筛选,再连接,之后where,on后的筛选条件主要是针对从表,对主表不起作用,因为是外关联,主表数据都会输出,对于主表的筛选条件应该放在where后面,如果
觉得主表不需要关联的数据太多,可以使用子查询,先过滤主表中无用数据
4.只是用一个map完成join,注意有限制。 SELECT /*+ MAPJOIN(b) */ a.key, a.value FROM a join b on a.key = b.key,对于a的每一个map,b完全读入内存,只需要map就可以完成join操作。 The restriction is that a FULL/RIGHT OUTER JOIN b cannot be performed
如果join的表都很大,但是在join的字段上进行了分桶,而且一个的桶数是另一个的倍数,则也可以进行mapjoin
侧视图(lateral view):用来列转行
lateralView: LATERAL VIEW udtf(expression) tableAlias AS columnAlias (',' columnAlias)*
fromClause: FROM baseTable (lateralView)*
取出样例数据:tablesample
table_sample: TABLESAMPLE (BUCKET x OUT OF y [ON colname]),将数据分为y桶,取出第x桶,如果建表时表没被分桶,则会扫描全表,进行分桶,colname可以为rand()
block_sample: TABLESAMPLE (n PERCENT) 取出n%的数据量,不是%n行
SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
SELECT * FROM source TABLESAMPLE(0.1 PERCENT) s;
虚拟列:
INPUT__FILE__NAME,BLOCK__OFFSET__INSIDE__FILE
查看语句执行过程
explain [extended] query
生成采样数据:
INSERT OVERWRITE TABLE pv_gender_sum_sample
SELECT pv_gender_sum.*
FROM pv_gender_sum TABLESAMPLE(BUCKET 3 OUT OF 32 [ON userid]);
Union all 略
Array Operations
Array columns in tables can only be created programmatically currently.
SELECT pv.friends[2] FROM page_views pv; SELECT pv.userid, size(pv.friends)FROM page_view pv;
Map(Associative Arrays) Operations
Maps provide collections similar to associative arrays. Such structures can only be created programmatically currently.
INSERT OVERWRITE page_views_map
SELECT pv.userid, pv.properties['page type'] FROM page_views pv;SELECT size(pv.properties) FROM page_view pv;
distribute by 和 cluster by 的区别:
Distribute By and Sort By: Instead of specifying "cluster by", the user can specify "distribute by" and "sort by", so the partition columns and sort columns can be different
Altering Tables
ALTER TABLE tab1 ADD COLUMNS (c1 INT COMMENT 'a new int column', c2 STRING DEFAULT 'def val');
Dropping Tables and Partitions
ALTER TABLE pv_users DROP PARTITION (ds='2008-08-08')
Hive内置函数:
查看函数:show functions;desc function [extended] fname;
运算符:
A [NOT] BETWEEN B AND C (as of version 0.9.0)
函数
1.数学函数:
round(double a[, int d]):int/double ,pow(a,b),sqrt(),bin(a):返回2进制形式,hex():16进制形式,conv(BIGINT num, int from_base, int to_base)
abs(), pmod(int a, int b),返回正余数,a%b如果余数为负,返回负余数,degrees(double a) 弧度转为度,radians(double a),e(),pi(),sign()符号函数
std(),stddev()
2.Collection Functions
size(Map<K.V>),size(Array<T>),map_keys(Map<K.V>),map_values(Map<K.V>),array_contains(Array<T>, value),sort_array(Array<T>):按自然顺序(as of version 0.9.0)
array(n0, n1...) - Creates an array with the given elements
3.Type Conversion Functions
cast(expr as <type>)
4.Date Functions
from_unixtime(bigint unixtime[, string format])
unix_timestamp(string date, string pattern)
weekofyear(string date),
datediff(string enddate, string startdate),
date_add(string startdate, int days),
date_sub(string startdate, int days)
from_utc_timestamp(timestamp, string timezone)
to_utc_timestamp(timestamp, string timezone)
date_format
5.Conditional Functions
if(boolean testCondition, T valueTrue, T valueFalseOrNull)
COALESCE(a1, a2, ...) - Returns the first non-null argument,如果参数全为null,返回null。D.J.[kəʊəles] 可以用来替代ifnull,
CASE a WHEN b THEN c [WHEN d THEN e]* [ELSE f] END When a = b, returns c; when a = d, return e; else return f
CASE WHEN a THEN b [WHEN c THEN d]* [ELSE e] END When a = true, returns b; when c = true, return d; else return e
6.String Functions
concat_ws(string SEP, string A, string B...) ,可以使用自定义分隔符
find_in_set(string str, string strList) find_in_set('ab', 'abc,b,ab,c,def') returns 3
format_number(number x, int d) Formats the number X to a format like '#,###,###.##', rounded to D decimal places(as of Hive 0.10.0)
get_json_object(string json_string, string path) ,跟对象的名字为$
json_tuple(jsonStr, p1, p2, ..., pn) - like get_json_object, but it takes multiple names and return a tuple. All the input parameters and output column types are string.
in_file(string str, string filename)
instr(string str, string substr) ,locate(string substr, string str[, int pos])
lpad(string str, int len, string pad) lpad('a',3,'b'):bba,rpad(),ltrim(),rtrim(),trim()
ngrams(array<array<string>>, int N, int K, int pf) =================
parse_url(string urlString, string partToExtract [, string keyToExtract]) Returns the specified part from the URL. Valid values for partToExtract include HOST, PATH, QUERY, REF, PROTOCOL, AUTHORITY, FILE, and USERINFO.
parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'HOST') : 'facebook.com',parse_url('http://facebook.com/path1/p.php?k1=v1&k2=v2#Ref1', 'QUERY', 'k1') returns 'v1'.
parse_url_tuple(url, partname1, partname2, ..., partnameN) - extracts N (N>=1) parts from a URL.
SELECT b.* FROM src LATERAL VIEW parse_url_tuple(fullurl, 'HOST', 'PATH', 'QUERY', 'QUERY:id') b as host, path, query, query_id LIMIT 1;
SELECT parse_url_tuple(a.fullurl, 'HOST', 'PATH', 'QUERY', 'REF', 'PROTOCOL', 'FILE', 'AUTHORITY', 'USERINFO', 'QUERY:k1') as (ho, pa, qu, re, pr, fi, au, us, qk1) from src a;
printf(String format, Obj... args) (as of Hive 0.9.0)
regexp_extract('foothebar', 'foo(.*?)(bar)', n) 抽取第n组的数据,例如regexp_extract('foothebar', 'foo(.*?)(bar)', 2) returns 'bar.'
regexp_replace(string INITIAL_STRING, string PATTERN, string REPLACEMENT)
repeat(string str, int n) Repeat str n times
reverse(string A)
sentences(string str, string lang, string locale) :array<array<string>>
space(int n)
split(string str, string pat):array Split str around pat (pat is a regular expression)
str_to_map(text[, delimiter1, delimiter2]) Splits text into key-value pairs using two delimiters. Delimiter1 separates text into K-V pairs, and Delimiter2 splits each K-V pair. Default delimiters are ',' for delimiter1 and '=' for delimiter2.
substr(string|binary A, int start, int len)或substring(string|binary A, int start, int len)
translate(string input, string from, string to)
upper(),lower()
groupconcat()
map_keys()
map_values()
Misc. Functions
varies java_method(class, method[, arg1[, arg2..]])Synonym for reflect (as of Hive 0.9.0)
varies reflect(class, method[, arg1[, arg2..]]) Use this UDF to call Java methods by matching the argument signature (uses reflection). (as of Hive 0.7.0)
XPath Functions(从xml格式中获取数据)
xpath, xpath_short, xpath_int, xpath_long, xpath_float, xpath_double, xpath_number, xpath_string
Built-in Aggregate Functions (UDAF)
max(),min(),count(),avg(),sum()
double variance(col), var_pop(col) Returns the variance of a numeric column in the group 方差
double var_samp(col) Returns the unbiased sample variance of a numeric column in the group 样本方差
double stddev_pop(col) Returns the standard deviation of a numeric column in the group 标准差
double stddev_samp(col) Returns the unbiased sample standard deviation of a numeric column in the group 样本标准差
double covar_pop(col1, col2) Returns the population covariance of a pair of numeric columns in the group 协方差
double covar_samp(col1, col2) Returns the sample covariance of a pair of a numeric columns in the group 样本协方差
Built-in Table-Generating Functions (UDTF)
Array Type explode(array<TYPE> a) For each element in a, explode() generates a row containing that element
No other expressions are allowed in SELECT
SELECT pageid, explode(adid_list) AS myCol... is not supported
UDTF's can't be nested
SELECT explode(explode(adid_list)) AS myCol... is not supported
GROUP BY / CLUSTER BY / DISTRIBUTE BY / SORT BY is not supported
SELECT explode(adid_list) AS myCol ... GROUP BY myCol is not supported
stack(INT n, v_1, v_2, ..., v_k) Breaks up v_1, ..., v_k into n rows. Each row will have k/n columns. n must be constant.
发表评论
-
mapreduce编程模型之hbase输入hdfs多路输出
2014-01-14 14:43 1677import java.io.IOException; ... -
mapreduce编程模型之hbase表作为数据源输入输出
2013-12-18 15:15 3195package cn.luxh.app; impor ... -
mapreduce编程模型之HDFS数据到HBASE表数据
2013-12-18 15:06 1282package com.bfd.util; impo ... -
mapreduce 求最大值最小值问题
2013-09-16 21:43 1974import java.io.File; import j ... -
基于hadoop的多个reduce 输出
2013-07-17 19:11 1835import java.io.File; import ... -
mapreduce编程模型之mysql 输入数据至hbase表数据
2013-06-25 11:19 15441.Hbase shell中创建表create 'tab1 ... -
Eclipse本地机提交hadoop程序至集群
2013-05-30 17:13 1511用Eclipse提交hadoop 程序提交时总是发现有些类 ... -
hadoop 集群Eclipse设置
2013-05-30 16:21 1380Hadoop集群(第7期)_Ecl ... -
java api 操作hdfs文件
2013-05-23 16:48 1430package com.bfd.test; impor ... -
zookeeper-3.4.5安装
2012-11-28 23:06 3563ganliang@ganliang-laptop:/home ... -
Hadoop Shell命令
2012-08-30 17:13 811调 用文件系统(FS)Shell ... -
hadoop 本地文件复制到hdfs目录
2012-08-26 17:58 2715public static void main(String[ ... -
hadoop下的examples运行
2012-08-26 17:54 811bin/hadoop jar hadoop-examples- ... -
hadoop hdfs.DFSClient: DataStreamer Exception
2012-08-18 22:08 3796hdfs.DFSClient: DataStreamer Ex ... -
hive表创建,删除,导入数据,删除数据
2012-07-31 21:14 9893--hive 数据表创建 use company; DROP ... -
hadoop1.0.1单机安装
2012-04-11 16:47 1553安装ssh 1 apt-get ins ... -
hive数据存储格式
2012-03-19 21:23 1495Hive 没有专门的数据存 ...
相关推荐
总的来说,《王家林Hive学习资料》是一份全面而深入的Hive教程,无论你是初学者还是有经验的开发者,都能从中获取宝贵的知识和技能,提升你在大数据处理和分析领域的专业水平。通过学习,你将能够熟练地在Hadoop环境...
标题中提到的"Hive学习笔记-比较全的知识"和描述中所述"相当不错的,适合初学者,下载绝对不亏"意味着本篇文档旨在为初学者提供一个全面的学习指南,覆盖Hive的主要概念和操作。而标签"hive"确定了文档的中心主题是...
【描述】:描述指出这份资料包含了大量的关于Hadoop、Hive和Hbase的学习材料,意味着它可能提供了全面且深入的知识点,适合初学者或有一定经验的开发者进行深入学习和提升。 【标签】:“hadoop”、“hive”、...
《阿里巴巴Hive学习笔记》是基于阿里巴巴内部分享的资料,主要涵盖了Hive的基本结构、与Hadoop的关系、元数据库管理以及基本操作等方面,为初学者提供了深入理解Hive的全面指南。 1. **Hive结构** - **Hive架构**...
《Hive学习必备经典》是针对大数据处理领域中Hive技术的深度解析文档,由公司技术团队耗费一个月精心整理而成,旨在为学习Hive的人员提供全面、实用的知识指南,无需再深入研究源码。 Hive是基于Hadoop的数据仓库...
这本书为读者提供了全面的Hive学习资源,帮助他们掌握在大数据环境中利用Hive进行高效的数据操作。 Hadoop是Apache软件基金会开发的一个开源框架,用于存储和处理海量数据。Hive作为Hadoop的一部分,它将复杂的...
《Hive简明教程-大数据技术系列》是针对Hive这一大数据处理工具的全面解析,旨在帮助初学者快速掌握其核心概念和实用技巧。本文将深入探讨Hive的语法、原理以及相关的技术细节。 首先,Hive是Apache软件基金会开发...
本课件"hive课件.rar"旨在为学习者提供关于Hive的全面知识,帮助大家深入理解这个大数据仓库系统。 Hive是由Facebook开发的一个基于Hadoop的数据仓库工具,它可以将结构化的数据文件映射为一张数据库表,并提供SQL...
《Hive 口袋手册》是笔者根据自己对 Apache Hive 的知识点的理解汇总而成的小册子。...写这本册子的初衷一是为了总结自己关于 Hive 的学习过程,二是为了将自己学习过程中七零八乱的笔记做个整理、汇总。
通过这个压缩包,你可以全面掌握Hive的基础知识,包括安装、配置、表管理和SQL操作。实践是学习最好的老师,建议你在理论学习后,实际操作一下,这样能更好地理解和应用所学知识。在大数据的世界里,Hive是你探索...
本翻译文档对原版英文文档进行了全面、准确的翻译,便于中文使用者学习和应用。 首先,让我们详细了解一下Hive的核心概念和功能: 1. **Hive架构**:Hive构建在Hadoop之上,它将结构化的数据文件映射为一张数据库...
Hive学习笔记旨在为初学者提供全面而快速的Hive应用理解。Hive是一个基于Hadoop的数据仓库工具,它能够将结构化的数据文件映射为一张数据库表,并提供SQL(HQL,Hive Query Language)查询功能,方便大数据分析。 1...
本笔记将全面深入地探讨Hive在大数据处理中的应用、原理及其实战技巧。 一、Hive简介 Hive是Apache软件基金会下的一个开源项目,它提供了一种基于Hadoop的数据仓库工具,可以将结构化的数据文件映射为一张数据库表...
《Hive编程手册,从入门到精通》是一本全面覆盖Hive技术的指南,旨在帮助读者从初学者逐渐晋升为Hive专家。Hive作为大数据处理领域中的重要工具,主要用于处理和分析大规模分布式数据,其易用性与SQL兼容性使其在...
通过这些资源,学习者可以从理论到实践,全面掌握Hive的使用,包括安装、配置、数据导入、查询优化、以及与其他工具的集成等。同时,理解Hadoop生态中的其他组件如MapReduce,将有助于更好地理解Hive的工作流程,...
总之,《Hive用户指南》全面覆盖了Hive的各个方面,从基础概念到实际操作,再到高级优化技巧,对于Java开发者来说,是掌握Hive并利用其处理大数据问题的重要参考资料。学习和理解这份指南,能有效提升在大数据环境下...
“HiveTutorial.pdf”可能是一个全面的Hive教程,涵盖了Hive的基本概念,如表的创建、数据加载、查询语法、分区和桶的概念,以及如何使用Hive进行数据挖掘和业务智能。这个教程可能还会讨论Hive与其他Hadoop组件如...
在大数据领域,Hive是一种广泛使用的数据仓库工具,它允许用户使用SQL(HQL,Hive Query Language)...通过对这个项目的深入学习和实践,你可以全面理解Hive在实际业务中的应用,并提升你在大数据离线分析方面的能力。
通过本课程的学习,学员不仅能够全面了解Hive的基本概念、操作方法与优化策略,还能够掌握如何在实际工作中有效利用Hive进行数据处理和分析。此外,通过对Hive高级特性和实战案例的学习,学员还将具备解决复杂问题的...
通过上述知识点,我们可以理解Hive测试数据在学习和实践中如何发挥作用,以及如何使用这些数据来探索和理解Hive的功能特性。不过,为了更深入地理解和实践,还需要具体的文件内容来进行实际操作。