- 浏览: 347783 次
- 性别:
- 来自: 杭州
文章分类
最新评论
-
lvyuan1234:
你好,你那个sample.txt文件可以分享给我吗
hive insert overwrite into -
107x:
不错,谢谢!
hive 表的一些默认值 -
on_way_:
赞
Hadoop相关书籍 -
bupt04406:
dengkanghua 写道出来这个问题该怎么解决?hbase ...
Unexpected state导致HMaster abort -
dengkanghua:
出来这个问题该怎么解决?hbase master启动不起来。
Unexpected state导致HMaster abort
ExecReducer{
private boolean isTagged = false;
@Override
public void configure(JobConf job) {
MapredWork gWork = Utilities.getMapRedWork(job);
isTagged = gWork.getNeedsTagging(); //初始化
}
}
//只有在reducer是JoinOperator.class的时候才会设置needsTagging为true
org.apache.hadoop.hive.ql.plan.MapredWork
public void setNeedsTagging(boolean needsTagging) {
this.needsTagging = needsTagging;
}
if (reducer.getClass() == JoinOperator.class) {
plan.setNeedsTagging(true);
}
org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(Object key, Iterator values, OutputCollector output,
Reporter reporter){
BytesWritable keyWritable = (BytesWritable) key;
tag.set((byte) 0); //默认tag为0
if (isTagged) { // tag上了
// remove the tag
int size = keyWritable.getSize() - 1; // keyWritable的长度
tag.set(keyWritable.get()[size]); //最后一个byte是tag的值。
keyWritable.setSize(size);
}
}
join1.q.out:在一个MR中计算两个表的join,tag为0,1,在reduce阶段就可以区分一个row是来自哪个table。
EXPLAIN
FROM src src1 JOIN src src2 ON (src1.key = src2.key)
INSERT OVERWRITE TABLE dest_j1 SELECT src1.key, src2.value
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
src1
TableScan
alias: src1
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 0
value expressions:
expr: key
type: string
src2
TableScan
alias: src2
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 1
value expressions:
expr: value
type: string
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col3
Select Operator
expressions:
expr: _col0
type: string
expr: _col3
type: string
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: UDFToInteger(_col0)
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest_j1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest_j1
join3.q.out:在一个MR中计算三个表的Join,tag为0,1,2,在reduce阶段就可以区分一个row是来自哪个table。
EXPLAIN
FROM src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON (src1.key = src3.key)
INSERT OVERWRITE TABLE dest1 SELECT src1.key, src3.value
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
src1
TableScan
alias: src1
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 0
value expressions:
expr: key
type: string
src2
TableScan
alias: src2
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 1
src3
TableScan
alias: src3
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 2
value expressions:
expr: value
type: string
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
Inner Join 0 to 2
condition expressions:
0 {VALUE._col0}
1
2 {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col5
Select Operator
expressions:
expr: _col0
type: string
expr: _col5
type: string
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: UDFToInteger(_col0)
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
private boolean isTagged = false;
@Override
public void configure(JobConf job) {
MapredWork gWork = Utilities.getMapRedWork(job);
isTagged = gWork.getNeedsTagging(); //初始化
}
}
//只有在reducer是JoinOperator.class的时候才会设置needsTagging为true
org.apache.hadoop.hive.ql.plan.MapredWork
public void setNeedsTagging(boolean needsTagging) {
this.needsTagging = needsTagging;
}
if (reducer.getClass() == JoinOperator.class) {
plan.setNeedsTagging(true);
}
org.apache.hadoop.hive.ql.exec.ExecReducer.reduce(Object key, Iterator values, OutputCollector output,
Reporter reporter){
BytesWritable keyWritable = (BytesWritable) key;
tag.set((byte) 0); //默认tag为0
if (isTagged) { // tag上了
// remove the tag
int size = keyWritable.getSize() - 1; // keyWritable的长度
tag.set(keyWritable.get()[size]); //最后一个byte是tag的值。
keyWritable.setSize(size);
}
}
join1.q.out:在一个MR中计算两个表的join,tag为0,1,在reduce阶段就可以区分一个row是来自哪个table。
EXPLAIN
FROM src src1 JOIN src src2 ON (src1.key = src2.key)
INSERT OVERWRITE TABLE dest_j1 SELECT src1.key, src2.value
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
src1
TableScan
alias: src1
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 0
value expressions:
expr: key
type: string
src2
TableScan
alias: src2
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 1
value expressions:
expr: value
type: string
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
condition expressions:
0 {VALUE._col0}
1 {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col3
Select Operator
expressions:
expr: _col0
type: string
expr: _col3
type: string
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: UDFToInteger(_col0)
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest_j1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest_j1
join3.q.out:在一个MR中计算三个表的Join,tag为0,1,2,在reduce阶段就可以区分一个row是来自哪个table。
EXPLAIN
FROM src src1 JOIN src src2 ON (src1.key = src2.key) JOIN src src3 ON (src1.key = src3.key)
INSERT OVERWRITE TABLE dest1 SELECT src1.key, src3.value
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Alias -> Map Operator Tree:
src1
TableScan
alias: src1
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 0
value expressions:
expr: key
type: string
src2
TableScan
alias: src2
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 1
src3
TableScan
alias: src3
Reduce Output Operator
key expressions:
expr: key
type: string
sort order: +
Map-reduce partition columns:
expr: key
type: string
tag: 2
value expressions:
expr: value
type: string
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
Inner Join 0 to 2
condition expressions:
0 {VALUE._col0}
1
2 {VALUE._col1}
handleSkewJoin: false
outputColumnNames: _col0, _col5
Select Operator
expressions:
expr: _col0
type: string
expr: _col5
type: string
outputColumnNames: _col0, _col1
Select Operator
expressions:
expr: UDFToInteger(_col0)
type: int
expr: _col1
type: string
outputColumnNames: _col0, _col1
File Output Operator
compressed: false
GlobalTableId: 1
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
Stage: Stage-0
Move Operator
tables:
replace: true
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
name: dest1
发表评论
-
hive rename table name
2013-09-18 14:28 2594hive rename tablename hive re ... -
hive的distribute by如何partition long型的数据
2013-08-20 10:15 2470有用户问:hive的distribute by分桶是怎么分 ... -
hive like vs rlike vs regexp
2013-04-11 18:53 11208like vs rlike vs regexp r ... -
hive sql where条件很简单,但是太多
2012-07-18 15:51 8734insert overwrite table aaaa ... -
insert into时(string->bigint)自动类型转换
2012-06-14 12:30 8277原表src: hive> desc src; ... -
通过复合结构来优化udf的调用
2012-05-11 14:07 1206select split("accba&quo ... -
RegexSerDe
2012-03-14 09:58 1544官方示例在: https://cwiki.apache.or ... -
Hive 的 OutputCommitter
2012-01-30 19:44 1813Hive 的 OutputCommitter publi ... -
hive LATERAL VIEW 行转列
2011-11-09 14:49 5442drop table lateralview; create ... -
hive complex type
2011-11-08 19:56 1360数据: 1,100|3,20|2,70|5,100 建表: ... -
hive转义字符
2011-10-25 16:41 6238CREATE TABLE escape (id STRING, ... -
hive 两个不同类型的columns进行比较
2011-09-19 13:46 3032select case when "ab1234&q ... -
lateral view
2011-09-18 04:04 0lateral view与udtf相关 -
udf 中获得 FileSystem
2011-09-14 10:28 0在udf中获得FileSystem,需要获得知道fs.defa ... -
hive union mapjoin
2011-09-09 16:29 0union union.q union2.q ... -
hive eclipse
2011-09-08 17:42 0eclipse-templates$ vi .classpat ... -
hive join filter
2011-09-07 23:05 0join16.q.out hive.optimize.ppd ... -
hive limit
2011-09-07 21:02 0limit 关键字: input4_limit.q.out ... -
hive convertMapJoin MapJoinProcessor
2011-09-06 21:17 0join25.q join26 ... -
hive hive.merge.mapfiles hive.merge.mapredfiles
2011-09-06 19:14 0HiveConf: HIVEMERGEMAPFILES ...
相关推荐
Hive-release-HDP-3.1.5.0-152-tag.tar.gz文件包含了Hive的源代码,这对于想要构建复杂查询逻辑、优化查询性能或扩展Hive功能的开发人员来说非常有价值。 Spark2是大数据处理的另一个关键组件,它在Hadoop生态系统...
- 在Map阶段,为来自不同表的数据打上标记(tag),这些标记会被添加到Map输出的Value中。 - 在Reduce阶段,根据标记来判断数据来源并进行相应的Join操作。 例如,假设我们要执行以下SQL查询: ```sql SELECT u.name,...
`huizhantags` array,score:bigint>> COMMENT 'tag json数组' ) ROW FORMAT SERDE 'org.elasticsearch.hadoop.hive.EsSerDe' STORED BY 'org.elasticsearch.hadoop.hive.EsStorageHandler' WITH ...
在大数据处理领域,Hive 是一个广泛使用的工具,它允许用户使用 SQL 语法查询和管理分布式存储的数据。在处理分隔符分隔的数组数据时,Hive 提供了一种称为 `lateral view` 的机制,配合 `explode` 函数可以将单一...
hive 的安装依赖于hadoop ,上节基于docker的hadoop安装参见 所以hive的安装是在hadoop的image基础上进行的。 第一步 完成hadoop的iamge构建 按照 完成hadoop 的image 构建 第二步 完成mysql的构建 1、采用mysql5.7...
ADF 由多个组件组成,如Pig、Hive、Spark等,这些组件可以协同工作,以处理大规模数据集。 2. **核心组件**:ADF 的核心组件包括一个用户界面(UI)用于设计数据流,一个运行时引擎负责执行数据流,以及一个元数据...
清洗后的结果数据,然后通过sqoop导入到数据库mysql中或者放入到hive中(web展现或者交给数据分析人员) 当天的数据:当日凌晨截至到统计时间点的数据 之前的历史数据:截至到今天凌晨的历史数据 实时处理...
自动格式化注释的tag及描述,缩进对齐,使用tab进行缩进,让tag和描述在展现上更加优雅。该格式化模板,配合保存自动格式化,让你的代码更加符合规范,写出风格更加统一,更加优雅的代码。该模板是在阿里规范的基础...
Fresh模板在视频列表页TAG不显示(IE6) Fresh模板播放页和专辑页的编辑按钮错误 Fresh模板频道页没有“查看该频道下所有专辑”的功能 Fresh模板播放页无法实现顶视频自动复制网址 Fresh模板专辑的视频管理页面...
self.crawl('https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%A7%91%E5%B9%BB&sort=recommend&page_limit=50&page_start=0', validate_cert=False, callback=self.index_page) ``` - **解析电影...
标签同步(Tag Synchronization)是Apache Ranger 的一个重要特性,用于在不同数据源之间同步和分发安全标签。这些标签可以用于定义敏感级别、业务领域或其他元数据属性,帮助管理员高效地管理和应用访问控制策略。...
Tagsync是Ranger中一个关键的元数据管理组件,它的主要任务是保持不同数据源之间的标签(Tag)一致性。标签在大数据环境中被广泛用于数据分类和隐私保护,通过为数据集打上标签,可以轻松实现对敏感信息的访问控制。...
项目可能使用Hadoop进行大规模数据的分布式存储和计算,使用Spark进行实时或近实时的数据处理,以及使用Hive进行数据仓库建设,提供SQL查询能力。同时,可能会用到数据挖掘算法如关联规则、聚类等来挖掘用户行为模式...
首先,整体架构包括Web配置页面、SDK通信服务器、配置服务器、探头服务、数据存储(如Kafka、MongoDB、HDFS、Hive和HBase)等。通信协议基于TCP层的长连接,确保实时性和双向交互,数据以JSON格式传输,通过加密和...
* Hive * MySQL * Spark数据开发作业调度(ETL) * Crontab * Airflow 非技术内容 本文中还涉及到一些非技术内容,包括: * 数据分析如何做数据调研、对于需求方提出的标签如何结合数据情况给出相应的解决方案 * ...
- **Hive**:用于数据仓库的工具,提供了SQL-like的查询语言HQL,方便进行数据聚合分析。 - **MySQL**:作为关系型数据库,适用于存储结构化的用户标签数据。 - **Spark作业调度(ETL)**:利用Spark进行数据清洗、...
BitMap的压缩算法,如Roaring bitmap,进一步提高了空间效率和运算速度,使得Roaring在许多开源项目,如Apache Lucene、Solr、Elasticsearch、Druid、Spark、Hive和Kylin等中被广泛采用。 Bitmap的优势在于其高效的...