Mapr框架安装完后,安装与配置hbase、hive。
其中mapr框架的安装路径为/opt/mapr
Hbase的安装路径为/opt/mapr/hbase/hbase-0.90.4
Hive的安装路径为/opt/mapr/hive/hive-0.7.1
整合hive与hbase的过程如下:
1. 将文件 /opt/mapr/hbase/hbase-0.90.4/hbase-0.90.4.jar 与/opt/mapr/hbase/hbase-0.90.4/lib/zookeeper-3.3.2.jar拷贝到/opt/mapr/hive /hive-0.7.1/lib文件夹下面
注意:如果hive/lib下已经存在这两个文件的其他版本(例如zookeeper-3.3.1.jar),建议删除后使用hbase下的相关版本
2 修改hive/conf下hive-site.xml文件,在底部添加如下内容:
<property>
<name>hive.querylog.location</name>
<value>/opt/mapr/hive/hive-0.7.1/logs</value>
</property>
<property>
<name>hive.aux.jars.path</name> <value>file:///opt/mapr/hive/hive-0.7.1/lib/hive-hbase-handler-0.7.1.jar,file:///opt/mapr/hive/hive-0.7.1/lib/hbase-0.90.4.jar,file:///opt/mapr/hive/hive-0.7.1/lib/zookeeper-3.3.2.jar</value>
</property>
注意:如果hive-site.xml不存在则自行创建,或者把hive-default.xml.template文件改名后使用。
3. 拷贝hbase-0.90.4.jar到所有hadoop节点(包括master)的hadoop/lib下。
4. 拷贝hbase/conf下的hbase-site.xml文件到所有hadoop节点(包括master)的hadoop/conf下。
注意,如果3,4两步跳过的话,运行hive时很可能出现如下错误:
org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately.
This could be a sign that the server has too many connections (30 is the default). Consider inspecting your ZK server logs for that error and
then make sure you are reusing HBaseConfiguration as often as you can. See HTable's javadoc for more information. at org.apache.hadoop.
hbase.zookeeper.ZooKeeperWatcher.
5 启动hive
单节点启动
bin/hive -hiveconf hbase.master=master:60000
集群启动
bin/hive -hiveconf hbase.zookeeper.quorum=node1,node2,node3 (所有的zookeeper节点)
如果hive-site.xml文件中没有配置hive.aux.jars.path,则可以按照如下方式启动。
hive --auxpath /opt/mapr/hive/hive-0.7.1/lib/hive-hbase-handler-0.7.1.jar,/opt/mapr/hive/hive-0.7.1/lib/hbase-0.90.4.jar,/opt/mapr/hive/hive-0.7.1/lib/zookeeper-3.3.2.jar -hiveconf hbase.master=localhost:60000
经测试修改hive的配置文件hive-site.xml
<property>
<name>hive.zookeeper.quorum</name>
<value>node1,node2,node3</value>
<description>The list of zookeeper servers to talk to. This is only needed for read/write locks.</description>
</property>
不用增加参数启动hive就可以联合hbase
6 启动后进行测试
(1) 创建hbase识别的表
CREATE TABLE hbase_table_1(key int, value string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES ("hbase.table.name" = "xyz");
hbase.table.name 定义在hbase的table名称,多列时,data:1,data:2;多列族时,data1:1,data2:1;
hbase.columns.mapping 定义在hbase的列族,里面的:key 是固定值而且要保证在表pokes中的foo字段是唯一值
创建有分区的表
CREATE TABLE hbase_table_1(key int, value string) partitioned by (day string) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,cf1:val") TBLPROPERTIES ("hbase.table.name" = "xyz");
(2) 使用sql导入数据
新建hive的数据表
create table pokes(foo int,bar string)row format delimited fields terminated by ',';
批量导入数据
load data local inpath '/home/1.txt' overwrite into table pokes;
1.txt文件的内容为
1,hello
2,pear
3,world
使用sql导入hbase_table_1
insert overwrite table hbase_table_1 select * from pokes;
导入有分区的表
insert overwrite table hbase_table_1 partition (day='2012-01-01') select * from pokes;
(3) 查看数据
hive> select * from hbase_table_1;
OK
1 hello
2 pear
3 world
(注:与hbase整合的有分区的表存在个问题 select * from table查询不到数据,select key,value from table可以查到数据)
(4)登录Hbase去查看数据
hbase shell
hbase(main):002:0> describe 'xyz'
DESCRIPTION ENABLED {NAME => 'xyz', FAMILIES => [{NAME => 'cf1', BLOOMFILTER => 'NONE', REPLICATION_S true
COPE => '0', COMPRESSION => 'NONE', VERSIONS => '3', TTL => '2147483647', BLOCKSI
ZE => '65536', IN_MEMORY => 'false', BLOCKCACHE => 'true'}]}
1 row(s) in 0.0830 seconds
hbase(main):003:0> scan 'xyz'
ROW COLUMN+CELL
1 column=cf1:val, timestamp=1331002501432, value=hello
2 column=cf1:val, timestamp=1331002501432, value=pear
3 column=cf1:val, timestamp=1331002501432, value=world
这时在Hbase中可以看到刚才在hive中插入的数据了。
7 对于在hbase已经存在的表,在hive中使用CREATE EXTERNAL TABLE来建立
例如hbase中的表名称为test1,字段为 a: , b: ,c: 在hive中建表语句为
create external table hive_test (key int,gid map<string,string>,sid map<string,string>,uid map<string,string>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" ="a:,b:,c:") TBLPROPERTIES ("hbase.table.name" = "test1");
在hive中建立好表后,查询hbase中test1表内容
Select * from hive_test;
OK
1 {"":"qqq"} {"":"aaa"} {"":"bbb"}
2 {"":"qqq"} {} {"":"bbb"}
查询gid字段中value值的方法为
select gid[''] from hbase2;
得到查询结果
Total MapReduce jobs = 1
Launching Job 1 out of 1
Number of reduce tasks is set to 0 since there's no reduce operator
Starting Job = job_201203052222_0017, Tracking URL = http://localhost:50030/jobdetails.jsp?jobid=job_201203052222_0017
Kill Command = /opt/mapr/hadoop/hadoop-0.20.2/bin/../bin/hadoop job -Dmapred.job.tracker=maprfs:/// -kill job_201203052222_0017
2012-03-06 14:38:29,141 Stage-1 map = 0%, reduce = 0%
2012-03-06 14:38:33,171 Stage-1 map = 100%, reduce = 100%
Ended Job = job_201203052222_0017
OK
qqq
qqq
如果hbase表test1中的字段为user:gid,user:sid,info:uid,info:level,在hive中建表语句为
create external table hive_test(key int,user map<string,string>,info map<string,string>) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" ="user:,info:") TBLPROPERTIES ("hbase.table.name" = "test1");
查询hbase表的方法为
select user['gid'] from hbase2;
建立关联表
这里我们要查询的表在hbase中已经存在所以,使用CREATE EXTERNAL TABLE来建立,如下:
- CREATE EXTERNAL TABLE hbase_table_2(key string, value string)
- STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
- WITH SERDEPROPERTIES ("hbase.columns.mapping" = "data:1")
- TBLPROPERTIES("hbase.table.name" = "test");
hbase.columns.mapping指向对应的列族;多列时,data:1,data:2;多列族时,data1:1,data2:1;
hbase.table.name指向对应的表;
hbase_table_2(key string, value string),这个是关联表
注意数据为int时候:建表语句不同
CREATE EXTERNAL TABLE HisDiagnose(key string, doctorId int, patientId int, description String, rtime int) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key,diagnoseFamily:doctorId,diagnoseFamily:patientId,diagnoseFamily:description,diagnoseFamily:rtime","hbase.table.default.storage.type"="binary") TBLPROPERTIES("hbase.table.name" = "HisDiagnose");
功能测试:
使用hive
hive> create table pokes (foo int, bar striing); OK Time taken: 0.251 seconds hive>create table invites (foo INT, bar STRING) partitioned by (ds string); OK Time taken: 0.106 seconds hive>show tables; OK invites pokes Time taken: 0.107 seconds hive> descripe invites; OK foo int bar string ds string Time taken: 0.151 seconds hive> alter table pokes add columns (new_col int); OK Time taken: 0.117 seconds hive> alter table invites add columns (new_col2 int); OK Time taken: 0.152 seconds hive> LOAD DATA LOCAL INPATH './examples/files/kv1.txt' OVERWRITE INTO TABLE pokes; Copying data from file:/home/hadoop/hadoop-0.19.1/contrib/hive/examples/files/kv1.txt Loading data to table pokes OK Time taken: 0.288 seconds hive> load data local inpath './examples/files/kv2.txt' overwrite into table invites partition (ds=’2008-08-15′); Copying data from file:/home/hadoop/hadoop-0.19.1/contrib/hive/examples/files/kv2.txt Loading data to table invites partition {ds=2008-08-15} OK Time taken: 0.524 seconds hive> LOAD DATA LOCAL INPATH './examples/files/kv3.txt' OVERWRITE INTO TABLE invites PARTITION (ds=’2008-08-08′); Copying data from file:/home/hadoop/hadoop-0.19.1/contrib/hive/examples/files/kv3.txt Loading data to table invites partition {ds=2008-08-08} OK Time taken: 0.406 seconds hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out' SELECT a.* FROM invites a; Total MapReduce jobs = 1 Starting Job = job_200902261245_0002, Tracking URL = http://gp1:50030/jobdetails.jsp?jobid=job_200902261245_0002 Kill Command = /home/hadoop/hadoop-0.19.1/bin/hadoop job -Dmapred.job.tracker=gp1:9001 -kill job_200902261245_0002 map = 0%, reduce =0% map = 50%, reduce =0% map = 100%, reduce =0% Ended Job = job_200902261245_0002 Moving data to: /tmp/hdfs_out OK Time taken: 18.551 seconds hive> select count(1) from pokes; Total MapReduce jobs = 2 Number of reducers = 1 In order to change numer of reducers use: set mapred.reduce.tasks = <number> Starting Job = job_200902261245_0003, Tracking URL = http://gp1:50030/jobdetails.jsp?jobid=job_200902261245_0003 Kill Command = /home/hadoop/hadoop-0.19.1/bin/hadoop job -Dmapred.job.tracker=gp1:9001 -kill job_200902261245_0003 map = 0%, reduce =0% map = 50%, reduce =0% map = 100%, reduce =0% map = 100%, reduce =17% map = 100%, reduce =100% Ended Job = job_200902261245_0003 Starting Job = job_200902261245_0004, Tracking URL = http://gp1:50030/jobdetails.jsp?jobid=job_200902261245_0004 Kill Command = /home/hadoop/hadoop-0.19.1/bin/hadoop job -Dmapred.job.tracker=gp1:9001 -kill job_200902261245_0004 map = 0%, reduce =0% map = 50%, reduce =0% map = 100%, reduce =0% map = 100%, reduce =100% Ended Job = job_200902261245_0004 OK 500 Time taken: 57.285 seconds hive> INSERT OVERWRITE DIRECTORY ‘/tmp/hdfs_out’ SELECT a.* FROM invites a; Total MapReduce jobs = 1 Starting Job = job_200902261245_0005, Tracking URL = http://gp1:50030/jobdetails.jsp?jobid=job_200902261245_0005 Kill Command = /home/hadoop/hadoop-0.19.1/bin/hadoop job -Dmapred.job.tracker=gp1:9001 -kill job_200902261245_0005 map = 0%, reduce =0% map = 50%, reduce =0% map = 100%, reduce =0% Ended Job = job_200902261245_0005 Moving data to: /tmp/hdfs_out OK Time taken: 18.349 seconds hive> INSERT OVERWRITE DIRECTORY ‘/tmp/reg_5′ SELECT COUNT(1) FROM invites a; Total MapReduce jobs = 2 Number of reducers = 1 In order to change numer of reducers use: set mapred.reduce.tasks = <number> Starting Job = job_200902261245_0006, Tracking URL = http://gp1:50030/jobdetails.jsp?jobid=job_200902261245_0006 Kill Command = /home/hadoop/hadoop-0.19.1/bin/hadoop job -Dmapred.job.tracker=gp1:9001 -kill job_200902261245_0006 map = 0%, reduce =0% map = 50%, reduce =0% map = 100%, reduce =0% map = 100%, reduce =17% map = 100%, reduce =100% Ended Job = job_200902261245_0006 Starting Job = job_200902261245_0007, Tracking URL = http://gp1:50030/jobdetails.jsp?jobid=job_200902261245_0007 Kill Command = /home/hadoop/hadoop-0.19.1/bin/hadoop job -Dmapred.job.tracker=gp1:9001 -kill job_200902261245_0007 map = 0%, reduce =0% map = 50%, reduce =0% map = 100%, reduce =0% map = 100%, reduce =17% map = 100%, reduce =100% Ended Job = job_200902261245_0007 Moving data to: /tmp/reg_5 OK Time taken: 70.956 seconds
相关推荐
HIVE和HBASE的整合 HIVE和HBASE是两个不同的数据处理和存储系统,HIVE是一种数据仓库系统,专门用来存储和处理结构化数据,而HBASE是一种NoSQL数据库,专门用来存储和处理半结构化和非结构化数据。由于HIVE和HBASE...
jdk1.8.0_131、apache-zookeeper-3.8.0、hadoop-3.3.2、hbase-2.4.12 mysql5.7.38、mysql jdbc驱动mysql-connector-java-8.0.8-dmr-bin.jar、 apache-hive-3.1.3 2.本文软件均安装在自建的目录/export/server/下 ...
hive0.8.1和hbase0.92.0集成的hive-hbase-handler.Jar包,里面包含:hbase-0.92.0.jar、hbase-0.92.0-tests.jar、hive-hbase-handler-0.9.0-SNAPSHOT.jar。经测试没有问题。
在大数据处理领域,Hive和HBase都是广泛使用的工具,各有其特定的优势。Hive作为一个基于Hadoop的数据仓库工具,适合于数据批处理和分析,而HBase则是一款分布式、高性能的NoSQL数据库,适用于实时数据查询。本文将...
hive0.10.0和hbase0.94.4集成的hive-hbase-handler.Jar包,经测试没有问题。
本压缩包"scala-hive-HBASE-Api.7z"包含了2019年8月至10月间用于工作的相关jar包,主要用于支持Scala、Hive和HBase的集成开发。 首先,让我们来深入了解一下这三个核心概念: 1. **Scala**:Scala是一种多范式的...
该文档将分为四部分:Hadoop 集群环境搭建、HBase 集群环境搭建、Hive 集群环境搭建和 Sqoop 集成使用。 一、Hadoop 集群环境搭建 1.1 JDK 安装与配置 在开始搭建 Hadoop 集群环境前,我们需要先安装并配置 JDK。...
标题 "Hive整合HBase资源文件.zip" 指向的是一个关于如何将Apache Hive与Apache HBase集成的教程或工具包。Hive是大数据处理领域的一个重要组件,主要用于结构化数据的查询和分析,而HBase则是一个分布式、列式存储...
在大数据领域,构建一个完整的生态系统是至关重要的,其中包括多个组件,如Hadoop、Spark、Hive、HBase、Oozie、Kafka、Flume、Flink、Elasticsearch和Redash。这些组件协同工作,提供了数据存储、处理、调度、流...
### Hive与HBase的核心知识点详解 #### 一、Hive概览 **1.1 定义** Apache Hive 是一个建立在 Hadoop 上的数据仓库工具,它为在大规模数据集上进行复杂的查询提供了便利。Hive 的核心设计是让用户能够通过类似 SQL...
【标题】:Hive与HBase的集成与应用 【描述】:本压缩包包含Apache Hive 1.2.2和HBase 1.2.6的安装包,旨在介绍如何在大数据处理环境中将这两个组件结合使用,实现高效的数据存储和查询。 【标签】:Hive、HBase、...
hive-1.2.2集成hbase1.2.6版本的包,本人亲身踩坑多次才编译成功的一个jar包
Hive教程会介绍HQL(Hive SQL)、表和分区的概念、数据倾斜处理以及与Hadoop其他组件的集成。 **HBase**:HBase是一个分布式的、版本化的NoSQL数据库,运行在Hadoop之上。HBase教程会涵盖表设计、数据读写、Region ...
Hive提供了与HBase的集成,使得能够在HBase表上使用hive sql 语句进行查询 插入操作以及进行Join和Union等复杂查询、同时也可以将hive表中的数据映射到Hbase中
面试中会讨论HBase的模型(行、列族、列、版本)、RegionServer、Zookeeper的作用、读写流程以及HBase与Hadoop的集成。 8. **Kafka**:Kafka是一个高吞吐量的分布式消息队列,常用于实时数据管道和流处理。面试时...
标题“spark-2.4.0-hive-hbase-Api.7z”表明这是一个与Apache Spark、Apache Hive和Apache HBase相关的压缩包文件,适用于版本2.4.0。这个压缩包很可能包含了这三个组件的API库,使得开发人员能够在集成环境中进行...
描述中提到,“用于Hive和HBase的连接,通过hive操作hbase上的表”,这表明这些JAR文件是为了解决Hive与HBase集成的问题。Hive-HBase连接器允许用户在Hive中创建外部表,将这些表映射到HBase的数据表,从而可以在...
用户应确保选择兼容的HBase和Hive版本进行部署,以避免潜在的集成问题。 4. HBase与ZooKeeper的版本兼容性: ZooKeeper是HBase依赖的一个重要组件,HBase的运行需要ZooKeeper的协同工作。由于ZooKeeper是独立的...
这个“19:Flume+HBase+Hive集成大数据项目离线分析”的压缩包文件提供了关于如何将这三个工具集成到一起进行离线数据分析的详细教程。以下是关于这些技术的重点知识点: 1. **Flume**: Flume 是 Apache 提供的一个...
- **集成性**:HBase与Hadoop生态系统中的其他组件(如HDFS、MapReduce、Pig、Hive等)高度集成。 #### 7. RowKey的设计原则 **知识点解析:** RowKey是HBase中非常关键的概念,它直接影响到数据的查询性能。 - *...