This tutorial with quickly teach you how to use HBase, a column-oriented tool that sits on top of Hadoop, it works best when you have large tables and are accessing your Big Data randomly and in real-time. Though it does not support SQL, HBase can easily be connected to Hive, providing you with the read/write speed of HBase, the ease of Hive, and the parallel processing of MapReduce.
The BigSQL bundle automatically starts up a pseudo-distributed model of HBase in which a master and region server are both running on your local computer.
The tutorial will use the data file previously used in the Hadoop Hive Tutorial (See this tutorial for all prerequisites). If you have not grabbed the file already it is located in the zipfile here. Place the file ex1data.csv into the
~/Downloads/Sample_files/ex1data.csv directory.
Note: if you are using the BigSQL distribution (highly recommended) make sure you are using at least version beta 2.28!
The first step is to upload the csv file into HDFS. Use the hadoop fs command to make the directory and copy the ex1data.csv from your Downloads folder.
$ hadoop fs -mkdir /user/data/salesdata $ hadoop fs -copyFromLocal ~/Downloads/Sample_files/ex1data.csv /user/data/salesdata/ex1data.csv
Next, start the hbase shell and create the table “sales_data” with the column families location, units, size, age and pricing.
$ hbase shell hbase > create 'sales_data', 'location', 'units', 'size', 'age', 'pricing' hbase > quit
Use the ImportTsv tool to import the csv file into the HBase table. The column that will be the row’s primary key does not need to be listed by name. In this example, we list HBASE_ROW_KEY instead of explicitly saying s_num.
$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,' -Dimporttsv.columns=HBASE_ROW_KEY,location:s_borough,location:s_neighbor, location:s_b_class,location:s_c_p,location:s_block,location:s_lot,location:s_easement, location:w_c_p_2,location:s_address,location:s_app_num,location:s_zip,units:s_res_units, units:s_com_units,units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built, pricing:s_tax_c,pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt sales_data /user/data/salesdata/ex1data.csv
Since this file was separated by comas and not tabs, you need to specify ‘-Dimporttsv.separator=,’.
HBase is also very good with bulk uploads. In order to do this, use the ‘importtsv.bulk.output’ tool to generate compatible files, then use the ‘completebulkloads’ utility to load those into the HBase tables.
To ensure that the table has been created and loaded into hive, you can use the list command to show all HBase tables.
$ hbase shell hbase > list TABLE sales_data
To check the data within the table, you can use the scan command. This will list every cell in the table as one row.
hbase > scan 'sales_data'
To add the table to hive, create an external table in hive stored by org.apache.hadoop.hive.hbase.HBaseStorageHandler. You must list the hbase.columns.mapping as shown below. Note that the even though s_num is listed in the definition of the table, it is not listed under the serde properties.
$ hive hive > CREATE EXTERNAL TABLE IF NOT EXISTS sales_data ( s_num FLOAT, s_borough INT, s_neighbor STRING, s_b_class STRING, s_c_p STRING, s_block STRING, s_lot STRING, s_easement STRING, w_c_p_2 STRING, s_address STRING, s_app_num STRING, s_zip STRING, s_res_units STRING, s_com_units STRING, s_tot_units INT, s_sq_ft FLOAT, s_g_sq_ft FLOAT, s_yr_built INT, s_tax_c INT, s_b_class2 STRING, s_price FLOAT, s_sales_dt STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "location:s_borough, location:s_neighbor,location:s_b_class,location:s_c_p,location:s_block, location:s_lot,location:s_easement,location:w_c_p_2,location:s_address, location:s_app_num,location:s_zip,units:s_res_units,units:s_com_units, units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,pricing:s_tax_c, pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt"); hive> DESCRIBE sales_data; OK col_name data_type comment s_num float from deserializer s_borough int from deserializer s_neighbor string from deserializer s_b_class string from deserializer s_c_p string from deserializer s_block string from deserializer s_lot string from deserializer s_easement string from deserializer w_c_p_2 string from deserializer s_address string from deserializer s_app_num string from deserializer s_zip string from deserializer s_res_units string from deserializer s_com_units string from deserializer s_tot_units int from deserializer s_sq_ft float from deserializer s_g_sq_ft float from deserializer s_yr_built int from deserializer s_tax_c int from deserializer s_b_class2 string from deserializer s_price float from deserializer s_sales_dt string from deserializer Time taken: 0.27 seconds, Fetched: 22 row(s) You can also use the HBase Console (localhost:60010/master-status) to check the user tables created and their attributes and other metrics! For more information on BigSQL visit BigSQL.org
相关推荐
大数据集群 Hadoop HBase Hive Sqoop 集群环境安装配置及使用文档 在本文档中,我们将详细介绍如何搭建一个大数据集群环境,包括 Hadoop、HBase、Hive 和 Sqoop 的安装配置及使用。该文档将分为四部分:Hadoop 集群...
Hadoop和HBase是大数据处理领域中的重要组件,它们在分布式存储和实时数据访问方面扮演着关键角色。Hadoop是一个开源框架,主要用于处理和存储大量数据,而HBase是建立在Hadoop之上的非关系型数据库,提供高可靠性、...
《Hadoop 2.7.2与HBase的集成——深入理解hadoop-2.7.2-hbase-jar.tar.gz》 Hadoop是Apache软件基金会的一个开源项目,它为大规模数据处理提供了一个分布式计算框架。Hadoop的核心包括HDFS(Hadoop Distributed ...
Hadoop+HBase集群搭建详细手册 本文将详细介绍如何搭建Hadoop+HBase集群,包括前提准备、机器集群结构分布、硬件环境、软件准备、操作步骤等。 一、前提准备 在搭建Hadoop+HBase集群之前,需要准备以下几个组件:...
【Hadoop Hbase Zookeeper集群配置】涉及到在Linux环境下搭建分布式计算和数据存储系统的流程,主要涵盖以下几个关键知识点: 1. **集群环境设置**:一个基本的Hadoop Hbase Zookeeper集群至少需要3个节点,包括1个...
hadoop,hbase,zookeeper安装笔记hadoop,hbase,zookeeper安装笔记hadoop,hbase,zookeeper安装笔记
主要讲解 Hadoop Hbase的使用和原理,包括Hbase官方文档的翻译,还有Java对Hbase的操作等。
在构建大数据处理环境时,Hadoop、HBase、Spark和Hive是四个核心组件,它们协同工作以实现高效的数据存储、处理和分析。本教程将详细介绍如何在Ubuntu系统上搭建这些组件的集群。 1. **Hadoop**:Hadoop是Apache...
Hadoop 和 HBase 常用 shell 命令 在大数据处理中,Hadoop 和 HBase 是两个非常重要的组件。Hadoop 是一个分布式计算框架,用于处理大规模数据,而 HBase 是一个基于 Hadoop 的分布式数据库,用于存储和处理大规模...
在IT行业中,大数据处理和分析是至关重要的环节,而Hadoop和HBase是这个领域中的两个关键组件。Hadoop是一个开源框架,主要用于处理和存储大量数据,而HBase是建立在Hadoop之上的分布式数据库,提供了高效、实时的...
这是一个大牛的学习笔记,讲解详细,思路清晰,按步就班,是学习hadoop hbase的入门资料,值得入门人员拥用!
《Hadoop之HBase从入门到精通》是一个深入学习Hadoop和HBase的全面指南,旨在帮助初学者和有经验的开发者快速掌握这两个强大的大数据处理工具。Hadoop是Apache软件基金会开发的一个开源分布式计算框架,它允许在廉价...
【Hadoop及Hbase部署与应用】涉及到的关键知识点如下: 1. **Hadoop基础**: - Hadoop是一个开源的分布式计算框架,基于Java开发,主要用于处理和存储大规模数据。它由两个主要组件组成:HDFS(Hadoop Distributed...
Hadoop HA高可用集群搭建(Hadoop+Zookeeper+HBase) 一、Hadoop HA高可用集群概述 在大数据处理中,高可用集群是非常重要的,Hadoop HA高可用集群可以提供高可靠性和高可用性,确保数据处理不中断。该集群由...
Hadoop HBase 的官方文档,内容详细,单个文件mht格式
标题和描述均提到了“hadoop hbase hive 伪分布安装”,这涉及到在单台机器上模拟分布式环境来安装和配置Hadoop、HBase和Hive。以下将详细阐述这一过程中的关键步骤和相关知识点。 ### 1. Hadoop安装与配置 - **...
标题中的“HDP3.1.5源码下载—hadoop hbase hive”指的是Hortonworks Data Platform(HDP)的3.1.5版本,它是一个全面的大数据解决方案,包含了对Hadoop、HBase和Hive等组件的源代码支持。这个版本是大数据开发者和...
### Hadoop HBase性能报告分析 #### 引言与研究目标 本项目旨在评估Hadoop与HBase在实际集群环境中的性能表现。具体目标包括: 1. 在节点组成的集群上安装并配置Hadoop与HBase。 2. 研究Hadoop与HBase API,并...
在大数据处理领域,Hadoop、HBase和Zookeeper是三个至关重要的组件,它们共同构建了一个高效、可扩展的数据处理和存储环境。以下是关于这些技术及其集群配置的详细知识。 首先,Hadoop是一个开源的分布式计算框架,...