`
wbj0110
  • 浏览: 1642045 次
  • 性别: Icon_minigender_1
  • 来自: 上海
文章分类
社区版块
存档分类
最新评论

Hadoop Hbase Tutorial

阅读更多

This tutorial with quickly teach you how to use HBase, a column-oriented tool that sits on top of Hadoop, it works best when you have large tables and are accessing your Big Data randomly and in real-time. Though it does not support SQL, HBase can easily be connected to Hive, providing you with the read/write speed of HBase, the ease of Hive, and the parallel processing of MapReduce.

The BigSQL bundle automatically starts up a pseudo-distributed model of HBase in which a master and region server are both running on your local computer.

The tutorial will use the data file previously used in the Hadoop Hive Tutorial (See this tutorial for all prerequisites).  If you have not grabbed the file already it is located in the zipfile here. Place the file ex1data.csv into the

~/Downloads/Sample_files/ex1data.csv directory.

Note: if you are using the BigSQL distribution (highly recommended) make sure you are using at least version beta 2.28!

The first step is to upload the csv file into HDFS. Use the hadoop fs command to make the directory and copy the ex1data.csv from your Downloads folder.

	$ hadoop fs -mkdir /user/data/salesdata
 	$ hadoop fs -copyFromLocal ~/Downloads/Sample_files/ex1data.csv /user/data/salesdata/ex1data.csv

Next, start the hbase shell and create the table “sales_data” with the column families location, units, size, age and pricing.

	$ hbase shell
     	hbase > create 'sales_data', 'location', 'units', 'size', 'age', 'pricing'
     	hbase > quit

Use the ImportTsv tool to import the csv file into the HBase table. The column that will be the row’s primary key does not need to be listed by name. In this example, we list HBASE_ROW_KEY instead of explicitly saying s_num.

     	$ hbase org.apache.hadoop.hbase.mapreduce.ImportTsv '-Dimporttsv.separator=,'
            -Dimporttsv.columns=HBASE_ROW_KEY,location:s_borough,location:s_neighbor,
            location:s_b_class,location:s_c_p,location:s_block,location:s_lot,location:s_easement,
            location:w_c_p_2,location:s_address,location:s_app_num,location:s_zip,units:s_res_units,
            units:s_com_units,units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,
            pricing:s_tax_c,pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt
            sales_data /user/data/salesdata/ex1data.csv

Since this file was separated by comas and not tabs, you need to specify ‘-Dimporttsv.separator=,’.

HBase is also very good with bulk uploads. In order to do this, use the ‘importtsv.bulk.output’ tool to generate compatible files, then use the ‘completebulkloads’ utility to load those into the HBase tables.

To ensure that the table has been created and loaded into hive, you can use the list command to show all HBase tables.

     	$ hbase shell
     	hbase > list
        	TABLE
        	sales_data

To check the data within the table, you can use the scan command. This will list every cell in the table as one row.

     	hbase > scan 'sales_data'

To add the table to hive, create an external table in hive stored by org.apache.hadoop.hive.hbase.HBaseStorageHandler. You must list the hbase.columns.mapping as shown below. Note that the even though s_num is listed in the definition of the table, it is not listed under the serde properties.

     	$ hive
     	hive > CREATE EXTERNAL TABLE IF NOT EXISTS sales_data ( s_num FLOAT, s_borough INT, s_neighbor STRING, s_b_class STRING, s_c_p STRING, s_block STRING, s_lot STRING, s_easement STRING, w_c_p_2 STRING, s_address STRING, s_app_num STRING, s_zip STRING, s_res_units STRING, s_com_units STRING, s_tot_units INT, s_sq_ft FLOAT, s_g_sq_ft FLOAT, s_yr_built INT, s_tax_c INT, s_b_class2 STRING, s_price FLOAT, s_sales_dt STRING ) STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' WITH SERDEPROPERTIES ("hbase.columns.mapping" = "location:s_borough, location:s_neighbor,location:s_b_class,location:s_c_p,location:s_block, location:s_lot,location:s_easement,location:w_c_p_2,location:s_address, location:s_app_num,location:s_zip,units:s_res_units,units:s_com_units, units:s_tot_units,size:s_sq_ft,size:s_g_sq_ft,age:s_yr_built,pricing:s_tax_c, pricing:s_b_class2,pricing:s_price,pricing:s_sales_dt");

	hive> DESCRIBE sales_data;                                                                                  
		OK
		col_name		data_type		comment
		s_num               	float               	from deserializer   
		s_borough           	int                 	from deserializer   
		s_neighbor          	string              	from deserializer   
		s_b_class           	string              	from deserializer   
		s_c_p               	string              	from deserializer   
		s_block             	string              	from deserializer   
		s_lot               	string              	from deserializer   
		s_easement          	string              	from deserializer   
		w_c_p_2             	string              	from deserializer   	
		s_address           	string              	from deserializer   
		s_app_num           	string              	from deserializer   
		s_zip               	string              	from deserializer   
		s_res_units         	string              	from deserializer   
		s_com_units         	string              	from deserializer   
		s_tot_units         	int                 	from deserializer   
		s_sq_ft             	float               	from deserializer   
		s_g_sq_ft           	float               	from deserializer   
		s_yr_built          	int                 	from deserializer   
		s_tax_c             	int                 	from deserializer   
		s_b_class2          	string              	from deserializer   
		s_price             	float               	from deserializer   
		s_sales_dt          	string              	from deserializer   
		Time taken: 0.27 seconds, Fetched: 22 row(s)

You can also use the HBase Console (localhost:60010/master-status) to check the user tables created and their attributes and other metrics!
For more information on BigSQL visit BigSQL.org
分享到:
评论

相关推荐

    大数据面试题,大数据成神之路开启...Flink/Spark/Hadoop/Hbase/Hive...-Python开发

    大数据面试题,大数据成神之路开启...Flink/Spark/Hadoop/Hbase/Hive... 已经更新100+篇~ 关注公众号~ 大数据成神之路目录 大数据开发基础篇 :skis: Java基础 :memo: NIO :open_book: 并发 :...

    第3集-Hadoop环境搭建 - linux(centos7) - 安装配置hadoop2.7.7.pdf

    "Hadoop环境搭建 - Linux(CentOS7) - 安装配置Hadoop2.7.7... By following this tutorial, you should be able to set up a Hadoop environment on your own and start exploring the world of big data processing.

    python操作hbase

    HBase是基于Hadoop的一个分布式、多维排序的映射表,其设计灵感来源于Google的Bigtable论文。它能够处理非常大量的数据,并支持随机读取,非常适合实时大数据的应用场景。 #### HBase 接口概述 HBase提供了多种...

    apache_drill_tutorial.pdf

    与传统的Hive不同,Drill不依赖MapReduce作业,并且它并不完全基于Hadoop生态系统。实际上,Drill的设计灵感来源于Google的Dremel概念,这是一种用于大规模数据查询的高效工具,后来演变为Apache软件基金会的一个...

    avro_tutorial

    - **持久化存储**:Avro文件可以作为数据库的输入/输出格式,如HBase或Cassandra。 ### 小结 Avro作为数据序列化工具,提供了一种高效、灵活且跨语言的数据交换解决方案。通过使用Schema,Avro确保了数据的完整性...

    apache_flume_tutorial

    Apache Flume是一款标准的、简单的、健壮的、灵活的并且可扩展的数据采集工具,它主要用来将各种数据生产者(如web服务器)的数据输入到Hadoop。在本教程中,我们将使用简单且具有代表性的例子来解释Apache Flume的...

    5分钟搭建大数据学习环境.docx

    例如,Cloudera提供的[Hadoop入门教程](https://www.cloudera.com/developers/get-started-with-hadoop-tutorial.html)中包含了如何导入MySQL数据到Hive仓库的具体步骤。 通过以上步骤,用户可以在短短几分钟内...

    thrift 小试牛刀

    Thrift不仅适用于构建分布式系统和微服务架构,还可以用于处理大数据和高性能计算场景,如Hadoop和HBase中的应用。此外,关于Thrift的深入理解和实践,可参考一系列的在线资源和社区文章,这些资源提供了丰富的示例...

Global site tag (gtag.js) - Google Analytics