There have 3 common usages to import data to HBase:
1>Use ImportTsv
ImportTsv is a utility that will load data in TSV format into HBase.
There have two usages about it:
a.Loading data from TSV format in HDFS into HBase via Puts.
This kind of load is non-bulk loading
Eg:
Bin/hbase org.apache.hadoop.hbase.mapreduce.ImportTsv –Dimporttsv.columns=HBASE_ROW_KEY,information:c1 table_name /user/hadoop/data
b.Another method is to use the bulk loading
It divided into 2 steps:
1>To generate StoreFiles for bulk-loading
HADOOP_CLASSPATH=`${HBASE_HOME}/bin/hbase classpath` ${HADOOP_HOME}/bin/hadoop jar ${HBASE_HOME}/hbase-0.94.2.jar importtsv -Dimporttsv.columns=HBASE_ROW_KEY,information:c1 -Dimporttsv.bulk.output=/user/hadoop/rdc_search_hfile -Dimporttsv.separator=, rdc_search_information /user/hadoop/sourcedata
2>Move the generated StoreFiles into an HBase table using completebulkload utility
bin/hbase org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles /user/hadoop/rdc_search_hfile rdc_search_information
in above demo, the table name is rdc_search_information and it has one column family information.
Running ImportTsv with no arguments prints brief usage information:
Usage: importtsv -Dimporttsv.columns=a,b,c <tablename> <inputdir>
Imports the given input directory of TSV data into the specified table.
The column names of the TSV data must be specified using the -Dimporttsv.columns
option. This option takes the form of comma-separated column names, where each
column name is either a simple column family, or a columnfamily:qualifier. The special
column name HBASE_ROW_KEY is used to designate that this column should be used
as the row key for each imported record. You must specify exactly one column
to be the row key, and you must specify a column name for every column that exists in the
input data.
By default importtsv will load data directly into HBase. To instead generate
HFiles of data to prepare for a bulk data load, pass the option:
-Dimporttsv.bulk.output=/path/for/output
Note: the target table will be created with default column family descriptors if it does not already exist.
Other options that may be specified with -D include:
-Dimporttsv.skip.bad.lines=false - fail if encountering an invalid line
'-Dimporttsv.separator=|' - eg separate on pipes instead of tabs
-Dimporttsv.timestamp=currentTimeAsLong - use the specified timestamp for the import
-Dimporttsv.mapper.class=my.Mapper - A user-defined Mapper to use instead of org.apache.hadoop.hbase.mapreduce.TsvImporterMapper
2> We can write a java project and read the file in our OS and use the Put class to put the data into HBase.
3>Use the Pig to read the data from HDFS and write to HBase.
Ref: http://hbase.apache.org/book/ops_mgt.html#importtsv
相关推荐
import org.apache.hadoop.hbase.HBaseConfiguration; import org.apache.hadoop.hbase.TableName; import org.apache.hadoop.hbase.client.Connection; import org.apache.hadoop.hbase.client.ConnectionFactory; ...
import org.apache.hadoop.hbase.client.ConnectionFactory; import org.apache.hadoop.hbase.client.Table; import org.apache.hadoop.hbase.util.Bytes; public class CreateTable { public static void main...
* hbase org.apache.hadoop.hbase.snapshot.ExportSnapshot -snapshot 'snap_test' -copyto /data/huang_test:将快照导出到 HDFS * clone_snapshot 'snap_test', 'test':将快照恢复到 HBase 表中 五、手动修复 ...
<value>/path/to/zookeeper/data ``` 配置完成后,启动HBase。在单机模式下,可以使用`start-hbase.sh`命令;在分布式模式下,需要先启动Hadoop服务,然后启动HBase: ```bash # 单机模式 start-hbase.sh # ...
import org.springframework.data.jpa.repository.JpaRepository; import org.springframework.stereotype.Repository; @Repository public interface UserRepository extends JpaRepository, String> { } ``` ...
import org.apache.spark.sql.execution.datasources.hbase.HBaseTableCatalog val catalog = s"""{ "table":{"namespace":"default", "name":"my_table", "tableCoder":"PrimitiveType"}, "rowkey":"key", ...
确保你的Python环境已经配置好,然后你可以通过`pip install -e /path/to/hbase`(替换/path/to/hbase为你的实际路径)来安装这个库。安装完成后,你可以在Python脚本中通过`import happybase`来引入HBase库。 2. *...
def connect_to_hbase(): conn = Connection('localhost', port=9090) # 连接到本地HBase return conn def create_table(conn, table_name, families): conn.create_table(table_name, families) # 创建表,...
我们创建了一个名为`my_table`的HBase表,并将数据逐行读入,然后根据列名('col1', 'col2', 'col3')和对应的值创建列族('data')的qualifiers。最后,使用`put`方法将这些数据写入表中。 在实际应用中,可能还...
If you’re looking for a scalable storage solution to accommodate a virtually endless amount of data, this updated edition shows you how Apache HBase can meet your needs. Modeled after Google’s ...
import happybase connection = happybase.Connection('localhost', port=9090) ``` 这里,'localhost'是HBase服务器的主机名,9090是Thrift服务的默认端口。 3. **创建表**:使用`happybase`可以方便地创建...
from hbase import Hbase from hbase.ttypes import * try: # 创建socket连接 transport = TSocket.TSocket('127.0.0.1', 9090) # 缓冲传输层 transport = TTransport.TBufferedTransport(transport) # ...
import happybase connection = happybase.Connection('localhost') table = connection.create_table('my_table', {'cf1': dict(max_versions=1)}) ``` ### 四、数据插入与查询 1. **数据插入**:通过`put`方法...
- **HBase as a MapReduce Job DataSource and Data Sink**(HBase作为MapReduce作业的数据源和数据接收器):如何将HBase作为MapReduce作业的输入输出。 - **Writing HFiles Directly During Bulk Import**(批量...
import org.springframework.data.hadoop.hbase.HBasePersistentEntity; import org.springframework.data.hadoop.hbase.HBasePersistentProperty; import javax.persistence.Transient; import java.util.Date; @...
import org.apache.hadoop.hbase.client.*; import java.io.IOException; public class ExampleForHBase { public static Configuration configuration; public static Connection connection; public static ...
本文档详细记录了一次从自建Hadoop集群到华为云MRS(Managed Service for Big Data)的大规模数据迁移项目,涉及到了Hive、Kudu和HBase这三种不同类型的数据存储系统。以下是针对这些系统的迁移策略、流程和解决方案...
import happybase ``` 2. 创建连接对象,指定HBase服务器地址和端口(默认为9090): ```python connection = happybase.Connection('localhost', port=9090) ``` 3. 打开或创建一个表: ```python table = ...
import happybase # 创建连接 connection = happybase.Connection('localhost', port=9090) # 打开表 table = connection.table('my_table') # 插入数据 row_key = 'my_row' data = {'cf1:col1': 'value1', 'cf1:...