`

hbase-export table to json file

 
阅读更多

  i wanna export a table to json format files,but after gging,nothing solutions found.i known,pig is used to do soome sql like mapreduces stuff; and hive is a dataware to build on hbase.but i cant some soutions /wordaround to do that too( maybe i miss something)

  so i consider to use mr to figure out this case.of course,the data in hfile is byte format,that means it must be first converted to string using utf8 then put it into json object to escape something special chars.but wait a moment ,when i go on like this,but i not happy:the output chars limit of 64k chars occurs:

    static int writeUTF(String str, DataOutput out) throws IOException {
        int strlen = str.length();
        int utflen = 0;
        int c, count = 0;

        /* use charAt instead of copying String to char array */
        for (int i = 0; i < strlen; i++) {
            c = str.charAt(i);
            if ((c >= 0x0001) && (c <= 0x007F)) {
                utflen++;
            } else if (c > 0x07FF) {
                utflen += 3;
            } else {
                utflen += 2;
            }
        }

        if (utflen > 65535)
            throw new UTFDataFormatException(
                "encoded string too long: " + utflen + " bytes");
...

   maybe u want to use these method to address it ,but no luck also:

write(byte[]) :write bytes directly to fs

writeBytes(sting):write a byte per char to fs

writeChars(string):write two byts per char

 

  that means only the style of writeUTF() is suitable to write out misc type text to file.and it's using the utf8 encoding to write bytes.so i think i can construct a 'json-sytle' format bytes to achive this.

  and the utf8 decoding will be the reverse step of this encoding(copy from jdk)

for (;i < strlen; i++){
            c = str.charAt(i);
            if ((c >= 0x0001) && (c <= 0x007F)) {
                bytearr[count++] = (byte) c;

            } else if (c > 0x07FF) {
                bytearr[count++] = (byte) (0xE0 | ((c >> 12) & 0x0F));
                bytearr[count++] = (byte) (0x80 | ((c >>  6) & 0x3F));
                bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));
            } else {
                bytearr[count++] = (byte) (0xC0 | ((c >>  6) & 0x1F));
                bytearr[count++] = (byte) (0x80 | ((c >>  0) & 0x3F));
            }
        }

 

   and we know,the byte[] from hfile is utf8 encodig ,so the workaround below code is present:

    private void write2Hdfs() throws IOException {
    	//write per line to fs
    	logger.info(" start to write file..");
    	long start=System.currentTimeMillis();
    	for(MiniArchive jso : container){  //buffer size has been set in init,5m
    		this.outputStream.write('{');
    		this.outputStream.write(this.BYTE_KEY);
    		this.outputStream.write(jso.getKey());
    		this.outputStream.write(this.BYTE_QUOT);
    		
    		if(jso.getTitle() != null){
    			this.outputStream.write(this.BYTE_TITLE);
    			this.outputStream.write(jso.getTitle());
    			this.outputStream.write(this.BYTE_QUOT);
    		}
...

   yeh,the escape code is must before write out the appropriate bytes

    private static byte[] escapeRef(byte[] bytes) {
    	if(bytes == null || bytes.length == 0)
    		return null;
		Set<Integer> set = null;
//		long t1 = System.currentTimeMillis();
		for(int i=0;i<bytes.length;i++){
			if(bytes[i] == '\"'){
				if(set == null){
					set = new HashSet<Integer>(10);
				}
				set.add(i);
			}
		}
//		System.out.println("t1:" +(System.currentTimeMillis() - t1));
		if(set != null){
//			long ts = System.currentTimeMillis();
			byte[] ret = new byte[bytes.length + set.size()]; 
			/*
			 * 2
			 */
//			System.out.println("alg cost1:" + (System.currentTimeMillis() - ts));

//			ts = System.currentTimeMillis();
			int newIndex = 0;
//			long countts = 0;
			for(int i=0;i<bytes.length;i++){//-does it more effect use arraycopy() for first index?
//				long ts2 = System.currentTimeMillis();
//				boolean cnt = set.contains(i); //if set is large size,it's worse
//				countts += System.currentTimeMillis() - ts2;
				if(set.contains(i)){
					ret[newIndex] =  '\\'; //insert
					newIndex++;
					set.remove((Integer)i); //note:i in list is object
				}
				ret[newIndex++] = bytes[i];
			}
...

 

    so i did construct a fake 'json' myself;)

   after  some perf tests,i found that using this solution is about 4x fast than 3 waves conversations:

pseudo code:
a.convert bytes retrieved from hbase to string
b.put string to json,and get toString() to convert the special char '"'
c.convert this string back to bytes for writing out using write(byte[])

 

0
2
分享到:
评论

相关推荐

    hbase-meta-repair-hbase-2.0.2.jar

    HBase 元数据修复工具包。 ①修改 jar 包中的application.properties,重点是 zookeeper.address、zookeeper.nodeParent、hdfs....③开始修复 `java -jar -Drepair.tableName=表名 hbase-meta-repair-hbase-2.0.2.jar`

    phoenix-core-4.7.0-HBase-1.1-API文档-中文版.zip

    赠送jar包:phoenix-core-4.7.0-HBase-1.1.jar; 赠送原API文档:phoenix-core-4.7.0-HBase-1.1-javadoc.jar; 赠送源代码:phoenix-core-4.7.0-HBase-1.1-sources.jar; 赠送Maven依赖信息文件:phoenix-core-4.7.0...

    hbase-sdk是基于hbase-client和hbase-thrift的原生API封装的一款轻量级的HBase ORM框架

    hbase-sdk是基于hbase-client和hbase-thrift的原生API封装的一款轻量级的HBase ORM框架。 针对HBase各版本API(1.x~2.x)间的差异,在其上剥离出了一层统一的抽象。并提供了以类SQL的方式来读写HBase表中的数据。对...

    hbase-1.2.1-bin.tar.gz.zip

    标题“hbase-1.2.1-bin.tar.gz.zip”表明这是HBase 1.2.1版本的二进制发行版,以tar.gz格式压缩,并且进一步用zip压缩。这种双重压缩方式可能用于减小文件大小,方便在网络上传输。用户需要先对zip文件进行解压,...

    HBase(hbase-2.4.9-bin.tar.gz)

    HBase(hbase-2.4.9-bin.tar.gz)是一个分布式的、面向列的开源数据库,该技术来源于 Fay Chang 所撰写的Google论文“Bigtable:一个结构化数据的分布式存储系统”。就像Bigtable利用了Google文件系统(File System...

    hbase-client-2.1.0-cdh6.3.0.jar

    hbase-client-2.1.0-cdh6.3.0.jar

    hbase-1.3.1-bin.tar.gz.7z

    export HBASE_HOME=/path/to/hbase-1.3.1 export PATH=$PATH:$HBASE_HOME/bin ``` 别忘了替换 `/path/to/hbase-1.3.1` 为实际的HBase解压路径。 HBase依赖于Hadoop的HDFS组件,因此在运行HBase之前,必须确保Hadoop...

    phoenix-client-hbase-2.2-5.1.2.jar

    phoenix-client-hbase-2.2-5.1.2.jar

    phoenix-5.0.0-HBase-2.0-client

    "phoenix-5.0.0-HBase-2.0-client" 是一个针对Apache HBase数据库的Phoenix客户端库,主要用于通过SQL查询语句与HBase进行交互。这个版本的Phoenix客户端是为HBase 2.0版本设计和优化的,确保了与该版本HBase的兼容...

    phoenix-hbase-2.4-5.1.2

    《Phoenix与HBase的深度解析:基于phoenix-hbase-2.4-5.1.2版本》 在大数据处理领域,Apache HBase和Phoenix是两个至关重要的组件。HBase作为一个分布式、列式存储的NoSQL数据库,为海量数据提供了高效、实时的访问...

    hbase的hbase-1.2.0-cdh5.14.2.tar.gz资源包

    `hbase-1.2.0-cdh5.14.2.tar.gz` 是针对Cloudera Distribution Including Apache Hadoop (CDH) 5.14.2的一个特定版本的HBase打包文件。CDH是一个流行的Hadoop发行版,包含了多个大数据组件,如HDFS、MapReduce、YARN...

    phoenix-hbase-2.2-5.1.2-bin.tar.gz

    本文将深入探讨这两个技术及其结合体`phoenix-hbase-2.2-5.1.2-bin.tar.gz`的详细内容。 首先,HBase(Hadoop Database)是Apache软件基金会的一个开源项目,它构建于Hadoop之上,是一款面向列的分布式数据库。...

    hbase-2.4.17-bin 安装包

    这个“hbase-2.4.17-bin”安装包提供了HBase的最新稳定版本2.4.17,适用于大数据处理和分析场景。下面将详细介绍HBase的核心概念、安装步骤以及配置和管理。 一、HBase核心概念 1. 表(Table):HBase中的表是由行...

    phoenix-hbase-1.4-4.16.1-bin

    《Phoenix与HBase的深度解析:基于phoenix-hbase-1.4-4.16.1-bin的探讨》 Phoenix是一种开源的SQL层,它为Apache HBase提供了高性能的关系型数据库查询能力。在大数据领域,HBase因其分布式、列式存储的特性,常被...

    hive-hbase-handler-1.2.1.jar

    被编译的hive-hbase-handler-1.2.1.jar,用于在Hive中创建关联HBase表的jar,解决创建Hive关联HBase时报FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. org.apache.hadoop....

    pinpoint的hbase初始化脚本hbase-create.hbase

    搭建pinpoint需要的hbase初始化脚本hbase-create.hbase

    phoenix-4.14.1-HBase-1.2-client.jar

    phoenix-4.14.1-HBase-1.2-client.jar

    hbase-hadoop-compat-1.1.3-API文档-中文版.zip

    赠送jar包:hbase-hadoop-compat-1.1.3.jar; 赠送原API文档:hbase-hadoop-compat-1.1.3-javadoc.jar; 赠送源代码:hbase-hadoop-compat-1.1.3-sources.jar; 赠送Maven依赖信息文件:hbase-hadoop-compat-1.1.3....

    hbase-2.4.11-bin.tar.gz

    标题中的“hbase-2.4.11-bin.tar.gz”是指HBase的2.4.11稳定版本的二进制压缩包,用户可以通过下载这个文件来进行安装和部署。 HBase的核心设计理念是将数据按照行和列进行组织,这种模式使得数据查询和操作更加...

    hbase-prefix-tree-1.1.3-API文档-中文版.zip

    赠送jar包:hbase-prefix-tree-1.1.3.jar; 赠送原API文档:hbase-prefix-tree-1.1.3-javadoc.jar; 赠送源代码:hbase-prefix-tree-1.1.3-sources.jar; 赠送Maven依赖信息文件:hbase-prefix-tree-1.1.3.pom; ...

Global site tag (gtag.js) - Google Analytics