`
Taoo
  • 浏览: 293891 次
  • 性别: Icon_minigender_1
  • 来自: 北京
社区版块
存档分类
最新评论

pig用上lzo的相关配置

 
阅读更多
转载:http://stackoverflow.com/questions/7277621/how-to-get-pig-to-work-with-lzo-files
还没有试过。

----------------------
I recently got this to work and wrote up a wiki on it for my coworkers. Here's an excerpt detailing how to get PIG to work with lzos. Hope this helps someone!

NOTE: This is written with a Mac in mind. The steps will be almost identical for other OS', and this should definitely give you what you need to know to configure on Windows or Linux, but you will need to extrapolate a bit (obviously, change Mac-centric folders to whatever OS you're using, etc...).
Hooking PIG up to be able to work with LZOs

This was by far the most annoying and time-consuming part for me-- not because it's difficult, but because there are 50 different tutorials online, none of which are all that helpful. Anyway, what I did to get this working is:

    1,Clone hadoop-lzo from github at https://github.com/kevinweil/hadoop-lzo.

    2,Compile it to get a hadoop-lzo*.jar and the native *.o libraries. You'll need to compile this on a 64bit machine.

    3,Copy the native libs to $HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/.

    4,Copy the java jar to $HADOOP_HOME/lib and $PIG_HOME/lib

    5,Then configure hadoop and pig to have the property java.library.path point to the lzo native libraries. You can do this in $HADOOP_HOME/conf/mapred-site.xml with:

    <property>
        <name>mapred.child.env</name>
        <value>JAVA_LIBRARY_PATH=$HADOOP_HOME/lib/native/Mac_OS_X-x86_64-64/</value>
    </property>

    6,Now try out grunt shell by running pig again, and make sure everything still works. If it doesn't, you probably messed up something in mapred-site.xml and you should double check it.

    7,Great! We're almost there. All you need to do now is install elephant-bird. You can get that from https://github.com/kevinweil/elephant-bird (clone it).

    8,Now, in order to get elephant-bird to work, you'll need quite a few pre-reqs. These are listed on the page mentioned above, and might change, so I won't specify them here. What I will mention is that the versions on these are very important. If you get an incorrect version and try running ant, you will get errors. So, don't try grabbing the pre-reqs from brew or macports as you'll likely get a newer version. Instead, just download tarballs and build for each.

    9,command: ant in the elephant-bird folder in order to create a jar.

    10,For simplicity's sake, move all relevant jars (hadoop-lzo-x.x.x.jar and elephant-bird-x.x.x.jar) that you'll need to register frequently somewhere you can easily find them. /usr/local/lib/hadoop/... works nicely.

    11,Try things out! Play around with loading normal files and lzos in grunt shell. Register the relevant jars mentioned above, try loading a file, limiting output to a manageable number, and dumping it. This should all work fine whether you're using a normal text file or an lzo.

分享到:
评论

相关推荐

    elephant-bird-core-4.6rc1.zip

    大象鸟(Elephant Bird)是一个由Twitter开源的项目,主要用于处理LZO压缩的数据和协议缓冲区相关的Hadoop、Pig、Hive以及HBase的代码。这个项目的核心在于提供了一套高效的工具,使得在大数据处理框架中对LZO压缩...

    EasyHadoop部署实战手册

    - **修改hadoop配置文件mapred-site.xml**:进一步配置LZO相关选项。 #### 9. 开启任务调度器 - **修改mapred-site.xml**:配置任务调度策略,如Capacity Scheduler。 - **修改capacity-scheduler.xml添加hive...

    hadoop毅哥的压缩包.7z

    6. **Hadoop生态系统**:Hadoop并不只是一个单独的工具,它有一个庞大的生态系统,包括HBase(分布式数据库)、Hive(数据仓库工具)、Pig(数据分析平台)等,这些工具通常与Hadoop一起使用,以构建大数据处理解决...

    hadoop-0.20.0.tar

    一旦设置完成,你可以通过编写MapReduce程序或者使用Hadoop提供的命令行工具来处理存储在HDFS上的数据。 此外,Hadoop生态系统还包括其他组件,如Hive(数据仓库工具)、Pig(数据流处理)、HBase(NoSQL数据库)、...

    Hadoop主流开源云架构介绍.pptx

    此外,还有Hadoop的高级接口,如Pig、Hive、Spark等,这些接口允许开发者使用SQL或更高层次的抽象来处理数据。 总结,Hadoop作为主流的开源云架构,其强大的分布式存储和计算能力解决了大数据处理的挑战。YARN的...

    大数据云计算技术 暴风集团基于hadoop的数据平台总体架构简介(共18页).ppt

    1. **IUPushRsync**: 这是数据平台中的一个关键组件,它使用rsync -U命令对日志文件进行增量同步,以实现准实时的数据传输和压缩,从而有效降低网络带宽占用。此外,IUPushRsync还确保数据传输的完整性,通过检验...

    暴风数据平台简介.pdf

    - 支持数据压缩: 使用LZO压缩提高存储效率。 - 失败恢复: 在数据传输过程中遇到问题时能自动恢复,确保数据传输的连续性和完整性。 ##### 3. HCRush - **功能**: HCRush是一个基于MapReduce的通用日志清洗框架。 ...

    hbase性能调优

    HBase的性能瓶颈往往出现在垃圾回收(GC)上。默认的JVM垃圾回收策略可能无法满足HBase region服务器的特殊需求,尤其是在高写入负载场景下,频繁的数据创建和删除会导致内存碎片。用户可以通过在`hbase-env.sh`...

    Hadoop知识库:Hadoop知识库和常规命令

    - `mapred-site.xml`(MapReduce v1)或 `yarn-site.xml`(MapReduce v2,YARN):配置MapReduce的相关参数。 5. **Hadoop优化** - **压缩**:通过Gzip、Lzo、Snappy等算法减少数据传输量,提高性能。 - **数据...

Global site tag (gtag.js) - Google Analytics