【Pig一】Pig入门

bit1129

浏览: 1077952 次
性别:
来自: 北京

最近访客更多访客>>

xiaoyaohen24

yuxin8000

abc951654

zhongqi2513

博主相关

博客

微博

相册

留言

关于我

文章分类

社区版块

存档分类

博客分类：

Pig安装

1.下载pig

wget http://mirror.bit.edu.cn/apache/pig/pig-0.14.0/pig-0.14.0.tar.gz

2. 解压配置环境变量

如果Pig使用Map/Reduce模式，那么需要在环境变量中，配置HADOOP_HOME环境变量

export PIG_HOME=/home/hadoop/pig-0.14.0
export PIG_CLASSPATH=/home/hadoop/hadoop-2.5.2/etc/hadoop/conf
export PATH=$PIG_HOME/bin:$PATH

3. 在本篇中使用默认的Pig配置，如果要配置Pig属性，则对如下文件进行配置

/home/hadoop/pig-0.14.0/conf/pig.properties

4. 启动Pig

/home/hadoop/pig-0.14.0/bin/pig

启动的结果：

hadoop@tom-Inspiron-3521:~/pig-0.14.0/bin$ ./pig
14/12/28 12:31:03 INFO pig.ExecTypeProvider: Trying ExecType : LOCAL
14/12/28 12:31:03 INFO pig.ExecTypeProvider: Trying ExecType : MAPREDUCE
14/12/28 12:31:03 INFO pig.ExecTypeProvider: Picked MAPREDUCE as the ExecType
2014-12-28 12:31:03,217 [main] INFO  org.apache.pig.Main - Apache Pig version 0.14.0 (r1640057) compiled Nov 16 2014, 18:02:05
2014-12-28 12:31:03,217 [main] INFO  org.apache.pig.Main - Logging error messages to: /home/hadoop/pig-0.14.0/bin/pig_1419741063215.log
2014-12-28 12:31:03,279 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file /home/hadoop/.pigbootup not found
2014-12-28 12:31:04,154 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-12-28 12:31:04,154 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-12-28 12:31:04,154 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: hdfs://yuzhitao-Inspiron-3521:9000
2014-12-28 12:31:06,100 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt>

Pig基本使用

1. 启动Hadoop2.5.2

2. 准备数据集 vim $PIG_HOME/data/sampledata.txt

2014:11:23
2014:11:61
2014:12:32
2014:8:11

3.使用如下命令启动pig

pig

4. 在Pig中将sampledata.txt数据上传到HDFS中

grunt> fs -copyFromLocal /home/hadoop/pig-0.14.0/data/sampledata.txt /user/hadoop

说明：/user/hadoop是上传到HDFS的目标目录，也可以使用./，具体grunt如何解析./，还需要研究下。

5. 在Pig中查看HDFS状态

grunt> ls
hdfs://tom-Inspiron-3521:9000/user/hadoop/sampledata.txt<r 1>	77

77表示sampledata.txt的字节数

6. 查看sampledata.txt的内容

grunt> fs -cat /user/hadoop/sampledata.txt

7. 将sampledata.txt加载到Pig中，以:分割，指定三列A，B，C

grunt> A = LOAD 'sampledata.txt' USING PigStorage(':') AS (A:int,B:int, C:int);

执行结果

grunt> A = LOAD 'sampledata.txt' USING PigStorage(':') AS (A:int,B:int, C:int);
2014-12-28 13:19:44,205 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-12-28 13:19:44,299 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
grunt>

8. 查看A任务的描述

grunt> describe A;
A: {A: int,B: int,C: int}
grunt> dump A;

Pig启动一个Map Reduce Job, 结果显示10020端口连接失败，在终端使用telnet localhost 10020发现拒绝连接，终端在等待将近5分钟的频繁尝试连接10020后，最后的Map任务还是结束了，直接结果是

2014-12-28 15:10:55,732 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!
2014-12-28 15:10:55,739 [main] INFO  org.apache.pig.data.SchemaTupleBackend - Key [pig.schematuple] was not set... will not generate code.
2014-12-28 15:10:55,763 [main] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 1
2014-12-28 15:10:55,764 [main] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 1
(2014,11,23)
(2014,11,61)
(2014,12,32)
(2014,8,11)

9 .执行如下操作执行条件过滤

grunt> B = FILTER A BY Ｂ ! = 12;
grunt>dump B;

经常不停反复的尝试连接到10020端口，最后，结果还是输出了，结果是

(2014,11,23)
(2014,11,61)
(2014,8,11)

10. 执行如下操作对A进行分组操作

grunt> B = GROUP A BY Ｂ; //BY后面的B是列名
grunt>dump B;

经常不停反复的尝试连接到10020端口，最后，结果还是输出了，结果是

(8,{(2014,8,11)})
(11,{(2014,11,61),(2014,11,23)})
(12,{(2014,12,32)})

11. 将结果保存到HDFS上

grunt> STORE A INTO 'pigStorageA' USING PigStorage(':');

经常不停反复的尝试连接到10020端口，最后，结果还是输出了，结果是

grunt> ls
hdfs://tom-Inspiron-3521:9000/user/hadoop/pig	<dir>
hdfs://tom-Inspiron-3521:9000/user/hadoop/pigStorageA	<dir>  ////结果输出目录
hdfs://tom-Inspiron-3521:9000/user/hadoop/wordcount	<dir>
grunt> fs -l /user/hadoop/pigStorageA
-l: Unknown command
grunt> fs -ls /user/hadoop/pigStorageA
Found 2 items
-rw-r--r--   1 hadoop supergroup          0 2014-12-28 15:53 /user/hadoop/pigStorageA/_SUCCESS
-rw-r--r--   1 hadoop supergroup         43 2014-12-28 15:53 /user/hadoop/pigStorageA/part-m-00000
grunt> fs -cat /user/hadoop/pigStorageA/part-m-00000  ///结果
2014:11:23
2014:11:61
2014:12:32
2014:8:11
grunt>

别名(Alias)

在上面的例子中，使用了

grunt> A = LOAD 'sampledata.txt' USING PigStorage(':') AS (A:int,B:int, C:int);

grunt> 后面的A表示一个别名，那么这个别名是谁的别名呢？

dump命令用于都alias进行计算，并将结果显示到控制台上
describe命令用于显示alias的模式(schema)

总结

本文很失败，没有达到对Pig入门的效果，Pig能做什么，与Hive有什么区别，没有直观的感受；同时，配置还有问题，在执行Pig命令时，不断的报错

关于10020端口拒绝连接的问题

对于Hadoop2.x需要配置和启动Job History Server

1. 配置Job History相关的配置

vim mapred-site.xml

添加如下内容：

    <property>
        <name>mapreduce.jobhistory.address</name>
        <value>hadoop.master:10020</value>
    </property>
    <property>
        <name>mapreduce.jobhistory.webapp.address</name>
        <value>hadoop.master:19888</value>
    </property>

2. 启动JobHistoryServer

sbin/mr-jobhistory-daemon.sh start historyserver

分享到：

【Scala一】Scala各种符号的含义 | 【Storm二】Storm伪分布式安装

2014-12-28 13:29
浏览 2530
评论(0)
分类:开源软件
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论