pig关系操作符实例

ruishen

浏览: 52080 次
性别:
来自: 北京

最近访客更多访客>>

jh108020

UniverseSae

WangJiaX

caolinlin

博主相关

博客

微博

相册

留言

关于我

文章分类

全部博客 (51)

社区版块

存档分类

cogroup

　对两个对象模式，分别按指定的字段进行分组，然后按照指定的key列来分组

grunt> cat A;

0,1,2

1,3,4

grunt> cat B;

0,5,2

1,7,8

grunt> b = load 'B' usingPigStorage(',') as (c1:int,c2:int,c3:int);

grunt> a = load 'A' usingPigStorage(',') as (c1:int,c2:int,c3:int);

grunt> c = cogroup a by c1,b by c1;

grunt> illustrate c;

------------------------------------------------

|| 0 | 1 | 2 |

------------------------------------------------

------------------------------------------------

|| 0 | 5 | 2 |

------------------------------------------------

-----------------------------------------------------------------------------------------------------------------------------------------------

| c| group:int |a:bag{:tuple(c1:int,c2:int,c3:int)} |b:bag{:tuple(c1:int,c2:int,c3:int)} |

-----------------------------------------------------------------------------------------------------------------------------------------------

|| 0 | {(0, 1, 2), (0,1, 2)}| {(0, 5, 2), (0, 5, 2)}|

-----------------------------------------------------------------------------------------------------------------------------------------------

grunt> dump c;

(0,{(0,1,2)},{(0,5,2)})

(1,{(1,3,4)},{(1,7,8)})

修改一下A文本

grunt> cat A;

0,1,2

1,3,4

3,5,7

4,6,4

grunt> dump c;

(0,{(0,1,2)},{(0,5,2)})

(1,{(1,3,4)},{(1,7,8)})

(3,{(3,5,7)},{})

(4,{(4,6,4)},{})

由此可以看出这个操作就是等于根据指定键关联然后，找到符合key的tuple 放入到bag中，作为key的其中一个元素bag。

比如想看第二个对象中的第一个值都有哪些

grunt> e = foreach c generate $0,$2.c1;

grunt> dump e;

(0,{(0)})

(1,{(1)})

(3,{})

(4,{})

group

pig 分组

公式

alias = GROUP alias { ALL | BY expression}[, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BYpartitioner] [PARALLEL n];

pig terms中解释的很清楚了简单的例子就没有在自己做测试，只在这儿写下了笔记

http://pig.apache.org/docs/r0.11.1/basic.html#GROUP

所有的元素合成一个分组，一般统计用

B = GROUP A ALL;

指定列来分组

B = GROUP A BY f1;

指定多列的组合进行分组，然后按组将每条tuple汇集成一个bag

B = GROUP A BY (key1,key2);

GROUPPARTITION BY

pig自定义partitionner 类

package com.hcr.hadoop.pig;

import org.apache.hadoop.io.Writable;

import org.apache.hadoop.mapreduce.Partitioner;

import org.apache.pig.impl.io.PigNullableWritable;

public class SimpleCustomPartitionerextends

Partitioner<PigNullableWritable, Writable> {

@Override

public int getPartition(PigNullableWritable key, Writable value,

int numPartitions) {

System.out.println("[LOG] key:" + key +",value=" + value);

if (key.getValueAsPigType()instanceof Integer) {

int ret = (((Integer) key.getValueAsPigType()).intValue() % numPartitions);

return ret;

} else {

return (key.hashCode()) % numPartitions;

}

打包上传到服务器；

/home/pig/pig-0.11.0/udflib/simplePartitioner.jar

grunt> register/home/pig/pig-0.11.0/udflib/simplePartitioner.jar

grunt> cat A;

0,1,2

1,3,4

3,5,7

4,6,4

grunt> a = load 'A' using PigStorage(',');

grunt> b = group a by $0 partition bycom.hcr.hadoop.pig.SimpleCustomPartitioner;

grunt> dump b;

接下来会看到输出的部分中存在如下job status信息

Job Stats (time in seconds):

JobIdMaps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_201308011137_82475 11 2 22 2 99 9 9a,b GROUP_BYhdfs://hadoop-master.xx.com/tmp/temp-1972808180/tmp2041929350,

第一个字段就是jobID；

接下来看mapreduce任务执行的信息

http://hadoop-master.xx.com:50030/jobdetails.jsp?jobid=job_201308011137_82475

接下来进入到map任务的日志查看

stdout logs

[LOG] key:Null: false index: 0 (0),value=Null:false index: 0 (1,2)

[LOG] key:Null: false index: 0(1),value=Null: false index: 0 (3,4)

[LOG] key:Null: false index: 0(3),value=Null: false index: 0 (5,7)

[LOG] key:Null: false index: 0(4),value=Null: false index: 0 (6,4)

Piggroup PARALLEL

grunt> b = group a by $0 partition bycom.hcr.hadoop.pig.SimpleCustomPartitioner PARALLEL 2;

grunt> dump b;

在输出中就看到了，reduces值是2；

Job Stats (time in seconds):

JobIdMaps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs

job_201308011137_82476 12 2 22 2 119 10 10a,b GROUP_BY hdfs://hadoop-master.xx.com/tmp/temp-1972808180/tmp-1122010364

JOIN (inner)

公式：

alias = JOIN alias BY{expression|'('expression [, expression …]')'} (, alias BY{expression|'('expression [, expression …]')'} …) [USING 'replicated' |'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];

官方手册：http://pig.apache.org/docs/r0.11.1/basic.html#join-inner

Example: X = JOIN A BY fieldA, B BY fieldB,C BY fieldC;

'replicated' 告诉pig使用分片-复制算法执行这个join操作，这个时候是不触发执行reduce任务的。

实例

big = LOAD 'big_data' AS (b1,b2,b3);

tiny = LOAD 'tiny_data' AS (t1,t2,t3);

mini = LOAD 'mini_data' AS (m1,m2,m3);

C = JOIN big BY b1, tiny BY t1, mini BY m1USING 'replicated';

'replicated'是使用小数据对大数据进行join操作的时候，可以把小数据的模块放入到内存中使用分片复制技术。

'skewed'的用法和'replicated'一样，但是使用场景不同，

Skewed是对数据倾斜的操作起到一定的平衡作用（均匀datanode的数据操作压力）

具体可以参考programing pig（pig编程指南）62页，打字太辛苦了，偷懒一下，嘿嘿。

'merge' 用法也一样，但是使用于对已经排好序的数据进行merge join操作参考programing pig（pig编程指南）64页

指定PARTITION 参考 group PARTITION的用法即可。

Join模式（inner join）（left join）（right join）

join/left join / right join

null不匹配任何数据

-- join2key.pig

daily = load 'NYSE_daily' as (exchange,symbol, date, open, high, low, close,

volume, adj_close);

divs= load 'NYSE_dividends' as (exchange, symbol, date, dividends);

jnd= join daily by (symbol, date), divs by (symbol, date);

--leftjoin.pig

daily = load 'NYSE_daily' as (exchange,symbol, date, open, high, low, close,

volume, adj_close);

divs= load 'NYSE_dividends' as (exchange, symbol, date, dividends);

jnd= join daily by (symbol, date) left outer, divs by (symbol, date);

也可以同时多个变量，但只用于inner join

A = load 'input1' as (x, y);

B = load 'input2' as (u, v);

C = load 'input3' as (e, f);

alpha = join A by x, B by u, C by e;

也可以自身和自身join，但数据要加载两次

--selfjoin.pig

-- For each stock, find all dividends thatincreased between two dates

divs1= load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,

date:chararray, dividends);

divs2= load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,

date:chararray, dividends);

jnd= join divs1 by symbol, divs2 by symbol;

increased = filter jnd by divs1::date <divs2::date and

divs1::dividends <divs2::dividends;

下面这样不行

--selfjoin.pig

-- For each stock, find all dividends thatincreased between two dates

divs1= load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,

date:chararray, dividends);

jnd= join divs1 by symbol, divs1 by symbol;

increased = filter jnd by divs1::date <divs2::date and

divs1::dividends <divs2::dividends;

UNION

Pigunion 的操作是将连个数据连接在一起，数据的个数和模式都可以不同，能转换的pig会自动转换（隐式转换），不能转换的则schema for xx unknow；

转换规则

double > float > long > int >bytearray

tuple|bag|map|chararray > bytearray

grunt> cat A;

0,1,2

1,3,4

3,5,7

4,6,4

grunt> cat B;

0,5,2

1,7,8

grunt> cat C;

1,2,3,4,5,6

2,3,4,6,7

3,5,6,4,hou

4,2,5,7,2,中国人

90,10,23

grunt> a = load 'A' usingPigStorage(',') as (x:int,y:int,z:int);

grunt> b = load 'B' usingPigStorage(',') as (x:int,y:int,z:int);

grunt> c = load 'C' usingPigStorage(',') ;

grunt> d = union a,b,c;

grunt> dump d;

(1,2,3,4,5,6)

(2,3,4,6,7)

(3,5,6,4,hou)

(4,2,5,7,2,中国人)

(90,10,23)

(0,1,2)

(1,3,4)

(3,5,7)

(4,6,4)

(0,5,2)

(1,7,8)

grunt> describe c;

Schema for c unknown.

grunt> d = union a,b;

grunt> describe d;

d: {x: int,y: int,z: int}

CROSS

Pig latin cross “叉乘”(cross-product，也称为“笛卡儿积”[C'artesian prnduct})的

操作。这一操作把一个关系中的每个元组和第二个中的所有元组进行连接(如果有

更多的关系，那么这个操作就进一步把结果逐一和这些关系的每一个元组进行连

接)。这个操作的输出结果的大小是输人关系的大小的乘积。输出结果可能会非常大。

操作公式

alias = CROSS alias, alias [, alias …][PARTITION BY partitioner] [PARALLEL n];

grunt> cat A;

0,1,2

1,3,4

3,5,7

4,6,4

grunt> cat B;

0,5,2

1,7,8

grunt>a = load 'A' using PigStorage(',') as (x:int,y:int,z:int);

grunt> B = load 'B' using PigStorage(',') as(x:int,y:int,z:int);

grunt>c = cross a,B;

grunt> dump c;

(0,1,2,0,5,2)

(0,1,2,1,7,8)

(1,3,4,0,5,2)

(1,3,4,1,7,8)

(3,5,7,0,5,2)

(3,5,7,1,7,8)

(4,6,4,0,5,2)

(4,6,4,1,7,8)

由于cross是并行实现的而且一般数据量较大，建议必须指定reduce的任务数

cross x,x PARALLEL number

DEFINE

MacroDEFINE

脚本

Cat A;

DEFINE my_macro(A, sortkey) RETURNS C {

B = FILTER $A BY (c1>1 and c2>=0);

$C = ORDER B BY $sortkey;

};

a = load 'A' using PigStorage(',') as (c1:int,c2:int,c3:int);

b = my_macro(a,c1);

dump b;

执行效果

grunt> Cat A;

0,1,2

1,3,4

3,5,7

4,6,4

grunt> DEFINE my_macro(A, sortkey)RETURNS C {

>>B = FILTER $A BY (c1>1 and c2>=0);

>>$C = ORDER B BY $sortkey;

>> };

grunt> a = load 'A' usingPigStorage(',') as (c1:int,c2:int,c3:int);

grunt> b = my_macro(a,c1);

grunt> dump b;

(3,5,7)

(4,6,4)

Returns也可以是使用void 来返回空直接在函数中操作了所有的动作,比如

DEFINE my_macro1(A, sortkey,outfile) RETURNS void {

B = FILTER $A BY (c1>1 and c2>=0);

C = ORDER B BY $sortkey;

Store C into '$outfile';

};

DEFINE my_macro2(outfile) RETURNS void {

a = load 'A' using PigStorage(',') as (c1:int,c2:int,c3:int);

my_macro1(a,c1,$outfile);

};

my_macro2('macroOut');

执行效果

grunt> DEFINE my_macro1(A,sortkey,outfile) RETURNS void {

>>B = FILTER $A BY (c1>1 and c2>=0);

>> C = ORDER B BY $sortkey;

>> Store C into '$outfile';

>> };

grunt>

grunt> DEFINE my_macro2(outfile) RETURNSvoid {

>> a = load 'A' using PigStorage(',')as (c1:int,c2:int,c3:int);

>> my_macro1(a,c1,$outfile);

>> };

grunt> my_macro2('macroOut');

grunt> cat macroOut

35 7

46 4

grunt>

当然也可以把macro写到脚本里，通过import 导入，然后使用。

DEFINE(UDFs, streaming)
这个define的指定stream命令执行的。

由于本人对streaming不太了解和用法建议参考

http://pig.apache.org/docs/r0.11.1/basic.html#define-udfs

DISTINCT

去重复的数据

Example

Suppose we have relation A.

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;

(8,3,4)

(1,2,3)

(4,3,3)

(1,2,3)

In this example all duplicate tuples areremoved.

X = DISTINCT A;

DUMP X;

(1,2,3)

(4,3,3)

(8,3,4)

FILTER

过滤去掉数据

A = LOAD 'data' AS (f1:int,f2:int,f3:int);

DUMP A;

(1,2,3)

(4,2,1)

(8,3,4)

(4,3,3)

(7,2,5)

(8,4,3)

In this example the condition states thatif the third field equals 3, then include the tuple with relation X.

X = FILTER A BY f3 == 3;

DUMP X;

(1,2,3)

(4,3,3)

(8,4,3)

In this example the condition states thatif the first field equals 8 or if the sum of fields f2 and f3 is not greaterthan first field, then include the tuple relation X.

X = FILTER A BY (f1 == 8) OR (NOT (f2+f3> f1));

DUMP X;

(4,2,1)

(8,3,4)

(7,2,5)

(8,4,3)

FOREACH

Pig foreach 是处理一组数据让后将他们的逐条读取和处理的一个函数。而且这个函数中可以内部嵌套自定义函数（udf），调用其他函数flatten，distinct ，filter，limit，order等操作都可以。

GetMax类

package com.hcr.hadoop.pig;

import java.io.IOException;

import java.util.List;

import org.apache.pig.EvalFunc;

import org.apache.pig.data.DataBag;

import org.apache.pig.data.Tuple;

public class GetMax extends EvalFunc<Integer> {

@Override

public Integer exec(Tuple input) throws IOException {

List<Object> list = input.getAll();

Object[] objects = list.toArray();

int max = 0;

for (Object object : objects) {

if (object instanceof DataBag) {

DataBag dbag = (DataBag) object;

for (Tuple tuple : dbag) {

Object x = tuple.get(0);

Object y = tuple.get(1);

Object z = tuple.get(2);

max = CompareInt(x, y, z);

break;

}

if (object instanceof Tuple) {

Tuple tuple = (Tuple) object;

Object x = tuple.get(0);

Object y = tuple.get(1);

max = CompareInt(x, y);

break;

}

return max;

}

int String2Int(String str) {

return Integer.valueOf(str);

}

int Object2Int(Object str) {

return String2Int(str.toString());

}

int CompareInt(Object... objs) {

int max = 0;

for (Object object : objs) {

try {

int _tmpMax = Object2Int(object);

if (max < _tmpMax) {

max = _tmpMax;

}

} catch (Exception e) {

System.out.println("[LOG] 数据：object=" + object + ",无法转换成int");

e.printStackTrace();

}

return max;

}

grunt> register/home/pig/pig-0.11.0/udflib/getMax.jar

grunt> cat A;

0,1,2

1,3,4

3,5,7

4,6,4

grunt> a = load'A' using PigStorage(',') as (x:int,y:int,z:int);

grunt> b =group a by (x,y);

grunt> definegetMax com.hcr.hadoop.pig.GetMax();

grunt> c =foreach b{

>> generategroup,getMax(group);

>> };

grunt> dump c;

((0,1),1)

((1,3),3)

((3,5),5)

((4,6),6)

取出分组中最大的那个数

c = foreach b{

generategroup,getMax(a);

};

取出根据分组中的数据中最大的值

grunt> c =foreach b{

>> generategroup,getMax(a);

>> };

grunt> dump c;

((0,1),2)

((1,3),4)

((3,5),7)

((4,6),6)

IMPORT

Pig import 操作导入脚本，方便操作和通用；

[pig@ebsdi-23260-oozie scripts]$ pwd

/home/pig/pig-0.11.0/scripts

[pig@ebsdi-23260-oozie scripts]$ ll

总计 4

-rw-rw-r-- 1 pig pig 300 09-29 13:42my_macro.pig

[pig@ebsdi-23260-oozie scripts]$ catmy_macro.pig

DEFINE import_my_macro(A, sortkey,outfile)RETURNS void {

B= FILTER $A BY (c1>1 and c2>=0);

C = ORDER B BY $sortkey;

Store C into '$outfile';

};

DEFINE import_my_macro2 (outfile) RETURNSvoid {

a = load 'A' using PigStorage(',') as(c1:int,c2:int,c3:int);

import_my_macro(a,c1,$outfile);

};

执行结果

grunt> IMPORT'/home/pig/pig-0.11.0/scripts/my_macro.pig';

grunt> import_my_macro2('importMacroOut');

在结果跑完后查看一下结果

grunt> cat importMacroOut;

35 7

46 4

LIMIT

取指定数量的数据，在pig中除了order by 之外生成的数据都没有固定的顺序。

当指定数量比实际数量多的时候会取全部，否则取指定数；

grunt> a = load 'A' usingPigStorage(',') as (c1:int,c2:int,c3:int);

grunt> b = limit a 10;

grunt> dump b;

(0,1,2)

(1,3,4)

(3,5,7)

(4,6,4)

grunt> b = limit a 2;

grunt> dump b;

(0,1,2)

(1,3,4)

MAPREDUCE

在pig 0.8版本之后增加的功能，以般用于当使用mapreduce处理比pig要好，但是必须要和之前pig数据流相结合的时候。

A = LOAD 'WordcountInput.txt';

B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir'

AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`;

ORDER BY

使用公式

alias = ORDER alias BY { * [ASC|DESC] |field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];

使用order by和group差不多，order参数上边不用括号。

在pig编程指南中提到了数据倾斜的问题，所有pig为了处理这种现象打破了同一个指定键对应实例都输入到一个reduce上，所以有时候有些需要基于要把指定键必须放入一个reduce中去处理的（比如指定paitioner的）等就不要使用pig的order 功能，可能会导致数据不准。

A = LOAD 'mydata' AS (x: int, y:map[]);

B = ORDER A BY x; -- this is allowed becausex is a simple type

B = ORDER A BY y; -- this is not allowedbecause y is a complex type

B = ORDER A BY y#'id'; -- this is notallowed because y#'id' is an expression

Examples

Suppose we have relation A.

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;

(1,2,3)

(4,2,1)

(8,3,4)

(4,3,3)

(7,2,5)

(8,4,3)

In this example relation A is sorted by thethird field, f3 in descending order. Note that the order of the three tuplesending in 3 can vary.

X = ORDER A BY a3 DESC;

DUMP X;

(7,2,5)

(8,3,4)

(1,2,3)

(4,3,3)

(8,4,3)

(4,2,1)

RANK

可以增加一列序列值操作。详情http://pig.apache.org/docs/r0.11.1/basic.html#rank

SAMPLE

随机抽取一些实例根据指定值。详情http://pig.apache.org/docs/r0.11.1/basic.html#sample

SPLIT

有的时候我们需要归类操作对一个模式操作，比如

当数据好的放在好的列表，数据坏的放在坏的列表。实例（Hadoop权威指南340页。）

Example

In this example relation A is split intothree relations, X, Y, and Z.

A = LOAD 'data' AS (f1:int,f2:int,f3:int);

DUMP A;

(1,2,3)

(4,5,6)

(7,8,9)

SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF(f3<6 OR f3>6);

DUMP X;

(1,2,3)

(4,5,6)

DUMP Y;

(4,5,6)

DUMP Z;

(1,2,3)

(7,8,9)

Example

In this example, the SPLIT and FILTERstatements are essentially equivalent. However, because SPLIT is implemented as"split the data stream and then apply filters" the SPLIT statement ismore expensive than the FILTER statement because Pig needs to filter and storetwo data streams.

SPLIT input_var INTO output_var IF (field1is not null), ignored_var IF (field1 is null);

-- where ignored_var is not used elsewhere

output_var = FILTER input_var BY (field1 isnot null);

STORE

Examples

In this example data is stored usingPigStorage and the asterisk character (*) as the field delimiter.

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;

(1,2,3)

(4,2,1)

(8,3,4)

(4,3,3)

(7,2,5)

(8,4,3)

STORE A INTO 'myoutput' USING PigStorage('*');

CAT myoutput;

1*2*3

4*2*1

8*3*4

4*3*3

7*2*5

8*4*3

In this example, the CONCAT function isused to format the data before it is stored.

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;

(1,2,3)

(4,2,1)

(8,3,4)

(4,3,3)

(7,2,5)

(8,4,3)

B = FOREACH A GENERATECONCAT('a:',(chararray)f1), CONCAT('b:',(chararray)f2),CONCAT('c:',(chararray)f3);

DUMP B;

(a:1,b:2,c:3)

(a:4,b:2,c:1)

(a:8,b:3,c:4)

(a:4,b:3,c:3)

(a:7,b:2,c:5)

(a:8,b:4,c:3)

STORE B INTO 'myoutput' usingPigStorage(',');

CAT myoutput;

a:1,b:2,c:3

a:4,b:2,c:1

a:8,b:3,c:4

a:4,b:3,c:3

a:7,b:2,c:5

a:8,b:4,c:3

STREAM

在数据流中执行外部程序或者脚本可用。

http://pig.apache.org/docs/r0.11.1/basic.html#stream

总结：

Pig中的各operator（操作符），哪些会触发reduce过程

　　　　　　GROUP：由于GROUP操作会将所有具有相同key的记录收集到一起，所以数据如果正在map中处理的话，就会触发shuffle→reduce的过程。

　　　　　　ORDER：由于需要将所有相等的记录收集到一起（才能排序），所以ORDER会触发reduce过程。同时，除了你写的那个Pig job之外，Pig还会添加一个额外的M-R job到你的数据流程中，因为Pig需要对你的数据集做采样，以确定数据的分布情况，从而解决数据分布严重不均的情况下job效率过于低下的问题。

　　　　　　DISTINCT：由于需要将记录收集到一起，才能确定它们是不是重复的，因此DISTINCT会触发reduce过程。当然，DISTINCT也会利用combiner在map阶段就把重复的记录移除。

　　　　　　JOIN：JOIN用于求重合，由于求重合的时候，需要将具有相同key的记录收集到一起，因此，JOIN会触发reduce过程。

　　　　　　LIMIT：由于需要将记录收集到一起，才能统计出它返回的条数，因此，LIMIT会触发reduce过程。

　　　　　　COGROUP：与GROUP类似（参看本文前面的部分），因此它会触发reduce过程。

　　　　　　CROSS：计算两个或多个关系的叉积。

参考：http://www.cnblogs.com/uttu/archive/2013/02/19/2917438.html

分享到：

mapreduce之组件，join，排序原理 | linux开发常用的命令

2013-09-27 19:14
浏览 670
评论(0)
查看更多

发表评论

您还没有登录,请您登录后再发表评论

最近访客更多访客>>

博主相关

文章分类

社区版块

存档分类

最新评论