`

Pig: Introduction to Latin - 2

    博客分类:
  • Pig
 
阅读更多

Relational Operations

  • foreach

foreach takes a set of expressions and applies them to every record in the data pipeline.

 

A = load 'input' as (user:chararray, id:long, address:chararray, phone:chararray,preferences:map[]);
B = foreach A generate user, id;

 

prices = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,volume, adj_close);
gain = foreach prices generate close - open;
gain2 = foreach prices generate $6 - $3;

 

prices = load 'NYSE_daily' as (exchange, symbol, date, open,high, low, close, volume, adj_close);
beginning = foreach prices generate ..open; -- produces exchange, symbol, date, open
middle = foreach prices generate open..close; -- produces open, high, low, close
end = foreach prices generate volume..; -- produces volume, adj_close

 

bball = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
avg = foreach bball generate bat#'batting_average';

 

A = load 'input' as (t:tuple(x:int, y:int));
B = foreach A generate t.x, t.$1;

 

A = load 'input' as (b:bag{t:(x:int, y:int)});
B1 = foreach A generate b.x;

B2 = foreach A generate b.(x, y);

 

Note:For fields that are simple projections with no other operators applied, Pig keeps the same name as before. Once any expression beyond simple projection is applied, Pig does not assign a name to the field.

ou can assign a name with the as clause.

 

  • Filter

The filter statement allows you to select which records will be retained in your data pipeline. A filter contains a predicate. If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.

 

divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,date:chararray, dividends:float);
startswithcm = filter divs by symbol matches 'CM.*';

notstartswithcm = filter divs by not symbol matches 'CM.*';

 

Note:

a and b or not c  <=> (a and b) or (not c).

Pig will short-circuit Boolean operations when possible.

null neither matches nor fails to match any regular expression value.Thus x == null results in a
value of null.

 

  • Group

The group statement collects together records with the same key. Collects all records with the same value for the provided key together into a bag.

 

daily = load 'NYSE_daily' as (exchange, stock);
grpd = group daily by stock;
cnt = foreach grpd generate group, COUNT(daily);

 

grpd = group daily by (exchange, stock); --group by multiple keys
avg = foreach grpd generate group, AVG(daily.dividends);

 

grpd = group daily all;
cnt = foreach grpd generate COUNT(daily);

 

Note:Because grouping collects all records together with the same value for the key, you often
get skewed results,which increase the amount of data shipped over the network and written to disk heavily. Pig has a number of ways that it tries to manage this skew to balance out the load across your reducers. The one that applies to grouping is Hadoop’s combiner.

 

  • Order by

The order statement sorts your data for you, producing a total order of your output data.

 

daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,date:chararray, open:float, high:float,  low:float, close:float,volume:int, adj_close:float);

bydate = order daily by date;

bydatensymbol = order daily by date, symbol;

byclose = order daily by close desc, open;

 

Note:Order has the same effect with group that produces skew.  Pig solve this by first sampling the input of the order statement to get an estimate of the key distribution. Based on this sample, it then builds a partitioner that produces a balanced total order.

 

  • Distinct

The distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual fields.

 

daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray);
uniq = distinct daily;

 

Note:Because it needs to collect like records together in order to determine whether they are duplicates, distinct forces a reduce phase. It does make use of the combiner to remove any duplicate records it can delete in the map phase.

 

  • Join

Join selects records from one input to put together with records from another input.

 

daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,volume, adj_close);
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);


jnd1 = join daily by symbol, divs by symbol;

jnd2 = join daily by (symbol, date), divs by (symbol, date);

jnd3 = join daily by (symbol, date) left outer, divs by (symbol, date);

 

Note:Pig does these joins in MapReduce by using the map phase to annotate each record with which input it came from. It then uses the join key as the shuffle key. Thus join forces a new reduce phase. Once all of the records with the same value for the key are collected together, Pig does a cross product between the records from both inputs. To minimize memory usage, it has MapReduce order the records coming into the reducer using the input annotation it added in the map phase. Thus all of the records for the left input arrive first. Pig caches these in memory. All of the records for the right input arrive second. As each of these records arrives, it is crossed with each record from the left side to produce an output record. In a multiway join, the left n - 1 inputs are held in memory, and the nth is streamed through. It is important to keep this in mind when writing joins in your Pig queries if you know that one of your inputs has more records per value of the chosen key. Placing that input on the right side of your join will lower memory usage and possibly increase your script’s performance.

 

  • Sample

sample offers a simple way to get a sample of your data.

 

divs = load 'NYSE_dividends';
some = sample divs 0.1;

 

  • Parallel

The parallel clause can be attached to any relational operator in Pig Latin. However, it controls only reduce-side parallelism, so it makes sense only for operators that force a reduce phase. These are: group*, order, distinct, join*, limit, cogroup*, and cross.

 

daily= load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,volume, adj_close);
bysymbl = group daily by symbol parallel 10;

 

Note:parallel clauses apply only to the statement to which they are attached; they do not carry through the script. You can set a default parallel value before any commnads of script by set default_parallel 10;

 

 

分享到:
评论

相关推荐

    Pig Latin: A Not-So-Foreign Language for Data Processing

    ### Pig Latin:一种用于数据处理的“非外语” #### 概述 《Pig Latin: A Not-So-Foreign Language for Data Processing》是一篇由Christopher Olston、Benjamin Reed、Utkarsh Srivastava、Ravi Kumar以及Andrew ...

    captcha-core-2.2.1-API文档-中英对照版.zip

    标签:pig4cloud、core、plugin、captcha、jar包、java、中英对照文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,注释和说明...

    oss-spring-boot-starter-1.0.3-API文档-中英对照版.zip

    标签:pig4cloud、spring、plugin、starter、boot、oss、jar包、java、中英对照文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持...

    nacos-consistency-2.0.4.RELEASE-API文档-中文版.zip

    标签:pig4cloud、consistency、nacos、jar包、java、中文文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,注释和说明精准...

    日常生活英语单词必背.doc

    2. 颜色: - red:红 - blue:蓝 - yellow:黄 - green:绿 - white:白 - black:黑 - pink:粉红 - purple:紫 - orange:橙 - brown:棕 3. 学习用品: - pen:钢笔 - pencil:铅笔 - pencil-case...

    Pep小学英语总复习单词归类表.doc

    2. 人体(Body): - foot:脚 - head:头 - face:脸 - hair:头发 - nose:鼻子 - mouth:嘴 - eye:眼睛 - ear:耳朵 - arm:手臂 - hand:手 - finger:手指 - leg:腿 - tail:尾巴 3. 颜色...

    excel-spring-boot-starter-1.1.1-API文档-中文版.zip

    标签:pig4cloud、excel、spring、starter、boot、jar包、java、中文文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,注释和...

    冀教版小学英语总复习资料全.doc

    2. 教室物品类: - door: 门 - window: 窗户 - blackboard: 黑板 - desk: 课桌 - chair: 椅子 - map: 地图 - picture: 图画 - light: 灯 - chalk: 粉笔 - floor: 地板 - wall: 墙 3. 房间名称类: - ...

    外研社小学英语单词表默写.doc

    - 猪:pig - 鸡:chicken - 鸡蛋:egg - 瘦的:thin - 胖的:fat - 幼崽:cub - 小的:small - 大的:big - 粉红的:pink Module 6 更多动物和形容词的学习: - 蛇:snake 这些模块中的单词和短语都是小学英语学习...

    人教八年级上册单词短语句子翻译测试.doc

    - 猪:pig - 似乎:seem - 厌倦的:bored - 某人:someone - 日记:diary - 令人愉快的:pleasant - 活动:activity - 决定:decide - 尝试:try - 鸟:bird - 自行车:bicycle - 建筑物:building - ...

    小学三年级英语单词表.doc

    - Unit 2:家庭成员 - father:爸爸 - dad:爸爸(口语) - mother:妈妈 - mom:妈妈(口语) - man:男人 - woman:女人 - grandmother:祖母 - grandfather:祖父 这些词汇是基础英语教育的重要组成...

    captcha-core-2.2.1-API文档-中文版.zip

    标签:pig4cloud、core、plugin、captcha、jar包、java、中文文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,注释和说明精准...

    小学英语总复习词汇专项练习.doc

    2. **人体(body)** - 脚:foot - 鼻子:nose - 头:head - 脸:face - 头发:hair - 嘴:mouth - 眼睛:eye - 耳朵:ear - 手臂:arm - 手:hand - 手指:finger - 腿:leg - 尾巴:tail 3. **颜色...

    nacos-naming-2.0.4.RELEASE-API文档-中文版.zip

    标签:pig4cloud、naming、nacos、jar包、java、中文文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,注释和说明精准翻译,请...

    excel-spring-boot-starter-1.1.1-API文档-中英对照版.zip

    标签:pig4cloud、excel、spring、starter、boot、jar包、java、中英对照文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,...

    nacos-api-2.0.4.RELEASE-API文档-中文版.zip

    标签:pig4cloud、api、nacos、jar包、java、中文文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,注释和说明精准翻译,请...

    oss-spring-boot-starter-1.0.3-API文档-中文版.zip

    对应Maven信息:groupId:com.pig4cloud.plugin,artifactId:oss-spring-boot-starter,version:1.0.3 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中...

    nacos-auth-2.0.4.RELEASE-API文档-中文版.zip

    标签:pig4cloud、auth、nacos、jar包、java、中文文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,注释和说明精准翻译,请...

    nacos-consistency-2.0.4.RELEASE-API文档-中英对照版.zip

    标签:pig4cloud、consistency、nacos、jar包、java、中英对照文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,注释和说明...

    nacos-naming-2.0.4.RELEASE-API文档-中英对照版.zip

    标签:pig4cloud、naming、nacos、jar包、java、中英对照文档; 使用方法:解压翻译后的API文档,用浏览器打开“index.html”文件,即可纵览文档内容。 人性化翻译,文档中的代码和结构保持不变,注释和说明精准翻译...

Global site tag (gtag.js) - Google Analytics