- 浏览: 52080 次
- 性别:
- 来自: 北京
文章分类
最新评论
pig关系操作符实例
cogroup
- 对两个对象模式,分别按指定的字段进行分组,然后按照指定的key列来分组
grunt> cat A;
0,1,2
1,3,4
grunt> cat B;
0,5,2
1,7,8
grunt> b = load 'B' usingPigStorage(',') as (c1:int,c2:int,c3:int);
grunt> a = load 'A' usingPigStorage(',') as (c1:int,c2:int,c3:int);
grunt> c = cogroup a by c1,b by c1;
grunt> illustrate c;
| a| c1:int | c2:int | c3:int |
------------------------------------------------
|| 0 | 1 | 2 |
|| 0 | 1 | 2 |
------------------------------------------------
------------------------------------------------
| b| c1:int | c2:int | c3:int |
------------------------------------------------
|| 0 | 5 | 2 |
|| 0 | 5 | 2 |
------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------
| c| group:int |a:bag{:tuple(c1:int,c2:int,c3:int)} |b:bag{:tuple(c1:int,c2:int,c3:int)} |
-----------------------------------------------------------------------------------------------------------------------------------------------
|| 0 | {(0, 1, 2), (0,1, 2)}| {(0, 5, 2), (0, 5, 2)}|
-----------------------------------------------------------------------------------------------------------------------------------------------
grunt> dump c;
(0,{(0,1,2)},{(0,5,2)})
(1,{(1,3,4)},{(1,7,8)})
修改一下A文本
grunt> cat A;
0,1,2
1,3,4
3,5,7
4,6,4
grunt> dump c;
(0,{(0,1,2)},{(0,5,2)})
(1,{(1,3,4)},{(1,7,8)})
(3,{(3,5,7)},{})
(4,{(4,6,4)},{})
由此可以看出这个操作就是等于根据指定键关联然后,找到符合key的tuple 放入到bag中,作为key的其中一个元素bag。
比如想看第二个对象中的第一个值都有哪些
grunt> e = foreach c generate $0,$2.c1;
grunt> dump e;
(0,{(0)})
(1,{(1)})
(3,{})
(4,{})
group
pig 分组
公式
alias = GROUP alias { ALL | BY expression}[, alias ALL | BY expression …] [USING 'collected' | 'merge'] [PARTITION BYpartitioner] [PARALLEL n];
pig terms中解释的很清楚了简单的例子就没有在自己做测试,只在这儿写下了笔记
http://pig.apache.org/docs/r0.11.1/basic.html#GROUP
所有的元素合成一个分组,一般统计用
B = GROUP A ALL;
指定列来分组
B = GROUP A BY f1;
指定多列的组合进行分组,然后按组将每条tuple汇集成一个bag
B = GROUP A BY (key1,key2);
GROUPPARTITION BY
pig自定义partitionner 类
package com.hcr.hadoop.pig; import org.apache.hadoop.io.Writable; import org.apache.hadoop.mapreduce.Partitioner; import org.apache.pig.impl.io.PigNullableWritable; public class SimpleCustomPartitionerextends Partitioner<PigNullableWritable, Writable> { @Override public int getPartition(PigNullableWritable key, Writable value, int numPartitions) { System.out.println("[LOG] key:" + key +",value=" + value); if (key.getValueAsPigType()instanceof Integer) { int ret = (((Integer) key.getValueAsPigType()).intValue() % numPartitions); return ret; } else { return (key.hashCode()) % numPartitions; } } }
|
打包上传到服务器;
/home/pig/pig-0.11.0/udflib/simplePartitioner.jar
grunt> register/home/pig/pig-0.11.0/udflib/simplePartitioner.jar
grunt> cat A;
0,1,2
1,3,4
3,5,7
4,6,4
grunt> a = load 'A' using PigStorage(',');
grunt> b = group a by $0 partition bycom.hcr.hadoop.pig.SimpleCustomPartitioner;
grunt> dump b;
接下来会看到输出的部分中存在如下job status信息
Job Stats (time in seconds):
JobIdMaps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_201308011137_82475 11 2 22 2 99 9 9a,b GROUP_BYhdfs://hadoop-master.xx.com/tmp/temp-1972808180/tmp2041929350,
第一个字段就是jobID;
接下来看mapreduce任务执行的信息
http://hadoop-master.xx.com:50030/jobdetails.jsp?jobid=job_201308011137_82475
接下来进入到map任务的日志查看
stdout logs
[LOG] key:Null: false index: 0 (0),value=Null:false index: 0 (1,2)
[LOG] key:Null: false index: 0(1),value=Null: false index: 0 (3,4)
[LOG] key:Null: false index: 0(3),value=Null: false index: 0 (5,7)
[LOG] key:Null: false index: 0(4),value=Null: false index: 0 (6,4)
Piggroup PARALLEL
grunt> b = group a by $0 partition bycom.hcr.hadoop.pig.SimpleCustomPartitioner PARALLEL 2;
grunt> dump b;
在输出中就看到了,reduces值是2;
Job Stats (time in seconds):
JobIdMaps Reduces MaxMapTime MinMapTIme AvgMapTime MedianMapTime MaxReduceTime MinReduceTime AvgReduceTime MedianReducetime Alias Feature Outputs
job_201308011137_82476 12 2 22 2 119 10 10a,b GROUP_BY hdfs://hadoop-master.xx.com/tmp/temp-1972808180/tmp-1122010364
JOIN (inner)
公式:
alias = JOIN alias BY{expression|'('expression [, expression …]')'} (, alias BY{expression|'('expression [, expression …]')'} …) [USING 'replicated' |'skewed' | 'merge' | 'merge-sparse'] [PARTITION BY partitioner] [PARALLEL n];
官方手册:http://pig.apache.org/docs/r0.11.1/basic.html#join-inner
Example: X = JOIN A BY fieldA, B BY fieldB,C BY fieldC;
'replicated' 告诉pig使用分片-复制算法执行这个join操作,这个时候是不触发执行reduce任务的。
实例
big = LOAD 'big_data' AS (b1,b2,b3);
tiny = LOAD 'tiny_data' AS (t1,t2,t3);
mini = LOAD 'mini_data' AS (m1,m2,m3);
C = JOIN big BY b1, tiny BY t1, mini BY m1USING 'replicated';
'replicated'是使用小数据对大数据进行join操作的时候,可以把小数据的模块放入到内存中使用分片复制技术。
'skewed'的用法和'replicated'一样,但是使用场景不同,
Skewed是对数据倾斜的操作起到一定的平衡作用(均匀datanode的数据操作压力)
具体可以参考programing pig(pig编程指南)62页,打字太辛苦了,偷懒一下,嘿嘿。
'merge' 用法也一样,但是使用于对已经排好序的数据进行merge join操作参考programing pig(pig编程指南)64页
指定PARTITION 参考 group PARTITION的用法即可。
Join模式(inner join)(left join)(right join)
join/left join / right join
null不匹配任何数据
-- join2key.pig
daily = load 'NYSE_daily' as (exchange,symbol, date, open, high, low, close,
volume, adj_close);
divs= load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd= join daily by (symbol, date), divs by (symbol, date);
--leftjoin.pig
daily = load 'NYSE_daily' as (exchange,symbol, date, open, high, low, close,
volume, adj_close);
divs= load 'NYSE_dividends' as (exchange, symbol, date, dividends);
jnd= join daily by (symbol, date) left outer, divs by (symbol, date);
也可以同时多个变量,但只用于inner join
A = load 'input1' as (x, y);
B = load 'input2' as (u, v);
C = load 'input3' as (e, f);
alpha = join A by x, B by u, C by e;
也可以自身和自身join,但数据要加载两次
--selfjoin.pig
-- For each stock, find all dividends thatincreased between two dates
divs1= load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends);
divs2= load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends);
jnd= join divs1 by symbol, divs2 by symbol;
increased = filter jnd by divs1::date <divs2::date and
divs1::dividends <divs2::dividends;
下面这样不行
--selfjoin.pig
-- For each stock, find all dividends thatincreased between two dates
divs1= load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,
date:chararray, dividends);
jnd= join divs1 by symbol, divs1 by symbol;
increased = filter jnd by divs1::date <divs2::date and
divs1::dividends <divs2::dividends;
UNION
Pigunion 的操作是将连个数据连接在一起,数据的个数和模式都可以不同,能转换的pig会自动转换(隐式转换),不能转换的则schema for xx unknow;
转换规则
double > float > long > int >bytearray
tuple|bag|map|chararray > bytearray
grunt> cat A;
0,1,2
1,3,4
3,5,7
4,6,4
grunt> cat B;
0,5,2
1,7,8
grunt> cat C;
1,2,3,4,5,6
2,3,4,6,7
3,5,6,4,hou
4,2,5,7,2,中国人
90,10,23
grunt> a = load 'A' usingPigStorage(',') as (x:int,y:int,z:int);
grunt> b = load 'B' usingPigStorage(',') as (x:int,y:int,z:int);
grunt> c = load 'C' usingPigStorage(',') ;
grunt> d = union a,b,c;
grunt> dump d;
(1,2,3,4,5,6)
(2,3,4,6,7)
(3,5,6,4,hou)
(4,2,5,7,2,中国人)
(90,10,23)
(0,1,2)
(1,3,4)
(3,5,7)
(4,6,4)
(0,5,2)
(1,7,8)
grunt> describe c;
Schema for c unknown.
grunt> d = union a,b;
grunt> describe d;
d: {x: int,y: int,z: int}
CROSS
Pig latin cross “叉乘”(cross-product,也称为“笛卡儿积”[C'artesian prnduct})的
操作。这一操作把一个关系中的每个元组和第二个中的所有元组进行连接(如果有
更多的关系,那么这个操作就进一步把结果逐一和这些关系的每一个元组进行连
接)。这个操作的输出结果的大小是输人关系的大小的乘积。输出结果可能会非常大。
操作公式
alias = CROSS alias, alias [, alias …][PARTITION BY partitioner] [PARALLEL n];
grunt> cat A;
0,1,2
1,3,4
3,5,7
4,6,4
grunt> cat B;
0,5,2
1,7,8
grunt>a = load 'A' using PigStorage(',') as (x:int,y:int,z:int);
grunt> B = load 'B' using PigStorage(',') as(x:int,y:int,z:int);
grunt>c = cross a,B;
grunt> dump c;
(0,1,2,0,5,2)
(0,1,2,1,7,8)
(1,3,4,0,5,2)
(1,3,4,1,7,8)
(3,5,7,0,5,2)
(3,5,7,1,7,8)
(4,6,4,0,5,2)
(4,6,4,1,7,8)
由于cross是并行实现的而且一般数据量较大,建议必须指定reduce的任务数
cross x,x PARALLEL number
DEFINE
MacroDEFINE
脚本
Cat A; DEFINE my_macro(A, sortkey) RETURNS C { B = FILTER $A BY (c1>1 and c2>=0); $C = ORDER B BY $sortkey; }; a = load 'A' using PigStorage(',') as (c1:int,c2:int,c3:int); b = my_macro(a,c1); dump b; |
执行效果
grunt> Cat A;
0,1,2
1,3,4
3,5,7
4,6,4
grunt> DEFINE my_macro(A, sortkey)RETURNS C {
>>B = FILTER $A BY (c1>1 and c2>=0);
>>$C = ORDER B BY $sortkey;
>> };
grunt> a = load 'A' usingPigStorage(',') as (c1:int,c2:int,c3:int);
grunt> b = my_macro(a,c1);
grunt> dump b;
(3,5,7)
(4,6,4)
Returns也可以是使用void 来返回空直接在函数中操作了所有的动作,比如
DEFINE my_macro1(A, sortkey,outfile) RETURNS void { B = FILTER $A BY (c1>1 and c2>=0); C = ORDER B BY $sortkey; Store C into '$outfile'; }; DEFINE my_macro2(outfile) RETURNS void { a = load 'A' using PigStorage(',') as (c1:int,c2:int,c3:int); my_macro1(a,c1,$outfile); }; my_macro2('macroOut'); |
执行效果
grunt> DEFINE my_macro1(A,sortkey,outfile) RETURNS void {
>>B = FILTER $A BY (c1>1 and c2>=0);
>> C = ORDER B BY $sortkey;
>> Store C into '$outfile';
>> };
grunt>
grunt> DEFINE my_macro2(outfile) RETURNSvoid {
>> a = load 'A' using PigStorage(',')as (c1:int,c2:int,c3:int);
>> my_macro1(a,c1,$outfile);
>> };
grunt> my_macro2('macroOut');
grunt> cat macroOut
35 7
46 4
grunt>
当然也可以把macro写到脚本里,通过import 导入,然后使用。
DEFINE(UDFs, streaming)
这个define的指定stream命令执行的。
由于本人对streaming不太了解和用法建议参考
http://pig.apache.org/docs/r0.11.1/basic.html#define-udfs
DISTINCT
去重复的数据
Example
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(8,3,4)
(1,2,3)
(4,3,3)
(4,3,3)
(1,2,3)
In this example all duplicate tuples areremoved.
X = DISTINCT A;
DUMP X;
(1,2,3)
(4,3,3)
(8,3,4)
FILTER
过滤去掉数据
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example the condition states thatif the third field equals 3, then include the tuple with relation X.
X = FILTER A BY f3 == 3;
DUMP X;
(1,2,3)
(4,3,3)
(8,4,3)
In this example the condition states thatif the first field equals 8 or if the sum of fields f2 and f3 is not greaterthan first field, then include the tuple relation X.
X = FILTER A BY (f1 == 8) OR (NOT (f2+f3> f1));
DUMP X;
(4,2,1)
(8,3,4)
(7,2,5)
(8,4,3)
FOREACH
Pig foreach 是处理一组数据让后将他们的逐条读取和处理的一个函数。而且这个函数中可以内部嵌套自定义函数(udf),调用其他函数flatten,distinct ,filter,limit,order等操作都可以。
GetMax类
package com.hcr.hadoop.pig; import java.io.IOException; import java.util.List; import org.apache.pig.EvalFunc; import org.apache.pig.data.DataBag; import org.apache.pig.data.Tuple; public class GetMax extends EvalFunc<Integer> { @Override public Integer exec(Tuple input) throws IOException { List<Object> list = input.getAll(); Object[] objects = list.toArray(); int max = 0; for (Object object : objects) { if (object instanceof DataBag) { DataBag dbag = (DataBag) object; for (Tuple tuple : dbag) { Object x = tuple.get(0); Object y = tuple.get(1); Object z = tuple.get(2); max = CompareInt(x, y, z); break; } } if (object instanceof Tuple) { Tuple tuple = (Tuple) object; Object x = tuple.get(0); Object y = tuple.get(1); max = CompareInt(x, y); break; } } return max; } int String2Int(String str) { return Integer.valueOf(str); } int Object2Int(Object str) { return String2Int(str.toString()); } int CompareInt(Object... objs) { int max = 0; for (Object object : objs) { try { int _tmpMax = Object2Int(object); if (max < _tmpMax) { max = _tmpMax; } } catch (Exception e) { System.out.println("[LOG] 数据:object=" + object + ",无法转换成int"); e.printStackTrace(); } } return max; } } |
grunt> register/home/pig/pig-0.11.0/udflib/getMax.jar
grunt> cat A;
0,1,2
1,3,4
3,5,7
4,6,4
grunt> a = load'A' using PigStorage(',') as (x:int,y:int,z:int);
grunt> b =group a by (x,y);
grunt> definegetMax com.hcr.hadoop.pig.GetMax();
grunt> c =foreach b{
>> generategroup,getMax(group);
>> };
grunt> dump c;
((0,1),1)
((1,3),3)
((3,5),5)
((4,6),6)
取出分组中最大的那个数
c = foreach b{
generategroup,getMax(a);
};
取出根据分组中的数据中最大的值
grunt> c =foreach b{
>> generategroup,getMax(a);
>> };
grunt> dump c;
((0,1),2)
((1,3),4)
((3,5),7)
((4,6),6)
IMPORT
Pig import 操作导入脚本,方便操作和通用;
[pig@ebsdi-23260-oozie scripts]$ pwd
/home/pig/pig-0.11.0/scripts
[pig@ebsdi-23260-oozie scripts]$ ll
总计 4
-rw-rw-r-- 1 pig pig 300 09-29 13:42my_macro.pig
[pig@ebsdi-23260-oozie scripts]$ catmy_macro.pig
DEFINE import_my_macro(A, sortkey,outfile)RETURNS void {
B= FILTER $A BY (c1>1 and c2>=0);
C = ORDER B BY $sortkey;
Store C into '$outfile';
};
DEFINE import_my_macro2 (outfile) RETURNSvoid {
a = load 'A' using PigStorage(',') as(c1:int,c2:int,c3:int);
import_my_macro(a,c1,$outfile);
};
执行结果
grunt> IMPORT'/home/pig/pig-0.11.0/scripts/my_macro.pig';
grunt> import_my_macro2('importMacroOut');
在结果跑完后查看一下结果
grunt> cat importMacroOut;
35 7
46 4
LIMIT
取指定数量的数据,在pig中除了order by 之外生成的数据都没有固定的顺序。
当指定数量比实际数量多的时候会取全部,否则取指定数;
grunt> a = load 'A' usingPigStorage(',') as (c1:int,c2:int,c3:int);
grunt> b = limit a 10;
grunt> dump b;
(0,1,2)
(1,3,4)
(3,5,7)
(4,6,4)
grunt> b = limit a 2;
grunt> dump b;
(0,1,2)
(1,3,4)
MAPREDUCE
在pig 0.8版本之后增加的功能,以般用于当使用mapreduce处理比pig要好,但是必须要和之前pig数据流相结合的时候。
A = LOAD 'WordcountInput.txt'; B = MAPREDUCE 'wordcount.jar' STORE A INTO 'inputDir' LOAD 'outputDir' AS (word:chararray, count: int) `org.myorg.WordCount inputDir outputDir`; |
ORDER BY
使用公式
alias = ORDER alias BY { * [ASC|DESC] |field_alias [ASC|DESC] [, field_alias [ASC|DESC] …] } [PARALLEL n];
使用order by和group差不多,order参数上边不用括号。
在pig编程指南中提到了数据倾斜的问题,所有pig为了处理这种现象打破了同一个指定键对应实例都输入到一个reduce上,所以有时候有些需要基于要把指定键必须放入一个reduce中去处理的(比如指定paitioner的)等就不要使用pig的order 功能,可能会导致数据不准。
A = LOAD 'mydata' AS (x: int, y:map[]);
B = ORDER A BY x; -- this is allowed becausex is a simple type
B = ORDER A BY y; -- this is not allowedbecause y is a complex type
B = ORDER A BY y#'id'; -- this is notallowed because y#'id' is an expression
Examples
Suppose we have relation A.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
In this example relation A is sorted by thethird field, f3 in descending order. Note that the order of the three tuplesending in 3 can vary.
X = ORDER A BY a3 DESC;
DUMP X;
(7,2,5)
(8,3,4)
(1,2,3)
(4,3,3)
(8,4,3)
(4,2,1)
RANK
可以增加一列序列值操作。详情http://pig.apache.org/docs/r0.11.1/basic.html#rank
SAMPLE
随机抽取一些实例根据指定值。详情http://pig.apache.org/docs/r0.11.1/basic.html#sample
SPLIT
有的时候我们需要归类操作对一个模式操作,比如
当数据好的放在好的列表,数据坏的放在坏的列表。实例(Hadoop权威指南340页。)
Example
In this example relation A is split intothree relations, X, Y, and Z.
A = LOAD 'data' AS (f1:int,f2:int,f3:int);
DUMP A;
(1,2,3)
(4,5,6)
(7,8,9)
SPLIT A INTO X IF f1<7, Y IF f2==5, Z IF(f3<6 OR f3>6);
DUMP X;
(1,2,3)
(4,5,6)
DUMP Y;
(4,5,6)
DUMP Z;
(1,2,3)
(7,8,9)
Example
In this example, the SPLIT and FILTERstatements are essentially equivalent. However, because SPLIT is implemented as"split the data stream and then apply filters" the SPLIT statement ismore expensive than the FILTER statement because Pig needs to filter and storetwo data streams.
SPLIT input_var INTO output_var IF (field1is not null), ignored_var IF (field1 is null);
-- where ignored_var is not used elsewhere
output_var = FILTER input_var BY (field1 isnot null);
STORE
Examples
In this example data is stored usingPigStorage and the asterisk character (*) as the field delimiter.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
STORE A INTO 'myoutput' USING PigStorage('*');
CAT myoutput;
1*2*3
4*2*1
8*3*4
4*3*3
7*2*5
8*4*3
In this example, the CONCAT function isused to format the data before it is stored.
A = LOAD 'data' AS (a1:int,a2:int,a3:int);
DUMP A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
(7,2,5)
(8,4,3)
B = FOREACH A GENERATECONCAT('a:',(chararray)f1), CONCAT('b:',(chararray)f2),CONCAT('c:',(chararray)f3);
DUMP B;
(a:1,b:2,c:3)
(a:4,b:2,c:1)
(a:8,b:3,c:4)
(a:4,b:3,c:3)
(a:7,b:2,c:5)
(a:8,b:4,c:3)
STORE B INTO 'myoutput' usingPigStorage(',');
CAT myoutput;
a:1,b:2,c:3
a:4,b:2,c:1
a:8,b:3,c:4
a:4,b:3,c:3
a:7,b:2,c:5
a:8,b:4,c:3
STREAM
在数据流中执行外部程序或者脚本可用。
http://pig.apache.org/docs/r0.11.1/basic.html#stream
总结:
Pig中的各operator(操作符),哪些会触发reduce过程
GROUP:由于GROUP操作会将所有具有相同key的记录收集到一起,所以数据如果正在map中处理的话,就会触发shuffle→reduce的过程。
ORDER:由于需要将所有相等的记录收集到一起(才能排序),所以ORDER会触发reduce过程。同时,除了你写的那个Pig job之外,Pig还会添加一个额外的M-R job到你的数据流程中,因为Pig需要对你的数据集做采样,以确定数据的分布情况,从而解决数据分布严重不均的情况下job效率过于低下的问题。
DISTINCT:由于需要将记录收集到一起,才能确定它们是不是重复的,因此DISTINCT会触发reduce过程。当然,DISTINCT也会利用combiner在map阶段就把重复的记录移除。
JOIN:JOIN用于求重合,由于求重合的时候,需要将具有相同key的记录收集到一起,因此,JOIN会触发reduce过程。
LIMIT:由于需要将记录收集到一起,才能统计出它返回的条数,因此,LIMIT会触发reduce过程。
COGROUP:与GROUP类似(参看本文前面的部分),因此它会触发reduce过程。
CROSS:计算两个或多个关系的叉积。
参考:http://www.cnblogs.com/uttu/archive/2013/02/19/2917438.html
相关推荐
JavaScript中的instanceof操作符是用于检测构造函数的prototype属性是否出现在某个实例对象的原型链上,从而判断该实例是否是某个构造函数的实例或者其子类的实例。instanceof操作符在JavaScript面向对象编程中是一...
- **操作符**: 介绍了Pig中可用的数据处理操作符。 **11.7 Pig实践提示与技巧** - **实用建议**: 提供了一些Pig使用的实践建议和技巧。 #### 十二、Hbase简介 **12.1 HBase基础** - **NoSQL数据库**: 介绍了...
- **数据处理操作符**:提供了一系列常用的数据处理操作符的使用方法。 - **Pig实践提示与技巧**:分享了一些使用Pig处理数据的最佳实践和技巧。 #### 十三、Hbase简介 - **HBase基础**:介绍了HBase的基本概念和...
Pig通过一系列操作符(例如LOAD、FILTER、JOIN等)来定义数据处理流程。 - **HBase**:是一个分布式的、面向列的数据库模型,设计用于处理非常大的表(数十亿行、数百万列)。HBase基于Google的Bigtable论文实现。 -...
"=="操作符在处理基本类型时比较的是数值是否相等,而在处理引用类型时,它比较的是对象的引用是否指向内存中的同一位置。这意味着,当两个引用变量指向堆中的不同实例但内容相同时,"=="会返回false。而"equals()...
- **位操作符与逻辑操作符**:`&`是位运算符,`&&`是逻辑运算符。位运算符对二进制位进行操作,逻辑运算符用于布尔值,且`&&`具有短路特性。 2. **Hadoop部分** - **堆和栈的区别**:栈遵循LIFO(后进先出)原则...
- **实例化**:通过 `=` 赋值操作符创建实例。 - **属性访问**:通过点语法 `.property` 访问实例的属性。 - **方法调用**:同样使用点语法调用实例的方法。 #### 五、枚举类型 枚举类型是 Swift 中常用的数据结构...
DISQL作为其核心组件,提供类似SQL的语言接口,支持任意组合的SQL样式的操作符,封装了分布式算法,并实现了自动代码生成。 #### DISQL的核心组件 - **DQuery(Distributed Query)**:一种嵌入式领域特定语言...
- 以及其他相关项目如Hive、Pig等。 - **Hadoop核心组成介绍** - HDFS负责数据存储。 - MapReduce负责数据处理。 - **Hadoop生态圈结构** - 包括数据存储、数据处理、数据访问等多个层面。 - **Hadoop安装与配置...
通过深入学习这些Java基础知识,你将能够搭建稳固的技能基础,进一步探索大数据开发的世界,如Hadoop生态系统中的Hive、Pig、Spark等工具,以及其他分布式计算框架。不断实践和深化理解,将使你在大数据开发领域...
此外,类可能还包含了一些访问修饰符(public、private、protected),用于控制属性和方法的访问权限。 2. **Waiter.java** - 这个类可能代表服务员。服务员在餐厅中承担着与顾客交互、接单、上菜等职责。在`Waiter...
4. **大数据生态系统**:除了Hadoop和Spark,还有其他组件,如Hive(数据仓库工具)、Pig(数据分析工具)、Kafka(消息队列)、Zookeeper(集群协调服务)等,它们共同构成了大数据生态系统。 “README.txt”文件...
这里定义了几个位操作符,用于控制单片机端口的输入输出。 2. **数字数组定义**: ```c uchar const num[] = {0xC0, 0xF9, 0xA4, 0xB0, 0x99, 0x92, 0x82, 0xF8, 0x80, 0x90, 0xC1}; ``` 定义了一个数字数组,...