Pig: Introduction to Latin - 4

  • cogroup

cogroup is a generalization of group . Instead of collecting records of one input based on a key, it collects records of n inputs based on a key. The result is a record with a key and one bag for each input.

A = load 'input1' as (id:int, val:float);
B = load 'input2' as (id:int, val2:int);
C = cogroup A by id, B by id;
describe C;
C: {group: int,A: {id: int,val: float},B: {id: int,val2: int}}

Another way to think of cogroup is as the first half of a join. The keys are collected together, but the cross product is not done.

  • union

union puts two data sets together by concatenating them.

A = load 'input1' as (x:int, y:float);
B = load 'input2' as (x:int, y:float);
C = union A, B;
describe C;
C: {x: int,y: float}

A = load 'input1' as (w:chararray, x:int, y:float);
B = load 'input2' as (x:int, y:double, z:chararray);
C = union onschema A, B;
describe C;
C: {w: chararray,x: int,y: double,z: chararray}

  • cross

cross matches the mathematical set operation of the same name. In the following Pig Latin, cross takes every record in NYSE_daily and combines it with every record in NYSE_dividends:

daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,date:chararray,

                                              open:float,high:float,  low:float,close:float, volume:int, adj_close:float);
divs= load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,date:chararray, dividends:float);
tonsodata = cross daily, divs parallel 10;

猜你喜欢

转载自ylzhj02.iteye.com/blog/2038581
pig