Pig: Introduction to Latin - 2

Relational Operations

foreach

foreach takes a set of expressions and applies them to every record in the data pipeline.

A = load 'input' as (user:chararray, id:long, address:chararray, phone:chararray,preferences:map[]);
B = foreach A generate user, id;

prices = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,volume, adj_close);
gain = foreach prices generate close - open;
gain2 = foreach prices generate $6 - $3;

prices = load 'NYSE_daily' as (exchange, symbol, date, open,high, low, close, volume, adj_close);
beginning = foreach prices generate ..open; -- produces exchange, symbol, date, open
middle = foreach prices generate open..close; -- produces open, high, low, close
end = foreach prices generate volume..; -- produces volume, adj_close

bball = load 'baseball' as (name:chararray, team:chararray,position:bag{t:(p:chararray)}, bat:map[]);
avg = foreach bball generate bat#'batting_average';

A = load 'input' as (t:tuple(x:int, y:int));
B = foreach A generate t.x, t.$1;

A = load 'input' as (b:bag{t:(x:int, y:int)});
B1 = foreach A generate b.x;

B2 = foreach A generate b.(x, y);

Note:For fields that are simple projections with no other operators applied, Pig keeps the same name as before. Once any expression beyond simple projection is applied, Pig does not assign a name to the field.

ou can assign a name with the as clause.

Filter

The filter statement allows you to select which records will be retained in your data pipeline. A filter contains a predicate. If that predicate evaluates to true for a given record, that record will be passed down the pipeline. Otherwise, it will not.

divs = load 'NYSE_dividends' as (exchange:chararray, symbol:chararray,date:chararray, dividends:float);
startswithcm = filter divs by symbol matches 'CM.*';

notstartswithcm = filter divs by not symbol matches 'CM.*';

Note:

a and b or not c <=> (a and b) or (not c).

Pig will short-circuit Boolean operations when possible.

null neither matches nor fails to match any regular expression value.Thus x == null results in a
value of null.

Group

The group statement collects together records with the same key. Collects all records with the same value for the provided key together into a bag.

daily = load 'NYSE_daily' as (exchange, stock);
grpd = group daily by stock;
cnt = foreach grpd generate group, COUNT(daily);

grpd = group daily by (exchange, stock); --group by multiple keys
avg = foreach grpd generate group, AVG(daily.dividends);

grpd = group daily all;
cnt = foreach grpd generate COUNT(daily);

Note:Because grouping collects all records together with the same value for the key, you often
get skewed results,which increase the amount of data shipped over the network and written to disk heavily. Pig has a number of ways that it tries to manage this skew to balance out the load across your reducers. The one that applies to grouping is Hadoop’s combiner.

Order by

The order statement sorts your data for you, producing a total order of your output data.

daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray,date:chararray, open:float, high:float, low:float, close:float,volume:int, adj_close:float);

bydate = order daily by date;

bydatensymbol = order daily by date, symbol;

byclose = order daily by close desc, open;

Note:Order has the same effect with group that produces skew. Pig solve this by first sampling the input of the order statement to get an estimate of the key distribution. Based on this sample, it then builds a partitioner that produces a balanced total order.

Distinct

The distinct statement is very simple. It removes duplicate records. It works only on entire records, not on individual fields.

daily = load 'NYSE_daily' as (exchange:chararray, symbol:chararray);
uniq = distinct daily;

Note:Because it needs to collect like records together in order to determine whether they are duplicates, distinct forces a reduce phase. It does make use of the combiner to remove any duplicate records it can delete in the map phase.

Join

Join selects records from one input to put together with records from another input.

daily = load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,volume, adj_close);
divs = load 'NYSE_dividends' as (exchange, symbol, date, dividends);

jnd1 = join daily by symbol, divs by symbol;

jnd2 = join daily by (symbol, date), divs by (symbol, date);

jnd3 = join daily by (symbol, date) left outer, divs by (symbol, date);

Note:Pig does these joins in MapReduce by using the map phase to annotate each record with which input it came from. It then uses the join key as the shuffle key. Thus join forces a new reduce phase. Once all of the records with the same value for the key are collected together, Pig does a cross product between the records from both inputs. To minimize memory usage, it has MapReduce order the records coming into the reducer using the input annotation it added in the map phase. Thus all of the records for the left input arrive first. Pig caches these in memory. All of the records for the right input arrive second. As each of these records arrives, it is crossed with each record from the left side to produce an output record. In a multiway join, the left n - 1 inputs are held in memory, and the nth is streamed through. It is important to keep this in mind when writing joins in your Pig queries if you know that one of your inputs has more records per value of the chosen key. Placing that input on the right side of your join will lower memory usage and possibly increase your script’s performance.

Sample

sample offers a simple way to get a sample of your data.

divs = load 'NYSE_dividends';
some = sample divs 0.1;

Parallel

The parallel clause can be attached to any relational operator in Pig Latin. However, it controls only reduce-side parallelism, so it makes sense only for operators that force a reduce phase. These are: group*, order, distinct, join*, limit, cogroup*, and cross.

daily= load 'NYSE_daily' as (exchange, symbol, date, open, high, low, close,volume, adj_close);
bysymbl = group daily by symbol parallel 10;

Note:parallel clauses apply only to the statement to which they are attached; they do not carry through the script. You can set a default parallel value before any commnads of script by set default_parallel 10;

Pig: Introduction to Latin - 2

猜你喜欢