Apache Pig is an abstraction of MapReduce. It is a tool / platform for analyzing large datasets, and expressed them as a data stream.
It uses Pig Latin programming language to write the script, and Hive have some similarities. Here to do something simple summary
1, loading data
A = LOAD 'a.txt' AS (col1:chararray, col2:int, col3:int, col4:int, col5:double, col6:double);
Here the text file is loaded into a.txt A where the file contains 6 wherein
2, the stored data
STORE alias INTO 'directory' USING function;
Use the keyword load load data into the relationship between student
student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt'
USING PigStorage(',')
as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
Then stored in HDFS directory relations "/ pig_Output /" in
STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');
3, Apache Pig Diagnostic operator
Load
statement will simply load data into the specified relationship in the Apache Pig. To verify the Load
execution of the statement, you must use the Diagnostic
operator. Pig Latin provide four different types of diagnostic operator:
- Dump operator
- Describe operator
- Explanation operator
- Illustration Operators
3.1 Dump operators
dump is mainly used for debugging, is the direct result will be printed to the screen.
A = LOAD 'a.txt' AS (col1:chararray, col2:int, col3:int, col4:int, col5:double, col6:double);
DUMP A
DUMP
The output is a tuple
3.2 Describe operators
describe
operators to view relationships mode. Similar hive in the desc
View table structure
Describe A
3.3 Explanation operators
explain
operator is used to show the relationship of the logical and physical implementation plan MapReduce. Similarly in the hiveexplain
explain A
3.4 Illustration operators
illustrate operator provides for the gradual implementation of a series of statements you can almost be construed as a bar to view the results.
illustrate Relation_name;
4, the connection packet and
4.1 group operators
GROUP
operators for data packets in one or more relationships, which collects the data having the same key. Each result is output in the form of tuples.
-- 按照 age 分组
A = LOAD 'a.txt' AS (age: int, name: chararray, city: chararray);
Group_data = GROUP A BY age;
-- 按照 age 和 city 分组
Group_multiple = GROUP A BY (age, city);
-- 按照所有的列对关系进行分组
Group_all = GROUP A ALL;
group all 输出结果是一个元组 (all ,{(..),(..),...})
4.2 cogroup operator
COGROUP
works the operator GROUP
operators the same. The only difference between the two operators are group
operators often used in a relationship, and cogroup
operators for statements involving two or more relationships.
student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray);
employee_details = LOAD 'hdfs://localhost:9000/pig_data/employee_details.txt' USING PigStorage(',')
as (id:int, name:chararray, age:int, city:chararray);
cogroup_data = COGROUP student_details by age, employee_details by age;
-- cogroup 有点像hive里的full outer join的感觉,两个文件按照age分组,
-- 如果某个文件里其中一个age的值没有,那么会在结果里用 {} 补齐
4.3 seek to achieve the set difference with cogroup
differencing set with a left join in general realize in the hive where, for example table_a and table_b in only one id field, but not now want to find out where the id in table_b in table_a
-- hive 实现
select
a.id
from (
select
id
from table_a
) a
left join (
select
id
from table_b
) b
on a.id=b.id
where b.id is NULL
Look at how to achieve some of the pig
A = LOAD 'table_a' USING PigStorage('\t') AS (id: chararray);
B = LOAD 'table_b' USING PigStorage('\t') AS (id: chararray);
C = COGROUP A BY id, B BY id;
D = FILTER(A) IsEmpty(B);
E = FOREACH D GENERATE FLATTEN(A);
E = FOREACH E GENERATE A::id as id;
5, join operator
- Self-join
- Inner-join
- Outer-join − left join, right join, and full join
Self-join & 5.1 inner-join
self-join and inner-join syntax is the same, but self-join is his own association, to define the relationship when the same data load
to two different relation
Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;
5.2 outer-join
类比于hive 里的left join, right join,full outer join
Left Outer Join (outer left connection)
left Outer left the Join operator returns all rows in the table, even if the relationship between the right there is no match.
The syntax is as follows:
Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;
Right Outer Join (right outer join)
right Outer the Join operator returns all rows in the right table, even if there is no matching entry left in the table.
The syntax is as follows:
Relation3_name = JOIN Relation1_name BY id RIGHT, Relation2_name BY customer_id;
Full Outer Join (full external connection)
when a match exists a relationship, full outer join operation returns OK.
The syntax is as follows:
Relation3_name = JOIN Relation1_name BY id FULL OUTER, Relation2_name BY customer_id;
6, cross operators
CROSS
operator calculates the vector product of two or more relationships. This chapter will be used in the example shows how the Pig Latin cross
operator.
It will be appreciated to calculate the Cartesian product
The syntax is as follows:
cross_data = CROSS customers, orders;
7, union operator
Pig Latin the UNION
operator content for the relationship between the two combined. To perform on the relationship between the two UNION
operations, their columns and fields must be identical.
The syntax is as follows:
Relation_name3 = UNION Relation_name1, Relation_name2;
8, split operators
SPLIT
operator is used to split into two or more relationships Relationship
The syntax is as follows:
SPLIT Relation_name INTO Relation1_name IF (condition1), Relation2_name (condition2);
9, filter operators
FILTER
operator is used to select the desired tuple from the relationship according to the conditions.
The syntax is as follows:
Relation2_name = FILTER Relation1_name BY (condition);
filter_data = FILTER student_details BY city == 'Chennai';
10, distinct operators
DISTINCT
operator for deleting redundant (duplicate) tuple from the relationship.
The syntax is as follows:
distinct_data = DISTINCT student_details;