Apache Pig Syntax Description

Apache Pig is an abstraction of MapReduce. It is a tool / platform for analyzing large datasets, and expressed them as a data stream.

It uses Pig Latin programming language to write the script, and Hive have some similarities. Here to do something simple summary

1, loading data

A = LOAD 'a.txt' AS (col1:chararray, col2:int, col3:int, col4:int, col5:double, col6:double);

Here the text file is loaded into a.txt A where the file contains 6 wherein

2, the stored data

STORE alias INTO 'directory' USING function;

Use the keyword load load data into the relationship between student

student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' 
   USING PigStorage(',')
   as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Then stored in HDFS directory relations "/ pig_Output /" in

STORE student INTO ' hdfs://localhost:9000/pig_Output/ ' USING PigStorage (',');

3, Apache Pig Diagnostic operator
Loadstatement will simply load data into the specified relationship in the Apache Pig. To verify the Loadexecution of the statement, you must use the Diagnosticoperator. Pig Latin provide four different types of diagnostic operator:

  • Dump operator
  • Describe operator
  • Explanation operator
  • Illustration Operators

3.1 Dump operators
dump is mainly used for debugging, is the direct result will be printed to the screen.

A = LOAD 'a.txt' AS (col1:chararray, col2:int, col3:int, col4:int, col5:double, col6:double);
DUMP A

DUMP The output is a tuple

3.2 Describe operators
describeoperators to view relationships mode. Similar hive in the descView table structure

Describe A

3.3 Explanation operators
explainoperator is used to show the relationship of the logical and physical implementation plan MapReduce. Similarly in the hiveexplain

explain A

3.4 Illustration operators
illustrate operator provides for the gradual implementation of a series of statements you can almost be construed as a bar to view the results.

illustrate Relation_name;

4, the connection packet and
4.1 group operators
GROUPoperators for data packets in one or more relationships, which collects the data having the same key. Each result is output in the form of tuples.

-- 按照 age 分组
A = LOAD 'a.txt' AS (age: int, name: chararray, city: chararray);
Group_data = GROUP A BY age;

-- 按照 age 和 city 分组
Group_multiple = GROUP A BY (age, city);

-- 按照所有的列对关系进行分组
Group_all = GROUP A ALL;

group all 输出结果是一个元组 (all ,{(..),(..),...})

4.2 cogroup operator
COGROUPworks the operator GROUPoperators the same. The only difference between the two operators are groupoperators often used in a relationship, and cogroupoperators for statements involving two or more relationships.

student_details = LOAD 'hdfs://localhost:9000/pig_data/student_details.txt' USING PigStorage(',')
   as (id:int, firstname:chararray, lastname:chararray, age:int, phone:chararray, city:chararray); 
employee_details = LOAD 'hdfs://localhost:9000/pig_data/employee_details.txt' USING PigStorage(',')
   as (id:int, name:chararray, age:int, city:chararray);
cogroup_data = COGROUP student_details by age, employee_details by age;

-- cogroup 有点像hive里的full outer join的感觉,两个文件按照age分组,
-- 如果某个文件里其中一个age的值没有,那么会在结果里用 {} 补齐

4.3 seek to achieve the set difference with cogroup
differencing set with a left join in general realize in the hive where, for example table_a and table_b in only one id field, but not now want to find out where the id in table_b in table_a

-- hive 实现
select
a.id 
from (
	select 
	id 
	from table_a
) a
left join (
	select
	id
	from table_b
) b
on a.id=b.id
where b.id is NULL

Look at how to achieve some of the pig

A = LOAD 'table_a' USING PigStorage('\t') AS (id: chararray);
B = LOAD 'table_b' USING PigStorage('\t') AS (id: chararray);

C = COGROUP A BY id, B BY id;
D = FILTER(A) IsEmpty(B);
E = FOREACH D GENERATE FLATTEN(A);
E = FOREACH E GENERATE A::id as id;
 

5, join operator

  • Self-join
  • Inner-join
  • Outer-join − left join, right join, and full join

Self-join & 5.1 inner-join
self-join and inner-join syntax is the same, but self-join is his own association, to define the relationship when the same data loadto two different relation

Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key ;

5.2 outer-join
类比于hive 里的left join, right join,full outer join

Left Outer Join (outer left connection)
left Outer left the Join operator returns all rows in the table, even if the relationship between the right there is no match.
The syntax is as follows:

Relation3_name = JOIN Relation1_name BY id LEFT OUTER, Relation2_name BY customer_id;

Right Outer Join (right outer join)
right Outer the Join operator returns all rows in the right table, even if there is no matching entry left in the table.

The syntax is as follows:

Relation3_name = JOIN Relation1_name BY id RIGHT, Relation2_name BY customer_id;

Full Outer Join (full external connection)
when a match exists a relationship, full outer join operation returns OK.

The syntax is as follows:

Relation3_name = JOIN Relation1_name BY id FULL OUTER, Relation2_name BY customer_id;

6, cross operators
CROSSoperator calculates the vector product of two or more relationships. This chapter will be used in the example shows how the Pig Latin crossoperator.
It will be appreciated to calculate the Cartesian product

The syntax is as follows:

cross_data = CROSS customers, orders;

7, union operator
Pig Latin the UNIONoperator content for the relationship between the two combined. To perform on the relationship between the two UNIONoperations, their columns and fields must be identical.

The syntax is as follows:

Relation_name3 = UNION Relation_name1, Relation_name2;

8, split operators
SPLIToperator is used to split into two or more relationships Relationship

The syntax is as follows:

SPLIT Relation_name INTO Relation1_name IF (condition1), Relation2_name (condition2);

9, filter operators
FILTERoperator is used to select the desired tuple from the relationship according to the conditions.

The syntax is as follows:

Relation2_name = FILTER Relation1_name BY (condition);

filter_data = FILTER student_details BY city == 'Chennai';

10, distinct operators
DISTINCToperator for deleting redundant (duplicate) tuple from the relationship.

The syntax is as follows:

distinct_data = DISTINCT student_details;
Published 114 original articles · won praise 55 · views 80000 +

Guess you like

Origin blog.csdn.net/zuolixiangfisher/article/details/90237565
pig
pig