Pig relational operators

  • There are six main types of relational operators in Pig Latin, namely load and store, diagnose, group and join, filter, sort, merge and split. Using these six types of operators, combined with general operators, can analyze data in local or HDFS clusters.
    insert image description here

1. Load and store operators

  • Contains two operators: LOAD and STORE, which are used to load or store to a relation from a local or HDFS file system.

1. LOAD

  • It mainly consists of two parts, separated by an equal sign. The left side of the equal sign specifies the relationship of storing data, and the right side needs to define the way of storing data.
Relation_name = LOAD 'Input file path' USING function as schema;
  • Relation_name: the name of the target relation where the data is saved
  • Input file path The path where the data is saved, which can be a local or hdfs path. If it is data in hdfs, the format is: "hdfs://localhost:9000/path"
  • function must choose a function from the set of loading functions provided in Apache Pig,

PigStorage() Load and store structured files
TextLoader() Load unstructured data into Pig
BinStorage() Load and store data into Pig using a machine-readable format
JsonLoader() Load non-Json data into Pig

  • schema data mode, the data mode must be specified when loading data, the syntax format is as follows:
(column1:data type,column2:data type,column3:data type);

example

Create data files

  • Create a student.txt file on the Linux host, and enter the content, separated by commas, and save it to the main directory
108,胡占一,,1995-06-03,95033
405,钱多多,,1989-06-03,95031
107,朱琳琳,,1997-05-08,95033
101,蓝樱桃,,1996-12-11,95033
109,王三石,,1994-07-08,95031
103,王俊毅,,1993-09-14,95031

insert image description here

  • start hadoop server
start-all.sh

insert image description here

insert image description here

  • Create a hdfs folder and upload it to the newly created folder
hadoop fs -mkdir /pig_input
hadoop fs -put ~/data/student.txt /pig_input
hadoop fs -ls /pig_input

insert image description here

  • Start Grunt Shell in HDFS mode and load the data in student.txt into Pig. Hadoop's history server needs to be started when using Pig to load HDFS data
pig -x mapreduce
#注意自己设置的 HDFS 端口
student = LOAD 'hdfs://hadoop102:8020/pig_input/student.txt' USING PigStorage(',') as (sno:chararray,sname:chararray,ssex:chararray,sbirthday:chararray,class:chararray);
dump student;

insert image description here
insert image description here
insert image description here
insert image description here

2. STORE

  • Due to the limited information that can be displayed on the screen, the results of data analysis by Pig need to be saved in the persistent storage system. In Pig you can use the STORE operator to store loaded data in the file system
STORE Relation_name INTO 'required_directory_path' [USING function];
  • Parameter Description
  • Relation_name: relation name
  • required_directory_path: relational target storage path USING
  • function: load function

example

  • Export the data in the relationship named student to the /pigfile_ouput directory of HDFS and view it
#注意自己设置的 HDFS 端口
STORE student INTO 'hdfs://hadoop102:8020/pigfile_output' USING PigStorage(',');
fs -cat /pigfile_output/part-m-00000

insert image description here

2. Diagnostic algorithm

  • Diagnostic operators are able to verify that the data loaded into a relation using the LOAD statement is correct.

dump

  • Used to execute Pig Latin statements and print the results, usually used to debug the code
dump student;

insert image description here

describe

  • schema for viewing relationships
describe student;

insert image description here

explain

  • Logical, physical, or MapReduce execution plans to show relationships
explain student;

insert image description here

illustrate

  • Output the results of statement execution one by one
illustrate student;

insert image description here

3. Grouping operators

  • The GROUP operator can group the data in one or more relations, and the model must contain at least one same key when grouping multiple relations.
# 对一个关系中的数据进行分组
Group_data = GROUP Relation_name BY Group_key;
# 对多个关系中的数据进行分组
Group_data = GROUP Relation_name BY Group_key, Relation_name2 BY Group_key;
  • Parameter Description
  • Relation_name: relation name
  • Group_key: group key

example

  • There are currently two data files, namely contract.txt and temporary.txt, create two files in the Linux system and upload them to the /pig_input directory of HDFS
vim contract.txt

001,Rajiv,21,1254745857,Hyderabad
002,Siddarth,22,54786541785,Kolkata
003,Rajesh,22,14856978541,Delhi
004,Preethi,21,13254785642,Pune
005,Trupthi,23,9848022336,Bhuwaneshwar
006,Archana,23,487965214857,Chennai
007,Komal,24,35478595417,Trivendram
008,Bharathi,24,12547896547,Chennai
vim temporary.txt

001,Robin,22,Newyork
002,Bob,23,Kolkata
003,Maya,23,Tokyo
004,Sara,25,London
005,David,23,Bhuwaneshwar
006,Maggy,22,Chennai

insert image description here

  • Upload files to hdfs
hadoop fs -mkdir /pig_input
hadoop fs -put ~/data/contract.txt ~/data/temporary.txt /pig_input
hadoop fs -ls /pig_input

insert image description here

  • Run Grunt Shell in MapReduce mode, load the data in the two files into contract.txt and temporary relationship respectively
pig -x mapreduce

#注意自己设置的 HDFS 端口
contract = LOAD 'hdfs://hadoop102:8020/pig_input/contract.txt' USING PigStorage(',') as (id:int,firstname:chararray,age:int,phone:chararray,city:chararray);

dump contract;

insert image description here

#注意自己设置的 HDFS 端口
temporary = LOAD 'hdfs://hadoop102:8020/pig_input/temporary.txt' USING PigStorage(',') as (id:int,name:chararray,age:int,city:chararray);

dump temporary

insert image description here

insert image description here

  • Group the data in the contract and temporary relationships according to age
contract_group = GROUP contract BY age;
dump contract_group;

insert image description here

insert image description here

temporary_group = GROUP temporary BY age;
dump temporary_group;

insert image description here

Fourth, the connection operator

  • OIN is used to combine the records of two or more relations. When performing this operation, one or a set of tuples are declared as keys from each relation. When these keys match, the tuples match, otherwise the records are discarded.

self-join

  • Used to join with itself, typically loading data using a different relation
Relation3_name = JOIN Relation1_name BY key, Relation2_name BY key;
  • Parameter Description
  • Relation3_name: The name of the target relation to save the connected data
  • Relation1_name, Relation2_name : two relations with the same data
  • key: connection key value

example

  • Create a data file named score.txt and upload it to the /pig_input directory of HDFS
vim score.txt

108,Hadoop生态体系,89
405,Linux操作系统,67
107,高等数学,87
103,高等数学,100

insert image description here

  • Upload HDFS
hadoop fs -put ~/data/score.txt /pig_input
hadoop fs -ls /pig_input

insert image description here

  • Enter the Grunt Shell command line in the form of mapreduce, load the data in the score.txt file in HDFS into the relationship between score1 and score2, and perform self-connection
#注意自己设置的 HDFS 端口
score1 = LOAD 'hdfs://hadoop102:8020/pig_input/score.txt' USING PigStorage(',') as (stu_no:chararray,cname:chararray,degree:int);
#注意自己设置的 HDFS 端口
score2 = LOAD 'hdfs://hadoop102:8020/pig_input/score.txt' USING PigStorage(',') as (stu_no:chararray,cname:chararray,degree:int);

score3 = JOIN score1 BY stu_no, score2 BY stu_no;

dump score3;

insert image description here

inner join

  • Inner join is the most frequently used join method, also known as equivalence join, which can join data with common predicates in two tables and create a new relationship.
  • During execution, an inner join compares each row of A with each row of B to find all rows that satisfy the condition.
result = JOIN relation1 BY columnname, relation2 BY columnname;
  • Parameter Description
  • relation1, relation2: two relations to be joined
  • columnname: join predicate

Example
Load the data in student.txt and score.txt under the /pig_input directory of HDFS into the relationship, and perform an inner join operation on student and score

#注意自己设置的 HDFS 端口
score = LOAD 'hdfs://hadoop102:8020/pig_input/score.txt' USING PigStorage(',') as (stu_no:chararray,cname:chararray,degree:int);
student= LOAD 'hdfs://hadoop102:8020/pig_input/student.txt' USING PigStorage(',') as (sno:chararray,sname:chararray,ssex:chararray,sbirthday:chararray,class:chararray);
result = JOIN student BY sno, score BY stu_no;
dump result;

insert image description here

outer join

  • left outer join
  • It can return all the records in the left table and the matching records in the right table, and replace the unmatched records in the right table with null values
outer_left = JOIN relation1 BY columnname LEFT, relation2 BY columnname;
  • example

Load the data in the student.txt and score.txt files under the /pig_input directory in HDFS into the student and score relationships respectively, and perform a left outer join operation.

#注意自己设置的 HDFS 端口
score = LOAD 'hdfs://hadoop102:8020/pig_input/score.txt' USING PigStorage(',') as (stu_no:chararray,cname:chararray,degree:int);
student= LOAD 'hdfs://hadoop102:8020/pig_input/student.txt' USING PigStorage(',') as (sno:chararray,sname:chararray,ssex:chararray,sbirthday:chararray,class:chararray);
outer_left = JOIN student BY sno LEFT, score BY stu_no;
dump outer_left;

insert image description here

  • right outer join
  • It can return all the records in the right table and the matching records in the left table, and replace the unmatched records in the left table with null values
outer_right = JOIN relation1 BY columnname RIGHT, relation2 BY columnname;

example

  • Load the data in the student.txt and score.txt files under the /pig_input directory in HDFS into the student and score relationships respectively, and perform a right outer join operation.
#注意自己设置的 HDFS 端口
score = LOAD 'hdfs://hadoop102:8020/pig_input/score.txt' USING PigStorage(',') as (stu_no:chararray,cname:chararray,degree:int);
student= LOAD 'hdfs://hadoop102:8020/pig_input/student.txt' USING PigStorage(',') as (sno:chararray,sname:chararray,ssex:chararray,sbirthday:chararray,class:chararray);
outer_right = JOIN student BY sno RIGHT, score BY stu_no;
dump outer_right;

insert image description here

  • full outer join
  • It can return all the records in the two tables, and the unmatched records in the tables are replaced by null values
outer_full = JOIN relation1 BY columnname FULL OUTER, relation2 BY columnname;
  • example
  • Load the data in the student.txt and score.txt files under the /pig_input directory in HDFS into the student and score relationships respectively, and perform a full outer join operation.
#注意自己设置的 HDFS 端口
score = LOAD 'hdfs://hadoop102:8020/pig_input/score.txt' USING PigStorage(',') as (stu_no:chararray,cname:chararray,degree:int);
student= LOAD 'hdfs://hadoop102:8020/pig_input/student.txt' USING PigStorage(',') as (sno:chararray,sname:chararray,ssex:chararray,sbirthday:chararray,class:chararray);
outer_full = JOIN student BY sno FULL OUTER, score BY stu_no;
dump outer_full;

insert image description here

Five, filter operator

  • There are three filter operators in PigLatin
  • FILTER
  • Select the desired tuples from a relation based on a condition
  • DISTINCT
  • Remove redundant tuples from relations
  • FOREACH

  • Data conversion syntax formulated based on column data generation
#FILTER
FILTER relation BY (condition);
#DISTINCT
DISTINCT relation;
#FOREACH 
FOREACH relation GENERATE (required data);
  • Create a data file named data_details.txt, upload it to the /pig_input directory of HDFS, and load the data into the details relationship
vim data_details.txt

001,Rajiv,Reddy,9848022337,Hyderabad
002,Siddarth,Battacharya,9848022338,Kolkata
002,Siddarth,Battacharya,9848022338,Kolkata
003,Rajesh,Khanna,9848022339,Delhi
003,Rajesh,Khanna,9848022339,Delhi
004,Preethi,Agarwal,9848022330,Pune
005,Trupthi,Mohanthy,9848022336,Bhuwaneshwar
006,Archana,Mishra,9848022335,Chennai
006,Archana,Mishra,9848022335,Chennai

insert image description here

hadoop fs -put ~/data/data_details.txt /pig_input

pig -x mapreduce

details = LOAD 'hdfs://hadoop102:8020/pig_input/data_details.txt' USING PigStorage(',') as (id:int,firstname:chararray,lastname:chararray,phone:chararray,city:chararray);

insert image description here
insert image description here

  • FILTER
  • Use the FILTER operator to filter out records with id 005
filter_data = FILTER details BY id == 5;
dump filter_data;

insert image description here

  • DISTINCT
  • Use the DICTINCT operator to remove duplicate tuples in the data
distinct_data = DISTINCT details;
dump distinct_data;

insert image description here

  • FOREACH
  • Use the FOREACH operator to get the values ​​of id, firstname and city from the distinct_data relation and store them into a relation called foreach_data
foreach_data = FOREACH distinct_data GENERATE id,firstname,city;
dump foreach_data;

insert image description here

6. Sorting Operators

  • Use the ORDER BY keyword to sort on one or more fields. ORDER BY is often used with the LIMIT operator. The LIMIT operator is mainly used to intercept and display the specified number of tuples in the sorted relationship
# ORDER运算符
ORDER Realtion BY (ASC|DESC);
#LIMIT运算符
LIMIT Relation required number of tuples;
  • Load the data in the student.txt file under the /pig_input directory of HDFS into the relationship named student and sort them in ascending order according to sno, and finally display the first two tuples
student = LOAD 'hdfs://hadoop102:8020/pig_input/student.txt' USING PigStorage(',') as (sno:chararray,sname:chararray,ssex:chararray,sbirthday:chararray,class:chararray);

order_by_data = ORDER student BY sno ASC;

limit_data = LIMIT order_by_data 2;

dump limit_data;

insert image description here

Guess you like

Origin blog.csdn.net/weixin_51309151/article/details/127161589