Big data regularization-12-pig is a high-level query language for processing large-scale data

1 Introduction

The language used to describe the data flow is called Pig Latin. The execution environment for executing Pig Latin programs currently has two environments: the local execution environment in a single JVM and the distributed execution environment on the Hadoop cluster.

In Pig, each operation or transformation is to process the input data, and then produce the output result. These transformation operations are converted into a series of MapReduce jobs. Pig lets the programmer not need to know how these transformations are performed, so that the engineer can focus on Focus on the data, not the details of the implementation.

2 Download and install

Installation package download address: http://pig.apache.org/releases.html.
Insert picture description here
#tar -xzvf pig-0.17.0.tar.gz -C /usr/local/ #cd
/usr/local/
#mv pig-0.17.0/ pig
#vi /root/.bashrc

export PIG_HOME=/usr/local/pig
export PATH=$PATH:$PIG_HOME/bin

#source /root/.bashrc
#pig -help
#pig -version【0.17.0】
#hadoop version【2.8.5】
#pig -x local set execution mode
grunt>
grunt> quit exit

3 Type of execution

3.1 local mode

Grunt is Pig's shell. In local mode, Pig runs in a single JVM and accesses the local file system. This mode is used to test or process small-scale data sets.
#pig -x local
grunt>
grunt> quit
For example, extract the first column of the /etc/passwd file under linux and output it.
#head -n 5 /etc/passwd

root:x:0:0:root:/root:/bin/bash
bin:x:1:1:bin:/bin:/sbin/nologin
daemon:x:2:2:daemon:/sbin:/sbin/nologin
adm:x:3:4:adm:/var/adm:/sbin/nologin
lp:x:4:7:lp:/var/spool/lpd:/sbin/nologin

#pig -x local -brief

Connecting to hadoop file system at: file:///
[表示本地文件系统]

use

grunt> A = load '/etc/passwd' using PigStorage(':');注意空格不可少
grunt> B = foreach A generate $0 as id;注意空格不可少
grunt> dump B;注意空格不可少

3.2 MapReduce mode

If you run this command without Hadoop installed, an error will be reported.
ERROR 4010: Cannot find hadoop configurations in classpath (neither hadoop-site.xml nor core-site.xml was found in the classpath). If you plan to use local mode, please put -x local option in command line.
Pig finds and runs the corresponding Hadoop client according to the HADOOP_HOME environment variable.
#cat /root/.bashrc

export HADOOP_HOME=/usr/local/hadoop
export PATH=$PATH:$HADOOP_HOME/bin:$HADOOP_HOME/sbin
export HADOOP_CONF_DIR=/usr/local/hadoop/etc/hadoop
export HADOOP_CLASSPATH=/usr/local/hadoop/share/hadoop/common/lib

#pig -brief

Connecting to hadoop file system at: hdfs://pda1:9000

(1) Start the hadoop cluster
#start-dfs.sh
#start-yarn.sh
#mr-jobhistory-daemon.sh start historyserver Note that this line must be run
(2) Upload the file to the cluster
#hdfs dfs -ls /
#hdfs dfs -put /etc/passwd /
(3) Enter the Pig shell, run it, extract A with a':' separation, put the first column of A into B, and dump out B.
#pig -x mapreduce -brief
grunt> A = load'/passwd' using PigStorage(':');
grunt> B = foreach A generate $0 as id;
grunt> dump B;
(4)Close the cluster
#mr-jobhistory- daemon.sh stop historyserver
#stop-yarn.sh
#stop-dfs.sh

Guess you like

Origin blog.csdn.net/qq_20466211/article/details/112675623