The case is based on hadoop 2.73, pseudo-distributed cluster
1. Import the data package into the /user/root directory of the hadoop cluster hdfs
hdfs dfs -copyFromLocal 2008.csv /user/root
2. Write the totalmiles.pig script
records = LOAD '2008.csv' USING PigStorage(',') AS
(Year,Month,DayofMonth,DayOfWeek,DepTime,CRSDepTime,ArrTime,CRSArrTime,UniqueCarrier,FlightNum,TailNum,ActualElapsedTime,CRSElapsedTime,AirTime,ArrDelay,DepDelay,Origin,Dest,Distance:int,TaxiIn,TaxiOut,Cancelled,CancellationCode,Diverted,CarrierDelay,WeatherDelay,NASDelay,SecurityDelay,LateAircraftDelay);
milage_recs = GROUP records ALL;
tot_miles = FOREACH milage_recs GENERATE SUM(records.Distance);
STORE tot_miles INTO '/user/root/totalmiles';
LOAD: Read a file in hdfs or all files in a directory.
USING: By default, Pig parses the file content with tab spaces. You can specify the function to customize Pig parsing with commas.
AS xx: hdfs can store any raw data, Pig needs to read data from hdfs and parse it into a data model that Pig understands.
GROUP … ALL: Aggregate each type of result set.
FOREACH A GENERATE B: The A result set is converted into a single value using the B function.
STORE INTO: Store the result to hdfs.
3. Execute the totalmiles.pig script from the command line
pig -x mapreduce totalmiles.pig
Tip: -x+mapreduce/spark/tez, etc., which one to choose depends on the computing framework selected by the cluster.
Operation:
Details at logfile: /usr/test/code/pig_1516001376428.log
2018-01-14 23:29:39,112 [main] INFO org.apache.pig.Main - Pig script completed in 3 seconds and 128 milliseconds (3128 ms)
4. View the results
hdfs dfs -cat /user/root/totalmiles/part-r-00000
Result situation:
[root@slave1 code]# hdfs dfs -cat /user/root/totalmiles/part-r-00000
5091775499
资料:
1、《Hadoop For Dummies》
2、《Aapache Pig Getting Started》