pig is a built-in function

1 Introduction
Pig comes with some built-in functions. These functions include (conversion functions, load and store functions, math functions, string functions, and package and tuple functions). There are two main functions in Pig. They are built-in functions and Custom UDF functions, they differ in the
first : built-in functions do not need to be registered, because Pig itself knows where they are
Second: built-in functions do not need to define reference paths, because Pig itself knows where to find them
2 Dynamically call
Java There are already a large number of tool class libraries in it, so in Pig, we can also flexibly define a certain type of functions you need to use through reflection, such as the following example

Currently , dynamic calls can be used for any static function:

Can receive no parameters or receive some types of string, int, long, double, float ,
arrays Average of values, ignoring Null values, available after using Group All or Group single column. 3.2 Concat usage: contact (expression1, expression2) concatenates the value of two fields into a string, if one of them is Null, the result is Null 3.3 Count usage: count (expression) counts the number of all elements in a bag, not Contains null value statistics, and requires group premise support. 3.4 Count_Star usage and count type, the difference is that Count_Star contains null statistics






3.5 Diff usage: diff (expression1, expression2), compare the difference between two fields sets in a tuple, similar to the diff function in linux or python

3.6 isEmpty usage: IsEmpty (expression1) to determine whether a bag or map is empty ( No data), you can use
3.7 max in filter to filter data Usage: max (expression) calculates the largest numerical value in a single column, or the maximum value of a string (dictionary sort), same as count requires Group support
3.8 min Usage: min ( expression) to calculate the smallest numerical value in a single column, or the smallest value of a string (dictionary sort), same as count requires Group to support
3.9 pluckTuple Usage: , add a string prefix to the specified relationship

3.10 Size Usage: size (expression) calculates any The size and length of the pig string, or the length of the collection type.
3.11 Subtract usage: subtract (expression1, expression2), perform the difference operation on the tupe in two bags, and return the difference part to a new bag

3.12 Sum usage sum (expression) to sum a column, the same as the aggregation function Need to group in advance.
3.13 Tokenize Usage tokenize (expression, 'field_delimiter') splits a sentence according to the specified delimiter, and then converts it into a series of words, which can be used as the classic function of wordcount.
4 Load/Store functions
Load and store functions determine how data is loaded into and output from pig. Pig provides a series of load and store functions. Of course, you can rewrite your own custom load and store functions through udf functions.
4.1 Handling compression
Compression support is determined by pig's load and store functions.
PigStorage and TextLoader support gzip and bzip compression including read and write. BinStorgae does not support compression. In order to process gzip compressed files, input and output files must have A .gz extension suffix. Gzip files cannot be split into multiple maps, which means that the number of maps is equal to the number of files.

In order to process bzip compressed files, the input and output files must also have a bz or bz2 suffix, and bzip compression can be divided into multiple map blocks for execution.

Pig can read and write compressed files correctly, as long as the original file is in the correct compression method, it is not correct to just modify the suffix or add a suffix to .gz or .bz, for example:



4.2 BinSotrage
can load and Stores the machine-readable format, which is rarely used, and has some type loss bugs, so I will not introduce it in detail here.
4.3 JsonLoader, JsonStorage
load and store json data load and store functions
4.4 PigDump
stores data using UDF-8 Format
4.5 PigStorage
loads and stores structured file data
Usage : PigStorage (field_delimiter, options)
Parameter 1: Delimiter to be loaded, which must be enclosed in single quotes
Parameter 2: Extended item, used less, no detailed description
This function is the default load and store function of pig, which supports compression. The input file can be a file, a directory, or a set of directories.
The storage and display methods of composite data types in PigStorage:
Tuple: (item1, item2, item3), null values ​​are also valid values ​​stored as: ()
Bag: {code}, {(tuple)}, null values ​​are valid: {}
Map: [key1#value,key2#value], null value is valid []


4.6 TextLoader
loads unstructured data, using UTF-8 format, each resulting tuple consists of a single field, and a line of input Text, TextLoader also supports compression, but will be subject to certain restrictions, in addition, TextLoader does not support data storage.

4.7 HbaseStorage
loads and stores data from Hbase tables The
usage is similar to PigStorage, you need to specify the delimiter, and the load option HbaseStorage('columns', 'option')
4.8 AvroStorage
loads and stores data from Avro files
4.9 TrevniStorage
loads and stores files from trevniStorage
5 Mathematical Functions
5.1 ABC Absolute
Value 5.2 ACOS Arc Cosine
5.3 ASIN Arc Sine
5.4 ATAN Arc Tangent
5.5 CBRT Cube Root
5.6 CEIL Near 1 Method
5.7 COS Cosine
5.8 COSH Hyperbolic cosine
5.9 EXP Exponent
5.10 FLOOR Round off
5.11 LOG e-based logarithm
5.12 LOG10 base 10 logarithm
5.13 RANDOM Generate a decimal between 0.0 and 1.0
5.14 ROUND Returns the nearest integer
5.15 SIN Sine
5.16 SINH Hyperbolic Sine
5.17 SQRT Square Root
5.18 TAN Tangent
5.19 TANH Hyperbolic Tangent
6 String Functions
6.1 EndSwith Usage: EndsWith("foobar", "bar") returns true, ends with a string
6.2 EqualsIgnoreCase compares two characters String ignores case
6.3 IndexOf Returns the first position of the string to be queried in the target source Index
6.4 Last_Index_of Returns the last position of the string to be queried in the target source Index
6.5 Lower Convert to lowercase
6.6 Ltrim Ignore left spaces
6.7 Regex_Extract Regular Extract the string to be returned
Usage : REGEX_EXTRACT (string, regex, index),
the first parameter: the original string,
the second parameter: the regular expression,
the third parameter: the index subscript of the returned data The
example is as follows:
We want to get the ip address from 192.168.1.5:8080. How to write it is very simple:
REGEX_EXTRACT (“192.168.1.5:8080”, “(.*):(.*)”, 1), just


6.8 Regex_Extract_All
returns all tuples split by the specified regular expression:

will return something like an array containing two elements separated by colons
6.9 Replace
replaces an existing string with a new string
Usage : REPLACE(string, 'regExp', 'newChar');
6.10 Rtrim
ignores spaces on the right
6.11 StartsWith
queries functions starting with a string
6.12 StrSplit
usage: STRSPLIT(string, regex, limit)
Limit represents the number of returned elements

6.13 SubString
Intercept a new string from a string
Usage : SUBSTRING(string, startIndex, stopIndex)
is similar to string interception in java
6.14 Trim
ignores left and right spaces
6.15 Ucfirst
converts the first letter of each string to uppercase
6.16 Upper
to uppercase
7 date function
7.1 AddDuration Add a new date to the specified date
7.2 CurrentTime Returns the current timestamp
7.3 DaysBetween Returns the number of days between two dates
7.4 GetDay Gets the current number of days
from a date 7.5 GetHour Gets the current hour from a date
7.6 GetMilliSecond Get the milliseconds
from a date 7.7 GetMinute get the minutes
from a date 7.8 GetMonth Get the month
from a date 7.9 GetSecond Get the seconds
from a date 7.10 GetWeek Get the week from a date
7.11 GetWeekYear Get the date of the first anniversary
7.12 GetYear from a date
7.13 HoursBetween Returns the number of hours between two days 7.14
MilliSecondsBetween Returns the number of milliseconds between two days
7.15 MinutesBetween Returns the number of minutes between two days
7.16 MonthsBetween Returns the number of months between two days
7.17 SecondsBetween Returns the number of months between two days 7.18
SubtractDuration Returns a Date object minus the time after the specified date
7.19 ToDate Returns a DateTime object according to the parameter
7.20 ToMilliSeconds Returns the current number of milliseconds
7.21 ToString converts a date to a string
7.22 ToUnixTime converts a time in unix format
7.23 WeeksBetween returns the number of weeks directly between
two dates 7.24 YearsBetween returns the number of years between two dates
8 Tuple, Bag, Map functions
8.1 TOTUPLE
converts one or more fields, Convert one or more expressions to bag for a tuple

8.2 TOBAG 8.3 TOMAP convert to K/V form into a map set 8.4 TOP returns the tuples in the bag of the first n tuples, for example:






Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326705359&siteId=291194637