The meaning and usage of some keywords in pig

Mainly sorted out the meaning and usage of some keywords in pig. Although pig is a framework with data stream processing as the core, most of the keywords and operations of the database can basically be found in pig. Function, very flexible and concise, the last article before the Spring Festival, I wish you a happy Spring Festival!
1. Reserved keywords:
-- A assert, and, any, all, arrange, as, asc, AVG
-- B bag, BinStorage, by, bytearray, BIGINTEGER, BIGDECIMAL
-- C cache, CASE, cat, cd, chararray , cogroup, CONCAT, copyFromLocal, copyToLocal, COUNT, cp, cross
-- D datetime, %declare, %default, define, dense, desc, describe, DIFF, distinct, double, du, dump
-- E e, E, eval , exec, explain
-- F f, F, filter, flatten, float, foreach, full
-- G generate, group
-- H help
-- I if, illustrate, import, inner, input, int, into, is
-- J join
-- K kill
-- L l, L, left, limit, load, long, ls
-- M map, matches, MAX, MIN, mkdir, mv
-- N not, null
-- O onschema, or, order, outer, output
-- P parallel, pig, PigDump, PigStorage, pwd
-- Q quit
-- R register, returns, right, rm, rmf, rollup, run
-- S sample, set, ship, SIZE, split, stderr, stdin, stdout, store, stream, SUM
-- T TextLoader, TOKENIZE, through, tuple
-- U union, using
-- V, W, X, Y, Z Void
2, case-sensitive, alias case-sensitive, keyword case can be. For example, load, group, foreach are equivalent to LOAD, GROUP, FOREACH
3. Alias ​​definition (the first character must be a letter, and other positions can be letters, numbers, underscores)
4. Collection type
Bags, similar to table, can contain multiple row
Tuples, similar to row row, can have multiple field
Fields, specific The data
5. Column name reference. In relational databases, we can use column names to locate the value of a field in a row of data. In JDBC, we can reference either by column name or by index subscript. In pig In , these two kinds of references are also supported. Subscript references need to be added with digital identifiers such as $0 and $1.
6. Data type
(basic type)
Int: signed 32-bit integer
Long: signed 64-bit integer
Float: 32-bit single-precision
Double: 64-bit single-precision
Chararray: String type in Java, must be UTF-8 encoded
Bytearray : Blob byte type
Boolean: Boolean type
Datetime: Date type
Biginteger: Java Bigingteger
Bigdecimal: Java BigDecimal
(collection type)
Tuple: An ordered collection of field values, similar to List in Java
Bag: Collection of Tuples, similar to Java Collection collection super interface
Map: Map, K and V in Java are directly separated by #, and you need to add #
7 when referencing. Operators:
(1) Comparison operators ==,!=,<,>,> =,<=
(2) Comparison operator matches, suitable for strings, supports regularization
(3) Arithmetic operators +,-,*,/,%,?:,CASE
(4) Null operator is not null, is null
(5) Collection type reference symbols tuple (.), map (#)
(6) Relational operators cogroup, group, join
(7) Function count_star, sum, min, max, count, avg, concat, size
8, multiple data When the source joins, the aliases are distinguished. Using A::name, B::name
9, fallten can flatten a collection type or nested type into one line, see the following example
B={(a,b,c) ,(b,b,c)}
After FLATTEN(B),
a,b,c,b,b,c become a row of data
10, cogroup, multi-table grouping using
11, cross, two data source links, will generate Cartesian set
12, distinct, de-duplication, unlike relational databases, a single field cannot be de-duplicated, it must be a row, if you want to de-duplicate a single filed, then you need to extract this filed separately first , and then in distinct
13, filter, filter, similar to the database where condition, return a boolean value.
14, foreach, iteration, extracting one or several columns of data,
15, group, grouping, similar to the database group
16, partition by, the same Partition component in hadoop
17, join, internal and external connections, similar to relational databases, in There are different connection methods in hadoop: copy connection, merge connection, skewed connection, etc.
18, limit, limit the number of rows returned in the result set, similar to the limit keyword in mysql
19. The specific keywords of load and pig are responsible for loading the data source from a specified path. The path can use wildcards to be consistent with hadoop's path wildcards.
20. Mapreduce, in pig, executes a jar package in MR mode.
21. order by Similar to the order of relational databases
22, rank, give a set, generate a serial number, similar to the index in for loop, increment
23, sample, sampler, can randomly extract the specified number of records from the specified data set
24, split, can Split a large data set according to conditions, generate several different small data sets
25, store, and pig functions to store the results, you can store a set in a specified place in a specified storage mode
26, stream, provide In a streaming way, you can interact with other programming languages ​​in the pig script, such as passing the intermediate results of pig processing to python, perl, or shell, etc.
27, union, data-like union, merging two result sets For a result set
28, register, UDF, use this keyword to register our components, which may be a jar package or a python file
29, define, define an alias for the UDF reference
30, import, in a pig In the script, use the imprt keyword to introduce another pig script

Guess you like

Origin http://10.200.1.11:23101/article/api/json?id=326705627&siteId=291194637