hive common function arrangement

  1. about date

  1. datediff(date1, date2)

  • Return the number of days between date1 and data2;

  1. date_sub(start_day, num_days)

  • Return the date of num_days before start_day;

  1. timestampdiff():

  • timestampdiff(year|month|day|hour|minute|second,date1,date2)

  • #For the two comparisons, the one with the smaller time is placed in the front, and the one with the larger time is placed in the back

  1. window function

1. FIRST_VALUE () function

  • The FIRST_VALUE() function is used to return the current first value

SELECT *,
    FIRST_VALUE(order_price) OVER (PARTITION BY user_id ORDER BY order_price) AS firstvalue 
FROM order_content;

2. LAST_VALUE() function

  • The LAST_VALUE() function is used to return the current last value

  • last_value(field, true): It will skip data with null value and fill in the last value.

SELECT *,
   LAST_VALUE(order_price) OVER (PARTITION BY user_id ORDER BY order_price)AS lastvalue 
FROM order_content;

3. lag() and lead()

These two functions can query the corresponding results of the corresponding number of rows offset up and down in the result set we get

  • lag(): Query the result corresponding to the offset of n rows up from the current row

  • This function has three parameters: the first is the name of the parameter column to be queried, the second is the digit of the upward offset, and the third parameter is the default value beyond the uppermost boundary.

-- 查询向上偏移 1 位的年龄

SELECT user_id,
       user_age,
       lag(user_age, 1, 0) over(ORDER BY user_id) RESULT
FROM user_info;
  • lead() function: Query the result corresponding to the current row shifted down by n rows

  • This function has three parameters: the first is the name of the parameter column to be queried, the second is the digit of the downward offset, and the third parameter is the default value beyond the bottom boundary.

--查询向下偏移 2 位的年龄

SELECT user_id,
       user_age,
       lead(user_age, 2, 0) over(ORDER BY user_id)
  FROM user_info;

  1. other

  • COALESCE()

For Oracle database, NVL is usually used to deal with null values, while ifnull is commonly used in mysql. These two functions are similar, but they are both derived from one function, that is, the COALESCE() function .

COALESCE() function definition: returns the value of the first non-null expression in the list. Returns null if all expressions evaluate to null

The COALESCE() function has two usages:

  1. COALESCE ( expression1, expression2 );

  1. COALESCE ( expression1, expression2, ... expression-n );

The first one is equivalent to nvl in Oracle or ifnull in mysql, written as an expression in the form:

CASE WHEN expression1 IS NOT NULL THEN expression1 ELSE expression2 END;

The second type can contain n expressions, which means that if the first one is not empty, take the first one, otherwise judge the next one, and so on, if all are empty, return a null value.

Note: the empty string in vertica is not the same as the null value


  1. UDTF

  • EXPLODE(col): Split the complex array or map structure in one column of the hive table into multiple rows.

  • Single Column Explode

Requirement: Change the data in the student column from one row to multiple rows (using split and explode, combined with the lateral view function)

select
    class,student_name
from
    default.class_info
    lateral view explode(split(student,',')) t as student_name;

  • Posexplode(): Based on the implementation of the explode() function, add a number.

Requirements: I want to give each classmate a number in order (using the posexplode function)

select
    class,student_index + 1 as student_index,student_name
from
    default.class_info
    lateral view posexplode(split(student,',')) t as student_index,student_name;
  • LATERAL VIEW

用法:LATERAL VIEW udtf(expression) tableAlias AScolumnAlias

Explanation: It is used with UDTF such as split and explode. It can split a column of data into multiple rows of data, and on this basis, the split data can be aggregated. The lateral view first calls UDTF for each row of the original table, and the UDTF will split a row into one or more rows, and then combine the results to generate a virtual table that supports alias tables.


5. OUTSIDE

  • CONCAT(string A/col, string B/col...): Returns the result of concatenating the input strings, supporting any number of input strings;

  • CONCAT_WS(separator, str1, str2,...):

  • It is a special form of CONCAT(). The separator between the first parameter and the remaining parameters.

  • The delimiter can be a string like the rest of the arguments.

  • If the delimiter is NULL, the return value will also be NULL.

  • This function skips any NULL and empty strings after the delimiter parameter.

  • The delimiter will be added between the concatenated strings;

  • 注意: CONCAT_WS must be "string or array<string>

  • COLLECT_SET(col): Its main function is to deduplicate and summarize the value of a certain field to generate an array type field.

  • COLLECT_LIST(col): Its main function is to summarize the value of a certain field without deduplication , and generate an array type field.

Guess you like

Origin blog.csdn.net/m0_57126939/article/details/129689587