Commonly used functions and syntax in Hive and examples of business scenarios

Commonly used functions and syntax in Hive and examples of business scenarios

aggregate function

collect_list - collect column values ​​into an array

collect_listFunction is used to collect the values ​​of the specified column into an array and return the array as the result. It is often GROUP BYused in clauses to collect values ​​of the same key into an array for aggregate operations

Following is collect_listthe syntax of the function:

collect_list(column)

where columnis the column name or expression to collect.

collect_listThe function collects the values ​​of the specified column into an array and returns the resulting array. For example, if you want to categorygroup a table by columns and productcollect the values ​​of the columns in each group into an array, you can use the following query:

SELECT category, collect_list(product) AS products
FROM your_table
GROUP BY category;

This query will categorygroup by column and for each group create an array productscontaining all productthe values ​​in that group.

Precautions:

  • collect_listThe function returns an array, which can contain repeated values.
  • If you want to remove duplicate values ​​in an array, you can use collect_seta function instead.
  • Functions in Hive collect_listrequire that all values ​​to be collected fit in memory, so you need to be aware of memory constraints when dealing with large data volumes. If the amount of data is too large, you may need to consider other ways to perform aggregation operations.

Function example usage:

Suppose there is a orderstable named with the following columns: order_id(Order ID), customer_id(Customer ID), and product(Product Name).

+----------+-------------+-------------------+
| order_id | customer_id | product           |
+----------+-------------+-------------------+
| 1        | 101         | iPhone            |
| 2        | 101         | MacBook Pro       |
| 3        | 102         | iPad              |
| 4        | 102         | Apple Watch       |
| 5        | 102         | AirPods           |
+----------+-------------+-------------------+

To customer_idgroup by and collect the items in each group productinto an array, you can use the following query:

SELECT customer_id, collect_list(product) AS products
FROM orders
GROUP BY customer_id;

The query results will look like this:

+-------------+----------------------------------+
| customer_id | products                         |
+-------------+----------------------------------+
| 101         | ["iPhone", "MacBook Pro"]         |
| 102         | ["iPad", "Apple Watch", "AirPods"]|
+-------------+----------------------------------+

For each customer_idgrouping, collect_listthe function collects the values ​​in that grouping productinto an array and returns the array as the result.

size - Returns the number of array or Map elements

The SIZE() function is used to return the size (number of elements) of an array or Map. It can be used to count the number of elements in a column of collection types, such as arrays and maps

The SIZE() function is mainly used to calculate the size of collection types (arrays and Maps), and provides statistics and analysis capabilities for the number of collection elements

Function introduction: The SIZE(collection) function accepts a collection type parameter (array or Map) and returns the number of elements in the collection.

Example: Suppose you have a table employeesthat contains an employee ID (employee_id) and an array of skills (skills). To calculate the number of skills each employee has, use the SIZE() function:

SELECT employee_id, SIZE(skills) AS num_skills
FROM employees;

This will return a result set containing employee IDs and the number of skills they possess. The SIZE() function will calculate the size of each array and num_skillsreturn it as the value of the column.

In addition, the SIZE() function can also be used to calculate the number of key-value pairs in the Map. Suppose there is a table product_salesthat contains a Map of product IDs (product_id) and sales (sales_by_month). To calculate the number of sales months for each product, the SIZE() function can be used:

SELECT product_id, SIZE(sales_by_month) AS num_months
FROM product_sales;

This will return a result set with the product id and the quantity for the month of sale. The SIZE() function will count the number of key-value pairs in each Map and num_monthsreturn it as the value of the column.

The SIZE() function is one of the common functions used to calculate the size of collection types (array and map) in Hive. It can help us to conduct statistics and analysis of the number of elements in the collection, so as to gain insight into the structure and characteristics of the data.

Example business scenario:

  1. Social media analysis: Suppose there is a user table of a social media platform, which contains a user ID (user_id) and a user's friend list (friends). The buddy list is an array that stores buddy IDs. In order to analyze the distribution of the number of friends for each user, the size of the friend list can be calculated using the SIZE() function:
SELECT user_id, SIZE(friends) AS num_friends
FROM user_friends;
  1. Shopping basket analysis: Suppose there is an order table of an e-commerce platform, which contains an order ID (order_id) and a product list (items). The product list is an array containing multiple product IDs, representing multiple products purchased in an order. In order to analyze the number of items purchased in each order, the size of the item list can be calculated using the SIZE() function:
SELECT order_id, SIZE(items) AS num_items
FROM orders;
  1. Log analysis: Assume that there is a log table, which contains log information of users visiting web pages, including user ID (user_id) and visited page list (pages). The page list is an array storing page URLs. In order to analyze the distribution of the number of pages visited by each user, the size of the page list can be calculated using the SIZE() function:
SELECT user_id, SIZE(pages) AS num_pages
FROM user_logs;

length - returns the length of the string

length()Function used to return the length (number of characters) of a string. It takes a string as an argument and returns the number of characters in that string

Example of how to use the function:

SELECT length('Hello, World!') AS str_length;

Output result:

str_length
------------
13

In the above example, the number of characters in length('Hello, World!')the string is returned 'Hello, World!', which is 13.

The LENGTH() function is mainly used to calculate the length of the string type, which can be used to verify the length limit of the string, perform string interception and other operations

For strings, the LENGTH() function returns the number of characters in the string (including spaces and special characters)

window function

lag - get the value of the row preceding the current row in the result set

The LAG() function is used to obtain the value of the row preceding the current row in the result set. It can be used to perform window function operations, giving each row the value of the previous row.

Function introduction: The LAG(expression, offset, default_value) function returns the value of the row before the specified offset of the current row. If there is no previous row (for example, the current row is the first row), returns the specified default value. It is usually used with OVER clause and ORDER BY clause.

Example business scenario:

  1. Sales growth rate calculation: Assume that there is a sales data table, which contains the monthly sales (sales_amount). In order to calculate the sales growth rate for each month, you can use the LAG() function to get the sales of the previous month and calculate the growth rate:
SELECT month, sales_amount,
    (sales_amount - LAG(sales_amount, 1, 0) OVER (ORDER BY month)) / LAG(sales_amount, 1, 1) OVER (ORDER BY month) AS sales_growth_rate
FROM sales_data;
  1. User behavior analysis: Suppose there is a user log table, which contains user ID (user_id) and login time (login_time). In order to analyze the user's login interval time, you can use the LAG() function to obtain the last login time and calculate the interval time:
SELECT user_id, login_time,
    login_time - LAG(login_time, 1, login_time) OVER (PARTITION BY user_id ORDER BY login_time) AS time_interval
FROM user_logs;
  1. Calculation of inventory changes: Suppose there is an inventory transaction table that contains product ID (product_id), transaction date (transaction_date) and transaction quantity (transaction_quantity). In order to calculate the inventory change for each transaction, you can use the LAG() function to get the quantity of the last transaction and calculate the change:
SELECT product_id, transaction_date, transaction_quantity,
    transaction_quantity - LAG(transaction_quantity, 1, 0) OVER (PARTITION BY product_id ORDER BY transaction_date) AS inventory_change
FROM inventory_transactions;

In these examples, the LAG() function is used to obtain the value of the previous row in the result set for related calculations or analysis. In this way, it can easily handle time series, comparison of previous and subsequent rows of data, etc., and help in deeper data analysis and insight. Depending on specific business needs, you can combine other functions and clauses to build complex analytical queries.

lead - get the value of the row after the current row in the result set

The LEAD() function is used to get the value of the row following the current row in the result set. It can be used to perform window function operations, providing each row with the value of the next row.

Function introduction: The LEAD(expression, offset, default_value) function returns the value of the row after the specified offset of the current row. If there are no following lines (for example, the current line is the last line), the specified default value is returned. It is usually used with OVER clause and ORDER BY clause.

Example business scenario:

  1. Periodic data analysis: Suppose there is a sales data table, which contains product ID (product_id), sales date (sale_date) and sales volume (sales_quantity). In order to calculate the sales growth rate of each product, you can use the LEAD() function to get the sales volume of the next day and calculate the growth rate:
SELECT product_id, sale_date, sales_quantity,
    (LEAD(sales_quantity, 1, 0) OVER (PARTITION BY product_id ORDER BY sale_date) - sales_quantity) / sales_quantity AS sales_growth_rate
FROM sales_data;
  1. User activity analysis: Suppose there is a user activity table, which contains user ID (user_id) and active date (active_date). In order to analyze the number of consecutive active days for each user, you can use the LEAD() function to get the active date of the next day and calculate the number of consecutive active days:
SELECT user_id, active_date,
    DATEDIFF(LEAD(active_date, 1, active_date) OVER (PARTITION BY user_id ORDER BY active_date), active_date) AS consecutive_active_days
FROM user_activity;
  1. Stock data analysis: Suppose there is a stock transaction data table, which contains stock code (stock_code), transaction date (trade_date) and closing price (closing_price). In order to calculate the rise and fall of each stock, you can use the LEAD() function to get the closing price of the next day and calculate the rise and fall:
SELECT stock_code, trade_date, closing_price,
    (LEAD(closing_price, 1, closing_price) OVER (PARTITION BY stock_code ORDER BY trade_date) - closing_price) / closing_price AS price_change_rate
FROM stock_data;

In these examples, the LEAD() function is used to obtain the value of the next row in the result set for related calculations or analysis. This makes it easy to handle time series, comparison of front and back row data, etc.

row_number - returns the row number in the result set

row_number()is a window function that assigns a unique sequence number to each row in the query result set. It is often used to assign a serial number to each row after sorting or grouping query results for further data processing or filtering.

row_number()The syntax of the function is as follows:

row_number() over ([partition by col1, col2, ...] order by col3, col4, ...)
  • partition byclause is optional and is used to specify the columns to group by. row_number()If a grouping column is specified, the sequence number will be calculated independently within each group , that is, the sequence number of the first row of each group is 1.
  • order byclause is used to specify the columns to sort on. The query results will be sorted according to the specified column order.

Here is an example showing how to use row_number()the function:

SELECT col1, col2, col3, row_number() OVER (ORDER BY col3) as row_num
FROM table_name;

In the above example, row_number()the function sorts by col3the values ​​of the column and assigns each row an ordinal number, which is stored in a row_numnew column called .

conditional function

ifnull - returns the value of the second expression if the first expression is null, otherwise returns the value of the first expression

The IFNULL() function is used to replace NULL values ​​with a specified default value. It accepts two arguments: the expression to check and a default value. If the value of the expression is NULL, the IFNULL() function will return the default value; otherwise, it will return the value of the expression.

Function introduction: The IFNULL(expression, default_value) function is used to handle NULL values. When the value of the expression is NULL, the specified default value is returned to ensure that the result set does not contain NULL values.

Example business scenario:

  1. Calculate the average rating: Suppose there is a movie rating table, which contains the movie ID (movie_id) and rating (rating). In some cases, the score column may contain NULL values. To calculate the average rating for a movie, you can use the IFNULL() function to replace NULL values ​​with 0 and calculate the average:
SELECT movie_id, AVG(IFNULL(rating, 0)) AS avg_rating
FROM movie_ratings
GROUP BY movie_id;
  1. Adjust the sales level: Suppose there is a sales order table, which contains customer ID (customer_id) and order amount (order_amount). In some cases, the order amount may be NULL. For sales measure analysis, the IFNULL() function can be used to replace NULL values ​​with 0 and calculate the adjusted sales measure:
SELECT customer_id, IFNULL(order_amount, 0) AS adjusted_amount
FROM sales_orders;
  1. Counting the number of null values: In data quality analysis, counting the number of null values ​​in a column is a common requirement. Suppose there is a user table which contains user id (user_id) and email address (email). In order to count the number of null values, you can use the IFNULL() function to replace the NULL value with 1, and count the sum of the replaced values:
SELECT SUM(IFNULL(email, 1)) AS null_count
FROM users;

In these examples, the IFNULL() function is used to handle NULL values, replacing them with default values ​​or specific calculation results. This ensures that NULL values ​​are not included in the result set when doing aggregate calculations, data analysis, or data quality checks, and provides a way to handle missing or invalid data. According to specific business requirements, the IFNULL() function can be flexibly applied to meet different data processing requirements.

nvl - returns the value of the second expression if the first expression is empty, otherwise returns the value of the first expression

NVL()Functions are used to handle the case of NULL values. It accepts two parameters: the first parameter is the expression or column to check, and the second parameter is the replacement value. If the first parameter is empty (NULL), returns the second parameter as a substitute value; otherwise, returns the value of the first parameter.

The following is NVL()an example of the syntax for a function:

NVL(expression, substitute_value)
  • expression: The expression or column to check, and if NULL, an alternative value is returned.
  • substitute_value: A substitute value, which is returned if the expression is empty.

NVL()Here is an example query that demonstrates how to use functions in Hive :

SELECT name, NVL(age, 0) AS age
FROM persons;

In the above example, personsthere are two columns in the table: nameand age. If agethe column is empty, 0 is used as an alternative value. Query results will return namecolumn and agecolumn (if not empty) or an alternate value of 0 (if empty).

NVL()Functions are useful for handling null values, allowing alternative values ​​to be specified in queries to avoid potentially problematic null values.

coalesce - returns the first non-null expression value in the argument list

COALESCE()Function is used to return the first non-empty (non-NULL) value from a set of expressions. It accepts multiple parameters and checks them one by one in order, returning the first non-null value. Returns NULL if all parameters are empty

Function usage syntax:

COALESCE(expr1, expr2, expr3, ...)

Parameter Description:

  • expr1, expr2, expr3, ...: List of expressions to check.

When using COALESCE()the function, Hive checks the arguments one by one from left to right, returning the first non-null value. Returns NULL if all arguments are null.

Example usage: Suppose we have a table my_tablewith two columns col1and col2we want to get the first non-null value in these two columns.

SELECT COALESCE(col1, col2) AS result
FROM my_table;

In the above example, COALESCE(col1, col2)the expression will first check col1the value of and return the value if it is not empty col1; if col1it is empty, continue to check col2the value of and return. Finally, we AS resultalias the result via , which is called result.

COALESCE()Functions are useful for working with columns or variables that may be null. It ensures that there is always a non-null value when processing expressions, thus avoiding NULL values.

Specific to the use in the code:

coalesce(`eid`,'')

The meaning of the function is eidto select a non-null value between the column and the empty string as the result. If eidthe value of the column is not empty, returns eidthe value of the column; if eidthe value of the column is empty, returns an empty string

string functions

split - split the string by the specified delimiter

SPLIT()The function is used to split a string according to the specified delimiter and return a string array. The function accepts two parameters: the string to split and the delimiter

The function syntax is as follows:

SPLIT(str, delimiter)

Parameter Description:

  • str: The string to split.
  • delimiter: Delimiter, used to specify where to split the string.

Return Value: SPLIT()The function returns a string array containing the substrings split by the specified delimiter.

Example usage:

SELECT SPLIT('Hello,World,How,Are,You', ',') AS result;

Output result:

["Hello", "World", "How", "Are", "You"]

In the above example, SPLIT()the function splits the string 'Hello,World,How,Are,You'on commas ,and returns an array of strings ["Hello", "World", "How", "Are", "You"]. Each comma-separated section becomes an element of the array. Note that the returned result is an array of strings, with each element enclosed in double quotes.

Specific to the use in the code:

split(`hisentname`,';')

Split hisentnamethe value in the field using;

concat - concatenates two or more strings

CONCAT()Functions are used to concatenate multiple strings into one string. It takes two or more strings as parameters and returns the result of concatenating those strings

Function example usage:

Suppose we have two strings 'Hello'and 'World', and we want to concatenate them into one string 'Hello World'.

SELECT CONCAT('Hello', ' ', 'World') AS result;

In the above example, CONCAT('Hello', ' ', 'World')the expression concatenates 'Hello', space and 'World'these three strings to get 'Hello World'.

CONCAT()Functions can accept multiple parameters, which can be string constants, column names, or other expressions. It concatenates the arguments into a single string in the order they appear in the function. If there is a NULL value in the parameter, the parameter will be ignored and will not affect the connection result.

Example usage: Suppose we have a table my_tablewith two columns and we want to concatenate them first_nameinto last_namea full name.

SELECT CONCAT(first_name, ' ', last_name) AS full_name
FROM my_table;

In the above example, CONCAT(first_name, ' ', last_name)the expression concatenates first_namethe value of the column, a space character, and last_namethe value of the column to get the full name. AS full_nameThe result is given an alias by , the alias is full_name.

Note: In Hive SQL, CONCAT()the number of parameters that a function can accept is limited (usually 256). If you need to concatenate a large number of strings, you may need to split them into multiple CONCAT()function calls.

trim - remove spaces from both ends of a string

TRIM()Function to remove spaces or specified characters from the beginning and end of a string.

The function syntax is as follows:

TRIM([BOTH | LEADING | TRAILING] trim_character FROM input_string)

Parameter Description:

  • BOTH: Remove characters from the beginning and end of the string, also used by default BOTH.
  • LEADING: Only remove characters at the beginning of the string.
  • TRAILING: Only remove characters at the end of the string.
  • trim_character: The character or string to be removed, the default is to remove space characters.
  • input_string: The string to process.

Example:

SELECT TRIM('   Hello World   ') AS trimmed_string;

Output result:Hello World

In the above example, TRIM()the function removes ' Hello World 'whitespace characters from the beginning and end of the string and returns the processed string'Hello World'

Specific to the use in the code:

trim(`entname`)

entnameRemove leading and trailing whitespace characters from fields

regexp - checks if a string matches a specified regular expression

regexpFunction used to check whether a string matches a specified regular expression pattern

The function syntax is as follows:

regexp(string, pattern)
  • string: The string to match.
  • pattern: Regular expression mode, used to specify matching rules.

The function returns a boolean value if the given string matches the regular expression pattern, otherwise it truereturns false.

Here are some examples:

  1. Checks if a string matches a specified regular expression pattern:
SELECT regexp('hello', '^h.*');

output:true

In this example, the given string is "hello" and the regular expression pattern is "^h.*", which represents any string starting with the letter "h". Since the string "hello" starts with "h", it matches successfully and returns true.

  1. Use functions in queries regexpto filter data:
SELECT column_name
FROM table_name
WHERE regexp(column_name, '[0-9]+');

[0-9]+This statement will select rows of data in a column of a table that match a regular expression pattern . This pattern represents a sequence of one or more digits. Only rows matching the pattern will be returned.

Note: In Hive SQL, regular expression patterns are based on Java's regular expression syntax. Therefore, you can use the syntax rules of Java regular expressions to build patterns.

Specific to the use in the code:

case when `credit_code` regexp '0{18}' then null 
else upper(regexp_replace(`credit_code`,'\\s',''))
end as `uscc_code` 
  1. when credit_code regexp '0{18}' then nullmeans that if credit_codethe value of the column matches the regular expression pattern 0{18}(ie 18 consecutive 0s), return NULL.
  2. else upper(regexp_replace(credit_code ,'\\s',''))indicates that if credit_codethe value of the column does not match the regular expression pattern 0{18}, credit_codethe value of the column will be processed as upper.
  • regexp_replace(credit_code ,'\\s','')Replaces credit_codespace characters in the column with empty strings. \sis the representation of a space character in a regular expression.
  • upper(...)Convert the processed credit_codevalue to uppercase.

Finally, depending on the condition, uscc_codethe value of the column may be NULL (when credit_codematching 18 consecutive 0s) or a processed capital letter string (when credit_codenot matching 18 consecutive 0s).

regexp_replace - Replace matches in a string with a regular expression

regexp_replace()Function to replace parts of a string that match a regular expression

The syntax of the function is as follows:

regexp_replace(string, pattern, replacement)
  • string: The string to be replaced.
  • pattern: The regular expression pattern to match.
  • replacement: Replace the matched part of the string.

This function will search the given string for parts matching the regular expression pattern and replace them with the replacement string. If no matching part is found, the original string is returned.

Here are some examples:

  1. Replace numbers in a string with specific characters:
SELECT regexp_replace('Hello123World456', '[0-9]', '*');

output:Hello***World***

  1. Remove spaces from a string:
SELECT regexp_replace('Hello World', '\\s', '');

output:HelloWorld

  1. Replace all commas in a string with semicolons:
SELECT regexp_replace('a,b,c,d', ',', ';');

output:a;b;c;d

  1. Use an empty string to remove a specific pattern in a string:
SELECT regexp_replace('abc123def456', '[a-z]', '');

output:123456

Note: In Hive SQL, the syntax of regular expressions may be slightly different, and needs to be adjusted according to specific requirements and Hive versions.

Specific to the use in the code:

regexp_replace(`credit_code`,'\\s','')

This code will replace credit_codespace characters (meaning spaces) in the field column \swith empty strings. \\sThe double backslash in is to escape the backslash, because the backslash itself also needs to be escaped in the regular expression.

This means that if credit_codethe column contains any whitespace characters, they will be replaced with empty strings. For example, if credit_codethe value of is ABC 123 DEF, the result after replacement is ABC123DEF, that is, the space characters are removed.

Specific to the use in the code:

regexp_replace(
	regexp_replace(
		regexp_replace(`hisentname`, ';', ';')
		,'&|nbsp;|&|/|:|:|\\.|企业基本信息|名称|企业(机构)名称|企业名称|名称序号|联系电话|第一名称|第二名称|序号|【变更前内容】|\\*|-|[0-9]|[a-zA-Z]', ''
		)
	, '\\s', '') 
AS `hisentname`

This code is used to hisentnameperform multiple replacement operations on the value of the column and store the processed results in hisentnamethe column.

  1. regexp_replace(hisentname ,';',';')replaces hisentnamethe Chinese semicolon (;) in the column with the English semicolon (;). This is the first replacement operation.
  2. regexp_replace(...,'&|nbsp;|&|/|:|:|\\\.|企业基本信息|名称|企业(机构)名称|企业名称|名称序号|联系电话|第一名称|第二名称|序号|【变更前内容】|\\\*|-|[0-9]|[a-zA-Z]','')Use regular expression pattern matching to hisentnamereplace some special characters and keywords in the column. The specific contents to be replaced include: &, nbsp;, &, /, :, :, ., basic information of the enterprise, name, enterprise (organization) name, enterprise name, name serial number, contact number, first name, second name , serial number, 【content before change】, * (the backslash needs to be escaped), numbers and letters. This is the second replacement operation.
  3. regexp_replace(...,'\\s','')Replaces space characters in the above replaced string with an empty string. This is the third replacement operation. \sis the representation of a space character in a regular expression.

Finally, after three replacement operations, the processed string is stored in hisentnamethe column.

substr - returns a substring of a string

SUBSTR()function to extract a substring from a string

The syntax of the function is as follows:

SUBSTR(string, start, length)

in:

  • stringis the original string from which to extract the substring.
  • startIs the position index to start fetching, index starts from 1.
  • lengthis the length of the substring to extract.

SUBSTR()The function returns the substring extracted from the original string.

Example: Suppose there is a string Hello, World!and we want to extract a substring from it World, we can use the following statement:

SUBSTR('Hello, World!', 8, 5)

-- SELECT SUBSTR('Hello, World!', 8, 5);
-- 输出结果为 World

In the above code, SUBSTR('Hello, World!', 8, 5)it means to extract a substring of length 5 from the 8th position of the string

upper / lower - convert string to upper/lower case

upper()Function to convert a string to uppercase

The syntax of the function is as follows:

upper(string)
  • string: The string to be converted to uppercase.

This function converts all characters in the given string to uppercase and returns the converted result.

Here are some examples:

  1. Convert a string to uppercase:
SELECT upper('hello world');

output:HELLO WORLD

  1. Convert strings in a column to uppercase:
SELECT upper(column_name) FROM table_name;

This statement selects a column in a table and converts all string values ​​in the column to uppercase.

Note: upper()The function is case-insensitive in Hive SQL, so it can be used on any string, regardless of its original case.

lower()use the same upper(), the effect is opposite

explode- split an array or Map into multiple lines

LATERAL VIEW EXPLODE()Popularly known as the burst function, it is used to split an array column (Array) into multiple rows and generate each array element as a new row. This function is usually SELECTused in conjunction with the statement

Function usage syntax:

SELECT ...
FROM ...
LATERAL VIEW EXPLODE(array_column) table_alias AS column_alias

Parameter Description:

  • array_column: The array column to split (Array).
  • table_alias: Generated table alias.
  • column_alias: Generated column alias.

When using LATERAL VIEW EXPLODE()the function, Hive will treat each element in the array column as a new row and place it in table_aliasthe table specified by . You can then SELECTrefer to it in a statement column_aliasand do further processing on the split line.

Example usage: Suppose we have a table my_tablewith an array_colarray column called and we want to split the array into rows.

SELECT column_alias
FROM my_table
LATERAL VIEW EXPLODE(array_col) my_table_alias AS column_alias;

In the example above, LATERAL VIEW EXPLODE()the function splits the array column my_tableof the table array_colinto multiple rows. Each array element becomes a new row, and my_table_aliasthese split rows are then referred to by as aliases. You can SELECTselect the required columns in the statement and perform further operations on the split rows.

Note that LATERAL VIEW EXPLODE()the function can only be used for splitting of array columns, not for other types of columns

date function

datediff - returns the difference in days between two dates

datediff()function to calculate the difference in days between two dates. It takes two dates as input parameters and returns an integer representing the difference in days between the first date and the second date.

The function syntax is as follows:

datediff(enddate, startdate)

Among them, enddateand startdateare date parameters, which can be string type or date type. enddateindicates a later date, and startdateindicates an earlier date.

Here are some examples:

Example 1:

SELECT datediff('2023-06-27', '2023-06-20');

The output is: 7

Example 2:

SELECT datediff('2023-06-01', '2023-07-01');

The output is: -30

In example 1, the first date is '2023-06-27' and the second date is '2023-06-20', and the difference in days between them is 7.

In example 2, the first date is '2023-06-01' and the second date is '2023-07-01', since the first date is later, the result is negative, indicating that the first date is in 30 days before the second date.

Note: datediff()The function calculates the difference in days between two dates, regardless of time zone and time part.

current_timestamp - returns the current timestamp

CURRENT_TIMESTAMP()The function is used to get the current timestamp, representing the current date and time.

CURRENT_TIMESTAMP()The function has no parameters, it returns a timestamp value, usually in the format of 'yyyy-MM-dd HH:mm:ss' or 'yyyy-MM-dd HH:mm:ss.SSS'.

Function example usage:

SELECT CURRENT_TIMESTAMP() AS current_time;

Output result: 2023-06-05 12:34:56

In the above example, CURRENT_TIMESTAMP()the function returns the current date and time, ie '2023-06-05 12:34:56'. Note that the actual output will vary according to the current system time.

array function

sort_array - Sort an array

sort_arrayFunction is used to sort an array and returns the sorted array as the result. It can be used to perform sort operations on arrays containing elements.

sort_array() is often used in conjunction with the collect_list() function

Following is sort_arraythe syntax of the function:

sort_array(array[, ascendingOrder])

Among them, arrayis the array to be sorted, ascendingOrderis an optional parameter, specifies whether to sort in ascending order, the default is true(ascending).

sort_arrayThe function sorts the given array and returns the sorted array. If no sort order is specified, the default is sorted in ascending order.

sort_arrayExample usage of the function:

Suppose there is a numberstable named with the following columns: id(number) and values(array containing integers).

+----+----------------------+
| id | values               |
+----+----------------------+
| 1  | [5, 3, 2, 4, 1]      |
| 2  | [9, 7, 6, 8, 10]     |
+----+----------------------+

To valuessort an array, the following query can be used:

SELECT id, sort_array(values) AS sorted_values
FROM numbers;

The query results will look like this:

+----+----------------------+
| id | sorted_values        |
+----+----------------------+
| 1  | [1, 2, 3, 4, 5]      |
| 2  | [6, 7, 8, 9, 10]     |
+----+----------------------+

For each row, sort_arraythe function valuessorts the array and returns the sorted array as the result.

It should be noted that sort_arraythe function only sorts the elements in the array and does not change the values ​​of other columns. During the sorting process, the elements in the array are sorted according to their default data types, for example, integers are sorted by numerical value, and strings are sorted by alphabetical order.

array_contains - Checks whether an array contains the specified element

The ARRAY_CONTAINS() function is used to check whether the specified element is contained in the array and returns a Boolean value (true or false). It can be used for membership checking and filtering operations on arrays.

Function introduction: The ARRAY_CONTAINS(array, value) function accepts two parameters: an array and a value. It checks whether the specified value is contained in the array and returns a Boolean result.

Example business scenario:

  1. User Tag Matching: Assume there is a user table that contains a user ID (user_id) and a list of tags for the user (tags). The tag list is an array that stores the user's hobbies. In order to find users with a specific interest tag, the ARRAY_CONTAINS() function can be used for matching:
SELECT user_id
FROM users
WHERE ARRAY_CONTAINS(tags, 'sports');
  1. Commodity screening: Assume there is a product table that contains a product ID (product_id) and an array of applicable industries (industries). In order to filter out products suitable for a specific industry, you can use the ARRAY_CONTAINS() function to filter:
SELECT product_id
FROM products
WHERE ARRAY_CONTAINS(industries, 'technology');
  1. Array aggregation statistics: Suppose there is a sales data table, which contains an array of product ID (product_id) and sales volume (sales_amounts). In order to count the number of sales of each product, you can use the ARRAY_CONTAINS() function to count the number of array elements that satisfy the condition:
SELECT product_id, COUNT(*) AS sales_count
FROM sales_data
WHERE ARRAY_CONTAINS(sales_amounts, 0);

In these examples, the ARRAY_CONTAINS() function is used to check the membership in the array to meet certain conditions. It can help with operations such as label matching, array filtering, and array aggregation, thereby supporting data query and analysis in various business scenarios.

It should be noted that the ARRAY_CONTAINS() function may require more complex usage methods for complex data types in arrays, such as structures or nested arrays. When using this function in a specific environment and tool, please refer to the relevant documentation and official guide for the exact usage and behavior.

encryption function

md5 - Computes the MD5 hash of a string

MD5()Function to calculate the MD5 hash value of a given string. MD5 is a commonly used hashing algorithm that converts input data of any length into a fixed-length hash value (usually 128 bits) that is theoretically unique

MD5()The function takes a string as input and returns the MD5 hash of that string, represented as a string. It can be used to calculate the hash value of strings in Hive SQL, and is often used in scenarios such as data summary, data comparison, and data encryption.

Example usage: Suppose we have a string 'Hello, World!'and we want to calculate its MD5 hash.

SELECT MD5('Hello, World!') AS hash_value;

In the above example, MD5('Hello, World!')the expression 'Hello, World!'calculates the MD5 hash value of the string and returns it as a string. The result is similar '65a8e27d8879283831b664bd8b7f0ad4'.

Note: MD5 is an older hashing algorithm, and while still usable in some scenarios, it has been considered insecure. In practical applications, especially where sensitive data is involved, it is recommended to use a more powerful and secure hashing algorithm such as SHA-256. In Hive, SHA2()a function is also provided to calculate the SHA-2 hash value.

sha2 - Computes the SHA-2 hash of a string

SHA2()The function is used to calculate the SHA-2 (Secure Hash Algorithm 2) hash value of the given string. SHA-2 is a set of cryptographic hash functions that includes different variants such as SHA-224, SHA-256, SHA-384, and SHA-512. These algorithms were designed by the National Security Agency (NSA) and are widely used in cryptography and security applications.

SHA2()The function accepts two parameters: the string to be hashed and the number of bits for the hashing algorithm. The number of bits can be 256, 384, or 512, corresponding to SHA-256, SHA-384, and SHA-512, respectively. For example, SHA2('hello', 256)will return the SHA-256 hash of the string 'hello'.

SHA2()The following is an example query that demonstrates how to use functions in Hive :

SELECT SHA2('hello', 256);

output:

185f8db32271fe25f561a6fc938b2e264306ec304eda518007d1764826381969

This is the SHA-256 hash of the string 'hello'. Note that the output hash is a hex string.

SHA2()Functions are usually used in scenarios such as data security, data summarization, and password protection in Hive. For example, hashes of sensitive data can be stored in Hive tables instead of plaintext data for increased security.

SHA2()and MD5()are functions used in Hive to calculate hash values, but there are some important differences between them

  1. Hash algorithm: SHA2()use the SHA-2 algorithm family, and MD5()use the MD5 algorithm. SHA-2 is a more secure and powerful hash algorithm than MD5, it provides different variants such as SHA-256, SHA-384 and SHA-512, you can choose different number of bits according to your needs. In contrast, the MD5 algorithm has been proven to have some security holes and is vulnerable to collision attacks.
  2. Output length: SHA2()The output length can vary depending on the number of bits chosen, while MD5()always producing a 128-bit (16-byte) hash. SHA-256 produces a hash of 256 bits (32 bytes), SHA-384 produces 384 bits (48 bytes), and SHA-512 produces 512 bits (64 bytes). Longer output lengths provide greater security.
  3. Collision probability: Due to the characteristics of the MD5 algorithm, its collision probability is higher than that of the SHA-2 algorithm. A collision is when two different inputs produce the same hash. Although the SHA-2 algorithm may also collide, the probability is much lower than that of MD5.
  4. Security: The SHA-2 algorithm provides higher security than the MD5 algorithm. The MD5 algorithm has been widely broken and is not suitable for storing hashes of sensitive data. The SHA-2 algorithm is considered to be one of the more secure and collision-resistant hash algorithms.

To sum up, if hash values ​​need to be calculated in Hive, and security is a key consideration, it is recommended to use SHA2()functions, especially to choose SHA-256 or higher bits. Functions MD5()may be useful in some simple validation or non-security-sensitive situations.

encrypt - Encrypt a string

encrypt()The function is used to encrypt the given string. It encrypts a string using the specified encryption algorithm and key, and returns the encrypted result. This function can be used to protect sensitive data such as passwords or other confidential information.

encrypt()The syntax of the function is as follows:

encrypt(string input, string key);

Parameter Description:

  • input: String to encrypt.
  • key: The key used for encryption.

Note: The functions in Hive encrypt()need to install and enable the Hive encryption plugin to work normally. By default, Hive does not provide encryption functions, and additional configuration and plugins are required to use this function.

Example using encrypt()functions:

SELECT encrypt('password123', 'mySecretKey') AS encrypted_password FROM my_table;

In the above example, the string "password123" is encrypted with the key "mySecretKey", and the encrypted result is returned as "encrypted_password".

Note that specific encryption algorithms and encryption plugins depend on Hive configuration and environment settings. Common encryption algorithms include AES, DES, RSA, etc. The specific algorithm used depends on the configuration of Hive and the support of plug-ins.

type conversion function

cast - casts an expression to the specified data type

cast()Functions are used to convert the value of an expression or column to a specified data type. It provides the function of type conversion, which can convert one data type to another compatible data type.

cast()The syntax of the function is as follows:

CAST(expression AS data_type)

where expressionis the expression or column to be converted and data_typeis the target data type to be converted to.

Here are some common data type conversion examples:

-- 将字符串转换为整数
CAST('123' AS INT)

-- 将字符串转换为浮点数
CAST('3.14' AS DOUBLE)

-- 将整数转换为字符串
CAST(456 AS STRING)

-- 将日期字符串转换为日期类型
CAST('2023-01-01' AS DATE)

-- 将NULL值转换为字符串类型
cast(null as string) 

-- 将当前的时间戳(即当前日期和时间)转换为字符串格式
cast(current_timestamp() as string

It should be noted that cast()the function can only convert compatible data types. If a conversion is not possible or if there are incompatible data types, the conversion will fail and an error will be thrown.

In Hive SQL, cast()functions are very useful in data type conversion, data format conversion, and data precision conversion, and can convert data to a type suitable for specific calculation or processing requirements as needed.

to_date - converts a string to a date format

The to_date function is used to convert a string to a date format. It parses the given string into a date and returns the corresponding date value.

Function introduction: The to_date(string) function accepts a string parameter and parses it into a date format. The string parameter must conform to the date format supported by Hive, otherwise a NULL value will be returned.

Example usage scenarios:

  1. String date conversion: Assume that there is a data table that contains a date field date_str, which stores dates in string form (such as '2023-06-29'). For date calculation and analysis, string dates need to be converted to date types:
SELECT to_date(date_str) AS date
FROM table;
  1. Date comparison and filtering: Suppose there is an order table, which contains the order number (order_id) and the order date (order_date). In order to filter out orders within a specific date range, the to_date function can be used to convert the query parameter to a date format and compare with the order date:
SELECT order_id, order_date
FROM orders
WHERE to_date(order_date) BETWEEN to_date('2023-01-01') AND to_date('2023-06-30');
  1. Date aggregation statistics: Suppose there is a sales data table, which contains sales date (sale_date) and sales amount (sale_amount). In order to count the sales amount according to the date, you can use the to_date function to convert the date string into a date and perform an aggregation operation:
SELECT to_date(sale_date) AS date, SUM(sale_amount) AS total_sales
FROM sales
GROUP BY to_date(sale_date);

In these examples, the to_date function is used to convert a string date to a date type for operations such as date comparison, date aggregation, and date calculation. It is useful when working with date data and facilitates date-based queries and analysis.

Please note that the to_date function depends on the date format of the input string, so you need to ensure that the input string conforms to the date format supported by Hive.

to_unix_timestamp - Convert a date or time string to UNIX timestamp format

The to_unix_timestamp function is used to convert a date or time string to UNIX timestamp format. It parses the given date or time string into a UNIX timestamp and returns the corresponding integer value.

Function introduction: The to_unix_timestamp(string) function accepts a date or time string parameter and parses it into UNIX timestamp format. The string parameter must conform to the date or time format supported by Hive, otherwise a NULL value will be returned. A UNIX timestamp is the number of seconds elapsed since January 1, 1970 00:00:00 UTC.

Example usage scenarios:

  1. Time comparison and filtering: Suppose there is a log table that contains a log timestamp field (log_timestamp). In order to filter out logs within a specific time range, you can use the to_unix_timestamp function to convert the query parameter to a UNIX timestamp and compare it with the log timestamp:
SELECT log_id, log_timestamp
FROM logs
WHERE to_unix_timestamp(log_timestamp) BETWEEN to_unix_timestamp('2023-06-29 00:00:00') AND to_unix_timestamp('2023-06-30 23:59:59');
  1. Time calculation and conversion: Suppose there is a task table, which contains the task start time (start_time) and task execution duration (duration, in seconds). In order to calculate the end time of the task, you can use the to_unix_timestamp function to convert the start time to a UNIX timestamp, and combine it with the task execution time for calculation:
SELECT task_id, start_time, duration,
    from_unixtime(to_unix_timestamp(start_time) + duration) AS end_time
FROM tasks;
  1. Timestamp format conversion: Assume that there is a data table that contains a date field (date_str) that stores dates in string form (such as '2023-06-29'). In order to convert a date field to a UNIX timestamp and use it in subsequent calculations and processing, you can use the to_unix_timestamp function for conversion:
SELECT date_str, to_unix_timestamp(date_str) AS unix_timestamp
FROM table;

In these examples, the to_unix_timestamp function is used to convert a date or time string to a UNIX timestamp for time comparison, time calculation, and time format conversion. It is very useful when working with temporal data and doing time-related calculations.

Please note that the to_unix_timestamp function depends on the date or time format of the input string, so you need to ensure that the input string conforms to the date or time format supported by Hive

from_unixtime - Convert a UNIX timestamp to a date or time string format

The from_unixtime function is used to convert a UNIX timestamp to a date or time string format. It parses the given UNIX timestamp into a date or time string and returns the corresponding string value.

Function introduction: The from_unixtime(unix_timestamp[, format]) function accepts a UNIX timestamp parameter and converts it to a date or time string. It can specify an optional format parameter that defines the format of the output string. If no format parameter is provided, the "yyyy-MM-dd HH:mm:ss" format is used by default.

Example usage scenarios:

  1. UNIX timestamp conversion: Suppose there is a data table that contains a UNIX timestamp field (unix_timestamp). In order to convert a UNIX timestamp to a readable date and time format, the from_unixtime function can be used for the conversion:
SELECT unix_timestamp, from_unixtime(unix_timestamp) AS datetime
FROM table;
  1. Date format customization: Suppose there is an order table, which contains the order date (order_date). In order to output the order date in a custom format, you can use the from_unixtime function and specify the format parameter:
SELECT order_id, from_unixtime(order_date, 'yyyy/MM/dd') AS formatted_date
FROM orders;
  1. Timestamp conversion: Suppose there is a log table that contains a log timestamp field (log_timestamp). In order to convert log timestamps to a specific time format, use the from_unixtime function and specify the format parameter:
SELECT log_id, from_unixtime(log_timestamp, 'HH:mm:ss') AS log_time
FROM logs;

In these examples, the from_unixtime function is used to convert a UNIX timestamp to a date or time string for time format customization, timestamp conversion, and human-readable output. It is useful in handling UNIX timestamps and date/time format conversions.

It should be noted that the from_unixtime function returns a string type, so subsequent calculations, comparisons, or format processing need to be performed as needed during use.

math function

greatest - returns the greatest value

GREATEST()Function is used to return the largest value from a given set of values. It accepts multiple parameters and returns the maximum of those parameters.

grammar:

GREATEST(value1, value2, ...)

parameter:

  • value1, value2, ...: The value to compare, which can be a number, string or date type.

return value:

  • Returns the maximum value among the arguments.

Precautions:

  • If the parameter contains a NULL value, the return result is NULL.
  • GREATEST()When a function compares parameters of different types, it will convert and compare according to the comparison rules of the types.

Example:

SELECT GREATEST(5, 10, 3, 8); -- 返回 10
SELECT GREATEST('apple', 'banana', 'orange'); -- 返回 'orange'
SELECT GREATEST(date '2021-01-01', date '2022-03-15', date '2020-12-25'); -- 返回 '2022-03-15'

floor - Returns the largest integer not greater than the given number (rounded down)

The floor function is used to return the largest integer not greater than the given number. It rounds down the given numeric argument and returns the nearest integer value not greater than that number.

Function introduction: The floor(x) function accepts a numeric parameter x and returns the largest integer value not greater than x. If x is positive, returns the largest integer less than or equal to x; if x is negative, returns the largest integer greater than or equal to x.

Example usage scenarios:

  1. Value rounding: Suppose there is a sales table, which contains the total amount (total_amount) of the sales order. In order to count the integer part of the order amount, you can use the floor function to round down the total amount:
SELECT order_id, total_amount, floor(total_amount) AS rounded_amount
FROM sales;
  1. Price adjustment: Suppose there is a product table, which contains product prices (price). For price adjustment purposes, the price is rounded down to the nearest integer value and processed as the adjusted price:
SELECT product_id, price, floor(price) AS adjusted_price
FROM products;
  1. Timestamp conversion: Suppose there is a log table that contains a log timestamp field (log_timestamp). In order to round log timestamps down to the minute level, and to group and aggregate logs, you can use the floor function:
SELECT floor(log_timestamp/60)*60 AS minute_timestamp, COUNT(*) AS count
FROM logs
GROUP BY floor(log_timestamp/60)*60;

In these examples, the floor function is used to round down numeric values ​​or timestamps for numeric manipulation, price adjustment, timestamp conversion, and aggregation operations. It is very useful when working with numerical and temporal data and can be used for data processing and calculations in various business scenarios.

It should be noted that the result returned by the floor function is an integer type, which can be calculated and compared with other values

logic function

case when - implement conditional judgment and branch logic

CASE WHENStatements are used to perform different actions or return different values ​​based on conditions. It is similar to conditional statements (like if-else statements) in other programming languages.

CASE WHENThe general syntax of the statement is as follows:

CASE WHEN condition1 THEN result1
     WHEN condition2 THEN result2
     ...
     ELSE resultN
END
  • condition1, condition2, … are the conditional expressions to be evaluated.
  • result1, result2, … are the result expressions to return when the corresponding conditions are met.
  • ELSE resultNis optional and is used to specify a default result expression to return when all conditions are not met.

Precautions:

  • CASE WHENStatements evaluate conditions sequentially, and once a condition is met, the corresponding result is returned, and subsequent conditions are not evaluated again.
  • Multiple WHENclauses can be used to set different conditions and results as needed.
  • If no clause satisfies the condition, and no ELSEclause is supplied, CASE WHENthe statement will return NULL.

The following is an example showing how to use the statement in Hive SQL CASE WHEN:

SELECT column1, column2,
       CASE WHEN column1 > 10 THEN 'Large'
            WHEN column1 > 5 THEN 'Medium'
            ELSE 'Small'
       END AS size
FROM table;

In the above example, depending on column1the value of , different sizevalues ​​are returned based on different conditions. column1Return if greater than 10 'Large'; return if greater than 5 but less than or equal to 10 'Medium'; otherwise return 'Small'.

with as

WITH ASstatement is used to create a temporary table or subquery and assign it an alias. This temporary table or subquery can be used in subsequent queries.

WITH ASThe syntax of the statement is as follows:

WITH tmp AS (
    -- 子查询或临时表定义
)

Using WITH tmp ASthe statement can improve the readability and reusability of the query, especially when the query needs to refer to the same subquery result multiple times. It avoids writing the same subquery repeatedly and simplifies the structure of the query statement

Temporary tables created using WITHclauses are automatically recycled, and manual recycling operations are not required.

The lifetime of a temporary table is tied to the execution cycle of a query. When the query is executed, the temporary table will be automatically deleted and the resources occupied by it will be released. This means that temporary tables are visible in the context of the current query, but will no longer exist after the query ends.

This automatic recycling feature makes the management of temporary tables more convenient, without the need to manually delete or release resources. Temporary tables are recreated each time a query is executed, ensuring query independence and isolation.

It should be noted that temporary tables are only valid in the current session and are not visible to other sessions or queries executed in parallel. If you need to share temporary tables between multiple queries, consider using Global Temporary Tables or permanent tables.

if

In Hive SQL, IFit is a conditional expression used to choose to perform different operations according to the result of the condition.

The syntax is as follows:

IF(condition, value_if_true, value_if_false)

in:

  • conditionis a Boolean expression that specifies a condition.
  • value_if_trueis the value or expression to return if the condition is true.
  • value_if_falseis the value or expression to return if the condition is false.

Example usage:

SELECT IF(salary > 5000, 'High', 'Low') AS salary_category
FROM employees;

In the above example, according to salarythe value of , if the salary is greater than 5000, return 'High', otherwise return'Low'

select “1” as xxx from table

It is usually used to temporarily add an auxiliary column to distinguish different sources, and a constant value string is used in front of as

For example:

select id,"1" as source from code_table_1
union all
select id,"2" as source from code_table_2

In this sample code, '1' and '2' are used as source identifiers. They are a helper column used to differentiate the source of data for code_table_1 and code_table_2

  1. In the first SELECT statement '1' as sourceindicates that the row comes from the table code_table_1, and set the source flag to '1', alias source
  2. In the second SELECT statement '2' as sourceindicates that the row comes from the table code_table_2, and set the source ID to '2', alias source

'1' and '2' are only used as auxiliary identifiers in this code, which are used to distinguish data sources and help query and filter setting priorities, and have no other special meaning.

Example usage:

SELECT `code`, name
FROM (
	SELECT `code`, name, row_number() OVER (PARTITION BY name ORDER BY source DESC) AS rn
	FROM (
		SELECT `code`, name, '1' AS source
		FROM n000_code_cb18
		UNION ALL
		SELECT `code`, name, '2' AS source
		FROM n000_code_cb18_new
		WHERE rn = 1
	) a
) aa
WHERE aa.rn = 1;

In the subquery part of the query, the two SELECT statements are from different tables:

  1. The in the first SELECT statement '1' as sourceindicates that the row is from the table n000_code_cb18and sets the source flag to '1'.
  2. The in the second SELECT statement '2' as sourceindicates that the row is from the table n000_code_cb18_newand sets the source ID to '2'.

The purpose of this is to merge the data union all of the two tables, sort and partition according to the value of the source identifier, so that in the subsequent ROW_NUMBER() function, the row with the highest priority in each partition is selected according to the specified rules .

In the final query part, where aa.rn=1the condition says to select only the row with a row number (rn) of 1, i.e. the row with the highest priority in each partition

Guess you like

Origin blog.csdn.net/wt334502157/article/details/131460277