PostgreSQL: Use JSON functions and regular expressions to help you process semi-structured data easily and efficiently


I. Introduction

Postgresql is a powerful, easy-to-use, stable and reliable relational database management system that is widely used in enterprise-level applications of all sizes.
It adopts an object-oriented data model and provides rich scalability and flexibility.

  • Supports complex queries and operations, including full-text search, JSON data processing, transaction processing and other functions.
  • Supports advanced functions such as data analysis, graphics processing, time series analysis, etc.
  • Supporting data backup and recovery, data sharding, horizontal expansion and other functions, it can help enterprises effectively manage and maintain massive data.
  • It supports high concurrency and high performance data reading and writing, can handle large amounts of data, and is very stable and reliable.

In addition, Postgresql has extensive community support and ecosystem, and can be easily integrated with other open source tools and applications.

This article mainly introduces the processing of some arrays encountered when developing some data models in the past two days.

环境:postgresql 14.1, windows 11

2. JSON data processing scenarios

Before processing data, you first need to have JSON data. First, we will introduce two methods for generating JSON type data: ::jsonand to_json(). There are some differences between the two:

  • ::json: The fields to be converted are required to have strict JSON format and must be strings;
  • to_json(): The fields to be converted, in addition to strings, also support numerical values, arrays, etc.

For example:
use ::jsonto convert a string '["12","ab"]'into a JSON type value:

select '["12","ab"]'::json ary;

image.png
Take the first value:
image.png

to_json()Convert an array array['12','ab']into a JSON value using :

select to_json(array['12','ab']) ary;

image.png
Take the first value
image.png

After understanding these two methods, we will describe them in four scenarios below:

2.1 Scenario 1: JSON value

Task 1: Extract the value corresponding to '{"a":"1","b":"2"}'the key in the string. This is relatively simple. The relevant value methods have been basically introduced above, which is to use to obtain the value. The SQL is as follows:a
->>

select '{"a":"1","b":"2"}'::json->>'a' as "a值";

image.png
It should be noted that this method returns a text type. If you want to return a JSON type, use the ->following SQL:

select '{"a":"1","b":"2"}'::json-> 'a' as "a值";

image.png

Task 2: Extract the value corresponding to '[{"a":"1","b":"2"}]'the key in the string. This has one more layer of nesting, so you need to get one more layer when getting the value. The method is as follows: It should be noted that the first layer needs to be used to return the JSON type so that you can continue to get the value. Whether it is an array result or a key-value pair structure, both use and get values. The former returns JSON type, and the latter returns TEXT type.a


->->->>

select '[{"a":"1","b":"2"}]'::json->0->>'a' as "a-text";
select '[{"a":"1","b":"2"}]'::json->0-> 'a' as "a-json";

image.png

2.2 Scenario 2: Split key-value pairs

Split the key-value pairs in the string '{"a":"1","b":"2"}'into one key-value pair per row, and the key and value into one column each.

key value
a 1
b 2

To get the JSON type keys of the key-value structure, you can json_object_keys()extract it. The returned data structure is one key per row. The results are as follows:

select '{"a":"1","b":"2"}'::json "k_v",json_object_keys('{"a":"1","b":"2"}'::json) AS key;

image.png
There is no similar function to get the value, but if you can get the key, you can get the value through the key, that is, get the value column by column in the following k_vcolumns key.
image.png
The value SQL is as follows:

select '{"a":"1","b":"2"}'::json ->> json_object_keys('{"a":"1","b":"2"}'::json) AS val;

image.png
Finally, let’s put the three together and take a look:

select '{"a":"1","b":"2"}'::json AS "k_v"
			,json_object_keys('{"a":"1","b":"2"}'::json) AS key
			,'{"a":"1","b":"2"}'::json->> 
   			json_object_keys('{"a":"1","b":"2"}'::json) AS val;

image.png

2.3 Scenario 3: Split string

Task: Separate the string '分数: 5'key-value pairs into two columns. Note that there may be spaces in between.

This is different from the above. The above is a key-value pair and is a standard JSON format structure. This one is just a string. The structure is like a key-value pair, but its "key" and "value" are not enclosed in double quotes, so it is not a key-value. Format data.

Therefore, you need to use cutting to separate the values. You also need to consider the processing of spaces before cutting.

There may be multiple processing methods. For example, first use replace('分数: 5',' ','')to remove spaces, then use regexp_split_to_array(<string>, '[::]')to cut the string according to colons. In order to avoid mixing Chinese and English, use two kinds of colons. This function returns an array, and finally use to to_json()convert to JSON type. Get the value. It is also possible to obtain values ​​directly from an array. Looking at the example below (taking the first one as an example),

it should be noted that the index of the JSON type starts from 0, while the index of the array type starts from 1.

-- 转化为 JSON 类型取值
SELECT to_json(regexp_split_to_array(replace('分数: 5',' ',''), '[::]'))->>0;
-- 转化为 数组类型取值
SELECT (regexp_split_to_array(replace('分数: 5',' ',''), '[::]'))[1];

image.png
The above method is to replace first and then cut. Using regular expressions can also achieve one-step value acquisition. The regexp_split_to_array()value is also obtained through methods, but the matching method is different. '(\W+)'The meaning of this matching pattern is:

  • \W : Matches _characters that are not letters, numbers, Chinese and
  • +: Matches one or more
  • (): Match by group, treating the content in brackets as a whole

So '(\W+)'the meaning is to match one or more characters that are not letters, numbers, Chinese and _characters. If there are spaces and colons in the middle, they will be matched at once.
The matching results are '(\W+)'as follows:

-- 转化为 JSON 类型取值
SELECT to_json(regexp_split_to_array('分数: 5', '(\W+)'))->>0;
-- 转化为 数组类型取值
SELECT (regexp_split_to_array('分数: 5', '(\W+)'))[1];

image.png
The final step is to '分数: 5'separate the two columns. Taking the second method as an example, the final SQL is as follows:

-- 转化为 JSON 类型取值
SELECT to_json(regexp_split_to_array('分数: 5', '(\W+)'))->>0 AS key,to_json(regexp_split_to_array('分数: 5', '(\W+)'))->>1 AS value;
-- 转化为 数组类型取值
SELECT (regexp_split_to_array('分数: 5', '(\W+)'))[1] AS key,(regexp_split_to_array('分数: 5', '(\W+)'))[2] AS value;

image.png

2.4 Scenario 4: Matching strings in batches

Task: Remove all '<h3>标签1</h3>\n<p><strong>等级3:一般</strong></p>'angle brackets <>and characters within the angle brackets from the string, leaving only the text.

You can use functions to remove characters regexp_replace()and support regular expressions. To remove all angle brackets and the strings inside them, you need to match the angle brackets. There are many matching methods. Here are two methods: '<[^>]+>'and '<.*?>'.

  • <and >represent left and right angle brackets respectively.
  • [^>]Represents any character except the right angle bracket.
  • +Indicates that the preceding character can appear one or more times.
  • .*?Is a non-greedy match, that is, matches each set of angle brackets and the string within them

Taking the first one as an example, check the matching results:

SELECT regexp_replace('<h3>标签1</h3>\n<p><strong>等级3:一般</strong></p>', '<[^>]+>', '');

image.png
Judging from the results, the first one has been correctly matched and removed, indicating that the logic is feasible.
But the goal is to match and remove all angle brackets and the strings within them. This requires the use of regexp_replace()optional parameters flags. By default, this function only matches one value. You can change the matching mode 'g'to match all values.

Several commonly used modes are as follows:

  • 'g': Global matching mode, which finds all matches in the entire string and replaces them. The default pattern matches only one.
  • 'i': Case-insensitive matching pattern, that is, case is ignored when matching. The default pattern matching is case-sensitive.
  • 'm': Multi-line mode, that is, find matches in multi-line text and replace them. The default mode only finds matches within a single line of text.
  • 's': Greedy mode, which matches as many characters as possible. The default pattern matches as few characters as possible.

Examples of the final two matching methods are as follows:

SELECT regexp_replace('<h3>标签1</h3>\n<p><strong>等级3:一般</strong></p>', '<[^>]+>', '', 'g');
SELECT regexp_replace('<h3>标签1</h3>\n<p><strong>等级3:一般</strong></p>', '<.*?>', '', 'g');

image.png

3. Summary

This article focuses on the use of JSON functions and regular expressions in the PostgreSQL database, and introduces several common data processing methods from shallow to deep, including JSON values, splitting key-value pairs, splitting strings, and batch matching strings.

PostgreSQL's JSON data type shows great flexibility and supports key-value pairs and nested array structures, allowing us to easily store and retrieve unstructured data. Moreover, combined with regular expressions, which have excellent string processing capabilities, processing unstructured data becomes a piece of cake.

Due to the length limitation of the article, only a small part of JSON functions and regular expressions are introduced. If you want to get more information and understand more comprehensive knowledge points, you can check the official documentation.




Related reading:

Regular expressions (Although this is an explanation of regular expressions in Python, the underlying knowledge points of regular expressions are the same.)

Guess you like

Origin blog.csdn.net/qq_45476428/article/details/131749333