HIVE SQL regexp_extract and regexp_replace use regular expressions to extract multiple qualified values

"The Ordinary World" has good ratings, the movie "The Hunchback of Notre Dame" was adapted into is good, and "1984" is also pretty good to watch.

How to use theregexp_extract&regexp_replace function to extract all the book names in the above text?

select 	substr(
				regexp_replace(
				regexp_extract(
				regexp_replace(regexp_replace('《平凡的世界》评分不错,《巴黎圣母院》改变成的电影不错,还有<<1984>>也蛮好看。','<<','《'),'>>','》')
				,'(.*》)',1)
				,'.*?(《[^》|^《]+》)',',$1')
			,2) as books
;

Code analysis:
step1: The two regexp_replace()s will << be transformed into , regularize >> to ;
step2:regexp_extractregular extraction satisfies The value when a>pattern='.*》', the main purpose of this operation is to remove the text content after the last book title number

select 	
				regexp_extract(
				regexp_replace(regexp_replace('《平凡的世界》评分不错,《巴黎圣母院》改变成的电影不错,还有<<1984>>也蛮好看。','<<','《'),'>>','》')
				,'(.*》)',1)
		
;

The result extracted at this time is:

"The Ordinary World" received good reviews, "The Hunchback of Notre Dame" was transformed into a good movie, and "1984"

step3:regexp_replaceReplace the content before the book title number with

#此处的$1是指第一个小括号中的匹配结果
select 	
				regexp_replace(
				'《平凡的世界》评分不错,《巴黎圣母院》改变成的电影不错,还有《1984》'
				,'.*?(《[^》|^《]+》)',',$1')
;

The result extracted at this time is:

, "The Ordinary World", "The Hunchback of Notre Dame", "1984"

What needs to be noted here is:
*1). Non-greedy matching is used in the regular expression.*?, if greedy matching is used a>.*, the final returned result will be

,《1984》

*2) If the operation of step 2 is omitted, the extracted results will not meet the conditions.

select 	
				regexp_replace(
				regexp_replace(regexp_replace('《平凡的世界》评分不错,《巴黎圣母院》改变成的电影不错,还有<<1984>>也蛮好看。','<<','《'),'>>','》')
				,'.*?(《[^》|^《]+》)',',$1')
;

The result extracted at this time is:

, "The Ordinary World", "The Hunchback of Notre Dame", and "1984" are also pretty good to watch.

step4:substrTruncate the remaining content except the first comma

select substr(',《平凡的世界》,《巴黎圣母院》,《1984》',2)
;

The final extracted result is:

"The Ordinary World", "The Hunchback of Notre Dame", "1984"

Guess you like

Origin blog.csdn.net/p1306252/article/details/133384270