The LIMIT clause of SQL is a pitfall record

In some scenarios, we may want to unparsedextract the first 100 rows from a large table and apply UDF to these rows. An easy-to-think SQL statement is as follows:

@pyspark
insert into table parsed
select url, parse_func(content) as parsed_content from unparsed
limit 100;

But this statement will actually apply UDF unparsedto all the rows in and then extract the first 100 rows , which does not meet our expectations. For this reason, the following modifications can be made

@pyspark
insert into table parsed
select url, parse_func(content) as parsed_content
from (
    select url, content from unparsed
    limit 100
);

Note that the following statements are invalid and the speed will not change:

@pyspark
insert into table parsed
(select url, parse_func(content) as parsed_content from unparsed limit 100);

Guess you like

Origin blog.csdn.net/raelum/article/details/133578044