In some scenarios, we may want to unparsed
extract the first 100 rows from a large table and apply UDF to these rows. An easy-to-think SQL statement is as follows:
@pyspark
insert into table parsed
select url, parse_func(content) as parsed_content from unparsed
limit 100;
But this statement will actually apply UDF unparsed
to all the rows in and then extract the first 100 rows , which does not meet our expectations. For this reason, the following modifications can be made
@pyspark
insert into table parsed
select url, parse_func(content) as parsed_content
from (
select url, content from unparsed
limit 100
);
Note that the following statements are invalid and the speed will not change:
@pyspark
insert into table parsed
(select url, parse_func(content) as parsed_content from unparsed limit 100);