[Sublime] Use the search and replace function of sublime3 to get all the news headlines of the html page

1. Task

Since you want to get news headlines, where do you have the most headlines? Of course it is the leaderboard. We chose NetEase News Ranking.
Screenshot of NetEase News Ranking
Our goal is to get all news headlines. What are the ways to obtain it?

First, we can crawl the HTML document of the page, parse it with the bs4 library, and then use regular expressions to extract the relevant content.
In addition, we can also use the search and replace function of a text editor that supports regular expressions to delete the content between the two headlines to get the news headline content.
According to the course taught by Beijing University of Posts and Telecommunications Teacher Chen Guang , we can also use certain browser plug-ins to achieve similar functions, which is more convenient. But the blogger still doesn't know what plug-in it is, so I can't do more introduction here. If anyone knows, please comment and leave a message to the blogger!

Two, solve

The first is to use a simple crawler. Python and R languages ​​are both very convenient tools. I won't introduce too much here. Interested friends are welcome to leave a message to the blogger.

The second method requires basic knowledge of regular expressions. After observation, we found that the content that should be deleted is between and.html">, so we need to use regular expressions to match this part of the content. Its expression is:<\/a>[\S\s]*?\.html">.
Regular expression find and replaceNote that there is a .* option that must be selected to enable the regular expression function! (At the bottom left corner of the picture)

Generally, we will use. To match all the characters, but. Cannot match the newline character, so we use it here [\S\s]*?[\s] Means that as long as there is a blank (including spaces, newlines, TAB indentation, etc.), it will match;[\S] Indicates that non-blank matches; then their combination can indicate all matches.*? Represents the smallest match (as opposed to greedy match)

Here is the result.
All news headlines

Three, reflection

If you have the opportunity, you should study how to use the browser plug-in to accomplish this task more conveniently!

Guess you like

Origin blog.csdn.net/why_not_study/article/details/105416164
Recommended