python crawling reptile achieve movie information

Web crawlers can be broadly divided into three simple steps:

    The first step to get data,

    The second step of the process data,

    The third step is to store data.

When obtaining data here I used python's urllib standard library, which is a python module is very easy to crawl the page content.

Specifically:

Here I would like crawling data is a movie title of the movie Paradise movie page, date and so on.

headers here is an argument, is your browser when accessing the server, the server will know some information about your browser, operating system, as well as other information. function to determine if a successful response when the site will return a result of data read response 200. At this time, that is the code of the page. Here I made a string conversion treatment, according to the web page code is gb2312 encoding display, so this time as long as the encoding to gb2312 it.

According to the above page code, charset to gb2312 determination.

    When we access the web page data, it is found in html format, and there are a lot of html, css code, but we only want one of the text messages, this time how to do it.

    This time is necessary to use a powerful data processing module, beautifusoup4, commonly known as the delicious soup. After installing this module. We can do further processing of our html file, extract the information we need.

Here we use the CSS selector function delicious soup, is only the choice of the information we want to, according to the web page code, find information class equal ulink followed afterwards by the time we need. There is also a color = # 8F8C89 we need. Use the select method to screen out the selected information. Final Results:

Click 0 is displayed because the site is 0, it is estimated that the problem site. In this way we get movie information and time information released. There are a lot. Depending on the implementation of this simple reptiles, I found that web crawlers but you have to understand the knowledge python outside, for front-end knowledge html, CSS and so you have to have some understanding. Sims crawler is to collect data of the site, some sites as well as the establishment of anti-crawler technology. So crawler technology has been updated.

Guess you like

Origin blog.csdn.net/fei347795790/article/details/92621874