Python Reptile combat tutorial: crawling Netease news

Foreword

Text and images in this article from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.

Author: Amauri

PS: If necessary Python learning materials can be added to a small partner click the link below to obtain their own

http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef

This article reptile belonging to the entry-level class, older drivers who do not read.

This is mainly crawling Netease news, including news title, author, source, release time, news text.

First, we opened 163 sites, we randomly select a category, and here I choose classification is national news. Then right-click to view source code, source code and did not find the news list page in the center. This indicates that this page uses an asynchronous manner. That is, data through the interface api acquired.

So after confirming the F12 can be used to open the Google browser-based console, click on the Network, we have been pulled down and found appears on the right: the "... special/00804KVA/cm_guonei_03.js? .... "address and the like, opening the Response found exactly what we were looking for api interface. Here Insert Picture Description

You can see the address are certain rules of these interfaces: "cm_guonei_03.js", "cm_guonei_04.js", then it is obvious:

http://temp.163.com/special/0...*).js

The above address is the connection we want to grab this request.

Then only you need to use two python libraries:

  • requests

  • json

  • BeautifulSoup

requests the library is used for network requests, saying that white is to simulate a browser to access resources.

Since we collect is api interfaces, its format is json, so use the json library to parse. BeautifulSoup is used to parse html document, can easily help us get the contents of the specified div.

Here we start writing reptiles:

The first step is to import more than three packages:

import json
import requests
from bs4 import BeautifulSoup

 

Then we define a specified page within the data acquisition method:

. 1  DEF the get_page (Page):
 2      url_temp = ' http://temp.163.com/special/00804KVA/cm_guonei_0{}.js ' 
. 3      return_list = []
 . 4      for I in Range (Page):
 . 5          URL = url_temp. the format (I)
 . 6          response = requests.get (URL)
 . 7          IF response.status_code = 200 is! :
 . 8              Continue 
. 9          Content = response.text   # fetch response body 
10          _content = formatContent (Content)   # format json string 
11         result = json.loads(_content)
12         return_list.append(result)
13     return return_list

 

This way is obtained a list of contents corresponding to each page:

Here Insert Picture Description

After analysis of the data shows the following figure turns out is the need to grab the title, as well as news content pages Published by.

Here Insert Picture Description

Now that has been acquired to the content of the page url, then the next start crawling news text.

Before we crawled first analyze the body text of html pages, find text, author, source location in the html document.

We see where the source of the article in the document are as follows: id = "ne_article_source" 的 a 标签. On location: . Text position: . class = "ep-editor" 的 span 标签 class = "post_text" 的 div 标签

Under these three elements interview Acquisition code:

 1 def get_content(url):
 2     source = ''
 3     author = ''
 4     body = ''
 5     resp = requests.get(url)
 6     if resp.status_code == 200:
 7         body = resp.text
 8         bs4 = BeautifulSoup(body)
 9         source = bs4.find('a', id='ne_article_source').get_text()
10         author = bs4.find('span', class_='ep-editor').get_text()
11         body = bs4.find('div', class_='post_text').get_text()
12     return source, author, body

So far all we have to crawl data have been collected.

So then of course save them down, in order to facilitate direct I take the form of text to save. The following are the final results:

Here Insert Picture Description

Json format string, the "title": [ 'date', 'url', 'source', 'Author', 'text'].

Note that the way the current implementation is fully synchronous, linear fashion, the problems that the acquisition will be very slow. The main delay is on the network IO, it can be upgraded to asynchronous IO, asynchronous acquisition.

Guess you like

Origin www.cnblogs.com/Qqun821460695/p/12001703.html