Text and images in this article from the network, only to learn, exchange, not for any commercial purposes, belongs to original author, if any questions, please contact us for treatment.
Author: Amauri
PS: If necessary Python learning materials can be added to a small partner click the link below to obtain their own
http://note.youdao.com/noteshare?id=3054cce4add8a909e784ad934f956cef
This article reptile belonging to the entry-level class, older drivers who do not read.
This is mainly crawling Netease news, including news title, author, source, release time, news text.
First, we opened 163 sites, we randomly select a category, and here I choose classification is national news. Then right-click to view source code, source code and did not find the news list page in the center. This indicates that this page uses an asynchronous manner. That is, data through the interface api acquired.
So after confirming the F12 can be used to open the Google browser-based console, click on the Network, we have been pulled down and found appears on the right: the "... special/00804KVA/cm_guonei_03.js? .... "
address and the like, opening the Response found exactly what we were looking for api interface.
You can see the address are certain rules of these interfaces: "cm_guonei_03.js", "cm_guonei_04.js", then it is obvious:
http://temp.163.com/special/0...*).js
The above address is the connection we want to grab this request.
Then only you need to use two python libraries:
-
requests
-
json
-
BeautifulSoup
requests the library is used for network requests, saying that white is to simulate a browser to access resources.
Since we collect is api interfaces, its format is json, so use the json library to parse. BeautifulSoup is used to parse html document, can easily help us get the contents of the specified div.
Here we start writing reptiles:
The first step is to import more than three packages:
import json import requests from bs4 import BeautifulSoup
Then we define a specified page within the data acquisition method:
. 1 DEF the get_page (Page): 2 url_temp = ' http://temp.163.com/special/00804KVA/cm_guonei_0{}.js ' . 3 return_list = [] . 4 for I in Range (Page): . 5 URL = url_temp. the format (I) . 6 response = requests.get (URL) . 7 IF response.status_code = 200 is! : . 8 Continue . 9 Content = response.text # fetch response body 10 _content = formatContent (Content) # format json string 11 result = json.loads(_content) 12 return_list.append(result) 13 return return_list
This way is obtained a list of contents corresponding to each page:
After analysis of the data shows the following figure turns out is the need to grab the title, as well as news content pages Published by.
Now that has been acquired to the content of the page url, then the next start crawling news text.
Before we crawled first analyze the body text of html pages, find text, author, source location in the html document.
We see where the source of the article in the document are as follows: id = "ne_article_source" 的 a 标签
. On location: . Text position: . class = "ep-editor" 的 span 标签
class = "post_text" 的 div 标签
Under these three elements interview Acquisition code:
1 def get_content(url): 2 source = '' 3 author = '' 4 body = '' 5 resp = requests.get(url) 6 if resp.status_code == 200: 7 body = resp.text 8 bs4 = BeautifulSoup(body) 9 source = bs4.find('a', id='ne_article_source').get_text() 10 author = bs4.find('span', class_='ep-editor').get_text() 11 body = bs4.find('div', class_='post_text').get_text() 12 return source, author, body
So far all we have to crawl data have been collected.
So then of course save them down, in order to facilitate direct I take the form of text to save. The following are the final results:
Json format string, the "title": [ 'date', 'url', 'source', 'Author', 'text'].