Stocks in python, the first use of the data collected, using the following example to briefly press crawling
data collection
1. Open was designated financial information sites, such as the China Securities Times: http://stock.stcn.com/dapan/index.shtml
2. chrome browser is recommended to analyze the structure of the site, specify a list of articles to be extracted
Analysis of available paths:
<head> , <body>, <div>'' ,<div>''' .... <li> <a>
Here path deeper, and involve multiple positioning, check if using find only once, but many times with find_all can query, use find_all more reasonable.
Div nested, in order to extract the contents of the lower layer, there are various methods commonly used in this example:
Can last name = 'ul', attrs = { 'class': "news_list2"}, may be used name = 'div', attrs = { 'class': "content clearfix"}
Suppose tag = search result of the first layer
Again positioning the second layer article sub_tag find in tag
If the data page, the article is simple extraction methods such as:
soup.find_all ( "a") # find all the nodes in a data
But the results are often unsatisfactory, because often encounter other advertising or recommend a list of articles also show come. Or if need be screening for the condition, or matching regular way
The sample code
1. Create a new file .py introduced BS
from bs4 import BeautifulSoup import requests import time import json url = 'http://stock.stcn.com/dapan/index.shtml' wb_data = requests.post(url) soup = BeautifulSoup(wb_data.content,'lxml')
2. for loop positioning articles node
The wording have advantages:
- Avoid temporary variables with more than the increase if, to save memory space to open up. Even if the <DIV> is a plurality of nested
- When using a double subclassing to query for positioning, but also saves space Iterations
- otherwise noted, at this time, although the only path, but with different find_all is because it is considered to find the results returned by the query is not conducive to find a subsequent operation, for a subsequent nested loop convenience
for tag in soup.find_all(name='ul',attrs={'class':"news_list2"}): for sub in tag.find_all("a"): print(sub)
3. Display results:
Then filter the data, if only to retain title of the article
Tag in soup.find_all for (name = 'UL', attrs = { 'class': "news_list2"}): for sub in tag.find_all ( "A"): Print (sub.string) # remains as child sub Object tag, character string information used to extract the string