With python reptile BeautifulSoup information to stock two news articles

 

Stocks in python, the first use of the data collected, using the following example to briefly press crawling

data collection

1. Open was designated financial information sites, such as the China Securities Times: http://stock.stcn.com/dapan/index.shtml

2. chrome browser is recommended to analyze the structure of the site, specify a list of articles to be extracted

 

 Analysis of available paths:

<head> , <body>, <div>'' ,<div>''' .... <li> <a>

 Here path deeper, and involve multiple positioning, check if using find only once, but many times with find_all can query, use find_all more reasonable.

 Div nested, in order to extract the contents of the lower layer, there are various methods commonly used in this example:

Can last name = 'ul', attrs = { 'class': "news_list2"}, may be used name = 'div', attrs = { 'class': "content clearfix"}

Suppose tag = search result of the first layer

Again positioning the second layer article sub_tag find in tag

If the data page, the article is simple extraction methods such as:

soup.find_all ( "a") # find all the nodes in a data

  But the results are often unsatisfactory, because often encounter other advertising or recommend a list of articles also show come. Or if need be screening for the condition, or matching regular way



The sample code

1. Create a new file .py introduced BS

from bs4 import BeautifulSoup
import requests
import time
import json

url = 'http://stock.stcn.com/dapan/index.shtml'
wb_data = requests.post(url)
soup = BeautifulSoup(wb_data.content,'lxml')

 

2. for loop positioning articles node  

The wording have advantages:

 - Avoid temporary variables with more than the increase if, to save memory space to open up. Even if the <DIV> is a plurality of nested

 - When using a double subclassing to query for positioning, but also saves space Iterations

 - otherwise noted, at this time, although the only path, but with different find_all is because it is considered to find the results returned by the query is not conducive to find a subsequent operation, for a subsequent nested loop convenience

for tag in soup.find_all(name='ul',attrs={'class':"news_list2"}):
    for sub in tag.find_all("a"):
        print(sub)

 

3. Display results:

   

 

 

Then filter the data, if only to retain title of the article

Tag in soup.find_all for (name = 'UL', attrs = { 'class': "news_list2"}): 
	for sub in tag.find_all ( "A"): 
		Print (sub.string) # remains as child sub Object tag, character string information used to extract the string

    

 

Guess you like

Origin www.cnblogs.com/nerocm/p/12501972.html