When we need to get some data needs from the Web, we can use some libraries html web analytics to quickly obtain the data. There are a variety of third-party libraries to parse HTML pages available, for example lxml, beautiful soup and so on. Below an example lxml crawling statistical data we need from the Web
I want to get all the information line from Beijing to Beijing Public transit sites, in preparation for subsequent processing
First reference requests for access request to a Web page, html page to obtain the original data
import requests
Longer refer to etree class of lxml
import lxml.etree
First we start crawling input address bus lines web index page as a starting point, get the value of all the lines corresponding to the url
lxml.etree the layers of html pages divided according to tag, gradually grow down the tree structure is formed, we want to find our data by viewing the source code of the page in which the tag, use the function to extract the corresponding tab xpath The data
def get_all_line(): url = 'http://beijing.gongjiao.com/lines_all.html' text = requests.get(url).text doc = lxml.etree.HTML(text) all_lines = doc.xpath("//div[@class='list']/ul/li") f=open("./data/"+'allline.txt','a') print(len(all_lines)) for line in all_lines: line_name = line.xpath("./a/text()")[0].strip() line_url = line.xpath("./a/@href")[0] f.write(line_name+'$$'+line_url+'\n') f.close()
So we get all the lines corresponding url, on this basis, then followed by crawling relevant data for each line of page
Before crawling line data, we need to create a data dictionary for storing different fields, easy to manage. If necessary, I created a dictionary of 13 fields
df_dict = { 'line_name': [], 'line_url': [], 'line_start': [], 'line_stop': [], 'line_op_time': [], 'line_interval': [], 'line_price': [], 'line_company': [], 'line_up_times': [], 'line_station_up': [], 'line_station_up_len': [], 'line_station_down': [], 'line_station_down_len': [] }
Read all the url from the url data file that we just generated, the following function reads the line data for the realization of
def getbuslin(line): line_name = line[:line.find('$$')] line_url = line[line.find('$$')+2:] #print(line_url) url = line_url text = requests.get(url).text #print(len(text)) doc = lxml.etree.HTML(text) infos = doc.xpath("//div[@class='gj01_line_header clearfix']") for info in infos: #f=open("./data/"+line_name+'.txt','a') start_stop = info.xpath("./dl/dt/a/text()") #f.write('start-stop'+start_stop+'\n') op_times = info.xpath("./dl/dd[1]/b/text()") #f.write('open time:'+op_times+'\n') interval = info.xpath("./dl/dd[2]/text()") #f.write('interval:'+interval+'\n') price = info.xpath("./dl/dd[3]/text()") #f.write('price:'+price+'\n') company = info.xpath("./dl/dd[4]/text()") #f.write('company:'+company+'\n') up_times = info.xpath("./dl/dd[5]/text()") all_stations_up = doc.xpath('//ul[@class="gj01_line_img JS-up clearfix"]') for station in all_stations_up: station_up_name = station.xpath('./li/a/text()') df_dict['line_station_up'].append(station_up_name) df_dict['line_station_up_len'].append(len(station_up_name)) #f.write(station_name+'\n') all_stations_down = doc.xpath('//ul[@class="gj01_line_img JS-down clearfix"]') if len(all_stations_down)== 0: #print(line_name) df_dict['line_station_down'].append('') df_dict['line_station_down_len'].append(0) for station in all_stations_down: station_down_name=station.xpath('./li/a/text()') df_dict['line_station_down'].append(station_down_name) df_dict['line_station_down_len'].append(len(station_down_name)) #f.write(station_name+'\n') #f.close() df_dict['line_name'].append(line_name) df_dict['line_url'].append(line_url) df_dict['line_start'].append(start_stop[0]) df_dict['line_stop'].append(start_stop[1]) if len(op_times)==0: op_times.append(defaultoptime) df_dict['line_op_time'].append(op_times[0]) df_dict['line_interval'].append(interval[0][5:]) df_dict['line_company'].append(company[0][5:]) df_dict['line_price'].append(price[0][5:]) df_dict['line_up_times'].append(up_times[0][5:])
The essence of the dictionary is still a list, so the operating method can refer to the list of dictionary
I extracted the name of a line of bus lines from the Web, starting station, terminal, hours of operation, the grid spacing, fare, start the terminal, up and down all sites
After all the data stored in the dictionary created, saved as a csv file, where pandas need to use the library
import pandas as pd df = pd.DataFrame(df_dict) df.to_csv(name+'.csv', encoding='utf-8', index=None)
At this time, we've parsed the data page is saved to the local for subsequent processing
Since the data bus lines is relatively static, so no real-time updates, the amount of data is relatively small, but the entire program still needs several minutes to run down
Here, I have not multi-process treatment, follow-up can take full advantage of your computer's performance by reducing the time required to optimize multi-process