Use lxml parse HTML pages and bulk data acquisition

When we need to get some data needs from the Web, we can use some libraries html web analytics to quickly obtain the data. There are a variety of third-party libraries to parse HTML pages available, for example lxml, beautiful soup and so on. Below an example lxml crawling statistical data we need from the Web

I want to get all the information line from Beijing to Beijing Public transit sites, in preparation for subsequent processing

First reference requests for access request to a Web page, html page to obtain the original data

import requests

Longer refer to etree class of lxml

import lxml.etree

First we start crawling input address bus lines web index page as a starting point, get the value of all the lines corresponding to the url

lxml.etree the layers of html pages divided according to tag, gradually grow down the tree structure is formed, we want to find our data by viewing the source code of the page in which the tag, use the function to extract the corresponding tab xpath The data

def get_all_line():
    url = 'http://beijing.gongjiao.com/lines_all.html'
    text = requests.get(url).text
    doc = lxml.etree.HTML(text)
    all_lines = doc.xpath("//div[@class='list']/ul/li")
    f=open("./data/"+'allline.txt','a')
    print(len(all_lines))
    for line in all_lines:
        line_name = line.xpath("./a/text()")[0].strip()
        line_url = line.xpath("./a/@href")[0]
        f.write(line_name+'$$'+line_url+'\n')
    f.close()

So we get all the lines corresponding url, on this basis, then followed by crawling relevant data for each line of page

Before crawling line data, we need to create a data dictionary for storing different fields, easy to manage. If necessary, I created a dictionary of 13 fields

df_dict = {
'line_name': [], 'line_url': [], 'line_start': [], 'line_stop': [],
'line_op_time': [], 'line_interval': [], 'line_price': [], 'line_company': [],
'line_up_times': [], 'line_station_up': [], 'line_station_up_len': [],
'line_station_down': [], 'line_station_down_len': [] 
}

 

Read all the url from the url data file that we just generated, the following function reads the line data for the realization of

def getbuslin(line):
    line_name = line[:line.find('$$')]
    line_url = line[line.find('$$')+2:]
    #print(line_url) 
    url = line_url
    text = requests.get(url).text
    #print(len(text))
    doc = lxml.etree.HTML(text)
    infos = doc.xpath("//div[@class='gj01_line_header clearfix']")
    for info in infos:
        #f=open("./data/"+line_name+'.txt','a')
        start_stop = info.xpath("./dl/dt/a/text()")
        #f.write('start-stop'+start_stop+'\n')
        op_times = info.xpath("./dl/dd[1]/b/text()")
        #f.write('open time:'+op_times+'\n')
        interval = info.xpath("./dl/dd[2]/text()")
        #f.write('interval:'+interval+'\n')
        price = info.xpath("./dl/dd[3]/text()")
        #f.write('price:'+price+'\n')
        company = info.xpath("./dl/dd[4]/text()")
        #f.write('company:'+company+'\n')
        up_times = info.xpath("./dl/dd[5]/text()")

        all_stations_up = doc.xpath('//ul[@class="gj01_line_img JS-up clearfix"]')
        for station in all_stations_up:
            station_up_name = station.xpath('./li/a/text()')
            df_dict['line_station_up'].append(station_up_name)
            df_dict['line_station_up_len'].append(len(station_up_name))
            #f.write(station_name+'\n')
        
        all_stations_down = doc.xpath('//ul[@class="gj01_line_img JS-down clearfix"]')
        if len(all_stations_down)== 0:
            #print(line_name)
            df_dict['line_station_down'].append('')
            df_dict['line_station_down_len'].append(0)

        for station in all_stations_down: 
            station_down_name=station.xpath('./li/a/text()')
            df_dict['line_station_down'].append(station_down_name)
            df_dict['line_station_down_len'].append(len(station_down_name))
            #f.write(station_name+'\n')
        #f.close()
        df_dict['line_name'].append(line_name)
        df_dict['line_url'].append(line_url)
        df_dict['line_start'].append(start_stop[0])
        df_dict['line_stop'].append(start_stop[1])
        if len(op_times)==0:
            op_times.append(defaultoptime)
        df_dict['line_op_time'].append(op_times[0])
        df_dict['line_interval'].append(interval[0][5:])
        df_dict['line_company'].append(company[0][5:])
        df_dict['line_price'].append(price[0][5:])
        df_dict['line_up_times'].append(up_times[0][5:])

The essence of the dictionary is still a list, so the operating method can refer to the list of dictionary

I extracted the name of a line of bus lines from the Web, starting station, terminal, hours of operation, the grid spacing, fare, start the terminal, up and down all sites

After all the data stored in the dictionary created, saved as a csv file, where pandas need to use the library

import pandas as pd

df = pd.DataFrame(df_dict)
df.to_csv(name+'.csv', encoding='utf-8', index=None)

At this time, we've parsed the data page is saved to the local for subsequent processing

Since the data bus lines is relatively static, so no real-time updates, the amount of data is relatively small, but the entire program still needs several minutes to run down

Here, I have not multi-process treatment, follow-up can take full advantage of your computer's performance by reducing the time required to optimize multi-process

 

Guess you like

Origin www.cnblogs.com/btc1996/p/11460235.html