About multi-page selenium crawling storage

Target site: https: //china.nba.com/playerindex/

Information crawling all players: When the target site analysis of switching pages url will not change, and page content unchanged

Login using selenium analog crawling data:

Import the webdriver Selenium from 
from lxml Import etree 
Import PANDAS AS PD 
Import JSON 
Import Time 
Browser = webdriver.Chrome () 
the try: 
    browser.get ( "https://china.nba.com/playerindex/") 
    the wait = WebDriverWait (Browser, 10) 
    # information analysis page, find names beginning from a to Z alphabet, but the last digit xpath changed to correspond to the letters, so we spliced the corresponding xpath corresponding page, click get info 
    for i in range (1 , 27): 
        path = '// * [@ ID = "main-Container"] / div / div [2] / div [2] / sectionTop / div / div / div / div / div [. 3] / div [ 2] / div [2] / '+' div [ '+ str (i) +'] ' 
     # find xpath click browser.find_element_by_xpath (path) .click ()
     # find the first list, which is the corresponding player information table DF = pd.read_html (browser.page_source) [0]
     # csv file is exported df.to_csv('std'+str(i)+'.csv', index=False) time.sleep(10) finally: browser.close()

  Because I have no way to list every page there is a csv file to find, so I can only use the simplest method. Generated 26 csv file, and then save them to a csv file inside

import pandas as pd
import os
la = []
for i in range(1,27):
    path = 'std' + str(i) +'.csv'
    la.append(path)
for inputfile in la:
    pf = pd.read_csv(inputfile,header = None)
    pf.to_csv('all.csv',mode='a',index=False,header=False)

  Here's the path you need to re-set, so that generate a large csv file, then import the database, of course I am not saying the manual import

xlrd module supports import lxs table to the database, so I want is to convert the file into csv file xls

from io import StringIO
import csv
import pandas as pd
 
c_path = r"C:\Users\23990\Desktop\all.csv"
x_path = r"C:\Users\23990\Desktop\xxx.xls"   # 路径中的xls文件在调用to_excel时会自动创建
 
 
def csv_to_xls(csv_path, xls_path):
    with open(csv_path, 'r', encoding='gb18030', errors='ignore') as f:
        data = f.read()
    data_file = StringIO(data)
    print(data_file)
    csv_reader = csv.reader(data_file)
    list_csv = []
    for row in csv_reader:
        list_csv.append(row)
    df_csv = pd.DataFrame(list_csv).applymap(str)
    ''' 
    which is not part of the installed for the csv xls, csv file but filter and then write
    df_csv = df_csv[(df_csv[4] == '') | (df_csv[4] == 'name')] # Filtered fourth column contains a null value and the data of the name 
    df_csv.to_csv (csv_path, index = 0, header = 0, encoding = 'gb18030 ') # csv file is written 
    '' ' 
    Writer = pd.ExcelWriter (xls_path) 
    # written Excel 
    df_csv.to_excel ( 
        excel_writer = Writer, 
        index = False, 
        header = False 
    ) 
    writer.save () 
csv_to_xls (c_path, x_path)  

Finally into the database, perfect, this method Windows environment is fully operational, there is no attempt in the Mac version, until I find a new approach in talking about

 

Guess you like

Origin www.cnblogs.com/FlowerNotGiveYou/p/11595050.html