Target site: https: //china.nba.com/playerindex/
Information crawling all players: When the target site analysis of switching pages url will not change, and page content unchanged
Login using selenium analog crawling data:
Import the webdriver Selenium from from lxml Import etree Import PANDAS AS PD Import JSON Import Time Browser = webdriver.Chrome () the try: browser.get ( "https://china.nba.com/playerindex/") the wait = WebDriverWait (Browser, 10) # information analysis page, find names beginning from a to Z alphabet, but the last digit xpath changed to correspond to the letters, so we spliced the corresponding xpath corresponding page, click get info for i in range (1 , 27): path = '// * [@ ID = "main-Container"] / div / div [2] / div [2] / sectionTop / div / div / div / div / div [. 3] / div [ 2] / div [2] / '+' div [ '+ str (i) +'] '
# find xpath click browser.find_element_by_xpath (path) .click ()
# find the first list, which is the corresponding player information table DF = pd.read_html (browser.page_source) [0]
# csv file is exported df.to_csv('std'+str(i)+'.csv', index=False) time.sleep(10) finally: browser.close()
Because I have no way to list every page there is a csv file to find, so I can only use the simplest method. Generated 26 csv file, and then save them to a csv file inside
import pandas as pd import os la = [] for i in range(1,27): path = 'std' + str(i) +'.csv' la.append(path) for inputfile in la: pf = pd.read_csv(inputfile,header = None) pf.to_csv('all.csv',mode='a',index=False,header=False)
Here's the path you need to re-set, so that generate a large csv file, then import the database, of course I am not saying the manual import
xlrd module supports import lxs table to the database, so I want is to convert the file into csv file xls
from io import StringIO import csv import pandas as pd c_path = r"C:\Users\23990\Desktop\all.csv" x_path = r"C:\Users\23990\Desktop\xxx.xls" # 路径中的xls文件在调用to_excel时会自动创建 def csv_to_xls(csv_path, xls_path): with open(csv_path, 'r', encoding='gb18030', errors='ignore') as f: data = f.read() data_file = StringIO(data) print(data_file) csv_reader = csv.reader(data_file) list_csv = [] for row in csv_reader: list_csv.append(row) df_csv = pd.DataFrame(list_csv).applymap(str) ''' which is not part of the installed for the csv xls, csv file but filter and then write df_csv = df_csv[(df_csv[4] == '') | (df_csv[4] == 'name')] # Filtered fourth column contains a null value and the data of the name df_csv.to_csv (csv_path, index = 0, header = 0, encoding = 'gb18030 ') # csv file is written '' ' Writer = pd.ExcelWriter (xls_path) # written Excel df_csv.to_excel ( excel_writer = Writer, index = False, header = False ) writer.save () csv_to_xls (c_path, x_path)
Finally into the database, perfect, this method Windows environment is fully operational, there is no attempt in the Mac version, until I find a new approach in talking about