/1 Introduction/
With the rise of online shopping, many traditional shops have transformed to online businesses, and the emergence of e-commerce has greatly facilitated our lives.
/2 Project goal/
Search and go directly to the destination with one-click through the Python program, crawl Taobao product links, product names, and product image links, and record each operation in the log file.
/ 3 Project preparation/
Use the sublime text 3 editor to write the program, first look at the main interface after the program runs:
/4 Project realization/
1. Analyze the page structure and put the product information in their respective lists. Take the following shop as an example.
2. The old way, F12, because we are looking for product links in the store, so we find as many products as possible. From the perspective of the layout of the store, it seems that the baby recommends more products in this section, so we will Climb all the content in this section.
3. Steps 1, 2 and 3 in the figure are the various information of the products we want to crawl. It can be seen that the products are all in the dt tag with the class photo, so we need to extract them.
try: urllib3.disable_warnings() #Remove warnings from urllib3 #Web request rep=requests.get(self.e2.get(),verify=False,timeout=4) #Certificate verification is set to FALSE, set access delay rep .encoding='gbk' soup=BeautifulSoup(rep.content,'html.parser') result=soup.find_all('dt',class_='photo') #Get all dt elements whose class is photo for x in result : tt=x.find_all('a') #Get all the child elements under dt a for y in tt: for x in y: ab=x.find_next_siblings('img') #Get all the next sibling elements img for z in ab: \#Add the product name and product picture link to the lists aa and bb aa.append (z ['old']) bb.append('https:'+z['data-ks-lazyload']) cc.append('https:'+y['href'])#Add the product link to the list cc except: return
In this way, we can easily obtain the product link, product name, and product image link, and then save them in the aa, bb, and cc lists.
/ 5 Writing GUI interface, interactive and friendly/
In order to make the running result more beautiful, we need to make a GUI interface, which has to mention the Python built-in GUI artifact tkinter.
Okay, let's get back to the subject, we can write an interactive interface to encapsulate it into a class, which is more beautiful.
class page: def __init__(self): self.ti=dt.now().strftime("%Y/%m/%d %H:%S:%M") self.root = tk.Tk() # Initialize the window self.root.title('Taobao Get Merchant Baby V1.0') #Window name self.root.geometry("700x700") #Set window size self.root.iconbitmap('q.ico') self.root .resizable(width=True,height=True) #Set whether the window is variable, width is not variable, height is variable, the default is True \ #Create label, text, background color, font (color, size), label height and Wide self.label1 =tk.Label(self.root,text='Shop Homepage:',font=('宋体',10),width=12,height=2) \ #Create input box, label height, font size Color, content display method self.e2 = tk.Entry(self.root,width=30,show=None, font=('Arial', 12)) # Display in plain text self.label2 =tk.Label(self. root,text='Taobao Direct:',font=('宋体',10),width=12,height=2) self.e1 = tk.Entry(self.root,width=30,show=None, font=('Arial', 12)) \ #Create button content width and height button binding event self.b1 = tk.Button(self.root, text='parse page', width=8,height=1,command=self.parse) self.b2 =tk. Button(self.root, text='Generate excel', width=8,height=1,command=self.sc) self.b3 =tk.Button(self.root, text='Taobao search', width=8, height=1,command=self.search) self.b4 =tk.Button(self.root, text='Close the program', width=8,height=1,command=self.close) self.b5 =tk.Button (self.root, text='Save log', width=8,height=1,command=self.log) \ #Create a text box self.te=tk.Text(self.root,height=40) self.label1 .place(x=140,y=30,anchor='nw') self.label2.place(x=138,y=70,anchor='nw') \ #Add all parts to the interface self.e1. place(x=210,y=74,anchor='nw') self.e2.place(x=210,y = 34, anchor = 'nw') self.b1.place(x=160,y=110,anchor='nw') self.b2.place(x=240,y=110,anchor='nw') self.b3.place(x=320, y=110,anchor='nw') self.b4.place(x=400,y=110,anchor='nw') self.b5.place(x=480,y=110,anchor='nw') self.te.place(x=40,y=170,anchor='nw') \ #Set the start text of the input box self.e1.delete(0, "end") self.e1.insert(0, "Please enter The product to be searched") self.root.mainloop()
Even if the GUI interface is created, the effect diagram is as follows:
/ 6 Enter the homepage address of the target shop, generate data and export to Excel and log records/
1. Get the data by entering the homepage address of the Taobao shop, so we need to make a judgment on the program, because we encapsulate it in a class, so we need to add a self in each function bracket, the code is as follows:
# Parse web content def parse(self): self.res() if self.e2.get()=='': #Determine whether the value of the input box is empty \ #Insert the value into the text box self.te.insert ('insert','.... Please enter the URL.... \n') elif str(self.e2.get()).find('taobao.com')==-1 or aa==' ': self.te.insert('insert',' ... The address is incorrect...\n') else: self.te.insert("insert","Parsing the target webpage: %s\n\ n"%self.e2.get()) self.te.insert("insert",".... The start of parsing:.... \n") #INSERT index indicates the current position of the insert cursor self. te.insert("insert","\n\n") for x,y,z in zip(aa,bb,cc): #Combine the list where the data is located result=x+'\n'+y+'\n' +z+'\n\n' self.te.insert("insert",result,"\n\n") #Insert into the text box self.te.insert("insert","\n\n") #Insert a space self.te.insert("end","The parsing is complete...\n")
2. Generate an Excel file, the code is as follows:
# Save the result to excel def sc(self): self.te.insert("insert"," ... Start generating: ...\n") av={'Time':self.ti, 'Product name':aa,'Product link':cc,'Product picture link':bb} \#Generate dataframe multidimensional array df=p.DataFrame(av,columns=['time','product name','product Link','Product image link'],index=range(len(aa))) df.to_excel('22.xlsx', sheet_name='taobao') #Generate excel self.te.insert("end"," ... The generation is complete.... \N")
After the code is run, the following effect is obtained:
3. Generate a log file, the code is as follows:
# Save log def log(self): ss=str(self.te.get(0.0,'end')).split('\n') #Separate text box content with open('1.txt','w ',encoding='utf8') as f: #Save log for y in range(len(ss)): rea=str(self.ti)+ss[y]+'\n' f.write(rea)
After the code is run, the following effect is obtained:
/8 Quick search Taobao commodity webpage direct program is closed/
To search Taobao products with one click, we first find the search address of Taobao, and then make a get request and pass different values to him. General search will involve a keyword search.
Here we first find the search entry of Taobao, the address is:
https://s.taobao.com/search?q=
Then pass the value later, because we want to browse in the browser, so we need to use the webbrowser module, which is specifically responsible for accessing the page, and its usage is webbrowser.open(url).
So the code is as follows:
# Search products def search(self): self.te.insert("insert",".... Open the browser:.... \N") wb.open('https://s.taobao .com/search?q='+self.e1.get()) #Open the browser
The final step is to close the program. code show as below:
# Close the program def close(self): self.te.insert("insert","..... Close the program:.... \N") self.root.destroy() #Destroy the window
/9 Summary/
1. It is not recommended to grab too much data, which may cause load on the server, just try it briefly.
2. This article is based on the Python web crawler, using the crawler library, to create a simple and intelligent Taobao search system, and can be operated to generate logs.
3. This system seems very simple, but in fact it is a big challenge for the novice Xiaobai. Even some big guys are easy to fall into the hole. The main page analysis is a bit complicated and changeable, and there will be many abnormalities that will make you trapped. . Generally speaking, it's a pretty good practice project, and it's a test for yourself, I hope you like it.
4. If you need the source code of this article, please click the " source code " keyword to get it. You think it's good, remember to give a star.