1, first import the relevant library
Import Requests Import BS4 Import Threading # for multi-threaded crawler, crawling speed, you can complete a multi-page crawling Import os
2, bs4 acquire content in html
The crawling website: http://www.umei.cc/bizhitupian/diannaobizhi/1.htm This is just the first page of pictures of course can batch crawl inside all pictures
bs = bs4.BeautifulSoup(requests.get(r"http://www.umei.cc/bizhitupian/diannaobizhi/1.htm").text)
At this point we've got the HTML of the page, and found a little garbled HTML output, then we can look at our code modified
import bs4 import requests import os req = requests.get(r"http://www.umei.cc/bizhitupian/diannaobizhi/1.htm") req.encoding="utf-8" bs = bs4.BeautifulSoup(req.text)
This will solve the problem garbled HTML crawling out
3, after the match to get the HTML image tag we need
= bs.find_all obj ( " A " , { " class " : { " TypeBigPics " }}) representative of the label #a is <a> class label is <a> class TypeBigPics corresponding class label is <a> the corresponding values, find the corresponding picture <a> tag based on the value of class
Then you get all the tags find_all <a> image corresponding to () take away all objects that match, find () is removed a
4, and then remove all img tag label inside <a>
= imgObj [] # img for storing objects for S in obj: imgObj.append (s.find ( " img " )) # img the extracted objects into the array imgObj
5, and then acquires the value of the img src tag
srcObj = [] # for storing image src objects for O in imgObj: srcObj.append (o.get ( " src " ))
Then we get the file path all the pictures on this page, the next step can download these pictures the
6, download pictures
for img in srcObj: with open("D:\\Images\\"+os.path.basename(img),'wb') as f: f.write(requests.get(img).content) print(os.path.basename(img)+"保存成功")
srcObj to get the picture above address, D: \\ Images \\ local directory to save Note: use double slash os.path.basename (img) for the original picture file name can be replaced with your own settings file name here simple reptile has ended
7, all the code follows
import bs4 import requests import os req = requests.get(r"http://www.umei.cc/bizhitupian/diannaobizhi/1.htm") req.encoding="utf-8" bs = bs4.BeautifulSoup(req.text) obj = bs.find_all("a",{"class":{"TypeBigPics"}}) objHtml=[] objImg=[] for s in obj: objHtml.append(s.find("img")) for o in objHtml: objImg.append(o.get("src")) for img in objImg: with open("D:\\pics22223\\"+os.path.basename(img),'wb') as f: f.write(requests.get(img).content) print(os.path.basename(img)+"保存成功");
8, using multi-threaded crawling all image This station
Here directly on the source code
import bs4 import requests import os
import threading def ojue(i): bs = bs4.BeautifulSoup(requests.get(r"http://www.umei.cc/bizhitupian/diannaobizhi/"+i+".htm").text) obj = bs.find_all("a",{"class":{"TypeBigPics"}}) objHtml=[] ImgObj=[] for f in obj: objHtml.append(f.get("href")) for z in objHtml: htmlText = bs4.BeautifulSoup(requests.get(z).text) Img = htmlText.find_all("img") for c in Img: ImgObj.append(c.get("src")) for img in ImgObj: with open("D:\\pics22223\\"+os.path.basename(img),'wb') as f: f.write (requests.get (IMG) .content) Print (os.path.basename (IMG) + " saved successfully " ) for I in Range (627 ): #range () taken from 0 to 627 of the threading.Thread (target = ojue, args = (I +. 1,)). Start () #target parameter is a function corresponding to the name