Python crawling super-simple website pictures

1, first import the relevant library

Import Requests
 Import BS4
 Import Threading # for multi-threaded crawler, crawling speed, you can complete a multi-page crawling
 Import os

2, bs4 acquire content in html

The crawling website: http://www.umei.cc/bizhitupian/diannaobizhi/1.htm     This is just the first page of pictures of course can batch crawl inside all pictures

bs = bs4.BeautifulSoup(requests.get(r"http://www.umei.cc/bizhitupian/diannaobizhi/1.htm").text)

At this point we've got the HTML of the page, and found a little garbled HTML output, then we can look at our code modified

import bs4
import requests
import os
req = requests.get(r"http://www.umei.cc/bizhitupian/diannaobizhi/1.htm")
req.encoding="utf-8"
bs = bs4.BeautifulSoup(req.text)

This will solve the problem garbled HTML crawling out

3, after the match to get the HTML image tag we need

= bs.find_all obj ( " A " , { " class " : { " TypeBigPics " }}) representative of the label #a is <a> class label is <a> class TypeBigPics corresponding class label is <a> the corresponding values, find the corresponding picture <a> tag based on the value of class

Then you get all the tags find_all <a> image corresponding to () take away all objects that match, find () is removed a

4, and then remove all img tag label inside <a>

= imgObj [] # img for storing objects
 for S in obj: 
    imgObj.append (s.find ( " img " )) # img the extracted objects into the array imgObj

5, and then acquires the value of the img src tag

srcObj = [] # for storing image src objects for O in imgObj: 
    srcObj.append (o.get ( " src " ))

Then we get the file path all the pictures on this page, the next step can download these pictures the

6, download pictures

for img in srcObj:
    with open("D:\\Images\\"+os.path.basename(img),'wb') as f:
        f.write(requests.get(img).content)
    print(os.path.basename(img)+"保存成功")

srcObj to get the picture above address, D: \\ Images \\ local directory to save Note: use double slash os.path.basename (img) for the original picture file name can be replaced with your own settings file name here simple reptile has ended

7, all the code follows

import bs4
import requests
import os
req = requests.get(r"http://www.umei.cc/bizhitupian/diannaobizhi/1.htm")
req.encoding="utf-8"
bs = bs4.BeautifulSoup(req.text)
obj = bs.find_all("a",{"class":{"TypeBigPics"}})
objHtml=[]
objImg=[]
for s in obj:
    objHtml.append(s.find("img"))
for o in objHtml:
    objImg.append(o.get("src"))
for img in objImg:
    with open("D:\\pics22223\\"+os.path.basename(img),'wb') as f:
        f.write(requests.get(img).content)
    print(os.path.basename(img)+"保存成功");

8, using multi-threaded crawling all image This station

Here directly on the source code

import bs4
import requests
import os
import threading
def ojue(i): bs = bs4.BeautifulSoup(requests.get(r"http://www.umei.cc/bizhitupian/diannaobizhi/"+i+".htm").text) obj = bs.find_all("a",{"class":{"TypeBigPics"}}) objHtml=[] ImgObj=[] for f in obj: objHtml.append(f.get("href")) for z in objHtml: htmlText = bs4.BeautifulSoup(requests.get(z).text) Img = htmlText.find_all("img") for c in Img: ImgObj.append(c.get("src")) for img in ImgObj: with open("D:\\pics22223\\"+os.path.basename(img),'wb') as f: f.write (requests.get (IMG) .content) Print (os.path.basename (IMG) + " saved successfully " ) for I in Range (627 ): #range () taken from 0 to 627 of the threading.Thread (target = ojue, args = (I +. 1,)). Start () #target parameter is a function corresponding to the name

 

Guess you like

Origin www.cnblogs.com/MingGyGy-Castle/p/11962188.html