There are no emojis in the chat, and the programmers directly crawled 100,000 emojis with python

There are no emojis in the chat, and the programmers directly crawled 100,000 emojis with python

foreword

The matter started a few days ago. I have a friend. When he was chatting with the little sister he liked, the chat atmosphere was always very awkward. At this time, he wanted to send some emojis to ease the atmosphere, but when he saw himself The expression pack collection is like this. . .
insert image description here
. . . After I posted this, I basically said goodbye to my sister directly, and then he asked me for help and asked me if I had an emoji package. I didn’t have an emoji package, but the website has it.
insert image description here

Analysis page

The website crawled today is Doutu. I have to say that there are really many emojis. After seeing this amazing number of pages, it is time
to see how to get the url of the emoji pictures. First, open the Google browser, and then click F12 enters the crawler happy mode
insert image description here
and then completes the operation in the figure below. Click the arrow No. 1 first, and then select an emoji package. The red box is the object we want to crawl, and the src of the emoji package is in it.
insert image description here
Now we will do it. After I figured out how to get the url of the emoji package, I started to write the code.

Implementation

Parse the page

Get web content

Here is the information to get the crawled web page

def askURL(url):
    head = {
    
    
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"
    }
    req = urllib.request.Request(url=url, headers=head)
    html = ""
    try:
        response = urllib.request.urlopen(req)
        html = response.read()
    except Exception as result:
        print(result)
    return html

Parse web content

# 取出图片src的正则式
imglink = re.compile(
    r'<img alt="(.*?)" class="img-responsive lazy image_dta" data-backup=".*?" data-original="(.*?)" referrerpolicy="no-referrer" src=".*?"/>',
    re.S)
def getimgsrcs(url):
    html = askURL(url)
    bs = BeautifulSoup(html, "html.parser")
    names = []
    srcs = []
    # 找到所有的img标签
    for item in bs.find_all('img'):
        item = str(item)
        # 根据上面的正则表达式规则把图片的src以及图片名拿下来
        imgsrc = re.findall(imglink, item)
        # 这里是因为拿取的img标签可能不是我们想要的,所以匹配正则规则之后可能返回空值,因此判断一下
        if (len(imgsrc) != 0):
            imgname = ""
            if imgsrc[0][0] != '':
                imgname = imgsrc[0][0] + '.' + getFileType(imgsrc[0][1])
            else:
                imgname = getFileName(imgsrc[0][1])
            names.append(imgname)
            srcs.append(imgsrc[0][1])
    return names, srcs

Now that you have got the links and names of all the pictures, you can start downloading

document dowload

multithreaded download

Because there are a lot of files, it is best to download in a multi-threaded way. I just gave an example here. You can just write it according to this logic.

 pool = ThreadPoolExecutor(max_workers=50)
         for j in range(len(names)):
            pool.submit(FileDownload.downloadFile, urls[j], filelocation[j])
 

achievement

insert image description here

insert image description here
A total of more than 100,000 emojis have been climbed, and this time we are also a big emoji
insert image description here

Summarize

A very simple crawler, suitable for beginners like me to practice hands, if you are interested in crawler, you can read other articles in my crawler column, maybe you also like it

Reptile column, come and click on me

Two lines of code to crawl Weibo hot search, and realize the email reminder function, mother no longer have to worry that I can't eat melons. Crawler basics

python crawling 4k lady sister pictures life is too short I use python

python crawling b station video life is too short I use python

Python crawling beauty pictures crawler basics

If you have a chance to write again, the infringement will be deleted immediately.

Guess you like

Origin blog.csdn.net/qq_43627076/article/details/119851587