There are no emojis in the chat, and the programmers directly crawled 100,000 emojis with python
foreword
The matter started a few days ago. I have a friend. When he was chatting with the little sister he liked, the chat atmosphere was always very awkward. At this time, he wanted to send some emojis to ease the atmosphere, but when he saw himself The expression pack collection is like this. . .
. . . After I posted this, I basically said goodbye to my sister directly, and then he asked me for help and asked me if I had an emoji package. I didn’t have an emoji package, but the website has it.
Analysis page
The website crawled today is Doutu. I have to say that there are really many emojis. After seeing this amazing number of pages, it is time
to see how to get the url of the emoji pictures. First, open the Google browser, and then click F12 enters the crawler happy mode
and then completes the operation in the figure below. Click the arrow No. 1 first, and then select an emoji package. The red box is the object we want to crawl, and the src of the emoji package is in it.
Now we will do it. After I figured out how to get the url of the emoji package, I started to write the code.
Implementation
Parse the page
Get web content
Here is the information to get the crawled web page
def askURL(url):
head = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.106 Safari/537.36"
}
req = urllib.request.Request(url=url, headers=head)
html = ""
try:
response = urllib.request.urlopen(req)
html = response.read()
except Exception as result:
print(result)
return html
Parse web content
# 取出图片src的正则式
imglink = re.compile(
r'<img alt="(.*?)" class="img-responsive lazy image_dta" data-backup=".*?" data-original="(.*?)" referrerpolicy="no-referrer" src=".*?"/>',
re.S)
def getimgsrcs(url):
html = askURL(url)
bs = BeautifulSoup(html, "html.parser")
names = []
srcs = []
# 找到所有的img标签
for item in bs.find_all('img'):
item = str(item)
# 根据上面的正则表达式规则把图片的src以及图片名拿下来
imgsrc = re.findall(imglink, item)
# 这里是因为拿取的img标签可能不是我们想要的,所以匹配正则规则之后可能返回空值,因此判断一下
if (len(imgsrc) != 0):
imgname = ""
if imgsrc[0][0] != '':
imgname = imgsrc[0][0] + '.' + getFileType(imgsrc[0][1])
else:
imgname = getFileName(imgsrc[0][1])
names.append(imgname)
srcs.append(imgsrc[0][1])
return names, srcs
Now that you have got the links and names of all the pictures, you can start downloading
document dowload
multithreaded download
Because there are a lot of files, it is best to download in a multi-threaded way. I just gave an example here. You can just write it according to this logic.
pool = ThreadPoolExecutor(max_workers=50)
for j in range(len(names)):
pool.submit(FileDownload.downloadFile, urls[j], filelocation[j])
achievement
A total of more than 100,000 emojis have been climbed, and this time we are also a big emoji
Summarize
A very simple crawler, suitable for beginners like me to practice hands, if you are interested in crawler, you can read other articles in my crawler column, maybe you also like it
Reptile column, come and click on me
python crawling 4k lady sister pictures life is too short I use python
python crawling b station video life is too short I use python
Python crawling beauty pictures crawler basics
If you have a chance to write again, the infringement will be deleted immediately.