Preface
The text and pictures in this article are from the Internet and are for learning and communication purposes only, and do not have any commercial use. If you have any questions, please contact us for processing.
Python crawler, data analysis, website development and other case tutorial videos are free to watch online
https://space.bilibili.com/523606542
Preamble content
Python crawler beginners introductory teaching (1): crawling Douban movie ranking information
Python crawler novice introductory teaching (2): crawling novels
Python crawler beginners introductory teaching (3): crawling Lianjia second-hand housing data
Python crawler novice introductory teaching (4): crawling 51job.com recruitment information
Python crawler beginners' introductory teaching (5): Crawling the video barrage of station B
Python crawler novice introductory teaching (6): making word cloud diagrams
Python crawler beginners introductory teaching (7): crawling Tencent video barrage
Python crawler novice introductory teaching (8): crawl forum articles and save them as PDF
Basic development environment
- Python 3.6
- Pycharm
- wkhtmltopdf
Use of related modules
- re
- requests
- concurrent.futures
Install Python and add it to the environment variables, pip installs the required related modules.
One, clear needs
Who doesn't send a few emoticons to chat now? When chatting, emoticons are an important tool for us, and it is also a good helper to draw the distance between our friends. When chatting is in an awkward situation, just take an emoticon and make the embarrassment invisible.
In this article, I will use python to crawl the emoticon pictures in batches and keep them for future use.
2. Web page data analysis
As shown in the figure, all the image data on the Doutu website are contained in the a tag. You can try to request this web page directly to check whether the data returned by the response also contains the image address.
import requests
def get_response(html_url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=html_url, headers=headers)
return response
def main(html_url):
response = get_response(html_url)
print(response.text)
if __name__ == '__main__':
url = 'https://www.doutula.com/photo/list/'
main(url)
Ctrl + F to search in the output result .
Here is one point I want to pay attention to. The result returned to us when I use python to request the web page contains the picture url address:
data-original="picture url"
data-backup="picture url"
If you want to extract the URL address, you can use the parsel parsing library, or re regular expression. I used parcel before, so let's use regular expressions in this article.
urls = re.findall('data-original="(.*?)"', response.text)
Single page crawling complete code
import requests
import re
def get_response(html_url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=html_url, headers=headers)
return response
def save(image_url, image_name):
image_content = get_response(image_url).content
filename = 'images\\' + image_name
with open(filename, mode='wb') as f:
f.write(image_content)
print(image_name)
def main(html_url):
response = get_response(html_url)
urls = re.findall('data-original="(.*?)"', response.text)
for link in urls:
image_name = link.split('/')[-1]
save(link, image_name)
if __name__ == '__main__':
url = 'https://www.doutula.com/photo/list/'
main(url)
Multi-threaded crawling of all site pictures (if your memory is large enough)
The data on page 3631 has all expressions, hehehe
import requests
import re
import concurrent.futures
def get_response(html_url):
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36'
}
response = requests.get(url=html_url, headers=headers)
return response
def save(image_url, image_name):
image_content = get_response(image_url).content
filename = 'images\\' + image_name
with open(filename, mode='wb') as f:
f.write(image_content)
print(image_name)
def main(html_url):
response = get_response(html_url)
urls = re.findall('data-original="(.*?)"', response.text)
for link in urls:
image_name = link.split('/')[-1]
save(link, image_name)
if __name__ == '__main__':
# ThreadPoolExecutor 线程池的对象
# max_workers 最大任务数
executor = concurrent.futures.ThreadPoolExecutor(max_workers=3)
for page in range(1, 3632):
url = f'https://www.doutula.com/photo/list/?page={page}'
# submit 往线程池里面添加任务
executor.submit(main, url)
executor.shutdown()