Hummingbird net pictures - Introduction
Today play something new, with a new library aiohttp
, we use it to improve reptiles crawling speed.
Installation module routine routine
pip install aiohttp
Wait after running, the installation is complete, you want to study, then the necessary official documents:https://aiohttp.readthedocs.io/en/stable/
Then you can start writing code.
We want to crawl a page, this time is selected
http://bbs.fengniao.com/forum/forum_101_1_lastpost.html
Open the page, we can easily get to the page
Long time not so easily see the page number.
Try aiohttp
to introduce it to access this page, module, nothing special, uses import
can
if we need to use Asyncio + Aiohttp
asynchronous IO write reptiles, it should be noted that, in front of the method you need to add asynchronousasync
Next, try to get the first look at the source code of that page address above.
Code, declare a function fetch_img_url while carrying a parameter, this parameter can also write directly to die.
with
Context is not prompt, you can search for relevant information on their own ( `· ω · ')
aiohttp.ClientSession() as session:
Create an session
object and then use the session
objects to open the page. session
You can perform multiple operations, such as post
, get
, put
etc.
Code await response.text()
wait for the page data is returned
asyncio.get_event_loop
Create a thread, run_until_complete
the method responsible for arranging the execution tasks
tasks. tasks
May be a separate function, it can also be a list.
import aiohttp
import asyncio
async def fetch_img_url(num):
url = f'http://bbs.fengniao.com/forum/forum_101_{num}_lastpost.html' # 字符串拼接
# 或者直接写成 url = 'http://bbs.fengniao.com/forum/forum_101_1_lastpost.html'
print(url)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400',
}
async with aiohttp.ClientSession() as session:
# 获取轮播图地址
async with session.get(url,headers=headers) as response:
try:
html = await response.text() # 获取到网页源码
print(html)
except Exception as e:
print("基本错误")
print(e)
# 这部分你可以直接临摹
loop = asyncio.get_event_loop()
tasks = asyncio.ensure_future(fetch_img_url(1))
results = loop.run_until_complete(tasks)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎
The last part of the above code can be rewritten as
loop = asyncio.get_event_loop()
tasks = [fetch_img_url(1)]
results = loop.run_until_complete(asyncio.wait(tasks))
Well, if you have the source code to get results, so we sent a Diudiu distance ultimate goal.
Modify the code to get bulk 10.
Only need to change tasks
can, in this operation, see the following results
tasks = [fetch_img_url(num) for num in range(1, 10)]
The following series of operations on a blog and is very similar, to find the law.
Just open one page
http://bbs.fengniao.com/forum/forum_101_4_lastpost.html
Click on a picture to the inside pages, click on a picture of the inside pages, into a carousel page
Click again to enter the picture playback page
In the last picture we play page, to find the source code found in all image links, then the problem came out, how from the first link above into links carousel figure? ? ?
The following is the source http://bbs.fengniao.com/forum/pic/slide_101_10408464_89383854.html
View source right.
Continue to analyze it ~ ~ ~ ~ ヾ(=・ω・=)o
http://bbs.fengniao.com/forum/forum_101_4_lastpost.html
转变成下面的链接?
http://bbs.fengniao.com/forum/pic/slide_101_10408464_89383854.html
Continue to look at the first link, we use F12 Developer Tools, go grab a picture to see.
××× winning picture frame location to discover the numbers we want, so good, we just need to match their expression through positive out just fine.
In the following code ####
location, to note that I used the original regular matches in the process of writing regular expressions, I find that there is not even a complete matching step, only two steps, you can look at specific detailo(╥﹏╥)o
- Find all the pictures
<div class="picList">
- Get two part numbers we want
async def fetch_img_url(num):
url = f'http://bbs.fengniao.com/forum/forum_101_{num}_lastpost.html'
print(url)
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400',
}
async with aiohttp.ClientSession() as session:
# 获取轮播图地址
async with session.get(url,headers=headers) as response:
try:
###############################################
url_format = "http://bbs.fengniao.com/forum/pic/slide_101_{0}_{1}.html"
html = await response.text() # 获取到网页源码
pattern = re.compile('<div class="picList">([\s\S.]*?)</div>')
first_match = pattern.findall(html)
href_pattern = re.compile('href="/forum/(\d+?)_p(\d+?)\.html')
urls = [url_format.format(href_pattern.search(url).group(1), href_pattern.search(url).group(2)) for url in first_match]
##############################################
except Exception as e:
print("基本错误")
print(e)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎
Code completion, we have acquired, we want the URL, continue reading the following URL inside information, then we want to match the picture link
async def fetch_img_url(num):
# 去抄上面的代码
async with aiohttp.ClientSession() as session:
# 获取轮播图地址
async with session.get(url,headers=headers) as response:
try:
#去抄上面的代码去吧
################################################################
for img_slider in urls:
try:
async with session.get(img_slider, headers=headers) as slider:
slider_html = await slider.text() # 获取到网页源码
try:
pic_list_pattern = re.compile('var picList = \[(.*)?\];')
pic_list = "[{}]".format(pic_list_pattern.search(slider_html).group(1))
pic_json = json.loads(pic_list) # 图片列表已经拿到
print(pic_json)
except Exception as e:
print("代码调试错误")
print(pic_list)
print("*"*100)
print(e)
except Exception as e:
print("获取图片列表错误")
print(img_slider)
print(e)
continue
################################################################
print("{}已经操作完毕".format(url))
except Exception as e:
print("基本错误")
print(e)
After the final picture JSON already received, the final step, download pictures, Dangdang when ~ ~ ~ ~, meal rapid operation, took the picture down
async def fetch_img_url(num):
# 代码去上面找
async with aiohttp.ClientSession() as session:
# 获取轮播图地址
async with session.get(url,headers=headers) as response:
try:
# 代码去上面找
for img_slider in urls:
try:
async with session.get(img_slider, headers=headers) as slider:
# 代码去上面找
##########################################################
for img in pic_json:
try:
img = img["downloadPic"]
async with session.get(img, headers=headers) as img_res:
imgcode = await img_res.read() # 图片读取
with open("images/{}".format(img.split('/')[-1]), 'wb') as f:
f.write(imgcode)
f.close()
except Exception as e:
print("图片下载错误")
print(e)
continue
###############################################################
except Exception as e:
print("获取图片列表错误")
print(img_slider)
print(e)
continue
print("{}已经操作完毕".format(url))
except Exception as e:
print("基本错误")
print(e)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎
Images will be written in advance you images
quickly create a folder inside
tasks
1024 coroutine open up, but I suggest you open the 100 is OK, too much concurrency, they server too much.
The above operation is finished, add in some details, such as saving to the specified folder, OK.