Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

Hummingbird net pictures - Introduction

Today play something new, with a new library aiohttp, we use it to improve reptiles crawling speed.

Installation module routine routine

pip install aiohttp

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

Wait after running, the installation is complete, you want to study, then the necessary official documents:https://aiohttp.readthedocs.io/en/stable/

Then you can start writing code.

We want to crawl a page, this time is selected

http://bbs.fengniao.com/forum/forum_101_1_lastpost.html

Open the page, we can easily get to the page

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

Long time not so easily see the page number.

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

Try aiohttpto introduce it to access this page, module, nothing special, uses importcan
if we need to use Asyncio + Aiohttpasynchronous IO write reptiles, it should be noted that, in front of the method you need to add asynchronousasync

Next, try to get the first look at the source code of that page address above.

Code, declare a function fetch_img_url while carrying a parameter, this parameter can also write directly to die.

with Context is not prompt, you can search for relevant information on their own ( `· ω · ')

aiohttp.ClientSession() as session:Create an sessionobject and then use the sessionobjects to open the page. sessionYou can perform multiple operations, such as post, get, putetc.

Code await response.text()wait for the page data is returned

asyncio.get_event_loopCreate a thread, run_until_completethe method responsible for arranging the execution taskstasks. tasksMay be a separate function, it can also be a list.

import aiohttp  
import asyncio 

async def fetch_img_url(num):
    url = f'http://bbs.fengniao.com/forum/forum_101_{num}_lastpost.html'  # 字符串拼接
    # 或者直接写成 url = 'http://bbs.fengniao.com/forum/forum_101_1_lastpost.html'
    print(url)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400',
    }

    async with aiohttp.ClientSession() as session:
        # 获取轮播图地址
        async with session.get(url,headers=headers) as response:
            try:
                html = await response.text()   # 获取到网页源码
                print(html)

            except Exception as e:
                print("基本错误")
                print(e)

# 这部分你可以直接临摹
loop = asyncio.get_event_loop()
tasks = asyncio.ensure_future(fetch_img_url(1))
results = loop.run_until_complete(tasks)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

The last part of the above code can be rewritten as

loop = asyncio.get_event_loop()
tasks =  [fetch_img_url(1)]
results = loop.run_until_complete(asyncio.wait(tasks))

Well, if you have the source code to get results, so we sent a Diudiu distance ultimate goal.
Modify the code to get bulk 10.
Only need to change taskscan, in this operation, see the following results

tasks =  [fetch_img_url(num) for num in range(1, 10)]

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

The following series of operations on a blog and is very similar, to find the law.
Just open one page

http://bbs.fengniao.com/forum/forum_101_4_lastpost.html

Click on a picture to the inside pages, click on a picture of the inside pages, into a carousel page

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

Click again to enter the picture playback page

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

In the last picture we play page, to find the source code found in all image links, then the problem came out, how from the first link above into links carousel figure? ? ?
The following is the source http://bbs.fengniao.com/forum/pic/slide_101_10408464_89383854.htmlView source right.

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

Continue to analyze it ~ ~ ~ ~ ヾ(=・ω・=)o

http://bbs.fengniao.com/forum/forum_101_4_lastpost.html
转变成下面的链接?
http://bbs.fengniao.com/forum/pic/slide_101_10408464_89383854.html

Continue to look at the first link, we use F12 Developer Tools, go grab a picture to see.

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

××× winning picture frame location to discover the numbers we want, so good, we just need to match their expression through positive out just fine.
In the following code ####location, to note that I used the original regular matches in the process of writing regular expressions, I find that there is not even a complete matching step, only two steps, you can look at specific detailo(╥﹏╥)o

  1. Find all the pictures<div class="picList">
  2. Get two part numbers we want
async def fetch_img_url(num):
    url = f'http://bbs.fengniao.com/forum/forum_101_{num}_lastpost.html'
    print(url)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400',
    }

    async with aiohttp.ClientSession() as session:
        # 获取轮播图地址
        async with session.get(url,headers=headers) as response:
            try:
                ###############################################
                url_format = "http://bbs.fengniao.com/forum/pic/slide_101_{0}_{1}.html"
                html = await response.text()   # 获取到网页源码
                pattern = re.compile('<div class="picList">([\s\S.]*?)</div>')
                first_match = pattern.findall(html)
                href_pattern = re.compile('href="/forum/(\d+?)_p(\d+?)\.html')
                urls = [url_format.format(href_pattern.search(url).group(1), href_pattern.search(url).group(2)) for url in first_match]
                ##############################################

            except Exception as e:
                print("基本错误")
                print(e)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Code completion, we have acquired, we want the URL, continue reading the following URL inside information, then we want to match the picture link

async def fetch_img_url(num):
    # 去抄上面的代码
    async with aiohttp.ClientSession() as session:
        # 获取轮播图地址
        async with session.get(url,headers=headers) as response:
            try:
                #去抄上面的代码去吧
                ################################################################
                for img_slider in urls:
                    try:
                        async with session.get(img_slider, headers=headers) as slider:
                            slider_html = await slider.text()   # 获取到网页源码
                            try:
                                pic_list_pattern = re.compile('var picList = \[(.*)?\];')
                                pic_list = "[{}]".format(pic_list_pattern.search(slider_html).group(1))
                                pic_json = json.loads(pic_list)  # 图片列表已经拿到
                                print(pic_json)
                            except Exception as e:
                                print("代码调试错误")
                                print(pic_list)
                                print("*"*100)
                                print(e)

                    except Exception as e:
                        print("获取图片列表错误")
                        print(img_slider)
                        print(e)
                        continue
                ################################################################

                print("{}已经操作完毕".format(url))
            except Exception as e:
                print("基本错误")
                print(e)

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

After the final picture JSON already received, the final step, download pictures, Dangdang when ~ ~ ~ ~, meal rapid operation, took the picture down


async def fetch_img_url(num):
    # 代码去上面找
    async with aiohttp.ClientSession() as session:
        # 获取轮播图地址
        async with session.get(url,headers=headers) as response:
            try:
                # 代码去上面找
                for img_slider in urls:
                    try:
                        async with session.get(img_slider, headers=headers) as slider:
                            # 代码去上面找
                            ##########################################################
                            for img in pic_json:
                                try:
                                    img = img["downloadPic"]
                                    async with session.get(img, headers=headers) as img_res:
                                        imgcode = await img_res.read()  # 图片读取
                                        with open("images/{}".format(img.split('/')[-1]), 'wb') as f:
                                            f.write(imgcode)
                                            f.close()
                                except Exception as e:
                                    print("图片下载错误")
                                    print(e)
                                    continue
                            ###############################################################

                    except Exception as e:
                        print("获取图片列表错误")
                        print(img_slider)
                        print(e)
                        continue
                print("{}已经操作完毕".format(url))
            except Exception as e:
                print("基本错误")
                print(e)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Images will be written in advance you imagesquickly create a folder inside

Getting Started with Python Reptile [7]: hummingbird network picture crawling bis

tasks1024 coroutine open up, but I suggest you open the 100 is OK, too much concurrency, they server too much.

The above operation is finished, add in some details, such as saving to the specified folder, OK.

Guess you like

Origin blog.51cto.com/14445003/2423290