Python and seo, Baidu search keyword competition degree crawling asynchronous crawler demo

How to check the degree of keyword competition, the first contact people should know, go directly to the browser to search for keywords, such as Baidu search for a certain keyword, Weibo, a line of small gray headlines, "Baidu found about 100,000,000 relevant results for you "This is the degree of competition for keywords. It relates to the difficulty of your keyword ranking optimization in the later stage. Of course, this is only a reference indicator.

Of course, there is a very important reference index, the keyword Baidu Index, which is for the keywords that have been included. Most of them should be optimized by studying the keywords of the Baidu Index. Big words have indexes!

key point

asyncio --- Asynchronous I/O

Starting from Python 3.4, the concept of coroutines has been added to Python, but this version of coroutines is still based on generator objects. In Python 3.5, async/await has been added to make the implementation of coroutines more convenient.

The most commonly used library for using coroutines in Python is asyncio

asyncio is a library for writing concurrent code, using async/await syntax.

asyncio is used as the basis for a number of high-performance Python asynchronous frameworks, including network and website services, database connection libraries, distributed task queues, and so on.

asyncio is often the best choice for building IO-intensive and high-level structured network code.

event_loop: event loop, which is equivalent to an infinite loop. We can register some functions on this event loop. When the conditions are met, the corresponding processing method will be called.

coroutine: Chinese translation is called coroutine. In Python, it is often referred to as the coroutine object type. We can register the coroutine object in the time loop and it will be called by the event loop. We can use the async keyword to define a method, this method will not be executed immediately when called, but returns a coroutine object.

task: Task, which is a further encapsulation of the coroutine object, including the various states of the task.

Future: Represents the result of a task that will be executed or not executed in the future. In fact, it has no essential difference with task.

The async/await keyword, which only appeared since Python 3.5, is specifically used to define coroutines. Among them, async defines a coroutine, and await is used to suspend the execution of the blocking method.

Concurrency of asyncio gather and wait

gather is more advanced than wait.

Gather can group tasks, and gather is generally preferred.

When some customized tasks require, wait is used.

Single thread

#百度搜索结果数(竞争度大小)抓取
# 20201113@author:WX:huguo00289
# -*- coding=utf-8 -*-

import requests,re,time
from fake_useragent import UserAgent

def search(keyword):
    sum=''
    ua=UserAgent()
    url=f'https://www.baidu.com/s?wd={keyword}&ie=UTF-8'
    headers= {
        'User-Agent':ua.random,
        'Cookie':'BIDUPSID=E8605F17778754AD6BAA328A17329DAF; PSTM=1595994013; BAIDUID=E8605F17778754AD8EAC311EDCEC5A37:FG=1; BD_UPN=12314353; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; COOKIE_SESSION=75158_0_8_0_82_8_0_0_0_8_1_0_75159_0_1_0_1605083022_0_1605083023%7C9%230_0_1605083023%7C1; H_PS_645EC=c097mGOFZEl3IZjKw2lVOhIl4YyhcIr2Zp3YMimT2D62xwJo8q%2B9jeQnZq3gvUXMGbhD; BA_HECTOR=a42l8ka5ah8h0003611fqs8b60p; BD_HOME=1; H_PS_PSSID=32818_1452_33045_32939_33060_32973_32705_32961',
    }
    try:
        html=requests.get(url,headers=headers,timeout=5).content.decode('utf-8')
        #time.sleep(1)
        sum=re.search(r'<span class="nums_text">百度为您找到相关结果约(.+?)个</span>',html,re.M|re.I).group(1)
    except Exception as e:
        print(f"错误代码: {e}")
    if sum !='':
        print(keyword,sum)


def main():
    keywords=["seo优化技巧","百度站长平台","sem怎么学习","全网推广营销","seo网站优化方案","百度烧钱推广","自媒体推广策划"]
    for keyword in keywords:
        search(keyword)

    print('共运行了{}秒'.format(end - start))  # 程序耗时

asyncio+aiohttp 异步-wait

async def get_content(keyword):
    ua = UserAgent()
    headers = {
        'User-Agent': ua.random,
        'Cookie': 'BIDUPSID=E8605F17778754AD6BAA328A17329DAF; PSTM=1595994013; BAIDUID=E8605F17778754AD8EAC311EDCEC5A37:FG=1; BD_UPN=12314353; BDORZ=B490B5EBF6F3CD402E515D22BCDA1598; COOKIE_SESSION=75158_0_8_0_82_8_0_0_0_8_1_0_75159_0_1_0_1605083022_0_1605083023%7C9%230_0_1605083023%7C1; H_PS_645EC=c097mGOFZEl3IZjKw2lVOhIl4YyhcIr2Zp3YMimT2D62xwJo8q%2B9jeQnZq3gvUXMGbhD; BA_HECTOR=a42l8ka5ah8h0003611fqs8b60p; BD_HOME=1; H_PS_PSSID=32818_1452_33045_32939_33060_32973_32705_32961',
    }
    async with aiohttp.ClientSession() as session:
        response = await session.get(f'https://www.baidu.com/s?wd={keyword}&ie=UTF-8',headers=headers,timeout=5)
        content = await response.read()
        return content




async def get_num(keyword):
    sum=''
    content = await get_content(keyword)
    try:
        html=content.decode('utf-8')
        #time.sleep(1)
        sum=re.search(r'<span class="nums_text">百度为您找到相关结果约(.+?)个</span>',html,re.M|re.I).group(1)
    except Exception as e:
        print(f"错误代码: {e}")
    if sum !='':
        print(keyword,sum)



def run():
    tasks = []
    start = time.time()  # 记录起始时间戳
    keywords=["seo优化技巧","百度站长平台","sem怎么学习","全网推广营销","seo网站优化方案","百度烧钱推广","自媒体推广策划"]
    loop = asyncio.get_event_loop()
    for keyword in keywords:
        c = get_num(keyword)
        # 通过返回的协程对象进一步封装成一个任务对象
        task = asyncio.ensure_future(c)
        tasks.append(task)
    loop.run_until_complete(asyncio.wait(tasks))
    end = time.time()  # 获取结束时间戳
    print('共运行了{}秒'.format(end - start))  # 程序耗时

asyncio+aiohttp 异步-gather

def run_gather():
    start = time.time()  # 记录起始时间戳
    keywords=["seo优化技巧","百度站长平台","sem怎么学习","全网推广营销","seo网站优化方案","百度烧钱推广","自媒体推广策划"]
    tasks = [asyncio.ensure_future(get_num(keyword)) for keyword in keywords]
    loop = asyncio.get_event_loop()
    tasks = asyncio.gather(*tasks)
    loop.run_until_complete(tasks)
    end = time.time()  # 获取结束时间戳
    print('共运行了{}秒'.format(end - start))  # 程序耗时

For the complete demo, please pay attention to the official account of the slag: Er Ye Ji

Background reply keywords: asynchronous crawler 

Get py file

Reference source

  • [1] asyncio --- Asynchronous I/O — Python 3.9.0 documentation

  • [2] asyncio+aiohttp asynchronous crawler

  • [3] Python crawler study notes asyncio+aiohttp asynchronous crawler principle and analysis

  • [4] From 0 to 1, the evolution of Python asynchronous programming

  • [5] Concurrency of asyncio gather and wait

Guess you like

Origin blog.csdn.net/minge89/article/details/109685500