Use pyppeteer Taobao products

I used pyppeteer to bypass the web driver detection when logging in to Taobao, but this does not mean logging in

After that, there is no detection. Today, I will take the product name obtained by crawling the search keyword as an example.

The whole process has 4 steps: 1. Log in, 2. Enter keywords and click Search, 3. Slide to the bottom and get data, 4. Click on the next page, and then repeat steps 3 and 4 until there is no next page (actually An account cannot be crawled on every page. If you want to crawl all the pages, you may have to buy or borrow the account. In the following tutorial, I only crawl the first few pages of data). Then I have to build a framework. In order to make the program look simple, I use an object-oriented design method to design the program. The approximate code is as follows:

class TaoBaoSpider:
    async def login(self):
        pass
    async def search(self):
        pass
    async def crawl(self):
        pass

Because pyppeteer is implemented through async and await asynchronous mechanisms, the methods must be asynchronous, so I added the async keyword before each method definition.

log in

There are two ways to log in, one is to log in by scanning the code on the mobile phone, and the other is to log in by entering the account password. In order to be as simple as possible and as undetected as possible, I use the mobile phone to scan the code to log in. Since you use your mobile phone to scan the code to log in, the implementation of the login method is very simple, as long as you sleep for a while. But it should be noted here that the sleep function of the time module cannot be used (this is a wait for a synchronous blocking interface and cannot appear in an asynchronous function). We need an asynchronous wait. This asynchronous wait is the sleep function of the asyncio module. Modified with the keyword await, as shown below.

from asyncio import sleep

class TaoBaoSpider:
    @staticmethod
    async def login():
        await sleep(10)

    async def search(self):
        pass

    async def crawl(self):
        pass

Here I set the waiting time to 10 seconds. It should be enough to complete the mobile phone scan code login in these 10 seconds. If you feel that time is pressing, you can change 10 to a larger number.

Next we need to test. Before testing, we have to write the construction method of this class to initialize some properties. Because the browser object is needed in the search and crawl methods, we need to initialize a browser object, as follows Shown.

self.browser = await launch(headless=False, args=['--disable-infobars', f'--window-size={self.width},{self.height}'])

We all know that the await keyword must be in the body of the async-modified function. Some people might think of following this:

async def __init__(self)

This is wrong, as shown in the figure.

A little translation, saying that the function "__init__" cannot be asynchronous! Then we need to define a separate asynchronous method to complete asynchronous initialization. I defined this asynchronous method as init. The following two methods are mainly written __init__ and init.

from asyncio import sleep, get_event_loop
from pyppeteer import launch

class TaoBaoSpider:
    def __init__(self):
        self.width, self.height = 1500, 800
        get_event_loop().run_until_complete(self.init())
        get_event_loop().run_until_complete(self.login())
        get_event_loop().run_until_complete(self.search())
        get_event_loop().run_until_complete(self.crawl())

    async def init(self):
        # noinspection PyAttributeOutsideInit
        self.browser = await launch(headless=False, 
                            args=['--disable-infobars', f'--window-size={self.width},{self.height}'])
        # noinspection PyAttributeOutsideInit
       self.page = await self.browser.newPage()
       await self.page.setViewport({'width': self.width, 'height': self.height})
       await self.page.goto('https://login.taobao.com/member/login.jhtml?redirectURL=https://www.taobao.com/')
       await self.page.evaluate('()=>{Object.defineProperties(navigator,{webdriver:{get:()=>false}})}')

   @staticmethod
   async def login():
       await sleep(10)

   async def search(self):
       pass

   async def crawl(self):
       pass

A Chromium browser will pop up when the program is run. The browser displays the Taobao login page. The mobile phone scans the code to log in successfully, but it is not detected. After the login part is tested, we will mainly write the search method.

 

search for

Before writing a search method, let's think about a question, is it too fast to search as soon as you log in? We all know that accessing too fast is easy to be BAN, so we need to reduce the access speed. This can be achieved by asynchronous waiting.

Here comes another key question. How long should I wait? If the time is too long, the efficiency is too low. If the time is too short, it is easy to be detected. We can think about the speed of manually accessing Taobao. I think it is a click in 1 to 4 seconds. Some people take it for granted.

This is too simple, isn't the initialization in the direct construction method complete? As follows:

self.sleep_time = 1+random()*3

In fact, writing this way is wrong, because the field value will not be reinitialized in the middle, that is, the time interval between every two events is the same, so there will be ghosts if they are not detected, after all, human operations cannot be so mechanized , Then we need to be able to modify its value every time we call it. Since this value is constantly changing, we can define it as a method, but every call after the method requires a pair of parentheses. This It's simply too cumbersome. We can use property to decorate this method. There is no need to add parentheses when calling. Below I will implement the search method.

from asyncio import sleep, get_event_loop
from pyppeteer import launch
from random import random

class TaoBaoSpider:
    def __init__(self):
        self.width, self.height = 1500, 800
        get_event_loop().run_until_complete(self.init())
        get_event_loop().run_until_complete(self.login())
        get_event_loop().run_until_complete(self.search())
        get_event_loop().run_until_complete(self.crawl())

    async def init(self):
       # noinspection PyAttributeOutsideInit
       self.browser = await launch(headless=False,
                                   args=['--disable-infobars', f'--window-size={self.width},{self.height}'])
       # noinspection PyAttributeOutsideInit
       self.page = await self.browser.newPage()
       await self.page.setViewport({'width': self.width, 'height': self.height})
       await self.page.goto('https://login.taobao.com/member/login.jhtml?redirectURL=https://www.taobao.com/')
       await self.page.evaluate('()=>{Object.defineProperties(navigator,{webdriver:{get:()=>false}})}')

   @staticmethod
   async def login():
       await sleep(10)

   @property
   def sleep_time(self):
       return 1+random()*3

    async def login():
        await sleep(10)

    @property
    def sleep_time(self):
        return 1+random()*3

    async def search(self):
        await self.page.click('#q')
        await sleep(self.sleep_time)
        await self.page.keyboard.type('机械革命')
        await sleep(self.sleep_time)
        await self.page.click('#J_TSearchForm > div.search-button > button')
        await sleep(self.sleep_time)

async def crawl(self):
    pass

if __name__ == '__main__':
    TaoBaoSpider()

The keywords I searched here are directly assigned initial values. You can modify them into the form of inputting parameters. This is very simple, I won't talk about it. Here are the most critical steps to crawl data.

Crawl data

To crawl data, we will crawl the searched product name, the two lines below the price in the figure below.

By examining the elements, we can find the corresponding HTML source form of each paragraph, as shown in the following figure:

Since there are still tags under the a tag, then we need to make a second filter, extract the content under the a tag for the first time, and remove the tags and blank characters of the data extracted for the second time. So two regulars are needed. The first is to extract the content under the a tag. After constant comparison, you can get the final regular:

pattern = compile(r'<a id=".*?" class="J_ClickStat".*?>(.*?)</a>',S)

Next is the regular used to replace: 

repl_pattern = compile(r'<.*?>|\s+')

Next we try to get the data of the first 5 pages. Go to the bottom first (can't do it overnight, you must simulate human operations. Here, uniform acceleration is used to achieve, acceleration is a random value, and it may be detected if it is not random), and then get the page source code for data filtering. Finally, skip to the next page and repeat the previous operation. The complete source code is given directly below.

from asyncio import sleep, get_event_loop
from pyppeteer import launch
from random import random
from re import compile, S

class TaoBaoSpider:
    def __init__(self):
        self.width, self.height = 1500, 800
        get_event_loop().run_until_complete(self.init())
        get_event_loop().run_until_complete(self.login())
        get_event_loop().run_until_complete(self.search())
        get_event_loop().run_until_complete(self.crawl())

    async def init(self):
        # noinspection PyAttributeOutsideInit
       self.browser = await launch(headless=False,
                                   args=['--disable-infobars', f'--window-size={self.width},{self.height}'])
       # noinspection PyAttributeOutsideInit
       self.page = await self.browser.newPage()
       await self.page.setViewport({'width': self.width, 'height': self.height})
       await self.page.goto('https://login.taobao.com/member/login.jhtml?redirectURL=https://www.taobao.com/')
       await self.page.evaluate('()=>{Object.defineProperties(navigator,{webdriver:{get:()=>false}})}')

   @staticmethod
    async def login():
        await sleep(10)

    @property
    def sleep_time(self):
        return 1+random()*3

async def search(self):
    await self.page.click('#q')
    await sleep(self.sleep_time)
    await self.page.keyboard.type('机械革命')
    await sleep(self.sleep_time)
    await self.page.click('#J_TSearchForm > div.search-button > button')
    await sleep(self.sleep_time)

async def crawl(self):
    pattern = compile(r'<a id=".*?" class="J_ClickStat".*?>(.*?)</a>', S)
       repl_pattern = compile(r'<.*?>|\s+')
       for i in range(5):
           # document.body.clientHeight
           height = await self.page.evaluate('document.body.clientHeight')
           scrolled_height = 0
           a = 1+random()
           t = 1
           # window.scrollTo(width, height)
           while scrolled_height < height:
               scrolled_height = int(1/2*a*t**2)  # x=v0*t+1/2*a*t**2,v0=0
               await self.page.evaluate(f'window.scrollTo(0,{scrolled_height})')
               t += 1
           await sleep(self.sleep_time)
           html = await self.page.content()
           results = pattern.findall(html)
           for result in results:
               result = repl_pattern.sub('', result)
               print(result)
           print()
           await sleep(self.sleep_time)
           await self.page.click('#mainsrp-pager > div > div > div > ul > li.item.next > a')
           await sleep(self.sleep_time)
       await sleep(self.sleep_time)

if __name__ == '__main__':
    TaoBaoSpider()

The result of the operation is shown in the figure.

It can be found that the data displayed on the web page has been crawled down. Here is a summary of some techniques for dealing with this kind of particularly strict anti-crawling website:

1. Simulate the operation of a human, and each request has to wait a while randomly.

2. If you need to log in, try to log in manually. Automatic login may be detected.

3. Do not visit too many times for one account (if you want to log in) or one IP. If you want to crawl a lot of data, you can use multiple accounts (if you want to log in) or pair one IP.

Guess you like

Origin blog.csdn.net/zhangge3663/article/details/108202145