The first tutorial in 2021: Introductory tutorial on web data crawler packet capture

Introductory tutorial of web data capture

Earlier, we simply understood the composition of a page, and knew the basic composition of a web page. Next, we proceed to the second stage of learning, data capture

2.1 Before that, we need to understand a crawler protocol. By checking the crawler protocol, we can know who can crawl on this website and what data can be crawled. You can view the crawler protocol by adding /robots.txt after the main website URL. For example: we enter http://baidu.com/robots.txt to get Baidu's crawling protocol, the first in each list is the crawler name (user name), and the following Disallow means that these file paths cannot be crawled.

2.2 crawler code learning:

First install the requests library on the computer, and enter pip install requests on the console to automatically complete the installation. There is a get method in the requests library to get the response content of a specified page.

We write crawlers, generally three steps

1. Find the file, open the browser, find the developer tools, view the source code of the webpage, and find the data location I want

Second, the program takes the page

Three, analysis

First of all, we use the developer tools to find that the content we want is in that location on the web page, and then we can make a request to the web page and get the content of the page response; find User-Agent and Cookie in the developer tool for backup

Import the requests library

import requests #导入库

resp = requests.get(url='https://www.uisdc.com/',
				  headers={
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
				  		 'Cookie': 'Hm_lvt_7aeefdb15fe9aede961eee611c7e48a5=1610174500; notLoginHasViewPages=2; Hm_lpvt_7aeefdb15fe9aede961eee611c7e48a5=1610174550'})

resp = requests.get(url = URL, headers = {dictionary format camouflage information, add the two commonly used camouflage information above User-Agent and Cookie})

In this way, we can get the content of the response of this page, print the resp, we can generally see <Response [200]>, which means that we can get the content of this page normally. If we want to see the specific content, we only need to After this page object.text

resp = requests.get(url='https://www.uisdc.com/',
              headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
                   'Cookie': 'Hm_lvt_7aeefdb15fe9aede961eee611c7e48a5=1610174500; notLoginHasViewPages=2; Hm_lpvt_7aeefdb15fe9aede961eee611c7e48a5=1610174550'})
print(resp.text)

You can see all the code of this webpage, then after we get this page, we find that the code is too much, the data we need is dazzled in it, it is difficult to see at a glance, so we need to enter the third step, analysis

There are many ways of parsing, such as ~ regular expression parsing, CSS selector parsing, XPath parsing

  • Regular analysis

Let's try regular analysis first. If I want to get all the article titles on this page, through the developer tools, I found that all the content is under the tag of h2 title class title. I imported re, studied the title tag carefully, and found this one The title is inside <h2> class=“title”> Title content</h2>

import re

pattern = re.compile(r'class="title">(.*?)</h2>')
print(pattern.findall(resp.text))

After I wrote such a regularity, I matched the code of the webpage and got the title.

  • CSS selector analysis

After we install the bs4 library, we can use the CSS selector to parse

The css selector, the first step is the same as the regular one. When you get the page, the next step is different, because it is a selector parsing. It needs to sort out all the tags on this page first, so that it doesn’t become a mess, so the second The step should be to use the method in the bs4 selector to convert the page content into content that the selector can understand:

soup = bs4.BeautifulSoup(resp.text, 'html.parser')

Here, we use the bs4 built-in function BeautifullySoup to parse resp.text into content that can be read by bs4 with'html.parser', and save it with the variable soup.

Then analyze and find that the title content is in the h2 tag, and the h2 tag class is also defined as title

soup = bs4.BeautifulSoup(resp.text, 'html.parser')
anchors = soup.select('h2.title')
print(anchors)

I use the search tool select under bs4 to find all h2 and the h2 tag is defined as the title tag

After printing and running, it was found that a bunch of data was obtained, which was also mixed with labels. This search tool returned a list of labels. We can loop through these labels and use the text attributes of these labels to get the text in the labels. , In this way, get all the titles

soup = bs4.BeautifulSoup(resp.text, 'html.parser')
titles = soup.select('h2.title')
for title in titles:
    print(title.text)

What needs to be explained here is that when we took the tag, we wrote the paragraph ('h2.title') called tag selector. In the future, you can find a tutorial on tag selector to learn in detail, here selct and tag selector There are many other methods, so I won’t talk about the others here, so as not to confuse everyone.

2.3 Data persistence

It’s not enough for us to just get the data. We also need to write it into a file for long-term preservation. It can be used at any time in the future. There are many ways to write to Excel. I only teach one, so as not to confuse everyone, we install an xls Library, this library has very good compatibility and good performance.

The excel file itself is a workbook, and a form is a worksheet. We write each grid called a cell, so next, after we install the xls library, the work is very simple. Import library, create workbook, create form, write data

soup = bs4.BeautifulSoup(resp.text, 'html.parser')
lists = soup.select('a.a_block')
print(lists)
for titles in lists:
    title = titles.select_one('h2.title').text
    nue = titles.select_one('p').text

In order to write data, I changed the above code a bit, and got a small paragraph under the title, which is convenient for us to write to excel

We first import the library, create a workbook, create a workbook, and name it a form

import xlwt
wb =xlwt.Workbook()
sheet = wb.add_sheet('表单')

Create a cell index:

sheet.write(0,0,'序号')
sheet.write(0,1,'标题')
sheet.write(0,2,'详情')

Next, we only need to add the data to the cell type.

wb =xlwt.Workbook()
sheet = wb.add_sheet('表单')

sheet.write(0,0,'序号')
sheet.write(0,1,'标题')
sheet.write(0,2,'详情')
a = 0

soup = bs4.BeautifulSoup(resp.text, 'html.parser')
lists = soup.select('a.a_block')
print(lists)
for titles in lists:
    title = titles.select_one('h2.title').text
    nue = titles.select_one('p').text
    lis_name = [title,nue]
    a += 1
    for index,name in enumerate(lis_name):
        sheet.write(a,index,name)
wb.save('页面.xls')

At this point, the basics of crawlers are over. Generally, simple small websites can be pushed like this to crawl data. Next, add some things outside the outline.

Use of proxy ip

When we crawl data, we often perform a large number of operations in a short time. Generally, others are not fools, and they cannot be foolish to let you crawl. Sometimes the IP will be blocked. At this time, we need to use proxy ip, then proxy How to use ip?

In fact, there is only one difference, that is, when we make a get request at the beginning, just add a proxies attribute to the get request. Put our proxy ip address in proxies, and it can be unrestrained, this proxy ip address You need to go to a professional website to buy it yourself. Domestic IP is still relatively cheap. Next, I will use the above website to write an example for your reference

resp = requests.get(url='https://www.uisdc.com/',
              headers={
    
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Edg/87.0.664.75',
			 proxies={
    
    
                'https': f'http://ip:端口号'}
                   ) # 这样就ok了

Basically, others will not directly give you text data, but will give you a link. If you go to get this link, you will receive the json data. We need to convert it and take it out. You can operate according to the website.

Guess you like

Origin blog.csdn.net/SaharaLater/article/details/112396354