How to use python crawler to crawl data in only six steps!

It's really easy to crawl data with python's crawler . You just need to master these six steps, and it's not complicated either. I used to think that crawling was difficult, but once I got started , I learned how to crawl things down in less than an hour.

Python climbs out of six steps

Step 1: Install requests library and BeautifulSoup library:

The writing of the two libraries in the program is like this:

 import requests
from bs4 import BeautifulSoup

Since I am using pycharm for python programming. So I will talk about how to install these two libraries on pycharm . Under the main page file options, find Settings. Find out more about the project interpreter. Then in the selected box, click the + sign on the software package to install the query plug-in. It is probably easier to start with hxd that has a compiler plug-in installed . The specific situation is shown in the figure below.

Step 2: Obtain the headers and cookies required by the crawler:

I wrote a crawler program that crawls Weibo hot searches. Let’s just use it as an example here. Obtaining headers and cookies is necessary for a crawler program . It directly determines whether the crawler program can accurately find the location of the web page for crawling.

First, enter the hot search page on Weibo and press F12, and the js language design part of the web page will appear. As shown below. Find the Network section on the web page. Then press ctrl+R to refresh the page. If the file information is already there, there is no need to refresh it. Of course, there is no problem if you refresh it. Then, we browse the Name part, find the file we want to crawl, right-click the mouse, select copy, and copy the URL of the web page. Just like the picture below.

After copying the URL, we enter a web page Convert curl commands to code. This web page can automatically generate headers and cookies based on the URL you copied, as shown below. Just copy the generated header and cookie and paste them into the program.

#爬虫头数据
cookies = {
    'SINAGLOBAL': '6797875236621.702.1603159218040',
    'SUB': '_2AkMXbqMSf8NxqwJRmfkTzmnhboh1ygvEieKhMlLJJRMxHRl-yT9jqmg8tRB6PO6N_Rc_2FhPeZF2iThYO9DfkLUGpv4V',
    'SUBP': '0033WrSXqPxfM72-Ws9jqgMF55529P9D9Wh-nU-QNDs1Fu27p6nmwwiJ',
    '_s_tentry': 'www.baidu.com',
    'UOR': 'www.hfut.edu.cn,widget.weibo.com,www.baidu.com',
    'Apache': '7782025452543.054.1635925669528',
    'ULV': '1635925669554:15:1:1:7782025452543.054.1635925669528:1627316870256',
}
headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/25',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Accept-Language': 'zh-CN,zh;q=0.9',
}
params = (
    ('cate', 'realtimehot'),
)

Copy it into the program like this. This is the request header for hot searches on Weibo.

Step 3: Get the web page:

After we get the header and cookie, we can copy it into our program. After that, use request request to get the web page.

#获取网页
response = requests.get('https://s.weibo.com/top/summary', headers=headers, params=params, cookies=cookies)

Step 4: Parse the web page :

At this time, we need to return to the web page. Also press F12 and find the Elements section of the web page. Use the small box with an arrow in the upper left corner, as shown below, and click on the web page content. At this time, the web page will automatically display the code corresponding to the part of the web page you obtained on the right.

As shown in the picture above, after we find the web page code of the page part we want to crawl , we place the mouse on the code, right-click, and copy to the selector part. As shown in the picture above.

Step 5: Analyze the obtained information and simplify the address:

In fact, the selector just copied is equivalent to the address stored in the corresponding part of the web page. Since what we need is a type of information on the web page, we need to analyze and extract the obtained address. Of course, it's not impossible to just use that address, but you can only get that part of the content on the web page you selected.

#pl_top_realtimehot > table > tbody > tr:nth-child(1) > td.td-02 > a
#pl_top_realtimehot > table > tbody > tr:nth-child(2) > td.td-02 > a
#pl_top_realtimehot > table > tbody > tr:nth-child(9) > td.td-02 > a

These are the three addresses I obtained. It can be found that the three addresses have many similarities. The only difference is the tr part. Since tr is a web page tag, the following part is its supplementary part, which is the subcategory selector. It can be inferred that this type of information is stored in a subclass of tr. If we directly extract information from tr , we can obtain all the information corresponding to this part. So the refined address is:

#pl_top_realtimehot > table > tbody > tr > td.td-02 > a

This process will probably be better handled by hxd who has a certain understanding of js-like languages. But it doesn’t matter if you don’t have a foundation in js-like languages. The main step is to keep the same parts. Try it slowly and you will always get it right.

Step 6: Crawl content and clean data

After this step is completed, we can crawl the data directly. Use a label to store the address-like thing extracted above. The tag will pull in the web content we want to get.

#爬取内容
content="#pl_top_realtimehot > table > tbody > tr > td.td-02 > a"

After that, we need soup and text to filter out unnecessary information, such as js language, to eliminate the interference of such language on the reading of the information audience. In this way, we successfully crawled the information.

fo = open("./微博热搜.txt",'a',encoding="utf-8")
a=soup.select(content)
for i in range(0,len(a)):
    a[i] = a[i].text
    fo.write(a[i]+'\n')
fo.close()

I stored the data in a folder, so there will be write operations caused by wirte . It’s up to the reader where they want to save the data, or how they want to use it.

Code examples for crawling hot searches on Weibo and display of results:

import os
import requests
from bs4 import BeautifulSoup
#爬虫头数据
cookies = {
    'SINAGLOBAL': '6797875236621.702.1603159218040',
    'SUB': '_2AkMXbqMSf8NxqwJRmfkTzmnhboh1ygvEieKhMlLJJRMxHRl-yT9jqmg8tRB6PO6N_Rc_2FhPeZF2iThYO9DfkLUGpv4V',
    'SUBP': '0033WrSXqPxfM72-Ws9jqgMF55529P9D9Wh-nU-QNDs1Fu27p6nmwwiJ',
    '_s_tentry': 'www.baidu.com',
    'UOR': 'www.hfut.edu.cn,widget.weibo.com,www.baidu.com',
    'Apache': '7782025452543.054.1635925669528',
    'ULV': '1635925669554:15:1:1:7782025452543.054.1635925669528:1627316870256',
}
headers = {
    'Connection': 'keep-alive',
    'Cache-Control': 'max-age=0',
    'Upgrade-Insecure-Requests': '1',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.89 Safari/537.36 SLBrowser/7.0.0.6241 SLBChan/25',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
    'Sec-Fetch-Site': 'cross-site',
    'Sec-Fetch-Mode': 'navigate',
    'Sec-Fetch-User': '?1',
    'Sec-Fetch-Dest': 'document',
    'Accept-Language': 'zh-CN,zh;q=0.9',
}
params = (
    ('cate', 'realtimehot'),
)
#数据存储
fo = open("./微博热搜.txt",'a',encoding="utf-8")
#获取网页
response = requests.get('https://s.weibo.com/top/summary', headers=headers, params=params, cookies=cookies)
#解析网页
response.encoding='utf-8'
soup = BeautifulSoup(response.text, 'html.parser')
#爬取内容
content="#pl_top_realtimehot > table > tbody > tr > td.td-02 > a"
#清洗数据
a=soup.select(content)
for i in range(0,len(a)):
    a[i] = a[i].text
    fo.write(a[i]+'\n')
fo.close()

Guess you like

Origin blog.csdn.net/Everly_/article/details/133138470