Summary of python crawling webpage methods, python crawler obtains webpage data

Hello everyone, the editor will answer the following questions for you. The code of python crawling webpage information cannot be crawled correctly. Use python to crawl simple webpage data steps. Let us take a look today!

Article Directory

Python is quite good for data processing. If you want to be a crawler, Python is a good choice. It has many class packages that have been written. Just call it to complete many complex functions. Before we start,
we Need to install some environment dependent packages, open the command line and
insert image description here
insert image description here
make sure you have python and pip in your computer, if not, you need to install it yourself

After that we can use pip to install the prerequisite module requests

pip install requests

insert image description here
requests is a simple and easy-to-use HTTP library implemented by python. It is much simpler to use than urllib. requests allows you to send HTTP/1.1 requests. Specify the URL and add the query url string to start crawling web page information

1. Grab the source code of the web page

Taking this platform as an example, grab the company name data in the web page, web page link: https://www.crrcgo.cc/admin/crr_supplier.html?page=1
insert image description here
The source code of the target web page is as follows:
insert image description here
First, specify step
1. Open the target Site
2. Grab the target site code and output

import requests

Import the requests function module we need

page=requests.get('https://www.crrcgo.cc/admin/crr_supplier.html?page=1')

The meaning of this command is to use the get method to obtain the pseudo-original URL of the data god code ai locomotive of the web page . In fact, what we get is the data information of the homepage screen when the browser opens the Baidu website

print(page.text)

This sentence is to output (print) the text (text) content of the data we obtained

import requests
page=requests.get('https://www.crrcgo.cc/admin/crr_supplier.html?page=1')
print(page.text)

insert image description here
Successfully crawled the source code of the target page

2. Grab the content of a label in the source code of a webpage

But the code captured above is full of characters in angle brackets, which has no effect on us. Such data full of angle brackets is the webpage file we receive from the server, just like the doc and pptx file formats of Office, webpage files Usually in html format. Our browsers can display these html code data as the web pages we see.
If we need to extract valuable data from these characters, we must first understand the tag elements. The
text content of each tag is sandwiched between two angle brackets. The end angle brackets start with /, and inside the angle brackets (img and div) Indicates the type of tag element (picture or text), and there can be other attributes (such as src) in the angle brackets. The
insert image description here
text of the tag content is the data we need, but we need to use the id or class attribute to find the required tag from many tags element.

We can open any webpage in a computer browser, press the F12 key to open the element viewer (Elements), and you can see hundreds of various markup elements that make up this page
insert image description here
. Nested layer by layer, for example, the following is the body nested div element, body is the parent layer, the upper layer element; div is the child layer, the lower layer element.

<body>
    <div>十分钟上手数据爬虫</div>
</body>

Back to crawling, now I just want to grab the data of the company name in the webpage, and I don’t want other data.
insert image description here
Check the html code of the webpage and find that the company name is in the label detail_head
insert image description here

import requests
req=requests.get('https://www.crrcgo.cc/admin/crr_supplier.html?page=1')

These two lines explained above are to get page data

from bs4 import BeautifulSoup

We need to use the function module BeautifulSoup to change the html data full of angle brackets into a more usable format, from bs4 import BeautifulSoup means to import BeautifulSoup from the function module bs4, yes, because bs4 contains multiple modules , BeautifulSoup is just one of

req.encoding = "utf-8"

Specifies that the acquired web content is encoded in utf-8

soup = BeautifulSoup(html.text, 'html.parser')

This code uses the html parser (parser) to analyze the html text content obtained by our requests, and the soup is the result of our analysis.

company_item=soup.find_all('div',class_="detail_head")

find is to find, and find_all finds all. Find all elements whose tag name is div and class attribute is detail_head

dd = company_item.text.strip()

The strip() method is used to remove the specified character (space or newline by default) or character sequence specified at the beginning and end of the string. Here is the html data that removes the extra angle brackets

After the final splicing, the code is as follows:

import requests
from bs4 import BeautifulSoup

req = requests.get(url="https://www.crrcgo.cc/admin/crr_supplier.html?page=1")
req.encoding = "utf-8"
html=req.text
soup = BeautifulSoup(req.text,features="html.parser")
company_item = soup.find("div",class_="detail_head")
dd = company_item.text.strip()
print(dd)

insert image description here
Finally, the execution result successfully captured the company information we wanted on the webpage, but only one company was captured, and the rest were not captured.

So we need to add a loop to grab all the company names in the webpage, and it doesn't change much

for company_item in company_items:
    dd = company_item.text.strip()
    print(dd)

The final code is as follows:

import requests
from bs4 import BeautifulSoup

req = requests.get(url="https://www.crrcgo.cc/admin/crr_supplier.html?page=1")
req.encoding = "utf-8"
html=req.text
soup = BeautifulSoup(req.text,features="html.parser")
company_items = soup.find_all("div",class_="detail_head")
for company_item in company_items:
    dd = company_item.text.strip()
    print(dd)

insert image description here
The final running result queries all the company names in the web page

3. Grab the content of multiple web page sub-tabs

So what if I now want to crawl the company names in multiple web pages? It's very simple, the general code has been written, we only need to add a loop again to
view the webpage we need to crawl, and find that when the webpage changes, only the number behind the page will change. Of course, the web pages of many big manufacturers, such as JD.com and Taobao, often make people confused and difficult to guess.
insert image description here
insert image description here

inurl="https://www.crrcgo.cc/admin/crr_supplier.html?page="
for num in range(1,6):
    print("================正在爬虫第"+str(num)+"页数据==================")

Write loop, we only grab the contents of pages 1 to 5, here we use the range function to realize the loop, the feature of the range function is left closed and right open so that we must specify 6 if we want to grab 5 pages

	outurl=inurl+str(num)
    req = requests.get(url=outurl)

Splice the loop value and url into a complete url, and get the page data

The complete code is as follows:

import requests
from bs4 import BeautifulSoup

inurl="https://www.crrcgo.cc/admin/crr_supplier.html?page="
for num in range(1,6):
    print("================正在爬虫第"+str(num)+"页数据==================")
    outurl=inurl+str(num)
    req = requests.get(url=outurl)
    req.encoding = "utf-8"
    html=req.text
    soup = BeautifulSoup(req.text,features="html.parser")
    company_items = soup.find_all("div",class_="detail_head")
    for company_item in company_items:
        dd = company_item.text.strip()
        print(dd)

insert image description here
Successfully crawled all company names (sub-tabs) on pages 1-5

I have been studying recently, but the things I learned are very complicated, so I recorded my learning results to facilitate myself and help others. I hope this article can be helpful to you. If there are mistakes, please point them out! ! ! If you like it, don't forget to like it! ! !

Let me introduce myself first. The editor graduated from Jiaotong University in 2013. He used to work in a small company, went to big factories such as Huawei, OPPO, and joined Alibaba in 2018. Until now. I know that most junior and intermediate java engineers want to upgrade their skills, they often need to explore and grow by themselves or enroll in classes, but the tuition fee of nearly 10,000 yuan for training institutions is really stressful. My unsystematic self-study is very inefficient and long, and it is easy to hit the ceiling and the technology stops. Therefore, I collected a copy of "A Complete Set of Learning Materials for Java Development" and gave it to everyone. The original intention is also very simple, that is, I hope to help friends who want to learn by themselves but don't know where to start, and at the same time reduce everyone's burden. Add the business card below to get a full set of learning materials

Guess you like

Origin blog.csdn.net/mynote/article/details/132318429