"Reptile" is a small test, and you can master it without coding knowledge~

In the big data era of information explosion, we are surrounded by countless data every day. How to quickly find the information you need in this vast digital ocean? Traditional search methods may no longer be able to meet our needs. At this time, a technology called "crawler" gradually emerged and became an indispensable tool in the era of big data.

Let’s give a small example of what a crawler can do? Take the Python tool for crawling the hero avatars on the official website of Honor of Kings as an example to crawl and save the pictures on the webpage. The Python crawler can be divided into four steps to crawl the webpage pictures: clarify the purpose, send the request, data analysis, and save the data. Specifically Example operations are as follows.

1. Reptile case


1.1 Clarify the purpose

Open the hero introduction homepage of King of Glory. This homepage contains avatar pictures of many kinds of heroes. The homepage URL link is as follows.

https://pvp.qq.com/web201605/herolist.shtml

1.2 Send a request

Use the requests library to send a request, the return status code is 200, and the server connection is normal.

import requests  
u='https://pvp.qq.com/web201605/herolist.shtml'  
response=requests.get(u)  
print('状态码:{}'.format(response.status_code))  
if response.status_code != 200:  
    pass  
else:  
    print("服务器连接正常")

1.3 Data analysis

Before data parsing, you need to install pyquery in advance. The pyquery library is similar to the Beautiful Soup library. During initialization, you need to pass in HTML text to initialize a PyQuery object. Its initialization method includes directly passing in a string, passing in a URL, and passing in File name, etc. Here the URL is passed in and the node is searched.

#解析数据  
from pyquery import PyQuery  
  
doc=PyQuery(html)  
items=doc('.herolist>li')#.items()  
print(items)

Traverse at the same time, use the find function to find child nodes, traverse and crawl the image URL and image name.

for item in items:  
    url=item.find('img').attr('src')  
    #print(url)  
    urls='http:'+url  
    name=item.find('a').text()  
    #print(name)  
    url_content=requests.get(urls).content

1.4 Save data

Finally, to save the data, you need to create a new folder in advance for data storage. At the same time, the code for storing data needs to be written in a for loop, otherwise only one picture will be saved.

with open('C:/Users/尚天强/Desktop/王者荣耀picture/'+name+'.jpg','wb') as file:  
        file.write(url_content)  
        print("正在下载%s......%s"%(name,urls))

At the same time, a timer is added to time the duration of image crawling. It is shown here that the total time taken for image crawling is 7.03 seconds.

import time  
start=time.time()  
end=time.time()  
print('图片爬取共计耗时{:.2f}秒'.format(end-start))

The dynamic demonstration of the crawling process is as follows, and the running process is very fast.

Above, we successfully crawled the hero avatars of King of Glory, and there are high-definition avatars in the code file.

Bright Data


Crawler, this seemingly mysterious word, is actually our right-hand man in dealing with the challenges of big data. In this era of unlimited data growth, crawlers have opened up a way for us to obtain information quickly and efficiently. The above example requires programming foundation. Here I will introduce to you the use of Bright Data to crawl data without programming.

2.1 Bright Data Registration

To use the functions of Bright Data, you need to register on its official website. You can register using your personal email address. The interface after registration is completed is as follows.

Official website address: https://get.brightdata.com/dhsjfx

2.2 Main functions

After logging in to the official website, you can see that commonly used functions have been displayed in the main interface: agent & crawler infrastructure and data sets and Web Scraper IDE, introducing their functions respectively.

Proxy & crawler infrastructure: The fastest and stable proxy network, static and dynamic IP covering 195 countries around the world, say goodbye to anti-crawling restrictions and blocks. include:

  • Proxy network: dynamic residential, static residential ISP, computer room proxy and mobile proxy.

  • Bright Network Unlocker: All-round automatic unlocking

  • Try the SERP API to easily unlock search engine results

Datasets and Web Scraper IDE: Whether it's a complete and rich big data set or easy development of data mining scraping tools at scale, you can find it here. include:

  • data set

  • Custom dataset

  • Web scraper IDE/Web scraper IDE

2.3 Agent & crawler infrastructure

When web crawling, many websites take steps to limit or block access from specific IP addresses. This is primarily to prevent over-crawling and protect the privacy of website data. Therefore, if you use a fixed IP address for crawling operations, you are likely to encounter restricted access problems.

To avoid this situation, many crawler developers choose to use proxy IPs. Proxy IP is a method of hiding the real IP address and transmitting data through a proxy server. When you use a proxy IP for crawling operations, the request received by the website server will be displayed as the IP address of the proxy server instead of your real IP. Bright Data contains a variety of proxy IP functions.

The advantage of using a proxy IP is that you can change different proxy IPs to access the target website, so even if a certain proxy IP is restricted or blocked, you can still continue crawling through other available proxy IPs. In addition, using a real proxy IP can also help you better simulate the access behavior of real users and improve the efficiency and success rate of the crawler.

2.4 Datasets and Web Scraper IDE

In the world of data science and machine learning, a huge data set is essential. Sometimes, in order to get the data we need, we need to scrape information from a website. And this process, while necessary, is often time-consuming and complicated. Fortunately, some platforms and tools already provide us with convenient solutions.

In the Web Scraper IDE, the official provides us with crawling data of many well-known sites. This means that you don’t need to start from scratch and crawl every website manually. You can directly use these already crawled data sets, saving a lot of time and effort.

These datasets typically cover a variety of domains, from social media and news websites to e-commerce platforms. Whether you're doing market analysis, content generation, or pattern recognition, you'll find the data you need in these datasets.

What’s even more exciting is that the quality of these data sets has been rigorously screened and cleaned to ensure data accuracy and completeness. You can use this data without worrying about missing or incorrect data.

Using the data sets provided by Web Scraper IDE can greatly simplify the data scraping process, allowing you to get into the core work of data analysis faster. If you need to obtain high-quality data quickly, then these officially provided data sets are undoubtedly your best choice.

2.5 Web Scraper IDE

Bright Data also provides web-side IDE tools and related sample codes. You can directly use the template and corresponding code, or customize the crawler yourself. You can customize the data set according to your needs. Click the Create button to enter the custom data. set interface.

Here we take the data of crawling the top 250 Douban movies as an example. Fill in the corresponding information according to the prompts. When filling in the URL of the example, you need to fill in at least two URL links so that the data can be crawled.

Then, for the fields returned by the web page, you can edit the field names, data types, etc., limited to the crawled data information. Moreover, the returned data fields can be previewed to view the crawled data results in advance.

After the data fields are set, you can click the download button to download the data. There are two data saving formats: JSON and CSV. Through preview, we can see the basic data information crawled. It is also very simple to use custom crawled data. .

With the help of crawlers, you can easily crawl data. Whether you need to collect data on a large scale, break through website blocks, or need to manage your agents, Bright Data can provide you with high-quality services. If you want to learn more about Bright Data Crawling function, click to read the original text, apply for a free trial, and start your crawling journey!

Interested friends will receive a complete set of Python learning materials, including interview questions, resume information, etc. See below for details.

1. Learning routes in all directions of Python

The technical points in all directions of Python have been compiled to form a summary of knowledge points in various fields. Its usefulness is that you can find corresponding learning resources according to the following knowledge points to ensure that you learn more comprehensively.

img
img

2. Essential development tools for Python

The tools have been organized for you, and you can get started directly after installation!img

3. Latest Python study notes

When I learn a certain basic and have my own understanding ability, I will read some books or handwritten notes compiled by my seniors. These notes record their understanding of some technical points in detail. These understandings are relatively unique and can be learned. to a different way of thinking.

img

4. Python video collection

Watch a comprehensive zero-based learning video. Watching videos is the fastest and most effective way to learn. It is easy to get started by following the teacher's ideas in the video, from basic to in-depth.

img

5. Practical cases

What you learn on paper is ultimately shallow. You must learn to type along with the video and practice it in order to apply what you have learned into practice. At this time, you can learn from some practical cases.

img

6. Interview Guide

Insert image description here

Insert image description here

resume templateInsert image description here
If there is any infringement, please contact us for deletion.

Guess you like

Origin blog.csdn.net/cxyxx12/article/details/135220058
Recommended