Nanny-level explanation of Python crawler + crawling Taobao data case

Name: Ayue’s little Dongdong

Learning: Python, C/C++

Blog homepage: Ayue's Xiaodongdong's blog_CSDN blog - advanced knowledge of python&&c++, essential for the Chinese New Year, blogger in the field of C/C++ knowledge explanation

Table of contents

A Beginner's Guide to Web Crawling

1. Introduction to web crawlers

2. Preparation

3. Send HTTP request

4. Parse HTML

5. Crawl data

6. Use API

7. Crawler example: crawling Taobao data

8. Reptilian ethics

in conclusion


Here is an article about using Python to write web crawlers, including a total of 3,000 words and accompanying code.

A Beginner's Guide to Web Crawling

A web crawler is an automated program used to automatically collect information on the Internet. They are the basis for search engines, price comparison sites, social media platforms and more. This guide explains how to write a web crawler using Python.

1. Introduction to web crawlers

A web crawler is a software program that automatically scrapes information on the Internet. Web crawlers obtain and parse HTML pages by sending HTTP requests, and extract the required data from them.

Behind the crawler, there are two important concepts: crawling and parsing. Crawling refers to the process of obtaining data from a website. Parsing refers to converting the acquired data into an operable format.

2. Preparation

Before writing a web crawler in Python, you need to install the following components:

  • Python 3
  • Requests
  • BeautifulSoup4

You can install these dependencies using the following command:

pip install requests
pip install beautifulsoup4

3. Send HTTP request

Before using Python to send HTTP requests, you need to understand the HTTP protocol. HTTP is a protocol used to transfer information between computers. When you enter a URL into your browser, the browser sends an HTTP request to get the page. Similarly, we can send HTTP requests using Python’s Requests library.

import requests

response = requests.get('https://www.example.com')

In the above code, we https://www.example.comsend an HTTP GET request to and store the response in responsea variable named. You can access response.contentthe contents of the response using .

4. Parse HTML

The purpose of a web crawler is usually to collect data from a website. For data analysis and visualization, you need to convert this data into an actionable format. In web development, the most common format is HTML. You can use Python's BeautifulSoup library to parse HTML pages.

from bs4 import BeautifulSoup

html = '''
<html>
  <head>
    <title>Example</title>
  </head>
  <body>
    <div class="content">
      <p>Hello, world!</p>
    </div>
  </body>
</html>
'''

soup = BeautifulSoup(html, 'html.parser')

title = soup.title.text
content = soup.find('div', {'class': 'content'}).text

In the above code, we parsed a simple HTML document using the BeautifulSoup library. We use to soup.title.textget the title of the page and we use soup.find('div', {'class': 'content'}).textto get the content of the page.

5. Crawl data

Now, you are ready to scrape data from your website. To understand how to create a crawler, let's start with a simple example.

import requests
from bs4 import BeautifulSoup

URL = 'https://www.example.com'

response = requests.get(URL)
soup = BeautifulSoup(response.content, 'html.parser')

title = soup.title.text
print('Title:', title)

for link in soup.find_all('a'):
    print(link.get('href'))

In the above code, we https://www.example.comsend an HTTP GET request to and parse the response content using BeautifulSoup. We use soup.title.textto get the title of the page and we use soup.find_all('a')to get all the links. We use link.get('href')the URL that prints each link.

6. Use API

Some websites provide APIs that allow you to obtain data using HTTP requests. APIs are generally easier to use than a website's HTML.

Here is an example of using Python’s Requests library to access the API:

import requests

response = requests.get('https://api.example.com/data')
data = response.json()

for item in data:
    print(item['name'], item['value'])

In the above code, we send an HTTP GET request to the API and use .json()the method to convert the response into JSON format. We use a loop to iterate over the list of data and print the namesum valueproperty of each item.

7. Crawler example: crawling Taobao data

To use Python to crawl Taobao data, you can use the following steps:

  1. Determine the keywords to be crawled and construct search links. For example, if you want to crawl the data of "mask", the search link is: https://s.taobao.com/search?q=mask

  2. Send an HTTP request to obtain the content of the search results page. Use Python's requests library to send HTTP requests and obtain page content.

  3. Parse the page content and extract product information. Use Python's BeautifulSoup library to parse HTML page content and extract the required information.

  4. Storing data. The extracted product information can be stored in a local file or database.

Here is the sample code:

import requests
from bs4 import BeautifulSoup

def get_search_result(url):
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'}
    # 发送HTTP请求
    response = requests.get(url, headers=headers)
    # 解析HTML页面
    soup = BeautifulSoup(response.text, 'html.parser')
    items = soup.find_all(class_='item J_MouserOnverReq')
    for item in items:
        # 获取商品信息
        title = item.find(class_='title').text.strip()
        price = item.find(class_='price g_price g_price-highlight').text.strip()
        sales = item.find(class_='deal-cnt').text.strip()
        shop = item.find(class_='shopname J_MouseEneterLeave J_ShopInfo').text.strip()
        # 存储数据
        with open('data.txt', 'a', encoding='utf-8') as f:
            f.write(f"商品名称:{title}, 价格:{price}, 销量:{sales}, 店铺:{shop}\n")

if __name__ == '__main__':
    keyword = '口罩'
    url = f'https://s.taobao.com/search?q={keyword}'
    get_search_result(url)

After executing the above code, a data.txt file will be generated in the current directory, which contains the crawled product information.

8. Reptilian ethics

Web crawlers have extremely high potential for abuse and can conduct large-scale data collection without explicit permission. Therefore, we need to follow good web crawler behavior to avoid causing harm to the website and users.

Here are some tips for following good web crawling behavior:

  • Follow your website’s robots.txt file to learn which pages can be crawled.
  • Do not over-visit the same site to avoid impacting site performance.
  • Respect user privacy and avoid collecting sensitive data.
  • Avoid using web crawlers for illegal activities.

in conclusion

In this guide, we cover how to write a web crawler using Python. We learned about sending HTTP requests and parsing HTML pages using BeautifulSoup. We also explored how to use APIs to access data. Finally, we provide tips for following good web crawler behavior.

 

Guess you like

Origin blog.csdn.net/m0_64122244/article/details/131228544