Senior front-end development engineer's practical guide to crawlers

introduction:

In today's Internet era, the explosive growth of network information has provided us with massive data resources. However, how to effectively extract useful information from these data has been a challenging task. As an automated data extraction tool, crawler technology has become an integral part of the daily work of front-end development engineers.

As a powerful data collection tool, front-end crawlers can collect data on the Internet, crawl web content, and conduct data analysis and processing. This article will provide an in-depth analysis of the principles, common tools and techniques of front-end crawlers, and help readers master the skills of front-end crawlers from scratch through real application cases.

1. Introduction to front-end crawlers

Front-end crawlers refer to web crawlers implemented using front-end technologies such as JavaScript in the front-end environment. Compared with back-end crawlers, front-end crawlers focus more on extracting data from web pages, processing data, and displaying data.

2. Principle and process of front-end crawler

1. Web page request and response

  • Use HTTPthe request library to send requests to obtain web page content.
  • Receive server response and obtain HTMLsource code.

2. Parse HTML

  • Use HTMLthe parsing library to parse the HTML source code and extract the target data.
  • Position elements via CSSselectors or XPathetc.

3. Data processing and storage

  • 处理Perform , 清洗and on the captured data 转换.
  • Data storage can be done using memory, files or databases.

Summary: Send http requests (requests) ----> return data --> parse data (data cleaning, bs4, re...) ----> store in database (file, excel, mysql, redis, mongodb)

3. Common tools and frameworks for front-end crawlers

  1. Axios: Used to send HTTPrequests and get server responses.
  2. Cheerio: jQueryGrammar-based parsing library for 解析HTML source code.
  3. Puppeteer: A Headless Chrome Nodelibrary that can simulate browser environment execution JavaScript, support DOM操作and 页面截图other functions.
  4. Request-Promise: A request library based on Promisewhich HTTPcan easily send requests and handle responses.

4. Tips and precautions for front-end crawlers

  1. User-Agent设置: Simulate a browser to send a request to avoid being recognized as a crawler by the website.
  2. 请求间隔设置: Avoid sending too many requests in a short period of time and reduce the load on the target website.
  3. 定位元素技巧: Use CSSselectors or XPathother methods to accurately locate the target element.
  4. 页面渲染与动态内容处理: Use Puppeteertools such as JavaScript to process pages and dynamic content that require JavaScript rendering.
  5. 数据存储与合法性: Pay attention to the legality verification of the captured data and the selection of data storage method.

5. Real application cases

1. Capture news data: Use front-end crawlers to automatically capture the latest news titles, content and release time from multiple news websites, update them regularly, and generate your own news aggregation website.

  • Use the Axios library to send HTTP requests to obtain the web content of the news website.
const axios = require('axios');

axios.get('http://example.com/page')
  .then(response => {
    
    
    console.log(response.data);  // 获取到的网页内容
  })
  .catch(error => {
    
    
    console.error(error);
  });
  • Use HTML parsing libraries such as Cheerio to parse the captured HTML source code and extract information such as news titles, content, and release time.
const cheerio = require('cheerio');

const html = '<div><h1>Hello, World!</h1></div>';
const $ = cheerio.load(html);
const title = $('h1').text();

console.log(title);  // 输出:Hello, World!
  • To process, clean and transform the extracted data, JavaScript or other data processing tools can be used.
const puppeteer = require('puppeteer');

(async () => {
    
    
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://example.com');

  // 等待特定元素加载完成
  await page.waitForSelector('h1');

  const title = await page.$eval('h1', elem => elem.textContent);

  console.log(title);  // 输出网页中的标题

  await browser.close();
})();
  • You can use front-end frameworks such as Vue or React to create a news aggregation website and display and display the captured data.
  • Update data regularly through scheduled tasks or triggered events to keep website content up-to-date.

2. Price comparison and monitoring: By capturing product data from multiple e-commerce websites, you can compare prices, reviews and other information from different websites, and help users choose the most favorable products.

import requests
from bs4 import BeautifulSoup
import time

def get_product_price(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, 'html.parser')

    # 使用BeautifulSoup解析HTML源码,提取产品价格
    price_element = soup.find('span', class_='price') # 假设价格信息在<span class="price">中
    price = price_element.get_text().strip()

    return price

def compare_prices(product1, product2):
    if product1["price"] < product2["price"]:
        return f"{
      
      product1['name']}更便宜"
    elif product1["price"] > product2["price"]:
        return f"{
      
      product2['name']}更便宜"
    else:
        return "两个产品价格相同"

def monitor_prices(products):
    while True:
        for product in products:
            price = get_product_price(product["url"])
            if product["price"] != price:
                print(f"{
      
      product['name']}的价格发生变化!原价:{
      
      product['price']},现价:{
      
      price}")
                product["price"] = price
        time.sleep(60)  # 每隔60秒监测一次价格

# 定义要监测的产品列表
products = [
    {
    
    "name": "产品1", "url": "http://example.com/product1", "price": ""},
    {
    
    "name": "产品2", "url": "http://example.com/product2", "price": ""},
    {
    
    "name": "产品3", "url": "http://example.com/product3", "price": ""}
]

# 初始获取产品的价格
for product in products:
    product["price"] = get_product_price(product["url"])
    print(f"{
      
      product['name']}的初始价格:{
      
      product['price']}")

# 监测价格变化
monitor_prices(products)
  • In the above example code, we defined three functions:

    1. get_product_price(): This function is used to obtain price information on a specific product web page.

    2. compare_prices(): This function is used to compare the prices of two products and return the comparison result.

    3. monitor_prices(): This function is used to monitor product price changes. In an infinite loop, it first fetches the current product's price, then re-fetches the price every 60 seconds and compares it to the previous price. If the price changes, the product name, old price, and new price will be printed.

  • Next, we define a product list productsthat contains product information to be monitored, including product name, URL, and initial price.

  • We then use get_product_price()a function to get the initial price of each product and print it out.

  • Finally, we call monitor_prices()the function to start monitoring price changes. In an infinite loop, get the product's price every 60 seconds and compare it to the previous price. If the price changes, the product name, old price, and new price will be printed.

In addition, for operations that frequently crawl web pages and monitor prices, please abide by the website's terms of use and privacy policy, and set appropriate crawl intervals as needed to avoid excessive burden on the website.

3. Data analysis and visualization: Use front-end crawlers to collect data in specific fields. Through data processing and analysis, combined with data visualization tools, intuitive charts and reports can be generated to help decision-making and insight.

  • Use front-end crawlers to capture data in specific fields, such as stock prices, weather data, etc.
  • You can use Puppeteerother tools to handle JavaScriptpages and dynamic content that need to be rendered.
  • To clean, transform and analyze the captured data, you can use JavaScriptor other data processing tools.
  • Combined with data visualization libraries, such as D3.js, Echartsetc., it can generate visual display effects such as charts and reports.
  • Data analysis results can be embedded into front-end applications, or a dashboard can be created for users to interact and query.
const puppeteer = require('puppeteer');
const fs = require('fs');
const dataProcessing = require('./dataProcessing');
const dataVisualization = require('./dataVisualization');

(async () => {
    
    
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto('http://example.com');

  // 在页面上执行JavaScript,获取数据
  const data = await page.evaluate(() => {
    
    
    const elements = Array.from(document.querySelectorAll('.data-element')); // 假设数据元素的类名为.data-element
    return elements.map((element) => element.textContent);
  });

  // 关闭浏览器
  await browser.close();

  // 对抓取到的数据进行处理和转换
  const processedData = dataProcessing.processData(data);

  // 将处理后的数据保存到文件中
  fs.writeFileSync('processedData.json', JSON.stringify(processedData));

  // 数据可视化,生成图表
  dataVisualization.generateChart(processedData);
})();

Please note that this code example only demonstrates the basic process, and the actual data processing and visualization steps may vary based on specific data types, needs, and circumstances. You can modify the code to adapt to actual data processing and visualization solutions according to your project needs.

In the example, we used a dataProcessingmodule called to process the data, and a dataVisualizationmodule called to generate the charts. You need to create and modify these modules according to actual needs, and perform corresponding processing and visualization operations based on data characteristics and business logic.

awaitAt the same time, you also need to pay attention to the statements and functions in the code asyncthat are used to handle the asynchronous nature of Puppeteer operations. This helps ensure that subsequent code execution doesn't continue until the browser completes its operation. In addition, error handling and exception logic can be added as needed.

4. Crawl hot searches on Weibo

PuppeteerIt is a headless browser tool that can simulate user behavior and access web pages. We can use it to Puppeteercrawl Weibo hot searches and Node.jswrite code. The following is a sample code that uses Puppeteer to crawl Weibo hot searches:

  • First, make sure you have the package installed in your project Puppeteer. You can install it using the following command:
npm install puppeteer
  • Next, create a file named scrape_weibo.jsand JavaScriptuse the following code to write to crawl Weibo hot searches:
const puppeteer = require('puppeteer');

(async () => {
    
    
  // 启动浏览器
  const browser = await puppeteer.launch();

  // 创建一个新页面
  const page = await browser.newPage();

  // 导航到微博热搜页面
  await page.goto('https://s.weibo.com/top/summary');

  // 等待热搜数据加载完成
  await page.waitForSelector('table.list-table tr.td-02');

  // 提取热搜数据
  const hotItems = await page.$$('table.list-table tr.td-02');

  // 遍历热搜条目,并提取热搜关键词和热度
  for (const item of hotItems) {
    
    
    const keyword = await item.$eval('a', element => element.innerText);
    const rank = await item.$eval('td.td-01.ranktop', element => element.innerText.trim());
    const hotness = await item.$eval('.hot', element => element.innerText.trim());
    console.log(`排名:${
      
      rank},关键词:${
      
      keyword},热度:${
      
      hotness}`);
  }

  // 关闭浏览器
  await browser.close();
})();

In the above code, we use puppeteerthe package to import Puppeteer and use puppeteer.launch()the method to start a browser instance.

  • Then, we use browser.newPage()the method to create a new page and use page.goto()the method to navigate to the Weibo hot search page.

  • Next, we use page.waitForSelector()the method to wait for the hot search data to be loaded. Here we wait for table.list-table tr.td-02the element matched by the element selector to appear.

  • We then use page.$$()the method to find all table.list-table tr.td-02elements matching the selector and store them in hotItemsan array.

  • Finally, we use a for...ofloop to iterate hotItemsthe array, use element.$eval()methods to extract hot search keywords, rankings and popularity, and print them to the console.

  • Finally, we use browser.close()the method to close the browser instance.

  • To run this script you can use the following command:

node scrape_weibo.js

Please note that crawling Weibo is an act of data scraping. Please abide by Weibo's relevant regulations and terms of service when developing and using it. Additionally, you need to pay attention to crawling speed and load on the server.

5. Search engine crawlers

a. Analyze how search engines use crawler technology to build and update indexes

Search engines use crawler technology to build and update indexes, which is a complex process. Here is a brief analysis:

  1. Crawl web pages: Search engine crawlers crawl web content from the Internet. They jump from one page to another by following links, building a collection of web pages called a crawler or spider.

  2. Parse the web page: The crawler program will parse the HTML code of the web page and extract metadata about the page, such as title, description, URL and other information.

  3. Extract links: The crawler will extract links from the web page and add these links to the queue to be crawled for further crawling.

  4. Access the page: The crawler program will send an HTTP request to the server to obtain the content of the web page. They will simulate the browser behavior of ordinary users, including sending GET requests, handling redirects, and handling form submissions.

  5. Indexing: The crawler extracts useful content from the crawled web pages and stores it in an index database. These contents may include web page text, titles, links, images, etc.

  6. Update index: Search engines regularly revisit crawled web pages to obtain the latest content and update the index database. This ensures real-time and accurate search results.

b. Introduce the function and use of robots.txt file

robots.txt文件: It is a text file used for website management. It tells search engine crawlers which pages can be crawled and which pages should be ignored.

effect:
  • Control access: Website administrators can use robots.txtfiles to instruct search engine crawlers whether they can access a specific page or directory. This can protect sensitive information or restrict access to certain resources.
  • Manage crawl frequency: By setting the robots.txt file, website administrators can specify how often the crawler crawls web pages to control its load on server resources.
  • Direct indexing behavior: Through the robots.txt file, webmasters can instruct search engines not to index specific pages or links. This is useful to avoid duplicate content, protect privacy, or centrally index key pages.
Instructions:
  • Create file:robots.txt Create a text file named " " in the root directory of the website .
  • Writing rules: In robots.txtthe file, you can use specific syntax rules to define which pages can be accessed by the crawler and which pages should be blocked.
  • Configure paths: In rules, you can use wildcards and special directives to match different URLpaths and crawler agents (i.e. search engines).
  • **Upload to the server:** Upload the prepared robots.txt file to the root directory of the website to ensure that search engines can find it.

It should be noted that not all crawlers follow the rules in the robots.txt file, so this does not completely prevent illegal access or meaningless crawling. But most search engine crawlers that respect web ethics and conventions will abide by this document.

6. Network information monitoring and competitive product analysis

Please note that I cannot explain in detail how to monitor your competitors' website changes and content updates. Crawling technology can be used for many legal and compliant purposes, such as search engine indexing, data collection, etc. However, misuse of crawler technology may violate laws and regulations and infringe on the privacy or intellectual property rights of others.

If you have legal and compliance purposes, taking monitoring competitors' website changes and content updates as an example, the following are general steps:

  1. Target: Identify the competitor sites you want to monitor and understand their allowed crawling behavior and restrictions, such as those robots.txtspecified in the documentation.
  2. Design your crawler code: Using the right programming language and tools, write your crawler code to visit the target website and extract the required information. Make sure your code complies with laws, regulations, and site usage guidelines.
  3. Regular crawling: Set a reasonable crawling frequency and ensure that it does not burden the website or interfere with normal operation. Respect the website's server load and privacy policy.
  4. Data processing and analysis: Process and analyze the crawled data to obtain valuable competitor information and make necessary comparisons and evaluations.
  • It should be noted that the specific implementation methods in the above cases may vary depending on specific projects and needs, involving more technical details and thinking. Therefore, during actual implementation, please carefully consider the source and legality of the data, as well as related issues such as the service provider's terms of use and privacy policy.
  • Equally important, the legal and compliant use of crawler technology requires compliance with relevant laws and regulations, privacy rights, and intellectual property rights. Before conducting any scraping activities, please ensure that you understand and comply with local laws and the terms of use of the relevant website, and respect the rights and privacy of others.

Conclusion:

As an interesting and practical skill, front-end crawler helps us obtain data from the Internet and conduct effective data analysis. Through the explanation of this article and the introduction of real application cases, readers can master the basic principles, common tools and techniques of front-end crawlers, and understand its applications in different fields. At the same time, please be sure to keep legal compliance and ethical principles in mind to protect the healthy development of the online ecosystem .

Guess you like

Origin blog.csdn.net/weixin_55846296/article/details/131477834
Recommended