Python crawling movie information: Ajax introduction, crawling case practice + MongoDB storage

Introduction to Ajax

Ajax (Asynchronous JavaScript and XML) is a technology used to implement asynchronous communication in web applications. It allows a portion of a web page to be updated in real time without refreshing the entire web page by exchanging data with the server in the background. The main features of Ajax include:

  1. Asynchronous communication: Ajax is asynchronous, which means it can communicate without blocking the user interface. The user can continue to interact with the web page without waiting for the server to respond.

  2. Data exchange: Ajax allows data to be exchanged between the client and server, typically using XML, JSON or other data formats. This enables web pages to load, display and update data in real time without completely reloading the entire page.

  3. No page refresh required: Traditional web applications typically require a full page refresh every time they interact with the server. Ajax can refresh only part of the page, providing a smoother user experience.

  4. Dynamic content: Ajax enables developers to create dynamic, real-time updating web content that can be dynamically loaded and modified based on user actions and needs.

  5. Multiple uses: Ajax can be used not only for loading data, but also for submitting forms, validating user input, auto-complete searches, live chat, and many other interactive features in web applications.

Ajax usually consists of the following core components:

  • XMLHttpRequest object: This is the core of Ajax, which allows JavaScript code to communicate with the server, send HTTP requests and receive responses. fetch APIIt is often used instead in modern web development XMLHttpRequestbecause it is simpler and more powerful.

  • Server-side script: The server-side needs to provide an endpoint that accepts Ajax requests, and can process these requests, perform corresponding operations, and return response data.

  • Asynchronous event handling: JavaScript code needs to be able to handle Ajax requests and responses in the background to ensure it does not block the user interface. This usually involves using callback functions or Promises to handle asynchronous operations.

  • Data format: Ajax can use a variety of data formats to exchange information, including XML, JSON, HTML and plain text.

Ajax has become an important part of modern web application development, providing an effective way to achieve real-time, interactive and dynamic user experience. Many popular web applications and frameworks (such as React, Angular, and Vue.js) use Ajax to handle data loading and interaction. Through Ajax, web applications can better respond to user needs and provide a better user experience.

Actual cases

Based on the previous article https://blog.csdn.net/rubyw/article/details/132714499?spm=1001.2014.3001.5501 uses Ajax to dynamically render the page, crawl it, and store it in the local mongo database.
Website link: https:/ /spa1.scrape.center

Only look at the information under the menu XHR and observe the situation when the page changes

First page
Insert image description hereInsert image description here
details page
Insert image description here

# Ajax + MongoDB存储

import pymongo
import requests
import logging

logging.basicConfig(level=logging.INFO,
                    format='%(asctime)s - %(levelname)s: %(message)s')

INDEX_URL = 'https://spa1.scrape.center/api/movie/?limit={limit}&offset={offset}'

MONGO_CONNECTION_STRING = 'mongodb://localhost:27017'
MONGO_DB_NAME = 'movies'
MONGO_COLLECTION_NAME = 'movies'

client = pymongo.MongoClient(MONGO_CONNECTION_STRING)
db = client['movies']
collection = db['movies']

# 处理 JSON 接口
def scrape_api(url):
    logging.info('scraping %s...', url)
    try:
        response = requests.get(url)
        if response.status_code == 200:
            return response.json()
        else:
            logging.error('get invalid status code %s while scraping %s',
                      response.status_code, url)
        return None
    except requests.RequestException:
        logging.error('error occurred while scraping %s', url, exc_info=True)


LIMIT = 10


def scrape_index(page):
    url = INDEX_URL.format(limit=LIMIT, offset=LIMIT * (page - 1))
    return scrape_api(url)


DETAIL_URL = 'https://spa1.scrape.center/api/movie/{id}'


def scrape_detail(id):
    url = DETAIL_URL.format(id=id)
    return scrape_api(url)


TOTAL_PAGE = 10


def save_data(data):
    collection.update_one({
    
    
        'name': data.get('name')   # 根据name进行查询
    }, {
    
    
        '$set': data   # 表示更新操作
    }, upsert=True)   # 存在即更新,不存在即插入


def main():
    for page in range(1, TOTAL_PAGE + 1):
        index_data = scrape_index(page)
        for item in index_data.get('results'):
            id = item.get('id')
            detail_data = scrape_detail(id)
            logging.info('detail data %s', detail_data)
            save_data(detail_data)
            logging.info('data saved successfully')


if __name__ == '__main__':
    main()

The visualization tool RoboMongo/Robo 3T is easy to use and powerful. The official website is https://robomongo.org/. It is supported by all three major platforms. The download link is https://robomongo.org/download.

Finally, the final crawled and saved results can be seen in the local mongo database:
Insert image description here

Guess you like

Origin blog.csdn.net/rubyw/article/details/132715850