Anacanda development environment and crawler overview

Anacanda development environment

  • Anaconda is an integrated environment based on data analysis and machine learning. It integrates various environments and modules corresponding to data analysis and machine learning.
  • jupyter: is a coding tool based on browser visualization provided by Anaconda integrated environment
  • Precautions
    • You only need to install Anaconda when setting up the environment. The installation path must be pure English, and no special symbols can appear.
  • Test whether the installation is successful:
    • Open the terminal: jupyter notebook and press Enter, indicating that the installation is successful and the environment variables are also configured successfully
    • In all your programs (click on the windows button), find if there is a folder called anaconda. Click on the file. If a program called navigator appears under the folder, it means the installation was successful. It's just that the environment variables are not configured.

How to start jupyter

  • Method 1: Configure the environment variables, directly enter the jupyter notebook in the terminal and press Enter
  • Method 2: Without configuring environment variables, open navigator, click the home option in the upper left corner, and click lauch under the jupyter notebook icon to start
    • Recommendation: Click on the environment in the upper left corner

Open the terminal by clicking open terminal, cd to the folder you want to use (in the drive letter). Enter jupyter notebook in the terminal and press enter

Basic usage of jupyter

  • After entering the jupyter notebook command in the terminal, it means that we have started a service in this machine. Then it will automatically open your default browser.
    • Note: When you execute the jupyter notebook instruction terminal, you can enter the specified directory and execute the jupyter notebook instruction, then the page displayed by the browser opened is the directory structure of the directory corresponding to your current terminal.
      • The directory structure corresponding to your terminal is the root directory in the jupyter page opened by your browser.

new:

  • python3: Create a new jupyter source file (emphasis)

    • Consists of cell: cell is a line of editable boxes
    • The role of cell:
      • Used to write code and notes according to different modes. The written code and notes can be run directly in the current file, and the running results can be viewed!
    • cell mode:
      • code: Python code can be written
        • The code mode cell can write one line of code or multiple lines of code.
        • Features: The order of writing code does not matter, but the order of executing code must be top-down.
          • As long as the related variables, functions or classes (related definitions) are defined in a cell, when the cell is executed, the defined content will be loaded into the cache of the current source file, then it is expressed in any other cell In order to directly use the previously defined definition loaded into the cache.
      • markdown: Write notes. You can use markdown integrated instructions to specify the style of text, or you can use html tags to specify the style of text.
  • folder: create a new folder

  • text file: Create a new text file with any suffix

    • You can write programs, but you cannot run directly in this file.
  • terminal: Create a new browser-based terminal (used when downloading packages, pip install xxxx)

    Use of shortcut keys

    • Insert cell: Insert a above to create a, and insert a below to add b

    • Delete cell: x

    • Execute cell: shift + enter

    • Switch the cell mode: m, y

    • After the cell is executed, double-click on the left side of the cell to return to the editable mode of the cell

    • Recovery of execution results: double-click on the left side of the execution results

    • Open the help document: shift + tab

    • tab: auto-complete

    • Undo: z

      The source file of jupyter can be exported after writing:

    • File-》Download as-》HTML

Reptile Overview

  • What is a reptile?
    • It is the process of writing a program to simulate the browser to go online, and then grab the data in the Internet.
      • Keyword extraction:
        • Simulation: The browser is a purely natural and original crawler tool.
        • Crawl:
          • Grab a whole page of source code data
          • Grab local data from a whole page
  • Classification of reptiles:
    • Common reptile:
      • Ask us to crawl a whole page of source data
    • Focus Reptile
      • Request to crawl local data in a page
        • Focused reptiles must be based on universal reptiles.
    • Incremental crawler:
      • Used to monitor the status of website data updates, so as to crawl to the latest updated data from the website.
    • Distributed crawler:
      • The ultimate weapon to increase crawling efficiency.
  • Anti-climbing mechanism
    • Is applied to the portal. If the website does not want the crawler to easily crawl the data, it can formulate related mechanisms or measures to prevent the crawler from crawling its data.
  • Anti-anti-climbing strategy
    • It is used in crawler programs. We crawlers can formulate relevant strategies to break the anti-climbing mechanism to crawl relevant data.
  • The first anti-climbing mechanism:
    • Robots protocol: anti-gentleman not anti-villain
      • It is a plain text agreement, which stipulates which data in this website can be crawled by which crawlers and which cannot be crawled.
    • Crack: (not compliant)
      • Your own subjective disregard for this agreement is sufficient.

Guess you like

Origin www.cnblogs.com/zzsy/p/12687154.html