Reptile exam

A how to crawl static Web pages and dynamic Web page code:

Crawl static code:

  Static HTML pages of data are included in the page (usually get request)

  Static pages to obtain the source code libraries by requests, when analyzed by bs4, re

  import requests

  url = " "

  html = requests.get(url).text

Crawl dynamic load the page:

  Structured Data: json, xml, etc.

  Dynamic pages and static pages the main difference is that when data is refreshed by the ajax technology, and re-render query data from the database to the front page is refreshed,

  Data packets are stored in the network, not crawling HTML data is acquired.

This crawl dynamic pages there are two common ways:

  1. fetch request packet network

    Request interface needs to pass a number of parameters, the parameters need to crack, crack and crack parameters js

  2. headless browser rendering

    selenium browser testing framework, you can call the browser as a browser webdriver analog, etc. page finished loading all source code, access to the source code is parsed by bs4, re

# Coding. 8 = UTF- 
from Selenium Import the webdriver 

Chrome_options = webdriver.ChromeOptions () 
Chrome_options.add_argument ( ' --headless ' ) # endless mode is set, the pop-up browser window may not 
Drive = webdriver.Chrome (chrome_options = Chrome_options) 
drive.get ( ' http://public.163.com/#/list/movie ' ) 
HTML = drive.page_source
 Print (HTML) 
drive.quit ()

 

Guess you like

Origin www.cnblogs.com/lskai/p/11982936.html