1.3 Web page data capture
Li Mu
Station B: https://space.bilibili.com/1567748478/channel/collectiondetail?sid=28144
Course homepage: https://c.d2l.ai/stanford-cs329p/
1. Web page data capture
-
Web page data capture: extract data from specific websites;
-
Features: large noise, spam information, large data scale
-
Common Applications: Price Comparison, Price Tracking Sites
-
-
What is the difference between crawling and data crawling?
- Crawl: entire webpage
- Data crawling: data scientist –> specific web page –> data of interest
2. Crawler
There are problems: tools linux
are used under the platform curl
, but websites generally use various tools to prevent crawling;
Solution: headless
the browser used, a browser without GUI; ( ip
a large number of visits to the same website in the same period of time may be banned, and you can use the cloud server)
- sample code
from selenium import webdriver
chrome_op