Reptile Crosstalk
requests
Be sure to grasp
Crawling process data:
- Specify the url
- Initiate a request
- Fetch response data
- data analysis
- Persistent storage
get, post parameters:
- url
- data/params
- headers
- proxies
Ajax dynamic loading processing data:
- Dynamic loading of data: data request requesting to another
- By capturing request packet capture tool between the corresponding data packet, the data packet to achieve a local or global search
Simulated landing:
- Click on the login button corresponding post request request
- Dynamic request parameters:
- Usually hidden in the front page
Coding platform:
- Super Eagles, cloud coding
cookie handling:
Manual processing
- Cookie corresponding to the header information from the packet capture capture tool, the headers paste
requests.Session () automatically process
- session . requests.Session()
aaa=123; time=1506660011
Proxy ip:
- Type: http, https
Thread Pool:
- dummy import Pool
- map(func,list)
Lazy loading image: pseudo-attributes corresponding
+ Single-threaded asynchronous multi-task coroutine
- Coroutine: a special object keyword is aynic modified function definition, after the function is called immediately return a coroutine objects
- Task objects: package is further object of the coroutine
- Binding callback: def callback (task): return task.result ()
- task.add_done_callback(callback)
- Task List: multiple task objects
- Event loop objects: Task List is required to register to the event loop object, and then after the event loop is turned on, it will be a task for each subject in the task list to make asynchronous calls
- aiohttp: it is a network based on asynchronous request module
Analytical data: data extracting tag 1 is positioned
- Regular Expressions:
- bs4:
- xpath:
selenium
- Association:
- 1. convenient capture data dynamically loaded (visible can be obtained)
- 2. To achieve simulated landing
- effect:
- Implement the relevant operation browser automation
- manual:
- Environment Installation
- Drivers download browser
- Examples of browser object
- The development of relevant behavior actions
- Close the browser
- find family of functions:
- Positioning labels
- switch_to.frame (iframe's id) function
- Switching iframe
- Headless browser (as used with adhesive)
- Evade monitoring
- phantomJs :( headless browser)
- Google headless browser:
- Even action: from selenium.webdriver import ActionChains
scrapy
Project creation process:
- scrapy startprojecct proName
- cd ProName
- scrapy genspider spiderName www.xxx.com
- scrapy crawl spiderName
data analysis:
- response.xpath ( 'xpath expression')
- Xpath returned list is stored Selector object, the parsed data string type is stored in the object. String type of data to be acquired extract () or extract_first ()
Persistent storage:
- Based on persistent storage instruction terminal
- scrapy crawl spiderName -o filePath
- pipeline:
- data analysis
- Stored in the package item type of the object
- The item submitted to the pipeline
- Receiving item (item, Spider) method in class process_item pipe of item arbitrary form of persistent storage
- In the settings in the pipeline to open
- Precautions:
- Open pipeline when the pipeline priority, the higher priority the smaller the value
- Pipeline represents a store data specific to a particular platform
- process_item the return item represented by item is passed to the next pipeline class to be executed
- item reptiles submit will only be submitted to the highest priority pipeline
- Based on persistent storage instruction terminal
Handles paging data:
- The station data crawling
- Manual request: yield scrapy.Request / FormRequest (url, callback, formdata)
- The station data crawling
post request
cookie handling:
Log level:
- LOG_LEVEL = "ERROR"
Request parameter passing:
scenes to be used:
- Depth data to achieve crawling (crawling with an absence of data in the page)
How to achieve mass participation request
yield scrapy.Request/FormRequest(url,callback,formdata)
: Meta themselves will pass to the callback, use reponse.meta receive the callback in dictionary
Principle five core components:
Download Middleware:
- process_request: intercepting a normal request
- UA camouflage
- request.headers['User-Agent'] = 'xxx'
- Proxy settings
- request.meta['proxy'] = 'http://ip:port'
- UA camouflage
- process_response: tamper-responsive content responsive object or alternatively
- process_exception
- To request an exception to intercept
- Unusual request for correction
- The return request requesting the operation target corrected re-transmission request
- process_request: intercepting a normal request
UA pool and agent pool:
Application of selenium in scrapy:
- 1. Add an attribute reptile, a browser object attributes.
- 2. Close the browser object closed reptiles (self, spider) method
- 3. Get in process_reponse browser object middleware, the automation-related operations (such as the wheel to which drag)
crawlSpider:
- The station data crawling
- LinkExtractor(allow='正则')
- According to the specified rule (regular) extraction connection (url) of
- Rule(link,callback,follow=True)
- Receiving a connection to the extractor extracts, its request, then the request data parsing rules specified
distributed
Why can not implement distributed scrapy:
- The scheduler can not be shared and the pipe
scarpy_redis role:
- Provide may be shared
Anti-climb mechanism summary
robots.txt
UA检测
验证码
数据加密
cookie
禁IP
动态token
数据动态加载
js加密
js混淆
图片懒加载
Data cleaning
空值检测 删除空值所在的行数据:df.dropna(axis=0)
空值检测填充空值: df.fillna(method='ffill',axis=0)
异常值检测和过滤:
判定异常值的条件
重复行检测和删除:
df.drop_duplicated(keep='first')
Interview questions
- Write in the network crawling process, meet solution to the problem of anti-climb.
- How to improve the efficiency of the reptiles?
- requests + thread pool
- asyncio+aiohttp
- scrapy
- Distributed (Ultimate)
- The amount of data your reptile crawling how much?
- List the modules you used the python web crawler used.
- Network requests: urllib, requests, aiohttp
- Data analysis: re, bs4, lxml
- selenium
- dummy,asyncio,pyExcl
- Function of the requests and the basic module to use?
- Function of the beautifulsoup module and basic use?
- Function of the seleninu module and basic use?
- Scrapy framework outlined in the workflow of each component?
- How to set up a proxy (two ways) in scrapy framework?
- scrapy framework how to download large files?
- scrapy how to achieve the speed limit?
- scrapy how to achieve provisional reptiles?
- How scrapy be customized for commands?
- scrapy how to achieve a record depth of reptiles?
- pipelines in the works scrapy?
- How scrapy of pipelines discard an item objects?
- Brief scrapy role in reptiles and download middleware middleware?
- Role scrapy-redis components?
- Deduplication scrapy-redis component of how to achieve the task?
- How scrapy-redis task scheduler to achieve the depth-first and breadth-first?
Mobile terminal to fetch the data:
- fiddler, blue and white porcelain, miteproxy
Crawl over which types of data, the order is how much?
Electricity providers, medical equipment, news, stocks, finance, recruitment, project bidding
The remaining 100 about 1 million, 20w ...
Reptile framework
- scrapy, pyspider (understand)
Talk about the understanding of scrapy
- scrapy function modules
- Workflow five core components
How to resolve a partial page of data carried by the tag:
- With bs4
Learn Middleware
- Download Middleware
- Role: batch intercept requests and responses
- Intercept request
- UA disguise, agent
- Interception response
- Tamper response content
- Intercept request
- Role: batch intercept requests and responses
How to detect site data updates?
Incremental
Take a timing climb, shell scripts Timing
Depth-first scrapy default node is not all reserved space large; fast running speed
Breadth-First retain all nodes small footprint; running slow
Find out about machine learning
sklearn # entry-level, but a lot of packages, you can use
Linear Regression
The KNN (handwritten digit recognition, identification codes)
Predict house prices