Reptile Crosstalk

Reptile Crosstalk

requests

Be sure to grasp

Crawling process data:

  1. Specify the url
  2. Initiate a request
  3. Fetch response data
  4. data analysis
  5. Persistent storage

get, post parameters:

  • url
  • data/params
  • headers
  • proxies

Ajax dynamic loading processing data:

  • Dynamic loading of data: data request requesting to another
  • By capturing request packet capture tool between the corresponding data packet, the data packet to achieve a local or global search

Simulated landing:

  • Click on the login button corresponding post request request
  • Dynamic request parameters:
    • Usually hidden in the front page

Coding platform:

  • Super Eagles, cloud coding

cookie handling:

  • Manual processing

    • Cookie corresponding to the header information from the packet capture capture tool, the headers paste
  • requests.Session () automatically process

    • session . requests.Session()

    aaa=123; time=1506660011

Proxy ip:

  • Type: http, https

Thread Pool:

  • dummy import Pool
  • map(func,list)

Lazy loading image: pseudo-attributes corresponding

+ Single-threaded asynchronous multi-task coroutine

  • Coroutine: a special object keyword is aynic modified function definition, after the function is called immediately return a coroutine objects
  • Task objects: package is further object of the coroutine
    • Binding callback: def callback (task): return task.result ()
    • task.add_done_callback(callback)
  • Task List: multiple task objects
  • Event loop objects: Task List is required to register to the event loop object, and then after the event loop is turned on, it will be a task for each subject in the task list to make asynchronous calls
  • aiohttp: it is a network based on asynchronous request module

Analytical data: data extracting tag 1 is positioned

  • Regular Expressions:
  • bs4:
  • xpath:

selenium

  • Association:
    • 1. convenient capture data dynamically loaded (visible can be obtained)
    • 2. To achieve simulated landing
  • effect:
    • Implement the relevant operation browser automation
  • manual:
    • Environment Installation
    • Drivers download browser
    • Examples of browser object
    • The development of relevant behavior actions
    • Close the browser
  • find family of functions:
    • Positioning labels
  • switch_to.frame (iframe's id) function
    • Switching iframe
  • Headless browser (as used with adhesive)
  • Evade monitoring
  • phantomJs :( headless browser)
  • Google headless browser:
  • Even action: from selenium.webdriver import ActionChains

scrapy

  • Project creation process:

    • scrapy startprojecct proName
    • cd ProName
    • scrapy genspider spiderName www.xxx.com
    • scrapy crawl spiderName
  • data analysis:

    • response.xpath ( 'xpath expression')
    • Xpath returned list is stored Selector object, the parsed data string type is stored in the object. String type of data to be acquired extract () or extract_first ()
  • Persistent storage:

    • Based on persistent storage instruction terminal
      • scrapy crawl spiderName -o filePath
    • pipeline:
      • data analysis
      • Stored in the package item type of the object
      • The item submitted to the pipeline
      • Receiving item (item, Spider) method in class process_item pipe of item arbitrary form of persistent storage
      • In the settings in the pipeline to open
      • Precautions:
        • Open pipeline when the pipeline priority, the higher priority the smaller the value
        • Pipeline represents a store data specific to a particular platform
        • process_item the return item represented by item is passed to the next pipeline class to be executed
        • item reptiles submit will only be submitted to the highest priority pipeline
  • Handles paging data:

    • The station data crawling
      • Manual request: yield scrapy.Request / FormRequest (url, callback, formdata)
  • post request

  • cookie handling:

  • Log level:

    • LOG_LEVEL = "ERROR"
  • Request parameter passing:

    • scenes to be used:

      • Depth data to achieve crawling (crawling with an absence of data in the page)
    • How to achieve mass participation request

      • yield scrapy.Request/FormRequest(url,callback,formdata)

        : Meta themselves will pass to the callback, use reponse.meta receive the callback in dictionary

  • Principle five core components:

  • Download Middleware:

    • process_request: intercepting a normal request
      • UA camouflage
        • request.headers['User-Agent'] = 'xxx'
      • Proxy settings
    • process_response: tamper-responsive content responsive object or alternatively
    • process_exception
      • To request an exception to intercept
      • Unusual request for correction
      • The return request requesting the operation target corrected re-transmission request
  • UA pool and agent pool:

  • Application of selenium in scrapy:

    • 1. Add an attribute reptile, a browser object attributes.
    • 2. Close the browser object closed reptiles (self, spider) method
    • 3. Get in process_reponse browser object middleware, the automation-related operations (such as the wheel to which drag)
  • crawlSpider:

    • The station data crawling
    • LinkExtractor(allow='正则')
      • According to the specified rule (regular) extraction connection (url) of
    • Rule(link,callback,follow=True)
      • Receiving a connection to the extractor extracts, its request, then the request data parsing rules specified

distributed

Why can not implement distributed scrapy:

  • The scheduler can not be shared and the pipe

scarpy_redis role:

  • Provide may be shared

Anti-climb mechanism summary

robots.txt
UA检测
验证码
数据加密
cookie
禁IP
动态token
数据动态加载
js加密
js混淆
图片懒加载

Data cleaning

空值检测 删除空值所在的行数据:df.dropna(axis=0)
空值检测填充空值: df.fillna(method='ffill',axis=0)
异常值检测和过滤:
    判定异常值的条件
重复行检测和删除:   
    df.drop_duplicated(keep='first')

Interview questions

  1. Write in the network crawling process, meet solution to the problem of anti-climb.
  2. How to improve the efficiency of the reptiles?
    1. requests + thread pool
    2. asyncio+aiohttp
    3. scrapy
    4. Distributed (Ultimate)
  3. The amount of data your reptile crawling how much?
  4. List the modules you used the python web crawler used.
    • Network requests: urllib, requests, aiohttp
    • Data analysis: re, bs4, lxml
    • selenium
    • dummy,asyncio,pyExcl
  5. Function of the requests and the basic module to use?
  6. Function of the beautifulsoup module and basic use?
  7. Function of the seleninu module and basic use?
  8. Scrapy framework outlined in the workflow of each component?
  9. How to set up a proxy (two ways) in scrapy framework?
  10. scrapy framework how to download large files?
  11. scrapy how to achieve the speed limit?
  12. scrapy how to achieve provisional reptiles?
  13. How scrapy be customized for commands?
  14. scrapy how to achieve a record depth of reptiles?
  15. pipelines in the works scrapy?
  16. How scrapy of pipelines discard an item objects?
  17. Brief scrapy role in reptiles and download middleware middleware?
  18. Role scrapy-redis components?
  19. Deduplication scrapy-redis component of how to achieve the task?
  20. How scrapy-redis task scheduler to achieve the depth-first and breadth-first?

Mobile terminal to fetch the data:

  • fiddler, blue and white porcelain, miteproxy

Crawl over which types of data, the order is how much?

  • Electricity providers, medical equipment, news, stocks, finance, recruitment, project bidding

    The remaining 100 about 1 million, 20w ...

Reptile framework

  • scrapy, pyspider (understand)

Talk about the understanding of scrapy

  • scrapy function modules
  • Workflow five core components

How to resolve a partial page of data carried by the tag:

  • With bs4

Learn Middleware

  • Download Middleware
    • Role: batch intercept requests and responses
      • Intercept request
        • UA disguise, agent
      • Interception response
        • Tamper response content

How to detect site data updates?

  • Incremental

    Take a timing climb, shell scripts Timing

Depth-first scrapy default node is not all reserved space large; fast running speed

Breadth-First retain all nodes small footprint; running slow

Find out about machine learning

  • sklearn # entry-level, but a lot of packages, you can use

    • Linear Regression

    • The KNN (handwritten digit recognition, identification codes)

      Predict house prices

Guess you like

Origin www.cnblogs.com/Doner/p/11468658.html