A reference book with strong practicality, from the simple to the deep, from the simple to the deep. From the environment configuration to the principle of crawlers, the use of basic libraries and analysis libraries, and then to data storage, the foundation is laid step by step, and then Ajax, dynamic pages, verification codes, App crawling, identification, etc. are introduced by category. Proxy use, simulated login applications. The last part explains pyspider, Scrapy framework examples, distributed deployment and so on. The book introduces many very practical tools, such as Selenium and Splash for dynamic web crawling, Charles, mitmdump, Appium for APP crawling, and Scrapyd and Gerapy in distributed crawler applications. Both the knowledge points and the source code can be used directly.
about the author
Cui Qingcai, software engineer of Microsoft (China), master of Beihang University, mainly researches web crawler, web development, machine learning and other directions.
brief introduction
This book describes how to develop web crawlers using Python 3. Compared with the first edition, this book is equipped with a targeted practice platform for the actual combat project of each knowledge point, avoiding the problem of case expiration.
In addition, knowledge points such as asynchronous crawler, JavaScript
reverse engineering, App
reverse engineering, page intelligent analysis, deep learning identification verification code, Kubernetes
operation and maintenance and deployment are mainly added. enriched and updated.
heavy bomb
Get Recommended Python
by Father Guido van Rossum
!
Everyone should know who the father of Python is, right? It's Guido van Rossum, who wrote Python in 1989.
Book of Contents
The table of contents of the whole book is posted below, the contents are as follows:
This book is divided into 15 chapters, summarized as follows
Chapter 1: Introduces the detailed configuration process of all environments involved in this book, taking into account the three major platforms of Windows, Linux and Mac. You don’t need to read this chapter one by one, you can refer to it when you need it
Chapter 2: Introduces the basic knowledge you need to know before learning crawlers, such as HTTP, crawlers, basic principles of proxy, basic structure of web pages, etc. Readers who don’t know anything about crawlers are recommended to understand the knowledge in this chapter
Chapter 3: Introduces the most basic crawler operations. Generally, learning crawlers starts from this step. The chapter introduces the two most basic request libraries (urllib is called yin and yang) and the basic usage of regular expressions. Learned this chapter , you can master the most basic crawler technology
Chapter 4: Introduces the basic usage of the page parsing library, including the basic usage of Beautiful Soup, XPath, and pyquery, which can make information extraction more convenient and faster, and are essential tools for crawlers
Chapter 5: Introduces the common forms and storage operations of data storage, including the storage of various files such as TXT, JSON, and CSV, and the basic storage operations of relational database MySQL, non-relational data MongoDB, and Red is storage. We can save the crawled data flexibly and conveniently
Chapter 6: Introduces the process of Ajax data crawling. The data of some web pages may be loaded by means of Ajax request API interface, which cannot be crawled by conventional methods. This chapter introduces the method of data crawling using Ajax
Chapter 7: Introduces the crawling of dynamically rendered pages. Now more and more website content is rendered by JavaScript, and the original HTML text may not contain any valid content, and the rendering process may involve some JavaScript encryption algorithms , you can use Selenium, Splash and other tools to implement the method of simulating browser data crawling
Chapter 8: Introduces the relevant processing methods of verification codes: if positive codes are an important measure for website anti-crawlers, we can learn about the solutions to various verification codes through this chapter, including graphic verification codes, extreme verification codes, touch Recognition of Verification Code and Weibo Gongge Verification Code
Chapter 9: Introduces the use of proxy. Restricting IP access is also an important measure for website anti-crawlers. In addition, we can also use proxies to disguise the real IP of crawlers. Using proxies can effectively solve this problem. Through this chapter, we learned how to use proxies, how to maintain proxy pools, and how to use ADSL proxies
Chapter 10: Introduces the method of simulated login and crawling. Some websites need to log in to see the required content. This chapter introduces the most basic simulated login method and the method of maintaining the Cookies pool
Chapter 11: Introduces the crawling method of App, including the use of basic Charles and mitmproxy package software. In addition, it also introduces the method of mitmdump docking Python scripts for real-time crawling, and uses Appium to completely simulate the operation of mobile Apps for crawling method of taking
Chapter 12: Introduces the pyspider crawler framework and usage. The framework is simple, easy to use and powerful, which can save a lot of time in developing crawlers. This chapter introduces the method of using this framework for crawler development with a case
Chapter 13: Introducing the Scrapy crawler framework and usage Scrapy is currently the most widely used crawler framework. This chapter introduces its basic architecture, principles and usage of each component. It also introduces some methods of Scrapy's general configuration and docking with Docker
Chapter 14: Introduces the basic principles and implementation methods of distributed crawlers. In order to improve crawling efficiency, distributed crawlers are essential. This chapter introduces the method of using Scrapy and Red is to implement distributed crawlers
Chapter 15: Introduces the deployment and management methods of distributed crawlers. It is convenient and fast to complete the distributed deployment of crawlers, which can save developers a lot of time. This chapter introduces the implementation of distributed crawler deployment and management in combination with tools such as Scrapy, Scrapyd, Docker, and Gerapy
Because there are really too many materials in the book, so I won't show you one by one! !