Strongly recommended by the father of Python, Python3 web crawler development practice, a must-read book for getting started with crawlers, Douban score 9.2

A reference book with strong practicality, from the simple to the deep, from the simple to the deep. From the environment configuration to the principle of crawlers, the use of basic libraries and analysis libraries, and then to data storage, the foundation is laid step by step, and then Ajax, dynamic pages, verification codes, App crawling, identification, etc. are introduced by category. Proxy use, simulated login applications. The last part explains pyspider, Scrapy framework examples, distributed deployment and so on. The book introduces many very practical tools, such as Selenium and Splash for dynamic web crawling, Charles, mitmdump, Appium for APP crawling, and Scrapyd and Gerapy in distributed crawler applications. Both the knowledge points and the source code can be used directly.

about the author

Cui Qingcai, software engineer of Microsoft (China), master of Beihang University, mainly researches web crawler, web development, machine learning and other directions.

brief introduction

This book describes how to develop web crawlers using Python 3. Compared with the first edition, this book is equipped with a targeted practice platform for the actual combat project of each knowledge point, avoiding the problem of case expiration.

In addition, knowledge points such as asynchronous crawler, JavaScript reverse engineering, App reverse engineering, page intelligent analysis, deep learning identification verification code, Kubernetes operation and maintenance and deployment are mainly added. enriched and updated.

heavy bomb

Get  Recommended Python by Father  Guido van Rossum !

Everyone should know who the father of Python is, right? It's Guido van Rossum, who wrote Python in 1989.

Book of Contents

The table of contents of the whole book is posted below, the contents are as follows:

This book is divided into 15 chapters, summarized as follows

 Chapter 1: Introduces the detailed configuration process of all environments involved in this book, taking into account the three major platforms of Windows, Linux and Mac. You don’t need to read this chapter one by one, you can refer to it when you need it

Chapter 2: Introduces the basic knowledge you need to know before learning crawlers, such as HTTP, crawlers, basic principles of proxy, basic structure of web pages, etc. Readers who don’t know anything about crawlers are recommended to understand the knowledge in this chapter

Chapter 3: Introduces the most basic crawler operations. Generally, learning crawlers starts from this step. The chapter introduces the two most basic request libraries (urllib is called yin and yang) and the basic usage of regular expressions. Learned this chapter , you can master the most basic crawler technology

Chapter 4: Introduces the basic usage of the page parsing library, including the basic usage of Beautiful Soup, XPath, and pyquery, which can make information extraction more convenient and faster, and are essential tools for crawlers

Chapter 5: Introduces the common forms and storage operations of data storage, including the storage of various files such as TXT, JSON, and CSV, and the basic storage operations of relational database MySQL, non-relational data MongoDB, and Red is storage. We can save the crawled data flexibly and conveniently

Chapter 6: Introduces the process of Ajax data crawling. The data of some web pages may be loaded by means of Ajax request API interface, which cannot be crawled by conventional methods. This chapter introduces the method of data crawling using Ajax

Chapter 7: Introduces the crawling of dynamically rendered pages. Now more and more website content is rendered by JavaScript, and the original HTML text may not contain any valid content, and the rendering process may involve some JavaScript encryption algorithms , you can use Selenium, Splash and other tools to implement the method of simulating browser data crawling

Chapter 8: Introduces the relevant processing methods of verification codes: if positive codes are an important measure for website anti-crawlers, we can learn about the solutions to various verification codes through this chapter, including graphic verification codes, extreme verification codes, touch Recognition of Verification Code and Weibo Gongge Verification Code

Chapter 9: Introduces the use of proxy. Restricting IP access is also an important measure for website anti-crawlers. In addition, we can also use proxies to disguise the real IP of crawlers. Using proxies can effectively solve this problem. Through this chapter, we learned how to use proxies, how to maintain proxy pools, and how to use ADSL proxies

 

Chapter 10: Introduces the method of simulated login and crawling. Some websites need to log in to see the required content. This chapter introduces the most basic simulated login method and the method of maintaining the Cookies pool

Chapter 11: Introduces the crawling method of App, including the use of basic Charles and mitmproxy package software. In addition, it also introduces the method of mitmdump docking Python scripts for real-time crawling, and uses Appium to completely simulate the operation of mobile Apps for crawling method of taking

Chapter 12: Introduces the pyspider crawler framework and usage. The framework is simple, easy to use and powerful, which can save a lot of time in developing crawlers. This chapter introduces the method of using this framework for crawler development with a case

Chapter 13: Introducing the Scrapy crawler framework and usage Scrapy is currently the most widely used crawler framework. This chapter introduces its basic architecture, principles and usage of each component. It also introduces some methods of Scrapy's general configuration and docking with Docker

Chapter 14: Introduces the basic principles and implementation methods of distributed crawlers. In order to improve crawling efficiency, distributed crawlers are essential. This chapter introduces the method of using Scrapy and Red is to implement distributed crawlers

Chapter 15: Introduces the deployment and management methods of distributed crawlers. It is convenient and fast to complete the distributed deployment of crawlers, which can save developers a lot of time. This chapter introduces the implementation of distributed crawler deployment and management in combination with tools such as Scrapy, Scrapyd, Docker, and Gerapy

Because there are really too many materials in the book, so I won't show you one by one! !

Friends who need to learn videos can leave a message for free

Guess you like

Origin blog.csdn.net/m0_70615468/article/details/127885787