Zero-based entry-python reptile encounter pits

Zero-based entry-python reptile

First explain that I am the iOS mobile developers. During the annual leave, by the impact of the epidemic nothing else, write a little crawlers kill time, the way to climb some of the favorite things to satisfy my own curiosity.

Without further ado, reptile own entry-week encounter some pit, the purpose of writing this article is to give some suggestions for the novice, less waste of time.

** 1 ** environment support: crawlers need software support here, but more to say, first of all have python environment. Here I recommend the use of scientific computing environment Anaconda, Anaconda refers to is an open source Python release, which contains conda, Python and more than 180 scientific package and its dependencies, the latter do reptiles need a lot of libraries, there will be between them many rely on, a little inattentive will be nausea problems do not understand, especially in the Windows system, and the use of Anaconda these issues become very convenient. Installation of this environment there are many online (just pull a: Anaconda installation Getting Started tutorial ) here and then add that, Anaconda can not access the official website is not over the wall, usually with Tsinghua University, China's open-source software mirror sites (attach links: Tsinghua University Open Source software mirror sites )

. ** 2 ** IDE: generally used pycharm

** 3 ** started to find a simple static website (as well as opposite dynamic website crawling relatively complex), for example, a simple novel site: pen interesting to see

** 4 ** crawler is essentially a request and a data processing network, the network requests nothing to say, with the frame can be requests. Data processing is basically of HTML pages character interception, this one recommended for novice to learn the basic operation of a regular expression matching the content they want to extract, and then use the library.

** 5 ** Many newcomers say: Results online reptile stick down a lot of code to run not want (before I did not do the job less children), we conclude there are three reasons: ① age, climb You can not access the site. ② site structure has changed, the original extract logic code does not apply. ③ To know the current development of the Internet very quickly, before the HTTP protocol are used in many sites, but it is there are security risks, HTTPS protocols are used in most of today. For crawlers is the case on the basis of the request is the need for a link on the add headers parameters: Cookie, User-Agent ... according to the actual situation (the contents of User-Agent header field contains the user requesting information; Cookie in the browser it is registered in the small data volume, which can be described relative to the server and user information)

** 6 ** Pawan simple static web page, how dynamic and complex web crawling yet. The difference between dynamic and static web pages is that:

Static pages:
(1) static web pages can not be simply understood as stationary, he mainly refers to the page no code, only HTML (ie: HTML), generally suffix .html, .htm, or .xml and so on. Although the pages static pages once made, the content will not change. However, static pages also include some active part, these are some of the major animated GIF, etc.
(2) static pages open, users can simply double-click, and regardless of the contents of any person at any time to open the pages are the same.

Dynamic pages:
(1) refers to a dynamic web page programming techniques with static pages relative. Dynamic web page files in addition to HTML tags, some program code further comprising a specific function, the code may cause the browser and the server may interact, the server dynamically generated web content according to the different requesting client.
Namely: dynamic pages with static pages, the page code has not changed though, but the content is displayed can be changed as the result of time, the environment or database operations.
(2) dynamic web page, with no dynamic effect on the visual variety of animation on a Web page, scrolling marquee directly related to dynamic pages can be plain text, you can also include a variety of animated content, these are just the specific pages forms of content, regardless of whether the page is a dynamic effect, as long as the use of dynamic web technology (such as PHP, ASP, JSP, etc.) generated web pages are known as dynamic pages.

Is simply request a web page address in python down the program content and the html code to see on the page is not the same, you want to get content on a web page need to request for additional hidden data for analysis.

Here's an chestnuts, to tell you how to analyze, to find pages Ajax request (accompanied by: What are the steps and Ajax ajax request )
Here Insert Picture Description
want to say it, we will definitely continue to step behind the pit ...

Published an original article · won praise 0 · Views 42

Guess you like

Origin blog.csdn.net/qq_41431582/article/details/104614175