Crawler basics-the basic structure of a web page

The basic structure of the web page

A web page is a file stored in a folder on the server side. It can be static (it can contain Javascript, but when the client accesses this web page, it is not loaded by the program, but downloaded like a picture. Yes. This seems to be the general operation of previous web pages), or it can be dynamic. When we use Wordpress to create our own website, the web pages inside are dynamically loaded by the PHP program.

静态网页编写简单加载速度快,但是存在巨大的缺陷,它无法变化更别提和用户互动。动态网页在这种情况下应运而生,它可以实现搜索,查询,登录注册等等诸多好玩的功能

Web pages can be divided into three parts, namely HTML, Javascript, and CSS.

  1. HTML specifies the overall layout of a page. Since it contains all the elements in the page, it must have a high degree of generality. A commonly used analogy is that it is the skeleton of a web page
  2. Css is for layout and decoration of HTML text content. Here, the layout means that it can change the position of HTML elements and display:float;float:leftwill produce the effect of pressing an HTML element to the left border of its parent element. As for the decorative effect, it goes without saying that it is to beautify the content of the text.针对字体我们可以选择它们的样式,大小颜色,位置;针对图片,我们可以选择透明度,圆角边框,位置。
  3. Javascript is a nested script file written inside HTML text. Of course, it can also be referenced as an external file. In fact, CSS files are referenced in the same way. For Javascript, the common operations are to add a carousel image to the web page (if you log in to Taobao, you can see it immediately), and submit a form. Because my Javascript is not good enough, the description may be too simple

Why do we need to understand the basic structure of web pages?

Answer: When we do crawlers, in fact, the most important thing is to analyze the response of the webpage and extract the data you want. Of course, you can also not extract it. Then why not just go to the website that others have made? ? (Laughing) When we understand the basic structure of a web page, we can locate the element we want more accurately. In Python crawlers, Xpath and CSS selectors (these are all methods of locating webpage elements) must be based on understanding the way the webpage structure arranges the webpage content.

Web page structure and Xpath.

复制一个完整的Xpth路径:/html/body/div[1]/div[1]/div[1]/div/div[3]/button[1];你看到的是从网页的跟节点层层定位后的路径,你必须理解这些HTML标签之间的逻辑关系,才能准确定位到自己想要的元素Note: This is what I copied from the developer tools. I usually don't write it this way, or I'm too lazy to write it so long.

Web page structure and CSS selector

css的选择器用于精准的定位需要修饰的元素,可分为:Class属性选择器,HTML标签选择器,还有ID选择器. A detailed description will not be made here.

Note: Xpath has its own grammar, which can accurately locate the contents of xml documents and HTML documents, and you need to spend time to understand it yourself.

This article is enough to learn Xpath

Guess you like

Origin blog.csdn.net/weixin_47249161/article/details/113967266