Basic knowledge of crawler notes network

Basic knowledge of crawler notes network

Insert picture description here
As a novice and curious baby, I am also very responsible for crawling. I am listening to other people's conversations, and I am like an ignorant baby. Finally, I have started, I am going to start, I am coming.
The old Chinese saying goes well, I can’t eat hot tofu in a hurry.
I'm still at the old pace, starting from the basics.
As a clanging person, it is impossible to report to class, so we still have to learn.
Not much to say, the address of the pilot study.
Today I found a video on
Bilibili , which is more basic https://www.bilibili.com/video/BV1Ek4y1B7Eb?p=2.
Below are my notes today, which are
still the same. Import the library first.
This library will get the information of
Insert picture description here
the entire page. How can we become the information we understand?
bs4 is similar to sorting. Analyze the data we want.
Insert picture description here
The outdated sending
cookies of cookies are similar to your footprints on the beach, and this footprint is only matched by your feet. For example, if you log in to QQ on the website, enter the password for the first time, and do not enter the password for the second time, log in directly to
https://baike.baidu.com/item/cookie/1119
Insert picture description here timeout setting, you will find that There is a string of numbers that are different.
Insert picture description hereProxies proxy, click to
set proxy, proxies parameters, use proxy
Insert picture description here
requests to get the information of the entire page, and then parse the information in bs4 to get the text that we can understand.
Insert picture description here
Four types of objects————Tag____NavigableString————Comment————BeautifulSoup
The four types of objects are clearly written. Click the one
Insert picture description here
that may exist in the comment to add a section of content that is not rendered through the web page.
Insert picture description here

Document tree-direct child nodes (daddy's father is grandfather)
Insert picture description here
want children and grandchildren to read it out. descendants displays all the children and grandchildren nodes in a certain label.
You can also process the
Insert picture description here
node content through a for loop . Note that the output of soup.a.string and soup.p.string are the same.
If the tag contains multiple (you can call the node of .string) and the .string method will return None.
Note that spaces and newlines are both counted as a node, so the output of soup.a.string and soup.p.string are different.
Insert picture description here
If you want to get multiple content
strings or under the tag . string_strings (can remove extra blank content).
Insert picture description here
parent gets the parent node of the current tag. parent
can get all the parent nodes of the current element.
Insert picture description here
Sibling nodes (nodes where the current node is at the same pole)
.next_sibling gets the next sibling node.
prev_sibling gets the previous sibling node is
similar to this, good brother, Pai Pai StationInsert picture description here

Before and after nodes, all nodes before (after) the node, regardless of level, pay attention to the difference with the sibling nodes, the method is
next_element
.previous_element
.next_elements
.previous_elements.
———————————————————————————————————————————————— ————————Search the document tree——find_all
find_all can get all the tag subnodes of the current tag.
The parameters that can be directly filled in find_all()
Insert picture description here
Tag name, such as a, p, h1 and other
lists, such as ['a' ,'b']
True, find all child nodes
Regular expression.
Search the document tree——find_all
keyward parameter:
find_all (attribute name in tag=attribute value)
Note that if you want to find class, write it as class, because class is python Built-in keywords. The
Insert picture description here
css selector
soup.select() filters elements. Return list
grammar rules, tag name without any modification, cass name with a dot, id name with #.
Insert picture description here
I’m gone, I don’t want to lose my hair.
Bye bye
Insufficient + how to modify.
Insert picture description here

Guess you like

Origin blog.csdn.net/m0_52456045/article/details/113151411