[Network security takes you to practice reptiles-100 exercises] Practice 12: pyquery parsing library extracts specified data

Table of contents

1. Goal 1. Basic/environmental preparation

2. Goal 2: Start using pyquery

3. Goal 3: Extract the specified data

4. Goal 3: Obtain specified data in the form of a list

5. Extension: other methods

6. Network Security O


1. Goal 1. Basic/environmental preparation

1. Documentation:

The use of PyQuery parsing library, we can use it compared with jQuery

jQuery API Chinese documentation https://www.94xh.com/

2. Environment:

Install pyquery (or install directly on pycharm)

pip install pyquery

3. Positioning

<li>
    <div class="A">
        <div id="B">

Locate li ---> 'li'

定位<div class="A">--->'li .A'

定位<div id="B">--->'li .A #B'

(order has nothing to do with labels)


 



2. Goal 2: Start using pyquery

1. Print head data

from pyquery import PyQuery as pq

doc = pq(url='http://www.baidu.com')
print(doc('head'))

The head data is printed out, in fact, you can see that the Chinese characters are garbled

2. Coding:

Solve the problem of garbled characters (that is, use utf-8 encoding to solve Chinese garbled characters)

import requests
from pyquery import PyQuery as pq

response = requests.get('http://www.baidu.com')
content = response.content.decode('utf-8')
doc = pq(content)
print(doc('head'))

No more garbled characters

 

3. Extract the corresponding attributes

will extract all div tags that meet the conditions

 

.official-newsbd crawls all <div class="official-newsbd">

 

import requests
from pyquery import PyQuery as pq

response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
print(doc('.official-newsbd'))

 



3. Goal 3: Extract the specified data

1. Goal:

Extract information about all list images

2. Full extraction: 

Extract all <div class="thumb">

import requests
from pyquery import PyQuery as pq

response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
items = doc('.thumb')

print(item)

operation result

 

3. Logic extraction:

This is step by step to extract content

import requests
from pyquery import PyQuery as pq

response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
items = doc('.official-newsbd')
item = items.find('.thumb')

print(item)

operation result

4. List data extraction

children()is a function or method used to get all child elements of an element

item = items.children()



4. Goal 3: Obtain specified data in the form of a list

Suppose we want to extract to link URL

1. Find the smallest subtag

First we go to the smallest subtag first

(You can go down step by step without logic, or you can find it directly)

import requests
from pyquery import PyQuery as pq

response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
items = doc('.official-newsbd')
item = items.find('.thumb')
i = item('a')

print(i)

 

2. For loop - extract the URL in the tag

key code

(In fact, there are various ways to achieve it)

.attr('href')

.attr.href

full code

import requests
from pyquery import PyQuery as pq

response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
items = doc('.official-newsbd')
item = items.find('.thumb')
for i in item:
    b = pq(i)('a')
    print(b.attr('href'))

Before improvement:

items = doc('.official-newsbd')
item = items.find('.thumb')

 After improvement:

item = doc('.official-newsbd .thumb')

3. Get the text

key code

b.text()

(I have no text for this target tag to extract)



5. Extension: other methods

1. Brothers


The siblings() method returns a PyQuery object containing all sibling elements of the current element. These sibling elements are direct sibling nodes of the current element (other child nodes under the same parent node), excluding the current element itself

from pyquery import PyQuery as pq

html = '''
<div>
  <p class="first">First paragraph</p>
  <p class="second">Second paragraph</p>
  <p class="third">Third paragraph</p>
</div>
'''

doc = pq(html)
elem = doc('.second')

siblings = elem.siblings()
print(siblings)

2. Parent class

The parent() method returns a PyQuery object containing the immediate parent element of the current element. The parent element refers to the parent node of the current element, that is, the parent node of the current element
 

from pyquery import PyQuery as pq

html = '''
<div class="parent">
  <p>Child paragraph</p>
</div>
'''

doc = pq(html)
child = doc('p')

parent = child.parent()
print(parent)


6. Network Security O

README.md Book Bansheng/Network Security Knowledge System-Practice Center-Code Cloud-Open Source China (gitee.com) https://gitee.com/shubansheng/Treasure_knowledge/blob/master/README.md

GitHub - BLACKxZONE/Treasure_knowledgehttps://github.com/BLACKxZONE/Treasure_knowledge

Supongo que te gusta

Origin blog.csdn.net/qq_53079406/article/details/131663524
Recomendado
Clasificación