Table of contents
1. Goal 1. Basic/environmental preparation
2. Goal 2: Start using pyquery
3. Goal 3: Extract the specified data
4. Goal 3: Obtain specified data in the form of a list
1. Goal 1. Basic/environmental preparation
1. Documentation:
The use of PyQuery parsing library, we can use it compared with jQuery
jQuery API Chinese documentation https://www.94xh.com/
2. Environment:
Install pyquery (or install directly on pycharm)
pip install pyquery
3. Positioning
<li>
<div class="A">
<div id="B">
Locate li ---> 'li'
定位<div class="A">--->'li .A'
定位<div id="B">--->'li .A #B'
(order has nothing to do with labels)
2. Goal 2: Start using pyquery
1. Print head data
from pyquery import PyQuery as pq
doc = pq(url='http://www.baidu.com')
print(doc('head'))
The head data is printed out, in fact, you can see that the Chinese characters are garbled
2. Coding:
Solve the problem of garbled characters (that is, use utf-8 encoding to solve Chinese garbled characters)
import requests
from pyquery import PyQuery as pq
response = requests.get('http://www.baidu.com')
content = response.content.decode('utf-8')
doc = pq(content)
print(doc('head'))
No more garbled characters
3. Extract the corresponding attributes
will extract all div tags that meet the conditions
.official-newsbd crawls all <div class="official-newsbd">
import requests
from pyquery import PyQuery as pq
response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
print(doc('.official-newsbd'))
3. Goal 3: Extract the specified data
1. Goal:
Extract information about all list images
2. Full extraction:
Extract all <div class="thumb">
import requests
from pyquery import PyQuery as pq
response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
items = doc('.thumb')
print(item)
operation result
3. Logic extraction:
This is step by step to extract content
import requests
from pyquery import PyQuery as pq
response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
items = doc('.official-newsbd')
item = items.find('.thumb')
print(item)
operation result
4. List data extraction
children()
is a function or method used to get all child elements of an element
item = items.children()
4. Goal 3: Obtain specified data in the form of a list
Suppose we want to extract to link URL
1. Find the smallest subtag
First we go to the smallest subtag first
(You can go down step by step without logic, or you can find it directly)
import requests
from pyquery import PyQuery as pq
response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
items = doc('.official-newsbd')
item = items.find('.thumb')
i = item('a')
print(i)
2. For loop - extract the URL in the tag
key code
(In fact, there are various ways to achieve it)
.attr('href')
.attr.href
full code
import requests
from pyquery import PyQuery as pq
response = requests.get('https://www.chinaz.com/')
content = response.content.decode('utf-8')
doc = pq(content)
items = doc('.official-newsbd')
item = items.find('.thumb')
for i in item:
b = pq(i)('a')
print(b.attr('href'))
Before improvement:
items = doc('.official-newsbd')
item = items.find('.thumb')
After improvement:
item = doc('.official-newsbd .thumb')
3. Get the text
key code
b.text()
(I have no text for this target tag to extract)
5. Extension: other methods
1. Brothers
The siblings() method returns a PyQuery object containing all sibling elements of the current element. These sibling elements are direct sibling nodes of the current element (other child nodes under the same parent node), excluding the current element itself
from pyquery import PyQuery as pq
html = '''
<div>
<p class="first">First paragraph</p>
<p class="second">Second paragraph</p>
<p class="third">Third paragraph</p>
</div>
'''
doc = pq(html)
elem = doc('.second')
siblings = elem.siblings()
print(siblings)
2. Parent class
The parent() method returns a PyQuery object containing the immediate parent element of the current element. The parent element refers to the parent node of the current element, that is, the parent node of the current element
from pyquery import PyQuery as pq
html = '''
<div class="parent">
<p>Child paragraph</p>
</div>
'''
doc = pq(html)
child = doc('p')
parent = child.parent()
print(parent)
6. Network Security O
GitHub - BLACKxZONE/Treasure_knowledgehttps://github.com/BLACKxZONE/Treasure_knowledge