一、xpath的使用
基本使用规则见:https://blog.csdn.net/weixin_42569562/article/details/84670604?from=singlemessage
1.1 xpath选择例子:
1.选择豆瓣电影top250的电影名称
//div[@class=‘hd’]/a/span[1]/text()
2.选择图片路径(img标签中的src属性)
//div[@class=‘pic’]//img/@src
1.2 程序中使用xpath的方法
from lxml import etree
et_node = etree.HTML(html文本) # 将字符串转换为节点对象
pic_paths = et_node.xpath("xpath选择器") # 通过xpath选择器获取数据
1.3 代码演示
1.3.1 豆瓣电影图片爬取
from lxml import etree
from urllib import request
url = 'https://movie.douban.com/top250?start=50'
req = request.Request(url)
response = request.urlopen(req)
if response.status == 200:
html = response.read().decode()
et_node = etree.HTML(html) # 将字符串转换为节点对象
pic_paths = et_node.xpath("//div[@class='pic']//img/@src") # 通过xpath选择器获取数据
for pic_path in pic_paths:
pic_name = pic_path.split("/")[-1]
request.urlretrieve(pic_path,"./images_movie/"+pic_name) # 下载图片
print(pic_name+"下载成功!")
1.3.2 qq表情爬取
import time
from lxml import etree
import requests
url = 'http://sc.chinaz.com/biaoqing/'
response = requests.get(url)
if response.status_code == 200:
html = response.text
et_node = etree.HTML(html)
pic_paths = et_node.xpath("//div[@class='up']//a/img/@src2")
for pic_path in pic_paths:
pic_name =str(int(time.time()*100000)) + pic_path.split("/")[-1]
response2 = requests.get(pic_path)
with open('./images_qq/'+pic_name,'wb') as f:
f.write(response2.content) # 通过响应对象的content属性写入图片二进制数据
print(pic_name+"下载成功!"
二、Redis分布式爬取数据
2.1 Windows下redis安装步骤:
下载地址:https://github.com/MicrosoftArchive/redis/releases
解压ZIP,在命令行中输入redis-server redis.windows.conf
将Redis设置为Windows下服务的方法:
进入Redis目录,输入redis-server --service-install redis.windows-service.conf --loglevel verbose
启动Redis服务:
redis-server --service-start
关闭Redis服务:
redis-server --service-stop
默认情况下,Redis不需要密码,也可以设置密码:
在redis.windows-service.conf中取消注释requirepass,将变量的值改为设置的密码即可。
设置Redis可以被远程访问:
redis.windows-service.conf中注释bind变量。
测试访问:
redis-cli -a 密码 -h ip地址 -p 6379
redis-cli -a 123 -h localhost -p 6379
在多台机器上爬取服务器的数据,效率大大提高。
具体详见代码
2.2 代码演示
2.2.1 master_server代码
from redis import Redis
from bs4 import BeautifulSoup
import requests
link_url = []
with open('alexa.txt') as f:
lines = f.readlines()
for line in lines:
url = line.split("\t")[1]
url = url.replace("\n","")
link_url.append(url)
if len(link_url) == 100:
break
def push_to_redis():
redis = Redis(host="localhost",port=6379,password="")
for link in link_url:
response = requests.get(link)
soup = BeautifulSoup(response.text,'lxml')
imgs = soup.find_all("img") # 查询所有的img标签
for img in imgs: # 遍历img标签
pic_src = img["src"] # 获取当前正在遍历的img标签的src属性值
redis.lpush("pic_url",pic_src) # 将图片的src属性值添加到redis中
print("加入Redis:",pic_src)
if __name__ == '__main__':
push_to_redis()
2.2.2 slave代码
import time
from redis import Redis
import requests
def get_pic():
redis = Redis(host="localhost", port=6379, password="")
while True:
try:
url = redis.rpop("pic_url")
url = url.decode()
if url[0:2] == "//":
continue
response = requests.get(url)
if response.status_code == 200:
pic_name = str(int(time.time()*100000)) + ".png"
with open("./images/"+pic_name,'wb') as f:
f.write(response.content)
except Exception as e:
break
if __name__ == '__main__':
get_pic()
三、 DjangoRestful补充
3.1 安装Django中Rest需要的包
pip install djangorestframework
pip install markdown # Markdown support for the browsable API.
pip install django-filter # Filtering support
3.2 创建Rest风格的项目(restful项目)
关键点:
1. 创建序列化模块:serializers.py,在其中创建序列化类,继承
rest_framework.serializers的某个序列化类(eg:HyperlinkedModelSerializer)。
在自定义的序列化类中,创建Meta内部类:
class Meta:
model = 模型名
fields = (字段名)
2. 在views.py中,创建继承自viewsets下的某个类的类视图,在类中关联序列化类。
class FruitView(viewsets.ModelViewSet):
queryset = Fruit.objects.all()
serializer_class = FruitSerializers
3. 配置路由:
from rest_framework import routers
router = routers.DefaultRouter()
router.register('fruits',FruitView)
urlpatterns = [
path('',include(router.urls)),
]
4.在项目配置文件setttings.py中加入应用:
INSTALLED_APPS = [
'rest_framework',
]
3.3 代码演示
from django.db import models
class Fruit(models.Model):
name = models.CharField(max_length=20)
price = models.FloatField()
url = models.URLField()
class Meta:
db_table = 'fruits'
2、serializers.py
from rest_framework import serializers
from myapp.models import Fruit
class FruitSerializers(serializers.HyperlinkedModelSerializer):
class Meta:
model = Fruit
fields = ('name','price','url')
3、 views
from django.shortcuts import render
from rest_framework import viewsets
from myapp.models import Fruit
from myapp.serializers import FruitSerializers
class FruitView(viewsets.ModelViewSet):
queryset = Fruit.objects.all()
serializer_class = FruitSerializers
4、urls
from django.contrib import admin
from django.urls import path,include
from rest_framework import routers
from myapp.views import FruitView
router = routers.DefaultRouter()
router.register('fruits',FruitView)
urlpatterns = [
path('admin/', admin.site.urls),
path('',include(router.urls)),
path('browser/', include('rest_framework.urls', namespace='rest_framework'))
]
#无子路由
#访问:localhost://8000/fruits/