web crawlers to explain the 2-urllib library use xpath expressions -BeautifulSoup basis

In urllib, we can use xpath expressions as information extraction, this time, you need to first install lxml module, and then the Web page data by etree transformation in the form of treedata of lxml

urllib library use xpath expression

etree.HTML () html the acquired character string, is converted into a tree structure, i.e. xpath expression format can be acquired

#!/usr/bin/env python
# -*- coding:utf8 -*-
import urllib.request
from lxml import etree  #导入html树形结构转换模块 wye = urllib.request.urlopen('http://sh.qihoo.com/pc/home').read().decode("utf-8",'ignore') zhuanh = etree.HTML(wye)  #将获取到的html字符串,转换成树形结构,也就是xpath表达式可以获取的格式 print(zhuanh) hqq = zhuanh.xpath('/html/head/title/text()') #通过xpath表达式获取标题 #注意,xpath表达式获取到数据,有时候是列表,有时候不是列表所以要做如下处理 if str(type(hqq)) == "<class 'list'>":  #判断获取到的是否是列表     print(hqq) else:     xh_hqq = [i for i in hqq]       #如果不是列表,循环数据组合成列表     print(xh_hqq) #返回 :['【今日爆点】你的专属资讯平台']


BeautifulSoup basis

BeautifulSoup is acquiring module thml elements

BeautifulSoup-3.2.1 version

 

Guess you like

Origin www.cnblogs.com/liuyun258/p/11115645.html