任务
爬取该网页商品的名称,图片地址,价格,阅读人数,星级评价
使用bs4库,用到css selecter, xpath以后会用到
select地址:f12,找到标签,右键复制select地址
name:
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4:nth-child(2) > a
image:
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > img
money:
body > div:nth-child(2) > div > div.col-md-9 > div:nth-child(2) > div:nth-child(1) > div > div.caption > h4.pull-right
…….
代码
from bs4 import BeautifulSoup#导入库
with open('./index.html','r') as wb_data:#打开本地文件,读
soup = BeautifulSoup(wb_data,'lxml')#解析内容
#得到每个小标签的集合,对比前面的select路径,发现不一样了吧
#模糊路径去掉位置,可以爬取该页所有相同模式下的内容
images = soup.select('body > div > div > div.col-md-9 > div > div > div > img')
names = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4 > a')
moneys = soup.select('body > div > div > div.col-md-9 > div > div > div > div.caption > h4.pull-right')
reads = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p.pull-right')
fives = soup.select('body > div > div > div.col-md-9 > div > div > div > div.ratings > p:nth-of-type(2)')
my_list = []
for image,name,money,read,five in zip(images,names,moneys,reads,fives):
data = {
'name':name.get_text(),
'image':image.get('src'),
'money':money.get_text(),
'read':read.get_text(),
'fives':len(five.find_all("span",class_="glyphicon glyphicon-star"))
# 观察发现,每一个星星会有一次<span class="glyphicon glyphicon-star"></span>,所以我们统计有多少次,就知道有多少个星星了;
# 使用find_all 统计有几处是★的样式,第一个参数定位标签名,第二个参数定位css 样式,具体可以参考BeautifulSoup 文档示例http://www.crummy.com/software/BeautifulSoup/bs4/doc.zh/#find-all;
# 由于find_all()返回的结果是列表,我们再使用len()方法去计算列表中的元素个数,也就是星星的数量
}
my_list.append(data)
for i in my_list:
print(i['name'],i['image'],i['money'],i['read'],i['fives'],sep='\n')
print('\n')
运行结果
最后附上网页源代码,没有图片,仅供参考源码
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="description" content="">
<meta name="author" content="">
<title>Shop Homepage - Start Bootstrap Template</title>
<!-- Bootstrap Core CSS -->
<link href="css/bootstrap.min.css" rel="stylesheet">
<!-- Custom CSS -->
<link href="css/shop-homepage.css" rel="stylesheet">
<!-- HTML5 Shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.4.2/respond.min.js"></script>
<![endif]-->
</head>
<body>
<!-- Navigation -->
<nav class="navbar navbar-inverse navbar-fixed-top" role="navigation">
<div class="container">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse" data-target="#bs-example-navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
<a class="navbar-brand" href="#">Web Parse</a>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="bs-example-navbar-collapse-1">
<ul class="nav navbar-nav">
<li>
<a href="#">Home</a>
</li>
<li>
<a href="#">Other</a>
</li>
<li>
<a href="#">Contact</a>
</li>
</ul>
</div>
<!-- /.navbar-collapse -->
</div>
<!-- /.container -->
</nav>
<!-- Page Content -->
<div class="container">
<div class="row">
<div class="col-md-3">
<p class="lead">Shop Name</p>
<div class="list-group">
<a href="#" class="list-group-item">Category 1</a>
<a href="#" class="list-group-item">Category 2</a>
<a href="#" class="list-group-item">Category 3</a>
</div>
</div>
<div class="col-md-9">
<div class="row carousel-holder">
<div class="col-md-12">
<div id="carousel-example-generic" class="carousel slide" data-ride="carousel">
<ol class="carousel-indicators">
<li data-target="#carousel-example-generic" data-slide-to="0" class=""></li>
<li data-target="#carousel-example-generic" data-slide-to="1" class=""></li>
<li data-target="#carousel-example-generic" data-slide-to="2" class="active"></li>
</ol>
<div class="carousel-inner">
<div class="item">
<img class="slide-image" src="img/pic800_0000_12.jpg" alt="">
</div>
<div class="item">
<img class="slide-image" src="img/pic8_0002_b43bf91d0e8ffc5f08bd68d3ee90b9b9.jpg" alt="">
</div>
<div class="item active">
<img class="slide-image" src="img/pic8_0002_b43bf91d0e8ffc5f08bd68d3ee90b9b9.jpg" alt="">
</div>
</div>
<a class="left carousel-control" href="#carousel-example-generic" data-slide="prev">
<span class="glyphicon glyphicon-chevron-left"></span>
</a>
<a class="right carousel-control" href="#carousel-example-generic" data-slide="next">
<span class="glyphicon glyphicon-chevron-right"></span>
</a>
</div>
</div>
</div>
<div class="row">
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="img/pic_0000_073a9256d9624c92a05dc680fc28865f.jpg" alt="">
<div class="caption">
<h4 class="pull-right">$24.99</h4>
<h4><a href="#">EarPod</a>
</h4>
<p>See more snippets like this online store item at web store </p>
</div>
<div class="ratings">
<p class="pull-right">65 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="img/pic_0005_828148335519990171_c234285520ff.jpg" alt="">
<div class="caption">
<h4 class="pull-right">$64.99</h4>
<h4><a href="#">New Pocket</a>
</h4>
<p>This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<div class="ratings">
<p class="pull-right">12 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star-empty"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="img/pic_0006_949802399717918904_339a16e02268.jpg" alt="">
<div class="caption">
<h4 class="pull-right">$74.99</h4>
<h4><a href="#">New sunglasses</a>
</h4>
<p>This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<div class="ratings">
<p class="pull-right">31 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star-empty"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="img/pic_0008_975641865984412951_ade7a767cfc8.jpg" alt="">
<div class="caption">
<h4 class="pull-right">$84.99</h4>
<h4><a href="#">Art Cup</a>
</h4>
<p>This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<div class="ratings">
<p class="pull-right">6 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star-empty"></span>
<span class="glyphicon glyphicon-star-empty"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="img/pic_0001_160243060888837960_1c3bcd26f5fe.jpg" alt="">
<div class="caption">
<h4 class="pull-right">$94.99</h4>
<h4><a href="#">iphone gamepad</a>
</h4>
<p>This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<div class="ratings">
<p class="pull-right">18 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star-empty"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="img/pic_0002_556261037783915561_bf22b24b9e4e.jpg" alt="">
<div class="caption">
<h4 class="pull-right">$214.5</h4>
<h4><a href="#">Best Bed</a>
</h4>
<p>This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<div class="ratings">
<p class="pull-right">18 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star-empty"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="img/pic_0011_1032030741401174813_4e43d182fce7.jpg" alt="">
<div class="caption">
<h4 class="pull-right">$500</h4>
<h4><a href="#">iWatch</a>
</h4>
<p>This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<div class="ratings">
<p class="pull-right">35 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star-empty"></span>
</p>
</div>
</div>
</div>
<div class="col-sm-4 col-lg-4 col-md-4">
<div class="thumbnail">
<img src="img/pic_0010_1027323963916688311_09cc2d7648d9.jpg" alt="">
<div class="caption">
<h4 class="pull-right">$15.5</h4>
<h4><a href="#">Park tickets</a>
</h4>
<p>This is a short description. Lorem ipsum dolor sit amet, consectetur adipiscing elit.</p>
</div>
<div class="ratings">
<p class="pull-right">8 reviews</p>
<p>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star"></span>
<span class="glyphicon glyphicon-star-empty"></span>
</p>
</div>
</div>
</div>
</div>
</div>
</div>
</div>
<!-- /.container -->
<div class="container">
<hr>
<!-- Footer -->
<footer>
<div class="row">
<div class="col-lg-12">
<p>Copyright © Your Website 2014</p>
</div>
</div>
</footer>
</div>
<!-- /.container -->
<!-- jQuery -->
<script src="js/jquery.js"></script>
<!-- Bootstrap Core JavaScript -->
<script src="js/bootstrap.min.js"></script>
</body>
</html>