python爬取网页信息

版权声明:本文为博主原创文章,未经博主允许不得转载。 https://blog.csdn.net/xumeng7231488/article/details/78472845

一、简单了解html网页

1.推荐浏览器:

使用Chrome浏览器,在检查元素中可以看到HTML代码和css样式。

2.网页构成:

网页的内容主要包括三个部分:javascript主要针对功能,html针对结构,css针对样式。在本地文件中通常是三部分,html+images+css

3.常用标签和结构

<div></div> 划分区域
<div class=”aasdf”></div>说明样式
<p>wowiji</p>说明文字内容
<li></li>列表
<img>图片
<h1></h1>....<h6></h6>六种字体不同的标题格式
<a href=”” ></a>超链接


标签可以互相嵌套

4.实战做一个网页

使用工具:pycharm

文件内容:sample.html

              Main.css

主要框架:head(标题栏+导航栏),content(主体),footer(页脚)

5.网页效果


6.html源码

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <title>The blah</title>
    <link rel="stylesheet" type="text/css" href="main.css">
</head>
<body>
    <div class="header">
        <img src="images/blah.png">
        <ul class="nav">
            <li><a href="#">Home</a></li>
            <li><a href="#">Site</a></li>
            <li><a href="#">Other</a></li>
        </ul>
    </div>
    <div class="main-content">
        <h2>Article</h2>
        <ul class="article">
            <li>
                <img src="images/0001.jpg" width="100" height="90">
                <h3><a href="#">The blah</a></h3>
                <p>Say something</p>
            </li>
            <li>
                <img src="images/0002.jpg" width="100" height="90">
                <h3><a href="#">The blah</a></h3>
                <p>Say something</p>
            </li>
            <li>
                <img src="images/0003.jpg" width="100" height="90">
                <h3><a href="#">The blah</a></h3>
                <p>Say something</p>
            </li>
            <li>
                <img src="images/0004.jpg" width="100" height="90">
                <h3><a href="#">The blah</a></h3>
                <p>Say something</p>
            </li>
        </ul>
    </div>
    <div class="footer">
        <p>@xumeng</p>
    </div>
</body>
</html>


7.css源码

body {
    padding: 0 0 0 0;
    background-color: #ffffff;
    background-image: url(images/bg3-dark.jpg);
    background-position: top left;
    background-repeat: no-repeat;
    background-size: cover;
    font-family: Helvetica, Arial, sans-serif;
}
.main-content {
    width: 500px;
    padding: 20px 20px 20px 20px;
    border: 1px solid #dddddd;
    border-radius:25px;
    margin: 30px auto 0 auto;
    background: #f1f1f1;
    -webkit-box-shadow: 0 0 22px 0 rgba(50, 50, 50, 1);
    -moz-box-shadow:    0 0 22px 0 rgba(50, 50, 50, 1);
    box-shadow:         0 0 22px 0 rgba(50, 50, 50, 1);
}
.main-content p {
    line-height: 26px;
}
.main-content h2 {
    color: dimgray;
}
 
.nav {
    padding-left: 0;
    margin: 5px 0 20px 0;
    text-align: center;
}
.nav li {
    display: inline;
    padding-right: 10px;
}
.nav li:last-child {
    padding-right: 0;
}
.header {
    padding: 10px 10px 10px 10px;
 
}
 
.header a {
    color: #ffffff;
}
.header img {
    display: block;
    margin: 0 auto 0 auto;
}
.header h1 {
    text-align: center;
}
 
.article {
    list-style-type: none;
    padding: 0;
}
.article li {
    border: 1px solid #f6f8f8;
    background-color: #ffffff;
    height: 90px;
}
.article h3 {
    border-bottom: 0;
    margin-bottom: 5px;
}
.article a {
    color: #37a5f0;
    text-decoration: none;
}
.article img {
    float: left;
    padding-right: 11px;
}
 
.footer {
    margin-top: 20px;
}
.footer p {
    color: #aaaaaa;
    text-align: center;
    font-weight: bold;
    font-size: 12px;
    font-style: italic;
    text-transform: uppercase;
}
 
 
 
 
 
 
.post {
    padding-bottom: 2em;
}
.post-title {
    font-size: 2em;
    color: #222;
    margin-bottom: 0.2em;
}
.post-avatar {
    border-radius: 50px;
    float: right;
    margin-left: 1em;
}
.post-description {
    font-family: Georgia, "Cambria", serif;
    color: #444;
    line-height: 1.8em;
}
.post-meta {
    color: #999;
    font-size: 90%;
    margin: 0;
}
 
.post-category {
    margin: 0 0.1em;
    padding: 0.3em 1em;
    color: #fff;
    background: #999;
    font-size: 80%;
}
.post-category-design {
    background: #5aba59;
}
.post-category-pure {
    background: #4d85d1;
}
.post-category-yui {
    background: #8156a7;
}
.post-category-js {
    background: #df2d4f;
}
 
.post-images {
    margin: 1em 0;
}
.post-image-meta {
    margin-top: -3.5em;
    margin-left: 1em;
    color: #fff;
    text-shadow: 0 1px 1px #333;
}


8.注意:

共有十张图片,注意路径关系,CSSHTMLIMages文件夹在同一目录下。

写给自己:此项目路径在:F:\Python实战:四周实现爬虫系统\作业代码\第一周\上课_1

 

二、解析本地文件中的元素

1.解析的文件html源码

<html>
<head>
    <link rel="stylesheet" type="text/css" href="new_blah.css">
</head>
<body>
    <div class="header">
        <img src="images/blah.png">
        <ul class="nav">
            <li><a href="#">Home</a></li>
            <li><a href="#">Site</a></li>
            <li><a href="#">Other</a></li>
        </ul>
    </div>
    <div class="main-content">
        <h2>Article</h2>
        <ul class="articles">
            <li>
                <img src="images/0001.jpg" width="100" height="91">
                <div class="article-info">
                    <h3><a href="www.sample.com">Sardinia's top 10 beaches</a></h3>
                    <p class="meta-info">
                        <span class="meta-cate">fun</span>
                        <span class="meta-cate">Wow</span>
                    </p>
                    <p class="description">white sands and turquoise waters</p>
                </div>
                <div class="rate">
                    <span class="rate-score">4.5</span>
                </div>
            </li>
            <li>
                <img src="images/0002.jpg" width="100" height="91">
                <div class="article-info">
                    <h3><a href="www.sample.com">How to get tanned</a></h3>
                    <p class="meta-info">
                        <span class="meta-cate">butt</span><span class="meta-cate">NSFW</span>
                    </p>
                    <p class="description">hot bikini girls on beach</p>
                </div>
                <div class="rate">
                    <img src="images/Fire.png" width="18" height="18">
                    <span class="rate-score">5.0</span>
                </div>
            </li>
            <li>
                <img src="images/0003.jpg" width="100" height="91">
                <div class="article-info">
                    <h3><a href="www.sample.com">How to be an Aussie beach bum</a></h3>
                    <p class="meta-info">
                        <span class="meta-cate">sea</span>
                    </p>
                    <p class="description">To make the most of your visit</p>
                </div>
                <div class="rate">
                    <span class="rate-score">3.5</span>
                </div>
            </li>
            <li>
                <img src="images/0004.jpg" width="100" height="91">
                <div class="article-info">
                    <h3><a href="www.sample.com">Summer's cheat sheet</a></h3>
                    <p class="meta-info">
                        <span class="meta-cate">bay</span>
                        <span class="meta-cate">boat</span>
                        <span class="meta-cate">beach</span>
                    </p>
                    <p class="description">choosing a beach in Cape Cod</p>
                </div>
                <div class="rate">
                    <span class="rate-score">3.0</span>
                </div>
            </li>
        </ul>
    </div>
    <div class="footer">
        <p>© Mugglecoding</p>
    </div>
</body>
</html>


 

 

2.需解析的网页CSS文件

body {
    padding: 0 0 0 0;
    background-color: #ffffff;
    background-image: url(images/bg3-dark.jpg);
    background-position: top left;
    background-repeat: no-repeat;
    background-size: cover;
    font-family: Helvetica, Arial, sans-serif;
}
.main-content {
    width: 500px;
    padding: 20px 20px 20px 20px;
    border: 1px solid #dddddd;
    border-radius:15px;
    margin: 30px auto 0 auto;
    background: #fdffff;
    -webkit-box-shadow: 0 0 22px 0 rgba(50, 50, 50, 1);
    -moz-box-shadow:    0 0 22px 0 rgba(50, 50, 50, 1);
    box-shadow:         0 0 22px 0 rgba(50, 50, 50, 1);
}
.main-content p {
    line-height: 26px;
}
.main-content h2 {
    color: #585858;
}
.articles {
    list-style-type: none;
    padding: 0;
}
.articles img {
    float: left;
    padding-right: 11px;
}
.articles li {
    border-top: 1px solid #F1F1F1;
    background-color: #ffffff;
    height: 90px;
    clear: both;
}
.articles h3 {
    margin: 0;
}
.articles a {
    color:#585858;
    text-decoration: none;
}
.articles p {
    margin: 0;
}
 
.article-info {
    float: left;
    display: inline-block;
    margin: 8px 0 8px 0;
}
 
.rate {
    float: right;
    display: inline-block;
    margin:35px 20px 35px 20px;
}
 
.rate-score {
    font-size: 18px;
    font-weight: bold;
    color: #585858;
}
 
.rate-score-hot {
 
 
}
 
.meta-info {
}
 
.meta-cate {
    margin: 0 0.1em;
    padding: 0.1em 0.7em;
    color: #fff;
    background: #37a5f0;
    font-size: 20%;
    border-radius: 10px ;
}
 
.description {
    color: #cccccc;
}
 
.nav {
    padding-left: 0;
    margin: 5px 0 20px 0;
    text-align: center;
}
.nav li {
    display: inline;
    padding-right: 10px;
}
.nav li:last-child {
    padding-right: 0;
}
.header {
    padding: 10px 10px 10px 10px;
 
}
 
.header a {
    color: #ffffff;
}
.header img {
    display: block;
    margin: 0 auto 0 auto;
}
.header h1 {
    text-align: center;
}
 
 
 
.footer {
    margin-top: 20px;
}
.footer p {
    color: #aaaaaa;
    text-align: center;
    font-weight: bold;
    font-size: 12px;
    font-style: italic;
    text-transform: uppercase;
}


 

3.解析步骤

1beautifulsoup解析网页

2)描述爬取定位

3)从标签获取信息并按照要求装进容器方便查询

4.beautifulsoup解析网页

1)爬取代码

标准解析格式为:soup=beautifulsoup(html,lxml)//第一个参数是网页文件,第二个是解析方式,解析方式共有五种:lxml,html.parser,lxml HTML,lxml xML,HTML5lib

from bs4 import BeautifulSoup

with open('F:/Python实战:四周实现爬虫系统/作业代码/第一周/上课_2/web/new_index.html','r') as wb_data:

    Soup = BeautifulSoup(wb_data,'lxml')

    print(Soup)

2)报错1

can't import beautifulsoup

原因是没有安装beautifulsoup库,解决:在cmd

pip install bs4

3)报错2

bs4.FeatureNotFound: Couldn't find a tree builder with the features you requested: lxml. Do you need to install a parser library?

原因是没有安装解析器,解决:在cmd下:

pip install lxml

4)爬取结果

<html>
<head>
<link href="new_blah.css" rel="stylesheet" type="text/css"/>
</head>
<body>
<div class="header">
<img src="images/blah.png"/>
<ul class="nav">
<li><a href="#">Home</a></li>
<li><a href="#">Site</a></li>
<li><a href="#">Other</a></li>
</ul>
</div>
<div class="main-content">
<h2>Article</h2>
<ul class="articles">
<li>
<img height="91" src="images/0001.jpg" width="100"/>
<div class="article-info">
<h3><a href="www.sample.com">Sardinia's top 10 beaches</a></h3>
<p class="meta-info">
<span class="meta-cate">fun</span>
<span class="meta-cate">Wow</span>
</p>
<p class="description">white sands and turquoise waters</p>
</div>
<div class="rate">
<span class="rate-score">4.5</span>
</div>
</li>
<li>
<img height="91" src="images/0002.jpg" width="100"/>
<div class="article-info">
<h3><a href="www.sample.com">How to get tanned</a></h3>
<p class="meta-info">
<span class="meta-cate">butt</span><span class="meta-cate">NSFW</span>
</p>
<p class="description">hot bikini girls on beach</p>
</div>
<div class="rate">
<img height="18" src="images/Fire.png" width="18"/>
<span class="rate-score">5.0</span>
</div>
</li>
<li>
<img height="91" src="images/0003.jpg" width="100"/>
<div class="article-info">
<h3><a href="www.sample.com">How to be an Aussie beach bum</a></h3>
<p class="meta-info">
<span class="meta-cate">sea</span>
</p>
<p class="description">To make the most of your visit</p>
</div>
<div class="rate">
<span class="rate-score">3.5</span>
</div>
</li>
<li>
<img height="91" src="images/0004.jpg" width="100"/>
<div class="article-info">
<h3><a href="www.sample.com">Summer's cheat sheet</a></h3>
<p class="meta-info">
<span class="meta-cate">bay</span>
<span class="meta-cate">boat</span>
<span class="meta-cate">beach</span>
</p>
<p class="description">choosing a beach in Cape Cod</p>
</div>
<div class="rate">
<span class="rate-score">3.0</span>
</div>
</li>
</ul>
</div>
<div class="footer">
<p>© Mugglecoding</p>
</div>
</body>
</html>


 

5.描述爬取位置

描述位置使用selector位置,获取方法,选择->右键检查->右键copy->复制selector

#源码
from bs4 import BeautifulSoup
with open('F:/Python实战:四周实现爬虫系统/作业代码/第一周/上课_2/web/new_index.html','r') as wb_data:
    Soup = BeautifulSoup(wb_data,'lxml')
    #print(Soup)
    print("获取第一张照片")
    #images=Soup.select('body > div.main-content > ul > li:nth-child(1) > img')
    #注意使用上面的地址会报错,要根据提示修改
    image1 = Soup.select('body > div.main-content > ul > li:nth-of-type(1) > img')
    print(image1)
    print("获取所有照片")
    #要获取所有照片需要清除位置信息
    images = Soup.select('body > div.main-content > ul > li > img')
    #把其他信息筛选出来
    title=Soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    score=Soup.select('body > div.main-content > ul > li > div.rate > span')
    selector=Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info > span')
    description=Soup.select('body > div.main-content > ul > li > div.article-info > p.description')
    print(images,title,score,selector,description,sep='\n----------------------------------\n')
 


#打印结果
获取第一张照片
[<img height="91" src="images/0001.jpg" width="100"/>]
获取所有照片
[<img height="91" src="images/0001.jpg" width="100"/>, <img height="91" src="images/0002.jpg" width="100"/>, <img height="91" src="images/0003.jpg" width="100"/>, <img height="91" src="images/0004.jpg" width="100"/>]
----------------------------------
[<a href="www.sample.com">Sardinia's top 10 beaches</a>, <a href="www.sample.com">How to get tanned</a>, <a href="www.sample.com">How to be an Aussie beach bum</a>, <a href="www.sample.com">Summer's cheat sheet</a>]
----------------------------------
[<span class="rate-score">4.5</span>, <span class="rate-score">5.0</span>, <span class="rate-score">3.5</span>, <span class="rate-score">3.0</span>]
----------------------------------
[<span class="meta-cate">fun</span>, <span class="meta-cate">Wow</span>, <span class="meta-cate">butt</span>, <span class="meta-cate">NSFW</span>, <span class="meta-cate">sea</span>, <span class="meta-cate">bay</span>, <span class="meta-cate">boat</span>, <span class="meta-cate">beach</span>]
----------------------------------
[<p class="description">white sands and turquoise waters</p>, <p class="description">hot bikini girls on beach</p>, <p class="description">To make the most of your visit</p>, <p class="description">choosing a beach in Cape Cod</p>]


6.筛选有关信息

#打印出所有种类的结果
from bs4 import BeautifulSoup
with open('F:/Python实战:四周实现爬虫系统/作业代码/第一周/上课_2/web/new_index.html','r') as wb_data:
    Soup = BeautifulSoup(wb_data,'lxml')
    images = Soup.select('body > div.main-content > ul > li > img')
    titles = Soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    scores = Soup.select('body > div.main-content > ul > li > div.rate > span')
    #selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info > span')
    selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info ')
    descrs = Soup.select('body > div.main-content > ul > li > div.article-info > p.description')
 
for title,image,desc,selec,score in zip(titles,images,descrs,selecs,scores):
    data={
        #'selec': selec.get_text(),
        'selec':list(selec.stripped_strings),#获取子级目录下所有
        'title':title.get_text(),
        'image':image.get('src'),
        'desc':desc.get_text(),
        'score':score.get_text()
    }
    print(data)
 


#打印结果
['fun', 'Wow'], 'title': "Sardinia's top 10 beaches", 'image': 'images/0001.jpg', 'desc': 'white sands and turquoise waters', 'score': '4.5'}
{'selec': ['butt', 'NSFW'], 'title': 'How to get tanned', 'image': 'images/0002.jpg', 'desc': 'hot bikini girls on beach', 'score': '5.0'}
{'selec': ['sea'], 'title': 'How to be an Aussie beach bum', 'image': 'images/0003.jpg', 'desc': 'To make the most of your visit', 'score': '3.5'}
{'selec': ['bay', 'boat', 'beach'], 'title': "Summer's cheat sheet", 'image': 'images/0004.jpg', 'desc': 'choosing a beach in Cape Cod', 'score': '3.0'}
 


#打印出评分>3分的文章
from bs4 import BeautifulSoup
info=[]
with open('F:/Python实战:四周实现爬虫系统/作业代码/第一周/上课_2/web/new_index.html','r') as wb_data:
    Soup = BeautifulSoup(wb_data,'lxml')
    images = Soup.select('body > div.main-content > ul > li > img')
    titles = Soup.select('body > div.main-content > ul > li > div.article-info > h3 > a')
    scores = Soup.select('body > div.main-content > ul > li > div.rate > span')
    #selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info > span')
    selecs = Soup.select('body > div.main-content > ul > li > div.article-info > p.meta-info ')
    descrs = Soup.select('body > div.main-content > ul > li > div.article-info > p.description')
 
for title,image,desc,selec,score in zip(titles,images,descrs,selecs,scores):
    data={
        #'selec': selec.get_text(),
        'selec':list(selec.stripped_strings),#获取子级目录下所有
        'title':title.get_text(),
        'image':image.get('src'),
        'desc':desc.get_text(),
        'score':score.get_text()
 
    }
    info.append(data)
for i in info:
    if float(i['score'])>3:
        print(i['title'],i['score'])
 


#打印结果:
Sardinia's top 10 beaches 4.5
How to get tanned 5.0
How to be an Aussie beach bum 3.5


三、爬取真实网页

Requests+beautifulsoup爬取tripadvisior

1.服务器与本地的交换机制

1http协议

点击页面:向服务器发送请求(request

#get:

GET /page_one.html HTTP/1.1 Host:www.sample.com

显示页面:responsestatus_code:

查看:右键->检查->network

 

HTTP1.0:get,post,head

http1.1:get,post,head,options.connect,trace,delete

2)代码

pip install requests


2.解析真实网页的步骤

1requests请求

2)爬取整个界面

from bs4 import BeautifulSoup
import requests
 
url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
 
wb_data=requests.get(url,timeout = 500)
soup=BeautifulSoup(wb_data.text,'lxml')
print(soup)


3)描述爬取的元素位置

#爬取某个标题的selector
from bs4 import BeautifulSoup
import requests
 
url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
 
wb_data=requests.get(url,timeout=500)
soup=BeautifulSoup(wb_data.text,'lxml')
titles=soup.select('#taplc_attraction_coverpage_attraction_0 > div:nth-of-type(4) > div > div > div.shelf_item_container > div:nth-of-type(1) > div.poi > div > div.item.name > a')
print(titles)


结果:

[<a class="poiTitle" data-tpact="shelf_item_click" data-tpatt="4|poi|272517" data-tpid="20" data-tpp="Attractions" href="/Attraction_Review-g60763-d272517-Reviews-Conservatory_Garden-New_York_City_New_York.html" onclick="widgetEvCall('handlers.shelfItemClick', event, this)" target="_blank">温室花园</a>]


4)描述爬取的所有元素取所有特征大小的图片

#爬取所有特征大小的图片
from bs4 import BeautifulSoup
import requests
 
url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
 
wb_data=requests.get(url,timeout=500)
soup=BeautifulSoup(wb_data.text,'lxml')
imgs=soup.select('img[width="200"]')
print(imgs)


5)字典方式遍历

#字典方式遍历
from bs4 import BeautifulSoup
import requests
url='https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
 
wb_data=requests.get(url,timeout=500)
soup=BeautifulSoup(wb_data.text,'lxml')
imgs=soup.select('img[width="200"]')
titles=soup.select('#taplc_attraction_coverpage_attraction_0 > div > div > div > div.shelf_item_container > div:nth-of-type(1) > div.poi > div > div.item.name > a')
for title,img in zip(titles,imgs):
    data={
        'title':title.get_text(),
        'img':img.get('src'),
    }
    print(data)


3.跳过登录步骤,在request参数获取信息

from bs4 import BeautifulSoup
import requests
import time
 
url_saves = 'http://www.tripadvisor.com/Saves#37685322'
url = 'https://cn.tripadvisor.com/Attractions-g60763-Activities-New_York_City_New_York.html'
urls = ['https://cn.tripadvisor.com/Attractions-g60763-Activities-oa{}-New_York_City_New_York.html#ATTRACTION_LIST'.format(str(i)) for i in range(30,930,30)]
 
headers = {
    'User-Agent':'',
    'Cookie':''
}
 
 
def get_attractions(url,data=None):
    wb_data = requests.get(url)
    time.sleep(4)
    soup = BeautifulSoup(wb_data.text,'lxml')
    titles    = soup.select('div.property_title > a[target="_blank"]')
    imgs      = soup.select('img[width="160"]')
    cates     = soup.select('div.p13n_reasoning_v2')
 
    if data == None:
        for title,img,cate in zip(titles,imgs,cates):
            data = {
                'title'  :title.get_text(),
                'img'    :img.get('src'),
                'cate'   :list(cate.stripped_strings),
                }
        print(data)
 
 
def get_favs(url,data=None):
    wb_data = requests.get(url,headers=headers)
    soup      = BeautifulSoup(wb_data.text,'lxml')
    titles    = soup.select('a.location-name')
    imgs      = soup.select('div.photo > div.sizedThumb > img.photo_image')
    metas = soup.select('span.format_address')
 
    if data == None:
        for title,img,meta in zip(titles,imgs,metas):
            data = {
                'title'  :title.get_text(),
                'img'    :img.get('src'),
                'meta'   :list(meta.stripped_strings)
            }
            print(data)
 
for single_url in urls:
    get_attractions(single_url)


4.反爬虫

只用检查->在移动端查看->解析(保护措施不是非常严密)

 

四、获取动态数据异步加载

1.异步加载

不换页的情况不断加载

JS 持续加载,与JavaScript不在一起,分批量加载

2. 发现异步数据

检查->Network->XHR

Name:出现新请求成功的页码->动态请求网址URL(page=x)

Response加载回一组div标签,包括链接

3.代码

from bs4 import BeautifulSoup
import requests
import time
url = 'https://knewone.com/discover?page='
def get_page(url,data=None):
 
    wb_data = requests.get(url)
    soup = BeautifulSoup(wb_data.text,'lxml')
    imgs = soup.select('a.cover-inner > img')
    titles = soup.select('section.content > h4 > a')
    links = soup.select('section.content > h4 > a')
 
    if data==None:
        for img,title,link in zip(imgs,titles,links):
            data = {
                'img':img.get('src'),
                'title':title.get('title'),
                'link':link.get('href')
            }
            print(data)
 
#自控页码函数
def get_more_pages(start,end):
    for one in range(start,end):
        get_page(url+str(one))
        time.sleep(2)
 
get_more_pages(1,10)
 


五、作业:爬取商品信息

from bs4 import BeautifulSoup
import requests
import time
 
url = 'http://bj.58.com/pingbandiannao/24604629984324x.shtml'
 
wb_data = requests.get(url)
soup = BeautifulSoup(wb_data.text,'lxml')
 
def get_links_from(who_sells):
    urls = []
    list_view = 'http://bj.58.com/pbdn/{}/pn2/'.format(str(who_sells))
    wb_data = requests.get(list_view)
    soup = BeautifulSoup(wb_data.text,'lxml')
    for link in soup.select('td.t a.t'):
        urls.append(link.get('href').split('?')[0])
    return urls
 
 
def get_views_from(url):
    id = url.split('/')[-1].strip('x.shtml')
    api = 'http://jst1.58.com/counter?infoid={}'.format(id)
    # 这个是找到了58的查询接口,不了解接口可以参照一下新浪微博接口的介绍
    js = requests.get(api)
    views = js.text.split('=')[-1]
    return views
    # print(views)
 
 
def get_item_info(who_sells=0):
 
    urls = get_links_from(who_sells)
    for url in urls:
 
        wb_data = requests.get(url)
        soup = BeautifulSoup(wb_data.text,'lxml')
        data = {
            'title':soup.title.text,
            'price':soup.select('.price')[0].text,
            'area' :list(soup.select('.c_25d')[0].stripped_strings) if soup.find_all('span','c_25d') else None,
            'date' :soup.select('.time')[0].text,
            'cate' :'个人' if who_sells == 0 else '商家',
            # 'views':get_views_from(url)
        }
        print(data)
 
# get_item_info(url)
 
# get_links_from(1)
 
get_item_info(url)


猜你喜欢

转载自blog.csdn.net/xumeng7231488/article/details/78472845