Do you really know how to buy things at the best price? Learn to use Python to crawl prices!

a coding process

1 Determine the goal: Use the regular expressions you just learned to crawl the product names and product prices of e-commerce websites;

2 Determine the plan:

①Select e-commerce website:

To query products by keyword on Taobao, you need to log in first. After checking the login process of Taobao, it requires network packet capture, which takes a while to analyze. It may not be possible to analyze it, so I gave up Taobao;

If you press keyword query on JD.com, you can see that you can add keywords to the URL.

'https://search.jd.com/Search?keyword='+关键字+'&enc=utf-8&wq='+关键字

And looking at the source code of the search structure page, you can see the product price and product name. I didn’t use javaScrip, so I decided to use the more friendly JD.com to learn;

② Determine the regular expression to capture the price and product name from the source code of the product list page. You can copy a section from the source code and do the experiment. After the experiment is successful, put it into the code.

The experimental code is as follows:

#试验获取商品价格、商品名称
import requests
import bs4
from bs4 import BeautifulSoup
goods1='手机壳'
html1 = '<div class="p-price"><strong class="J_55346447381" data-done="1"><em>¥</em><i>28.80</i></strong>		</div>		<div class="p-name p-name-type-2">			<a target="_blank" title="【推荐苹果11/X系列隐形钻石膜,防爆不碎边】专享买1送1,领券59减3、满79减5!京配免邮,次日达!多买多优惠!猛戳这里去购买!" href="//item.jd.com/55346447381.html" onclick="searchlog(1,55346447381,0,1,flagsClk=20971660)">				<em><span class="p-tag" style="background-color:#c81623">京东超市</span>亿色(ESR)苹果11/11Pro<font class="skcolor_ljg">手机壳</font> iPhone11 Pro max保护套超薄全透明防摔硅胶壳 苹果11【6.1英寸】送钢化膜</em>				<i class="promo-words" id="J_AD_55346447381">【推荐苹果11/X系列隐形钻石膜,防爆不碎边】专享买1送1,领券59减3、满79减5!京配免邮,次日达!多买多优惠!猛戳这里去购买!</i>			</a>		</div>'
plt = re.findall(r'<em>¥</em><i>.*?\.\d\d',html1) #获取商品价格,搜索以<em>¥</em><i>开头,以.数字数字结尾的字符串   
print(plt)
price = plt[0].split('<i>')[1] 
print(price)        
tlt1 = re.findall(r'[^(<em>¥</em>)]<em>.*?'+goods1+r'.*?</em>',html1) #获取商品名称,搜索以<em>开始,以遇到的第一个</em>结尾的字符串,且 第一个字符是(<em>¥</em>)]<em>除外
tlt2 = re.findall(r'[^(<em>¥</em>)]<em>.*?[\u4e00-\u9fa5].*?</em>',html1) #获取商品名称,搜索以<em>开始,以遇到的第一个</em>结尾的字符串,且 第一个字符是(<em>¥</em>)]<em>除外                  
print(tlt1)
print(tlt2)

2. All the code crawled

It’s really grand that JD.com can be generous and open to us novices to learn. The following code is only for learning and communication. Please imitate human behavior to crawl, don’t crawl like a machine frequently.

#爬取的代码
import requests
import re
import time
goods='书包'  #搜索关键字
depth = 2  #搜索深度为2,即爬取第1页,第2页
start_url = 'https://search.jd.com/Search?keyword='+ goods+'&enc=utf-8&wq='+goods
infoList=[]
hd = {'user-agent':'Mozilla/5.0'}
for j in range(depth):  #对每一个页面进行处理,使用for循环
    try:
        url = start_url + '&page=' + str(j) # 组合成带翻页功能的url https://search.jd.com/Search?keyword=书包=utf-8&wq=书包&page=1
        try:
            r = requests.get(url,headers=hd,timeout=30)
            r.raise_for_status()
            r.encoding=r.apparent_encoding  #把获取到的页面信息 替换成utf-8信息,这样就不会乱码
            print(r.status_code)
            html = r.text
            print(r.url)
            print(r.text)
        except:
            print("抓取异常")
        try:
            plt = re.findall(r'<em>¥</em><i>.*?\.\d\d',html) #获取商品价格,搜索以<em>¥</em><i>开头,以.数字数字结尾的字符串            
            tlt = re.findall(r'[^(<em>¥</em>)]<em>.*?[\u4e00-\u9fa5].*?</em>',html)  #获取商品名称,搜索以<em>开始,以遇到的第一个</em>结尾的字符串,且 第一个字符是(<em>¥</em>)]<em>除外        
            for i in range(len(plt)):
                price = plt[i].split('<i>')[1]
                title = tlt[i]
                infoList.append([price,title]) # append() 方法用于在列表末尾添加新的对象。
        except: #让程序不会因为异常执行而溢出
            print("分析异常")
    except:
        continue  #如果某一个页面解析出了entity,那么继续解析下一个页面。
    time.sleep(2)
    
tplt = "{:^10}\t{:^10}\t{:^20}" #设定一个print模板,用大括号{}来定义槽函数
print(tplt.format("序号","价格","商品名称"))# Python2.6 开始,新增了一种格式化字符串的函数 str.format(),它增强了字符串格式化的功能。format用法举例:print("网站名:{name}, 地址 {url}".format(name="菜鸟教程", url="www.runoob.com"))
count=0
for g in infoList:
     count = count +1
     print(tplt.format(count,g[0],g[1])) #打印商品价格、名称,字符串没做处理

3. Information crawled

The first 10 items in the list were intercepted, and the product titles were not processed with redundant strings.

Serial number price product name

1 49.00 <em>Multifunctional student hanging book bag, adjustable desk hanging bag, book storage bag, student hanging book bag, book hanging bag, desk storage bag, document stationery hanging book bag, desk artifact hanger blue</em>

2 99.00 <em>Scarecrow Backpack for Men and Women 14/15.6-inch Large Capacity Laptop Bag Multifunctional Travel Backpack Water-Repellent Business Casual Student<font class="skcolor_ljg">School Bag</font>50470 Black</em>

3 69.00 <em>Backpack Men's Backpack Large Capacity Fashion Casual Business Travel Laptop Bag High School College Student <font class="skcolor_ljg">School Bag</font> Men's Trendy USB Charging Bag 65199 Black</em>

4 159.00 <em> Septwolves Backpack Backpack Men's 15.6-inch Computer Bag Business Casual Commuting Water-Repellent Oxford Cloth <font class="skcolor_ljg">School Bag</font> Black B0301872-201</em>

5 168.00 <em>Switzerland SWICKY Backpack Men's Backpack New Large Capacity Casual Business Travel Laptop Bag Student <font class="skcolor_ljg">School Bag</font> Business Travel Bag USB Charging Bag Black Large with USB Free Multifunctional Knife +lock</em>

6 169.00 <em>Septwolves Backpack Men's Oxford Cloth Backpack Casual Simple 15.6-inch Computer Bag Fashionable Travel Bag Large Capacity Student <font class="skcolor_ljg">School Bag</font> Male Black B0301062-201</em>

7 69.80 <em><font class="skcolor_ljg">School bag</font>Men's luminous backpack for primary and secondary school students, Korean version of casual computer bag, college USB travel bag, USB large music boy + pencil case + anti-theft lock </em>

8 159.00<em><img class="p-tag3" src=" //img14.360buyimg.com/uba/jfs/t6919/268/501386350/1257/92e5fb39/5976fcf9Nd915775f.png " />The 9th V. NINE primary school students<font class="skcolor_ljg">schoolbag</font>boys and girls children's spine protection<font class="skcolor_ljg">schoolbag</font> 1-3-6 grade burden-reducing backpack junior high school students leisure<font class=" skcolor_ljg">Schoolbag</font> VD9BV33972J blue with pink</em>

9 79.00 <em><span class="p-tag" style="background-color:#c81623">JD Supermarket</span> The9 V.NINE Backpack Men's and Women's Cartoon Print<font class="skcolor_ljg"> School bag</font>Six-piece canvas casual backpack campus primary and secondary school students<font class="skcolor_ljg">School bag</font> VB7BV32884J Pink suit</em>

10 59.90 <em>2020 New Style<font class="skcolor_ljg">School Bag</font>Men's Backpack Female Junior High School Back-to-School Bag Casual Simple Fashion Trend Canvas Bag Versatile High School Student Black</em>

4. At that time (2020-4-12), part of the source code of the product list page was captured.

Comparing this source code with the code, I understand why the code is written this way.

<div class="p-scroll">
			<span class="ps-prev">&lt;</span>
			<span class="ps-next">&gt;</span>
			<div class="ps-wrap">
				<ul class="ps-main">
					<li class="ps-item"><a href="javascript:;" class="curr" title="蓝色"><img data-url="https://item.jd.com/64923971966.html"  data-presale="" data-sku="64923971966" data-img="1" data-lazy-img="//img11.360buyimg.com/n9/jfs/t1/89937/11/18011/133904/5e8e7872E5d238ffa/e2752ecd1eb188cc.jpg" class="err-product" width="25" height="25" /></a></li>
									</ul>
			</div>
		</div>
		<div class="p-price">
<strong class="J_64923971966" data-done="1"><em>¥</em><i>49.00</i></strong>		</div>
		<div class="p-name p-name-type-2">
			<a target="_blank" title="多功能学生挂书袋可调课桌挂袋书本收纳袋 学生挂书袋 书挂袋书桌收纳袋文件文具挂书袋课桌神器挂架 蓝色" href="https://item.jd.com/64923971966.html" onclick="searchlog(1,64923971966,0,1,'','adwClk=1');searchAdvPointReport('https://ccc-x.jd.com/dsp/nc?ext=aHR0cHM6Ly9pdGVtLmpkLmNvbS82NDkyMzk3MTk2Ni5odG1s&log=4o6yQPJy6XmVSDUPaAlnilzQoTl0WfQq_iFkBg-nAELRr_jWgST6F3gHkDceKGeLFNVwe-soMnCpciBNs23mQ-Ilfi01tO75IDlJJX-6zhuGhAHxgFmEvKNeQT_qOIh8ZDU-NBcY8BsO9QLaz0X57aPu3e23a54_KScadwVylpD691LvcQa8ZbIjXHcQ17QOvtke4mexTr2lONtxOUaqrutZv5jV-h-7aOPjf_pruYgj_evBk7UICQoYrHVO0KZ_lui2p5hOalWxF3oKDmIkyo4ZwP8laIw9XFGI5tSiiOm1NqThyDWIwpRknK91PjiHNrlTIzMDemk-v03a2rjIi-Q9nHrG7vrq_SP0hc3z8aqUgN5VvW8WeChuIzSBJSGEoENy3HEx0XnARSKCiUbYBcrU--XghhLCocnp0a8x_sX7vMd1idTT4W7eeYfs-2v1u2ftQZz3UWxuI3bljxX0ZQ7obwL7Nyw9KbZS9wasMO5UY9kv5KyTRUc3-SQCTeEhUnCFou_VllDAaoHd90ols2Ca3lLUcCgcWEqv8HL7xiQ17MN8mm9-HFMIyYlZWwGZ1E9NuCW9M2PZ2IqYDTqGY5aVRFkJez8V3wQrqn61VwU9KCrU8GT2WOUmahNglOKLTvkdAzsKbg5Un2kUV3D2mssvsf76pw5itFaS5nSyxg3QPpBBd_gWrWCMrbuQ858X&v=404&clicktype=1&&clicktype=1');">
				<em>多功能学生挂书袋可调课桌挂袋书本收纳袋 学生挂书袋 书挂袋书桌收纳袋文件文具挂书袋课桌神器挂架 蓝色</em>
				<i class="promo-words" id="J_AD_64923971966"></i>
			</a>
		</div>
		<div class="p-commit">
			<strong><a id="J_comment_64923971966" target="_blank" href="https://item.jd.com/64923971966.html" onclick="searchlog(1,64923971966,0,3,'','adwClk=1')"></a></strong>
		</div>
		<div class="p-focus"><a class="J_focus" data-sku="64923971966" href="javascript:;" title="点击关注" onclick="searchlog(1,64923971966,0,5,'','adwClk=1')"><i></i>关注</a></div>
		<div class="p-shop" data-dongdong="" data-selfware="0" data-score="0" data-reputation="20" data-verderId="800106" data-shopid="795794">
		</div>	
		
		<div class="p-icons" id="J_pro_64923971966">
		</div>
		<span class="p-promo-flag">广告</span>
		
		<img source-data-lazy-advertisement="https://im-x.jd.com/dsp/np?log=4o6yQPJy6XmVSDUPaAlnilzQoTl0WfQq_iFkBg-nAELRr_jWgST6F3gHkDceKGeLFNVwe-soMnCpciBNs23mQ-Ilfi01tO75IDlJJX-6zhuGhAHxgFmEvKNeQT_qOIh8ZDU-NBcY8BsO9QLaz0X57aPu3e23a54_KScadwVylpARyavUgVRRZoP_thQ20x2cxcX9K-q692C4F-Ae3UlBOJQTPbwpeA47iOpQzp8MW-tnzhG4QcrgoNATCpmXhtOptt3X3m7MguGIYN2oKkU75SMlgTYm8masby6PnX7SeBx1yBcShcgL6IjCCrM_6RK9vJw8wVwmwW7VFgwAA5Ns0XspwYX1RIa8NoHIg6fzhJpq6wv56y7ePNvsosaGVfQoHjghgzr7XUaKnRhD-mRppyp0YHaLEuKPRIbqKvGO0ZTX4_iqFQyyOA24W8owSLkyKcUNiuRzv87NVKxkWEczyI_NvmrKLVtAy2pSNQKG1Q1tR84c1U_94w39kgMmZf9F0cNk-vsR2zq1DzwzJXILKv6BEWLANsPlDiKA9LkBsErwzHkoPKETW5cxZxubDxCnB9UpJcJ4GaGOrPu--5kmV2gsn1Cnj7OmpvttAZ9oRynB68bXmO5NQY3kaE2WOLOhXGG50Zx7KB1gRCCyB4zZMTr93pHBKNRK0LZCK3f4cbAOFs4uV5yL_vME_7tl_bFJmfgiBSlZZcHouiD99UcLmQ&v=404&rt=3" >
	</div>
</li>
<li class="gl-item" data-sku="5181576" data-spu="5181576" data-pid="5181576">
	<div class="gl-i-wrap">
		<div class="p-img">
			<a target="_blank" title="【稻草人爆款双肩包,15.6英寸超大容量,三大隔层,出行轻松搞掂】8-12日,每满119减20元,稻草人品质保证。快来抢购吧!" href="//item.jd.com/5181576.html" onclick="searchlog(1,5181576,1,2,'','flagsClk=1077940872')">
				<img width="220" height="220" class="err-product" data-img="1" source-data-lazy-img="//img11.360buyimg.com/n7/jfs/t1/96664/15/14541/395640/5e675971E689c5511/5b03b94f7fa247d1.jpg" />
</a>			<div data-lease="" data-catid="12071" data-venid="1000001048" data-presale=""></div>
		</div>
		<div class="p-scroll">
			<span class="ps-prev">&lt;</span>
			<span class="ps-next">&gt;</span>
			<div class="ps-wrap">
				<ul class="ps-main">
					<li class="ps-item"><a href="javascript:;" class="curr" title="主图款15.6英寸黑色款"><img  data-presale="" data-sku="5181576" data-img="1" data-lazy-img="//img11.360buyimg.com/n9/jfs/t1/96664/15/14541/395640/5e675971E689c5511/5b03b94f7fa247d1.jpg" class="err-product" width="25" height="25" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色17.3英寸"><img  data-presale="" data-sku="100003909565" data-img="1" width="25" height="25" data-lazy-img="//img10.360buyimg.com/n9/jfs/t1/108779/28/8424/281341/5e675356E20f0c196/12e26e33f67909a0.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="主图款15.6英寸灰色"><img  data-presale="" data-sku="4242121" data-img="1" width="25" height="25" data-lazy-img="//img11.360buyimg.com/n9/jfs/t1/97571/6/14547/438010/5e675538Ee03aedd1/3f6a133e5ee04207.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色款"><img  data-presale="" data-sku="100002467473" data-img="1" width="25" height="25" data-lazy-img="//img13.360buyimg.com/n9/jfs/t1/85679/31/14640/129467/5e675a9aE87526b01/f41bf024388f7456.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="主图款15.6英寸蓝色"><img  data-presale="" data-sku="4242123" data-img="1" width="25" height="25" data-lazy-img="//img13.360buyimg.com/n9/jfs/t1/105047/20/14381/391870/5e675500E822f2475/8a835c1bce455204.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="灰色款"><img  data-presale="" data-sku="100002467469" data-img="1" width="25" height="25" data-lazy-img="//img14.360buyimg.com/n9/jfs/t1/86089/15/14660/208728/5e675b0fEbda83d7c/ab11a079415adc5f.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="深灰色款"><img  data-presale="" data-sku="100003060945" data-img="1" width="25" height="25" data-lazy-img-slave="//img10.360buyimg.com/n9/jfs/t1/108751/29/8393/194375/5e675317Ed80a440f/69f564da9cf0c212.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色A款"><img  data-presale="" data-sku="100005730491" data-img="1" width="25" height="25" data-lazy-img-slave="//img11.360buyimg.com/n9/jfs/t1/104931/2/14580/631645/5e675cf5E81bae1c9/5c3346eabfbe5b30.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色B款"><img  data-presale="" data-sku="100010262064" data-img="1" width="25" height="25" data-lazy-img-slave="//img14.360buyimg.com/n9/jfs/t1/89293/5/14530/560061/5e675c7fEe5c723c8/2ca305d297ae82c3.jpg" class="err-product" /></a></li>
										<li class="ps-item"><a href="javascript:;" title="黑色C款"><img  data-presale="" data-sku="100010262050" data-img="1" width="25" height="25" data-lazy-img-slave="//img10.360buyimg.com/n9/jfs/t1/98662/13/14678/545819/5e675cc2Ee6f433a2/076556df49cd2936.jpg" class="err-product" /></a></li>
									</ul>
			</div>
		</div>
		<div class="p-price">
<strong class="J_5181576" data-done="1"><em>¥</em><i>99.00</i></strong>		</div>
		<div class="p-name p-name-type-2">
			<a target="_blank" title="【稻草人爆款双肩包,15.6英寸超大容量,三大隔层,出行轻松搞掂】8-12日,每满119减20元,稻草人品质保证。快来抢购吧!" href="//item.jd.com/5181576.html" onclick="searchlog(1,5181576,1,1,'','flagsClk=1077940872')">
				<em>稻草人双肩包男女14/15.6英寸大容量笔记本电脑包多功能旅行出差背包防泼水商务休闲学生<font class="skcolor_ljg">书包</font>50470黑色</em>
				<i class="promo-words" id="J_AD_5181576">【稻草人爆款双肩包,15.6英寸超大容量,三大隔层,出行轻松搞掂】8-12日,每满119减20元,稻草人品质保证。快来抢购吧!</i>
			</a>
		</div>
		<div class="p-commit">
			<strong><a id="J_comment_5181576" target="_blank" href="//item.jd.com/5181576.html#comment" onclick="searchlog(1,5181576,1,3,'','flagsClk=1077940872')"></a></strong>
		</div>
		<div class="p-focus"><a class="J_focus" data-sku="5181576" href="javascript:;" title="点击关注" onclick="searchlog(1,5181576,1,5,'','flagsClk=1077940872')"><i></i>关注</a></div>
		<div class="p-shop" data-dongdong="" data-selfware="1" data-score="5" data-reputation="98">
<span class="J_im_icon"><a target="_blank" class="curr-shop hd-shopname" onclick="searchlog(1,1000001048,0,58)" href="//mall.jd.com/index-1000001048.html" title="稻草人京东自营旗舰店">稻草人京东自营旗舰店</a></span>		</div>	
		
		<div class="p-icons" id="J_pro_5181576" data-done="1">
			<i class="goods-icons J-picon-tips J-picon-fix" data-idx="1" data-tips="京东自营,品质保障">自营</i>
    		<i class="goods-icons4 J-picon-tips" style="border-color:#4d88ff;color:#4d88ff;" data-idx="1" data-tips="品质服务,放心购物" >放心购</i>
<i class="goods-icons4 J-picon-tips" data-tips="本商品参与满减促销">每满119-20</i>		</div>
	</div>
</li>
<li class="gl-item" data-sku="59975470952" data-spu="13810867851" data-pid="59975470952">
	<div class="gl-i-wrap">
		<div class="p-img">
			<a target="_blank" title="【好店认证】【买一送“一”送钥匙包】【支持7天无理由退换货,赠送运费险,售后无忧】【支持货到付款】" href="//item.jd.com/59975470952.html" onclick="searchlog(1,59975470952,2,2,'','flagsClk=1094713996')">
				<img width="220" height="220" class="err-product" data-img="1" source-data-lazy-img="//img12.360buyimg.com/n7/jfs/t1/100706/25/17185/130140/5e8459f0Efbd3fdcf/379d9e03eea2a5d7.jpg" />
</a>			<div data-lease="" data-catid="12071" data-venid="84618" data-presale=""></div>
		</div>
		<div class="p-scroll">
			<span class="ps-prev">&lt;</span>
			<span class="ps-next">&gt;</span>
			<div class="ps-wrap">
				<ul class="ps-main">
					<li class="ps-item"><a href="javascript:;" class="curr" title="黑色"><img  data-presale="" data-sku="59975470952" data-img="1" data-lazy-img="//img12.360buyimg.com/n9/jfs/t1/100706/25/17185/130140/5e8459f0Efbd3fdcf/379d9e03eea2a5d7.jpg" class="err-product" width="25" height="25" /></a></li>
									</ul>
			</div>
		</div>
		<div class="p-price">
<strong class="J_59975470952" data-done="1"><em>¥</em><i>69.00</i></strong>		</div>
		<div class="p-name p-name-type-2">
			<a target="_blank" title="【好店认证】【买一送“一”送钥匙包】【支持7天无理由退换货,赠送运费险,售后无忧】【支持货到付款】" href="//item.jd.com/59975470952.html" onclick="searchlog(1,59975470952,2,1,'','flagsClk=1094713996')">
				<em>双肩包男士背包大容量时尚休闲商务旅行笔记本电脑包高中大学生<font class="skcolor_ljg">书包</font>男潮流USb充电包包65199 黑色</em>
				<i class="promo-words" id="J_AD_59975470952">【好店认证】【买一送“一”送钥匙包】【支持7天无理由退换货,赠送运费险,售后无忧】【支持货到付款】</i>
			</a>
		</div>
		<div class="p-commit">
			<strong><a id="J_comment_59975470952" target="_blank" href="//item.jd.com/59975470952.html#comment" onclick="searchlog(1,59975470952,2,3,'','flagsClk=1094713996')"></a></strong>
		</div>

The most important thing in this section is that you need to learn regular expressions . Here is a brief explanation of regular expressions.

Guess you like

Origin blog.csdn.net/Everly_/article/details/133139074