Python crawling with people with money Taobao two thousand condoms

 

Here Insert Picture Description

Warning: This tutorial only as a learning exchanges, do not for commercial profit, and offenders at your peril! As used herein have violated the privacy of any organization or group interests, please contact told Xiao Bian deleted! ! !

First, log on Taobao Review

We have already introduced how to use the library to log requests Taobao, received a lot of feedback from students and ask questions, small series I am very pleased, and say sorry to those students who did not reply!

Mention about the way this login function, the code is no problem. If you log appear application code failed st wrong time, you can replace _verify_passwordall the request parameters method.
Here Insert Picture Description

Taobao login 2.0 improvements, we increased the cookies serialization function, the purpose is to facilitate crawling Taobao data, because if you do the same ip frequently log on Taobao Taobao words might trigger the mechanism of pocketing !

Log on Taobao success rate, the basic can be successful in practical use in small series, if it is not successful by the above method to replace the login parameters!

Second, Taobao commodity information crawling

This article is to explain how to crawl data, analyzing data on the next. The reason why is because separate crawling too many problems encountered in Taobao, and small series and then go into the details for you on how detailed crawling, so consider the length and terms of the absorption rate of students on two points explain it! Also the same purpose: to make white also can understand !

The crawling is to call Taobao pc-search interface , the returned data is extracted, and then save it as excel file!

A seemingly simple function but contains a lot of problems, little by little, we read on!

Third, the data of a single page crawling

Reptiles began to write a project we all need to quantify and then step by step, and the first step is usually to crawl a try!

1. Find the URL to load data

我们在网页中打开淘宝网,然后登录,打开chrome的调试窗口,点击network,然后勾选上Preserve log,在搜索框中输入你想要搜索的商品名称
Here Insert Picture Description
这是第一页的请求,我们查看了数据发现:返回的商品信息数据插入到了网页里面,而不是直接返回的纯json数据
Here Insert Picture Description

2. 是否有返回纯json数据接口?

然后小编就好奇有没有返回纯json的数据接口呢?于是我就点了下一页(也就是第二页)
Here Insert Picture Description
请求第二页后小编发现返回的数据竟然是纯json,然后比较两次请求url,找到只返回json数据的参数!
Here Insert Picture Description
通过比较我们发现搜索请求url中如果带ajax=true参数的话就直接返回json数据,那我们是不是可以直接模拟直接请求json数据!

所以小编就直接使用第二页的请求参数去请求数据(也就是直接请求json数据),但是请求第一页就出现错误:
Here Insert Picture Description
直接返回一个链接而 不是json数据,这个链接是什么鬼?点一下。。。
Here Insert Picture Description
铛铛铛,滑块出现,有同学会问:用requests能搞定淘宝滑块吗?小编咨询过几个爬虫大佬,滑块的原理是收集响应时间,拖拽速度,时间,位置,轨迹,重试次数等然后判断是否是人工滑动。而且还经常变算法,所以猪哥选择放弃这条路!

3.使用请求网页接口

所以我们选择类似第一页(请求url中不带ajax=true参数,返回整个网页形式)的请求接口,然后再把数据提取出来!

Here Insert Picture Description
这样我们就可以爬取到淘宝的网页信息了

四、提取商品属性

爬到网页之后,我们要做的就是提取数据,这里先从网页提取json数据,然后解析json获取想要的属性。

1.提取网页中商品json数据

既然我们选择了请求整个网页,我们就需要了解数据内嵌在网页的哪个位置,该怎么提取出来。

经过小编搜索比较发现,返回网页中的js参数:g_page_config就是我们要的商品信息,而且也是json数据格式!
Here Insert Picture Description
然后我们写一个正则就可以将数据提取出来了!

goods_match = re.search(r'g_page_config = (.*?)}};', response.text)

2.获取商品价格等信息

要想提取json数据,就要了解返回json数据的结构,我们可以将数据复制到一些json插件或在线解析
Here Insert Picture Description
了解json数据结构之后,我们就可以写一个方法去提取我们想要的属性了
Here Insert Picture Description

五、保存为excel

操作excel有很多库,网上有人专门针对excel操作库做了对比与测评感兴趣可以看看:https://dwz.cn/M6D8AQnq

小编选择使用pandas库来操作excel,原因是pandas比较操作方便且是比较常用数据分析库!

1.安装库

pandas库操作excel其实是依赖其他的一些库,所以我们需要安装多个库

pip install xlrd
pip install openpyxl
pip install numpy
pip install pandas

2.保存excel

Here Insert Picture Description
这里有点坑的是pandas操作excel没有追加模式,只能先读取数据后使用append追加再写入excel!

查看效果
Here Insert Picture Description

六、批量

一次爬取的整个流程(爬取、数据提取、保存)完成之后,我们就可以批量循环调用了。
Here Insert Picture Description
这里设置的超时秒数是猪哥实践出来的,从3s、5s到10s以上,太频繁容易出现验证码!
Here Insert Picture Description
小编分多次爬取了两千多条数据
Here Insert Picture Description

七、爬取淘宝遇到的问题

爬取淘宝遇到了非常多的问题,这里为大家一一列举:

1.登录问题

Here Insert Picture Description
问题:申请st码失败怎么办?
回答:更换_verify_password方法中的所有请求参数。

参数没问题的话登录基本都会成功!

2.代理池

为了防止自己的ip被封,小编使用了代理池。爬取淘宝需要高质量的ip才能爬取,小编试了很多网上免费的ip,基本都不能爬取。
Here Insert Picture Description

But there is a very good site ip  station Grandpa : http://ip.zdaye.com/dayProxy.html  , this site will be updated hourly number of ip, Xiao Bian tried ip is still a lot of crawling Taobao.

3. retry mechanism

In order to prevent normal request fails, greatly accelerates method crawling added retry mechanism!
Here Insert Picture Description
Libraries need to install retry

pip install retry

4. When the slider

Those above no problem, but still there will be a slider, small test many times, some crawling slider about 20 times -40 times most likely to occur.
Here Insert Picture Description
Such as sliders appear only after a half-hour continue to climb, because there can not be used to resolve requests slide library, learning behind other frameworks such as selenium and see if you can solve! Xiao Bian finishing a set of Python data and PDF, need to learn Python learning materials can be added to the group: 631 441 315, anyway idle is idle it, it is better to learn a lot friends ~ ~

5. Currently only reptile

For now the reptile is not perfect, can only be regarded as semi-finished products, there are many areas for improvement, such as automatic maintenance function ip pool, multi-threaded segments crawling, solve problems, and so the slider, behind which we work together to improve slowly a creep, so that he can become a complete ignorance of the reptiles!

Guess you like

Origin www.cnblogs.com/qingdeng123/p/11567522.html