Reptile Taobao product information about the

Introduction: Taobao, Alibaba Group is China's online shopping site, was founded in 2003 by the Ma May 10, is in mainland China, Hong Kong, Macao, Taiwan and Malaysia consumers C2C shopping site. Taobao has some means of anti-climb, was a bit nerve-racking. I passed the information collected and analyzed, the following began to explain how to get started Taobao reptiles.
This anti-climb measures for more serious sites, in order to be able to normal access to information, log on to simulate personally think the best, get cookies. Or very easy, because the slider to verify things like trapped.

First, log on Taobao

1. Go to the login screen Taobao
Here Insert Picture Description
we can use the phone number of the account as a way of landing
at the code level analysis login Taobao four processes:
1. After entering a user name, the browser will initiate a post to Taobao (taobao.com) request to determine whether the slider validation occurs.
Verify that the slider verification request URL appears: `

'https: //login.taobao.com/member/request_nick_check.do _input_charset = utf- 8'`?
Here Insert Picture Description
2. After the user enters the password, the browser to Taobao (taobao.com) has launched a post request, validate the user name and password is correct, if correct a token is returned.

Verify whether the correct password and account request URL: https://login.taobao.com/member/login.jhtml
3. The browser holding a token to Alibaba (alibaba.com) exchange st code.

4. After obtaining st browser code, holding st yard gain cookies, login is successful.

Cookies get request URL:https://login.taobao.com/member/vst.htm?st={}

到这里也许你可能会有疑问:为什么淘宝登录需要这么麻烦呢?直接在 taobao.com 登录不就可以吗?为什么要先在taobao验证用户名密码,通过之后还要去 alibaba.com 换取st码登录?
任何公司都是由小到大,最开始成立互联网公司框架也在慢慢演变,我想最开始的淘宝登录肯定没这么复杂。但是从03年淘宝网成立至今2020年,随着阿里巴巴的慢慢壮大,子公司也越来越多,很多东西要划分开来,但是这些子公司之间又有关联性,比如用户登录了淘宝之后,也许要登录天猫(淘宝和天猫的顶级域名不同,所以不能共享cookies),但同样都是属于阿里巴巴集团的公司,这样重复的登录对用户而言会有点麻烦,降低了用户使用体验。为了解决这个问题,单点登录就出现了。

单点登录(Single Sign On),简称为 SSO,是目前比较流行的企业业务整合的解决方案之一。SSO的定义是在多个应用系统中,用户只需要登录一次就可以访问所有相互信任的应用系统。 ——百度百科

此图来于网络资料收集
Here Insert Picture Description
知道了流程和请求链接及参数之后,便就可以用代码来模拟请求了。

代码实现

1.判断输入账号后是否需要验证码

   def _user_check(self):
        data = {
            'username': self.username,
            'ua': self.ua
        }
        try:
            response = self.session.post(self.user_check_url, data=data, timeout=self.timeout)
            response.raise_for_status()
        except:
        	print("账号需要验证码")

2.验证用户名密码,若验证成功并返回st码申请地址

    def _verify_password(self):
        verify_password_headers = {
            'Connection': 'keep-alive',
            'Cache-Control': 'max-age=0',
            'Origin': 'https://login.taobao.com',
            'Upgrade-Insecure-Requests': '1',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36',
            'Content-Type': 'application/x-www-form-urlencoded',
            'Referer': 'https://login.taobao.com/member/login.jhtml?from=taobaoindex&f=top&style=&sub=true&redirect_url=https%3A%2F%2Fi.taobao.com%2Fmy_taobao.htm',
        }
        # 登录toabao.com 从浏览器复制form data的提交数据
        verify_password_data = {
            'TPL_username': self.username,
            'ncoToken': '507aa5182c96f27aaad2e6571a3425846ae5745f',
            'slideCodeShow': 'false',
            'useMobile': 'false',
            'lang': 'zh_CN',
            'loginsite': 0,
            'newlogin': 0,
            'TPL_redirect_url': 'https://www.taobao.com/',
            'from': 'tb',
            'fc': 'default',
            'style': 'default',
            'keyLogin': 'false',
            'qrLogin': 'true',
            'newMini': 'false',
            'newMini2': 'false',
            'loginType': '3',
            'gvfdcname': '10',
            'gvfdcre': '68747470733A2F2F6C6F67696E2E74616F62616F2E636F6D2F6D656D6265722F6C6F676F75742E6A68746D6C3F73706D3D613231626F2E323031372E3735343839343433372E372E3561663931316439456B5773397A26663D746F70266F75743D7472756526726564697265637455524C3D68747470732533412532462532467777772E74616F62616F2E636F6D253246',
            'TPL_password_2': self.TPL_password2,
            'loginASR': '1',
            'loginASRSuc': '1',
            'oslanguage': 'zh-CN',
            'sr': '1536*864',
            'osVer': '',
            'naviVer': 'chrome|75.03770142',
            'osACN': 'Mozilla',
            'osAV': '5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
            'osPF': 'Win32',
            'appkey': '00000000',
            'mobileLoginLink': 'https://login.taobao.com/member/login.jhtml?redirectURL=https://www.taobao.com/&useMobile=true',
            'showAssistantLink': '',
            'um_token': 'TF0ABBC1D4E5BBF32528851CD189E64A00635DB82C599E278C3A8475900',
            'ua': self.ua
        }
        try:
            response = self.session.post(self.verify_password_url, headers=verify_password_headers, data=verify_password_data,
                              timeout=self.timeout)
            response.raise_for_status()
            # 从返回的页面中提取申请st码地址
        except Exception as e:
            print('验证用户名和密码请求失败,原因:')
            raise e
        # 提取申请st码url
        apply_st_url_match = re.search(r'<script src="(.*?)"></script>', response.text)
        # 存在则返回
        if apply_st_url_match:
            print('验证用户名密码成功,st码申请地址:{}'.format(apply_st_url_match.group(1)))
            return apply_st_url_match.group(1)
        else:
            raise RuntimeError('用户名密码验证失败!response:{}'.format(response.text))

其中"verify_password_data"的数据来源为:
Here Insert Picture Description
3.申请st码 若成功返回st码

    def _apply_st(self):
        apply_st_url = self._verify_password()
        try:
            response = self.session.get(apply_st_url)
            response.raise_for_status()
        except:
            print('申请st码请求失败,原因:')
        st_match = re.search(r'"data":{"st":"(.*?)"}', response.text)
        if st_match:
            print('获取st码成功,st码:{}'.format(st_match.group(1)))
            return st_match.group(1)
        else:
            print('获取st码失败!response:{}'.format(response.text))

4.使用st码登录

    def login(self):
        """
        使用st码登录
        :return:
        """
        # 加载cookies文件
        if self._load_cookies():
            return True
        # 判断是否需要滑块验证
        self._user_check()
        st = self._apply_st()
        headers = {
            'Host': 'login.taobao.com',
            'Connection': 'Keep-Alive',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
        }
        try:
            response = self.session.get(self.vst_url.format(st), headers=headers)
            response.raise_for_status()
        except Exception as e:
            print('st码登录请求,原因:')
            raise e
        # 登录成功,提取跳转淘宝用户主页url
        my_taobao_match = re.search(r'top.location.href = "(.*?)"', response.text)
        if my_taobao_match:
            print('登录淘宝成功,跳转链接:{}'.format(my_taobao_match.group(1)))
            self._serialization_cookies()
            return True
        else:
            raise RuntimeError('登录失败response:{}'.format(response.text))

5.获取淘宝昵称

   def get_taobao_nick_name(self):
        headers = {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
        }
        try:
            response = self.session.get(self.my_taobao_url, headers=headers)
            response.raise_for_status()
        except Exception as e:
            print('获取淘宝主页请求失败!原因:')
            raise e
        # 提取淘宝昵称
        nick_name_match = re.search(r'<input id="mtb-nickname" type="hidden" value="(.*?)"/>', response.text)
        if nick_name_match:
            print('登录淘宝成功,你的用户名是:{}'.format(nick_name_match.group(1)))
            return nick_name_match.group(1)
        else:
           print('获取淘宝昵称失败!response:{}'.format(response.text))

6.将cookies序列化。
每次程序运行完后,登录的cookies就没了,也就是说下次又要重新登录太过于重复,而浏览器却可以保存cookies信息,将cookies序列化可以解决眼前的问题。

序列化 (Serialization)是将对象的状态信息转换为可以存储或传输的形式的过程。——百度百科

简单说序列化就是将对象持久性保存起来,因为原来对象是在内存中,程序运行完了就要释放内存,所有的对象、变量等都会被清除,而序列化则可以把他们保存到文件。即使程序关闭了,下次启动的时候可以读取文件到内存转回对象继续使用,而这个过程叫反序列化。

也就是说,我们第一次登录成功后,将cookies保存在本地文件。第二次或后面需要的时候读取它,就不需要重新登录了。如果cookies过期,就将本地保存的cookies删除,再重新登录写入新的cookies,留在以后登录。

    def _load_cookies(self):
        # 1、判断cookies序列化文件是否存在
        if not os.path.exists(COOKIES_FILE_PATH):
        	print('cookies文件不存在')
            return False
        # 2、加载cookies
        self.session.cookies = self._deserialization_cookies()
        # 3、判断cookies是否过期
        try:
            self.get_taobao_nick_name()
        except Exception as e:
            os.remove(COOKIES_FILE_PATH)
            print('cookies过期,删除cookies文件!')
            return False
        print('加载淘宝登录cookies成功!!!')
        return True

Here Insert Picture Description

二、爬取淘宝商品信息

我们在网页中打开淘宝网,然后登录,打开chrome的调试窗口,点击network,然后勾选上Preserve log,在搜索框中输入你想要搜索的商品名称。
如果不勾上Preserve log会没有资源包信息。

  1. 首先分析网页
    Here Insert Picture Description
    将这一行的数据复制下来,经过简单的修改,在json.cn我们可以清楚的看见,数据的层级关系,利于json解析,提取数据。
    Here Insert Picture Description

  2. 解析Json数据,并提取标题、价格、商家地址、销量、评价地址

   def _get_goods_info(self, goods_str):
        goods_json = json.loads(goods_str)
        goods_items = goods_json['mods']['itemlist']['data']['auctions']
        goods_list = []
        for goods_item in goods_items:
            goods = {'title': goods_item['raw_title'],
                     'price': goods_item['view_price'],
                     'location': goods_item['item_loc'],
                     'nick': goods_item['nick'],
                     'comment_url': goods_item['comment_url']}
            goods_list.append(goods)

        return goods_list

Here Insert Picture Description

拿到第一页 第二页 第三页的 存有商品信息包 的 URL
Here Insert Picture Description
可以发现 第一页无ajax=true,第二三页有,同时里面部分数据发生变化,但有规律

  1. 经过简单推理可以的构造后面的的url
    def spider_goods(self, page):
        s = page * 44
        # 搜索链接,q参数表示搜索关键字,s=page*44 数据开始索引
        search_url = f'https://s.taobao.com/search?initiative_id=tbindexz_20170306&ie=utf8&spm=a21bo.2017.201856-taobao-item.2&sourceId=tb.index&search_type=item&ssid=s5-e&commend=all&imgfile=&q={self.q}&suggest=history_1&_input_charset=utf-8&wq=biyunt&suggest_query=biyunt&source=suggest&bcoffset=4&p4ppushleft=%2C48&s={s}&data-key=s&data-value={s + 44}'

        # 代理ip
        proxies = {'http': '123.207.218.215:1080'}
        # 请求头
        headers = {
            'referer': 'https://www.taobao.com/',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36'
        }
        response = req_session.get(search_url, headers=headers, proxies=proxies,
                                   verify=False, timeout=self.timeout)
        # print(response.text)
        goods_match = re.search(r'g_page_config = (.*?)}};', response.text)
        # # 没有匹配到数据
        if not goods_match:
            print('提取页面中的数据失败!')
            print(response.text)
        #     raise RuntimeError
        goods_str = goods_match.group(1) + '}}'
        goods_list = self._get_goods_info(goods_str)
        self._save_excel(goods_list)
        print(goods_str)
  1. json数据生成excel文件
    def _save_excel(self, goods_list):
        """
        将json数据生成excel文件
        :param goods_list: 商品数据
        :param startrow: 数据写入开始行
        :return:
        """
        # pandas没有对excel没有追加模式,只能先读后写
        if os.path.exists(GOODS_EXCEL_PATH):
            df = pd.read_excel(GOODS_EXCEL_PATH)
            df = df.append(goods_list)
        else:
            df = pd.DataFrame(goods_list)

        writer = pd.ExcelWriter(GOODS_EXCEL_PATH)
        # columns参数用于指定生成的excel中列的顺序
        df.to_excel(excel_writer=writer, columns=['title', 'price', 'location', 'nick', 'comment_url'], index=False,
                    encoding='utf-8', sheet_name='Sheet')
        writer.save()
        writer.close()

贴上一张结果图
Here Insert Picture Description

三、抓取评论

  1. 随便进入一个商品详细信息界面

Here Insert Picture Description

  1. 找到存有评价信息的包
    Here Insert Picture Description
  2. 模拟请求
    经观察请求网址为红色标记部分
    Here Insert Picture Description
    请求网址"?"后面的内容params:
    Here Insert Picture Description
    就是红框里面的内容。
    经分析:要想获取评价信息,必须要带的params信息是
"itemId"        #商品id  
"sellerId"
"currentPage"   #页码
"callback"

完整代码如下:

# # coding=gbk
##浪琴
import requests
import json,random
from Taobao_Log import TaoBaoLogin
requests.packages.urllib3.disable_warnings()
req_session = requests.Session()
tbl = TaoBaoLogin(req_session)
tbl.login()
proxies = [
    {'https': 'https://'+'151.253.165.70:8080'}
    # {'http':'http://'+'47.112.200.175:8000'}
]
proxies = random.choice(proxies)
url = "https://rate.tmall.com/list_detail_rate.htm"
header={
    "cookie":"cna=EYnEFeatJWUCAbfhIw4Sd0GO; x=__ll%3D-1%26_ato%3D0; hng=CN%7Czh-CN%7CCNY%7C156; uc1=cookie14=UoTaHYecARKhrA%3D%3D; uc3=vt3=F8dBy32hRyZzP%2FF7mzQ%3D&lg2=U%2BGCWk%2F75gdr5Q%3D%3D&nk2=1DsN4FjjwTp04g%3D%3D&id2=UondHPobpDVKHQ%3D%3D; t=ad1fbf51ece233cf3cf73d97af1b6a71; tracknick=%5Cu4F0F%5Cu6625%5Cu7EA22013; lid=%E4%BC%8F%E6%98%A5%E7%BA%A22013; uc4=nk4=0%401up5I07xsWKbOPxFt%2BwuLaZ8XIpO&id4=0%40UOE3EhLY%2FlTwLmADBuTfmfBbGpHG; lgc=%5Cu4F0F%5Cu6625%5Cu7EA22013; enc=ieSqdE6T%2Fa5hYS%2FmKINH0mnUFINK5Fm1ZKC0431E%2BTA9eVjdMzX9GriCY%2FI2HzyyntvFQt66JXyZslcaz0kXgg%3D%3D; _tb_token_=536fb5e55481b; cookie2=157aab0a58189205dd5030a17d89ad52; _m_h5_tk=150df19a222f0e9b600697737515f233_1565931936244; _m_h5_tk_enc=909fba72db21ef8ca51c389f65d5446c; otherx=e%3D1%26p%3D*%26s%3D0%26c%3D0%26f%3D0%26g%3D0%26t%3D0; l=cBa4gFrRqYHNUtVvBOfiquI8a17O4IJ51sPzw4_G2ICP9B5DeMDOWZezto8kCnGVL6mpR3RhSKO4BYTKIPaTlZXRFJXn9MpO.; isg=BI6ORhr9X6-NrOuY33d_XmZFy2SQp1Ju1qe4XLjXJRHsGyp1IJ9IG0kdUwfSA0oh",
    "referer":"https://detail.tmall.com/item.htm",
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36",
}
params={                                #必带信息
    "itemId":"602110852465",            #商品id
    "sellerId":"3439921580",
    "currentPage":"2",                  #页码
    "callback":"jsonp548",
    }
req=req_session.get(url=url,params=params,headers=header,proxies=proxies).content.decode('utf-8')[12:-1]     #解码,并且去除str中影响json转换的字符(\n\rjsonp(...));
req="{"+req
print(req)
result=json.loads(req)
print(result)
for i in range(20):
     comment=result['rateDetail']['rateList'][i]['rateContent']
     print(comment)

结果图:
Here Insert Picture Description

四、加强版

对于这次爬虫,我们已经可以拿到自己想要的商品列表页和评论。但对于一个列表页的所有商品的评论该如何爬取呢?
在写代码调试的过程中,我们可以发现,同样的代码,在重复1~2次爬取时,就会产生错误,
Here Insert Picture Description
根据报错结果我们可以发现,是result=json.loads(req)错误。
究其原因是因为淘宝的反爬十分严格,当检测到同一淘宝账号,IP地址 发送请求频率过高,很容易被淘宝反爬机制检测,要求重新登录或者得不到对应网页的响应,得到错误的返回结果。于是json解析就出现了错误。
解决思路“

  1. 采用IP代理池,获取多个可用的IP并不断更换。

  2. 创建多个账号,获取多个cookies并不断更换。
    关于IP代理池

  3. 找到 IP代理池框架
    我们可以在github这个开发者社区找到需要的开源项目,star数目超过9000,是个很好的项目。
    Here Insert Picture Description

  4. 长话短说,根据项目README.md,配置好必要的库与设置,下载数据库reids和可视化软件Redis Desktop Manger

    开启reids数据库Here Insert Picture Description

    打开可视化软件
    Here Insert Picture Description

  5. 尝试抓取IP
    Here Insert Picture Description
    其中 带有success结尾的IP,表示为可用。
    在数据库可视化软件中也可查看
    Here Insert Picture Description
    此外我们也可以通过 API查看抓取的代理:
    API地址:http://127.0.0.1:5010

    API Introduction: Here Insert Picture Description
    random access to a usable IP:
    Here Insert Picture Description
    to obtain the number of available IP:
    Here Insert Picture Description
    Here Insert Picture Description
    To use the crawler code words, this package can be used directly as a function of API:

import requests

def get_proxy():
    return requests.get("http://127.0.0.1:5010/get/").json()

def delete_proxy(proxy):
    requests.get("http://127.0.0.1:5010/delete/?proxy={}".format(proxy))

# your spider code

def getHtml():
    # ....
    retry_count = 5
    proxy = get_proxy().get("proxy")
    while retry_count > 0:
        try:
            html = requests.get('https://www.example.com', proxies={"http": "http://{}".format(proxy)})
            # 使用代理访问
            return html
        except Exception:
            retry_count -= 1
    # 出错5次, 删除代理池中代理
    delete_proxy(proxy)
    return None

We can url = 'http://httpbin.org/ip'really test agent pool of available IP
Here Insert Picture Description
test was successful.

Summary, get on IP agent pool of available IP we initially solved.

IP Agent application pool to deal with anti-climb
during the operation of the code found a problem:
Here Insert Picture Description
Solution:

  1. Turn off Internet Software Science
    Here Insert Picture Description
  2. Add the following code to the program

Http maximum number of connections exceeds the limit, by default connection Keep-alive, so this leads to too many connections the server can no longer maintain the new connection. Or you may inadvertently set up after the installation shadowsock reasons for the global proxy. Also it is time to collect or use the API too requests.get

import os
os.environ['NO_PROXY'] = 'stackoverflow.com'

Comment individual commodities achieve results
Here Insert Picture Description

Published 19 original articles · won praise 3 · Views 3454

Guess you like

Origin blog.csdn.net/qq_43544005/article/details/104357920