Python实现的淘宝直通车数据抓取（1）

最近帮一个朋友做一个抓取淘宝直通车数据的小项目，感觉ython比较适合写爬虫程序，决定使用Python来做程序。
首先是登陆程序，因为淘宝的登陆校验很复杂，所以不能直接使用命令行的形式输入账号密码。查阅资料后，发现可以使用Selenium的自动测试框架，决定用这个框架实现登陆。
首先下载一个纯净版的firefox浏览器，放到主目录下，然后用python打开浏览器：

def openbrowser_login():    
    binary=FirefoxBinary(os.getcwd()+'/Firefox/Firefox.exe')
    profile=FirefoxProfile()
    profile.set_preference("browser.cache.disk.enable",False)
    profile.set_preference("browser.cache.offline.enable",False)    
    driver=webdriver.Firefox(firefox_binary=binary,firefox_profile=profile)
    driver.get('http://zhitongche.taobao.com/')
    while(True):
        if(len(driver.window_handles)>1):
           print('检测到页面跳转！')
           driver.switch_to.window(driver.window_handles[1]);
           time.sleep(3)
           driver.get(driver.current_url)
           time.sleep(5)
           break;
        else:
           time.sleep(2)
    cookie = [item["name"] + "=" + item["value"] for item in driver.get_cookies()]
    cookiestr=';'.join(item for item in cookie)
    try:
        driver.quit()
    except Exception as e:
        pass
    return cookiestr

实现的方式就是先去文件目录下找到firefox的启动文件，然后使用浏览器打开淘宝直通车的登陆页，程序每隔两秒检测一次页面，如果发现新开了额外的标签，就认为是登录成功，这时把页面的cookie保存下来并返回。打开浏览器时同时设置了一些属性，profile是浏览器属性设置文件，这里将浏览器缓存功能关闭。
下面是实现检查登陆的函数：

def check_login(cookiestr):
    print('开始登陆验证!') 
    url='https://i.taobao.com/my_taobao.htm'
    headers= {
            'Host':'i.taobao.com',
            'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0',
            'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
            'Accept-Language':'zh-CN,zh;q=0.8,en-US;q=0.5,en;q=0.3',
#            'Accept-Encoding':'gzip, deflate',
            'Referer' :'https://www.taobao.com',
            'Content-Type': 'application/x-www-form-urlencoded',
            'Connection' : 'Keep-Alive',
            'Cookie' : cookiestr,
            'Cache-Control':'max-age=0',
        }
    request=urllib.request.Request(url,headers=headers)
    try:
        response=urllib.request.urlopen(request)
 #       print(response.geturl())
        if(response.geturl()==url):
            print('登陆验证通过!')
            return True        
    except Exception as e:
        print(e)
    print('登陆验证失败!请重新登陆!')
    return False

然后是检查淘宝直通车权限，如果检查权限通过，就将cookie文件保存下来方便下次使用：

def check_subway(cookiestr):
    print('开始淘宝直通车验证!')
    url='http://subway.simba.taobao.com/bpenv/getLoginUserInfo.htm'
    headers= {
            'Host':'subway.simba.taobao.com',
            'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; WOW64; rv:43.0) Gecko/20100101 Firefox/43.0',
            'Accept':'application/json, text/javascript, */*; q=0.01',
            'Accept-Language':'zh-CN,zh;q=0.8',
            'Connection' : 'Keep-Alive',
            'Cookie' : cookiestr,
            'Origin':'http://subway.simba.taobao.com',
            'Cache-Control':'max-age=0',
            'X-Requested-With':'XMLHttpRequest'
        }
    request=urllib.request.Request(url,headers=headers)
    data={'_referer':'/tools/insight/index'}
    postdata=urllib.parse.urlencode(data).encode('utf-8')
    try:
        response=urllib.request.urlopen(request,data=postdata)
        string=response.read().decode()
        parse=json.loads(string)
        if(parse['code']=='200'):
            print('淘宝直通车验证通过!您当前以<'+parse['result']['nickName']+'>登陆')
            fp=open('cookie','wt')
            fp.write(cookiestr)
            fp.close()
            print('登陆cookie已经保存!')
            return parse['result']['token']        
    except Exception as e:
        print(e)
    print('淘宝直通车验证失败!请重新登陆!')
    return False

在主函数中，程序将优先加载cookie文件，cookie失效或没有cookie文件时打开浏览器进行登陆：

#主函数
if(os.path.exists('cookie')):
    print('检测到cookie文件！将使用cookie登陆！')
    fp=open('cookie','r')
    cookiestr=fp.read()
    fp.close()
else:
    cookiestr=openbrowser_login()
while(True):
    if(check_login(cookiestr)):
        token=check_subway(cookiestr)
        if(token!=False):
            break;
    cookiestr=openbrowser_login()

Python实现的淘宝直通车数据抓取（1）
Python实现的淘宝直通车数据抓取（2）
Python实现的淘宝直通车数据抓取（3）
Python实现的淘宝直通车数据抓取（4）

Python实现的淘宝直通车数据抓取（1）

猜你喜欢