万里长征第一步：爬虫爬起来

初衷是想学机器学习，本来第一步应该是学习python的，因为我接触python已经很久了，都是自学的，用到什么去网上找例程直接拿来用，前几天又看了一些老师网上的直播课程，巩固了一下基础，印象深刻的是正则表达式，慢慢的开始对爬虫有了点了解，要处理数据首先要获取数据，这里对于模块的使用不做详细解释了，网上例子很多，主要是把走过的坑记录一下，下面是爬取豆瓣网数目的一个小栗子。

# -*- coding: utf-8 -*-
"""
Created on Tue Mar 12 11:21:25 2019

@author: lilide

"""
import requests
import re
 

try:
    fo = open('full.txt','w',encoding='utf-8')
    
    '''
    r = requests.get('https://book.douban.com')#.text
    print(r.status_code) 
    content = r.content.decode('utf-8')
    '''
    
    content = requests.get('https://book.douban.com/')
    #print(content.text)
    
    fo.write(content.text)
    
    pattern = re.compile('<li\s?class="">.*?="more-meta">\s?.*?"title">\s(.*?)\s<.*?class="author">\s(.*?)\s?</span>', re.S)

    results = re.findall(pattern, content.text)
    #print(results)
 
    for result in results:
        name,author = result
        author = re.sub('\s','',author)
        name = re.sub('\s','',name)
        print(name,'--'*5,author)

finally:
    fo.close#关闭文件

主要记录的地方有两个：

一个是编码格式：

fo = open('full.txt','w')；上面代码编译的时候出现了：UnicodeEncodeError: 'gbk' codec can't encode character '\xbb' in position 0: illegal multibyte seque对比了一下，应该f的编码格式是GBK的

fo = open('full.txt','w',encoding='utf-8')；

r = requests.get('https://book.douban.com')#.text

content = r.content.decode('utf-8')

另一个是正则表达式建立：pattern = re.compile('<li\s?class="">.*?="more-meta">\s?.*?"title">\s(.*?)\s<.*?class="author">\s(.*?)\s?</span>', re.S)；

一直匹配不到，主要因为太心急，写正则表达式一定要确保你写出来的是对应的，可以找到的，对于这种长的表达式更是要细心。

参考了coder大侠的博客，链接https://www.cnblogs.com/zhaof/p/6925674.html

万里长征第一步：爬虫爬起来

猜你喜欢