爬虫教训

1.对于一些爬取后数据为空，但在网页上f12 上有的，可以查看网页源代码，发现使用js写的

则只能用模拟浏览器selenium

from selenium import webdriver
driver = webdriver.Chrome()
driver.get('......略')
include_title = []
driver.implicitly_wait(20)
author = driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[1]/div[1]/div[1]/div[1]/h4/a[1]').text
date = driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[1]/div[1]/div[1]/div[1]/h4/a[2]').text
driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[2]/div[1]/button[2]').click()
print(author, date)

2.使用scrapy 如果不模拟浏览器，则只能爬取静态网页

3.对于有内容的网页但response url 打开后为404 则表明不希望你访问，暂时没有办法爬取

4.对于scrapy 在setting 里可以设置头

DEFAULT_REQUEST_HEADERS={
            "Accept": "*/*",
            "Accept-Encodingv": "gzip, deflate, br",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connectionvkeep-alive":"keep-alive",
            "Host": "bbs.tju.edu.cn",
            "Referer":"https://bbs.tju.edu.cn/",
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
}

5. 普通的导入mongodb

import pymongo
client = pymongo.MongoClient('localhost',27017)
mydb = client['mydb']
taobao = mydb['taobao']
#抓取略
commodity = {
            'goods': goods,#抓取之后
            'price': price,
            'sell': sells,
            'shop': shop,
            'address': address
        }
        taobao.insert_one(commodity)

6.scrapy 导入mongodb

#在pipelines
import pymongo
class JianshuPipeline(object):
    def __init__(self):
        client = pymongo.MongoClient('localhost', 27017)
        test = client['test']
        jianshu = test['jianshu']
        self.post = jianshu
    def process_item(self, item, spider):
        info = dict(item)
        self.post.insert(info)
        return item

爬虫 教训

猜你喜欢