Reptile Lessons

1. For some data that is empty after crawling, but some on f12 on the web page, you can view the source code of the web page and find that it is written in js

You can only use the simulated browser selenium

 
 
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('......略')
include_title = []
driver.implicitly_wait(20)
author = driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[1]/div[1]/div[1]/div[1]/h4/a[1]').text
date = driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[1]/div[1]/div[1]/div[1]/h4/a[2]').text
driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[2]/div[1]/button[2]').click()
print(author, date)

2. Use scrapy If you don't simulate a browser, you can only crawl static web pages

3. For webpages with content but the response url is 404 after opening, it means that you do not want to visit, and there is no way to crawl for the time being

4. For scrapy, you can set the header in the setting

DEFAULT_REQUEST_HEADERS={
            "Accept": "*/*",
            "Accept-Encodingv": "gzip, deflate, br",
            "Accept-Language": "zh-CN,zh;q=0.9",
            "Connectionvkeep-alive":"keep-alive",
            "Host": "bbs.tju.edu.cn",
            "Referer":"https://bbs.tju.edu.cn/",
            "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
}

5. Ordinary import mongodb

import pymongo
client = pymongo.MongoClient('localhost',27017)
mydb = client['mydb']
taobao = mydb['taobao']
#grab slightly
commodity = {
            'goods': goods, #after grabbing
            'price': price,
            'sell': sells,
            'shop': shop,
            'address': address
        }
        taobao.insert_one(commodity)

6.scrapy import mongodb

# in pipelines
import pymongo
class JianshuPipeline(object):
    def __init__(self):
        client = pymongo.MongoClient('localhost', 27017)
        test = client['test']
        jianshu = test['jianshu']
        self.post = jianshu
    def process_item(self, item, spider):
        info = dict(item)
        self.post.insert(info)
        return item

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325739654&siteId=291194637