1. For some data that is empty after crawling, but some on f12 on the web page, you can view the source code of the web page and find that it is written in js
You can only use the simulated browser selenium
from selenium import webdriver driver = webdriver.Chrome() driver.get('......略') include_title = [] driver.implicitly_wait(20) author = driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[1]/div[1]/div[1]/div[1]/h4/a[1]').text date = driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[1]/div[1]/div[1]/div[1]/h4/a[2]').text driver.find_element_by_xpath('//*[@id="main"]/div/div/div[1]/div[2]/div[1]/button[2]').click() print(author, date)
2. Use scrapy If you don't simulate a browser, you can only crawl static web pages
3. For webpages with content but the response url is 404 after opening, it means that you do not want to visit, and there is no way to crawl for the time being
4. For scrapy, you can set the header in the setting
DEFAULT_REQUEST_HEADERS={ "Accept": "*/*", "Accept-Encodingv": "gzip, deflate, br", "Accept-Language": "zh-CN,zh;q=0.9", "Connectionvkeep-alive":"keep-alive", "Host": "bbs.tju.edu.cn", "Referer":"https://bbs.tju.edu.cn/", "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36" }
5. Ordinary import mongodb
import pymongo client = pymongo.MongoClient('localhost',27017) mydb = client['mydb'] taobao = mydb['taobao'] #grab slightly commodity = { 'goods': goods, #after grabbing 'price': price, 'sell': sells, 'shop': shop, 'address': address } taobao.insert_one(commodity)
6.scrapy import mongodb
# in pipelines import pymongo class JianshuPipeline(object): def __init__(self): client = pymongo.MongoClient('localhost', 27017) test = client['test'] jianshu = test['jianshu'] self.post = jianshu def process_item(self, item, spider): info = dict(item) self.post.insert(info) return item