Getting Started with Python Reptile [20]: Nuggets station network user reptiles scrapy

Users get the whole station, theory we can be, we need to climb to take the user's list of concerns, from Watchlist constantly superimposed on from a user as a starting point.

Just open a user's personal center

Getting Started with Python Reptile [20]: Nuggets station network user reptiles scrapy

Green circle inside are what we want to collect information. 0 users concerned about this person? Then you also need to continue to look for an entry, the user must pay attention to others. Select your watchlist, it is to allow data to be useful, because there may be a large number of followers trumpet or inactive accounts, of little value.

I chose such a portal page that focuses on three individuals, you can also select some more, this has little impact!
https://juejin.im/user/55fa7cd460b2e36621f07dde/following
We should, through this page, to grab the user's ID

Getting Started with Python Reptile [20]: Nuggets station network user reptiles scrapy

After obtaining the ID, you can splice out the links below

https://juejin.im/user/用户ID/following

Write reptiles

After analyzing Well, you can create a scrapyproject of

items.pyFile, used to define all the data we need, note that the following have _id = scrapy.Field()the first good set aside, in order to mongdbprepare, the other explanation, refer to the comment field can be.

class JuejinItem(scrapy.Item):

    _id = scrapy.Field()
    username = scrapy.Field()
    job = scrapy.Field()
    company =scrapy.Field()
    intro = scrapy.Field()
    # 专栏
    columns = scrapy.Field()
    # 沸点
    boiling = scrapy.Field()
    # 分享
    shares = scrapy.Field()
    # 赞
    praises = scrapy.Field()
    #
    books = scrapy.Field()
    # 关注了
    follow = scrapy.Field()
    # 关注者
    followers = scrapy.Field()
    goods = scrapy.Field()
    editer = scrapy.Field()
    reads = scrapy.Field()
    collections = scrapy.Field()
    tags = scrapy.Field()
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Writing reptile main entrance file JuejinspiderSpider.py

import scrapy
from scrapy.selector import Selector
from Juejin.items import JuejinItem

class JuejinspiderSpider(scrapy.Spider):
    name = 'JuejinSpider'
    allowed_domains = ['juejin.im']
    # 起始URL    5c0f372b5188255301746103
    start_urls = ['https://juejin.im/user/55fa7cd460b2e36621f07dde/following']

def parse Function, logic is not complicated, the process can be two traffic

  1. Return item
  2. Request return watchlist

item acquisition, we need to use xpath to match, in order to simplify the amount of code, I wrote a extraction method called get_defaultfunction.

    def get_default(self,exts):
        if len(exts)>0:
            ret = exts[0]
        else:
            ret = 0
        return ret

    def parse(self, response):
        #base_data = response.body_as_unicode()
        select = Selector(response)
        item = JuejinItem()
        # 这个地方获取一下数据
        item["username"] = select.xpath("//h1[@class='username']/text()").extract()[0]
        position = select.xpath("//div[@class='position']/span/span/text()").extract()
        if position:
            job = position[0]
            if len(position)>1:
                company = position[1]
            else:
                company = ""
        else:
            job = company = ""
        item["job"] = job
        item["company"] = company
        item["intro"] = self.get_default(select.xpath("//div[@class='intro']/span/text()").extract())
        # 专栏
        item["columns"] = self.get_default(select.xpath("//div[@class='header-content']/a[2]/div[2]/text()").extract())
        # 沸点
        item["boiling"] = self.get_default(select.xpath("//div[@class='header-content']/a[3]/div[2]/text()").extract())
        # 分享
        item["shares"] = self.get_default(select.xpath("//div[@class='header-content']/a[4]/div[2]/text()").extract())
        # 赞
        item["praises"] = self.get_default(select.xpath("//div[@class='header-content']/a[5]/div[2]/text()").extract())
        #
        item["books"] = self.get_default(select.xpath("//div[@class='header-content']/a[6]/div[2]/text()").extract())

        # 关注了
        item["follow"] = self.get_default(select.xpath("//div[@class='follow-block block shadow']/a[1]/div[2]/text()").extract())
        # 关注者
        item["followers"] = self.get_default(select.xpath("//div[@class='follow-block block shadow']/a[2]/div[2]/text()").extract())

        right = select.xpath("//div[@class='stat-block block shadow']/div[2]/div").extract()
        if len(right) == 3:
            item["editer"] = self.get_default(select.xpath("//div[@class='stat-block block shadow']/div[2]/div[1]/span/text()").extract())
            item["goods"] = self.get_default(select.xpath("//div[@class='stat-block block shadow']/div[2]/div[2]/span/span/text()").extract())
            item["reads"] = self.get_default(select.xpath("//div[@class='stat-block block shadow']/div[2]/div[3]/span/span/text()").extract())

        else:
            item["editer"] = ""
            item["goods"] = self.get_default(select.xpath("//div[@class='stat-block block shadow']/div[2]/div[1]/span/span/text()").extract())
            item["reads"] = self.get_default(select.xpath("//div[@class='stat-block block shadow']/div[2]/div[2]/span/span/text()").extract())

        item["collections"] = self.get_default(select.xpath("//div[@class='more-block block']/a[1]/div[2]/text()").extract())
        item["tags"] = self.get_default(select.xpath("//div[@class='more-block block']/a[2]/div[2]/text()").extract())
        yield item  # 返回item

The code has been successfully returned to the Item, to open the setting.pyfile pipelinesset, test whether the data can be stored, in the way
DEFAULT_REQUEST_HEADERSthe request parameter configuration request it.

setting.py

DEFAULT_REQUEST_HEADERS = {
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language': 'en',
    "Host": "juejin.im",
    "Referer": "https://juejin.im/timeline?sort=weeklyHottest",
    "Upgrade-Insecure-Requests": "1",
    "User-Agent": "Mozilla/5.0 浏览器UA"
}

ITEM_PIPELINES = {
   'Juejin.pipelines.JuejinPipeline': 20,
}

This reptile data stored mongodbinside, so you need pipelines.pyto write the code stored files.


import time
import pymongo

DATABASE_IP = '127.0.0.1'
DATABASE_PORT = 27017
DATABASE_NAME = 'sun'
client = pymongo.MongoClient(DATABASE_IP,DATABASE_PORT)
db = client.sun
db.authenticate("dba", "dba")
collection = db.jujin  # 准备插入数据

class JuejinPipeline(object):

    def process_item(self, item, spider):
        try:
            collection.insert(item)
        except Exception as e:
            print(e.args)

After you run the code, if there is no error, the last step to improve the circulation operation in which the crawler Spider completion

      list_li = select.xpath("//ul[@class='tag-list']/li")  # 获取所有的关注
      for li in list_li:
           a_link = li.xpath(".//meta[@itemprop='url']/@content").extract()[0] # 获取URL
             # 返回拼接好的数据请求
           yield scrapy.Request(a_link+"/following",callback=self.parse)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

All code is already written it

Getting Started with Python Reptile [20]: Nuggets station network user reptiles scrapy

Full stop writing complete user reptiles

Extended direction

  1. Every time crawling reptiles only the first page of Watchlist, it can also cycle continues, this does not trouble
  2. In setting.pyturn multithreaded operation
  3. Add redis faster, behind will continue to write several distributed crawlers, crawling speed increase
  4. Ideas can be extended, N multi-site users reptiles, we also wrote several back

Getting Started with Python Reptile [20]: Nuggets station network user reptiles scrapy

Guess you like

Origin blog.51cto.com/14445003/2424143