[Reptile] study notes day62 7.4 scrapy-redis combat - in the data processing Redis

7.4 scrapy-redis real - processing the data in Redis

Here Insert Picture Description

Processing the data in Redis

Data network destined to climb back, but on Redis There is no treatment. Before us there is no configuration file to customize their ITEM_PIPELINES, but the use of RedisPipeline, so now these data are stored in redis of youyuan: items key, so we need to do another deal.

In scrapy-youyuan directory can see a process_items.pyfile, the file is scrapy-redis the example provided by the reading process is carried out stencil item from redis.

Suppose we want to youyuan: items read out the data stored in MongoDB written or MySQL, then we can write its own process_youyuan_profile.pyfile, then kept running in the background can be kept crawling back up data warehousing.

MongoDB stores

  1. Start MongoDB database:sudo mongod
  2. Perform the following procedures:py2 process_youyuan_mongodb.py
# process_youyuan_mongodb.py

# -*- coding: utf-8 -*-

import json
import redis
import pymongo

def main():

    # 指定Redis数据库信息
    rediscli = redis.StrictRedis(host='192.168.199.108', port=6379, db=0)
    # 指定MongoDB数据库信息
    mongocli = pymongo.MongoClient(host='localhost', port=27017)

    # 创建数据库名
    db = mongocli['youyuan']
    # 创建表名
    sheet = db['beijing_18_25']

    while True:
        # FIFO模式为 blpop,LIFO模式为 brpop,获取键值
        source, data = rediscli.blpop(["youyuan:items"])

        item = json.loads(data)
        sheet.insert(item)

        try:
            print u"Processing: %(name)s <%(link)s>" % item
        except KeyError:
            print u"Error procesing: %r" % item

if __name__ == '__main__':
    main()

[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-zIMCBq26-1580312342386) (../ images / youyuan_mongodb.png)]

Stored in MySQL

  1. Start MySQL: mysql.server start(more platforms are not the same)

  2. Log on to the root user:mysql -uroot -p

  3. Create a database youyuan:create database youyuan;

  4. Switch to the specified database:use youyuan

  5. Create a table beijing_18_25and column names of all fields and data types.

    [外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-zPxZswsZ-1580312342387)(../images/youyuan_mysql2.png)]

  6. Perform the following procedures:py2 process_youyuan_mysql.py

#process_youyuan_mysql.py

# -*- coding: utf-8 -*-

import json
import redis
import MySQLdb

def main():
    # 指定redis数据库信息
    rediscli = redis.StrictRedis(host='192.168.199.108', port = 6379, db = 0)
    # 指定mysql数据库
    mysqlcli = MySQLdb.connect(host='127.0.0.1', user='power', passwd='xxxxxxx', db = 'youyuan', port=3306, use_unicode=True)

    while True:
        # FIFO模式为 blpop,LIFO模式为 brpop,获取键值
        source, data = rediscli.blpop(["youyuan:items"])
        item = json.loads(data)

        try:
            # 使用cursor()方法获取操作游标
            cur = mysqlcli.cursor()
            # 使用execute方法执行SQL INSERT语句
            cur.execute("INSERT INTO beijing_18_25 (username, crawled, age, spider, header_url, source, pic_urls, monologue, source_url) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s )", [item['username'], item['crawled'], item['age'], item['spider'], item['header_url'], item['source'], item['pic_urls'], item['monologue'], item['source_url']])
            # 提交sql事务
            mysqlcli.commit()
            #关闭本次操作
            cur.close()
            print "inserted %s" % item['source_url']
        except MySQLdb.Error,e:
            print "Mysql Error %d: %s" % (e.args[0], e.args[1])

if __name__ == '__main__':
    main()

[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-jwrRACQo-1580312342388)(../images/youyuan_mysql.png)]

Published 290 original articles · won praise 94 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104111491