150 lines of code to write crawler (2)

Previous content: http://dushen.iteye.com/blog/2415336

Project address: https://gitee.com/dushen666/spider.git

Continuing the content of the previous article, in the previous article, we can already crawl the data and save it in the form of a json file. In this article, I want to insert data into a relational database and implement deduplication.

Here is an example of a MySQL database:

We create the table structure according to the items in the previous article:

/*
SQLyog Ultimate v10.42
MySQL - 5.7.20-0ubuntu0.16.04.1 : Database - movie-website1
*********************************************************************
*/


/*!40101 SET NAMES utf8 */;

/*!40101 SET SQL_MODE=''*/;

/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
/*!40111 SET @OLD_SQL_NOTES=@@SQL_NOTES, SQL_NOTES=0 */;
/*Table structure for table `spider_h6080` */

CREATE TABLE `spider_h6080` (
  `name` VARCHAR(1000) DEFAULT NULL,
  `url` VARCHAR(1000) DEFAULT NULL,
  `num` VARCHAR(1000) DEFAULT NULL,
  FULLTEXT KEY `spiderh6080index` (`name`,`url`)
) ENGINE=INNODB DEFAULT CHARSET=utf8;

/*Table structure for table `spider_h6080_movieinfo` */

CREATE TABLE `spider_h6080_movieinfo` (
  `id` INT(11) NOT NULL AUTO_INCREMENT,
  `moviename` VARCHAR(100) DEFAULT NULL,
  `prefilename` VARCHAR(50) DEFAULT NULL,
  `suffixname` VARCHAR(20) DEFAULT NULL,
  `createtime` DATETIME(6) DEFAULT NULL,
  `updatetime` DATETIME(6) DEFAULT NULL,
  `publishtime` VARCHAR(100) DEFAULT NULL,
  `types` VARCHAR(200) DEFAULT NULL,
  `area` VARCHAR(200) DEFAULT NULL,
  `language` VARCHAR(200) DEFAULT NULL,
  `actor` VARCHAR(200) DEFAULT NULL,
  `director` VARCHAR(200) DEFAULT NULL,
  `keyword` VARCHAR(200) DEFAULT NULL,
  `weight` INT(11) DEFAULT NULL,
  `countnumber` INT(11) DEFAULT NULL,
  `avaliblesum` INT(11) DEFAULT NULL,
  `introduce` VARCHAR(2000) DEFAULT NULL,
  `clickcount` INT (11) DEFAULT NULL,
  `playcount` INT(11) DEFAULT NULL,
  `duration` VARCHAR(100) DEFAULT NULL,
  `isoutsource` VARCHAR(2) DEFAULT NULL,
  `picurl` VARCHAR(500) DEFAULT NULL,
  `classify_id` INT(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  FULLTEXT KEY `moviename` (`moviename`)
) ENGINE=INNODB AUTO_INCREMENT=13748 DEFAULT CHARSET=utf8;

/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40111 SET SQL_NOTES=@OLD_SQL_NOTES */;

Install MySQL-python for connecting to our MySQL database:
Ubuntu installation steps are as follows:
```
sudo apt-get install libmysqlclient-dev libmysqld-dev python-dev python-setuptools
pip install MySQL-python
```
Windows, download MySQL-python-1.2.5.win-amd64-py2.7.exe in the attachment, and double-click to install.

To verify, enter import MySQLdb in the python interactive interface. If no error is reported, it means success:
```
C:\Users\du>python
Python 2.7.10 (default, May 23 2015, 09:44:00) [MSC v.1500 64 bit (AMD64)] on wi
n32
Type "help", "copyright", "credits" or "license" for more information.
>>> import MySQLdb
>>>
```

Install pybloom for deduplication.

pip install pybloom

Description: pybloom, Bloom Filter was proposed by Burton Howard Bloom in 1970. It actually consists of a long binary vector and a series of random mapping functions, and the Bloom filter can be used to retrieve whether an element is in a set.

There are usually three methods for deduplication:
1. Store the filter field in a list, and for each captured item, check whether the filter field is in the list, and drop it if it exists. Otherwise put into storage. However, as the list continues to increase, the program occupies more and more memory, resulting in performance degradation, and even triggers the system's OOMKiller to interrupt the process.

ids_seen=[]
if item['url'] in self.ids_seen:
      raise DropItem("Exist Exception! Duplicate item found: %s" % item['name'])
else:
      sql = ("INSERT INTO spider_h6080 (NAME,url,num) VALUES ('%s', '%s', '%s')" % (item['name'], item['url'], item['num']))
      try:
            self.cur.execute(sql)
            self.conn.commit()
            self.append(item['url'])
      except Exception as err:
            raise DropItem("DB Exception! Duplicate item found: %s" % err)
            return item

2. Add the primary key to the database, and use the database to return errors, but a large number of errors in the database may cause the program to crash.

sql = ("INSERT INTO spider_h6080 (NAME,url,num) VALUES ('%s', '%s', '%s')" % (item['name'], item['url'], item['num']))
try:
	self.cur.execute(sql)
	self.conn.commit()
except Exception as err:
	raise DropItem("%s" % err)
return item

3. Use the Bloom filter . The feature of the Bloom filter is to use the accuracy in exchange for memory space and efficiency. Although the memory of the program will continue to increase, it is much better than the list, and the memory increase is extremely small.

from pybloom import BloomFilter

ids_seen = BloomFilter(capacity=5000000, error_rate=0.001)

if item['url'] in self.ids_seen:
	raise DropItem("Exist Exception! Duplicate item found: %s" % item['name'])
else:
	sql = ("INSERT INTO spider_h6080 (NAME,url,num) VALUES ('%s', '%s', '%s')" % (item['name'], item['url'], item['num']))
	try:
		self.cur.execute(sql)
		self.conn.commit()
		self.ids_seen.add(item['url'])
	except Exception as err:
		raise DropItem("DB Exception! Duplicate item found: %s" % err)
	return item

The code of pipeline.py is as follows:

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
from scrapy.exceptions import DropItem
from spider.items import H6080Item, H6080MovieInfo
import MySQLdb
from pybloom import BloomFilter


class Spider1Pipeline(object):
    def __init__(self):
        #self.ids_seen = []
        self.ids_seen = BloomFilter(capacity=5000000, error_rate=0.001)
        #self.movienames_seen = []
        self.movienames_seen = BloomFilter(capacity=5000000, error_rate=0.001)
        self.conn = MySQLdb.connect(host='127.0.0.1', port=3306, user='root', passwd='ROOT', db='movie-website1', charset='utf8')
        self.cur = self.conn.cursor()
        self.cur.execute('SELECT url FROM spider_h6080')
        result = self.cur.fetchall()
        for row in result:
            self.ids_seen.add(row[0])
        self.cur.execute('SELECT picurl FROM spider_h6080_movieinfo')
        result = self.cur.fetchall()
        for row in result:
            self.movienames_seen.add(row[0])

    def process_item(self, item, spider):
        print "Deep of ids_seen: %s" % self.ids_seen.count
        print "Deep of movienames_seen: %s" % self.movienames_seen.count
        if isinstance(item, H6080Item):
            if item['url'] in self.ids_seen:
                raise DropItem("Exist Exception! Duplicate item found: %s" % item['name'])
            else:
                sql = ("INSERT INTO spider_h6080 (NAME,url,num) VALUES ('%s', '%s', '%s')" % (item['name'], item['url'], item['num']))
                try:
                    self.cur.execute(sql)
                    self.conn.commit()
                    self.ids_seen.add(item['url'])
                except Exception as err:
                    raise DropItem("DB Exception! Duplicate item found: %s" % err)
                return item
        elif isinstance(item, H6080MovieInfo):
            if item['picurl'] in self.movienames_seen:
                raise DropItem("Exist Exception! Duplicate item found: %s" % item['name'])
            sql = "INSERT INTO spider_h6080_movieinfo (moviename,actor,TYPES,AREA,publishtime,countnumber,introduce,director,picurl) VALUES ('%s', '%s', '%s', '%s','%s', '%s', '%s', '%s', '%s')" % (item['name'], item['actor'], item['types'], item['area'], item['publishtime'], item['countnumber'], item['introduce'], item['director'], item['picurl'])
            try:
                self.cur.execute(sql)
                self.conn.commit()
                self.movienames_seen.add(item['picurl'])
            except Exception as err:
                raise DropItem("DB Exception! Duplicate item found: %s" % err)
            return item

Here, the two items use picurl and url as the deduplication filtering fields respectively, and declare that two bloomfilters are used to filter these two fields. BloomFilter(capacity=5000000, error_rate=0.001), capacity is the maximum amount of data that the filter can store, and error_rate is the maximum tolerable error rate

At this point, the data storage and deduplication functions that the crawler has implemented, we can use it to crawl the video.

Finally, attach the project address again: https://gitee.com/dushen666/spider.git

150 lines of code to write crawler (2)

Guess you like