Python3 web crawler combat -33, Data storage: non-relational database storage: MongoDB

NoSQL, stands for Not Only SQL, meaning not just SQL, refers to non-relational database. NoSQL is based on a key pair, and does not require the parsed SQL layer, there is no coupling, between the very high performance data.

Non-relational database and can be broken down as follows:

  • Key-value store database, representing Redis, Voldemort, Oracle BDB and so on.
  • Column store database, representing Cassandra, HBase, Riak and so on.
  • Document database, representing CouchDB, MongoDB and so on.
  • Graphics database, representing Neo4J, InfoGrid, Infinite Graph and so on.

For crawlers for data storage, a data field extract some cases there may be failure of the deletion, and the data may be adjusted at any time, there can be additional nesting relationships between data. If we use a relational database to store, one needs to build tables in advance, and second, if the data nested relationship exists the need for serialization can store more inconvenient. If you use a non-relational database can avoid some trouble, simple and efficient.

In this section we introduce the main data storage operations of MongoDB and Redis.

MongoDB storage

MongoDB is written by C ++ language non-relational database, is an open source database system based on a distributed file storage, the contents stored in the form of similar Json object, its field value can contain other documents, arrays and array of documents, is very flexible, in in this section we look at Python3 MongoDB storage operations.

1. Preparations

Before beginning this section, make sure you have installed the MongoDB and launched its services, in addition installed Python's PyMongo library, if not installed The installation procedure is the first chapter.

2. Connect MongoDB

We need to use MongoDB connection PyMongo library inside MongoClient, incoming general MongoDB's IP and port to the first argument address host, the second parameter is the port port, if not pass the default port is 27017.

import pymongo
client = pymongo.MongoClient(host='localhost', port=27017)

So that we can create a connection object of the MongoDB.

Further MongoClient first parameter may also be directly transmitted MongoDB host connection string begins with MongoDB, for example:

client = MongoClient('mongodb://localhost:27017/')

You can achieve the same effect connection.

3. Specify database

MongoDB is also divided into a database, our next step is to specify which database to do so, here I am to test the database as an example, so the next step we need to specify the database to be used in the program.

db = client.test

Call the client to return the properties of test test database, of course, you can specify this:

db = client['test']

Two methods are equivalent.

4. Specify collection

Each MongoDB database also contains a number of collections Collection, it is similar to a relational database table, the next step we need to specify the collection to be operated, here we specify a name for the collection of students, student collections, and also specify the database Similarly, there are two ways to specify the collection:

collection = db.students

collection = db['students']

We will declare a Collection object.

5. Insert data

Then we will be able to insert the data, and for the students of this Collection, we build a new student data, expressed in the form of a dictionary:

student = {
    'id': '20170101',
    'name': 'Jordan',
    'age': 20,
    'gender': 'male'
}
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Here we specify the student's school number, name, age and gender, then the next collection of direct calls insert () method to insert data, the code is as follows:

result = collection.insert(student)
print(result)

In MongoDB, each data actually has to uniquely identify a _id attribute, if not explicitly specified _id, will automatically generate a MongoDB ObjectId _id attribute type. insert () _id method returns a value after execution.

operation result:

5932a68615c2606814c91f3d

Of course, we can simultaneously insert a plurality of data, can be transmitted only in a list, for example:

student1 = {
    'id': '20170101',
    'name': 'Jordan',
    'age': 20,
    'gender': 'male'
}

student2 = {
    'id': '20170202',
    'name': 'Mike',
    'age': 21,
    'gender': 'male'
}

result = collection.insert([student1, student2])
print(result)

The result returned is set corresponding to _id of the results:

[ObjectId('5932a80115c2606a59e8a048'), ObjectId('5932a80115c2606a59e8a049')]

In fact PyMongo 3.X version, insert () method officials have not recommended, of course, continue to use no problem, the official recommended insert_one () and insert_many () method to insert separate single and multiple records.

student = {
    'id': '20170101',
    'name': 'Jordan',
    'age': 20,
    'gender': 'male'
}

result = collection.insert_one(student)
print(result)
print(result.inserted_id)

operation result:

<pymongo.results.InsertOneResult object at 0x10d68b558>
5932ab0f15c2606f0c1cf6c5

Returns, and insert () methods differ, the returns InsertOneResult object, we can call its inserted_id property acquisition _id.

For insert_many () method, we can pass to the data in tabular form, for example:

student1 = {
    'id': '20170101',
    'name': 'Jordan',
    'age': 20,
    'gender': 'male'
}

student2 = {
    'id': '20170202',
    'name': 'Mike',
    'age': 21,
    'gender': 'male'
}

result = collection.insert_many([student1, student2])
print(result)
print(result.inserted_ids)

insert_many () method returns the type InsertManyResult, call inserted_ids _id property to get a list of inserted data, the results:

<pymongo.results.InsertManyResult object at 0x101dea558>
[ObjectId('5932abf415c2607083d3b2ac'), ObjectId('5932abf415c2607083d3b2ad')]

6. Queries

We can use the data after insertion find_one () or find () method query, find_one () to get a single query result, find () then returns a generated object.

result = collection.find_one({'name': 'Mike'})
print(type(result))
print(result)

Here we query name is Mike's data, it returns the result is a dictionary type, operating results:

<class 'dict'>
{'_id': ObjectId('5932a80115c2606a59e8a049'), 'id': '20170202', 'name': 'Mike', 'age': 21, 'gender': 'male'}

It can be found more than a _id attribute, which is automatically added MongoDB in the insertion process.

We can also come directly from the ObjectId query, where the need to use bson library inside the ObjectId.

from bson.objectid import ObjectId

result = collection.find_one({'_id': ObjectId('593278c115c2602667ec6bae')})
print(result)

Its results are still Dictionary type, operating results:

{'_id': ObjectId('593278c115c2602667ec6bae'), 'id': '20170101', 'name': 'Jordan', 'age': 20, 'gender': 'male'}

Of course, if there is no query results are returned None.

For inquiries pieces of data, we can use the find () method, such as where to find the age of 20 data, examples are as follows:

results = collection.find({'age': 20})
print(results)
for result in results:
    print(result)

operation result:

<pymongo.cursor.Cursor object at 0x1032d5128>
{'_id': ObjectId('593278c115c2602667ec6bae'), 'id': '20170101', 'name': 'Jordan', 'age': 20, 'gender': 'male'}
{'_id': ObjectId('593278c815c2602678bb2b8d'), 'id': '20170102', 'name': 'Kevin', 'age': 20, 'gender': 'male'}
{'_id': ObjectId('593278d815c260269d7645a8'), 'id': '20170103', 'name': 'Harden', 'age': 20, 'gender': 'male'}
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

The result is Cursor type, the equivalent of a generator, we need to traverse to get all the results, each result is a dictionary type.

If you want to query data older than 20, it is written as follows:

results = collection.find({'age': {'$gt': 20}})

Conditions key here is not simply query the figure, but rather a dictionary whose keys are called comparative symbol $ gt, meaning greater than 20 keys so that you can check out all the data is older than 20.

Here the comparison symbols are summarized in the following table:

symbol meaning Examples
$lt Less than {'age': {'$lt': 20}}
$gt more than the {'age': {'$gt': 20}}
$ lte Less than or equal {'age': {'$lte': 20}}
$gte greater or equal to {'age': {'$gte': 20}}
$ does not equal to { 'Age': { '$ ne': 20}}
$in within the scope {'age': {'$in': [20, 23]}}
$ s Not within the scope { 'Age': { '$ nin': [20, 23]}}

It also can be a regular match query, the query name such as student data beginning with M, the examples are as follows:

results = collection.find({'name': {'$regex': '^M.*'}})

As used herein, a $ regex to match the regular match, ^ M. * M represents the start with regular expressions, so you can check all results in line with the regular.

Here some of the function symbol then categorized as follows:

symbol meaning Examples Example Meaning
$regex Matches regular {'name': {'$regex': '^M.*'}} name beginning with M
$exists Property exists {'name': {'$exists': True}} The name attribute exists
$type Type judgment {'age': {'$type': 'int'}} age is of type int
$mod Digital mode operation {'age': {'$mod': [5, 0]}} Age More than 05 die
$text Text query {'$text': {'$search': 'Mike'}} text type attribute contains the string Mike
$where Advanced query conditions {'$where': 'obj.fans_count == obj.follows_count'} Their number equal to the number of fans attention

These more detailed usage operations in MongoDB can be found in the official documentation: https://docs.mongodb.com/manu...。

7. Count

To count the number of data query results, you can call count () methods, such as statistical data of all pieces:

count = collection.find().count()
print(count)

Or statistical data that meets certain criteria:

count = collection.find({'age': 20}).count()
print(count)

The result is a numerical value, i.e. the number of data pieces meet the conditions.

8. Sort

You can call the sort () method, passing in the sort field descending and ascending to sign, for example:

results = collection.find().sort('name', pymongo.ASCENDING)
print([result['name'] for result in results])

operation result:

['Harden', 'Jordan', 'Kevin', 'Mark', 'Mike']

Here we call pymongo.ASCENDING specify ascending, descending order if you can pass pymongo.DESCENDING.

9. Offset

In some cases where we may want to take only a few elements, may be utilized where Skip () method shifted several locations, such as offset 2, it ignores the former two elements, and later to obtain the third element.

results = collection.find().sort('name', pymongo.ASCENDING).skip(2)
print([result['name'] for result in results])

operation result:

['Kevin', 'Mark', 'Mike']

Also () method to take a specified number of results with limit, examples are as follows:

results = collection.find().sort('name', pymongo.ASCENDING).skip(2).limit(2)
print([result['name'] for result in results])

operation result:

['Kevin', 'Mark']

If not limit () will return the original three results, plus the restrictions after the interception of two results will be returned.

It is noteworthy that, in a very large number of databases that time, such as ten million, one hundred million level, it is best not to use a large offset to query the data, is likely to cause a memory leak, you can use a query like the following operations:

from bson.objectid import ObjectId
collection.find({'_id': {'$gt': ObjectId('593278c815c2602678bb2b8d')}})

At this time last query _id good record.

10. Update

May be used for data update update () method, the specified data can be updated and update conditions, for example:

condition = {'name': 'Kevin'}
student = collection.find_one(condition)
student['age'] = 25
result = collection.update(condition, student)
print(result)
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Here we will name data is updated as Kevin's age, first specify the query, then the query data out, modify age, after calling the update () method and conditions of the original data modification passed to complete the data update.

operation result:

{'ok': 1, 'nModified': 1, 'n': 1, 'updatedExisting': True}

The result is a dictionary, ok which represents the successful execution, data representative of the number of affected nModified.

Further we can also use $ set operator to update the data, the code read as follows:

result = collection.update(condition, {'$set': student})

This update only student in the dictionary memory field, if there are other fields whose original is not updated, not deleted. And if not, then it will put $ set of data before the entire student with a dictionary Alternatively, if the original there are other fields will be deleted.

Also update () method is actually not officially recommended method, here also divided update_one () method and update_many () method, the use of more stringent second argument requires the use of $ type operator keys of a dictionary name, we with examples of feel.

condition = {'name': 'Kevin'}
student = collection.find_one(condition)
student['age'] = 26
result = collection.update_one(condition, {'$set': student})
print(result)
print(result.matched_count, result.modified_count)

Here update_one call () method, the second parameter dictionary can not be passed directly modified, but need to use { '$ set': student} such a form that the result is UpdateResult type, and then call matched_count data number and the number of affected data stripe are available modified_count matching properties.

operation result:

<pymongo.results.UpdateResult object at 0x10d17b678>
1 0

We look at an example:

condition = {'age': {'$gt': 20}}
result = collection.update_one(condition, {'$inc': {'age': 1}})
print(result)
print(result.matched_count, result.modified_count)

Here we specify the query condition is older than 20, then the update condition is { '$ inc': { 'age': 1}}, which is an increase of age, it will meet the first condition after performing data plus 1 Age .

operation result:

<pymongo.results.UpdateResult object at 0x10b8874c8>
1 1

We can see the number of matches as an influence also a number of.

If you call update_many () method, then all qualified data is updated, for example:

condition = {'age': {'$gt': 20}}
result = collection.update_many(condition, {'$inc': {'age': 1}})
print(result)
print(result.matched_count, result.modified_count)

At this time in respect of the number of matches is no longer one, the results are as follows:

<pymongo.results.UpdateResult object at 0x10c6384c8>
3 3

Then you can see all the matching data will be updated.

11. Delete

Delete operation is relatively simple, direct call remove () method to delete the specified conditions to meet the conditions of all data will be deleted, for example:

result = collection.remove({'name': 'Kevin'})
print(result)

operation result:

{'ok': 1, 'n': 1}

Two new recommendation method further still, the method delete_one () and delete_many (), the following example:

result = collection.delete_one({'name': 'Kevin'})
print(result)
print(result.deleted_count)
result = collection.delete_many({'age': {'$lt': 25}})
print(result.deleted_count)

operation result:

<pymongo.results.DeleteResult object at 0x10e6ba4c8>
1
4
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

delete_one () delete data that is in line with the conditions of the first, delete_many () is deleted all eligible data, the result is DeleteResult type, you can call deleted_count property Gets the number of pieces of data to delete.

12. More

Further PyMongo also provides combinations of methods, such as find_one_and_delete (), find_one_and_replace (), find_one_and_update (), that is, after finding the delete, replace, update operations, using the method described above are basically the same.

It also can operate on the index, as create_index (), create_indexes (), drop_index () and the like.

13. Conclusion

This section explains PyMongo operation MongoDB data additions and deletions to change search method, we'll use them in actual combat operations for data storage cases later.

Guess you like

Origin blog.51cto.com/14445003/2426850