Python3 web crawler combat -32, data storage: a relational database to store: MySQL

Relational database based on the relational model database, and the relational model is a two-dimensional tables to hold, so it's storage table is composed of the ranks, and each column is a field, and each row is a record. Table can be seen as a collection of entities, but there is a link between the entities, which requires the relationships between tables and tables to reflect, as the primary key foreign key relationship, the more tables of a database, which is the relationship database.

There are a variety of relational databases, such as SQLite, MySQL, Oracle, SQL Server, DB2, and so on.

In this section we introduce the next Python3 MySQL storage.

In Python2, the connection MySQL libraries mostly use MySQLDB, but this library is not officially supported Python3, so here is recommended to use the library is PyMySQL.

This section to explain the operating method PyMySQL MySQL database.

1. Preparations

Before beginning this section, make sure you have installed the MySQL database and running, but also need to install PyMySQL library, if not installed, you can refer to the installation instructions of the first chapter.

2. Connect database

Here we first try to connect to the database it is assumed that the current MySQL running locally called the root user password is 123456, run the port is 3306, where we use PyMySQL connect it and then create a new MySQL database, called spiders ,code show as below:

import pymysql

db = pymysql.connect(host='localhost',user='root', password='123456', port=3306)
cursor = db.cursor()
cursor.execute('SELECT VERSION()')
data = cursor.fetchone()
print('Database version:', data)
cursor.execute("CREATE DATABASE spiders DEFAULT CHARACTER SET utf8")
db.close()
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

operation result:

Database version: ('5.6.22',)

Here we () method declaration by connect PyMySQL of a MySQL connection object, need to pass the host running MySQL that is IP, here Because MySQL is running locally, so the incoming is localhost, if the remote MySQL running, pass into its public IP address, and then the subsequent user parameters i.e. username, password i.e. password, port 3306 that is the default port.

After the connection is successful, we need to call the cursor () method to obtain MySQL operation cursor to execute a SQL statement using the cursor, for example, where we performed two SQL, using execute () method to perform the appropriate SQL statements, first MySQL SQL sentence is to get the current version, and then call fetchone () method to get the first data, it has been the version number, and we also perform the creation of the database, the database name is called spiders, the default encoding is utf-8, Since this statement is not a query, so we executed directly after successfully created a database spiders, then we'll use this database for subsequent operations.

3. Create a table

In general the above operation to create the database we just need to do just once, of course, we can also manually create the database, since we are operating on this database operations, so the introduction of MySQL later connected directly specify the current database spiders, all operations are performed within spiders database.

So here MySQL connection requires additional parameters to specify a db.

Next, we then create a new table data, execute SQL statements to create the table, create a user table students, here designated three fields, structured as follows:

Field name meaning Types of
id student ID varchar
name Full name varchar
age age int

Sample code creates a table as follows:

import pymysql

db = pymysql.connect(host='localhost', user='root', password='123456', port=3306, db='spiders')
cursor = db.cursor()
sql = 'CREATE TABLE IF NOT EXISTS students (id VARCHAR(255) NOT NULL, name VARCHAR(255) NOT NULL, age INT NOT NULL, PRIMARY KEY (id))'
cursor.execute(sql)
db.close()
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

After the run we will create a data table called students, the field is the three fields listed above.

Of course, here as a demonstration of our designated the simplest several fields, in actual reptile process, we will design a specific field based on the results of crawling.

4. Insert the data

The next step is to insert the data into the database after we parsed the data, for example, where we took a climb of student information, student number is 20120001, named Bob, age 20, how the piece of data into the database , example code as follows:

import pymysql

id = '20120001'
user = 'Bob'
age = 20

db = pymysql.connect(host='localhost', user='root', password='123456', port=3306, db='spiders')
cursor = db.cursor()
sql = 'INSERT INTO students(id, name, age) values(%s, %s, %s)'
try:
    cursor.execute(sql, (id, user, age))
    db.commit()
except:
    db.rollback()
db.close()

Here we first construct a SQL statement, which we did not use Value Value string concatenation approach to construction, such as:

sql = 'INSERT INTO students(id, name, age) values(' + id + ', ' + name + ', ' + age + ')'

Such wording cumbersome and not intuitive, so we choose to directly implemented% s character format, there are several several write% s Value, we only need () method of the first parameter passed to the Execute SQL statement, value transfer value with a unified tuple over just fine.

This wording has trouble can be avoided and string concatenation, and also avoid the problem of conflict quotation marks.

After worth noting that you need to perform commit db object () method is available for data insertion, this method is the real statement will be submitted to the database to perform the method for data insert, update, delete operations need to call this method to take effect .

Next we add a layer of exception handling, if fails, then call rollback () to perform the rollback of data, the equivalent of what had happened.

Here on questions relating to a transaction, the transaction mechanism to ensure data consistency, that is, it either happened or did not happen, such as inserting a piece of data, there will be no case insert half, either fully inserted, or the entire a not inserted, this is the atomic transactions, in addition to the transaction as well as three other properties, consistency, isolation, durability, usually become ACID properties.

Summarized as follows:

Attributes Explanation
Atomicity (Atomicity) A transaction is an indivisible unit of work, all operations in the transaction include either do or do not do.
Consistency (consistency) The database transaction must be changed from one consistent state to another consistent state. Consistency and atomicity are closely related.
Isolation (Isolation) Execution of a transaction can not be other transactions interference. I.e., operation and use of the data inside a transaction other concurrent transactions are isolated and can not interfere with each other between the respective transaction executed concurrently.
Persistence (durability) Persistent, also known as permanent (permanence), it means that once a transaction commits, changing its data in the database should be permanent. The next operation or other faults should not have any effect on them.

Insert, update and delete operations are the operation of the database to change, and change operations must as a transaction, so the standard wording for these operations is:

try:
    cursor.execute(sql)
    db.commit()
except:
    db.rollback()

In this way we can ensure data consistency, here's commit () and rollback () method is to provide support for the realization of the transaction.

Well, we understand the above data insertion operation is by constructing a SQL statement to achieve, but it is clear that there is a place and inconvenient, such as addition of another sex gender, if a sudden increase in a field , then we need to construct SQL statement read:

INSERT INTO students(id, name, age, gender) values(%s, %s, %s, %s)

Corresponding tuple is required to change parameters:

(id, name, age, gender)

This is obviously not what we want, in many cases, the effect we want to achieve is to insert method does not require changes to be made a general method, only we need to pass a dynamically changing dictionary to just fine. For example, we construct such a dictionary:

{
    'id': '20120001',
    'name': 'Bob',
    'age': 20
}
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Then the SQL statement dynamically constructed according to the dictionary, a tuple is also dynamic structure, so as to achieve common insertion method. So here we need to insert method redrafting:

data = {
    'id': '20120001',
    'name': 'Bob',
    'age': 20
}
table = 'students'
keys = ', '.join(data.keys())
values = ', '.join(['%s'] * len(data))
sql = 'INSERT INTO {table}({keys}) VALUES ({values})'.format(table=table, keys=keys, values=values)
try:
   if cursor.execute(sql, tuple(data.values())):
       print('Successful')
       db.commit()
except:
    print('Failed')
    db.rollback()
db.close()

Here we passed in the form of data dictionary, is defined as a data variable, table name is also defined as a variable table. Then we need to construct a dynamic SQL statements.

First, we need to construct inserted into the fields, id, name and age, where just to take over the data key name, and then separated by a comma can be. Therefore ',' .join (data.keys ()) result is id, name, age, and we need to construct a plurality of% s as a placeholder, there are several fields of several configurations, such as where there are two field, it is necessary configuration% s,% s,% s, so here are first defined array of length 1 [ '% s'], then multiplication will be expanded as [ '% s', '% s', '% s'], then call join () method, eventually become% s,% s,% s. So we'll use the string format () method of the table name, field names, the placeholder constructed, the final statement was dynamic sql structure became:

INSERT INTO students(id, name, age) VALUES (%s, %s, %s)

The first parameter and finally execute () method sql incoming variable, the second argument to the key data configuration of tuples, the data can be successfully inserted.

So since we have achieved passing a dictionary method to insert data, and modify the SQL statement does not need to go to the insertion.

5. Update Data

Data update actually execute SQL statements, the easiest way is to construct and execute a SQL statement:

sql = 'UPDATE students SET age = %s WHERE name = %s'
try:
   cursor.execute(sql, (25, 'Bob'))
   db.commit()
except:
   db.rollback()
db.close()

Here is also configured with placeholders manner SQL, then excute the execute () method, passing in the form of tuples of parameters, is also performed commit () method to perform operations.

If you do simple data updates, using this method is entirely possible.

But in the actual data capturing process, in most cases need to insert the data, but we are concerned that there will be any duplicate data, duplicate data if there was, we hope that the general approach is to update the data rather than repeat save time, as mentioned above, the other is dynamically constructed SQL problem, so here we are here to re-implement a de-emphasis of the practice can be done, if repeated updating data, if the data does not exist, insert data, in addition It supports flexible dictionary by value.

data = {
    'id': '20120001',
    'name': 'Bob',
    'age': 21
}

table = 'students'
keys = ', '.join(data.keys())
values = ', '.join(['%s'] * len(data))

sql = 'INSERT INTO {table}({keys}) VALUES ({values}) ON DUPLICATE KEY UPDATE'.format(table=table, keys=keys, values=values)
update = ','.join([" {key} = %s".format(key=key) for key in data])
sql += update
try:
    if cursor.execute(sql, tuple(data.values())*2):
        print('Successful')
        db.commit()
except:
    print('Failed')
    db.rollback()
db.close()

Here constructed SQL statement is actually insert statements, but behind the increase in the ON DUPLICATE KEY UPDATE, this means that if the primary key already exists, then perform updates, such as where we still incoming data id 20120001, However, the age varies from 20 into a 21, but will not be inserted in this data, but will be updated to 20,120,001 id data.

Here is a complete SQL constructed like this:

INSERT INTO students(id, name, age) VALUES (%s, %s, %s) ONDUPLICATE KEY UPDATE id = %s, name = %s, age = %s

SQL insertion operation as compared to the above described, after more than part of, that is, update the fields, ON DUPLICATE KEY UPDATE primary key already exists so that the data is updated, it is followed by an update of the field contents. So here it becomes six% s. Therefore, in the back of execute () method of the second parameter tuples multiplied by 2 would become a factor of two.

In this way, we can achieve the primary key does not exist then insert the data, the presence of functional data is updated.

6. Delete Data

Delete operation is relatively simple, you can use the DELETE statement, you need to specify the target table name and delete the condition you want to delete, and still need to use commit db's () method to take effect, examples are as follows:

table = 'students'
condition = 'age > 20'

sql = 'DELETE FROM  {table} WHERE {condition}'.format(table=table, condition=condition)
try:
    cursor.execute(sql)
    db.commit()
except:
    db.rollback()

db.close()

Here we specify the name of the table, delete conditions. Since deletion condition may be varied, such as the operator has greater than, less than, equal to, etc. the LIKE condition with a connector such as AND, OR, etc., so that no further complicated structure determination condition, the condition where the direct as a string to pass in order to achieve the deletion.

7. query data

Then insert, modify, or delete operation, leaving a very important operation, that is the query.

Here the query SELECT statement is used, let's use an example to feel:

sql = 'SELECT * FROM students WHERE age >= 20'

try:
    cursor.execute(sql)
    print('Count:', cursor.rowcount)
    one = cursor.fetchone()
    print('One:', one)
    results = cursor.fetchall()
    print('Results:', results)
    print('Results Type:', type(results))
    for row in results:
        print(row)
except:
    print('Error')

operation result:

Count: 4
One: ('20120001', 'Bob', 25)
Results: (('20120011', 'Mary', 21), ('20120012', 'Mike', 20), ('20120013', 'James', 22))
Results Type: <class 'tuple'>
('20120011', 'Mary', 21)
('20120012', 'Mike', 20)
('20120013', 'James', 22)

Here we construct a SQL statement, aged 20 years and older students check out, and then passed to execute () method can pay attention to commit () method is no longer needed here db's. Then we can call the cursor's rowcount property to get the number of query results, the number of results obtained strip current example is four.

Then we call the fetchone () method, this method can obtain the results of the first data, the result is a tuple form fields correspond with the order of the elements of the tuple, which is the first element is the first field id The second element is the second field name, and so on. Then we call fetchall () method, which can get all the data results, and then print out the result and type, it is double tuples, each element is a record. We will traverse its output, its output out one by one.

But noticed a problem here, it shows the 4 data, fetall () method instead of getting all the data? Why only three? This is because its internal implementation is an offset pointer to the query result, the offset pointer to the beginning of the first data, taken once, the data pointer is shifted to the next, so that they would take longer to get next the data. So we initially called once fetchone () method, the results of such an offset pointer points to the next data, fetchall () method returns a pointer to the data offset until the end of all the data, so fetchall () method to get the result is only three, so here to understand the concept of an offset pointer.

So we can use a while loop plus fetchone () method to get all the data, rather than fetchall () all get out together, fetchall () will result in a return of all tuples form, if the amount of data is large, then the occupation the cost will be very high. It is recommended to use the following method to fetch one by one:

sql = 'SELECT * FROM students WHERE age >= 20'
try:
    cursor.execute(sql)
    print('Count:', cursor.rowcount)
    row = cursor.fetchone()
    while row:
        print('Row:', row)
        row = cursor.fetchone()
except:
    print('Error')
Python资源分享qun 784758214 ,内有安装包,PDF,学习视频,这里是Python学习者的聚集地,零基础,进阶,都欢迎

Thus each cycle, a pointer will shift data, used with the check, simple and efficient.

8. Conclusion

In this section we introduced the MySQL database and operating PyMySQL Constructors some SQL statements, we will apply them in the real data storage cases later.

Guess you like

Origin blog.51cto.com/14445003/2426849