Python crawls 100G level, more than 2000K data volume, use mysql or mongodb?

Welcome everyone to search on Netease News: py Kexi, welcome everyone to pay attention,

If you think the small writing is good, you can follow it, like it, leave a comment, and welcome to complain

We can answer this question from two angles. One is the difference between 100G data volume and MySQL and MongoDB in terms of storage and reading, and the other is the structure of the data itself and the application you want to use to consider which database is more convenient.

100G data volume

At present, the data volume of 100G has no pressure on MySQL and MongoDB for storage. If you want to read frequently, then I recommend that you use MongoDB for storage. The reason is that MongoDB is a memory-mapped mode, which can make full use of the memory resources of the system. The larger the memory, the faster the query speed of MongoDB. After all , the I/O efficiency of disk and memory is not the same order of magnitude .

If you do not have high requirements for query speed, and the data you climbed down is structured data, and you are familiar with MySQL syntax and related operations, you can use MySQL to store this level of data, which is equivalent to storing a large Excel table.

data structure

If the data structure is inconsistent, that is, some rows are not missing some fields, and some rows have some fields, then I recommend using MongoDB for storage processing. Because Mongo does not have a strict definition of schema, it is accessed in json format. If the field of the crawler changes frequently, the field definition may change, Mongo is very lenient in this regard, and it is easy to be compatible. But if you have transaction requirements, it is better to choose MySQL. Because MongoDB, a NoSQL database, is not designed for transactional relationships, the specific application depends on the requirements.

All in all, I personally recommend using MongoDB for massive crawler database storage. No matter from the irregularity of the data structure and the speed of data storage and reading, Mongo can be competent, and MongoDB can easily scale horizontally, shard, and replicate the cluster in minutes.

However, the final implementation depends on the demand, and it is still necessary to comprehensively consider the frequency, magnitude and application scenarios of data storage.

Each technology has its own application scenarios, and the choice of technology needs to be combined with its own application scenarios!

Generally speaking, the performance of MySQL database will decrease significantly if there are more than 10 million pieces of data. Of course, the performance can be improved by configuring master-slave or using middleware. For mongoDB, the data volume of 100G and 2000W should be considered normal, and the management will be simpler than mysql. However, if strong transactionality, consistency, etc. are required, mongo may not be able to meet the requirements.

A lot of data crawled by the crawler can be stored directly and unstructured. The stored fields have not been clearly determined, and because your magnitude is very large, it will not be analyzed immediately, so it is recommended to store it in mongodb for the next step. data mining processing

Next, let me introduce the difference between mysq and mongodb

I have been engaged in Python development for nearly 9 years. If you have any questions about learning Python's learning methods, learning paths, and future development, you can consult me ​​at any time, follow me, and send me a private message "Python", and I will give you the system. Learning materials and the address of the learning exchange (must see the end, the address is there, it is a group)

Insertion Stability Analysis

Insertion stability refers to the insertion rate when a certain amount of data is inserted as the amount of data increases.

In this test, we set the scale of this indicator at 10w, that is, the displayed data is how many pieces of data can be inserted per second during this period of time when 10w pieces of data are inserted .

First present four pictures:

1. MongoDB specifies _id to insert:

2. MongoDB inserts without specifying _id:

3. MySQL specifies the PRIMARY KEY insert:

4. MySQL does not specify PRIMARY KEY insert:

analyze:

1. If MySQL is not optimized for queries, its query speed will not be comparable to that of MongoDB. MongoDB can make full use of the system's memory resources . The memory of our test machine is 64GB. The larger the memory, the faster the query speed of MongoDB, after all.

2. The data queried in this experiment is also randomly generated, so the probability that all the data to be queried exists in the memory cache of MongoDB is very small. When querying, MongoDB needs to interact with the in-memory data to disk multiple times for lookups, so its query rate depends on the number of times it interacts . In this way, there is a possibility that although the number of data to be queried is large, the randomly generated data is fetched from the disk by MongoDB in fewer times. Therefore, the average speed of its query is faster. In this way, MongoDB's query speed fluctuations are also within a reasonable range.

3. The stability of MySQL is beyond doubt.

In the data stored in the database, there is a special key value called the primary key , which is used to uniquely identify a record in the table. That is, a table cannot have more than one primary key, and the primary key cannot be null.

Whether it is MongoDB or MySQL, there is a definition of a primary key.

For MongoDB, its primary key is named " _id ". When generating data, if the user does not actively assign a primary key to it, MongoDB will automatically generate a randomly assigned value for it.

In MySQL, the designation of the primary key is defined by specifying the PRIMARY KEY when MySQL inserts data . When the primary key is not specified, another tool, the index, is equivalent to replacing the function of the primary key. The index can be empty or duplicated, and there is another index that does not allow duplicates called a unique index. If neither a primary key nor an index is specified, MySQL will automatically create one for the data.

Summarize:

1. The overall insertion speed is similar to the previous statistics: MongoDB does not specify _id insertion > MySQL does not specify primary key insertion > MySQL specifies primary key insertion > MongoDB specifies _id insertion.

2. As can be seen from the figure, when specifying the primary key to insert data, when MySQL and MongoDB have different data orders of magnitude, the data inserted per second will fluctuate every once in a while, which is a regular glitch in the chart. . When the data to be inserted is not specified, the insertion rate is relatively average in most cases, but with the increase of data in the database, the efficiency of insertion drops momentarily in a certain period of time, and then becomes stable again.

3. On the whole, the rate fluctuation of MongoDB is more serious than that of MySQL, and the variance changes greatly.

4. When MongoDB inserts the specified _id, when the inserted data increases, the insertion efficiency drops significantly. In the other three insertion tests, the insertion rate was fixed at a standard most of the time from start to finish.

analyze:

1. The glitch phenomenon is because, when too much data is inserted, MongoDB needs to write the data in memory to the hard disk, and MySQL needs to re-partition the table. These operations happen automatically every time the data in the database reaches a certain magnitude, so there is a noticeable glitch every once in a while.

2. MongoDB is still a new thing after all, and its stability is not as good as MySQL, which has been used for many years.

3. When MongoDB inserts the specified _id, its performance is still very degraded.

1. When the scale of the data read is not large, the query speed of MongoDB is really unparalleled, and it is far and far away from MySQL.

2. When the amount of data queried gradually increases, the query speed of MySQL decreases steadily, while the query speed of MongoDB fluctuates somewhat.

+4913.08659, Code: Cauchy, Code: Cauchy

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326027369&siteId=291194637