What aspects should be paid attention to when python processes a tens of millions of csv files

When you are processing hundreds of thousands of data statistics, you may directly store the data in a list and then calculate the statistics, and you are done looking for something.
But when the data table you are dealing with has millions of records, it is difficult to view it with traditional excel worksheets, because the maximum number of xls files is 65536, csv (as far as I know), a csv file The maximum display number is 1024*1024=1048576, which is more than one million. So your data table must be consulted by another display tool - the database.
What should be paid attention to before and after importing a tens of millions of data tables into the database? Why do you need to pay attention? Let’s talk about it briefly.

database storage

The method of database storage can save your data in a friendly way, and it is convenient to check statistics, but the following points need to be paid attention to:

1. Data type selection when building a table

We know that when creating a table, we need to ensure that there is no content difference between the database table and the original data table  . Therefore, when building a table, select the appropriate data type and the length of the type. The data types are int char text float...
Misunderstanding : Using the data type of the original data table will definitely get the same data
Another question:
Will the data obtained when reading the csv file using the default dtype of pandas be exactly the same as the original data?
The answer is: not necessarily the same!
Case introduction, when importing an original data table with a field of float type, dtype = object in pandas, that is, the default is exactly the same as the data type of the original data table. The resulting database looks like this:


However, the table in the actual situation should be like this:

we can see at a glance that the data in the first table has changed a little bit from the actual data table. This is an existing precision loss problem, which is actually negligible in terms of engineering values, but it is obviously very troublesome in calculation, and it is necessary to change the original data appearance.
The solution is to read with numpy.str type.

Therefore, when importing a table, you need to determine the data type according to the actual situation, instead of choosing the default data type uniformly. When using pandas to read, you can refer to this...

1
2
3

csvPath = r"name.csv"
p = pd.read_csv(csvPath, header = None, dtype = np.str, engine = 'python')
    p

When the amount of stored data is large and you don’t know the size of your data type, it is best to choose the largest storage. You can use varchar 255. If you report an error, it’s okay, but the data of a certain record is suddenly too long. Change to a larger data type, such as text... You may think, why not use text in the first place...

2. Encoding and preprocessing storage when building a table

Regardless of the size of the data table, you should also pay attention to this problem, but a large data table will take more time! ! !

  1. Choose the encoding of the database and choose utf8mb4 directly, don't ask. . . Reduce the exception thrown by some specific characters
  2. In addition, pay special attention to a point that when your string text contains quotation marks and #, you will find that you will make an error when exporting the text. Why, because # in the python code is a comment, no matter what you do 1. When there is a # after the paired quotation marks, it is regarded as a comment. Please see:
    Suppose a text is: [Today's topic "#说谈谈谈话寒", what do you want to say. ]
    After importing the database, take out the text processing, please see:

1

a = "Today's Topic"#Talk about the teasers around you, what do you all want to say."

The text at this time is no longer the original text, it has become your text with annotations...
so how to solve it? There is no single solution, but there are easy ones.
First, I suggest to preprocess the data before storing the data, and remove or replace the quotation marks and # in the data according to the actual situation.
Second, considering the number of errors and the size of the data, you can consider exception handling and use re to match the corresponding content...

Next, let's talk a little more

Maybe you can start exception handling and use re to match some text content you need after encountering this situation, which shows that your code quality is very high... But this is a drawback of storage design that has not been avoided. Therefore, a good data structure and data storage can avoid your redundant and cumbersome algorithms later. In the overall design, you must think clearly, here are millions of tens of millions of data tables!

3. Adding indexes and self-increment problems when creating a table

Adding an index can locate the position of a certain data record, which is very helpful in practice. Adding it will facilitate solving many problems. At least you can know where the error is.

4. Comparison between navicat import and program import

Navicat is simply an artifact of database management. It can create, import, view, count, create tables, modify, transfer data tables and other operations. It is really good for you who can’t remember commands.
If you have a cloud database with tens of millions of large data tables, what should you focus on when importing?

  1. When there is no field explanation in your imported data table, you need to design the field table yourself. After you design the field, in the case of importing:


    Needless to say, I will do it when I see it
  2. When your database is a cloud database and the data table is very large, it usually takes several hours to import a data table. At this time, you need to ensure that your home network is stable and your computer is not automatically sleeping, etc. Factor Interference
    **  Please do not imitate the following operations : I recently tried a data table of more than 35 million, and it took about 4 hours to import. I directly stored the local csv file on the cloud database. Because I didn’t care about it after importing, I thought the network was stable, etc., after two or three hours in the middle of importing the data table, some import errors appeared, probably because the network was not working, the connection was wrong, and I forgot to take a screenshot... So I again What a waste of hours! ! !
  3. Next, after importing navicat and importing errors, I tried to use the program to import. My design idea is first, first use pandas to read the file, then load it into the memory, generate sql batch import commands, and then import: the
    code is:
    first warn, small data tables can refer to the following code, large data tables are afraid of you The cpu can't stand it! ! ! don't imitate

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29

#coding:utf8
import pandas as pd
import numpy as np
import pymysql
#Batch write into database
def writedata(data):
    conn = pymysql.connect(host='cdb-xxxxxxxxxxxxxx.com',port=xxxxx,user='root' , passwd='xxxx!', db = "city_A",charset='utf8mb4')
    cur = conn.cursor() # call the sql command to python
    cur.execute('show tables;')
    print(cur.fetchall ())

    sql = "insert into `A_migration_copy1_copy1` (move_date,start_city,end_city,move_index) values ​​{}".format(data)

    cur.execute(sql)
    conn.commit()


#Generate batch imported data
def insertVals(csvPath :str):
    p = pd.read_csv(csvPath, header = None, dtype = np.str, engine = 'python')
    arrayP = p.values
    valuestr = []
    for i in arrayP:
        list_i = tuple(i)
        valuesstr.append(list_i)
    return str(valuestr)[1:-1:]#Python removes the square brackets is so rude
if __name__ == '__main__':  
    csvPath = r"name.csv"
    writedata(insertVals(csvPath))

Then after waiting for more than an hour:


In the end, the cpu couldn't stand it anymore and said it was quitting.

  1. Continuing, the above method of reading with python pandas and then enabling batch import, the CPU can't stand it, so is there a better way? Yes, we can import batches of 1000 or 10000 at a time, but using pandas gives me the feeling that it is too slow! ! ! I had to think of a more reasonable solution.
  2. Final solution :
    1. Use navicat, first import the local database
    2. Establish a connection with the cloud database locally
    3. Transfer the local database data to the cloud database


    First, the speed of importing the database locally is about the same as that of importing the cloud database. But importing locally does not require a network, and it is done locally from csv–>table!
    Second, the transmission from the local database to the cloud database is table->table transmission. The transmission on navicat is very fast, and it takes less than half an hour to transfer a data table with more than 35 million data to the cloud database.

Summarize

When the data table is small, the data table can be modified at any time to suit the call of the application; when the data table is large, it will be cumbersome and error-prone to change the data table, and calling it with code also requires complex calculations.
Before designing the storage table, you should comprehensively consider the encoding problem of the table, the storage type of the field, and the appropriate storage method.

Guess you like

Origin blog.csdn.net/zxj19880502/article/details/129265851