Pandas improve the efficiency of ten tips

1、read_csv

 If the amount of data you want to read a lot, you can try to use it this parameters: chunksize = 5. Prior to actually loading the entire table is read only a small part of the table.

2、select_dtypes

When using Python for data preprocessing, this command will help you save time. After reading the table, the default data type for each column may be bool, int64, float64, object, category, timedelta64 or datetime64. You can get all the data types to use this command:

df.dtypes.value_counts()

Then perform the following operations, select subdatasets desired characteristics.

df.select_dtypes(include=[‘float64’, ‘int64’])

3、copy

This is a very important command. If you execute the following command:

import pandas as pd
df1 = pd.DataFrame({ ‘a’:[0,0,0], ‘b’: [1,1,1]})
df2 = df1
df2[‘a’] = df2[‘a’] + 1
df1.head()

You will find df1 changed. This is because df2 = df1 is not assigned to the copy df1 and df2, but provided a pointer pointing df1. Df2 so any change will result in a change of df1. To solve this problem, you can use the following two methods:

df2 = df1.copy () or

from copy import deepcopy
df2 = deepcopy(df1)

4、map

This is a simple way to convert data in a very cool. First, the definition of a dictionary, which 'keys' are the old values, 'values' is the new value.

level_map = {1: ‘high’, 2: ‘medium’, 3: ‘low’}
df[‘c_level’] = df[‘c’].map(level_map)

5、apply

If we want to create a new column by column enter some other value, then apply the function will be very useful.

def rule(x, y):
    if x == ‘high’ and y > 10:
         return 1
    else:
         return 0
df = pd.DataFrame({ 'c1':[ 'high' ,'high', 'low', 'low'], 'c2': [0, 23, 17, 4]})
df['new'] = df.apply(lambda x: rule(x['c1'], x['c2']), axis =  1)
df.head()

In the above code, we define a function (x, y) of the two input variables and to apply the apply function "c1" column and "c2" column.

But the problem is, apply the method sometimes too slow. For example, you want to calculate the maximum two "c1" and "c2", you can do this:

df[‘maximum’] = df.apply(lambda x: max(x[‘c1’], x[‘c2’]), axis = 1)

But it is much slower than the following command:

df[‘maximum’] = df[[‘c1’,’c2']].max(axis =1)

Note: If you can use other built-in functions to complete the same program, try not to use apply, because the built-in function is usually faster. For example, to column 'c' is rounded to integer, using the round (df [ 'c'], 0), instead apply function.

6、value counts

This is a command to check the value of the distribution statistics. If you want to look at "c" How many different values, and each value of the frequency of occurrence of the column there, you can do this:

df[‘c’].value_counts() 

Below we summarize gave some useful tips:

A. normalize = True: checking the sentence for frequency instead of the counts.

B. dropna = False: may contain missing values ​​in the statistics.

C. sort = False: statistics sorted by value but not sorted by count.

D. df [ 'c] .value_counts () reset_index ():. The statistics table data is converted into panda reprocessing.

7, missing values ​​statistics

When building the model, if you want to exclude missing values ​​or line missing values ​​belongs, may be used .isnull () and .sum () to process.

import pandas as pd
import numpy as np
df = pd.DataFrame({ ‘id’: [1,2,3], ‘c1’:[0,0,np.nan], ‘c2’: [np.nan,1,1]})
df = df[[‘id’, ‘c1’, ‘c2’]]
df[‘num_nulls’] = df[[‘c1’, ‘c2’]].isnull().sum(axis=1)
df.head()

8, a selected row having a specific id

In SQL, we can use the SELECT * FROM ... WHERE ID in ( 'A001', 'C022', ...) to get the record with a specific ID. In the Pandas you can do:

df_filter = df[‘ID’].isin([‘A001’,‘C022’,...])
df[df_filter]

9, the packet: as a percentage

For a numeric column, it is desirable to group the values ​​in a column, such as 5% of the value before into groups 1,5% - 20% of the value of the value of 2.20% -50% into groups divided into 3 groups, the remaining 50% of the values ​​divided into 4 groups. Pandas Of course there are many ways, but here provide a new method, its speed will soon (because there is no use apply function):

import numpy as np
cut_points = [np.percentile(df[‘c’], i) for i in [50, 80, 95]]
df[‘group’] = 1
for i in range(3):
    df[‘group’] = df[‘group’] + (df[‘c’] < cut_points[i])
# or <= cut_points[i]

10、to_csv

This is a common command will be used by everyone. But I will point out two separate tips, the first one is:

print(df[:5].to_csv())

This command can accurately print out the first five lines of data to be written to the file.

Another technique is float_format = '%. 0f'.

For the case of handling integer values ​​and vacancy mixed together. If a missing values ​​and integer values ​​at the same time, then the data type is written instead of still float int. When exporting data table by adding float_format = '%. 0f' float all rounded off. If you want all the columns in the output are integers, then this technique can also help you get rid of the annoying '.0' format.

 

Guess you like

Origin www.cnblogs.com/wu-wu/p/12444462.html