16-大数据处理技巧--数据分析

大数据处理技巧

In [1]:
import pandas as pd
gl = pd.read_csv('game_logs.csv')
gl.head()
C:\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py:2717: DtypeWarning: Columns (12,13,14,15,19,20,81,83,85,87,93,94,95,96,97,98,99,100,105,106,108,109,111,112,114,115,117,118,120,121,123,124,126,127,129,130,132,133,135,136,138,139,141,142,144,145,147,148,150,151,153,154,156,157,160) have mixed types. Specify dtype option on import or set low_memory=False.
  interactivity=interactivity, compiler=compiler, result=result)
Out[1]:
  date number_of_game day_of_week v_name v_league v_game_number h_name h_league h_game_number v_score ... h_player_7_name h_player_7_def_pos h_player_8_id h_player_8_name h_player_8_def_pos h_player_9_id h_player_9_name h_player_9_def_pos additional_info acquisition_info
0 18710504 0 Thu CL1 na 1 FW1 na 1 0 ... Ed Mincher 7.0 mcdej101 James McDermott 8.0 kellb105 Bill Kelly 9.0 NaN Y
1 18710505 0 Fri BS1 na 1 WS3 na 1 20 ... Asa Brainard 1.0 burrh101 Henry Burroughs 9.0 berth101 Henry Berthrong 8.0 HTBF Y
2 18710506 0 Sat CL1 na 2 RC1 na 1 12 ... Pony Sager 6.0 birdg101 George Bird 7.0 stirg101 Gat Stires 9.0 NaN Y
3 18710508 0 Mon CL1 na 3 CH1 na 1 12 ... Ed Duffy 6.0 pinke101 Ed Pinkham 5.0 zettg101 George Zettlein 1.0 NaN Y
4 18710509 0 Tue BS1 na 2 TRO na 1 9 ... Steve Bellan 5.0 pikel101 Lip Pike 3.0 cravb101 Bill Craver 6.0 HTBF Y

5 rows × 161 columns

In [2]:
gl.shape
Out[2]:
(171907, 161)
In [3]:
gl.info(memory_usage='deep')
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 171907 entries, 0 to 171906
Columns: 161 entries, date to acquisition_info
dtypes: float64(77), int64(6), object(78)
memory usage: 860.5 MB
In [4]:
for dtype in ['float64','int64','object']:
    selected_dtype = gl.select_dtypes(include = [dtype])
    mean_usage_b = selected_dtype.memory_usage(deep=True).mean()
    mean_usage_mb = mean_usage_b/1024**2
    print ('平均内存占用',dtype,mean_usage_mb)
平均内存占用 float64 1.2947326073279748
平均内存占用 int64 1.1241934640066964
平均内存占用 object 9.514454069016855
In [5]:
import numpy as np
int_types = ['uint8','int8','int16','int32','int64']
for it in int_types:
    print (np.iinfo(it))
Machine parameters for uint8
---------------------------------------------------------------
min = 0
max = 255
---------------------------------------------------------------

Machine parameters for int8
---------------------------------------------------------------
min = -128
max = 127
---------------------------------------------------------------

Machine parameters for int16
---------------------------------------------------------------
min = -32768
max = 32767
---------------------------------------------------------------

Machine parameters for int32
---------------------------------------------------------------
min = -2147483648
max = 2147483647
---------------------------------------------------------------

Machine parameters for int64
---------------------------------------------------------------
min = -9223372036854775808
max = 9223372036854775807
---------------------------------------------------------------

In [6]:
def mem_usage(pandas_obj):
    if isinstance(pandas_obj,pd.DataFrame):
        usage_b = pandas_obj.memory_usage(deep=True).sum()
    else:
        usage_b = pandas_obj.memory_usage(deep=True)
    usage_mb = usage_b/1024**2
    return '{:03.2f} MB'.format(usage_mb)
gl_int = gl.select_dtypes(include = ['int64'])
coverted_int = gl_int.apply(pd.to_numeric,downcast='unsigned')
print (mem_usage(gl_int))
print (mem_usage(coverted_int))
7.87 MB
1.48 MB
In [7]:
gl_float = gl.select_dtypes(include=['float64'])
converted_float = gl_float.apply(pd.to_numeric,downcast='float')
print(mem_usage(gl_float))
print(mem_usage(converted_float))
100.99 MB
50.49 MB
In [8]:
optimized_gl = gl.copy()
optimized_gl[coverted_int.columns] = coverted_int
optimized_gl[converted_float.columns] = converted_float
print(mem_usage(gl))
print(mem_usage(optimized_gl))
860.50 MB
803.61 MB
In [9]:
gl_obj = gl.select_dtypes(include = ['object']).copy()
gl_obj.describe()
Out[9]:
  day_of_week v_name v_league h_name h_league day_night completion forefeit protest park_id ... h_player_6_id h_player_6_name h_player_7_id h_player_7_name h_player_8_id h_player_8_name h_player_9_id h_player_9_name additional_info acquisition_info
count 171907 171907 171907 171907 171907 140150 116 145 180 171907 ... 140838 140838 140838 140838 140838 140838 140838 140838 1456 140841
unique 7 148 7 148 7 2 116 3 5 245 ... 4774 4720 5253 5197 4760 4710 5193 5142 332 1
top Sat CHN NL CHN NL D 19210630,,3,2,45 H V STL07 ... grimc101 Charlie Grimm grimc101 Charlie Grimm lopea102 Al Lopez spahw101 Warren Spahn HTBF Y
freq 28891 8870 88866 9024 88867 82724 1 69 90 7022 ... 427 427 491 491 676 676 339 339 1112 140841

4 rows × 78 columns

In [10]:
dow = gl_obj.day_of_week
dow.head()
Out[10]:
0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: object
In [11]:
dow_cat = dow.astype('category')
dow_cat.head()
Out[11]:
0    Thu
1    Fri
2    Sat
3    Mon
4    Tue
Name: day_of_week, dtype: category
Categories (7, object): [Fri, Mon, Sat, Sun, Thu, Tue, Wed]
In [13]:
dow_cat.head(10).cat.codes
Out[13]:
0    4
1    0
2    2
3    1
4    5
5    4
6    2
7    2
8    1
9    5
dtype: int8
In [14]:
print (mem_usage(dow))
print (mem_usage(dow_cat))
9.84 MB
0.16 MB
In [15]:
converted_obj = pd.DataFrame()
for col in gl_obj.columns:
    num_unique_values = len(gl_obj[col].unique())
    num_total_values = len(gl_obj[col])
    if num_unique_values / num_total_values < 0.5:
        converted_obj.loc[:,col] = gl_obj[col].astype('category')
    else:
        converted_obj.loc[:,col] = gl_obj[col]
In [16]:
print(mem_usage(gl_obj))
print(mem_usage(converted_obj))
751.64 MB
51.67 MB
In [19]:
date = optimized_gl.date
date[:5]
Out[19]:
0    18710504
1    18710505
2    18710506
3    18710508
4    18710509
Name: date, dtype: uint32
In [20]:
print (mem_usage(date))
0.66 MB
In [21]:
optimized_gl['date'] = pd.to_datetime(date,format='%Y%m%d')
print (mem_usage(optimized_gl['date']))
1.31 MB
In [22]:
optimized_gl['date'][:5]
Out[22]:
0   1871-05-04
1   1871-05-05
2   1871-05-06
3   1871-05-08
4   1871-05-09
Name: date, dtype: datetime64[ns]
In [ ]:

猜你喜欢

转载自blog.csdn.net/m0_38039437/article/details/80819839