Python data analysis one of The Three Musketeers: Pandas tool library

Data processing and cleaning Python

Pandas are based Numpy build libraries, data processing can understand it as an enhanced version of numpy, while Pandas also an open source project.

Numpy is different from, Pandas have a kind of data structure: Series one-dimensional data, DataFrame two-dimensional data (table)

head () # 10 rows of data before the default

tail () # default after 10 rows

First, clear data

1, the number zero filled null value:
df.fillna (value = 0)

2, using the mean of the prince column filled NA:
DF [ 'prince'] fillna (DF [ 'prince'] Mean ().).


3, clear character spaces city fields:
DF [ 'city'] = DF [ 'city'] Map (str.strip).

4, case conversion:
. DF [ 'City'] = DF [ 'City'] str.lower ()

5, change the data format:
. DF [ '. Price'] asType ( 'int')

6, change a column name:
df.rename (Columns = { 'category': 'category-size'})

7, after deleting the duplicate values:
. DF [ 'City'] drop_duplicates ()

8, to delete duplicate values appear:
. DF [ 'City'] drop_duplicates (= Keep 'Last')

9, data replacement:
. DF [ 'City'] Replace ( 'SH', 'Shanghai')

 

Second, data extraction

Three main functions used: loc, iloc and ix, loc function extraction by tag value, iloc extracted by location, ix can be simultaneously extracted by the tag and location.

1, according to the index value extracted single row

df_inner.loc[3]

2, according to the index value of the row region extraction

df_inner.iloc[0:5]

3, reset the index

df_inner.reset_index()

4, set the date index
df_inner = df_inner.set_index ( 'date')

5, extracts all the data before 4
df_inner [: '2013-01-04']

6, by using the location area data extracting iloc

df_inner.iloc [: 3,: 2] # numbers before and after the colon is the label name of the index is no longer, but the position data is located, starting from 0, the first three lines, the first two columns.

7, to adapt the position lifted by separate data iloc

df_inner.iloc [[0,2,5], [4,5]] # 0,2,5 row extraction, column 4,5

8, the use of the mixed extract ix tag and location data in the index

df_inner.ix [: '2013-01-03',: 4] No. Before # 2013-01-03, first four columns of data

9, it is determined whether the value of Beijing city column
df_inner [ 'city']. Isin ([ 'beijing'])

10, it is determined whether the city column comprising in beijing and shanghai, then the data that meets the condition extracted from

df_inner.loc [df_inner [ 'City'] ISIN ([ 'Beijing', 'Shanghai']).]
. 11, extracts the first three characters, and generates the data table
pd.DataFrame (category.str [: 3])

 

merge, append, join, concat    data combining  

set_index ( 'id') disposed Index

sort_values ​​(by = [ 'age']) sorted by a particular column value

sort_index () sorted index column

* [ 'Group'] = np.where (df_inner [ 'price']> 3000, 'high', 'low') If the value of the column prince> 3000, group column shows high, low or display

 

Guess you like

Origin www.cnblogs.com/harsin/p/11766165.html