pandas commonly used functions

First, the common features and functions Introduction

Import Package

Generally, we need to do the following to import, numpy and pandas generally requires joint use:

PD PANDAS AS Import
Import numpy AS NP
herein abbreviated as follows:

df: Pandas DataFrame Object
s: Pandas Series Object
Data Import

pd.read_csv (filename): Import a CSV file from the data
pd.read_table (filename): import data from a delimited text file defining
pd.read_excel (filename): import data from the Excel file
pd.read_sql (query, connection_object): From SQL table / database import data
pd.read_json (json_string): import data from a string JSON format
pd.read_html (url): parse URL, an HTML file or string
pd.read_clipboard (): Get content from the clipboard
pd.DataFrame (dict): import data from a dictionary object
data export

df.to_csv (filename): export data to a CSV file
df.to_excel (filename): export data to an Excel file
df.to_sql (table_name, connection_object): export data to SQL tables
df.to_json (filename): Export data to Json format to a text file
to create an object

pd.DataFrame (np.random.rand (20,5)): Create Object 20 DataFrame row random number consisting of 5
pd.Series (my_list): my_list create a Series object from iterables
df.index = pd. date_range ( '1900/1/30', periods = df.shape [0]): increase a date index
index and reindex useful in combination, and then the elements may be arbitrary ordering index as an index, the elements followed unchanged, Therefore, when the element remake, only need to index only be rearranged to: reindex.

In addition, reindex, also can add new elements labeled NaN.

 

 

View

df.head (n): front view object DataFrame n rows
df.tail (n): n View last row DataFrame object
df.shape (): number of rows and columns view
df.info (): see the index data type and memory information
df.describe (): See summary statistics numeric columns
s.value_counts (dropna = False): See Series object unique value and counting
df.apply (pd.Series.value_counts): per View objects DataFrame the only value of a count and
apply a lot of useful, such as through joint lambda function with complete many functions: to contain elements of a certain part of singled out, and so on.

 

cities['Is wide and has saint name'] = (cities['Area square miles'] > 50) & cities['City name'].apply(lambda name: name.startswith('San'))

 
Data Selector

df [col]: The column name, and returns it as Series column
df [[col1, col2]] : Returns to form a plurality of rows DataFrame
s.iloc [0]: Select data by location
s.loc [ 'index_one'] : press the select data index
df.iloc [0 ,:]: returns a first row
of data cleaning

df.columns = [ 'a', ' b', 'c']: Rename column name
pd.isnull (): null DataFrame inspection object and returns a Boolean array
pd.notnull (): Object inspection DataFrame the non-null value, and returns a Boolean array
df.dropna (): delete all rows contain null values
df.fillna (x): replace all null values DataFrame object with X
s.astype (a float): a Series to change the data type to float
s.replace (1, 'one') : with a 'one' value equal to 1 instead of all
df.rename (columns = lambda x: x + 1): batch change a column name
df.set_index ( 'column_one'): to change the index column
of data processing: Filter, Sort, GroupBy

df [df [col]> 0.5 ]: select col column values greater than 0.5
df.sort_values (col1): col1 sort the data in columns, in ascending order by default
df.groupby (col): grouped in a return of the column col Groupby objects
df.groupby (col1) .agg (np.mean) : returns the mean of all the columns by the column col1 packet
df.pivot_table (index = col1, values = [col2, col3], aggfunc = max): Create a press column col1 group, and calculates the maximum value of PivotTable col2 and col3 of
data.apply (np.mean): for each column in the applied function np.mean DataFrame
data combining

df1.append (df2): df2 adding rows to the tail df1
df.concat ([df1, df2], axis = 1): adding columns to df2 df1 tail
df1.join (df2, on = col1, how = 'inner') : SQL execution forms of the columns df1 and df2 join column
statistics

df.describe (): Summary view the data value of the column statistics
df.mean (): Returns the average of all the columns
df.corr (): Returns the correlation coefficient between the columns and columns
df.count (): returns for each column the number of non-null values
df.max (): returns the maximum of each column
df.min (): returns the minimum value of each column
df.median (): returns the number of bits in each column
df.std (): returns standard deviation of each column
Pandas supported data types

int int
float float
bool Boolean
object string type
category kind
datetime Time Type
added:

df.astypes: data format conversion
df.value_counts: count the number of same value
df.hist (): Videos histogram
df.get_dummies: one-hot encoding the translation attributes into a matrix-type format attributes. For example: three colors RGB, red coded as [100]
 

Reference article:

Pandas official website

Pandas official documents

Pandas Cheat Sheet -- Python for Data Science

10 minutes to Pandas

Second, house prices predicted Case

According to the given training csv file, csv file test to predict given the price.


 
import numpy as np
import pandas as pd
from Cython.Shadow import inline
 
import matplotlib.pyplot as plt
#matplotlib inline
 
###################1 oridinal data##################
train_df = pd.read_csv('input/train.csv', index_col=0)#数据导入
test_df = pd.read_csv('input/test.csv', index_col=0)
 
print("type of train_df:" + str(type(train_df)))
#print(train_df.columns)
print("shape of train_df:" + str(train_df.shape))
print("shape of test_df:" + str(test_df.shape))
 
train_df.head()#数据查看
#print(train_df.head())
 
###################2 smooth label######################
prices = pd.DataFrame({"price":train_df["SalePrice"], "log(price+1)":np.log1p(train_df["SalePrice"])})
print("shape of prices:" + str(prices.shape))#数据创建
prices.hist()#直方图
# plt.plot(alphas, test_scores)
# plt.title("Alpha vs CV Error")
plt.show()
 
y_train = np.log1p(train_df.pop('SalePrice'))
print("shape of y_train:" + str(y_train.shape))
 
###################3 take train and test data together##############
all_df = pd.concat((train_df, test_df), axis=0)#数据合并
print("shape of all_df:" + str(all_df.shape))
 
###################4 make category data to string####################
print(all_df['MSSubClass'].dtypes)
all_df['MSSubClass'] = all_df['MSSubClass'].astype(str)#数据格式转换
all_df['MSSubClass'].value_counts () # same value as the number of statistical
print (all_df [ 'MSSubClass']
 
Fill null. 5 ########################### ##################
all_dummy_df = pd.get_dummies (all_df) # one-hot encoding, color RGB, is encoded as R [0 0. 1]
Print (all_dummy_df.head ()) # next line data cleansing, find null properties, and according to the number of empty sort attribute
Print (all_dummy_df.isnull (). SUM (). sort_values (Ascending = False) .head ())
 
mean_cols all_dummy_df.mean = () # statistics, average
Print (mean_cols.head (10))
 
all_dummy_df = all_dummy_df .fillna (mean_cols) # data cleaning, with the value () in place of null
Print (all_dummy_df.isnull (). SUM (). SUM ())
 
###############. 6 ######################## cols numeric Smooth
numeric_cols = all_df.columns [all_df.dtypes! = 'Object'] # is not a property selected object, i.e., the value data
Print (numeric_cols)
 
numeric_col_means = all_dummy_df.loc [:, numeric_cols] .mean () # selected according to the index data of the brackets, and averaging
numeric_col_std = all_dummy_df.loc[:, numeric_cols].std()
all_dummy_df.loc[:, numeric_cols] = (all_dummy_df.loc[:, numeric_cols] - numeric_col_means) / numeric_col_std
 
###############7 train model#######################
dummy_train_df = all_dummy_df.loc[train_df.index]
dummy_test_df = all_dummy_df.loc[test_df.index]
print("shape of dummy_train_df:" + str(dummy_train_df))
print("shape of dummy_test_df:" + str(dummy_test_df))
 
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score
 
X_train = dummy_train_df.values
X_test = dummy_test_df.values
 
alphas = np.logspace(-3, 2, 50)
test_scores = []
for alpha in alphas:
clf = Ridge(alpha)
test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=10, scoring='neg_mean_squared_error'))
test_scores.append(np.mean(test_score))
plt.plot(alphas, test_scores)
plt.title("Alpha vs CV Error")
plt.show()
 
from sklearn.ensemble import RandomForestRegressor
max_features = [.1, .3, .5, .7, .9, .99]
test_scores = []
for max_feat in max_features:
clf = RandomForestRegressor(n_estimators=200, max_features=max_feat)
test_score = np.sqrt(-cross_val_score(clf, X_train, y_train, cv=5, scoring='neg_mean_squared_error'))
test_scores.append(np.mean(test_score))
 
plt.plot(max_features, test_scores)
plt.title("Max Features vs CV Error")
plt.show()
 
#########################8 stacking#####################
ridge = Ridge(alpha=15)
rf = RandomForestRegressor(n_estimators=200, max_features=.3)
ridge.fit(X_train, y_train)
rf.fit(X_train, y_train)
 
y_ridge = np.expm1(ridge.predict(X_test))
y_rf = np.expm1(rf.predict(X_test))
 
y_final = (y_ridge + y_rf)/2
 
######################9 submission############################
submission_df = pd.DataFrame(data = {'Id':test_df.index, 'SalePrice':y_final})
print(submission_df.head())

Guess you like

Origin www.cnblogs.com/aibabel/p/11011614.html