Pandas study notes

http://blog.csdn.net/qq_16234613/article/details/64217337

1. Read data

import pandas as pd

data = pd.read_csv("census.csv")

# 成功 - 显示第一条记录
display(data.head(n=1))
#这个 data 类型是 DataFrame

读取 写入 
read_csv       to_csv 
read_excel      to_excel 
read_hdf       to_hdf 
read_sql       to_sql 
read_json      to_json 
read_msgpack (experimental)   to_msgpack (experimental) 
read_html       to_html 
read_gbq (experimental)     to_gbq (experimental) 
read_stata       to_stata 
read_sas 
read_clipboard    to_clipboard 
read_pickle      to_pickle//速度比csv快 

#存储为csv文件
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],'Survived': predictions })
submission.to_csv("submission.csv", index=False)
# index参数是否写入行names键

2. Get a column

#获取列名为 income 的一列
income_raw = data['income']
income = data.loc[:,'income']
#获取列名为 name 和income 的两列
sub_data = data.loc[:,['name','income']]
#获取第1列,注意是从0开始
sub_Data1 = data.iloc[:,1]
#:表示一整列数据,也可以选一个区间,如选前两行 [0,2)
sub_Data1 = data.iloc[0:2,1]

3. Count the number of occurrences of each value

income_value_counts = income.value_counts()
#返回的是 pandas.Series类型

4.Series basic concepts and properties

class pandas.Series(data=Noneindex=Nonedtype=Nonename=Nonecopy=Falsefastpath=False)[source]

One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Operations between Series (+, -, /, , *) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

A Series is a one-dimensional array data structure with a label or index. The statistical functions in numpy.ndArray have been rewritten to automatically remove missing data such as NaNs.

The operators (+, -, /,  , * ) used between Series are based on each element of them, and the lengths of the two Series are not required to be the same. That is to say, in pandas Series, NumPy's array operations (filtering data with boolean arrays, scalar multiplication, and using mathematical functions) are preserved, while maintaining the use of references.

(This is similar to Numpy's ndArray. Give an official example of ndArray:)

>>> x = np.array([1, 2])
>>> y = np.array([[3], [4]])

>>> x
array([1, 2])

>>> y
array([[3],
       [4]])

>>> x + y
array([[4, 5],
       [5, 6]])

Add some more knowledge:

We have already seen that Python supports data of boolean type. The boolean type has only Truetwo Falsevalues, but the boolean type has the following operations:

AND operation: The result of the calculation is True only if both Boolean values ​​are True.

True and True   # ==> True
True and False   # ==> False
False and True   # ==> False
False and False   # ==> False

OR operation: As long as there is a Boolean value of True, the result of the calculation is True.

True or True   # ==> True
True or False   # ==> True
False or True   # ==> True
False or False   # ==> False

NOT operation: Change True to False, or False to True:

not True   # ==> False
not False   # ==> True

Boolean operations are used in computers to make conditional judgments. Depending on whether the calculation result is True or False, the computer can automatically execute different subsequent codes.

In Python, the Boolean type can also perform and, or, and not operations with other data types, see the following code:

a = True
print a and 'a=T' or 'a=F'

The result of the calculation is not a boolean type, but a string 'a=T', why is this?

Because Python treats 0, 空字符串''and Noneas False, and other numbers and non-empty strings as True, so:

True and 'a=T' evaluates to 'a=T'
Continue to calculate 'a=T' or 'a=F' The calculation result is still 'a=T'

In addition, Boolean values ​​can also do digital operations, at this time True=1, False=0
a=True
b=False
c=a+b
print c

>>>1

So the method of finding the accuracy of the prediction results can be as follows (after so many foreshadowing above, in fact, it is to understand the following line of code...)

def accuracy_score(truth, pred):
    """ 返回 pred 相对于 truth 的准确率 """
    
    # 确保预测的数量与结果的数量一致
    if len(truth) == len(pred): 
        
        # 计算预测准确率(百分比)
        return "Predictions have an accuracy of {:.2f}%.".format((truth == pred).mean()*100)
    
    else:
        return "Number of predictions does not match number of outcomes!"

Python basic data type List If you use +, it is to link two arrays, which is the same as the string type.

5. Get the value of a key in the Series

# 被调查者的收入大于$50,000的人数,
#例如income这列的值只有'<=50K'和'>50K'
#income.value_counts()输出如下
#  <=50K    34014
#  >50K     11208

n_greater_50k = income.value_counts().get('>50K')
#value_counts()是经过排序的,如果拿最多的一个数,则可以用
n_most_group = income.value_counts().index[0]

Series are all pointers. To get the real value, you can use .values

To get the i-th element, you can write a.values[i] like this,

6. Split data

# 将数据切分成特征和对应的标签
income_raw = data['income']
features_raw = data.drop('income', axis = 1)

7. Apply the function to each row or column:

def num_missing(x):
  return sum(x.isnull())
#应用列:
print data.apply(num_missing, axis=0)
#应用行:
print data.apply(num_missing, axis=1).head()

# 对于倾斜的数据使用Log转换,注意 lambda
skewed = ['capital-gain', 'capital-loss']
features_raw[skewed] = data[skewed].apply(lambda x: np.log(x + 1))

8. For

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325088800&siteId=291194637