http://blog.csdn.net/qq_16234613/article/details/64217337
1. Read data
import pandas as pd
data = pd.read_csv("census.csv")
# 成功 - 显示第一条记录
display(data.head(n=1))
#这个 data 类型是 DataFrame
读取 写入
read_csv to_csv
read_excel to_excel
read_hdf to_hdf
read_sql to_sql
read_json to_json
read_msgpack (experimental) to_msgpack (experimental)
read_html to_html
read_gbq (experimental) to_gbq (experimental)
read_stata to_stata
read_sas
read_clipboard to_clipboard
read_pickle to_pickle//速度比csv快
#存储为csv文件
submission = pd.DataFrame({ 'PassengerId': test_df['PassengerId'],'Survived': predictions })
submission.to_csv("submission.csv", index=False)
# index参数是否写入行names键
2. Get a column
#获取列名为 income 的一列
income_raw = data['income']
income = data.loc[:,'income']
#获取列名为 name 和income 的两列
sub_data = data.loc[:,['name','income']]
#获取第1列,注意是从0开始
sub_Data1 = data.iloc[:,1]
#:表示一整列数据,也可以选一个区间,如选前两行 [0,2)
sub_Data1 = data.iloc[0:2,1]
3. Count the number of occurrences of each value
income_value_counts = income.value_counts()
#返回的是 pandas.Series类型
4.Series basic concepts and properties
class pandas.
Series
(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)[source]
One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).
Operations between Series (+, -, /, , *) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.
A Series is a one-dimensional array data structure with a label or index. The statistical functions in numpy.ndArray have been rewritten to automatically remove missing data such as NaNs.
The operators (+, -, /, , * ) used between Series are based on each element of them, and the lengths of the two Series are not required to be the same. That is to say, in pandas Series, NumPy's array operations (filtering data with boolean arrays, scalar multiplication, and using mathematical functions) are preserved, while maintaining the use of references.
(This is similar to Numpy's ndArray. Give an official example of ndArray:)
>>> x = np.array([1, 2]) >>> y = np.array([[3], [4]]) >>> x array([1, 2]) >>> y array([[3], [4]]) >>> x + y array([[4, 5], [5, 6]])
Add some more knowledge:
We have already seen that Python supports data of boolean type. The boolean type has only True
two False
values, but the boolean type has the following operations:
AND operation: The result of the calculation is True only if both Boolean values are True.
True and True # ==> True True and False # ==> False False and True # ==> False False and False # ==> False
OR operation: As long as there is a Boolean value of True, the result of the calculation is True.
True or True # ==> True True or False # ==> True False or True # ==> True False or False # ==> False
NOT operation: Change True to False, or False to True:
not True # ==> False not False # ==> True
Boolean operations are used in computers to make conditional judgments. Depending on whether the calculation result is True or False, the computer can automatically execute different subsequent codes.
In Python, the Boolean type can also perform and, or, and not operations with other data types, see the following code:
a = True print a and 'a=T' or 'a=F'
The result of the calculation is not a boolean type, but a string 'a=T', why is this?
Because Python treats 0
, 空字符串''
and None
as False, and other numbers and non-empty strings as True, so:
True and 'a=T' evaluates to 'a=T' Continue to calculate 'a=T' or 'a=F' The calculation result is still 'a=T'
In addition, Boolean values can also do digital operations, at this time True=1, False=0
a=True
b=False
c=a+b
print c
>>>1
So the method of finding the accuracy of the prediction results can be as follows (after so many foreshadowing above, in fact, it is to understand the following line of code...)
def accuracy_score(truth, pred):
""" 返回 pred 相对于 truth 的准确率 """
# 确保预测的数量与结果的数量一致
if len(truth) == len(pred):
# 计算预测准确率(百分比)
return "Predictions have an accuracy of {:.2f}%.".format((truth == pred).mean()*100)
else:
return "Number of predictions does not match number of outcomes!"
Python basic data type List If you use +, it is to link two arrays, which is the same as the string type.
5. Get the value of a key in the Series
# 被调查者的收入大于$50,000的人数,
#例如income这列的值只有'<=50K'和'>50K'
#income.value_counts()输出如下
# <=50K 34014
# >50K 11208
n_greater_50k = income.value_counts().get('>50K')
#value_counts()是经过排序的,如果拿最多的一个数,则可以用
n_most_group = income.value_counts().index[0]
Series are all pointers. To get the real value, you can use .values
To get the i-th element, you can write a.values[i] like this,
6. Split data
# 将数据切分成特征和对应的标签
income_raw = data['income']
features_raw = data.drop('income', axis = 1)
7. Apply the function to each row or column:
def num_missing(x):
return sum(x.isnull())
#应用列:
print data.apply(num_missing, axis=0)
#应用行:
print data.apply(num_missing, axis=1).head()
# 对于倾斜的数据使用Log转换,注意 lambda
skewed = ['capital-gain', 'capital-loss']
features_raw[skewed] = data[skewed].apply(lambda x: np.log(x + 1))
8. For