Data analysis 02 / pandas basis, stock analysis Case

Data analysis 02 / pandas basis, stock analysis Case

1. pandas Introduction

  • numpy can help us deal with a numeric data, of course, in addition to numerical data as well as many other types of data (string, time series), then pandas can help in data analysis we handled well except numeric other data!

  • pandas two commonly used classes: Series / DataFrame

2. Series

  • definition:

    Series is a one-dimensional array of similar objects, by the following two components:

    values: one set of data (ndarray type)

    index: the index-related data labels

  • Series created

    1. Create a list or array numpy

    2. Create a dictionary

    Code Example:

    import pandas as pd
    from pandas import Series,DataFrame
    import numpy as np
    
    # 方式一:
    s1 = Series(data=[1,2,3,4,5])
    
    # 方式二:
    s2 = Series(data=np.random.randint(0,100,size=(4,)))
    
    # 方式三:
    dic = {
        'a':1,
        'b':2,
        'c':3
    }
    # Series的索引可以为字符串
    s3 = Series(data=dic)
    
    
    
    # Series这个数据结构中存储的数据一定得是一个维度
  • Series Index

    1. implicit index: numeric, default implicit index

    2. Explicit used: Custom (string), to improve the readability of the data

    Code Example:

    # index指定显式索引
    s4 = Series(data=[1,2,3],index=['数学','英语','理综'])
  • Series of indexing and slicing

    1. Index Operations

    # 隐式索引操作
    s4[0]
    # 显示索引操作
    s4['数学']
    s4.数学

    2. Slice

    s4[0:2]
  • Series of common properties

    • shape: shape; Example: s4.shape
    • size: size; Example: s4.size
    • index: row index; Example: s4.index
    • values: column index; Example: s4.values
  • Series A common method

    1.head(),tail()

    s4.head(2)   # 显式前n条数据
    s4.tail(2)   # 显式后n条数据

    2.unique()

    s = Series(data=[1,1,2,2,3,4,5,6,6,6,6,6,6,7,8])
    s.unique()   # 对Series进行去重

    3.add () sub () mul () div () / Series arithmetic

    s + s 相当于 s.add(s)

    The law of arithmetic: the index with a value that matches were arithmetic, or fill empty

    s1 = Series(data=[1,2,3,4])
    s2 = Series(data=[5,6,7])
    s1 + s2
    
    # 结果:
    0     6.0
    1     8.0
    2    10.0
    3     NaN
    dtype: float64

    4.isnull (), notnull () / applications: washing of null Series

    s1 = Series(data=[1,2,3,4],index=['a','b','c','e'])
    s2 = Series(data=[1,2,3,4],index=['a','d','c','f'])
    s = s1 + s2
    s
    
    # 结果:
    a    2.0
    b    NaN
    c    6.0
    d    NaN
    e    NaN
    f    NaN
    dtype: float64
    
    # 清洗结果的空值:boolean可以作为索引取值
    s[s.notnull()]

3. DataFrame

  • DataFrame Profile

    DataFrame data structure is a table type [of]. DataFrame a plurality of rows of data arranged in a certain order components. The Series is designed to expand the use of scenes from one-dimensional to multidimensional. DataFrame both the row index, column index also.

    • Line Index: index
    • Column index: columns
    • Values: values
  • DataFrame creation

    Creating 1.ndarray

    2. Create a dictionary

    Example:

    df = DataFrame(data=np.random.randint(0,100,size=(5,6)))
    df
    dic = {
        'name':['zhangsan','lisi','wangwu'],
        'salary':[10000,15000,10000]
    }
    df = DataFrame(data=dic,index=['a','b','c'])
    df
  • DataFrame property

    • df.values: All values
    • df.shape: Shape
    • df.index: row index
    • df.columns: column index
  • DataFrame index operation

    1. column index

    # 索引取单列
    df['name']
    
    # 索引取多列
    df[['age','name']]

    2. index rows

    # 索引取单行
    df.loc['a']  # 显示索引操作
    df.iloc[0]  # 隐式索引操作
    
    # 索引取多行
    df.loc[['a','c']]  # 显示索引操作
    df.iloc[[0,2]]  # 隐式索引操作

    3. Take a single element

    df.loc['b','salary']  # 显示索引操作
    df.iloc[1,1]  # 隐式索引操作

    4. Take a plurality of element values

    df.loc[['b','c'],'salary']  # 显示索引操作
    df.iloc[[1,2],1]  # 隐式索引操作
  • DataFrame slicing operation

    1. slicing line

    # 切行
    df[0:2]

    2. Slice the columns

    # 切列
    df.iloc[:,0:2]
  • DataFrame operations: Series and it is the same

    Elements corresponding to the ranks of the index is consistent, it can perform arithmetic operations between elements, or fill empty

  • Data type of data View df

    df.dtypes
  • Time data type conversion: pd.to_datetime (col)

    Example:

    dic = {
        'time':['2019-01-09','2011-11-11','2018-09-22'],
        'salary':[1111,2222,3333]
    }
    df = DataFrame(data=dic)
    
    # 将time列转换成时间序列类型
    df['time'] = pd.to_datetime(df['time'])
    
    # 转换前time的类型是:object
    # 转换后time的类型是:datetime64[ns]
    # 转换后可以进行datetime64[ns]类型相关的操作
  • A column to the row index: df.set_index ()

    Example:

    # 将time这一列作为原数据的行索引
    df.set_index(df['time'],inplace=True)   # inplace将原索引替换成time索引
    
    # 将之前的time列删掉
    df.drop(labels='time',axis=1,inplace=True)  # drop函数中axis的0行,1列

4. Use DataFrame for stock analysis

  • Requirements: stock analysis
    • Use tushare historical market data package to obtain a stock.
      • tushare: Financial data interface package
      • pip install tushare
    • The output of the stock of all closed more than 3% than the date opened up.
    • The output of the shares opened 2% of all date than the day before closing down more than.
    • If I start from January 1, 2010, the first trading day each month to buy a hand stock, sell all the stock last trading day of each year, to date, how my earnings?
  • Code:

    1. Use tushare get a package of historical stock market data

    import tushare as ts
    import pandas as pd
    from pandas import Series,DataFrame
    
    # 使用tushare包获取某股票的历史行情数据
    df = ts.get_k_data('600519',start='1988-01-01')
    # 将获取的数据写入到本地进行持久化存储
    df.to_csv('./maotai.csv')
    
    # 将本地文本文件中的数据读取加载到DataFrame中
    df = pd.read_csv('./maotai.csv')
    df.head(10)
    
    # 将Unnamed: 0为无用的列删除
    df.drop(labels='Unnamed: 0',axis=1,inplace=True)
    df.head(5)  # 显示前五条,不写5默认也是显示的前5条
    
    # 将date列转成时间序列类型
    df['date'] = pd.to_datetime(df['date'])
    
    # 将date列作为元数据的行索引
    df.set_index(df['date'],inplace=True)
    
    # 删除原date列
    df.drop(labels='date',axis=1,inplace=True)
    df.head()

    2. The output of the stock of all closed more than 3% than the date opened up.

    # 伪代码:(收盘-开盘)/开盘  > 0.03
    (df['close'] - df['open'])/df['open'] > 0.03
    
    # boolean可以作为df的行索引
    df.loc[[True,False,True]]
    df.loc[(df['close'] - df['open'])/df['open'] > 0.03]
    
    df.loc[(df['close'] - df['open'])/df['open'] > 0.03].index

    3. The output of all shares opened more than 2% higher than the day before the closing date decline

    #伪代码:(开盘-前日收盘)/前日收盘  < -0.02
    
    # 将收盘/close列下移一位,这样可以将open和close作用到一行,方便比较
    (df['open'] - df['close'].shift(1))/df['close'].shift(1) < -0.02
    
    # boolean作为df的行索引
    df.loc[(df['open'] - df['close'].shift(1))/df['close'].shift(1) < -0.02]
    
    df.loc[(df['open'] - df['close'].shift(1))/df['close'].shift(1) < -0.02].index
    
    # shift(1):可以让一个Series中的数据整体下移一位

    4. If From January 1, 2010 start, first trading day of each month to buy a hand stock, sell all the stock last trading day of each year, to date, how to profit?

    analysis:

    BUY: a full 12 years need to buy stock, buy once single-handedly (100), a full-year 1200 needs to buy stock

    Sell: a full-year sales, once sold 1200 shares

    Code:

    # 将2010-1-1 - 今天对应的交易数据取出
    data = df['2010':'2019']
    data.head()
    
    # 数据的重新取样,将每个月第一个交易日的数据拿到
    data_monthly = data.resample('M').first()
    
    # 一共花了多少钱
    cost_money = (data_monthly['open']*100).sum()
    
    # 卖出股票入手多少钱,将每年的最后一个交易日的数据拿到
    data_yeasly = data.resample('A').last()[:-1]
    recv_money = (data_yeasly['open']*1200).sum()
    
    # 19年手里剩余股票的价值也要计算到收益中
    last_money = 1200*data['close'][-1]
    
    # 最后总收益如下:
    last_monry + recv_money - cost_monry

Guess you like

Origin www.cnblogs.com/liubing8/p/12025163.html