Python Stock Analysis Series - Data Integration .p7 Python Stock Analysis Series - Data Integration .p7

Python Stock Analysis Series - Data Integration .p7

 

Welcome to Part 7 Python for Finance tutorial series. In the previous tutorial, we grabbed Yahoo Finance data for the entire S & P 500 companies. In this tutorial, we will be a combination of these data into a DataFrame in.

The code so far:

 
import bs4 as bs
import datetime as dt
import os
import pandas_datareader.data as web
import pickle
import requests


def save_sp500_tickers():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)
    with open("sp500tickers.pickle", "wb") as f:
        pickle.dump(tickers, f)
    return tickers


# save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500:
        tickers = save_sp500_tickers()
    else:
        with open("sp500tickers.pickle", "rb") as f:
            tickers = pickle.load(f)
    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

    start = dt.datetime(2010, 1, 1)
    end = dt.datetime.now()
    for ticker in tickers:
        # just in case your connection breaks, we'd like to save our progress!
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            df = web.DataReader(ticker, 'morningstar', start, end)
            df.reset_index(inplace=True)
            df.set_index("Date", inplace=True)
            df = df.drop("Symbol", axis=1)
            df.to_csv('stock_dfs/{}.csv'.format(ticker))
        else:
            print('Already have {}'.format(ticker))


get_data_from_yahoo()
 

Although we have all the data, but we may want to evaluate the data together. To this end, we will all together stock data. Each stock has a current file: open, high, low, closing price, trading volume and closing price adjustment. To start at least, most of us are now only interested in closing adjusted.

def compile_data():
    with open("sp500tickers.pickle","rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()

 First, we pull our list of codes before production, and start with an empty box called main_df of data. Now, we are ready to read the data set for each stock:

    for count,ticker in enumerate(tickers):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
        df.set_index('Date', inplace=True)

 You do not need to use Python enumeration here, I just use it, so we know we're in the process of reading all the data. You can iterate code. From this point, we * can * Generate extra columns interesting data, such as:

        df ['{} _ HL_pct_diff'.format(ticker)] =(df ['High'] - df ['Low'])/ df ['Low']
        df ['{} _ daily_pct_chng'.format(ticker)] =(df ['Close'] - df ['Open'])/ df ['Open']

But now, we will not worry. Just know this may be a way of pursuing the road. On the contrary, we are really interested to Adj Adj columns:

        df.rename(columns={'Adj Close':ticker}, inplace=True)
        df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)

Now that we have this column (as above, or extra ...... But remember, in this case, we did not do HL_pct_diff or daily_pct_chng). Please note that we have Adj Adj column rename any stock code name. We started building a shared data box:

        if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df, how='outer')

If main_df Nothing, then we will start the current df, otherwise we will use to join the Pandas.

Still in the for loop, we Jiangzai add two lines:

        if count % 10 == 0:
            print(count)

This will only output the current number of shares, if it can be divisible by 10. What counts for us is 10%, the remainder, if the count is divided by 10. So if we ask if 10% count == 0, we only see the if statement, if the current count is divided by 10, the remainder is 0, or if it can be divisible by 10, then True will appear.

When we complete the for loop:

    print(main_df.head())
    main_df.to_csv('sp500_joined_closes.csv')

This call it to this:

 
    with open("sp500tickers.pickle","rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()

    for count,ticker in enumerate(tickers):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
        df.set_index('Date', inplace=True)

        df.rename(columns={'Adj Close':ticker}, inplace=True)
        df.drop(['Open','High','Low','Close','Volume'],1,inplace=True)

        if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df, how='outer')

        if count % 10 == 0:
            print(count)
    print(main_df.head())
    main_df.to_csv('sp500_joined_closes.csv')


compile_data()
 

The current complete code is:

 
import bs4 as bs
import datetime as dt
import os
import pandas as pd
import pandas_datareader.data as web
import pickle
import requests


def save_sp500_tickers():
    resp = requests.get('http://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
    soup = bs.BeautifulSoup(resp.text, 'lxml')
    table = soup.find('table', {'class': 'wikitable sortable'})
    tickers = []
    for row in table.findAll('tr')[1:]:
        ticker = row.findAll('td')[0].text
        tickers.append(ticker)
    with open("sp500tickers.pickle", "wb") as f:
        pickle.dump(tickers, f)
    return tickers


# save_sp500_tickers()
def get_data_from_yahoo(reload_sp500=False):
    if reload_sp500:
        tickers = save_sp500_tickers()
    else:
        with open("sp500tickers.pickle", "rb") as f:
            tickers = pickle.load(f)
    if not os.path.exists('stock_dfs'):
        os.makedirs('stock_dfs')

    start = dt.datetime(2010, 1, 1)
    end = dt.datetime.now()
    for ticker in tickers:
        # just in case your connection breaks, we'd like to save our progress!
        if not os.path.exists('stock_dfs/{}.csv'.format(ticker)):
            df = web.DataReader(ticker, 'morningstar', start, end)
            df.reset_index(inplace=True)
            df.set_index("Date", inplace=True)
            df = df.drop("Symbol", axis=1)
            df.to_csv('stock_dfs/{}.csv'.format(ticker))
        else:
            print('Already have {}'.format(ticker))


def compile_data():
    with open("sp500tickers.pickle", "rb") as f:
        tickers = pickle.load(f)

    main_df = pd.DataFrame()

    for count, ticker in enumerate(tickers):
        df = pd.read_csv('stock_dfs/{}.csv'.format(ticker))
        df.set_index('Date', inplace=True)

        df.rename(columns={'Adj Close': ticker}, inplace=True)
        df.drop(['Open', 'High', 'Low', 'Close', 'Volume'], 1, inplace=True)

        if main_df.empty:
            main_df = df
        else:
            main_df = main_df.join(df, how='outer')

        if count % 10 == 0:
            print(count)
    print(main_df.head())
    main_df.to_csv('sp500_joined_closes.csv')


compile_data()
 

In the next tutorial, we will try to see if we are able to quickly find any relationship data.

Guess you like

Origin www.cnblogs.com/medik/p/10989795.html