[Data Mining] Introducing NLP technology to stock market analysis

1. Description

        Machine learning models implemented in trading are typically trained on historical stock prices and other quantitative data to predict future stock prices. However, natural language processing (NLP) allows us to analyze financial documents, such as Form 10-k, to predict stock movements.

2. Interpretation of natural language processing

Image credit: Adam Gaitgay

        Natural language processing is a branch of artificial intelligence that involves teaching computers to read language and extract meaning from it. Because language is so complex, computers must go through a series of steps to understand text. Below is a quick description of the steps that occur in a typical NLP pipeline.

  1. Sentence segmentation Text documents are segmented
    into individual sentences.
  2. Tokenization
    Once the document is broken into sentences, we further split the sentences into individual words. Each word is called a token, hence the name tokenization.
  3. Part-of-speech tagging
    We feed each tag and a few words around it into a pre-trained part-of-speech classification model to receive the tagged part-of-speech as output.
  4. Lemmatization
    Words often occur in different forms while referring to the same object/action. To prevent computers from treating different forms of a word as different words, we perform lemmatization, the process of combining various inflections of a word to analyze them as a single item, determined by the word's lemma (the number of words in a dictionary Appearance) logo.
  5. Stop words Extremely common words such as "and", "the", and "a" provide no value, so we identify them as stop words to
    exclude them from any analysis performed on the text.

  6. Dependency analysis assigns a syntactic structure to a sentence and understands the relationship between words in a sentence by feeding the words to a dependency analyzer .
  7. Noun Phrases Grouping noun phrases together in a sentence
    can help simplify sentences for cases where we don't care about adjectives.
  8. Named Entity Recognition Named entity recognition models can label objects such as
    person names, company names, and geographic locations.
  9. Coreference resolution
    As NLP models analyze individual sentences, they can be confused by pronouns referring to nouns in other sentences. To address this, we employ coreference resolution to keep track of pronouns in sentences to avoid confusion.

        For a more in-depth description of NLP: read this

        After completing these steps, our text is ready for analysis. Now that we understand NLP better, let's take a look at my project code (from Project 5 of Udacity's AI Trading course). Click here to view the full Github repository

3. NLP data import/download

        First, we make the necessary imports; project_helper contains various utility and graphics functions.

import nltk
import numpy as np
import pandas as pd
import pickle
import pprint
import project_helper


from tqdm import tqdm

Then we download the stopwords corpus for stopword removal and the wordnet corpus for lemmatization.

nltk.download('stopwords')
nltk.download('wordnet')

4. Get 10-ks data

        The 10-K filing includes information such as company history, organizational structure, executive compensation, equity, subsidiaries and audited financial statements. To look up 10-k documents, we use each company's unique CIK (Central Index Key).

cik_lookup = {
    'AMZN': '0001018724',
    'BMY': '0000014272',   
    'CNP': '0001130310',
    'CVX': '0000093410',
    'FL': '0000850209',
    'FRT': '0000034903',
    'HON': '0000773840'}

        We now take the filed 10-k listings from the SEC and show them using Amazon data as an example.

sec_api = project_helper.SecAPI()
from bs4 import BeautifulSoup
def get_sec_data(cik, doc_type, start=0, count=60):
    rss_url = 'https://www.sec.gov/cgi-bin/browse-edgar?action=getcompany' \
        '&CIK={}&type={}&start={}&count={}&owner=exclude&output=atom' \
        .format(cik, doc_type, start, count)
    sec_data = sec_api.get(rss_url)
    feed = BeautifulSoup(sec_data.encode('ascii'), 'xml').feed
    entries = [
        (
            entry.content.find('filing-href').getText(),
            entry.content.find('filing-type').getText(),
            entry.content.find('filing-date').getText())
        for entry in feed.find_all('entry', recursive=False)]
return entries
example_ticker = 'AMZN'
sec_data = {}
for ticker, cik in cik_lookup.items():
    sec_data[ticker] = get_sec_data(cik, '10-K')
pprint.pprint(sec_data[example_ticker][:5])

        We receive a list of urls that point to files containing metadata associated with each fill. The metadata is irrelevant to us, so we extract the padding by replacing the url with the padding URL. Let's use tqdm to watch the download progress and look at the sample documentation.

raw_fillings_by_ticker = {}
for ticker, data in sec_data.items():
    raw_fillings_by_ticker[ticker] = {}
    for index_url, file_type, file_date in tqdm(data, desc='Downloading {} Fillings'.format(ticker), unit='filling'):
        if (file_type == '10-K'):
            file_url = index_url.replace('-index.htm', '.txt').replace('.txtl', '.txt')            
            
            raw_fillings_by_ticker[ticker][file_date] = sec_api.get(file_url)
print('Example Document:\n\n{}...'.format(next(iter(raw_fillings_by_ticker[example_ticker].values()))[:1000]))

        Breaks down the downloaded file into its associated documents, which are separated in padding, with tags <DOCUMENT> indicating the beginning of each document and </DOCUMENT> indicating the end of each document.

import re
def get_documents(text):
    extracted_docs = []
    
    doc_start_pattern = re.compile(r'<DOCUMENT>')
    doc_end_pattern = re.compile(r'</DOCUMENT>')   
    
    doc_start_is = [x.end() for x in      doc_start_pattern.finditer(text)]
    doc_end_is = [x.start() for x in doc_end_pattern.finditer(text)]
    
    for doc_start_i, doc_end_i in zip(doc_start_is, doc_end_is):
            extracted_docs.append(text[doc_start_i:doc_end_i])
    
    return extracted_docs
filling_documents_by_ticker = {}
for ticker, raw_fillings in raw_fillings_by_ticker.items():
    filling_documents_by_ticker[ticker] = {}
    for file_date, filling in tqdm(raw_fillings.items(), desc='Getting Documents from {} Fillings'.format(ticker), unit='filling'):
        filling_documents_by_ticker[ticker][file_date] = get_documents(filling)
print('\n\n'.join([
    'Document {} Filed on {}:\n{}...'.format(doc_i, file_date, doc[:200])
    for file_date, docs in filling_documents_by_ticker[example_ticker].items()
    for doc_i, doc in enumerate(docs)][:3]))

        Define the get_document_type function to return a given document type.

def get_document_type(doc):
    
    type_pattern = re.compile(r'<TYPE>[^\n]+')
    
    doc_type = type_pattern.findall(doc)[0][len('<TYPE>'):] 
    
    return doc_type.lower()

        Use the get_document_type function to filter out non-10-k documents from the filler.

ten_ks_by_ticker = {}
for ticker, filling_documents in filling_documents_by_ticker.items():
    ten_ks_by_ticker[ticker] = []
    for file_date, documents in filling_documents.items():
        for document in documents:
            if get_document_type(document) == '10-k':
                ten_ks_by_ticker[ticker].append({
                    'cik': cik_lookup[ticker],
                    'file': document,
                    'file_date': file_date})
project_helper.print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['cik', 'file', 'file_date'])

5. Preprocessing data

        Remove the html and make all text lowercase to clean up the document text.

def remove_html_tags(text):
    text = BeautifulSoup(text, 'html.parser').get_text()
    
    return text
def clean_text(text):
    text = text.lower()
    text = remove_html_tags(text)
    
    return text

        Use the clean_text function to clean up the document.

for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc='Cleaning {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_clean'] = clean_text(ten_k['file'])
project_helper.print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['file_clean'])

        Now we lemmatize all the data.

from nltk.stem import WordNetLemmatizer
from nltk.corpus import wordnet
def lemmatize_words(words):

    lemmatized_words = [WordNetLemmatizer().lemmatize(word, 'v') for word in words]
    
    return lemmatized_words
word_pattern = re.compile('\w+')
for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc='Lemmatize {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_lemma'] = lemmatize_words(word_pattern.findall(ten_k['file_clean']))
project_helper.print_ten_k_data(ten_ks_by_ticker[example_ticker][:5], ['file_lemma'])

Remove stop words.

from nltk.corpus import stopwords
lemma_english_stopwords = lemmatize_words(stopwords.words('english'))
for ticker, ten_ks in ten_ks_by_ticker.items():
    for ten_k in tqdm(ten_ks, desc='Remove Stop Words for {} 10-Ks'.format(ticker), unit='10-K'):
        ten_k['file_lemma'] = [word for word in ten_k['file_lemma'] if word not in lemma_english_stopwords]
print('Stop Words Removed')

Six, 10-ks sentiment analysis

        Sentiment analysis was performed on 10-ks using the Loughran-McDonald sentiment word list (which was specifically built for text analysis related finance).

sentiments = ['negative', 'positive', 'uncertainty', 'litigious', 'constraining', 'interesting']

sentiment_df = pd.read_csv('loughran_mcdonald_master_dic_2018.csv')
sentiment_df.columns = [column.lower() for column in sentiment_df.columns] # Lowercase the columns for ease of use

# Remove unused information
sentiment_df = sentiment_df[sentiments + ['word']]
sentiment_df[sentiments] = sentiment_df[sentiments].astype(bool)
sentiment_df = sentiment_df[(sentiment_df[sentiments]).any(1)]

# Apply the same preprocessing to these words as the 10-k words
sentiment_df['word'] = lemmatize_words(sentiment_df['word'].str.lower())
sentiment_df = sentiment_df.drop_duplicates('word')


sentiment_df.head()

Generate sentiment bags of words from 10-k documents using sentiment word lists. Bag of words counts the number of sentiment words in each document.

from collections import defaultdict, Counter
from sklearn.feature_extraction.text import CountVectorizer
def get_bag_of_words(sentiment_words, docs):

    vec = CountVectorizer(vocabulary=sentiment_words)
    vectors = vec.fit_transform(docs)
    words_list = vec.get_feature_names()
    bag_of_words = np.zeros([len(docs), len(words_list)])
    
    for i in range(len(docs)):
        bag_of_words[i] = vectors[i].toarray()[0]
return bag_of_words.astype(int)
sentiment_bow_ten_ks = {}
for ticker, ten_ks in ten_ks_by_ticker.items():
    lemma_docs = [' '.join(ten_k['file_lemma']) for ten_k in ten_ks]
    
    sentiment_bow_ten_ks[ticker] = {
        sentiment: get_bag_of_words(sentiment_df[sentiment_df[sentiment]]['word'], lemma_docs)
        for sentiment in sentiments}
project_helper.print_ten_k_data([sentiment_bow_ten_ks[example_ticker]], sentiments)

7. Jaccard similarity

        Now that we have the bag of words, we can convert it to a boolean array and calculate the jaccard similarity. Jaccard similarity is defined as the size of the intersection divided by the size of the union of the two sets. For example, the jaccard similarity between two sentences is the sum of the number of common words between the two sentences divided by the total number of unique words in the two sentences. The closer the Jaccard similarity value is to 1, the more similar the sets are. To make our calculations easier to understand, we plot the Jaccard similarities.

from sklearn.metrics import jaccard_similarity_score
def get_jaccard_similarity(bag_of_words_matrix):
    
    jaccard_similarities = []
    bag_of_words_matrix = np.array(bag_of_words_matrix, dtype=bool)
    
    for i in range(len(bag_of_words_matrix)-1):
            u = bag_of_words_matrix[i]
            v = bag_of_words_matrix[i+1]
              
    jaccard_similarities.append(jaccard_similarity_score(u,v))    
    
    return jaccard_similarities
# Get dates for the universe
file_dates = {
    ticker: [ten_k['file_date'] for ten_k in ten_ks]
    for ticker, ten_ks in ten_ks_by_ticker.items()}
jaccard_similarities = {
    ticker: {
        sentiment_name: get_jaccard_similarity(sentiment_values)
        for sentiment_name, sentiment_values in ten_k_sentiments.items()}
    for ticker, ten_k_sentiments in sentiment_bow_ten_ks.items()}
project_helper.plot_similarities(
    [jaccard_similarities[example_ticker][sentiment] for sentiment in sentiments],
    file_dates[example_ticker][1:],
    'Jaccard Similarities for {} Sentiment'.format(example_ticker),
    sentiments)

8. TFIDF

        From the list of sentiment words, let's generate sentiment term frequencies – Inverse Document Frequency (TFIDF) from 10-k documents. TFIDF is an information retrieval technique that reveals how often a word/term occurs in a selected collection of text. Each term is assigned a term frequency (TF) and an inverse document frequency (IDF) score. The product of these scores is called the term's TFIDF weight. Higher TFIDF weights indicate rarer terms, and lower TFIDF scores indicate more common terms.

from sklearn.feature_extraction.text import TfidfVectorizer
def get_tfidf(sentiment_words, docs):
    
    vec = TfidfVectorizer(vocabulary=sentiment_words)
    tfidf = vec.fit_transform(docs)
    
    return tfidf.toarray()
sentiment_tfidf_ten_ks = {}
for ticker, ten_ks in ten_ks_by_ticker.items():
    lemma_docs = [' '.join(ten_k['file_lemma']) for ten_k in ten_ks]
    
    sentiment_tfidf_ten_ks[ticker] = {
        sentiment: get_tfidf(sentiment_df[sentiment_df[sentiment]]['word'], lemma_docs)
        for sentiment in sentiments}
project_helper.print_ten_k_data([sentiment_tfidf_ten_ks[example_ticker]], sentiments)

9. Cosine similarity

        From our TFIDF values, we can calculate cosine similarity and plot it as a function of time. Similar to jaccard similarity, cosine similarity is a metric used to determine how similar documents are. Cosine similarity calculates similarity by measuring the cosine of the angle between two vectors projected in a multidimensional space, regardless of magnitude. For text analysis, the two vectors used are usually arrays containing the word counts of the two documents.

from sklearn.metrics.pairwise import cosine_similarity
def get_cosine_similarity(tfidf_matrix):
    
    cosine_similarities = []    
    
    for i in range(len(tfidf_matrix)-1):
        
cosine_similarities.append(cosine_similarity(tfidf_matrix[i].reshape(1, -1),tfidf_matrix[i+1].reshape(1, -1))[0,0])
    
    return cosine_similarities
cosine_similarities = {
    ticker: {
        sentiment_name: get_cosine_similarity(sentiment_values)
        for sentiment_name, sentiment_values in ten_k_sentiments.items()}
    for ticker, ten_k_sentiments in sentiment_tfidf_ten_ks.items()}
project_helper.plot_similarities(
    [cosine_similarities[example_ticker][sentiment] for sentiment in sentiments],
    file_dates[example_ticker][1:],
    'Cosine Similarities for {} Sentiment'.format(example_ticker),
    sentiments)

10. Price Data

        Now, we'll evaluate the alpha factor by comparing it to the stock's annual pricing. We can download pricing data from QuoteMedia.

pricing = pd.read_csv('yr-quotemedia.csv', parse_dates=['date'])
pricing = pricing.pivot(index='date', columns='ticker', values='adj_close')

pricing

11. Convert the data to a data frame

        Alphalens is a python library for alpha factor performance analysis, it uses dataframe, so we have to convert the dictionary to dataframe.


cosine_similarities_df_dict = {'date': [], 'ticker': [], 'sentiment': [], 'value': []}
for ticker, ten_k_sentiments in cosine_similarities.items():
    for sentiment_name, sentiment_values in ten_k_sentiments.items():
        for sentiment_values, sentiment_value in enumerate(sentiment_values):
            cosine_similarities_df_dict['ticker'].append(ticker)
            cosine_similarities_df_dict['sentiment'].append(sentiment_name)
            cosine_similarities_df_dict['value'].append(sentiment_value)
            cosine_similarities_df_dict['date'].append(file_dates[ticker][1:][sentiment_values])
cosine_similarities_df = pd.DataFrame(cosine_similarities_df_dict)
cosine_similarities_df['date'] = pd.DatetimeIndex(cosine_similarities_df['date']).year
cosine_similarities_df['date'] = pd.to_datetime(cosine_similarities_df['date'], format='%Y')
cosine_similarities_df.head()

Before taking advantage of many of the alphalens functions, we need to align indices and convert times to unix timestamps.

import alphalens as al
factor_data = {}
skipped_sentiments = []
for sentiment in sentiments:
    cs_df = cosine_similarities_df[(cosine_similarities_df['sentiment'] == sentiment)]
    cs_df = cs_df.pivot(index='date', columns='ticker', values='value')
    
    try:
        data = al.utils.get_clean_factor_and_forward_returns(cs_df.stack(), pricing.loc[cs_df.index], quantiles=5, bins=None, periods=[1])
        factor_data[sentiment] = data
    except:
        skipped_sentiments.append(sentiment)
if skipped_sentiments:
    print('\nSkipped the following sentiments:\n{}'.format('\n'.join(skipped_sentiments)))
factor_data[sentiments[0]].head()

        We also had to create factor dataframes with unix time to be compatible with alphalen's factor_rank_autocorrelation and mean_return_by_quantile functions.

unixt_factor_data = {
    factor: data.set_index(pd.MultiIndex.from_tuples(
        [(x.timestamp(), y) for x, y in data.index.values],
        names=['date', 'asset']))
    for factor, data in factor_data.items()}

12. Factor returns

        Let's look at factor returns over time

ls_factor_returns = pd.DataFrame()
for factor_name, data in factor_data.items():
    ls_factor_returns[factor_name] = al.performance.factor_returns(data).iloc[:, 0]
(1 + ls_factor_returns).cumprod().plot()

        As expected, 10-k reports expressing positive sentiment generated the largest gains, while 10-k reports containing negative sentiment resulted in the largest losses.

13. Turnover analysis

        Using factor rank autocorrelation, we can analyze the stability of alpha over time. We want the alpha level to remain relatively the same over time.

ls_FRA = pd.DataFrame()
for factor, data in unixt_factor_data.items():
    ls_FRA[factor] = al.performance.factor_rank_autocorrelation(data)
ls_FRA.plot(title="Factor Rank Autocorrelation")

14. Sharpe Ratio

        Finally, let's calculate the Sharpe ratio, which is the average return minus the risk-free return divided by the standard deviation of investment returns.

daily_annualization_factor = np.sqrt(252)  
(daily_annualization_factor * ls_factor_returns.mean() / ls_factor_returns.std()).round(2)

        A Sharpe ratio of 1 is considered acceptable, a ratio of 2 is very good, and a ratio of 3 is very good. As expected, we can see that positive sentiment is associated with a high Sharpe ratio and negative sentiment is associated with a low Sharpe ratio. Other emotions are also associated with high Sharpe ratios. However, replicating these returns in the real world is much harder because so many complex factors affect stock prices.

References and Citations

[1] Udacity, Artificial Intelligence for Trading, Github

 

Guess you like

Origin blog.csdn.net/gongdiwudu/article/details/131865231