基于word2vec和CNN 对十年的新闻文本数据做道琼斯指数预测

1、数据描述：
（1）新闻数据：从Reddit WorldNews Channel（/ r / worldnews）抓获历史新闻头条。它们按reddit用户的投票排名，并且只有前25个标题被考虑用于单个日期。（范围：2008-06-08至2016-07-01）
（2）股票数据：道琼斯工业平均指数（DJIA）用于“证明这一概念”。（范围：2008-08-08至2016-07-01）
文件数据格式：csv
Combined_News_DJIA.csv
提供了27个列的组合数据集。第一列是“日期”，第二列是“标签”，以下是从“Top1”到“Top25”的新闻标题。
其中，当DJIA Adj Close值上升或保持不变时，“1”;当DJIA Adj Close值下降时，“0”。
对于任务评估，使用2008-08-08至2014-12-31的数据作为训练集，然后测试集将是以下两年的数据（从2015-01-02到2016-07-01）。这大约是80％/ 20％的分割。
最终结果使用AUC作为评估指标。
下载链接：https://pan.baidu.com/s/12Y2fVIJ7yhnlJGQCkyYysg 密码：xwg8
2、文本预处理
(1)观测数据

import pandas as pd
import numpy as np
from sklearn.metrics import roc_auc_score
from datetime import date
#读取数据
data = pd.read_csv('Combined_News_DJIA.csv',header=0,encoding='utf8')
#观察前五行数据,看起来特别的简单直观。如果是1，那么当日的DJIA就提高或者不变了。如果是1，那么DJIA那天就是跌了。
data.head()

(2)分割数据集(测试/训练集)

train = data[data['Date'] < '2015-01-01']
test = data[data['Date'] > '2014-12-31']

然后把每条新闻做成一个单独的句子，集合在一起，然后制造一个所有句子的集合作为语料库，这里我们用flatten()来将所有的文本扁平化成一个numpy数组作为语料库，但注意X_train和X_test不能随便的扁平化，因为他们要和y_train和y_test对应。

X_train = train[train.columns[2:]]
corpus = X_train.values.flatten().astype(str)

X_train = X_train.values.astype(str)
X_train = np.array([' '.join(x) for x in X_train])
X_test = test[test.columns[2:]]
X_test = X_test.values.astype(str)
X_test = np.array([' '.join(x) for x in X_test])
y_train = train['Label'].values
y_test = test['Label'].values

分割后如下所示：

corpus[:3]

输出：

array([ 'b"Georgia \'downs two Russian warplanes\' as countries move to brink of war"',
       "b'BREAKING: Musharraf to be impeached.'",
       "b'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)'"], 
      dtype='<U312')

X_train[:1]

输出：

array([ 'b"Georgia \'downs two Russian warplanes\' as countries move to brink of war" b\'BREAKING: Musharraf to be impeached.\' b\'Russia Today: Columns of troops roll into South Ossetia; footage from fighting (YouTube)\' b\'Russian tanks are moving towards the capital of South Ossetia, which has reportedly been completely destroyed by Georgian artillery fire\' b"Afghan children raped with \'impunity,\' U.N. official says - this is sick, a three year old was raped and they do nothing" b\'150 Russian tanks have entered South Ossetia whilst Georgia shoots down two Russian jets.\' b"Breaking: Georgia invades South Ossetia, Russia warned it would intervene on SO\'s side" b"The \'enemy combatent\' trials are nothing but a sham: Salim Haman has been sentenced to 5 1/2 years, but will be kept longer anyway just because they feel like it." b\'Georgian troops retreat from S. Osettain capital, presumably leaving several hundred people killed. [VIDEO]\' b\'Did the U.S. Prep Georgia for War with Russia?\' b\'Rice Gives Green Light for Israel to Attack Iran: Says U.S. has no veto over Israeli military ops\' b\'Announcing:Class Action Lawsuit on Behalf of American Public Against the FBI\' b"So---Russia and Georgia are at war and the NYT\'s top story is opening ceremonies of the Olympics?  What a fucking disgrace and yet further proof of the decline of journalism." b"China tells Bush to stay out of other countries\' affairs" b\'Did World War III start today?\' b\'Georgia Invades South Ossetia - if Russia gets involved, will NATO absorb Georgia and unleash a full scale war?\' b\'Al-Qaeda Faces Islamist Backlash\' b\'Condoleezza Rice: "The US would not act to prevent an Israeli strike on Iran." Israeli Defense Minister Ehud Barak: "Israel is prepared for uncompromising victory in the case of military hostilities."\' b\'This is a busy day:  The European Union has approved new sanctions against Iran in protest at its nuclear programme.\' b"Georgia will withdraw 1,000 soldiers from Iraq to help fight off Russian forces in Georgia\'s breakaway region of South Ossetia" b\'Why the Pentagon Thinks Attacking Iran is a Bad Idea - US News &amp; World Report\' b\'Caucasus in crisis: Georgia invades South Ossetia\' b\'Indian shoe manufactory  - And again in a series of "you do not like your work?"\' b\'Visitors Suffering from Mental Illnesses Banned from Olympics\' b"No Help for Mexico\'s Kidnapping Surge"'], 
      dtype='<U4424')

目标值如下:

y_train[:5]

输出：

array([0, 1, 0, 0, 1])

接下来对数据进行分词,以往都是用jieba，此次将引用nltk第三方库的word_tokenize来进行分词。其中要注意corpus和X_train中的每一个数据代表的意义不同，corpus每一个数据代表一个句子，而X_train每一个数据代表一天的文章所有句子集合（对应每个label）。

from nltk.tokenize import word_tokenize

corpus = [word_tokenize(x) for x in corpus]
X_train = [word_tokenize(x) for x in X_train]
X_test = [word_tokenize(x) for x in X_test]

预处理:
*小写
*删除停止词
*删除数字与符号
*使用lemma让所有的单词统一格式,如除去各种时态或者人称等.
其中停用词由nltk.corpus库中的stopwords提供，如果碰到报错，只要按照报错提示去下载对应的nltk包中的语料即可。

# 停止词
from nltk.corpus import stopwords
stop = stopwords.words('english')

# 数字
import re
def hasNumbers(inputString):
    return bool(re.search(r'\d', inputString))

# 特殊符号
def isSymbol(inputString):
    return bool(re.match(r'[^\w]', inputString))

# lemma
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()

def check(word):
    """
    如果需要这个单词，则True
    如果应该去除，则False
    """
    word= word.lower()
    if word in stop:
        return False
    elif hasNumbers(word) or isSymbol(word):
        return False
    else:
        return True

# 把上面的方法综合起来
def preprocessing(sen):
    res = []
    for word in sen:
        if check(word):
            # 这一段的用处仅仅是去除python里面byte存str时候留下的标识。。之前数据没处理好，其他case里不会有这个情况
            word = word.lower().replace("b'", '').replace('b"', '').replace('"', '').replace("'", '')
            res.append(wordnet_lemmatizer.lemmatize(word))
    return res

分别对corpus,测试\训练集做处理:

corpus = [preprocessing(x) for x in corpus]
X_train = [preprocessing(x) for x in X_train]
X_test = [preprocessing(x) for x in X_test]

3、训练NLP模型
（1）简单介绍word2vec：word2vec算法包括skip-gram和CBOW模型，使用分层softmax或负抽样。word2vec被集成在了gensim库当中。
各个参数详解：
gensim.models.word2vec.Word2Vec(sentences=None,size=100,alpha=0.025,window=5, min_count=5, max_vocab_size=None, sample=0.001,seed=1, workers=3,min_alpha=0.0001, sg=0, hs=0, negative=5,cbow_mean=1, hashfxn=,iter=5,null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000)

· sentences：可以是一个·ist，对于大语料集，建议使用BrownCorpus,Text8Corpus或ineSentence构建。
· sg：用于设置训练算法，默认为0，对应CBOW算法；sg=1则采用skip-gram算法。
· size：是指特征向量的维度，默认为100。大的size需要更多的训练数据,但是效果会更好. 推荐值为300-800。
· window：表示当前词与预测词在一个句子中的最大距离是多少
· alpha: 是学习速率
· seed：用于随机数发生器。与初始化词向量有关。
· min_count: 可以对字典做截断. 词频少于min_count次数的单词会被丢弃掉, 默认值为5
· max_vocab_size: 设置词向量构建期间的RAM限制。如果所有独立单词个数超过这个，则就消除掉其中最不频繁的一个。每一千万个单词需要大约1GB的RAM。设置成None则没有限制。
· sample: 高频词汇的随机降采样的配置阈值，默认为1e-3，范围是(0,1e-5)
· workers参数控制训练的并行数。
· hs: 如果为1则会采用hierarchica·softmax技巧。如果设置为0（defaut），则negative sampling会被使用。
· negative: 如果>0,则会采用negativesamp·ing，用于设置多少个noise words
· cbow_mean: 如果为0，则采用上下文词向量的和，如果为1（defaut）则采用均值。只有使用CBOW的时候才起作用。
· hashfxn：hash函数来初始化权重。默认使用python的hash函数
· iter：迭代次数，默认为5
· trim_rule：用于设置词汇表的整理规则，指定那些单词要留下，哪些要被删除。可以设置为None（min_count会被使用）或者一个接受()并返回RUE_DISCARD,utis.RUE_KEEP或者utis.RUE_DEFAUT的函数。
· sorted_vocab：如果为1（defaut），则在分配word index 的时候会先对单词基于频率降序排序。
· batch_words：每一批的传递给线程的单词的数量，默认为10000

from gensim.models.word2vec import Word2Vec
model = Word2Vec(corpus, size=500, window=5, min_count=5, workers=4)

用构造一个128256的matrix来表示每个新闻文本向量。对于每天的新闻，只考虑前256个单词。不够的用[0128]补上，vec_size 指的是我们本身vector的size。
其中padding_size是要让我们最终生成的matrix是一样的维度。

def transform_to_matrix(x, padding_size=256, vec_size=128):
    res = []
    for sen in x:
        matrix = []
        for i in range(padding_size):
            try:
                matrix.append(model[sen[i]].tolist())
            except:
                # 这里有两种except情况，
                # 1. 这个单词找不到
                # 2. sen没那么长
                # 不管哪种情况，我们直接贴上全是0的vec
                matrix.append([0] * vec_size)  ##补0
        res.append(matrix)
    return res

构造训练集和测试集：

X_train = transform_to_matrix(X_train)
X_test = transform_to_matrix(X_test)

print(X_train[123])

输出是一个128*256的矩阵，太大就不展示了。
在进行下一步之前，我们把我们的input要reshape一下。
原因是我们要让每一个matrix外部“包裹”一层维度，来告诉CNN model，每个数据点都是独立的。

# 搞成np的数组，便于处理
X_train = np.array(X_train)
X_test = np.array(X_test)

# 看看数组的大小
print(X_train.shape)
print(X_test.shape)

输出：

(1611, 256, 128)
(378, 256, 128)

修改input的维度：

X_train = X_train.reshape(X_train.shape[0], 1, X_train.shape[1], X_train.shape[2])
X_test = X_test.reshape(X_test.shape[0], 1, X_test.shape[1], X_test.shape[2])

print(X_train.shape)
print(X_test.shape)

输出如下：

(1611, 1, 256, 128)
(378, 1, 256, 128)

4、定义CNN模型
这里用keras框架来定义CNN模型。

from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Convolution2D, MaxPooling2D
from keras.layers.core import Dense, Dropout, Activation, Flatten

# set parameters:
batch_size = 32
n_filter = 16
filter_length = 4
nb_epoch = 5
n_pool = 2

# 新建一个sequential的模型
model = Sequential()
model.add(Convolution2D(n_filter,filter_length,filter_length,
                        input_shape=(1, 256, 128)))
model.add(Activation('relu'))
model.add(Convolution2D(n_filter,filter_length,filter_length))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(n_pool, n_pool)))
model.add(Dropout(0.25))
model.add(Flatten())
# 后面接上一个ANN
model.add(Dense(128))
model.add(Activation('relu'))
model.add(Dropout(0.5))
model.add(Dense(1))
model.add(Activation('softmax'))
# compile模型
model.compile(loss='mse',
              optimizer='adadelta',
              metrics=['accuracy'])

训练模型：

model.fit(X_train, y_train, batch_size=batch_size, nb_epoch=nb_epoch,
          verbose=0)
score = model.evaluate(X_test, y_test, verbose=0)
print('Test score:', score[0])
print('Test accuracy:', score[1])

结果如下：

Test score: 0.492063492221
Test accuracy: 0.507936509829

基于word2vec和CNN 对十年的新闻文本数据做道琼斯指数预测

猜你喜欢