cs224-assignment1

其实作业很简单,但是头疼的问题是涉及到两个语料库的包，不太好下载。一个是NLTK中的语料库，一个是genism里面的 “word2vec-google-news-300”。第一个能直接运行出来，第二个需要科学上网下载，为了方便使用，文末附链接。

说明：我是个小白，代码可能不是简洁明了，仅作为自己的记录。

Q1首先，先运行程序，确保import的包都在，且环境符合要求。然后读取NLTK中reuters中，类别为crude的所有fileids中的每一个文档，在文档开头结尾加“start”和“end”。

Q1.1实现了统计语料库中所有单词个数和去重的函数。

def distinct_words(corpus):
    corpus_words = []
    num_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    for sentense in corpus: #sentense is  a list 
        for words in sentense:
            corpus_words.append(words) ##现在句子的单词
    corpus_words = list(set(corpus_words)) ####筛选一遍
    corpus_words = sorted(corpus_words)   
    num_corpus_words = len(corpus_words)
    # ------------------

    return corpus_words, num_corpus_words

Q1.2基于语料库，做共现矩阵。M是共现矩阵，word2Ind是共现矩阵中每一行的单词和其所对应的行号，window_size是统计中的窗口大小。

def compute_co_occurrence_matrix(corpus, window_size=4):

    words, num_words = distinct_words(corpus)
    M = None
    word2Ind = {}
    
    # ------------------
    # Write your implementation here.
    M = np.zeros((num_words,num_words))
    for i in range(0,num_words): ##构造字典
        word2Ind[words[i]] = i 
    ###构造矩阵
    for sentense in corpus:
        for i in range(0,len(sentense)):
            if i < len(sentense) and i+window_size < len(sentense):
                for j in range(0,window_size):
                    #print((i,i+j+1))
                    M[word2Ind.get(sentense[i]),word2Ind.get(sentense[i+j+1])] += 1
            else : ##不足4个
                k = len(sentense) - i-1 
                for j in range(0,k):
                    #print((i,i+j+1))
                    M[word2Ind.get(sentense[i]),word2Ind.get(sentense[i+j+1])] += 1
            if i < len(sentense) and i-window_size > 0 :
                for j in range(0,window_size):
                    #print((i,i-j-1))
                    M[word2Ind.get(sentense[i]),word2Ind.get(sentense[i-j-1])] += 1
            else : ####左侧不足4个
                k = i
                for j in range(0,k):
                    #print((i,i-j-1))
                    M[word2Ind.get(sentense[i]),word2Ind.get(sentense[i-j-1])] += 1
                 
    # ------------------

    return M, word2Ind

Q1.3 利用sklearn中的TruncatedSVD给共现矩阵降维。

def reduce_to_k_dim(M, k=2):
   
    n_iters = 10     # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
        # ------------------
        # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters)
    #svd.fit(M)
    M_reduced = svd.fit_transform(M) ##fit和fit_transform是对于同一个样本数据使用的
        # ------------------

    print("Done.")
    return M_reduced

sklearn官网：http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html

Q1.4 将单词和其降维后的2维矩阵，在图中呈现。主要利用matplotlib。

def plot_embeddings(M_reduced, word2Ind, words):

    # ------------------
    # Write your implementation here.
    for w in words: #显示的单词
        X = M_reduced[word2Ind[w]][0]
        Y = M_reduced[word2Ind[w]][1]
        plt.scatter(X, Y, marker='x')
        plt.text(X, Y, w)
        
    plt.show()
    # ------------------

图：

Q1.5 利用共现矩阵画出给定点的图，分析结果。

扫描二维码关注公众号，回复： 8938063 查看本文章

图：

结果就是:pertoleum、industry、energy、oil等表示石油的词聚在一起，但是bpd和barrels表示石油单位的在一起却没有聚在一起。

Q2 主要利用genism包，实现word2vector。

运行代码，不科学上网下面这段代码会出现问题，所以你可以下载相应包，然后在运行替换程序。

def load_word2vec():
    """ Load Word2Vec Vectors
        Return:
            wv_from_bin: All 3 million embeddings, each lengh 300
    """
    import gensim.downloader as api
    wv_from_bin = api.load("word2vec-google-news-300")
    vocab = list(wv_from_bin.vocab.keys())
    print("Loaded vocab size %i" % len(vocab))
    return wv_from_bin

wv_from_bin = load_word2vec()

Q2.1 word2vec的图像输出分析

图：

单词仍然是共现矩阵中的单词，可以很明显的看出两种方法的词聚类的效果是不同的。vwnezuela和kuwait没有聚在一起。

Q2.2 利用gensim中的most_similar函数，找出与目标单词最相近的单词，计算关系主要用了余弦函数。

在这里，我用的是"mother"，

wv_from_bin.most_similar("mother")

output:
[('daughter', 0.8706233501434326),
 ('grandmother', 0.8442240953445435),
 ('aunt', 0.8435925841331482),
 ('niece', 0.807008683681488),
 ('father', 0.7901482582092285),
 ('son', 0.768320620059967),
 ('sister', 0.7633353471755981),
 ('wife', 0.7550681829452515),
 ('stepmother', 0.7531880140304565),
 ('granddaughter', 0.7470966577529907)]

从output可以看出，daughter结果最高，都是一些表示亲人之间关系的词语。

Q2.3 同义词和反义词远近分析，主要利用了利用gensim中的distance方法。

# ------------------
# Write your synonym & antonym exploration code here.

w1 = "normal"
w2 = "regular"
w3 = "abnormal"
w1_w2_dist = wv_from_bin.distance(w1, w2)
w1_w3_dist = wv_from_bin.distance(w1, w3)

print("Synonyms {}, {} have cosine distance: {}".format(w1, w2, w1_w2_dist))
print("Antonyms {}, {} have cosine distance: {}".format(w1, w3, w1_w3_dist))

# ------------------

output：
Synonyms normal, regular have cosine distance: 0.6609785556793213
Antonyms normal, abnormal have cosine distance: 0.4773910641670227

Q2.4 词类比工作；Q2.5 错误类比工作；Q2.6 有一些单词可能带有一些歧视性分析（Guided Analysis of Bias in Word Vectors）；Q2.7 跟上面差不多

Q2.4-2.7主要利用的都是most_similar函数做分析，我就没运行...用的CPU...又卡又慢。

“word2vec-google-news-300”词库连接：

链接：https://pan.baidu.com/s/1A0RZZzXLXxmM-wDZCew52Q
提取码：oqxw

Foneone

发布了56 篇原创文章 · 获赞 29 · 访问量 3万+

私信关注

猜你喜欢