［笔记］stanford engineering cs224n lecture 1

lecture 1

the course

an understanding of the effective modern methods for deep learning—— basic first, then key methods used in nlp

a big picture understanding of human languages and the difficulties in understanding and producing them

an understanding of and ability to build systems (in pytorch) for some of the major problems in nlp (word meaning, dependency phrasing, machine translation, question answering)

hw1: hopefully an easy on ramp- an ipython notebook
hw2: pure python (bumpy) do multivariate caucus so really understand the basics
hw3: introduces pytorch
hw4\5:use pytorch on a gpu(microsoft azure)
human language and word meaning

pathetically slow
how to represent the meaning of the word:
commonest linguistic way of thinking of meaning:
signifier(symbol) <->signified(idea or thing) = denotational semantics
problems with resources like wordnet
miss nuance
miss new meaning of words
subjective
require human labor to create and adapt
cannot compute accurate word similarity
traditional nlp (up to 2012)
regard words as discrete symbols
one-hot vector(each one is orthogonal正交 with each other)

word vectors/ word embeddings/ word representations:
dense vector for each word chosen so that it is similar to vectors of words that appear in similar contexts => vector space

word2vec introduction

idea:

we have a large corpus of text
every word in a fixed vocabulary is represented by a vector
go through each position t in the text, which has a center word c and context words o
use the similarity of the word vectors for c and o to calculate the probability of o given c
keep adjusting the word vectors to maximize this probability

softmax function
softmax详解

word2vec objective function gradients
optimization basics
looking at word vectors

gensim
it’s a package for word and text similarity modeling, which started with(lda－style) topic model 安定 grew into svd and neural word representations. but its efficient and scalable可攀登的, and quite widely used.

Gensim是一个用于从文档中自动提取语义主题的Python库，足够智能，堪比无痛人流。
Gensim可以处理原生，非结构化的数值化文本(纯文本)。Gensim里面的算法，比如Latent Semantic Analysis(潜在语义分析LSA)，Latent Dirichlet Allocation，Random Projections，通过在语料库的训练下检验词的统计共生模式(statistical co-occurrence patterns)来发现文档的语义结构。这些算法是非监督的，也就是说你只需要一个语料库的文档集。
当得到这些统计模式后，任何文本都能够用语义表示(semantic representation)来简洁的表达，并得到一个局部的相似度与其他文本区分开来。
核心概念
Corpus
数字化文档的集合，被用于自动推断文档的结构和主题等。由此，corpus也称作training corpus，被推断的这些潜在结构用于给新的文档分配主题，无需人为介入，比如给文档打标签，不存在的。
Vector
在向量空间模型中，每个文档被表示成了一组特征，比如，一个单一的特征可能被视为一个问答对。

1.How many times does the word splonge appear in the document? Zero.
2.How many paragraphs does the document consist of? Two.
3.How many fonts does the document use? Five.
问题通常被表示成整数ID，在这里就是1,2,3.所以这篇文档就表示成了(1, 0.0), (2, 2.0), (3, 5.0). 如果我们提前知道所有问题，那我们可能会省略他们，然后简单的写成(0.0, 2.0, 5.0).答案的序列被视为一个向量(这个例子里是三维向量)，出于实际考虑，我们只考虑那些答案可以转换成一个单一实数的问题。

Sparse Vector
通常，大部分问题的答案都是0，为了节约空间，我们会从文档表示中省略他们，只写成(2, 2.0), (3, 5.0) (去掉 (1, 0.0)). 既然问题提前知道，那文档中所有稀疏表示的丢失特征都是0.
Gensim不会指定任何特定的Corpus格式，不管Corpus是怎样的格式，迭代时回一次产生这些Sparse Vector
Model
我们用model将一个文档的表示转换成另一个。Gensim中文档表示成向量，所以model可以看成是两个向量空间的转换。转换的细节从training corpus中学习

作者：迅速傅里叶变换
链接：https://www.jianshu.com/p/e21b59a46e4c
来源：简书
简书著作权归作者所有，任何形式的转载都请联系作者获得授权并注明出处。

［笔记］stanford engineering cs224n lecture 1

猜你喜欢