CS224N Learning - Lecture1

This is my first note of CS224N (Natural Language Processing with Deep Learning), hoping could finish this class in the near future!

Lecture1 drew an outline about NLP and focused on a basic question in this domain: How to represent the meaning of a word? To answer this question, 3 different methods were introduced in chronological sequence: WordNet, Representing words by discrete symbols(using One-hot encoder) and representing words by context(using Word2Vec).

1.Why is NLP difficult?

NLP's difficulty could track back to human languages' difficulty, which is a big step of civilization. Languages let people pass the knowledge through decades or across the wide space, thus break the limits of time and geography. Nowadays, nearly all of human knowledge was stored in text form, so it would be an interesting work to let computers understand and learn those knowledge in order to make the "real useful" Strong Artificial Intelligence.
However, it not that easy for computers to understand human languages. In fact, this task is difficult for our humans sometime, too. Here's a little comic about this:

A comic
For human language, different situations, different people, even different sequences would make huge effects, and solving those effects became one of the biggest challenge in NLP.

2.WordNet

Let's back to the simple question we mentioned in the beginning: How to represent the meaning of a word? To answer this question, scientists firstly came up with the WordNet in mid 1980s.
WordNet is a thesaurus containing lists of synonym sets and hypernyms. Like a big dictionary, for each word, its synonyms and hypernyms were listed. And people believed that the meaning of a word can be expressed by these synonyms and hypernyms. For example, in WordNet, the synonyms of the word "good" may contain "just","goodness","up right",etc. In fact, those words do explain the meaning of "good" in some way.
However, the problems with resources like WordNet is apparent. Firstly, it misses the nuance as some synonyms are only correct in some contexts. Secondly, it takes human labor to create and adapt, and that is a costly job. Last but not least, though it requires a lot of specialists and costs, it still couldn't keep up-to-date cause there are so many new words come up everyday and with time goes by, the dictionary would be more and more difficult to maintain.
Problems with WordNet

3.Representing words by discrete symbols

Can we represent a word by a vector? That question leads to a good solution for representing. With one-hot encoder, we could simply transform a word to a vector, which is far more easy to use in downstream tasks.
For example, we could translate the word "hotel" and "motel" into a long vector only consists 0 and 1.
an example of one-hot vector
In this case, we assume that the whole corpus includes only 15 words, so every word in the corpus will be represented as a 15-dimension vector made up with 0 and 1. As you can see, the word "motel" is the 11th word in the our customized dictionary, so the 11th position at the vector is filled with 1 and others were 0. It's easy to find that the dimension of the word-vector equals to the number of different words in the corpus.
Representing words by discrete symbols is convenient and simple, it does make sense in many NLP tasks such as calculate the similarity of 2 sentences or passages, but this method also have its Achilles' Heel, since all the word-vector are orthogonal (perpendicular to each other), we can't calculate the similarity of different words. Further more, for a large corpus, the dimension of each vector will be incredibly huge(2 million or even larger), thus leads to the "Curse of Dimensionality", making it hard for computers to do calculate tasks.

4.Representing words by context

Those embarrassing situation in NLP was ended by Word2Vec, an epochal method to convert words to vectors created by Google.
Distributional semantics is the core of this algorithm, which believes that "A word’s meaning is given by the words that frequently appear close-by", and that is One of the most successful ideas of modern statistical NLP.
In a word, when a word w appears in a text, its context is the set of words that appear nearby (within a fixed-size window), and we want to use the many contexts of w to build up a representation of w.
Use context to represent a word

And for each word, we get a dense vector instead of one-hot vector:
Dense vector for each word

Those word vectors were also called "word embeddings" or "word representations"
This paper by Google explains the algorithm clearly.The main idea is below.

We have a large corpus of text
Every word in a fixed vocabulary is represented by a vector
Go through each position t in the text, which has a center word c and context (“outside”) words o
Use the similarity of the word vectors for c and o to calculate the probability of o given c (or vice versa)
Keep adjusting the word vectors to maximize this probability

This is a classic situation when "banking" is core word and we want to use it to predict the context, since the window size is 2, we want to use the word vector of "banking"(initialized as a random vector) to predict former 2 words and later 2 words. If the prediction is wrong, we try to adjust the vector of word "banking" to get better prediction.
Use center word to predict contexts

But how do we adjust the vector? Well, that's a point.
Back to last picture, for each center word c and a context word o, we calculate a probability with softmax (u and v means word vector of c and o)
Probability1
And for all the context words in windows, we calculate an overall likelihood probability:

With the likelihood probability, we set the Loose Function:

Our goal is to minimize the Loose Function by adjusting θ (the word vector of center word) step by step.

But how to adjust the θ ? In fact, we use Gradient Descent to adjust automatically. This passage introduces Gradient Descent in a simple way.

So, after times and times of adjusting with large scale of corpus, we could finally get what we want: a dense vector for each word. Unlike the one-hot vector we mentioned above, the vectors-by-context get amazing results in many of NLP tasks and became one of the most fashionable methods for word embedding in the past 10 years.

5.Conclusion

In Lecture 1, we learned about NLP's difficulty and focused on a simple question: "How to represent the meaning of a word?". 3 different methods: WordNet, Representing words by discrete symbols and Representing words by context were introduced.
Since it's the first time for me to write in English, there must be a little "Chinglish", and I'm definitely trying my best to avoid this in the following learning notes. Maybe I will upload Lecture 2 in few days!

References:

Stanford CS224N NLP with Deep Learning
Learning Note of 川陀学者, a Google software engineer
The original paper of Word2Vec
A Word2Vec tutorial (Skip-Gram Model)