From Word2Vec to FastText
Application of Word2Vec in Deep Learning
- Text generation (Word2Vec + RNN/LSTM)
- Text Classification (Word2Vec + CNN)
text generation
Neural network: a nonlinear regression model composed of a bunch of formulas
common neural network
neural network with memory
Therefore, direct feed alone is not enough. We hope that our classifier can remember the contextual relationship:
The purpose of RNN is to allow information with sequential relationships to be considered.
What is a sequential relationship? It is the context of information in time.
RNN
SS in each time pointS calculation: short-term memory
S t = f ( U xt + W st − 1 ) S_t=f(U_{x_t}+W_{s_{t-1}})St=f(Uxt+Wst−1)
The final output of this neuron, based on the lastSSS
O t = s o f t m a x ( V S t ) O_t=softmax(V_{S_t}) Ot=softmax(VSt)
Simply put, for t=5, it is actually equivalent to stretching one neuron into five
In other words, S is what we call memory (because the information of t from 1-5 is recorded)
LSTM
long short-term memory —— long-term short-term memory
The most important thing in LSTM is the Cell State, which goes all the way down and runs through this timeline, representing the bond of memory.
It will be messed with by XOR and AND operators to update the memory
The control of the increase and decrease of information depends on these valves: Gate
1 means, remember all the information of this trip
0 means, the information of this trip can be forgotten
- forget the door
to decide what information we should forget
It compares the last state ht-1 with the current input xt. Output a value from 0 to 1 through the gate (just like an activation function)
1 represents: Remember it for me! 0 means: forget it!
f t = σ ( w f ⋅ [ h t − 1 , x t ] + b f ) f_t=\sigma(w_f\cdot[h_{t-1},x_t]+b_f) ft=s ( wf⋅[ht−1,xt]+bf)
- memory door
what to remember
This door is more complicated, divided into two steps:
The first step, use sigmoid to decide what information needs to be updated by us (forget the old)
The second step is to create a new Cell State (updated cell state) with Tanh
i t = σ ( W i ⋅ [ h t − 1 , x t ] + b i ) C t ~ = t a n h ( W C ⋅ [ h t − 1 , x t ] + b C ) i_t=\sigma(W_i\cdot[h_t-1,x_t]+b_i) \\\tilde{C_t}=tanh(W_C\cdot[h_{t-1},x_t]+b_C) it=s ( Wi⋅[ht−1,xt]+bi)Ct~=t a n h ( WC⋅[ht−1,xt]+bC)
- update gate
Update the old cell state to the new cell state
Update our cell state with gates like XOR and AND:
C t = f t ∗ C t − 1 + i t ∗ C t ~ C_t=f_t*C_{t-1}+i_t*\tilde{C_t} Ct=ft∗Ct−1+it∗Ct~
- output gate
memory to determine what value to output
Our Cell State has been updated,
So we use this memory link to determine our output:
(The Ot here is similar to the output we just ran out of the RNN in one step)
O t = σ ( W σ [ h t − 1 , x t ] + b o ) h t = O t ∗ t a n h ( C t ) O_t=\sigma(W_{\sigma}[h_{t-1},x_t]+b_o) \\h_t=O_t*tanh(C_t) Ot=s ( Wp[ht−1,xt]+bo)ht=Ot∗t a n h ( Ct)
Text Categorization
Baseline: BoW + SVM
Deep Learning: CNN for Text
CNN4Text
Blur + Sharpen
How to migrate to word processing?
C i = f ( W T X i : i + h − 1 + b ) C_i=f(W^TX_{i:i+h-1}+b) Ci=f(WTXi:i+h−1+b)
- convert text into pictures
- Make CNN into 1D
RNN generates text——→ strict logic, order guaranteed
CNN Text Classification —— → error tolerance
Boundary handling:
Narrow vs Wide
Stride size: (sweep in a few steps)
Fast Text
word2vec:
Fast Text:
- Bow ——→ Bi-Gram
- Hashing Trick
- Hierachy Softmax
Focus on text classification, space compression and time acceleration