learning sources of rnn

refs

I tried but not strived to make these tutorials in a historically or technically sequential order, and stars 🌟 indicates personal recommendations. βœ… indicates posts read by me.

papers

  • [GRU original paper, not high quality though βœ…]
  • [RNN learning by karpathy and feifei]
  • [transformer: attention is all you need βœ…]
  • [First Attention Mechnism-Nerual Machine Translation by jointly learning to align and translate]
  • [Convolutional Sequence to Sequence Learning]

understandings

When we speak or write, the next word we use mostly depends on the previous word we use and the overall semantic context. We can simply model it as follows:

$$
x_{t+1} = f(h_t)
$$

That is exactly what RNN basically does, modeling the hidden state, and the conditional probability of the next word given previous words. Over past years, RNN has shown its strong power in following but not limited field: text generation, machine translation, speech recognition and so on. However, due to its pure and yet naive architecture, several problems manifest themselves.

RNN is subject to the problem of gradient vanishing and gradient exploding. To address the exploding problem, techniques like gradient clipping is used. However, the vanishing problem might need more sophisticated arch, which is why Long Short Time Memory(lstm) is introduced. Besides, although in theory RNN is able to β€˜remember’ every previous hidden state no matter the distance, in practice it does not show this ability. The aforementioned lstm could also partially resolve this problem.

LSTM is strong enough to address the gradient vanishment as well as keep a long history of the previous hidden states, or content in semantic level. But it still suffers from its too complex arch, making it hard to train. Gated Recurrent Unit(GRU) just intends to relieve the heavy burden caused by lstm while preserve the strong ablity of the lstm, which had been proven through enormouos papers and experiments.

To this end, several Seq2Seq models are proposed to resolve any problems which can be modeled as taking a sequence and outputing a sequence, machine translation for example. Typically, these models use an encoder-decoder abstract architecture. To be specific, it use a rnn / lstm / gru to encode the input sequence $S_{input}$ into a context vector / matrix $C$, and use a another rnn / lstm / gru to decode the $C$ into output sequence $S_{output}$. The encoder-decoder architecture is able to address the limitations when the input sequence and output sequence are variable-sized with a fixed-length context vector $C$. Besides, image captioning also benefits from this encoder-decoder architecture, whose encoder is cnn, as expected.

The drawbacks of Seq2Seq are 2-fold. First, as long as you use lstm / gru / rnn, you will meet the problem of hard training. Second, lstm / gru also has the same problem as rnn. They are not able to β€˜remember’ context/state far from the current state. Intuitionally, if the rnn / lstm / gru wants to utilize the $h_{t-n}$, then it has to backwards search n units. In other words, a linear time. When the input sentence is too long, Seq2Seq will lose its magical power.

Facebook proposed a Seq2Seq model totally based on CNN. Compared to RNN, CNN is far more friendly to gpu and parallel training. And it could find the previous state in $log(n)$ time. Another great mechanism is attention. When you translate a sentence, you actually focuses on some words but not the entire sentence. Attention just intends to simulate this process. It enables the decoder to focus on whatever hidden states in encoder and thus not only boosts the quality of the context $C$, but also resolves the problem of β€˜forget’. Even more, google’s Attention is all you need proposed a model without any cnn or rnn, but attention layer. This model is known as Transformer, see illustrated transformer for detailed and clear explanation and harvardnlp: annotated transformer for extra explanation with code.