Recurrent

  • Sequences as inputs/outputs
  • Sequential processing
  • Turing complete
  • memory through state persisted between timesteps
    • operation invariant to the sequence
    • reduces no of params needed
  • Output comes back as input
  • variable sized inputs and outputs : encoder decoder
  • Three weight matrices and two bias vectors.
  • Stateful : hidden state kept across batches of inputs
  • Activation usually Sigmoid or Tanh
  • BPTT
    • architecture
      • If eigen decomposition , then
      • If less than 0 then will converge to 0 or if bigger then will explore to infinity long sequences
      • Element wise clippingdeeplearning
        • CLIP if bigger than value
      • Norm clipping
        • CLIP if $$||g|| >vg = \frac{gv}{||g||}$
        • v can be decided by trial and error
  • Training stuff
    • Softmax but on every output vector simultaneously
      • If Softmax is lower (eg between 0 and 0.5). It becomes more confident and hence more conservative
      • Near 0 is very diverse and less confident
    • Feed a char into the RNN distribution over characters that comes next Sample from it Feed it back
  • Some basic patterns from here
    • The model first discovers the general word-space structure and then rapidly starts to learn the words.
    • First starting with the short words and then eventually the longer ones.
    • Topics and themes that span multiple words (and in general longer-term dependencies) start to emerge only much later.