Category Archives: Machine Learning

[Paper Summary] Attention Is All You Need


model architecture

encoder-decoder paring in brief

transformer architecture in more details

  • in each encoder and decoder layer, the self-attention enables for each position to attend to all positions in principle
  • it implements the seq-to-seq transduction masking all positions rightward of the current step in the decoder layers
  • the topmost encoder output acts as the keys & values entering into each decoder layer

scaled dot-product and multi-head attention

scaled dot-product attention

  • the commonly used attention method since NMT by jointly learning to align and translate
  • it deals with the inner-product magnitude scale problem of large-sized vectors by introducing a scaling factor(1/sqrt(dk))
  • computationally cheaper than additive attention with appropriate application of a scaling factor as practical heuristic

multi-head attention

  • it provides with multiple attention results using different parallel ‘scaled dot-product attention’ heads
  • and it has a compute load almost the same as that of a single head attention with the full dimensionality
  • a brilliant choice for both general performance regularity and computational efficiency

position-wise feed-forward networks

  • each layer in the encoder/decoder has a feed forward network with parameters shared across positions but unique per layer

embeddings and softmax

  • the same matrix is shared between the input/output embedding layers and the pre-softmax linear transformation
  • and multiply those embedding layer weights by sqrt(dmodel)

positional encoding

  • sinusoidal functions are used to inject the sequence order information to help the model learn to attend by relative positions
  • any position with some fixed offset k, PEpos+k, can be represented as a linear function of PEpos

why self-attention

  • complexity per layer
    • self-attention layer has less computational complexity than recurrent layer if n is less than d which is most often the case
    • self-attention layer has much less complexity than conv layer (separable convolutions have complexity between self-attention only and self-attention + FFN)
    • when n is very large, restricted self-attention could be an alternative but it increases the max path length to O(n/r)
  • sequential operations
    • self-attention layer has a constant number of sequential operations as conv layer and restricted self-attention layer do while recurrent layer does the worst
  • maximum path length
    • a single self-attention layer simply connects all positions to each other which no other layer type can do


  • training data and batching
    • WMT 2014 English-German
      • 4.5 million sentence pairs
      • shared source/target vocab of about 37000 tokens with BPE
    • WMT 2014 English-French
      • 36 million sentence pairs
      • shared source/target vocab of about 32000 tokens with word-piece
    • sentence-pair batching of approximate sequence length
      • a set of sentence pairs with approximately 25000 source/target tokens respectively
  • hardware and schedule
    • one machine with 8 P100 GPUs
    • base model
      • training step time: 0.4 seconds
      • training for 100000 steps (12 hours)
    • big model
      • training step time: 1.0 seconds
      • training for 300000 steps (3.5 days)
  • optimizer
    • Adam optimizer with beta1 = 0.9, beta2 = 0.98, epsilon = 10-9
    • learning rate varying over the course of training with warmup_steps = 4000:
  • regularization
    • residual dropout
      • dropout applied in the encoder/decoder stacks to:
        • multi-head attention output before ‘Add & Norm’
        • FFN output before ‘Add & Norm’
        • input/output embeddings + positional encodings
      • dropout probability for the base model: 0.1
    • label smoothing
      • epsilon = 0.1
      • performance: better for accuracy and BLUE while worse for perplexity



Statement of accomplishment for Coursera Machine Learning class

Finally acquired the statement of accomplishment for Machine Learning on Coursera by Andrew Ng. Actually it took me more than 2 years to make it since the first opening of the class in the year of 2012. I don’t wanna make any excuse for such procrastination but, as others would do, I used to have some difficulty sparing the time for the class solely. Yeah, shame on me but I finished the course with a record of 100%. 🙂

Anyway, I’m personally proud of what I’ve done for the last couple months. And deep appreciation to Andrew Ng for this encouraging class.

Statement of accomplishment: Coursera ml 2014

the one posting on the curse of dimensionality

i just found a blog on computer vision and machine learning, and got a chance to have a fundamental and clear understanding of the curse of dimensionality.

the posting above provides a very thorough and fundamental explanation about the essence of the curse of dimensionality. the other postings on that blog also seem to be definitely must-read articles for ML newbies, just like me. 🙂

it would be really worth reading the whole materials on that blog, and that’s what i’m gonna do for the next week.

Undergraduate ML course at UBC 2012

I happened to find these undergraduate ML course video clips opened on youtube by UBC. I think I really have to thank to UBC & the professor Nando de Freitas for this valuable sharing. Anyone with strong interest or something in ML should definitely be better to take a look through the whole sessions. That’s what I’m gonna do for months now.


Here’s a community-driven and educational guidance with current focus on machine learning and probabilistic AI – Metacademy. It will make a great assistance for those with strong interest and enthusiasm in machine learning, just like me. 🙂 It draws an intuitive map of sequential steps towards the concept you queried about. You’ll see once you give it a try.

Machine Learning Video Library

A bunch of educational video segments on machine learning – Machine Learning Video Library.

Topics included:

Aggregation, Bayesian Learning, Bias-Variance Tradeoff, Bin Model, Data Snooping, Error Measures, Gradient Descent, Learning Curves, Learning Diagram, Learning Paradigms, Linear Classification, Linear Regression, Logistic Regression, Netflix Competition, Neural Networks, Nonlinear Transformation, Occam’s Razor, Overfitting, Radial Basis Functions, Regularization, Sampling Bias, Support Vector Machines, Validation, VC Dimension