Despite being the standard loss function to train multi-class neural networks, the log-softmax has two potential limitations. First, it involves computations that scale linearly with the number of output classes, which can restrict the size of problems that we are able to tackle with current hardware. Second, it remains unclear how close it matches the task loss such as the top-k error rate or ...