First some context. In supervised learning in general, the goal is to learn or infer an (initially) unknown function f : x 7→ y, from a set of training data in the form of T “input/output” pairs {(xμ, yμ)}μ=1:T . 1 More generally, you try to infer the conditional distribution ρ(y|x) from this training set; the reason is that in general your outputs contains some noise (or stated better, trial-t...