Before training, the parameter values in all units (i.e., neurons) are unknown.
Training algorithms for neural networks (e.g., gradient descent) are iterative
Initialize the weights and biases before starting the first iteration,
Weights initialization methods, sampling from:
Random normal: normal distribution, typically mean 0 and SD 0.05 ;
Random uniform: uniform distribution with range $[-0.05,0.05]$;
Xavier normal: truncated normal distribution, centered on 0 , with $\mathrm{SD}=\sqrt{2 /(\text { in }+ \text { out })}$ rere “in” is the number of units in the preceding layer to which the current unit is connected (the one whose parameters you initialize); and “out” is the number of units on the subsequent layer to which the current unit is connected;
Xavier uniform: uniform distribution within [-limit, limit], where “limit” is $\sqrt{6 /(\text { in }+ \text { out })}$, and “in” and “out” are defined as in Xavier normal, above.
The bias term is usually initialized with 0 or 1 . Note: we cannot use 001 for weights