vanishing gradient

in deep neural networks, after propagating through several layers, the gradient becomes very small, fact that implies a slow, or a stopped learning. this phenomenon is called the vanishing gradient problem. [cite:@ovidiu_deep_learning] it is recommended to use relu over sigmoid (or other sigmoidal functions) as an activation functions for deep feedforward neural networks to reduce the risk of a vanishing gradient. gradients could also become too large as a result of subsequent multiplications, this problem is called the exploding gradient problem.