linear regression

given a training data set comprising $\text{[math]}$ observations $\text{[math]}$ , where $\text{[math]}$ , together with corresponding target values $\text{[math]}$ , the goal is to predict the value of $\text{[math]}$ for a new value of $\text{[math]}$ with a predictive hypothesis function $\text{[math]}$ :
$\text{[math]}$ but to make our hypothesis applicable to more function spaces we extend this linear function with a set of $\text{[math]}$ fixed, non-linear basis functions
$\text{[math]}$ such that $\text{[math]}$ for the bias parameter, such that
$\text{[math]}$ this gets rid of the restrictions we would be imposing by using linear functions of $\text{[math]}$ , this is however still a linear function of $\text{[math]}$ which eases analysis.
here, $\text{[math]}$ is the weight matrix which we hope would represent a linear transformation that transforms a vector of input features $\text{[math]}$ into an output vector $\text{[math]}$ .
[cite:@ml_bishop_2006 section 3.1 linear basis function models]

we start with a random point in $\text{[math]}$ for $\text{[math]}$ , and use an optimization algorithm to arrive at a good enough approximation of a hypothetical target function $\text{[math]}$ from which we assume the observations $\text{[math]}$ were drawn, "good enough" being defined by some criterion or loss function. here we consider the traditional gradient descent as the optimization method. the observations may potentially be divided into batches, but that doesnt matter in theory. our goal is to converge on a good enough $\text{[math]}$ by going in the direction opposite to that of the gradient of the loss function, because by doing so we could get closer to a local minima point (i.e. "sliding downhill") so a training step would consist of: $\text{[math]}$ where $\text{[math]}$ is the loss function and $\text{[math]}$ is the learning rate. the optimal weight matrix $\text{[math]}$ would be the one to minimize this loss for a given batch (set of observations): $\text{[math]}$