Regression is all about fitting a low order parametric model or curve to data, so we can reason about it or make predictions on points not covered by the data. Both data and model are known, but we'd like to find the model parameters that make the model fit best or good enough to the data according to some metric.
We may also be interested in how well the model supports the data or whether we better look for another more appropriate model.
In a regression, a lot of data is reduced and generalized into a few parameters. The resulting model can obviously no longer reproduce all the original data exactly  if you need the data to be reproduced exactly, have a look at interpolation instead.
In the simplest yet still common form of regression we would like to fit a line
\(y : x \mapsto a + b x\) to a set of points \((x_j,y_j)\), where \(x_j\) and \(y_j\) are scalars.
Assuming we have two double arrays for x and y, we can use Fit.Line
to evaluate the \(a\) and \(b\)
parameters of the least squares fit:

Or in F#:
let a, b = Fit.Line ([10.0;20.0;30.0], [15.0;20.0;25.0])
How well do these parameters fit the data? The data points happen to be positioned exactly on a line. Indeed, the coefficient of determination confirms the perfect fit:

In practice, a line is often not an adequate model. But if we can choose a model that is linear, we can leverage the power of linear algebra; otherwise we have to resort to iterative methods (see Nonlinear Optimization).
A linear model can be described as linear combination of \(N\) arbitrary but known functions \(f_i(x)\), scaled by the model parameters \(p_i\). Note that none of the functions \(f_i\) depends on any of the \(p_i\) parameters.
\[y : x \mapsto p_1 f_1(x) + p_2 f_2(x) + \cdots + p_N f_N(x)\]
If we have \(M\) data points \((x_j,y_j)\), then we can write the regression problem as an overdefined system of \(M\) equations:
\[\begin{eqnarray} y_1 &=& p_1 f_1(x_1) + p_2 f_2(x_1) + \cdots + p_N f_N(x_1) \\ y_2 &=& p_1 f_1(x_2) + p_2 f_2(x_2) + \cdots + p_N f_N(x_2) \\ &\vdots& \\ y_M &=& p_1 f_1(x_M) + p_2 f_2(x_M) + \cdots + p_N f_N(x_M) \end{eqnarray}\]
Or in matrix notation with the predictor matrix \(X\) and the response \(y\):
\[\begin{eqnarray} \mathbf y &=& \mathbf X \mathbf p \\ \begin{bmatrix}y_1\\y_2\\ \vdots \\y_M\end{bmatrix} &=& \begin{bmatrix}f_1(x_1) & f_2(x_1) & \cdots & f_N(x_1)\\f_1(x_2) & f_2(x_2) & \cdots & f_N(x_2)\\ \vdots & \vdots & \ddots & \vdots\\f_1(x_M) & f_2(x_M) & \cdots & f_N(x_M)\end{bmatrix} \begin{bmatrix}p_1\\p_2\\ \vdots \\p_N\end{bmatrix} \end{eqnarray}\]
Provided the dataset is small enough, if transformed to the normal equation \(\mathbf{X}^T\mathbf y = \mathbf{X}^T\mathbf X \mathbf p\) this can be solved efficiently by the Cholesky decomposition (do not use matrix inversion!).

Using normal equations is comparably fast as it can dramatically reduce the linear algebra problem
to be solved, but that comes at the cost of less precision. If you need more precision, try using
MultipleRegression.QR
or MultipleRegression.Svd
instead, with the same arguments.
To fit to a polynomial we can choose the following linear model with \(f_i(x) := x^i\):
\[y : x \mapsto p_0 + p_1 x + p_2 x^2 + \cdots + p_N x^N\]
The predictor matrix of this model is the Vandermonde matrix.
There is a special function in the Fit
class for regressions to a polynomial,
but note that regression to high order polynomials is numerically problematic.

The \(x\) in the linear model can also be a vector \(\mathbf x = [x^{(1)}\; x^{(2)} \cdots x^{(k)}]\) and the arbitrary functions \(f_i(\mathbf x)\) can accept vectors instead of scalars.
If we use \(f_i(\mathbf x) := x^{(i)}\) and add an intercept term \(f_0(\mathbf x) := 1\) we end up at the simplest form of ordinary multiple regression:
\[y : x \mapsto p_0 + p_1 x^{(1)} + p_2 x^{(2)} + \cdots + p_N x^{(N)}\]
For example, for the data points \((\mathbf{x}_j = [x^{(1)}_j\; x^{(2)}_j], y_j)\) with values
([1,4],15)
, ([2,5],20)
and ([3,2],10)
we can evaluate the best fitting parameters with:

The Fit.MultiDim
routine uses normal equations, but you can always choose to explicitly use e.g.
the QR decomposition for more precision by using the MultipleRegression
class directly:

In multiple regression, the functions \(f_i(\mathbf x)\) can also operate on the whole vector or mix its components arbitrarily and apply any functions on them, provided they are defined at all the data points. For example, let's have a look at the following complicated but still linear model in two dimensions:
\[z : (x, y) \mapsto p_0 + p_1 \mathrm{tanh}(x) + p_2 \psi(x y) + p_3 x^y\]
Since we map (x,y) to (z) we need to organize the tuples in two arrays:

Then we can call Fit.LinearMultiDim with our model, which will return an array with the best fitting 4 parameters \(p_0, p_1, p_2, p_3\):

Let's say we have the following model:
\[y : x \mapsto a + b \ln x\]
For this case we can use the Fit.LinearCombination
function:

In order to evaluate the resulting model at specific data points we can manually apply
the values of p to the model function, or we can use an alternative function with the Func
suffix that returns a function instead of the model parameters. The returned function
can then be used to evaluate the parametrized model:

Sometimes it is possible to transform a nonlinear model into a linear one. For example, the following power function
\[z : (x, y) \mapsto u x^v y^w\]
can be transformed into the following linear model with \(\hat{z} = \ln z\) and \(t = \ln u\)
\[\hat{z} : (x, y) \mapsto t + v \ln x + w \ln y\]

Sometimes the regression error can be reduced by dampening specific data points. We can achieve this by introducing a weight matrix \(W\) into the normal equations \(\mathbf{X}^T\mathbf{y} = \mathbf{X}^T\mathbf{X}\mathbf{p}\). Such weight matrices are often diagonal, with a separate weight for each data point on the diagonal.
\[\mathbf{X}^T\mathbf{W}\mathbf{y} = \mathbf{X}^T\mathbf{W}\mathbf{X}\mathbf{p}\]

Weighter regression becomes interesting if we can adapt them to the point of interest and e.g. dampen all data points far away. Unfortunately this way the model parameters are dependent on the point of interest \(t\).
