# Linear Representations and Basis Vectors

This page describes basic linear algebra concepts related to linear representations in vector spaces.

## Basis Vectors

Originally we are given the recorded data in the channel space, say with $n$ channels, and $T$ samples (i.e. time points, frames). The data can be thought of as a collection of $T$ vectors in $n$-dimensional space, each of which in the case of EEG is a snapshot of the electric potential at the electrodes (relative to a given reference) at a particular time point.

The data can also be thought of as a collection of $n$ time series, or channel vectors, in $T$-dimensional space; or as a collection of spatiotemporal data segments (each e.g. an $n\times L$ matrix) in $(nL)$-dimensional space. As we are concerned here with instantaneous ICA, we'll primarily think of the data as a set of $T$ vectors in $n$-dimensional space, disregarding the temporal order of the vectors.

ICA is a type of linear representation of data in terms of a set of "basis" vectors. Since we're working here in $n$ channel space, the vectors we're interested in will be in $\mathbb{R}^n$. To illustrate in the following we'll use a three dimensional example, say recorded using three channels. The data then is given to us in three dimensional vector space.

Each of these data points is a vector in three dimensional space.

$\mathbf{x}_1,\mathbf{x}_2,\ldots,\mathbf{x}_T \in \mathbb{R}^3$

In general, any point in $n$-dimensional space can be represented as a linear combination of any $n$ vectors that are linearly independent. For example let's take the vectors,

$[\mathbf{b}_1 \mathbf{b}_2 \cdots \mathbf{b}_n] \triangleq \mathbf{B}$

Linear independence means that no vector in the set can be formed as a linear combination of the others, i.e. each vector branches out into a new dimension, and they do not all lie in a zero volume subspace of $\mathbb{R}^n$. Equivalently, there is no vector $\mathbf{v}$ that can mulitply $\mathbf{B}$ to produce the zero vector:

$\nexists \; \mathbf{v} \; \text{such that}\; \mathbf{B} \mathbf{v} = 0$

Mathematically, this is true if and only if:

$\det \mathbf{B} \neq 0$

So for example any data vector, $\mathbf{x} \in \mathbb{R}^3$, can be represented in terms of three linearly independent basis vectors, $\mathbf{b}_1, \mathbf{b}_2, \mathbf{b}_3 \in \mathbb{R}^3$, using a set of coefficients contained say in some (unique) coefficient vector, $\mathbf{c} \in \mathbb{R}^3$:

$\mathbf{x} = \sum_{i=1}^3 c_i \mathbf{b}_i = \mathbf{B}\mathbf{c}$

A linear representation of the data is a fixed basis set, $\mathbf{B}$, that is used to represent each data point:

$\mathbf{x}_t = \sum_{i=1}^n c_{it} \mathbf{b}_i = \mathbf{B}\mathbf{c}_t$

If we collect the vectors $\mathbf{X} \triangleq [\mathbf{x}_1\cdots \mathbf{x}_T]$ and $\mathbf{C} \triangleq [\mathbf{c}_1\cdots \mathbf{c}_T]$, the we can write,

$\mathbf{X} = \mathbf{B}\mathbf{C}$

where $\mathbf{X}$ is the $n \times T\;$ data matrix, $\mathbf{B}$ is the $n \times n,\;$ matrix of basis vectors, and $\mathbf{C}$ is the $n \times T$ coefficient (or loading, or weight) matrix, with $\mathbf{c}_t$ giving the "coordinates" of the point $\mathbf{x}_t$ in the coordinate space represented by the basis $\mathbf{B}$.

We have assumed thus far that the data itself is "full rank", i.e. that there exists a set of $n$ data vectors that are linearly independent. It may happen, however, that the data do not lie in the "full volume" of $\mathbb{R}^n$, but rather occupy a subspace of smaller dimension.

In three dimensions, for example, all of the data might exist in a two-dimensional subspace.

The data is still represented as points or vectors in three dimensional space, with three coordinates, but in fact only two coordinates are required (once a "center" point has been fixed in the subspace).

Even if the data does not lie exactly in a subspace, it may be the case that one of dimensions (directions) is just numerical noise. Eliminating such extraneous dimensions can lead to more efficient and stable subsequent processing of the data.

To understand how the data occupies the space volumetrically, and in the case of data that is not full rank, how to determine which subspace the data lies in, we will use Principle Component Analysis, described in the next section.

## Principle Component Analysis (PCA)

Let the data be represented by an $n \times T$ (channels by time points) matrix $\mathbf{X} = [\mathbf{x}_1 \cdots \mathbf{x}_T]$, which we're thinking of as a set of $T$ vectors contained in the columns. Let us also assume that the data is "zero mean", i.e. that the mean of each channel (row of $\mathbf{X}$) has been removed (subtracted from the row), so that:

$\frac{1}{T} \sum_{t=1}^T \big[\mathbf{X}\big]_{it} =\, 0, \quad i = 1,\ldots, n$

Now, one way to determine the rank of the data is to examine the covariance matrix, or matrix of channel correlations, which is defined by,

$\mathbf{\Sigma} \;\triangleq\; \frac{1}{T} \sum_{t=1}^{T} \mathbf{x}_t \mathbf{x}_t^T \;=\; \mathbf{X}\mathbf{X}^T /\, T$

The matrix $\mathbf{\Sigma}$ has the same rank, or intrinsic dimensionality, as the matrix $\mathbf{X}$. If we perform an eigen-decomposition of $\mathbf{\Sigma}$, we get,

$\mathbf{\Sigma} \;=\; \sum_{i=1}^{n} \lambda_i \mathbf{q}_i \mathbf{q_i}^T \;\triangleq\; \mathbf{Q}\mathbf{\Lambda} \mathbf{Q}^T$

where $\lambda_i$ and $\mathbf{q}_i,\, i = 1,\ldots,n,$ are the eigenvalues and eigenvectors respectively.

Since $\mathbf{\Sigma}$ is symmetric and "positive semidefinite", all the eigenvalues are real and non-negative. $\mathbf{\Sigma}$ (and thus $\mathbf{X}$) is full rank if and only if all eigenvalues are strictly positive, i.e. $\lambda_i > 0, \quad i = 1,\ldots, n$.

If some of the eigenvalues are zero, then the data is not full rank, and the rank is equal to the number of nonzero eigenvalues. In this case, the data lies entirely in the $r$-dimensional subspace spanned by the eigenvectors corresponding to the nonzero eigenvalues.

$\mathbf{x}_t = \sum_{i=1}^r c_{it} \mathbf{q}_i = \mathbf{B}\mathbf{c}_t$

and,

$\mathbf{X} = \mathbf{B}\mathbf{C}$

where $\mathbf{X}$ is the $n \times T\;$ data matrix, $\mathbf{B} = [\mathbf{q}_1 \cdots \mathbf{q}_r]$ is the $n \times r\;$ matrix of basis vectors, and $\mathbf{C}$ is the $r \times T$ coefficient matrix, with $\mathbf{c}_t$ giving the "coordinates" of the point $\mathbf{x}_t$ in the $r$-dimensional space of the nonzero eigenvectors.

The data $\mathbf{X}$ is reduced in dimension from $n\times T$ to $r\times T$ by "projecting" onto the $r$-dimensional space,

$\mathbf{C} = \mathbf{B}^T\mathbf{X}$

Analysis may be conducted on the reduced data $\mathbf{C}$, e.g. ICA may be performed, giving results in $r$ dimensional space. The coordinates in the original $n$ dimensional data space are then given by simply multiplying the $r$ dimensional vectors by $\mathbf{B}$. The $n\times n$, rank $r$, matrix $\mathbf{P}$,

$\mathbf{P} = \mathbf{B}\mathbf{B}^T$

in this case is called a "projection matrix", projecting the data in the full space onto the subspace spanned by the first $r$ eigenvectors.

## Singular Value Decomposition (SVD)

A related decomposition, called the Singular Value Decomposition (SVD), can be performed directly on the data matrix itself to produce a linear representation (of possibly reduced rank). The SVD decomposes the data matrix into,

$\mathbf{X} = \mathbf{U}\mathbf{S}\mathbf{V}^T$

where $\mathbf{X}$ is the $n \times T\;$ data matrix, $\mathbf{U} = [\mathbf{u}_1 \cdots \mathbf{u}_r]$ is the $n \times r\;$ matrix of ortho-normal (orthogonal and unit norm) "left eigenvectors", $\mathbf{S}$ is the $r \times r\;$ diagonal matrix of strictly positive "singular values", and $\mathbf{V}$ is the $T \times r$ matrix of orthonormal "right eigenvectors".

From the SVD, we see that,

$\mathbf{\Sigma} \;\triangleq\; \mathbf{X}\mathbf{X}^T /\, T \;=\; \mathbf{U}\mathbf{S}\mathbf{V}^T \mathbf{V}\mathbf{S}\mathbf{U}^T /\, T \;=\; \mathbf{U}\mathbf{S}^2\mathbf{U}^T /\, T$

so that $\mathbf{U} = \mathbf{Q},$ and $\mathbf{S}^2 /\, T = \mathbf{\Lambda}$. The SVD directly gives the linear representation:

$\mathbf{X} \;=\; \mathbf{U}\mathbf{C} \;=\;\mathbf{U}\big(\mathbf{U}^T\mathbf{X}\big)$

where $\mathbf{C} = \mathbf{U}^T\mathbf{X} = \mathbf{S}\mathbf{V}^T$. The vectors in $\mathbf{U}$ are orthonormal (orthogonal and unit norm), and the rows of $\mathbf{C}$ are orthogonal (since $\mathbf{S}$ is diagonal, and $\mathbf{V}$ is orthonormal.)

The SVD gives the unique linear representation (assuming singular values are distinct) of the data matrix $\mathbf{X}=\mathbf{B}\mathbf{C}$ such that the columns of $\mathbf{B}$ are orthonormal, and the rows of $\mathbf{C}$ are orthogonal (not necessarily unit norm: $\mathbf{C}\mathbf{C}^T = \mathbf{S}^2$). (The SVD is actually only unique when the singular values are all distinct; a subspace determined by equal singular values does not have a unique orthonormal basis in this subspace, allowing for arbitrary cancelling rotations of the left and right eigenvectors in this subspace.)

Having the rows of $\mathbf{C}$ be orthogonal, i.e. uncorrelated, is a desirable feature of the representation, but having the basis vectors be orthonormal is overly restrictive in many cases of interest, like EEG. However, if we only require the rows of $\mathbf{C}$ to be orthogonal, then we lose the uniqueness of the representation, since for any orthonormal matrix $\mathbf{R}$, and any full rank diagonal matrix $\mathbf{D}$, we have,

$\mathbf{X} \;=\; (\mathbf{U}\mathbf{S}\mathbf{R}^T\mathbf{D}^{-1})(\mathbf{D}\mathbf{R}\mathbf{V}^{T})$

where the rows of the new coefficient matrix $\mathbf{D}\mathbf{R}\mathbf{S}^{-1}\mathbf{C}$ are still orthogonal, but the new matrix of basis vectors in the columns of, $\mathbf{U}\mathbf{S}\mathbf{R}^T\mathbf{D}^{-1}$, are no longer orthogonal.

A linear representation of the data,

$\mathbf{X} \;=\; \mathbf{B}\mathbf{C}$

implies that the coefficients can be recovered from the data using the inverse of $\mathbf{B}$ (or in the case of rank deficient $\mathbf{B}$, any left inverse, like the pseudoinverse):

$\mathbf{C} \;=\; \mathbf{B}^{-1}\mathbf{X}$

## PCA and Sphering

We have seen that the SVD representation is one linear representation of the data matrix. The SVD puts,

$\mathbf{R} = \mathbf{D} = \mathbf{I}$

where $\mathbf{I}$ is the identity matrix.

Another representation, which we call "sphering", puts,

$\mathbf{R} = \mathbf{U}, \quad \mathbf{D} = \mathbf{I}$

This latter representation has certain advantages. We can show, e.g., that the sphering transformation leaves the data changed as little as possible among all "whitening" transformations, i.e. those that leave the resulting rows of the coefficient matrix uncorrelated with unit average power.

$\frac{1}{T} \sum_{t=1}^T \big[\mathbf{C}\big]_{it}^2 \,=\, 1, \quad i = 1,\ldots, n$

This is equivalent to taking $\mathbf{D} = \mathbf{I}$. Let the general form of a "whitening" decorrelating transformation, then, be:

$\mathbf{B}^{-1} \;=\; \mathbf{R}\mathbf{S}^{-1}\mathbf{U}^T$

for arbitrary orthonormal matrix $\mathbf{R}$. We measure the distance of the transformed data from the original data by the sum of the squared errors:

$E (\mathbf{B}^{-1}) = \frac{1}{T}\sum_{t=1}^{T} {\|\mathbf{x}_t-\mathbf{B}^{-1}\mathbf{x}_t\|^2} = \text{trace}((\mathbf{I}-\mathbf{B}^{-1})\mathbf{\Sigma}(\mathbf{I}-\mathbf{B}^{-1})^T) = \text{trace}(\mathbf{\Sigma}+\mathbf{B}^{-1}\mathbf{\Sigma}\mathbf{B}^{-T}) - 2 \,\text{trace}(\mathbf{B}^{-1}\mathbf{\Sigma})$

Writing $\mathbf{B}^{-1}$ in the general form of the decorrelating transformation, we get,

$E (\mathbf{R}) = \text{trace}(\mathbf{\Sigma}+\mathbf{I}) - 2 \,\text{trace}(\mathbf{R}\mathbf{S}\mathbf{U}^T) = \sum_{i=1}^n\; \sigma_{ii}^2 + 1 - 2\,\sigma_{ii} \mathbf{u}_i^T \mathbf{r}_i \ge \sum_{i=1}^n\; (\sigma_{ii} - 1)^2$

Equality is achieved in the last inequality if and only if $\mathbf{R} = \mathbf{U}$. The resulting minimal squared error is the same squared error that would be result from simply normalizing the variance of each channel, which is equivalent to the transformation $\mathbf{S}^{-1}$.

We shall refer to this particular whitening transformation, $\mathbf{B}^{-1} \;=\; \mathbf{U}\mathbf{S}^{-1}\mathbf{U}^T$, as the "sphering" transformation. This matrix is also referred to as the inverse of the "square root" of the covariance matrix $\mathbf{\Sigma}$. It is the unique symmetric matrix $\mathbf{B}^{-1} \triangleq \mathbf{W}=\mathbf{W}^T$ satisfying, $\mathbf{\Sigma} = \mathbf{W}\mathbf{W}^T = \mathbf{W}^2$.

Remarks:

We can view this result as saying that the whitening matrix $\mathbf{B}^{-1} \;=\; \mathbf{U}\mathbf{S}^{-1}\mathbf{U}^T$ changes the data as little as possible, where the data is viewed either as a collection of channel vectors, or as a collection of channel time series. The result is not surprising since using $\mathbf{R} = \mathbf{U}$ undoes the initial rotation performed by $\mathbf{U}^T$.

We have found in practice, performing ICA on EEG data, that using the (symmetric) sphering matrix as an initialization of for ICA generally yields the best results and the quickest convergence, especially in comparison with the whitening matrix $\mathbf{B}^{-1} \;=\; \mathbf{S}^{-1}\mathbf{U}^T$, which might suggest that the former whitening transformation produces more independent components than the latter. This is confirmed empirically in our mutual information computations.

Why should the sphering matrix $\mathbf{U}\mathbf{S}^{-1}\mathbf{U}^T$ produce more independent time series and a better starting point for ICA than the whitening matrix $\mathbf{S}^{-1}\mathbf{U}^T$? In the case of EEG, this is likely due to the fact that the EEG sensor electrodes are spread out at distances of the same order as the distance between the EEG sources. Thus the sources tend to have a much larger effect on a relatively small number of sensors, rather than a moderate effect on all of the sensors.

The whitening matrix $\mathbf{S}^{-1}\mathbf{U}^T$, in projecting the data onto the eigenvectors of the covariance matrix, produces time series that are each mixtures of all of the channels, and in this sense more mixed than the original data, in which the sources distribute over a relatively small number of channels.

The sphering matrix $\mathbf{U}\mathbf{S}^{-1}\mathbf{U}^T$ on the other hand, rotates the transformed data back into its original coordinates, and produces time series that are closest to the original data, which was relatively independent at the start.

By leaving the data in the eigenvector coordinate system, the whitening matrix $\mathbf{S}^{-1}\mathbf{U}^T$ forces the ICA algorithm to “undo” a great deal of mixing in the time series, and as a starting point for iterative algorithms, makes it more difficult (in terms of potential local optima) and more time consuming (since the starting point is farther from the ICA optimum).

## EEG Data Reference and Re-referencing

EEG data is recorded as a potential difference between the electrode location and the reference. Biosemi active recordings use a reference that is separate from the scalp electrodes. If data is recorded with a specific electrode reference, then the data essentially includes a "zero" channel corresponding to the signal at the reference location relative to itself.

A commonly used reference is the "average reference", which consists essentially of subtracting the mean scalp potential at each time point from the recorded channel potential. Let the vector of all ones be denoted, $\mathbf{e} \triangleq [1 \cdots 1]^T$. If the data is denoted $\mathbf{X}$, then average referenced data is equivalent to,

$(\mathbf{I} - \mathbf{e}\mathbf{e}^T\! / n) \mathbf{X}$

The average reference reduces the rank of the data because the referencing matrix is rank $n-1$ (note that if you include the original reference when computing average reference, average reference does not reduce the rank of the data). In particular, the vector $\mathbf{e}$ is in the "null space" of the referencing matrix:

$(\mathbf{I} - \mathbf{e}\mathbf{e}^T\! / n) \mathbf{e} = \mathbf{0}$

The left-hand side is transformed as

$(\mathbf{I} - \mathbf{e}\mathbf{e}^T\! / n) \mathbf{e}$
$= \mathbf{e} - \mathbf{e}*(\mathbf{e}^T\! * \mathbf{e}) / n$

Here, the $(1/\mathbf{n})$ is key since $(\mathbf{e}^T\! * \mathbf{e})/\mathbf{n} = 1$. Therefore,

$= \mathbf{e} - \mathbf{e}$
$= 0$

Re-referencing to a specific channel or channels can be represented similarly. Let the vector with one in the jth position be denoted

$\mathbf{e}_j \triangleq [0 \cdots 0\, \overbrace{1}^{j\mathrm{th}}\, 0 \cdots 0]^T$

Suppose e.g. that the mastoid electrode numbers are $j$ and $k$. Then the linked mastoid re-reference is equivalent to:

$\big(\mathbf{I} - \mathbf{e}(\mathbf{e}_j+\mathbf{e}_k)^T\!/2\big) \mathbf{X}$

Again, however, $\mathbf{e}$ is in the null space of this referencing matrix, showing that the rank is $n-1$. Any referencing matrix will be rank deficient, and will thus leave the data rank deficient by one dimension.

In addition to referencing, EEG pre-processing usually includes high-pass filtering (to reduce non-stationarity caused by slow drifts). Linear filtering (such as high, low, band-pass, FIR, IIR, etc.) can be represented as a matrix multiplication of the data on the right by a large matrix $\mathbf{F}$ whose columns are time shifted versions of each other. The combined referencing and filtering operations can be represented as:

$\big(\mathbf{I} - \mathbf{e}(\mathbf{e}_j+\mathbf{e}_k)^T\!/2\big) \mathbf{X} \mathbf{F}$

The resulting referenced and filtered matrix should remain rank deficient by one. However when referencing is done first, reducing the rank by one, and then filtering is performed, it may happen that the rank of the data increases so that it becomes essentially full rank again. This is apparently due to numerical effects of multiplying (in effect) by a $T \times T$ matrix $\mathbf{F}$.

To summarize, re-referencing should reduce the rank of the data, relegating it to an $n-1$ dimensional subspace of the $n$-dimensional channel space. However, subsequent filtering of the rank-reduced referenced data may increase the rank of the data again (so that the minimum singular value is significantly larger than zero.) In this case, numerical noise in the vector (direction) $\mathbf{e}$ is essentially added back into the data as an independent component.