Abstract
We present investigation of gene expression profiles by means of singular spectrum analysis (SSA). The biological problem under investigation is the decomposition of bicoid protein profiles of Drosophila melanogaster into the sum of a signal and noise, where the former consists of an exponential-in-distance pattern and is close to constant nonspecific component, or “background.” The signal processing problems addressed are (i) trend extraction from a noisy signal, (ii) batch processing of similar data, and (iii) analytical approximation of the signal components by the sum of exponential and constant-like functions. The proposed methods are evaluated on the given 17 series.
1. Introduction
Singular spectrum analysis is a method intended to perform decomposition of a sequence of measurements (usually a time series) into a sum of interpretable components, such as a trend, cycles, and noise [1]. SSA is recognized in geosciences, and is giving promising results in other areas (see http://www.math.uni-bremen.de/~theodore/ssawiki/). This work presents the use of SSA for signal extraction from spatial one-dimensional gene expression data. SSA was chosen as a compromise between parametric methods like regression, which can lead to wrong results if the model is not valid, and frequency methods like filtering.
The study of the activity of diverse genes has become one of key approaches in modern functional genomics, and is crucial for our understanding of embryo development. The expression of genes is traced either in time or in space (as in our case), along different tissues and organs or even a whole embryo. The research of gene expression is aimed at biomedical problems, but first it is systematically tested and developed on so-called model organisms. D. melanogaster (fruit fly) is one such organism, and the gene ensemble governing early events of fly embryo segmentation is one of the best studied genetic networks. This network of cross-regulating genes makes complicated patterns of their products, the segmentation factors. These patterns are directing the embryo developmental processes. Exponential-in-distance patterns are common at the very beginning of segmentation, and the primary morphogenetic gradient of the protein bicoid is most known and best studied [2, 3].
The biological problem under investigation is extraction of a signal from the noisy bicoid protein profile. Following [2], we assume bicoid to generate exponential pattern. The measured protein profile contains also the smooth residual referred to as “background” [4] and the measurement noise. In general, the form of the background is still an open question [2–4]. Moreover, it includes an unknown additive function depending on the confocal microscopy settings used. In this paper, we extend the model of [2], allowing the background to differ from constant. The problem is complicated by the fact that (i) the data contain outliers and (ii) that the data are very noisy and the noise has unknown structure. Although the noise appears to be multiplicative, the carried out study showed that it is true in very rough approximation only.
The SSA features demonstrated are as follows: (i) signal extraction with no parametric models of signal/noise specified; (ii) robustness to outliers; (iii) taking into account a parametric model of signal, if specified; (iv) interactive analysis with control over quality of separation of signal and noise; (v) batch processing of a set of similar series; (vi) derivation of an analytical formula of the signal.
Section 2 introduces the data, SSA, and the related methods used. The results of data processing are presented in Section 3. Finally, the conclusions are provided.
2. Methods and Approaches
Biological Data
Expression level of protein in wild-type fruit fly embryos (D. melanogaster,
Oregon-R), was measured using fluorescently tagged antibodies. For each embryo,
a pixel image with 8 bits of fluorescence
data was obtained. Image processing transforms each image into an ASCII table
containing a series of records (fluorescence intensity), one for each nucleus.
About 1100–1300 nuclei are obtained from each image. Each nucleus is
characterized by a unique identification number, the anteroposterior (AP) and
dorsoventral (DV) coordinates of its centroid, and the average fluorescence
level of the gene product. Because the expression of segmentation genes is
largely a function of position along the AP axis, it is natural to use the AP
profiles of gene expression. We use straightened data from a rectangle of 50%
of the DV height of the embryo, centred on the AP axis; for details see
[2]. This captures
approximately 700–800 nuclei. As in [2], we investigate only the interval of the AP coordinate
between 20 and 80% egg lengths (%EL). For examples of plotting nuclear
intensity versus AP position, see Figures 1(a) and 1(c). The considered test set
consists of 17 embryo images belonging to cleavage cycle 13 and thoroughly
studied in [2]. The data
and software used in this study are available at
http://www.math.uni-bremen.de/~theodore/GENESSA/. For a description of the
embryos, see [5], http://flyex.ams.sunysb.edu/FlyEx/,
or http://urchin.spbcas.ru/flyex/.
SSA
Let us describe the basic algorithm of SSA for
extraction of signal from a one-dimensional series .
For the given data, represents the intensity measured at the th nucleus inside the considered interval of
20%EL–80%EL. The first step of SSA has only the parameter and the window length, ,
and consists of the construction of the Hankel matrix of size ,
with column vectors ,
which is called the trajectory matrix. Note that there is a one-to-one
relation between series of length and Hankel matrices of size ;
each secondary diagonal of a Hankel matrix has equal values and produces a term
of the series. The trajectory matrix is then decomposed into the sum of the
ordered elementary matrices, ,
where are nonzero eigenvalues of in decreasing order, are the corresponding eigenvectors, and are the factor vectors. This is the so-called singular
value decomposition (SVD), and each SVD component generates an elementary
reconstructed component (elementary RC) of the series as follows. The matrix is hankelized by averaging the entries with indices ,
and the corresponding series of length is reconstructed by the above-mentioned
one-to-one relation. Thus, we decompose into the sum of elementary RCs, ,
where is the number of nonzero eigenvalues (the so-called SSA rank of ). Then, we choose a group of indices of the desirable components of (signal in our case), and gather the reconstructed
signal as .
The signal extraction problem is thus reduced to (i)
choice of window length and (ii) selection of the subgroup of SVD components for reconstruction. These
questions are briefly discussed in the next paragraph.
Trend Extraction in SSA
SSA needs no a
priori specification of models of series and signals, neither deterministic nor
stochastic ones. In this paper, we are interested in extraction of a slowly
varying signal, usually called the trend. Hereinafter, we refer to trend
instead of signal.
Generally, SSA is able to extract different kinds of
trends. Note that any trend can be approximated by a finite-rank series as the
class of finite-rank series includes all kinds of sums of products of
polynomials, exponentials, and sinusoids. Let us assume that the trend is (or
is approximated by) a series of rank .
With large enough and , the trend is separable from noise and is
reconstructed by the -leading SVD components. The subspace spanned
by the -corresponding eigenvectors contains
information about the finite-rank structure and, in particular, allows to
derive the approximate analytical formula of the trend. If is not large enough (for strong noise or high
rank ) and separability is bad, then the trend
still can be extracted using small .
In this case, trend is determined by a few leading components, and SSA works
like a smoothing adaptive linear filter. However, the subspace spanned by the
corresponding eigenvectors does not reflect the finite-rank structure of the
trend and is liable to be affected by noise and outliers.
Grouping of SVD components is based on the fact that
the slowly varying component of the series generates eigenvectors and factor
vectors of slowly varying form [1, 6], and therefore is composed of similar elementary RCs.
Thus, the identification of the components of a trend consists in identification
(visual or automatic) of slowly varying eigenvectors, factor vectors, or
elementary RCs.
AutoSSA for Trend Extraction
Here we present a method of identification of slowly
varying eigenvectors. This method is easy to use as it has only two parameters
(if the window length is fixed).
Firstly, we introduce the periodogram of a vector : , which can be interpreted as the contribution of the frequency .
The cumulative contribution is evaluated as , .
For ,
the contribution of low frequencies to is defined as .
Let us consider eigenvectors .
Then, given and ,
we select SVD components with eigenvectors satisfying .
One may interpret this method as selection of SVD components characterized
mostly by low-frequency fluctuations.
The low-frequency boundary manages the scale of the extracted trend; the
lower is, the slower the trend varies. The parameter regulates an acceptable share of higher
frequencies in the extracted component, and is usually chosen close to 1. For
more details on selecting and for description of a fitting procedure to
choose ,
see [6].
Derivation of the Analytical Form of a Component
A series consisting of a sum of exponentials and
sinusoids has SSA-rank ,
and is represented in the complex-valued form as .
The subspace spanned by the -leading eigenvectors of the trajectory matrix
determines the values .
The methods calculating through this subspace are called the subspace
methods. Real-valued correspond to exponential components; complex
conjugate generate sinusoids. For a noisy signal, the
subspace of the signal can be estimated using SSA as the space spanned by the
eigenvectors .
Subspace methods for calculation of exponentials (mostly complex-valued) have been known for a
long time. Among their advantages are (i) high resolution (estimation of close
frequencies of summand sinusoids), (ii) robustness to outliers, and (iii)
little prior information (no signal model specified, only its rank ).
In this paper, we use the method ESPRIT [7], which was chosen to
illustrate application of subspace methods. Note that there are modifications
of ESPRIT for more precise results, for example, weighted/total least-squares
ESPRIT. Having estimated, the coefficients can be computed by means of one of the least-squares
methods. In terms of SSA, ESPRIT exploits the rotational invariance property of
the subspace of the signal found by SSA.
3. Results and Discussion
3.1. Methodology
An Ideal Case (Trend Is Separable from Noise with )
Let us consider
the data ms19 (data from a single embryo) as an example (see Figure 1(a)). The only parameter of SSA is the window length .
SSA theory [1] implies
that for better separability between components of a given series, one should
choose close to the half-length .
Having performed SVD with ,
we visually examine the elementary RCs produced (see Figure 1(b)). As only the
two leading elementary RCs vary slowly, we reconstruct the trend with the two
leading SVD components. The resulting trend is depicted in Figure 1(a). One can
easily see that the result reasonably reconstructs the trend, but at the same
time is robust to the apparent outliers.
A Bad Case (Trend Is Not Separable from Noise with )
However, for some cases the trend cannot be separated
from noise or cyclic components. For the data ac2 (see Figure 1(c)) with ,
the third and fourth elementary RCs contain both trend and noise (see Figure
1(d)). The contribution of these SVD components is small and they can be
omitted for a tentative trend reconstruction. But for background estimation,
loss of even 1-2 intensity units is sizeable. Let us consider two additional
tricks which help to reconstruct trend more precisely.
Use of Small Window Length
The trend and
noise components can be mixed between each other due to the complicated trend
shape or strong noise, and the latter is observed in our study. The first
strategy to cope with the mixing is to choose small window length .
With small ,
SSA works like a smoothing adaptive linear filter. With for ac2, we get only the one leading
SVD component corresponding to the trend. Due to small window length, this SVD
component includes all terms of the trend. This method is suitable for
different kinds of signals, and overcomes the mixing of trend and noise.
However, the resultant decomposition does not allow us to split the trend into
the pattern and background.
Improvement of Separability by the Addition of a Constant
Recall that we
suppose the trend to be the sum of an exponential pattern and an almost
constant background, where the latter is approximated by an exponential
function with small rate. Then, in this particular case another trend
extraction strategy can be exploited. Such a trend generates two SVD components
[1]. In order to
enlarge the contribution of the second SVD component and therefore to reduce
mixture with the rest, we add a constant to the given data, thus artificially
increasing the background. After that we use the theoretically best window
length with no effect of mixing. This strategy
greatly helps; having added to the given data, the results for ac2 (with ) become visually the same as in the good case ms19 depicted in Figure 1(b). The value equal to 50 was chosen to provide separability
for all series from the considered test set and therefore to allow us to
extract trends being split into patterns and backgrounds. As for ms19,
smaller values of that are enough for separability can be used.
3.2. Batch Identification of Trend Components
Above we investigated the properties of the SSA representation of bicoid gene expression data, which contain the exponential pattern and a close-to-constant background. Let us apply AutoSSA for batch-processing the whole dataset taking into account the experience of the previous section. First, we set the parameter . As mentioned above, defines the low-frequency interval . Examining the eigenvector periodogram, we can guess as a value bounding the interval of large periodogram values next to the zero frequency. Taking the data ms19 as an example, we consider periodograms of its eigenvectors (). Exactly the two leading SVD components of ms19 are to be identified. The first six frequencies of the periodograms of both eigenvectors contain the prevailing contribution. Thus, we select . For five randomly selected test series (ad36, as27, cb23, hx8, iz13), the procedure of choice of presented in [6] () produces the following : 0.85,0.87,0.88,0.85,0.88 of which the smallest is selected. AutoSSA with and identifies the same SVD components as those visually identified above; two leading SVD components for ms19 () and for ac2 increased by (), as well as only the leading component for ac2 with .
Having added to all series from the given set, AutoSSA identifies exactly the two leading components. The visual check of the identified eigenvectors proves their slowly varying shape; this substantiates the use of AutoSSA, especially for these data where addition of a constant has enhanced the separability. Moreover, this uniformity of results over the whole dataset allows us to derive the analytical approximation using these SVD components.
3.3. Analytical Trend Approximation: Exponential Pattern Plus Background
Let us consider two-rank approximation of the trend. A two-rank series is either an exponentially modulated sinusoid or a sum of two real exponentials [1], and a constant function is a special case of an exponential. Note that we specify no approximation model (from the two given above) but only the rank. For the data ac2, the resulting formula for the trend is , where runs through the nucleus numbers in the considered AP interval. After transformation from nucleus number to AP coordinate , we obtain the approximation for the exponential pattern and the background : and (see Figure 2).
(a)
(b)
3.4. Summary Results of the Analytical Approximation
The considered dataset contains 17 series, and we extracted exponential patterns using ESPRIT for all of them, increased by in advance. The fact that the patterns tend to zero close to the posterior end is confirmed by the biological interpretation of the bicoid gradient. To generate reference results, we fitted the curve to each original series using the least-squares (LSs) method, like [2]. The patterns produced with ESPRIT and with LS-fitting are very similar, which confirms potential of ESPRIT in extracting exponential patterns without fixing the model of constant background. Both the MATLAB function fminsearch (v.7.4) and the nonlinear estimation module of STATISTICA (v.6.0) with randomly chosen initial values produce substantially incorrect results. Thus, the initial values used are crucial. It turns out that by using ESPRIT estimates of , , and as the initial values, the procedure of the LS-estimation becomes stable and precise. This shows the usefulness of ESPRIT in combination with LS-methods.
The SSA-based procedure presented here is more flexible than the usual bicoid profile modeling with constant background. On the other hand, as simultaneous estimation of parameters of two exponents is less stable in general, the variation of the resulting pattern parameters can be potentially larger than that in the model with a constant background. However, even for modeling with fixed shape of the background, SSA can be useful for setting initial values of the corresponding fitting procedure.
4. Conclusions
First, we developed the SSA-based technique for signal extraction from one-dimensional spatial gene expression profiles containing exponential-in-distance patterns and constant-like backgrounds. The obtained results are consistent with the state-of-the-art results for the given data, though the data contain strong noise and outliers, and we do not assume models of profile, pattern, and background to be known a priori. Moreover, the feasibility of batch processing of the given data using AutoSSA is demonstrated.
Second, using the SSA-related method ESPRIT, we obtained an analytical representation of the signal as a sum of two exponential functions. The first is the well-known exponential pattern of the bicoid protein, and the second is the background approximated by an exponential or linear function in our case. The employed method produces stable parameter estimates, even for noisy series. Moreover, these estimates can be used as initial values for nonlinear least-squares fitting procedures in which the model is assumed a priori.
Acknowledgments
The authors thank David Holloway for comments and Alexey Timofeyev for ESPRIT software. This work was supported by the NSF/NIGMS BioMath 1-R01-GM072022 and CRDF RUB1-1643-ST-06 Grants.