In my first blog, I introduced the principle of maximum entropy (MaxEnt). Here I shall use an example of inferring Gene Regulatory Networks as proposed [1] to see how to apply MaxEnt.
Recalled that with MaxEnt, any problem of interest is a problem of inferring the least biased or the most uncertain probability distributions from limited data. To do so, MaxEnt essentially provides a constrained optimisation procedure: to maximise the entropy distribution, which quantifies the uncertainty of the probability distribution, while satisfying the constraints which enforce the probability distribution to be consistent with the data. In essence, unlike many methods that assume a model, MaxEnt infers the least biased probabilistic models directly from the data.
From the above perspective, the goal of GNR inference is to infer the least biased probabilistic model of gene interactions that could give rise to the macrostate of a genome. To depict the macrostate of the genome of interest, we assume we conduct microarray or RNA-seq experiments to measure the expression levels of its genes denoted as a state vector
. We assume we have repeated the experiment
times which generated
distinct state vector
. Let
denote the probability distribution function of
, i.e., the probability that the genome is in the arbitrary state
. Based on MaxEnt, the aim of the GRN inference is to find the most unbiased probability distribution
by maximising the Shannon entropy of the system, i.e., the amount of information in the whole genome, or the number of microscopic states, subject to some constraints:
The constraints represent the observed macroscopic state of the genome, which include:
1) Normalisation contratint, i.e., the probabilities of all observable states sum to 1
2) The first moment, ,
3) The second moment, , coincide with those derived from the expression data.
The above constraints 2 and 3 ensure that the probability distribution preserves 1) the mean expression level of each gene (constraint 2); and 2) the correlations between genes (constraint 3), respectively.
Using Lagrange multipliers, we concatenate the objective function and constraints to form a constrained optimisation problem, which essentially link the microscopic and macroscopic scales. By maximising this problem, we essentially maximise the number of microscopic realisation of a system, with the constraints to ensure the inferred probability distribution is consistent with the data, i.e., the observations of the macroscopic states. The Lagrangian can be written as:
To maximise , we set its first-order derivative against
to 0, i.e.,
and obtain:
where matrix ,
and the constant
.
Equation (13) depicts a Boltzmann-like distribution, i.e., , where
plays the role of the energy function in conventional statistical mechanics. Constant $A$ is the inverse of
To calculate the elements of , we assume the discrete states of the genome
can be approximated with a continuum. Therefore, the constraints (Equations 9-11) becomes
Rearranging equation (16), we have
Solving the above equation yields:
By substituting Equation (18) into Equation (16), we obtain
or
which means the correlation or interaction matrix is the inverse of the covariance matrix, of which the matrix element represents the interaction between genes
and
. This also means we can obtain
by inverting the matrix of the gene expression data covariances.
However, because the number of samples in a gene expression dataset is usually much smaller than the number of genes
,
is singular and the inversion problem is undetermined. To solve this problem, we use spectral decomposition to inverse
in the non-zero eigenspace that represents the subspace spanned by the gene expression data:
where is the
-th eigenvalue of
, and
is the corresponding eigenvector, yielding:
where the summation is over the non-zero eigenvalues.
The above method was published in 2006, and since then there are some improved methods, e.g., method based on Ising model approximation [2]. Interested readers are referred to two recent review papers [3,4] (it is worth mentioning that paper [4] also review the application of MaxEnt to protein contact prediction). However, all the state-of-the-art methods as reviewed in the two papers can only infer undirected networks, which is not enough for gaining insights into the real regulatory mechanisms. Another drawback is that these MaxEnt methods cannot incorporate prior probability distribution that derived from previous data, which might not be efficient. In future blogs, I shall discuss my thoughts on addressing these drawbacks.
Reference
1.