The Generalised Maximum Entropy Principle

In this blog, I will introduce the Generalised Maximum Entropy principle as presented in (Kesavan and Kapur 1989). To generalise MaxEnt, the paper explicitly expresses its three probabilistic entities, namely, the entropy measure, the set of moment constraints and the probability distribution, and then examines its consequences, e.g., the inverse MaxEnt. The paper links MaxEnt to Minimum Discrimination Information Principle (Kullback-Leibler divergence).

The general form of MaxEnt

The MaxEnt principle essentially proposes a procedure that inferring particular probability distribution P = \{p_i\}, \;\; i=1, \cdots, N by maximizing the entropy H(P) = - \sum_{i=1}^{N} p_i \ln p_i subject to constraints. Maximising the entropy H determines the most unbiased probability distribution, while the imposed constraints ensure the inferred probability distribution is consistent with the data.

Following (Kesavan and Kapur 1989), we first give the formalism the MaxEnt principle, which in essence is a process of constrained maximisation using Lagrange multipliers:

\begin{aligned}  H = -\sum_{i=1}^{N} p_i \ln p_i \end{aligned}  \ \ \ \ (1)

Subject to the constraints:

\begin{aligned}  \sum_{i=1}^{N} p_i = 1 \end{aligned}  \ \ \ \ (2)

\begin{aligned}  \sum_{i=1}^{N} p_i g_r(x_i) = a_r, \; r = 1, 2, \dots, M \end{aligned}  \ \ \ \ (3)

where the first constraint (equation (2)) is the normalization constraint. The second one (equation (3)) ensures the inferred probability distribution is consistent with the data. In equation (3), g_r(x_i) is a function or property (e.g., energy) of a microscopic variable x_i (e.g., a particle) such that \sum_{i=1}^N  p_i g_r(x_i) is the theoretical expectation value of the property of the whole system (e.g., idealised gas); and a_r can be a moment (e.g., mean) of the observed property of the whole system.

Equation (2) essentially represents the so-called testable information, i.e., “a statement about a probability distribution whose truth or falsity is well-defined”. It is worth mentioning that, equation (2) can be inequality constraints as well, i.e.,

\begin{aligned}  \sum_{i=1}^{N} p_i g_r(x_i) \geq a_r, \; r = 1, 2, \dots, M \end{aligned}  \ \ \ \ (3)

The Lagrangian is:

\begin{aligned} L = - \sum_{i=1}^{N} p_i \ln p_i - (\lambda_0 - 1) \left[ \sum_{i=1}^{N} p_i - 1 \right] - \sum_{r=1}^{M} \lambda_r \left[ \sum_{i=1}^{N} p_i g_r(x_i) - a_r \right] . \end{aligned}  \ \ \ \ (4)

We then maximise equation (4), yielding:

\begin{aligned} p_i = \exp \left[  - \left( \lambda_0 + \sum_{r=1}^{M} \lambda_r g_r(x_i) \right) \right], \;  i = 1, 2, \cdots, N. \end{aligned}

Let g_0 = 1, we have

\boxed{\begin{aligned} p_i = \exp \left[  - \sum_{r=0}^{M} \lambda_r g_r(x_i) \right], \;  i = 1, 2, \cdots, N. \end{aligned} } \ \ \ \ (5)

To determine the Lagrange multipliers \lambda_i, i=0, 1, \dots, M, we substitute equations (2) and (3) into (4) to get

\begin{aligned} \exp(\lambda_0) = \sum_{i=1}^{N} \exp \left( - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right) \end{aligned}  \ \ \ \ (6)


\begin{aligned} a_r \exp(\lambda_0) = \sum_{i=1}^{N} g_r(x_i) \exp \left( - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right), \; r=1,2, \cdots, M. \end{aligned}  \ \ \ \ (7)

Substitute equation (6) into (7), we obtain

\boxed{\begin{aligned}  \frac{\sum_{i=1}^{N} g_r(x_i) \exp \left( - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right)}{\sum_{i=1}^{N} \exp \left( - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right)} = a_r, \; r=1,2, \cdots, M. \end{aligned} } \ \ \ \ (8)

We then solve equations (8) using any root solver to obtain the desired values of \lambda_i, i=0, 1, \cdots, M.

For continuous random variate with a probability density function f whose support is a set \mathcal{X}, the continuous MaxEnt can be written as:

\begin{aligned} \underset{x}{\arg\max} - \int_{\mathcal{X}} f(x) \ln f(x) dx \end{aligned}  \ \ \ \ (8)

subject to

\begin{aligned}  \int_{\mathcal{X}} f(x) dx = 1  \end{aligned}  \ \ \ \ (9)

\begin{aligned}  \int_{\mathcal{X}} f(x) g_r(x) dx = \bar{a}_r(x); \;  r=1, \cdots, M  \end{aligned}  \ \ \ \ (10)

MDI and Relative Entropy Maximisation

The Minimum Discrimination Informatoin (MDI) principle is based on Kullback-Leibler (KL) divergence, which is a measure to discriminate the probability distribution P from Q:

\begin{aligned}  D_{\text{KL}}(P:Q) = \sum_{i=1}^{N} p_i \ln \frac{p_i}{q_i} \end{aligned}  \ \ \ \ (11)

which is always \geq 0, and the global minimum value is zero when the two distributions are identical.

Based on KL divergence, Kullback proposed the Principle of Minimum Discrimination Information (MDI): given new facts, a new distribution P should be chosen which is as close to the reference distribution Q as possible; so that the new data produces as small an information gain D_{\text{KL}}(P:Q) as possible. This can be formulated as the same Lagrange with the same constraints as MaxEnt:

\begin{aligned} L = \sum_{i=1}^{N} p_i \ln \frac{p_i}{q_i}  + (\lambda_0 -1) \left( \sum_{i=1}^N p_i - 1 \right) + \sum_{r=1}^M \lambda_r \left( \sum_{i=1}^N p_i g_r(p_i) - \bar{a}_r \right) \end{aligned}  \ \ \ \ (12)

KL divergence is also commonly referred to as relative entropy, which is defined as

\begin{aligned} H(P|Q) = - \sum_{i=1}^N p_i \ln \frac{p_i}{q_i} \end{aligned}  \ \ \ \ (13)

so that

\begin{aligned} p_i = q_i \exp \left[ - \sum_{r=0}^M \lambda_r g_r(x_i)   \right] \end{aligned}  \ \ \ \ (14)


\begin{aligned} \exp(\lambda_0) = \sum_{i=1}^{N} q_i \exp \left[ - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right]. \end{aligned}  \ \ \ \ (15)

The relative entropy maximisation principle essentially introduces a priori probability distribution Q. When we do not have any knowledge of the a priori probability, we can assume Q is uniform, that is, q_i = 1/N, i=1, \cdots, N.

\begin{aligned} H(P|Q) = - \sum_{i=1}^N p_i \ln \frac{p_i}{q_i} = - \sum_{i=1}^N p_i (\ln p_i + \ln N) = H(P) - \ln N  \end{aligned}  \ \ \ \ (13)

The generalised MaxEnt

The generalised version essentially aims to determine any single entity when the rest of the three are specified. The GMEP first generalised MEP by relaxing the restriction of the entropy measure to Shannon entropy as in Jayne’s MEP, and to the Kullback-Leibler measure as in Kullback’s MDIP. Following the paper, we introduce two versions based on MEP and MDIP, respectively.

GMEP (The MEP version)

We introduce a convex function \phi (\cdot) such that the generalised entropy measure is defined as:

\begin{aligned} H(P) = - \sum_{i=1}^N \phi (p_i),  \end{aligned}  \ \ \ \ (13)

and the constraints are defined as

\begin{aligned} \sum_{i=1}^N p_i = 1  \end{aligned}  \ \ \ \ (13)

\begin{aligned} \sum_{i=1}^N p_i g_r(x_i) =a_{r}, \;\; r=1, 2,\cdots, M.  \end{aligned}  \ \ \ \ (20)

Similarly, we use Lagrange method to maximise (xx) subject to the m+1 constraints in equations (xx-xx), and let g_0 = 1 which yield the expression:

\begin{aligned} \frac{\partial \phi(p_i)}{\partial p_i} = \sum_{r=0}^{M} \lambda_i g_r(x_i). \end{aligned}  \ \ \ \ (21)

The Direct Problem

If we know the entropy measure \phi(\cdot) and the constraint mean values a_r of g_r(x_i), \;\; r=1, \cdots, M, we can determine the probability distribution that maximise the entropy measure. This is done by substitute (21) into (20) and solve the Lagrange multipliers that in turn yield the probability p_i as in equation (5).

The First Inverse Problem: Determination of Constraints

The first inverse problem is essentially to find the most unbiased constraints. Given the entropy measure \phi(\cdot) and the probability distribution P = \{p_i\}, \;\; i=1, \cdots, N, we can determine one or more constraints that yield the given probability distribution, assuming that the entropy is maximised subject to these constraints.

Since we know \phi(\cdot), we can derive \frac{\partial \phi(p_i)}{\partial p_i}, so that we can determine the right hand side of equation (21). We can then identify the values for g_r(x_i) by matching terms, which yields one of the most unbiased sets of constraints. We shall use an example to illustrate how to do this.

The Second Invest Problem: Determination of the Entropy Measure)

Given the constraints, g_r(x_i), \;\; r=1, \cdots, M, and the probability distribution p_i, determine the most unbiased entropy measure that when maximized subject to the given constraints, yields the given probability distribution.

We can obtain a differential equation by substituting the given values of g_r(x_i) and p_i into (21). After solving the differential equation, we obtain \phi(\cdot) and then we can determine the entropy measure H(P) = \sum_{i=1}^{N} \phi(p_i).

Example: Generalised Jaynes’ dice

Suppose we have a dice with N sides and the i-th sides have a unique number x_i, of which the probability distribution is p_i. We shall denote an entropy measure as \phi (p_i).

Solution of the direct problem

For the direct problem, we assume we know the entropy measure and the constraints. The problem is to obtain the probability distribution. Formally, we define the entropy measure as the Shannon’s entropy:

\begin{aligned} \phi (p_i) = p_i \ln p_i  \end{aligned}  \ \ \ \ (21)

We also define the normalisation constraint and the constraint that ensure the inferred probability distribution is consistent with the observed mean value from T dice rollings \bar{x} = \langle x \rangle = \frac{1}{T} \sum_{t=1}^{T} x_t:

\begin{aligned} \sum_{i=1}^{N} p_i = 1 \end{aligned}  \ \ \ \ (21)

\begin{aligned} \sum_{i=1}^{N} p_i x_i = \bar{x} \end{aligned}  \ \ \ \ (21)

Using Lagrange method, we can obtain the solution of the probability distribution p_i (See my previous blog for detailed derivation but with some notation differences):

\begin{aligned}  p_i = \frac{\exp(- \lambda_1 x_i)}{\sum_{i=1}^N \exp (- \lambda_1 x_i)}  \end{aligned}  \ \ \ \ (21)

while \lambda_1 can be obtained by using any root solver to solve the following equation:

\begin{aligned}  \frac{\sum_{i=1}^N x_i \exp(- \lambda_1 x_i)}{\sum_{i=1}^N \exp (- \lambda_1 x_i)} = \bar{x}  \end{aligned}  \ \ \ \ (21)

The First Inverse Problem: Determination of Constraints

For this problem, we have defined the entropy measure as the Shannon’s entropy:

\begin{aligned} \phi (p_i) = p_i \ln p_i  \end{aligned}  \ \ \ \ (21)

and also know the probability distribution as

\begin{aligned}  p_i = \frac{\exp(- \mu x_i)}{\sum_{i=1}^N \exp (- \mu x_i)}  \end{aligned}  \ \ \ \ (xx)

where \mu is a constant. Noting the normalisation constraint is obvious, now the problem is to determine a most unbiased set of constraints in the form of

\begin{aligned}  \sum_{i=1}^N p_i g_1(x_i) = \bar{x}  \end{aligned}  \ \ \ \ (xx)

From equation () we have

\begin{aligned} \frac{\partial \phi(p_i)}{\partial p_i} = -(1+ \ln p_i )=  \lambda_0 + \lambda_1 g_1(x_i). \end{aligned}  \ \ \ \ (xxx)

Substitute (xx) into (xxx), we have

\begin{aligned} -1 + \ln \sum_{i=1}^N \exp (\mu x_i) + \mu x_i = \lambda_0 + \lambda_1 g_1(x_i). \end{aligned}  \ \ \ \ (xxx)

Equating terms, we obtain

\begin{aligned} \lambda_0 & = -1 + \ln \left [ \sum_{i=1}^N \exp (- \mu x_i) \right], \\ \lambda_1 & = \mu \text{ and }  g_1(x_i)  = x_i.   \end{aligned}  \ \ \ \ (xxx)

Therefore, we can derive the constraints:

\begin{aligned} \sum_{i=1}^N p_i = 1 \text{ and } \sum_{i=1}^N p_i x_i = \bar{x}.    \end{aligned}  \ \ \ \ (xxxx)

The Second Inverse Problem: Determination of the Entropy Measure

In this problem, we know the probability distribution as defined in equation (xxx), the constraints as defined in (xxxx), we need to determine the most unbiased entropy measure. From equation (21)

\begin{aligned} \frac{\partial \phi(p_i)}{\partial p_i} = \lambda_0 + \lambda_1 x_1   \end{aligned}  \ \ \ \ (xxxx)


Kesavan, H. K., and J. N. Kapur. 1989. “The Generalized Maximum Entropy Principle.” IEEE Transactions on Systems, Man, and Cybernetics 19 (5): 1042–52.


Deriving PLA from MaxCal

The principle of least action (PLA), or more accurately, the principle of stationary action, is one first principle in physics. PLA offers the deepest explanatory power of our external reality. In this blog, I shall summarise a beautiful paper “Principle of maximum caliber and quantum physics” [1]. I shall omit the part of quantum physics but show how to derive PLA from the principle of Maximum Calibre (MaxCal), the generalisation of MaxEnt.

Let a and b be the two points in the phase space of a dynamical system, and p_i(ab) be the probability of the system will go from a to b following an individual path i, then the calibre is

\begin{aligned}  S(a,b) = -\sum_{i=1}^{N} p_i(ab) \ln p_i(ab) \end{aligned} \ \ \ \ (1),

where N is the total number of possible paths from point a to point b. We assume A_i(a,b) is the physical property that characterises the individual path i and the average of all possible path is \langle A(a,b) \rangle. We also assume the sum of probabilities of all path is 1. The two assumptions can be written as:

\begin{aligned} \sum_i^{N} p_i(ab) A_i(a,b) = \langle A(a,b) \rangle \end{aligned} \ \ \ \ (2)

\begin{aligned} \sum_i^N p_i(ab) = 1\end{aligned} \ \ \ \ (3)

Using the Lagrange multipliers method, the auxiliary function can be written as

\begin{aligned} S^{'} = -\sum_i^N p_i(ab) \ln p_i(ab) - \eta \left( \sum_i^N p_i(ab) A_i(a,b) - \langle A(a,b) \rangle \right) -   \lambda \left( \sum_i^N p_i(ab) - 1 \right) \end{aligned} \ \ \ \ (4).

To obtain the stationary points, we calculate the derivatives of S' with respect to p_i and the two multipliers \lambda and \eta:

\begin{aligned} \frac{\partial S^{'}}{\partial p_i} = -\ln p_i(ab) -1 - \eta  A_i(a,b) - \lambda = 0 \end{aligned} \ \ \ \ (5)

\begin{aligned} \frac{\partial S^{'}}{\partial \eta} = \sum_i^N p_i(ab) A_i(a,b) - \langle A(a,b) \rangle = 0 \end{aligned} \ \ \ \ (6)

\begin{aligned} \frac{\partial S^{'}}{\partial \lambda} = \sum_i^N p_i(ab) - 1 = 0 \end{aligned} \ \ \ \ (7)

From equation (5), we have

\begin{aligned} p_i(ab) =  \exp(- 1 - \lambda - \eta A_i(a,b))   = \exp(-1-\lambda) \exp (-\eta A_i(a,b))  \end{aligned} \ \ \ \ (8)

From equation (7), we have

\begin{aligned} \sum_i^N p_i(ab) = \sum_i^N \exp (-1- \eta A_i(a,b) - \lambda) = \exp (-1-\lambda)  \sum_i^N \exp (-\eta A_i(a,b)) = 1 \end{aligned}, which yields:

\begin{aligned}  \exp(-1-\lambda) = \frac{1}{ \sum_i^N \exp (-\eta A_i(a,b)) } \end{aligned} \ \ \ \ (9)

We then substitute \exp(-1-\lambda) in equation (8) with equation (9) to obtain

\begin{aligned} p_i(ab) = \frac{1}{Z} \exp(-\eta A_i(a,b)) \end{aligned} \ \ \ \ (10) ,

where Z = \sum_{i=1}^{N} \exp(-\eta A_i(a,b))

From equation (10), we notice that p_i(ab) , i.e., the probability of the ith path between points a and b , is maximum when its corresponding physical property A_i(a,b) is minimum.

If we replace A_i(a,b) with the word “action”, then we derive the principle of least action. However, we should stress that MaxCal is more general since it does not only contains the least action principle as its most probable outcome but also allows other trajectories with lower probability. Those trajectories with lower probability are not of use But, since classically there is only one allowed trajectory, we can infer that η must be high, thus suppressing the extra trajectories, and leaving only the most probable one.

1. General IJ. Principle of maximum caliber and quantum physics. Phys Rev E. 2018;98: 012110.

Inference of Gene Regulatory Networks using MaxEnt

In my first blog, I introduced the principle of maximum entropy (MaxEnt). Here I shall use an example of inferring Gene Regulatory Networks as proposed [1] to see how to apply MaxEnt.

Recalled that with MaxEnt, any problem of interest is a problem of inferring the least biased or the most uncertain probability distributions from limited data. To do so, MaxEnt essentially provides a constrained optimisation procedure: to maximise the entropy distribution, which quantifies the uncertainty of the probability distribution, while satisfying the constraints which enforce the probability distribution to be consistent with the data. In essence, unlike many methods that assume a model, MaxEnt infers the least biased probabilistic models directly from the data.

From the above perspective, the goal of GNR inference is to infer the least biased probabilistic model of gene interactions that could give rise to the macrostate of a genome. To depict the macrostate of the genome of interest, we assume we conduct microarray or RNA-seq experiments to measure the expression levels of its N genes denoted as a state vector \vec{x} = (x_1, \cdots, x_N). We assume we have repeated the experiment T times which generated T distinct state vector \vec{x}^1, \dots, \vec{x}^T. Let \rho(\vec{x}) denote the probability distribution function of \vec{x}, i.e., the probability that the genome is in the arbitrary state \vec{x}. Based on MaxEnt, the aim of the GRN inference is to find the most unbiased probability distribution \rho(\vec{x}) by maximising the Shannon entropy of the system, i.e., the amount of information in the whole genome, or the number of microscopic states, subject to some constraints:

\begin{aligned} - \sum_{i=1}^{N} \rho(x_i) \ln(\rho(x_i)). \end{aligned}  \ \ \ \ (8)

The constraints represent the observed macroscopic state of the genome, which include:

1) Normalisation contratint, i.e., the probabilities of all observable states sum to 1

\begin{aligned} \sum_{\vec{x}} \rho(\vec{x}) = \sum_{k=1}^{T} \rho(\vec{x}^k) = 1 \end{aligned}  \ \ \ \ (9)

2) The first moment, \langle x_i \rangle ,

\begin{aligned} \langle x_i \rangle = \sum_{\vec{x}} \rho(\vec{x}) x_i = \frac{1}{T} \sum_{k=1}^{T} x_{i}^{k} \end{aligned}  \ \ \ \ (10)

3) The second moment, \langle x_i, x_j \rangle, coincide with those derived from the expression data.

\begin{aligned} \langle x_i x_j \rangle = \sum_{\vec{x}} \rho(\vec{x}) x_i x_j = \frac{1}{T} \sum_{k=1}^{T} x_i^k x_j^k  \end{aligned}.  \ \ \ \ (11)

The above constraints 2 and 3 ensure that the probability distribution \rho(\vec{x}) preserves 1) the mean expression level of each gene (constraint 2); and 2) the correlations between genes (constraint 3), respectively.

Using Lagrange multipliers, we concatenate the objective function and constraints to form a constrained optimisation problem, which essentially link the microscopic and macroscopic scales. By maximising this problem, we essentially maximise the number of microscopic realisation of a system, with the constraints to ensure the inferred probability distribution is consistent with the data, i.e., the observations of the macroscopic states. The Lagrangian can be written as:

\begin{aligned} L = - \sum_{i=1}^{N} \rho(x_i) \ln(\rho(x_i)) - \lambda \left[ \sum_{\vec{x}} \rho(\vec{x}) - 1 \right] - \sum_{i=1}^{N} \mu_i \sum_{\vec{x}} \rho(\vec{x}) x_i - \sum_{i,j=1}^{N} \eta_{ij} \sum_{\vec{x}} \rho(\vec{x}) x_i x_j  \end{aligned} . \ \ \ \ (12)

To maximise L, we set its first-order derivative against \rho(\vec{x}) to 0, i.e., \partial L / \partial \rho(\vec{x}) = 0 and obtain:

\begin{aligned} \rho(\vec{x}) = \exp (-1 - \lambda - \vec{\mu} \cdot \vec{x} - \frac{1}{2} \vec{x} \mathbf{M} \vec{x} ) = A \exp (-\frac{1}{2} \vec{y} \mathbf{M} \vec{y}) \end{aligned} . \ \ \ \ (13)

where matrix \mathbf{M} \equiv \frac{1}{2} \eta, \vec{y} = \vec{x} + \vec{\mu} \mathbf{M}^{-1} and the constant A = \exp (\frac{1}{2} \vec{\mu} \mathbf{M}^{-1} \vec{\mu} - \lambda).

Equation (13) depicts a Boltzmann-like distribution, i.e., \rho(\vec{x}) \sim \exp(-H), where H = \frac{1}{2} \vec{y} \mathbf{M} \vec{y} plays the role of the energy function in conventional statistical mechanics. Constant $A$ is the inverse of

To calculate the elements of \mathbf{M}, we assume the discrete states of the genome \vec{x} can be approximated with a continuum. Therefore, the constraints (Equations 9-11) becomes

\begin{aligned} 1 = \int_{1}^{T} d \vec{x} \rho (\vec{x})= \int d^{N}x \rho (\vec{x}) = \int d^{N} y A \exp (- \frac{1}{2} \vec{y} \mathbf{M} \vec{y}) \end{aligned} . \ \ \ \ (14)

\begin{aligned} \langle x_i \rangle & =  \int_{1}^{T} d \vec{x} \rho (\vec{x}) x_i \\ &= \int d^{N}x \rho (\vec{x}) x_i \\ &= \int d^{N} y A \exp (- \frac{1}{2} \vec{y} \mathbf{M} \vec{y}) (y_i - \sum_{j} M_{ij}^{-1} \mu_j) \\ &= - \sum_{j} M_{ij}^{-1} \mu_{j} \end{aligned}. \ \ \ \ (15)

\begin{aligned} \langle x_i x_j \rangle & =  \int_{1}^{T} d \vec{x} \rho (\vec{x}) x_i x_j \\ &= \int d^{N}x \rho (\vec{x}) x_i x_j \\ &= \int d^{N} y A \exp (- \frac{1}{2} \vec{y} \mathbf{M} \vec{y}) \left[ \langle x_i \rangle \langle x_j \rangle + y_i \langle x_j \rangle + y_j \langle x_i \rangle + y_i y_j \right] \\ &= \langle x_i \rangle \langle x_j \rangle + \int d^N y A \exp (\frac{1}{2} \vec{y} \mathbf{M} \vec{y}) y_i y_j  \end{aligned}. \ \ \ \ (16)

Rearranging equation (16), we have

\begin{aligned}  \int d \vec{x} \rho(\vec{x}) y_i y_j = \langle x_i x_j \rangle - \langle x_i \rangle \langle x_j \rangle = C_{ij}  \end{aligned}. \ \ \ \ (17)

Solving the above equation yields:

\begin{aligned} \langle y_i y_j \rangle = M_{ij}^{-1} \end{aligned}.  \ \ \ \ (18)

By substituting Equation (18) into Equation (16), we obtain

\begin{aligned}  M_{ij}^{-1} = C_{ij} = \langle x_i, x_j \rangle - \langle x_i \rangle \langle x_j \rangle \end{aligned}.  \ \ \ \ (19)


\begin{aligned}  \mathbf{M}^{-1} = \mathbf{C} \end{aligned}.  \ \ \ \ (20)

which means the correlation or interaction matrix is the inverse of the covariance matrix, of which the matrix element M_{ij} represents the interaction between genes i and j . This also means we can obtain M_{ij} by inverting the matrix of the gene expression data covariances.

However, because the number of samples T in a gene expression dataset is usually much smaller than the number of genes N, \mathbf{M} is singular and the inversion problem is undetermined. To solve this problem, we use spectral decomposition to inverse \mathbf{C} in the non-zero eigenspace that represents the subspace spanned by the gene expression data:

\begin{aligned}  C_{ij} = \sum_{k=1}^N \omega_{k} v_{i}^{k} v_{j}^{k} \end{aligned},  \ \ \ \ (21)

where w_k is the k-th eigenvalue of \mathbf{C}, and \vec{v}^k is the corresponding eigenvector, yielding:

\begin{aligned}  M_{ij} = C_{ij}^{-1} = \sum_{k=1}^{T-1} \omega_{k}^{-1} v_{i}^{k} v_{j}^{k} \end{aligned}.  \ \ \ \ (22)

where the summation is over the non-zero eigenvalues.

The above method was published in 2006, and since then there are some improved methods, e.g., method based on Ising model approximation [2]. Interested readers are referred to two recent review papers [3,4] (it is worth mentioning that paper [4] also review the application of MaxEnt to protein contact prediction). However, all the state-of-the-art methods as reviewed in the two papers can only infer undirected networks, which is not enough for gaining insights into the real regulatory mechanisms. Another drawback is that these MaxEnt methods cannot incorporate prior probability distribution that derived from previous data, which might not be efficient. In future blogs, I shall discuss my thoughts on addressing these drawbacks.



Principle of Maximum Entropy

This is the first blog of a series to argue a logical but controversial view: biomedicine as inference. In this first one, I am going to introduce the principle of Maximum Entropy (MaxEnt). The principle was proposed by Edward Jaynes in 1957. MaxEnt and its more general form Maximum Calibre (MaxCal) have been applied in many problems in physics and biology, e.g., molecular dynamics, (See Ken Dill’s recent papers on his Scholar page), Gene Regulatory Network inference. However, many problems in biomedicine that can be naturally solved by MaxEnt and MaxCal remained untouched. In this first blog, I will introduce MaxEnt using an example of Gene Regulatory Network inference.

What is MaxEnt

MaxEnt was proposed in Jaynes’ two papers in 1957 [1,2]. In the paper, he first laid the argument that the entropy of statistical mechanics, called thermodynamic entropy and the Shannon information entropy are the same thing. This argument was originally proposed by Shannon himself: “Quantities of the form H = - \sum p_i \log p_i will be recognized as that of entropy as defined in certain formulations of statistical mechanics where p_i is the probability of a system being in cell i of its phase-space.” (Note: as we shall discuss in another blog, thermodynamic entropy and Shannon information entropy are not the same thing, but with the latter being more general than the former. However, the difference between these two entropies does not affect the correctness of MaxEnt.)

Based on the above argument, Jaynes then derived MaxEnt, which provides a radical view that statistical mechanics is a particular application of statistical inference. In Jaynes’ later publications, MaxEnt was generalised as Maximum Calibre (MaxCal). MaxEnt and MaxCal have been successfully applied to both complex equilibrium systems and nonequilibrium systems. Many other first principles and fundamental theories in physics can be derived from MaxEnt and MaxCal. For example, in [3], the least action principle was derived from MaxCal. I shall discuss their simplicity and generality in another blog, but in the following, I shall first explain the intuition behind MaxEnt first from the perspective of physics and statistics.


In physics, statistical mechanics is a branch that combines the principles and procedures of statistics with the laws of both classical and quantum mechanics. It aims to predict and explain the macroscopic states of a system from a probabilistic examination of the underlying microscopic states. To understand this concept, let’s take a look at a thermodynamic system, e.g., ideal gas, in which the molecules move independently between instantaneous collisions. The macroscopic state of the gas, e.g., the temperature or pressure can be derived from the average microscopic states, i.e., the average velocity of the gas atomic particles.

To predict and explain the macroscopic states from microscopic states, MaxEnt asserts that the most likely macroscopic state of a system is the one with the maximum number of microscopic states. This statement is based on the definition of entropy in physics, which is the number of distinct microscopic states compatible with a given macroscopic state. Therefore, the essence of MaxEnt is to maximise the number of distinct microscopic states compatible with a certain macroscopic state. The MaxEnt principle is a variational principle, which can be formulated as a constrained optimisation problem, and the constraints representing testable information about the macroscopic state.

Although MaxEnt rooted in statistical mechanics, it is a general principle for inferring probabilistic theoretical models from experimental data, one of the most fundamental and general problems in quantitative science. Let me explain MaxEnt from this statistical inference (or more poshly, data science) perspective using a simple example, deriving a probabilistic model of rolling a dice.

Example: Jaynes’ dice

We use a simple example called Jaynes dice which was adopted from [4,5]. Assuming we have a dice with values of all N sides (usually 6 sides, but as a mathematician, we can generalise) denoted as i =\{1, 2, \cdots, N \} and execute T experiments to record the side facing up x_k, k=1, \cdots, T and calculate the average value as \langle x \rangle = \frac{1}{T} \sum_{k=1}^T x_k. Our aim is to infer a probabilistic model to assign the probability distribution of all the events, i.e., the probability of the ith side occurs, denoted as p_i , i =1, \cdots, N, given the limited data, e.g., average value \langle x \rangle. With this probabilistic model, we can answer some practical questions such as what is the probability of having outcome i in an (N+1)th measurement

Since the data are limited, we can infer many different probability distributions that are consistent with the data, but intuitively and logically, the one that is the least biased (or the most uncertain) would be prefered. MaxEnt asserts that the least biased probability distribution is the one with the maximum Shannon information entropy (called maximum entropy probability distribution). Following MaxEnt, the inference problem is then cast into a constrained optimisation problem, i.e., to maximise the entropy distribution while satisfying the constraints which enforce the probability distribution to be consistent with the data.

Let’s denote the probability distribution function as

\begin{aligned} H = - \sum_{i=1}^{N} p_i \ln p_i  \end{aligned}  \ \ \ \ (1)

Subject to the constraints:

\begin{aligned}  \sum_{i=1}^{N} p_i = 1  \end{aligned}  \ \ \ \ (2)

\begin{aligned}  \sum_{i=1}^{N} p_i i= \langle x \rangle = \frac{1}{T} \sum_{k=1}^T  x_k \end{aligned}  \ \ \ \ (3)

where the first constraint (equation (2)) is the the normalization constraint. The second one (equation (3)) ensures the inferred probability distribution is consistent with the data.

The Lagrangian is

\begin{aligned} L = - \sum_{i=1}^{N} p_i \ln p_i - \lambda \left( \sum_{i=1}^{N} p_i - 1 \right) - \eta \left( \sum_{i=1}^{N} p_i i - \langle x \rangle \right). \end{aligned}  \ \ \ \ (4)

Setting \partial L / \partial p_i = 0, we have

\begin{aligned}  p_i = \exp ( -1 - \lambda  - \eta i) = \exp(-1-\lambda) \exp(-\eta i) \end{aligned}  \ \ \ \ (5)

Substitute equation (5) into equation (2), yeiling:

\begin{aligned}  \sum_{i=1}^{N} p_i = \exp(-1-\lambda) \sum_{i=1}^{N}\exp(-\eta i) =1 \Longrightarrow  \end{aligned}

\begin{aligned} \exp(-1-\lambda) = \frac{1}{\sum_{i=1}^{N} \exp(-\eta i)} \end{aligned}  \ \ \ \ (6)

Substitute equation (6) into (5) we obtain

\begin{aligned}  p_i = \frac{1}{Z} \exp(-\eta i) \end{aligned}  \ \ \ \ (7)

where the partition function Z is defined as

\begin{aligned}  Z = \sum_{i=1}^{N} \exp (-\eta i) \end{aligned} \ \ \ \ (8)

Unfortunately, equation (7) has no closed form solution. However, the solution is just an exponential-like distribution with parameter \eta and Z(\eta) as a normalization constant to make sure the probabilities sum to 1. Now the task is to find \eta. Notice that

\begin{aligned}  \frac{\partial \ln Z}{\partial \eta} = - \langle x \rangle \end{aligned} \ \ \ \ (9)

Let \exp(-\eta) = y, we have

\begin{aligned}   Z = \sum_{i=1}^{N} \exp (-\eta i) = \sum_{i=1}^{N} y^i   \end{aligned} \ \ \ \ (10)

Using derivative chain rule, we have

\begin{aligned}   \frac{\partial \ln Z}{\partial \eta} = \frac{\partial \ln Z}{\partial Z} \frac{\partial Z}{\partial y} \frac{\partial y}{\partial \eta} = -\frac{1}{Z} \sum_{i=1}^{N}(i y^{i-1}) \cdot (-y) = - \frac{1}{Z} \sum_{i=1}^{N}(i y^{i}) = - \langle x \rangle \end{aligned}, \ \ \ \ (11)

Substitute equation (10) into (11), we have:

\begin{aligned}     \frac{\sum_{i=1}^{N}(i y^{i})}{ \sum_{i=1}^{N} y^i} =  \langle x \rangle \end{aligned} \ \ \ \ (11),

Expanding and simplifying the above equation we get:

\begin{aligned}   \sum_{i=1}^{N} (i-  \langle x \rangle) y^{i}  = \sum_{i=1}^{N} (i-  \langle x \rangle) \exp(-\eta i) =  0\end{aligned} \ \ \ \ (12)

The above equation (12) can be solved by any root solver, which gives us the desired value of \eta. To see how to solve equation (12), please read Brian Keng’s blog Maximum Entropy Distributions [5].

From the perspective of statistical mechanics, the outcome of the dice rolling is the macroscopic states, the probability of the ith side occurring is the underlying microscopic states. We can observe the macroscopic states by independent runs of dice rolling and collect limited data (e.g., N runs).


1. Jaynes ET. Information Theory and Statistical Mechanics. Phys Rev. 1957;106: 620–630.

2. Jaynes ET. Information Theory and Statistical Mechanics. II. Physical Review. 1957. pp. 171–190. doi:10.1103/physrev.108.171

3. General IJ. Principle of maximum caliber and quantum physics. Phys Rev E. 2018;98: 012110.

4. Fougere PF. Maximum Entropy Calculations on a Discrete Probability Space. In: Erickson GJ, Smith CR, editors. Maximum-Entropy and Bayesian Methods in Science and Engineering: Foundations. Dordrecht: Springer Netherlands; 1988. pp. 205–234.

5. Keng B. Maximum Entropy Distributions. In: Bounded Rationality [Internet]. 27 Jan 2017 [cited 13 Mar 2020]. Available: