Principle of Maximum Entropy

This is the first blog of a series to argue a logical but controversial view: biomedicine as inference. In this first one, I am going to introduce the principle of Maximum Entropy (MaxEnt). The principle was proposed by Edward Jaynes in 1957. MaxEnt and its more general form Maximum Calibre (MaxCal) have been applied in many problems in physics and biology, e.g., molecular dynamics, (See Ken Dill’s recent papers on his Scholar page), Gene Regulatory Network inference. However, many problems in biomedicine that can be naturally solved by MaxEnt and MaxCal remained untouched. In this first blog, I will introduce MaxEnt using an example of Gene Regulatory Network inference.

What is MaxEnt

MaxEnt was proposed in Jaynes’ two papers in 1957 [1,2]. In the paper, he first laid the argument that the entropy of statistical mechanics, called thermodynamic entropy and the Shannon information entropy are the same thing. This argument was originally proposed by Shannon himself: “Quantities of the form H = - \sum p_i \log p_i will be recognized as that of entropy as defined in certain formulations of statistical mechanics where p_i is the probability of a system being in cell i of its phase-space.” (Note: as we shall discuss in another blog, thermodynamic entropy and Shannon information entropy are not the same thing, but with the latter being more general than the former. However, the difference between these two entropies does not affect the correctness of MaxEnt.)

Based on the above argument, Jaynes then derived MaxEnt, which provides a radical view that statistical mechanics is a particular application of statistical inference. In Jaynes’ later publications, MaxEnt was generalised as Maximum Calibre (MaxCal). MaxEnt and MaxCal have been successfully applied to both complex equilibrium systems and nonequilibrium systems. Many other first principles and fundamental theories in physics can be derived from MaxEnt and MaxCal. For example, in [3], the least action principle was derived from MaxCal. I shall discuss their simplicity and generality in another blog, but in the following, I shall first explain the intuition behind MaxEnt first from the perspective of physics and statistics.

Intuition

In physics, statistical mechanics is a branch that combines the principles and procedures of statistics with the laws of both classical and quantum mechanics. It aims to predict and explain the macroscopic states of a system from a probabilistic examination of the underlying microscopic states. To understand this concept, let’s take a look at a thermodynamic system, e.g., ideal gas, in which the molecules move independently between instantaneous collisions. The macroscopic state of the gas, e.g., the temperature or pressure can be derived from the average microscopic states, i.e., the average velocity of the gas atomic particles.

To predict and explain the macroscopic states from microscopic states, MaxEnt asserts that the most likely macroscopic state of a system is the one with the maximum number of microscopic states. This statement is based on the definition of entropy in physics, which is the number of distinct microscopic states compatible with a given macroscopic state. Therefore, the essence of MaxEnt is to maximise the number of distinct microscopic states compatible with a certain macroscopic state. The MaxEnt principle is a variational principle, which can be formulated as a constrained optimisation problem, and the constraints representing testable information about the macroscopic state.

Although MaxEnt rooted in statistical mechanics, it is a general principle for inferring probabilistic theoretical models from experimental data, one of the most fundamental and general problems in quantitative science. Let me explain MaxEnt from this statistical inference (or more poshly, data science) perspective using a simple example, deriving a probabilistic model of rolling a dice.

Example: Jaynes’ dice

We use a simple example called Jaynes dice which was adopted from [4,5]. Assuming we have a dice with values of all N sides (usually 6 sides, but as a mathematician, we can generalise) denoted as i =\{1, 2, \cdots, N \} and execute T experiments to record the side facing up x_k, k=1, \cdots, T and calculate the average value as \langle x \rangle = \frac{1}{T} \sum_{k=1}^T x_k. Our aim is to infer a probabilistic model to assign the probability distribution of all the events, i.e., the probability of the ith side occurs, denoted as p_i , i =1, \cdots, N, given the limited data, e.g., average value \langle x \rangle. With this probabilistic model, we can answer some practical questions such as what is the probability of having outcome i in an (N+1)th measurement

Since the data are limited, we can infer many different probability distributions that are consistent with the data, but intuitively and logically, the one that is the least biased (or the most uncertain) would be prefered. MaxEnt asserts that the least biased probability distribution is the one with the maximum Shannon information entropy (called maximum entropy probability distribution). Following MaxEnt, the inference problem is then cast into a constrained optimisation problem, i.e., to maximise the entropy distribution while satisfying the constraints which enforce the probability distribution to be consistent with the data.

Let’s denote the probability distribution function as

\begin{aligned} H = - \sum_{i=1}^{N} p_i \ln p_i  \end{aligned}  \ \ \ \ (1)

Subject to the constraints:

\begin{aligned}  \sum_{i=1}^{N} p_i = 1  \end{aligned}  \ \ \ \ (2)

\begin{aligned}  \sum_{i=1}^{N} p_i i= \langle x \rangle = \frac{1}{T} \sum_{k=1}^T  x_k \end{aligned}  \ \ \ \ (3)

where the first constraint (equation (2)) is the the normalization constraint. The second one (equation (3)) ensures the inferred probability distribution is consistent with the data.

The Lagrangian is

\begin{aligned} L = - \sum_{i=1}^{N} p_i \ln p_i - \lambda \left( \sum_{i=1}^{N} p_i - 1 \right) - \eta \left( \sum_{i=1}^{N} p_i i - \langle x \rangle \right). \end{aligned}  \ \ \ \ (4)

Setting \partial L / \partial p_i = 0, we have

\begin{aligned}  p_i = \exp ( -1 - \lambda  - \eta i) = \exp(-1-\lambda) \exp(-\eta i) \end{aligned}  \ \ \ \ (5)

Substitute equation (5) into equation (2), yeiling:

\begin{aligned}  \sum_{i=1}^{N} p_i = \exp(-1-\lambda) \sum_{i=1}^{N}\exp(-\eta i) =1 \Longrightarrow  \end{aligned}

\begin{aligned} \exp(-1-\lambda) = \frac{1}{\sum_{i=1}^{N} \exp(-\eta i)} \end{aligned}  \ \ \ \ (6)

Substitute equation (6) into (5) we obtain

\begin{aligned}  p_i = \frac{1}{Z} \exp(-\eta i) \end{aligned}  \ \ \ \ (7)

where the partition function Z is defined as

\begin{aligned}  Z = \sum_{i=1}^{N} \exp (-\eta i) \end{aligned} \ \ \ \ (8)

Unfortunately, equation (7) has no closed form solution. However, the solution is just an exponential-like distribution with parameter \eta and Z(\eta) as a normalization constant to make sure the probabilities sum to 1. Now the task is to find \eta. Notice that

\begin{aligned}  \frac{\partial \ln Z}{\partial \eta} = - \langle x \rangle \end{aligned} \ \ \ \ (9)

Let \exp(-\eta) = y, we have

\begin{aligned}   Z = \sum_{i=1}^{N} \exp (-\eta i) = \sum_{i=1}^{N} y^i   \end{aligned} \ \ \ \ (10)

Using derivative chain rule, we have

\begin{aligned}   \frac{\partial \ln Z}{\partial \eta} = \frac{\partial \ln Z}{\partial Z} \frac{\partial Z}{\partial y} \frac{\partial y}{\partial \eta} = -\frac{1}{Z} \sum_{i=1}^{N}(i y^{i-1}) \cdot (-y) = - \frac{1}{Z} \sum_{i=1}^{N}(i y^{i}) = - \langle x \rangle \end{aligned}, \ \ \ \ (11)

Substitute equation (10) into (11), we have:

\begin{aligned}     \frac{\sum_{i=1}^{N}(i y^{i})}{ \sum_{i=1}^{N} y^i} =  \langle x \rangle \end{aligned} \ \ \ \ (11),

Expanding and simplifying the above equation we get:

\begin{aligned}   \sum_{i=1}^{N} (i-  \langle x \rangle) y^{i}  = \sum_{i=1}^{N} (i-  \langle x \rangle) \exp(-\eta i) =  0\end{aligned} \ \ \ \ (12)

The above equation (12) can be solved by any root solver, which gives us the desired value of \eta. To see how to solve equation (12), please read Brian Keng’s blog Maximum Entropy Distributions [5].

From the perspective of statistical mechanics, the outcome of the dice rolling is the macroscopic states, the probability of the ith side occurring is the underlying microscopic states. We can observe the macroscopic states by independent runs of dice rolling and collect limited data (e.g., N runs).

Reference

1. Jaynes ET. Information Theory and Statistical Mechanics. Phys Rev. 1957;106: 620–630.

2. Jaynes ET. Information Theory and Statistical Mechanics. II. Physical Review. 1957. pp. 171–190. doi:10.1103/physrev.108.171

3. General IJ. Principle of maximum caliber and quantum physics. Phys Rev E. 2018;98: 012110.

4. Fougere PF. Maximum Entropy Calculations on a Discrete Probability Space. In: Erickson GJ, Smith CR, editors. Maximum-Entropy and Bayesian Methods in Science and Engineering: Foundations. Dordrecht: Springer Netherlands; 1988. pp. 205–234.

5. Keng B. Maximum Entropy Distributions. In: Bounded Rationality [Internet]. 27 Jan 2017 [cited 13 Mar 2020]. Available: http://bjlkeng.github.io/posts/maximum-entropy-distributions/

One thought on “Principle of Maximum Entropy

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: