# The Generalised Maximum Entropy Principle

In this blog, I will introduce the Generalised Maximum Entropy principle as presented in (Kesavan and Kapur 1989). To generalise MaxEnt, the paper explicitly expresses its three probabilistic entities, namely, the entropy measure, the set of moment constraints and the probability distribution, and then examines its consequences, e.g., the inverse MaxEnt. The paper links MaxEnt to Minimum Discrimination Information Principle (Kullback-Leibler divergence).

# The general form of MaxEnt

The MaxEnt principle essentially proposes a procedure that inferring particular probability distribution $P = \{p_i\}, \;\; i=1, \cdots, N$ by maximizing the entropy $H(P) = - \sum_{i=1}^{N} p_i \ln p_i$ subject to constraints. Maximising the entropy $H$ determines the most unbiased probability distribution, while the imposed constraints ensure the inferred probability distribution is consistent with the data.

Following (Kesavan and Kapur 1989), we first give the formalism the MaxEnt principle, which in essence is a process of constrained maximisation using Lagrange multipliers:

\begin{aligned} H = -\sum_{i=1}^{N} p_i \ln p_i \end{aligned} \ \ \ \ (1)

Subject to the constraints:

\begin{aligned} \sum_{i=1}^{N} p_i = 1 \end{aligned} \ \ \ \ (2)

\begin{aligned} \sum_{i=1}^{N} p_i g_r(x_i) = a_r, \; r = 1, 2, \dots, M \end{aligned} \ \ \ \ (3)

where the first constraint (equation (2)) is the normalization constraint. The second one (equation (3)) ensures the inferred probability distribution is consistent with the data. In equation (3), $g_r(x_i)$ is a function or property (e.g., energy) of a microscopic variable $x_i$ (e.g., a particle) such that $\sum_{i=1}^N p_i g_r(x_i)$ is the theoretical expectation value of the property of the whole system (e.g., idealised gas); and $a_r$ can be a moment (e.g., mean) of the observed property of the whole system.

Equation (2) essentially represents the so-called testable information, i.e., “a statement about a probability distribution whose truth or falsity is well-defined”. It is worth mentioning that, equation (2) can be inequality constraints as well, i.e.,

\begin{aligned} \sum_{i=1}^{N} p_i g_r(x_i) \geq a_r, \; r = 1, 2, \dots, M \end{aligned} \ \ \ \ (3)

The Lagrangian is:

\begin{aligned} L = - \sum_{i=1}^{N} p_i \ln p_i - (\lambda_0 - 1) \left[ \sum_{i=1}^{N} p_i - 1 \right] - \sum_{r=1}^{M} \lambda_r \left[ \sum_{i=1}^{N} p_i g_r(x_i) - a_r \right] . \end{aligned} \ \ \ \ (4)

We then maximise equation (4), yielding:

\begin{aligned} p_i = \exp \left[ - \left( \lambda_0 + \sum_{r=1}^{M} \lambda_r g_r(x_i) \right) \right], \; i = 1, 2, \cdots, N. \end{aligned}

Let $g_0 = 1$, we have

\boxed{\begin{aligned} p_i = \exp \left[ - \sum_{r=0}^{M} \lambda_r g_r(x_i) \right], \; i = 1, 2, \cdots, N. \end{aligned} } \ \ \ \ (5)

To determine the Lagrange multipliers $\lambda_i$, $i=0, 1, \dots, M$, we substitute equations (2) and (3) into (4) to get

\begin{aligned} \exp(\lambda_0) = \sum_{i=1}^{N} \exp \left( - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right) \end{aligned} \ \ \ \ (6)

and

\begin{aligned} a_r \exp(\lambda_0) = \sum_{i=1}^{N} g_r(x_i) \exp \left( - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right), \; r=1,2, \cdots, M. \end{aligned} \ \ \ \ (7)

Substitute equation (6) into (7), we obtain

\boxed{\begin{aligned} \frac{\sum_{i=1}^{N} g_r(x_i) \exp \left( - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right)}{\sum_{i=1}^{N} \exp \left( - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right)} = a_r, \; r=1,2, \cdots, M. \end{aligned} } \ \ \ \ (8)

We then solve equations (8) using any root solver to obtain the desired values of $\lambda_i$, $i=0, 1, \cdots, M$.

For continuous random variate with a probability density function $f$ whose support is a set $\mathcal{X}$, the continuous MaxEnt can be written as:

\begin{aligned} \underset{x}{\arg\max} - \int_{\mathcal{X}} f(x) \ln f(x) dx \end{aligned} \ \ \ \ (8)

subject to

\begin{aligned} \int_{\mathcal{X}} f(x) dx = 1 \end{aligned} \ \ \ \ (9)

\begin{aligned} \int_{\mathcal{X}} f(x) g_r(x) dx = \bar{a}_r(x); \; r=1, \cdots, M \end{aligned} \ \ \ \ (10)

# MDI and Relative Entropy Maximisation

The Minimum Discrimination Informatoin (MDI) principle is based on Kullback-Leibler (KL) divergence, which is a measure to discriminate the probability distribution $P$ from $Q$:

\begin{aligned} D_{\text{KL}}(P:Q) = \sum_{i=1}^{N} p_i \ln \frac{p_i}{q_i} \end{aligned} \ \ \ \ (11)

which is always $\geq 0$, and the global minimum value is zero when the two distributions are identical.

Based on KL divergence, Kullback proposed the Principle of Minimum Discrimination Information (MDI): given new facts, a new distribution $P$ should be chosen which is as close to the reference distribution $Q$ as possible; so that the new data produces as small an information gain $D_{\text{KL}}(P:Q)$ as possible. This can be formulated as the same Lagrange with the same constraints as MaxEnt:

\begin{aligned} L = \sum_{i=1}^{N} p_i \ln \frac{p_i}{q_i} + (\lambda_0 -1) \left( \sum_{i=1}^N p_i - 1 \right) + \sum_{r=1}^M \lambda_r \left( \sum_{i=1}^N p_i g_r(p_i) - \bar{a}_r \right) \end{aligned} \ \ \ \ (12)

KL divergence is also commonly referred to as relative entropy, which is defined as

\begin{aligned} H(P|Q) = - \sum_{i=1}^N p_i \ln \frac{p_i}{q_i} \end{aligned} \ \ \ \ (13)

so that

\begin{aligned} p_i = q_i \exp \left[ - \sum_{r=0}^M \lambda_r g_r(x_i) \right] \end{aligned} \ \ \ \ (14)

and

\begin{aligned} \exp(\lambda_0) = \sum_{i=1}^{N} q_i \exp \left[ - \sum_{r=1}^M \lambda_{r} g_r(x_i) \right]. \end{aligned} \ \ \ \ (15)

The relative entropy maximisation principle essentially introduces a priori probability distribution $Q$. When we do not have any knowledge of the a priori probability, we can assume $Q$ is uniform, that is, $q_i = 1/N$, $i=1, \cdots, N$.

\begin{aligned} H(P|Q) = - \sum_{i=1}^N p_i \ln \frac{p_i}{q_i} = - \sum_{i=1}^N p_i (\ln p_i + \ln N) = H(P) - \ln N \end{aligned} \ \ \ \ (13)

# The generalised MaxEnt

The generalised version essentially aims to determine any single entity when the rest of the three are specified. The GMEP first generalised MEP by relaxing the restriction of the entropy measure to Shannon entropy as in Jayne’s MEP, and to the Kullback-Leibler measure as in Kullback’s MDIP. Following the paper, we introduce two versions based on MEP and MDIP, respectively.

### GMEP (The MEP version)

We introduce a convex function $\phi (\cdot)$ such that the generalised entropy measure is defined as:

\begin{aligned} H(P) = - \sum_{i=1}^N \phi (p_i), \end{aligned} \ \ \ \ (13)

and the constraints are defined as

\begin{aligned} \sum_{i=1}^N p_i = 1 \end{aligned} \ \ \ \ (13)

\begin{aligned} \sum_{i=1}^N p_i g_r(x_i) =a_{r}, \;\; r=1, 2,\cdots, M. \end{aligned} \ \ \ \ (20)

Similarly, we use Lagrange method to maximise (xx) subject to the $m+1$ constraints in equations (xx-xx), and let $g_0 = 1$ which yield the expression:

\begin{aligned} \frac{\partial \phi(p_i)}{\partial p_i} = \sum_{r=0}^{M} \lambda_i g_r(x_i). \end{aligned} \ \ \ \ (21)

#### The Direct Problem

If we know the entropy measure $\phi(\cdot)$ and the constraint mean values $a_r$ of $g_r(x_i), \;\; r=1, \cdots, M$, we can determine the probability distribution that maximise the entropy measure. This is done by substitute (21) into (20) and solve the Lagrange multipliers that in turn yield the probability $p_i$ as in equation (5).

#### The First Inverse Problem: Determination of Constraints

The first inverse problem is essentially to find the most unbiased constraints. Given the entropy measure $\phi(\cdot)$ and the probability distribution $P = \{p_i\}, \;\; i=1, \cdots, N$, we can determine one or more constraints that yield the given probability distribution, assuming that the entropy is maximised subject to these constraints.

Since we know $\phi(\cdot)$, we can derive $\frac{\partial \phi(p_i)}{\partial p_i}$, so that we can determine the right hand side of equation (21). We can then identify the values for $g_r(x_i)$ by matching terms, which yields one of the most unbiased sets of constraints. We shall use an example to illustrate how to do this.

#### The Second Invest Problem: Determination of the Entropy Measure)

Given the constraints, $g_r(x_i), \;\; r=1, \cdots, M$, and the probability distribution $p_i$, determine the most unbiased entropy measure that when maximized subject to the given constraints, yields the given probability distribution.

We can obtain a differential equation by substituting the given values of $g_r(x_i)$ and $p_i$ into (21). After solving the differential equation, we obtain $\phi(\cdot)$ and then we can determine the entropy measure $H(P) = \sum_{i=1}^{N} \phi(p_i)$.

## Example: Generalised Jaynes’ dice

Suppose we have a dice with $N$ sides and the $i$-th sides have a unique number $x_i$, of which the probability distribution is $p_i$. We shall denote an entropy measure as $\phi (p_i)$.

### Solution of the direct problem

For the direct problem, we assume we know the entropy measure and the constraints. The problem is to obtain the probability distribution. Formally, we define the entropy measure as the Shannon’s entropy:

\begin{aligned} \phi (p_i) = p_i \ln p_i \end{aligned} \ \ \ \ (21)

We also define the normalisation constraint and the constraint that ensure the inferred probability distribution is consistent with the observed mean value from $T$ dice rollings $\bar{x} = \langle x \rangle = \frac{1}{T} \sum_{t=1}^{T} x_t$:

\begin{aligned} \sum_{i=1}^{N} p_i = 1 \end{aligned} \ \ \ \ (21)

\begin{aligned} \sum_{i=1}^{N} p_i x_i = \bar{x} \end{aligned} \ \ \ \ (21)

Using Lagrange method, we can obtain the solution of the probability distribution $p_i$ (See my previous blog for detailed derivation but with some notation differences):

\begin{aligned} p_i = \frac{\exp(- \lambda_1 x_i)}{\sum_{i=1}^N \exp (- \lambda_1 x_i)} \end{aligned} \ \ \ \ (21)

while $\lambda_1$ can be obtained by using any root solver to solve the following equation:

\begin{aligned} \frac{\sum_{i=1}^N x_i \exp(- \lambda_1 x_i)}{\sum_{i=1}^N \exp (- \lambda_1 x_i)} = \bar{x} \end{aligned} \ \ \ \ (21)

### The First Inverse Problem: Determination of Constraints

For this problem, we have defined the entropy measure as the Shannon’s entropy:

\begin{aligned} \phi (p_i) = p_i \ln p_i \end{aligned} \ \ \ \ (21)

and also know the probability distribution as

\begin{aligned} p_i = \frac{\exp(- \mu x_i)}{\sum_{i=1}^N \exp (- \mu x_i)} \end{aligned} \ \ \ \ (xx)

where $\mu$ is a constant. Noting the normalisation constraint is obvious, now the problem is to determine a most unbiased set of constraints in the form of

\begin{aligned} \sum_{i=1}^N p_i g_1(x_i) = \bar{x} \end{aligned} \ \ \ \ (xx)

From equation () we have

\begin{aligned} \frac{\partial \phi(p_i)}{\partial p_i} = -(1+ \ln p_i )= \lambda_0 + \lambda_1 g_1(x_i). \end{aligned} \ \ \ \ (xxx)

Substitute (xx) into (xxx), we have

\begin{aligned} -1 + \ln \sum_{i=1}^N \exp (\mu x_i) + \mu x_i = \lambda_0 + \lambda_1 g_1(x_i). \end{aligned} \ \ \ \ (xxx)

Equating terms, we obtain

\begin{aligned} \lambda_0 & = -1 + \ln \left [ \sum_{i=1}^N \exp (- \mu x_i) \right], \\ \lambda_1 & = \mu \text{ and } g_1(x_i) = x_i. \end{aligned} \ \ \ \ (xxx)

Therefore, we can derive the constraints:

\begin{aligned} \sum_{i=1}^N p_i = 1 \text{ and } \sum_{i=1}^N p_i x_i = \bar{x}. \end{aligned} \ \ \ \ (xxxx)

### The Second Inverse Problem: Determination of the Entropy Measure

In this problem, we know the probability distribution as defined in equation (xxx), the constraints as defined in (xxxx), we need to determine the most unbiased entropy measure. From equation (21)

\begin{aligned} \frac{\partial \phi(p_i)}{\partial p_i} = \lambda_0 + \lambda_1 x_1 \end{aligned} \ \ \ \ (xxxx)

# Reference

End

## One thought on “The Generalised Maximum Entropy Principle”

1. aron says:

Being an expert, you know the set shaping theory?
would you be able to recommend some articles about it.