In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function.[1] If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P << Q, and whose first moments exist, then
![{\displaystyle D_{KL}(P\parallel Q)\geq \Psi _{Q}^{*}(\mu '_{1}(P)),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/3b7580b07729495a34c806ba6c967da160825976)
where
![{\displaystyle \Psi _{Q}^{*}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/b5c25d1f110ce14a558152c562f053f8acb7af35)
is the rate function, i.e. the
convex conjugate of the
cumulant-generating function, of
![{\displaystyle Q}](https://wikimedia.org/api/rest_v1/media/math/render/svg/8752c7023b4b3286800fe3238271bbca681219ed)
, and
![{\displaystyle \mu '_{1}(P)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/154daa1a6f8433eaed9e8f514754ee8d236f1a28)
is the first
moment of
The Cramér–Rao bound is a corollary of this result.
Let P and Q be probability distributions (measures) on the real line, whose first moments exist, and such that P << Q. Consider the natural exponential family of Q given by
![{\displaystyle Q_{\theta }(A)={\frac {\int _{A}e^{\theta x}Q(dx)}{\int _{-\infty }^{\infty }e^{\theta x}Q(dx)}}={\frac {1}{M_{Q}(\theta )}}\int _{A}e^{\theta x}Q(dx)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/9fac438aadfa4d052d6c1ebbc147660b0b16c141)
for every measurable set
A, where
![{\displaystyle M_{Q}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/be1b01a19c94dd82648352d49ac5f241c50e44ec)
is the
moment-generating function of
Q. (Note that
Q0 =
Q.) Then
![{\displaystyle D_{KL}(P\parallel Q)=D_{KL}(P\parallel Q_{\theta })+\int _{\operatorname {supp} P}\left(\log {\frac {\mathrm {d} Q_{\theta }}{\mathrm {d} Q}}\right)\mathrm {d} P.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/97ae7e8d8553d74acf563c9b106d5e159713c547)
By
Gibbs' inequality we have
![{\displaystyle D_{KL}(P\parallel Q_{\theta })\geq 0}](https://wikimedia.org/api/rest_v1/media/math/render/svg/32600e8f39bb49affb49d55f00b9de8592eaa2a7)
so that
![{\displaystyle D_{KL}(P\parallel Q)\geq \int _{\operatorname {supp} P}\left(\log {\frac {\mathrm {d} Q_{\theta }}{\mathrm {d} Q}}\right)\mathrm {d} P=\int _{\operatorname {supp} P}\left(\log {\frac {e^{\theta x}}{M_{Q}(\theta )}}\right)P(dx)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2f056a9cdd6349e5371d6e2611f37a0ccd86c337)
Simplifying the right side, we have, for every real
θ where
![{\displaystyle D_{KL}(P\parallel Q)\geq \mu '_{1}(P)\theta -\Psi _{Q}(\theta ),}](https://wikimedia.org/api/rest_v1/media/math/render/svg/c97bea5d2ba84a77e6a7b10c4317a007b5b2caf7)
where
![{\displaystyle \mu '_{1}(P)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/154daa1a6f8433eaed9e8f514754ee8d236f1a28)
is the first moment, or mean, of
P, and
![{\displaystyle \Psi _{Q}=\log M_{Q}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/6a02083834fb76730af10d254f350422b4f3d482)
is called the
cumulant-generating function. Taking the supremum completes the process of
convex conjugation and yields the
rate function:
![{\displaystyle D_{KL}(P\parallel Q)\geq \sup _{\theta }\left\{\mu '_{1}(P)\theta -\Psi _{Q}(\theta )\right\}=\Psi _{Q}^{*}(\mu '_{1}(P)).}](https://wikimedia.org/api/rest_v1/media/math/render/svg/e31211e1db81444cddd42e3c0312b1bf40929d5b)
Corollary: the Cramér–Rao bound[edit]
Start with Kullback's inequality[edit]
Let Xθ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions. Then
![{\displaystyle \lim _{h\to 0}{\frac {D_{KL}(X_{\theta +h}\parallel X_{\theta })}{h^{2}}}\geq \lim _{h\to 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/7b336ec1e41d055b147d6ff478cf9a8e5e021811)
where
is the convex conjugate of the cumulant-generating function of
and
is the first moment of
Left side[edit]
The left side of this inequality can be simplified as follows:
![{\displaystyle {\begin{aligned}\lim _{h\to 0}{\frac {D_{KL}(X_{\theta +h}\parallel X_{\theta })}{h^{2}}}&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left({\frac {\mathrm {d} X_{\theta +h}}{\mathrm {d} X_{\theta }}}\right)\mathrm {d} X_{\theta +h}\\&=-\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left({\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)\mathrm {d} X_{\theta +h}\\&=-\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\log \left(1-\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)\right)\mathrm {d} X_{\theta +h}\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)+{\frac {1}{2}}\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}+o\left(\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right)\right]\mathrm {d} X_{\theta +h}&&{\text{Taylor series for }}\log(1-t)\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[{\frac {1}{2}}\left(1-{\frac {\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right]\mathrm {d} X_{\theta +h}\\&=\lim _{h\to 0}{\frac {1}{h^{2}}}\int _{-\infty }^{\infty }\left[{\frac {1}{2}}\left({\frac {\mathrm {d} X_{\theta +h}-\mathrm {d} X_{\theta }}{\mathrm {d} X_{\theta +h}}}\right)^{2}\right]\mathrm {d} X_{\theta +h}\\&={\frac {1}{2}}{\mathcal {I}}_{X}(\theta )\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/07eddde9b54dbbf114421992c3c7a8994577c318)
which is half the
Fisher information of the parameter
θ.
Right side[edit]
The right side of the inequality can be developed as follows:
![{\displaystyle \lim _{h\to 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}}=\lim _{h\to 0}{\frac {1}{h^{2}}}{\sup _{t}\{\mu _{\theta +h}t-\Psi _{\theta }(t)\}}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/113319a38d3eea955038930b43f90f72952471af)
This supremum is attained at a value of
t=τ where the first derivative of the cumulant-generating function is
![{\displaystyle \Psi '_{\theta }(\tau )=\mu _{\theta +h},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2d853c45e9711ac8dba6cfd598579ccf61b6b639)
but we have
![{\displaystyle \Psi '_{\theta }(0)=\mu _{\theta },}](https://wikimedia.org/api/rest_v1/media/math/render/svg/2ff2ec56c2fe72b05674014042a75bd3e0b2ea03)
so that
![{\displaystyle \Psi ''_{\theta }(0)={\frac {d\mu _{\theta }}{d\theta }}\lim _{h\to 0}{\frac {h}{\tau }}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/17b366d3b278f57cab5fdd08a667a5b71b434a18)
Moreover,
![{\displaystyle \lim _{h\to 0}{\frac {\Psi _{\theta }^{*}(\mu _{\theta +h})}{h^{2}}}={\frac {1}{2\Psi ''_{\theta }(0)}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2}={\frac {1}{2\operatorname {Var} (X_{\theta })}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ac2f9222874c96cf388a6e73257360415de7c961)
Putting both sides back together[edit]
We have:
![{\displaystyle {\frac {1}{2}}{\mathcal {I}}_{X}(\theta )\geq {\frac {1}{2\operatorname {Var} (X_{\theta })}}\left({\frac {d\mu _{\theta }}{d\theta }}\right)^{2},}](https://wikimedia.org/api/rest_v1/media/math/render/svg/df581c9f262a9525e343c716590cbfc9571fc9f8)
which can be rearranged as:
![{\displaystyle \operatorname {Var} (X_{\theta })\geq {\frac {(d\mu _{\theta }/d\theta )^{2}}{{\mathcal {I}}_{X}(\theta )}}.}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f9d38eadaadfc76f4e3be9b3872e82782b774514)
See also[edit]
Notes and references[edit]