Probability theory concept
Not to be confused with the
chain rule in calculus.
In probability theory , the chain rule [ 1] (also called the general product rule [ 2] [ 3] ) describes how to calculate the probability of the intersection of, not necessarily independent , events or the joint distribution of random variables respectively, using conditional probabilities . This rule allows one to express a joint probability in terms of only conditional probabilities.[ 4] The rule is notably used in the context of discrete stochastic processes and in applications, e.g. the study of Bayesian networks , which describe a probability distribution in terms of conditional probabilities.
Chain rule for events [ edit ]
For two events
A
{\displaystyle A}
and
B
{\displaystyle B}
, the chain rule states that
P
(
A
∩
B
)
=
P
(
B
∣
A
)
P
(
A
)
{\displaystyle \mathbb {P} (A\cap B)=\mathbb {P} (B\mid A)\mathbb {P} (A)}
,
where
P
(
B
∣
A
)
{\displaystyle \mathbb {P} (B\mid A)}
denotes the conditional probability of
B
{\displaystyle B}
given
A
{\displaystyle A}
.
An Urn A has 1 black ball and 2 white balls and another Urn B has 1 black ball and 3 white balls. Suppose we pick an urn at random and then select a ball from that urn. Let event
A
{\displaystyle A}
be choosing the first urn, i.e.
P
(
A
)
=
P
(
A
¯
)
=
1
/
2
{\displaystyle \mathbb {P} (A)=\mathbb {P} ({\overline {A}})=1/2}
, where
A
¯
{\displaystyle {\overline {A}}}
is the complementary event of
A
{\displaystyle A}
. Let event
B
{\displaystyle B}
be the chance we choose a white ball. The chance of choosing a white ball, given that we have chosen the first urn, is
P
(
B
|
A
)
=
2
/
3.
{\displaystyle \mathbb {P} (B|A)=2/3.}
The intersection
A
∩
B
{\displaystyle A\cap B}
then describes choosing the first urn and a white ball from it. The probability can be calculated by the chain rule as follows:
P
(
A
∩
B
)
=
P
(
B
∣
A
)
P
(
A
)
=
2
3
⋅
1
2
=
1
3
.
{\displaystyle \mathbb {P} (A\cap B)=\mathbb {P} (B\mid A)\mathbb {P} (A)={\frac {2}{3}}\cdot {\frac {1}{2}}={\frac {1}{3}}.}
Finitely many events [ edit ]
For events
A
1
,
…
,
A
n
{\displaystyle A_{1},\ldots ,A_{n}}
whose intersection has not probability zero, the chain rule states
P
(
A
1
∩
A
2
∩
…
∩
A
n
)
=
P
(
A
n
∣
A
1
∩
…
∩
A
n
−
1
)
P
(
A
1
∩
…
∩
A
n
−
1
)
=
P
(
A
n
∣
A
1
∩
…
∩
A
n
−
1
)
P
(
A
n
−
1
∣
A
1
∩
…
∩
A
n
−
2
)
P
(
A
1
∩
…
∩
A
n
−
2
)
=
P
(
A
n
∣
A
1
∩
…
∩
A
n
−
1
)
P
(
A
n
−
1
∣
A
1
∩
…
∩
A
n
−
2
)
⋅
…
⋅
P
(
A
3
∣
A
1
∩
A
2
)
P
(
A
2
∣
A
1
)
P
(
A
1
)
=
P
(
A
1
)
P
(
A
2
∣
A
1
)
P
(
A
3
∣
A
1
∩
A
2
)
⋅
…
⋅
P
(
A
n
∣
A
1
∩
⋯
∩
A
n
−
1
)
=
∏
k
=
1
n
P
(
A
k
∣
A
1
∩
⋯
∩
A
k
−
1
)
=
∏
k
=
1
n
P
(
A
k
|
⋂
j
=
1
k
−
1
A
j
)
.
{\displaystyle {\begin{aligned}\mathbb {P} \left(A_{1}\cap A_{2}\cap \ldots \cap A_{n}\right)&=\mathbb {P} \left(A_{n}\mid A_{1}\cap \ldots \cap A_{n-1}\right)\mathbb {P} \left(A_{1}\cap \ldots \cap A_{n-1}\right)\\&=\mathbb {P} \left(A_{n}\mid A_{1}\cap \ldots \cap A_{n-1}\right)\mathbb {P} \left(A_{n-1}\mid A_{1}\cap \ldots \cap A_{n-2}\right)\mathbb {P} \left(A_{1}\cap \ldots \cap A_{n-2}\right)\\&=\mathbb {P} \left(A_{n}\mid A_{1}\cap \ldots \cap A_{n-1}\right)\mathbb {P} \left(A_{n-1}\mid A_{1}\cap \ldots \cap A_{n-2}\right)\cdot \ldots \cdot \mathbb {P} (A_{3}\mid A_{1}\cap A_{2})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{1})\\&=\mathbb {P} (A_{1})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{3}\mid A_{1}\cap A_{2})\cdot \ldots \cdot \mathbb {P} (A_{n}\mid A_{1}\cap \dots \cap A_{n-1})\\&=\prod _{k=1}^{n}\mathbb {P} (A_{k}\mid A_{1}\cap \dots \cap A_{k-1})\\&=\prod _{k=1}^{n}\mathbb {P} \left(A_{k}\,{\Bigg |}\,\bigcap _{j=1}^{k-1}A_{j}\right).\end{aligned}}}
For
n
=
4
{\displaystyle n=4}
, i.e. four events, the chain rule reads
P
(
A
1
∩
A
2
∩
A
3
∩
A
4
)
=
P
(
A
4
∣
A
3
∩
A
2
∩
A
1
)
P
(
A
3
∩
A
2
∩
A
1
)
=
P
(
A
4
∣
A
3
∩
A
2
∩
A
1
)
P
(
A
3
∣
A
2
∩
A
1
)
P
(
A
2
∩
A
1
)
=
P
(
A
4
∣
A
3
∩
A
2
∩
A
1
)
P
(
A
3
∣
A
2
∩
A
1
)
P
(
A
2
∣
A
1
)
P
(
A
1
)
{\displaystyle {\begin{aligned}\mathbb {P} (A_{1}\cap A_{2}\cap A_{3}\cap A_{4})&=\mathbb {P} (A_{4}\mid A_{3}\cap A_{2}\cap A_{1})\mathbb {P} (A_{3}\cap A_{2}\cap A_{1})\\&=\mathbb {P} (A_{4}\mid A_{3}\cap A_{2}\cap A_{1})\mathbb {P} (A_{3}\mid A_{2}\cap A_{1})\mathbb {P} (A_{2}\cap A_{1})\\&=\mathbb {P} (A_{4}\mid A_{3}\cap A_{2}\cap A_{1})\mathbb {P} (A_{3}\mid A_{2}\cap A_{1})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{1})\end{aligned}}}
.
We randomly draw 4 cards (one at a time) without replacement from deck with 52 cards. What is the probability that we have picked 4 aces?
First, we set
A
n
:=
{
draw an ace in the
n
th
try
}
{\textstyle A_{n}:=\left\{{\text{draw an ace in the }}n^{\text{th}}{\text{ try}}\right\}}
. Obviously, we get the following probabilities
P
(
A
1
)
=
4
52
,
P
(
A
2
∣
A
1
)
=
3
51
,
P
(
A
3
∣
A
1
∩
A
2
)
=
2
50
,
P
(
A
4
∣
A
1
∩
A
2
∩
A
3
)
=
1
49
{\displaystyle \mathbb {P} (A_{1})={\frac {4}{52}},\qquad \mathbb {P} (A_{2}\mid A_{1})={\frac {3}{51}},\qquad \mathbb {P} (A_{3}\mid A_{1}\cap A_{2})={\frac {2}{50}},\qquad \mathbb {P} (A_{4}\mid A_{1}\cap A_{2}\cap A_{3})={\frac {1}{49}}}
.
Applying the chain rule,
P
(
A
1
∩
A
2
∩
A
3
∩
A
4
)
=
4
52
⋅
3
51
⋅
2
50
⋅
1
49
=
24
6497400
{\displaystyle \mathbb {P} (A_{1}\cap A_{2}\cap A_{3}\cap A_{4})={\frac {4}{52}}\cdot {\frac {3}{51}}\cdot {\frac {2}{50}}\cdot {\frac {1}{49}}={\frac {24}{6497400}}}
.
Statement of the theorem and proof [ edit ]
Let
(
Ω
,
A
,
P
)
{\displaystyle (\Omega ,{\mathcal {A}},\mathbb {P} )}
be a probability space. Recall that the conditional probability of an
A
∈
A
{\displaystyle A\in {\mathcal {A}}}
given
B
∈
A
{\displaystyle B\in {\mathcal {A}}}
is defined as
P
(
A
∣
B
)
:=
{
P
(
A
∩
B
)
P
(
B
)
,
P
(
B
)
>
0
,
0
P
(
B
)
=
0.
{\displaystyle {\begin{aligned}\mathbb {P} (A\mid B):={\begin{cases}{\frac {\mathbb {P} (A\cap B)}{\mathbb {P} (B)}},&\mathbb {P} (B)>0,\\0&\mathbb {P} (B)=0.\end{cases}}\end{aligned}}}
Then we have the following theorem.
Proof
The formula follows immediately by recursion
(
1
)
P
(
A
1
)
P
(
A
2
∣
A
1
)
=
P
(
A
1
∩
A
2
)
(
2
)
P
(
A
1
)
P
(
A
2
∣
A
1
)
P
(
A
3
∣
A
1
∩
A
2
)
=
P
(
A
1
∩
A
2
)
P
(
A
3
∣
A
1
∩
A
2
)
=
P
(
A
1
∩
A
2
∩
A
3
)
,
{\displaystyle {\begin{aligned}(1)&&&\mathbb {P} (A_{1})\mathbb {P} (A_{2}\mid A_{1})&=&\qquad \mathbb {P} (A_{1}\cap A_{2})\\(2)&&&\mathbb {P} (A_{1})\mathbb {P} (A_{2}\mid A_{1})\mathbb {P} (A_{3}\mid A_{1}\cap A_{2})&=&\qquad \mathbb {P} (A_{1}\cap A_{2})\mathbb {P} (A_{3}\mid A_{1}\cap A_{2})\\&&&&=&\qquad \mathbb {P} (A_{1}\cap A_{2}\cap A_{3}),\end{aligned}}}
where we used the definition of the conditional probability in the first step.
Chain rule for discrete random variables [ edit ]
Two random variables [ edit ]
For two discrete random variables
X
,
Y
{\displaystyle X,Y}
, we use the events
A
:=
{
X
=
x
}
{\displaystyle A:=\{X=x\}}
and
B
:=
{
Y
=
y
}
{\displaystyle B:=\{Y=y\}}
in the definition above, and find the joint distribution as
P
(
X
=
x
,
Y
=
y
)
=
P
(
X
=
x
∣
Y
=
y
)
P
(
Y
=
y
)
,
{\displaystyle \mathbb {P} (X=x,Y=y)=\mathbb {P} (X=x\mid Y=y)\mathbb {P} (Y=y),}
or
P
(
X
,
Y
)
(
x
,
y
)
=
P
X
∣
Y
(
x
∣
y
)
P
Y
(
y
)
,
{\displaystyle \mathbb {P} _{(X,Y)}(x,y)=\mathbb {P} _{X\mid Y}(x\mid y)\mathbb {P} _{Y}(y),}
where
P
X
(
x
)
:=
P
(
X
=
x
)
{\displaystyle \mathbb {P} _{X}(x):=\mathbb {P} (X=x)}
is the probability distribution of
X
{\displaystyle X}
and
P
X
∣
Y
(
x
∣
y
)
{\displaystyle \mathbb {P} _{X\mid Y}(x\mid y)}
conditional probability distribution of
X
{\displaystyle X}
given
Y
{\displaystyle Y}
.
Finitely many random variables [ edit ]
Let
X
1
,
…
,
X
n
{\displaystyle X_{1},\ldots ,X_{n}}
be random variables and
x
1
,
…
,
x
n
∈
R
{\displaystyle x_{1},\dots ,x_{n}\in \mathbb {R} }
. By the definition of the conditional probability,
P
(
X
n
=
x
n
,
…
,
X
1
=
x
1
)
=
P
(
X
n
=
x
n
|
X
n
−
1
=
x
n
−
1
,
…
,
X
1
=
x
1
)
P
(
X
n
−
1
=
x
n
−
1
,
…
,
X
1
=
x
1
)
{\displaystyle \mathbb {P} \left(X_{n}=x_{n},\ldots ,X_{1}=x_{1}\right)=\mathbb {P} \left(X_{n}=x_{n}|X_{n-1}=x_{n-1},\ldots ,X_{1}=x_{1}\right)\mathbb {P} \left(X_{n-1}=x_{n-1},\ldots ,X_{1}=x_{1}\right)}
and using the chain rule, where we set
A
k
:=
{
X
k
=
x
k
}
{\displaystyle A_{k}:=\{X_{k}=x_{k}\}}
, we can find the joint distribution as
P
(
X
1
=
x
1
,
…
X
n
=
x
n
)
=
P
(
X
1
=
x
1
∣
X
2
=
x
2
,
…
,
X
n
=
x
n
)
P
(
X
2
=
x
2
,
…
,
X
n
=
x
n
)
=
P
(
X
1
=
x
1
)
P
(
X
2
=
x
2
∣
X
1
=
x
1
)
P
(
X
3
=
x
3
∣
X
1
=
x
1
,
X
2
=
x
2
)
⋅
…
⋅
P
(
X
n
=
x
n
∣
X
1
=
x
1
,
…
,
X
n
−
1
=
x
n
−
1
)
{\displaystyle {\begin{aligned}\mathbb {P} \left(X_{1}=x_{1},\ldots X_{n}=x_{n}\right)&=\mathbb {P} \left(X_{1}=x_{1}\mid X_{2}=x_{2},\ldots ,X_{n}=x_{n}\right)\mathbb {P} \left(X_{2}=x_{2},\ldots ,X_{n}=x_{n}\right)\\&=\mathbb {P} (X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2}\mid X_{1}=x_{1})\mathbb {P} (X_{3}=x_{3}\mid X_{1}=x_{1},X_{2}=x_{2})\cdot \ldots \\&\qquad \cdot \mathbb {P} (X_{n}=x_{n}\mid X_{1}=x_{1},\dots ,X_{n-1}=x_{n-1})\\\end{aligned}}}
For
n
=
3
{\displaystyle n=3}
, i.e. considering three random variables. Then, the chain rule reads
P
(
X
1
,
X
2
,
X
3
)
(
x
1
,
x
2
,
x
3
)
=
P
(
X
1
=
x
1
,
X
2
=
x
2
,
X
3
=
x
3
)
=
P
(
X
3
=
x
3
∣
X
2
=
x
2
,
X
1
=
x
1
)
P
(
X
2
=
x
2
,
X
1
=
x
1
)
=
P
(
X
3
=
x
3
∣
X
2
=
x
2
,
X
1
=
x
1
)
P
(
X
2
=
x
2
∣
X
1
=
x
1
)
P
(
X
1
=
x
1
)
=
P
X
3
∣
X
2
,
X
1
(
x
3
∣
x
2
,
x
1
)
P
X
2
∣
X
1
(
x
2
∣
x
1
)
P
X
1
(
x
1
)
.
{\displaystyle {\begin{aligned}\mathbb {P} _{(X_{1},X_{2},X_{3})}(x_{1},x_{2},x_{3})&=\mathbb {P} (X_{1}=x_{1},X_{2}=x_{2},X_{3}=x_{3})\\&=\mathbb {P} (X_{3}=x_{3}\mid X_{2}=x_{2},X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2},X_{1}=x_{1})\\&=\mathbb {P} (X_{3}=x_{3}\mid X_{2}=x_{2},X_{1}=x_{1})\mathbb {P} (X_{2}=x_{2}\mid X_{1}=x_{1})\mathbb {P} (X_{1}=x_{1})\\&=\mathbb {P} _{X_{3}\mid X_{2},X_{1}}(x_{3}\mid x_{2},x_{1})\mathbb {P} _{X_{2}\mid X_{1}}(x_{2}\mid x_{1})\mathbb {P} _{X_{1}}(x_{1}).\end{aligned}}}
René L. Schilling (2021), Measure, Integral, Probability & Processes - Probab(ilistical)ly the Theoretical Minimum (1 ed.), Technische Universität Dresden, Germany, ISBN 979-8-5991-0488-9 {{citation }}
: CS1 maint: location missing publisher (link )
William Feller (1968), An Introduction to Probability Theory and Its Applications , vol. I (3 ed.), New York / London / Sydney: Wiley, ISBN 978-0-471-25708-0
Russell, Stuart J. ; Norvig, Peter (2003), Artificial Intelligence: A Modern Approach (2nd ed.), Upper Saddle River, New Jersey: Prentice Hall, ISBN 0-13-790395-2 , p. 496.