Moments of Random Variables

Oliver C. Ibe , in Fundamentals of Applied Probability and Random Processes (Second Edition), 2014

3.3 Expectation of Nonnegative Random Variables

Some random variables assume only nonnegative values. For example, the time X until a component fails cannot be negative. In Chapter 1 we defined the reliability function R(t) of a component as the probability that the component has not failed by time t. Thus, if the PDF of X is f X (x) and the CDF is F X (x), we can define the reliability function of the component by R X (t), which is related to the CDF and PDF as follows:

(3.3a) R X t = P X > t = 1 P X t = 1 F X t

(3.3b) F X t = 1 R X t

Proposition 3.1:

For a nonnegative random variable X with CDF F X (x), the expected value is given by

E X = 0 P X > x dx = 0 1 F X x dx X continuous x = 0 P X > x = x = 0 1 F X x X discrete

Proof:

We first prove the case for a continuous random variable. Since

P X > x = x f X u du

we have that

0 P X > x dx = 0 x f X u du dx

The region of integration {(x, u)|0   x  < ; x  u  < } is as shown in Figure 3.2.

Figure 3.2. Region of Integration

From the figure we observe that the region of integration can be transformed into {(x, u)|0   x  u;   0   u  < }, which gives

0 P X > x dx = 0 x f X u du dx = u = 0 x = 0 u dx f X u du = u = 0 u f X u du = E X

This proves the proposition for the case of a continuous random variable. For a discrete random variable X that assumes only nonnegative values, we have that

E X = x = 0 x p X x = x = 1 x p X x = 1 p X ( 1 ) + 2 p X ( 2 ) + 3 p X ( 3 ) + 4 p X ( 4 ) + = p X 1 + p X 2 + p X ( 2 ) + p X 3 + p X ( 3 ) + p X ( 3 ) + p X 4 + p X ( 4 ) + p X ( 4 ) + p X ( 4 ) + = x = 1 p X x + x = 2 p X x + x = 3 p X x + = 1 F X 0 + 1 F X 1 + 1 F X 2 + = x = 0 1 F X x = x = 0 P X > x

Example 3.5

Use the above method to find the expected value of the random variable X whose PDF is given in Example 3.4.

Solution:

From Example 3.4, the PDF of X is given by

f X x = λ e λx x 0 0 x < 0

Thus, the CDF of X is given by

F X x = 0 x f X u du = 0 x λ e λu du = 1 e λx

Since X is a nonnegative random variable, its expected value is given by

E X = 0 1 F X x dx = 0 e λx dx = 1 λ

which is the same result that we obtained in Example 3.4.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128008522000031

What is This Stuff Called Probability?

John K. Kruschke , in Doing Bayesian Data Analysis (Second Edition), 2015

4.3.2.2 The normal probability density function

Any function that has only nonnegative values and integrates to 1 (i.e., satisfies Equation 4.3) can be construed as a probability density function. Perhaps the most famous probability density function is the normal distribution, also known as the Gaussian distribution. A graph of the normal curve is a well-known bell shape; an example is shown in Figure 4.4.

Figure 4.4. A normal probability density function, shown with a comb of narrow intervals. The integral is approximated by summing the width times height of each interval.

The mathematical formula for the normal probability density has two parameters: μ (Greek mu) is called the mean of the distribution and σ (Greek sigma) is called the standard deviation. The value of μ governs where the middle of the bell shape falls on the x-axis, so it is called a location parameter, and the value of σ governs how wide the bell is, so it is called a scale parameter. As discussed in Section 2.2, you can think of the parameters as control knobs with which to manipulate the location and scale of the distribution. The mathematical formula for the normal probability density is

(4.4) p ( x ) = 1 σ 2 π exp ( 1 2 [ x μ σ ] 2 ) .

Figure 4.4 shows an example of the normal distribution for specific values of μ and σ as indicated. Notice that the peak probability density can be greater than 1.0 when the standard deviation, σ, is small. In other words, when the standard deviation is small, a lot of probability mass is squeezed into a small interval, and consequently the probability density in that interval is high.

Figure 4.4 also illustrates that the area under the normal curve is, in fact, 1. The x axis is divided into a dense comb of small intervals, with width denoted Δx. The integral of the normal density is approximated by summing the masses of all the tiny intervals as in Equation 4.2. As can be seen in the text within the graph, the sum of the interval areas is essentially 1.0. Only rounding error, and the fact that the extreme tails of the distribution are not included in the sum, prevent the sum from being exactly 1.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124058880000040

Multiple Random Variables

Oliver C. Ibe , in Fundamentals of Applied Probability and Random Processes (Second Edition), 2014

5.5 Determining Probabilities from a Joint CDF

Suppose that X and Y are given random variables and we are required to determine the probability of a certain event defined in terms of X and Y for which the joint CDF is known. We start by sketching the event in the x-y plane. For example, assume we are required to find P[a  < X  b, c  < Y  d]. The region of interest is shown in Figure 5.2, which defines four partitions.

Figure 5.2. Domain Partitions

Consider the following events:

E 1 = X b Y d E 2 = X b Y c E 3 = X a Y d E 4 = X a Y c E 5 = a < X b c < Y d

The region of interest is B, which corresponds to event E 5 that can be obtained as follows:

E 5 = E 1 E 2 E 3 + E 4

Thus,

P a < X b c < Y d = F XY b d F XY b c F XY a d + F XY a c

Note that the probability is simply the joint CDF evaluated at the point where X and Y jointly have the larger of their two values plus the CDF evaluated at the point where they jointly have their smaller values minus the CDF evaluated at the two points where they have mixed smaller and larger values. When X and Y are independent random variables, the above result becomes

P a < X b c < Y d = F XY b d F XY b c F XY a d + F XY a c = F X b F Y ( d ) F X ( b ) F Y ( c ) F X ( a ) F Y ( d ) + F X ( a ) F Y ( c ) = F X b F Y d F Y c F X ( a ) F Y d F Y c = F X b F X a F Y d F Y c = P a < X b P [ c < Y d ]

Finally, for the case when X and Y are discrete random variables, the joint CDF can be obtained from the joint PMF as follows:

F XY x y = m x n y p XY m n

For example, if X and Y take only nonnegative values,

F XY ( 1 , 2 ) = m 1 n 2 p XY m n = p XY ( 0 , 0 ) + p XY ( 0 , 1 ) + p XY ( 0 , 2 ) + p XY ( 0 , 2 ) + p XY ( 1 , 1 ) + p XY ( 1 , 2 )

Example 5.7

The joint CDF of two discrete random variables X and Y is given as follows:

F XY x y = 1 8 x = 1 , y = 1 5 8 x = 1 , y = 2 1 4 x = 2 , y = 1 1 x = 2 , y = 2

Determine the following:

a.

Joint PMF of X and Y

b.

Marginal PMF of X

c.

Marginal PMF of Y

Solution:

The joint PMF is obtained from the relationship

F XY x y = m x n y p XY m n

Thus,

F XY ( 1 , 1 ) = p XY ( 1 , 1 ) = 1 / 8 F XY ( 1 , 2 ) = p XY ( 1 , 1 ) + p XY ( 1 , 2 ) = 5 / 8 p XY ( 1 , 2 ) = 5 / 8 1 / 8 = 1 / 2 F XY ( 2 , 1 ) = p XY ( 1 , 1 ) + p XY ( 2 , 1 ) = 1 / 4 p XY ( 2 , 1 ) = 1 / 4 1 / 8 = 1 / 8 F XY ( 2 , 2 ) = p XY ( 1 , 1 ) + p XY ( 1 , 2 ) + p XY ( 2 , 1 ) + p XY ( 2 , 2 ) = 1 p XY ( 2 , 2 ) = 1 / 4

The joint PMF becomes

p XY x y = 1 8 x = 1 , y = 1 1 2 x = 1 , y = 2 1 8 x = 2 , y = 1 1 4 x = 2 , y = 2

The marginal PMF of X is given by

p X x = p XY ( 1 , 1 ) + p XY ( 1 , 2 ) = 5 8 x = 1 p XY ( 2 , 1 ) + p XY ( 2 , 2 ) = 3 8 x = 2

The marginal PMF of Y is given by

p Y y = p XY ( 1 , 1 ) + p XY ( 2 , 1 ) = 1 4 y = 1 p XY ( 1 , 2 ) + p XY ( 2 , 2 ) = 3 4 y = 2

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128008522000055

Basic Concepts in Probability

Oliver C. Ibe , in Markov Processes for Stochastic Modeling (Second Edition), 2013

1.9.1 Markov Inequality

The Markov inequality applies to random variables that take only nonnegative values. It can be stated as follows:

Proposition 1.1

If X is a random variable that takes only nonnegative values, then for any a>0,

P [ X a ] E [ X ] a

Proof

We consider only the case when X is a continuous random variable. Thus,

E [ X ] = 0 x f X ( x ) d x = 0 a x f X ( x ) d x + a x f X ( x ) d x a x f X ( x ) d x a a f X ( x ) d x = a a f X ( x ) d x = a P [ X a ]

and the result follows.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124077959000013

Special Random Variables

Sheldon M. Ross , in Introduction to Probability and Statistics for Engineers and Scientists (Fifth Edition), 2014

5.5 Normal Random Variables

A random variable is said to be normally distributed with parameters μ and σ 2, and we write X N ( μ , σ 2 ) , if its density is

f ( x ) = 1 2 π σ e ( x μ ) 2 / 2 σ 2 , < x <

* The normal density f (x) is a bell-shaped curve that is symmetric about μ and that attains its maximum value of 1 2 π σ 0.399 / σ at x  = μ (see Figure 5.7).

Figure 5.7. The normal density function (a) with μ   =   0, σ   =   1 and (b) with arbitrary μ and σ2.

The normal distribution was introduced by the French mathematician Abraham de Moivre in 1733 and was used by him to approximate probabilities associated with binomial random variables when the binomial parameter n is large. This result was later extended by Laplace and others and is now encompassed in a probability theorem known as the central limit theorem, which gives a theoretical base to the often noted empirical observation that, in practice, many random phenomena obey, at least approximately, a normal probability distribution. Some examples of this behavior are the height of a person, the velocity in any direction of a molecule in gas, and the error made in measuring a physical quantity.

To compute E[X] note that

E [ X μ ] = 1 2 π σ ( x μ ) e ( x μ ) 2 / 2 σ 2 d x

Letting y  =   (xμ)/σ gives that

E [ X μ ] = σ 2 π y e y 2 / 2 d y

But

y e y 2 / 2 d y = e y 2 / 2 | = 0

showing that E[Xμ]   =   0, or equivalently that

E [ X ] = μ

Using this, we now compute Var(X) as follows:

(5.5.1) Var ( X ) = E [ ( X μ ) 2 ] = 1 2 π σ ( x μ ) 2 e ( x μ ) 2 / 2 σ 2 d x = 1 2 π σ 2 y 2 e y 2 / 2 d y

With u  = y and dv  = ye y 2/2, the integration by parts formula

u d v = u v v d u

yields that

y 2 e y 2 / 2 d y = y e y 2 / 2 | + e y 2 / 2 d y = e y 2 / 2 d y

Hence, from (5.5.1)

V a r ( X ) = σ 2 1 2 π e y 2 / 2 d y = σ 2

where the preceding used that 1 2 π e y 2 / 2 d y is the density function of a normal random variable with parameters μ  =   0 and σ  =   1, so its integral must equal 1.

Thus μ and σ 2 represent, respectively, the mean and variance of the normal distribution. A very important property of normal random variables is that if X is normal with mean μ and variance σ 2, then for any constants a and b, b ≠ 0, the random variable Y  = a  + bX is also a normal random variable with parameters

E [ Y ] = E [ a + b X ] = a + b E [ X ] = a + b μ

and variance

V a r ( Y ) = V a r ( a + b X ) = b 2 V a r ( X ) = b 2 σ 2

To verify this, let FY (y) be the distribution function of Y. Then, for b > 0

F Y ( y ) = P ( Y y ) = P ( a + b X y ) = P ( X y a b ) = F X ( y a b )

where FX is the distribution function of X. Similarly, if b < 0, then

F Y ( y ) = P ( a + b X y ) = P ( X y a b ) = 1 F X ( y a b )

Differentiation yields that the density function of Y is

f Y ( y ) = { 1 b f X ( y a b ) , if b > 0 1 b f X ( y a b ) , if b < 0

which can be written as

f Y ( y ) = 1 | b | f X ( y a b ) = 1 2 π σ | b | e ( y a b μ ) 2 / 2 σ 2 = 1 2 π σ | b | e ( y a b μ ) 2 / 2 b 2 σ 2

showing that Y  = a  + bX is normal with mean a  + and variance b 2 σ 2.

It follows from the foregoing that if X N ( μ , σ 2 ) , then

Z = X μ σ

is a normal random variable with mean 0 and variance 1. Such a random variable Z is said to have a standard, or unit, normal distribution. Let Φ(·) denote its distribution function. That is,

Φ ( x ) = 1 2 π x e y 2 / 2 d y , < x <

This result that Z  =   (Xμ)/σ has a standard normal distribution when X is normal with parameters μ and σ 2 is quite important, for it enables us to write all probability statements about X in terms of probabilities for Z. For instance, to obtain P{X < b}, we note that X will be less than b if and only if (Xμ)/σ is less than (bμ)/σ, and so

P { X < b } = P { X μ σ < b μ σ } = Φ ( b μ σ )

Similarly, for any a < b,

P { a < X < b } = P { a μ σ < X μ σ < b μ σ } = P { a μ σ < Z < b μ σ } = P { Z < b μ σ } P { Z < a μ σ } = Φ ( b μ σ ) Φ ( a μ σ )

It remains for us to compute Φ(x). This has been accomplished by an approximation and the results are presented in Table A1 of the Appendix, which tabulates Φ(x ) (to a 4-digit level of accuracy) for a wide range of nonnegative values of x. In addition, Program 5.5a of the text disk can be used to obtain Φ(x).

While Table A1 tabulates Φ(x) only for nonnegative values of x, we can also obtain Φ(–x) from the table by making use of the symmetry (about 0) of the standard normal probability density function. That is, for x > 0, if Z represents a standard normal random variable, then (see Figure 5.8)

Figure 5.8. Standard normal probabilities.

Φ ( x ) = P { Z < x } = P { Z > x } by symmetry = 1 Φ ( x )

Thus, for instance,

p { Z < 1 } = Φ ( 1 ) = 1 Φ ( 1 ) = 1 .8413 = .1587

Example 5.5a

If X is a normal random variable with mean μ  =   3 and variance σ 2  =   16, find

(a)

P{X < 11};

(b)

P{X > –1};

(c)

P{2 < X < 7}.

Solution
(a)

P { X < 11 } = P { X 3 4 < 11 3 4 } = Φ ( 2 ) = .9772

(b)

P { X > 1 } = P { X 3 4 > 1 3 4 } = P { Z > 11 } = P { Z < 11 } = .8413

(c)

P { 2 < X < 7 } = P { 2 3 4 < X 3 4 < 7 3 4 } = Φ ( 1 ) Φ ( 1 / 4 ) = Φ ( 1 ) ( 1 Φ ( 1 / 4 ) ) = .8413 + .5987 1 = .4400

Example 5.5b

Suppose that a binary message — either "0" or "1" — must be transmitted by wire from location A to location B. However, the data sent over the wire are subject to a channel noise disturbance and so to reduce the possibility of error, the value 2 is sent over the wire when the message is " 1" and the value –2 is sent when the message is "0." If x, x  =   ±2, is the value sent at location A then R, the value received at location B, is given by R  = x  + N, where N is the channel noise disturbance. When the message is received at location B, the receiver decodes it according to the following rule:

if R .5, then "1" is concluded if R < .5, then "0" is concluded

Because the channel noise is often normally distributed, we will determine the error probabilities when N is a standard normal random variable.

There are two types of errors that can occur: One is that the message "1" can be incorrectly concluded to be "0" and the other that "0" is incorrectly concluded to be "1." The first type of error will occur if the message is "1" and 2   + N < .5, whereas the second will occur if the message is "0" and –2   + N > .5.

Hence,

P { e r r o r | m e s s a g e i s " 1 " } = P { N < 1.5 } = 1 Φ ( 1.5 ) = .0668

and

P { e r r o r | m e s s a g e i s " 0 " } = P { N > 2.5 } = 1 Φ ( 2.5 ) = .0662

Example 5.5c

The power W dissipated in a resistor is proportional to the square of the voltage V. That is,

W = r V 2

where r is a constant. If r  =   3, and V can be assumed (to a very good approximation) to be a normal random variable with mean 6 and standard deviation 1, find
(a)

E[W];

(b)

P{W > 120}.

Solution
(a)

E [ W ] = E [ 3 V 2 ] = 3 E [ V 2 ] = 3 ( V a r [ V ] + E 2 [ V ] ) = 3 ( 1 + 36 ) = 111

(b)

P { W > 120 } = P { 3 V 2 > 120 } = P { V > 40 } = P { V 6 > 40 6 } = P { Z > .3246 } = 1 Φ ( .3246 ) = .3727

Let us now compute the moment generating function of a normal random variable. To start, we compute the moment generating function of a standard normal random variable Z.

E [ e t Z ] = e t x 1 2 π e x 2 / 2 d x = 1 2 π e ( x 2 2 t x ) / 2 d x = e t 2 / 2 1 2 π e ( x t ) 2 / 2 d x = e t 2 / 2 1 2 π e y 2 / 2 d y = e t 2 / 2

Now, if Z is a standard normal, then X  = μ  + σZ is normal with mean μ and variance σ 2. Using the preceding, its moment generating function is

E [ e t X ] = E [ e t μ + t σ Z ] = E [ e t μ e t σ Z ] = e t μ E [ e t σ Z ] = e t μ e ( σ t ) 2 / 2 = e μ t σ 2 t 2 / 2

Another important result is that the sum of independent normal random variables is also a normal random variable. To see this, suppose that Xi , i  =   1,…, n, are independent, with Xi being normal with mean μi and variance σ i 2 . The moment generating function of i = 1 n X i is as follows.

E [ e t i = 1 n X i ] = E [ e t X 1 e t X 2 e t X n ] = i = 1 n E [ e t X i ] by independence = i = 1 n e μ i t + σ i 2 t 2 / 2 = e μ t + σ 2 t 2 / 2

where

μ = i = 1 n μ i , σ 2 = i = 1 n σ i 2

Therefore, i = 1 n X i has the same moment generating function as a normal random variable having mean μ and variance σ 2. Hence, from the one-to-one correspondence between moment generating functions and distributions, we can conclude that i = 1 n X i is normal with mean i = 1 n μ i and variance i = 1 n σ i 2 .

Example 5.5d

Data from the National Oceanic and Atmospheric Administration indicate that the yearly precipitation in Los Angeles is a normal random variable with a mean of 12.08 inches and a standard deviation of 3.1 inches.

(a)

Find the probability that the total precipitation during the next 2 years will exceed 25 inches.

(b)

Find the probability that next year's precipitation will exceed that of the following year by more than 3 inches.

Assume that the precipitation totals for the next 2 years are independent.

Solution

Let X 1 and X 2 be the precipitation totals for the next 2 years.

(a)

Since X 1  + X 2 is normal with mean 24.16 and variance 2(3.1)2  =   19.22, it follows that

P { X 1 + X 2 > 25 } = P { X 1 + X 2 24.16 19.22 > 25 24.16 19.22 } = P { Z > .1916 } .4240

(b)

Since –X 2 is a normal random variable with mean –12.08 and variance (–1)2(3.1)2, it follows that X 1X 2 is normal with mean 0 and variance 19.22.

P { X 1 > X 2 + 3 } = P { X 1 X 2 > 3 } = P { X 1 X 2 19.22 > 3 19.22 } = P { Z > .6843 } .2469

Thus there is a 42.4 percent chance that the total precipitation in Los Angeles during the next 2 years will exceed 25 inches, and there is a 24.69 percent chance that next year's precipitation will exceed that of the following year by more than 3 inches.■

For α ∈ (0, 1), let zα be such that

P { Z > z α } = 1 Φ ( z α ) = α

That is, the probability that a standard normal random variable is greater than zα is equal to α (see Figure 5.9).

Figure 5.9. P{Z &gt; zα}   =   α.

The value of zα can, for any α, be obtained from Table A1. For instance, since

1 Φ ( 1.645 ) = .05 1 Φ ( 1.96 ) = .025 1 Φ ( 2.33 ) = .01

it follows that

z .05 = 1.645 , z .025 = 1.96 , z .01 = 2.33

Program 5.5b on the text disk can also be used to obtain the value of zα .

Since

P { Z < z α } = 1 α

it follows that 100(1 – α) percent of the time a standard normal random variable will be less than zα . As a result, we call zα the 100(1 – α) percentile of the standard normal distribution.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123948113500058

Conditional Probability and Conditional Expectation

Mark A. Pinsky , Samuel Karlin , in An Introduction to Stochastic Modeling (Fourth Edition), 2011

2.5.3 The Maximal Inequality for Nonnegative Martingales

Because a martingale has constant mean, Markov's inequality applied to a nonnegative martingale immediately yields

Pr { X n λ } E [ X 0 ] λ , λ > 0 .

We will extend the reasoning behind Markov's inequality to achieve an inequality of far greater power:

(2.49) Pr { max 0 n m X n λ } E [ X 0 ] λ .

Instead of limiting the probability of a large value for a single observation X n , the maximal inequality (2.49) limits the probability of observing a large value anywhere in the time interval 0, …, m, and since the right side of (2.49) does not depend on the length of the interval, the maximal inequality limits the probability of observing a large value at any time in the infinite future of the martingale!

In order to prove the maximal inequality for nonnegative martingales, we need but a single additional fact: If X and Y are jointly distributed random variables and B is an arbitrary set, then

(2.50) E [ X 1 B ( Y ) ] = E [ E ( X | Y ) 1 B ( Y ) ]

But (2.50) follows from the conditional expectation property (2.12), E [g (X)h (Y)] = E{h (Y)E [g (X)| Y]}, with g (x) = x and h (y) = 1 (y in B}. We will have need of (2.50) with X = Xm and Y = (X 0, …, X n), whereupon (2.50) followed by (2.48) then justifies

(2.51) E [ X m 1 { X 0 < λ , , X n 1 < λ , X n λ } ] = E [ E { X m | X 0 , , X n } 1 { X 0 < λ , , X n 1 < λ , X n λ } ] = E [ X n 1 { X 0 < λ , , X n 1 < λ , X n λ } ] .

Theorem 2.1.

Let X 0, X 1, … be a martingale with nonnegative values; i.e., Pr{X n > 0} = 1 for n = 0, 1, …. For any λ > 0,

(2.52) Pr { max 0 X n λ } E [ X 0 ] λ , f o r 0 n m

and

(2.53) Pr { max n 0 X n > λ } E [ X 0 ] λ , f o r a l l n .

Proof. Inequality (2.53) follows from (2.52) because the right side of (2.52) does not depend on m. We begin with the law of total probability, as in Chapter 1, Section 1.2.1. Either the {X 0, …, X m} sequence rises above λfor the first time at some index n or else it remains always below λ. As these possibilities are mutually exclusive and exhaustive, we apply the law of total probability to obtain

E [ X m ] = n = 0 m E [ X m 1 ] { X 0 < λ , , X n 1 < λ , X n λ } + E [ X m 1 { X 0 < λ , , X m < λ } ] n = 0 m E [ X m 1 { X 0 < λ , , X n 1 < λ , X n λ } ] ( X m 0 )

n = 0 m E [ X n 1 { X 0 < λ , , X n 1 < λ , X n λ } ] λ n = 0 m Pr { X 0 < λ , , X n 1 < λ , X n λ } = λ pr { max 0 n m X n λ } .

Example A gambler begins with a unit amount of money and faces a series of independent fair games. Beginning with X 0 = 1, the gambler bets the amount p, 0 < p < 1. If the first game is a win, which occurs with probability 1 2 , the gambler's fortune is X 1 = 1 + pX 0 = 1 + p. If the first game is a loss, then X 1 = 1 − pX 0 = 1 − p. After the n th play and with a current fortune of Xn , the gambler wagers pXn , and

X n + 1 { ( 1 + p ) X n with probability 1 2 ( 1 p ) X n with probability 1 2 .

Then, {Xn } is a nonnegative martingale, and the maximal inequality (2.52) with λ = 2, e.g., asserts that the probability that the gambler ever doubles his money is less than or equal to 1 2 , and this holds no matter what the game is, as long as it is fair, and no matter what fraction p of his fortune is wagered at each play. Indeed, the fraction wagered may vary from play to play, as long as it is chosen without knowledge of the next outcome.

As amply demonstrated by this example, the maximal inequality is a very strong statement. Indeed, more elaborate arguments based on the maximal and other related martingale inequalities are used to show that a nonnegative martingale converges: If {X n } is a nonnegative martingale, then there exists a random variable, let us call it X, for which lim n →∞ X n = X . We cannot guarantee the equality of the expectations in the limit, but the inequality E [X 0] ≥ E [X ] ≥ 0 can be established.

Example In Chapter 3, Section 3.8, we will introduce the branching process model for population growth. In this model, X n is the number of individuals in the population in the n th generation, and μ > 0 is the mean family size or expected number of offspring of any single individual. The mean population size in the nth generation is X 0μn. In this branching process model, Xn n is a nonnegative martingale (see Chapter 3, Problem 3.8.4), and the maximal inequality implies that the probability of the actual population ever exceeding 10 times the mean size is less than or equal to 1/10. The nonnegative martingale convergence theorem asserts that the evolution of such a population after many generations may be described by a single random variable X in the form

X n X μ n , for large n

Example How NOT to generate a uniformly distributed random variable An urn initially contains one red and one green ball. A ball is drawn at random and it is returned to the urn, together with another ball of the same color. This process is repeated indefinitely. After the nth play, there will be a total of n + 2 balls in the urn. Let Rn be the number of these balls that are red, and Xn = R n /(n + 2) the fraction of red balls. We claim that {X n } is a martingale. First, observe that

R n + 1 = { R n + 1 with probability X n R n with probability 1 X n

so that

E [ R n + 1 | X n ] = R n + X n = X n ( 2 + n + 1 ) ,

and finally,

E [ X n + 1 | X n ] = 1 n + 3 E [ R n + 1 | X n ] = 2 + n + 1 n + 3 X n = X n .

This verifies the martingale property, and because such a fraction is always nonnegative, indeed, between 0 and 1, there must be a random variable X to which the martingale converges. We will derive the probability distribution of the random limit. It is immediate that R 1 is equally likely to be 1 or 2, since the first ball chosen is equally likely to be red or green. Continuing,

Pr { R 2 = 3 } = Pr { R 2 = 3 | R 1 = 2 } Pr { R 1 = 2 } = ( 2 3 ) ( 1 2 ) = 1 3 ; Pr { R 2 = 2 } = Pr { R 2 = 2 | R 1 = 1 } Pr { R 1 = 1 } + Pr { R 2 = 2 | R 1 = 2 } Pr { R 1 = 2 } = ( 1 3 ) ( 1 2 ) + ( 1 3 ) ( 1 2 ) = 1 3 ;

and since the probabilities must sum to 1,

Pr { R 2 = 1 } = 1 3 .

By repeating these simple calculations, it is easy to see that

Pr { R n = k } = 1 n + 1 for k = 1 , 2 , , n + 1 ,

and that, therefore, Xn is uniformly distributed over the values 1/(n + 2), 2/(n + 2), …, (n + 1)/(n + 2). This uniform distribution must prevail in the limit, which leads to

Pr { X x } = x for 0 < x < 1.

Think about this remarkable result for a minute! If you sit down in front of such an urn and play this game, eventually the fraction of red balls in your urn will stabilize in the near vicinity of some value, call it U. If I play the game, the fraction of red balls in my urn will stabilize also, but at another value, U ′. Anyone who plays the game will find the fraction of red balls in the urn tending toward some limit, but everyone will experience a different limit. In fact, each play of the game generates a fresh, uniformly distributed random variable, in the limit. Of course, there may be faster and simpler ways to generate uniformly distributed random variables.

Martingale implications include many more inequalities and convergence theorems. As briefly mentioned at the start, there are so-called systems theorems that delimit the conditions under which a gambling system, such as doubling the bets until a win is secured, can turn a fair game into a winning game. A deeper discussion of martingale theory would take us well beyond the scope of this introductory text, and our aim must be limited to building an enthusiasm for further study. Nevertheless, a large variety of important martingales will be introduced in the Problems at the end of each section in the remainder of the book.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123814166000022

Other Generalised Functions

R.F. Hoskins Research Professor , in Delta Functions (Second Edition), 2011

7.1.4 Fractional differentiation

We have achieved a satisfactory definition of an integration operator I λ which makes sense for all nonnegative values of λ. The next step is to introduce a corresponding definition of a derivative operator of non-integer order. This will effectively extend the meaning of I λ to negative values of λ. In what follows we shall assume that all input signals x(t) are sufficiently well behaved to allow the necessary integrations and differentiations to be carried out; that is we assume not only that each x(t) vanishes outside some finite interval but also that it may be differentiated as often as we wish.

To begin with, let α be any real number such that 0   α  <   1. Then I 1   α is certainly well-defined and so we can define an operator Dα which represents the cascade connection of the fractional integrator represented by I 1   α and an ideal differentiator:

(7.14) x t y t D α x t = d dt I 1 α x t = 1 Γ 1 α d dt t x τ t τ α .

For α  =   0 this reduces to

x t D 0 x t = d dt t x τ = x t

So that the impulse response is δ(t) and D 0 is identical with I 0. Further since we have the equivalent representation

D α x t = 1 Γ 1 α 0 + τ α x t τ = I 1 α x t ,

it follows at once that

I α D α x t = I α I 1 α x t = I 1 x t = x t .

Hence, for 0   α  <   1, Dα is the inverse of the fractional integration operator Iα . Finally, for an arbitrary λ   >   0 we can always write

λ = n 1 + α where n and 0 α < 1 ,

and to complete the definition, we need only to set

(7.15) D λ x t 1 Γ 1 α d n d t n t x τ t τ α

The operator D λ represents a (causal) time-invariant linear system which consists of a cascade combination of a fractional integrator of order 1   α and n ideal differentiators. It should therefore have the transfer function H(s)   = s n    1   + α . However, for any value of n  >   1, there will exist no (ordinary) function h(t) of which H(s) is the Laplace transform, and so we cannot express (7.11) in the form of a classical convolution integral. If α  =   0 then D λ becomes D n    1 and the problem is solved at least formally by the introduction of the generalised function δ (n    1)(t) To obtain a satisfactory development of a comprehensive fractional calculus we therefore need to define other types of generalised functions.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781904275398500078

Elements of Probability

Sheldon Ross , in Simulation (Fifth Edition), 2013

2.7 Chebyshev's Inequality and the Laws of Large Numbers

We start with a result known as Markov's inequality.

Proposition 3 Markov's Inequality

If X takes on only nonnegative values, then for any value a > 0

P { X a } E X a

Proof 3

Define the random variable Y by

Y = a , if X a 0 , if X < a

Because X 0 , it easily follows that

X Y

Taking expectations of the preceding inequality yields

E X E Y = aP { X a }

and the result is proved.

As a corollary we have Chebyshev's inequality, which states that the probability that a random variable differs from its mean by more than k of its standard deviations is bounded by 1 / k 2 , where the standard deviation of a random variable is defined to be the square root of its variance.

Corollary 2 Chebyshev's Inequality

If X is a random variable having mean μ and variance σ 2 , then for any value k > 0 ,

P { | X - μ | k σ } 1 k 2

Proof 4

Since ( X - μ ) 2 / σ 2 is a nonnegative random variable whose mean is

E ( X - μ ) 2 σ 2 = E ( X - μ ) 2 σ 2 = 1

we obtain from Markov's inequality that

P ( X - μ ) 2 σ 2 k 2 1 k 2

The result now follows since the inequality ( X - μ ) 2 / σ 2 k 2 is equivalent to the inequality | X - μ | k σ .

We now use Chebyshev's inequality to prove the weak law of large numbers, which states that the probability that the average of the first n terms of a sequence of independent and identically distributed random variables differs from its mean by more than ϵ goes to 0 as n goes to infinity.

Theorem 1 The Weak Law of Large Numbers

Let X 1 , X 2 , be a sequence of independent and identically distributed random variables having mean μ . Then, for any ϵ > 0 ,

P X 1 + + X n n - μ > ϵ 0 as n

Proof 5

We give a proof under the additional assumption that the random variables X i have a finite variance σ 2 . Now

E X 1 + + X n n = 1 n ( E X 1 + + E X n ) = μ

and

Var X 1 + + X n n = 1 n 2 Var ( X 1 ) + + Var ( X n ) = σ 2 n

where the above equation makes use of the fact that the variance of the sum of independent random variables is equal to the sum of their variances. Hence, from Chebyshev's inequality, it follows that for any positive k

P X 1 + + X n n - μ k σ n 1 k 2

Hence, for any ϵ > 0 , by letting k be such that k σ / n = ϵ , that is, by letting k 2 = n ϵ 2 / σ 2 , we see that

P X 1 + + X n n - μ ϵ σ 2 n ϵ 2

which establishes the result.

A generalization of the weak law is the strong law of large numbers, which states that, with probability 1,

lim n X 1 + + X n n = μ

That is, with certainty, the long-run average of a sequence of independent and identically distributed random variables will converge to its mean.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124158252000024

Random Variables and Expectation

Sheldon M. Ross , in Introduction to Probability and Statistics for Engineers and Scientists (Fifth Edition), 2014

4.9 Chebyshev's Inequality and the Weak Law of Large Numbers

We start this section by proving a result known as Markov's inequality.

Proposition 4.9.1 Markov's Inequality

If X is a random variable that takes only nonnegative values, then for any value a > 0

P { X a } E [ X ] a

Proof

We give a proof for the case where X is continuous with density f.

E [ X ] = 0 x f ( x ) d x = 0 a x f ( x ) d x + a x f ( x ) d x a x f ( x ) d x a a f ( x ) d x = a a f ( x ) d x = a P { X a }

and the result is proved.■

As a corollary, we obtain Proposition 4.9.2.

Proposition 4.9.2 Chebyshev's Inequality

If X is a random variable with mean μ and variance σ 2, then for any value k > 0

P { | X μ | k } σ 2 k 2

Proof

Since (Xμ)2 is a nonnegative random variable, we can apply Markov's inequality (with a  = k 2) to obtain

(4.9.1) P { ( X μ ) 2 k 2 } E [ ( X μ ) 2 ] k 2

But since (Xμ) ≥ k 2 if and only if | X μ | k , Equation 4.9.1 is equivalent to

P { | X μ | k } E [ ( X μ ) 2 ] k 2 = σ 2 k 2

and the proof is complete.■

The importance of Markov's and Chebyshev's inequalities is that they enable us to derive bounds on probabilities when only the mean, or both the mean and the variance, of the probability distribution are known. Of course, if the actual distribution were known, then the desired probabilities could be exactly computed and we would not need to resort to bounds.

Example 4.9a

Suppose that it is known that the number of items produced in a factory during a week is a random variable with mean 50.

(a)

What can be said about the probability that this week's production will exceed 75?

(b)

If the variance of a week's production is known to equal 25, then what can be said about the probability that this week's production will be between 40 and 60?

Solution

Let X be the number of items that will be produced in a week:

(a)

By Markov's inequality

P { X > 75 } E [ X ] 75 = 50 75 = 2 3

(b)

By Chebyshev's inequality

P { | X 50 | 10 } σ 2 10 2 = 1 4

Hence

P { | X 50 | < 10 } 1 1 4 = 3 4

and so the probability that this week's production will be between 40 and 60 is at least .75.■

By replacing k by in Equation 4.9.1, we can write Chebyshev's inequality as

P { | X μ | > k σ } 1 / k 2

Thus it states that the probability a random variable differs from its mean by more than k standard deviations is bounded by 1/k 2.

We will end this section by using Chebyshev's inequality to prove the weak law of large numbers, which states that the probability that the average of the first n terms in a sequence of independent and identically distributed random variables differs by its mean by more than ε goes to 0 as n goes to infinity.

Theorem 4.9.3 The Weak Law of Large Numbers

Let X 1, X 2,…, be a sequence of independent and identically distributed random variables, each having mean E[Xi ]   = μ. Then, for any ε > 0,

P { | X 1 + + X n n μ | > ε } 0 as n

Proof

We shall prove the result only under the additional assumption that the random variables have a finite variance σ 2. Now, as

E [ X 1 + + X n n ] = μ and Var ( X 1 + + X n n ) = σ 2 n

it follows from Chebyshev's inequality that

P { | X 1 + + X n n μ | > ϵ } σ 2 n ϵ 2

and the result is proved.■

For an application of the above, suppose that a sequence of independent trials is performed. Let E be a fixed event and denote by P(E) the probability that E occurs on a given trial. Letting

X i = { 1 if E occurs on trial i 0 if E does not occur on trial i

it follows that X 1  + X 2  +     + Xn represents the number of times that E occurs in the first n trials. Because E[Xi ]   = P(E), it thus follows from the weak law of large numbers that for any positive number ε, no matter how small, the probability that the proportion of the first n trials in which E occurs differs from P(E) by more than ε goes to 0 as n increases.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123948113500046

Random Variables

Sheldon Ross , in Introduction to Probability Models (Eleventh Edition), 2014

2.8 Limit Theorems

We start this section by proving a result known as Markov's inequality.

Proposition 2.6 Markov's Inequality

If X is a random variable that takes only nonnegative values, then for any value a > 0

P { X a } E [ X ] a

Proof

We give a proof for the case where X is continuous with density f .

E [ X ] = 0 xf ( x ) dx = 0 a xf ( x ) dx + a xf ( x ) dx a xf ( x ) dx a af ( x ) dx = a a f ( x ) dx = aP { X a }

and the result is proven.

As a corollary, we obtain the following.

Proposition 2.7 Chebyshev's Inequality

If X is a random variable with mean μ and variance σ 2 , then, for any value k > 0 ,

P { X - μ k } σ 2 k 2

Proof

Since ( X - μ ) 2 is a nonnegative random variable, we can apply Markov's inequality (with a = k 2 ) to obtain

P { ( X - μ ) 2 k 2 } E [ ( X - μ ) 2 ] k 2

But since ( X - μ ) 2 k 2 if and only if X - μ k , the preceding is equivalent to

P { X - μ k } E [ ( X - μ ) 2 ] k 2 = σ 2 k 2

and the proof is complete.

The importance of Markov's and Chebyshev's inequalities is that they enable us to derive bounds on probabilities when only the mean, or both the mean and the variance, of the probability distribution are known. Of course, if the actual distribution were known, then the desired probabilities could be exactly computed, and we would not need to resort to bounds.

Example 2.49

Suppose we know that the number of items produced in a factory during a week is a random variable with mean 500.

(a)

What can be said about the probability that this week's production will be at least 1000?

(b)

If the variance of a week's production is known to equal 100, then what can be said about the probability that this week's production will be between 400 and 600?

Solution:  Let X be the number of items that will be produced in a week.

(a)

By Markov's inequality,

P { X 1000 } E [ X ] 1000 = 500 1000 = 1 2

(b)

By Chebyshev's inequality,

P { X - 500 100 } σ 2 ( 100 ) 2 = 1 100

Hence,

P { X - 500 < 100 } 1 - 1 100 = 99 100

and so the probability that this week's production will be between 400 and 600 is at least 0.99.  

The following theorem, known as the strong law of large numbers, is probably the most well-known result in probability theory. It states that the average of a sequence of independent random variables having the same distribution will, with probability 1, converge to the mean of that distribution.

Theorem 2.1 Strong Law of Large Numbers

Let X 1 , X 2 , be a sequence of independent random variables having a common distribution, and let E [ X i ] = μ . Then, with probability 1,

X 1 + X 2 + + X n n μ as n

As an example of the preceding, suppose that a sequence of independent trials is performed. Let E be a fixed event and denote by P ( E ) the probability that E occurs on any particular trial. Letting

X i = 1 , if E occurs on the i th trial 0 , if E does not occur on the i th trial

we have by the strong law of large numbers that, with probability 1,

(2.25) X 1 + + X n n E [ X ] = P ( E )

Since X 1 + + X n represents the number of times that the event E occurs in the first n trials, we may interpret Equation (2.25) as stating that, with probability 1, the limiting proportion of time that the event E occurs is just P ( E ) .

Running neck and neck with the strong law of large numbers for the honor of being probability theory's number one result is the central limit theorem. Besides its theoretical interest and importance, this theorem provides a simple method for computing approximate probabilities for sums of independent random variables. It also explains the remarkable fact that the empirical frequencies of so many natural "populations" exhibit a bell-shaped (that is, normal) curve.

Theorem 2.2 Central Limit Theorem

Let X 1 , X 2 , be a sequence of independent, identically distributed random variables, each with mean μ and variance σ 2 . Then the distribution of

X 1 + X 2 + + X n - n μ σ n

tends to the standard normal as n . That is,

P X 1 + X 2 + + X n - n μ σ n a 1 2 π - a e - x 2 / 2 dx

as n .

Note that like the other results of this section, this theorem holds for any distribution of the X i s; herein lies its power.

If X is binomially distributed with parameters n and p , then X has the same distribution as the sum of n independent Bernoulli random variables, each with parameter p . (Recall that the Bernoulli random variable is just a binomial random variable whose parameter n equals 1.) Hence, the distribution of

X - E [ X ] Var ( X ) = X - np np ( 1 - p )

approaches the standard normal distribution as n approaches . The normal approximation will, in general, be quite good for values of n satisfying np ( 1 - p ) 10 .

Example 2.50 Normal Approximation to the Binomial

Let X be the number of times that a fair coin, flipped 40 times, lands heads. Find the probability that X = 20 . Use the normal approximation and then compare it to the exact solution.

Solution:  Since the binomial is a discrete random variable, and the normal a continuous random variable, it leads to a better approximation to write the desired probability as

P { X = 20 } = P { 19.5 < X < 20.5 } = P 19.5 - 20 10 < X - 20 10 < 20.5 - 20 10 = P - 0.16 < X - 20 10 < 0.16 Φ ( 0.16 ) - Φ ( - 0.16 )

where Φ ( x ) , the probability that the standard normal is less than x is given by

Φ ( x ) = 1 2 π - x e - y 2 / 2 dy

By the symmetry of the standard normal distribution

Φ ( - 0.16 ) = P { N ( 0 , 1 ) > 0.16 } = 1 - Φ ( 0.16 )

where N ( 0 , 1 ) is a standard normal random variable. Hence, the desired probability is approximated by

P { X = 20 } 2 Φ ( 0.16 ) - 1

Using Table 2.3, we obtain

P { X = 20 } 0.1272

The exact result is

P { X = 20 } = 40 20 1 2 40

which can be shown to equal 0.1268.

 

Table 2.3. Area Φ ( x ) under the Standard Normal Curve to the Left of x

x 0.00 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09
0.0 0.5000 0.5040 0.5080 0.5120 0.5160 0.5199 0.5239 0.5279 0.5319 0.5359
0.1 0.5398 0.5438 0.5478 0.5517 0.5557 0.5597 0.5636 0.5675 0.5714 0.5753
0.2 0.5793 0.5832 0.5871 0.5910 0.5948 0.5987 0.6026 0.6064 0.6103 0.6141
0.3 0.6179 0.6217 0.6255 0.6293 0.6331 0.6368 0.6406 0.6443 0.6480 0.6517
0.4 0.6554 0.6591 0.6628 0.6664 0.6700 0.6736 0.6772 0.6808 0.6844 0.6879

0.5 0.6915 0.6950 0.6985 0.7019 0.7054 0.7088 0.7123 0.7157 0.7190 0.7224
0.6 0.7257 0.7291 0.7324 0.7357 0.7389 0.7422 0.7454 0.7486 0.7517 0.7549
0.7 0.7580 0.7611 0.7642 0.7673 0.7704 0.7734 0.7764 0.7794 0.7823 0.7852
0.8 0.7881 0.7910 0.7939 0.7967 0.7995 0.8023 0.8051 0.8078 0.8106 0.8133
0.9 0.8159 0.8186 0.8212 0.8238 0.8264 0.8289 0.8315 0.8340 0.8365 0.8389

1.0 0.8413 0.8438 0.8461 0.8485 0.8508 0.8531 0.8554 0.8557 0.8599 0.8621
1.1 0.8643 0.8665 0.8686 0.8708 0.8729 0.8749 0.8770 0.8790 0.8810 0.8830
1.2 0.8849 0.8869 0.8888 0.8907 0.8925 0.8944 0.8962 0.8980 0.8997 0.9015
1.3 0.9032 0.9049 0.9066 0.9082 0.9099 0.9115 0.9131 0.9147 0.9162 0.9177
1.4 0.9192 0.9207 0.9222 0.9236 0.9251 0.9265 0.9279 0.9292 0.9306 0.9319

1.5 0.9332 0.9345 0.9357 0.9370 0.9382 0.9394 0.9406 0.9418 0.9429 0.9441
1.6 0.9452 0.9463 0.9474 0.9484 0.9495 0.9505 0.9515 0.9525 0.9535 0.9545
1.7 0.9554 0.9564 0.9573 0.9582 0.9591 0.9599 0.9608 0.9616 0.9625 0.9633
1.8 0.9641 0.9649 0.9656 0.9664 0.9671 0.9678 0.9686 0.9693 0.9699 0.9706
1.9 0.9713 0.9719 0.9726 0.9732 0.9738 0.9744 0.9750 0.9756 0.9761 0.9767

2.0 0.9772 0.9778 0.9783 0.9788 0.9793 0.9798 0.9803 0.9808 0.9812 0.9817
2.1 0.9821 0.9826 0.9830 0.9834 0.9838 0.9842 0.9846 0.9850 0.9854 0.9857
2.2 0.9861 0.9864 0.9868 0.9871 0.9875 0.9878 0.9881 0.9884 0.9887 0.9890
2.3 0.9893 0.9896 0.9898 0.9901 0.9904 0.9906 0.9909 0.9911 0.9913 0.9916
2.4 0.9918 0.9920 0.9922 0.9925 0.9927 0.9929 0.9931 0.9932 0.9934 0.9936

2.5 0.9938 0.9940 0.9941 0.9943 0.9945 0.9946 0.9948 0.9949 0.9951 0.9952
2.6 0.9953 0.9955 0.9956 0.9957 0.9959 0.9960 0.9961 0.9962 0.9963 0.9964
2.7 0.9965 0.9966 0.9967 0.9968 0.9969 0.9970 0.9971 0.9972 0.9973 0.9974
2.8 0.9974 0.9975 0.9976 0.9977 0.9977 0.9978 0.9979 0.9979 0.9980 0.9981
2.9 0.9981 0.9982 0.9982 0.9983 0.9984 0.9984 0.9985 0.9985 0.9986 0.9986

3.0 0.9987 0.9987 0.9987 0.9988 0.9988 0.9989 0.9989 0.9989 0.9990 0.9990
3.1 0.9990 0.9991 0.9991 0.9991 0.9992 0.9992 0.9992 0.9992 0.9993 0.9993
3.2 0.9993 0.9993 0.9994 0.9994 0.9994 0.9994 0.9994 0.9995 0.9995 0.9995
3.3 0.9995 0.9995 0.9995 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996 0.9997
3.4 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9997 0.9998

Example 2.51

Let X i , i = 1 , 2 , , 10 be independent random variables, each being uniformly distributed over (0, 1). Estimate P { 1 10 X i > 7 } .

Solution:  Since E [ X i ] = 1 2 , Var ( X i ) = 1 12 we have by the central limit theorem that

P 1 10 X i > 7 = P 1 10 X i - 5 10 1 12 > 7 - 5 10 1 12 1 - Φ ( 2.2 ) = 0.0139

Example 2.52

The lifetime of a special type of battery is a random variable with mean 40 hours and standard deviation 20 hours. A battery is used until it fails, at which point it is replaced by a new one. Assuming a stockpile of 25 such batteries, the lifetimes of which are independent, approximate the probability that over 1100 hours of use can be obtained.

Solution:  If we let X i denote the lifetime of the i th battery to be put in use, then we desire p = P { X 1 + + X 25 > 1100 } , which is approximated as follows:

p = P X 1 + + X 25 - 1000 20 25 > 1100 - 1000 20 25 P { N ( 0 , 1 ) > 1 } = 1 - Φ ( 1 ) 0.1587

We now present a heuristic proof of the central limit theorem. Suppose first that the X i have mean 0 and variance 1, and let E [ e tX ] denote their common moment generating function. Then, the moment generating function of X 1 + + X n n is

E exp t X 1 + + X n n = E [ e tX 1 / n e tX 2 / n e tX n / n ] = ( E [ e tX / n ] ) n by independence

Now, for n large, we obtain from the Taylor series expansion of e y that

e tX / n 1 + tX n + t 2 X 2 2 n

Taking expectations shows that when n is large

E [ e tX / n ] 1 + tE [ X ] n + t 2 E [ X 2 ] 2 n = 1 + t 2 2 n because E [ X ] = 0 , E [ X 2 ] = 1

Therefore, we obtain that when n is large

E exp t X 1 + + X n n 1 + t 2 2 n n

When n goes to the approximation can be shown to become exact and we have

lim n E exp t X 1 + + X n n = e t 2 / 2

Thus, the moment generating function of X 1 + + X n n converges to the moment generating function of a (standard) normal random variable with mean 0 and variance 1. Using this, it can be proven that the distribution function of the random variable X 1 + + X n n converges to the standard normal distribution function Φ .

When the X i have mean μ and variance σ 2 , the random variables X i - μ σ have mean 0 and variance 1. Thus, the preceding shows that

P X 1 - μ + X 2 - μ + + X n - μ σ n a Φ ( a )

which proves the central limit theorem.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780124079489000025