Probability theory

Let Ω be a set with probability measure P\mathbb{P} defined (on some sigma-algebra). A random variable is a measurable function X:ΩRX:\Omega\to \mathbb{R}. (Roughly speaking, it means P{ω:X(ω)(a,b)}\mathbb{P}\{\omega:X(\omega)\in(a,b)\} defined for every interval (a,b)(a,b)). It’s impossible to study random variable directly, instead we study XX by studying its probability distribution. The probability distribution of XX is a measure μ on R\mathbb{R} given by μ(a,b)=P{ωΩ:X(ω)(a,b)}\mu(a,b) = \mathbb{P}\{\omega\in \Omega:X(\omega)\in (a,b)\}, it is a probability measure on R\mathbb{R}. If dμ=f(x)dxd\mu = f(x)dx for some function f(x)f(x), we call ff the density function of XX.

Expectations

The expectation of XX is given by EX=ΩX(ω)dP(ω)EX = \int_{\Omega}X(\omega)d\mathbb{P}(\omega). To calculate integrals of random variable and function of random variable, one usually use the following handy formula:

Variance

The variance of XX is the L2L^2 norm of XE(X)X - E(X), i.e. D(X)=E((XE(X))2)=E(X2)(E(X))2D(X) = E((X - E(X))^2 ) = E(X^2) - (E(X))^2.

The Characteristic function

Let XX be a random variable, the characteristic function of XX is its Fourier transform, given by ΦX(s)=E(e2πisX)\Phi_X(s) = E(e^{-2\pi i sX}).

  • Let dμd\mu be the law of distribution of XX, then ΦX(s)=Fμ(s)\Phi_X(s) = \mathcal F\mu(s). Recall that this is given by Fμ(s)=Re2πisxdμ(x)\mathcal F\mu(s) = \int_{\mathbb{R}}e^{-2 \pi i s x}d\mu(x).
  • If XX has a distribution function f(t)f(t), then ΦX(s)=+f(t)e2πistdt=Ff(s)\Phi_X(s) = \int_{-\infty}^{+\infty}f(t)e^{-2\pi i s t}dt = \mathcal Ff(s).

Characteristic function and momentum of XX

  1. The characteristic function can detect momentums of XX. By derivative theorem dkdskΦX(s)=F((2πix)kμ)(s)=+(2πix)ke2πisxdμ(x)=(2πi)kxke2πisxdμ(x).\frac{d^k}{ds^k}\Phi_X(s) = \mathcal F((-2\pi i x)^k \cdot \mu)(s) = \int_{-\infty}^{+\infty}(-2\pi i x )^k e^{-2\pi i s x}d\mu(x) = (-2\pi i )^k \int x^k e^{-2\pi i s x}d\mu(x). So let s=0s = 0 we get E(Xk)=1(2πi)kΦX(k)(0)E(X^k) = \frac{1}{(-2\pi i )^k}\Phi_X^{(k)}(0).

  2. Converesely, if ΦX\Phi_X has global Taylor expansion (this is the case when XX or dμd\mu is compactly supported so its Fourier transform is analytic by Pauley Wiener theorem),

    ΦX(s)=k0Φ(k)(0)k!sk=k0(2πis)kE(Xk)k!.\Phi_X(s) = \sum_{k\geq 0}\frac{\Phi^{(k)}(0)}{k!}s^{k} = \sum_{k\geq 0} \frac{(-2\pi i s)^k E(X^k)}{k!}.

Uncertainty principle

A complex function ψ is called a wave function if ψ(x)2|\psi(x)|^2 is a finite measure on R\mathbb{R}. We normalize so that it is a probability measure. We also assume ψS\psi\in \mathcal S for simplicity, so that Fψ\mathcal F\psi is also in S\mathcal S. The uncertainty principle says that the variance of random variables ψ and Fψ\mathcal F\psi cannot be both small. In other words, the measurements ψ and Fψ\mathcal F\psi cannot be simouteneously localized.

Theorem (Uncertainty principle)
Rx2ψ(x)2dxRs2Fψ(s)2ds116π2.\int_{\mathbb{R}}x^2 |\psi(x)|^2 dx \cdot \int_{\mathbb R}s^2|\mathcal F\psi (s)|^2 ds\geq \frac{1}{16\pi^2}.

Proof. By our normalization we have ψ(x)2dx=1\int |\psi (x)|^2 dx= 1, integration by parts implies 1=ψ(x)ψ(x)dx=xψ(x)+xd(ψψˉ)=0x(ψψˉ+ψψˉ)dx1 = \int \psi(x)\overline{\psi(x)} dx = x|\psi(x)|_{-\infty}^{+\infty} - \int x d(\psi \bar{\psi}) = 0 - \int x (\psi' \bar{\psi} + \psi \bar{\psi'} )dx. By absolute value inequality we get 1=+x(ψˉψ+ψψˉ)dx+2xψψdx2xψL2ψL21 = |\int_{-\infty}^{+\infty} x(\bar{\psi} \psi' + \psi'\bar{\psi})|dx \leq \int_{-\infty}^{+\infty} 2|x \psi| |\psi'|dx \leq 2\|x\psi\|_{L^2}\|\psi'\|_{L^2} by Cauchy-Schwartz inequality. Taking square we get

4xψ(x)L22ψ(x)L221.4\|x\psi(x)\|^2_{L^2} \cdot \|\psi'(x)\|_{L^2}^2\geq 1.

Now we look at the derivative term. By Plancherel identity and derivative theorem ψL22=FψL22=2πisFψ(s)L22=4π2s2Fψ(s)2ds.\|\psi'\|_{L^2}^2 = \|\mathcal F\psi'\|_{L^2}^2 = \|2\pi i s \mathcal F\psi(s)\|_{L^2}^2 = \int 4\pi^2 s^2|\mathcal F\psi(s)|^2 ds. Plug in above inequality we get 16π2xψ(x)L22sFψ(s)L22116\pi^2 \|x\psi(x)\|_{L^2}^2 \cdot \|s\mathcal F\psi(s)\|^2_{L^2}\geq 1 as claimed.

Central Limit Theorem

Independent random variables

Random variables XX and YY are said to be independent provided E(f(X)g(Y))=E(f(X))E(g(Y))E(f(X)g(Y)) = E(f(X))E(g(Y)) for every measurable f,gf,g. There are two main consequences for random variables to be independent that we are going to use:

  1. E(XY)=E(X)E(Y)E(XY) = E(X)E(Y). This implies D(X+Y)=D(X)+D(Y)D(X+Y) = D(X)+D(Y), where D(X)=E(X2)(E(X))2D(X) = E(X^2) - (E(X))^2 is the variance of XX.
  2. Let fX,fYf_X,f_Y be probability density functions of X,YX,Y. Then the density of X+YX+Y satisfies fX+Y=fXfYf_{X+Y} = f_X*f_Y

Statement of Central limit theorem

Two random variables X,YX,Y are called identically distributed provided that their law of distribution are the same (i.e. corresponds to the same probability measure on R\mathbb{R}).

We call a family (Xi)i=1n(X_i)_{i = 1}^n of identically distributed and independent random variables i.i.d. to simplify.

We are interested in the behaviour of distribution of X1++XnX_1+\cdots+X_n when nn\to \infty. For example, consider a point at origin on R\mathbb{R}. At every second, the point randomly moves distance 12\leq \frac{1}{2} with equal probability and independent from previous moves. Then the location of the point at nn-th second is X1++XnX_1+\cdots+X_n where XiX_i are uniform distribution random variable on [12,12][-\frac{1}{2},\frac{1}{2}]. What is the probability distribution of location look like when nn is large? In this case we can calculate the distribution directly since they are convolutions. However, it turns out successive convolutions are very complicated even in this simplist case. The following is taken from this wikipedia page.

The frequency point of view

Assume XiX_i are i.i.d. with common density function Π(x)={1,x[12,12]0,else\Pi(x) = \begin{cases}1,x\in [-\frac{1}{2},\frac{1}{2}] \\ 0,else\end{cases}. Then E(Xi)=0E(X_i) = 0, D(Xi)=E(Xi2)=1212x2dx=112D(X_i) = E(X_i^2) = \int_{-\frac{1}{2}}^{\frac{1}{2}} x^2 dx = \frac{1}{12}. D(X1++Xn)=n12D(X_1+\cdots+X_n) = \frac{n}{12}. Let Sn:=X1++Xnn12S_n:=\frac{X_1+\cdots+X_n}{\sqrt{\frac{n}{12}}} be the standard normalization so that D(Sn)=1D(S_n) = 1.

ΦX1++Xn(s)=(FΠ(s))n=(sin(πs)πs)n\Phi_{X_1+\cdots+X_n}(s) = (\mathcal F\Pi(s))^n = (\frac{\sin(\pi s)}{\pi s})^n. Note that one has ΦXa(s)=E(e2πisXa)=E(e2πisaX)=ΦX(sa)\Phi_{\frac{X}{a}}(s) = E(e^{2\pi i s\frac{X}{a}}) = E(e^{2\pi i \frac{s}{a}X}) = \Phi_X(\frac{s}{a}), so 1

ΦSn(s)=(sin(23πsn)23πsn)n.\Phi_{S_n}(s) = \left( \frac{\sin(\frac{2\sqrt{3}\pi s}{\sqrt{n}})}{\frac{2\sqrt{3}\pi s}{\sqrt{n}}}\right)^n.

For every ss, when nn is large, sn\frac{s}{\sqrt{n}} will be near 0. Then by Taylor expansion we have sin(x)x=xx33+x55!+o(x5)x=1x223+O(x4)\frac{\sin(x)}{x} = \frac{x - \frac{x^3}{3}+\frac{x^5}{5!}+o(x^5)}{x} = 1 - \frac{x^2}{2\cdot 3}+O(x^4). Actually in this case we have the global Taylor series so the “O(x4)O(x^4)” is concretely given by a power series given by k4,even(1)k/2πk(s2)kk!(k+1)\sum_{k\geq 4,\text{even}} \frac{(-1)^{k/2}{\pi}^k(\frac{s}{2})^k}{k!(k+1)} but we don’t need it.

ΦSn(s)=(1123(23πsn)2+O(1n2))n=(12π2s2n+O(1n2))ne2π2s2\begin{split} \Phi_{S_n}(s) &= \left(1 - \frac{1}{2\cdot 3}\cdot (\frac{2\sqrt 3 \pi s}{\sqrt n})^2 + O(\frac{1}{n^2})\right)^n \\ &= \left(1 - \frac{2\pi ^2 s^2}{n}+O(\frac{1}{n^2})\right)^n \\ & \to e^{-2\pi^2s^2} \end{split}

when nn\to \infty.

Taking inverse Fourier transform we see that fSn(x)12πex22f_{S_n}(x)\to \frac{1}{\sqrt{2\pi}}e^{-\frac{x^2}{2}} since the Fourier transform is continuous.

The central limit theorem says this is true for general random variable.

Theorem

Let (Xi)i=1(X_i)_{i = 1}^{\infty} be a sequence of i.i.d. random variable with common expectation μ and standard deviation σ. Then X1++Xnnμnσ\frac{X_1+\cdots+X_n - n\mu}{\sqrt{n}\sigma} converges weakly to the Gaussian random variable. In other words,

limnP(a<X1++Xnnμnσ<b)=12πabex22dx.\lim_{n\to \infty}\mathbb{P}(a<\frac{X_1+\cdots+X_n - n\mu}{\sqrt{n}\sigma} < b ) = \frac{1}{\sqrt{2\pi}}\int_{a}^b e^{-\frac{x^2}{2}}dx.

Here Gaussian random variable means a random variable whose probability density function is 12πex22\frac{1}{\sqrt{2\pi}} e^{-\frac{x^2}{2}}.

Proof of Central Limit theorem

By shifting if necessary we can assume the XiX_i has expectation zero. Let Sn:=X1++XnnσS_n:=\frac{X_1+\cdots+X_n}{\sqrt{n}\sigma} be the normalized sum. We 'll prove

The structure of proof of central limit theorem is:

  1. Note that by scaling theorem, the characteristic function of Gauss random variable is e2πs2e^{-2\pi s^2}. Show that i.i.d. implies that ΦSn(s)\Phi_{S_n}(s) converges pointwise to e2πs2e^{-2\pi s^2}.
  2. Use the fact that for probability measures, pointwise convergence of Fourier transform implies weak convergence. This is called Levy continunity theorem.

We first give proof of 1 in full detail. Then we give a sketch proof of 2.

Proof of 1. The proof will be similar to the above using Taylor expansion. In the general case, the Taylor expansion ΦX(s)=14πσ2s22+o(s2)\Phi_X(s) = 1 - \frac{4\pi \sigma^2 s^2}{2} + o(s^2), so for fixed ss (note that we think of ss as constant and nn as variable)

ΦSn(s)=(12π2σ2(snσ)2+O(1n3/2))n.\Phi_{S_n}(s) = \left(1 - 2\pi ^2\sigma^2 (\frac{s}{\sqrt{n}\sigma})^2 + O(\frac{1}{n^{3/2}}) \right)^n.

Write ΦSn=(1+cn)n\Phi_{S_n} = (1+c_n)^n, where cn=2π2s2+enc_n = -2\pi ^2 s^2 + e_n with en0e_n\to 0 when nn\to \infty. For every ϵ>0\epsilon > 0, there exists NN large enough such that when n>Nn>N, en<ϵ|e_n|<\epsilon. There exists NN larger such that when n>Nn>N, (1+2π2s2+ϵn)n<e2π2s2+ϵ+ϵ(1+\frac{-2\pi^2s^2 + \epsilon}{n})^n < e^{-2\pi^2 s^2 + \epsilon}+\epsilon. So lim supn(1+cn)ne2πs2+ϵ+ϵ\limsup_{n\to \infty} (1+c_n)^n \leq e^{-2\pi s^2 +\epsilon}+\epsilon for every ε. This implies lim supn(1+cn)ne2πs2\limsup_{n\to \infty}(1+c_n)^n \leq e^{-2\pi s^2}. By the same argument, we can show that lim infn(1+cn)ne2πis2\liminf_{n\to \infty}(1+c_n)^n \geq e^{-2\pi i s^2}. So the limit exists and limnΦSn(s)=e2πs2\lim_{n\to \infty}\Phi_{S_n}(s) = e^{-2\pi s^2}. We have proved the pointwise convergence.

The Levy continuous theorem

Let (μn)n=1(\mu_n)_{n = 1}^{\infty} be a sequence of probability measures. Let Φn\Phi_n be their characteristic functions (i.e. Fourier transform). Note that they are uniformly continuous functions as Fourier transform of probability measures. Suppose Φnϕ\Phi_n \to \phi pointwise, then the following are equivalent:

(1)(1) μn\mu_n converges weakly to a probability measure μ.

(2)(2) ϕ is Fourier transform of a probability measure μ.

(3)(3) ϕ is continuous at zero.

(4)(4) The family (μn)n=1(\mu_n)_{n = 1}^{\infty} is tight (to be defined later).

To see (1)    (2)(1)\implies(2) we recall that by definition,μnμ\mu_n\to \mu weakly iff (μn,f)=(μ,f)(\mu_n,f) = (\mu,f) for every bounded continuous function ff. But Φn(s)=(μn(x),e2πisx)\Phi_n(s) = (\mu_n(x),e^{-2\pi i s x}) and e2πisxe^{-2\pi i s x} is bounded continuous on xx variable for every ss. This implies Φn(s)=(μn(x),e2πixs)(μ(x),e2πixs)\Phi_n(s) = (\mu_n(x),e^{-2\pi i x s})\to (\mu(x),e^{-2\pi i x s}) for every ss. But we already know that Φn(s)Φ(s)\Phi_n(s)\to \Phi(s) pointwise, this implies ϕ(s)=(μ(x),e2πixs)=Φμ(s)\phi(s) = (\mu(x),e^{-2\pi i x s}) = \Phi_\mu(s).

(2)    (3)(2)\implies(3) follows from the remark that Fourier transform of probability measure is uniformly continuous, so in particular continuous at 0.

Tightness of measures

Let (μn)n=1(\mu_n)_{n = 1}^{\infty} be a family of probability measures.

We know that for any μi\mu_i, limRμi(R,R)=1\lim_{R\to \infty}\mu_i(-R,R) = 1. The family is called tight provided that this is uniform, i.e. for every ϵ>0\epsilon > 0, there exists RR (not depending on ii) such that μi(x>R)<ϵ\mu_i(|x|>R)<\epsilon for every ii.

Consequence of tightness: (4)    (1)(4)\implies (1).

We give a sketch proof here to illustrate how to use tightness.

Uniform estimate of μn(x>R)\mu_n(|x|>R) by mean value of Φn\Phi_n near zero.

(3)    (4)(3)\implies(4)

For ϵ>0\epsilon > 0, ϵϵ(1Φn(t))dt=R[2ϵsin(2ϵπx)πx]dμn(x)\int_{-\epsilon}^{\epsilon}(1 - \Phi_n(t))dt = \int_{\mathbb R}[2\epsilon - \frac{\sin(2\epsilon \pi x)}{\pi x} ]d\mu_n(x). So 1ϵϵϵ(1Φn(t))dt=R[22sin(2ϵπx)2ϵπx]dμn(x)\frac{1}{\epsilon}\int_{-\epsilon}^{\epsilon}(1-\Phi_n(t))dt = \int_{\mathbb{R}}[2 - 2 \frac{\sin(2\epsilon \pi x)}{2\epsilon \pi x} ]d\mu_n(x). The main part of integrand is scaling of 1sinc(x)1-sinc(x) which looks like below.

picture of 1-sinc

Observe that 1sinxx12χ{x>2}(x)1 - \frac{\sin x}{x} \geq \frac{1}{2}\chi_{\{|x|> 2\}}(x), so

22sin(2πϵx)2ϵπxχ{2πϵx>2}(x)2 - 2\frac{\sin(2\pi \epsilon x)}{2\epsilon \pi x} \geq \chi_{\{|2\pi \epsilon x|>2\}}(x)

and it follows that

1ϵϵϵ(1Φn(t))dtμn{x>1πϵ}.\frac{1}{\epsilon}\int_{-\epsilon}^{\epsilon}(1-\Phi_n(t))dt \geq \mu_n\{|x|> \frac{1}{\pi \epsilon}\}.

Claim: For every δ>0\delta > 0, there exists small ϵ0\epsilon_0 and large NN depending on δ such that for every n>Nn>N, 1ϵ0ϵ0ϵ0(1Φn(t))dt<δ\frac{1}{\epsilon_0}\int_{-\epsilon_0}^{\epsilon_0}(1-\Phi_n(t))dt < \delta, so this implies μn{x>1πϵ0}<δ\mu_n\{|x|> \frac{1}{\pi \epsilon_0}\} < \delta for every n>Nn>N.

First we point out that this claim imply tightness. For δ>0\delta > 0, we want to find large enough RR such that μn{x>R}<δ\mu_n\{|x| >R\} < \delta for every nn. The claim implies there is an NN and RR' such that for every n>Nn>N blablabla. To fix the “for every” part, just choose larger RiR_i for every 1iN1\leq i \leq N, and let R=max{R1,,RN,R}R = \max\{{R_1,\dots,R_N,R'}\}.

So it reduces to prove the claim. The LHS is the (two times) mean value of Φn\Phi_n on a small interval near zero.Recall that we have Φn\Phi_n converges to ϕ pointwise and ϕ is continuous at zero, this means 1ϕ1-\phi is small near zero. These two conditions should somehow give uniform control of value of Φn\Phi_n in a small nbh of zero. To check this, 1ϵϵϵ(1Φn(t))dt1ϵϵϵ(1ϕ(t))dt+1ϵϵϵϕ(t)Φn(t)dt|\frac{1}{\epsilon}\int_{-\epsilon}^{\epsilon}(1 - \Phi_n(t))dt| \leq |\frac{1}{\epsilon}\int_{-\epsilon}^{\epsilon}(1 - \phi(t))dt| + |\frac{1}{\epsilon} \int_{-\epsilon}^{\epsilon}\phi(t) - \Phi_n(t)dt|. The first term is δ2\leq \frac{\delta}{2} for some small ϵ0\epsilon_0 by mean-value theorem of integral and continunity of ϕ. For this ϵ0\epsilon_0, the second is δ\leq \delta when nn is large by dominated convergence theorem (dominated by the constant function 2 because both Φn1|\Phi_n|\leq 1 as Fourier transform of probability measures and pointwise convergence implies ϕ1|\phi|\leq 1 as well.

Appendix