1256 words
6 minutes
Math Misc
2025-01-16

里面是一些杂七杂八的数学知识,不定期更新(雾)。

Circulant Matrix#

An n×nn \times n circulant matrix C\mathbf{C} takes the form

C=[c0cn1c2c1c1c0cn1c2c1c0cn2cn1cn1cn2c1c0]\mathbf{C} = \left[\begin{array}{ccccc}c_0 & c_{n-1} & \cdots & c_2 & c_1 \\ c_1 & c_0 & c_{n-1} & & c_2 \\ \vdots & c_1 & c_0 & \ddots & \vdots \\ c_{n-2} & & \ddots & \ddots & c_{n-1} \\ c_{n-1} & c_{n-2} & \cdots & c_1 & c_0\end{array}\right]

It has the determinant det(C)=i=0n1f(ωi)\det(\mathbf{C}) = \prod_{i=0}^{n-1} f(\omega^i) , where f(x)=c0+c1x++cn1xn1f(x) = c_0 + c_1 x + \cdots + c_{n-1} x^{n-1} and ω=e2πi/n\omega = e^{2\pi i/n}.

Proof

Let Ω=(ω(i1)(j1))1i,jnMn(C)\mathbf{\Omega}=(\omega^{(i-1)(j-1)})_{1\leq i, j \leq n} \in \mathbf{\mathcal{M}}_n(\mathbb{C}), then

det(CΩ)=det([f(1)f(ω)f(ωn1)f(1)ωf(ω)ωn1f(ωn1)f(1)ωn1f(ω)ω(n1)(n1)f(ωn1)])=f(1)f(ω)f(ωn1)det(Ω)=det(Ω)i=0n1f(ωi)\begin{aligned} \det(\mathbf{C}\mathbf{\Omega}) &= \det\left(\left[\begin{array}{cccc}f(1) & f(\omega) & \cdots & f(\omega^{n-1}) \\ f(1) & \omega f(\omega) & \cdots & \omega^{n-1} f(\omega^{n-1}) \\ \vdots & \vdots & \ddots & \vdots \\ f(1) & \omega^{n-1}f(\omega) & \cdots & \omega^{(n-1)(n-1)}f(\omega^{n-1}) \end{array}\right]\right) \\ \\ &=f(1)f(\omega)\cdots f(\omega^{n-1})\det(\mathbf{\Omega}) \\ &=\det(\mathbf{\Omega}) \prod_{i=0}^{n-1} f(\omega^i) \\ \end{aligned}

Therefore , det(C)=det(Ω)i=0n1f(ωi)\det(\mathbf{C}) = \det(\mathbf{\Omega})\prod_{i=0}^{n-1} f(\omega^i). \quad \square

Gridiant of Log-Determinant Function#

Let XS++\mathbf{X}\in \mathbf{S}_{++} be a positive definite matrix, then (logdetX)=X1\nabla (\log\det\mathbf{X})=\mathbf{X}^{-1}.

Proof

Let Y=X+ΔX\mathbf{Y}=\mathbf{X}+\Delta \mathbf{X}, then

logdetY=logdet(X12(I+X12ΔXX12X12))=logdetX+logdet(I+X12ΔXX12)=logdetX+i=1nlog(1+λi)\begin{aligned} \log\det\mathbf{Y} &= \log\det(\mathbf{X}^{\frac{1}{2}}(\mathbf{I}+\mathbf{X}^{-\frac{1}{2}}\Delta\mathbf{X}\mathbf{X}^{-\frac{1}{2}}X^{\frac{1}{2}})) \\ &= \log\det\mathbf{X} + \log\det(\mathbf{I}+\mathbf{X}^{-\frac{1}{2}}\Delta\mathbf{X}\mathbf{X}^{-\frac{1}{2}}) \\ &= \log\det\mathbf{X} + \sum_{i=1}^n \log(1+\lambda_i) \end{aligned}

where λi\lambda_i are the eigenvalues of X12ΔXX12\mathbf{X}^{-\frac{1}{2}}\Delta\mathbf{X}\mathbf{X}^{-\frac{1}{2}}. Since X=o(X)\mathbf{X}=\mathcal{o}(\mathbf{X}), we have λi=o(1)\lambda_i = \mathcal{o}(1), and thus log(1+λi)=λi+o(λi)\log(1+\lambda_i)=\lambda_i + \mathcal{o}(\lambda_i). Therefore

logdetY=logdetX+i=1nλi+o(1)=logdetX+tr(X12ΔXX12)+o(1)=logdetX+tr(X1ΔX)+o(1)\begin{aligned} \log\det\mathbf{Y} &= \log\det\mathbf{X} + \sum_{i=1}^n \lambda_i + \mathcal{o}(1) \\ &=\log\det\mathbf{X}+\text{tr}(\mathbf{X}^{-\frac{1}{2}}\Delta\mathbf{X}\mathbf{X}^{-\frac{1}{2}}) + \mathcal{o}(1) \\ &=\log\det\mathbf{X}+\text{tr}(\mathbf{X}^{-1}\Delta\mathbf{X}) + \mathcal{o}(1) \end{aligned}

Thus, d(XlogdetX)=tr(X1dX)(logdetX)=X1\text{d}(\mathbf{X} \log\det\mathbf{X}) = \text{tr}(\mathbf{X}^{-1}\text{d}\mathbf{X})\Longrightarrow \nabla (\log\det\mathbf{X})=\mathbf{X}^{-1}. \quad \square

Barbalat’s Lemma#

If f:R+R+f:\mathbb{R^+} \rightarrow \mathbb{R^+} is uniformly continuous with a+f(t)dt<+\int_a^{+\infty} f(t)dt < +\infty, then limt+f(t)=0\lim_{t\rightarrow +\infty} f(t) = 0.

For series we have n=1+un<+limn+un=0\sum_{n=1}^{+\infty} u_n < +\infty \Rightarrow \lim_{n\rightarrow +\infty} u_n = 0. However, for functions a+f(t)dt<+\int_a^{+\infty} f(t)dt < +\infty \nRightarrow limt+f(t)=0\lim_{t\rightarrow +\infty} f(t) = 0, the uniform continuity is necessary.

  • Example 1. f(x)=sin(x2)f(x)=sin(x^2)
    Since 1+sin(x2)dx == u=x21+sin(u)2udu\int_1^{+\infty} \sin(x^2)dx \ \overset{\ u=x^2}{\xlongequal{}\xlongequal{}}\int_1^{+\infty} \frac{sin(u)}{2\sqrt{u}}du, by Dirichlet’s test, the integral converges. However, limx+sin(x2)\lim_{x\rightarrow +\infty} \sin(x^2) does not exist.
  • Example 2. f(x)=x1+x6sin2x>0f(x)=\frac{x}{1+x^6sin^2x} > 0
    Define F(u)=0uf(x)dxF(u)=\int_0^u f(x)dx and ak=(k1)πkπf(x)dxa_k=\int_{(k-1)\pi}^{k\pi} f(x)dx, then F(u)F(u) is monotonic increasing on R+\mathbb{R}^+ and F(nπ)=k=1nakF(n\pi)=\sum_{k=1}^{n}a_k. We have ak(k1)πkπkπ1+(k1)6π6sin2xdx=2kπ0π2dx1+(k1)6π6sin2x2kπ0π2dx1+(k1)6π6(2πx)2=2kπ0π2dx1+[2(k1)3π2x]2=k(k1)3π0(k1)3πdt1+t2k(k1)3π0+dt1+t2=k2(k1)3+12k2\begin{aligned} a_k &\leq \int_{(k-1)\pi}^{k\pi} \frac{k\pi}{1+(k-1)^6\pi^6\sin^2x}dx \\ &= 2k\pi \int_0^{\frac{\pi}{2}} \frac{dx}{1+(k-1)^6\pi^6\sin^2x} \\ &\leq 2k\pi \int_0^{\frac{\pi}{2}} \frac{dx}{1+(k-1)^6\pi^6(\frac{2}{\pi} x)^2} \\ &= 2k\pi\int_0^{\frac{\pi}{2}} \frac{dx}{1+{[2(k-1)^3\pi^2x]}^2} \\ &= \frac{k}{(k-1)^3\pi}\int_0^{(k-1)^3\pi}\frac{dt}{1+t^2} \\ &\leq \frac{k}{(k-1)^3\pi}\int_0^{+\infty}\frac{dt}{1+t^2} \\ &=\frac{k}{2(k-1)^3} \underset{+\infty}{\sim} \frac{1}{2k^2} \end{aligned} Here we use the inequality sinx2xπ\sin x \geq \frac{2x}{\pi} and the fact that 0+dt1+t2=π2\int_0^{+\infty}\frac{dt}{1+t^2} = \frac{\pi}{2}. Therefore, limn+F(nπ)\lim_{n\rightarrow +\infty} F(n\pi) exists. By F(x)F(x) increasing, limx+F(x)<+\lim_{x\rightarrow +\infty} F(x)< +\infty, but limx+f(x)\lim_{x\rightarrow +\infty} f(x) doesn’t exist.
Proof

If f(x)0f(x)\nrightarrow 0 as x+x\rightarrow +\infty, then there exists ϵ>0\epsilon>0 and a sequence (xn)(x_n) such that nN,f(xn)>ϵ\forall n\in \mathbb{N}, |f(x_n)|>\epsilon. Since ff is uniformly continuous, δ>0 nN y>0, xny<δf(xn)f(y)<ϵ2\exists \delta>0 \ \forall n\in\mathbb{N} \ \forall y>0, \ |x_n-y|<\delta \Rightarrow |f(x_n)-f(y)|<\frac{\epsilon}{2}. So x[xn,xn+δ]\forall x\in[x_n, x_n+\delta], we have f(x)>f(xn)f(xn)f(y)>ϵ2|f(x)|>|f(x_n)|-|f(x_n)-f(y)|>\frac{\epsilon}{2}. Therefore

axn+δf(t)dtaxnf(t)dt=xnxn+δf(t)dt>ϵδ2>0\left|\int_a^{x_n+\delta} f(t)dt - \int_a^{x_n}f(t)dt\right| = \sum_{x_n}^{x_n+\delta} f(t)dt > \frac{\epsilon\delta}{2}>0

However, since 0f(t)dt\int_0^{\infty} f(t)dt exists. LHS converges to 0 as n+n\rightarrow +\infty, yielding a contradiction. \quad \square

Perron-Frobenius Theorem#

Definition

Positive Matrix#

A positive(non-negative) matrix AMn,m(R)\mathbf{A}\in \mathbf{\mathcal{M}}_{n,m}(\mathbb{R}) is a matrix with all entries aija_{ij} positive(non-negative). If A>0,α0,α0\mathbf{A}>0, \mathbf{\alpha} \geq 0, \mathbf{\alpha}\neq 0, then Aα>0\mathbf{A}\mathbf{\alpha}>0.

Definition

Spectral Radius#

The spectral radius of a matrix AMn(R)\mathbf{A}\in \mathbf{\mathcal{M}}_n(\mathbb{R}) is defined as ρ(A)=max{λ:λ is an eigenvalue of A}\rho(\mathbf{A})=\max\{|\lambda|:\lambda \text{ is an eigenvalue of } \mathbf{A}\}.

Let AMn(R)\mathbf{A}\in \mathbf{\mathcal{M}}_n(\mathbb{R}) be a positive matrix, then

  1. A\mathbf{A} has a positive eigenvalue ρ(A)\rho(\mathbf{A}) with respect to a positive eigenvector v\mathbf{v} (Perron–Frobenius eigenvalue).
  2. Perron–Frobenius eigenvalue is simple (i.e. the algebraic multiplicity of ρ(A)\rho(\mathbf{A}) is both 1).
  3. There are no other positive eigenvectors except positive multiples of v\mathbf{v}.
Lemma

Gelfand’s formula#

If AMn(C)\mathbf{A}\in \mathbf{\mathcal{M}}_n(\mathbb{C}) and A0\mathbf{A}\neq 0, then ρ(A)=limk+Ak21k\rho(\mathbf{A})=\lim_{k\rightarrow +\infty} ||\mathbf{A}^k||_2^{\frac{1}{k}} with 2||\cdot||_2 the spectral norm.

Proof

We can suppose that ρ(A)=1\rho(\mathbf{A})=1

  1. Let λ\lambda be an eigenvalue of A\mathbf{A} with λ=1=ρ(A)|\lambda|=1=\rho(\mathbf{A}) and v\mathbf{v} the corresponding eigenvector. Consider α=(vi)>0\mathbf{\alpha}=(|\mathbf{v}_i|)>0, then we have

    (Aα)i=jaijvjjaijvj=λvi=vi=λvi=vi\begin{aligned} (\mathbf{A}\mathbf{\alpha})_i &= \sum_j a_{ij}|\mathbf{v}_j| \geq |\sum_j a_{ij}\mathbf{v}_j| = |\lambda \mathbf{v}_i| \\ &= |\mathbf{v}_i| = |\lambda||\mathbf{v}_i| = |\mathbf{v}_i| \end{aligned}

    which implies (Aα)iviAαα(\mathbf{A}\mathbf{\alpha})_i\geq |\mathbf{v}_i|\Rightarrow \mathbf{A}\mathbf{\alpha}\geq \mathbf{\alpha}. Now we prove that Aα>α\mathbf{A}\mathbf{\alpha}>\mathbf{\alpha} is impossible. Suppose that Aα>αA\mathbf{\alpha}>\mathbf{\alpha}, then

    Aαα>0A(Aα)>0A2α>Aα\mathbf{A}\mathbf{\alpha}-\mathbf{\alpha}>0 \Rightarrow \mathbf{A}(\mathbf{A}-\mathbf{\alpha})>0 \Rightarrow \mathbf{A}^2\mathbf{\alpha}>\mathbf{A}\mathbf{\alpha}

    Thus, ϵ>0 s.t. A2α>Aα(1+ϵ)\exists \epsilon>0 \ s.t. \ \mathbf{A}^2\mathbf{\alpha}>\mathbf{A}\mathbf{\alpha}(1+\epsilon), and by induction we have Akα>Ak1α(1+ϵ)k1\mathbf{A}^k\mathbf{\alpha}>\mathbf{A}^{k-1}\mathbf{\alpha}(1+\epsilon)^{k-1}, which implies Akα2>(1+ϵ)k1Aα2||\mathbf{A}^k\mathbf{\alpha}||_2 > (1+\epsilon)^{k-1}||\mathbf{A}\mathbf{\alpha}||_2. Therefore

    limk+Ak21k(Ak(Aα)2Aα2)1k>1+ϵ>1=ρ(A)\lim_{k\rightarrow +\infty}||\mathbf{A}^k||_2^{\frac{1}{k}}\geq \left(\frac{||\mathbf{A}^k(\mathbf{A}\mathbf{\alpha})||_2}{||\mathbf{A}\mathbf{\alpha}||_2}\right)^{\frac{1}{k}} > 1+\epsilon > 1=\rho(\mathbf{A})

    which contradicts Gelfand’s formula. Therefore, Aα=α>0\mathbf{A}\mathbf{\alpha}=\mathbf{\alpha}>0 which means there exists a positive eigenvector α\mathbf{\alpha} with respect to eigenvalue ρ(A)\rho(\mathbf{A}).

  2. Assume that there exists an eigenvector β\mathbf{\beta} that is linearly independent with α\mathbf{\alpha} and corresponds to the eigenvalue ρ(A)=1\rho(\mathbf{A})=1. Let c=minβi>0αiβi=αkβkc=\min_{\mathbf{\beta}_i>0} \frac{\mathbf{\alpha}_i}{\mathbf{\beta}_i}=\frac{\mathbf{\alpha}_k}{\mathbf{\beta}_k} (if β<0\beta<0, we can choose β>0-\beta>0), then we have c>0c>0 and αcβ0,αcβ0\mathbf{\alpha}-c\mathbf{\beta}\geq 0, \mathbf{\alpha}-c\mathbf{\beta}\neq 0. However, since A>0\mathbf{A}>0, αcβ=A(αcβ)>0\mathbf{\alpha}-c\mathbf{\beta}=\mathbf{A}(\mathbf{\alpha}-c\mathbf{\beta})>0, which contradicts the fact that αkcβk=0\mathbf{\alpha}_k-c\mathbf{\beta}_k=0. Therefore, the geometric multiplicity of ρ(A)\rho(\mathbf{A}) is 1.

    Using the fact that AT\mathbf{A}^\mathsf{T} is also positive, there exists a positive eigenvector γ\mathbf{\gamma} with respect to the eigenvalue ρ(AT)=1\rho(\mathbf{A}^\mathsf{T})=1. Note U=span(γ)\mathcal{U}=\text{span}(\mathbf{\gamma}), we have Rn=UU\mathbb{R}^n=\mathcal{U}^\perp \oplus\mathcal{U}. Since α>0,γ>0\mathbf{\alpha}>0, \mathbf{\gamma}>0, then γTα>0αU\mathbf{\gamma}^\mathsf{T}\mathbf{\alpha}>0\Rightarrow \mathbf{\alpha}\in\mathcal{U}. Choose a basis E=(α,e1,,en1)\mathcal{E}=(\mathbf{\alpha}, e_1, \cdots, e_{n-1}) for Rn\mathbb{R}^n, then the matrix of A\mathbf{A} under E\mathcal{E} is

    (10B)\left(\begin{array}{cc}1 & * \\ \mathbf{0} & \mathbf{B}\end{array}\right)

    If there exists a positive eigenvector δ\mathbf{\delta} of B\mathbf{B} with respect to the eigenvalue 11, then δUT\mathbf{\delta}\in\mathcal{U}^\mathsf{T}, which implies δ\mathbf{\delta} and α\mathbf{\alpha} are linearly dependent, contracting the fact that the geometric multiplicity is 1. Therefore, the algebraic multiplicity of A\mathbf{A} is also 1.

  3. Suppose there exists a positive eigenvector β\mathbf{\beta} corresponding to the eigenvalue λ<1\lambda < 1, then by the positivity of A\mathbf{A}, λ>0\lambda>0. Let c=maxαiβi=αkβkc=\max \frac{\mathbf{\alpha}_i}{\mathbf{\beta}_i}=\frac{\mathbf{\alpha}_k}{\mathbf{\beta}_k}, then cβα0,cβα0c\mathbf{\beta}-\mathbf{\alpha}\geq 0, c\mathbf{\beta}-\mathbf{\alpha}\neq 0. However, since A>0\mathbf{A}>0, λcβα=A(cβα)>0\lambda c\mathbf{\beta}-\mathbf{\alpha}=\mathbf{A}(c\mathbf{\beta}-\mathbf{\alpha})>0, which means i[[1,n]]  λcβi>αic<λc<c\forall i\in[[1, n]] ~~ \lambda c\mathbf{\beta}_i>\mathbf{\alpha}_i \Rightarrow c<\lambda c < c, yielding a contradiction. Therefore, there are no other positive eigenvectors except positive multiples of α\mathbf{\alpha}. \quad \square

KL Divergence#

Definition

If PP and QQ are two probability distributions, then the Kullback-Leibler divergence from QQ to PP is defined as

DKL(PQ)=xXP(x)log(P(x)Q(x))=EXP[log(P(X)Q(X))]D_{KL}(P||Q) = \sum_{x\in \mathcal{X}} P(x)\log\left(\frac{P(x)}{Q(x)}\right) = \mathbb{E}_{X \sim P}\left[\log\left(\frac{P(X)}{Q(X)}\right)\right]
DKL(PQ)DKL(QP)D_{KL}(P||Q)\neq D_{KL}(Q||P)

Positivity#

DKL(PQ)0D_{KL}(P||Q) \geq 0

Proof
DKL(PQ)=xXP(x)log(P(x)Q(x))log(xXP(x)Q(x)P(x))=log(xXQ(x))=log1=0\begin{aligned} D_{KL}(P||Q) &= \sum_{x\in \mathcal{X}} P(x)\log\left(\frac{P(x)}{Q(x)}\right) \\ &\geq -\log(\sum_{x\in \mathcal{X}} P(x)\cdot \frac{Q(x)}{P(x)}) \\ &= -\log(\sum_{x\in \mathcal{X}} Q(x)) \\ &= -\log 1 = 0 \end{aligned}

Here we use the Jensen’s inequality since f(x)=logxf(x)=-\log x is a convex function.

Forward and Reverse KL Divergence#

If we use the dstribution Qθ(X)Q_\theta(X) to approximate the distribution P(X)P(X), then

  • Minimizing forwad KL divergence arg minθ DKL(PQθ)\underset{\theta}{\argmin} ~ D_{KL}(P||Q_\theta) is equivalent to make a maximum likelihood estimation of θ\theta under PP.

    Proof
    arg minθDKL(PQθ)=arg minθxXP(x)logP(x)xXP(x)logQθ(x)=arg maxθEXP[logQθ(X)]+xXP(x)logP(x)H(P) : Entropy of P=arg maxθEXP[logQθ(X)]=arg maxθi=1mQθ(xi)=arg maxθL(θ)=θMLE\begin{aligned} \argmin_\theta D_{KL}(P||Q_\theta) &= \argmin_\theta \sum_{x\in \mathcal{X}} P(x)\log P(x) - \sum_{x\in \mathcal{X}} P(x)\log Q_\theta(x) \\ &= \argmax_\theta \mathbb{E}_{X\sim P}[\log Q_\theta(X)] + \underbrace{\sum_{x\in \mathcal{X}} -P(x)\log P(x)}_{\color{red}H(P)\text{ : Entropy of } P} \\ &= \argmax_\theta \mathbb{E}_{X\sim P}[\log Q_\theta(X)] \\ &= \argmax_\theta \prod_{i=1}^{m}Q_\theta(x_i) \\ &=\argmax_\theta \mathcal{L}(\theta) = \theta^{\text{MLE}} \end{aligned}

    Here we use Pdata={x1,,xm}P_\text{data}=\{x_1, \cdots, x_m\} to represent the data we collected.

    Wherever P(x)P(x) has high probability, Q(x)Q(x) must also have high probability.

    image description

    The figure above shows the effect of fitting a bimodal distribution PP using a unimodal distribution QθQ_\theta through the forward KL divergence cost.

    This property of the forward KL divergence is also often referred to as “zero avoiding”, because it tends to avoid having Q(x)=0Q(x)=0 at any position where P(x)>0P(x)>0 .

  • Minimizing reverse KL divergence arg minθ DKL(QθP)\underset{\theta}{\argmin} ~ D_{KL}(Q_\theta||P) is equivalent to requiring that the fitting maintains a single mode as much as possible.

    Proof
    arg minθDKL(QθP)=arg minθEXQθ[logP(X)]+H(Qθ(X))\argmin_\theta D_{KL}(Q_\theta||P) = \argmin_\theta \mathbb{E}_{X\sim Q_\theta}[-\log P(X)] + H(Q_\theta(X))

    Based on the properties of entropy, it is known that when QθQ_\theta approaches a uniform distribution, the value of H(Qθ(X))H(Q_\theta(X)) is larger. Conversely, when QθQ_\theta tends toward a unimodal distribution (single-peak distribution), the value of H(Qθ(X))H(Q_\theta(X)) is smaller. Therefore, the reverse KL divergence is equivalent to requiring QθQ_\theta to fit PP while maintaining as much unimodality as possible.

    Wherever Q(x)Q(x) has high probability, P(x)P(x) must also have high probability.

    image description

    The figure above shows the effect of fitting the same bimodal distribution PP using a unimodal distribution QθQ_\theta through the reverse KL divergence cost.

    The reverse KL divergence tends to minimize the difference between Qθ(x)Q_\theta(x) and P(x)P(x) when Qθ(x)>0Q_\theta(x)>0.

Math Misc
https://astronaut.github.io/posts/math-misc/
Author
关怀他人
Published at
2025-01-16