Math Misc

2025-01-16

/

里面是一些杂七杂八的数学知识，不定期更新（雾）。

Circulant Matrix#

An $n \times n$ circulant matrix $\mathbf{C}$ takes the form

\mathbf{C} = \left[\begin{array}{ccccc}c_0 & c_{n-1} & \cdots & c_2 & c_1 \\ c_1 & c_0 & c_{n-1} & & c_2 \\ \vdots & c_1 & c_0 & \ddots & \vdots \\ c_{n-2} & & \ddots & \ddots & c_{n-1} \\ c_{n-1} & c_{n-2} & \cdots & c_1 & c_0\end{array}\right]

It has the determinant $\det(\mathbf{C}) = \prod_{i=0}^{n-1} f(\omega^i)$ , where $f(x) = c_0 + c_1 x + \cdots + c_{n-1} x^{n-1}$ and $\omega = e^{2\pi i/n}$ .

Proof
Let $\mathbf{\Omega}=(\omega^{(i-1)(j-1)})_{1\leq i, j \leq n} \in \mathbf{\mathcal{M}}_n(\mathbb{C})$ , then
$\begin{aligned} \det(\mathbf{C}\mathbf{\Omega}) &= \det\left(\left[\begin{array}{cccc}f(1) & f(\omega) & \cdots & f(\omega^{n-1}) \\ f(1) & \omega f(\omega) & \cdots & \omega^{n-1} f(\omega^{n-1}) \\ \vdots & \vdots & \ddots & \vdots \\ f(1) & \omega^{n-1}f(\omega) & \cdots & \omega^{(n-1)(n-1)}f(\omega^{n-1}) \end{array}\right]\right) \\ \\ &=f(1)f(\omega)\cdots f(\omega^{n-1})\det(\mathbf{\Omega}) \\ &=\det(\mathbf{\Omega}) \prod_{i=0}^{n-1} f(\omega^i) \\ \end{aligned}$
Therefore , $\det(\mathbf{C}) = \det(\mathbf{\Omega})\prod_{i=0}^{n-1} f(\omega^i)$ . $\quad \square$

Gridiant of Log-Determinant Function#

Let $\mathbf{X}\in \mathbf{S}_{++}$ be a positive definite matrix, then $\nabla (\log\det\mathbf{X})=\mathbf{X}^{-1}$ .

Proof
Let $\mathbf{Y}=\mathbf{X}+\Delta \mathbf{X}$ , then
$\begin{aligned} \log\det\mathbf{Y} &= \log\det(\mathbf{X}^{\frac{1}{2}}(\mathbf{I}+\mathbf{X}^{-\frac{1}{2}}\Delta\mathbf{X}\mathbf{X}^{-\frac{1}{2}}X^{\frac{1}{2}})) \\ &= \log\det\mathbf{X} + \log\det(\mathbf{I}+\mathbf{X}^{-\frac{1}{2}}\Delta\mathbf{X}\mathbf{X}^{-\frac{1}{2}}) \\ &= \log\det\mathbf{X} + \sum_{i=1}^n \log(1+\lambda_i) \end{aligned}$
where $\lambda_i$ are the eigenvalues of $\mathbf{X}^{-\frac{1}{2}}\Delta\mathbf{X}\mathbf{X}^{-\frac{1}{2}}$ . Since $\mathbf{X}=\mathcal{o}(\mathbf{X})$ , we have $\lambda_i = \mathcal{o}(1)$ , and thus $\log(1+\lambda_i)=\lambda_i + \mathcal{o}(\lambda_i)$ . Therefore
$\begin{aligned} \log\det\mathbf{Y} &= \log\det\mathbf{X} + \sum_{i=1}^n \lambda_i + \mathcal{o}(1) \\ &=\log\det\mathbf{X}+\text{tr}(\mathbf{X}^{-\frac{1}{2}}\Delta\mathbf{X}\mathbf{X}^{-\frac{1}{2}}) + \mathcal{o}(1) \\ &=\log\det\mathbf{X}+\text{tr}(\mathbf{X}^{-1}\Delta\mathbf{X}) + \mathcal{o}(1) \end{aligned}$
Thus, $\text{d}(\mathbf{X} \log\det\mathbf{X}) = \text{tr}(\mathbf{X}^{-1}\text{d}\mathbf{X})\Longrightarrow \nabla (\log\det\mathbf{X})=\mathbf{X}^{-1}$ . $\quad \square$

Barbalat’s Lemma#

If $f:\mathbb{R^+} \rightarrow \mathbb{R^+}$ is uniformly continuous with $\int_a^{+\infty} f(t)dt < +\infty$ , then $\lim_{t\rightarrow +\infty} f(t) = 0$ .

For series we have $\sum_{n=1}^{+\infty} u_n < +\infty \Rightarrow \lim_{n\rightarrow +\infty} u_n = 0$ . However, for functions $\int_a^{+\infty} f(t)dt < +\infty$ $\nRightarrow$ $\lim_{t\rightarrow +\infty} f(t) = 0$ , the uniform continuity is necessary.
Example 1. $f(x)=sin(x^2)$
Since $\int_1^{+\infty} \sin(x^2)dx \ \overset{\ u=x^2}{\xlongequal{}\xlongequal{}}\int_1^{+\infty} \frac{sin(u)}{2\sqrt{u}}du$ , by Dirichlet’s test, the integral converges. However, $\lim_{x\rightarrow +\infty} \sin(x^2)$ does not exist.
Example 2. $f(x)=\frac{x}{1+x^6sin^2x} > 0$
Define $F(u)=\int_0^u f(x)dx$ and $a_k=\int_{(k-1)\pi}^{k\pi} f(x)dx$ , then $F(u)$ is monotonic increasing on $\mathbb{R}^+$ and $F(n\pi)=\sum_{k=1}^{n}a_k$ . We have $\begin{aligned} a_k &\leq \int_{(k-1)\pi}^{k\pi} \frac{k\pi}{1+(k-1)^6\pi^6\sin^2x}dx \\ &= 2k\pi \int_0^{\frac{\pi}{2}} \frac{dx}{1+(k-1)^6\pi^6\sin^2x} \\ &\leq 2k\pi \int_0^{\frac{\pi}{2}} \frac{dx}{1+(k-1)^6\pi^6(\frac{2}{\pi} x)^2} \\ &= 2k\pi\int_0^{\frac{\pi}{2}} \frac{dx}{1+{[2(k-1)^3\pi^2x]}^2} \\ &= \frac{k}{(k-1)^3\pi}\int_0^{(k-1)^3\pi}\frac{dt}{1+t^2} \\ &\leq \frac{k}{(k-1)^3\pi}\int_0^{+\infty}\frac{dt}{1+t^2} \\ &=\frac{k}{2(k-1)^3} \underset{+\infty}{\sim} \frac{1}{2k^2} \end{aligned}$ Here we use the inequality $\sin x \geq \frac{2x}{\pi}$ and the fact that $\int_0^{+\infty}\frac{dt}{1+t^2} = \frac{\pi}{2}$ . Therefore, $\lim_{n\rightarrow +\infty} F(n\pi)$ exists. By $F(x)$ increasing, $\lim_{x\rightarrow +\infty} F(x)< +\infty$ , but $\lim_{x\rightarrow +\infty} f(x)$ doesn’t exist.

Proof
If $f(x)\nrightarrow 0$ as $x\rightarrow +\infty$ , then there exists $\epsilon>0$ and a sequence $(x_n)$ such that $\forall n\in \mathbb{N}, |f(x_n)|>\epsilon$ . Since $f$ is uniformly continuous, $\exists \delta>0 \ \forall n\in\mathbb{N} \ \forall y>0, \ |x_n-y|<\delta \Rightarrow |f(x_n)-f(y)|<\frac{\epsilon}{2}$ . So $\forall x\in[x_n, x_n+\delta]$ , we have $|f(x)|>|f(x_n)|-|f(x_n)-f(y)|>\frac{\epsilon}{2}$ . Therefore
$\left|\int_a^{x_n+\delta} f(t)dt - \int_a^{x_n}f(t)dt\right| = \sum_{x_n}^{x_n+\delta} f(t)dt > \frac{\epsilon\delta}{2}>0$
However, since $\int_0^{\infty} f(t)dt$ exists. LHS converges to 0 as $n\rightarrow +\infty$ , yielding a contradiction. $\quad \square$

Perron-Frobenius Theorem#

Definition
Positive Matrix#
A positive(non-negative) matrix $\mathbf{A}\in \mathbf{\mathcal{M}}_{n,m}(\mathbb{R})$ is a matrix with all entries $a_{ij}$ positive(non-negative). If $\mathbf{A}>0, \mathbf{\alpha} \geq 0, \mathbf{\alpha}\neq 0$ , then $\mathbf{A}\mathbf{\alpha}>0$ .

Definition
Spectral Radius#
The spectral radius of a matrix $\mathbf{A}\in \mathbf{\mathcal{M}}_n(\mathbb{R})$ is defined as $\rho(\mathbf{A})=\max\{|\lambda|:\lambda \text{ is an eigenvalue of } \mathbf{A}\}$ .

Let $\mathbf{A}\in \mathbf{\mathcal{M}}_n(\mathbb{R})$ be a positive matrix, then

$\mathbf{A}$ has a positive eigenvalue $\rho(\mathbf{A})$ with respect to a positive eigenvector $\mathbf{v}$ (Perron–Frobenius eigenvalue).
Perron–Frobenius eigenvalue is simple (i.e. the algebraic multiplicity of $\rho(\mathbf{A})$ is both 1).
There are no other positive eigenvectors except positive multiples of $\mathbf{v}$ .

Lemma
Gelfand’s formula#
If $\mathbf{A}\in \mathbf{\mathcal{M}}_n(\mathbb{C})$ and $\mathbf{A}\neq 0$ , then $\rho(\mathbf{A})=\lim_{k\rightarrow +\infty} ||\mathbf{A}^k||_2^{\frac{1}{k}}$ with $||\cdot||_2$ the spectral norm.

Proof
We can suppose that $\rho(\mathbf{A})=1$
Let $\lambda$ be an eigenvalue of $\mathbf{A}$ with $|\lambda|=1=\rho(\mathbf{A})$ and $\mathbf{v}$ the corresponding eigenvector. Consider $\mathbf{\alpha}=(|\mathbf{v}_i|)>0$ , then we have
$\begin{aligned} (\mathbf{A}\mathbf{\alpha})_i &= \sum_j a_{ij}|\mathbf{v}_j| \geq |\sum_j a_{ij}\mathbf{v}_j| = |\lambda \mathbf{v}_i| \\ &= |\mathbf{v}_i| = |\lambda||\mathbf{v}_i| = |\mathbf{v}_i| \end{aligned}$
which implies $(\mathbf{A}\mathbf{\alpha})_i\geq |\mathbf{v}_i|\Rightarrow \mathbf{A}\mathbf{\alpha}\geq \mathbf{\alpha}$ . Now we prove that $\mathbf{A}\mathbf{\alpha}>\mathbf{\alpha}$ is impossible. Suppose that $A\mathbf{\alpha}>\mathbf{\alpha}$ , then
$\mathbf{A}\mathbf{\alpha}-\mathbf{\alpha}>0 \Rightarrow \mathbf{A}(\mathbf{A}-\mathbf{\alpha})>0 \Rightarrow \mathbf{A}^2\mathbf{\alpha}>\mathbf{A}\mathbf{\alpha}$
Thus, $\exists \epsilon>0 \ s.t. \ \mathbf{A}^2\mathbf{\alpha}>\mathbf{A}\mathbf{\alpha}(1+\epsilon)$ , and by induction we have $\mathbf{A}^k\mathbf{\alpha}>\mathbf{A}^{k-1}\mathbf{\alpha}(1+\epsilon)^{k-1}$ , which implies $||\mathbf{A}^k\mathbf{\alpha}||_2 > (1+\epsilon)^{k-1}||\mathbf{A}\mathbf{\alpha}||_2$ . Therefore
$\lim_{k\rightarrow +\infty}||\mathbf{A}^k||_2^{\frac{1}{k}}\geq \left(\frac{||\mathbf{A}^k(\mathbf{A}\mathbf{\alpha})||_2}{||\mathbf{A}\mathbf{\alpha}||_2}\right)^{\frac{1}{k}} > 1+\epsilon > 1=\rho(\mathbf{A})$
which contradicts Gelfand’s formula. Therefore, $\mathbf{A}\mathbf{\alpha}=\mathbf{\alpha}>0$ which means there exists a positive eigenvector $\mathbf{\alpha}$ with respect to eigenvalue $\rho(\mathbf{A})$ .
Assume that there exists an eigenvector $\mathbf{\beta}$ that is linearly independent with $\mathbf{\alpha}$ and corresponds to the eigenvalue $\rho(\mathbf{A})=1$ . Let $c=\min_{\mathbf{\beta}_i>0} \frac{\mathbf{\alpha}_i}{\mathbf{\beta}_i}=\frac{\mathbf{\alpha}_k}{\mathbf{\beta}_k}$ (if $\beta<0$ , we can choose $-\beta>0$ ), then we have $c>0$ and $\mathbf{\alpha}-c\mathbf{\beta}\geq 0, \mathbf{\alpha}-c\mathbf{\beta}\neq 0$ . However, since $\mathbf{A}>0$ , $\mathbf{\alpha}-c\mathbf{\beta}=\mathbf{A}(\mathbf{\alpha}-c\mathbf{\beta})>0$ , which contradicts the fact that $\mathbf{\alpha}_k-c\mathbf{\beta}_k=0$ . Therefore, the geometric multiplicity of $\rho(\mathbf{A})$ is 1.
Using the fact that $\mathbf{A}^\mathsf{T}$ is also positive, there exists a positive eigenvector $\mathbf{\gamma}$ with respect to the eigenvalue $\rho(\mathbf{A}^\mathsf{T})=1$ . Note $\mathcal{U}=\text{span}(\mathbf{\gamma})$ , we have $\mathbb{R}^n=\mathcal{U}^\perp \oplus\mathcal{U}$ . Since $\mathbf{\alpha}>0, \mathbf{\gamma}>0$ , then $\mathbf{\gamma}^\mathsf{T}\mathbf{\alpha}>0\Rightarrow \mathbf{\alpha}\in\mathcal{U}$ . Choose a basis $\mathcal{E}=(\mathbf{\alpha}, e_1, \cdots, e_{n-1})$ for $\mathbb{R}^n$ , then the matrix of $\mathbf{A}$ under $\mathcal{E}$ is
$\left(\begin{array}{cc}1 & * \\ \mathbf{0} & \mathbf{B}\end{array}\right)$
If there exists a positive eigenvector $\mathbf{\delta}$ of $\mathbf{B}$ with respect to the eigenvalue $1$ , then $\mathbf{\delta}\in\mathcal{U}^\mathsf{T}$ , which implies $\mathbf{\delta}$ and $\mathbf{\alpha}$ are linearly dependent, contracting the fact that the geometric multiplicity is 1. Therefore, the algebraic multiplicity of $\mathbf{A}$ is also 1.
Suppose there exists a positive eigenvector $\mathbf{\beta}$ corresponding to the eigenvalue $\lambda < 1$ , then by the positivity of $\mathbf{A}$ , $\lambda>0$ . Let $c=\max \frac{\mathbf{\alpha}_i}{\mathbf{\beta}_i}=\frac{\mathbf{\alpha}_k}{\mathbf{\beta}_k}$ , then $c\mathbf{\beta}-\mathbf{\alpha}\geq 0, c\mathbf{\beta}-\mathbf{\alpha}\neq 0$ . However, since $\mathbf{A}>0$ , $\lambda c\mathbf{\beta}-\mathbf{\alpha}=\mathbf{A}(c\mathbf{\beta}-\mathbf{\alpha})>0$ , which means $\forall i\in[[1, n]] ~~ \lambda c\mathbf{\beta}_i>\mathbf{\alpha}_i \Rightarrow c<\lambda c < c$ , yielding a contradiction. Therefore, there are no other positive eigenvectors except positive multiples of $\mathbf{\alpha}$ . $\quad \square$

KL Divergence#

Definition
If $P$ and $Q$ are two probability distributions, then the Kullback-Leibler divergence from $Q$ to $P$ is defined as
$D_{KL}(P||Q) = \sum_{x\in \mathcal{X}} P(x)\log\left(\frac{P(x)}{Q(x)}\right) = \mathbb{E}_{X \sim P}\left[\log\left(\frac{P(X)}{Q(X)}\right)\right]$

$D_{KL}(P||Q)\neq D_{KL}(Q||P)$

Positivity#

$D_{KL}(P||Q) \geq 0$

Proof
$\begin{aligned} D_{KL}(P||Q) &= \sum_{x\in \mathcal{X}} P(x)\log\left(\frac{P(x)}{Q(x)}\right) \\ &\geq -\log(\sum_{x\in \mathcal{X}} P(x)\cdot \frac{Q(x)}{P(x)}) \\ &= -\log(\sum_{x\in \mathcal{X}} Q(x)) \\ &= -\log 1 = 0 \end{aligned}$
Here we use the Jensen’s inequality since $f(x)=-\log x$ is a convex function.

Forward and Reverse KL Divergence#

If we use the dstribution $Q_\theta(X)$ to approximate the distribution $P(X)$ , then

Minimizing forwad KL divergence $\underset{\theta}{\argmin} ~ D_{KL}(P||Q_\theta)$ is equivalent to make a maximum likelihood estimation of $\theta$ under $P$ .
Proof
$\begin{aligned} \argmin_\theta D_{KL}(P||Q_\theta) &= \argmin_\theta \sum_{x\in \mathcal{X}} P(x)\log P(x) - \sum_{x\in \mathcal{X}} P(x)\log Q_\theta(x) \\ &= \argmax_\theta \mathbb{E}_{X\sim P}[\log Q_\theta(X)] + \underbrace{\sum_{x\in \mathcal{X}} -P(x)\log P(x)}_{\color{red}H(P)\text{ : Entropy of } P} \\ &= \argmax_\theta \mathbb{E}_{X\sim P}[\log Q_\theta(X)] \\ &= \argmax_\theta \prod_{i=1}^{m}Q_\theta(x_i) \\ &=\argmax_\theta \mathcal{L}(\theta) = \theta^{\text{MLE}} \end{aligned}$
Here we use $P_\text{data}=\{x_1, \cdots, x_m\}$ to represent the data we collected.
Wherever $P(x)$ has high probability, $Q(x)$ must also have high probability.
$image description$
The figure above shows the effect of fitting a bimodal distribution $P$ using a unimodal distribution $Q_\theta$ through the forward KL divergence cost.
This property of the forward KL divergence is also often referred to as “zero avoiding”, because it tends to avoid having $Q(x)=0$ at any position where $P(x)>0$ .
Minimizing reverse KL divergence $\underset{\theta}{\argmin} ~ D_{KL}(Q_\theta||P)$ is equivalent to requiring that the fitting maintains a single mode as much as possible.
Proof
$\argmin_\theta D_{KL}(Q_\theta||P) = \argmin_\theta \mathbb{E}_{X\sim Q_\theta}[-\log P(X)] + H(Q_\theta(X))$
Based on the properties of entropy, it is known that when $Q_\theta$ approaches a uniform distribution, the value of $H(Q_\theta(X))$ is larger. Conversely, when $Q_\theta$ tends toward a unimodal distribution (single-peak distribution), the value of $H(Q_\theta(X))$ is smaller. Therefore, the reverse KL divergence is equivalent to requiring $Q_\theta$ to fit $P$ while maintaining as much unimodality as possible.
Wherever $Q(x)$ has high probability, $P(x)$ must also have high probability.
$image description$
The figure above shows the effect of fitting the same bimodal distribution $P$ using a unimodal distribution $Q_\theta$ through the reverse KL divergence cost.
The reverse KL divergence tends to minimize the difference between $Q_\theta(x)$ and $P(x)$ when $Q_\theta(x)>0$ .