stat._.jun

[Paper Review] Online PAC-Bayes Learning

Mon, 04 May 2026 03:43:09 GMT

PAC-Bayes

Learning sample $\mathcal{S} := {z_1, z_2, \dots, z_n} \sim \mathcal{D}^n$
Hypothesis $h : \mathcal{X} \to \mathcal{Y}$, $h \in \mathcal{H}$
Prior $P$, Posterior $Q$
Loss function $\ell : \mathcal{H} \times \mathcal{Z} \to \mathbb{R}$

일반적으로 PAC-Bayes Bound는 아래와 같이 주어진다. With probability at least $1-\delta$, $$ \mathbb{E}{h \sim Q}[\mathcal{R}_\mathcal{D}(h)] \le \mathbb{E}{h \sim Q}[\hat \mathcal{R}_\mathcal{S}(h)] + \text{Complexity Terms} $$

Online PAC-Bayes

한편 데이터가 누적됨에 따라 Prior $P$와 Posterior $Q$가 시간에 따라 변화하는 상황을 고려해보자.

Sequence of Priors $(P_i){1 \le i \le m}$, $P_1 = P$는 Data-free prior이고, $P_i$는 $\mathcal{F}{t-1}$에 의존할 수 있다.
Sequence of Posteriors $(Q_i)_{1 \le i \le m}$

논문의 Main Result인 Online PAC-Bayes Bound는 아래와 같다.

또한, Upper Bound를 최소화하는 $Q_{i}$는 Gibbs Posterior로 주어짐이 알려져있다. 일반적으로, Gibbs Posterior를 구하는 과정은 비용이 매우 비싸다. Gibbs Posterior를 Variational Approximation하는 방법으로 풀 수도 있겠지만 (e.g. Cherief-Abdellatif, 2019), 이 논문에서는 다른 방법을 제시한다.

Disintegrated Bounds

위에서 Upper Bound를 최소화하는 문제를 풀어서 Posterior를 구한다고 하였다. 한편 앞서 언급한 Upper Bound를 최소화하여 문제를 푸는 것은 계산을 하는 것은 비용적인 문제가 있다. 그 이유는 아래의 적분 계산을 수행하는 것이 어렵기 때문이다. $$ E_{h_i \sim Q}[\ell (h_i, z_i)] = \int \ell(h_i, z_i) d Q(h_i) $$ 뿐만 아니라 Gibbs Posterior의 Normalizing Term과 관련하여 아래의 계산을 수행해야한다. $$ \mathbb{E}_{h_i \sim Q}[\exp(-\lambda \ell(h_i, z_i))] $$

그래서 전체 Hypothesis에 대해서 평가된 loss를 최소화하는 것이 아니라, sampling된 hypothesis에 대해서 평가된 loss를 최소하는 문제로 바꾸고자 한다.

그래서 논문에서는 sampled된 hypothesis sequence $(h_1, \dots, h_m)$에 대해서 아래와 같은 Bound를 증명했다.

$\Psi$와 $\Phi$는 Bound를 어떤 disintegrated PAC-Bayes inequality를 사용하느냐에 따라 달라지며, 논문에서는 두 가지를 제안한다. 하나는 (Rivasplata, 2020)의 KL 기반이고, 다른 하나는 (Viallard, 2021)의 Renyi 기반이다.

위 Upper Bound를 매 Step마다 최소화하는 방향으로 문제를 풀어나가면 될것이다.

Key contribution?

다 이해하지는 못했지만, 기존의 PAC-Bayesian Bound를 Online 세팅에 맞게 변형한 것 (Thm 2.3)이 key contribution인 것으로 이해하고 있다. 구체적으로는 과거 정보에 의존하는 Prior와 시간에 따라 변화하는 Posterior를 허용한 세팅에서 Bound를 보인 것이 중요한 기여인 것 같다.

Haddouche, Maxime, and Benjamin Guedj. "Online pac-bayes learning." Advances in Neural Information Processing Systems 35 (2022): 25725-25738.

[Paper Review] RL with KL penalties is better viewed as Bayesian inference

Sun, 03 May 2026 10:19:31 GMT

Case : Language model

$\mathcal{X}$ : set of sequences of tokens from some vocabulary
$\pi$ : Language Model (LM)
$\pi_0$ : Pretrained LM
$\pi(x)$ : probability of sequence $x \in \mathcal{X}$
$r(x)$ : Reward (Human preferences)

Reinforcement Learning with Human Feedbacks (RLHF)에서 흔히 사용하는 목적식은 아래와 같다. $$ \max_{\pi} \left{ \mathbb{E}_{x \sim \pi}[r(x)] - \beta KL(\pi | \pi_o) \right} $$

위 목적식을 최대화하는 $\pi^$는 Donsker-Varadahan Lemma (or duality formula)에 의해 아래의 Gibbs Posterior로 달성됨이 알려져있다. (이를 위해 논문의 Appendix에서도 짧은 가정을 두었다.) $$ \pi^(x) \propto \exp( r(x) / \beta) \pi_o(x) $$

그런데, 일반적으로 Posterior는 구하는게 불가능하다. 전체 Seqeunce space $\mathcal{X}$에 대해서 exponential tilting하는게 어려우니깐... 그래서 자연스러운 접근으로 $\theta$로 parametrized된 LM $\pi_{\theta}$를 고려할 수 있고, 아래와 같이 목적식을 다시 표현할수 있다.

$$ \max_{\theta} \left{ \mathbb{E}{x \sim \pi{\theta}}[r(x)] - \beta KL(\pi_{\theta} | \pi_o) \right} $$

논문의 그림을 보면 더 이해가 잘되는 것 같다.

Without KL Penalty

KL Term을 빼면, Expected Reward Maximization의 문제가 된다. 따라서 Reward가 가장 큰 Sequence에 모든 mass를 두는 dirac delta로 붕괴한다.

Key contribution?

제목 그대로, RLHF를 Bayesian의 언어로 번역하여 다시 프레이밍한 것이 주된 기여같다.

RLHF 관련 글을 볼 때, $\pi_0$가 사전분포의 역할을 하고 loss를 -Reward로 사용하는 Variational Bayes로 해석할 수 있는지 궁금했는데, 이렇게 정리된 논문이 있었다.

Korbak, Tomasz, Ethan Perez, and Christopher Buckley. "RL with KL penalties is better viewed as Bayesian inference." Findings of the Association for Computational Linguistics: EMNLP 2022. 2022.

Extreme Value Theorem

Mon, 16 Feb 2026 15:02:24 GMT

Thm. (EVT) If $f: K \to \mathbb{R}$ is conti. on a compact $K \subseteq \mathbb{R}$, then $f$ attains a maximum and minumum value.

Proof. Since $K$ is compact, $f(K)$ is compact as well. We can set $\alpha = \sup f(K)$ since $f(K)$ is bounded and also one can observe that $\alpha \in f(K)$ due to the closedness of $f(K)$. Thus $\exist x_1 \in K$ such that $\alpha = f(x_1)$.

Thm. $\Theta_0 = (a, b) \subseteq \mathbb{R}$ : open set $\ell \in C^2(\Theta_0)$ with 1) $\nabla^2 \ell(\theta) <0$ for any $\theta \in \Theta_0$ and 2) $\lim_{\theta \to \partial \Theta_0} \ell(\theta) = -\infty$. Then, $\exist ! \hat \theta \in \Theta_0$ such that $\ell(\hat \theta) = 0$.

Sketch of the proof.

EVT를 통해서 $\hat \theta$가 $\Theta_0$에서 $\ell$의 최대임을 보이자. (Compact Set을 잘 잡는게 킥...?)
Strictly Concave하다는 것이 조건으로 부터 자동으로 도출되기 때문에, 이를 이용해서 유일한 해가 존재함을 보일 수 있다.

Open Set

Sat, 14 Feb 2026 10:41:49 GMT

Def. Open Set A set $O \subseteq \mathbb{R}$ is open if $\forall a \in O, \exists V_{\epsilon}(a) \subseteq O.$

Thm. (Intersection and Union of Open Sets)

The union of arbitrary collections of open sets is open.
The intersection of finite collection of open sets is open.

Proof for 1. Consider $O = \cup_{\lambda \in \Lambda} O_{\lambda}$ where $O_{\lambda}$ is open for any $\lambda \in \Lambda$. Let $a \in O$. Then $a \in O_{\lambda '}$ for some $\lambda ' \in \Lambda$. $\Rightarrow V_{\epsilon}(a) \subseteq O_{\lambda '} \subseteq O$. (QED)

Proof for 2. Consider $O = \cap_{n \in [N]}O_n$ where $O_n$ is open for any $n \in [N]$. And let $a \in O$, then for each $n \in [N]$, $\exist \epsilon_n$ such that $V_{\epsilon_n} (a) \subseteq O_n$. Take $\epsilon = \min (\epsilon_n){n \in [N]}$. Then, $\forall n \in [N], V{\epsilon}(a) \subseteq V_{\epsilon_n}(a) \subseteq O$. (QED)

Def. (Limit Point) $x \in A$ is a limit point of $A$ if $$ \forall \epsilon >0, \exist a \in A (a \neq x) \textup{ s.t } a \in V_{\epsilon}(x). $$

Thm. $x \in A$ is a limit point of $A$ if and only if $\lim_{n \to \infty} a_n = x$ for some $(a_n) \subseteq A$.

Proof (if part). Take $\epsilon = 1/n$. Then, we can choose $a_n (\neq x)$ such that $a_n \in V_{1/n}(x)$. Then for arbitrary $\epsilon > 0$, choose $N \in \mathbb{N}$ so that $1/N < \epsilon$. $$ n \geq N \Rightarrow d(a_n, x) < 1/n < \epsilon $$

Proof (Only if part). Straightforward due to epsilon-n statement.

Generalized Least Squares (일반화최소제곱법, GLS)

Tue, 10 Feb 2026 18:30:41 GMT

기존에 우리가 고려했던 선형모형과 달리, 다음의 모형을 고려해보자. $$ \begin{aligned} Y \sim N(X\beta , \sigma^2V) \end{aligned} $$ $V$는 우리가 알고있다고 가정하고, $\sigma^2$만 모를때, 어떻게 추정할까? 우선, Least Squares Estimator을 찾아보고 싶다. 이렇게 모형에 변주가 들어올때, 항상 LSE를 적용하는 방법은 바로 적용하는게 아니라 적용할 수 있는 Form을 만들고 적용하는 거다.

우선, 고유값 분해를 통해 $V = \Gamma \Gamma^T$라 하자. 양변에 $\Gamma^{-1}$를 곱하면 $\Gamma^{-1} Y = \Gamma^{-1}X\beta + \Gamma^{-1}\epsilon$. 이때, $\Gamma^{-1} Y = Z, \Gamma^{-1}X = W$라고 하자. $$ Cov(W) = \Gamma^{-1}Cov(Y)(\Gamma^{-1})^T = \sigma^2I $$

그러므로 변형된 모형 $Z = W \beta + \Gamma^{-1} \epsilon$에 대해서 LSE를 적용할 수 있다. 그러면 우리의 회귀 식은 아래와 같이 주어진다.

$$ \begin{aligned} \hat \beta_G &= (W^TW)^{-1}W^TZ \ &= (X^T(\Gamma^{-1})^T\Gamma^{-1}X)^{-1}X^T (\Gamma^{-1})\Gamma^{-1}Y \ &= (X^T(\Gamma\Gamma^T)^{-1}X)^{-1}X^T(\Gamma\Gamma^T)^{-1}Y \ &= (X^TV^{-1}X)^{-1}X^TV^{-1}Y \end{aligned} $$

이렇게 구해진 해는 Gauss-Markov 정리 역시 만족한다. 공분산 행렬은 아래와 같이 주어진다.

$$ \begin{aligned} Cov(\hat \beta_G) &= \sigma^2 (X^TV^{-1}X)^{-1}X^TV^{-1} VV^{-1}X(X^TV^{-1}X)^{-1}\ &=\sigma^2(X^TV^{-1}X)^{-1} \end{aligned} $$

Graphical LASSO

Tue, 10 Feb 2026 12:32:38 GMT

Graphical LASSO는 다음의 식을 최소화하면서 풀어나가는 방법을 일컫는다. 모형이 $Y_1 ,\dots, Y_n \sim N (0, \Sigma)$라고 할때, Precision Matrix $\Theta = \Sigma^{-1}$를 찾는 문제를 푼다. S를 Sample Covariance matrix for Y라고 한다면, 우리의 문제는 아래와 같다. $$ \textup{minimize} ; \log(|\Theta|) - \textup{tr}(S\Theta) - \lambda | \Theta |1 $$ 다변량 정규분포의 로그가능도는 불필요한 계수와, 상수들을 모두 제외하면 아래와 같이 주어진다. $$ \begin{aligned} \ell_n(\Sigma) &\propto -\log(|\Sigma|)- \sum{i \in [n]}Y_i^{\top}\Sigma^{-1}Y_i \ &= \log(|\Theta|) - \textup{tr}( \sum_{i \in [n]}Y_i Y_i^{\top}\Sigma^{-1}) \ &= \log(|\Theta|) - \textup{tr}(S \Theta) \end{aligned} $$

그래서 로그가능도에, Precision으로 정규화항을 준것으로 이해할 수 있다.

Scheffe Test (쉐페 다중가설 검정)

Tue, 10 Feb 2026 11:36:32 GMT

실험계획법 때도 나오고 여기저기 나오는 이름인데, 맨날 공부하고 까먹게 되는 것 같다. 약간 왈드, 스코어 감성이다. $H_0 : C\beta = 0$ 가설검정에서 F 통계량이 다음과 같이 나온다. (s는 MSE임) $$ F = \frac{(C \hat \beta)^{\top}[C(X^{\top}X)^{-1}C^{\top}]^{-1}(C \hat \beta) /q} {s^2} $$ 근데, 가설 하나에 대해서 표현한다면 어떻게 하면 될까? $$ \begin{aligned} F &= \frac{(a^{\top} \hat \beta)^{\top}[a^{\top}(X^{\top}X)^{-1}a]^{-1}(a^{\top} \hat \beta) /1} {s^2}\ &= \frac{(a^{\top}\hat \beta)^2}{s^2 [a^{\top} (X^{\top}X)^{-1}a]} \quad \textup{for some } , a \end{aligned} $$ 쉐페 다중 가설검정의 아이디어는 "F 중에 제일 큰거 기준으로 기각역잡으면 한번에 가설검정할 수 있지 않을까"였다. 그래서 아래와 같은 수식이 나온다. $$ \max_a \frac{(a^{\top}\hat \beta)^2}{s^2 [a^{\top} (X^{\top}X)^{-1}a]}\textup{를 컨트롤 하자!} $$ 몫의 미분법을 열심히 적용해보면 식이 나온다. 일단 Q를 정의하자. $$ Q = \frac{(a^{\top}\hat \beta)^2}{s^2 [a^{\top} (X^{\top}X)^{-1}a]} $$

$$ \begin{aligned} &\frac{dQ}{da} = \frac{1}{s^2} \left[ \frac{2(a^{\top} \hat \beta)(a^{\top}(X^{\top}X)^{-1}a)\hat \beta - 2 (a^{\top} \hat \beta)^2 (X^{\top}X)^{-1}a }{(a^{\top} (X^{\top}X)^{-1}a)^2} \right] = 0 \ &\Rightarrow (a^{\top}(X^{\top}X)^{-1}a)\hat \beta - (a^{\top} \hat \beta)(X^{\top}X)^{-1}a = 0 \ &\Rightarrow a = \frac{(a^{\top}(X^{\top}X)^{-1}a)}{(a^{\top} \hat \beta)}(X^{\top}X) \hat \beta = \xi (X^{\top}X) \hat \beta \end{aligned} $$

그래서 마지막 식 저 식에 넣으면 $a$가 사라지면서, 최댓값이 매우 이쁘게 나온다.

$$ \frac{\xi^2 (\hat \beta^{\top}X^{\top}X \hat \beta)^2}{s^2 \xi^2 (\hat \beta^{\top}X^{\top}X \hat \beta)} = \frac{\hat \beta X^{\top}X \hat \beta}{s^2} $$

그리고 위의 식은 아래의 $H_0 : \beta_{0:p} = 0$의 F검정 통계량이랑 똑같다. $$ \frac{\hat \beta X^{\top}Y}{s^2} = \frac{SSR(Full)}{SSE} $$

Restricted Regression (제한회귀)

Tue, 10 Feb 2026 10:40:57 GMT

Restricted Regression

다음의 가설검정 문제를 생각해보자. $$ H_0 : C\beta = 0 \quad \textup{versus} \quad H_1 : C\beta \neq 0 $$ 선형모형에서, 이런 가설검정 문제에서 LR Test를 수행하려면 좀 난감하다. 귀무가설 하에서 MLE를 어떻게 찾아야할까? 귀무 가설 하에서 LRT는 다음과 같은 문제로 귀결된다. $$ \textup{minimize } | Y- X\beta| \textup{ subject to } C\beta = 0 $$ 라그랑주 승수를 세워보자. $$ \begin{aligned} \mathcal{L}(\beta, \lambda) &= \frac{1}{2}| Y - X \beta |^2 + \lambda^{\top}C\beta \end{aligned} $$ 그럼 이제 미분해보자. 우선 베타에 대해 미분해주고, $$ \begin{aligned} &\nabla_{\beta} \mathcal{L} = -X^{\top}(Y - X\beta) + C^{\top} \lambda = 0 \ &\Rightarrow X^{\top}X \beta = X^{\top}Y - C^{\top}\lambda \ &\Rightarrow \hat\beta_c = (X^{\top}X)^{-1}X^{\top}Y - (X^{\top}X)^{-1}C^{\top}\lambda \end{aligned} $$ 그리고 다시 람다로 미분을 해보도록하자. $$ \begin{aligned} &\nabla_{\lambda}\mathcal{L} = C\beta \ &\Rightarrow C\hat \beta_c = 0 \end{aligned} $$ 그럼 두개의 정보를 뭉치자. 위에 베타쪽 식 양변에 C 곱해주면 $$ \begin{aligned} &C \hat \beta_c = C(X^{\top}X)^{-1}X^{\top}Y - C(X^{\top}X)^{-1}C^{\top}\lambda \ &\Rightarrow 0 = C(X^{\top}X)^{-1}X^{\top}Y - C(X^{\top}X)^{-1}C^{\top}\lambda \ &\Rightarrow \lambda = [C(X^{\top}X)^{-1}C^{\top}]^{-1} C(X^{\top}X)^{-1}X^{\top}Y \ \end{aligned} $$ 자, 그럼 처음 식에 다시 대입을 해주면, 끝임! $$ \hat \beta_c = (X^{\top}X)^{-1}X^{\top}Y - (X^{\top}X)^{-1}C^{\top}[C(X^{\top}X)^{-1}C^{\top}]^{-1} C(X^{\top}X)^{-1}X^{\top}Y $$ 앞부분 첫 Term은 OLS에서 회귀계수인게 보인다.

자연스러운 확장으로 $H_0 : C\beta = t$에 대해서도 어렵지 않게 보일 수 있다.

Hypothesis Testing

$$ C \hat \beta \sim N(C\beta, \sigma^2 C(X^{\top}X)^{-1}C^{\top}) $$ 위 사실로 부터 여러 사실을 알수 있는데, 일단 SSH를 정의하자. (SSR 감성) $$ SSH = (C\hat\beta)^{\top}[C(X^{\top}X)^{-1}C^{\top}]^{-1}(C\hat \beta)/\sigma^2 $$ 근데 이건 이차형식이기 때문에 카이제곱 따르는 거는 어렵지 않게 알수 있지만, F검정을 수행하려면 SSE랑 독립성을 밝혀야한다. 그래서 SSH를 좀 다른 Form으로 쓰면 (Y에 대한 이차형식으로 써주자!) $$ SSH = Y^{\top}X(X^{\top}X)^{-1}[C(X^{\top}X)^{-1}C^{\top}]^{-1}(X^{\top}X)^{-1}X^{\top}Y/\sigma^2. $$ SSE의 $I - H$랑 곱했을때 $O$ 행렬 나오면 두 이차형식이 독립이니까 SSH /SSE 해서 검정 가능함. $$ \begin{aligned} &(I - H)X(X^{\top}X)^{-1}[C(X^{\top}X)^{-1}C^{\top}]^{-1}(X^{\top}X)^{-1}X^{\top} \ &=SSH - H \cdot SSE \ &=SSH - X(X^{\top}X)^{-1}X^{\top}X(X^{\top}X)^{-1}[C(X^{\top}X)^{-1}C^{\top}]^{-1}(X^{\top}X)^{-1}X^{\top} \ &=SSH -(X^{\top}X)^{-1}[C(X^{\top}X)^{-1}C^{\top}]^{-1}(X^{\top}X)^{-1}X^{\top} \ &=SSH - SSH = O \end{aligned} $$

식이 혐짤이다; 그래서 여하튼 위 사실에 의해서 F 통계량도 알수 있다. 앞에서 언급하지 않았지만 $C \in \mathbb{R}^{(p+1) \times q}$ 라고 정의하면, $$ F = \frac{SSH /q}{SSE / (n-p-1)} \sim F_{q, n-p-1}. $$

$Y \sim N(\mu, \Sigma)$라고 할때, $Y^{\top} A Y$의 MGF는 아래와 같다. $$ M_A(t) = |I-2tA \Sigma|^{-1/2} \exp \left{ \frac{1}{2} \mu^{\top}(I-(I-2tA \Sigma)^{-1})\Sigma^{-1}\mu \right} $$ 근데 이거 알아두면 은근 유용한게, Rencher에서 뒤에거 증명하기 편함. $\Xi = |I-2tA\Sigma|$라고 두고, $\log M_{A}(t)$ 미분하면 중심적률 나오니까, 그거 이용해서 Variance 유도할 수 있다. 고도의 암기?는 도움이 된다!!

Karlin-Rubin Theorem and UMPU

Tue, 10 Feb 2026 07:54:32 GMT

1. Karlin-Rubin

Def. (Monotone Likelihood Ratio) A model ${f_\theta : \theta \in \Theta }$ is said to have a monotone likelihood ratio (MLR) in a statistic $T$ if, for any $\theta_1 < \theta_2$, $f_{\theta_2}(X) > f_{\theta_1}(X)$ is a monotone function of $T$.

Proposition 1. Suppose that a model ${ f_{\theta} : \theta \in \Theta }$ has a MLR in a statistic $T$. Then, a power function $\beta_{\phi}(\theta)$ for the test $\phi_t(X) = \mathbb{I}(T(X) > t)$ is increasing in $\theta$ for any $t$.

Thm. (Karlin-Rubin) Assume that ${f_\theta : \theta \in \Theta }$ has MLR in $T$. Consider one-sided hypothesis testing problem such as $H_0 : \theta \leq \theta_0 \quad \textup{versus} \quad H_1 : \theta > \theta_0$. Then a test $\phi^t (X) = \mathbb{I}(T(X) > t{\alpha})$ is a uniformly most powerful level $\alpha$ test, where $t_{\alpha}$ is chosen to be $\beta_{\phi^}(\theta_0) = \alpha$.

UMP Test는 항상 존재하지 않지만, 어떤 모형이 Monotone Likelihood Ratio 성질을 가지고 있다면, Karlin-Rubin 정리에 의해서 단측 가설검정에 대해서는 UMP Test가 존재함을 알 수 있다. 대표적인 예시는 아래와 같은, 지수족에서의 가설검정이다.

Corollary 1. 모집단의 확률밀도함수가 아래와 같이 주어지는 Exponential Family라고 하자. $$ f(x; \theta) = h(x) \exp { g(\theta)^{\top}T(x) - B(\theta) } $$ $g(\theta)$가 $\theta$의 단조증가함수이면, $\phi_t^*(X) = \mathbb{I}(T(X) > t_\alpha)$ 는 UMP Test이다.

Corollary 1. 보이는거는 $g(\theta)$가 단조증가라는 조건이 MLR과 바로 연결됨을 파악하면 끝난다.

2. Uniformly Most Powerful Unbiased Test

Def. (Unbiased Test) A test $\phi_t(X)$ is a level $\alpha$ unbiased test if $\sup_{\theta \in \Theta_0} \beta_{\phi_t}(\theta) \leq \alpha$ and $\inf_{\theta \in \Theta_1} \beta_{\phi_t}(\theta) \geq \alpha$.

불편검정은 생소한 것 같다. 귀무가설 하에서 귀무가설을 기각하는 1종 오류를 범할 확률이 $\alpha$보다 작게 control할 것이라면, 검정력이 최소한 $\alpha$보다는 커야한다. 모호하게 들리지만, 결국 만약 검정력이 $\alpha$보다 작다면, $\alpha$의 확률로 랜덤하게 기각하는 랜덤 검정법보다도 좋지 못하다는 뜻이기 때문에 찍는 것보다 좋음을 설명하는 조건이다.

다음과 같은 예시를 고려해보자. $$ X_1, \dots, X_n \sim N(\theta,1) $$ $$ H_0 : \theta = \theta_0 \textup{ versus } H_1 : \theta \neq \theta_0 $$ 이 가설을 검정하는 LRT $\phi(X)$는 다음을 만족한다.

첫번째는 유의수준에 대한 수식이고, $$ \beta_{\phi}(\theta_0) = \alpha, \quad \nabla_{\theta} \beta_{\theta_0}(\theta) = 0 $$ 두번째는 가장 좋은 검정력을 가지는 검정임을 의미한다. $$ \forall , \varphi : (\beta_{\varphi}(\theta) = \alpha) , \land , (\nabla_{\theta} \beta_{\varphi}(\theta) = 0), ; \beta_{\phi}(\theta_1) \geq \beta_{\varphi}(\theta_1) $$

이거 미분을 한게 어떤 의미냐면 미분을 해서 0되는 점이 $\theta_0$이니까, 이게 Local Minimum이 된다. 근데 Unbiased Test 정의에 의하면, $$ \beta_{\phi}(\theta) \geq \alpha, ; \forall \theta \neq \theta_0. $$

Neural Tangent Kernel

Sat, 25 Oct 2025 16:12:37 GMT

정리가 매우 잘되어있는 블로그

1. Gradient Flow

먼저, Gradient Flow라는 것에 대해서 이해할 필요가 있다. Gradient Descent를 조금 다르게 생각하는 것이다. $$ \theta_{t+1} = \theta_{t} - \eta \nabla_{\theta} \mathcal{L}(\theta) $$ 간단한 식조작을 통해서 아래와 같이 바꿔보자. $$ \frac{\theta_{t+1} - \theta_{t}}{\eta} = -\nabla_{\theta}\mathcal{L}(\theta) $$ 만약 $\eta \approx 0$이라면, 아래 처럼 생각할 수 있지 않을까? $$ \frac{d\theta(t)}{dt} = - \nabla_{\theta}\mathcal{L}(\theta) $$ 이렇게 생각하는 것을 Gradient Flow라고 한다.

2. Empirical Risk

Empirical Risk가 다음과 같이 주어졌다고 하자. $$ \mathcal{L}(\theta) := \frac{1}{N} \sum_i \ell( f(x_i ; \theta), y_i) $$

Chain Rule을 통해 아래의 식을 알고 있다. $$ \frac{df(x ; \theta)}{dt} = \frac{\partial f(x; \theta)}{\partial \theta} \frac{d\theta}{dt} $$

3. Wrap Up.

$$ \nabla_{\theta} \mathcal{L}(\theta) = \frac{1}{N} \sum_{i = 1}^{N} \nabla_{\theta} \ell(f(x_i; \theta), y_i) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} f(x_i; \theta) \nabla_{f} \ell(f(x_i ;\theta),y_i) $$

$$ \frac{df(x ; \theta)}{dt} = - \frac{1}{N} \sum_{i=1}^{N} \nabla_{\theta} f(x ; \theta)^{\top} \nabla_{\theta} f(x_i; \theta) \nabla_{f} \ell(f(x_i ;\theta),y_i) $$

이때, 이 $\nabla_{\theta} f(x ; \theta)^{\top} \nabla_{\theta} f(x_i; \theta)$ 부분을 Neural Tangent Kernel이라고 하고, 여러 문서에서 기호로는 $\Theta$으로 표기한다.

Gradient Based Learning Model을 Kernel Regression의 형태로 이해할 수 있는 관점이 흥미로운 것 같다.

가능도비 검정 (Likelihood Ratio Test)

Tue, 18 Jul 2023 15:29:30 GMT

Likelihood Ratio Test

다음의 가설검정을 고려해보자. $$ H_0 : \theta \in \Theta_0 \quad v.s \quad H_1 : \theta \in \Theta_1 $$ 가능도비 $\Lambda(X)$를 아래와 같이 정의하자. $$ \Lambda(X) = \frac{\sup_{\Theta} \Pi_{i=1}^n f(X_i \mid \theta)}{\sup_{\Theta_0} \Pi_{i=1}^n f(X_i \mid \theta)} $$ 이때, 아래의 검정을 가능도비 검정이라고 한다. $$ \phi(X) = \mathbb{I}(\Lambda(X) \geq k) $$

Sufficient Statistic

네이만-피어슨 분해정리에 의해서, 다음의 사실을 관찰할수 있다. (충분통계량을 $T$라고 하고, 충분통계량의 가능도비를 $\Lambda^$로 두자.) $$ \begin{aligned} \Lambda(X) &= \frac{\sup_{\Theta} \Pi_{i=1}^n f(X_i \mid \theta)}{\sup_{\Theta_0} \Pi_{i=1}^n f(X_i \mid \theta)}\ &= \frac{f( X \mid \hat \theta_n)}{f(X \mid \hat \theta_{0n})} \ &= \frac{g(T(X) \mid \hat \theta_n) h(X)}{g(T(X) \mid \hat \theta_{0n}) h(X)} \ &=\frac{g(T(X) \mid \hat \theta_n)}{g(T(X) \mid \hat \theta_{0n})} \ &= \Lambda^(T) \end{aligned} $$ 그래서, 충분통계량의 가능도비를 찾는 것 충분하다.

Wald, Score

항상 이제 가능도비랑 같이 묶이는 3대장이다. 일단 Wald는 다음의 사실에 근거한다. $$ \sqrt{n}(\hat \theta_n - \theta_0) \overset{d}{\to} N(0, I(\theta_0)^{-1}) $$ 고러면, $n (\hat \theta_n - \theta_0)^{\top} I(\theta_0) (\hat \theta_n - \theta_0) \to \chi_k$. Score는 다음의 사실에 근거한다. $$ \frac{1}{\sqrt{n}}S_n(\theta_0) = \sqrt{n}\left( \frac{1}{n}\sum_{i=1}^n \nabla_{\theta} \log f(X_i \mid \theta)|_{\theta = \theta_0} \right) \overset{d}{\to} N(0, I(\theta_0)) $$ 그럼, $\frac{1}{n}S_n(\theta_0)I(\theta_0)^{-1}S_n(\theta) \to \chi_k$.

Asymptotics

당연히 많은 경우에 가능도비검정의 검정통계량이 따르는 분포를 정확히 알지 못한다. 그래서 n이 충분히 클때, 분포가 어디로 수렴하는지 Asymptotic 분석을 하는 것이 무척 중요하다. 일반적으로 다음의 사실이 성립한다. (n이 클때) 여기서는 단순한 가설검정만 생각해볼게. $H_0 : \theta = \theta_0$인 양측검정. $$ 2 \log \Lambda(X) \overset{d}{\to} \chi_1 $$

이제 설명을 해보자. 이 증명은 MLE의 점근정규성 증명과 유사하다. $$ \begin{aligned}

&\ell_n(\theta_0) = \ell_n(\hat \theta_n) + \nabla \ell_n(\hat \theta_n)^{\top} (\theta_0 - \hat \theta_n ) + \frac{1}{2}(\hat \theta_n - \theta_0)^{\top} \nabla^2 \ell_n(\theta^)(\hat \theta_n - \theta_0) \ &= \frac{1}{2}(\hat \theta_n - \theta_0)^{\top} \nabla^2 \ell_n(\theta^)(\hat \theta_n - \theta_0) \end{aligned} $$ 자 그러면, 이제 다음의 사실을 우리는 알고 있다. 적당한 정규성 아래에서 $\hat \theta_n \to \theta_0$이므로, $\theta^* \to \theta_0$이다. $$ -\frac{1}{n}\nabla^2 \ell_n(\theta^) = \frac{1}{n}\sum_{i \in [n]} \left[ - \nabla^2 \log f(X_i \mid \theta^)\right] \overset{P}{\to}\mathbb{E}_{\theta_0}[-\nabla^2 \log f(X_1 \mid \theta)] = I(\theta_0) $$

그러므로, 다음을 알 수 있다. $$ \begin{aligned} 2(\ell_n(\hat \theta_n)) - \ell_n(\theta_0)) &= n(\hat \theta_n - \theta_0)^{\top} [-\frac{1}{n}\nabla^2 \ell_n(\theta^*)](\hat \theta_n - \theta_0) \ &\overset{d}{\to} Z^{\top} I(\theta_0)Z = \chi_k \end{aligned} $$

라오-블랙웰 정리 (Rao Blackwell Theorem)

Tue, 18 Jul 2023 15:04:00 GMT

1. 라오-블랙웰 정리

$\hat{\eta}$ : Estimator for $\eta(\theta)$ $Y = U(X)$ : Sufficient Statistic for $\theta$ $\hat{\eta}^{\star} = \mathbb{E}_{\theta}[\hat{\eta}|Y]$ 라고 하자.

이때, 다음이 성립한다.

$$ \mathbb{E}{\theta}(\hat{\eta}^{\star}) = \mathbb{E}{\theta}(\hat{\eta}) \ Var(\hat{\eta}^\star) \leq Var(\hat{\eta}) $$

2. 라오-블랙웰 정리의 의미

라오-블랙웰 정리는 어떤 추정량이 있을때, 충분통계량에 대한 조건부 기댓값을 취함으로써 더 좋은 추정량을 만들 수 있음을 뜻한다.
더 좋다는 뜻은 MSE의 관점에서 좋다는 것이다. 기댓값이 동일하기 때문에 Bias는 커지지 않지만, Variance는 작아진다. 만약 이 추정량이 불편추정량이라면 불편성은 훼손되지 않은채로 자연스레 분산이 더 작아진 추정량이 되는 것이다.

3. 충분통계량이라는 조건

충분통계량이 아니면 벌어지는 일은 무엇일까? 엉뚱한 질문같지만 다음과 같은 일이 일어난다.
충분통계량이 아닌 통계량을 $Z$라 하자. 과연 추정량 $\hat{\eta}$에 $Z$에 대한 조건부 기댓값을 취해도 라오-블랙웰 정리가 성립할까?
라오-블랙웰 정리에서 충분통계량이라는 조건이 주어진 이유는 다음과 같은 예시로 알아볼 수 있다.

$Z \perp X$이고 $\mathbb{E}_{\theta}(\hat{\eta}) = \theta$라고 하자. 그러면 $\hat{\eta}^\star(X) = \mathbb{E}[\hat{\eta}|Z] = \theta$ 이므로, $\hat{\eta}^\star$는 $\theta$의 추정량이 되지 못한다. (모수에 의존하므로..).

Reference

Hogg et al. (2021). Introduction to Mathematical Statistcs(8th Edition) 강기훈, 박진호. (2015). 수리통계학 : R을 이용한 실습

분포수렴 (Convergence in Distribution)

Tue, 18 Jul 2023 14:31:03 GMT

1. 분포수렴의 정의

$F_{n}$ : $\mathrm{cdf ; of ;} X_{n}$ $F$ : $\mathrm{cdf ; of ;} X$ $C(F) = {x \in \R : F; \mathrm{is; continious ; at} ; S_{X}}$ 이라고 하자.

다음을 만족할때, 확률변수 $X_{n}$이 $X$로 분포수렴한다고 한다. $$ \forall x \in C(F), ; \lim_{n \to \infin}F_{n}(x) = F(x) $$

기호로는 다음와 같이 표현한다. $$ X_{n} \overset{d}{\to} X $$

2. 적률생성함수와 분포수렴

$M_{n}$ : $\mathrm{mgf ; of ;} X_{n}$ $M$ : $\mathrm{mgf ; of ;} X$

$$ X_{n} \overset{d}{\to} X \iff \exists \eta > 0, \quad\forall t \in (-\eta, \eta), \quad \lim_{n \to \infin} M_{n}(t) = M(t)$$

Reference

Hogg et al. (2021). Introduction to Mathematical Statistcs(8th Edition)