sea_note.log

[7주차 ~ 9주차] [코드구현] Semi-Parametric Contextual Pricing

Sat, 24 Feb 2024 20:19:02 GMT

참고교재

Wooldridge (2010), Econometric Analysis of Cross Section and Panel Data, 2nd Edition, MIT Press.

1. Introduction

상품이나 서비스를 판매하는 입장에서 판매 가격을 결정하는 것은 상당히 중요한 문제다. 판매자는 최대 판매수익을 얻을 수 있는 가격을 설정할 것이며, 이를 위해선 고객의 지불의사가격(WTP, Willing To Pay)을 아는 것이 중요하다.

하지만 WTP는 고객마다 다르며 판매자가 이를 알 방법 또한 없다. 하지만 과거의 판매 기록 데이터를 기반으로 고객의 WTP 분포를 추정할 수 있다면, 고객 별로 가격을 달리 책정하는 Contextual Dynamic Pricing 전략을 활용할 수 있을 것이다.

판매 데이터의 가장 큰 특징은 type-I interval censored data라는 점이다. 예를 들어 커피 한 잔을 1500원에 판매했다면 고객이 생각하는 커피 한 잔의 WTP는 1500원보다 높을 것이며, 생각했던 가격보다 저렴하게 판매하기 때문에 구매했다고 추측할 수 있다. 결국 고객의 WTP는 하나의 값이 아닌 구간(interval)으로 나타나게 된다.

이러한 데이터로부터 확률분포를 추정하기 위해 앞서 비모수적 추정에 대해 알아보았다. 이번에는 고객의 정보를 고려할 것이므로 Contextual vector $\mathbf{x}_t$가 등장할 것이며, 각 정보가 가격 결정에 미치는 정도를 살펴볼 것이다. 이를 위해 Cox Proportional Hazard Model을 사용한다.

2. Cox Proportional Hazard Model

Cox PH 모델은 생존분석의 기법 중 하나이다. 우리가 관심있어하는 부분은, '특정 사건이 언제 발생하는가'인데, 단 한 번만 발생하는 사건(죽음, 실패 등)을 다루므로 충분히 생존분석이라 말할 수 있겠다. 생존-죽음을 넓은 의미로 보면 재직-해고, 건강-발병 등에 대입할 수 있다.

실패 시점을 나타내는 확률변수 $T$의 누적분포함수 $F(t)$는 $$ F(t) = \mathrm{P}(T \le t)~(t \ge0) $$ 로 정의되며, $t$ 시간 이내에 실패할 확률을 의미한다. 이와는 반대로 $t$ 시점까지 생존할 확률을 나타내는 생존함수(survivor function) $S(t)$는 $$ S(t) = \mathrm{P}(T \ge t) $$ 로 나타낸다.

일반적으로 시간의 흐름에 따라 생존 $\rightarrow$ 죽음 상태로 이어지는데, $t$ 시점에서 죽음이 발생할 가능성(=생존을 벗어날 가능성)을 나타내는 위험함수(hazard function) $\lambda(t)$는

$$ \begin{aligned} \lambda(t) &= \lim_{h \rightarrow 0+} {\mathrm{P}(t \le T \le t+h|t \le T) \over h} \ &=\lim_{h \rightarrow 0+} {F(t+h)-F(t) \over h S(t)} \ &= {F'(t) \over S(t)} \ &= -{d \over dt} \log S(t) \end{aligned} $$ 로 나타내며, 결국 $F(t)$는 $$ F(t)=1-\text{exp}\bigg(-\int_0^t \lambda(s) ds \bigg)~(t \ge 0) $$ 와 같이 $\lambda(t)$에 의해 결정된다고 볼 수 있다. 만약 $\lambda(t)=k$ (상수함수)라면 $T$는 모수가 $k$인 지수분포를 따르게 된다.

현실적으로 생각해보면 개개인마다의 생존함수는 모두 다를 것이다. 이를 공변량(covariates) $\mathbf{x}$로 표현할 것이며, 이에 따른 위험함수 $\lambda(t|\mathbf{x})$는 $$ \lambda(t|\mathbf{x}) = \kappa(\mathbf{x})\lambda_0(t) $$ 로 나타낸다. $\kappa(\mathbf{x})$는 치역이 양수인 함수이며, $\lambda_0(t)$는 baseline hazard function으로, covariate에 관계없이 $\lambda(t|\mathbf{x})$에 공통적으로 포함된 함수이다.

일반적으로 $\kappa(\mathbf{x})=\text{exp}(\mathbf{x}^T \beta)$로 나타낸다. 즉, covariate vector $\mathbf{x}$의 각 변수가 분포함수에 얼마나 영향을 미치는지를 나타내는 가중치 벡터 $\beta$로 해석할 수 있고, $\lambda_0(t)$는 $\mathbf{x}=\mathbf{0}$일 때를 의미하게 된다.

covariate $\mathbf{x}$가 전제되었을 때의 생존함수 $S(t|\mathbf{x})$는 $$ \begin{aligned} S(t|\mathbf{x}) & =\text{exp}\bigg(-\int_0^t \lambda(s|\mathbf{x}) ds \bigg) \ &= \text{exp} \bigg(-\int_0^t\lambda_0(s) ds \cdot\text{exp}(\mathbf{x}^T\beta) \bigg) \ &= S(t)^{\text{exp}(\mathbf{x}^T\beta)} \end{aligned} $$ 로 나타난다. $S(t)$는 $\lambda_0(t)$와 마찬가지로 covariate에 관계없는($\mathbf{x}=0$) baseline function과 같으며, 앞으로 $S_0(t)(=1-F_0(t))$로 표기할 것이다.

3. Contextual Pricing using Cox PH Model

어떤 물건을 구매하는데 있어 고객이 생각하는 WTP를 $v_t$라 하고, 판매자가 제시하는 물건의 가격을 $p_t$라 하자. 구매자의 정보를 $\mathbf{x}_t$라 하고, 물건 구매 여부를 $y_t \in {0, 1}$라 하자. 물건을 구매했다면 $y_t=1$, 구매하지 않았다면 $y_t=0$인 것이다.

예를 들어 고객이 비행기 티켓을 구매하는 상황을 생각해볼 수 있다. 항공사는 비행기 티켓 가격을 $p_t$에 제시하고, 구매자는 마음속으로 $v_t$의 가격을 생각하며 판매가격이 $v_t$ 이하라면 티켓을 구매할 것이다. 구매자의 정보 $\mathbf{x}_t$는 나이, 성별, 거주지, 행선지 등을 포함할 것이다. 즉 $$ v_t \ge p_t | \mathbf{x}_t ~~~ \Leftrightarrow ~~~ y_t=1 | \mathbf{x}_t \ v_t < p_t | \mathbf{x}_t ~~~ \Leftrightarrow ~~~ y_t=0 | \mathbf{x}_t $$ 이다.

판매자는 $v_t$의 분포를 알고싶을 것이다. 이를 확률변수로 두고 '구매'를 생존, '구매하지 않음'을 죽음으로 정의한다면 $$ \begin{aligned} \mathrm{P}(v_t \ge p_t | \mathbf{x}_t) &= S(p_t|\mathbf{x}_t) \ &= S_0(p_t)^{\text{exp}(\mathbf{x}_t^T\beta)} \ &= \text{exp} \bigg(-\int_0^{p_t}\lambda_0(s) ds \cdot\text{exp}(\mathbf{x}_t^T\beta) \bigg) \ &= \text{exp} \bigg(-\Lambda_0(p_t) \cdot \text{exp}(\mathbf{x}_t^T\beta) \bigg) \ \end{aligned} $$ 와 같다. 여기서 $$ \Lambda_0(p_t) := \int_0^{p_t} \lambda_0(s) ds \ \Lambda(p_t|~\mathbf{x}_t) := \Lambda_0(p_t)\cdot \text{exp}(\mathbf{x}_t^T\beta) $$ 는 Cumulative hazard function이라 불린다.

이러한 type-I interval censored data로부터 $v_t$의 분포를 추정하자. 위 정의에 의해 $$ \mathrm{P}(y_t=1|\mathbf{x}t) = S_0(p_t)^{\text{exp}(\mathbf{x}_t^T\beta)} \ \mathrm{P}(y_t=0|\mathbf{x}_t) = 1 - S_0(p_t)^{\text{exp}(\mathbf{x}_t^T\beta)} $$ 이므로 $S_0$(혹은 $\Lambda_0$)를 추정하기 위한 MLE는 $$ l(S_0) = {1 \over T} \sum{t=1}^T \bigg(y_t \log S_0(p_t)^{\text{exp}(\mathbf{x}t^T\beta)} + (1-y_t) \log (1-S_0(p_t)^{\text{exp}(\mathbf{x}_t^T\beta)}) \bigg) \ \text{or}\ l(\Lambda_0)= {1 \over T} \sum{t=1}^T \bigg(y_t~~\Lambda_0(p_t)~~{\text{exp}(\mathbf{x}_t^T\beta)} + (1-y_t) \log (1-e^{-\Lambda_0(p_t)\text{exp}(\mathbf{x}_t^T\beta)}) \bigg) $$ 이다. 이때 $T$는 sample의 개수이다.

4. Contextual Pricing Neural Network

지금까지의 내용을 Neural Network로 구현해보자. Neural Network로 추정해야 할 것은 $S(p_t|\mathbf{x}_t)$와 $\beta$가 될 것이다. $\mathbf{x}_t$가 주어진다는 점을 생각해보면 $\beta$는 parametric한 추정이며, $S(p_t|\mathbf{x}_t)$는 Non-parametric한 추정이고, 결국 둘 모두를 고려하는 Semi-parametric 추정인 것이다.

현실적으로 우리가 알 수 있는 데이터는 $x_t, p_t, y_t$이고, 이로부터 알려지지 않은 $v_t|x_t$의 분포를 추정해야 한다. 이러한 데이터는 Neural Network로 하여금 $S_0(p_t)$가 아닌 $S(p_t|\mathbf{x}_t)$를 직접 학습시킨다.

따라서 $\mathbf{x}_t$의 각 변수가 $v_t$의 분포에 어떤 영향을 끼치는지 궁금하지 않다면, 위 구조로 네트워크를 학습시키면 된다.

만약 $\beta$까지 추정하고자 한다면 다음과 같이 $S(p_t|\mathbf{x}_t)$를 학습하는 $\text{Loss1}$과 $\beta$를 학습하는 $\text{Loss2}$를 구성하여 합하는 방법을 제안한다.

$\text{Loss1}$은 첫 번째 네트워크 구조와 같은 것으로, $S(p_t|\mathbf{x}_t)$를 학습하도록 한다.
$\text{Loss2}$는 $\beta$를 학습하도록 한다.
- 각 $\mathbf{x}_t$마다 $S(p_t|\mathbf{x}_t)=S_0(p_t)^{\text{exp}(\mathbf{x}_t^T\beta)}$는 모두 다르지만, $S_0(p_t)$를 공통으로 갖는다는 성질이 있다.
- 공통으로 갖는 $S_0(p_t)$가 $S(p_t|\mathbf{x}_t=\mathbf{0})$과 같아지도록 학습함으로써 $\text{exp}(\mathbf{x}_t^T\beta)$ 속의 $\beta$가 자동으로 결정되도록 한다.
$\text{Loss} = \text{Loss1}+\text{Loss2}$

같은 방법으로 $\Lambda(p_t|\mathbf{x}_t)$를 추정하도록 설계할 수 있다. $S(p_t|\mathbf{x}_t) = e^{-\Lambda(p_t|\mathbf{x}_t)}$이므로 비슷한 결과를 얻을 수 있다.

실험1. 단일차원 Contextual

True $\beta=-2$
초기 $\beta \sim \text{Unif}(-1, 1)$
$\mathbf{x}_t \sim \mathrm{N}(0, 1)$
True $S_0(p_t) = 1-\text{GammaCDF}(\alpha=2, \beta=0.2)$

랜덤시드를 고정하고 데이터를 생성한다.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.parameter import Parameter
from torch.nn.modules.module import Module

import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt

# Seed 고정 함수
def RandomFix(random_seed):
    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    np.random.seed(random_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# 추정하고자 하는 beta : -2
# 추정하고자 하는 F: 감마분포(alpha=2, beta=0.2)
class RV_Sampler(st.rv_continuous):
    def __init__(self, cox):
        super().__init__()
        self.cox = cox

    def _cdf(self, x):
        return 1 - (st.gamma.sf(x=x, a=2, loc=0, scale=5))**self.cox

def generate_dataC(num_sample):
    beta = np.array([-2])
    x_t = np.random.randn(len(beta), num_sample)    

    cox_t = np.exp(np.dot(beta, x_t))
    v_t = np.array([RV_Sampler(cox=val).rvs(size=1)[0] for val in cox_t])
    p_t = st.uniform.rvs(size=num_sample) * 30
    y_t = (v_t >= p_t).astype(int)

    return torch.tensor(x_t.T, dtype=torch.float), torch.tensor(p_t, dtype=torch.float), torch.tensor(y_t, dtype=torch.float)

확률분포에서 샘플링하는 방법으로 scipy.stats의 rvs를 이용하였다. 이는 $\text{Unif}(0,1)$에서 샘플링한 값에 cdf의 역함수를 씌우는 Inverse CDF Method를 사용한다.
$S_0(p_t)$의 그래프는 다음과 같다.

모델 구성은 다음과 같다. 이전에 구현한 MonoBlock을 바탕으로 MonoNet_bdd($S(p_t|\mathbf{x}_t)$ 추정), MonoNet_mul($\Lambda(p_t|\mathbf{x}_t$ 추정)을 구성하였다.

class MonoBlock(Module):
    def __init__(self, in_feature:int, out_feature:int, mono_indicator = 'inc', activation = 'none', activation_partition = (0,0,1)):
        super().__init__()
        self.activation = activation
        self.activation_partition = activation_partition

        self.in_feature = in_feature
        self.out_feature = out_feature
        self.mono_indicator = mono_indicator

        self.W = Parameter(torch.randn(self.in_feature, self.out_feature))
        self.b = Parameter(torch.randn(self.out_feature))

    def get_activation(self):
        convex = getattr(F, self.activation)
        def concave(x):
            return -convex(-x)
        def saturated(x):
            plus = -convex(-x+torch.ones_like(x)) + convex(torch.ones_like(x))
            minus = convex(x+torch.ones_like(x)) - convex(torch.ones_like(x))
            return torch.where(x >= 0, plus, minus)
        return convex, concave, saturated

    def activation_index(self, x):
        if sum(self.activation_partition) != 1:
            raise ValueError(f"sum of activation_partition must be 1")
        if len(self.activation_partition) != 3:
            raise ValueError(f"length of activation_partition must be 3")

        convex_num = int(self.activation_partition[0] * len(x.T))
        concave_num = int(self.activation_partition[1] * len(x.T))
        return convex_num, convex_num+concave_num, len(x.T)

    def forward(self, x):
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)
        if self.mono_indicator == 'inc':
            self.mono_indicator = torch.ones(x.shape[1])
        if x.shape[1] != self.in_feature:
            raise ValueError(f"matrix multiplication cannot be implemented : {x.shape[0]}x{x.shape[1]} and {self.in_feature}x{self.out_feature}")
        if len(self.mono_indicator) != self.in_feature:
            raise ValueError(f"number of variable does not match : {len(self.mono_indicator)} and {self.in_feature}")

        mono_oper = torch.tensor(self.mono_indicator).to(device).reshape(-1, 1) * torch.abs(self.W)
        W_oper = torch.where(torch.abs(mono_oper) >= torch.abs(self.W), mono_oper, self.W)
        x = torch.matmul(x, W_oper) + self.b

        convex_idx, concave_idx, saturated_idx = self.activation_index(x)
        if self.activation == 'none':
            out = torch.cat([x.T[:convex_idx], x.T[convex_idx:concave_idx], x.T[concave_idx:saturated_idx]], dim=0).T
        else:
            convex_act, concave_act, saturated_act = self.get_activation()
            out = torch.cat([convex_act(x.T[:convex_idx]), concave_act(x.T[convex_idx:concave_idx]), saturated_act(x.T[concave_idx:saturated_idx])], dim=0).T

        return out


class MonoNet_bdd(nn.Module):
    def __init__(self, context_size):
        super().__init__()
        self.context_size = context_size
        self.linear = nn.Linear(self.context_size, 1, bias=False)
        self.mono = nn.Sequential(
            MonoBlock(2, 32, mono_indicator=[0, -1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 1)
        )

        eps = 1e-6
        for m in self.modules():
            if isinstance(m, MonoBlock):
                torch.nn.init.uniform_(m.W)
                torch.nn.init.uniform_(m.b, a=-1, b=1)
            elif isinstance(m, nn.Linear):
                torch.nn.init.uniform_(m.weight, a=-1, b=1)


    def coxfunc(self, x):
        return torch.exp(self.linear(x))

    def forward(self, x, p):
        cat = torch.cat([x, p.reshape(-1, 1)], dim=1)
        S_x = torch.sigmoid(self.mono(cat))

        cox = self.coxfunc(x)
        S_0 = S_x**(1/cox)

        return S_x, S_0, cox


class MonoNet_mul(nn.Module):
    def __init__(self, context_size):
        super().__init__()
        self.context_size = context_size
        self.linear = nn.Linear(self.context_size, 1, bias=False)
        self.mono = nn.Sequential(
            MonoBlock(2, 32, mono_indicator=[0, 1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 1)
        )

        eps = 1e-6
        for m in self.modules():
            if isinstance(m, MonoBlock):
                torch.nn.init.uniform_(m.W)
                torch.nn.init.uniform_(m.b, a=-1, b=1)
            elif isinstance(m, nn.Linear):
                torch.nn.init.uniform_(m.weight, a=-1, b=1)


    def coxfunc(self, x):
        return torch.exp(self.linear(x))

    def origin(self, x):
        result = torch.log(3*x+1) / (0.01 + torch.log(3*x+1))
        return result.reshape(-1, 1)

    def forward(self, x, p):
        cat = torch.cat([x, p.reshape(-1, 1)], dim=1)
        out = self.mono(cat)
        L_x = torch.where(out>=1, out, torch.exp(out-1)) * self.origin(p)

        cox = self.coxfunc(x)
        L_0 = L_x / cox

        return L_x, L_0, cox

학습에 사용될 공통 하이퍼파라미터를 정의하였다.

common_param = {'learning_rate' : 0.01,
                'num_epoch' : 5000}
num_repl = 100

데이터를 생성하고 모델을 학습시킨다.

# 모델학습 _ S추정 _ replication
y_rate_SC = []
S_errorC = []
S_betaC_x = [0]
S_betaC_y = []
S_betaC_res = []

p = np.linspace(0.1, 30, 200)
P = torch.tensor(p, dtype=torch.float).reshape(-1, 1).to(device)
X = torch.zeros((200, 1), dtype=torch.float).to(device)
baseX = torch.zeros((10000, 1), dtype=torch.float).to(device)
gtC = st.gamma.cdf(p, a=2, loc=0, scale=5).reshape(-1, 1)

for idx in range(1, num_repl+1):
    RandomFix(idx)
    print(f"====={idx} Replication =====")

    modelS = MonoNet_bdd(1)
    modelS.to(device)
    optimizer = torch.optim.Adam(modelS.parameters(), lr=common_param['learning_rate'])

    if idx == 1:
        for m in modelS.modules():
            if isinstance(m, nn.Linear):
                S_betaC_y.append(m.weight[0].cpu().detach().numpy()[0])

    train_x, train_p, train_y = generate_dataC(10000)
    y_rate_SC.append(sum(train_y) / 10000)

    train_x = train_x.to(device)
    train_p = train_p.to(device)
    train_y = train_y.reshape(-1, 1).to(device)

    for epoch in range(1, common_param['num_epoch']+1):
        modelS.train()

        S_x, S_0, cox = modelS(train_x, train_p)
        S_00 = modelS(baseX, train_p)[0]

        loss1 = torch.mean(-1 * train_y * torch.log(S_x) - (1-train_y) * torch.log(1 - S_x))
        loss2 = torch.mean((S_00**cox - S_x)**2)
        train_loss = loss1 + loss2
        if epoch % 100 == 0:
            print(f"  [Epoch {epoch}] Train loss : {train_loss}")

        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

        if idx == 1:
            for m in modelS.modules():
                if isinstance(m, nn.Linear):
                    cand = m.weight[0].cpu().detach().numpy()[0]
                    if S_betaC_y[-1] != cand:
                        S_betaC_x.append(epoch)
                        S_betaC_y.append(cand)
    for m in modelS.modules():
        if isinstance(m, nn.Linear):
            S_betaC_res.append(m.weight[0].cpu().detach().numpy()[0])

    F_0 = 1 - modelS(X, P)[1]
    S_errorC.append(gtC - F_0.cpu().detach().numpy())

fig = plt.figure()
fig.set_figwidth(20)

ax1 = fig.add_subplot(131)
ax1.plot(p, gtC, 'r')
ax1.plot(p, -1 * sum(S_errorC)/len(S_errorC) + gtC, color='b')

ax2 = fig.add_subplot(132)
ax2.plot(S_betaC_x, S_betaC_y)

ax3 = fig.add_subplot(133)
ax3.hist(S_betaC_res, bins=100)

print(f"y_t=1 비율 : {sum(y_rate_SC).numpy()/len(y_rate_SC)}")
print(f"Std(RMSE) : {np.sqrt(np.mean(sum(val**2 for val in S_errorC)))}")

우연에 의존한 결과가 아님을 보이기 위해 같은 실험을 Random seed 값만 달리하여 100번 진행하였다(Replication). 그 결과 $F_0(p_t) = 1-S_0(p_t)$와 $\beta$ 모두 잘 추정하는 모습을 볼 수 있었다.

첫 번째 그래프는 100번의 Replication으로부터 도출된 $F_0(p_t)$의 평균값을 그린 그래프이다.
- 붉은 색은 True $F_0(p_t)$이며, 푸른 색은 추정한 $F_0(p_t)$이다.
두 번째 그래프는 첫 번째 Replication에서 $\beta$ 값의 변화를 나타낸 그래프다.
세 번째 그래프는 100번의 Replication동안 나타난 $\beta$의 히스토그램이다. 대체로 True 값인 $-2$ 근처에 잘 모인 모습이다.

같은 방법으로 $\Lambda(p_t|\mathbf{x}_t)$를 추정하는 과정이다.

# 모델학습 _ L추정 _ replication
y_rate_LC = []
L_errorC = []
L_betaC_x = [0]
L_betaC_y = []
L_betaC_res = []

p = np.linspace(0.1, 30, 200)
P = torch.tensor(p, dtype=torch.float).reshape(-1, 1).to(device)
X = torch.zeros((200, 1), dtype=torch.float).to(device)
baseX = torch.zeros((10000, 1), dtype=torch.float).to(device)
gtC = st.gamma.cdf(p, a=2, loc=0, scale=5).reshape(-1, 1)

for idx in range(1, num_repl+1):
    RandomFix(idx)
    print(f"====={idx} Replication =====")

    modelL = MonoNet_mul(1)
    modelL.to(device)
    optimizer = torch.optim.Adam(modelL.parameters(), lr=common_param['learning_rate'])

    if idx == 1:
        for m in modelL.modules():
            if isinstance(m, nn.Linear):
                L_betaC_y.append(m.weight[0].cpu().detach().numpy()[0])

    train_x, train_p, train_y = generate_dataC(10000)
    y_rate_LC.append(sum(train_y) / 10000)

    train_x = train_x.to(device)
    train_p = train_p.to(device)
    train_y = train_y.reshape(-1, 1).to(device)

    for epoch in range(1, common_param['num_epoch']+1):
        modelL.train()

        L_x, L_0, cox = modelL(train_x, train_p)
        L_00 = modelL(baseX, train_p)[0]

        loss1 = torch.mean(train_y * L_x - (1-train_y) * torch.log(1 - torch.exp(-L_x)))
        loss2 = torch.mean((torch.exp(-L_00 * cox) - torch.exp(-L_x))**2)

        train_loss = loss1 + loss2
        if epoch % 100 == 0:
            print(f"  [Epoch {epoch}] Train loss : {train_loss}")

        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

        if idx == 1:
            for m in modelL.modules():
                if isinstance(m, nn.Linear):
                    cand = m.weight[0].cpu().detach().numpy()[0]
                    if L_betaC_y[-1] != cand:
                        L_betaC_x.append(epoch)
                        L_betaC_y.append(cand)
    for m in modelL.modules():
        if isinstance(m, nn.Linear):
            L_betaC_res.append(m.weight[0].cpu().detach().numpy()[0])

    F_0 = 1 - torch.exp(-modelL(X, P)[1])
    L_errorC.append(gtC - F_0.cpu().detach().numpy())

fig = plt.figure()
fig.set_figwidth(20)

ax1 = fig.add_subplot(131)
ax1.plot(p, gtC, 'r')
ax1.plot(p, -1 * sum(L_errorC)/len(L_errorC) + gtC, color='b')

ax2 = fig.add_subplot(132)
ax2.plot(L_betaC_x, L_betaC_y)

ax3 = fig.add_subplot(133)
ax3.hist(L_betaC_res, bins=100)

print(f"y_t=1 비율 : {sum(y_rate_LC).numpy()/len(y_rate_LC)}")
print(f"Std(RMSE) : {np.sqrt(np.mean(sum(val**2 for val in L_errorC)))}")

위 결과와 유사하게 잘 학습이 된 모습이다.

실험2. 다차원 Contextual

True $\beta=(\beta_1, \beta_2, \beta_3, \beta_4, \beta_5)=(-2, 0.3, -0.5, 1.5, 0)$
초기 $\beta_i \sim \text{Unif}(-1, 1),~~i=1,~~2,3,4,~5$
$\mathbf{x}_t \sim \mathrm{N}(\mathbf{0}, I_5)$
True $S_0(p_t) = 1-\text{GammaCDF}(\alpha=2, \beta=0.2)$

실험1과 비교하면 Contextual vector의 차원이 5로 증가한 것 뿐이며, 그에 따라 추정해야 할 $\beta$ 역시 5차원으로 증가하였다.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.parameter import Parameter
from torch.nn.modules.module import Module

import numpy as np
import scipy.stats as st
import scipy.special as sp
import matplotlib.pyplot as plt

# Seed 고정 함수
def RandomFix(random_seed):
    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    np.random.seed(random_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

# 추정하고자 하는 beta : (-2, 0.3, -0.5, 1.5, 0)
# 추정하고자 하는 F: 감마분포(alpha=2, beta=0.2)
class RV_Sampler(st.rv_continuous):
    def __init__(self, cox):
        super().__init__()
        self.cox = cox

    def _cdf(self, x):
        return 1 - (st.gamma.sf(x=x, a=2, loc=0, scale=5))**self.cox

def generate_dataC(num_sample):
    beta = np.array([-2, 0.3, -0.5, 1.5, 0])
    x_t = np.random.randn(len(beta), num_sample)    

    cox_t = np.exp(np.dot(beta, x_t))
    v_t = np.array([RV_Sampler(cox=val).rvs(size=1)[0] for val in cox_t])
    p_t = st.uniform.rvs(size=num_sample) * 30
    y_t = (v_t >= p_t).astype(int)

    return torch.tensor(x_t.T, dtype=torch.float), torch.tensor(p_t, dtype=torch.float), torch.tensor(y_t, dtype=torch.float)

그 외의 모델구성, 학습과정은 동일하다.

class MonoBlock(Module):
    def __init__(self, in_feature:int, out_feature:int, mono_indicator = 'inc', activation = 'none', activation_partition = (0,0,1)):
        super().__init__()
        self.activation = activation
        self.activation_partition = activation_partition

        self.in_feature = in_feature
        self.out_feature = out_feature
        self.mono_indicator = mono_indicator

        self.W = Parameter(torch.randn(self.in_feature, self.out_feature))
        self.b = Parameter(torch.randn(self.out_feature))

    def get_activation(self):
        convex = getattr(F, self.activation)
        def concave(x):
            return -convex(-x)
        def saturated(x):
            plus = -convex(-x+torch.ones_like(x)) + convex(torch.ones_like(x))
            minus = convex(x+torch.ones_like(x)) - convex(torch.ones_like(x))
            return torch.where(x >= 0, plus, minus)
        return convex, concave, saturated

    def activation_index(self, x):
        if sum(self.activation_partition) != 1:
            raise ValueError(f"sum of activation_partition must be 1")
        if len(self.activation_partition) != 3:
            raise ValueError(f"length of activation_partition must be 3")

        convex_num = int(self.activation_partition[0] * len(x.T))
        concave_num = int(self.activation_partition[1] * len(x.T))
        return convex_num, convex_num+concave_num, len(x.T)

    def forward(self, x):
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)
        if self.mono_indicator == 'inc':
            self.mono_indicator = torch.ones(x.shape[1])
        if x.shape[1] != self.in_feature:
            raise ValueError(f"matrix multiplication cannot be implemented : {x.shape[0]}x{x.shape[1]} and {self.in_feature}x{self.out_feature}")
        if len(self.mono_indicator) != self.in_feature:
            raise ValueError(f"number of variable does not match : {len(self.mono_indicator)} and {self.in_feature}")

        mono_oper = torch.tensor(self.mono_indicator).to(device).reshape(-1, 1) * torch.abs(self.W)
        W_oper = torch.where(torch.abs(mono_oper) >= torch.abs(self.W), mono_oper, self.W)
        x = torch.matmul(x, W_oper) + self.b

        convex_idx, concave_idx, saturated_idx = self.activation_index(x)
        if self.activation == 'none':
            out = torch.cat([x.T[:convex_idx], x.T[convex_idx:concave_idx], x.T[concave_idx:saturated_idx]], dim=0).T
        else:
            convex_act, concave_act, saturated_act = self.get_activation()
            out = torch.cat([convex_act(x.T[:convex_idx]), concave_act(x.T[convex_idx:concave_idx]), saturated_act(x.T[concave_idx:saturated_idx])], dim=0).T

        return out


class MonoNet_bdd(nn.Module):
    def __init__(self, context_size):
        super().__init__()
        self.context_size = context_size
        self.linear = nn.Linear(self.context_size, 1, bias=False)
        self.mono = nn.Sequential(
            MonoBlock(1+self.context_size, 32, mono_indicator=[*[0]*self.context_size, -1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 1)
        )
        for m in self.modules():
            if isinstance(m, MonoBlock):
                torch.nn.init.uniform_(m.W)
                torch.nn.init.uniform_(m.b, a=-1, b=1)
            elif isinstance(m, nn.Linear):
                torch.nn.init.uniform_(m.weight, a=-1, b=1)

    def coxfunc(self, x):
        return torch.exp(self.linear(x))

    def forward(self, x, p):
        cat = torch.cat([x, p.reshape(-1, 1)], dim=1)
        S_x = torch.sigmoid(self.mono(cat))

        cox = self.coxfunc(x)
        S_0 = S_x**(1/cox)
        return S_x, S_0, cox


class MonoNet_mul(nn.Module):
    def __init__(self, context_size):
        super().__init__()
        self.context_size = context_size
        self.linear = nn.Linear(self.context_size, 1, bias=False)
        self.mono = nn.Sequential(
            MonoBlock(1+self.context_size, 32, mono_indicator=[*[0]*self.context_size, 1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 1)
        )
        for m in self.modules():
            if isinstance(m, MonoBlock):
                torch.nn.init.uniform_(m.W)
                torch.nn.init.uniform_(m.b, a=-1, b=1)
            elif isinstance(m, nn.Linear):
                torch.nn.init.uniform_(m.weight, a=-1, b=1)

    def coxfunc(self, x):
        return torch.exp(self.linear(x))

    def origin(self, x):
        result = torch.log(3*x+1) / (0.01 + torch.log(3*x+1))
        return result.reshape(-1, 1)

    def forward(self, x, p):
        cat = torch.cat([x, p.reshape(-1, 1)], dim=1)
        out = self.mono(cat)
        L_x = torch.where(out>=1, out, torch.exp(out-1)) * self.origin(p)

        cox = self.coxfunc(x)
        L_0 = L_x / cox

        return L_x, L_0, cox

common_param = {'learning_rate' : 0.01,
                'context_size' : 5,
                'num_epoch' : 5000,
                'num_sample' : 10000}
num_repl = 100

다음은 $S(p_t|\mathbf{x}_t)$를 학습시키는 과정이다.

# 모델학습 _ S추정 _ replication
y_rate_SC = []
S_errorC = []
S_betaC_x = [[0] for _ in range(common_param['context_size'])]
S_betaC_y = [[] for _ in range(common_param['context_size'])]
S_betaC_res = []

p = np.linspace(0.1, 30, 200)
P = torch.tensor(p, dtype=torch.float).reshape(-1, 1).to(device)
X = torch.zeros((200, common_param['context_size']), dtype=torch.float).to(device)
baseX = torch.zeros((common_param['num_sample'], common_param['context_size']), dtype=torch.float).to(device)
gtC = st.gamma.cdf(p, a=2, loc=0, scale=5).reshape(-1, 1)

for idx in range(1, num_repl+1):
    RandomFix(idx)
    print(f"====={idx} Replication =====")

    modelS = MonoNet_bdd(common_param['context_size'])
    modelS.to(device)
    optimizer = torch.optim.Adam(modelS.parameters(), lr=common_param['learning_rate'])

    if idx == 1:
        for m in modelS.modules():
            if isinstance(m, nn.Linear):
                for i in range(common_param['context_size']):
                    S_betaC_y[i].append(m.weight[0].cpu().detach().numpy()[i])

    train_x, train_p, train_y = generate_dataC(common_param['num_sample'])
    y_rate_SC.append(sum(train_y) / common_param['num_sample'])

    train_x = train_x.to(device)
    train_p = train_p.to(device)
    train_y = train_y.reshape(-1, 1).to(device)

    for epoch in range(1, common_param['num_epoch']+1):
        modelS.train()

        S_x, S_0, cox = modelS(train_x, train_p)
        S_00 = modelS(baseX, train_p)[0]

        loss1 = torch.mean(-1 * train_y * torch.log(S_x) - (1-train_y) * torch.log(1 - S_x))
        loss2 = torch.mean((S_00**cox - S_x)**2)

        train_loss = loss1 + loss2
        if epoch % 100 == 0:
            print(f"  [Epoch {epoch}] Train loss : {train_loss}")

        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

        if idx == 1:
            for m in modelS.modules():
                if isinstance(m, nn.Linear):
                    cand = m.weight[0].cpu().detach().numpy()
                    for i in range(common_param['context_size']):
                        if cand[i] != S_betaC_y[i][-1]:
                            S_betaC_x[i].append(epoch)
                            S_betaC_y[i].append(cand[i])

    for m in modelS.modules():
        if isinstance(m, nn.Linear):
            S_betaC_res.append(m.weight[0].cpu().detach().numpy())

    F_0 = 1 - modelS(X, P)[1]
    S_errorC.append(gtC - F_0.cpu().detach().numpy())

fig = plt.figure()
fig.set_figwidth(20)

ax1 = fig.add_subplot(121)
ax1.plot(p, gtC, 'r')
ax1.plot(p, -1 * sum(S_errorC)/len(S_errorC) + gtC, color='b')

ax2 = fig.add_subplot(122)
for i in range(common_param['context_size']):
    ax2.plot(S_betaC_x[i], S_betaC_y[i])

fig2 = plt.figure()
fig2.set_figwidth(20)

S_betaC_res = np.array(S_betaC_res).T
for i in range(common_param['context_size']):
    globals()[f'ax2{i}'] = fig2.add_subplot(1, common_param['context_size'], i+1)
    globals()[f'ax2{i}'].hist(S_betaC_res[i], bins=100)


print(f"y_t=1 비율 : {sum(y_rate_SC).numpy()/len(y_rate_SC)}")
print(f"Std(RMSE) : {np.sqrt(np.mean(sum(val**2 for val in S_errorC)))}")

$F_0(p_t)$와 $\beta$ 모두 잘 학습된 모습이다. 또한 $\beta$의 각 성분 모두 본래 True $\beta_i$ 값에 근접하게 학습되었음을 확인할 수 있다. 같은 과정을 $\Lambda(p_t|\mathbf{x}_t)$에 대해 진행하였다.

# 모델학습 _ L추정 _ replication
y_rate_LC = []
L_errorC = []
L_betaC_x = [[0] for _ in range(common_param['context_size'])]
L_betaC_y = [[] for _ in range(common_param['context_size'])]
L_betaC_res = []

p = np.linspace(0.1, 30, 200)
P = torch.tensor(p, dtype=torch.float).reshape(-1, 1).to(device)
X = torch.zeros((200, common_param['context_size']), dtype=torch.float).to(device)
baseX = torch.zeros((common_param['num_sample'], common_param['context_size']), dtype=torch.float).to(device)
gtC = st.gamma.cdf(p, a=2, loc=0, scale=5).reshape(-1, 1)

for idx in range(1, num_repl+1):
    RandomFix(idx)
    print(f"====={idx} Replication =====")

    breaker = False
    modelL = MonoNet_mul(common_param['context_size'])
    modelL.to(device)
    optimizer = torch.optim.Adam(modelL.parameters(), lr=common_param['learning_rate'])

    if idx == 1:
        for m in modelL.modules():
            if isinstance(m, nn.Linear):
                for i in range(common_param['context_size']):
                    L_betaC_y[i].append(m.weight[0].cpu().detach().numpy()[i])

    train_x, train_p, train_y = generate_dataC(common_param['num_sample'])

    train_x = train_x.to(device)
    train_p = train_p.to(device)
    train_y = train_y.reshape(-1, 1).to(device)

    for epoch in range(1, common_param['num_epoch']+1):
        modelL.train()

        L_x, L_0, cox = modelL(train_x, train_p)
        L_00 = modelL(baseX, train_p)[0]

        loss1 = torch.mean(train_y * L_x - (1-train_y) * torch.log(1 - torch.exp(-L_x)))
        loss2 = torch.mean((torch.exp(-L_00 * cox) - torch.exp(-L_x))**2)

        train_loss = loss1 + loss2
        if torch.any(torch.isnan(train_loss)):
            breaker = True
            break
        if epoch % 100 == 0:
            print(f"  [Epoch {epoch}] Train loss : {train_loss}")

        optimizer.zero_grad()
        train_loss.backward()
        optimizer.step()

        if idx == 1:
            for m in modelL.modules():
                if isinstance(m, nn.Linear):
                    cand = m.weight[0].cpu().detach().numpy()
                    for i in range(common_param['context_size']):
                        if cand[i] != L_betaC_y[i][-1]:
                            L_betaC_x[i].append(epoch)
                            L_betaC_y[i].append(cand[i])
    if breaker:
        continue

    for m in modelL.modules():
        if isinstance(m, nn.Linear):
            L_betaC_res.append(m.weight[0].cpu().detach().numpy())

    F_0 = 1 - torch.exp(-modelL(X, P)[1])
    L_errorC.append(gtC - F_0.cpu().detach().numpy())
    y_rate_LC.append(sum(train_y.cpu().detach()) / common_param['num_sample'])

fig = plt.figure()
fig.set_figwidth(20)

ax1 = fig.add_subplot(121)
ax1.plot(p, gtC, 'r')
ax1.plot(p, -1 * sum(L_errorC)/len(L_errorC) + gtC, color='b')

ax2 = fig.add_subplot(122)
for i in range(common_param['context_size']):
    ax2.plot(L_betaC_x[i], L_betaC_y[i])

fig2 = plt.figure()
fig2.set_figwidth(20)

L_betaC_res = np.array(L_betaC_res).T
for i in range(common_param['context_size']):
    globals()[f'ax2{i}'] = fig2.add_subplot(1, common_param['context_size'], i+1)
    globals()[f'ax2{i}'].hist(L_betaC_res[i], bins=100)


print(f"y_t=1 비율 : {sum(y_rate_LC).numpy()/len(y_rate_LC)}")
print(f"Std(RMSE) : {np.sqrt(np.mean(sum(val**2 for val in L_errorC)))}")

비슷한 학습결과 또한 얻을 수 있었다.

실험3. ic_sp 모델과의 비교

R의 icenReg 패키지의 ic_sp 모델은 지금까지의 실험과 완전히 똑같은 기능을 제공한다. Contextual vector $\mathbf{x}_t$와 $p_t, y_t$가 주어졌을 때 ic_sp 모델을 통해 $F_0(p_t)$와 $\mathbf{x}_t$의 각 변수가 미치는 영향(=가중치 값, $\beta$)을 알 수 있다.

같은 데이터를 사용하여 두 모델을 학습시키고 결과를 비교하자. 랜덤 시드를 고정하고 데이터를 생성, 저장한다.

# 데이터 생성
import torch
import numpy as np
import pandas as pd
import scipy.stats as st
import matplotlib.pyplot as plt

# Seed 고정 함수
def RandomFix(random_seed):
    torch.manual_seed(random_seed)
    torch.cuda.manual_seed(random_seed)
    np.random.seed(random_seed)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

class RV_Sampler(st.rv_continuous):
    def __init__(self, cox):
        super().__init__()
        self.cox = cox

    def _cdf(self, x):
        return 1 - (st.gamma.sf(x=x, a=2, loc=0, scale=5))**self.cox

def generate_dataC(num_sample):
    beta = np.array([-2, 0.3, -0.5, 1.5, 0])
    x_t = np.random.randn(len(beta), num_sample)

    cox_t = np.exp(np.dot(beta, x_t))
    v_t = np.array([RV_Sampler(cox=val).rvs(size=1)[0] for val in cox_t])
    p_t = st.uniform.rvs(size=num_sample) * 30
    y_t = (v_t >= p_t).astype(int)

    return x_t.T, p_t.T, y_t.T

# 데이터 -> csv 변환
train_x, train_p, train_y = generate_dataC(10000)
data = np.concatenate([train_x, train_p.reshape(-1, 1), train_y.reshape(-1, 1)], axis=1)

data_pd = pd.DataFrame(data, columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'p', 'y'])
# data_pd.to_csv('data_pd.csv', index=False)

MonoNet를 활용하여 $S_0(p_t)$와 $\beta$를 추정한다.

class MonoBlock(Module):
    def __init__(self, in_feature:int, out_feature:int, mono_indicator = 'inc', activation = 'none', activation_partition = (0,0,1)):
        super().__init__()
        self.activation = activation
        self.activation_partition = activation_partition

        self.in_feature = in_feature
        self.out_feature = out_feature
        self.mono_indicator = mono_indicator

        self.W = Parameter(torch.randn(self.in_feature, self.out_feature))
        self.b = Parameter(torch.randn(self.out_feature))

    def get_activation(self):
        convex = getattr(F, self.activation)
        def concave(x):
            return -convex(-x)
        def saturated(x):
            plus = -convex(-x+torch.ones_like(x)) + convex(torch.ones_like(x))
            minus = convex(x+torch.ones_like(x)) - convex(torch.ones_like(x))
            return torch.where(x >= 0, plus, minus)
        return convex, concave, saturated

    def activation_index(self, x):
        if sum(self.activation_partition) != 1:
            raise ValueError(f"sum of activation_partition must be 1")
        if len(self.activation_partition) != 3:
            raise ValueError(f"length of activation_partition must be 3")

        convex_num = int(self.activation_partition[0] * len(x.T))
        concave_num = int(self.activation_partition[1] * len(x.T))
        return convex_num, convex_num+concave_num, len(x.T)

    def forward(self, x):
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)
        if self.mono_indicator == 'inc':
            self.mono_indicator = torch.ones(x.shape[1])
        if x.shape[1] != self.in_feature:
            raise ValueError(f"matrix multiplication cannot be implemented : {x.shape[0]}x{x.shape[1]} and {self.in_feature}x{self.out_feature}")
        if len(self.mono_indicator) != self.in_feature:
            raise ValueError(f"number of variable does not match : {len(self.mono_indicator)} and {self.in_feature}")

        mono_oper = torch.tensor(self.mono_indicator).to(device).reshape(-1, 1) * torch.abs(self.W)
        W_oper = torch.where(torch.abs(mono_oper) >= torch.abs(self.W), mono_oper, self.W)
        x = torch.matmul(x, W_oper) + self.b

        convex_idx, concave_idx, saturated_idx = self.activation_index(x)
        if self.activation == 'none':
            out = torch.cat([x.T[:convex_idx], x.T[convex_idx:concave_idx], x.T[concave_idx:saturated_idx]], dim=0).T
        else:
            convex_act, concave_act, saturated_act = self.get_activation()
            out = torch.cat([convex_act(x.T[:convex_idx]), concave_act(x.T[convex_idx:concave_idx]), saturated_act(x.T[concave_idx:saturated_idx])], dim=0).T

        return out


class MonoNet_bdd(nn.Module):
    def __init__(self, context_size):
        super().__init__()
        self.context_size = context_size
        self.linear = nn.Linear(self.context_size, 1, bias=False)
        self.mono = nn.Sequential(
            MonoBlock(1+self.context_size, 32, mono_indicator=[*[0]*self.context_size, -1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 1)
        )
        for m in self.modules():
            if isinstance(m, MonoBlock):
                torch.nn.init.uniform_(m.W)
                torch.nn.init.uniform_(m.b, a=-1, b=1)
            elif isinstance(m, nn.Linear):
                torch.nn.init.uniform_(m.weight, a=-1, b=1)

    def coxfunc(self, x):
        return torch.exp(self.linear(x))

    def forward(self, x, p):
        cat = torch.cat([x, p.reshape(-1, 1)], dim=1)
        S_x = torch.sigmoid(self.mono(cat))

        cox = self.coxfunc(x)
        S_0 = S_x**(1/cox)
        return S_x, S_0, cox

# MonoNet에 의한 F_0 추정
comparison = {'learning_rate' : 0.01,
              'num_epoch' : 5000,
              'num_sample' : 10000,
              'context_size' : 5}

RandomFix(1)
baseX = torch.zeros((comparison['num_sample'], comparison['context_size']), dtype=torch.float).to(device)

modelS = MonoNet_bdd(comparison['context_size'])
modelS.to(device)
optimizer = torch.optim.Adam(modelS.parameters(), lr=comparison['learning_rate'])

train_x = torch.tensor(train_x, dtype=torch.float).to(device)
train_p = torch.tensor(train_p.reshape(-1, 1), dtype=torch.float).to(device)
train_y = torch.tensor(train_y.reshape(-1, 1), dtype=torch.float).to(device)

for epoch in range(1, comparison['num_epoch']+1):
    modelS.train()

    S_x, S_0, cox = modelS(train_x, train_p)
    S_00 = modelS(baseX, train_p)[0]

    loss1 = torch.mean(-1 * train_y * torch.log(S_x) - (1-train_y) * torch.log(1 - S_x))
    loss2 = torch.mean((S_00**cox - S_x)**2)

    train_loss = loss1 + loss2
    if epoch % 100 == 0:
        print(f"  [Epoch {epoch}] Train loss : {train_loss}")

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

for m in modelS.modules():
    if isinstance(m, nn.Linear):
        print(m.weight[0].cpu().detach().numpy())

p = np.linspace(0.1, 30, 200)
P = torch.tensor(p, dtype=torch.float).reshape(-1, 1).to(device)
X = torch.zeros((200, comparison['context_size']), dtype=torch.float).to(device)
gtC = st.gamma.cdf(p, a=2, loc=0, scale=5).reshape(-1, 1)

S_0 = modelS(X, P)[0].cpu().detach().numpy()
plt.plot(p, gtC, color='r')
plt.plot(p, S_0, color='b')

추정한 결과는 다음과 같다.

추정한 $\beta : (-2.0889125,~~0.3212827,~~-0.5420359,~~1.5514011,~~0.01493187)$

같은 데이터를 icenReg 패키지의 ic_sp 모델에 학습시키자. 다음은 R 코드로 작성되었다.

install.packages('icenReg')
library('icenReg')
library("utils")

data_pd <- read.csv('/content/data_pd.csv')
data_pd$p <- as.numeric(as.character(data_pd$p))
data_pd$y <- 3 * (1 - data_pd$y )
data_pd$q <- data_pd$p

data_pd$q[data_pd$y == 0] <- Inf
data_pd$p[data_pd$y == 3] <- 0

data_pd

data_pd의 일부이다.
type-I interval censored data임을 표시하기 위해 y열을 수정하고, q열을 만들어 [p, q]의 interval을 구성하였다.

fit_ph <- ic_sp(formula = Surv(time=p, time2=q, event=y, type='interval') ~ x1+x2+x3+x4+x5, model = 'ph', bs_samples = 100, data = data_pd)

newdata <- data.frame(x1 = c(0),
                      x2 = c(0),
                      x3 = c(0),
                      x4 = c(0),
                      x5 = c(0))
alpha = 2
theta = 5

plot(data.frame(x = 0:300 / 10, prob = 1 - pgamma(q = 0:300 / 10, shape = alpha, scale = theta, lower.tail = TRUE)), lwd=2, type="l", col='red')
par(new=TRUE)
plot(fit_ph, newdata, lwd=2, col='blue')

모델을 학습시키고 $\mathbf{x}_t=\mathbf{0}$을 대입한 그래프를 표현하였다.
시각적 용이함을 위해 두 그래프를 겹쳐 표현하였다.

붉은 색은 True $S_0(p_t)$, 푸른 색은 추정한 $S_0(p_t)$이다.
추정한 $\beta : (-2.112000,~~0.291100,~~-0.528000,~~1.603000,~~0.008823)$

두 모델의 결과는 비슷하지만, 육안으로 보았을 때 ic_sp 모델이 $F_0(p_t)$를 더욱 잘 추정했다고 말할 수 있겠다. 이는 Neural Network를 더 깊게 구성함으로써 충분히 보완할 수 있을 것이다.

5. Conclusion

지금까지 Contextual Pricing을 위한 Neural Network 모델을 설계하고 실험을 진행하였다. 마지막 실험을 마친 후, ic_sp라는 좋은 모델이 이미 있는데, 굳이 Neural Network로 구현할 의미가 있을까? 라는 생각이 들 수 있을 것이다.

하지만 Neural Network의 높은 자유도에 그 의의가 있다고 생각한다. 가령 Loss Fuction에 $\lambda\sum\beta$를 더함으로써 추정치의 대략적인 범위를 조정하는 등의 추가적인 액션이 가능하다는 것이다.

현재 Neural Network에서 생각해봐야할 점은 identifiability이다. 하나의 함수를 추정하는 일반적인 Neural Network 문제와 달리, 지금은 두 개의 함수($S_0(p_t), \beta$)를 각각 정확히 추정해야 한다. 그렇기에 $S(p_t|\mathbf{x}_t)$를 추정하도록 하는 $\text{Loss1}$만으로는 두 함수를 구분해낼 수 없고, 그 역할을 $\text{Loss2}$를 통해 이루어냈다. 이러한 $\text{Loss}$의 구성이 이론적으로 identifiability을 보장할 수 있는가가 추가적으로 논의되어야 할 점이다.

[6주차] [코드구현] Advanced Monotonic Neural Network with PyTorch

Sat, 10 Feb 2024 12:56:44 GMT

1. Introduction

Monotone Neural Network를 살펴보고 구현한 이유는, 이를 이용하여 누적분포함수(CDF)를 추정하기 위함이다. 모든 CDF는 monotone increasing이고 bounded(0~1)이므로, Monotone Neural Network의 결과값이 0 이상 1이하의 값을 갖도록 기법을 추가할 것이다.

또한 원점을 지나는 Monotone Neural Network를 구현할 것이다. 이는 Cox Regression에서 Cumulative Hazard rate를 추정하기 위함이다. 자세한 내용은 7주차에서 소개할 것이다.

현재 단계에서는 Input/Output이 모두 1차원인 증가함수만 다룬다. 각 기법은 고차원의 경우로 잘 확장될 수 있을 것이다.

2. Bounded Monotone Neural Network

모델의 결과값이 0 이상(혹은 초과) 1 이하(혹은 미만)로 만들기 위해, $\reals \rightarrow (0, 1)$인 증가함수를 합성할 것이다. 대표적으로는 Sigmoid 함수, Tanh 함수(적절히 scaling한다면) 등이 있다.

우선 Monotonic Dense Block을 가져오자.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.parameter import Parameter
from torch.nn.modules.module import Module

import numpy as np
import matplotlib.pyplot as plt

# MonoBlock 구현
## activation : relu, elu, selu, none 중 선택
class MonoBlock(Module):
    def __init__(self, in_feature:int, out_feature:int, mono_indicator = 'inc', activation = 'none', activation_partition = (0,0,1)):
        super().__init__()
        self.activation = activation
        self.activation_partition = activation_partition

        self.in_feature = in_feature
        self.out_feature = out_feature
        self.mono_indicator = mono_indicator

        self.W = Parameter(torch.randn(self.in_feature, self.out_feature))
        self.b = Parameter(torch.randn(self.out_feature))

    def get_activation(self):
        convex = getattr(F, self.activation)
        def concave(x):
            return -convex(-x)
        def saturated(x):
            plus = -convex(-x+torch.ones_like(x)) + convex(torch.ones_like(x))
            minus = convex(x+torch.ones_like(x)) - convex(torch.ones_like(x))
            return torch.where(x >= 0, plus, minus)
        return convex, concave, saturated

    def activation_index(self, x):
        if sum(self.activation_partition) != 1:
            raise ValueError(f"sum of activation_partition must be 1")
        if len(self.activation_partition) != 3:
            raise ValueError(f"length of activation_partition must be 3")

        convex_num = int(self.activation_partition[0] * len(x.T))
        concave_num = int(self.activation_partition[1] * len(x.T))
        return convex_num, convex_num+concave_num, len(x.T)

    def forward(self, x):
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)
        if self.mono_indicator == 'inc':
            self.mono_indicator = torch.ones(x.shape[1])
        if x.shape[1] != self.in_feature:
            raise ValueError(f"matrix multiplication cannot be implemented : {x.shape[0]}x{x.shape[1]} and {self.in_feature}x{self.out_feature}")
        if len(self.mono_indicator) != self.in_feature:
            raise ValueError(f"number of variable does not match : {len(self.mono_indicator)} and {self.in_feature}")

        mono_oper = torch.tensor(self.mono_indicator).reshape(-1, 1) * torch.abs(self.W)
        W_oper = torch.where(torch.abs(mono_oper) >= torch.abs(self.W), mono_oper, self.W)
        x = torch.matmul(x, W_oper) + self.b

        convex_idx, concave_idx, saturated_idx = self.activation_index(x)
        if self.activation == 'none':
            out = torch.cat([x.T[:convex_idx], x.T[convex_idx:concave_idx], x.T[concave_idx:saturated_idx]], dim=0)
        else:
            convex_act, concave_act, saturated_act = self.get_activation()
            out = torch.cat([convex_act(x.T[:convex_idx]), concave_act(x.T[convex_idx:concave_idx]), saturated_act(x.T[concave_idx:saturated_idx])], dim=0).T

        return out

Monotone Neural Network를 만들고, 가장 마지막에 Sigmoid 함수롸 Tanh 함수를 합성할 것이다. Tanh 함수는 함숫값이 0과 1 사이에 오도록 하기 위해 ${1 \over 2}\text{Tanh} x + {1 \over 2}$로 사용한다.

# Bdd(0~1) MonoNet 구현
## Sigmoid
class BddMonoNet_Sig(nn.Module):
    def __init__(self):
        super().__init__()
        self.mono = nn.Sequential(
            MonoBlock(1, 32, mono_indicator=[1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 4, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(4, 1)
        )

    def forward(self, x):
        return torch.sigmoid(self.mono(x))

## Tanh
class BddMonoNet_Tanh(nn.Module):
    def __init__(self):
        super().__init__()
        self.mono = nn.Sequential(
            MonoBlock(1, 32, mono_indicator=[1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 4, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(4, 1)
        )

    def forward(self, x):
        return 0.5 * torch.tanh(self.mono(x)) + 0.5

모델 검증

위 기법이 잘 작동하는지 실험을 통해 검증해보자. 두 개의 모의함수로부터 데이터를 생성하고, 각 모델에 학습시켜 예측 결과를 관찰할 것이다.

붉은 그래프 : $y = {0.3 \over 1+e^{-x}} + {0.5 \over 1+e^{-(5x-2)}} + {0.2 \over 1+e^{-(2x+5)}}$
푸른 그래프 : $y=1-e^{-2x}~~(x \ge 0)$

  # Reproduce를 위한 seed 고정
random_seed = 42
torch.manual_seed(random_seed)
np.random.seed(random_seed)

# 데이터 생성 _ train, valid, test
def data_generate1(num_sample, noise):
    X = np.random.uniform(-4, 4, num_sample)
    Y1 = 0.3 / (1 + np.exp(-X))
    Y2 = 0.5 / (1 + np.exp(-5*X+2))
    Y3 = 0.2 / (1 + np.exp(-2*X-5))
    Y = Y1 + Y2 + Y3 + noise * np.random.normal(0, 1, num_sample)
    return torch.tensor(X, dtype=torch.float), torch.tensor(Y, dtype=torch.float)

def data_generate2(num_sample, noise):
    X = np.random.uniform(0, 8, num_sample)
    Y = 1 - np.exp(-2*X) + noise * np.random.normal(0, 1, num_sample)
    return torch.tensor(X, dtype=torch.float), torch.tensor(Y, dtype=torch.float)

train_x1, train_y1 = data_generate1(800, 0.1)
train_x2, train_y2 = data_generate2(800, 0.1)

valid_x1, valid_y1 = data_generate1(100, 0)
valid_x2, valid_y2 = data_generate2(100, 0)

test_x1, test_y1 = data_generate1(100, 0)
test_x2, test_y2 = data_generate2(100, 0)

생성한 데이터로 두 모델을 학습시킨다.

# 모델학습
## BddMonoNet_Sig
my_param_sig = {'learning_rate' : 0.001,
            'num_epoch' : 10000}

## 1번 데이터
model_sig1 = BddMonoNet_Sig()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_sig1.parameters(), lr=my_param_sig['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_sig['num_epoch']+1):
    model_sig1.train()

    output = model_sig1(train_x1)
    train_loss = criterion(output, train_y1)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_sig1.eval()
        if epoch % 500 == 0:
            valid_output = model_sig1(valid_x1)
            valid_loss = criterion(valid_output, valid_y1)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")


# 2번 데이터
model_sig2 = BddMonoNet_Sig()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_sig2.parameters(), lr=my_param_sig['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_sig['num_epoch']+1):
    model_sig2.train()

    output = model_sig2(train_x2)
    train_loss = criterion(output, train_y2)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_sig2.eval()
        if epoch % 500 == 0:
            valid_output = model_sig2(valid_x2)
            valid_loss = criterion(valid_output, valid_y2)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")

# 모델학습
## BddMonoNet_Tanh
my_param_tanh = {'learning_rate' : 0.001,
            'num_epoch' : 10000}

# 1번 데이터
model_tanh1 = BddMonoNet_Tanh()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_tanh1.parameters(), lr=my_param_tanh['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_tanh['num_epoch']+1):
    model_tanh1.train()

    output = model_tanh1(train_x1)
    train_loss = criterion(output, train_y1)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_tanh1.eval()
        if epoch % 500 == 0:
            valid_output = model_tanh1(valid_x1)
            valid_loss = criterion(valid_output, valid_y1)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")


# 2번 데이터
model_tanh2 = BddMonoNet_Tanh()
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_tanh2.parameters(), lr=my_param_tanh['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_tanh['num_epoch']+1):
    model_tanh2.train()

    output = model_tanh2(train_x2)
    train_loss = criterion(output, train_y2)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_tanh2.eval()
        if epoch % 500 == 0:
            valid_output = model_tanh2(valid_x2)
            valid_loss = criterion(valid_output, valid_y2)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")

test data에 대한 각 모델의 예측 결과는 다음과 같다. 붉은 점은 True 함수의 함숫값, 푸른 점은 모델의 예측 결과값이다.

# BddMonoNet_Sig, BddMonoNet_Tanh 모델 평가
test_pred_sig1 = model_sig1(test_x1)
test_pred_sig2 = model_sig2(test_x2)
test_pred_tanh1 = model_tanh1(test_x1)
test_pred_tanh2 = model_tanh2(test_x2)

# 결과 시각화
fig_cdf_test = plt.figure()
fig_cdf_test.set_figwidth(15)

ax_cdf_test_sig1 = fig_cdf_test.add_subplot(221)
ax_cdf_test_sig1.scatter(test_x1.detach(), test_y1.detach(), color='r')
ax_cdf_test_sig1.scatter(test_x1.detach(), test_pred_sig1[0].detach(), color='b')

ax_cdf_test_sig2 = fig_cdf_test.add_subplot(222)
ax_cdf_test_sig2.scatter(test_x2.detach(), test_y2.detach(), color='r')
ax_cdf_test_sig2.scatter(test_x2.detach(), test_pred_sig2[0].detach(), color='b')

ax_cdf_test_tanh1 = fig_cdf_test.add_subplot(223)
ax_cdf_test_tanh1.scatter(test_x1.detach(), test_y1.detach(), color='r')
ax_cdf_test_tanh1.scatter(test_x1.detach(), test_pred_tanh1[0].detach(), color='b')

ax_cdf_test_tanh2 = fig_cdf_test.add_subplot(224)
ax_cdf_test_tanh2.scatter(test_x2.detach(), test_y2.detach(), color='r')
ax_cdf_test_tanh2.scatter(test_x2.detach(), test_pred_tanh2[0].detach(), color='b')

ax_cdf_test_sig1.set_title('Data 1')
ax_cdf_test_sig2.set_title('Data 2')
ax_cdf_test_sig1.set_ylabel('Sigmoid', fontsize=14)
ax_cdf_test_tanh1.set_ylabel('Tanh', fontsize=14)

각 모델이 bounded 조건을 잘 만족하는지 역시 확인할 수 있다.

# bounded 검증
print(torch.all((0 <= test_pred_sig1) & (test_pred_sig1 <= 1)))
print(torch.all((0 <= test_pred_sig2) & (test_pred_sig2 <= 1)))
print(torch.all((0 <= test_pred_tanh1) & (test_pred_tanh1 <= 1)))
print(torch.all((0 <= test_pred_tanh2) & (test_pred_tanh2 <= 1)))

3. Origin-Passing Monotone Neural Network

원점($=\mathbf{0}$)을 지나도록 하기 위해 다음 세 가지 기법을 소개하고 효용성을 비교할 것이다.

모든 bias 제거 $~(N(x) = \sigma(Wx))$
마지막 layer에서 bias 제거 $~(N(x)=f(x)-f(0))$
원점을 지나는 증가함수 곱하기 $~(N(x)=\sigma(f(x)) \cdot {\ln(3x+1) \over 0.01+\ln(3x+1)})$

세 방법 모두 단조성(증가)을 해치지 않으면서 원점$(0, 0)$을 지남을 어렵지 않게 확인할 수 있다. 단, 세 번째 방법은 양수 정의역에서만 사용할 수 있다. 이후 실험을 진행할 때 세 번째 방법은 양수 영역에서 sampling한 데이터를 사용할 것이다.

우선 Monotonic Dense Block을 가져오자.

import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.nn.parameter import Parameter
from torch.nn.modules.module import Module

import numpy as np
import matplotlib.pyplot as plt

# MonoBlock 구현
## activation : relu, elu, selu, none 중 선택
class MonoBlock(Module):
    def __init__(self, in_feature:int, out_feature:int, mono_indicator = 'inc', activation = 'none', activation_partition = (0,0,1)):
        super().__init__()
        self.activation = activation
        self.activation_partition = activation_partition

        self.in_feature = in_feature
        self.out_feature = out_feature
        self.mono_indicator = mono_indicator

        self.W = Parameter(torch.randn(self.in_feature, self.out_feature))
        self.b = Parameter(torch.randn(self.out_feature))

    def get_activation(self):
        convex = getattr(F, self.activation)
        def concave(x):
            return -convex(-x)
        def saturated(x):
            plus = -convex(-x+torch.ones_like(x)) + convex(torch.ones_like(x))
            minus = convex(x+torch.ones_like(x)) - convex(torch.ones_like(x))
            return torch.where(x >= 0, plus, minus)
        return convex, concave, saturated

    def activation_index(self, x):
        if sum(self.activation_partition) != 1:
            raise ValueError(f"sum of activation_partition must be 1")
        if len(self.activation_partition) != 3:
            raise ValueError(f"length of activation_partition must be 3")

        convex_num = int(self.activation_partition[0] * len(x.T))
        concave_num = int(self.activation_partition[1] * len(x.T))
        return convex_num, convex_num+concave_num, len(x.T)

    def forward(self, x):
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)
        if self.mono_indicator == 'inc':
            self.mono_indicator = torch.ones(x.shape[1])
        if x.shape[1] != self.in_feature:
            raise ValueError(f"matrix multiplication cannot be implemented : {x.shape[0]}x{x.shape[1]} and {self.in_feature}x{self.out_feature}")
        if len(self.mono_indicator) != self.in_feature:
            raise ValueError(f"number of variable does not match : {len(self.mono_indicator)} and {self.in_feature}")

        mono_oper = torch.tensor(self.mono_indicator).reshape(-1, 1) * torch.abs(self.W)
        W_oper = torch.where(torch.abs(mono_oper) >= torch.abs(self.W), mono_oper, self.W)
        x = torch.matmul(x, W_oper) + self.b

        convex_idx, concave_idx, saturated_idx = self.activation_index(x)
        if self.activation == 'none':
            out = torch.cat([x.T[:convex_idx], x.T[convex_idx:concave_idx], x.T[concave_idx:saturated_idx]], dim=0)
        else:
            convex_act, concave_act, saturated_act = self.get_activation()
            out = torch.cat([convex_act(x.T[:convex_idx]), concave_act(x.T[convex_idx:concave_idx]), saturated_act(x.T[concave_idx:saturated_idx])], dim=0).T

        return out

모든 bias 제거

모든 bias의 required_grad_를 False로 설정하고 초기값을 0으로 설정한다.

# 전체 bias 제거 - 학습X, 0 부여
class MonoNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.mono = nn.Sequential(
            MonoBlock(1, 32, mono_indicator=[1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 24, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(24, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 1)
        )

    def forward(self, x):
        return self.mono(x)

model_nobias1 = MonoNet()
model_nobias2 = MonoNet()

def init_weights(m):
    if isinstance(m, MonoBlock):
        m.b.data.fill_(0)
        m.b.requires_grad_(False)

model_nobias1.apply(init_weights)
model_nobias2.apply(init_weights)

마지막 layer에서 bias 제거

마지막 layer를 거친 후, 최종 bias를 빼는 origin 메소드를 적용시킨다.

# 마지막 layer에서 bias 제거
class MonoNet_mbias(nn.Module):
    def __init__(self):
        super().__init__()
        self.mono = nn.Sequential(
            MonoBlock(1, 32, mono_indicator=[1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 24, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(24, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 1)
        )

    def origin(self, x):
        zero = torch.zeros_like(x)
        return self.mono(zero)

    def forward(self, x):
        return self.mono(x) - self.origin(x)

model_mbias1 = MonoNet_mbias()
model_mbias2 = MonoNet_mbias()

원점을 지나는 증가함수 곱하기

원점을 지나는 함수를 곱하여 강제로 원점을 지나도록 한다. 이때 단조성(증가)을 유지하기 위해 곱해지는 함수 역시 증가함수로 설정하며, 학습에 미치는 영향을 가능한 줄이기 위해 원점 이외에선 상수함수$(=1)$에 가까운 함수를 선택한다.

$y={\ln(3x+1) \over 0.01 + \ln(3x+1)}~~(x \ge 0)$

다만 이는 0 이상의 정의역에서만 사용할 수 있다. 만약 음수 영역에서도 함수를 정의하려면, 단조성을 지키기 위해 함수값이 음수가 될 수 밖에 없고 이는 학습에 적지 않은 영향을 준다.

또한 함수의 합성과는 달리, 함숫값이 항상 양수인 증가함수를 곱한 것이 증가함수이기 위해선 원래 함수 역시 함숫값이 항상 양수인 증가함수여야 한다. 이를 위해 마지막 layer를 통과한 값에 다음 함수를 우선 합성한다.

$y=\begin{cases}e^{x-1}~~\text{if~~}x<0\ x~~\text{if~~}x\ge0 \end{cases}$

7주차 주제인 Dynamic Pricing은 항상 0 이상의 정의역에서만 다루어지는 문제이므로 세 번째 방법 역시 사용이 가능하다.

# 원점을 지나는 증가함수 곱하기 (양수에서만 정의)
class MonoNet_mul(nn.Module):
    def __init__(self):
        super().__init__()
        self.mono = nn.Sequential(
            MonoBlock(1, 32, mono_indicator=[1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 24, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(24, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 1)
        )

    def origin(self, x):
        if torch.any(x.T < 0):
            raise ValueError(f"input value 'x' should be non-negative")
        return torch.log(3*x.T+1) / (0.01 + torch.log(3*x.T+1))

    def forward(self, x):
        out = self.mono(x)
        return torch.where(out>=1, out, torch.exp(out-1)) * self.origin(x)


model_mul1 = MonoNet_mul()
model_mul2 = MonoNet_mul()

모델 검증

검증에 사용할 train, valid, test 데이터를 생성한다. 입력/출력이 모두 1차원인 다음 두 모의함수를 사용할 것이다.

붉은 그래프 : $y=\ln({x \over 5}+1)+x+\sin x$
푸른 그래프 : $y={e^x \over x^2+1}-1$

# Reproduce를 위한 seed 고정
random_seed = 42
torch.manual_seed(random_seed)
np.random.seed(random_seed)

# 데이터 생성 _ train, valid, test
def data_generate(num_sample, noise):
    X = np.random.uniform(-4, 4, num_sample)
    Y1 = np.log(X/5 + 1) + X + np.sin(X) + noise * np.random.normal(0, 1, num_sample)
    Y2 = np.exp(X) / (X**2 + 1) - 1 + noise * np.random.normal(0, 1, num_sample)
    return torch.tensor(X, dtype=torch.float), torch.tensor(Y1, dtype=torch.float), torch.tensor(Y2, dtype=torch.float)

def data_generate_pos(num_sample, noise):
    X = np.random.uniform(0, 8, num_sample)
    Y1 = np.log(X/5 + 1) + X + np.sin(X) + noise * np.random.normal(0, 1, num_sample)
    Y2 = np.exp(X) / (X**2 + 1) - 1 + noise * np.random.normal(0, 1, num_sample)
    return torch.tensor(X, dtype=torch.float), torch.tensor(Y1, dtype=torch.float), torch.tensor(Y2, dtype=torch.float)

train_x, train_y1, train_y2 = data_generate(800, 0.1)
valid_x, valid_y1, valid_y2 = data_generate(100, 0.1)
test_x, test_y1, test_y2 = data_generate(100, 0.1)

train_xp, train_yp1, train_yp2 = data_generate_pos(800, 0.1)
valid_xp, valid_yp1, valid_yp2 = data_generate_pos(100, 0.1)
test_xp, test_yp1, test_yp2 = data_generate_pos(100, 0.1)

두 데이터에 대해, 세 모델에 각각 학습시킨다.

# 모델학습
## model_nobias
my_param_nobias = {'learning_rate' : 0.001,
                   'num_epoch' : 10000}

## 1번 데이터
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_nobias1.parameters(), lr=my_param_nobias['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_nobias['num_epoch']+1):
    model_nobias1.train()

    output = model_nobias1(train_x)
    train_loss = criterion(output, train_y1)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_nobias1.eval()
        if epoch % 500 == 0:
            valid_output = model_nobias1(valid_x)
            valid_loss = criterion(valid_output, valid_y1)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")


## 2번 데이터
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_nobias2.parameters(), lr=my_param_nobias['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_nobias['num_epoch']+1):
    model_nobias2.train()

    output = model_nobias2(train_x)
    train_loss = criterion(output, train_y2)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_nobias2.eval()
        if epoch % 500 == 0:
            valid_output = model_nobias2(valid_x)
            valid_loss = criterion(valid_output, valid_y2)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")

# 모델학습
## model_mbias
my_param_mbias = {'learning_rate' : 0.001,
                   'num_epoch' : 10000}

## 1번 데이터
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_mbias1.parameters(), lr=my_param_mbias['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_mbias['num_epoch']+1):
    model_mbias1.train()

    output = model_mbias1(train_x)
    train_loss = criterion(output, train_y1)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_mbias1.eval()
        if epoch % 500 == 0:
            valid_output = model_mbias1(valid_x)
            valid_loss = criterion(valid_output, valid_y1)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")


## 2번 데이터
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_mbias2.parameters(), lr=my_param_mbias['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_mbias['num_epoch']+1):
    model_mbias2.train()

    output = model_mbias2(train_x)
    train_loss = criterion(output, train_y2)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_mbias2.eval()
        if epoch % 500 == 0:
            valid_output = model_mbias2(valid_x)
            valid_loss = criterion(valid_output, valid_y2)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")

# 모델학습
## model_mul
my_param_mul = {'learning_rate' : 0.001,
                   'num_epoch' : 10000}

## 1번 데이터
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_mul1.parameters(), lr=my_param_mul['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_mul['num_epoch']+1):
    model_mul1.train()

    output = model_mul1(train_xp)
    train_loss = criterion(output, train_yp1)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_mul1.eval()
        if epoch % 500 == 0:
            valid_output = model_mul1(valid_xp)
            valid_loss = criterion(valid_output, valid_yp1)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")


## 2번 데이터
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_mul2.parameters(), lr=my_param_mul['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param_mul['num_epoch']+1):
    model_mul2.train()

    output = model_mul2(train_xp)
    train_loss = criterion(output, train_yp2)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model_mul2.eval()
        if epoch % 500 == 0:
            valid_output = model_mul2(valid_xp)
            valid_loss = criterion(valid_output, valid_yp2)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")

test data에 대해 각 모델의 예측 결과는 다음과 같다.

# BddMonoNet_Sig, BddMonoNet_Tanh 모델 평가
test_pred_nobias1 = model_nobias1(test_x)
test_pred_nobias2 = model_nobias2(test_x)
test_pred_mbias1 = model_mbias1(test_x)
test_pred_mbias2 = model_mbias2(test_x)
test_pred_mul1 = model_mul1(test_xp)
test_pred_mul2 = model_mul2(test_xp)


# 결과 시각화
fig_ori_test = plt.figure()
fig_ori_test.set_figwidth(15)
fig_ori_test.set_figheight(10)


ax_ori_test_nobias1 = fig_ori_test.add_subplot(321)
ax_ori_test_nobias1.scatter(test_x.detach(), test_y1.detach(), color='r')
ax_ori_test_nobias1.scatter(test_x.detach(), test_pred_nobias1.detach(), color='b')

ax_ori_test_nobias2 = fig_ori_test.add_subplot(322)
ax_ori_test_nobias2.scatter(test_x.detach(), test_y2.detach(), color='r')
ax_ori_test_nobias2.scatter(test_x.detach(), test_pred_nobias2.detach(), color='b')

ax_ori_test_mbias1 = fig_ori_test.add_subplot(323)
ax_ori_test_mbias1.scatter(test_x.detach(), test_y1.detach(), color='r')
ax_ori_test_mbias1.scatter(test_x.detach(), test_pred_mbias1.detach(), color='b')

ax_ori_test_mbias2 = fig_ori_test.add_subplot(324)
ax_ori_test_mbias2.scatter(test_x.detach(), test_y2.detach(), color='r')
ax_ori_test_mbias2.scatter(test_x.detach(), test_pred_mbias2.detach(), color='b')

ax_ori_test_mul1 = fig_ori_test.add_subplot(325)
ax_ori_test_mul1.scatter(test_xp.detach(), test_yp1.detach(), color='r')
ax_ori_test_mul1.scatter(test_xp.detach(), test_pred_mul1.detach(), color='b')

ax_ori_test_mul2 = fig_ori_test.add_subplot(326)
ax_ori_test_mul2.scatter(test_xp.detach(), test_yp2.detach(), color='r')
ax_ori_test_mul2.scatter(test_xp.detach(), test_pred_mul2.detach(), color='b')


ax_ori_test_nobias1.set_title('Data 1')
ax_ori_test_nobias2.set_title('Data 2')
ax_ori_test_nobias1.set_ylabel('No bias', fontsize=14)
ax_ori_test_mbias1.set_ylabel('Minus bias', fontsize=14)
ax_ori_test_mul1.set_ylabel('Multiple', fontsize=14)

모든 bias를 제거한 모델은 다른 모델에 비해 표현력이 떨어졌고, 그 결과 복잡한 함수를 비교적 간단한 모형으로 예측함을 확인할 수 있다.
마지막 layer에서 bias를 제거한 모델은 마지막 bias에 의해 전체 bias가 영향을 받는다. 만약 원점 주변 데이터를 학습하지 못한다면 편향된 bias가 나타날 것이고, 이는 전체 모델의 편향을 발생시킬 것이다.
증가함수를 곱하는 모델은 정의역이 0 이상으로 제한되며 0 주변에서 왜곡이 발생할 가능성이 있다. 하지만 전체적인 bias는 두 번째 방법보다 낮을 것으로 예상된다.

각 모델이 원점을 잘 지나는지 역시 확인할 수 있다.

# 원점 통과 검증
print(model_nobias2(torch.tensor([0], dtype=torch.float)) == torch.tensor([0]))
print(model_mbias1(torch.tensor([0], dtype=torch.float)) == torch.tensor([0]))
print(model_mul1(torch.tensor([0], dtype=torch.float)) == torch.tensor([0]))

[5주차] [코드구현] Constrained Monotonic Neural Network with PyTorch

Sat, 10 Feb 2024 07:20:44 GMT

원문

Runje D, Shankaranarayana S, Constrained Monotonic Neural Network
참고자료
[GitHub]$~$monotonic-nn

1. Introduction

Monotone Function을 추정하는 과정을 Isotonic Regression이라 한다. Monotone Function의 주요 특성(제약)은 Monotonicity로, 쉽게 말하면 증가함수 또는 감소함수여야 한다는 점이다. 4주차 논문은 이를 위한 기법을 고안하고 이론적 근거를 제시한다. 또한 Keras를 활용한 패키지를 제작했는데, 이를 PyTorch로 다시 제작할 것이다. 이후 Sklearn에서 제공하는 Isotonic Regression과 함께 성능을 비교할 것이다.

2. Monotonic Dense Block 구현

Monotone Neural Network의 핵심 부품인 다음 Monotonic Dense Block을 먼저 구현하자.

import torch
import torch.nn.functional as F
from torch.nn.parameter import Parameter
from torch.nn.modules.module import Module

# activation : relu, elu, selu, none 중 선택
class MonoBlock(Module):
    def __init__(self, in_feature:int, out_feature:int, mono_indicator = 'inc', activation = 'none', activation_partition = (0,0,1)):
        super().__init__()
        self.activation = activation
        self.activation_partition = activation_partition

        self.in_feature = in_feature
        self.out_feature = out_feature
        self.mono_indicator = mono_indicator

        self.W = Parameter(torch.randn(self.in_feature, self.out_feature))
        self.b = Parameter(torch.randn(self.out_feature))

    def get_activation(self):
        convex = getattr(F, self.activation)
        def concave(x):
            return -convex(-x)
        def saturated(x):
            plus = -convex(-x+torch.ones_like(x)) + convex(torch.ones_like(x))
            minus = convex(x+torch.ones_like(x)) - convex(torch.ones_like(x))
            return torch.where(x >= 0, plus, minus)
        return convex, concave, saturated

    def activation_index(self, x):
        if sum(self.activation_partition) != 1:
            raise ValueError(f"sum of activation_partition must be 1")
        if len(self.activation_partition) != 3:
            raise ValueError(f"length of activation_partition must be 3")

        convex_num = int(self.activation_partition[0] * len(x.T))
        concave_num = int(self.activation_partition[1] * len(x.T))
        return convex_num, convex_num+concave_num, len(x.T)

    def forward(self, x):
        if len(x.shape) == 1:
            x = x.reshape(-1, 1)
        if self.mono_indicator == 'inc':
            self.mono_indicator = torch.ones(x.shape[1])
        if x.shape[1] != self.in_feature:
            raise ValueError(f"matrix multiplication cannot be implemented : {x.shape[0]}x{x.shape[1]} and {self.in_feature}x{self.out_feature}")
        if len(self.mono_indicator) != self.in_feature:
            raise ValueError(f"number of variable does not match : {len(self.mono_indicator)} and {self.in_feature}")

        mono_oper = torch.tensor(self.mono_indicator).reshape(-1, 1) * torch.abs(self.W)
        W_oper = torch.where(torch.abs(mono_oper) >= torch.abs(self.W), mono_oper, self.W)
        x = torch.matmul(x, W_oper) + self.b

        convex_idx, concave_idx, saturated_idx = self.activation_index(x)
        if self.activation == 'none':
            out = torch.cat([x.T[:convex_idx], x.T[convex_idx:concave_idx], x.T[concave_idx:saturated_idx]], dim=0)
        else:
            convex_act, concave_act, saturated_act = self.get_activation()
            out = torch.cat([convex_act(x.T[:convex_idx]), concave_act(x.T[convex_idx:concave_idx]), saturated_act(x.T[concave_idx:saturated_idx])], dim=0).T

        return out

구현해야할 핵심 기능은 총 3가지다.

get_activation(self)
- zero-centered, monotonically increasing, convex, lower-bounded인 Activation Function $\breve{\rho}$을 입력했을 때, $\hat{\rho}$, $\tilde{\rho}$를 반환하는 메소드다.
- activation은 $\breve{\rho}$를 지정하는 것으로, 'none'이면 항등함수를, 그 외는 torch.nn.functional에서 해당 이름의 함수를 가져온다.
activation_index(self, x)
- 아핀 변환된 Input vector($=|\mathbf{W^T}|_t\cdot\mathbf{x+b}$)가 3분할될 index를 반환하는 메소드다.
- activation_partition은 $\breve{s}, \hat{s}, \tilde{s}$의 비율을 나타낸다.
forward(self, x)
- $\mathbf{x} \rightarrow \rho^S(|\mathbf{W^T}|_t\cdot\mathbf{x+b})$를 수행하는 메소드다.
- mono_indicator는 Input vector의 각 변수의 증가/감소/제약없음 여부를 결정하는 array이다. mono_indicator에 따라 $\mathbf{W} \rightarrow |\mathbf{W^T}|_t$를 계산한다.
- activation_index에 따라 $\breve{\rho}, \hat{\rho}, \tilde{\rho}$에 적용시키고, 다시 하나의 벡터로 합친다.

Activation Function $\breve{\rho}$에 따라 $\hat{\rho}, \tilde{\rho}$가 잘 반환되는지 확인해보자.

# convex, concave, non-convex-concave 시각화
import matplotlib.pyplot as plt
fig = plt.figure()

x = torch.linspace(-2, 2, steps=201)

# ReLU 시각화
convex_relu, concave_relu, saturated_relu = MonoBlock(1, 3, [1], activation='relu').get_activation()

ax1 = fig.add_subplot(231)
ax4 = fig.add_subplot(234)

ax1.plot(x, convex_relu(x))
ax4.plot(x, convex_relu(x))
ax4.plot(x, concave_relu(x))
ax4.plot(x, saturated_relu(x).detach())

# ELU 시각화
convex_elu, concave_elu, saturated_elu = MonoBlock(1, 3, [1], activation='elu').get_activation()

ax2 = fig.add_subplot(232)
ax5 = fig.add_subplot(235)

ax2.plot(x, convex_elu(x))
ax5.plot(x, convex_elu(x))
ax5.plot(x, concave_elu(x))
ax5.plot(x, saturated_elu(x).detach())

# SeLU 시각화
convex_selu, concave_selu, saturated_selu = MonoBlock(1, 3, [1], activation='selu').get_activation()

ax3 = fig.add_subplot(233)
ax6 = fig.add_subplot(236)

ax3.plot(x, convex_selu(x))
ax6.plot(x, convex_selu(x))
ax6.plot(x, concave_selu(x))
ax6.plot(x, saturated_selu(x).detach())

논문에서 제시한 다음 그림과 동일한 결과를 얻을 수 있었다.

3. Monotone Neural Network 구현

Monotonic Dense Block을 쌓아 Monotone Neural Network를 구현하자.

import torch.nn as nn
import torch.nn.functional as F

class MonoNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.mono = nn.Sequential(
            MonoBlock(1, 32, mono_indicator=[1], activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(32, 16, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(16, 8, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(8, 4, activation='elu', activation_partition=(0, 0, 1)),
            MonoBlock(4, 1)
        )

    def forward(self, x):
        return self.mono(x)

mono_indicator의 길이는 Input vector의 차원과 같아야 한다. $i$번째 indicator가 $i$번째 변수의 증감 여부를 결정하기 때문이다.
layer 수, 각 layer의 입력/출력 벡터 차원은 자유롭게 설정할 수 있다.

4. Monotone Neural Network 성능비교

모의함수로부터 데이터를 생성하고, 이를 PyTorch 모델, Keras 모델, Sklearn 모델에 학습시켜 예측결과를 비교하자. 성능은 test data의 MSE로 나타낼 것이다.

import numpy as np

# Reproduce를 위한 seed 고정
random_seed = 42
torch.manual_seed(random_seed)
np.random.seed(random_seed)

# 데이터 생성 _ train, valid, test
def data_generate(num_sample, noise):
    X = np.random.uniform(-1, 5, num_sample)
    Y = np.exp(X - 2 + np.sin(X)) + noise * np.random.normal(0, 0.1, num_sample)
    return torch.tensor(X, dtype=torch.float), torch.tensor(Y, dtype=torch.float)

train_x, train_y = data_generate(800, 0.8)
valid_x, valid_y = data_generate(100, 0)
test_x, test_y = data_generate(100, 0)

모의함수는 $y=e^{x-2+\sin x}$이고, 증가함수다. train data에는 노이즈를 첨가하고, valid, test data는 노이즈를 첨가하지 않았다.

PyTorch 모델 학습

# PyTorch 모델 학습
my_param = {'learning_rate' : 0.01,
            'num_epoch' : 2000}

criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=my_param['learning_rate'])
best_valid_loss_torch = 10**9

for epoch in range(1, my_param['num_epoch']+1):
    model.train()

    output = model(train_x)
    train_loss = criterion(output, train_y)

    optimizer.zero_grad()
    train_loss.backward()
    optimizer.step()

    with torch.no_grad():
        model.eval()
        if epoch % 50 == 0:
            valid_output = model(valid_x)
            valid_loss = criterion(valid_output, valid_y)
            print(f"    [Epoch {epoch}] Valid loss : {valid_loss}")

            if best_valid_loss_torch > valid_loss:
                best_valid_loss_torch = valid_loss

print(f"[PyTorch] Best valid loss : {best_valid_loss_torch}")

Keras 모델 학습

# Keras 모델 설계
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input
from airt.keras.layers import MonoDense
from tensorflow.keras.optimizers import Adam

model_keras = Sequential()

model_keras.add(Input(shape=(1,)))
model_keras.add(
    MonoDense(512, activation="elu", monotonicity_indicator=[1]))
model_keras.add(
    MonoDense(512, activation="elu"))
model_keras.add(
    MonoDense(512, activation="elu"))
model_keras.add(
    MonoDense(512, activation="elu"))
model_keras.add(
    MonoDense(1))

optimizer_keras = Adam(learning_rate=my_param['learning_rate'])
model_keras.compile(optimizer=optimizer_keras, loss="mse")

model_keras.fit(
    x=np.array(train_x), y=np.array(train_y), batch_size=10000, validation_data=(np.array(valid_x), np.array(valid_y)), epochs=2000
)

Sklearn 모델 학습

# SKlearn 모델 설계

from sklearn.isotonic import IsotonicRegression
iso_reg = IsotonicRegression().fit(train_x, train_y)

모델 평가

# PyTorch 모델 평가
test_pred = model(test_x)

fig = plt.figure()
fig.set_figwidth(25)
ax1 = fig.add_subplot(131)

ax1.scatter(test_x.detach(), test_y.detach(), color='r')
ax1.scatter(test_x.detach(), test_pred.detach(), color='b')
ax1.set_title('PyTorch')
ax1.text(-1, 6.5, f"MSE : {criterion(test_y, test_pred)}")
ax1.text(-1, 6, f"Param : {sum(p.numel() for p in model.parameters() if p.requires_grad)}")


# Keras 모델 평가
test_pred_keras = model_keras.predict(x=np.array(test_x))
ax2 = fig.add_subplot(132)

ax2.scatter(test_x.detach(), test_y.detach(), color='r')
ax2.scatter(test_x.detach(), test_pred_keras, color='b')
ax2.set_title('Keras')
ax2.text(-1, 6.5, f"MSE : {criterion(test_y, torch.tensor(test_pred_keras))}")
ax2.text(-1, 6, f"Param : {model_keras.count_params()}")

# Sklearn 모델평가
test_pred_iso = iso_reg.predict(test_x)
ax3 = fig.add_subplot(133)

ax3.scatter(test_x.detach(), test_y.detach(), color='r')
ax3.scatter(test_x.detach(), test_pred_iso, color='b')
ax3.set_title('Sklearn')
ax3.text(-1, 6.5, f"MSE : {criterion(test_y, torch.tensor(test_pred_iso))}")
ax3.text(-1, 6, f"Param : -")

test data에 대한 예측 결과는 다음과 같다. 붉은 점은 True 함수에 의한 함숫값, 푸른 점은 모델의 예측값이다.

PyTorch 모델은 Sklearn 모델과 거의 유사한, 최고성능을 냈음을 확인할 수 있다.
Keras 모델에 비해 PyTorch 모델은 적은 파라미터로도 더 좋은 성능을 냈음을 확인할 수 있다.

PyTorch 모델이 단조성을 만족하는지 확인해보자.

# 단조성 검증 _ PyTorch
sort_idx_x = np.argsort(test_x.detach().numpy())
sort_idx_pred = np.argsort(test_pred.detach().numpy()[0])
np.all(sort_idx_x == sort_idx_pred)

test data의 argsort와 모델 예측값의 argsort가 모두 같아, 모델이 단조성을 만족함을 확인할 수 있다.

[4주차] [논문분석] Constrained Monotonic Neural Network

Sun, 04 Feb 2024 18:22:35 GMT

원문

Runje D, Shankaranarayana S, Constrained Monotonic Neural Network

1. Abstract

Monotonicity(단조성)은 현실 세계에서도 많이 요구되는 특성(제약) 중 하나이다. Monotonic Fully-Connected Neural Network를 구성하는 고전적인 방법으로는 가중치의 부호를 제한하는 방법이 있는데(증가하길 원한다면 양수, 감소하길 원한다면 음수), 이는 Activation Function으로 Sigmoid 함수를 사용하면 함수를 잘 근사하지 못하고, ReLU를 사용하면 Convex 함수만 잘 근사하는 문제가 있다. 이를 극복하고자 가중치의 부호를 제한하는 아이디어에 새로운 Activation Function을 도입하여 기존 방법들의 한계를 해결하고자 한다. 제안된 방법은 기존 방법들보다 (조금) 더 나은 성능을 보임과 동시에 월등히 적은 파라미터 수, 학습과정 변화의 불필요성, Universality 만족 등의 성과를 이루어냈다.

2. Background

2-1. Monotonic Architecture - by Construction

지금까지 Monotone Neural Network를 구현하기 위해 어떤 방법들이 제안되었는지 간단히 알아보자. Monotone Neural Network는 뜻 그대로 Monotonicity를 갖는 Neural Network를 말한다. 만약 입력$\cdot$출력이 모두 1차원이라면 Monotone Neural Network는 증가함수 또는 감소함수가 될 것이며, 다차원의 경우에도 모든 출력 성분이 각 변수에 대해 증가함수 또는 감소함수(=Partially Monotonically Increasing/Decreasing)가 될 것이다. 편의 상 증가함수에 대해서만 이야기하도록 하자.

모든 Weight를 0 이상으로 제한 (Archer & Wang, 1993)

Fully Connected Neural Network의 모든 Weight를 0 이상으로 두어 증가함수를 표현할 수 있다. 기존의 Universal Approximation Theorem에 의해 임의의 함수를 표현하려면 음의 가중치가 반드시 필요했다. [1] 하지만 Sigmoid 함수나 Tanh 함수를 Activation Function으로 사용하면 본래 함수를 잘 추정하지 못했다고 한다.

이후 ReLU 함수가 대중적인 Activation Function으로 자리잡은 후, 위 방법에서 ReLU를 적용한 결과 Convex 함수만을 잘 추정할 수 있었다. 아핀 변환의 역할을 하는 nn.Linear과 ReLU 모두 Convex Function이므로, NN 자체가 Convex Function이 되기 때문이다.

그 외에도 다음 기법들이 고안되었다.

양의 가중치 & Max-Min Pooling (Sill, 1997)
Linear Calibrators & Lattice (You et al, 2017)

2-2. Monotonic Architecture - by Regularization

구조에서의 변화를 주는 대신, Regularization term을 이용해 Monotonicity를 보인 방법이 있다.

Non Monotonicity에 페널티를 주는 항을 추가
Soft Monotonicity Constraint를 추가한 Point Loss Function
Monotonicity를 만족하지 않는 반례를 조정하며 학습

3. Main Idea

논문의 핵심 아이디어를 살펴보자. 그에 앞서 다음 개념을 정의한다.

3-1. Constrained Weight

Definition 다변수함수 $f:\reals^n \rightarrow \reals$가 $$ x_i^0>x_i^1~~\Rightarrow~~f(x_1,~~\cdots,~~x_i^0,~~\cdots,~~x_n) ~~\ge f(x_1,~~\cdots,~~x_i^1,~~\cdots,~x_n) $$ 을 만족시키면 $f$는 partially monotonically increasing이다.

혹은 $$ x_i^0>x_i^1~~\Rightarrow~~f(x_1,~~\cdots,~~x_i^0,~~\cdots,~~x_n) ~~\le f(x_1,~~\cdots,~~x_i^1,~~\cdots,~x_n) $$ 을 만족시키면 $f$는 partially monotonically decreasing이다.

논문의 목표는 partially monotonically increasing/decreasing을 만족하는 Neural Network를 설계하는 것이다. 심지어 입력 데이터가 2차원 이상의 벡터인 경우, 각 변수마다의 증가/감소를 결정할 수 있다. [2]

Definition $n$-dimensional monotonicity indicator vector $t=[t_1,~~\cdots,~~t_n]$을 다음과 같이 정의한다. $$ t_j = \begin{cases} 1~~\text{if~~}{\partial f(\mathbf{x})_i \over \partial x_j} \ge 0~~\text{for~~each~~}i \in{1,~~\cdots,~~m}\ -1~~\text{if~~}{\partial f(\mathbf{x})_i \over \partial x_j} \le 0~~\text{for~~each~~}i \in{1,~~\cdots,~~m}\ 0~\text{otherwise} \end{cases} $$

입력 변수마다의 증가/감소 여부는 monotonicity indicator vector $t$로 결정한다. 입력 벡터의 $i$번째 변수가 증가하길 원한다면 $t_i=1$, 감소하길 원한다면 $t_i=-1$, 증가/감소를 결정하고 싶지 않다면 $t_i=0$을 부여한다.

monotonicity indicator vector $t$가 결정되면 이를 이용하는 연산 $|\cdot|_t$를 다음과 같이 정의한다.

$m\times n$ 행렬 $\mathbf{M}$에 대하여 $\mathbf{M}'=|\mathbf{M}|t$의 성분은 $$ m'{j,~~i} = \begin{cases} |m_{j,~~i}|~~\text{if~~}t_i=1 \ -|m_{j,~~i}|~~\text{if~~}t_i=-1 \ m_{j,~~i}~\text{otherwise} \end{cases} $$ 이다. [3] 이로부터 Fully-Connected layer는 $$ \mathbf{h = |W^T|}_t\cdot\mathbf{x+b} $$ 로 정의되며, 이후 새로운 Activation Function을 적용한다.

Lemma 1은 Constrained Weight를 곱한 후 bias를 더하더라도 Monotonicity가 그대로 만족됨을 설명한다. 증명은 어렵지 않아 생략한다.

Lemma 1 각 $i \in{1,~~\cdots,~~n}$에 대하여, $$ t_i=1~~\Rightarrow~~{\partial h_j \over \partial x_i}\ge0~~\text{for~~all~~}j\in{1,~~\cdots,~~m} $$ $$ t_i=-1~~\Rightarrow~~{\partial h_j \over \partial x_i}\le0~~\text{for~~all~~}j\in{1,~~\cdots,~~m} $$ 를 만족한다.

3-2. Activation Function

논문에서는 ReLU와 같은 Unbounded Activation Function을 활용할 방법을 제안하였다. Convex Function만을 추정할 수 있다는 한계를 극복하고자 다음과 같은 Activation Function을 정의하였다.

Definition $\breve{\rho}$를 zero-centered, monotonically increasing, convex, lower-bounded 함수라 하자. $$ \hat{\rho}(x):=-\breve{\rho}(-x) $$ $$ \tilde{\rho}(x)=\begin{cases} \breve{\rho}(x+1)-\breve{\rho}(1)~~\text{if~~}x<0 \ \hat{\rho}(x-1)+\breve{\rho}(1)~~\text{if~~}x\ge0 \end{cases} $$

각 Activation Function은 다음과 같은 특징을 갖는다.

$\breve{\rho}$ : 원래 사용하려던 Activation Function (convex, lower-bdd)
$\hat{\rho}$ : $\breve{\rho}$의 원점대칭 함수 (concave, upper-bdd)
$\tilde{\rho}$ : $\breve{\rho}$와 $\hat{\rho}$가 혼합된 함수 (non-convex, non-concave, bounded)

예를 들어, $\breve{\rho}$를 ReLU, SeLU, ELU로 하였을 때의 $\hat{\rho}, \tilde{\rho}$는 다음과 같다.

세 Activation Function을 통합하여 Activation Function $\rho^{\mathbf{s}}$를 정의한다.

Definition $\breve{\rho}$를 zero-centered, monotonically increasing, convex, lower-bounded 함수라 하고, $\mathbf{h} \in \reals^m$, $\mathbf{s}=(\breve{s},~~\hat{s},~~\tilde{s}) \in \N^3$, $\breve{s}+\hat{s}+\tilde{s}=m$이라 하자. Activation Function $\rho^\mathbf{s}:\reals^m \rightarrow \reals^n$을 다음과 같이 정의한다. $$ \rho^\mathbf{s}(\mathbf{h})_j = \begin{cases} \breve{\rho}(h_j)~~\text{if~~}j\le\breve{s} \ \hat{\rho}(h_j)~~\text{if~~}\breve{s}\text{if}j>\breve{s}+\hat{s} \end{cases} $$

즉, $\rho^{\mathbf{s}}$는 아핀 변환 이후의 벡터를 차례로 $\breve{s}, \hat{s}, \tilde{s}$개씩 나누어 각각 $\breve{\rho}, \hat{\rho}, \tilde{\rho}$를 적용하는 것이다. 각 함수에 몇 개의 성분을 할당하냐에 따라 convex, concave의 정도가 달라질 것이다. Corollary 3은 이러한 Activation Function이 Monotonicity를 해치지 않음을 이야기한다.

Corollary 3 $\mathbf{y}=\rho^\mathbf{s}\big(|\mathbf{W^T}|_t \cdot \mathbf{x+b}\big)$라 하자. 각 $i \in{1,~~\cdots,~~n}$에 대하여, $$ t_i=1~~\Rightarrow~~{\partial y_j \over \partial x_i}\ge0~~\text{for~~all~~}j\in{1,~~\cdots,~~m} $$ $$ t_i=-1~~\Rightarrow~~{\partial y_j \over \partial x_i}\le0~~\text{for~~all~~}j\in{1,~~\cdots,~~m} $$ $$ \mathbf{s}=(m,0,0)~~\Rightarrow~~\mathbf{y}_j:\text{convex} ~~\text{for~~all~~}j\in{1,~~\cdots,~~m}\ \mathbf{s}=(0,~~m,0)\Rightarrow~~\mathbf{y}_j:\text{concave}~~\text{for~~all~~}j\in{1,~~\cdots,~~m} $$ 를 만족한다.

지금까지의 과정을 다음 그림과 같이 나타낼 수 있다. 이를 Monotonic Dense Block이라 부른다.

4. Architecture

Monotonic Dense Block을 여러 개 이어붙여 네트워크 아키텍쳐를 구성할 수 있는데, 논문에서는 크게 2가지 타입을 소개한다.

Type1 Monotonic Dense Block을 단순히 이어붙인 구조다.
- 2번째 블록부터 monotonicity indicator vector $t$가 항상 $\mathbf{1}$벡터이다. 변수의 증가/감소 여부는 첫 번째 블록에서 결정되고, 이후 블록에서는 항상 증가함수를 합성함으로써 증가/감소를 유지한다.
- 마지막 블록의 Activation Function은 Identity를 사용한다. 만약 $\mathbf{s}=(0, 0, 1)$이라면 bounded Activation Function이 합성되어, Neural Network로 나타낼 수 있는 함수가 크게 제한되기 때문이다.
Type2 입력 벡터가 Monotonic/Non-Monotonic으로 나누어져 네트워크를 통과하는 구조다.
- Monotonic Inputs은 각 변수의 증가/감소를 결정할 수 있다. 각 변수마다의 convex/concave를 결정할 수 있도록 하기 위해 개별적인 Monotonic Dense Block을 사용한다.
- Non-Monotonic Inputs은 일반적인 Neural Network를 통과한다. Feature Extracting 등 Neural Network의 역할을 목적에 맞게 부여할 수 있다.
- 두 Input이 concatenate되어 다시 Monotonic Dense Block을 통과한다. Type1과 마찬가지로 monotonicity indicator vector $t$가 항상 $\mathbf{1}$벡터로 고정되며, 마지막 블록의 Activation Function으로 Identity를 사용한다.

5. Universality

지금까지 Monotone Neural Network의 구조를 살펴보았다면, 이러한 구조가 실제로 임의의 Monotone Function을 잘 근사할 수 있는지에 대한 이론적 근거가 필요하다. Theorem 4는 Sigmoid 함수를 Activation Function으로 사용하였을 경우에 대한 이론적 근거를 제시한다. [4]

Theorem 4 [5] 임의의 continuous, monotone nondecreasing 함수 $f:K \rightarrow \reals$ $(K :\text{compact~~subset~~of~~}\reals^k)$에 대하여, 다음을 만족하는 feedforward neural network가 존재한다.

최대 $k$개의 hidden layer
sigmoid를 Activation Function으로 사용
양의 가중치
임의의 $x \in K$와 $\epsilon>0$에 대하여 $|O_x-f(x)|<\epsilon$을 만족하는 벡터 $O$를 출력

우리의 목적은, [5]에서 사용된 Heavyside Function $\mathbf{H}$를 $\tilde{\rho}$로 대체하는 것이다. Lemma 5는 $\mathbf{H}$가 $\tilde{\rho}$를 이용하여 표현할 수 있음을 주장한다.

Lemma 5 $\breve{\rho}$를 zero-centered, monotonically increasing, convex, lower-bounded 함수라 하자. 함수 $$ \mathbf{H}(z)=\begin{cases} 1~~\text{if~~}z \ge0 \ 0~~\text{if~~}z < 0 \end{cases} $$ 는 $\reals$에서 함수 $$ \tilde{\rho}_H(x)=\alpha \tilde{\rho}(x)+\beta~~\text{for~~some~~}\alpha\in \reals^+,~~\beta\in\reals $$ 로 근사될 수 있다.

또한 Lemma 6은 $\tilde{\rho}_H$를 $\tilde{\rho}$로 대체할 수 있음을 주장한다.

Lemma 6 어떤 $\alpha \in \reals^+,~\beta \in \reals$에 대하여, Activation Function $\tilde{\rho}{\alpha,\beta}(x)$가 $$ \tilde{\rho}{\alpha,\beta}(x)=\alpha \tilde{\rho}(x)+\beta $$ 를 만족한다고 하자. $\tilde{\rho}{\alpha,\beta}(x)$를 Activation Function으로 사용하는 모든 constrained monotone neural network $N{\alpha,\beta}$에 대하여 $$ N(x)=N_{\alpha,\beta}(x) $$ 를 만족하는, $\tilde{\rho}(x)$를 Activation Function으로 사용하는 constrained monotone neural network $N$이 존재한다.

즉 $\mathbf{H}$는 $\tilde{\rho}H$로 근사할 수 있으며, $\tilde{\rho}_H$를 Activation Function으로 사용하는 Neural Network $N{\alpha,\beta}$를 $\tilde{\rho}$로 똑같이 만들 수 있다는 흐름이다.

이로부터 Theorem 7은 Constrained Monotone Neural Network의 이론적 근거를 제시한다.

Theorem 7 $\breve{\rho}$를 zero-centered, monotonically increasing, convex, lower-bounded 함수라 하자. compact set $K \subset \reals^k$에서 정의된 임의의 multivariate continuous monotone 함수는 다음을 만족하는 monotone constrained neural network로 근사될 수 있다.

최대 $k$개의 hidden layer
$\rho$를 Activation Function으로 사용

Theorem 7은 최대 hidden layer의 수 역시 제시하지만, 실험에 의하면 그보다 적은 개수를 사용하였을 때 더 좋은 성능을 보였다고 한다.

6. Experiment

논문은 기존의 Monotone Neural Network와 성능을 비교한 실험을 진행하였다. 3개의 데이터셋 COMPAS, Blog Feedback, Loan Defaulter 모두에서 최소 파라미터 수와 최고 성능을 달성하였다.

COMPAS : 13개의 Feature 중 4개의 Monotone Feature를 가진 분류 문제
Blog Feedback : 276개의 Feature 중 8개의 Monotone Feature를 가진 회귀 문제
Loan Defaulter : 28개의 Feature 중 5개의 Monotone Feature를 가진 분류 문제

또한 Monotone Feature만을 가진 데이터셋에 대해서도 최고성능을 달성하였다.

AutoMPG : 3개의 Monotone Feature를 가진 회귀 문제
Heart Disease : 2개의 Monotone Feature를 가진 분류 문제

7. Code

이곳에서 Colab 코드를 제공하고 있다. 지금은 이변수함수 $z=x^3+e^{-y}$를 근사하는 Monotone Neural Network를 학습시켜보자.

데이터 생성

$z=x^3+e^{-y}$에 노이즈를 섞은 학습 데이터를 생성한다.

import numpy as np
import pandas as pd

rng = np.random.default_rng(42)


def generate_data(no_samples: int, noise: float):
    x = rng.normal(size=(no_samples, 2))
    y = x[:, 0] ** 3
    y += np.exp(-x[:, 1])
    y += noise * rng.normal(size=no_samples)
    return x, y


x_train, y_train = generate_data(10_000, noise=0.1)


d = {f"x{i}": x_train[:5, i] for i in range(2)}
d["y"] = y_train[:5]
pd.DataFrame(d).style.hide(axis="index")

모델 구성

패키지를 설치하여 MonoDense는 Monotone Dense Block을 이용할 수 있다. 이를 Sequential에 쌓아 모델을 구성한다.

!pip install monotonic-nn

from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense, Input

from airt.keras.layers import MonoDense

model = Sequential()

model.add(Input(shape=(2,)))
monotonicity_indicator = [1, -1]
model.add(
    MonoDense(128, activation="elu", monotonicity_indicator=monotonicity_indicator)
)
model.add(MonoDense(128, activation="elu"))
model.add(MonoDense(1))

model.summary()

모델 학습

learning rate, optimizer 등을 설정하고 모델을 학습시킨다.

from tensorflow.keras.optimizers import Adam
from tensorflow.keras.optimizers.schedules import ExponentialDecay

lr_schedule = ExponentialDecay(
    initial_learning_rate=0.01,
    decay_steps=10_000 // 32,
    decay_rate=0.9,
)
optimizer = Adam(learning_rate=lr_schedule)
model.compile(optimizer=optimizer, loss="mse")

model.fit(
    x=x_train, y=y_train, batch_size=32, validation_data=(x_val, y_val), epochs=10
)

결과 확인

학습 결과를 정답과 비교하며 확인한다.

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111, projection='3d')
x_val, y_val = generate_data(100, noise=0.0)
pred = model.predict(x=x_val)

x = x_val[:, 0]
y = x_val[:, 1]
z1 = pred[:, 0]
z2 = y_val

ax.scatter(x, y, z1, color = 'r', alpha = 0.5)
ax.scatter(x, y, z2, color = 'g', alpha = 0.5)
# 예측 : red
# 정답 : greed

결과는 다음과 같다.

붉은 점과 초록 점이 잘 포개어진 것으로 보아, 학습이 잘 되었음을 알 수 있다. 또한 $x$축, $y$축 별로 grid 데이터를 생성하여 변수 별 Monotonicity를 검증할 수 있다.

8. Conclusion

이 논문은 Monotonicity를 보장하는 Neural Network의 구현 및 Universality 달성을 주장하며 다음 3가지를 특징으로 뽑는다.

Fully-Connected layer를 즉시 Monotone Dense Block으로 대체 가능
- 기존 네트워크의 layer와 구조적으로 차이가 없다는 점이다. 이는 Monotone Dense Block의 범용성과 확장성을 강조한다.
성능을 유지하면서 압도적으로 적은 파라미터 수
- 실험 결과 기존 모델에 비해 최고 성능을 달성했지만, 큰 차이를 보이지 못했다. 하지만 압도적으로 적은 파라미터 수를 사용하면서, 학습속도 및 연산량에서 큰 장점을 보였다.
Convolution layer 등 다른 유형의 layer와의 연계 기대
- 현재는 Type2 Architecture와 같이 Fully-Connected layer와 함께 사용하였지만, 적절한 변형을 통해 Convoulution layer와 같은 다양한 layer와 함께 사용될 수 있기를 기대할 수 있다.

Endnotes

[1] 아주 작은 구간에서만 0이 아닌 값을 갖는 막대 모양의 함수를 만들기 위해선, Sigmoid 모양의 함수를 빼서 만들어야 한다.

[2] 예를 들어 입력이 $\mathbf{x} \in\reals^2$인 이변수함수 $f(x, y)$를 근사하고자 할 때, $x$에 대해서는 증가, $y$에 대해서는 감소하도록 하는 Neural Network를 설계할 수 있다.

[3] 입력 데이터 $\mathbf{x} \in \reals^9$, monotonicity indicator vector $t=(-1, -1, -1, 0, 0, 0, 1, 1, 1)$와 다음 초기 가중치 행렬($9\times12$) $\mathbf{W}$로 예를 들어보자.

앞선 세 변수는 감소, 중간 세 변수는 증가/감소에 제약이 없으며, 마지막 세 변수는 증가하도록 네트워크를 구성하는 과정이다. $t$에 의한 연산 결과 $(|\mathbf{W^T}|_t)^{\mathbf{T}}$는 다음과 같이 감소하는 세 변수와 곱해지는 1~~3행은 음수(or $0$), 증가/감소 제약이 없는 변수와 곱해지는 4~~6행은 기존과 그대로, 증가하는 세 변수와 곱해지는 7~9행은 양수(or $0$)로 계산된다.

[4] 다만 Background에서 소개한 바에 의하면 Sigmoid를 Activation Function으로 사용하면 함수를 잘 근사하지 못한다. 이러한 모순점이 명확하게 해결되지 않는데, 이론적 근거는 충분하지만 실험에서 의미있는 결과를 얻지 못했다고 감히 추측해본다.

[5] Theorem 4의 증명은 수학적 귀납법의 흐름으로 전개된다. 입력 벡터 $\mathbf{x} \in \reals^k$에 대하여 $k=1$일 때의 증명을 소개한다.

다음 두 가지를 가정한다.

$f$ : strictly increasing
$f(x) \ge 0~~$ ($K : \text{compact~~}\Rightarrow~~f:\text{bounded,~~}f_\text{new}=f+C$)

strictly increasing은 추후 increasing으로 확장 가능하며, compact set에서 연속인 $f$는 bounded라는 사실로부터 lower bound $C$를 더해 $f(x) \ge 0$로 만들 수 있다.

이로부터 $f(x)$를 Heavyside Function $\mathbf{H}$를 이용하여 다음과 같이 나타낼 수 있다. $$ f(x) = \int_0^\infin \mathbf{H}(f(x)-u)du,~~\mathbf{H}(z)=\begin{cases} 1~~\text{if~~}z \ge0 \ 0~~\text{if~~}z < 0 \end{cases} $$

$f$가 continuous, strictly increasing이므로 역함수가 존재하여 $$ f(x)=\int_0^\infin \mathbf{H}(f(x)-u)du=\int_0^\infin \mathbf{H}(x-f^{-1}(v))dv $$ 이고, $f$는 compact set에서 continuous이므로 리만 적분 가능하여 $$ f(x)\approx\sum_{i=1}^N (v_{i+1}-v_i)\mathbf{H}(x-f^{-1}(v_i)) $$ $$ \Updownarrow $$

로 나타낼 수 있으며, 이는 다음과 같은 neural network로 나타낼 수 있다.

[3주차] Computation of the NPMLE for type-I interval censored data

Mon, 22 Jan 2024 19:56:14 GMT

참고교재

Groeneboom and Jongbloed (2014), Nonparametric Estimation under Shape Constraints, Cambridge University Press.

1. Review

지난 2.1절에서는 단조 증가하는 함수 $r$을 추정하는 방법을 알아보았다.

Lemma 2.1 $\hat{r}$이 convex cone $C={(r_1,~~r_2,~~\cdots,~~r_n)\in\reals^n ~|~~ r_1 \le r_2 \le \cdots r_n}$에서 strictly convex function $$ Q(r) = {1 \over 2}\sum_{i=1}^n (r_i-y_i)^2w_i $$ 을 최소화하는 것의 필요충분조건은 $$ \sum_{j=1}^i \hat{r}jw_j \begin{cases} \le \sum{j=1}^i y_jw_j &\text{for}i=1,2,\cdots,~~n \ =\sum_{j=1}^i y_jw_j &\text{if}~~\hat{r}_{i+1}>\hat{r}_i~~~ \text{or}~~ i=n \end{cases} $$ 이다.

$r$의 추정함수 $\hat{r}$을 $$ (0,0),~~(w_1,~~w_1y_1),~~\cdots,~~(\sum_{j=1}^nw_j,~\sum_{j=1}^nw_jy_j) $$ 의 convex minorant의 left derivative로 정의한다면 Lemma 2.1의 등식/부등식 조건을 만족하여, $Q(r)$의 최소화원으로서 역할을 할 수 있었다. 여기서 기억해야 할 점은, 우리가 $Q(r)$과 같은 quadratic form의 최소화원을 알고 있다는 것이다.

2. Newton Algorithm

Newton Algorithm은 (convex) function의 최소화원을 찾기 위한 알고리즘이다. 즉, 어떤 (convex) function $\phi : \reals^n \rightarrow \reals$를 최소화하는 $\hat{\theta} \in \reals^n$를 찾는 알고리즘이다. 알고리즘은 $\theta^{(0)},~\theta^{(1)}, \theta^{(2)}, \cdots$와 같이 점진적으로 $\hat{\theta}$를 찾아가게 된다.

1. 초기값 $\theta^{(0)}$를 설정
1. $k\ge0$에 대하여, $~\theta^{(k+1)}=A(\theta^{(k)})$ $$ = \argmin_{\theta} \bigg( \phi(\theta^{(k)})+(\theta-\theta^{(k)})^\mathrm{T}\nabla\phi(\theta^{(k)})+{1\over2}(\theta-\theta^{(k)})^\mathrm{T}\nabla^2\phi(\theta^{(k)})(\theta-\theta^{(k)}) \bigg) $$ 으로 정의
1. 종료조건 달성 시까지 2를 반복

하지만 Newton Algorithm은 $\phi$의 실제 최소화원으로 수렴을 보장하지 못한다는 단점이 있다. 이는 $\phi$가 convex function이더라도 보장되지 않는다.

3. Iterative Convex Minorant Algorithm

Newton Algorithm을 보완할 새로운 알고리즘을 알아보자. 이 알고리즘은 convex function이 유일한 최소화원을 갖는다면, 반드시 그 최소화원으로의 수렴성을 보장하는 알고리즘이다. 다음과 같이 용어를 정의하고, 3가지를 가정하자.

$\phi$ : 목적함수(objective funtion) - 최소화되길 원하는 함수 $C={r \in \reals^n:r_1 \le r_2 \le \cdots \le r_n}$ : convex set

$\phi$ : convex, continuous on $C$
$\phi$ : continuously differentiable on ${\beta \in \reals^n :\phi(\beta)< \infin }$
$\hat{\beta}$ : unique minimum of $\phi$ on $C$ exist

이로부터 다음 동치관계가 성립한다. $$ \hat{\beta}=\argmin_{\beta \in C}~~\phi(\beta)~~\Leftrightarrow~~~ \begin{cases} <\hat{\beta},~~\nabla\phi(\hat{\beta})>=0 \ <\beta,~~\nabla\phi(\hat{\beta})> \ge0~~\text{for~~all}~~ \beta\in C \end{cases} $$

이는 왼쪽과 같이 convex set 내부에 global minimum이 존재한다면 $\nabla\phi(\hat{\beta})=\mathbf{0}$이 되어 등식/부등식이 성립하고, 오른쪽과 같이 convex set 외부에 있더라도 최소화원은 convex set의 경계에 위치하게 되어 등식/부등식이 성립한다.

위 동치조건은 추후 Modified Iterative Convex Minorant Algorithm의 종료 조건으로 사용된다.

Iterative Convex Minorant Algorithm (이하 ICM)은 Newton Method와 유사한 방식으로 최소화원을 찾아간다.

1. 초기값 $\beta^{(0)}$을 설정
1. 이전 값$(\beta^{(k)})$에서의 함수 $\phi$의 근사 함수의 최소화원을 다음 값$(\beta^{(k+1)})$으로 설정
1. 종료조건 달성 시까지 2를 반복

2번에서의 근사함수는 다음과 같이 정의된다. $\gamma \in C$ 근방의 임의의 $\beta$에 대하여, $\phi(\gamma+(\beta-\gamma))$의 근사함수는 $$ \begin{aligned} \phi(\gamma+(\beta-\gamma)) &\approx \phi(\gamma)+(\beta-\gamma)^\mathrm{T}\nabla\phi(\gamma)+{1 \over 2}(\beta-\gamma)^\mathrm{T}D(\beta-\gamma) \ &=\phi(\gamma) + {1 \over 2}(\beta-\gamma+D^{-1}\nabla\phi(\gamma))^\mathrm{T}D(\beta-\gamma+D^{-1}\nabla\phi(\gamma)) \ &- {1 \over 2}\nabla\phi(\gamma)^\mathrm{T}(\beta-\gamma+D^{-1}\nabla\phi(\gamma))+{1\over2}\nabla\phi(\gamma)^\mathrm{T}(\beta-\gamma) \ &=\phi(\gamma)+{1 \over 2}(\beta-\gamma+D^{-1}\nabla\phi(\gamma))^\mathrm{T}D(\beta-\gamma+D^{-1}\nabla\phi(\gamma))+c_\gamma \end{aligned} $$ $$ \begin{aligned} \text{where}~ c_\gamma &= -{1 \over 2}\nabla\phi(\gamma)^\mathrm{T}(\beta-\gamma+D^{-1}\nabla\phi(\gamma))+{1\over2}\nabla\phi(\gamma)^\mathrm{T}(\beta-\gamma) \ &=-{1\over2}\nabla\phi(\gamma)^\mathrm{T}D^{-1}\nabla\phi(\gamma), \end{aligned} $$ $$ D:\text{positive diagonal matrix} $$ 이고, 이러한 근사 함수를 최소화하는 최소화원 $B(\gamma)$는 $$ \begin{aligned} B(\gamma)&=\argmin_{\beta \in C} \bigg(\phi(\gamma)+{1 \over 2}(\beta-\gamma+D^{-1}\nabla\phi(\gamma))^\mathrm{T}D(\beta-\gamma+D^{-1}\nabla\phi(\gamma))+c_\gamma \bigg)\ &=\argmin_{\beta \in C} \bigg({1 \over 2}(\beta-\gamma+D^{-1}\nabla\phi(\gamma))^\mathrm{T}D(\beta-\gamma+D^{-1}\nabla\phi(\gamma)) \bigg) \ &=\argmin_{\beta \in C}\bigg({1 \over 2}\sum_{i=1}^n \bigg(\beta_i-\gamma_i+{1 \over d_i}{\partial \over \partial\beta_i}\phi(\gamma)\bigg)^2d_i\bigg) \end{aligned} $$ $$ \text{where}~~ d_i=D_{i,i} $$ 이다. $B(\gamma)$는 quadratic form이므로, 앞선 Lemma 2.1에서와 같이 convex minorant를 이용하면 $B(\gamma)$를 구할 수 있다.

이러한 알고리즘은 $\phi$를 최소화하는 것이 아닌, $\phi$의 근사함수를 최소화하는 과정을 나타낸다. 그렇다면 근사함수를 최소화하는 것이 $\phi$를 최소화하는 것에 도움이 될까? Lemma 7.1은 그에 대한 일부 답변을 제시한다.

Lemma 7.1 앞선 가정을 만족하는 $\phi$와, $\phi(\beta)<\infin$을 만족하는 모든 $\beta \in C \setminus{\hat{\beta}}$에 대하여, 충분히 작은 $\lambda>0$가 존재하여 $$ \phi(\beta+\lambda(B(\beta)-\beta))<\phi(\beta) $$ 가 성립한다.

즉, ICM의 방법이 $\phi$를 최소화하는 방향임은 보장한다.

Lemma 7.1 Proof $\psi:[0, 1] \rightarrow \reals$를 다음과 같이 정의하자. $$ \psi(\lambda)=\phi(\beta+\lambda(B(\beta)-\beta)) $$ [Claim] : $\psi'(0+)=(B(\beta)-\beta)^\mathrm{T}\nabla\phi(\beta)<0$ 만약 위 Claim이 증명된다면 $$ \psi'(0+) \approx {\psi(0+\lambda)-\psi(0) \over \lambda}<0, $$ $$ \psi(0+\lambda)-\psi(0)=\phi(\beta+\lambda(B(\beta)-\beta)) - \phi(\beta)<0 $$ 와 같이 Lemma 7.1이 증명된다.

다음 두 사실로부터, $$ B(\beta)^\mathrm{T}\big(D(\beta)(B(\beta)-\beta)+\nabla\phi(\beta)\big)=0 $$ $$ \beta^\mathrm{T}\big(D(\beta)(B(\beta)-\beta)+\nabla\phi(\beta)\big) \ge 0 $$ 다음 부등식이 성립한다. $$ (B(\beta)-\beta)^\mathrm{T}D(\beta)(B(\beta)-\beta)+\psi'(0) \le 0 $$ 하지만 $\beta \in C \setminus {\hat{\beta}}$이므로 $D(\beta)$는 positive definite이고, 따라서 $$ (B(\beta)-\beta)^\mathrm{T}D(\beta)(B(\beta)-\beta)>0~~\rArr~~\psi'(0)<0 $$ 가 성립한다.

Lemma 7.1은 감소 방향에 대한 해답은 주지만, $\hat{\beta}$까지 감소하는지에 (=최소화원으로의 수렴성) 대해서는 보장하지 않는다. 이를 보완하는 알고리즘을 소개한다.

4. Modified Iterative Convex Minorant Algorithm

Modified Iterative Convex Minorant Algorithm (이하 MICM)은 ICM의 알고리즘에 $\hat{\beta}$으로의 수렴성을 더한 방법이다. ICM에서의 방법과 같이 $\beta^{(0)}, \beta^{(1)}, \beta^{(2)}, \cdots$으로 $\hat{\beta}$를 점진적으로 찾아가며, 그 관계는 다음과 같다. $$ \beta^{(k+1)}= C(\beta^{(k)})=\begin{cases} B(\beta^{(k)})~~\text{if}~~\phi(B(\beta^{(k)}))<\phi(\beta^{(k)})+(1-\epsilon)\nabla\phi(\beta^{(k)})^\mathrm{T}(B(\beta^{(k)})-\beta^{(k)}) \ x \in {y \in \text{seg}(\beta^{(k)},~~B(\beta^{(k)}))~~|~~(1-\epsilon)\nabla\phi(\beta^{(k)})^\mathrm{T}(y-\beta^{(k)}) \le \phi(y)-\phi(\beta^{(k)}) \le \epsilon\nabla\phi(\beta^{(k)})^\mathrm{T}(y-\beta^{(k)})}~~\text{else} \end{cases} $$ $$ \text{where~~}\epsilon \in(0,~~0.5) $$

위 관계식을 그림으로 살펴보자. $\phi$를 최소화하는 것이 목표지만, 이를 $\psi$ 관점에서 살펴볼 것이다.

1) $\phi(B(\beta^{(k)})) < \phi(\beta^{(k)})+(1-\epsilon)\nabla\phi(\beta^{(k)})^\mathrm{T}(B(\beta^{(k)})-\beta^{(k)})$ 부등식의 오른쪽 항은 $\phi(\beta^{(k)})+(1-\epsilon)\psi'(0)$과 같다. 이는 아래와 같이 기울기가 $(1-\epsilon)\psi'(0)$이고 $(0, \phi(\beta^{(k)}))$를 지나는 직선이 $\lambda=1$에서의 함숫값과 같다. 이 직선은 항상 감소하며, 기울기에 $1-\epsilon$을 곱함으로써 그래도 이 직선보다는 더 감소했으면 좋겠다의 정도로 해석할 수 있다. 부등식 조건에서는 $\phi(B(\beta^{(k)}))$가 직선보다 아래에 위치하므로 충분히 감소했다고 판단하여 $\beta^{(k+1)} = \phi(B(\beta^{(k)}))$로 정의하였다고 볼 수 있다.

2) $\phi(B(\beta^{(k)})) \ge \phi(\beta^{(k)})+(1-\epsilon)\nabla\phi(\beta^{(k)})^\mathrm{T}(B(\beta^{(k)})-\beta^{(k)})$ 1)의 해석에 따라, 기대한 만큼 충분히 감소하지 못했을 경우다. 아래 그림은 그럼에도 $\phi(B(\beta^{(k)}))<\phi(\beta{(k)})$인 상황이지만, 일반적으로 이렇다고 보장할 수 없다. 따라서 어떤 적절한 $0<\lambda<1$을 선택하여, 그 때의 $\psi(\lambda)$를 $\beta^{(k+1)}$로 선정할 것이다.

기대한 만큼 충분히 감소한 결과를 원하므로, 기울기가 $(1-\epsilon)\psi'(0)$이고 $(0, \phi(\beta^{(k)}))$을 지나는 직선보다 아래 영역에서 $\lambda$를 선택할 것이다. 대신 너무 조금 감소하는 경우를 배제하기 위해 기울기가 $\epsilon\psi'(0)$이고 $(0, \phi(\beta^{(k)}))$을 지나는 직선보다 위 영역으로 $\lambda$ 범위를 제한한다. 아래 그림에서의 연두색 영역과 같다.

해당 영역 내의 아무 $\lambda_k$를 선택하여 $\beta^{(k+1)}=\psi(\lambda_k)=\beta^{(k)}+\lambda_k(B(\beta^{(k)})-\beta^{(k)})$로 정의한다.

이러한 알고리즘의 종료 조건은 다음과 같다. $$ \sum_{i=j}^n {\partial \over \partial\beta_i}\phi(\hat{\beta}) \begin{cases} \ge 0~~\text{for} ~~~1\le j \le n \ =0~~\text{for}~~j=1 \end{cases} $$ 이는 다음 동치관계를 기반으로 결정되었다. $$ <\beta,~~\nabla\phi(\hat{\beta})> \ge0~~\text{for~~all}~~ \beta\in C ~~\Leftrightarrow~~ \sum_{i=j}^n {\partial \over \partial\beta_i}\phi(\hat{\beta}) \begin{cases} \ge 0~~\text{for} ~~~1\le j \le n \ =0~~\text{for}~~j=1 \end{cases} $$

Theorem 7.3은 MICM에 의해 $\hat{\beta}$으로 수렴함을 보인다.

Theorem 7.3 앞선 조건을 만족하는 $\phi:\reals^n \rightarrow(-\infin,\infin]$와, $\phi(\beta^{(0)})<\infin$을 만족하는 $\beta^{(k)} \in C$에 대하여, mapping $\beta \mapsto D(\beta)$이 집합 $K={\beta\in C:\phi(\beta) \le \phi(\beta^{(0)})}$에서 연속이 되도록 하는 positive definite diagonal matrix $D$를 선택하면, Iterative Convex Minorant Algorithm은 $\hat{\beta}$로 수렴한다.

Theorem 7.3 Proof 모든 $k \ge 0$에 대하여 $\beta^{(k)} \in K$고, K는 compact이다. $\rArr~~$[Enough to Show] : 모든 $\beta \in K\setminus{\hat{\beta}}$에서 Modified ICM 함수 $C$가 닫혀있다.

만약 닫힘성(closedness)가 증명되면 7.1절의 Theorem 7.1에 의해 수렴성이 보장된다. (교재 참고)

임의의 $\beta \in K \setminus{\hat{\beta}}$와 $\beta^{(k)} \rightarrow \beta$인 $\beta^{(k)} \in K$에 대하여, $\gamma^{(k)}$를 $$ \gamma^{(k)} \in C(\beta^{(k)}),~~\gamma^{(k)} \rightarrow \gamma~~\text{for~~some~~} \gamma \in K $$ 로 정의하자. $\rArr~~$[Enough to Show] : $\gamma \in C(\beta)$

다음 사실로부터, $$ \nabla\phi(\beta^{(k)}) \rightarrow \nabla \phi(\beta),~B(\beta^{(k)}) \rightarrow B(\beta) $$

1) $\phi(B(\beta^{(k)})) \le \phi(\beta^{(k)})+(1-\epsilon)\nabla\phi(\beta^{(k)})^\mathrm{T}(B(\beta^{(k)})-\beta^{(k)})$ $$ \gamma^{(k_j)}=B(\beta^{(k_j)})~~\text{for~~each~~}j \ge 0 $$ $$ C(\beta)={B(\beta)}~~\text{as~~}\beta^{(k)} \rightarrow \beta $$ 가 성립하므로 $$ \gamma^{(k_j)} \rightarrow B(\beta)~~\text{as~~} j \rightarrow \infin $$ $$ \gamma=B(\beta) \in C(\beta) $$ 가 성립한다.

2) $\phi(B(\beta^{(k)})) > \phi(\beta^{(k)})+(1-\epsilon)\nabla\phi(\beta^{(k)})^\mathrm{T}(B(\beta^{(k)})-\beta^{(k)})$ MICM에 의해 $$ \phi(\gamma^{(k)})-\phi(\beta^{(k)}) \in [(1-\epsilon)\nabla\phi(\beta^{(k)})^\mathrm{T}(\gamma^{(k)}-\beta^{(k)}),~~\epsilon\nabla\phi(\beta^{(k)})^\mathrm{T}(\gamma^{(k)}-\beta^{(k)})] $$ 이고, 충분히 큰 $k$에 대하여 $$ \beta^{(k)} \rightarrow \beta,~~\gamma^{(k)} \rightarrow \gamma,~~\nabla\phi(\beta^{(k)}) \rightarrow \nabla\phi(\beta) $$ 가 성립하므로 $$ \phi(\gamma)-\phi(\beta) \in [(1-\epsilon)\nabla\phi(\beta)^\mathrm{T}(\gamma-\beta),~~\epsilon\nabla\phi(\beta)^\mathrm{T}(\gamma-\beta)] $$ 이고, 따라서 $$ \gamma \in C(\beta)= \text{seg}(\beta,~B(\beta)) $$ 가 성립한다.

MICM의 전체 과정은 다음과 같다.

[2주차] Nonparametric MLE for the type-I interval censored data

Sat, 13 Jan 2024 19:14:05 GMT

참고교재

Groeneboom and Jongbloed (2014), Nonparametric Estimation under Shape Constraints, Cambridge University Press.

1. Monotone Regression

$x_i$가 고정값이며 증가함에 따라 $y_i$가 확률변수 $$ Y_i=r(x_i)+\epsilon_i $$ 의 실현값이라 하자. 여기서 $r$은 $x_i$를 통해 $Y_i$를 설명하고자 하는 함수이며, $\epsilon_i$는 $\mathrm{E}(\epsilon_i)=0$을 만족하는 확률변수(노이즈)이다. 우리의 목표는 단조함수 $r$을 추정하는 것이다. (단조 증가 or 단조 감소)

단조 증가가 가정된 함수 $r$을 추정한다고 하자. 관측에 의해 얻어진 실현값은 노이즈($\epsilon_i$) 때문에 단조성이 나타나지 않을 수 있다. ($y_i > y_{i+1}$인 실현값이 존재할 수 있다) 그럼에도 실현값만을 참고하면서, 실현값을 가장 잘 설명하는 단조 증가함수가 무엇일지 추정하는 과정이다.

다음 데이터를 예시로 살펴보자. 12세~18세 여자 아이들의 신장을 조사한 데이터다.

이때, 나이에 따른(12세~18세) 여자 아이들의 신장의 경향성을 설명하는 함수 $r$을 모델링하고자 한다. 그러한 $r$을 가장 잘 나타낸 $\hat{r}$은 다음과 같이 나타낼 수 있다.

$$ \hat{r}=\argmin_{r \in M} {1 \over 2}\sum_{i=1}^n (y_i-r(x_i))^2w_i,~~\text{where~~}M={f:\reals \rightarrow \reals|f(u)\le f(v)~~\text{for~~all~~}u \le v} $$

즉, 단조 증가하는 함수 중 위의 quadratic form을 최소화하는 $r$이 우리가 원하는 함수라는 뜻이다. $\hat{r}$을 찾기에 앞서, 위 식은 오직 함수 $r$의 이산적인 함숫값에 의해서만 결정된다. 즉 $x_i$가 아닌 곳에서의 함숫값은 위 식의 최소화에 영향을 주지 않으므로, $x_i$가 아닌 점에서 $r$의 함숫값은 상수함수라 가정한다.

Lemma 2.1에서는 위와 같은 상황에서 $\hat{r}$의 필요충분조건을 서술한다. 어차피 $\hat{r}$의 $x_1,~~x_2,~~\cdots,~~x_n$에서의 함숫값만 결정하면 되므로 $\hat{r}=(\hat{r}(x_1),~~\hat{r}(x_2),~~\cdots,~~\hat{r}(x_n))$인 벡터를 추정하는 것으로 생각할 수 있다.

위 필요충분조건이 말하는 바는, 기본적으로 $i=1,2,\cdots,~n$일 때 부등식이 성립하면서도 추가로 $\hat{r}_{i+1}>\hat{r}_n$이거나 $i=n$일 때는 등식까지 성립한다는 것이다.

Lemma 2.1 Proof ** 1. 부등식($\Leftrightarrow$) ** 벡터 $v^{(i)}=(0,~~\cdots~~0,,1,~~\cdots,~~1)$ ($i^{th}$ 성분까지 $0$, 그 이후로는 $1$, $1 \le i \le n$)을 정의하자. 그렇다면 모든 $i$와 모든 $\epsilon>0$에 대하여 $\hat{r}-\epsilon v^{(i)} \in C$이다. $Q(r)$이 strictly convex 함수이므로 $$ \lim_{\epsilon \rightarrow 0+}\epsilon^{-1}\big( Q(\hat{r}-\epsilon v^{(i)})- Q(\hat{r})\big) \ge 0 $$ 는 자연스럽다. ($Q(r)$을 최소화하는 $\hat{r}$에서 어떤 방향으로든($\epsilon v^{(i)}$) 벗어나면 $Q(r)$의 값은 커진다.)

이때, $Q(r)$의 정의에 의해 $$ Q(\hat{r} - \epsilon v^{(i)})-Q(\hat{r})={1 \over 2}\sum_{j=1}^n \big(\epsilon v_j^{(i)}w_j(\epsilon v_j^{(i)}-2\hat{r}j+2y_j) \big) $$ 이므로 $$ \lim{\epsilon \rightarrow 0+}\epsilon^{-1}\big( Q(\hat{r}-\epsilon v^{(i)})- Q(\hat{r})\big) =\sum_{j=1}^i \big(y_j-\hat{r}_j\big)w_j \ge0 $$ 이다.

** 2. 등식($\Leftrightarrow$) ** $\hat{r}{i+1}>\hat{r}_i$ 조건 하에, 충분히 작은 $\epsilon>0$에 대하여 $\hat{r}_i+\epsilon<\hat{r}{i+1}$이 성립하므로 $\hat{r}+\epsilon v^{(i)} \in C$이다. 따라서 $$ Q(\hat{r} + \epsilon v^{(i)})-Q(\hat{r})={1 \over 2}\sum_{j=1}^n \big(\epsilon v_j^{(i)}w_j(\epsilon v_j^{(i)}+2\hat{r}j-2y_j) \big) $$ 이고, $$ \lim{\epsilon \rightarrow 0+}\epsilon^{-1}\big( Q(\hat{r}+\epsilon v^{(i)})- Q(\hat{r})\big) =\sum_{j=1}^i \big(\hat{r}j-y_j\big)w_j \ge0 $$ 이다. 1에서 증명한 사실에 의해 $$ \sum{j=1}^i\hat{r}jw_j=\sum{j=1}^iy_jw_j $$ 이다.

Lemma 2.1에 의해, $$ \sum_{j=1}^i \hat{r}jw_j \begin{cases} \le \sum{j=1}^i y_jw_j &\text{for}~~i=1,~~2,~~\cdots,~~n \ =\sum_{j=1}^i y_jw_j &\text{if}~~\hat{r}_{i+1}>\hat{r}_i~~ \text{or}~~~ i=n \end{cases} $$ 를 만족하는 $\hat{r}$을 정의할 수 있다면 이것이 곧 $Q(r)$의 최소화원이 된다. 이때 $\hat{r}$을 $$ (0,0),~~(w_1,~~w_1y_1),~~\cdots,~~(\sum_{j=1}^nw_j,~~\sum_{j=1}^nw_jy_j) $$ 의 convex minorant의 left derivative로 정의한다면 위 조건을 만족하게 되어, $Q(r)$의 최소화원이 될 수 있다. 12세~18세 여자 아이들의 평균 신장의 편차($\bar{x}_i-\bar{x}$) 데이터를 기반으로 점들을 정의하고, 이것의 convex minorant을 그린 결과는 다음과 같다.

convex minorant는 convex한 꼴을 이루기 때문에, 실선의 기울기는 단조 증가한다. 따라서 단조 증가하는 $r$의 추정치로서 사용하기에 적합하며, Lemma 2.1은 이 기울기가 최소화원임을 보장한다.

일반적으로는 평균 신장의 편차가 아닌, 평균 신장 데이터를 기반으로 convex minorant를 그리지만, 명확한 시각화를 위해 편차를 활용하였다. 편차를 이용하더라도 원래 구하고자 하던 기울기는 어렵지 않게 구할 수 있다.

2. Estimation from Current Status Data

이번에 살펴볼 데이터는 1988년 오스트리아 남성 230명을 대상으로 조사한 Rubella(풍진) 발병 여부 데이터다. $T_i$ : [관측] 각 남성의 태어난 연도로부터의 조사 시점(years) $\Delta_i$ : [관측] 풍진 발병 여부 (발병했다($\Delta_i=1$), 발병하지 않았다($\Delta_i=0$)) $X_i$ : [추정] 오스트리아 남성에게 풍진이 발병하는 나이(years) $F$ : [추정] 나이에 따른 풍진 발병 분포 함수(CDF)

이로부터 오스트리아 남성의 풍진 발병 나이 함수($F$)를 추정하자. 질병의 특성을 고려하여, 풍진이 한 번 발병하면 평생 지속된다고 가정한다. 관측 데이터 $t_i$를 $t_1

각 데이터(남성)에 대하여, 조사 시점에 풍진이 발병되었을 확률은 $$ \mathrm{P}(\Delta_i=1)=\mathrm{P}(X_iT_i)=1-F(t_i) $$ 이다. 이로부터 $F$에 대한 log likelihood function을 $$ l(F)=\sum_{i=1}^n \delta_i \log F(t_i)+(1-\delta_i) \log(1-F(t_i)) $$ 로 정의할 수 있다. 이제 문제는 $l(F)$를 최대화하는 $\hat{F}$를 찾는 것이다.

앞선 예시에서와 같은 이유로 $\hat{F}$는 $t_i$ 이외의 점에서 상수함수로 가정할 수 있다. Lemma 2.3은 이러한 문제에서 최대화원 $\hat{F}$이 무엇인지 알려준다.

Lemma 2.3 다음과 같이 $P_i$를 정의하자. $$ P_0=(0,~~0),~~P_i=\bigg(i,~~\sum_{j=1}^i\delta_j \bigg),~~1\le i \le n $$ $\hat{F}(t_i)$를 $P_i$로 만든 convex minorant의, $P_i$에서의 left derivative로 정의하면 $\hat{F}$는 $l(F)$의 유일한 최대화원이다.

Lemma 2.3 Proof 1. 최대화원 $l(F)$를 잘 정의하기 위해 다음 두 가지를 가정하자.

$\delta_1=1$
$\delta_n=0$ $P_i$의 정의에 의해, $P_i$로 만든 convex minorant의 기울기는 $0$ 이상 $1$ 이하임을 알 수 있다. $\delta_1=1$을 가정하여 $\log F(t_1)$을, $\delta_n=0$을 가정하여 $\log F(t_n)$을 잘 정의하도록 한다.

$l(F)$는 strictly concave 함수이므로, $F(t_1) \le F(t_2) \le \cdots \le F(t_n)$을 만족하는 모든 $F$에 대하여 $$ \lim_{\epsilon \rightarrow 0+}\epsilon^{-1}\big(l(\hat{F}+\epsilon(F-\hat{F}))-l(\hat{F})\big) \le0 $$ 임을 보이면 $\hat{F}$이 $l(F)$의 최대화원임이 증명된다. 위 부등식의 의미는, $\hat{F}$로부터 어떤 방향으로라도 멀어진다면 $l(F)$의 기울기가 감소(or 0)한다는 뜻이다.

$$ \begin{aligned} \lim_{\epsilon \rightarrow 0+}&\epsilon^{-1}\big(l(\hat{F}+\epsilon(F-\hat{F}))-l(\hat{F})\big)\ &=\lim_{\epsilon \rightarrow 0+}\epsilon^{-1}\sum_{i=1}^n \bigg(\delta_i \log \bigg(1+{\epsilon\big(F(t_i)-\hat{F}(t_i)\big) \over \hat{F}(t_i)}\bigg)+(1-\delta_i) \log\bigg(1+{-\epsilon\big(F(t_i)-\hat{F}(t_i)\big) \over 1-\hat{F}(t_i)} \bigg)\bigg) \ &=\sum_{i=1}^n\bigg(\delta_i {F(t_i)-\hat{F}(t_i) \over \hat{F}(t_i)} - (1-\delta_i){F(t_i)-\hat{F}(t_i) \over 1-\hat{F}(t_i)}\bigg) \ &= \sum_{i=1}^n{F(t_i)(\delta_i-\hat{F}(t_i)) \over \hat{F}(t_i)(1-\hat{F}(t_i))} - \sum_{i=1}^n {\delta_i-\hat{F}(t_i) \over 1-\hat{F}(t_i)} \ &=: I_1-I_2 \end{aligned} $$ 이므로, $I_1-I_2 \le 0$임을 보이자. $P_i$로 만든 convex minorant에서 꺾이는 점(=convex minorant가 지나는점)을 $$ 0=i_0

1-1. $I_1 \le 0$ $$ I_1=\sum_{i=1}^n{F(t_i)(\delta_i-\hat{F}(t_i)) \over \hat{F}(t_i)(1-\hat{F}(t_i))}=\sum_{j=1}^k {1 \over \hat{F}(t_{i_j})(1-\hat{F}(t_{i_j}))}\sum_{i=i_{j-1}+1}^{i_j}F(t_i)(\delta_i-\hat{F}(t_i)) $$ 이고, $F(t_1) \le F(t_2) \le \cdots \le F(t_n)$의 특성에 의해 어떤 $\alpha_m \ge 0$에 대하여 $$ F(t_i)=\sum_{m=i_{j-1}+1}^i \alpha_m,~~i_{j-1}

1-2. $I_2 = 0$ $$ \begin{aligned} I_2&=\sum_{j=1}^k\sum_{i=1}^n {\delta_i-\hat{F}(t_i) \over 1-\hat{F}(t_i)}\ &=\sum_{j=1}^k{1 \over 1-\hat{F}(t_i)}\sum_{i=1}^n\ \big(\delta_i-\hat{F}(t_i)\big)=0 \end{aligned} $$ 이다. $i_{j-1}

위의 두 사실로부터 $$ \lim_{\epsilon \rightarrow 0+}\epsilon^{-1}\big(l(\hat{F}+\epsilon(F-\hat{F}))-l(\hat{F})\big)=I_1-I_2 \le 0 $$ 이 성립한다.

2. 유일성 최대화원 $\hat{F}$가 유일하기 위해선 $$ \lim_{\epsilon \rightarrow 0+}\epsilon^{-1}\big(l(\hat{F}+\epsilon(F-\hat{F}))-l(\hat{F})\big) < 0 $$ 가 성립해야 한다. 즉, 등호가 성립할 수 없음을 보일 것이다.

$$ \begin{aligned} &\lim_{\epsilon \rightarrow 0+}\epsilon^{-1}\big(l(\hat{F}+\epsilon(F-\hat{F}))-l(\hat{F})\big)= 0 \ &\Leftrightarrow~~I_1=0 \ &\Leftrightarrow~~\sum_{i=m}^{i_j}\big(\delta_i-\hat{F}(t_i)\big)=0 \end{aligned} $$ 이고, 이는 convex minorant가 모든 $P_i$를 지나는 것과 동치이고, 모든 $i$에서 $\delta_i=1$임과 동치다. 이는 앞서 세운 가정인 $\delta_n=0$에 모순이므로, $$ \lim_{\epsilon \rightarrow 0+}\epsilon^{-1}\big(l(\hat{F}+\epsilon(F-\hat{F}))-l(\hat{F})\big)= 0 $$ 은 불가능하다. 따라서 최대호원 $\hat{F}$는 유일하다.

풍진 발병 데이터의 $P_i$와 이들로 만든 convex minorant는 (a)와 같고, 이로부터 추정한 풍진 발병 나이 분포함수(CDF) $\hat{F}$는 (b)와 같다. 대부분 20대 이후에는 풍진 바이러스를 보유함을 알 수 있다.

[1주차] Examples and Technologies of Censored Data

Tue, 26 Dec 2023 07:16:30 GMT

참고교재

Groeneboom and Jongbloed (2014), Nonparametric Estimation under Shape Constraints, Cambridge University Press.
Wooldridge (2010), Econometric Analysis of Cross Section and Panel Data, 2nd Edition, MIT Press.
참고자료
*[Lecture Note]** https://scholar.harvard.edu/files/montamat/files/nonparametric_estimation.pdf
*[Github Docs]** https://reliability.readthedocs.io/en/latest/What%20is%20censored%20data.html

1. Nonparametric Estimation

Nonparametric Estimation (비모수적 추정)이란 관측치의 분포 함수에 대한 최소한의 가정을 기반으로 분포 함수를 추정하는 방법이다. 관측치가 특정 분포를 따른다는 강력한 가정을 갖는 Parametric Estimation과 달리, 관측치의 분포에 대해 함부로 단정지을 수 없는 경우에 사용한다.

대표적인 Nonparametric Estimation의 예시로

Kernel Estimation : Density Function의 각 함숫값을 결정할 때, 각 점의 local한 범위 내의 값이 전체 관측치에서 얼마나 관측되었는지 그 비율을 함숫값으로 결정하는 기법

이 있다. local한 범위 안에서 관측치가 중심점과 얼마나 멀리 있느냐에 따라 가중치를 부여하기도 하며, 관측치마다의 가중치를 계산하는 함수를 Kernel function이라 한다. 범위와 그에 따른 가중치를 어떻게 잡느냐에 따라 Uniform Kernel, Normal Kernel 등으로 세분화될 수 있다. 의미상으로는 Neural Network의 Kernel과 일맥상통하는 듯 하다.

교재에 소개된 몇 가지 예시 상황을 살펴보자.

1.1 Is There a Warming-up of Lake Mendota?

Lake Mendota는 미국에 위치한 호수로, 1800년대부터 연구가 시작되어 오랜 기간 데이터를 축적한 좋은 샘플이다. 예시로 살펴볼 데이터는 1855년부터 157년간 측정한 Lake Mendota가 얼어있는 일 수이다.

$i$의 증가를 연도의 증가(=시간의 흐름)로 볼 때, 호수가 얼어있는 일 수 $v_i$를 모델링하고자 한다. 이 때 호수가 얼어있는 일 수가 점점 짧아질 것이라고 가정하여 (혹은 선행 분석을 통해 감소추세임을 밝혀내어)

$$ v_1 \ge v_2 \ge \cdots \ge v_{157} $$

을 만족한다고 하자. 157년간 관측된 데이터를 $Y_i$라 할 때,

$$ \sum_{i=1}^{157} (Y_i-v_i)^2 $$

을 최소화하는 것이 목표가 되겠다. 이렇게 단조증가/단조감소하는 estimator를 Isotonic estimator라 하고, 이를 모델링하는 과정을 Isotonic regression이라 한다.

$v_i$를 추정하기 위한 실질적인 방법으로

$$ (0,0),~~(1,Y_1),~~(2,Y_1+Y_2),~~\cdots,~~(157,~\sum_{j=1}^{157}Y_i) $$

의 (least) concave majorant의 left derivative를 활용할 수 있다.(이론적 근거는 2주차에 계속) 이로부터 추정한 $v_i$는 아래 (a)의 실선과 같고, 이를 smooth하게 이은 결과는 아래 (b)의 실선과 같다.

1.2 Onset of Nonlethal Lung Tumor

두 그룹의 쥐의 수명을 측정한 데이터를 살펴보자. 한 그룹은 무균 환경, 다른 한 그룹은 일반 환경에 놓여있으며, 관심분포는 RFM type의 폐 종양이 쥐에게 발병하는 나이이다.

CE(=Conventional Environment, 일반 환경)과 GE(=Germ-free Environment, 무균 환경)에서 쥐의 수명(일 수)을 폐 종양 발병 여부($\Delta=1$ : 발병, $\Delta=0$ : 발병하지 않음)에 따라 분류한 데이터다. 당시 연구자들은 폐 종양이 쥐에게 치명적이라는 가정 하에, 두 그룹에서 쥐에게 폐 종양이 발병하는 나이를 모델링하였다. 이를 위해 Kaplan-Meier estimator를 사용하였으며, 그 결과 아래 실선(일반환경)과 점선(무균환경)과 같이 나타났다.

이로부터 무균 환경에서 쥐의 폐 종양 발병 나이는 일반 환경보다 확연히 늦다는 결론을 얻을 수 있었지만, 사실 이는 잘못된 과정이었다. 폐 종양이 쥐에게 치명적이라는 가정이 잘못되었고(사실은 치명적이지 않다고 한다), 폐 종양이 발견된 쥐($\Delta=1$)의 수명이 곧 폐 종양의 발병 나이와 유사할 것이라는 가정이 잘못된 것이다.

이를 옳게 추정하는 과정은 다음과 같다. 각각의 쥐($i$)를 관측한 순간(=쥐가 사망한 순간)($T_i$), 쥐는 폐 종양에 걸려있거나($\Delta_i=1$) 걸려있지 않다($\Delta_i=0$). 즉, 폐 종양에 걸린 나이($X_i$)는

$$ \Delta_i=1~~~ \Leftrightarrow ~~~X_i \le T_i $$ $$ \Delta_i=0~~~ \Leftrightarrow ~~~X_i > T_i $$

를 만족한다. 쥐가 사망한 순간에만 폐 종양 여부가 관측되었기 때문에, 쥐의 폐 종양 발병 나이의 누적분포함수 $F$를 추정하기 위해

$$ l(F) = \sum_{i=1}^{n} (\delta_i\log F(t_i) + (1-\delta_i)\log (1-F(t_i))) $$

와 같이 log likelihood function을 정의한다. $l(F)$를 최대화하는 $\hat{F}$를 추정하기 위해

$$ (0,0),~~(1,\delta_1),~~(2,\delta_1+\delta_2),~~\cdots,~~(n,\sum_{j=1}^n \delta_j) $$ 의 (greatest) convex minorant의 left derivative를 활용할 수 있다.(이론적 근거는 2주차에 계속) 이번 예시도 앞선 것과 유사한 맥락을 갖는다.

1.3 The Transmission Potential of a Disease

마지막으로 살펴볼 예시는 불가리아의 A형 간염에 감염된 인구 수이다. A형 간염과 같은 전염성이 있는 바이러스는 '얼마나 잘 전염되는가'가 주목해야 할 특징이다. 이를 Transmission Potential이라 하며, 이 값이 1보다 클 경우 감염병 유행의 위험이 있다.

짧은 감염 주기를 갖는 전염병의 Transmission Potential $R_v$는

$$ R_v = {\int_0^{\infty}e^{-\int_0^a \mu(u)du}\lambda(a)^2V(a)da \over \int_0^{\infty}e^{-\int_0^a \mu(u)du}\lambda(a)^2e^{-\int_0^a \lambda(u)du}da} $$

를 따르며 $\lambda$는 감염 강도, $V(x)$는 나이가 $x$살인 사람에게 항체가 없을 확률, $\mu$는 사망률이다. 추정하고자 하는 것은 감염 강도 $\lambda$이며, 1.2에서의 절차와 유사하게 어느 한 시점에 표본 집단에 대해 항체 보유 여부를 조사함으로써 항체를 보유하면 이미 감염되었고, 항체를 보유하지 않으면 잠재적으로 감염되거나 아예 감염되지 않을 사람으로 판단할 수 있다.

2. Censored Data

Censored Data (중도절단 데이터)란 정확한 사건 발생 시점을 알 수 없는 데이터이다. 일반적으로 사건 발생 시점을 아는 데이터인 Complete Data와는 반대로, '실패' 또는 '죽음'을 관찰하는 경우(Ex. 수명, 어떤 약의 지속시간 등) 관측하는 시점마다 각 객체는 '아직 실패하지 않음'과 '실패'의 상태 중 한 가지로 나타난다. 아직 실패하지 않은 경우 실패(=사건의 발생)의 순간이 해당 시점 이후의 어딘가에 존재할 것이라고 짐작할 수 있다. 이러한 경우는 Censored Data와 Complete Data가 혼재된 형태로 나타난다.

Right Censored Data는 직전에 설명한 바와 같은 Censored Data의 한 유형이다. 시간이 흐름(오른쪽 방향)에 따라 각 객체에서 사건이 발생함으로써 더 이상 '생존하지 않은' 상태가 된다. 관측 시점에서는 중도 절단된 '실패' 데이터와 '아직 생존한' 데이터를 볼 수 있다. '아직 생존한' 데이터의 경우 실패 시점을 정확히 알 수 없으며, 어떤 구간 안에 실패 시점이 존재할 것이라고 추측할 수 있다.

Left Censored Data는 테스트 시작 전에 이미 '실패'한 데이터이다. Interval Censored Data 중 구간의 낮은 쪽이 0인 데이터이며, 현실에서 거의 찾아보기 힘든 유형이다.

Interval Censored Data는 사건 발생 시점의 하한과 상한까지 알려진 Censored Data의 한 유형이다. 실험 객체에 대해 여러 번 관측을 할 때, $i$ 시점에서는 실패하지 않았지만 $i+1$ 시점에서는 실패한 경우 사건 발생 시점(=데이터)을 하나의 구간으로 나타낼 수 있다.

Type I Censored Data는 미리 정해둔 시간이 다 되어 테스트가 중단되는 경우이다. 이때 종료 시점까지 실패가 발생하지 않은 데이터는 종료시점에 Right-censoring 되어, 실패 시점이 구간으로 나타나게 된다.

Type II Censored Data는 미리 정해둔 실패 횟수가 다 되어 테스트가 중단되는 경우이다. 이때 종료 시점까지 실패 횟수에 도달하지 않은 데이터는 종료시점에 Right-censoring 된다.

지금부터 여러 유형의 Censoring 모델을 살펴보자.

2.1 Binary Censoring

어떤 상품이나 서비스에 대한 지불할 의향이 있는 금액(Willing To Pay, 이하 WTP)을 모델링하자. 같은 상품이라도 지불 주체($i$)마다 생각하는 '구매하기에 합리적인 가격의 최댓값'($y_i$)가 존재하며, 상품 가격($r_i$)보다 높을 경우 해당 제품을 구매할 것($w_i=1$)이다. 즉,

$w_i$ : 구매 의향 (구매한다($w_i=1$), 구매하지 않는다($w_i=0$) $r_i$ : 판매자가 제시하는 상품 판매 가격 $y_i$ : 구매자가 생각하는, 상품을 구매하기에 합리적인 가격

$$ w_i=1~~\Leftrightarrow~~y_i>r_i $$ $$ w_i=0~~\Leftrightarrow~~y_i

이며, $P(y_i =r_i)=0$이라 가정하자. 판매자의 입장에서 가장 알고 싶은 정보는 $y_i$이며, $y_i$의 추정값을 $r_i$로 설정할 것이다. 고객의 정보($x_i$)를 바탕으로 $y_i$를 추정하는 가장 간단한 모델 형태는

$$ y_i = \mathbf{x}_i \beta+u_i $$ $$ \mathbf{E}(u_i|\mathbf{x}_i)=0 $$

을 만족하는 Linear Model이다. noise인 $u_i$가

$$ u_i|\mathbf{x}_i, r_i~ \sim~N(0, \sigma^2) $$

을 만족한다고 가정하면, 제시된 가격에 고객이 상품을 구입할 확률을 다음과 같이 모델링할 수 있다.

$$ \begin{aligned} \mathrm{P}(w_i=1|\mathbf{x}_i,~~r_i) &= \mathrm{P}(y_i>r_i~~|~~\mathbf{x}_i,~~r_i) \ &= \mathrm{P}\bigg({u_i \over \sigma} > {(r_i-\mathbf{x}_i \beta) \over \sigma}|\mathbf{x}_i,~r_i\bigg)\ &= \Phi\bigg({\mathbf{x}_i\beta-r_i \over \sigma}\bigg) \end{aligned} $$

하지만 $u_i$의 이분산성과 비정규성을 무시하는 가정은 Binary Censoring 과정에서 우려되는 것이 타당하다고 한다.(함부로 가정하기 어렵다는 의미인 듯 하다) 또한 $y_i \ge0$이 가정된 상황에서 Linear Model을 적용하는 것은 이상적이지 못하여

$$ y_i = \mathrm{max}(0,~\mathbf{x}_i\beta+u_i) $$

와 같이 Type I Tobit Model을 적용하거나

$$ y_i =e^{\mathbf{x}_i\beta+u_i} $$

와 같이 $y_i \ge 0$을 유지하면서도 WTP가 0인 고객을 충분히 모델링할 수 있는 형식이 필요하다. 이를 바탕으로 구성한 '제시된 가격에서 고객이 상품을 구매할 확률'은

$$ \mathrm{P}(w_i=1|\mathbf{x}_i,~r_i) = \Phi({\mathbf{x}_i\beta-\log r_i \over \sigma}) $$

로 나타난다.

2.2 Interval Coding

2.1에서와 유사하게, 알려지지 않은 $y_i$를 추정하고자 한다. 달라진 점은 $y_i$의 정확한 값을 추정하는 것이 아닌, 미리 설정한 특정 구간 안에 속하도록 모델링하는 것이다. 즉,

$$ w_i=0~~\mathrm{if}~~y_i \le r_1 $$ $$ w_i=1~~\mathrm{if}~~ r_1\le y_i\le r_2 $$ $$ \vdots $$ $$ w_i=J~~\mathrm{if}~~y_i>r_J $$

으로 정의하고, 각각의 $j=0,1,\cdots,~~J$마다, $y_i$가 $j$ 구간에 속할 확률인 $\mathrm{P}(w_i=j~~|~\mathbf{x}_i)$를 아용하여 $y_i$를 모델링하는 것이다. 이러한 데이터를 Interval Censored Data라 한다.

위와 같은 interval censored data인 $y_i$를 $y_i=\mathbf{x}_i\beta+u_i$를 만족하는 선형모델로 모델링해보자. 즉 $y_i$를 결정하는 요소인 $\beta$,$\sigma$를 추정하는 것이다. ($u_i|\mathbf{x}_i,~~r_i~~\sim~N(0, \sigma^2)$임을 기억하자) $\beta$,$\sigma$에 대한 maximum likelihood function은 다음과 같이 구성된다.

$$ \begin{aligned} l_i(\beta,~\sigma) &= 1[w_i=0] \log(\Phi({r_1-\mathbf{x}_i\beta \over \sigma}))\ &+ 1[w_i=1]\log(\Phi({r_2-\mathbf{x}_i\beta \over \sigma})-\Phi({r_1-\mathbf{x}_i\beta \over \sigma}))\ &+ \cdots \ &+ 1[w_i=J]\log(1-\Phi({r_J-\mathbf{x}_i\beta \over \sigma})) \end{aligned} $$

2.3 Censoring from Above and Below

관측할 데이터에 상한 또는 하한을 설정하여 Right Censored Data 또는 Left Censored Data로서 관측할 수 있다. 예를 들어 사람들의 재산을 조사한다고 할 때, 너무 큰 금액대를 조사하는 것은 큰 의미가 없어보인다. Right Censored Data의 분포 함수를 추정하는 방법을 구체적으로 알아보자.

상한값($r_i$)에 대하여, 본래 관측치($y_i$)는

$$ w_i:=\mathrm{min}(y_i,~r_i) $$ 인 $w_i$로 대체된다. $w_i$는 상한값($r_i$)을 초과한 값을 갖지 않는 Right Censored Data이며, $w_i$의 cdf($F$)와 pdf($f$)를 추정하고자 한다. 자명하게도 $w_i \le r_i$이며, $w_i$의 Support를 $w

$w|\mathbf{x}_i,~~r_i) = \mathrm{P}(y_i \le w~~|~~\mathbf{x}_i) = F(w~~|~~\mathbf{x}_i~~;~\theta)$ 로 나타난다.
$w=r_i$인 경우 관측치($y_i$)가 상한($r_i$) 이상임을 의미하므로 $\mathrm{P}(w_i=r_i|\mathbf{x}_i,~~r_i) = \mathrm{P}(y_i \ge r_i~~|~~\mathbf{x}_i) = 1-F(r_i~~|~~\mathbf{x}_i~~;~\theta)$ 로 나타난다.

위의 과정에서 $y_i$와 $r_i$의 독립성이 가정된다. 의미상으로 보아도 두 변수는 독립임이 자명하다. (상한값이 관측치에 영향을 받진 않으므로) 이를 바탕으로, $w_i$의 pdf($f$)를 추정하는 log likelihood function을 다음과 같이 구성할 수 있다.

$$ \begin{aligned} l(\theta) = \sum_{i=1}^I \bigg( &1[w_i|\mathbf{x}_i;\theta))\ &+1[w_i=r_i]\log(1-F(r_i|\mathbf{x}_i;\theta)) \bigg) \end{aligned} $$

이때 $y_i|\mathbf{x}_i~~\sim N(\mathbf{x}_i\beta,~~\sigma^2)$이 가정된다면 위의 과정을 거친 모델을 Censored Normal Regression Model이라 한다. 정규성이 가정된 상태에서 $w_i$의 pdf($f$)를 추정하는 log likelihood function은 다음과 같이 구성될 수 있다.

$$ \begin{aligned} l(\theta) = \sum_{i=1}^I \bigg( &1[w_i

($\phi$는 표준정규분포의 pdf, $\Phi$는 표준정규분포의 cdf이다.)

[논문분석] Denoising Diffusion Implicit Models (DDIM)

Fri, 12 May 2023 09:13:47 GMT

<논문> [arXiv] Denoising Diffusion Implicit Models

<참고자료> [Reference] Christopher M Bishop, Pattern Recognition and Machine Learning, 2006 [tistory] DDIM : Denoising Diffusion Implicit Models [page] [논문리뷰] DDIM : Denoising Diffusion Implicit Model

DDIM은 DDPMs의 후속 논문이다. DDPMs의 단점인 '느린 Sampling 속도'를 해결하고자 새로운 Sampling 방안을 제시하였다. 이 포스팅은 DDPMs를 안다는 전제 하에 진행될 것이므로 DDPMs를 먼저 알고 오는 것을 권장한다.

1. Introduction

기존의 이미지 생성 모델인 VAE는 다양한 이미지를 생성할 수 있지만 quality가 낮았고, GAN은 높은 quality의 이미지를 생성할 수 있지만 다양성이 낮았다. 이와 달리 DDPMs의 diffusion model은 높은 quality의 이미지를 다양하게 생성할 수 있었지만, Sampling 속도가 느리다는 단점이 있었다. diffusion model의 구조 상, Pure Gaussian Noise에 $T(=1000)$번의 Denoising Process를 거쳐야 이미지를 생성할 수 있기 때문이다.

DDIM은 non-Markovian diffusion process를 활용하여 DDPMs에서의 Sampling 속도를 10배 이상 향상시킨다. 또한 consistency를 향상시켜, 비슷한 위치에서 $\mathbf{x}_T$를 Sampling 한다면 비슷한 이미지를 얻을 수 있다고 한다.

시작하기에 앞서, DDIM에서 사용하는 $\alpha_t$는 DDPMs와 다르다. DDPMs는 $\alpha_t = 1-\beta_t$인 반면, DDIM에서 $\alpha_t = \prod_{i=1}^T(1-\beta_t)$이다. 즉, DDPMs에서의 $\bar{\alpha}_t$가 DDIM에서는 $\alpha_t$라 쓰이는 것이다. 이러한 이유는 깔끔함 등이 있는데, 이 포스팅에서는 DDPMs 기준의 $\alpha_t$를 쓸 것이다.

2. 개요

DDPMs의 느린 Sampling 속도는 무엇 때문일까? DDIM은 Markov Chain을 원인으로 보았다. 이미지를 Sampling 하는 데 $T(=1000)$번씩이나 Denoising process를 거치지 말고, 몇 단계씩 건너뛰며 Sampling 속도를 높이자는 것이 DDIM의 주장이다. 이를 위해 non-Markovian Forward process와 non-Markovian Reverse process를 제안하고, Reverse process가 $T(=1000)$번의 단계를 거치는 것이 아닌 부분수열(subsequence)에 따라 움직일것을 제안한다.

process가 바뀌면 Loss function이 바뀌는 것이 일반적인데, DDIM의 핵심 중 하나는 Loss function이 바뀌었음에도 불구하고 최적해의 위치는 바뀌지 않는다는 것이다. 즉 파라미터 $\theta$가 최적인 순간이 DDPMs, DDIM 모두 같다는 것이며, 따라서 새롭게 학습을 진행할 필요가 없다. 그래서 보통 학습은 DDPMs의 방법으로, Sampling은 DDIM의 방법으로 진행한다고 한다.

3. Non-Markovian process

우선 DDPMs에서의 Loss function을 다시 떠올려보자. $$ L_{simple}(\theta) = \mathrm{E}{{t}, \mathbf{x}_0, \epsilon} \bigg[ (\epsilon - \epsilon{\theta}(\mathbf{x}_t, t))^2 \bigg] $$

Simple Loss function은 기존의 KL Divergence의 합으로부터, $t$에 대한 계수식$(={\beta_t^2 \over 2\sigma_t^2\alpha_t (1-\bar{\alpha}t)})$을 제거하고 합을 평균으로 바꾼 것이다. 즉 Simple Loss function은 다음과 같이 재구성할 수 있다. $$ L_\gamma(\epsilon_\theta) \coloneqq \sum{t=1}^T \gamma_t \mathrm{E}_{\mathbf{x}_0 \sim q(\mathbf{x}_0),\epsilon_t \sim N(0, I)} \bigg[ ||\epsilon_\theta^{(t)}(\mathbf{x}_t, t) - \epsilon_t||_2^2 \bigg] $$

DDPMs는 본래 $\gamma_t ={\beta_t^2 \over 2\sigma_t^2\alpha_t (1-\bar{\alpha}_t)}$였으나, 간단한 Loss function으로 설계하기 위해 $\gamma_t=1$로 두었다. DDIM에서는 보다 일반적인 형태를 다루기 위해 $\gamma_t$를 유지한다.

DDIM에서 중요하게 본 포인트는 Loss function이 marginal distribution인 $q(\mathbf{x}_t|\mathbf{x}_0)$에 의해서만 결정되고, joint distribution인 $q(\mathbf{x}_{1:T}|\mathbf{x}_0)$에는 영향을 받지 않는다는 것이다. [1] 즉, marginal distribution만 유지한다면 joint distribution은 무엇이 와도 상관없다고 해석하여, 같은 marginal distribution을 갖는 다른 process를 고려한다. 그 중, Markov process를 다룬 DDPMs와 달리 non-Markov process를 정의할 것이다.

3-1. Definition

위에서 언급한 바와 같이 non-Markovian process $q_\sigma$를 새롭게 설계하는데, 이 process는 $q_\sigma(\mathbf{x}_t|\mathbf{x}_0) = N(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)$를 만족해야 한다. 이를 고려하여 $q_\sigma$를 정의하자.

$$ q_\sigma(\mathbf{x}{1:T}|\mathbf{x}_0) \coloneqq q_\sigma(\mathbf{x}_1|\mathbf{x}_0) \prod{t=2}^T q_\sigma(\mathbf{x}t|\mathbf{x}{t-1}, \mathbf{x}_0) $$

위의 식을 잘 살펴보면, $\mathbf{x}0$가 주어졌을 때 우선 $\mathbf{x}_1$을 구하고$(=q_\sigma(\mathbf{x}_1|\mathbf{x}_0))$, $q_\sigma(\mathbf{x}_t|\mathbf{x}{t-1}, \mathbf{x}_0)$을 이용하여 $\mathbf{x}_2$, $\mathbf{x}_3$, $\cdots$, $\mathbf{x}_T$를 순차적으로 구하는 것이다. 아래 이미지와 이 DDPMs(왼쪽)는 직전 이미지만을 참고했다면, DDIM(오른쪽)은 직전과 처음 이미지를 참고하는 것이다. 또한 Bayes Theorem에 의해,

$$ \begin{aligned} q_\sigma(\mathbf{x}{1:T}|\mathbf{x}_0) &= q_\sigma(\mathbf{x}_1|\mathbf{x}_0) \prod{t=2}^T q_\sigma(\mathbf{x}t|\mathbf{x}{t-1}, \mathbf{x}_0) \

&= q_\sigma(\mathbf{x}1|\mathbf{x}_0) \prod{t=2}^T {q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t,\mathbf{x}_0)~~q_\sigma(\mathbf{x}_t~~|~\mathbf{x}_0) \over q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_0)} \

&= q_\sigma(\mathbf{x}T|\mathbf{x}_0) \prod{t=2}^T q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0) \end{aligned} $$

이다. 어떠한 forward process든 pure Gaussian Noise를 목표로 함은 같으므로

$$ q_\sigma(\mathbf{x}_T|\mathbf{x}_0) \coloneqq N(\sqrt{\bar{\alpha}_T}\mathbf{x}_0, (1-\bar{\alpha}_T)I) $$

로 정의할 수 있다. 또한 모든 $t>1$에 대하여

$$ q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) = N(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2} \cdot {\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0 \over \sqrt{1-\bar{\alpha}_t}}, \sigma_t^2I) $$

라 정의하자. 위의 두 정의로부터, 모든 $t$에 대하여 $$ q_\sigma(\mathbf{x}_t|\mathbf{x}_0) = N(\mathbf{x}_t;\sqrt{\bar{\alpha}_t},~~(1-\bar{\alpha}_t)I) $$ 를 만족할 수 있다. 논문에서는 이에 대한 증명을 Appendix B의 Lemma 1에 남겨두었다. [2]

위 성질이 증명되면 non-Markovian forward process를 구할 수 있다. Bayes Theorem에 의해 $$ q_\sigma(\mathbf{x}t|\mathbf{x}{t-1},\mathbf{x}0) = {q_\sigma(\mathbf{x}{t-1}|\mathbf{x}t, \mathbf{x}_0) q_\sigma(\mathbf{x}_t|\mathbf{x}_0) \over q_\sigma(\mathbf{x}{t-1}|\mathbf{x}0)} $$ 이므로 $q_\sigma(\mathbf{x}_t|\mathbf{x}{t-1},\mathbf{x}0)$를 forward process라 할 수 있다. 또한, $\mathbf{x}_t$가 $\mathbf{x}{t-1}$과 $\mathbf{x}_0$의 영향을 동시에 받으므로 non-Markov process라 할 수 있다. [3]

$q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0)$의 꼴은 마치 DDPMs의 $q(\mathbf{x}{t-1}|\mathbf{x}t, \mathbf{x}_0)$와 유사하다. DDPMs에서 $\bar{\Sigma}$의 값인 $\sigma_t^2={\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t}$를 대입하면 $$ \begin{aligned} q_\sigma(\mathbf{x}{t-1}|\mathbf{x}t,\mathbf{x}_0) &= N(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2} \cdot {\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0 \over \sqrt{1-\bar{\alpha}_t}},~~\sigma_t^2I) \

&= N\bigg({\sqrt{\alpha}(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}_t}\mathbf{x}_t + {\sqrt{\bar{\alpha}{t-1}}\beta_t \over 1-\bar{\alpha}t}\mathbf{x}_0,~~{\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}_t}I\bigg) \

&= q(\mathbf{x}{t-1}|\mathbf{x}_t, \mathbf{x}_0) \end{aligned} $$ $$ q_\sigma(\mathbf{x}_t|\mathbf{x}{t-1},\mathbf{x}0) = N(\sqrt{\alpha_t}\mathbf{x}{t-1},~~\beta_tI) = q(\mathbf{x}_t~~|\mathbf{x}{t-1}) \kern{85pt} $$ 이고, DDPMs에서의 식과 같음을 알 수 있다. 특히 forward process는 자연스럽게 Markov property를 가지게 됨을 알 수 있다. 즉, $q_\sigma(\mathbf{x}{t-1}|~\mathbf{x}_t, \mathbf{x}_0)$는 DDPMs의 경우를 포함하는 일반적인 식이라 볼 수 있다.

실제 Sampling 하는 과정을 generative process라 하며, $p_\theta$를 이용하여 나타낸다. DDIM에서 generative process는 $q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_t,\mathbf{x}_0)$를 닮아야 하는데, 구하고자 하는 $\mathbf{x}_0$가 조건으로 걸려있는 상태이다. 하지만 $\mathbf{x}_0$를 실제로는 모르는 상황이므로, $\mathbf{x}_t$를 이용하여 $\mathbf{x}_0$의 예측값을 구하고 이를 대체값으로 쓸 것이다. 만약 $\mathbf{x}_t$가 가진 $\epsilon_\theta^{(t)}$를 알 수 있다면 (예측할 수 있다면),

$$ \mathbf{x}_0 \approx {\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\epsilon_\theta^{(t)} \over \sqrt{\bar{\alpha}_t}} \coloneqq f_\theta^{(t)}(\mathbf{x}_t) $$

에 의해 $\mathbf{x}_0$를 예측할 수 있고,

$$ p_\theta^{(t)}(\mathbf{x}{t-1}|\mathbf{x}_t) = \begin{cases} N(f_\theta^{(1)}(\mathbf{x}_1),~~\sigma_1^2I) &\text{if } t=1\ q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t,~~f_\theta^{(t)}(\mathbf{x}_t)) &\text{otherwise} \ \end{cases} $$

와 같이 generative process를 구성할 수 있다. $\mathbf{x}t$로부터 $\epsilon_\theta^{(t)}$를 예측하는 것은 DDPMs의 네트워크가 하는 일과 같다. 이로부터 $p_\theta(\mathbf{x}{0:T})$를 다음과 같이 정의할 수 있다. $$ p_\theta(\mathbf{x}{0:T}) = p_\theta(\mathbf{x}_T)\prod{t=1}^T p_\theta^{(t)}(\mathbf{x}_{t-1}|\mathbf{x}_t) $$

Comparison

DDPMs와 process를 비교해보면 forward process는 $q(\mathbf{x}t|\mathbf{x}{t-1})$에서 $q_\sigma(\mathbf{x}t|\mathbf{x}{t-1},\mathbf{x}_0)$으로, reverse process는 $q(\mathbf{x}{t-1}|\mathbf{x}_t,\mathbf{x}_0)$에서 $q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t,\mathbf{x}_0)$으로 바뀌었음을 알 수 있다. 또한 generative process 역시 $$ p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) = \begin{cases} {1 \over \sqrt{\alpha_t}}\bigg(\mathbf{x}_t - {\beta_t \over \sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\mathbf{x}_t,t) \bigg) &\text{if } t=1\ N\bigg({1 \over \sqrt{\alpha_t}}\bigg(\mathbf{x}_t - {\beta_t \over \sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\mathbf{x}_t,t) \bigg),~~\beta_tI\bigg) &\text{otherwise} \ \end{cases} $$ 에서 $$ p_\theta^{(t)}(\mathbf{x}{t-1}|\mathbf{x}t) = \begin{cases} N(f_\theta^{(1)}(\mathbf{x}_1),~~\sigma_1^2I) &\text{if } t=1\ q_\sigma(\mathbf{x}{t-1}|\mathbf{x}t,~~f_\theta^{(t)}(\mathbf{x}_t)) &\text{otherwise} \ \end{cases} $$ 로 바뀌었다. 두 논문 모두 일반적으로 $q(\mathbf{x}{t-1}|\mathbf{x}_t,\mathbf{x}_0)$를 이용하나, 이미지를 최종적으로 생성하는 $t=1$에서 다른 process를 사용함을 알 수 있다.

3-2. Loss Function

DDIM에서 정의한 process에 따라 Loss function$(=J_\sigma(\epsilon_\theta))$을 재구성해야 한다. 다만 DDPMs에서의 Loss function과 동일한 논리흐름을 가지므로, Loss function의 형식 역시 같다. 즉

$$ \begin{aligned} J_\sigma(\epsilon_\theta) &\coloneqq \mathrm{E}{\mathbf{x}{0:T}\sim q_\sigma(\mathbf{x}{0:T})} \bigg[ \log q_\sigma(\mathbf{x}{1:T}|\mathbf{x}0)-\log p_\theta(\mathbf{x}{0:T}) \bigg] \

&\equiv \mathrm{E}{\mathbf{x}{0:T}\sim q_\sigma(\mathbf{x}{0:T})} \bigg[ \sum{t=2}^T D_{KL}(q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t,\mathbf{x}_0)||p_\theta^{(t)}(\mathbf{x}{t-1}|\mathbf{x}_t)) - \log p_\theta^{(1)}(\mathbf{x}_0|\mathbf{x}_1) \bigg] \end{aligned} $$

이며, 기댓값 내부의 두 항 중 KL Divergence의 합 항은

$$ \begin{aligned} \mathrm{E}&{\mathbf{x}{0:T} \sim q_\sigma(\mathbf{x}{0:T})} \bigg[ D{KL}(q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t,\mathbf{x}_0)||p_\theta^{(t)}(\mathbf{x}{t-1}|\mathbf{x}_t)) \bigg] \

&= \mathrm{E}{\mathbf{x}{0:T} \sim q_\sigma(\mathbf{x}{0:T})} \bigg[ D{KL}(q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t,\mathbf{x}_0)||q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t,f_\theta^{(t)}(\mathbf{x}_t))) \bigg] \

&\equiv \mathrm{E}{\mathbf{x}{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ {||\mathbf{x}_0-f_\theta^{(t)}(\mathbf{x}_t)||_2^2 \over 2\sigma_t^2} \bigg] \

&= \mathrm{E}{\mathbf{x}{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ {||\epsilon_t - \epsilon_\theta^{(t)}(\mathbf{x}_t)||_2^2 \over 2d\sigma_t^2\bar{\alpha}_t} \bigg] \end{aligned} $$ 이다. 여기서 $\equiv$는 학습에 영향을 주지 않을 선에서 같음을 의미한다. 두 Multivariate Gaussian Distribution의 KL Divergence를 계산할 때 등장하는 '학습과 관계없는 상수항'을 신경쓰지 않기 위해 사용한다.

또한 나머지 항은 $$ \begin{aligned} \mathrm{E}&{\mathbf{x}{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ -\log p_\theta^{(1)}(\mathbf{x}_0|\mathbf{x}_1) \bigg] \kern{135pt} \

&\equiv \mathrm{E}{\mathbf{x}{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ {||\mathbf{x}_0-f_\theta^{(1)}(\mathbf{x}_1)||_2^2 \over 2\sigma_1^2} \bigg] \

&= \mathrm{E}{\mathbf{x}{0:T} \sim q_\sigma(\mathbf{x}_{0:T})} \bigg[ {||\epsilon_1 - \epsilon_\theta^{(1)}(\mathbf{x}_1)||_2^2 \over 2d\sigma_1^2\bar{\alpha}_1} \bigg] \end{aligned} $$

이다. 즉 $J_\sigma(\epsilon_\theta)$를 다시 표현하면

$$ J_\sigma(\epsilon_\theta) \equiv \sum_{t=1}^T {1 \over 2d\sigma_t^2\bar{\alpha}_t} \mathrm{E} \bigg[ ||\epsilon_t-\epsilon_\theta^{(t)}(\mathbf{x}_t)||_2^2\bigg] $$

이며, 이는 Non-Markovian process 단락에서 $\gamma_t={1 \over 2d\sigma_t^2\bar{\alpha}_t}$일 때의 $L_\gamma$와 같다. 이로부터, 임의의 $\epsilon_\theta$에 대하여 어떤 $\gamma_t\in\reals$, $C\in\reals$가 존재하여 $J_\sigma(\epsilon_\theta) = L_\gamma + C$를 만족하다. [4]

위의 결론은 DDIM을 완성하는 중요한 포인트다. 기존의 DDPMs의 표현식으로부터 $\sigma_t$를 사용하는 일반화 식으로 process를 재구성하였기 때문에 Loss function은 달라져야 하지만, 위 결론에 의해 우리는 어떤 $L_\gamma+C$로 Loss function을 대체할 수 있다.

또한 $\gamma$는 최적해를 찾는 데 영향을 주지 않는다. 이는 곧 특정 $L_\gamma$를 사용해도 최적해의 값은 변하지 않는다는 뜻이고, 따라서 우리가 잘 알고있는 (DDPMs에서 사용하는) $L_1$으로 대체할 수 있다는 의미이다. 만약 DDPMs에서 $\epsilon$-예측 네트워크를 학습시켰다면, 이를 추가학습 없이 DDIM에서 사용할 수 있다.

3-3. Sampling

DDPMs의 $\epsilon$-예측 네트워크를 이미 학습시켰다고 가정하자. 즉 $\mathbf{x}_t$가 가진 $\epsilon_\theta^{(t)}(\mathbf{x}_t)$를 예측할 수 있으며, 따라서 $t \ge 2$에서

$$ p_\theta^{(t)}(\mathbf{x}{t-1}|\mathbf{x}_t) = q_\sigma(\mathbf{x}{t-1}|\mathbf{x}t,~~f_\theta^{(t)}(\mathbf{x}_t)) $$ 로부터 $$ \mathbf{x}{t-1} = \sqrt{\bar{\alpha}{t-1}} \bigg({\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\epsilon_\theta^{(t)}(\mathbf{x}_t) \over \sqrt{\bar{\alpha}_t}} \bigg) +\sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2}\cdot\epsilon_\theta^{(t)}(\mathbf{x}t)+\sigma_t\epsilon_t $$ $$ where~~\epsilon_t \sim N(0, I) $$ 와 같이 $\mathbf{x}{t-1}$을 Sampling 할 수 있다. 앞서 보인 바와 같이 $\sigma_t^2={\beta_t(1-\bar{\alpha}_{t-1}) \over 1-\bar{\alpha}_t}$일 때는 DDPMs의 Sampling 과정과 같아진다.

하지만 여전히 한 단계씩 Sampling을 진행하고 있으며, 이대로면 DDPMs와 큰 차이가 없을 것이다. 우리의 목표는 몇 단계를 한 번에 건너뛰어 Sampling 속도를 높이는 것이다.

4. Accelerated Sampling

$p_\theta^{(t)}(\mathbf{x}{t-1}|\mathbf{x}_t)$는 여전히 한 단계씩 Sampling을 진행하므로 속도를 개선시키지 못한다. 따라서 Sampling 속도를 가속화하기 위해 $\lang T,~~T-1,~~\cdots,2,1\rang$의 순서가 아닌 **$\lang \tau_S,~\tau{S-1},~~\cdots,~~\tau_2,~~\tau_1\rang$의 순서로 Sampling을 진행한다.** 수열 $\lang1,~~2,~~\cdots,~~T\rang$의 부분수열 $\tau = \lang\tau_1,~~\tau_2,\cdots,~~\tau_S\rang$는

$\tau_i \in {1,2,\cdots,T}for~~every~~ i \in{1,2,\cdots,~S}$
$S < T$
$i
$\tau_S = T$

를 만족하는 임의의 수열이다.

4-1. Definition

$\mathbf{x}{\tau_i}$만을 거쳐가는 process $q{\sigma,\tau}$를 다시 정의할 것이다. 흐름은 $q_\sigma$때와 비슷하며, $t \notin \tau$인 $\mathbf{x}_t$는 큰 관심이 없으므로 비교적 간단히 정의될 것이다.

$$ q_{\sigma,\tau}(\mathbf{x}{1:T}|\mathbf{x}_0) = q{\sigma,\tau}(\mathbf{x}{\tau_1}|\mathbf{x}_0)\prod{i=2}^S q_{\sigma,\tau}(\mathbf{x}{\tau_i}|\mathbf{x}{\tau_{i-1}},\mathbf{x}0) \prod{t\notin\tau} q_{\sigma,\tau}(\mathbf{x}_t|\mathbf{x}_0) $$

$t \in \tau$에 대해서는 수열 상 직전$(\mathbf{x}{\tau{i-1}} \to \mathbf{x}_{\tau_i})$ 이미지와 초기 이미지$(\mathbf{x}_0)$를 참고하며, $t \notin \tau$에 대해서는 초기 이미지$(\mathbf{x}_0)$만을 참고한다.

또한 Bayes Theorem에 의해, $$ \begin{aligned} q_{\sigma,\tau}(\mathbf{x}{1:T}|\mathbf{x}_0) &= q{\sigma,\tau}(\mathbf{x}{\tau_1}|\mathbf{x}_0)\prod{i=2}^S q_{\sigma,\tau}(\mathbf{x}{\tau_i}|\mathbf{x}{\tau_{i-1}},\mathbf{x}0) \prod{t\notin\tau} q_{\sigma,\tau}(\mathbf{x}_t|\mathbf{x}_0) \

&= q_{\sigma,\tau}(\mathbf{x}{\tau_1}|\mathbf{x}_0)\prod{i=2}^S {q_{\sigma,\tau}(\mathbf{x}{\tau{i-1}}|\mathbf{x}{\tau_i},\mathbf{x}_0)~q{\sigma,\tau}(\mathbf{x}{\tau_i}|\mathbf{x}_0) \over q{\sigma,\tau}(\mathbf{x}{\tau{i-1}}|\mathbf{x}0)} \prod{t\notin\tau} q_{\sigma,\tau}(\mathbf{x}_t|\mathbf{x}_0) \

&= q_{\sigma,\tau}(\mathbf{x}{\tau_S}|\mathbf{x}_0)\prod{i=2}^S q_{\sigma,\tau}(\mathbf{x}{\tau{i-1}}|\mathbf{x}{\tau_i},\mathbf{x}_0) \prod{t\notin\tau} q_{\sigma,\tau}(\mathbf{x}_t|\mathbf{x}_0) \ \end{aligned} $$

이다. $t \notin \tau$인 $\mathbf{x}_t$는 어차피 Sampling 단계에서 사용되지 않으니 다음과 같이 간단히 나타낸다. 또한 마지막$(=T)$ 이미지 역시 $q_\sigma$때와 같이 나타낸다.

$$ q_{\sigma,\tau}(\mathbf{x}_t|\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0,~~(1-\bar{\alpha}_t)I)~~for~~t=T,~~every~~t\notin\tau $$

모든 $\tau_i \in \tau$에 대하여

$$ q_{\sigma,\tau}(\mathbf{x}{\tau{i-1}}|\mathbf{x}{\tau_i},\mathbf{x}_0) = N\bigg(\sqrt{\bar{\alpha}{\tau_{i-1}}}\mathbf{x}0+\sqrt{1-\bar{\alpha}{\tau_{i-1}}-\sigma_{\tau_i}^2} \cdot {\mathbf{x}{\tau_i}-\sqrt{\bar{\alpha}{\tau_i}}\mathbf{x}0 \over \sqrt{1-\bar{\alpha}{\tau_i}}},~~\sigma_{\tau_i}^2I \bigg) $$

라 정의하자. 이번에도 역시 Appendix B의 Lemma 1에 의해

$$ q_{\sigma,\tau}(\mathbf{x}{\tau_i}|\mathbf{x}_0) = N(\sqrt{\bar{\alpha}{\tau_i}}\mathbf{x}0,~~(1-\bar{\alpha}{\tau_i})I)~~for~~every~~\tau_i \in \tau $$

가 성립한다. [5] 저자는 위의 정의에서 $\mathbf{x}_{\tau_i}$와 $\mathbf{x}_0$가 'chain'(기차처럼 한 줄로 이어진 형태)을 이루고, 그 외 나머지와 $\mathbf{x}_0$가 'star graph'(모두가 오직 $\mathbf{x}_0$와 연결된 형태)를 이룬다고 말한다.

실제 Sampling 역시 부분수열 $\tau$만을 거치며 진행된다. 가속된 generative process $p_\theta$는 다음과 같이 정의된다. $$ p_\theta(\mathbf{x}{0:T}) \coloneqq p_\theta(\mathbf{x}_T) \prod{i=1}^S p_\theta^{(\tau_i)} (\mathbf{x}{\tau{i-1}}|\mathbf{x}{\tau_i}) \prod{t \notin \tau} p_\theta^{(t)}(\mathbf{x}_0|\mathbf{x}_t) $$

이때 실질적으로 Sampling에 사용되는 $p_\theta^{(\tau_i)} (\mathbf{x}{\tau{i-1}}|\mathbf{x}{\tau_i})$는 non-Markov diffusion process를 닮는 것이 합리적이므로 $$ p_\theta^{(\tau_i)} (\mathbf{x}{\tau_{i-1}}|\mathbf{x}{\tau_i}) = q{\sigma, \tau} (\mathbf{x}{\tau{i-1}}|\mathbf{x}{\tau_i}, f_\theta^{(\tau_i)}(\mathbf{x}{\tau_i})) $$ 로 정의하며, 그 외 단계들은 다음과 같이 정의한다. $$ p_\theta^{(t)}(\mathbf{x}_0|\mathbf{x}_t) = N(f_\theta^{(t)}(\mathbf{x}_t), \sigma_t^2I) $$

4-2. Loss Function

앞선 Loss function에서 했던 바와 같이, 모든 $\sigma, \tau$에 대하여 $J_{\sigma, \tau} (\epsilon_\theta) = L_\gamma + C$로 표현할 수 있는 $\gamma_t$가 존재함을 보여야 한다. 만약 존재한다면, 임의의 부분수열만을 거치도록 설계된 generative process에 의해 만들어지는 Loss function은 어떤 $L_\gamma$로 대체될 수 있으며, $L_\gamma$의 최적해는 DDPMs에서 사용한 $L_1$의 최적해와 같으므로, $\epsilon$-예측 네트워크를 따로 학습시키지 않고 (=DDPMs의 네트워크를 이용하여) 원하는 부분수열만을 거쳐가도록 Sampling 할 수 있다.

관련된 내용은 논문의 Appendix C.1에 기술되어 있지만 증명은 생략되었고, 수식이 논리적으로 전개되지 않으므로 위 증명은 생략하겠다.

4-3. Sampling

결론은 Loss Function이 달라지더라도 최적해의 위치는 DDPMs의 $L_1$과 같다는 것이며, 따라서 Sampling 단계에서 $$ p_\theta(\mathbf{x}{\tau{i-1}}|\mathbf{x}{\tau_i}) = \begin{cases} N(f_\theta^{(\tau_1)}(\mathbf{x}{\tau_1}), \sigma_{\tau_1}^2I) &\text{if } i=1 \

q_{\sigma, \tau}(\mathbf{x}{\tau{i-1}}|\mathbf{x}{\tau_i}, f_\theta^{(\tau_i)}(\mathbf{x}{\tau_i})) &\text{otherwise} \end{cases} $$ 를 이용하여,

$$ \mathbf{x}{\tau{i-1}} = \begin{cases} {\mathbf{x}{\tau_1} - \sqrt{1-\bar{\alpha}{\tau_1}}\epsilon_\theta^{(\tau_1)}(\mathbf{x}{\tau_1}) \over \sqrt{\bar{\alpha}{\tau_1}}} +\sigma_{\tau_1}\epsilon_{\tau_1} &\text{if } i=1 \ \

\sqrt{\bar{\alpha}{\tau{i-1}}} \cdot {\mathbf{x}{\tau_i} - \sqrt{1-\bar{\alpha}{\tau_i}}\epsilon_\theta^{(\tau_i)}(\mathbf{x}{\tau_i}) \over \sqrt{\bar{\alpha}{\tau_i}}} + \sqrt{1-\bar{\alpha}{\tau{i-1}}-\sigma_{\tau_i}^2} \cdot \epsilon_\theta^{(\tau_i)}(\mathbf{x}{\tau_i}) + \sigma{\tau_i}\epsilon_{\tau_i} &\text{otherwise} \end{cases} $$ $$ where~~\epsilon_{\tau_i} \sim N(0, I) $$ 와 같이 Sampling을 진행할 수 있다. Acceleration 이전과 식은 비슷하지만, 부분수열 $\tau$를 마음대로 정의하여 Sampling 횟수를 획기적으로 줄일 수 있다는 점이 큰 차이다.

5. Experiment

$\sigma$와 $\tau$는 실험을 진행하는 사람에 따라 변화할 수 있는 하이퍼파라미터다. DDIM은 다양한 $\sigma$와 $\tau$를 비교하여 가장 높은 성능을 보이는 순간을 찾는 실험을 진행하였다.

$S$는 부분수열 $\tau$의 길이이며, DDIM은 DDPMs와 성능을 비교하기 위해 $\eta$를 다음과 같이 정의하였다. $$ \sigma_{\tau_i} = \eta \cdot {\beta_t (1-\bar{\alpha}_{t-1}) \over 1-\bar{\alpha}_t} $$

$\eta = 1$일 때 $\sigma_{\tau_i}$는 DDPMs의 것과 같아지며, $\eta = 0$까지 줄이며 실험을 진행하였다. $\eta = 1$, $S=1000$ 일 때가 정확히 DDPMs의 실험이다.

평가지표는 FID이며, 값이 작을수록 높은 성능을 의미한다. 전체적으로 $S$가 클수록, $\eta$가 작을수록 성능이 좋음을 알 수 있다. 이로부터 DDIM은 $\eta=0$을 채택하여 $\sigma_{\tau_i}=0$으로 둔다. $S=1000$이면 Sampling 시간이 오래 걸리므로, 어느정도 성능 하락을 감수하면서 속도를 10배 이상$(S \le100)$ 향상시켰다. [6]

6. Endnotes

[1] DDPMs의 네트워크는 $\mathbf{x}_t$가 가진 $\epsilon_\theta^{(t)}$를 잘 추출하도록 훈련되며, 그 방법은 미리 준비된 정답$(=\epsilon_\theta^{(t)})$으로부터 문제$(= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\epsilon_\theta^{(t)}=\mathbf{x}_t)$를 만들고, 네트워크가 문제로부터 정답$(=\epsilon_\theta^{(t)})$을 잘 추출하는지 비교하는 방식이다. 이때 $\mathbf{x}_t$를 만들기 위해 사용되는 distribution은 $$ q(\mathbf{x}_t|\mathbf{x}_0) = N(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0 , (1-\bar{\alpha}_t)I) $$ 인 marginal distribution 뿐이다. 따라서 marginal distribution만 같다면, joint distribution은 무엇이든 상관없다는 아이디어다.

[2] Appendix B의 Lemma 1을 증명하자. 논문은 참고문헌으로 Christopher M Bishop 의 Pattern recognition and machine learning을 인용하였다. 해당 서적에서 참고한 수식은 다음과 같다.

Lemma 1의 증명 아이디어는 수학적 귀납법(Induction)과 유사하다. 다음 두 식 $$ q_\sigma(\mathbf{x}T|\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_T}\mathbf{x}_0, (1-\bar{\alpha}_T)I) $$ $$ q_\sigma(\mathbf{x}{t-1}|\mathbf{x}t, \mathbf{x}_0) = N \bigg(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2} \cdot {\mathbf{x}t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 \over \sqrt{1-\bar{\alpha}_t}}, \sigma_t^2I \bigg) $$ 이 정의되었을 때, 모든 $t \le T$에 대하여 $$ q_\sigma(\mathbf{x}_t|\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)\ \Downarrow \ q_\sigma(\mathbf{x}{t-1}|\mathbf{x}0) = N(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0, (1-\bar{\alpha}{t-1})I) $$ 임을 증명하자. $a \Rightarrow b$는 $a$가 참일 때 $b$가 참이라는 뜻이다.

만약 증명된다면, 우리는 이미 $t=T$일 때인 $q_\sigma(\mathbf{x}_T|\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_T}\mathbf{x}_0, (1-\bar{\alpha}_T)I)$가 성립함을 알고 있으므로 연쇄적으로 $t=T-1,~~\cdots,~~2,~1$일 때 성립함을 보일 수 있다.

$$ q_\sigma(\mathbf{x}t|\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I) $$ 가 참이라 가정하자. 만약 $\mathbf{x}_0$를 상수라고 인식한다면 다음과 같이 표현할 수 있다. $$ q_\sigma(\mathbf{x}_t) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I) $$ 마찬가지로 $$ q_\sigma(\mathbf{x}{t-1}|\mathbf{x}t) = N \bigg(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0 + \sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2} \cdot {\mathbf{x}t - \sqrt{\bar{\alpha}_t}\mathbf{x}_0 \over \sqrt{1-\bar{\alpha}_t}}, \sigma_t^2I \bigg) $$ 라 표현할 수 있으며, 수식 (2.115)에 의해 $$ \begin{aligned} q_\sigma(\mathbf{x}{t-1}) &= q_\sigma(\mathbf{x}_{t-1}|\mathbf{x}_0) \

&=N \bigg( \sqrt{\bar{\alpha}{t-1}}\mathbf{x}_0, \bigg( \sigma_t^2I + {1-\bar{\alpha}{t-1}-\sigma_t^2 \over 1-\bar{\alpha}_t} \cdot (1-\bar{\alpha}_t) \bigg)I \bigg) \

&= N(\sqrt{\bar{\alpha}{t-1}}\mathbf{x}_0, (1-\bar{\alpha}{t-1})I) \end{aligned} $$ 이다. 따라서 수학적 귀납법에 의해, 모든 $t \le T$에 대하여 $$ q_\sigma(\mathbf{x}_t|\mathbf{x}_0) = N(\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I) $$ 가 성립한다.

[3] $q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_t$, $\mathbf{x}_0), q_\sigma(\mathbf{x}_t|\mathbf{x}_0)$을 알고 있으므로 $q_\sigma(\mathbf{x}_t|\mathbf{x}{t-1},\mathbf{x}_0)$를 실제로 구할 수 있다.

$$ q_\sigma(\mathbf{x}t|\mathbf{x}{t-1},\mathbf{x}0) = {q_\sigma(\mathbf{x}{t-1}|\mathbf{x}t, \mathbf{x}_0)~~q_\sigma(\mathbf{x}_t~~|~\mathbf{x}_0) \over q_\sigma(\mathbf{x}{t-1}|\mathbf{x}_0)} = N(\mathbf{x}_t;\tilde{\mu}, \tilde{\Sigma}) \ \space\ where $$

$$ \tilde{\mu} = {\sqrt{1-\bar{\alpha}t}\sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2} \over 1-\bar{\alpha}{t-1}}\mathbf{x}{t-1}

\bigg(\sqrt{\bar{\alpha}t} - {\sqrt{1-\bar{\alpha}_t}\sqrt{1-\bar{\alpha}{t-1}-\sigma_t^2}\sqrt{\bar{\alpha}{t-1}} \over 1-\bar{\alpha}{t-1}}\bigg) \mathbf{x}_0, \ \space\

\tilde{\Sigma} = {\sigma_t^2(1-\bar{\alpha}t) \over 1-\bar{\alpha}{t-1}}I \kern{275pt} $$ 으로, Multivariate Gaussian Distribution이다.

사실 $q_\sigma(\mathbf{x}{t-1}, \mathbf{x}_0)$에 의해 $\mathbf{x}_0$를 $\mathbf{x}{t-1}$로 변환할 수 있고, 이때 $q_\sigma(\mathbf{x}t|\mathbf{x}{t-1},\mathbf{x}_0)$는 Markov process가 된다. 하지만 forward process에서 markov 여부는 크게 중요하지 않다.

[4] 논문에서는 Appendix B의 Theorm 1에 기술되어 있다.

[5] [2]와 같은 과정이다. $t$ 대신 $\tau_i$가, $t-1$ 대신 $\tau_{i-1}$이 들어갈 뿐이다.

[6] $\eta=0$이면 $\sigma_{\tau_i}=0$이고, 이는 diffusion process와 denoising process 모두에서 무작위성을 넣지 않겠다는 의미이다. 오히려 생성모델의 다양성을 떨어트리는 인사이트로 보인다.

결과론적인 이야기지만 위 상황에 의미를 부여하면, 기존 DDPMs에서 denoising process는 $\mu_\theta$로 denoise를 한 뒤 $\Sigma_\theta$로 다시 노이즈를 넣었다고 볼 수 있다. Denoise 하는 데 노이즈를 다시 넣는 것이 직관에는 맞지 않으며, DDIM에서 $\sigma_{\tau_i}=0$으로 둠으로써 이러한 비효율을 해소했다고 해석할 수 있다.

[논문분석] Denoising Diffusion Probabilistic Models (DDPMs)

Mon, 10 Apr 2023 09:28:58 GMT

<논문> [arXiv] Denoising Diffusion Probabilistic Models

<참고자료> [arXiv] Deep Unsupervised Learning using Nonequilibrium Thermodynamics [arXiv] Understanding Diffusion Models: A Unified Perspective [github.io] What are Diffusion Models? [HuggingFace] The Annotated Diffusion Model [YouTube] 생성모델부터 Diffusion까지 [YouTube] [Paper Review] Denoising Diffusion Probabilistic Models

Diffusion 모델은 최근(2022) 가장 핫한 생성 AI로 주목받고 있다. 비평형 열역학의 아이디어로부터 출발한 Diffusion model이 어떤 원리로 학습하고 추론하는지 알아보자.

1. Introduction

Diffusion Model은 Denoising Diffusion Probabilistic Models (이하 DDPMs)에서 처음 제시된 모델이 아니다. 2015년에 출간된 논문 (이하 DPM)에서 처음으로 Diffusion Model을 제안하였고, 이를 개선한 것이 현재의 DDPMs이다. 개인적으로 DPM과 VAE를 이해하면 DDPMs를 이해할 수 있다고 생각한다. 그래도 이번 포스팅에서는 DDPMs를 최대한 깊이 파헤쳐볼 것이다.

2. DDPMs

DDPMs에서 Diffusion Model이 어떻게 작동하는지 대략적으로 알아보자. 앞으로 기술되는 'Diffusion Model'은 모두 DDPMs의 모델을 지칭한다.

2-1. 개요

'Diffusion'이라는 단어는 '확산', '전파'라는 뜻을 가진다. Diffusion Model이 처음 제안된 DPM은 비평형 열역학으로부터 아이디어를 얻었는데, 비평형 열역학은 열평형 상태가 아닌 시스템을 다루는 분야이다. 즉, 열평형 상태가 아니므로 열은 평형을 이루기 위해 이동할 것이다. 마치 차가운 손으로 뜨거운 손난로를 쥐면 손이 따뜻해지는 것처럼, 열이 고루 퍼진 상태가 일반적이고 평형하지 않은 상태는 자연스럽지 않은 의도적인 형태이다.

이미지 역시 비슷한 개념을 적용할 수 있는데, 우리가 보는 이미지는 의도적으로 픽셀이 배치된 형태이다. [1] 이러한 의도된 이미지를 자연스러운 노이즈 이미지로 흩트려놓고, 그 역과정을 통해 노이즈 이미지륻 의도된 이미지로 재구성하는 것이 Diffusion Model의 핵심이다.

(1)과 같이 원래 이미지에 노이즈를 첨가하는 과정을 Diffusion Process라 하고, (2)와 같이 완전한 노이즈 이미지로부터 노이즈를 제거하는 과정을 Denoising Process라 한다.

우리의 목표는 Denoising Process를 잘 수행하는 모델을 만들어, 무작위로 Sampling 한 노이즈 이미지로부터 새로운 이미지를 생성하는 것이다. 하지만 Denoising Process를 수행하는 모델은 일반적으로 만들기 어려우므로, Diffusion Process를 참고하며 설계할 것이다.

2-2. Diffusion Process

Diffusion Process, 또는 Forward Process라 불리는 단계이다.

실제 데이터 분포 $q(\mathbf{x})$에서 이미지 $\mathbf{x}_0$를 Sampling 하였다고 하자. 이미지 $\mathbf{x}_0$에 노이즈를 첨가할 것인데, 한 번에 넣는 것이 아닌 여러 단계(=$T$번)에 나누어 조금씩 넣을 것이다. $\mathbf{x}_0$에 노이즈를 첨가한 이미지를 $\mathbf{x}_1$, $\mathbf{x}_1$에 노이즈를 첨가한 이미지를 $\mathbf{x}_2$, $\cdots$ 으로 정의하자. [2] 추후 Denoising Process에서는 $\mathbf{x}_0$를 만들기 위해 $\mathbf{x}_1$, $\mathbf{x}_2$, $\cdots$, $\mathbf{x}_T$가 활용되므로, $\mathbf{x}_0$를 제외한 이들을 잠재변수(latent variable)라 부른다.

각 단계의 이미지는 직전 단계의 이미지만 알더라도 만들어낼 수 있음을 가정한다. [3] 이로부터 $t$ 시점의 이미지 $\mathbf{x}t$는 $\mathbf{x}{t-1}$에 Gaussian noise를 더하여 만든다. [4] 이 과정은 다음과 같이 표현할 수 있다. $$ q(\mathbf{x}{t}|\mathbf{x}{t-1}) = N(\mathbf{x}t;\sqrt{1-\beta_t}\mathbf{x}{t-1},~\beta_{t}I) $$

$q(\mathbf{x}{t}|\mathbf{x}{t-1})$는 $\mathbf{x}{t-1}$에 의해 모수가 결정된 확률분포에서 $\mathbf{x}_t$를 Sampling 할 가능도다. 또는 **$t-1$번째 단계의 이미지가 $\mathbf{x}{t-1}$일 때 $t$번째 단계의 이미지가 $\mathbf{x}_t$일 가능도다. 넓은 의미로는 **$\mathbf{x}_t$가 따르는 확률분포라 볼 수 있다.
Diffusion Process이므로 $t$가 증가하는 방향이 정방향이다. 즉, 과거 정보$(=\mathbf{x}_{t-1})$를 바탕으로 현재$(=\mathbf{x}_t)$를 판단한다.
$\mathbf{x}t$를 결정하는 방법은 Multivariate Gaussian Distribution에서 하나의 값(벡터)을 Sampling 하는 것이다. 이때 Multivariate Gaussian Distribution을 결정하는 모수가 $\mathbf{x}{t-1}$에 의해서만 결정되므로 Markov Chain이라 할 수 있다.
$\beta_t$는 noise를 첨가하는 정도이다. $t$에 따라 $\beta_t$의 값은 다르며, 이는 학습되는 값이 아닌 우리가 직접 결정하는 값이다. [5] 즉, 이미 알고 있는 값이다.
$I$는 항등행렬로, 공분산을 간단하게 하여 계산의 편의성을 증가시키기 위함이다. 상수 $a$에 대하여 공분산이 $a \cdot I$ 형태인 Gaussian Distribution을 Isotropic Gaussian Distribution이라 한다.
$\sqrt{1-\beta_t}$를 $\mathbf{x}_{t-1}$에 곱한 이유는 $\mathbf{x}_t$의 분산을 1로 유지하기 위함이다. [9]

위 식은 주변확률분포이다. 이를 바탕으로 $T+1$개 단계의 이미지에 대한 분포인 $q(\mathbf{x}{0:T})\coloneqq q(\mathbf{x}_0,~~\mathbf{x}_1,~~\cdots,~\mathbf{x}_T)$를 도출할 수 있다. $$ \begin{aligned} q(\mathbf{x}{0:T}) &\coloneqq q(\mathbf{x}0,~~\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T) \ &= q(\mathbf{x}_T~~|~\mathbf{x}{T-1},~~\cdots,~~\mathbf{x}1,~\mathbf{x}_0) \cdot q(\mathbf{x}{T-1},~~\cdots,~~\mathbf{x}1,~~\mathbf{x}_0) \ &= q(\mathbf{x}_T~~|~\mathbf{x}{T-1},~~\cdots,~~\mathbf{x}1,~\mathbf{x}_0) \cdot q(\mathbf{x}{T-1}|\mathbf{x}{T-2},~~\cdots,~~\mathbf{x}_0) \cdot q(\mathbf{x}{T-2},~~\cdots,~~\mathbf{x}0) \ &= \cdots \ &= q(\mathbf{x}_T|\mathbf{x}{T-1},~~\cdots,~~\mathbf{x}0) \cdot q(\mathbf{x}{T-1}|\mathbf{x}{T-2},~~\cdots,~~\mathbf{x}_0) \cdots q(\mathbf{x}_1|\mathbf{x}_0) \cdot q(\mathbf{x}_0) \end{aligned} $$ 이고, Markov Property에 의해 $$ \begin{aligned} q(\mathbf{x}_T&|\mathbf{x}{T-1},~~\cdots,~~\mathbf{x}0) \cdot q(\mathbf{x}{T-1}|\mathbf{x}{T-2},~~\cdots,~~\mathbf{x}_0) \cdots q(\mathbf{x}_1|\mathbf{x}_0) \cdot q(\mathbf{x}_0) \ &= q(\mathbf{x}_T|\mathbf{x}{T-1}) \cdot q(\mathbf{x}{T-1}|\mathbf{x}{T-2}) \cdots q(\mathbf{x}1|\mathbf{x}_0) \cdot q(\mathbf{x}_0) \ &= q(\mathbf{x}_0) \cdot \prod{t=1}^T q(\mathbf{x}t|\mathbf{x}{t-1}) \end{aligned} $$ 이다. $q(\mathbf{x}0)$는 현실적으로 알 수 없으니 [6], $\mathbf{x}_0$를 조건으로 넘겨 나머지 항만을 사용한다. $$ {q(\mathbf{x}{0:T}) \over q(\mathbf{x}0)} = q(\mathbf{x}{1:T}|\mathbf{x}0) = \prod{t=1}^T q(\mathbf{x}t|\mathbf{x}{t-1}) $$ 이 결과는 이후 Loss Function을 설계할 때 사용된다. 기억해두자.

$\mathbf{x}{t-1}$을 알고 있다면 Gaussian Distribution $q(\mathbf{x}{t}|\mathbf{x}{t-1})$를 알게 되므로 $\mathbf{x}_t$를 알 수 있게 (=Sampling 할 수 있게) 된다. Gaussian Distribution의 Reparameterization Trick [7] 을 이용하여 $\mathbf{x}_t$를 표현하면 다음과 같다. $$ \mathbf{x}_t = \sqrt{1-\beta_t}\mathbf{x}{t-1}+\sqrt{\beta_t}\epsilon,~~where~~\epsilon\sim N(0, I) $$

여기서 거꾸로 생각해보자. $\mathbf{x}t$는 $\mathbf{x}{t-1}$만 알면 알 수 있고, $\mathbf{x}{t-1}$은 $\mathbf{x}{t-2}$만 알면 알 수 있고, $\cdots$, $\mathbf{x}_1$은 $\mathbf{x}_0$만 알면 알 수 있다. 즉, $\mathbf{x}_t$는 $\mathbf{x}_0$만 알아도 알 수 있게 된다! 점화식의 일반항을 구하는 과정으로 생각해도 좋다. 다음은 $\mathbf{x}_t$를 $\mathbf{x}_0$로 표현하는 유도 과정인데, 계산의 편의를 위해 $\alpha_t=1-\beta_t$, $\bar{\alpha_t}=\prod_{i=1}^{t}\alpha_t$라 정의하자. $$ \begin{aligned} \mathbf{x}t &= \sqrt{\alpha_t}\mathbf{x}{t-1}+\sqrt{1-\alpha_t}\epsilon_t\ &= \sqrt{\alpha_t}\left(\sqrt{\alpha_{t-1}}\mathbf{x}{t-2}+\sqrt{1-\alpha{t-1}}\epsilon_{t-1}\right) +\sqrt{1-\alpha_t}\epsilon_t \ &= \sqrt{\alpha_t}\sqrt{\alpha_{t-1}}\mathbf{x}{t-2} + \sqrt{\alpha_t}\sqrt{1-\alpha{t-1}}\epsilon_{t-1}+\sqrt{1-\alpha_t}\epsilon_t \ &~~where~~\epsilon_t,\epsilon_{t-1} \sim N(0, I) \end{aligned} $$ 여기서 $\epsilon_t$와 $\epsilon_{t-1}$는 둘 다 $N(0, I)$에서 Sampling 되었지만, 서로 독립임을 명시하기 위해 아래첨자($t$)를 구분하였다. 여기서 $$ \sqrt{1-\alpha_t}\epsilon_t \sim N(0, (1-\alpha_t)I), \ \sqrt{\alpha_t}\sqrt{1-\alpha{t-1}}\epsilon_{t-1} \sim N(0, \alpha_t(1-\alpha_{t-1})I) $$ 이므로, $$ \begin{aligned} \sqrt{1-\alpha_t}\epsilon_t + \sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}}\epsilon_{t-1} &\sim N(0, ((1-\alpha_t) + \alpha_t(1-\alpha_{t-1}))I) \ &= N(0, (1-\alpha_t\alpha_{t-1})I) \end{aligned} $$ 이고, [8] 다시 Reparameterization Trick에 의해 $$ \sqrt{1-\alpha_t}\epsilon_t + \sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}}\epsilon_{t-1} = \sqrt{1-\alpha_t\alpha_{t-1}}\tilde{\epsilon},~~where~~\tilde{\epsilon}\sim N(0, I) $$ 이다. $\tilde{\epsilon}$은 $N(0, I)$에서 Sampling 한 벡터임을 나타내는 것 이외의 의미는 없다. 즉, $$ \mathbf{x}t = \sqrt{\alpha_t\alpha{t-1}}\mathbf{x}{t-2}+\sqrt{1-\alpha_t\alpha{t-1}}\tilde{\epsilon} $$ 이고, 위의 과정을 반복하면 $$ \begin{aligned} \mathbf{x}t &= \sqrt{\alpha_t\alpha{t-1}}\mathbf{x}{t-2}+\sqrt{1-\alpha_t\alpha{t-1}}\tilde{\epsilon} \ &= \sqrt{\alpha_t\alpha_{t-1}\alpha_{t-2}}\mathbf{x}{t-3}+\sqrt{1-\alpha_t\alpha{t-1}\alpha_{t-2}}\tilde{\epsilon} \ &= \cdots \ &= \sqrt{\alpha_t\alpha_{t-1}\cdots\alpha_1}\mathbf{x}0+\sqrt{1-\alpha_t\alpha{t-1}\cdots\alpha_1}\tilde{\epsilon} \ &= \sqrt{\bar{\alpha_t}}\mathbf{x}_0 + \sqrt{1-\bar{\alpha_t}}\tilde{\epsilon} \end{aligned} $$ 이다. 우리는 모든 $\beta_t$ 값을 알고 있으므로 모든 $\alpha_t$ 값을 알고, 모든 $\bar{\alpha_t}$ 값 역시 안다. 결국 $\mathbf{x}_0$만 주어지면, 한 단계씩 밟아오지 않아도 $\mathbf{x}_t$를 단번에 알 수 있다. 이를 다음과 같이 표현한다. $$ q(\mathbf{x}_t|\mathbf{x}_0) = N(\mathbf{x}_t;\sqrt{\bar{\alpha_t}}\mathbf{x}_0, (1-\bar{\alpha_t})I) $$

Diffusion Process 해석

지금까지 Diffusion Process가 어떻게 정의되었는지 알아보았다. 직전 이미지에 Gaussian noise를 첨가해 다음 이미지를 만드는 방식이다. 지금부터는 수식에 담긴 의미를 알아보자.

논문은 $T=1000$으로 두었지만, $T$가 무한히 커진다고 가정해보자. $\beta_T$는 0.02에 가까워지고, $\alpha_T$는 0.98에 가까워지지만, 1보다 작은 값을 무한히 곱하기 때문에 $\bar{\alpha_T}$는 0에 가까워진다. 즉, $T$가 무한히 커지면 $q(\mathbf{x}_T|\mathbf{x}_0)=N(\mathbf{x}_t;\sqrt{\bar{\alpha_t}}\mathbf{x}_0, (1-\bar{\alpha_t})I)$는 $\mathbf{x}_0$에 관계없이 $N(0, I)$에 가까워진다. $(\therefore q(\mathbf{x}_T) \approx N(0, I))$

충분히 큰 $T$에 대하여 $\mathbf{x}_T \sim N(0, I)$이다. [9] 이로부터, Diffusion Process는 이전 단계의 이미지로부터 다음 단계의 이미지 분포를 조금씩 옮겨가 결국 $N(0, I)$에 가까워지는 과정임을 알 수 있다.

직관적으로 생각볼 수 있는 점은, $\mathbf{x}t = \sqrt{1-\beta_t}\mathbf{x}{t-1}+\sqrt{\beta_t}\epsilon$의 형태가 마치 $\mathbf{x}_{t-1}$과 $\epsilon$의 내분과 비슷하다는 것이다. $\beta_t$ 값은 구간 $(0, 1)$ 내에서 증가하므로, 위의 과정을 거칠 때마다 $\mathbf{x}_t$는 점점 $\epsilon$에 가까워진다고 볼 수 있고, $\mathbf{x}_t$의 분포는 점점 $N(0, I)$에 가까워진다고 볼 수 있다.

이후 Diffusion process의 반대 과정인 Denoising process를 진행할 것인데, 이는 반대로 $N(0, I)$에서 $\mathbf{x}_0$의 분포를 찾아가는 과정임을 추측해볼 수 있다.

2-3. Denoising Process

완전한 노이즈 이미지를 원본 이미지 $\mathbf{x}0$로 되돌리는 과정으로, Reverse Process라 부르기도 한다. 앞서 Diffusion Process를 $q$로 정의했기 때문에, Denoising Process 역시 $q$를 이용하여 정의하는 것이 자연스럽다. 즉, $q(\mathbf{x}{t-1}|\mathbf{x}_t)$를 아는 것이 가장 좋다.

하지만 실질적으로 $q(\mathbf{x}{t-1}|\mathbf{x}_t)$를 아는 것은 불가능하다. $q(\mathbf{x}_t|\mathbf{x}{t-1})$을 알고 있으므로 Bayes' Theorem을 고려해볼 수 있지만 그러기 위해선 $q(\mathbf{x}_t)$를 알아야 하고, 이를 위해선 $q(\mathbf{x}_0)$를 알아야 하지만 구할 수 없으므로 불가능하다. [6]

또는 직관적으로 생각해보았을 때, $t-1$번째 단계가 $\mathbf{x}{t-1}$이 아니더라도 $t$번째 단계가 $\mathbf{x}_t$일 가능성은 꽤 존재한다. 서로 다른 $\mathbf{x}{t-1}$들로부터 각각의 $\mathbf{x}t$의 분포를 만들고, 각 분포에서 $\mathbf{x}_t$를 Sampling 한다고 생각해보면 그렇다. 그 중 $t-1$번째 단계가 $\mathbf{x}{t-1}$일 가능성을 골라내는게 쉽지 않아 보인다.

따라서 우리는 $q(\mathbf{x}{t-1}|\mathbf{x}_t)$에 근사하도록 $p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)$를 정의하고, 이것이 어떤 확률분포인지 구할 것이다.

1949년 Feller의 연구에 따르면, Diffusion Process가 Gaussian Distribution을 따르고 변화의 정도가 충분히 작으면$(=\beta_t)$, 그 반대 과정(=Denoising Process) 역시 Gaussian Distribution을 따른다고 한다. [10] 즉, $p_\theta$는 다음과 같이 정의될 수 있다. $$ p_\theta(\mathbf{x}{t-1} | \mathbf{x}_t) = N(\mathbf{x}{t-1};\mu_\theta(\mathbf{x}_t, t),~\Sigma_\theta(\mathbf{x}_t, t)) $$

$p_\theta(\mathbf{x}{t-1} | \mathbf{x}_t)$는 $\mathbf{x}_t$에 의해 모수가 결정된 확률분포에서 **$\mathbf{x}{t-1}$을 Sampling 할 가능도다. 또는 $t$번째 단계의 이미지가 $\mathbf{x}t$일 때 $t-1$번째 단계의 이미지가 $\mathbf{x}{t-1}$일 가능도다. 넓은 의미로는 **$\mathbf{x}_{t-1}$이 따르는 확률분포라 볼 수 있다.
Denoising Process이므로 $t$가 감소하는 방향이 정방향이다. 즉 $t$가 높을수록 과거 정보이며, 과거 정보$(=\mathbf{x}t)$를 바탕으로 현재$(=\mathbf{x}{t-1})$를 판단한다.
확률분포를 결정하는 모수 $\mu_\theta$와 $\Sigma_\theta$는 $\mathbf{x}_t$와 $t$에 의해 결정된다. [11]
변화의 정도가 작을 때 Gaussian Distribution을 따르므로, Denoising Process는 자동적으로 noise를 여러 단계$(=T)$에 거쳐 조금씩 제거하는 과정이다.
아래첨자$(_\theta)$는 학습에 의해 결정됨을 표시하기 위해 명시적으로 붙은 것이다.

Diffusion Process에서와 마찬가지로, 위의 식은 주변확률분포이다. 이를 바탕으로 $T+1$개 단계의 이미지에 대한 분포인 $p_\theta(\mathbf{x}{0:T}) \coloneqq p_\theta(\mathbf{x}_0,~~\mathbf{x}_1,~~\cdots,~\mathbf{x}_T)$를 구할 수 있다. 이 과정은 Diffusion Process와 완전히 똑같으므로 결과만 소개하겠다. $$ p_\theta(\mathbf{x}{0:T}) \coloneqq p_\theta(\mathbf{x}0,~~\mathbf{x}_1,~~\cdots,~\mathbf{x}_T) = p_\theta(\mathbf{x}_T)\cdot\prod{t=1}^Tp_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t) $$ $p_\theta$는 순수한 노이즈 이미지로부터 원본 이미지를 복원하는 과정이므로, 그 시작$(=p_\theta(\mathbf{x}_T))$은 순수한 노이즈 분포$(=N(0, I))$이다. 즉, $p_\theta(\mathbf{x}_T) = N(\mathbf{x}_T;0, I)$이므로 조건으로 걸지 않고 결합 형태로 둘 것이다. 이 결과 역시 추후 Loss Function을 설계하는 데 활용될 것이다.

Denoising Process 해석

Denoising Process의 형태는 Diffusion Process와 거의 똑같다. 즉, 추측한 바와 같이 Denoising Process는 $N(0, I)$에서 $\mathbf{x}_0$의 분포를 찾아가는 과정이며, 각 단계별로 Gaussian Distribution을 정의하고 Sampling 하는 것을 반복할 것이다.

우리의 목표는 이상적인 Denoisig을 수행하는 $p_\theta$를 찾는 것이고, 이는 곧 $\mu_\theta$와 $\Sigma_\theta$를 찾는 것이다. 각 $t$번째 step마다 노이즈를 어떻게 제거할지(=분포를 어떻게 옮길지) 안다면, 완전한 노이즈로부터 이미지를 생성해낼 수 있을 것이다.

$\mu_\theta$와 $\Sigma_\theta$는 어떤 목표를 가져야 할까? $p_\theta$를 바탕으로 이미지를 생성하였을 때, 이미지가 Training data 이미지와 비슷하다면 좋은 $\mu_\theta$와 $\Sigma_\theta$를 찾았다고 할 수 있을 것이다. 즉, VAE에서 사용한 방법과 같이 $p_\theta(\mathbf{x}_0)$가 최대가 되도록 하는 방향으로 $\mu_\theta$와 $\Sigma_\theta$를 추측할 것이다.

2-4. Loss Function 설계

$p_\theta(\mathbf{x}_0)$의 증가는 $\log{\left(p_\theta(\mathbf{x}_0)\right)}$의 증가와 같고, $-\log{\left(p_\theta(\mathbf{x}_0)\right)}$의 감소와 같다. [12] $-\log{\left(p_\theta(\mathbf{x}_0)\right)}$를 낮추기 위해 $\mu_\theta$와 $\Sigma_\theta$가 어떤 값을 가져야 하는지 알아보자. $\mu_\theta$와 $\Sigma_\theta$가 어떤 값과 가까워져야 하는지 아는 것이 곧 Loss Function을 설계하는 것이다.

확률분포의 확률밀도함수의 적분값은 1이므로 [13] $$ \begin{aligned} -\log(p_\theta(\mathbf{x}0)) &= -\log(p_\theta(\mathbf{x}_0)) \cdot \int q(\mathbf{x}{1:T}|\mathbf{x}0)d\mathbf{x}{1:T} \

&= -\int \log(p_\theta(\mathbf{x}0)) \cdot q(\mathbf{x}{1:T}|\mathbf{x}0)d\mathbf{x}{1:T} \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_0))\bigg] \ \end{aligned} $$

이고, Bayes' Theorem에 의해 [14] $$ \begin{aligned} \mathrm{E}&{\mathbf{x}{1:T} \sim q(\mathbf{x}_{1:T}|\mathbf{x}_0)}\bigg[-\log(p_\theta(\mathbf{x}_0))\bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \bigg[-\log{p_\theta(\mathbf{x}_0,~~\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T) \over p_\theta(\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T~~|~\mathbf{x}_0)} \bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \bigg[- \log\left({p_\theta(\mathbf{x}_0,~~\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T) \over p_\theta(\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T~~|~~\mathbf{x}_0)}\cdot{q(\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T~~|~~\mathbf{x}_0) \over q(\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T~~|~\mathbf{x}_0)}\right) \bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \bigg[-\log{p_\theta(\mathbf{x}_0,~~\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T) \over q(\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T~~|~~\mathbf{x}_0)} - \log{q(\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T~~|~~\mathbf{x}_0) \over p_\theta(\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T~~|~\mathbf{x}_0)} \bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log{p_\theta(\mathbf{x}{0:T}) \over q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg] + \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}0)} \bigg[ -\log{q(\mathbf{x}{1:T}|\mathbf{x}0) \over p_\theta(\mathbf{x}{1:T}|\mathbf{x}0)} \bigg] \ \end{aligned} $$ 이다. 마지막 식의 두 번째 항은 KL Divergence의 형태이다. 즉, $$ \begin{aligned} \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}0)}& \bigg[ -\log{q(\mathbf{x}{1:T}|\mathbf{x}0) \over p_\theta(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg] \

&= -\int\log\left({q(\mathbf{x}{1:T}|\mathbf{x}_0) \over p_\theta(\mathbf{x}{1:T}|\mathbf{x}0)}\right) \cdot q(\mathbf{x}{1:T}|\mathbf{x}0)d\mathbf{x}{1:T}\ &= -D_{KL}(~~q(\mathbf{x}_{1:T}~~|~~\mathbf{x}_0)~~||~~p_\theta(\mathbf{x}_{1:T}~~|~~\mathbf{x}_0)~~) \le 0 \end{aligned} $$ 이므로, [15] $$ \begin{aligned} -\log&(p_\theta(\mathbf{x}_0)) \

&\le \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log{p_\theta(\mathbf{x}{0:T}) \over q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg] \ \end{aligned} $$ 이다. 마치 VAE에서 표현했던 것처럼, 구하고자 하는 값보다 크거나 같은 값을 최소로 만들어 구하려는 값 또한 최소가 되도록 하는 기법이다. 이제부터는 $-\log(p_\theta(\mathbf{x}_0))$ 대신 **$\mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}0)} \bigg[-\log{p_\theta(\mathbf{x}{0:T}) \over q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \bigg]$를 최소화**시키는 방향으로 식을 전개할 것이다.

$\log$ 내부의 $p_\theta(\mathbf{x}{0:T})$, $q(\mathbf{x}{1:T}|\mathbf{x}0)$는 앞서 미리 구하였다. $q(\mathbf{x}_t|\mathbf{x}{t-1})$와 $p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)$를 처음 정의한 곳을 참고하자. 따라서, $$ \begin{aligned} \mathrm{E}&{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}0)} \bigg[-\log{p_\theta(\mathbf{x}{0:T}) \over q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log{p_\theta(\mathbf{x}_T) \cdot \prod{t=1}^Tp_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over \prod{t=1}^Tq(\mathbf{x}t|\mathbf{x}{t-1})} \bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_T)) - \log\left(\prod{t=1}^T{p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over q(\mathbf{x}_t|\mathbf{x}{t-1})}\right) \bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_T)) - \sum{t=1}^T\log{p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over q(\mathbf{x}_t|\mathbf{x}{t-1})} \bigg] \ \end{aligned} $$ 이다.

마지막 식을 잘 변형하면 마치 KL Divergence의 형태가 나올 것처럼 보인다. 결국 $\mu_\theta$와 $\Sigma_\theta$는 어떠한 확률분포(아마 Gaussian Distribution)와 닮아가도록 학습될 것이다. 하지만 그 대상은 $q(\mathbf{x}t|\mathbf{x}{t-1})$이 아닐 것이다. $q$는 노이즈를 첨가하는 과정이기 때문이다.

여기서 기술적인 식 변형이 일어나는데, $q(\mathbf{x}t|\mathbf{x}{t-1})$에 $\mathbf{x}_0$을 조건으로 걸고 Bayes' Theorem을 사용할 것이다.

Markov Property에 의해 [16] $$ q(\mathbf{x}t|\mathbf{x}{t-1}) = q(\mathbf{x}t|\mathbf{x}{t-1},~~\mathbf{x}0) $$ 이므로 $$ \begin{aligned} \mathrm{E}&{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}~~|~~\mathbf{x}0)} \bigg[-\log(p_\theta(\mathbf{x}_T)) - \sum{t=1}^T\log{p_\theta(\mathbf{x}_{t-1}~~|~~\mathbf{x}_t) \over q(\mathbf{x}_t~~|~\mathbf{x}_{t-1})} \bigg] \

이고, Bayes' Theorem에 의해 [17] $$ \begin{aligned} \mathrm{E}&{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_T)) - \sum{t=2}^T\log{p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over q(\mathbf{x}_t|\mathbf{x}{t-1},~~\mathbf{x}_0)} - \log{p_\theta(\mathbf{x}_0~~|~~\mathbf{x}_1) \over q(\mathbf{x}_1~~|~\mathbf{x}_0)}\bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_T)) - \sum{t=2}^T\log\bigg({p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)} \cdot {q(\mathbf{x}{t-1}|\mathbf{x}_0) \over q(\mathbf{x}_t|\mathbf{x}_0)} \bigg) \ & \kern{240pt} - \log{p_\theta(\mathbf{x}_0|\mathbf{x}_1) \over q(\mathbf{x}_1|\mathbf{x}_0)}\bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_T)) - \sum{t=2}^T\log{p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)} - \sum{t=2}^T \log{q(\mathbf{x}{t-1}|\mathbf{x}_0) \over q(\mathbf{x}_t|\mathbf{x}_0)} \ & \kern{240pt} - \log{p_\theta(\mathbf{x}_0|\mathbf{x}_1) \over q(\mathbf{x}_1|\mathbf{x}_0)}\bigg] \ \end{aligned} $$ 이다. 이때 $$ \sum{t=2}^T \log{q(\mathbf{x}{t-1}|\mathbf{x}_0) \over q(\mathbf{x}_t|\mathbf{x}_0)} = \log \bigg(\prod{t=2}^T {q(\mathbf{x}{t-1}|\mathbf{x}_0) \over q(\mathbf{x}_t|\mathbf{x}_0)} \bigg) = \log {q(\mathbf{x}_1|\mathbf{x}_0) \over q(\mathbf{x}_T|\mathbf{x}_0)} $$ 이므로 $$ \begin{aligned} \mathrm{E}&{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}0)} \bigg[-\log(p_\theta(\mathbf{x}_T)) - \sum{t=2}^T\log{p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)} - \sum{t=2}^T \log{q(\mathbf{x}_{t-1}|\mathbf{x}_0) \over q(\mathbf{x}_t|\mathbf{x}_0)} \ & \kern{240pt} - \log{p_\theta(\mathbf{x}_0|\mathbf{x}_1) \over q(\mathbf{x}_1|\mathbf{x}_0)}\bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_T)) - \log {q(\mathbf{x}_1|\mathbf{x}_0) \over q(\mathbf{x}_T|\mathbf{x}_0)} - \log{p_\theta(\mathbf{x}_0|\mathbf{x}_1) \over q(\mathbf{x}_1|\mathbf{x}_0)} \ & \kern{205pt} -\sum{t=2}^T\log{p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)}\bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log\bigg(p_\theta(\mathbf{x}_T) \cdot {q(\mathbf{x}_1|\mathbf{x}_0) \over q(\mathbf{x}_T|\mathbf{x}_0)} \cdot {p_\theta(\mathbf{x}_0|\mathbf{x}_1) \over q(\mathbf{x}_1|\mathbf{x}_0)}\bigg) \ & \kern{205pt} - \sum{t=2}^T\log{p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)}\bigg] \

&= \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log {p_\theta(\mathbf{x}_T) \over q(\mathbf{x}_T|\mathbf{x}_0)} - \log( p_\theta(\mathbf{x}_0|\mathbf{x}_1)) - \sum{t=2}^T\log{p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t) \over q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)}\bigg] \

&= \fcolorbox{red}{white}{$\mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}_{1:T}|\mathbf{x}_0)} \bigg[\log{q(\mathbf{x}_T|\mathbf{x}_0) \over p_\theta(\mathbf{x}_T)}\bigg]$}

\fcolorbox{green}{white}{$\sum_{t=2}^T \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[\log{q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0) \over p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)} \bigg]$} \ & \kern{146pt} + \fcolorbox{blue}{white}{$\mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_0|\mathbf{x}_1)) \bigg]$}

\end{aligned} $$

이다. 마지막 식의 각 항은 [18]에 의해 다음과 같이 전개된다.

$$ \begin{aligned} L \coloneqq \fcolorbox{red}{white}{$D_{KL}(q(\mathbf{x}T|\mathbf{x}_0)||p_\theta(\mathbf{x}_T))$} + \fcolorbox{green}{white}{$\sum{t=2}^T D_{KL}(q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)~~||~p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t))$} \ \fcolorbox{blue}{white}{$- \log(p_\theta(\mathbf{x}_0|\mathbf{x}_1))$} \end{aligned} $$

지금까지 살펴본 바로, 위의 정리된 식을 Loss function으로 만들어야 할 듯 하다. 하지만 $L$의 각 항의 의미를 분석하면 더 간단한 Loss function으로 만들 수 있다.

$$L_T \coloneqq \fcolorbox{red}{white}{$D_{KL}(q(\mathbf{x}_T|\mathbf{x}_0)||p_\theta(\mathbf{x}_T))$}$$
- 두 분포 $q(\mathbf{x}_T|\mathbf{x}_0)$, $p_\theta(\mathbf{x}_T)$의 다름의 정도이다. $p_\theta(\mathbf{x}_T)$는 $\theta$로서 명시되어 있지만, 사실 $p_\theta(\mathbf{x}_T) = N(\mathbf{x}_T;0, I)$로 결정된 분포이므로 학습할 파라미터가 없다. 따라서 학습할 때 고려하지 않는다.
- 또한 $q(\mathbf{x}_T|\mathbf{x}_0)$가 실제로 $N(\mathbf{x};0, I)$에 매우 가깝기 때문에 KL Divergence 값은 0에 가깝다.
$$L_{t-1} \coloneqq D_{KL}(q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)~~||~p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)),\ \kern{30pt} \fcolorbox{green}{white}{$\sum{t=2}^T D_{KL}(q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)~~||~p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t))$}$$
- 두 분포 $q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)$, $p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)$의 다름의 정도이다. 우리가 알고자 하는 $p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)$가 어떤 분포를 닮아야 하는지 알 수 있으며, 따라서 우리는 **$p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)$가 $q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)$를 닮아가도록 학습시킬 것이다.**
$$L_0 \coloneqq \fcolorbox{blue}{white}{$- \log(p_\theta(\mathbf{x}_0|\mathbf{x}_1))$}$$
- 노이즈를 제거하는 마지막 단계로, latent variable $\mathbf{x}1$으로부터 $\mathbf{x}_0$를 Sampling 할 가능도와 관련되어 있다. 이 역시 학습대상이며, $L{t-1}$과는 다른 방법으로 학습시켜야 할 것이다. [19]

우선 $L_{t-1} = D_{KL}(q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)~~||~p_\theta(\mathbf{x}{t-1}|\mathbf{x}t))$을 최소화시켜보자. 이를 위해선 $p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)$를 근사시킬 $q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)$를 구하여야 한다.

$q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)$를 활용한 과정은 굉장히 특별하다. 곧 전개할 예정이지만, **$q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)$는 하나의 확률분포로 표현할 수 있고, 이는 Gaussian Distribution에 근사한다.** 또한 전체적인 Loss를 보았을 때, $L{t-1}$ 이외에 고려해야 할 사항이 거의 없다는 점이 큰 장점이다. 즉, 우리가 찾고자 하는 $p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)$가 $q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)$라는 것이다.

$q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)$를 구해보자. 한 가지 유의할 점은, $q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)$는 $\mathbf{x}{t-1}$에 대한 확률분포이므로 $\mathbf{x}_{t-1}$에 대하여 정리해야 한다는 것이다. Bayes' Theorem과 Markov Property에 의해

$$ Q \coloneqq q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0) = {q(\mathbf{x}_t~~|~\mathbf{x}{t-1},~~\mathbf{x}0) \cdot q(\mathbf{x}{t-1}~~|~~\mathbf{x}_0) \over q(\mathbf{x}_t~~|~~\mathbf{x}_0)} = {q(\mathbf{x}_t~~|~~\mathbf{x}{t-1}) \cdot q(\mathbf{x}{t-1}~~|~~\mathbf{x}_0) \over q(\mathbf{x}_t~~|~~\mathbf{x}_0)} $$ 이며 $$ \begin{aligned} q(\mathbf{x}_t~~|~~\mathbf{x}_{t-1}) &= N(\mathbf{x}_t~~;~\sqrt{1-\beta_t}\mathbf{x}_{t-1}, \beta_tI) \

q(\mathbf{x}{t-1}|\mathbf{x}_0) &= N(\mathbf{x}{t-1};\sqrt{\bar{\alpha}{t-1}}\mathbf{x}_0, (1-\bar{\alpha}{t-1})I) \

q(\mathbf{x}_t|\mathbf{x}_0) &= N(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I) \end{aligned} $$ 는 각각 Multivariate Gaussian Distribution의 확률밀도함수이므로 [20]

$$ \begin{aligned} Q &= {N(\mathbf{x}t;\sqrt{1-\beta_t}\mathbf{x}{t-1},\beta_tI) \cdot N(\mathbf{x}{t-1};\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0, (1-\bar{\alpha}{t-1})I) \over N(\mathbf{x}_t;\sqrt{\bar{\alpha}_t}\mathbf{x}_0, (1-\bar{\alpha}_t)I)} \ \

&= { {e^{-{1 \over 2} \big( (\mathbf{x}t-\sqrt{1-\beta_t}\mathbf{x}{t-1})^T {1 \over \beta_t} I (\mathbf{x}t - \sqrt{1-\beta_t}\mathbf{x}{t-1}) \big)} \over \sqrt{(2\pi\beta_t)^D}} \cdot {e^{-{1 \over 2} \big( (\mathbf{x}{t-1}-\sqrt{1-\bar{\alpha}{t-1}}\mathbf{x}0)^T {1 \over 1-\bar{\alpha}{t-1}} I (\mathbf{x}{t-1}-\sqrt{1-\bar{\alpha}{t-1}}\mathbf{x}0) \big)} \over \sqrt{(2\pi(1-\bar{\alpha}{t-1}))^D}} \over {e^{-{1 \over 2} \big( (\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\mathbf{x}_0)^T {1 \over 1-\bar{\alpha}_t} I (\mathbf{x}_t-\sqrt{1-\bar{\alpha}_t}\mathbf{x}_0) \big)} \over \sqrt{(2\pi(1-\bar{\alpha}_t))^D}} } \ \

&= { e^{-{ (\mathbf{x}t-\sqrt{1-\beta_t}\mathbf{x}{t-1})^T (\mathbf{x}t-\sqrt{1-\beta_t}\mathbf{x}{t-1}) \over 2\beta_t } -{ (\mathbf{x}{t-1}-\sqrt{\bar{\alpha}{t-1}}\mathbf{x}0)^T (\mathbf{x}{t-1}-\sqrt{\bar{\alpha}{t-1}}\mathbf{x}_0) \over 2(1-\bar{\alpha}{t-1}) } +{ (\mathbf{x}t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0)^T (\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0) \over 2(1-\bar{\alpha}_t) } } \over \sqrt{\bigg( 2\pi \cdot {\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t} \bigg)^D }} \ \end{aligned} $$ 이다. 식이 너무 복잡하니 약어를 정하자. $$ K \coloneqq {\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t},~\mathbf{x}^2 \coloneqq \mathbf{x}^T\mathbf{x} $$ 라 하자. [21] 즉, $$ Q = {e^{ -{ {1-\bar{\alpha}{t-1} \over 1-\bar{\alpha}t} (1-\beta_t) \big(\mathbf{x}{t-1}-{\mathbf{x}t \over \sqrt{1-\beta_t}} \big)^2 + {\beta_t \over 1-\bar{\alpha}_t} \big(\mathbf{x}{t-1}-\sqrt{\bar{\alpha}{t-1}}\mathbf{x}_0 \big)^2 - {\beta_t(1-\bar{\alpha}{t-1}) \over (1-\bar{\alpha}t)^2} \big(\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0 \big)^2 \over 2K} } \over \sqrt{(2\pi K)^D}} \ $$ 이다. 위 식의 분자의 지수의 분자를 집중적으로 계산할 것이므로, 이를 $U$라 하자. 또한 식 전개의 편의성을 위해 약어를 정하여, $$ m = {(1-\bar{\alpha}{t-1})(1-\beta_t) \over 1-\bar{\alpha}t},~ n = {\beta_t \over 1-\bar{\alpha}_t} $$ $$ \begin{aligned} U &= {1-\bar{\alpha}{t-1} \over 1-\bar{\alpha}t} (1-\beta_t) \bigg(\mathbf{x}{t-1}-{\mathbf{x}t \over \sqrt{1-\beta_t}} \bigg)^2 + {\beta_t \over 1-\bar{\alpha}_t} \bigg(\mathbf{x}{t-1}-\sqrt{\bar{\alpha}{t-1}}\mathbf{x}_0 \bigg)^2 \ & \kern{164pt} - {\beta_t(1-\bar{\alpha}{t-1}) \over (1-\bar{\alpha}_t)^2} \bigg(\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0 \bigg)^2 \

&= m \bigg(\mathbf{x}{t-1}-{\mathbf{x}_t \over \sqrt{1-\beta_t}} \bigg)^2 + n \bigg(\mathbf{x}{t-1}-\sqrt{\bar{\alpha}{t-1}}\mathbf{x}_0 \bigg)^2 \ & \kern{164pt} - {\beta_t(1-\bar{\alpha}{t-1}) \over (1-\bar{\alpha}t)^2} \bigg(\mathbf{x}_t-\sqrt{\bar{\alpha}_t}\mathbf{x}_0 \bigg)^2 \ \end{aligned} $$ 라 하자. 식 $U$에서 $m$, $n$은 $\mathbf{x}{t-1}^2$의 계수가 되며, 공교롭게도 $m+n=1$이다. $U$를 2차 완전제곱식 꼴로 나타내면 $$ \begin{aligned} U &= \bigg( \mathbf{x}{t-1}- \bigg( {\sqrt{\alpha_t}(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t} \mathbf{x}_t + {\sqrt{\bar{\alpha}{t-1}}\beta_t \over 1-\bar{\alpha}t} \mathbf{x}_0 \bigg) \bigg)^2 \ & \kern{130pt} - {2\sqrt{\bar{\alpha}_t}\beta_t(1-\bar{\alpha}{t-1})(\sqrt{\bar{\alpha}{t-1}}-1) \over (1-\bar{\alpha}_t)^2} \mathbf{x}_t\mathbf{x}_0 \end{aligned} $$ 이다. 전개하는 과정은 당장 도움이 되지 않아 생략하였다. 충분히 큰 T에 대하여 $\bar{\alpha}_t$는 0에 가까움을 언급한 적이 있다. 즉 $$ 2\sqrt{\bar{\alpha}_t}\beta_t(1-\bar{\alpha}{t-1})(\sqrt{\bar{\alpha}{t-1}}-1) \approx 0 $$ 이며, $$ U \approx \bigg( \mathbf{x}{t-1}- \bigg( {\sqrt{\alpha_t}(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}_t} \mathbf{x}_t + {\sqrt{\bar{\alpha}{t-1}}\beta_t \over 1-\bar{\alpha}t} \mathbf{x}_0 \bigg) \bigg)^2 $$ 이다. 따라서 $$ \begin{aligned} Q &= q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0) = {e^{-{U \over 2K}} \over \sqrt{(2\pi K)^D}}

\approx { e^{- {\big( \mathbf{x}{t-1}- \big( {\sqrt{\alpha_t}(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t} \mathbf{x}_t + {\sqrt{\bar{\alpha}{t-1}}\beta_t \over 1-\bar{\alpha}t} \mathbf{x}_0 \big) \big)^2 \over 2\cdot{\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t}} } \over \sqrt{\bigg(2\pi{\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}_t}\bigg)^D} } \ \

&= { e^{-{1 \over 2} \big(\big( \mathbf{x}{t-1}- \big( {\sqrt{\alpha_t}(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t} \mathbf{x}_t + {\sqrt{\bar{\alpha}{t-1}}\beta_t \over 1-\bar{\alpha}t} \mathbf{x}_0 \big) \big)^T {1 \over {\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t}} I \big( \mathbf{x}{t-1}- \big( {\sqrt{\alpha_t}(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}_t} \mathbf{x}_t + {\sqrt{\bar{\alpha}{t-1}}\beta_t \over 1-\bar{\alpha}t} \mathbf{x}_0 \big) \big) \big) } \over \sqrt{(2\pi)^D\bigg({\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}_t}\bigg)^D} } \

&= N(\mathbf{x}{t-1};\bar{\mu},\bar{\Sigma}) \ &where~ \fcolorbox{red}{white}{$\bar{\mu} = {\sqrt{\alpha_t}(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t} \mathbf{x}_t + {\sqrt{\bar{\alpha}{t-1}}\beta_t \over 1-\bar{\alpha}t} \mathbf{x}_0$},~~\fcolorbox{blue}{white}{$\bar{\Sigma}= {\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t}I$} \end{aligned} $$ 이다. 결국 $q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)$는 평균$(=\bar{\mu})$과 공분산$(=\bar{\Sigma})$을 아는 Gaussian Distribution에 근사하며, 이것이 $p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)$가 닮아야 할 대상인 것이다.

잠시 정리해보자. $-\log(p_\theta(\mathbf{x}0))$가 감소하도록 Loss를 구성하였고, Loss의 3개 항 중 $L{t-1}$을 계산하기 위해 $q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)$를 구하였다. $L{t-1}$은 원래 KL Divergence 식이었으므로, $q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)$와 $p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)$의 KL Divergence를 구하자.

$\bar{\Sigma}$는 고정된 공분산이므로, $\Sigma_\theta(\mathbf{x}t, t)$ 역시 고정하는 것이 합리적이다. 상수 $\sigma_t$에 대하여 $$ \Sigma_\theta(\mathbf{x}_t, t) = \sigma_t^2I = {\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}t}I $$ 라 하자. 한 가지 생각해볼 점으로, 본래 $q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)$의 공분산인 $\bar{\Sigma}$에서 $\bar{\alpha}_t$는 0에 충분히 가깝다. 이는 $\bar{\alpha}{t-1}$ 역시 마찬가지이며, $1-\bar{\alpha}t \approx 1-\bar{\alpha}{t-1}$이라 할 수 있다. 즉, $$ \bar{\Sigma}= {\beta_t(1-\bar{\alpha}_{t-1}) \over 1-\bar{\alpha}_t}I \approx \beta_t I $$ 이다. 따라서 $$ \sigma_t^2I = \beta_tI $$

라 할 수 있다. 이는 계산의 편의성을 가져다주며, 위와 같이 가정하더라도 실험 상 큰 성능의 차이는 발생하지 않았다고 한다.

$D_{KL} (q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)~~||~p_\theta(\mathbf{x}{t-1}|\mathbf{x}t))$를 계산하자. 두 Multivariate Gaussian Distribution의 KL Divergence 계산 식 [22]에 의해 $$ \begin{aligned} D{KL} (q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)~~||~p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)) &= {1 \over 2\beta_t} \bigg( (\mu_\theta - \bar{\mu})^2 + \ln {({\beta_t(1-\bar{\alpha}{t-1}) \over 1-\bar{\alpha}_t}I)^D \over (\sigma_t^2)^D} \bigg) \

&\approx {(\mu_\theta - \bar{\mu})^2 \over 2\sigma_t^2}, \ \end{aligned} $$ $$ \sum_{t=2}^T \bigg( D_{KL} (q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)~~||~p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)) \bigg) \approx \sum{t=2}^T {(\mu_\theta(\mathbf{x}t,t) - \bar{\mu}(\mathbf{x}_t,\mathbf{x}_0))^2 \over 2\sigma_t^2} $$ 이다. 논문에서는 $\ln$ 항을 상수 $C$로 처리하였다. 우리가 구하고자 하는 $\mu_\theta$는 이전 이미지가 무엇인지$(=\mathbf{x}_t)$, 몇 번째 단계인지$(=t)$에 따라 결정된다. 따라서 현재 $\mathbf{x}_t$와 $\mathbf{x}_0$로 표현되어있는 $\bar{\mu}$를 다르게 표현할 필요가 있다. $q(\mathbf{x}_t|\mathbf{x}_0)$에 Reparametrization Trick을 적용하면 $$ \mathbf{x}_t = \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, \ \mathbf{x}_0 = {1 \over \sqrt{\bar{\alpha}_t}} (\mathbf{x}_t - \sqrt{1-\bar{\alpha}_t}\epsilon) $$ 와 같이 $\mathbf{x}_0$를 $\mathbf{x}_t$와 $t$로 표현할 수 있으므로 $$ \bar{\mu}(\mathbf{x}_t, t) = {1 \over \sqrt{\alpha_t}} \bigg( \mathbf{x}_t - {\beta_t \over \sqrt{1-\bar{\alpha}_t}}\epsilon \bigg) $$ 이다. $\mu_\theta(\mathbf{x}_t, t)$는 $\bar{\mu}(\mathbf{x}_t, t)$를 닮아야 하며, $\mathbf{x}_t$와 $t$를 이미 알고있음이 전제되므로 유일하게 $\epsilon$만을 알아내면 된다. 즉, $$ \mu_\theta(\mathbf{x}_t, t) = {1 \over \sqrt{\alpha_t}} \bigg( \mathbf{x}_t - {\beta_t \over \sqrt{1-\bar{\alpha}_t}}\epsilon_\theta(\mathbf{x}_t, t) \bigg) $$ 라 표현할 수 있다. $N(0, I)$에서 Sampling 한 $\epsilon$을 알아내야 한다는 점이 이상하게 들릴 수 있는데, 이는 아래에서 자세히 설명하겠다. 위의 결과를 바탕으로 KL Divergence를 다시 계산하면 $$ \begin{aligned} \sum{t=2}^T& \bigg( D_{KL} (q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)~~||~p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)) \bigg) \approx \sum{t=2}^T {(\mu_\theta(\mathbf{x}_t,t) - \bar{\mu}(\mathbf{x}_t,\mathbf{x}_0))^2 \over 2\sigma_t^2} \

&= \sum_{t=2}^T \bigg( {\beta_t^2 \over 2\sigma_t^2\alpha_t(1-\bar{\alpha}t)} (\epsilon - \epsilon_\theta(\mathbf{x}_t, t))^2 \bigg) \ \end{aligned} $$ 이며 논문에서는 여러 실험을 거친 결과, 앞선 상수 항$(= {\beta_t^2 \over 2\sigma_t^2\alpha_t(1-\bar{\alpha}_t)})$를 제외하고 Loss function을 설계하였더니 더 좋은 성능을 얻을 수 있었다고 한다. 즉, $$ L{simple}(\theta) \coloneqq \mathrm{E}{t,\mathbf{x}_0,\epsilon} \bigg[ (\epsilon - \epsilon_\theta(\mathbf{x}_t, t))^2 \bigg] \ \ where~~t \sim {2, 3, \cdots, T},~~ \mathbf{x}_0 \sim q(\mathbf{x}),~ \epsilon \sim N(0, I) $$ 로 **$L{t-1}$에 대한 Loss function**을 정의하였다.

이제 $L_0$를 고려하자. [19]에서는 $p_\theta(\mathbf{x}0|\mathbf{x}_1)$의 특수성 때문에 $\mu_\theta(\mathbf{x}_1, 1)$을 따로 구성하기로 하였다. $\mu_\theta(\mathbf{x}_1, 1)$은 확률의 곱이 최대가 되도록 학습되어야 하며, 그렇게 만들어진 $\mu_\theta(\mathbf{x}_1, 1)$은 $\mathbf{x}_0$와 비슷해야 한다. 즉, $$ \mu_\theta(\mathbf{x}_1, 1) \stackrel{need}{\approx} \mathbf{x}_0 $$ 이다. 위에서 정의했던 $L{simple}$을 살펴보자. 최종적으로는 $\epsilon$을 예측하도록 학습되지만, 이는 $\bar{\mu}$를 예측하도록 학습되는 것과 같다. 즉, $$ \mu_\theta(\mathbf{x}t, t) \stackrel{need}{\approx} \bar{\mu}(\mathbf{x}_t, t),~~where~~t=2,3,\cdots,T $$ 이다. 여기서 가장 좋은 경우는, $L{simple}$의 $t=1$의 경우가 [19]를 충분히 대변할 수 있는가이다. 만약 그렇다면 우리는 $p_\theta(\mathbf{x}0|\mathbf{x}_1)$을 위한 Loss를 따로 세우지 않고, $L{simple}$의 $t=1$인 경우를 추가로 고려하면 되는 것이다.

$t=1$일 때 $\bar{\mu}(\mathbf{x}t, t)$는 $$ \bar{\mu}(\mathbf{x}_1, 1) = {1 \over \sqrt{\alpha}_1} (\mathbf{x}_1 - \sqrt{\beta_1}\epsilon) $$ 이며, $t=1$일 때 $q(\mathbf{x}_t|\mathbf{x}{t-1})$은 $$ q(\mathbf{x}1|\mathbf{x}_0) = N(\mathbf{x}_1;\sqrt{1-\beta_1}\mathbf{x}_0, \beta_1I) $$ 이고, 이에 Reparameterization Trick을 적용한 결과는 $$ \mathbf{x}_1 = \sqrt{1-\beta_1}\mathbf{x}_0 + \sqrt{\beta_1}\epsilon,~\mathbf{x}_0 = {1 \over \sqrt{1-\beta_1}}(\mathbf{x}_1-\sqrt{\beta_1}\epsilon) $$ 이다. 즉 $$ \bar{\mu}(\mathbf{x}_1, 1) = \mathbf{x}_0 $$ 이고, 이는 **$t=2, \cdots, T$에 대해서만 성립하던 $L{simple}$이 $t=1$일 때 $L_0$를 충분히 잘 설명할 수 있음을 의미한다. 따라서 **$-\log(p_\theta(\mathbf{x}0|\mathbf{x}_1))$ $(=L_0)$을 최소화하는 것은 $L{simple}$의 $t=1$인 경우를 최소화하는 것이라는 결론을 도출할 수 있다. $L_0,L_1,\cdots,L_{T-1}$은 각각 $L_{simple}(\theta)$의 $t=1,2,\cdots,T$인 경우와 대응하며, 우리의 최종 Loss Function은 다음과 같이 정의된다. $$ \fcolorbox{red}{white}{$L_{simple}(\theta) \coloneqq \mathrm{E}_{t,\mathbf{x}_0,\epsilon} \bigg[ (\epsilon - \epsilon_\theta(\mathbf{x}_t, t))^2 \bigg]$} \ \ where~~t \sim {1, 2, \cdots, T},~~ \mathbf{x}_0 \sim q(\mathbf{x}),~ \epsilon \sim N(0, I) $$

Supplement

$\epsilon$을 예측한다는 것이 조금 이상하게 들릴 수 있다. $\epsilon$는 Isotropic Gaussian Distribution $N(0, I)$에서 Sampling 한 벡터인데, 이것을 예측하는게 의미가 있는지 의구심이 들 수 있다. (내가 그랬다) 결론을 먼저 말하자면, 노이즈가 섞인 이미지 $\mathbf{x}_t$로부터 $\mathbf{x}_t$가 가지고 있던 노이즈를 추출한다는 의미에서 $\epsilon$(=노이즈)을 예측하는 것이다.

먼저 알아야 할 점으로, 우리는 노이즈를 잘 제거하는 $p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)$를 구하고자 하는 것이다. $p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)$는 $q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)$를 닮아야 하는데, $q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)$의 역할을 정확하게 짚어보자. $\mathbf{x}_t$, $\mathbf{x}_0$가 주어졌을 때 (그런데 $\mathbf{x}_0$는 $\mathbf{x}_t$와 $t$, $\epsilon$으로 표현할 수 있으므로 $\mathbf{x}_t$, $t$, $\epsilon$이 주어졌을 때), $\mathbf{x}{t-1}$(의 분포)을 알려주는 것이었다. 즉 $p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)$ 역시 $\mathbf{x}_t$, $t$, $\epsilon$을 기반으로 $\mathbf{x}{t-1}$을 알아낸다고 볼 수 있다. $\epsilon$이 등장한 이유는 $\mathbf{x}0$를 $\mathbf{x}_t$에 대한 식으로 변환하였기 때문이고, 따라서 $\epsilon$은 $\mathbf{x}_t$가 가지고있는 노이즈이다. **더 정확하게 말하면, $\mathbf{x}_t$와 $t$, 그리고 $\mathbf{x}_t$의 노이즈를 알 때 $\mathbf{x}{t-1}$을 알 수 있다**는 것이다. 비록 $\epsilon$이 $N(0, I)$에서 무작위로 추출되었지만, 실제 Denoising Process에서 $\mathbf{x}t$가 실제로 가지고 있는 노이즈는 이미 결정되어있고, 이를 알아내어야만 $\mathbf{x}{t-1}$을 알 수 있는 것이다.

Sampling(=Inference) 단계를 생각하면 더욱 이해가 쉽다. Sampling 단계에서는 노이즈가 첨가된 이미지 $\mathbf{x}t$와 time step $t$만이 주어질 뿐, 이것이 어떤 이미지에 어떤 노이즈를 넣은 것인지는 알 수 없다. 하지만 우리가 이제껏 전개한 수식에 따르면 $\mathbf{x}_t$가 가진 노이즈$(=\epsilon)$이 무엇인지 알아야 $\mathbf{x}{t-1}$을 알 수 있다.

따라서 노이즈가 섞인 이미지 $\mathbf{x}_t$로부터 노이즈 $\epsilon$을 알아내고자 하는 네트워크가 필요한 것이다. 임의의 노이즈 $\epsilon$을 준비하고, 원본 이미지 $\mathbf{x}_0$에 고의적으로 노이즈를 넣은 뒤, 노이즈를 다시 추출할 수 있는 네트워크를 설계한 것이다. 이는 AutoEncoder와 구조가 유사하다. Input이 정답$(=\epsilon)$이고, Output$(=\epsilon_\theta)$이 Input을 닮아야 하는 방식이다.

2-5. Algorithm

실제 학습 알고리즘은 다음과 같다.

(1) Training data $\mathbf{x}_0$를 가져오고
(2) Sampling 될 확률이 모두 같은 ${1, 2, \cdots, T}$에서 $t$를 Sampling 하고
(3) $N(0, I)$에서 $\epsilon$을 Sampling 하고
(1), (2), (3)을 이용하여 $\mathbf{x}_t$를 생성하고
$\mathbf{x}_t$와 $t$를 이용하여 $\epsilon_\theta$를 생성하고
$\epsilon_\theta$가 $\epsilon$과 가까워지도록 $\epsilon_\theta$-생성 네트워크를 학습시킨다.

학습되어 생성된 $\epsilon_\theta$는 Sampling(=Inference) 단계에서 $\mu_\theta$를 만들고, $\mu_\theta$는 $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$를 만들어 $\mathbf{x}_T$를 순차적으로 $\mathbf{x}_0$로 풀어내는 Denoising Process를 진행하게 된다.

Training Architecture

학습 과정을 도식화한 그림은 다음과 같다. 이론과 설명은 길었지만, 학습 과정은 생각보다 단순하다. 그림의 L2 Loss는 임의의 Loss이며, 경우에 따라 Smooth L1 Loss 등을 사용할 수 있다. 코드분석 포스팅에서 다룰 예정이지만, $\epsilon_\theta$를 생성하는 네트워크는 Neural Network이며, UNet을 사용한다.

Inference Architecture

학습이 완료된 후, 이미지를 생성해내는 과정을 도식화한 그림은 다음과 같다. 마지막으로 $\mathbf{x}_1$에서 $\mathbf{x}_0$를 만드는 과정은 $z$가 Sampling 되지 않고, $\mu_\theta$가 그대로 $\mathbf{x}_0$가 된다.

2-6. Furthermore

이후에는 현재 모델을 개선한 Denoising Diffusion Implicit Models(DDIM), Diffusion Model Beat GANs on Image Synthesis, GAN과 결합한 Tackling the Generative Learning Trilemma with Denoising Diffusion GANs 등이 등장하였다.

3. Endnotes

[1] $512{\times}512{\times}3$ 사이즈의 칼라 이미지를 생각해보자. 이미지의 각 픽셀은 256개의 경우의 수를 가질 수 있으므로, 한 이미지는 $256^{512{\times}512{\times}3}$개의 경우 중 하나로 결정된다. 셀 수 없이 많은 경우의 이미지 중 우리가 현실의 사진이라고 느끼는 경우는 극히 일부일 것이며, 우리가 자연스럽다고 느끼는 사진은 의도적인 픽셀 배치라고 볼 수 있다.

이는 실험적으로 살펴볼 수 있는데, 한 픽셀이 갖는 값(=0~255의 정수)을 랜덤하게 결정하여 $512{\times}512{\times}3$개의 픽셀을 모두 채운다면, 일반적으로 노이즈 이미지를 볼 수 있을 것이다. 이를 Gaussian noise라고 한다.

[2] 이러한 정의는 reverse process에서도 변함없이 사용된다. 즉, 원본 이미지는 $\mathbf{x}_0$, 최종 노이즈 이미지는 $\mathbf{x}_T$이다.

한 가지 인지해야 할 점은 모든 $\mathbf{x}_i$의 해상도(=픽셀 수)가 같다는 것이다.

[3] 이러한 성질을 Markov Property라 한다. 즉, 현재 상태는 직전 상태에 의해서만 결정된다는 것이다. Markov Property를 갖는 일련의 확률변수를 Markov Chain이라 한다.

$t$ 시점의 상태를 $s_t$라 하면, Markov Chain $S_1$, $S_2$, $\cdots$ 는 다음과 같은 특징을 갖는다. $$ P(S_t = s_t|S_{t-1} = s_{t-1}, S_{t-2} = s_{t-2},~~\cdots, S_1 = s_1) = P(S_t = s_t~~|~S_{t-1} = s_{t-1}) $$ 직전보다 더 이전 상태는 조건에서 무시된다는 의미가 내포되어 있다.

[4] Gaussian Distribution을 따르는 noise를 Gaussian noise라 한다. 노이즈는 이미지와 같은 해상도(=픽셀 수)를 가지므로, 이미지 픽셀과 같은 수의 차원의 Multivariate Gaussian Distribution을 따른다.

[5] 하지만 논문에서 제시한 $\beta_t$를 사용하자. $\beta_1=10^{-4}$, $\beta_T=0.02$이며, 값은 점점 증가한다.

어떻게 증가하는지는 schedule 방법에 따라 다르며, 저자는 값이 일정하게 증가하는 linear schedule을 사용했지만, 추후 연구에서 cosine schedule이 더 좋은 성능을 냄이 밝혀졌다.

[6] $q(\mathbf{x})$를 안다는 것은 이미 실제 Training data의 분포를 안다는 의미이다. 즉 Training data와 유사한 이미지를 생성하는 일은 해당 분포에서 단순히 Sampling 하면 되는 일이며, 굳이 Diffusion Model을 설계할 필요가 없다는 뜻이다.

[7] 확률변수 $X$가 정규분포 $N(\mu, \sigma^2)$을 따른다고 하자. 이 정규분포에서 $z$를 Sampling 하였을 때, $z$는 다음과 같이 표현할 수 있다. $$ z = \mu + \sigma\epsilon,\epsilon\sim N(0, 1) $$ Reparameterization Trick은 무작위성을 갖는 변수$(\epsilon)$만을 분리하여, $z$가 $\mu$나 $\sigma$로 미분이 가능하도록 식의 형태를 바꾸는 역할을 한다.

Multivariate Gaussian Distribution $N(\sqrt{1-\beta_t}\mathbf{x}{t-1},~\beta{t}I)$에서 Sampling 한 $\mathbf{x}_t$를 Reparameterization Trick을 적용하여 나타내보자. 방법은 이곳을 참고하자.

$AA^{\mathrm{T}}=\beta_tI$를 만족하는 $A$를 찾자. 어렵지 않게 $A=\sqrt{\beta_t}I$임을 찾을 수 있다. 따라서, $$ \begin{aligned} \mathbf{x}t &= \sqrt{1-\beta_t}\mathbf{x}{t-1} + A\epsilon \ &= \sqrt{1-\beta_t}\mathbf{x}{t-1} + \sqrt{\beta_t}I\epsilon \ &= \sqrt{1-\beta_t}\mathbf{x}{t-1} + \sqrt{\beta_t}\epsilon,~~where~~ \epsilon\sim N(0, I) \end{aligned} $$ 이곳에 따르면 $\epsilon$은 표준정규분포 $N(0, 1)$에서 차원 수 만큼 독립추출한 값을 하나의 벡터로 이어붙인 형태여야 하지만, 이는 $N(0, I)$에서 하나의 벡터를 Sampling 하는 것과 같다.

[8] 서로 독립인 두 확률변수 $X$, $Y$가 각각 Gaussian Distribution을 따른다고 가정하자. $$ X \sim N(\mu_x, \Sigma_x),~~Y \sim N(\mu_y, \Sigma_y) $$ 이때 두 확률변수의 합 역시 정규분포를 따른다. $$ X+Y \sim N(\mu_x+\mu_y, \Sigma_x+\Sigma_y) $$

따라서 $$ \sqrt{1-\alpha_t}\epsilon_t \sim N(0, (1-\alpha_t)I), \ \sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}}\epsilon_{t-1} \sim N(0, \alpha_t(1-\alpha_{t-1})I) $$ 일 때, 두 확률변수의 합 $\sqrt{1-\alpha_t}\epsilon_t + \sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}}\epsilon_{t-1}$은 $$ \sqrt{1-\alpha_t}\epsilon_t + \sqrt{\alpha_t}\sqrt{1-\alpha_{t-1}}\epsilon_{t-1} \sim N(0, (1-\alpha_t\alpha_{t-1})I) $$ 와 같이 Gaussian Distribution을 따른다.

[9] $\sqrt{1-\beta_t}$를 $\mathbf{x}_{t-1}$에 곱하는 이유는 $\mathbf{x}_t$의 공분산을 $I$로 유지하기 위함이다.

$t=T$일 때를 살펴보자. 충분히 큰 $T$에 대하여 $\mathbf{x}T \sim N(0, I)$이므로 $\mathbf{x}_T$의 공분산 $Cov(\mathbf{x}_T)$는 $Cov(\mathbf{x}_T)=I$이다. 이때 $\mathbf{x}_T = \sqrt{1-\beta_t}\mathbf{x}{T-1}+\sqrt{\beta_t}\epsilon$이므로, $$ \begin{aligned} Cov(\mathbf{x}T) &= Cov(\sqrt{1-\beta_t}\mathbf{x}{T-1}+\sqrt{\beta_t}\epsilon)\ &= (1-\beta_t)Cov(\mathbf{x}{T-1})+\beta_tCov(\epsilon) \ &= (1-\beta_t)Cov(\mathbf{x}{T-1})+\beta_tI \ &= I \end{aligned} $$ 이고, 따라서 $Cov(\mathbf{x}_{T-1})=I$이다.

같은 논리로, 모든 latent variable은 공분산이 $I$로 유지됨을 알 수 있다. 이를 위해 의도적으로 $\sqrt{1-\beta_t}$를 삽입한 것이다.

[10] 추후 증명 예정

[11] 두 모수 $\mu_\theta$, $\Sigma_\theta$는 Markov Property에 의해 $\mathbf{x}t$의 영향을 받음이 자명하고, $t$에 의해 $\beta_t$가 결정되고 $\beta_t$는 $q(\mathbf{x}_t|\mathbf{x}{t-1})$와, $q(\mathbf{x}t|\mathbf{x}{t-1})$는 $p_\theta(\mathbf{x}_{t-1}|\mathbf{x}_t)$와 연관이 있으므로 $t$의 영향을 받는다고 말할 수 있다.

이를 명시하기 위해 $\mu_\theta(\mathbf{x}_t, t)$, $\Sigma_\theta(\mathbf{x}_t, t)$라 표현한다.

[12] 밑이 1보다 큰 로그함수는 증가함수이므로, 임의의 함수 $f(x)$에 대하여 $f(x)$의 증감은 $\log(f(x))$의 증감과 동치이다.

이후 경사하강법을 통해 특정 값(=Loss)이 낮아지도록 학습할 것이므로, 우리의 목표가 변하지 않는 선에서 낮아지길 바라는 값을 활용한다.

[13] $q(\mathbf{x}{1:T}|\mathbf{x}_0)$는 확률밀도함수이므로 $\int q(\mathbf{x}{1:T}|\mathbf{x}0)d\mathbf{x}{1:T}=1$이다.

[14] Bayes' Theorem에 의해, $$ p_\theta(\mathbf{x}_0) = {p_\theta(\mathbf{x}_0,~~\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T) \over p_\theta(\mathbf{x}_1,~~\cdots,~~\mathbf{x}_T~~|~\mathbf{x}_0)} $$ 이다. 혹시 헷갈릴까 언급하자면, 모수 추정을 위한 것이 아닌 조건부확률을 위한 것이다.

[15] KL Divergence의 값은 항상 0보다 크거나 같다. KL Divergence의 값이 클 수록 비교하는 두 확률분포의 차이가 크다.

[16] Markov Property를 해석하면 직전 상태 이외에는 영향을 받지 않는다는 것이다. 즉, $$ q(\mathbf{x}t|\mathbf{x}{t-1}) = q(\mathbf{x}t|\mathbf{x}{t-1},~\mathbf{x}_0) $$ 이다. 이는 $\mathbf{x}_0$에만 국한된 특성이 아니며, 이후 Bayes' Theorem을 사용하여 식을 전개해 나갈 때 계산의 용이성을 고려하여 $\mathbf{x}_0$가 채택된 것이다.

[17] $$ \begin{aligned} q(\mathbf{x}t|\mathbf{x}{t-1},~~\mathbf{x}_0) &= {q(\mathbf{x}_t,~~\mathbf{x}{t-1},~\mathbf{x}_0) \over q(\mathbf{x}{t-1},~\mathbf{x}_0)} \

&= {q(\mathbf{x}t,~\mathbf{x}{t-1},~~\mathbf{x}0) \over q(\mathbf{x}{t-1},~~\mathbf{x}_0)} \cdot {q(\mathbf{x}_t,~~\mathbf{x}_0) \over q(\mathbf{x}_t,~~\mathbf{x}_0)} \cdot {q(\mathbf{x}_0) \over q(\mathbf{x}_0)}\

&= {q(\mathbf{x}t,~\mathbf{x}{t-1},~~\mathbf{x}_0) \over q(\mathbf{x}_t,~~\mathbf{x}0)} \cdot {q(\mathbf{x}_t,~\mathbf{x}_0) \over q(\mathbf{x}_0)} \cdot {q(\mathbf{x}_0) \over q(\mathbf{x}{t-1},~\mathbf{x}_0)} \

&= {q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0) \cdot q(\mathbf{x}_t~~|~\mathbf{x}_0) \over q(\mathbf{x}{t-1}|\mathbf{x}_0)} \end{aligned} $$

결과 식의 형태를 보면 모든 $q$에 $\mathbf{x}_0$가 조건으로 고정되었을 뿐, 일반적인 Bayes' Theorem과 같다.

[18] 중적분의 성질을 활용하여 식의 각 항을 변형할 것이다. 다음 성질을 잘 알아두자.

$\int (\cdots)~ d\mathbf{x}_{1:T} = \int\int \cdots \int (\cdots)~ d\mathbf{x}_1 d\mathbf{x}_2 \cdots d\mathbf{x}_T$
확률분포의 적분 합은 1이다.

$$ \begin{aligned}

\bold{\big(1\big)}~~\fcolorbox{red}{white}{$\mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}_{1:T}~~|~~\mathbf{x}_0)} \bigg[\log{q(\mathbf{x}_T~~|~\mathbf{x}_0) \over p_\theta(\mathbf{x}_T)}\bigg]$}

= \int \bigg( q(\mathbf{x}{1:T}|\mathbf{x}_0) \cdot \log{ q(\mathbf{x}_T|\mathbf{x}_0) \over p_\theta(\mathbf{x}_T)} \bigg) d\mathbf{x}{1:T} \

\end{aligned} $$ 에서 $$ q(\mathbf{x}{1:T}|\mathbf{x}_0) = {q(\mathbf{x}_T,~\mathbf{x}_0) \over q(\mathbf{x}_0)} \cdot {q(\mathbf{x}{0:T}) \over q(\mathbf{x}T,~~\mathbf{x}_0)} = q(\mathbf{x}_T~~|~\mathbf{x}_0) \cdot q(\mathbf{x}{1:T-1}|\mathbf{x}_T,~\mathbf{x}_0) $$ 이므로 $$ \begin{aligned}

\int \bigg( q(\mathbf{x}{1:T}&|\mathbf{x}_0) \cdot \log{ q(\mathbf{x}_T|\mathbf{x}_0) \over p_\theta(\mathbf{x}_T)} \bigg) d\mathbf{x}{1:T} \

&= \int \bigg( q(\mathbf{x}T|\mathbf{x}_0) \cdot q(\mathbf{x}{1:T-1}|\mathbf{x}T,~~\mathbf{x}_0) \cdot \log{ q(\mathbf{x}_T~~|~\mathbf{x}_0) \over p_\theta(\mathbf{x}_T)} \bigg) d\mathbf{x}{1:T} \

&= \int \bigg( q(\mathbf{x}T|\mathbf{x}_0) \cdot \log {q(\mathbf{x}_T|\mathbf{x}_0) \over p_\theta(\mathbf{x}_T)} \bigg) d\mathbf{x}_T \cdot \int q(\mathbf{x}{1:T-1}|\mathbf{x}T,~\mathbf{x}_0)d\mathbf{x}{1:T-1} \

&= \int \bigg( q(\mathbf{x}_T|\mathbf{x}_0) \cdot \log {q(\mathbf{x}_T|\mathbf{x}_0) \over p_\theta(\mathbf{x}_T)} \bigg)~ d\mathbf{x}_T \

&= D_{KL}(~~q(\mathbf{x}_T~~|~~\mathbf{x}_0)~~||~~p_\theta(\mathbf{x}_T)~~) \end{aligned} $$ 이다.

$$ \fcolorbox{red}{white}{$\mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[\log{q(\mathbf{x}_T|\mathbf{x}_0) \over p_\theta(\mathbf{x}_T)}\bigg] = D{KL}(q(\mathbf{x}_T|\mathbf{x}_0)||p_\theta(\mathbf{x}_T))$} $$

$$ \begin{aligned} \bold{\big(2\big)}~~~ &\fcolorbox{green}{white}{$\sum_{t=2}^T \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}_{1:T}~~|~~\mathbf{x}0)} \bigg[\log{q(\mathbf{x}{t-1}~~|~~\mathbf{x}_t,~~\mathbf{x}0) \over p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)} \bigg]$} \

& \kern{85pt} = \sum_{t=2}^T \int \bigg( q(\mathbf{x}{1:T}|\mathbf{x}_0) \cdot \log{ q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0) \over p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)} \bigg) d\mathbf{x}{1:T} \

\end{aligned} $$ 에서 $$ \begin{aligned} q(\mathbf{x}_{1:T}|&\mathbf{x}_0) \

&= {q(\mathbf{x}0,~\mathbf{x}{t-1},~~\mathbf{x}_t) \over q(\mathbf{x}_0,~~\mathbf{x}t)} \cdot {q(\mathbf{x}{0:T}) \over q(\mathbf{x}0,~\mathbf{x}{t-1},~~\mathbf{x}_t)} \cdot {q(\mathbf{x}_0,~~\mathbf{x}_t) \over q(\mathbf{x}_0)} \

&= q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0) \cdot q(\mathbf{x}{1:t-2},~~\mathbf{x}_{t+1:T}~~|~~\mathbf{x}_t,~~\mathbf{x}{t-1},~~\mathbf{x}_0) \cdot q(\mathbf{x}_T~~|~\mathbf{x}_0) \end{aligned} $$ 이므로 $$ \begin{aligned} \sum{t=2}^T& \int \bigg( q(\mathbf{x}{1:T}|\mathbf{x}_0) \cdot \log{ q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0) \over p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)} \bigg) d\mathbf{x}{1:T} \

&= \sum_{t=2}^T \int \bigg( q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0) \cdot q(\mathbf{x}{1:t-2},~~\mathbf{x}_{t+1:T}~~|~~\mathbf{x}_t,~~\mathbf{x}{t-1},~~\mathbf{x}_0) \ & \kern{110pt} \cdot q(\mathbf{x}_T~~|~\mathbf{x}_0) \cdot \log{ q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0) \over p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)} \bigg) d\mathbf{x}{1:T} \

&= \sum_{t=2}^T \bigg( \int \bigg( q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0) \cdot \log{ q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0) \over p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)} \bigg) d\mathbf{x}{t-1} \ & \kern{30pt} \cdot \int q(\mathbf{x}{1:t-2},~\mathbf{x}{t+1:T}|\mathbf{x}t,~\mathbf{x}{t-1},~~\mathbf{x}0) d{\mathbf{x}{1:t-2},~~\mathbf{x}_{t+1:T}} \cdot \int q(\mathbf{x}_T|\mathbf{x}_0) d\mathbf{x}_T \bigg) \

&= \sum_{t=2}^T \int \bigg( q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0) \cdot \log{ q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0) \over p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)} \bigg) d\mathbf{x}{t-1} \

&= \sum_{t=2}^T D_{KL}(~~q(\mathbf{x}_{t-1}~~|~~\mathbf{x}_t,~~\mathbf{x}0)||p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)~) \ \end{aligned} $$ 이다.

$$ \fcolorbox{green}{white}{$\sum_{t=2}^T \mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[\log{q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0) \over p_\theta(\mathbf{x}{t-1}|\mathbf{x}t)} \bigg] = \sum{t=2}^T D_{KL}(q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)~~||~p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t))$} $$

$$ \begin{aligned} \bold{\big(3\big)}~~~ &\fcolorbox{blue}{white}{$\mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}_{1:T}~~|~~\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_0~~|~\mathbf{x}_1)) \bigg]$} \

& \kern{45pt} = \int\bigg( - q(\mathbf{x}{1:T}|\mathbf{x}_0) \cdot \log(p_\theta(\mathbf{x}_0|\mathbf{x}_1)) \bigg) d\mathbf{x}{1:T} \kern{60pt} \

& \kern{45pt} = -\log(p_\theta(\mathbf{x}0|\mathbf{x}_1)) \cdot \int q(\mathbf{x}{1:T}|\mathbf{x}0)d\mathbf{x}{1:T} \

& \kern{45pt} = -\log(p_\theta(\mathbf{x}0|\mathbf{x}_1)) \end{aligned} $$ 이다. $$ \fcolorbox{blue}{white}{$\mathrm{E}{\mathbf{x}{1:T} \sim q(\mathbf{x}{1:T}|\mathbf{x}_0)} \bigg[-\log(p_\theta(\mathbf{x}_0|\mathbf{x}_1)) \bigg] = -\log(p_\theta(\mathbf{x}_0|\mathbf{x}_1))$} $$

[19] $L_{t-1}$을 구하는 방법을 먼저 간략히 설명하면, $q(\mathbf{x}{t-1}|\mathbf{x}_t,~\mathbf{x}_0)$이 특정 Gaussian Distribution과 유사함을 보이고, $\mu_\theta$와 $\Sigma_\theta$가 $q(\mathbf{x}{t-1}|\mathbf{x}t,~\mathbf{x}_0)$의 평균, 공분산과 같아지도록 학습한다. 이를 통해 **$p_\theta(\mathbf{x}{t-1}|\mathbf{x}_t)$를 구할 수 있다.**

$L_0$ 역시 $p_\theta(\mathbf{x}0|\mathbf{x}_1)$을 구하는 과정이며, 이는 $L{t-1}$이 구하고자 하는 바와 비슷해보인다. 하지만 당장은 $p_\theta(\mathbf{x}0|\mathbf{x}_1)$이 $q(\mathbf{x}{t-1}|\mathbf{x}_t,~~\mathbf{x}_0)$의 $t=1$인 순간과 비슷하다고 볼 수 없다. $t=1$일 때, $q(\mathbf{x}_0~~|~~\mathbf{x}_0,~~\mathbf{x}_1)=1$이기 때문이다. 이는 $\mathbf{x}_0$를 도출해내는 $\mu_\theta(\mathbf{x}_1, 1)$이 어떤 값을 닮아야 하는지 모른다는 뜻이다. 이를 해결하기 위해 논문에서는 다음과 같이 독립적인 Decoder $p_\theta(\mathbf{x}_0|\mathbf{x}_1)$을 정의하였다.

$$ p_\theta(\mathbf{x}0|\mathbf{x}_1) = \prod{i=1}^D \int_{\delta_{-}(x_0^i)}^{\delta_{+}(x_0^i)} N(x;\mu_\theta^i(\mathbf{x}1, 1), \sigma_1^2) dx \ $$ $$ where~~\delta{+}(x) = \begin{cases} \infty &\text{if~~} x=1 \ x+{1 \over 255} &\text{if~~} x<1 \end{cases},~~

\delta_{-}(x) = \begin{cases} -\infty &\text{if~~}x=-1 \ x-{1 \over 255} &\text{if~~} x>-1 \end{cases} $$ 여기서 $D$는 벡터의 차원 수(=한 이미지의 총 픽셀 수)이고, 벡터의 각 성분을 위첨자$(^i)$로 나타내었다. 한 가지 미리 알려두자면, 이 논문에서는 $\Sigma_\theta(\mathbf{x}_t, t)$를 고정된 값인 $\beta_tI$로 설정하였다. 실험 결과 상수로 두어도 큰 차이가 없다는 이유에서였다.

위의 정의를 해석해보자. numpy에서 이미지의 각 픽셀은 0~~255의 정수값으로 나타내 지만, 이를 -1~~1의 실수로 linear하게 scaling 하였다고 가정한다. 이는 $p(\mathbf{x}_T)=N(\mathbf{x}_T;0, I)$가 일관된 scale 내에서 작동하기 위함이다. Gaussian Distribution의 대칭성을 살리기 위함이라 생각해도 좋다. 그렇게 되면 ${0, 1, \cdots, 255}$의 값이 ${-1, -{253 \over 255}, \cdots, {253 \over 255}, 1}$로 대응(mapping)된다. 이를 바탕으로 적분구간을 해석하면 다음과 같다.

확률밀도함수는 순간의 확률이 0이지만, 아주 작은 구간$(={2 \over 255})$에 대한 확률밀도함수의 적분값은 구간 내 특정 순간의 확률이라 볼 수 있다. 즉, $\mu_\theta(\mathbf{x}_1, 1)$을 성분 별로 분해하고, 각 성분 별로 주어지는 (일변수) Gaussian Distribution으로부터 $\mathbf{x}_0$의 각 성분의 가능도 (=적분, 확률)를 구하여 모두 곱하였다고 볼 수 있다. 가능도의 곱이 최대가 될 때가 곧 우리가 원하는 순간일 것이다.

정리하자면, 각 픽셀 별로 가능도를 최대화하는 $\mu_\theta^i(\mathbf{x}_1, 1)$을 찾는 과정이다.

[20] Multivariate Gaussian Distribution의 확률밀도함수 $N(\mathbf{x};\mu, \Sigma)$는 $$ N(\mathbf{x};\mu, \Sigma) = {1 \over \sqrt{(2\pi)^D|\Sigma|}} e^{-{1 \over 2} \big((\mathbf{x}-\mu)^T\Sigma^{-1}(\mathbf{x}-\mu)) \big)} $$ 이다. $D$는 $\mathbf{x}$의 차원의 수, $|\Sigma|$는 행렬 $\Sigma$의 행렬식(determinant)이다.

[21] 벡터 $\mathbf{x}$에 대하여 $\mathbf{x}^T\mathbf{x}$는 $\mathbf{x} \cdot \mathbf{x}$(내적)과 같다. 따라서 $\mathbf{x}^T\mathbf{x}$는 하나의 값이며, 그 값은 $\mathbf{x}$의 모든 성분의 제곱합이다. 즉, $$ \mathbf{x} = \begin{bmatrix} x_0 \ x_1 \ \vdots \ x_D \end{bmatrix} $$ 일 때, $$ \mathbf{x}^T \mathbf{x} = \sum_{i=1}^D x_i^2 $$ 이다. $||\mathbf{x}||^2$이라 표현하지만, 편의상 $\mathbf{x}^2$이라 하기도 한다.

만약 $(\mathbf{x}-\mathbf{a})^T(\mathbf{x}-\mathbf{a})$를 간단히 나타낸다면, $$ (\mathbf{x}-\mathbf{a})^T(\mathbf{x}-\mathbf{a}) = (\mathbf{x}-\mathbf{a}) \cdot (\mathbf{x}-\mathbf{a}) = \mathbf{x}\cdot\mathbf{x} - 2\mathbf{x}\cdot\mathbf{a} + \mathbf{a}\cdot\mathbf{a} = (\mathbf{x}-\mathbf{a})^2 $$ 이 될 것이다.

[22] 두 Multivariable Gaussian Distribution의 KL Divergence를 구하는 공식은 다음과 같다.

$$ \begin{aligned} D_{KL}&(N_0(\mu_0, \Sigma_0)||N_1(\mu_1, \Sigma_1))\ &={1 \over 2}\left(tr(\Sigma_1^{-1}\Sigma_0)+(\mu_1-\mu_0)^{T}\Sigma_1^{-1}(\mu_1-\mu_0)-k+\ln\left({\det(\Sigma_1) \over \det(\Sigma_0)}\right) \right) \end{aligned} $$ $tr$은 행렬의 대각성분의 합, $k$는 Multivariate Gaussian Distribution의 차원 수이다. 즉, $\mu_0,~\mu_1$의 차원 수이다.

[코드구현] Auto-Encoding Variational Bayes (VAE)

Mon, 03 Apr 2023 03:58:08 GMT

<논문> Auto-Encoding Variational Bayes <코드> VAE_MNIST.ipynb

이론으로 알아본 VAE를 실습해보자. 목표는 VAE 모델을 학습시켜 MNIST 이미지를 생성하는 것이다. 일련의 과정은 colab에서 진행하였다.

라이브러리 준비

필요한 라이브러리를 준비한다. PyTorch를 이용할 것이고, 데이터는 torchvision의 MNIST 데이터를 가져올 것이다.

import torch
import torch.nn as nn

import torchvision.datasets
import torchvision.transforms
from torch.utils.data import DataLoader

import matplotlib.pyplot as plt
from tqdm import tqdm

시각화와 진행도 확인을 위해 각각 matplotlib.pyplot, tqdm을 가져왔다.

하이퍼파라미터 설정

하이퍼파라미터를 한 곳에서 결정할 수 있도록 딕셔너리를 정의하였다.

config = {'batch_size' : 16, 'latent_dim' : 10, 'learning_rate' : 0.00001, 'epoch' : 30}

데이터셋 준비

torchvision에서 제공하는 MNIST 데이터셋을 학습에 사용하고, MNIST 이미지를 생성할 것이다.

# 학습 디바이스 설정
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# MNIST datasets을 다운로드
train_data = torchvision.datasets.MNIST('./data', train=True, download=True, transform=torchvision.transforms.ToTensor())

# DataLoader에 데이터셋 탑재
train_dataloader = DataLoader(train_data, batch_size=config['batch_size'], shuffle=True, drop_last=True)

device를 정의하여 GPU가 있다면 GPU에서 학습이 진행되도록 할 것이다. 흔히 "device에 올린다"고 표현한다.

모델 설계

VAE의 핵심인 Encoder와 Decoder를 설계하고 Loss Function을 구성하자.

# Encoder, Decoder를 각각 설계
#   Encoder의 결과(mu, logvar)가 Loss Function에 사용되므로 따로 구성

class Encoder(nn.Module):
    def __init__(self, x_dim=784, h1_dim=196, h2_dim=49, z_dim=config['latent_dim']):
        super(Encoder, self).__init__()

        # 1st hidden layer : 784 -> 196
        self.fc1 = nn.Sequential(
            nn.Linear(x_dim, h1_dim),
            nn.ReLU()
        )

        # 2nd hidden layer : 196 -> 49
        self.fc2 = nn.Sequential(
            nn.Linear(h1_dim, h2_dim),
            nn.ReLU()
        )

        # output layer : 49 -> 10
        self.mu = nn.Linear(h2_dim, z_dim)
        self.logvar = nn.Linear(h2_dim, z_dim)

    # Reparameterization Trick을 위한 함수수
    def reparameterization(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)

        mu = self.mu(x)
        logvar = self.logvar(x)

        z = self.reparameterization(mu, logvar)
        return z, mu, logvar

class Decoder(nn.Module):
    def __init__(self, x_dim=784, h1_dim=196, h2_dim=49, z_dim=config['latent_dim']):
        super(Decoder, self).__init__()

        # 1st hidden layer : 10 -> 49
        self.fc1 = nn.Sequential(
            nn.Linear(z_dim, h2_dim),
            nn.ReLU()
        )

        # 2nd hidden layer : 49 -> 196
        self.fc2 = nn.Sequential(
            nn.Linear(h2_dim, h1_dim),
            nn.ReLU()
        )

        # output layer : 196 -> 784
        self.fc3 = nn.Linear(h1_dim, x_dim)

    # 0~1의 값을 도출하기 위해 Sigmoid를 추가
    def forward(self, z):
        z = self.fc1(z)
        z = self.fc2(z)
        z = self.fc3(z)
        pred = nn.Sigmoid()(z)
        return pred

이론에서 다룬 부분과 차이가 있는 부분은, Encoder에서 $\sigma_i$를 추출하는 대신 $\log(\sigma_i^2)$(=logvar)를 추출하였다.

logvar는 마지막으로 nn.Linear를 거친 뒤 나온다. 즉, logvar 벡터는 양수뿐만 아니라 0과 음수도 가질 수 있다. 하지만 표준편차는 항상 양수여야 한다. 그렇다고 nn.ReLU와 같은 함수를 걸어준다면 값이 왜곡될 것이다.
이것의 해결책으로, $\sigma_i$ 대신 $\log(\sigma_i^2)$를 추출하는 것이다. $\sigma_i^2$가 양수이므로 $\log$는 양수만을 정의역으로 가지고, 치역은 실수 전체이므로 nn.Linear를 거친 뒤 음수값을 갖더라도 문제되지 않는다.
정리하면, 양수가 아닌 값을 갖는 문제를 해결하기 위해 변환된 표준편차값을 추출한 셈 치겠다고 볼 수 있고, 엄밀하게 말하면 실수 전체를 치역으로 갖는 일대일 함수의 함수값으로 보고, 역함수로부터 본래 값(=표준편차)을 구하겠다고 볼 수 있다. [1]

mu와 logvar를 추출할 때, 위 코드와 같이 각각의 nn.Linear를 두는 방법이 있고, out_channel이 2배인 nn.Linear 하나를 두고 결과를 반으로 쪼개어 사용하는 방법도 있다.

Optimizer 설계

두 모델의 파라미터를 모아 Optimizer에 넘겨주고, learning_rate를 넘겨주자.

# Encoder, Decoder를 생성하고 device에 올리기
encoder = Encoder().to(device)
decoder = Decoder().to(device)

# 모델 파라미터, Learning rate를 기반으로 Optimizer 정의
parameters = list(encoder.parameters()) + list(decoder.parameters())
optimizer = torch.optim.Adam(parameters, lr=config['learning_rate'])

모델 학습

Encoder와 Decoder로 Forward를 구성하고, loss를 계산하고 미분, 학습하여 Backward를 구성하자.

# 이미지의 label은 사용되지 않는다

for epoch in tqdm(range(config['epoch'])):
    for i, (x, _) in enumerate(train_dataloader):
        # Forward
        input = x.view(config['batch_size'], -1).to(device)
        z, mu, logvar = encoder(input)
        output = decoder(z)

        # Reconstruction loss, Regularization loss 계산
        reconst_loss = nn.BCELoss(reduction='sum')(output, input)
        regular_loss = 0.5 * torch.sum(mu**2 + torch.exp(logvar) - logvar - 1)

        # Backward
        loss = reconst_loss + regular_loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f" Loss : {loss}")

위 모델은 label이 사용되지 않는다. 또한 생성모델이므로 성능을 측정하는 valid dataset이 필요없다. 학습이 진행되고 있는지를 loss의 변화로 확인하자.

모델 Inference

# N(0, 1)에서 반복추출하여 latent vector z를 Sampling
# check_num_image : Inference 하고자 하는 이미지 수
check_num_image = 10
z = torch.randn(check_num_image, config['latent_dim']).to(device)
sampled_images = decoder(z).view(check_num_image, 28, 28)

# Inference 결과 시각화
fig = plt.figure(figsize=(10, (check_num_image//2)))
for idx, img in enumerate(sampled_images):
    ax = fig.add_subplot(2, check_num_image//2, idx+1)
    img = img.detach().numpy()
    ax.imshow(img, cmap='gray')

다음은 Inference 결과를 시각화한 것이다.

숫자처럼 보이는 이미지도 있고, 그렇지 않은 것도 있다. Encoder에서 특징을 잘 추출할 수 있다면 더 좋은 이미지를 얻을 수 있을 것이다. 또한 이미지 화질이 낮은데, 이는 VAE의 단점으로 지적된다.

전체 코드는 다음과 같다.

"""
# VAE 코드구현

## 1. 사전준비

### 1-1. 필요한 라이브러리 불러오기
"""

import torch
import torch.nn as nn

import torchvision.datasets
import torchvision.transforms
from torch.utils.data import DataLoader

import matplotlib.pyplot as plt
from tqdm import tqdm

"""### 1-2. 하이퍼파라미터 정의"""

config = {'batch_size' : 16, 'latent_dim' : 10, 'learning_rate' : 0.00001, 'epoch' : 30}

"""## 2. 데이터 불러오기"""

# 학습 디바이스 설정
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

# MNIST datasets을 다운로드
train_data = torchvision.datasets.MNIST('./data', train=True, download=True, transform=torchvision.transforms.ToTensor())

# DataLoader에 데이터셋 탑재
train_dataloader = DataLoader(train_data, batch_size=config['batch_size'], shuffle=True, drop_last=True)

"""## 3. 모델 설계

### 3-1. Encoder, Decoder 설계
"""

# Encoder, Decoder를 각각 설계
#   Encoder의 결과(mu, logvar)가 Loss Function에 사용되므로 따로 구성

class Encoder(nn.Module):
    def __init__(self, x_dim=784, h1_dim=196, h2_dim=49, z_dim=config['latent_dim']):
        super(Encoder, self).__init__()

        # 1st hidden layer : 784 -> 196
        self.fc1 = nn.Sequential(
            nn.Linear(x_dim, h1_dim),
            nn.ReLU()
        )

        # 2nd hidden layer : 196 -> 49
        self.fc2 = nn.Sequential(
            nn.Linear(h1_dim, h2_dim),
            nn.ReLU()
        )

        # output layer : 49 -> 10
        self.mu = nn.Linear(h2_dim, z_dim)
        self.logvar = nn.Linear(h2_dim, z_dim)

    # Reparameterization Trick을 위한 함수수
    def reparameterization(self, mu, logvar):
        std = torch.exp(0.5 * logvar)
        eps = torch.randn_like(std)
        return mu + eps * std

    def forward(self, x):
        x = self.fc1(x)
        x = self.fc2(x)

        mu = self.mu(x)
        logvar = self.logvar(x)

        z = self.reparameterization(mu, logvar)
        return z, mu, logvar

class Decoder(nn.Module):
    def __init__(self, x_dim=784, h1_dim=196, h2_dim=49, z_dim=config['latent_dim']):
        super(Decoder, self).__init__()

        # 1st hidden layer : 10 -> 49
        self.fc1 = nn.Sequential(
            nn.Linear(z_dim, h2_dim),
            nn.ReLU()
        )

        # 2nd hidden layer : 49 -> 196
        self.fc2 = nn.Sequential(
            nn.Linear(h2_dim, h1_dim),
            nn.ReLU()
        )

        # output layer : 196 -> 784
        self.fc3 = nn.Linear(h1_dim, x_dim)

    # 0~1의 값을 도출하기 위해 Sigmoid를 추가
    def forward(self, z):
        z = self.fc1(z)
        z = self.fc2(z)
        z = self.fc3(z)
        pred = nn.Sigmoid()(z)
        return pred

"""### 3-2. Optimizer 설계"""

# Encoder, Decoder를 생성하고 device에 올리기
encoder = Encoder().to(device)
decoder = Decoder().to(device)

# 모델 파라미터, Learning rate를 기반으로 Optimizer 정의
parameters = list(encoder.parameters()) + list(decoder.parameters())
optimizer = torch.optim.Adam(parameters, lr=config['learning_rate'])

"""## 4. 모델 학습"""

# 이미지의 label은 사용되지 않는다

for epoch in tqdm(range(config['epoch'])):
    for i, (x, _) in enumerate(train_dataloader):
        # Forward
        input = x.view(config['batch_size'], -1).to(device)
        z, mu, logvar = encoder(input)
        output = decoder(z)

        # Reconstruction loss, Regularization loss 계산
        reconst_loss = nn.BCELoss(reduction='sum')(output, input)
        regular_loss = 0.5 * torch.sum(mu**2 + torch.exp(logvar) - logvar - 1)

        # backprop and optimize
        loss = reconst_loss + regular_loss
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

    print(f" Loss : {loss}")

"""## 5. 모델 Inference"""

# N(0, 1)에서 반복추출하여 latent vector z를 Sampling
check_num_image = 10
z = torch.randn(check_num_image, config['latent_dim']).to(device)
sampled_images = decoder(z).view(check_num_image, 28, 28)

# Inference 결과 시각화
fig = plt.figure(figsize=(10, (check_num_image//2)))
for idx, img in enumerate(sampled_images):
    ax = fig.add_subplot(2, check_num_image//2, idx+1)
    img = img.detach().numpy()
    ax.imshow(img, cmap='gray')

[1] 일대일 함수가 아니면 불가능하다. 즉, $\log(x)~(=\ln(x))$가 일대일 함수이므로 역함수가 존재하고 $(=e^x)$, logvar에 역함수를 취함으로서 $\sigma_i^2$를 얻을 수 있다.

코드에서는 self.reparameterization에서 위 과정을 살펴볼 수 있다. $\sigma_i$를 얻기 위해 torch.exp(0.5*logvar)를 하는 코드를 찾을 수 있다.

[논문분석] Auto-Encoding Variational Bayes (VAE)

Mon, 27 Mar 2023 19:17:30 GMT

<논문> Auto-Encoding Variational Bayes

<참고자료> [YouTube] 오토인코더의 모든것 [github.io] Variational AutoEncoder [arXiv] Tutorial on Variational Autoencoders

Variational AutoEncoder (이하 VAE)는 AutoEncoder 구조를 가진 생성모델이다. 하지만 AutoEncoder와 구조만 같을 뿐, 목적은 완전히 다르다. VAE를 알아보기에 앞서 AutoEncoder가 무엇인지 간단하게 알아보자.

1. AutoEncoder

AutoEncoder는 비지도학습(unsupervised learning)을 위해 제안된 모델이다. 라벨이 없는 이미지를 분류하려고 한다면, 어떻게 하는 것이 좋을까?

Latent Vector

이미지는 벡터로 볼 수 있다. MNIST 이미지를 예로 들면, 각 이미지는 28x28=784개의 픽셀을 가지므로 784차원의 벡터로 볼 수 있다. 즉, 784개의 숫자(값)가 하나의 이미지를 표현한다고 볼 수 있다.

조금 다르게 해석하면, MNIST의각 이미지를 구성하는 784가지 특성이 존재한다고 볼 수 있다. 이러한 특성을 조합하여 이미지를 구성할 것이다.

하지만 한 이미지를 표현하는 데 784개의 특성이 필요하진 않을 듯 하다. 즉 784차원의 벡터를 더 낮은 차원으로 압축하여 표현할 수 있고, 압축된 벡터를 latent vector라 부른다. 압축을 담당하는 Neural Network를 Encoder라 한다.

비지도학습

AutoEncoder의 목적은 라벨이 없는 데이터를 압축된 벡터로 잘 표현하는 것이다. 하지만 라벨이 있는 데이터와는 달리, 데이터가 잘 압축되었는지 확인할 방법이 없다.

데이터로부터 latent vector를 잘 추출하였는지 확인하기 위해 Decoder를 추가하였다. Decoder는 latent vector를 받아, 처음 입력했던 이미지(벡터)를 복원하는 역할을 한다. 처음 이미지와 복원된 이미지의 차이를 Loss Function으로 설정하여, 이미지를 잘 압축하고 잘 복원하도록 네트워크를 학습시킨다.

그 결과 latent vector를 잘 추출하는 Encoder(Neural Network)를 얻을 수 있다.

학습된 Encoder는 다음과 같이 사용할 수 있다.

Encoder의 마지막에 Classification을 위한 fully-connected layer를 덧붙인다.
소수의 라벨이 있는 데이터를 이용하여 마지막 layer만을 학습시킨다. 예를 들어 MNIST 데이터라면, 0부터 9까지를 분류하는 layer를 학습시키는 것이다.
Encoder는 latent vector를 잘 추출하고, 마지막 layer는 latent vector로부터 classification을 잘 수행하므로, 라벨이 없는 이미지를 분류하는 모델을 완성하였다.

요약

AutoEncoder는 잘 학습된 Encoder를 얻기 위해 Decoder를 붙인 모델이다.

VAE에 들어가기 앞서, AutoEncoder는 Encoder를 얻기가 목적이었음을 잊지 말자.

2. Variational AutoEncoder (VAE)

VAE는 이름 안에 AutoEncoder를 담고 있지만, AutoEncoder와의 공통점은 구조가 똑같다는 점 뿐이다. 생김새가 비슷하다는 점만 기억해두고 AutoEncoder는 잠시 잊어두자.

개요

VAE의 목적은 Training data와 유사한 이미지를 생성하는 것이다. 주 아이디어는 "낮은 차원의 벡터(latent vector)로부터 높은 차원의 벡터(이미지)를 획득"하는 것이다. 차원을 높이기 위해 Decoder(Neural Network)가 활용된다.

latent vector를 네트워크의 입력으로 이용하는 이유는 무엇일까? AutoEncoder에서처럼, 이미지는 낮은 차원의 latent vector로 압축될 수 있다. 이때 유사한 이미지의 latent vector는 어떠한 공간에 모여있을 것임이 전제된다. [1]

만약 Training data의 latent vector가 어디에 모여있는지 찾을 수 있고 이를 잘 복원하는 Decoder가 있다면, 그 인근의 latent vector $z$를 Sampling 하여 새로운 Training data에 있을법한 이미지를 얻을 수 있을 것이다.

1차 학습

이제 학습 구조를 설계해보자. 1차 학습이라 칭한 이유는, 이 방법이 실패할 것이기 때문이다.

앞으로 자주 쓰일 표현을 정리하고 시작하자. 필자도 이 표현을 해석하는 것이 가장 골치아팠다.

$p(x)$ : Training data $x$가 실제 Training data 분포에서 Sampling 될 가능도 [2], $x$를 변수로 본다면 확률분포의 PDF라 해석할 수 있으며, $x$가 Sampling 되는 분포 그 자체로 해석될 수도 있다.
- $p(x)$를 쉽게 해석해보자. 우리는 사실 실제 Training data 분포를 찾는 것이 가장 좋다. 그 분포에서 점을 Sampling하는 것만으로도 새로운 이미지를 생성해낼 수 있기 때문이다. 하지만 실질적으로 실제 Training data 분포를 찾기는 불가능하다. [3]
- 따라서 우리가 가진 Training data가 실제 Training data 분포에서 Sampling 되었을 것이라 가정하고, 그 분포가 무엇인지 추정할 것이다. 분포에서 점은 확률적으로 Sampling 되므로, Training data를 Sampling 하였다고 말할 수 있는 분포는 수없이 많을 것이다. 단, 분포마다 그 가능성은 다를 것이다. [4]
- 수많은 분포 중 어떤 분포를 선택하는 것이 가장 합리적일까? 이를 비교하기 위해 가능도 개념이 필요하다. 우리가 가진 Training data가 그 분포로부터 Sampling 되었을 가능성을 가능도로 측정하는 것이다. [5] [6]
- 따라서 실제 Training data 분포에서 점(이미지)을 Sampling 하였을 때, 우리가 가진 Training data가 Sampling 될 가능성이 가장 높은 분포를 찾을 것이다. 설령 진짜 분포가 아니더라도, 그 분포는 실제 Training data 분포일 가능성이 가장 높을 것이다.
- 즉, 네트워크는 $p(x)$가 최대가 되는 방향으로 학습될 것이다.

$p_\theta(x|z)$ : latent vector $z$가 주어질 때, Training data $x$가 실제 Training data 분포에서 Sampling 될 가능도, $x$를 변수로 본다면 확률분포의 PDF라 해석할 수 있으며, $x$가 Sampling 되는 분포 그 자체로 해석될 수도 있다.
- 어떠한 latent vector $z$가 주어질 때의 가능도이다. latent vector $z$는 Decoder를 통과하여 어떠한 확률분포의 모수가 되며, 결국 모수가 결정된 확률분포에서 Training data $x$가 Sampling될 가능도라고 볼 수 있다.
- $p(x)$와 $p_\theta(x|z)$ 모두 실제 Training 분포에서의 가능도지만, 분포의 형태는 다르다. $p_\theta(x|z)$는 조건 $z$가 추가되었으므로 $p(x)$의 분포의 일부라고 볼 수 있겠다.
- latent vector $z$가 어떠한 확률분포의 모수로 잘 변하도록 Decoder는 학습하게 될 것이다. (= Neural Network인 Decoder의 파라미터가 최적화될 것이다.) 이를 명시하기 위해 $\theta$를 붙인다.

$p(z)$ : 유의미한 (=Training data와 유사한) 이미지를 생성하는 latent vector의 분포에서 $z$가 Sampling될 가능도, $z$를 변수로 본다면 확률분포의 PDF라 해석할 수 있으며, $z$가 Sampling 되는 분포 그 자체로 해석될 수도 있다.
- 아무 $z$를 통과시킨다고 유의미한 이미지를 얻을 수 있진 않을 것이다. 유의미한 latent vector가 모여있는 공간이 존재할 것이고, 그 분포에서 $z$가 Sampling될 가능도이다.

우리의 목표는 Training data $x$에 대하여 $p(x)$를 최대가 되도록 하는 것이다. $p(x)$가 어떤 분포로부터 계산되었는지조차 모르지만, Bayesian Inference에 의해 $p(x)$는 다음과 같이 표현될 수 있다. [7]

$$ p(x)={\int p_\theta(x|z)p(z)dz} $$

$p_\theta(x|z)$와 $p(z)$를 최대화시켜 $p(x)$를 최대로 만들어보자.

$p(z)$ : 유의미한 latent vector가 Sampling 될 수 있도록 하는 확률분포가 존재할 것이다. 하지만 우리는 이를 평균이 $0$(영벡터), 공분산이 $I$(항등행렬)인 Multivariate Gaussian Distribution라 가정한다. [8]
- 결국 $p(z)$는 결정된 값이다. 더 이상 할 수 있는 것이 없다.
$p_\theta(x|z)$ : latent vector z가 Decoder를 통과하여 어떤 확률분포의 모수가 되고, 그 모수를 바탕으로 만들어진 확률분포에서 Training data $x$가 Sampling 될 가능도를 의미한다.
- $p_\theta(x|z)$의 값을 높이는 방향와 같은 맥락의 식이 있다. $z$가 Decoder를 통과하여 나온 벡터를 $p$라 할 때, 두 벡터 $p$, $x$가 같아지는 방향이다. [9]
- 즉, $p$와 $z$가 같아지도록 Decoder의 파라미터를 학습하자.
${\int p_\theta(x|z)p(z)dz}$ : 적분을 컴퓨터 언어로 표현하기 위해 몬테카를로 근사(Monte Carlo Approximation)를 활용할 수 있다. [10] 모든 $z$를 대상으로 하지 않고, 몇 개의 $z$값을 Sampling 하여 그 합으로 대체하는 것이다.

$$ p(x)={\int p_\theta(x|z)p(z)dz}\approx{\sum_ip_\theta(x|z_i)p(z_i)} $$

이제 학습을 진행할 수 있을 듯 하다.

1차 학습 _ 실패

하지만 이 방법은 성공하지 못했다. 의미적으로 오류가 발생했기 때문이다.

(a)는 Training data이고, (b)와 (c)는 Sampling 된 $z$로부터 생성된 이미지이다. 상식적으로 (c)가 (b)보다 '2'에 가깝지만, 실제 (a)와의 오차는 (a)~~(b)가 (a)~~(c)보다 작다. 즉, 모델은 (b)가 더 '2'에 가까운 이미지라고 학습한 것이다.

오차를 계산하는 metric을 변경하여 위의 문제를 해결할 수 있겠지만 이는 효율적인 방법이 아니라 판단하였고, 논문은 다른 방법을 제시한다.

2차 학습

중간점검을 해보자. 우리는 $p(x)$를 최대로 만들고 싶었으나 잘 모르므로

$$ p(x)={\int p_\theta(x|z)p(z)dz} $$

로 만들어 $p_\theta(x|z)$, $p(z)$를 계산하려고 했다.

실질적으로 대부분의 $z$에 대하여 $p_\theta(x|z)$가 거의 0이었다고 한다. 이는 몇 개의 $z$만을 Sampling 하던 1차 학습 기법에서 좋지 않은 영향을 준다. 결국 Gaussian Distribution에서도 의미있는 $z$만을 뽑아야 한다는 뜻이다.

$z$를 잘 뽑아야 된다는 것은 결국 $p(z)$가 아닌 어떠한 입력 이미지 별 유의미한 $z$의 분포로부터 $z$가 뽑혀야 함을 의미한다. [11] 이는 Training data $x$를 잘 생성할 것 같은 $z$의 분포를 의미하기 때문에 $x$를 조건으로 둔 분포이고, 이 분포에서 $z$가 Sampling 될 가능도를 $p(z|x)$라 정의한다.

사실 $p(z|x)$ 역시 알 방법이 없다. 따라서 이에 가장 근사하는, 비교적 간단한 분포 $q_\phi(z|x)$를 정의하고 그 모수를 학습으로 찾아낼 것이다. 이 모수는 입력되는 Training data $x$에 따라 달라질 것이다.

이 점을 고려하기 위해, 위의 식과 다른 방법으로 $p(x)$를 표현하고 해석해보자. $p(x)$를 최대로 만들겠다는 목표는 변하지 않았지만, 대신 $\log(p(x))$를 최대로 만들어보자.

$$ \begin{aligned} \log(p(x)) &= \log(p(x))\int{q_\phi(z|x)dz} = \int{\log(p(x))q_\phi(z|x)dz}\ &=\int{\log\left({p_\theta(x|z)p(z) \over p(z|x)}\right)}q_\phi(z|x)dz\ &=\int{\log\left(p_\theta(x|z)\cdot{p(z) \over q_\phi(z|x)}\cdot{q_\phi(z|x) \over p(z|x)}\right)}q_\phi(z|x)dz\ &= \int{\log((p_\theta(x|z)))q_\phi(z|x)dz} -\int{\log\left({q_\phi(z|x) \over p(z)}\right)}q_\phi(z|x)dz+ \int{\log\left({q_\phi(z|x) \over p(z|x)}\right)q_\phi(z|x)dz} \end{aligned} $$

[12] [13] [14]

$\log(p(x))$가 3개의 항으로 분리되었다. 각 항의 의미를 살펴보자.

$\int{\log((p_\theta(x|z)))q_\phi(z|x)dz}$ : 기댓값으로 볼 수 있다. 즉,

$$ \int{\log((p_\theta(x|z)))q_\phi(z|x)dz}=E_{q_\phi(z|x)}[\log(p_\theta(x|z))] $$

$-\int{\log\left({q_\phi(z|x) \over p(z)}\right)}q_\phi(z|x)dz$ : 두 확률분포의 KL Divergence로 볼 수 있다. 즉,

$$ -\int{\log\left({q_\phi(z|x) \over p(z)}\right)}q_\phi(z|x)dz = -D_{KL}(q_\phi(z|x)||p(z)) $$

$\int{\log\left({q_\phi(z|x) \over p(z|x)}\right)q_\phi(z|x)dz}$ : 두 확률분포의 KL Divergence로 볼 수 있다. 즉,

$$ \int{\log\left({q_\phi(z|x) \over p(z|x)}\right)q_\phi(z|x)dz} = D_{KL}(q_\phi(z|x)||p(z|x)) $$

이를 종합하면, $$ \begin{aligned} \log(p(x)) &= E_{q_\phi(z|x)}[\log(p_\theta(x|z))]\ &- D_{KL}(q_\phi(z|x)||p(z)) + D_{KL}(q_\phi(z|x)||p(z|x)) \end{aligned} $$ 이다.

한 가지 트릭으로, 우리는 $p(z|x)$가 어떤 확률분포인지 알지 못한다. 하지만 $q_\phi(z|x)$가 $p(z|x)$에 근사하길 바라는 방향으로 Encoder와 Decoder를 학습시킬 것이므로, 최적화된 상태에서 $D_{KL}(q_\phi(z|x)||p(z|x))$는 0에 가깝다. 따라서 $\log(p(x))$를 최대로 만들 때 $D_{KL}(q_\phi(z|x)||p(z|x))$ 항은 고려하지 않는다. [15]

$$ \log((p(x)) \ge E_{q_\phi(z|x)}[\log(p_\theta(x|z))] - D_{KL}(q_\phi(z|x)||p(z)) $$

이고, $\log(p(x))$의 하한(lower bound)을 최대로 끌어올림으로써 $\log(p(x))$의 최대를 찾는다고 할 수 있겠다.

이제 컴퓨터가 $E_{q_\phi(z|x)}[\log(p_\theta(x|z))] - D_{KL}(q_\phi(z|x)||p(z))$를 최대로 만들도록 네트워크를 최적화(Encoder, Decoder의 파라미터를 학습)해보자.

우리는 경사하강법(Gradient Descent)을 이용하여 최적화할 것이므로, Loss Function의 하강이 곧 $\log(p(x))$의 상승이 되도록 Loss Function을 설정해보자. 가장 간단한 방법은 무엇일까? 바로 마이너스(-)를 붙이는 것이다. 즉, $$ \begin{aligned} \max(~~\log(p(x))~~) &\approx \max(~~E_{q_\phi(z|x)}[\log(p_\theta(x|z))] - D_{KL}(q_\phi(z|x)~~||~~p(z))~~)\ &=\min(~~-E_{q_\phi(z|x)}[\log(p_\theta(x|z))] + D_{KL}(q_\phi(z|x)~~||~~p(z))~~) \end{aligned} $$ 이므로, $-E_{q_\phi(z|x)}[\log(p_\theta(x|z))] + D_{KL}(q_\phi(z|x)||p(z))$를 Loss Function으로 설정한다면 결과적으로 $\log(p(x))$가 최대가 되도록 파라미터를 학습할 수 있을 것이다.

Loss Function을 구체화하기 전, $-E_{q_\phi(z|x)}[\log(p_\theta(x|z))] + D_{KL}(q_\phi(z|x)||p(z))$의 각 항이 무엇을 의미하는지 간단히 살펴보자.

$-E_{q_\phi(z|x)}[\log(p_\theta(x|z))]$ : Training data $x$로부터, $x$와 유사한 이미지를 생성할 가능성이 있는 $z$의 분포(의 모수)를 찾아내고, 해당 분포에서 Sampling 한 $z$로부터 만들어진 '비슷한 $x$끼리의 분포'에서 $x$를 Sampling 할 음의 로그가능도이다.
- 즉, 입력 데이터 $x$와 '비슷한 $x$끼리의 분포'에서 생성된 데이터 $\tilde{x}$ 간의 다른 정도를 의미한다. 낮을 수록 두 데이터는 유사하며, 이 항을 Reconstruction Error라 부른다.
$D_{KL}(q_\phi(z|x)||p(z))$ : 1차 학습 때 $z$를 Sampling 하려던 분포 $p(z)$와, 입력 이미지 별 이상적인 $z$를 Sampling 할 수 있는 분포 $p(z|x)$에 근사하려는 분포 $q_\phi(z|x)$ 간의 KL Divergence 이다.
- 즉, 두 확률분포 $p(z)$와 $q_\phi(z|x)$ 간의 비슷한 정도를 의미한다. 낮을수록 두 분포는 유사하며, 이 항을 Regularization Error라 부른다.

두 항 모두, 감소하는 방향이 우리가 원하는 방향과 같다.

마지막으로, $-E_{q_\phi(z|x)}[\log(p_\theta(x|z))] + D_{KL}(q_\phi(z|x)||p(z))$의 각 항을 Loss Function으로 만들어보자. $i$번째 데이터의 학습에 대해 $~_i$(아래첨자)를 붙여 표현할 것이다.

Reconstruction Error $$ \begin{aligned}
E_{q_\phi(z_i|x_i)}[\log(p_\theta(x_i|z_i))] &\approx -{1 \over K}\sum_{j=1}^K(\log(p_\theta(x_{i,j}|z_{i,j}))),K는\mathrm{Sampling}할{z_i}의~~수\ &=-\log(p_\theta(x_i|z_i))~~(\mathrm{let}~K=1)\ &=-\log(p_{i}^{x_i}(1-p_i)^{1-x_i})\ &=-\left(x_i\log(p_i)+(1-x_i)\log(1-p_i)\right) \end{aligned} $$

$$ \therefore \mathrm{Reconstruction~~Error} = -\sum_{j=1}^{D}\left(x_{i,j}\log(p_{i,j})+(1-x_{i,j})\log(1-p_{i,j})\right),\ D는~~\mathrm{Training~~data}~~x_i의차원수\ $$

[16] [17] [18] [19]

Regularization Error 계산에 앞서, $q_\phi(z|x)$는 평균이 $\mu_i$, 공분산이 $\Sigma_i$인 Gaussian Distribution이라 가정한다. $\Sigma_i$는 $J$개의 대각성분이 $\sigma_i$의 제곱과 같은 $J{\times}J$ 행렬이다. [20] 어차피 실제 분포가 아닌 근사 분포이고, 두 Multivariable Gaussian Distribution의 KL Divergence를 수월하게 구하기 위한 선택이다.

$$ \begin{aligned} D_{KL}(q_\phi(z|x)||p(z)) &= {1 \over 2}\left(tr(I\Sigma_i)+\mu_i^{T}\mu_i-J+\ln\left({1 \over \prod_{j=1}^{J}\sigma_{i,j}^2}\right)\right)\ &={1 \over 2}\sum_{j=1}^{J}\left(\mu_{i,j}^{2}+\sigma_{i,j}^{2}-\ln(\sigma_{i,j}^{2})-1\right)\ &J는~~\mathrm{latent~~vector}의차원수\ \end{aligned} $$

$$ \therefore \mathrm{Regularization~Error} ={1 \over 2}\sum_{j=1}^{J}\left(\mu_{i,j}^{2}+\sigma_{i,j}^{2}-\ln(\sigma_{i,j}^{2})-1\right) $$

[21]

$$ \begin{aligned} \mathrm{Loss~~Function} &= \mathrm{Reconstruction~~Error} + \mathrm{Regularization~Error}\ &= -\sum_{j=1}^{D}\left(x_{i,j}\log(p_{i,j})+(1-x_{i,j})\log(1-p_{i,j})\right)\ &+ {1 \over 2}\sum_{j=1}^{J}\left(\mu_{i,j}^{2}+\sigma_{i,j}^{2}-\ln(\sigma_{i,j}^{2})-1\right) \end{aligned} $$

2차 학습 _ Architecture

학습할 준비가 끝났다. 2차 학습의 과정을 도식화한 그림은 다음과 같다.

이전 문단에서 등장하지 않은 벡터가 있는데, 바로 $\epsilon_i$이다. $\epsilon_i$가 도입된 이유는 Reparameterization Trick을 사용하기 위함이다. [22] [23]

2차 학습 _ Inference

VAE의 목적은 이미지를 생성하는 것이다. 즉, Inference는 Decoder를 이용하게 된다. VAE의 Encoder는 Decoder를 도운 역할이었음을 잊지 말자.

VAE의 Model Inference는 $N(0, I)$에서 Sampling 한 latent vector를 Decoder에 통과시켜 이미지를 얻어내는 방식으로 진행된다. 여기서 AutoEncoder와 차별화되는 VAE의 강점이 드러난다.

다음 그림은 latent vector space를 2차원이 되도록 네트워크를 구성하고, 학습이 완료된 후 Test data가 주어졌을 때의 latent vector를 시각화한 것이다. 왼쪽은 AutoEncoder, 오른쪽은 VAE이다.

두 그림의 가장 큰 차이점은, latent vector가 잘 모여있는가이다. AutoEncoder는 latent vector의 반경이 넓고 불규칙적이므로, 어떤 latent vector를 선택하여 Decoder를 통과시켰을 때 유의미한 이미지를 얻을 수 있다고 보장받기 힘들다.

반면 VAE는 latent vector가 잘 모여있다. [24] 반경 내의 latent vector를 Sampling 하여 Decoder에 통과시키면 유의미한 이미지가 나올 가능성이 높다고 말할 수 있겠다. 즉, 유의미한 이미지를 만드는 latent vector의 범위를 예측 가능한지가 두 모델의 차이이다.

latent vector의 차원은 조절이 가능하다. latent vector의 차원이 클수록 이미지의 특성을 더 다양하게 함축하겠노라 말할 수 있겠다. 다음은 각각 2차원, 5차원, 10차원, 20차원으로 설정한 뒤 Model Inference (Generation)를 거친 결과이다.

Furthermore

이후에는 원하는 이미지에 대한 조건을 걸어 이미지를 얻어내는 Conditional Variational AutoEncoder (CVAE), GAN과 함께 이미지를 생성하는 Adversarial AutoEncoder (AAE) 등이 연구되었다.

3. Endnote

[1] MNIST 이미지로 예를 들어보자. 이미지는 벡터로 볼 수 있고, 벡터는 공간 상의 점으로 볼 수 있다. 즉 이미지 1장은 784차원 공간 상의 점 1개이다. 사람이 보기에 '3'으로 해석되는 이미지는 784차원의 공간 상에서 서로 모여있을 것이다. 겨우 한 두 픽셀의 차이로 '3'의 형태가 무너지지 않을 것이기 때문이다.

Encoder는 Neural Network이고, 세부적으로 살펴보면 여러 함수의 합성이다. 선형변환(Linear), ReLU, Sigmoid 모두 연속함수이고, 연속적으로 모여있는 점(이미지)들을 연속함수에 몇 번을 통과시키더라도 결과는 연속적일 수 밖에 없다. 즉, '3'처럼 같은 의미를 가진 이미지의 latent vector는 낮은 차원의 공간 상에서 서로 모여있을 것이라 추측할 수 있다.

[2] 연속확률변수의 개별 값은 모두 0의 확률을 갖는다. 하지만 확률밀도함수(이하 PDF)를 살펴보면 엄밀히 '빈도'의 차이는 발생한다. 표준정규분포에서 점을 Sampling할 때 100,000보다 0 근처 값이 많이 뽑히는 것처럼, '빈도'의 차이를 설명하기 위해 '가능도' 개념이 도입된다.

가능도는 확률이 아니다. 점을 Sampling할 때, 그 점이 뽑힐 상대적인 가능성으로 해석하면 좋을 듯 하다. 가능도 값은 PDF의 함숫값으로 표현된다.

[3] MNIST 이미지로 예를 들자. 실제 Training data 분포는 모든 0부터 9까지의 손글씨 이미지가 모여있는 분포를 의미한다. 실질적으로 이 분포를 찾기는 불가능한데, 그 이유는 사람마다 숫자를 쓰는 손글씨가 다르고 그 방법은 무궁무진하기 때문이다.

[4] 평균이 1,000,000이고 표준편차가 1인 정규분포에서 Sampling을 하더라도 -1,000,000이 뽑힐 가능성은 존재한다. 다만 아주 희박한 가능성이고, 따라서 우리가 원하는 분포는 아닐 것이다.

[5] 평균은 모르고 표준편차가 1인 정규분포를 생각해보자. 그 분포에서 점을 5개 뽑았더니 모두 10이 나왔다. 어떤 분포가 '5개의 점을 뽑았더니 모두 10이 나올 가능성'이 클까? 직관적으로 정답은 평균이 10인 정규분포이다. 이를 수치적으로 정확하게 비교하기 위해, 각 점이 뽑힐 가능성을 '가능도'로 표현하고, 가능도가 가장 높은 확률분포를 찾는 것이다.

평균이 $\mu$, 표준편차가 1인 정규분포의 PDF $f(x)$는 $f(x)={1 \over \sqrt{2\pi}}e^{-{(x-\mu)^2 \over 2}}$이고, 이 정규분포에서 10이 Sampling될 가능도는 $f(10)={1 \over \sqrt{2\pi}}e^{-{(10-\mu)^2 \over 2}}$이다. 다수의 값이 Sampling될 가능도는 각 가능도의 곱으로 표현하며, 10이 5번 Sampling될 가능도는 $(f(10))^5 = ({1 \over \sqrt{2\pi}})^5e^{-{5(10-\mu)^2 \over 2}}$이고, 이 값이 최대가 되는 순간은 $\mu$가 10일 때이다.

즉, 10이 5번 Sampling 될 가능성이 가장 높은 분포는 평균이 10인 정규분포이므로, 우리가 Sampling을 진행한 분포는 평균이 10인 정규분포일 것이라고 합리적으로 생각할 수 있다.

이는 최우원리를 전제로 한다. Sampling은 확률적이므로, 낮은 가능도의 점이라도 낮은 가능성으로 Sampling될 수 있다. 하지만 가장 높은 가능성의 사건이 발생했을 것이다라는 전제 때문에, 우리는 가능도가 가장 높은 분포를 찾는 것이 목표이다. Sampling 되는 점이 많을수록 (=Training data가 많을수록), 가능도가 가장 높은 분포는 우리가 원하는 분포일 가능성이 높을 것이다.

[6] 이것이 최대가능도 추정(Maximum Likelihood Estimation)이며, 보통은 확률분포가 무엇인지 아는 상태에서(정규분포, 베르누이분포 등) 그 분포의 모수를 추정하지만, 지금은 실제 Training data 분포가 복잡하여 어떤 확률분포를 따르는지 모르는 상태라고 보면 된다.

[7] 이 식은 Bayesian Inference에서 기인한 것이다. Bayesian Inference는 사전확률과 추가 정보를 바탕으로 사후확률을 추정하는 방법이다.

Sampling 된 데이터 $x$가 주어졌을 때, Sampling이 이루어진 그 분포를 찾고자 한다. 달리 말하면, 그 분포의 모수 $\theta$를 찾고자 한다. 이때, 사후확률(데이터가 주어졌을 때 모수의 가능도)는 다음과 같이 치환될 수 있다

$$ p(\theta | x)={p(x|\theta)p(\theta) \over p(x)} $$

이때 $p(\theta | x)$는 확률함수로, 모든 $\theta$ 구간에 대해 적분하였을 때 1이 나와야 한다. 따라서

$$ p(x)=\int{p(x|\theta)p(\theta)d\theta} $$

로 표현될 수 있다. 전확률법칙(law of total probability)을 생각한다면 자연스러운 표현이라 생각할 수 있을 것이다.

더 자세한 내용은 marginal likelihood 또는 normalization constant에서 찾아보자.

[8] 의아한 설정이지만, Decoder의 초기 1~2개의 layer를 통해 Gaussian Distribution을 따르는 $z$값들이 유의미한 latent vector의 분포를 따르도록 바꿔줄 수 있다고 한다.

즉, latent vector가 어떤 분포를 가짐과 상관없이, 정규분포로부터 $z$를 Sampling 하여 몇 개의 layer를 거치면, $z$값은 우리가 원하는 latent vector 분포를 따를 것이다.

이는 Neural Network가 근사 함수의 역할을 하기 때문이다. 본질적으로 딥러닝을 사용하는 이유는 Neural Network가 비선형 함수에 강력하게 근사할 수 있기 때문이다. 즉, 정규분포 점을 latent vector space의 점으로 mapping하는 함수에 근사한 것이다. [arXiv] Tutorial on Variational Autoencoders

[9] 표준편차가 $\sigma$로 고정된 정규분포로 예를 들어보자. 어떤 입력데이터 $x_i$가 어떤 Neural Network $f_\theta$를 거친 결과가 정답인 $y_i$와 같아지기를 원한다고 하자. 즉, $f_\theta(x_i)$와 $y_i$가 같아지는 방향이다. 편의상 $x_i$, $y_i$는 벡터가 아닌 하나의 값으로 둔다.

이제 평균이 $f_\theta(x_i)$, 표준편차가 $\sigma$인 정규분포에서 $y_i$를 Sampling 할 가능도를 살펴보자. 정규분포에서 가장 Sampling 될 가능도가 높은 값은 무엇일까? 바로 평균($f_\theta(x_i)$)이다. 즉, $f_\theta(x_i)$와 $y_i$가 같아지는 방향은 곧 $y_i$가 Sampling 될 가능도가 높아지는 방향이다.

[10] ${\int p_\theta(x|z)p(z)dz}$를 살펴보면 마치 기댓값을 구하는 형태와 같음을 알 수 있다. 즉,

$$ {\int p_\theta(x|z)p(z)dz} = E_{z\text{\textasciitilde}p(z)}[{p_\theta(x|z)}] $$

인데, $z$를 연속확률변수가 아닌 Sampling 에 의한 이산확률변수로 본다면 그 기댓값은 $$ E_{z\text{\textasciitilde}p(z)}[{p_\theta(x|z)}] \approx \sum_ip_\theta(x|z_i)p(z_i) $$

로 근사하게 된다.

[11] MNIST를 예로 들면, '2'를 잘 생성하는 latent vector $z$는 '8'을 만드는 데 고려될 필요가 없다는 뜻이다.

[12] 밑이 1보다 큰 로그함수는 증가함수이다. 논문의 $\log$는 밑이 자연상수($e$)이므로, $p(x)$의 증가$\cdot$감소는 $\log(p(x))$의 증가$\cdot$감소와 같다.

[13] $q_\phi(z|x)$는 확률분포의 PDF이므로 $\int{q_\phi(z|x)dz}=1$이다.

[14] Bayes' Theorem에 의해 $p(x)={p(x|z)p(z) \over p(z|x)}$이다.

[15] 모든 KL Divergence는 0 이상이다. 두 확률분포가 같을 때 0이고, 다를수록 값이 증가한다.

[16] Monte Carlo Approximation에 의해, 기댓값 $-E_{q_\phi(z_i|x_i)}[\log(p_\theta(x_i|z_i))]$은 몇 개의 $z_i$를 뽑아 평균을 낸 $-{1 \over K}\sum_{j=1}^K(\log(p_\theta(x_{i,j}|z_{i,j})))$와 근사한다. $K$값이 클수록 더욱 근사해진다.

[17] 학습 시 계산량이 방대한 이유로, 각 $x_i$마다 1개의 $z_i$를 Sampling 하여 학습을 진행하였다.

[18] [9]의 연장선이다. $p_\theta(x_i|z_i)$가 증가하는 방향은, $z_i$가 Decoder를 통과하여 만들어진 $p_i$가 $x_i$와 같아지는 방향이고, $p_i$를 모수로 갖는 어떠한 확률분포로부터 $x_i$가 Sampling 될 가능도가 증가하는 방향이다.

만약 어떠한 확률분포가 Gaussian Distribution이라면 $p_\theta(x_i|z_i)$는 MeanSquare Error로 계산되며, Bernoulli Distribution이라면 CrossEntropy Error로 계산된다.

Gaussian Distribution $(p_\theta(x_i|z_i) = N(x_i;~p_i, 1))$

$$ \begin{aligned} -\log(p_\theta(x_i|z_i)) &= -\log\left({1 \over \sqrt{2\pi}}e^{-{(x_i-p_i)^2 \over 2}}\right)\ &= -\log\left({1 \over \sqrt{2\pi}}\right) + {(x_i-p_i)^2 \over 2}\ &\approx {(x_i-p_i)^2 \over 2}\ &\propto (x_i-p_i)^2 \end{aligned} $$

Bernoulli Distribution $(p_\theta(x_i|z_i) = Bern(x_i;~p_i))$

$$ \begin{aligned} -\log(p_\theta(x_i|z_i)) &= -\log\left(p_i^{x_i}(1-p_i)^{1-x_i}\right)\ &=-x_i\log(p_i) - (1-x_i)\log(1-p_i) \end{aligned} $$

정리하자면, $p_\theta(x_i|z_i)$가 어떤 확률분포이냐에 관계없이 $p_\theta(x_i|z_i)$가 증가하는 방향은 $x_i$와 $p_i$가 같아지는 방향이고, 특별히 $p_\theta(x_i|z_i)$가 베르누이 분포이면 CrossEntropy Error와 같은 형태가 되며 이 역시 $x_i$와 $p_i$가 같아지는 방향이므로, $p_\theta(x_i|z_i)$를 Bernoulli Distribution이라 가정하더라도 학습은 잘 일어난다.

만약 $p_\theta(x_i|z_i)$를 Bernoulli Distribution이 아닌 Gaussian Distribution으로 가정했다면 다른 형태의 Reconstruction Error가 도출될 것이다.

[19] 위 식의 결과는 결국 벡터이다. 벡터의 모든 성분을 합하여 하나의 오차값으로 만들었다.

[20] $\Sigma_i = \begin{bmatrix} \sigma_{i,1}^2 & 0 & 0 & \cdots & 0 \ 0 & \sigma_{i, 2}^2 & 0 & \cdots & 0 \ 0 & 0 & \sigma_{i, 3}^2 & \cdots & 0 \ \vdots & \vdots & \vdots & \ddots & \vdots \ 0 & 0 & 0 & \cdots & \sigma_{i, J}^2 \end{bmatrix},~~ $ $\sigma_i = \begin{bmatrix} \sigma_{i, 1} \ \sigma_{i, 2} \ \vdots \ \sigma_{i, J} \end{bmatrix} $

계산의 편의성을 위해 간단한 공분산을 가정한다.

[21] 두 Multivariable Gaussian Distribution의 KL Divergence를 구하는 공식은 다음과 같다.

$$ \begin{aligned} D_{KL}&(N_0(\mu_0, \Sigma_0)||N_1(\mu_1, \Sigma_1))\ &={1 \over 2}\left(tr(\Sigma_1^{-1}\Sigma_0)+(\mu_1-\mu_0)^{T}\Sigma_1^{-1}(\mu_1-\mu_0)-k+\ln\left({\det(\Sigma_1) \over \det(\Sigma_0)}\right) \right) \end{aligned} $$ $tr$은 행렬의 대각성분의 합이고, $k$는 Multivariate Gaussian Distribution의 차원 수이다. 즉, $\mu_0,~\mu_1$의 차원 수이다.

[22] 우리는 역전파(Backpropagation)를 통해 Encoder와 Decoder의 파라미터를 학습할 것이므로 Loss Function은 미분 가능한 형태여야 한다. 문제는, $z_i$가 미분 가능한가에 대한 것이다. 원래대로라면 $z_i$는 평균이 $\mu_i$, 표준편차가 $\sigma_i$인 Gaussian Distribution에서 Sampling 되어야 한다. 비록 $z_i$는 확률'변수'지만 미분이 가능할까? $z_i$는 $\mu_i$, $\sigma_i$에 크게 영향을 받는데, 그렇다면 ${\partial z_i \over \partial\mu_i}$나 ${\partial z_i \over \partial\sigma_i}$를 구할 수 있을까?

정답은 불가능하다. 하지만 표현방식만 달리하면, 똑같은 의미임에도 미분이 가능하도록 바꿀 수 있다. Isotropic Gaussian Distribution $N(0, I)$에서 $\epsilon_i$를 Sampling 하여 $z_i=\mu_i+\sigma_i\odot\epsilon_i$로 표현하는 것이다. (element-wise multiplication)

직관적으로 와닿지 않는다면, 고등학교 확률과통계에서 배운 정규분포 표준화를 떠올려보자. $X~~\text{\textasciitilde}~~N(\mu, \sigma^2)$인 확률변수 $X$에 대하여, ${X-\mu \over \sigma}~~\text{\textasciitilde}~~N(0, 1)$이 성립하므로 그 반대 방향도 가능하다는 직관을 얻을 수 있을 것이다.

무작위로 결정되는 값이 $\mu_i$, $\sigma_i$에 의존하지 않는다는 점이 Reparameterization Trick의 가장 큰 장점이다. 즉, 무작위값 $\epsilon_i$는 역전파(Backpropagation) 시 상수로 취급된다. 이로써 $z_i$는 미분이 가능하다.

[23] [22]의 예시는 우리가 원하는 상황과 조금 다르다. 우리는 Multivariate Gaussian Distribution으로부터 Sampling 하는 상황이며, 이는 일변수와는 조금 다르다.

Multivariate Gaussian Distribution에서의 Sampling 하는 방법은 이곳에서 확인할 수 있다. 조금 복잡해진 점은, $AA^\mathrm{T}=\Sigma$인 행렬 $A$를 찾아야 한다는 것이다.

원래의 목적을 생각해보자. Multivariate Gaussian Distribution $N(\mu_i, \Sigma_i)$로부터 latent vector $z$를 Sampling 할 것인데, 미분가능성을 생각하여 Reparameterization Trick을 이용할 것이다.

이곳에 따르면 $N(\mu_i, \Sigma_i)$로부터 Sampling 한 값 $z$는 $z=\mu_i+A_i\epsilon$로 표현된다. $(A_i$는 $A_iA_i^\mathrm{T}=\Sigma_i$를 만족하는 $J{\times}J$ 행렬, $\epsilon = \begin{bmatrix} \epsilon_1 \ \vdots \ \epsilon_J \ \end{bmatrix} $, 각 $\epsilon_i$는 $N(0, 1)$에서 독립추출한 값$)$

$A_i = \begin{bmatrix} \sigma_{i,1} & 0 & \cdots & 0 \ 0 & \sigma_{i, 2} & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \sigma_{i, J} \end{bmatrix} $ 라 하자. $\Sigma_i$에 루트를 씌운 느낌이다. 이때 $A_i^\mathrm{T}=A_i$이므로

$A_iA_i^{T}=\Sigma_i$으로 조건을 만족한다. 따라서

$z=\mu_i+A_i\epsilon = \mu_i + \begin{bmatrix} \sigma_{i,1} & 0 & \cdots & 0 \ 0 & \sigma_{i, 2} & \cdots & 0 \ \vdots & \vdots & \ddots & \vdots \ 0 & 0 & \cdots & \sigma_{i, J} \end{bmatrix} \begin{bmatrix} \epsilon_1 \ \epsilon_2 \ \vdots \ \epsilon_J \ \end{bmatrix} = \mu_i + \begin{bmatrix} \sigma_{i, 1}\epsilon_1 \ \sigma_{i, 2}\epsilon_2 \ \vdots \ \sigma_{i, J}\epsilon_J \ \end{bmatrix} = \mu_i + \sigma_i\odot\epsilon $이다.

Architecture에는 $N(0, I)$에서 $\epsilon_i$를 Sampling 한다고 하였는데, 이것이 위 식의 $\epsilon = \begin{bmatrix} \epsilon_1 \ \vdots \ \epsilon_J \ \end{bmatrix} $과 같을까? 이것 역시 위의 방법대로 $AA^\mathrm{T}=I$인 $A$ $(=I$라 할 수 있다$)$를 찾는다면 어렵지 않게 $N(0, 1)$에서 $J$번 독립추출한 값을 이어붙이는 것과 $N(0, I)$에서 하나의 벡터를 Sampling 하는 것이 같음을 알 수 있다.

[24] 사실 당연한 결과이다. $q_\phi(z|x)$가 $p(z)=N(z;~0, I)$에 가까워지도록 학습을 진행했기 때문이다. 이는 Loss Function에서도 Regularization Loss로 남아있다.

sea_note.log

[7주차 ~ 9주차] [코드구현] Semi-Parametric Contextual Pricing

관련논문

참고교재

1. Introduction

2. Cox Proportional Hazard Model

3. Contextual Pricing using Cox PH Model

4. Contextual Pricing Neural Network

실험1. 단일차원 Contextual

실험2. 다차원 Contextual

실험3. ic_sp 모델과의 비교

5. Conclusion

[6주차] [코드구현] Advanced Monotonic Neural Network with PyTorch

관련논문

1. Introduction

2. Bounded Monotone Neural Network

모델 검증

3. Origin-Passing Monotone Neural Network

모든 bias 제거

마지막 layer에서 bias 제거

원점을 지나는 증가함수 곱하기

모델 검증

[5주차] [코드구현] Constrained Monotonic Neural Network with PyTorch

원문

참고자료

1. Introduction

2. Monotonic Dense Block 구현

3. Monotone Neural Network 구현

4. Monotone Neural Network 성능비교

PyTorch 모델 학습

Keras 모델 학습

Sklearn 모델 학습

모델 평가

[4주차] [논문분석] Constrained Monotonic Neural Network

원문

1. Abstract

2. Background

2-1. Monotonic Architecture - by Construction

모든 Weight를 0 이상으로 제한 (Archer & Wang, 1993)

2-2. Monotonic Architecture - by Regularization

3. Main Idea

3-1. Constrained Weight

3-2. Activation Function

4. Architecture

5. Universality

6. Experiment

7. Code

데이터 생성

모델 구성

모델 학습

결과 확인

8. Conclusion

Endnotes

[3주차] Computation of the NPMLE for type-I interval censored data

참고교재

1. Review

2. Newton Algorithm

3. Iterative Convex Minorant Algorithm

4. Modified Iterative Convex Minorant Algorithm

[2주차] Nonparametric MLE for the type-I interval censored data

참고교재

1. Monotone Regression

2. Estimation from Current Status Data

[1주차] Examples and Technologies of Censored Data

참고교재

참고자료

1. Nonparametric Estimation

1.1 Is There a Warming-up of Lake Mendota?

1.2 Onset of Nonlethal Lung Tumor

1.3 The Transmission Potential of a Disease

2. Censored Data

2.1 Binary Censoring

2.2 Interval Coding

2.3 Censoring from Above and Below

[논문분석] Denoising Diffusion Implicit Models (DDIM)

1. Introduction

2. 개요

3. Non-Markovian process

3-1. Definition

Comparison