flatfish_selfish.log

Dask 에 대해 알아보자.

Thu, 01 Feb 2024 18:17:02 GMT

이번에 패스트 캠퍼스 강의를 들으며 처음으로 Dask라는 라이브러리를 알게 되었다.

Dask는 Python에서 병렬 처리를 가능하게 하는 라이브러리로, 대규모 데이터셋을 사용하는 복잡한 계산을 빠르게 수행해야 할 때 유용하다. Apache Spark의 pyspark도 많이 사용하지만, pandas에 익숙해져 있는 상태로는 사용하기 불편하다는 문제점이 있고 Dask는 이 문제에 대한 해결책이 될 수 있다. Dask는 NumPy, Pandas와 같은 Python의 기술 스택과 밀접하게 통합되어 있어, 대규모 데이터셋을 사용하는 경우에도 친숙한 인터페이스를 제공한다는 장점이 있다.

Dask의 주요 구성은 크게 두 가지로, 동적 작업 스케줄링과 데이터 컬렉션이 있다.

동적 작업 스케줄링 : 복잡한 작업을 작은 작업으로 나누고, 이를 동시에 실행하기 위한 스케줄링을 담당
데이터 컬렉션 : Dask Array, Dask DataFrame, Dask Bag 등 자료 구조를 제공하여 대규모 데이터셋을 효율적으로 처리

코드를 통해 보면 더 쉽게 알 수 있다.

def create_datasets(nrows: int, ncols: int) -> tuple[pd.DataFrame, pd.DataFrame]:
  main_data = {f"col_{i}": np.random.rand(nrows) for i in range(ncols)}
  ref_data = {f"col_{i}": np.random.rand(nrows//10) for i in range(ncols)}
  main_df = pd.DataFrame(main_data)
  ref_df = pd.DataFrame(ref_data)
  return main_df, ref_df

위 코드로 임의의 데이터셋을 생성해 준다.

def pandas_operations(main_df: pd.DataFrame, ref_df: pd.DataFrame) -> tuple[float, int]:
  start_time_agg = time.time()
  grouped = main_df.groupby("col_0").mean()
  end_time_agg = time.time()

  start_time_join = time.time()
  joined= main_df.merge(ref_df, on="col_0", how="left")
  end_time_join= time.time()

  return end_time_agg - start_time_agg, end_time_join - start_time_join

def dask_operations(
    main_df: pd.DataFrame, ref_df: pd.DataFrame, npartitions: int) -> tuple[float,float]:
    dmain_df = dd.from_pandas(main_df, npartitions=npartitions)
    dref_df = dd.from_pandas(ref_df, npartitions=npartitions)

    start_time_agg = time.time()
    grouped_task = dmain_df.groupby("col_0").mean()
    grouped = grouped_task.compute() # compute를 넣어주어야 일을 시킴
    end_time_agg = time.time()
    grouped_task.visualize("grouped.svg")

    start_time_join = time.time()
    joined_task = dmain_df.merge(dref_df, on="col_0", how="left")
    joined = joined_task.compute()
    end_time_join= time.time()
    joined_task.visualize("joined.svg")

    return end_time_agg - start_time_agg, end_time_join - start_time_join

위 함수는 pandas 프레임워크를 활용해 데이터셋의 평균을 구하는 연산 과정이고, 아래 함수는 dask 프레임워크를 활용해 같은 연산을 병렬로 진행하는 것이다.

연산이 걸리는 시간은 다음과 같이 출력되었다. Pandas 집계 시간: 7.414660930633545 초 Pandas 조인 시간: 10.029629230499268 초 Dask 집계 시간: 27.535321712493896 초 Dask 조인 시간: 7.579635143280029 초

Dask로 걸리는 시간이 더 오래 걸린 이유는 평균 계산은 덧셈, 나눗셈 등의 단순 연산만 존재하기 때문에 오히려 쪼개면 더 오래 걸리기 때문인 것 같다.

근데 시간이 짧게 걸린다고 하고 예시는 더 오래 걸려서 더 짧게 걸리는 간단한 예제가 무엇이 있을까 찾아보았고, 보다 큰 대규모 데이터셋을 처리하는 상황을 접해볼 수 있었다.

아래는 2001년부터 현재까지의 시카고 주의 범죄 관련 데이터이다. https://catalog.data.gov/dataset/crimes-2001-to-present-398a4

import dask.dataframe as dd
from datetime import datetime
import time

start_time = datetime.now()
st = time.time()
print("start_time : {}".format(start_time))

ddf = dd.read_csv("crime.csv", dtype=str, on_bad_lines='skip')  # Updated for future-proofing based on warning
print(ddf.head())

ddf = ddf.dropna()

# Correctly process 'Location' to extract 'lat' and 'lon'
ddf['Location'] = ddf['Location'].str.replace(r"[()]", "", regex=True).str.split(',')  # Use regex to remove parentheses

# Assign 'lat' and 'lon' from the split 'Location'
ddf['lat'] = ddf['Location'].map(lambda x: x[0] if x else None).astype('float64')
ddf['lon'] = ddf['Location'].map(lambda x: x[1] if x else None).astype('float64')

print(ddf.head(30))

# Group by 'Date' and aggregate 'lat'
agg_ddf = ddf.groupby(['Date']).agg({"lat": ['mean', 'sum', 'count']})
print(agg_ddf.head(30))

# Count rows to check dataframe size
print(ddf.count().compute())

end_time = datetime.now()
et = time.time()
print("running time : {} seconds".format(et - st))
print("end_time : {}".format(end_time))

얘는 dask를 쓴 것이고,

import pandas as pd
from datetime import datetime

start_time = datetime.now()
print("start_time : {}".format(start_time))

# Read CSV using Pandas
df = pd.read_csv("crime.csv", dtype=str, na_filter=True)
print(df.head())

# Drop rows with any NA values
df = df.dropna()

# Process 'Location' to extract 'lat' and 'lon'
df['Location'] = df['Location'].str.replace(r"[()]", "", regex=True).str.split(',')
df['lat'] = df['Location'].apply(lambda x: float(x[0]) if x else None)
df['lon'] = df['Location'].apply(lambda x: float(x[1]) if x else None)

print(df.head(30))

# Group by 'Date' and aggregate 'lat'
agg_df = df.groupby('Date')['lat'].agg(['mean', 'sum', 'count'])
print(agg_df.head(30))

# Count rows to check dataframe size
print(df.count())

end_time = datetime.now()
print("running time : {} seconds".format((end_time - start_time).total_seconds()))
print("end_time : {}".format(end_time))

얘는 dask를 안 쓴건데 놀랍게도 dask를 썼을 때는 230초, dask를 쓰지 않았을 때는 123초가 걸린 것을 확인할 수 있었다.

아마 데이터셋 양 자체의 문제는 둘째치고 연산 자체가 너무 단순해서 그런 것이 아닌가 생각이 든다. 나중에 다시 시도할 일이 많겠지... 한 가지 확실한 것은 pandas 프레임워크와 비슷해서 쓰긴 정말 쉬운 듯하다.

Back Propagation

Sat, 06 Jan 2024 14:38:25 GMT

MNIST를 통한 AI model의 학습 과정을 알아봤었고, 그중에서도 지난 번에는 feed-forward 과정을 보았었다. MNIST 손글씨 이미지는 Feed-forward 과정을 거치며 숫자에 대한 예측 값을 output으로 도출해낸다. 그러나 여기서는 가중치와 편향 값을 계산에 활용만 했을 뿐이고 숫자를 알아맞히기에는 모델의 능력이 턱없이 부족한 상태일 것이다. 우리는 이러한 가중치와 편향들의 업데이트가 필요하고 이것의 핵심이 되는 것이 바로 오늘 알아볼 역전파(back propagation)이다.

먼저 loss에 대해 알아보자. Loss는 feed forward의 결과값과, 주어진 정답의 차이라고 할 수 있다. 이 차이, 즉 오차를 계산하는 방법은 함수로도 표현하여 loss function 이라고도 한다. 쉽게 말해 모델이 얼마나 잘못 예측했는지를 수치로 구하는 것이고, 모델은 이 값을 최소화 하는 방식으로 가중치와 편향을 조절해 간다. 우리는 이 과정을 '모델이 학습한다'고 표현하는 것이고, 그 원리에 역전파가 중요한 역할을 한다고 정리할 수 있겠다.

그렇다면 역전파는 어떤 원리로 가중치와 편향을 조정하는 것인지 알아보자. 방금 언급했던 loss를 각 layer, 각 뉴런마다 거슬러 올라간다고 생각해보는 것이다. 궁극적으로 우리는 loss가 작아지는 방향으로 모델에 사용되는 값들을 조정해나가야 한다. 각 뉴런을 거치는 과정을 함수라고 한다면, output까지 도달한 연산은 여러 개의 합성함수로 표현할 수 있을 것이다. 결론부터 말하면 역전파는 이러한 합성함수의 도함수를 구하는 과정이라고 할 수 있다. 합성함수의 도함수를 구함으로써, 각각의 뉴런들에서 발생하는 변화량이 어떤 방식으로 output에 영향을 미치는지 알 수 있다. 예를 들어 $$\frac{\partial L}{\partial W_{3,1}}$$ 을 보자. 이는 손실 $$L$$에 대해 가중치 값 $$W_{3,1}$$이 어떻게 변화하는지를 나타낸 것으로 gradient라고 한다. 즉 gradient는 역전파의 과정을 밟으며 각각의 뉴런을 거칠 때, 어떠한 방향으로, 얼마만큼의 변화가 필요한지 알려주는 것이다. 이 편미분 값들은 손실이 감소하는 방향으로 가중치를 조정하는 데 사용되고, 가중치의 업데이트는 일반적으로 학습률(learning rate)과 이 편미분 값들을 곱한 값에 의해 결정된다. 각 반복마다, 가중치와 편향은 점차적으로 조정되어, 네트워크의 예측이 실제 값에 더 가까워지도록 한다.

MNIST로 이해하는 딥러닝 모델 학습

Sat, 30 Dec 2023 14:34:21 GMT

MNIST란?

MNIST는 손으로 쓴 숫자들로 구성된 대규모 데이터베이스이다. 'Modified National Institute of Standards and Technology'의 약자로, 주로 컴퓨터 비전, 이미지 처리, 머신러닝 등의 분야에서 널리 사용되는 벤치마크 데이터셋이다.

MNIST 데이터셋은 0부터 9까지의 숫자를 손으로 쓴 70,000개의 흑백 이미지로 구성되어 있다. 각 이미지는 가로 세로 28 픽셀 사이즈를 가지고, 각 픽셀은 0(검은 색)부터 255(흰 색)까지의 그레이스케일 값을 가진다. 60,000개의 훈련 세트와 10,000개의 테스트 세트로 나뉘어져 있고 각 이미지에는 해당하는 숫자의 레이블이 붙어 있다. 예를 들어 숫자 이미지 3에는 레이블 값으로 3이 붙어있는 것이다.

MNIST classification task 를 활용해서 feed-forward 과정을 설명해보고자 한다.

MNIST 분류 과제는 숫자 이미지가 주어졌을 때, 해당 숫자가 0에서부터 9 중 무엇인지를 딥러닝 모델이 맞추는 과제를 말한다.

다음과 같이 크게 네 파트로 나누어 생각해 볼 수 있다.

입력층 (input layer)
은닉층 (hidden layer)
출력층 (output layer)
활성화 & 분류 (activation & classification)

입력층

사람은 육안으로 숫자 이미지를 보고 바로 어떤 숫자인지 인지할 수 있지만, 딥러닝 모델은 그럴 수 없다. 숫자 이미지를 픽셀 값의 집합으로 바꿔주어야 하는데, 이는 28 x 28의 숫자 행렬을 784 x 1의 숫자열로 바꾸어주는 것을 의미한다. 이때 픽셀 하나하나를 유닛(unit) 혹은 뉴런이라 하고 MNIST 하나의 글씨에는 총 784개의 뉴런이 입력층을 구성한다.

출력층

분류 과제의 최종적인 목표는 주어진 이미지가 0에서부터 9까지의 숫자 중 어떤 것인지를 맞추는 것이다. 따라서 출력층은 0부터 9까지 총 10개의 유닛이라고 할 수 있다.

은닉층

은닉층은 입력층과 출력층을 연결하는 층으로, 입력층의 데이터에 가중치를 곱하고 편향을 더하여 출력층으로 값을 전달한다.

위 그림에서 레이어1은 은닉층이라 할 수 있고, 은닉층에서 계산되는 값은 입력층 위에서 부터 아래 뉴런 순서로 아래와 같이 나타낼 수 있다.

$$i_1w_{1_{1}}+b_{1_{1}}$$ $$i_1w_{1_{2}}+b_{1_{2}}$$ $$i_2w_{2_{1}}+b_{2_{1}}$$ $$i_2w_{2_{2}}+b_{2_{2}}$$

활성화 함수(Activation function)

그림에서 레이어 1 출력단과 출력층의 마지막에 f 라는 기호를 확인할 수 있을 것이다. 이것은 활성화 함수(activation 함수)로, 대표적인 예시인 ReLU 함수로 설명을 해볼 수 있다. 만약 은닉층의 뉴런에서 계산된 값이 0보다 작은 값이라면 ReLU 함수는 이를 모두 0으로 만들어 다음 레이어인 출력층에 전달할 것이다. 반면 0보다 크거나 같은 수라면 이는 제값을 출력층으로 전달한다. 활성화 함수는 딥러닝 모델에 "비선형"을 도입한다는 점에서 의의를 갖는데 이는 모델로 하여금 더 복잡한 패턴을 이해하고 데이터 간의 관계성을 파악할 수 있게 해준다. 더 자세한 활성화 함수에 대한 내용은 아래 링크를 통해 확인할 수 있다. https://stipplelabs.medium.com/the-power-of-relu-and-its-variants-4c8f57079e29

MNIST 신경망 구조

위의 내용들을 종합하여 시각화한다면 아래 구조와 같을 것이다. 숫자 이미지는 784개의 입력층 유닛으로 나타내어 질 수 있다. 각각의 입력 값은 은닉층의 500개(임의)의 층을 통해 계산이 되어지고 해당 값을 출력 층으로 전달하여 계산 결과를 합하는 이 전체 과정을 feed-forward라고 할 수 있다 .

Backward propagation과 Optimization은 다음번에 다루도록 하겠다.

DragGAN : Interactive Point-based Manipulation on the Generative Image Manifold

Wed, 31 May 2023 06:25:51 GMT

1. Introduction

DragGAN은 사용자로 하여금 GAN으로 생성된 이미지를 드래그 하여 편집할 수 있게 한다. 변경하고자 하는 핸들 포인트가 빨간색이고, 상응하는 타겟 점을 파란색으로 지정하면 변화에 flexible한(밝은) 영역은 이미지가 변화하고, 그 외 부분은 변하지 않는다.

이전에도 GAN을 컨트롤 하기 위한 시도는 꾸준히 이루어졌었다. 대부분의 시도는 기존의 3D모델들이나 인위적으로 annotated 된 데이터에 의존하는 지도학습을 통해 이루어졌기 때문에 새로운 object 카테고리에 대해 일반화가 어렵고, 편집 과정에서 바꿀 수 있는 공간적 특성이 한정적이다. 최근에는 text guidance를 이용한 image synthesis가 이목을 끌었지만, 공간적 특성을 바꾸는데 있어서 정확성이나 유연성이 떨어진다는 한계점이 있었다.

본 논문에서는 한계점을 극복할 수 있는 해결책으로 interactive point-based manipulation을 제시하였고, 이것을 구현된 모델로 DragGAN을 제안한다.

2. Method

latent code w로 GAN을 통해 생성된 임의의 이미지 $\mathbf{I}\in \mathbb{R}^{3\times H \times W}$가 있다고 하자. User는 input으로 n개의 handle point { $p_i=(x_{p,i},y_{p,i})|i=1,2,...,n$ }와 n개의 target point{ $t_i=(x_{t,i},y_{t,i})|i=1,2,...,n$ }를 지정해준다.

이미지 변형(image manipulation)은 optimization을 수행하는 맥락으로 이해할 수 있는데, 각 optimization 단계는 두 가지로 세분화될 수 있다. 하나는 motion supervision으로, 여기서는 handle point를 target point로 움직이게 하는 loss가 latent code w를 최적화하는데 사용된다. 이때 최적화 과정에서 물체는 미세하게 움직일 수 있다.

하지만 물체에 따라, 그리고 위치에 따라서 움직이는 정도가 다 다르기에 motion supervision은 새로운 handle point의 위치를 정확하게 제공해 주지 않는다. 따라서 다음 handle point 위치의 update, 즉, point tracking이 필요하다.

아래 그림은 Draggan이 작동하는 방법에 대한 것인데, 왼쪽에서부터 첫 번째 그림에서 두 번째 그림으로 넘어가는 부분이 Motion Supervision, 두 번째 그림에서 세 번째 그림으로 넘어가는 부분이 Point Tracking이다.

2-1. Motion Supervision

GAN으로 생성된 이미지에서 점의 이동을 지도하는 것은 연구된 바가 많지 않다. 본 논문에서는 추가적인 network에 의존하지 않는 motion supervision loss를 제안한다.

Motion supervision의 핵심은 GAN의 Generator의 중간 단계 feature가 매우 명확하게 구분이 된다는 점인데, 이는 즉 편집을 했을 때 더 명확한 변화가 관찰된다 정도로 해석할 수 있다.

그 이유는 아래 그림을 통해 알 수 있는데, input으로 주어지는 latent code W를 사용하는 것보다 layer를 통과한 $W^+$를 사용하는 것이 '분포 밖'의 변화를 가하는데 용이하기 때문

실험을 통해서 StyleGAN2의 6번째 블럭(layer)을 통과한 feature map이 가장 적절한 resolution과 판별능력의 tradeoff를 갖는다는 것을 확인하였다.

DragGAN은 handle point 주위에 있는 작은 패치들을 target point로 지도하는데 이때 사용되는 motion supervision loss $\mathcal{L}$는 다음과 같다.

$\mathcal{L}=\Sigma_{i=0}^{n}\Sigma_{q_i\in\Omega_1(p_i,r_1)}|\mathbf{F}(q_i)-\mathbf{F}(q_i+d_i)|_1 + \lambda|(\mathbf{F}-\mathbf{F}_0)\cdot(1-\mathbf{M})|_1$

이제 각각의 term 을 살펴보자.

Terms	Definition
n	motion supervision 은 n 단계에 걸쳐 진행됨
$\Omega_1(p_i,r_1)$	$p_i$까지의 거리가 $r_1$보다 짧은 픽셀들
$\mathbf{F}(q_i)$	픽셀 $q_i$에서의 feature 값
$d_i$	$p_i$에서 $t_i$를 향하는 벡터를 normalize한 것
1-M	Mask하지 않은 부분 즉, 불변하는 부분
$\mathbf{F}-\mathbf{F}_0$	Reconstruction loss

2-2. Point Tracking

Motion을 움직이고 난 후에는 정확히 handle point가 어디로 이동했는지 그 새로운 위치를 찾아내야 한다. 보편적으로 point tracking은 optical flow estimation model이나 particle video approach를 통해 이루어지는데, 아까도 언급했지만 추가적인 모델을 활용하는 것은 효율성을 해칠 뿐 아니라 error의 누적을 야기한다. 그리고 이는 'alias artifacts', 즉 왜곡된 이미지를 생성하는 특성이 존재하는 GAN에게 있어서 치명적일 수 있다.

본 논문에서는 feature patch에서 가장 가까운 이웃을 찾는 방식으로 point tracking 문제를 접근한다. 최초의 handle point를 $p_i$라고 하고 그 feature를 $\mathbf{F}0(p_i)$라고 하면, $p_i$를 둘러싼 patch는 다음과 같이 정의될 수 있다. $$ \Omega_2(p_i,r_2)={(x,y)||x-x{p,i}|

그리고 $\Omega_2(p_i,r_2)$에서 $f_i$의 가장 가까운 이웃을 찾아 $p_i$를 업데이트 해준다.

3. Implementation Details

Motion Supervision과 Point Tracking에 사용된 거리 값인 $r_1$과 $r_2$는 hyperparameter로 실험에서는 각각 3, 12의 값을 사용했고, motion supervision loss에서의 $\lambda$ 값은 20으로 설정하였다.

Optimization은 모든 handle point의 target point와의 거리가 d pixel 이하일 때까지 진행되었다.

4. Discussions

4-1. Effects of Mask

Mask 없이는 위 개 그림에서 머리가 아닌 몸통이 통째로 움직이는 문제가 발생한다. 이는 즉 point-based manipulation의 결과가 여러 가지로 나올 수 있다는 뜻이고, GAN이 image manifold에서 가장 정답에 가까운 것을 찾아내려고 한다는 것을 알 수 있다. 따라서 mask는 이미지의 특정 부분을 불변하도록 고정할 뿐 아니라 GAN으로 하여금 모호함을 줄여주는 중요한 역할을 한다.

4-2. Out-of-Distribution Manipulation

DragGAN의 또 하나의 강점은 학습된 이미지 데이터 분포 밖에 있는 변형 역시 생성해낼 수 있다는 점이다.

*만약 사용자가 항상 이미지를 학습된 분포 내에 두고 생성해내고 싶다면, latent code w에 추가적인 regularization을 더해주는 잠재적인 방법도 있다고 한다.

4-3. Limitations

한계점이 있다면 편집의 퀄리티가 여전히 학습된 데이터의 다양성에 영향을 받는다는 점이다. 위 그림처럼 인간 포즈를 학습된 데이터 분포를 벗어나게 생성한다면, 가짜 이미지가 된다. 또한, Texture가 부족한 부분에 handle point를 설정한다면 tracking이 어렵다.

마지막으로 논문에서는 사회적/윤리적 문제를 우려하고 있다. DragGAN은 이미지들의 공간적 특성을 변형할 수 있기에 실제 사람이 허구의 포즈, 표정 등을 짓도록 악용될 우려가 있다.

개인적인 생각...

거의 chatgpt 급인 것 같다. 깃허브에 딸랑 read.md랑 gif로 ui만 올라와있는데 스타가 만개다. 누가 구현도 해놓은 것 같은데 official code인지는 모르겠다.

Generative Modeling by Estimating Gradients of the Data Distribution

Mon, 22 May 2023 23:09:56 GMT

여태까지 영어로 썼었는데, 난 한국인이니까 이제부터는 한글로 써볼 것이다.

1. Introduction

기존의 생성 모델은 VAE처럼 likelihood-based이거나 GAN처럼 적대적 학습(adversarial training)을 진행하였다.

1-1. Limitations of previous generative models

그러나 위에서 언급한 생성모델들은 준수한 성능을 보였음에도 불구하고 다음과 같은 한계점을 가졌는데,

likelihood-based model

복잡한 확률모델의 normalizing을 위한 특수한 아키텍쳐를 만들거나
loss를 직접적으로 구할 수 없기에 ELBO 등의 대체제를 이용해야 한다
GAN
generator와 discriminator 사이의 균형을 항상 유지해야 하기 때문에 학습과정이 불안정할 수 있고,
다른 GAN 모델과 비교/평가 가 어렵다는 한계점이 있다.
Other Objectives
주로 낮은 차원의 데이터에서만 잘 작동한다는 단점이 있다.

따라서 위와 같은 한계점을 극복하고자 본 논문에서는 sampling 과 gradient estimation에 score-matching을 도입하고, Langevin dynamics를 활용해 새로운 샘플을 생성하는 방식을 소개한다.

1-2. What is Score?

Score-matching과 Langevin dynamics 전에 score가 무엇인지에 대해 짚고 넘어갈 필요가 있다.

생성모델의 궁극적인 목표는 주어진 데이터의 PDF(probability density function)를 학습하는 것이다. PDF를 학습하면 데이터가 어떻게 확률적으로 분포되어 있는지를 알 수 있고, 이를 통해 다른 샘플을 생성할 수 있다. pdf(주어진 데이터의 분포)는 $p_{data}(x)$로 표기한다.

Score(혹은 Score function)란, input variable 분포에 대한 log-pdf의 gradient로, 쉽게 말해 생성모델이 간단한 분포에서 복잡한 원래 분포로 돌아가는 방법을 배울 수 있도록 인도하는 가이드라인이라고 할 수 있다. Score는 $\nabla\log{p_{data}(x)}$로 표기한다.

2. Score-based generative modeling

2-1. Score estimation을 위한 Score matching

$\theta$에 의해 parametrize되어 있고, $p_{data}(x)$의 score를 추정하기 위한 neural network를 $s_{\theta}$라고 할 때, 해당 모델의 objective function은 $\frac{1}{2}\mathbb{E}{p{data}(x)}[||s_{\theta}-\nabla_{x}\log{p_{data}(x)}||^2_2]$를 최소화하는 것이다.

2-2. Denoising Score Matching

그러나 이전 논문들에서도 언급 되었듯이 $p_{data}(x)$를 곧바로 구하는 것은 어려운 일이다. 이때, denoising score matching을 활용한다면 $p_{data}(x)$를 예측하지 않고도 직접적으로 $s_{\theta}$를 학습시킬 수 있고, objective function은 아래와 같다.

$$ \mathbb{E}{p{data}(x)}[tr(\nabla_x s_{\theta}(x))+\frac{1}{2}||s_{\theta}(x)||^{2}_2] $$

각 항을 분해해서 살펴보자.

$\nabla_x s_{\theta}(x)$ : score $s_{\theta}(x)$의 Jacobian matrix

- 데이터 $x$에 대해 score가 어떻게 변화하는지에 대한 정보를 담고 있음

$tr(...)$ : Jacobian matrix의 trace로, 대각선 성분의 합

- score의 divergence
- score가 data space의 각 point에서 얼마나 확장, 수축하는지를 나타냄

$||s_{\theta}(x)||^{2}_2$ : score의 L2 norm의 제곱
```
- data space 각 점에서의 score의 값 
```
$\mathbb{E}{p{data}(x)}[...]$ : 모든 data point에 대한 평균

2-3. Langevin Dynamics

Denoising score matching이 모델을 학습하는데 쓰인다면, Langevin dynamics는 샘플을 생성해내는데 사용되는데, p(x)로부터 score만을 사용한다는 점에서 의미가 있다. Langevin은 아래 식처럼 반복적으로 score를 더해주는 방식으로 sampling이 진행된다.

$$ \tilde{x}t=\tilde{x}{t-1}+\frac{\epsilon}{2}\nabla_x\log{p(\tilde{x}_{t-1})}+\sqrt{\epsilon}z_t $$

Step size는 $\epsilon$으로 고정하고, 초기 값을 $\tilde{x}_0 \sim\pi(x)$로 설정한다. 이때, $\pi$는 prior distribution이다. Prior distribution은 가장 단순한 분포, 즉, 완전히 preturb된 데이터를 의미하고, 여기서 위 식을 반복하며 데이터를 복구해나가면서 원래의 복잡한 분포를 찾아나간다.

Langevin 동역학은 분자 시스템의 움직임을 수학적으로 모델링한 것이라는데, 여기서 원리를 가져왔다 정도만 알아도 이해하는데 큰 문제는 없었던 것 같다.

3. Challenges

Score-based generative modeling은 다음과 같은 문제점들을 극복해내야했다.

3-1. The manifold hypothesis

Manifold?

고차원 데이터가 어떠한 패턴이나 구조를 가지며 낮은 차원의 manifold를 형성한다.

Manifold Hypothesis

실제 데이터가 고차원에 임베딩 되어 있는 낮은 차원의 manifold에 집중되어 있는 경향을 보인다. 본 논문에서는 Manifold Hypothesis로 인한 두 가지 어려움을 제시하는데,

score가 고차원에서 구한 gradient이기 때문에 x가 저차원의 manifold에 국한되어 있는지 알 길이 없다.
data가 mainfold에 속해있을 경우, score matching objective는 불안정한 score 값을 제시하게 된다.

따라서 이를 해결하기 위해서 data에 아주 약간의 noise(육안으로 구분되지 않는 정도)를 더해준다.

Manifold에 속해있는 data에 noise를 추가해줌으로써 Manifold에 국한되지 않게 하고,
기존 data를 거의 손상시키지 않았기 때문에 feature를 학습하는데는 지장이 없다.

3-2. Low Data Density Regions

Data의 밀도가 낮은 부분에서는 score estimation과 Langevin Dynamics를 활용한 sampling이 어려울 수 있다.

4. Noise Conditional Score Networks: learning and inference

3에서 발생했던 문제들을 해결하기 위해서

각각 다른 레벨의 noise를 추가하여 data를 손상시킨다.
동시에 모든 noise 레벨에 해당하는 score를 예측하고 하나의 score network를 훈련시킨다.

1,2를 통해 학습을 마친 후 Langevine Dynamics를 이용해서 sampling한다.

4-1. Training

4-2. Inference

World-GAN : a Generative Model for Minecraft Worlds

Thu, 18 May 2023 12:12:12 GMT

Motivation

After taking leave of absence of unviersity, I recently got into playing Minecraft. Reminding me of my childhood memory, at the same time I have realized that so many features have changed and updated. While enjoying my life in cubic 3D world, I wondered what would it look like if the Minecraft world were generated by generative models.

Now that I have read about the GAN, I searched if there are any generative models related to the game and found out an interesting model : World-GAN.

World-GAN

What is World-GAN?

World-GAN is a generative model for generating Minecraft worlds. From a single example, it can perform PCGML in Minecraft. It uses the block2vec representation, motivated by the word2vec and the dense representation of NLP. Via block2vec, World-GAN is able to generate worlds in large levels based on parts of users' creations.

To understand how the World-GAN works, reading the following previous works would also be helpful :

1. Sin-GAN (Shaham, Dekel, and Michaeli 2019)

GAN architecture, learning from a single natural image
Cascade of fully convolutional generators and discriminators patched in diverse scales

2. TOAD-GAN (M. Awiszus, F. Schubert, and B. Rosenhahn 2020)

replaced bilinear downsampling in Sin-GAN to a special downsampling operation
determine the importance of token using a hierarchy that is constructed by a heuristic, motivated by the TF-IDF metric from NLP
applied to 2D token-based games like Super Mario

Problem Scenarios

The main problem of applying TOAD-GAN directly into world generating can be summarized into two parts.

The conversion from 2D to 3D leading to dramatic increase in size of samples.
A large variety of tokens in Minecraft, and their long-tail distribution
- sometimes aliasing low-frequency tokens, which are significant and should not be ignored

block2vec

In order to resolve the above problems, the paper suggests a new token embedding method called block2vec.

Say that there is a token $b_i$ in a given training sample, and let $f(b_i)$ the frequency of that token. Then the occurence probability of the token can be written as :

$P(b_i)=\sqrt{\frac{f(b_i)}{0.001}+1}\times \frac{0.001}{f(b_i)}$

By sampling the tokens according to the $P(b_i)$, it can mitigate the issue of token imbalance.

Three advantages of using block2vec

Reduced memory requirements
Omit definition of hierarchy
- visualized token embeddings (dimension reduced to 2 by MDE technique)
- rare tokens are placed close to semantically similar more common tokens
Choosing a different mapping from internal representations to tokens allows us to change the style of the generated content after training

Training

skip-gram model with two linear layers predicting context from the target token
Generator produces $m\times D\times H\times W$ tensor
Tensor fed to the discriminator

Experiment Results

Qualitative

Quantitative

![] (https://velog.velcdn.com/images/flatfish_selfish/post/7f9025c5-f4fa-4446-bec0-f9602f18f436/image.png)

Generative Adversarial Nets

Thu, 18 May 2023 04:55:06 GMT

Adversarial nets

Generator : generates samples by passing random noise through a multilayer perceptron
Discriminator : learns to determine whether a sample is from the model distribution or the data distribution

Training

Value function $V(G,D)$

D and G play two-player minmax game with value function :

$\min_G \max_D V(D, G) = \mathbb{E}{\boldsymbol{x} \sim p{data}(\boldsymbol{x})}[\log D(\boldsymbol{x})] + \mathbb{E}{\boldsymbol{z} \sim p{z}(\boldsymbol{z})}[\log(1 - D(G(\boldsymbol{z})))]$

Discriminator $\mathcal{D}$ tries to make $D(\boldsymbol{x})=1$, and $D(G(\boldsymbol{z})))=0$
- classify real data to 1, fake one to 0
Generator $G$ tries to make $D(G(\boldsymbol{z})))=1$
- "deceive" the discriminator to classify fake data to 1

Algorithm of Training GAN

optimizing the $D$ to completion on finite datasets is prohibitive
- computationally expensive
- lead to overfitting
- k steps of optimizing $D$, while only one step of optimizing $G$

Experiments

trained based on MNIST, the Toronto Face Database (TFD), and CIFAR-10
generator : mixture of rectifier linear activations
discriminator
- maxout activations
- dropout applied when training
input noise z to only the bottom layer

Advantages and Disadvantages

Advantages

Generator can be updated without data examples, and only with discriminator's gradient flow
can represent sharp, even degenerate distributions

Disadvantages

no explicit representation of $p_g(x)$
D should be well synced with the generator, or else can lead to mode collapse

Auto-Encoding Variational Bayes

Mon, 15 May 2023 05:22:33 GMT

1. Introduction

1-1 Difficulty of Mean-field Approach*

assumes that all variables of the model(data, latent, parameters) are independent (but they are not!)
simplifies the calculation, but can have poor approximations for complex models with dependent variables, which requires solving intractable expectations

*mean-field approach? 👉 commonly used method in VB for choosing the form of the approximate posterior

1-2 Auto-Encoding Variational Bayes (AEVB)

in order to overcome such intractability, the paper suggests new algorithm called AEVB
enables efficient, differentiable, and unbiased estimation of the variational lower bound via Stochastic Gradient Variational Bayes (SGVB) estimator
simplifies posterior inference and model learning, avoiding costly iterative schemes like MCMC.

2. Method

2-1 Problem Scenario

Assumptions

value $z^{(i)}$ generated from prior distribution $p_{\theta^{*}}(z)$ <- prior
value $x^{(i)}$ generated from conditional distribution $p_{\theta^{*}}(x|z)$ <- likelihood
PDF of prior & likelihood distribution are differentiable almost everywhere w.r.t. $\theta$ and z
true parameters and latent variables are unknown
** do not simplify the marginal / posterior probabilities**
Main Contributions

Efficient approximate ML or MAP estimation for the parameters $\theta$
Efficient approximate posterior inference of the latent variable z given an observed value x for a choice of parameters $\theta$
Efficient approximate marginal inference of the variable x

2-2 Variational Bound

Let marginal likelihood* $\log p_{\theta}(x^{(i)}) =D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z|x^{(i)}))+\mathcal{L}(\theta, \phi;x^{(i)})$
Since the value of KL-Divergence is always non-negative, $\mathcal{L}(\theta, \phi;x^{(i)})$ becomes the lower bound.
Lower bound on the marginal likelihood of datapoint i can be re-written as : $\mathcal{L}(\theta,\phi;x^{(i)})=-D_{KL}(q_{\phi}(z|x^{(i)})||p_{\theta}(z))+\mathbb{E}{q{\phi}(z|x^{(i)})}[logp_{\theta}(x^{(i)}|z)]$

*Gradient of the lower bound w.r.t. $\phi$, can lead to gradient estimator exhibiting high variance, thus impractical! * ==> Importance of SGVB estimator

*marginal likelihood (evidence)? 👉 represents the probability of the observed data given the prior distribution of the model parameters

👉 When optimizing, since the marginal likelihood can make the computation intractable, we use variational lower bound.

2-3 Reparametrization Trick

Let z be a continuous random variable, $z \sim q_{\phi}(z|x)$ be some conditional distribution.
Then z can be expressed as: $z=g_{\phi}(\epsilon, x)$, where $\epsilon$ is a random variable following simple, known distribution
$\int q_{\phi}(z|x)f(z)dz=\int p(\epsilon)f(z)d\epsilon=\int p(\epsilon)f(g_{\phi}(\epsilon,x))d\epsilon$
- $\int q_{\phi}(z|x)f(z)dz$ :
  - an expectation of a function f(z) under the distribution $q_{\phi}(z|x)$, represented by the integral of f(z) times the PDF of z ($q_{\phi}(z|x)$), over all possible values of z
- $\int p(\epsilon)f(z)d\epsilon$ :
  - changing the random variable of interest in the expectation calculation from z to $\epsilon$, which follows the distribution $q_{\phi}(z|x)$ and $p(\epsilon)$ respectively.
  - Since $\epsilon$ does not depend on the parameters $\phi$, it is possible to compute the gradient of the expectation w.r.t. $\phi$

2-4 SGVB estimator and AEVB algorithm

After applying reparametrization trick of section 2-3, estimates of expectation of some function $f(z)$ w.r.t. $q_{\phi}(z|x)$ can be formed as :

$\mathbb{E}{q{\phi}(z|x^{(i)})}[f(z)]=\mathbb{E}{p(\epsilon)}[f(g{\phi}(\epsilon,x^{(i)}))]\simeq \frac{1}{L}\Sigma^{L}{l=1}f(g{\phi}(\epsilon^{(l)},x^{(i)}))$
Yield generic Stochastic Gradient Variational Bayes estimator by applying technique in 1.

$\tilde{\mathbb{L}}^{A}(\theta,\phi;x^{(i)})=\frac{1}{L}\Sigma ^{L}{l=1}\log p{\theta}(x^{(i)},z^{(i,l)})-\log q_{\phi}(z^{(i,l)}|x^{(i)})$

where $z^{(i,l)}=g_{\phi}(\epsilon^{(i,l)}, x^{(i)})$ and $\epsilon^{(l)}\sim p(\epsilon)$

Below is the AEVB algorithm that utilizes above estimator.

3. Example : VAE

3-1 Variational approximate posterior

$\log q_{\phi}(z|x^{(i)})=\log \mathcal{N} (z;\mu^{(i)},\sigma^{2(i)}I)$

mean and s.d. of approximate posterior $\mu^{(i)},\sigma^{2(i)}$ : parameters learned from encoder
$\phi$ : variational parameters
the main goal of VAE is to find good approximate posterior $\log q_{\phi}(z|x^{(i)})$ over the latent variables

3-2 Estimator for VAE and datapoint $x^{(i)}$

$\mathcal{L}(\theta,\phi;x^{(i)})\simeq \frac{1}{2}\Sigma^{J}{j=1}(1+\log ((\sigma{j}^{(i)})^2)-(\mu_{j}^{(i)})^2-(\sigma_{j}^{(i)})^2)+\frac{1}{L}\Sigma^{L}{l=1}\log p{\theta}(x^{(i)}|z^{(i,l)})$

where $z^{(i,l)}=\mu^{(i)}+\sigma^{(i)}\odot \epsilon^{(l)}$ and $\epsilon^{(l)}\sim\mathcal{N}(0,\mathbf{I})$

3-3 Architecture

Auto-encoder vs Variational Auto Encoder Source : https://data-science-blog.com/blog/2022/04/19/variational-autoencoders/

4. Conclusion

SGVB is a novel estimator of the variational lower bound, resolving intractibility when parameters are optimized
Since SGVB is differentiable and can be optimized straight forward, it can lead to efficient approximate inference with continuous latent variables.
For the case of i.i.d. datasets and continuous latent variables per datapoint we introduce an efficient algorithm called Auto-Encoding VB (AEVB), learning an approximate inference model using the SGVB estimator.

flatfish_selfish.log

Dask 에 대해 알아보자.

Back Propagation

MNIST로 이해하는 딥러닝 모델 학습

MNIST란?

MNIST classification task 를 활용해서 feed-forward 과정을 설명해보고자 한다.

입력층

출력층

은닉층

활성화 함수(Activation function)

MNIST 신경망 구조

DragGAN : Interactive Point-based Manipulation on the Generative Image Manifold

1. Introduction

2. Method

2-1. Motion Supervision

2-2. Point Tracking

3. Implementation Details

4. Discussions

4-1. Effects of Mask

4-2. Out-of-Distribution Manipulation

4-3. Limitations

개인적인 생각...

Generative Modeling by Estimating Gradients of the Data Distribution

1. Introduction

1-1. Limitations of previous generative models

likelihood-based model

GAN

Other Objectives

1-2. What is Score?

2. Score-based generative modeling

2-1. Score estimation을 위한 Score matching

2-2. Denoising Score Matching

2-3. Langevin Dynamics

3. Challenges

3-1. The manifold hypothesis

Manifold?

Manifold Hypothesis

3-2. Low Data Density Regions

4. Noise Conditional Score Networks: learning and inference

4-1. Training

4-2. Inference

World-GAN : a Generative Model for Minecraft Worlds

Motivation

World-GAN

What is World-GAN?

Related Works

Problem Scenarios

block2vec

Three advantages of using block2vec

Training

Experiment Results

Qualitative

Quantitative

Generative Adversarial Nets

Adversarial nets

Training

Value function $V(G,D)$

Algorithm of Training GAN

Experiments

Advantages and Disadvantages

Advantages

Disadvantages

Auto-Encoding Variational Bayes

1. Introduction

1-1 Difficulty of Mean-field Approach*

1-2 Auto-Encoding Variational Bayes (AEVB)

2. Method

2-1 Problem Scenario

Assumptions

Main Contributions

2-2 Variational Bound

2-3 Reparametrization Trick

2-4 SGVB estimator and AEVB algorithm

3. Example : VAE

3-1 Variational approximate posterior

3-2 Estimator for VAE and datapoint $x^{(i)}$

3-3 Architecture

4. Conclusion