Jaehoon

OpenCLIP 학습 코드 정리

Sun, 10 Nov 2024 12:57:11 GMT

https://github.com/mlfoundations/open_clip

모델 불러오기

주의! OpenCLIP은 기존 CLIP과 다른 라이브러리를 사용한다. ( pip install open_clip_torch )

import torch
import open_clip

device = "cuda" if torch.cuda.is_available() else "cpu"
print(device)

model, _, preprocess = open_clip.create_model_and_transforms('ViT-B-32')
state_dict = torch.load('path_to_pretrained_weight', map_location=device)
model.load_state_dict(state_dict['CLIP'])
model.to(device)
tokenizer = open_clip.get_tokenizer('ViT-B-32')

state_dict: 모델의 가중치 및 파라미터를 담고 있는 dict타입 데이터. 학습된 모델을 저장하거나 로드할 때 사용된다. 여기서는 미리 저장된 가중치를 불러왔다.

tokenizer: 텍스트를 모델 입력에 맞게 토큰화하여 벡터로 변환하는 도구. 각 모델에 맞는 토크나이저를 사용해야한다.

preprocess: 이미지 데이터를 전처리하는 함수로 마찬가지로 모델이 요구하는 방식의 전처리기를 사용해야한다.

데이터 준비

커스텀 데이터셋 정의

from PIL import Image
from datasets import load_dataset
from torch.utils.data import Dataset

ds = load_dataset("tomytjandra/h-and-m-fashion-caption")

class HMFashionDataset(Dataset):
    def __init__(self, dataset_split, preprocess):
        self.dataset = dataset_split
        self.preprocess = preprocess

    def __len__(self):
        return len(self.dataset)

    def __getitem__(self, idx):
        item = self.dataset[idx]

        # 이미지 데이터 가져오기
        image = item['image']
        if isinstance(image, Image.Image):
            image = image.convert('RGB')
        else:
            image = Image.open(image).convert('RGB')

        image = self.preprocess(image)
        caption = item['text']
        return image, caption  # 텍스트를 문자열로 반환

load_dataset: datasets 라이브러리의 함수로, 다양한 공개 데이터를 쉽게 불러올 수 있다. 여기서는 H&M 패션 이미지 - 캡션 데이터를 활용하였다.

허깅페이스

Dataset 클래스: PyTorch에서 사용자 정의 데이터셋을 만들기 위해 상속하는 클래스로, len과 getitem 메서드를 통해 데이터셋을 정의하게 된다.

데이터 로더 설정

import torch
from torch.utils.data import DataLoader
import datasets
import random

prompts = [
    'a photo of a {}',
    'a fashion photo of a {}',
]

def collate_fn(batch):
    images, captions = zip(*batch)
    images = torch.stack(images)
    prompted_captions = []
    for caption in captions:
        prompt = random.choice(prompts)
        prompted_captions.append(prompt.format(caption))
    texts = tokenizer(prompted_captions)
    return images, texts

print(ds)

if isinstance(ds, datasets.DatasetDict):
    if 'train' in ds:
        # ds['train']을 훈련용과 테스트용으로 분할
        split_ds = ds['train'].train_test_split(test_size=0.2, seed=42)
    else:
        raise KeyError("The dataset does not contain a 'train' split.")
else:
    # ds 자체가 Dataset인 경우
    split_ds = ds.train_test_split(test_size=0.2, seed=42)

# 데이터셋 스플릿
train_dataset = HMFashionDataset(split_ds['train'], preprocess)
test_dataset = HMFashionDataset(split_ds['test'], preprocess)

train_dataloader = DataLoader(
    train_dataset,
    batch_size=512,
    shuffle=True,
    num_workers=0,
    collate_fn=collate_fn
)

test_dataloader = DataLoader(
    test_dataset,
    batch_size=512,
    shuffle=False,
    num_workers=0,
    collate_fn=collate_fn
)

prompts: 텍스트 입력을 생성하기 위해 사용되는 미리 정의된 프롬프트로, 주로 ‘a photo of {}’를 사용하지만 데이터셋의 특징에 따라서 다르게 설정할 수도 있다.

collate_fn: 데이터로더의 각 배치에서 호출되어 데이터를 배치에 맞게 정리하는 함수로 앞서 정의한 프롬프트를 적용하거나, torch.stack를 사용하여 텐서들을 하나의 배치로 결합할 수 있다.

DataLoader: 데이터셋을 반복 가능한(iterable) 형태로 만들어주는 파이토치 유틸리티

num_workers: 병렬 데이터 로딩을 위한 프로세스 수를 설정하는 옵션

모델 학습

하이퍼 파라미터 설정

import torch
import torch.nn.functional as F
from torch.amp import GradScaler
from tqdm import tqdm

# 손실 함수 정의 (CLIP의 대조적 손실 함수)
def clip_loss(logits_per_image, logits_per_text):
    batch_size = logits_per_image.size(0)
    labels = torch.arange(batch_size, dtype=torch.long, device=logits_per_image.device)

    loss_i = F.cross_entropy(logits_per_image, labels)
    loss_t = F.cross_entropy(logits_per_text, labels)

    return (loss_i + loss_t) / 2

# visual encoder 파라미터만 학습 가능하도록 설정
for param in model.parameters():
    param.requires_grad = False  # 모든 파라미터를 고정
for param in model.image_encoder.parameters():  # visual encoder 파라미터만 학습 가능
    param.requires_grad = True

# 학습 가능한 파라미터만 필터링하여 전달
optimizer = torch.optim.AdamW(
    filter(lambda p: p.requires_grad, model.parameters()),  
    lr=5e-6,
    betas=(0.9, 0.98),
    eps=1e-6,
    weight_decay=0.2
)

# 스케일러 정의
scaler = GradScaler()

# 에포크 수 및 total_steps 계산
epochs = 10
total_steps = epochs * len(train_dataloader)

# 스케줄러 정의
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=total_steps)

torch.arange: 일정 범위의 정수 배열을 생성하는 함수로, 라벨 생성에 사용된다.

AdamW: AdamW는 학습률을 조정하며 가중치 감쇠를 적용하는 Adam의 변형 옵티마이저를 말한다. 대안으로 SGD, Adam, RMSprop 등이 있으며, 모델의 특성과 학습 속도에 따라 다른 옵티마이저를 사용할 수 있다. optimizer 정리글

lr, betas, eps, weight_decay: AdamW 옵티마이저의 하이퍼파라미터. lr은 학습률, betas는 모멘텀, eps는 수치적 안정성을 위한 작은 값, weight_decay는 과적합 방지를 위한 L2 정규화를 나타낸다.

GradScaler: Mixed Precision 학습에서 손실 스케일링 적용 Auto Mixed Precision

scheduler: 학습률을 점진적으로 감소시키는 역할

학습

from tqdm import tqdm

model.train()

for epoch in range(epochs):
    progress_bar = tqdm(train_dataloader, desc=f"Epoch {epoch+1}/{epochs}")
    for batch in progress_bar:
        images, texts = batch
        images = images.to(device)
        texts = texts.to(device)

        optimizer.zero_grad()

        with torch.autocast(device_type='cuda', dtype=torch.float16):
            # 이미지와 텍스트 임베딩 추출
            image_features = model.encode_image(images)
            text_features = model.encode_text(texts)

            # 임베딩 정규화
            image_features = image_features / image_features.norm(dim=1, keepdim=True)
            text_features = text_features / text_features.norm(dim=1, keepdim=True)

            # 유사도 계산
            logit_scale = model.logit_scale.exp()
            logits_per_image = logit_scale * image_features @ text_features.t()
            logits_per_text = logits_per_image.t()

            # 손실 함수 계산
            loss = clip_loss(logits_per_image, logits_per_text)

        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()
        scheduler.step()  # 각 배치마다 호출

        progress_bar.set_postfix(loss=loss.item())

임베딩 정규화: 정규화는 벡터의 크기가 1이 되도록 벡터를 스케일링하여 임베딩 간에 비교할 때 크기가 아닌 방향만을 고려하게 하는 역할을 한다.

logit_scale: 모델에 포함된 logit_scale 파라미터를 지수 함수로 변환하여 이미지와 텍스트 임베딩 사이의 유사도 분포를 조절하여 학습의 안정성을 높이고 성능을 향상 시키는 역할을 한다.

tqdm: 진행 상황 표시바를 보여주는 라이브러리로, 반복문이 실행될 때 진행률을 아래처럼 확인할 수 잇게 해준다.

Discovering and Mitigating Visual Biases through Keyword Explanation (CVPR 2024)

Sun, 13 Oct 2024 12:48:34 GMT

Contribution

Bias-to-Text(B2T): 시각적 편향을 키워드로 추출 (잠재적 편향 검출) ⇒ 키워드 검증(CLIP score)을 통해 해당 키워드의 임베딩이 정답 캡션보다 이미지와 가까운지 여부를 판단

이미 잘 알려진 bias (CelebA, Waterbirds 등) 뿐만 아니라 새로운 bias도 찾아냄 ⇒ 꽃(flower)이 포함된 image에서 개미(ant)를 벌(bee)로 오인
이렇게 찾아낸 bias keywords를 가지고 DRO와 같은 편향 제거 훈련에 이용하거나, CLIP prompting에 적용하거나, 다른 모델과 비교할 수 있음. 또한 잘못된 레이블을 검출할 수도 있음

Bias-to-Text (B2T) Framework

Problem formulation

image $x \in \mathcal X$ 에 대해서 클래스 $y \in \mathcal Y$ 를 예측하는 classifier가 있을 때,

자주 틀리는 속성 $a$가 있을 경우 $y$에 대한 bias으로 정의 ⇒ Keyword 설명 형식

*검출되는 bias에는 spurious correlation이나 distribution shifts가 있음

Bias Keywords

잘못 예측된 클래스의 이미지들의 caption에서 공통된 Keywords를 추출 ⇒ Minority subgroups가 이 과정에서 자주 등장 (ex: Man - Blonde hair)

*Captioning Model로 ClipCap, Keywords Extraction Algorithm으로 YAKE 적용

CLIP score

추출한 kewords가 실제로 bias를 나타내는지 검증하기 위해 CLIP과 같은 vision-language scoring model 사용 ⇒ 편향된 컨셉과 관련된 keyword에서 높은 점수가 나타남

$s_{CLIP}(a; \mathcal D) := \text{sim}(a,\mathcal D_{\text{wrong}}) - \text{sim}(a,\mathcal D_{\text{correct}})$

$\text{sim}(a,\mathcal D) := \frac{1}{|\mathcal D|} \sum_{x \in \mathcal D}f_\text {image}(x)f_\text{text}(a)$

$\mathcal D_{\text{wrong}}$, $\mathcal D_{\text{correct}}$는 class-wise validation set $\mathcal D$의 subset keyword $a$와 dataset $\mathcal D$ 사이의 similarity

(a) ‘species’와 ‘bird’는 예측에 성공/실패한 이미지에서 공통적으로 나타나기 때문에 non-bias이고, 따라서 CLIP score도 낮게 나타남. ↔ ‘bamboo’, ‘forest’, ‘woods’는 잘못된 예측에서 더 높은 유사성을 보이므로 CLIP score도 높음. (b) subgroup accuracy(AUROC) for keywords (c) CLIP score와 AUROC 간의 상관관계(-0.95)

Discovering Biases in Image Classifiers

Known Biases (a) gender bias in CelebA blond (b) background bias in Waterbirds (c) distribution shifts in ImageNet-R with different styles (d) ImageNet-C with natural corruptions Novel Biases (e) spurious correlations between the keyword “cave” and wardrobe class indicating geographical bias (f) the keyword “flower” and ant class indicating contextual bias

known biases

Spurious correlation

(ERM) CelebA의 Blond 클래스에 대해 B2T가 “man” 키워드를, Waterbirds에 대해서 “forest”, “ocean”을 포착하여 성별, 배경 편향을 찾아냈을 뿐만 아니라 기존의 background annotation인 “land”에 비해 더 정확한 keyword인 “bamboo”를 찾아냄

Distribution shifts

(ResNet-50) B2T는 ImageNet-R에서 키워드 “illustration”, “drawing”, 좀 더 자세하게는 “hand-drawn”, “vector-art”를 찾아내었고, ImageNet-C 에서는 “snow”(snow corruption), “window”(frost-corruption)를 찾아냄.

Sample-wise bias labeling

이렇게 찾아낸 keywords를 CLIP zero-shot classifier에 적용하여 샘플 단위로 편항을 라벨링 할 수 있음

“a photo of [group]”과 같이 bias keyword를 입력한 프롬프트를 통해 group labeling을 진행하고

ground-truth bias가 존재하는 CelebA, Waterbirds에 대해 기존 방법들과 비교 분석

⇒ 거의 최적에 가까운 결과를 보임

novel biases

Dollar Street(Figure4. e)은 다양한 소득 수준을 가진 국가들의 객체 이미지들을 포함. 이전의 연구들은 이미지 분류기가 저소득 국가에서 낮은 성능을 보인다는 것을 보임.

⇒ B2T를 Dollar Street Validation Set에서 ImageNet을 사용하여 이러한 편향을 분석하였음

몇가지 예시로

“cave” (동굴): “wardrobe” (옷장) 클래스에서 저소득 국가의 옷장이 어두운 곳에 있는 경우가 많아 동굴처럼 보이는 경향

“fire” (불): “stove” (난로) 클래스에서, 저소득 국가의 전통적인 디자인의 난로는 종종 불을 사용하는 방식

이러한 객체의 차이는 국가 간의 지리적 편향을 나타내며, 분류기가 고소득 국가의 객체는 잘 예측하지만 저소득 국가의 객체는 잘못 예측하는 원인을 설명

ImageNet(Figure4. f)에서는 여러 객체가 동시에 존재함으로 인해서 발생하는 contextual biases를 발견할 수 있었는데,

“flower”(꽃)과 함께 존재하는 “ant”(개미)를 “bee”(벌)로 예측하였다. 이는 벌이 개미보다 꽃과 더 강한 연관성을 가지고 있음을 시사한다.

“playground”(놀이터)에서 “horizontal bar”(철봉)을 “swing”(그네)로 오인한다.

Applications of the B2T Keywords

B2T를 사용해서 얻어낸 키워드들은 학습, 프롬프팅, 모델 비교, 레이블 진단 등에 사용할 수 있다.

Debiased DRO training

앞서 구한 Sample-wise bias label을 이용해서 DRO(distributionally robust optimization)의 group label로 적용한 DRO-B2T의 성능을 측정하였다.

가장 성능이 나쁜 그룹의 정확도를 나타내는 WGA(worst-group accuracy)가 오히려 기존 ground truth를 사용한 DRO보다 능가하였다.

CLIP zero-shot prompting

CLIP이 기본적으로 사용하는 “a photo of a [class]“ 프롬프트에 키워드를 추가하여 “a photo of a [class] in the [group]“와 같은 형식을 사용한다.

B2T 키워드 중에 앞서 구한 CLIP score를 가지고 B2T-pos(ex: “ocean”), B2T-neg(ex: “bird)를 [group]에 대입하여 실험한 결과 positive 키워드를 사용했을 때 worst-group accuracy, average accuracy 모두 향상되었고, 반대로 negative 키워드를 사용했을 때는 오히려 성능이 나빠진 것을 알 수 있다.

Model comparison

ResNet vs. ViT

ViT는 ResNet보다 더 전반적인 문맥 이해와와 세밀한 클래스 분류에서 더 우수한 성능을 보임.

가령 ViT는 “work out”과 같은 추상적인 편향 키워드도 성공적으로 예측하였음

반면 ResNet은 “horizontal bar”를 “dumbbell”로, “shopping basket”을 “grocery store”로 잘못 예측하는 등 복잡한 이미지에서 어려움을 보임

ERM vs. DRO

CelebA와 Waterbirds 데이터셋에서 ERM과 DRO를 비교하였을 때,

DRO는 편향 키워드를 줄이거나 거의 완전히 제거에 성공하였음, CelebA blond에서 “man” 키워드가 사라졌고 Waterbirds에서 “seagull”의 CLIP 점수가 3.10에서 1.85로 감소하였음.

Label diagnosis

B2T는 잘못된 레이블 및 레이블 모호성을 진단하는 데 사용할 수 있는데, ImageNet에 존재하는 레이블 오류를 발견하였음.

B2T를 통해 “bee”가 “fly”로, “boar”가 “pig”로 잘못 레이블링된 이미지를 발견하였고, 또한 “desk”, “market”과 같이 대체적으로 여러 객체가 한번에 포함되어 모호한 레이블도 판별 할 수 있음.

Audio-Visual Segmentation

Sun, 04 Aug 2024 09:05:23 GMT

Audio-Visual Segmentation (ECCV 2022) Source Code - GitHub

Introduction

본 논문에서는 이미지 프레임에서 소리를 내는 물체를 픽셀 단위로 구분하는 Task인 Audio-Visual Segmentation (AVS)를 제시한다. 기존에 존재하던 Task와의 차이점, 새로운 Dataset과 Baseline model 그리고 실험 결과에 대해 알아보자.

AVC(Audio-Visual Correspondence)는 오디오와 이미지가 같은 scene에 해당하는지 판단하며, AVEL(Audio-Visual Event Localization)는 사전에 학습된 event label로 video segment를 분류한다. AVVP(Audio-Visual Video Parsing)는 비디오를 몇가지 event로 나누고 소리, 프레임, 또는 모두를 label에 따라 분류한다.

이러한 작업들은 프레임/시간 수준으로 제한되므로 새로운 Task는 소리가 나는 물체를 분류하는 것으로 범위를 줄인다.

SSL (Sound Source Localization)은 이중 가장 AVS와 가까운 작업으로, 프레임 내부에서 주어진 소리와 일치하는 영역을 찾아낸다. 그러나 SSL은 patch 단위로 이루어져있고, heat map으로 영역을 표시하므로 소리를 내는 객체의 모양을 정확히 표시하지는 않는다.

Audio-Visual Segmentation

AVS (Audio-Visual Segmentation)는 각 pixel이 해당 audio와 일치하는지 파악하여 sounding object와 겹치도록 mask를 생성한다.

위 비디오 프레임 중 두 번째 행이 SSL, 세 번째 행이 AVS를 나타낸다. SSL은 patch 단위의 히트맵으로 표시된다. AVS는 pixel 단위로 물체를 표시하며, 이는 sounding object가 복수인 경우에도 마찬가지이다.

AVSBench

기존 데이터셋 중에서는 pixel 단위의 label를 제공하는 것이 없었다. frame에 대한 event만 분류하거나(AVE, LLP), target sound source의 outline이 되는 bounding box만 제공한다(Flickr-SoundNet, VGG-SS). 따라서 저자는 새로운 Task 학습에 적합한 Dataset인 AVSBench를 제시한다.

AVSBench는 sounding object의 수에 따라 Single-source와 Multi-sources로 구분한다.

각 데이터 셋은 5초 분량의 비디오가 1초 클립 5개로 나누어진 형태이며 label은 각 클립의 마지막 프레임에 제시된다. label은 binary mask 형태로 되어있으며, sounding object를 pixel-level로 표시하는 역할을 한다.

이 때 source 수에 따라 labeling 방식이 조금 다르다.

semi-supervised Single Sound Source Segmentation (S4)

Single-source의 학습 데이터 부분의 경우는 각 비디오의 5개의 클립 중 첫번째 클립에서만 label이 제공된다. 이는 single-source의 경우에는 one-shot annotation으로 충분하다는 가정에 의한 것이다.

fully-supervised Multiple Sound Source SEgmentation (MS3)

Multi-sources의 경우에는 좀 더 어려운 Task이기 때문에. 모든 학습 데이터의 클립에 label이 존재한다. 실제 데이터셋을 살펴보면 아래와 같다.

Baseline

논문에서는 AVS를 위한 End-to-End framework를 제시하는데, temporal pixel-wise audio-visual 상관관계를 인코딩하기 위한 TPAVI 모듈과 audio-visual correlation을 활용하기 위한 regularization loss가 포함되어있다.

Encoder

Audio와 Video frame의 Encoding은 독립적으로 진행된다. 우선 auido clip은 short-time Fourier transform을 거친 후 VGGish를 통해 Txd 차원(d=128)의 audio feature가 추출된다. Visual feature의 경우는 convolution 또는 transformer 기반의 백본에 의해 처리된다.

앞서 추출된 visual feature를 후처리하기 위해서 Atrous Spatial Pyramid Pooling (ASPP)이 사용된다. 이러한 후처리는 병렬적으로 수행되는데, 이로 인해 서로 다른 크기의 receptive field를 갖는 객체를 인식할 수 있게 된다.

TPAVI

ASPP까지 거친 visual feature를 이제 audio의 feature와 mapping하는 작업이 필요하다. 이를 통해 어떤 물체가 소리를 내고 있는지 파악할 수 있기 때문이다. 이를 위해 Temporal Pixel-wise Audio-visual Interaction (TPVAI)을 인코딩하는 데, sound source의 소리와 모습이 항상 동시에 나타나지는 않기 때문에(예: 화면 밖에서 등장하는 경우) 한 video frame에 대해서 모든 audio signal을 고려하는 non-local neural networks 방식을 차용했다. 이때 audio feature는 visual feature와 같은 차원으로 변환된 후 hi * wi 만큼 복제후 재배열되는 방식으로 처리되어 TPAVI에 입력된다.

소리와 프레임 간의 관계를 나타내는 audio-visual interaction은 내적 연산에 의해 측정될 수 있다. 아래 식을 보자

이때 θ, φ, g and μ는 1×1×1 convolution 연산이며, N은 T×hi×wi 크기의 Normalization factor, αi는 the audio-visual similarity이며 Zi는 RT×hi×wi×C의 크기를 갖는다. TPAVI 내에서 각각의 픽셀들은 전체 audio와 상호작용한다.

아래 사진은 실제로 audio와 pixel의 유사도를 나타낸 것으로 밝을 수록 유사도가 높은 구역이다.

Experiments

SOD method인 LGVT가 ResNet50 기반 AVS 모델을 Single-Source Set에 관한 지표에서 조금 앞섰지만 Multi-Source 지표에서는 훨씬 밀린다.

⇒ 이것은 SOD가 소리내는 물체는 바뀌지만 화면은 그대로인 경우를 감지하지 못하기 때문인 것으로 보인다.

⇒ 반면 AVS는 Audio 전체를 참고하기 때문에 Visual Frame에서 어떤 객체를 포착할지 알아챈다.

Single Source인 경우 조금 앞서는 것도 LGVT가 Swin-Transformer 기반이기 때문에 Backbone 자체의 성능이 좋아서 그런 것 같고, Transformer 기반의 PVT를 쓸 경우에는 두 지표에서 LGVT를 모두 앞선다.

Qualitative examples of the SSL methods and AVS

fully-supervised MS3 환경에서, SSL 메소드들(LVS, MSSL)은 대략적인 위치만 찾아냈지만, AVS는 객체의 더 정확한 pixel 단위의 모양을 구분해낼 수 있었다.

Qualitative examples of the VOS, SOD, and AVS

MS3 환경에서 VOS(Video Object Segmentation)의 SOTA mothod인 SST와 SOD(Sounding object Detection)의 method LGVT와 비교해보았을 때, AVS는 다른 두 방법과 달리 sounding object의 변화를 정확하게 잘 잡아내는 것을 알 수 있다. baby, dog 예시와 같이 중간에 소리를 내는 물체가 변화할 경우 SST와 LGVT는 계속 하나만 포착하거나, 두개를 포착하는 결과를 보여준다.

Comparison with a two-stage baseline method

Two-Stage 구조에서 first-stage에 Mask R-CNN을 사용하여 segmentation quality를 증가시키더라도 AVS task 자체의 성능에 큰 영향을 미치지 않는다. 오히려 audio signal의 영향을 훨씬 많이 받는다.

Impact of audio signal and TPAVI

중간에 있는 row는 단순히 audio와 visual feature를 더한 것인데 이것만으로도 어느 정도 성능이 향상되었다.

Qualitative results under the semi-supervised S4 setting

TPAVI를 통해서 비디오에 존재하는 sounding object의 형태와 올바른 sound source를 학습하게 된다.

Qualitative results under the fully-supervised MS3 setting

TPAVI를 적용한 모델이 사람의 손과 같이 소리와 직접적으로 관련이 없는 객체를 필터링하거나 노래 부르는 사람과 같이 더 정확한 객체를 포착하는등 더 뛰어난 성능을 보였다.

Residual Learning의 이해와 ResNet-18 구현

Sun, 21 Jul 2024 13:52:07 GMT

Paper Review

요약

Residual Learning(잔차 학습)은 레이어의 입력을 Reference로 하는 Deep Learning 기법이다.

ResNet은 기존 모델에 비해 상당한 깊이에서도 높은 정확도를 유지할 뿐 아니라, 빠른 학습 시간을 보여준다. ILSVRC 2015에서 우승하였으며 CIFAR 10, 100 1000 image classification과 Detection, Localization, Segmentation에서도 뛰어난 성능을 보인다.

Introduction

Deep Network는 입력에 가까울수록 지역적인 low feature가, 출력에 가까울수록 전역적이고 추상적인 high feature가 나타난다. 깊이에 따라 이러한 feature level이 다양해지므로 network 깊이는 중요한 요소라고 할 수 있다.

그렇다면 layer가 많으면 많을 수록 좋은 network일까?

기존에 존재하던 기울기 소실/폭발 문제는 초기 정규화(normalized initialization)와 중간 정규화 층(intermediate normalization layer)을 통해 어느정도 해결됐다.

하지만 network가 점점 깊어지면서 degradation이라는 새로운 문제가 발생한다. degradation은 deeper network의 정확도가 포화(saturated)되었다가 급격히 저하(degraded)되는 것을 뜻한다.

degradation은 높은 traning error를 보인다는 점에서, 높은 training accuracy를 보이는 과적합(overfitting)과는 별개의 문제라고 할 수 있다.

저자는 이러한 문제를 해결하기 위해 Deep Residual Learning 구조를 제안한다. 입력이 x, relu를 거치기 이전의 출력이 H(x)라면, H(x) = F(x) + x 로 나타낼 수 있다. 이와 같이 입력이 층을 건너뛰는 것을 skip connection 이라고 한다.

본 논문에서는 skip connection을 identity mapping 즉, 다른 추가적인 파라미터나 연산없이 구현하였다. 이로써 전체 network는 여전히 확률적 경사하강법(Stochastic Gradient Descent)을 통해 E2E 학습을 진행할 수 있다.

Deep Residual Learning

Residual Learning (잔차 학습)

그렇다면 H(x) = F(x) + x 와 같은 새로운 구조를 제시한 이유는 무엇일까? x는 우리가 이미 알고 있는 값이므로 관점을 바꾸어 F(x) = H(x) - x 를 학습시킨다고 생각해보자.

만약 정답의 형태가 H(x) = x 와 같은 identity mapping이라면 Residual Learning에서는 F(x) = 0 이 되야 하므로 이러한 결과가 나오도록 학습시키는 것이 (Residual Learning이 아닌) plain network가 H(x) = x가 되도록 학습시키는 것보다 훨씬 쉽기 때문이다.

물론 항상 정답이 H(x) = x 가 되는 것은 아니지만 아무것도 없는 상태에서 multiple layers를 어떤 함수에 근사 시키는 것보다는 x라는 reference를 제공하는 identity mapping을 바탕으로 정답을 찾아나가는 것이 학습에 유리하다.

Identity Mapping by Shortcuts

identity mapping은 아래와 같이 나타낼 수 있다

Figure 2를 예시로 들면 F = W2σ(W1x)인데, σ는 ReLU 함수이고 bias는 표기상의 편의를 위해 빠졌다. F + x 연산은 shortcut connection과 element-wise addition으로 수행된다.

행렬 덧셈의 특성상 F와 x의 차원이 같아야 하는데, 만약 그렇지 않을 경우에는 차원을 맞추기 위해 Ws를 x에 곱한 후 더해준다.

Network Architectures(for ImageNet)

Plain Network

VGG Net(Fig.3 왼쪽)에서와 같이 convolutional layer는 거의 3x3 필터를 사용하고 두가지 디자인 규칙을 적용한다.

동일한 feature map size에 대해서 각 layer는 똑같은 수의 filter를 가진다.
만약 feature map size가 절반이 되면, filter의 수는 두배로 늘린다.

이는 매 layer의 시간 복잡성을 유지하기 위함이다.

여기에 convolutional layer(stride=2)로 직접 downsampling을 수행한다. 마지막에는 global average pooling과 함께 1000-way fully-connected Layer 및 softmax를 통해 분류를 진행하게 된다.(Fig.3 가운데)

Residual Network

Plain Network에서 identity mapping이 추가된다. 차원이 증가할 때는 점선으로 표기하였는데 두 가지 선택지를 고려할 수 있다.

(A) identity mapping을 수행하고 증가한 차원 부분에 대해서는 zero padding을 적용한다. (B) projection shortcut을 사용한다(Ws, 1x1 conv).

두 경우 모두 size가 반으로 줄어드므로 stride를 2로 설정하였다.

Implemetation

실제 구현은 아래와 같이 진행한다.

이미지의 짧은 쪽이 256~480 사이가 되도록 리사이즈를 진행한다.
224x224 사이즈만큼 추출한다(기존 or 좌우 반전) + per-pixel 평균을 뺸다.
standard color augmentation 적용
Batch Normalization을 convolution ~ activation 사이에 적용한다.
He 초기화 방법으로 가중치 초기화
SGD 적용 (mini-batch size : 256)
Learning rate 0.1에서 시작하여 10씩 나눠주며 진행
iteration: 60e4, Weight decay: 0.0001, Momentum: 0.9, dropout X

테스트 시에는 10-crop testing을 적용하였고, muliple scale을 사용해 짧은 쪽의 길이가 {224, 256, 384, 480, 640} 중 하나가 되게 한다.

Experiments

ImageNet Classification

ImageNet 2012 데이터셋이 사용되었는데 128만개의 training images, 5만개의 validation images, 10만개의 test images가 사용 되었고, top-1 & top-5 error rates를 측정하였다.

Plain Networks

처음에는 18-layer와 34-layer에 대한 평가를 진행하였다. 18-layer는 아까 위에서 보았던 34-layer와 유사한 형태이다. 아래 그래프를 통해 알 수 있듯이 더 깊은 34-layer가 18-layer보다 높은 validation error(굵은 선)를 보인다.

또한 초반부에 설명했던 degradation 문제 또한 발생하였다. 즉, 34-layer plain network의 training error(가는 선)가 전체 학습 과정에서 가장 높게 나타났다.

plain network에서는 34-layer에서 validation error가 커지며 degradation 문제가 발생하였다.

이러한 문제는 배치 정규화가 적용되어 순전파/역전파 기울기에는 문제가 없었기 때문에 기울기 소실에 의해 일어나는 것으로 판단되지는 않는다.

Residual Networks

다음으로는 plain network에 residual learning이 적용된 ResNet-18과 ResNet-34의 성능을 측정하였다. 이때 모든 shortcut에 대해서 차원이 증가될 때 zero padding(A 옵션)을 적용하였다. zero padding은 추가적인 파라미터가 필요하지 않으므로 plain 모델과 파라미터 수의 차이는 없다.

위 결과들을 토대로 몇가지 사실을 알 수 있다.

ResNet-34는 ResNet-18 보다 뛰어난 성능을 보이며(약 2.8%) 낮은 traning error가 눈에 띈다. => 이를 통해 degradation 문제가 해소되었음을 알 수 있다.
plain network와 비교하였을 때, 34-layer ResNet은 top-1 error rate를 약 3.5% 개선시켰다.(28.54% -> 25.03%)
18-layer plain/residual net 모두 상당한 정확도를 보였지만, 18-layer ResNet이 조금 더 빨리 수렴하였다.

network가 엄청나게 깊지 않다면(18-layer 포함), SGD solver는 여전히 plain net에 훌륭한 solution을 찾아줄 수 있다.

물론 ResNet은 여기에 빠른 수렴을 통해 최적화를 쉽게 만들어준다.

Identity vs. Projection Shortcuts

앞에서 파라미터가 필요하지 않은 identity shortcut이 학습에 도움이 되는 것을 알 수 있었는데, 이번에는 zero padding과 projection shortcut을 비교하여 살펴본다.

세 가지 옵션이 있다.

A) increasing dimension에 zero padding shortcut이 사용된 경우. 모든 shortcut은 parameter-free이다.

B) increasing dimension에 projection shortcut을 적용하고 나머지는 identity shortcut인 경우.

C) projection shorcut만 사용한 경우

Error rates (%, 10-crop testing) on ImageNet validation. VGG-16 is based on our test. ResNet-50/101/152 are of option B that only uses projections for increasing dimensions.

위 Table에서 알 수 있는 것은 우선 모두 plain보다는 상당히 좋은 성능을 보였다는 것이다.

ABC 끼리 차이는 있지만(A

bottleneck architecture의 복잡성을 높이지 않는 데에 Identical shortcuts가 중요함.

Deeper Bottleneck Architectures

ResNet의 깊이가 깊어짐에 따라 training time이 너무 커지는 것을 막기 위해 파라미터 수가 적은 bottleneck architecture를 도입하였다.

위 아래에 있는 1x1 conv layer는 각각 차원을 줄였다가 다시 늘리는 데에 사용된다.

이렇게 다시 차원을 원래대로 돌아오도록 하는 이유는 parameter-free인 identity shortcut이 bottleneck architecture에 매우 중요하기 때문이다.

만약 identity shortcut이 projection으로 교체된다면 2개의 high-dimentional 출력이 연결되어 시간 복잡도와 모델 크기가 두배가 된다.

이러한 결과를 막아주기 때문에 identity shortcut은 bottleneck design이 효과적인 모델이 되는데 중요한 역할을 한다.

50-layer ResNet

50-layer부터는 기존의 2-layer block을 3-layer bottleneck block으로 교체한다. 이 때 B 옵션을 사용하였다.

101-layer and 152-layer ResNets

Error rates (%) of single-model results on the ImageNet validation set (except † reported on the test set).

더 많은 3-layer block들을 사용하여 101-layer, 152-layer ResNet을 구성하였는데, 놀랍게도 깊이가 상당히 늘어났음에도 불구하고 152-layer ResNet(11.3 billion FLOPs)은 여전히 VGG-16/19 nets(15.3/19.6 billion FLOPs)보다 더 낮은 복잡도를 가졌다.

위 single-model 테스트에서 34-layer ResNets이 매우 경쟁력 있는 정확도를 보이지만, 50/101/152-layer ResNet은 더 높은 정확도를 보인다. 또한 여전히 degradation 문제를 찾아볼 수 없었다.

152-layer ResNet은 single-model 테스트에서 4.49%의 top-5 validation error를 보여주었는데, 이것은 심지어 single-model 임에도 이전의 모든 앙상블 모델을 제친 것이다.

Comparisons with State-of-the-art Methods

Error rates (%) of ensembles. The top-5 error is on the test set of ImageNet and reported by the test server.

저자는 서로 다른 깊이의 6개 모델을 앙상블한 모델을 만들었고, 이는 3.57% top-5 error를 보여주었다. 이 모델은 ILSVRC 2015에서 우승을 차지하였다.

Pytorch Implementation (18-layer)

ResNet은 위 표처럼 Block이 반복되는 구조이므로 Block 코드를 먼저 작성한 후, 이를 바탕으로 ResNet을 구성한다.

Block을 쌓는 구조이기 때문에 ResNet-18을 만들 수 있다면 34, 50 등도 만들 수 있다.

Import

import torch
from torch import nn
from torch import Tensor
from typing import Optional, Callable, Union, Type, List

구조만 파악하기 때문에 학습 관련 라이브러리는 배제하였다.

convolutional layer

def conv3x3(in_planes: int, out_planes: int, stride: int = 1, groups: int = 1, dilation: int = 1) -> nn.Conv2d:
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride, padding=dilation, groups=groups, bias=False, dilation=dilation)

def conv1x1(in_planes: int, out_planes: int, stride: int = 1) -> nn.Conv2d:
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)

BasicBlock과 Bottleneck에서 사용될 3x3, 1x1 convolutional layer를 정의하는 부분.

BasicBlock

class BasicBlock(nn.Module):
    def __init__(
        self,
        inplanes: int,
        planes: int,
        stride: int = 1,
        downsample: Optional[nn.Module] = None,
        groups: int = 1,
        dilation: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None
    )-> None:
        super(BasicBlock, self).__init__()

        # Normalization Layer
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d

        self.conv1 = conv3x3(inplanes, planes, stride) # padding, dilation = 1
        self.bn1 = norm_layer(planes)
        self.relu = nn.ReLU(inplace=True) # inplace : 원본 직접 수정 여부
        self.conv2 = conv3x3(planes, planes) # stride = 1
        self.bn2 = norm_layer(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x: Tensor) -> Tensor:
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity  #  residual connection
        out = self.relu(out)
        return out

정규화는 기본값으로 2d Batch Normalization이 사용된다.

Block 구성) conv3x3 - BN - ReLU - conv3x3 - BN - residual connection - ReLU

Bottleneck

class Bottleneck(nn.Module):
    # Bottleneck in torchvision places the stride for downsampling at 3x3 convolution(self.conv2)
    # while original implementation places the stride at the first 1x1 convolution(self.conv1)
    # according to "Deep residual learning for image recognition"https://arxiv.org/abs/1512.03385.
    # This variant is also known as ResNet V1.5 and improves accuracy according to
    # https://ngc.nvidia.com/catalog/model-scripts/nvidia:resnet_50_v1_5_for_pytorch.

    expansion: int = 4

    def __init__(
        self,
        inplanes: int,
        planes: int,
        stride: int = 1,
        downsample: Optional[nn.Module] = None,
        groups: int = 1,
        base_width: int = 64,
        dilation: int = 1,
        norm_layer: Optional[Callable[..., nn.Module]] = None,
    ) -> None:
        super().__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        width = int(planes * (base_width / 64.0)) * groups
        # Both self.conv2 and self.downsample layers downsample the input when stride != 1
        self.conv1 = conv1x1(inplanes, width)
        self.bn1 = norm_layer(width)
        self.conv2 = conv3x3(width, width, stride, groups, dilation)
        self.bn2 = norm_layer(width)
        self.conv3 = conv1x1(width, planes * self.expansion)
        self.bn3 = norm_layer(planes * self.expansion)
        self.relu = nn.ReLU(inplace=True)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x: Tensor) -> Tensor:
        identity = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)
        out = self.relu(out)

        out = self.conv3(out)
        out = self.bn3(out)

        if self.downsample is not None:
            identity = self.downsample(x)

        out += identity
        out = self.relu(out)

        return out

Bottleneck은 ResNet-18에서 사용되지 않는다. 자세한 내용은 레퍼런스에서 참고할 수 있다.

ResNet-18

class ResNet(nn.Module):
    def __init__(
        self,
        block: Type[Union[BasicBlock, Bottleneck]],
        layers: List[int],
        num_classes: int = 1000,
        zero_init_residual: bool = False,
        norm_layer: Optional[Callable[..., nn.Module]] = None
    )-> None:
        super(ResNet, self).__init__()
        if norm_layer is None:
            norm_layer = nn.BatchNorm2d
        self._norm_layer = norm_layer # batch norm layer

        self.inplanes = 64 # input shape
        self.dilation = 1 # dilation fixed
        self.groups = 1 # groups fixed

        # input block
        self.conv1 = nn.Conv2d(3, self.inplanes, kernel_size=7, stride=2, padding=3, bias=False)
        self.bn1 = norm_layer(self.inplanes)
        self.relu = nn.ReLU(inplace=True)
        self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1)

        # residual blocks
        self.layer1 = self._make_layer(block, 64, layers[0])
        self.layer2 = self._make_layer(block, 128, layers[1], stride=2, dilate=False)
        self.layer3 = self._make_layer(block, 256, layers[2], stride=2, dilate=False)
        self.layer4 = self._make_layer(block, 512, layers[3], stride=2, dilate=False)
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1))
        self.fc = nn.Linear(512, num_classes)

        # weight initialization
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight, mode='fan_out', nonlinearity='relu')
            elif isinstance(m, (nn.BatchNorm2d, nn.GroupNorm)):
                nn.init.constant_(m.weight, 1)
                nn.init.constant_(m.bias, 0)

        # Zero-initialize the last BN in each residual branch,
        # so that the residual branch starts with zeros, and each residual block behaves like an identity.
        # This improves the model by 0.2~0.3% according to https://arxiv.org/abs/1706.02677
        if zero_init_residual:
            for m in self.modules():
                if isinstance(m, Bottleneck):
                    nn.init.constant_(m.bn3.weight, 0)  # type: ignore[arg-type]
                elif isinstance(m, BasicBlock):
                    nn.init.constant_(m.bn2.weight, 0)  # type: ignore[arg-type]

    def _make_layer(self, block: Type[Union[BasicBlock, Bottleneck]], planes: int, blocks: int, stride: int=1, dilate: bool = False) -> nn.Sequential:
        norm_layer = self._norm_layer
        downsample = None

        # downsampling 필요할 경우 downsample layer 생성
        if stride != 1 or self.inplanes != planes:
            downsample = nn.Sequential(
                conv1x1(self.inplanes, planes, stride),
                norm_layer(planes)
            )

        layers = []
        layers.append(block(self.inplanes, planes, stride, downsample, self.groups, self.dilation, norm_layer))
        self.inplanes = planes
        for _ in range(1, blocks):
            layers.append(block(self.inplanes, planes, groups=self.groups, dilation=self.dilation, norm_layer=norm_layer))

        return nn.Sequential(*layers)

    def forward(self, x: Tensor) -> Tensor:
        print('input shape:', x.shape)
        x = self.conv1(x)
        print('conv1 shape:', x.shape)
        x = self.bn1(x)
        print('bn1 shape:', x.shape)
        x = self.relu(x)
        print('relu shape:', x.shape)
        x = self.maxpool(x)
        print('maxpool shape:', x.shape)

        x = self.layer1(x)
        print('layer1 shape:', x.shape)
        x = self.layer2(x)
        print('layer2 shape:', x.shape)
        x = self.layer3(x)
        print('layer3 shape:', x.shape)
        x = self.layer4(x)
        print('layer4 shape:', x.shape)

        x = self.avgpool(x)
        print('avgpool shape:', x.shape)
        x = torch.flatten(x, 1)
        print('flatten shape:', x.shape)
        x = self.fc(x)
        print('fc shape:', x.shape)

        return x

ResNet 코드에서는 BasicBlock을 이용하여 Residual Blocks를 구성하는 _make_layer 함수를 구현하고 이를 통해서 Block을 쌓는다.

또한 입력 맨 처음 단의 Input Block 또한 별도로 구성되어 있다.

Result

model = ResNet(BasicBlock, [2, 2, 2, 2])
x = torch.randn(1, 3, 112, 112)
print('\noutput shpae: ', model(x).shape)

ResNet-18은 각 블럭이 2개씩 구성되어 있으므로 [2, 2, 2, 2]를 입력한다.

input shape: torch.Size([1, 3, 112, 112]) conv1 shape: torch.Size([1, 64, 56, 56]) bn1 shape: torch.Size([1, 64, 56, 56]) relu shape: torch.Size([1, 64, 56, 56]) maxpool shape: torch.Size([1, 64, 28, 28]) layer1 shape: torch.Size([1, 64, 28, 28]) layer2 shape: torch.Size([1, 128, 14, 14]) layer3 shape: torch.Size([1, 256, 7, 7]) layer4 shape: torch.Size([1, 512, 4, 4]) avgpool shape: torch.Size([1, 512, 1, 1]) flatten shape: torch.Size([1, 512]) fc shape: torch.Size([1, 1000])

output shpae: torch.Size([1, 1000])

레퍼런스

Deep Residual Learning for Image Recognition(2015) - 논문 ResNet 논문 리뷰 - Youtube Short Connection과 Identity Mapping - 블로그 ResNet 논문 리뷰 - 블로그 Pytorch 구현 - 블로그

Transformer에 대해 조금만 알아보자

Sun, 07 Jul 2024 11:45:50 GMT

소개

Transformer는 Attention Mechanism 만을 적용하여 Recurrent Network의 한계를 극복한 신경망 아키텍처이다. 2017년에 구글이 발표한 논문인 “Attention is all you need”을 통해 소개되었고 GPT, BERT의 기반이 되는 등 현재까지도 자연어 처리 분야에 큰 영향력을 끼치고 있다.

Recurrent Model의 문제점

Transformer 이전의 언어 모델은 순환 신경망(RNN) 기반으로 만들어졌다. RNN은 연속적인 데이터를 입력으로 받고 현재 시점의 hidden state가 다음 hidden state에 영향을 주는 구조이다. 이로써 자연어 문장과 같은 순차적 입력에 대해 순서를 고려한 출력을 얻을 수 있는 것이다.

하지만 이러한 순차적인 구조는 몇 가지 문제점들이 있다. 우선, 길이가 긴 입력에 대해서 뒤로 갈수록 앞선 입력에 대한 정보를 모두 기억하기 어렵다. 또한 순서대로 입력을 처리해야하므로 (병렬 처리가 불가능하여) 연산 시간 단축에 한계가 있다.

Transformer 기본 구조

예시) 영어-독일어 번역 모델

Transformer는 인코더와 디코더를 N개씩 쌓은 구조로 되어있으며, 마지막 인코더의 출력이 각 디코더에 영향을 미친다. 논문에서는 N=6을 사용하였다.

Decoder에 있는 는 문장의 시작(start of string)을 나타내는 토큰이다. 문장의 끝(end of string)은 를 사용한다.

Transformer는 위 그림과 같이 문장 내 단어들을 순서대로 입력하는 것이 아니라 병렬적으로 동시에 입력하게 된다.

Positional Encoding

Transformer는 입력을 순차적으로 받지 않는다면 단어의 위치 정보를 어떻게 반영할까? 각 단어의 임베딩 벡터에 위치 정보를 더하는 것을 Positional Encoding이라고 한다.

Positional Encoding 함수에서 pos는 입력 문장에서의 임베딩 벡터의 위치를, i는 임베딩 벡터 내의 차원 인덱스를 의미한다. 각 차원 인덱스가 짝수일 경우에는 sin 함수를, 홀수 일 때는 cos 함수를 사용한다.

P E_{(p o s, 2 i)} = s i n (p o s / 10000^{2 i / d_{m o d e l}})

P E_{(p o s, 2 i + 1)} = c o s (p o s / 10000^{2 i / d_{m o d e l}})

𝑑𝑚𝑜𝑑𝑒𝑙은 각 출력층의 차원 정보를 나타내는 Transformer의 하이퍼파라미터 값이다. 여기서 임베딩 벡터의 차원이기도 하며 앞으로 계속 등장하는 개념이다. 예시에서는 간단하게 𝑑𝑚𝑜𝑑𝑒𝑙 = 4로 표현하였지만, 논문에서는 512를 사용하였다.

Attention

Attention은 단어 사이의 상관관계를 게산하여 중요한 정보에 집중하는 Mechanism이다. Transformer에서 사용되는 어텐션들은 다음과 같다.

Encoder Self-Attention은 인코더 자체의 Query, Key, Value 벡터만을 가지고 모든 정보를 고려하여 Attention을 적용한다. Maked Decoder Self-Attention과 Encoder-Decoder Attention은 디코더에서 이루어지며 적용방식이 조금 다르다. 자세한 내용은 아래서 설명한다.

앞서 소개한 세 종류의 Attention이 어느 위치에 적용되는지 나타낸 그림이다. Mutli-head라는 것은 Attention을 병렬적으로 처리함을 나타낸다.

Encoder

Position Encoding을 거친 문장은 인코더에 입력되어 num_layers 만큼의(논문에서는 6) 인코더를 통과하게 되며, 각 인코더마다 2개의 Sublayer로 구성되어 있다. Multi-head Self-Attention은 Self-Attention을 병렬적으로 처리하는 구조이며, FFNN은 Feed Foward Neural Network를 말한다. 핵심적인 구조인 Self-Attention를 먼저 이해해보자.

Self-Attention

Attention Function은 쿼리(Query)와 키(Key)의 유사도를 구한 후, 이를 가중치로 하여 값(Value)의 Weight Sum을 최종적으로 Return 한다. 여기서 Query는 유사도를 구하고 싶은(또는 궁금한) 단어를 가리킨다고 생각하면 된다.

Self-Attention은 Attention의 한 종류로, 입력 문장의 모든 단어 벡터들에 존재하는 Q, K, V를 가지고 Attention을 수행한다. 다른 문장을 이용하는 것이 아닌 한 문장 내부에서 Attention이 이루어진다고 해서 Self-Attention라 불린다.

유사도 계산을 통해서 입력 문장의 it은 ‘street’이 아닌 ‘animal’을 나타낼 확률이 높다는 것을 확률적으로 나타낸다. 위 예시에서는 it를 Query로 하여 Attention을 수행하였는데, 이런 방식으로 모든 단어에 대해서 Attention을 수행하는 것이 Self-Attention이다.

Q, K, V 벡터 산출

예시) “I am a student” 문장

Positional Encoding을 거친 입력 문장(𝑑𝑚𝑜𝑑𝑒𝑙)에서 Q, K, V 벡터를 구하기 위해서는 각각의 가중치 행렬을 곱해야 한다. 최종적으로 얻어지는 Q, K, V 벡터는 𝑑𝑚𝑜𝑑𝑒𝑙 / num_heads의 크기를 갖는다. 논문에서는 𝑑𝑚𝑜𝑑𝑒𝑙 = 512, num_head = 8을 사용하였다.

예시에서는 간단하게 𝑑𝑚𝑜𝑑𝑒𝑙 = 4, num_head = 2로 표현하였다.

Scaled dot-product Attention

Q벡터는 전체 문장 중에서 유사도를 구하고 싶은 주체가 되는 단어를 나타낸다. K벡터는 유사도를 구할 대상을 나타내며 벡터 내적을 통해 Attention Score를 구하게 된다.

아래 예시에서는 “I am a student”라는 문장에서 단어 “I’와 다른 단어들 사이의 Attention Score를 구한 것이다.

Attention Score를 구하는 과정에서 dk^1/2로 나누게 되는데 이는 내적 결과를 보정하기 위한 값이며 이 때문에 “Scaled”라는 말이 붙는다. dk는 앞서 구했던 𝑑𝑚𝑜𝑑𝑒𝑙 / num_heads 를 나타낸다. (논문에서는 8)

이렇게 구한 Attention Score에 Softmax Function을 적용하여 Attention Distribution을 구할 수 있다. 이를 가중치로 사용하여 V벡터의 Weight Sum을 구한 결과가 최종적인 Attention Value가 된다.

여태까지는 이해를 돕기 위해서 한 단어에 대해 연산을 구하는 과정을 설명하였다. 실제로는 전체 단어들에 대해 행렬 연산을 통해 Self-Attention을 수행하게 된다.

행렬 연산

행렬 연산을 이용하면 전체 단어들에 대한 Q, K, V 벡터를 한번에 구할 수 있다.

벡터 내적도 마찬가지로 전체 내적 결과를 담은 Attention Score를 얻을 수 있는데 이후에 Attention Value 또한 모든 단어에 대해 얻을 수 있다.

전체 계산과정을 수식으로 나타내면 아래와 같다. 여기서의 Attention은 Attention Value Matrix를 말한다.

A t t e n t i o n (Q, K, V) = s o f t m a x (\frac{Q K^{T}}{\sqrt{d_{k}}}) V

Multi-Head Attention

앞에서 입력 벡터의 𝑑𝑚𝑜𝑑𝑒𝑙 차원을 그대로 사용하지 않고 𝑑𝑚𝑜𝑑𝑒𝑙 / num_heads 차원을 갖는 Q, K, V 벡터로 바꾼 것은 Multi-Head 연산을 위함이다.

논문 저자는 한 번의 Attention보다 여러번의 Attention을 병렬로 사용하는 것이 다양한 관점을 학습하는 데에 더 효과적이라고 판단하였다.

각각의 head에서 Self-Attention을 수행하고 얻은 결과를 임베딩 차원 축으로 병합(Concatenation)을 수행한다. 병합을 통해서 얻은 Attention Value Matrix는 다시 𝑑𝑚𝑜𝑑𝑒𝑙 차원을 갖게 된다.

병합된 행렬은 또 한번의 행렬 곱을 통해 최종적으로 입력과 같은 (Seq_len, dmodel) 크기의 행렬이 되며, Multi-head Self-Attention의 최종 출력이 된다.

이렇게 입력과 출력을 같은 크기게 되도록 하는 것은 동일한 구조의 인코더에 다시 입력하기 위함과 Residual Connection 때문이다.

Position-wise FFNN

FFNN은 인코더와 디코더에서 공통적으로 가지고 있는 sublayer이다.

이를 식으로 나타내면 아래와 같다.

F F N N (x) = M A X (0, x W_{1} + b_{1}) W_{2} + b_{2}

x는 앞서 구한 Multi-Head Attention의 출력인 (seq_len, 𝑑𝑚𝑜𝑑𝑒𝑙)의 크기를 가지는 행렬이다. 이 때 가중치 W1은 (𝑑𝑚𝑜𝑑𝑒𝑙, 𝑑ff), W2는 (𝑑ff, 𝑑𝑚𝑜𝑑𝑒𝑙)의 크기를 가진다. 논문에서는 𝑑ff = 2048을 사용한다. 가중치는 인코더마다 하나의 값이 사용된다.

FFNN을 통과한 결과도 Self-Attention과 마찬가지로 (seq_len, 𝑑𝑚𝑜𝑑𝑒𝑙)의 크기가 보존된다. 이를 통해 출력을 다음 인코더의 입력으로 사용할 수 있다.

Residual Connection & Layer Normalization

지금까지 sublayer에 대해 설명하면서 지나친 부분이 있다. 각 sublayer 출력에 연결된 Add & Norm이라고 써있는 부분인데, Add는 Residual Connection을 Norm은 Layer Normalization을 뜻한다.

Residual Connection

앞서 각 sublayer의 최종 출력이 입력과 같은 크기여야 한다고 설명했는데, 이는 입력과 출력을 더하는 Residual Connection의 구조 때문이다. Residual Connection은 깊은 신경망 구조에서 학습 난이도를 낮추기 위해 사용된다.

가령 Multi-head Attention에서는 다음과 같이 표현할 수 있다.

H (x) = x + M u l t i - h e a d A t t e n t i o n (x)

Layer Normalization

Sublayer, Residual Connection을 거쳐 Layer Normalization을 진행한 결과를 다음과 같이 나타낼 수 있다.

L N = L a y e r N o r m (x + S u b l a y e r (x))

층 정규화(Layer Normalization)는 텐서의 마지막 차원의 평균과 분산을 구하고, 정규화를 진행하여 학습을 돕는다. 여기서는 𝑑𝑚𝑜𝑑𝑒𝑙이 마지막 차원에 해당된다.

𝑑𝑚𝑜𝑑𝑒𝑙 차원 방향을 기준으로 평균과 분산을 구한 후, 그 값들로 각 화살표 방향의 벡터들에 대하여 정규화를 진행한다. 정규화된 벡터는 아래처럼 표기한다.

l n_{i} = L a y e r N o r m (x_{i})

실제 계산을 위해서 우선, 기존 벡터의 원소(스칼라)마다 다음과 같이 정규화를 진행한다.

{\hat{x}}_{i, k} = \frac{x_{i, k} - μ_{i}}{\sqrt{σ_{i}^{2} + ϵ}}

여기서 epsilon은 분모가 0이 되는 것을 방지하는 값이다.

이제 학습 가능한 파라미터인 두 벡터를 아래처럼 초기값을 설정하고 최종 수식을 완성한다.

l n_{i} = γ {\hat{x}}_{i} + β = L a y e r N o r m (x_{i})

Decoder

지금까지 Encoder Self-Attention과 Position-wise FFNN, Residual Connection, Layer Normalization에 대해 알아보았다. 디코더도 앞서 설명한 구조와 비슷하지만 2개의 Attention Layer가 인코더와 다소 다른 형태를 가지고 있다.

디코더는 Masked Self-Attention, Encoder-Decoder Attention, Position-wise FFNN의 3 sublayer로 구성되어있다. FFNN과 Add & Norm 구조는 동일하므로 2개의 Attention Layer에 대해서만 알아보려한다.

Masked Self-Attention

첫번째 sublayer에 주목해보자. 디코더도 인코더와 마찬가지로 Positional Encoding된 문장 행렬이 입력된다. 인코더와 차이점은 인코더는 번역하고 싶은 문장인 “I am a student”를 입력했다면, 디코더는 그의 번역 결과인 “ je suis étudiant” 행렬을 입력한다. 이는 번역을 진행할 때 앞서 번역한 결과 단어들을 참고하기 위함이다.

이 때 한가지 문제점이 있다. 예를 들어 suis를 예측하는 시점에는 와 je만을 참고하여 번역을 수행하여야 하고, 예측 대상인 suis나 그 뒷 단어를 미리 참고해서는 안된다. RNN 계열의 모델과 달리 Transformer는 순차적으로 입력하지 않으므로 위와 같은 문제를 방지하기 위해 look-ahead mask를 도입했다.

앞서 설명한 것과 같이 Self-Attention을 수행한 후, 예측하는 시점과 같거나 후순위에 있는 단어는 참고하지 못하도록 미리보기(look-ahead)를 막는다(mask).

검은색 부분이 maked 부분이며, 실제로는 해당 부분을 제외한 행렬을 만들 수 없으므로 masking할 부분을 Softmax Fucntion을 거친 결과가 0에 가깝게 나오도록 절댓값이 매우 큰 음수로 설정한다.

Encoder-Decoder Attention

이번에는 두번째 sublayer를 살펴보자. Encoder-Decoder Attention은 Query 벡터는 디코더로부터 가져오며, Key, Value 벡터는 인코더로부터 가져온다. 서로 다른 곳에서 벡터들을 가져오므로 Self-Attention이 아니다. 이를 통해 디코더는 인코더의 출력을 참고하게 된다.

이후 다른 연산들은 앞서 설명한 것과 같다.

참고자료

Attention Is All You Need - 논문 나동빈님 논문 리뷰 - Youtube 딥 러닝을 이용한 자연어 처리 입문 - 위키독스

JPA와 N+1 문제

Thu, 30 May 2024 09:58:46 GMT

1. N+1 문제란

엔티티를 조회할 때 연관된 엔티티를 조회하기 위한 N번의 쿼리가 추가적으로 발생하는 문제 ⇒ DB 부담 증가

예시(1 : N 관계)

1:N 뿐만 아니라 1:1, N:1 관계에서 모두 발생할 수 있음!

Entity

    @Entity
    @Getter
    @Builder
    @NoArgsConstructor(access = AccessLevel.PROTECTED)
    @AllArgsConstructor
    public class Post {

        @Id
        @GeneratedValue(strategy = GenerationType.IDENTITY)
        private Long id;
        private String title;
        private String content;

        // ...

        @Builder.Default
        @OneToMany(mappedBy = "post", cascade = CascadeType.ALL, orphanRemoval = true)
        private List images = new ArrayList<>(); // image는 id와 url을 갖고 있음

        //... 
    }

Service

5개의 게시글을 불러온다고 가정

    @Transactional(readOnly = true)
    public PostResponse getPostList(Long postId) {
        List posts = postRepository.findAll();

        return posts.stream()
                    .map(PostResponse::toResponse)
                    .collect(Collectors.toList());
    }

Query

각각의 게시글의 Image에서 url을 불러올 때 쿼리문 추가 발생(N=5)

    Hibernate: select p1_0.id, p1_0.content, p1_0.title from post p1_0
    Hibernate: select i1_0.id, i1_0.post_id, i1_0.url from image i1_0 where i1_0.id=?
    Hibernate: select i1_0.id, i1_0.post_id, i1_0.url from image i1_0 where i1_0.id=?
    Hibernate: select i1_0.id, i1_0.post_id, i1_0.url from image i1_0 where i1_0.id=?
    Hibernate: select i1_0.id, i1_0.post_id, i1_0.url from image i1_0 where i1_0.id=?
    Hibernate: select i1_0.id, i1_0.post_id, i1_0.url from image i1_0 where i1_0.id=?

postRepository의 findAll() 메서드를 사용하여 1개의 Select 쿼리로 Post 목록 조회
@OneToMany는 기본적으로 Fetch Type이 Lazy(지연) ⇒ 이미지 리스트 자리에 프록시 객체 생성
Image에 있는 데이터를 조회할 때 엔티티 객체를 불러오기 위한 N개의 Select 쿼리가 추가적으로 발생

2. 발생하는 이유

객체와 RDB간 패러다임 차이

객체는 레퍼런스를 가지고 언제든지 연관된 객체에 접근할 수 있지만, RDB의 경우 SELECT 쿼리를 통해서만 조회할 수 있기 때문에 연관된 엔티티를 조회하려고 할 때 추가적으로 쿼리가 발생하게 된다.

fetch type

Q. 지연 로딩이 아닌 즉시 로딩을 사용하면 되는 것 아닌가요?

@OneToMany, @ManyToMany는 지연(Lazy) 로딩 이 기본

@ManyToOne, @OneToOne은 즉시(Eager) 로딩 이 기본

⇒ JPQL을 사용하는 시점에 N+1 문제 발생 + 예상치 못한 쿼리 발생 우려가 있어서 실무에서 사용하지 않음,

3. 해결 방법

1) fetch join

특징

fetch join은 객체 그래프를 SQL 한번에 조회하는 개념
fetch join을 사용할 때만 연관된 엔티티도 함께 조회(즉시 로딩) ⇒ 글로벌 로딩 전략 무시

한계점

둘 이상의 컬렉션은 fetch join 할 수 없다.
컬렉션을 페치 조인하면 페이징 API(setFirstResult, setMaxResults)를 사용할 수 없다. ⇒ 일대다 데이터를 조회하고 중복을 제거하는 과정에서 데이터의 수가 변하기 때문

public interface PostRepository extends JpaRepository {

    @Override
    @Query("select p from Post p join fetch p.images")
    List findAll();
}

Hibernate: 
    select
        p1_0.id,
        p1_0.content,
        i1_0.post_id,
        i1_0.id,
        i1_0.url,
        p1_0.title
    from
        post p1_0 
    join
        image i1_0 
            on p1_0.post_id=i1_0.post_post_id

cf) 하이버네이트 6 이후 부터는 자동으로 중복 제거(distinct)

일반 join과 차이점

select p from Post p join p.images

JPQL은 결과를 반환할 때 연관관계 고려X ⇒ SELECT 절에 지정한 엔티티만 조회
여기서는 Post 엔티티만 조회하고, Image 엔티티는 조회X

2) EntityGraph

fetch join을 편하게 사용하도록 도와주는 기능

public interface PostRepository extends JpaRepository {

    @Override
    @EntityGraph(attributePaths = {"images"})
    List findAll();
}

기본적으로 inner join을 사용하는 fetch join과 다르게 EntityGraphs는 outer join을 사용하기 때문에 성능 이슈가 있을 수 있음

3) BatchSize

연관된 엔티티 조회시 지정한 size 만큼 SQL의 in 절을 사용
즉시 로딩을 사용하면 최초에 JPQL 쿼리를 사용할 때, 지연 로딩으로 실행하면 지연 로딩된 엔티티를 최초 접근하는 시점에 size에 설정된 값만큼 in 절을 사용해서 조회한다.
10개를 조회하는데, @BatchSize(5)이라면 JPA는 쿼리를 2번 (10/5 = 2개) 날린다.

@BatchSize(size = 5)
@Entity
public class Image {
    ...
}

@Entity
public class Post {

    @BatchSize(size = 5)
    @OneToMany(mappedBy = "post", cascade = CascadeType.ALL, orphanRemoval = true)
    private List images = new ArrayList<>();
}

Hibernate: select p1_0.id, p1_0.content, p1_0.title from post p1_0
Hibernate: select i1_0.id, i1_0.post_id, i1_0.url from image i1_0 where i1_0.post_id in (?, ?, ?, ?, ?)
Hibernate: select i1_0.id, i1_0.post_id, i1_0.url from image i1_0 where i1_0.post_id in (?, ?, ?, ?, ?)

전역적으로 Batch Size 설정하는 방법

// application.yml

jpa:
    properties:
      hibernate:
        default_batch_fetch_size: 100

레퍼런스

https://www.inflearn.com/course/ORM-JPA-Basic https://inma.tistory.com/165 https://ttl-blog.tistory.com/1135 https://velog.io/@xogml951/JPA-N1-%EB%AC%B8%EC%A0%9C-%ED%95%B4%EA%B2%B0-%EC%B4%9D%EC%A0%95%EB%A6%AC

스프링에서 Custom Exception 깔끔하게 정리하기 (feat. API 응답 포맷)

Wed, 24 Apr 2024 09:15:47 GMT

서론

예외 처리는 서비스가 안정적으로 동작하게 하는데 필수적인 요소이다. 스프링 어플리케이션의 기능이 늘어남에 따라 도메인 별로 발생할 수 있는 예외들을 추가로 정의하고 예외 처리를 적용하는 과정에서 발생한 문제와 해결 과정을 정리하였다.

문제점(기존 방식)

사용자 정의 예외마다 클래스를 추가하는 방식

예외 하나마다 클래스를 정의하다보니 각 Custom Exception마다 ExceptionHandler 코드를 작성해야 했다.

⇒ 불필요하게 코드 길이 증가, 유지 보수의 어려움

PostNotFoundException

public class PostNotFoundException extends IllegalArgumentException {

    public PostNotFoundException(Long postId) {
        super("해당 게시글이 없습니다.(게시글 ID: " + postId +")");
    }

}

복수의 @RestControllerAdvice 사용

아래와 같은 방법으로 도메인마다 @RestControllerAdvice를 붙여 ExceptionHandler를 정의하였다가 일부 ExceptionHandler만 작동하는 문제가 발생하였다.

⇒ @RestControllerAdvice가 여러군데 선언되어 있어 발생하는 오류 ** **⇒ 하나만 사용하거나 @Order를 사용해서 순서를 지정해줘야한다!

PostExceptionHandler

@RequiredArgsConstructor
@RestControllerAdvice
public class PostExceptionHandler {

    private final ApiResponse apiResponse;

    // 게시글을 찾을 수 없는 경우 예외 처리
    @ExceptionHandler(PostNotFoundException.class)
    public ResponseEntity handlePostNotFoundException(PostNotFoundException e) {
        return apiResponse.error(e.getMessage(), HttpStatus.NOT_FOUND);
    }

    // 게시글 수정, 삭제 권한이 없는 경우 예외처리
    @ExceptionHandler(PostOwnershipException.class)
    public ResponseEntity handlePostOwnershipException(PostOwnershipException e) {
        return apiResponse.error(e.getMessage(), HttpStatus.UNAUTHORIZED);
    }

}

ExceptionHandler를 이용한 예외 처리

@ExceptionHandler

@(Rest)Controller에서 발생하는 예외를 처리하는 기능. Exception 클래스들을 값으로 받아 처리할 예외를 지정할 수 있다.

⇒ 각 Controller 마다 발생하는 예외를 ExceptionHandler로 처리할 수 있지만, Controller의 코드가 길어지는 문제가 발생한다.

@RestController
public class PostController {

    // ...

    @ExceptionHandler(RuntimeException.class)
    public ResponseEntity handleRuntimeException(RuntimeException ex) {
    // ...
    }

}

@RestControllerAdvice

여러 컨트롤러에서 발생하는 예외를 전역적으로 처리하는 역할을 한다.

⇒ 컨트롤러 내부에 ExceptionHandler를 정의할 필요가 없이, 하나의 클래스에서 예외 처리를 담당할 수 있다.

@ResponseBody가 들어 있어서 Json 형식의 응답을 반환한다. @Component도 포함되어 있어 스프링 빈에 등록된다.

문제 해결 과정

enum 타입으로 상태 코드 정의

ErrorCode

public interface ErrorCode {

    String name();
    HttpStatus getHttpStatus();
    String getMessage();
}

도메인 별로 상태코드를 관리하여 유지/보수가 용이하도록 하였다.

PostErrorCode

@Getter
@RequiredArgsConstructor
public enum PostErrorCode implements ErrorCode {

    NO_PERMISSION(HttpStatus.UNAUTHORIZED, "User not have permission to post"),
    POST_NOT_FOUND(HttpStatus.NOT_FOUND, "Post not found"),
    ;

    private final HttpStatus httpStatus;
    private final String message;
}

단일 CustomException 정의 Handler 적용

기존에 사용했던 방식(예외마다 클래스를 생성)과 다르게 하나의 CustomException을 정의하고 앞서 정의한 에러코드만 바꿔서 사용하는 방식을 적용하였다.

CustomException

@Getter
@RequiredArgsConstructor
public class CustomException extends RuntimeException {

    private final ErrorCode errorCode;
}

사용 예시

public Post validatePostExists(Long postId) {
            Post post = postRepository.findById(postId)
                    .orElseThrow(() -> new CustomException(PostErrorCode.POST_NOT_FOUND));
            return post;
        }

CustomException을 처리하는 ExceptionHandler가 하나만 필요하므로 @RestControllerAdvice도 기존처럼 여러 개를 사용할 필요가 없다!

GlobalExceptionHandler

@RequiredArgsConstructor
@RestControllerAdvice
public class GlobalExceptionHandler {

    private final ApiResponse apiResponse;

    @ExceptionHandler(CustomException.class)
    public ResponseEntity handleCustomException(RestApiException e) {
        ErrorCode errorCode = e.getErrorCode();
        return apiResponse.error(errorCode.getMessage(), errorCode.getHttpStatus());
    }

    @ExceptionHandler(IllegalArgumentException.class)
    public ResponseEntity handleIllegalArgument(IllegalArgumentException e) {
        ErrorCode errorCode = CommonErrorCode.INVALID_PARAMETER;
        return apiResponse.error(errorCode.getMessage(), errorCode.getHttpStatus());
    }

    // ...

    @ExceptionHandler(Exception.class)
    public ResponseEntity handleAllException(Exception e) {
        ErrorCode errorCode = CommonErrorCode.INTERNAL_SERVER_ERROR;
        return apiResponse.error(e.getMessage(), errorCode.getHttpStatus());
    }
}

미리 정의되어 있는 예외는 handleCustomException에서 처리하고, CustomException에 속하지 않는 예외는 다른 ExceptionHandler에 의해 처리된다.

응답 예시

미리 정의한 POST_NOT_FOUND 예외가 정상적으로 적용된 것을 알 수 있다.

// 404 Not Found
{
  "status": "error",
  "message": "Post not found"
}

정리

지금까지 ExceptionHandler를 사용한 스프링에서의 예외 처리 방법에 대해 알아보았다. 기존에 사용자 정의 예외를 모두 클래스로 선언하고 여러 개의 RestControllerAdvice에 적용하는 방식 대신에 하나의 사용자 정의 예외 클래스를 선언하고 enum 타입의 상태코드를 적용하여 코드의 유지/보수 및 확장성을 향상시킬 수 있었다.

레퍼런스

[Spring] 스프링의 다양한 예외 처리 방법(ExceptionHandler, ControllerAdvice 등) 완벽하게 이해하기 - (1/2)

[Spring] @RestControllerAdvice를 이용한 Spring 예외 처리 방법 - (2/2)

[공부/Spring] @ExceptionHandler 예외처리

부록) API 응답 포맷

REST API Response Format, 응답 객체는 어떤 형식이 좋을까?

위 예제에서 사용된 ApiResponse는 아래와 같이 JSend 형식의 응답 포맷을 정의해두고 사용한 것이다. JSON 타입으로 API 요청 성공/예외 여부와 메시지(+데이터)를 반환한다.

@Component
public class ApiResponse {

    private static final String STATUS_SUCCESS = "success";
    private static final String STATUS_ERROR = "error";

    private   ResponseEntity get(String status, @Nullable String message, @Nullable T data, @Nullable E errors, HttpStatus httpStatus) {

        if (status.equals(STATUS_SUCCESS)) {
            return new ResponseEntity<>(SucceededBody.builder()
                    .status(status)
                    .message(message)
                    .data(data)
                    .build(),
                    httpStatus);
        } else if (status.equals(STATUS_ERROR)) {
            return new ResponseEntity<>(ErroredBody.builder()
                    .status(status)
                    .message(message)
                    .build(),
                    httpStatus);
        } else {
            throw new RuntimeException("Api Response Error");
        }
    }

//     성공 응답 반환 (상태, 메시지, 데이터)
//     {
//          "status" : "success",
//          "message" : "success message",
//          "data" : "배열 또는 단일 데이터"
//     }
    public  ResponseEntity success(String message, T data, HttpStatus httpStatus) {
        return get(STATUS_SUCCESS, message, data, null, httpStatus);
    }

//     성공 응답 반환 (상태, 데이터)
//     {
//          "status" : "success",
//          "message" : null,
//          "data" : "배열 또는 단일 데이터"
//     }
    public  ResponseEntity success(T data, HttpStatus httpStatus) {
        return get(STATUS_SUCCESS, null, data, null, httpStatus);
    }

//     성공 응답 반환 (메시지, 데이터)
//     {
//          "status" : "success",
//          "message" : "success message",
//          "data" : null
//     }

    public  ResponseEntity success(String message, HttpStatus httpStatus) {
        return get(STATUS_SUCCESS, message, null, null, httpStatus);
    }

//     성공 응답 반환 (상태)
//     {
//          "status" : "success",
//          "message" : null,
//          "data" : null
//     }
    public ResponseEntity success(HttpStatus httpStatus) {
        return get(STATUS_SUCCESS, null, null, null, httpStatus);
    }

//     예외 발생 시 에러 응답 반환
//     {
//          "status" : "error",
//          "message" : "custom error message"
//     }
    public ResponseEntity error(String message, HttpStatus httpStatus) {
        return get(STATUS_ERROR, message, null, null, httpStatus);
    }

    // 성공 응답 객체 바디
    @Builder
    @Setter
    @Getter
    @AllArgsConstructor
    @NoArgsConstructor
    public static class SucceededBody {

        private String status;
        private String message;
        private T data;
    }

    // 오류 응답 객체 바디
    @Builder
    @Setter
    @Getter
    @AllArgsConstructor
    @NoArgsConstructor
    public static class ErroredBody {

        private String status;
        private String message;
    }

}

[Java] 다형성(polymorphisom)

Fri, 15 Mar 2024 15:10:51 GMT

1. 다형성이란?

객체지향에서의 다형성이란 '여러 가지 형태를 가질 수 있는 능력'을 의미한다. 구체적으로는 조상클래스 타입의 참조변수로 자손클래스의 인스턴스를 참조할 수 있는 성질이다.

class Tv {
    boolean power;        // 전원상태(on/off)
    int channel;        // 채널

    void power()     {    power = !power    }
    void channelUp() {    power = !power     }
    void channelUp() {    power = !power     }
}

class CaptionTv extends Tv {
    String text;    // 캡션을 보여 주기 위한 문자열
    void caption() { /* 내용생략 */ }

위와 같은 클래스들이 있을 때 Tv와 CaptionTv는 상속 관계에 있다. 두 클래스의 인스턴스를 생성하기 위해서는 아래와 같은 방법을 사용할 수 있다.

Tv t = new Tv();
CaptionTv c = new CationTv();

만약 인스턴스를 같은 타입의 참조변수가 아닌 조상타입의 참조변수나 자손타입의 참조변수를 사용하여 참조하는 것이 가능할까?

조상타입의 참조변수로 참조하는 경우

Tv t = new CaptionTv();

CaptionTv 인스턴스를 Tv타입의 참조변수 t를 사용하여 참조할 수 있지만, Tv타입의 참조변수로는 Tv 클래스에 정의되어 있지 않은 CaptionTv 인스턴스의 멤버를 사용할 수 없다. 예를 들어 t.power()는 사용할 수 있지만, t.caption()은 사용할 수 없다.

인스턴스가 가지고 있는 멤버는 참조 변수 타입에 정의된 범위 내에서만 사용 가능하다.

자손타입의 참조변수로 참조하는 경우

CaptionTv c = new Tv();

위 코드는 컴파일하면 에러가 발생한다. 실제 인스턴스인 Tv의 멤버 개수보다 참조변수 c가 사용할 수 있는 멤버 개수가 더 많기 때문이다. 즉, 참조변수가 사용할 수 있는 멤버의 개수는 인스턴스의 멤버 개수보다 같거나 적어야한다.

조상타입의 참조변수로 자손타입의 인스턴스를 참조할 수 있다. 반대로 자손타입의 참조변수로 조상타입의 인스턴스를 참조할 수 없다.

2. 참조변수의 형변환

기본형 변수와 같이 참조형 변수도 형변환이 가능하다. 단, 서로 상속관계에 있는 클래스 사이에서만 가능하다(조상타입<-->자손타입).

기본형 변수에서 작은 자료형에서 큰 자료형의 형변환이 생략 가능한 것처럼, 자손타입의 참조변수를 조상타입으로 형변환하는 경우에는 형변환을 생략할 수 있다.

Tv t = new Tv();
Caption c = (CaptionTv)t; // (CaptionTv) 생략 불가

CaptionTv c = new CaptionTv();
Tv t = (Tv)c; // (Tv) 생략 가능

자손타입 -> 조상타입(Up-Casting) :형변환 생략가능 조상타입 -> 자손타입(Down-Casting) :형변환 생략불가

Tv t = null;
CaptionTv c = new CaptionTv();
CaptionTv c2 = null

c.caption();
t = c; // t=(Tv)c;에서 형변환 생략됨
t.caption(); // 컴파일 에러 발생! Tv타입의 참조변수로는 caption() 호출 불가

c2 = (CaptionTv)t; // Down-Cating이므로 형변환 생략불가
c2.caption();

형변환은 참조변수의 타입을 변환하는 것이지 인스턴스를 변환하는 것은 아니기 때문에 참조변수의 형변환은 인스턴스에 아무런 영향을 미치지 않는다.

위 예시에서 생성된 인스턴스는 CaptionTv 타입이므로 해당 인스턴스를 참조하는 참조변수의 타입에 따라 호출할 수 있는 멤버의 범위만 달라질 뿐이다. 따라서 같은 인스턴스라도 Tv 타입의 참조변수로는 Caption()을 호출할 수 없지만, CaptionTv 타입의 참조변수로는 Caption()을 호출할 수 있다.

3. instanceof 연산자

참조변수가 참조하고 있는 인스턴스의 실제 타입을 알아보기 위해 instanceof 연산자가 사용된다(주로 조건문). 왼쪽에는 참조변수를 오른쪽에는 타입(클래스명)이 피연산자로 위치하고, 결과로는 boolean 값이 반환된다.

class InstanceofTest {
    public static void main(String args[]) {
        CaptionTv c = new CaptionTv();

        if(c instanceof CaptionTv) {
             System.out,println("This is a CaptionTv instance.");
        }
        if(c instanceof Tv) {
             System.out,println("This is a CaptionTv instance.");
        }
        if(c instanceof Object) {
             System.out,println("This is an CaptionTv instance.");
        }
        System.out.println(c.getClass().getName()); // 클래스 이름 출력
    }
}

This is a CaptionTv instance.
This is a Tv instance.
This is an Object instance.

어떤 타입의 대한 instanceof 연산의 결과가 true라는 것은 검사한 타입으로 형변환이 가능하다는 것을 뜻한다.

4. 참조변수와 인스턴스의 연결

자손 클래스에서 조상 클래스의 멤버변수를 중복으로 정의하면, 호출되는 인스턴스 변수는 참조변수의 타입에 따라 달라진다.

class BindingTest{
    public static void main(String[] args) {
        Parent p = new Child();
        Child c = new Child();

        System.out.println("p.x = ", p.x);
        p.method();

        System.out.println("c.x = ", c.x);
        c.method();
    }
}

class Parent {
    int x = 100;

    void method() {
        System.out.println("Parent Method");
    }
}

class Child extends Parent {
    int x = 200;

    void method() {
        System.out.println("Child Method");
    }
}

출력

p.x = 100
Child Method
c.x = 200
Child Method

메서드와 달리 멤버변수가 중복될 경우, 인스턴스의 타입과 상관없이 참조변수의 타입에 따라 사용되는 멤버변수가 달라진다.

따라서 멤버변수는 주로 private로 접근을 제한하고, 외부에서 메서드를 통해서만 멤버변수에 접근할 수 있도록 하는 것이 바람직하다.

[스프링 입문] 스프링 부트 기본 동작

Sun, 03 Mar 2024 11:13:38 GMT

1. Welcome Page

스프링 부트는 static/index.html 을 올려두면 Welcome page 기능을 제공한다.

resources/static/index.html




    Hello
    


Hello
hello

2. Controller와 viewResolver

/hello로 접속 => @GetMapping("hello")를 통해 hello 메소드가 호출됨 => model에 "hello!!"가 저장된 data attribute를 추가 => "hello"라는 ViewName을 반환 => viewResolver가 hello.html 반환

컨트롤러에서 리턴 값으로 문자를 반환하면 뷰 리졸버( viewResolver )가 화면을 찾아서 처리한다. 스프링 부트 템플릿엔진 기본 viewName 매핑: resources:templates/ +{ViewName}+ .html

@Controller
public class HelloController {

    @GetMapping("hello")
    public String hello(Model model){
        model.addAttribute("data", "hello!!");
        return "hello";
    }
}

resources/templates/hello.html (thymeleaf)




    Hello
    


안녕하세요. 손님

3. 정적 컨텐츠

HelloController, viewResolver 거치지 않고 바로 static/hello-static.html 정적 컨텐츠 반환

resources/static/hello-static.html




    static content
    


정적 컨텐츠 입니다.

4. @RequestParam()

@RequestParam("name") => 쿼리 파라미터를 통해 name Atrribute 추가 => hello-template.html 반환

Controller

 @Controller
 public class HelloController {
     @GetMapping("hello-mvc")
     public String helloMvc(@RequestParam("name") String name, Model model) {
         model.addAttribute("name", name);
         return "hello-template";
     }
}

View


 
 hello! empty

5. @ResponseBody

1) 문자 반환

viewResolver 없이 HTTP Body에 문자 내용을 "직접" 반환

 @Controller
 public class HelloController {
     @GetMapping("hello-string")
     @ResponseBody
     public String helloString(@RequestParam("name") String name) {
         return "hello " + name;
     }
}

2) 객체 반환

@ResponseBody와 함께 객체 반환시 객체가 JSON으로 변환됨

 @Controller
 public class HelloController {
     @GetMapping("hello-api")
     @ResponseBody
     public Hello helloApi(@RequestParam("name") String name) {
         Hello hello = new Hello();
         hello.setName(name);
         return hello;
     }
     static class Hello {
         private String name;
         public String getName() {
             return name;
}
         public void setName(String name) {
             this.name = name;
} }
}

빌드하고 실행하기 콘솔로 이동

./gradlew build
cd build/libs
java -jar hello-spring-0.0.1-SNAPSHOT.jar
실행확인

참고 자료

스프링 입문 - 코드로 배우는 스프링 부트, 웹 MVC, DB 접근 기술

Jaehoon

OpenCLIP 학습 코드 정리

모델 불러오기

데이터 준비

커스텀 데이터셋 정의

데이터 로더 설정

모델 학습

하이퍼 파라미터 설정

학습

Discovering and Mitigating Visual Biases through Keyword Explanation (CVPR 2024)

Contribution

Bias-to-Text (B2T) Framework

Problem formulation

Bias Keywords

CLIP score

Discovering Biases in Image Classifiers

known biases

Spurious correlation

Distribution shifts

Sample-wise bias labeling

novel biases

Applications of the B2T Keywords

Debiased DRO training

CLIP zero-shot prompting

Model comparison

Label diagnosis

Audio-Visual Segmentation

Introduction

Related Field

Audio-Visual Segmentation

AVSBench

semi-supervised Single Sound Source Segmentation (S4)

fully-supervised Multiple Sound Source SEgmentation (MS3)

Baseline

Encoder

Cross-Modal Fusion

TPAVI

Experiments

Comparison with methods from related tasks

Qualitative examples of the SSL methods and AVS

Qualitative examples of the VOS, SOD, and AVS

Comparison with a two-stage baseline method

Impact of audio signal and TPAVI

Qualitative results under the semi-supervised S4 setting

Qualitative results under the fully-supervised MS3 setting

Residual Learning의 이해와 ResNet-18 구현

Paper Review

요약

Introduction

Deep Residual Learning

Residual Learning (잔차 학습)

Identity Mapping by Shortcuts

Network Architectures(for ImageNet)

Plain Network

Residual Network

Implemetation

Experiments

ImageNet Classification

Plain Networks

Residual Networks

Identity vs. Projection Shortcuts

Deeper Bottleneck Architectures

50-layer ResNet

101-layer and 152-layer ResNets

Comparisons with State-of-the-art Methods

Pytorch Implementation (18-layer)

Import

convolutional layer

BasicBlock

Bottleneck

ResNet-18

Result

레퍼런스

Transformer에 대해 조금만 알아보자

소개

Recurrent Model의 문제점

Transformer 기본 구조

Positional Encoding

Attention

Encoder