juho.log

Multi-GPU 정리

Thu, 01 Aug 2024 06:05:51 GMT

Multi-GPU 사용하기 위해 정리한 글입니다. 퍼렁이님의 "Multi-GPU" 블로그글을 메인으로 참고하였습니다. https://m.blog.naver.com/jjunsss/222920508815 https://tkayyoo.tistory.com/27

PyTorch를 위한 병렬처리

Distributed Data Parallel(DDP): single-process, single-machine, multi-processing Data Parallel(DP): multi-thread

용어 정리

*multi-thread: 하나의 process를 실행하기 위해, 같은 자원을 활용해 여러개의 실행 단위를 사용하고, 하나의 프로그램을 공유하므로 context-switching이 빠름 *multi-process: 여러개의 process를 실행하기 때문에 각각의 자원을 활용하고, 어느 하나가 죽더라도 전체가 죽지는 않음

*RAM: 컴퓨터 기본 메모리. 어플리케이션을 실행하는데 필요한 범용 메모리 *VRAM: GPU와 함께 작돋하여 화면에서 이미지를 랜더링하는데 필요한 그래픽 데이터를 저장 및 관리 -> RAM은 전체 시스템을 위한 범용 메모리, VRAM은 GPU에만 사용가능. RAM은 다양한 작업을 처리할 수는 있지만 그래픽처리의 경우 VRAM보다 느림. VRAM은 그래픽 데이터 처리에 최적화되어 있고, GPU에 더 빠르게 데이터를 제공함.

1. Data Parallel (DP)

사용방법: nn.DataParalell

현재 거의 사용하지 않은 방식

문제점:

파이토치의 데이터 병렬화 작업때문에 하나의 GPU에 VRAM이 과하게 사용됨

![](https://velog.velcdn.com/images/zzooh_o/post/ddbd4ebf-83a1-4bb3-9b63-892bbe6c17e3/image.png)

하나의 GPU에 데이터들이 모두 모여있다가, 각각의 GPU에 분배되고 연산되고, 그 정보들을 다시 하나의 GPU로 모으고 추합하는 과정을 forward, backward로 수행하면서 VRAM이 하나의 GPU로 모이게 되는 문제가 있음

GPU의 병렬적 사용

GPU를 병렬적으로 사용하기 위해선,

Replica Model (각각의 GPU에 모델을 할당하는 과정)
Scatter (각 Iter를 나누는 역할)
Progress
Gather (각 GPU에서 출력한 결과들을 하나의 GPU에서 다시 모으는 역할)

-> 하나의 GPU가 과도하게 많이 사용됨으로 PyTorch 공식에서도 권장하지 않음.

참고: https://m.blog.naver.com/jjunsss/222920508815

해결방안

DP에서 생기는 VRAM 쏠림 현상은 Loss와 Ouput이 하나의 GPU에 모였다가 다시 모든 GPU에 분배되는 과정들이 반복되기 때문에, 이를 해결하기 위해 각각의 GPU에서 Gradient를 계산하고 업데이트까지 하면 됨.

Main GPU를 다른 GPU ID로 변경해서 적용한다

-> 결국 특정 GPU에 쏠리게 됨으로 해결방안이 안됨 2. Loss를 GPU parallel 연산에 분배 -> Loss에 필요한 Target을 모든 GPU로 scatter한 뒤, 각각의 GPU에서 loss계산 후 모델의 훈련을 진행

2. Distributed Data Parallel (DDP)

사용방법: nn.DistributedDataParallel (DDP) torch.nn.parallel.DistributedDataParallel(model,device_ids=[gpu])

용어 정리:

—nproc_per_node = 4 : 프로세스 할당 (1개의 서버에 GPU가 4개인 경우)
torch.distributed.launch --nproc_per_node=4로 사용
GPU 순서에 상관없이 랜덤으로 할당됨. 이때, 이 수는 내가 사용하고자 하는 GPU 수와 무조건 일치해야함. 필수값임.
node: 컴퓨터의 개수 (한대에 설치된 4개의 gpu 사용시 node=1)
world_size: (컴퓨터 개수) x (각 컴퓨터에 달린 GPU 개수) = 모델을 훈련하는데 필요한 총 GPU 개수
torch.distributed.get_world_size : world_size를 얻을 수 있는 코드
torch.distributed.get_rank() : get_rank()를 통해서 local rank를 얻음
Local Rank: DDP 코드 내에서 Loss, Model, set_device 등을 프로세서마다 동작시켜줘야하는데 이 때 값을 지정하기 위함. 따라서 코드내에 parser를 사용한다면 parser.add_argument("--local_rank", type=int, default=0)와 같이 무조건 선언해두어야함

문제점:

학습에 사용하지 않는 파라미터가 있는 경우 학습을 정지: 훈련중 에러레포팅을 하려고하거나, print를 하려고하면 학습 중간에 정지하면 대부분 이경우, main GPU한대에서만 작동하도록 먼저 돌려보기
오류가 많고, 속도면에서 느리기 때문에 요즘 모델들은 대부분 사용하지 않음
메모리를 전부 균일하게 사용하지만, launch 등장 후 잘 사용하지 않음

3. DDP launch

DP with torch.distributed.launch ( torch 버전 1.10 미만 )

사용 방법: torch.distributed.launch python -m torch.distributed.launch --nproc_per_node 4 multigpu.py

용어 정리:

single node의 경우, master_addr, port등은 필요하지 않음
python -m은 torch.distributed.launch를 실행시키기 위해 필수적인 요소
node_rank: 컴퓨터(서버)가 여러 대일 때 컴퓨터별(서버) 노드를 지정해주는 것
master_addr: 하나의 호스트에서 모든 결과를 취합해야하는데, TCP를 사용해서 지정하는 경우, master_addr 주소를 지정해야함
gpus_per_node: 하나의 node에 실행할 프로세서를 지정하는 변수. 만약에 총 4대의 GPU가 있는데 3만 사용할 경우, CUDA_VISIBLE_DEVICES로 GPU id 3개를 지정하고, gpus_per_node도 3으로 지정해주어야함

주의할 점:

—nproc_per_node : 각 node에 돌아가는 gpu를 설정. CUI환경에서 launch를 할당할 때 무조건 있어야함.
python -m torch.distributed.launch : -m 옵션을 무조건 사용해야함. 해당 옵션이없으면 launch가 실행 안됨
코드 내부에 —local_rank가 무조건 할당되어 있어야함: 해당 값은 파이선이 불러오는 프로세스 순서에 따라서 자동으로 부여. 코드 내부에 선언이 되어 있고, 적절하게 사용

DDP할 때, GPU 각각에 고정하기 위해서 torch.cuda.set_device(args.local_rank) 코드 사용
파이토치에서 추천하는 방식은 아니고, 미리 os.environ["CUDA_VISIBLE_DEVICES"] = "0,1,2,3" 과 같이 선언한 다음 torch.device(args.local_rank) 의 형태로 변경하고 각 데이터에 할당 방법을 추천

DDP with torchrun ( Torch 버전 1.10 이상 )

사용 방법: torchrun (torch.distributed.launch util이 업데이트된 버전) CUDA_VISIBLE_DEVICES=0,1,3,4 torchrun --nproc_per_node 4 main.py

주의할 점:

이전처럼 local_rank를 args로 사람이 선언하는 것이 아니라 os를 통해 받아오는 것이 다르다!! (local_rank parser를 강제로 만들 필요가 없어짐.)

참고: https://m.blog.naver.com/jjunsss/222920569248

4. DDP 응용

Augmentation GPU마다 다르게 적용하기

위에서 각 GPU가 멀티프로세스로 동작한다는 것을 확인했고, 이를 활용하여 GPU마다 다른 augmentation을 적용해서 모델 훈련에 사용할 수 있음
GPU마다 다른 augmentation 기법이 적용되어 학습되므로 더 robust한 모델을 생성할 수 있도록 기대함

[논문 리뷰] Vision Transformers Need Registers

Wed, 31 Jul 2024 02:15:42 GMT

*이 글은 ICLR 2024에 게재된 "Vision Transformer Need Registers"을 리뷰하기 위해 작성한 글입니다. *

Abstract

이 논문에서는 supervised and self-supervised ViT networks의 feature map에서 나타나는 artifacts에 관해 분석함
ViT의 inference시에 low-informative background areas of images에서 나타나는 이 artifacts는 high-norm tokens때문에 나타난다는 것을 밝힘
본 논문에서는 ViT의 input sequence에 additional tokens (registers)을 주어 이를 해결하고자 함
이 방법은 supervised, self-supervised 모두에서 효과적인 해결책으로 작용하고 self-supervised visual models의 visual prediction tasks에서 SoTA 갱신함

Introduction

1. 이미지에서 generic feature를 잘 뽑는 것은 굉장히 중요하다

양질의 데이터셋을 많이 구할 수 없는 현실에서는 가지고 있는 데이터에서 더 좋은 feature를 추출하는 방법이 계속 연구, 제안되어 왔음
그 방법에는 supervised - pretrain, self-supervised downstream 등의 방식이 있음

2. DINO 알고리즘이 이미지의 의미적 레이아웃에 대한 명시적인 정보를 포함하는 모델을 생성하는 것으로 나타나면서, 이러한 특성을 이용해서 객체 탐지 알고리즘이 DINO기반으로 구축됨

이런 알고리즘들은 supervision없이 attention map에서 정보를 모으는 것만으로 objects detect가 가능함
또한 last attention layer가 semantically consistent parts of image and often produces interpretable attention maps
위의 이미지에서 DINO는 artifacts가 거의 없고 DINOv2에서는 artifacts가 있음

*3. 이 논문에서는 이 artifacts가 나타나는 현상에 대해 고찰하고, 나아가 이 artifacts를 탐지할 수 있는 방법을 제안함 *

Artifacts??

이 artifacts는 tokens이고, 10x higher norm at the output and correspond to a small fraction of the total sequence (2%)
이 artifacts는 only appear after a sufficiently long training of a sufficiently big transformer.
hold less information about their original position in the image or the original pixels in their patch
모델이 discards local information contained in these patches during inference
contain global information about the image

Method

Figure2에서 볼 수 있듯이 거의 모든 ViT의 attention map에서 artifacts가 나타남
저자들은 Why and When에 초첨을 맞춰서 이 artifacts를 분석함

1. Artifacts are high-norm outlier tokens.

Figure3에서 DINO norms과 DINOv2 norms를 비교한 결과, norm of artifacts (yellow)가 명확하게 높은 것을 확인할 수 있음
!!! 앞으로 저자들은 이 논문에서 "norm higher than 150"를 "high-norm tokens"라고 지칭
!!! 또한 논문에서 등장하는 "high-norm"과 "outlier"는 interchangeably하게 사용한다고 말함

2. Outliers appear during the training of large models.

(1) Fig4-(a): 저자들은 DINOv2 학습과정에서 이 outlier patches를 분석했는데, 이 high-norm patches는 around layer 15 of 40-layer ViT에서 두드러지기 시작함
(2) Fig4-(b): distribution of norms 를 보면, DINOv2 학습 과정에서 outlier가 only appear after one third of training.
(3) Fig4-(c): models of different size (Tiny, Small, Base, Large, Huge and giant)에서 오직 three largest models에서만 나타남

3. High-norm tokens appear when patch information is redundant.

-> 이를 확인하기 위해 high-norm token과 그들의 4 neighbors right after the patch embedding layer간의 cosine-similarity를 계산함.

Fig5-(a): high-norms tokens들은 그들의 neighbors와 굉장히 비슷하게 나타남

이는 즉, 이 patches들이 contain redundant information and that the model could discard their information without hurting the quality of the image representation.

굉장히 중요!!!!!

4. High-norm tokens hold little local information.

-> 이 tokens에 대해 더 자세히 이해하기 위해 이 tokens들을 가지고 실험들을 진행해봄 Fig5-(b)

*Position prediction: * We train a linear model to predict the position of each patch token in the image, and measure its accuracy
- high-norm tokens have much lower accuravy than the other tokens
- 이는 이 tokens들이 less information about their position in the images가지고 있다는 것을 의미
*Pixel reconstruction: * We train a linear model to predict the pixel values of the image from the patch embeddings, and measure the accuracy of this model.
- high-norm tokens achieve much lower accuracy than other tokens
- 이는 이 tokens들이 less information to reconstruct the image than the others를 가지고 있다는 것을 의미

5. Artifacts hold global information.

-> 이 tokens들이 global information을 얼마나 가지고 있는지는 분석하기 위해 standard image representation learning benchmark에 검증

classification dataset
used DINOv2-g and ex- tract the patch embeddings
choose a single token at random, either high-norm or normal
This token is then considered as the image representation
then train a logistic regres- sion classifier to predict the image class from this representation
measure the accuracy

-** 실험 결과: the high-norm tokens have a much higher accuracy than the other tokens (Table 1).** -> This suggests that outlier tokens contain more global information than other patch tokens.

*Hypothesis and Remendiation

이러한 관찰을 바탕으로 저자들은 다음과 같은 가정과 결론, 해결책을 냄

Hypothesis: large, sufficiently trained mod- els learn to recognize redundant tokens, and to use them as places to store, process and retrieve global information

-> Indeed, it leads the model to discard local patch information (Tab. 5b), possibly incurring decreased performance on dense prediction tasks

Solution: explicitly add new tokens to the sequence, that the model can learn to use as registers.

-> add these tokens after the patch embedding layer, with a learnable value, similarly to the [CLS] token. -> At the end of the vision transformer, these tokens are discarded, and the [CLS] token and patch tokens are used as image representations, as usual.

Experiments

저자들은 자기들이 주장하는 solution이 굉장히 simple한 구조라서 기존의 모델들 학습과정에 쉽게 적용할 수 있다고 강조함
try it on three different state-of-the-art training methods for supervised, text-supervised, and unsupervised learning, shortly described below.
- DEIT-III
- OpenCLIP
- DINOv2

3.2 Evaluation of the proposed solution - Performance regression.

3.2 Evaluation of the proposed solution - Number of register tokens.

3.3 Object Discovery

3.4 Qualitative Evaluation of Registers

Conclusion

논문에서 요약

In this work, we exposed artifacts in the feature maps of DINOv2 models, and found this phe- nomenon to be present in multiple existing popular models.
We have described a simple method to detect these artifacts by observing that they correspond to tokens with an outlier norm value at the output of the Transformer model.
Findings: Models naturally recycle tokens from low-informative areas and repurpose them into a different role for inference.
Following this interpretation, we have proposed a simple fix, consisting of appending additional tokens to the input sequence that are not used as outputs -> found that this entirely removes the artifacts -> improving the performance in dense prediction and object discovery.

RLHF

Thu, 30 May 2024 08:13:11 GMT

이 글은 RHLF를 공부하기 위한 정리 노트입니다.

Reference link: https://ebbnflow.tistory.com/382 Referecne paper: Direct Preference Optimization: Your Language Model is Secretly a Reward Model

Direct Preference Optimization, DPO

Neurips 2023, Direct Preference Optimization: Your Language Model is Secretly a Reward Model

이전 RLHF 방법론에서 Reward model을 학습시키는 과정을 없애므로써 RLHF을 cross entropy training으로 바꾼 논문
우리는 LM을 학습시키면서 LM이 데이터셋의 퀄리티까지 구별해, 좋은 퀄리티의 데이터셋 쪽으로 학습하길 원한다
하지만 LM은 dataset maximum likelihood로 학습되므로, preferred response, behavior를 선택하여 모델에게 입력해주어야 똑똑한 LM이 학습될 수 있음.
따라서 가지고있는 데이터셋 이상의 좋은 LM을 만들기 위해서는 RL-based approach를 사용해야함
하지만 기존 RL 방법들은 preference dataset으로 Reward Model을 fitting애햐 했기때문에 학습 파이프라인이 복잡
DPO는 RL objective를 간단한 binary cross entropy objective로 풀어 preference learning을 단순화시킴.

이전 RLHF가 pre-trained LM (supervised fine tuning까지 거친)이 response를 생성하고, 사람이 preference로 순위를 매긴뒤 다시 이 preference loss로 Reward Model을 학습한 뒤, RL-tunning 진행

RLHF (Reinforcement Learning with Human Feedback)

출처: https://eair.tistory.com/66

LLMs는 인간의 피드백을 통한 강화학습으로 이루어짐

RLHF는 사람의 피드백, 특히 피드백간의 비료를 통해 보상 함수를 학습한 다음에 RL을 적용하여 학습된 보상 함수를 최적화하여 문제를 해결함

Supervised Fine-Tuning Stage: 사전 학습된 모델은 고품질 데이터셋에서 확률에 의해 가능한 답변을 생성하여 사람의 쿼리에 응답하는 방법을 학습

pre-trained LM이 있으면, 특정 task에 대한 데이터셋으로 supervised fine-tuning하는 단계에서 SFT를 얻을 수 있음

Reward modeling Stage: SFT 모델에서 x라는 프롬프트와 한쌍의 답변 y1과 y2를 생성. 이렇게 생성된 응답들은 다른 답변보다 어느 한 답변에 대한 선호도가 인간의 의해 표시되며, 이를 통해 비교 손실을 사용하여 보상모델을 학습 시킴

1단계에서 얻은 SFT 모델에 prompt x를 넣고 y 2개를 생성하여 human labeler가 더 선호는 하는 쪽은 y_w, 덜 선호하는 쪽은 y_l이라고 라벨링한다.
human preference probability를 위해 각 pair에 대한 선호가 있을 때 전체 선호도를 모델링할 수 있는 Bradley-Terry 모델을 주로 사용함
그리고 y_w, y_l이 있을 때, negative log-likelihood로 reward model을 학습시킴

RL Fine-Tuning Stage: SFT 모델은 본 단계의 초기화 역할을 하며, RL 알고리즘은 초기 정책과의 편차를 제한하면서 보상을 극대화하는 방향으로 정책을 최적화함

DPO

해당 논문에서는 RL policy를 optimize하기 위해 사용되었던

KL-conatrained RL objective을 아래와 같은 objective로 바꿔줌

유도 과정은 생략.

이렇게 유도된 DPO objective에 gradient에 대한 분석을 하면,

L_DPO에 theta에 대한 gradient를 취하면, 선호하는 라벨에 대한 log likeihood는 높이고, 선호하지 않은 likelihood는 낮추는 동시에 리워드가 잘못 추정되면 큰 penalty를 줄 수 있다.

Conclusion

GPT와 같은 대규모 모델을 개인이 학습하기 어려움. Reward model을 학습할 수 있다하더라도 대량의 데이터를 확보하기도 어려움
Reward model을 fitting하는 것도 어렵고, reward model 학습 후 RL 학습도 복잡하기 때문에 많은 어려움이 있음
DPO는 이런 문제들을 해결할 수 있음

[논문]Dynamic Routing Between Capsules

Mon, 27 May 2024 04:30:42 GMT

Object Recognition 기존의 object recognition에는 Convolution Network를 사용하였다. Feature Extracting + max pooling
Object Segmentation 기존 CNN의 Maxpooling과정의 문제점으로 인해 새로운 기법인 Dynamic Routing 기법 제시

들어가기 전에

Convolution Network의 단점

Invariance: input X에 변형을 가해도 같은 output을 출력해주는 함수
Equvariance: input X를 입력으로해서 나온 ouput에 변형을 가하고, 그 변형과 똑같은 변형을 input X에 가해서 나온 결과가 같도록 하는 함수

CNN = translation invariance(같은 label이면 위치나 구성의 변형 상관없이 똑같이 잘 분류하기 때문) -> Maxpooling 때문

CNN 한계 subsampling이 "local" invariance를 보장하지만, local 범위를 벗어나면 spatial relationship을 파악 못한다.

즉, Convolution network는 translation invariance를 얻기 위해 정보들을 버려서 spatial relationship을 고려하지 않는다는 단점이 있다

Contributions

MNIST dataset에서 sota를 달성
CNN에선 고려하지 못했던 Feature 간에 Spatial relationship고려
CNN에선 해결하지 못했던 Highly overlapping object recognition에 강하다

Abstract

Capsules?

What is entity? : instantiation parameter
is there entity? : lengh of vector

Instantiation parameter? 모형이나 물체 실재(entity)를 vector로 표현하는데 parameter개수와 어떤 의미인지 정확하게 알 순 없지만 entity를 설명, 표현하기위한 vector

논문의 제목이 Routing Between Capsule이기 때문에 결국 capsules 끼리의 connected layer로 문제를 푸는 구조

Capsule을 사용시 layer가 깊어지면?

layer간 차원이 다른 capsule의 forward propagation이 어려움
capsule의 길이가 확률을 의미하므로 [0,1] 사이의 값을 가져야하는데 layer 가 깊어질수록 연산을 거치면서 그 이상의 값을 가질 수 있음.

*->이 두 문제를 해결하려고 나온게 바로 Dynamic Routing *

How the vector inputs and outputs of a capsule are computed

(capsule) vector의 이전 layer와 다음 layer의 차원이 다름

Affine Transform

이전 layer의 차원을 다음 layer의 차원으로 높여주기
결국 차원을 높인다는 것은 현재의 capsule이 다음 layer에서 어떤 모양일지를 예측하는 것이고, 예측을 잘하도록 Affine Transform을 학습시키는 것.

Squash capsule은 vector이므로 activation function으로 ReLu나 Sigmoid를 사용할 수 없음. -> "squash"는 capsule의 activation function -> capsule vector의 길이를 [0,1]로 mapping 해주는 역할

No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding

Fri, 24 May 2024 08:04:04 GMT

이 글은 논문 "No Time to Waste: Squeeze Time into Channel for Mobile Video Understanding"에 대한 리뷰글입니다.

Abstract

최근 video understanding methods는 3D blocks나 2D convolutions with additional time modeling을 사용함
하지만 이 방법들은 모두 temporal axis(시간축)을 separate dimension으로 계산하기 때문에 large computation and memory budgets가 요구되어 mobile devices에서 사용하는데 한계가 있음

-> 본 논문에서는 time axis를 channel dimension으로 squeeze하는 SqueezeTime에 대해 제안함

Kinetics400, Kinetics600, HMDB51, AVA2.1 and THUMOS14에서 SoTA
Codes are publicly available at https://github.com/xinghaochen/SqueezeTime and https://github.com/mindspore-lab/models/tree/master/research/huawei-noah/SqueezeTime.

Introduction

기존 연구 방법들

Traditional 3D convolutional networks: 3D ConvNets, I3D and SlowFast
- can jointly learn spatial and temporal features from videos but consume large amounts of memory and computation
  - not suitable for mobile usage
Improvement of the 3D convolutional network by 2D decomposition or approximation manually
- However, searching for such 3D architectures on a video benchmark is also time-consuming
Incorporating 2D CNN with temporal learning: Temporal Shift Module, Temporal Difference Moudle, Temporal Adaptive Module, Adaptive Focus, Temporal Patch Shift, Temporally-Adaptive Convolutions
- Though these methods have improved running speed, the accuracies are not quite satisfactory in mobile settings.
Transformer-based Video analysis
- not friendly to mobile devices.

SqueezeTime

위의 모든 방법들이 temporal axis를 extra dimension으로 사용하기 때문에 computational cost가 큼
본 논문에서는 이렇게 temporal axis를 따로 둘 필요가 없다는 것을 발견
따라서 temporal axis를 spatial channel dimension으로 squeeze하는 SqueezeTime을 제안
이러한 Squeeze연산으로 생기는 부작용을 보완하기 위한 방법을 제안
- Channel-Time Learning Block (CTL): learn the temporal dynamics embedded into the channels
  - First branch: Temporal Focus Convolution (TFC) - concentrates on learning the potential temporal importance of different channels
  - Second branch: leveraged to restore the temporal information of multiple channels and to model the Inter-temporal Object Interaction (IOI) using large kernels.
Contributions
- SqueezeTime 제안: squeeze the temporal dimension of the video sequence into spatial channels, which is much faster with low memory-consuming and low computation cost.
- The CTL can learn the potential temporal importance of channels, restore temporal information, and enhance inter-temporal object modeling ability, which brings 4.4% Top1 accuracy gain on K400.
- Extensive experiments demonstrate the proposed SqueezeTime can yield higher accuracy (+1.2% Top1 on K400) with faster CPU and GPU (+80% throughput on K400) speed.

SqueezeTime Network (SqueezeNet)

Squeeze and Restore Time

Comparing Cost Time

3D CNN: 2c_out x c_ink x 3 x h x w x t
2D CNN with temporal modeling: 2c_out x c_ink x 2 x h x w x t + O(t)
SqueezeNet: 2c_out x c_ink x 2 x h x w

squeeze mechanism

*Formula 1 *

fs is the squeeze function, fm is the mix up function, and Fb is the squeezed feature without temporal dimension.

Formula 2

β is the temporal importance learning function, ξ is the inter-temporal interaction function, and τ is the injected temporal order information. F′ is the restored feature.

Channel-Time Learning (CTL) Block

CTL block은 SqueezeNet의 기본 요소이자 Formula 2를 이해하기 위한 기본 단계
CTL block 구성 요소 (Figure3-(a))
- 1 × 1 convolution : to reduce the channels
- CTL module: to learn temporal and spatial representations
- another 1 × 1 convolution: to restore the channel number

CTL module (Figure3-(b))

Fi and Fo are the input feature and output feature of the CTL block, and r is the ratio controlling the channel expansion.
set the reduction factor r to 0.25 as the default.

Figure3-(b)

Branch1: Temporal Focus Convolution (TFC)
- TFC with 1 × 1 kernel size: to especially concentrate on capturing the temporal-channel importance.
Branch2: Inter-temporal Object Interaction (IOI) module
- restore the temporal position information and model the inter-channel spatial relations using large kernels.
Final ouput: summation of the two branches

Temporal Focus Convolution (TFC)

temporal dimension을 channel로 squeeze할 때 중요한 질문: 과연 Original 2D convolution 연산이 서로 다른 channel에 있는 temporal representation을 모델링하기 적합한가?

Original 2D Convolution

-> 이런 기존의 2D Convolution연산은 서로 다른 channel간의 중요도가 동일하게 연산됨

하지만 저자들은 temporal information을 channel로 squeeze하면, 각 channel간의 temporal importance를 구분해야한다고 주장하며, improved Temporal Focus 2D Convolution (TFC)를 제안
wm: the temporal-adaptive weights calculated according to the input features, it models the temporal importance of different channels.
wm: can be computed using a lightweight module, i.e., weight computation module.
In this paper, authors simply use the modules as a global MaxPool2d followed by a two-layer MLP as the WCM.

Inter-temporal Object Interaction (IOI) module

IOI Module이 필요한 이유
- temporal information of a video clip이 channel로 squeeze될 때와 temporal order information of channels이 mixed up될 때 중요한 정보 손실이 일어날 수 있음
- 이런 temporal details 정보 손실을 복구할 수 있어야함
- 서로 다른 multiple objects간의 관계를 모델링할 수 있어야함

Figure3-(c)

Top branch
- 3 × 3 TFC: to reduce the number of channels (C) to the number of frames (T) and to capture the temporal importance
- temporal position encoding information: to restore the temporal dynamics
- 7 × 7 convolution: to model the object relations between T frames
- 다른 모듈로 변경 가능: to capture the cross-temporal object interactions
Bottom branch:
- 3 × 3 convolution: to get the output number of channels
- direct mapping from input channels to output channels.

Experiments

Conclusion

In this paper, we concentrate on building a lightweight and fast model for mobile video analysis.

Different from current popular video models that regard time as an extra dimension, we propose to squeeze the temporal axis of a video sequence into the spatial channel dimension, which saves a great amount of memory and computation consumption.
To remedy the performance drop caused by the squeeze operation, we elaborately design an efficient backbone SqueezeTime with a stack of efficient Channel-Time Learning Block (CTL), which consists of two complementary branches to restore and excavate temporal dynamics.
Besides, we make comprehensive experiments to compare a quantity of state-of-the-art methods in mobile settings, which shows the superiority of the proposed SqueezeTime, and we hope it can foster further research on mobile video analysis.

Casual Representation Learning

Tue, 30 Apr 2024 05:43:23 GMT

혼자 공부하려고 정리하는 글입니다.

Main ref: https://www.lgresearch.ai/blog/view?seq=306 Sub ref: https://lhw0772.medium.com/study-da-domain-adaptation-%EC%95%8C%EC%95%84%EB%B3%B4%EA%B8%B0-%EA%B8%B0%EB%B3%B8%ED%8E%B8-4af4ab63f871 참고: CauSSL: Casuality-inspired Semi-supervised Learning for Medical Image Segmentation

Casual Inference

결과에 대한 원인을 찾고 해당 원인이 변경되었을 때 결과에 미치는 영향을 분석 및 추론한느 방법
모델의 입력 변화에 따른 출력 변화에 대해 미리 추론함으로써 딥러닝 모델 결과를 예상할 수 있기 때문에 그 특성을 분석할 수 있는 효과가 있음

Independent and identically distributed (i.i.d)

데이터를 학습하는 머신러닝에서 중요한 가정은, 새로운 데이터셋에 대하여 학습 성능과 비슷한 성능을 낼 수 있도록 일반화를 보장하는 것
대부분의 머신러닝 연구들은 데이터에서 중요한 domain shift, temporal structure 등을 무시하거나 필요 없는 것을 여기고, 대량의 데이터가 갖는 i.i.d의 가정을 이용해 일반화 문제를 해결하려고함

Domain Shift

학습데이터와 테스트데이터의 distribution 차이를 의미.
대부분의 머신러닝 알고리즘은 데이터의 분포를 i.i.d로 가정하여 데이터의 수가 많아지면, 학습과 테스트의 분포가 비슷하다고 가정함
하지만, 현실에서는 학습 데이터와 테스트 데이터의 분포가 달라질 수 있음

Learning Reusable Mechanism

사람은 기존에 습득한 지식을 바탕으로 새로운 지식을 빠르게 학습함

Casuality Perspective

기존의 머신러닝 알고리즘들은 correlation 기반으로 동작하기 때문에 데이터 사이의 상관관계만을 파악할 수 있으며, 인과관계 (Casuality)는 추론할 수 없는 한계가 있음
이 Casuality를 활용한다면 이미 관찬된 상황이나 알고있는 지식과 다른 상황에서도 robust한 예측을 할 수 있음

1. Casual Modeling

자연 현상을 모델링할 수 있는 가장 좋은 방법은 Differential equation으로 모델링하는 방법
Differential equation을 사용하면 시간에 따른 변화를 모델링할 수 있음 1) 이 변화를 바탕으로 분석해야할 physical system의 state가 앞으로 어떻게 변화할지 예측 가능 2) 원인 및 변수의 intervention 효과에 대해 추론할 수 있음 3) 변수 사이의 통계적 의존성을 파악할 수 있음 4) 인과관계 파악 가능
반면, 통계적 모델은 실제 시스템의 표면적인 부분만 모델링할 수 있음. 또한 실험 조건이 변화하지 않는다는 가정 하에 특정 변수가 어떻게 영향을 미치는지 알 수 있음.
Casual model은 통계적 모델과 differential equation 모델의 사이에 있음.
Associational Causality (Predicting in the i.i.d setting)
- 통계적 모델의 목표는 input X와 Target Y가 주어졌을때, 분포 P(Y|X)를 예측하는 것
- 충분히 많은 i.i.d 데이터를 관찰하면 해결 가능
  - 하지만, 동일한 실험 조건에서만 정확한 결과를 산출할 수 있으며, 데이터 분포의 변화가 있으면 부정확하게 예측할 가능성이 있음
Interventional Casulity (Predicting under distribution shifts)
- 실제 환경에서는 intervention에 의해 데이터 분포에 변화가 발생할 수 있음
  - 이 경우 앞서 찾아낸 P(X,Y)가 변하게 되어 기존 통계적 모델의 성능은 보장되지 않음
- 반면, Casual model은 intervention의 효과에 대해 학습하여 데이터 분포 변화에도 강건한 성능을 보임

Counterfactual Casuality (Answering Counterfactual Questions)
- Casual discovery: 왜 이러한 일이 일어났는지 이유를 알아야함
  - Casual effect estimation: 만약 다른 action을 했다면 결과가 어떻게 변할지 알아야함
  - Decision making: 원하는 결과를 얻기 위해선 어떠한 action을 취해야하는지 알야아함

2. Independent Casual Mechanisms

Independent Causal Mechanism (ICM) Principle: The causal generative process of a system’s variables is composed of autonomous modules that do not inform or influence each other. In the probabilistic case, this means that the conditional distribution of each variable given its causes (i.e., its mechanism) does not inform or influence the other mechanisms.

특정 메커니즘 P(X|PA)을 변화시켜도 다른 메커니즘 P(X,PA)에는 변화가 없음
특정 메커니즘 P(X|PA)의 정보를 알아도 다른 메커니즘 P(X|PA)에 대한 정보를 알게되는 것이 아님

**Entanglement: 뒤얽힘, 꼬여있음 *

하나의 z가 다른 특징들과 연관이 되어있어, 특징을 학습할때는 결과에 영향을 미치는 하나의 특징벡터를 구하기가 어려움

3. Sparse Mechanism Shift

Sparse Mechanism Shift (SMS): Small distribution changes tend to manifest themselves in a sparse or local way in the causal/disentangled factorization, i.e., they should usually not affect all factors simultaneously.

SMS는 ICM의 결과로, causal/disentangled factorization에서 intervention을 통해 분포를 변화시켰을 때 소수 또는 특정 부분의 component만 변화하는 경향을 의미
*이는 만약 intervention의 결과로 모든 componet가 변한다면 모델은 intervention을 통한 분포의 변화로부터 어떠한 정보도 배우기 어렵다는 것을 의미 *

Casual leraning의 목적은, 현실 세계를 독립적인 casual mechanism의 chain으로 간주하며, 결국 현실 세계를 casual structure를 가진 disentangled representation으로 모델링하는 것

4. Implications for Machine Learning

Learning Transferable Mechanisms: 현실 세계에서 데이터의 양은 제한되기 때문에 데이터 분포에 온전히 의존하기 어려움. 또한 하나의 component로 규명하기 어렵기때문에 현실 세계를 모듈형태로 구조화하는 것이 중요

Semi-Supervised Learning

데이터 라벨이 부족한 현실에서의 문제를 해결하기 위해 제안
SSL이 어떻게 동작하는지 정확하게 밝혀지지 않아 해석적 연구가 필요 ex) X->Y의 casual 관계라 할때, 모델은 X->Y로 mapping을 학습
ICM에 의하면 P(x)와 P(Y|X)는 서로 독립, 어떠한 정보도 공유하지 않음: P(X)를 P(Y|X)의 추정에 활용하고자 한다면 아무런 도움이 되지 않음
반면에 anti-casual 관계 (Y->X)를 학습하여 ssl이 가능하다는 것을 증명

Robustness and Strong Generalization

Robustness와 Generalization은 OOD문제로 생각
OOD문제는 distribution class로부터 empirical risk를 최소화하는 최적화 문제로 간주 가능
OOD의 gap은 학습 분포 P(X,Y)와 테스트 분포 P*(X,Y)의 차이에 의해 발생
Casuality 관점에서 보면, OOD의 P*(X,Y)를 특정 intervention의 결과로 간주할 수 있음

Future work of Casual Representation Learning

비선형 인과관계 학습
Learning casual variables
Understanding the biases of existing dl: 추출된 disentangled representation에서 어떤 component가 새로운 task에 도움이 되는 지 이해

Batch Normalization

Sat, 23 Sep 2023 23:57:04 GMT

CMU - IntroDL 수업에서 Batch Normalization 수업 자료 정리입니다. 참고: https://gaussian37.github.io/dl-concept-batchnorm/

Gradient Descent

손실 함수에 대해서 미분값이 최소가 되는 점을 찾아 weight을 찾는 과정
Step size = Learning rate
일반적인 Gradient descent에서는 모든 데이터 샘플 N개에 대해서 모든 gradient를 구하고, 그 모든 gradient에 대해서 평균을 계산하여 모델을 업데이트 한다.
이런 방식은 대용량의 데이터를 한번에 처리하지 못한다는 문제가 있기 때문에 전체 N개 데이터를 batch 단위로 나눠서 학습하는 것이 일반적

Stochastic Gradient Descent

Stochastic(확률적) = 각 배치를 포함하는 하나의 데이터가 무작위로 선택된다는 의미
SGD는 gradient를 업데이트하기 위해서 일부 데이터만 사용. Batch size만큼만 사용하여 모델을 업데이트
단점: 반복이 충분하면 SGD는 효과가 있지만, 노이즈가 매우 심함
단점: SGD는 항상 최저점을 찾는다는 보장이 없음
해결 방안: Mini Batch SGD

Internal Covariant Shift

Batch 단위로 학습을 하게 되면, Internal Covariant shift라는 문제가 생김.
전체 데이터셋을 Batch로 나누게 되면, 각 batch에 따라서 데이터 분포가 달라짐.
따라서 학습을 Batch단위로 하게 되면, 각 Batch간의 데이터 분포의 차이가 발생할 수 있음.

이 Internal Covaiant Shift 문제를 해결하기 위해서 적용하는 기법이 "Batch Normalization"

Normalization

서로 다른 데이터의 크기를 동일(동등한 중요도를 갖도록)하게 해주기 위해 크기를 변환하는 과정 ex. 0~~255 사이의 값을 갖는 이미지를 255로 나누어 0~~1사이로 변환

Regularization

모델이 복잡도를 낮추기 위해 "제약"을 두는 것.

Batch Normalization

Batch Normalization은 학습 과정에서 각 배치 단위 별로 데이터가 다양한 분포를 가지더라도 각 배치별로 평균과 분산을 이용해 정규화

한계점

미니 배치 단위로 정규화를 수행한다는 이유-> 최근에 거대한 모델들이 등장하면서, 이러한 모델들을 학습하기 위해 배치사

Meta-Learning?

Mon, 10 Jul 2023 04:24:25 GMT

이 글은 메타러닝을 학습하기 위한 정리 글입니다. 참고 자료

Meta learning

기계가 아는지 모르는지 구분이 가능하다

Learning to learn / Learning to generalize

메타 학습은 여러가지 task에 대해서 일반화 될 수 있는 모델을 학습하는 것을 목표로 한다
따라서 메타학습된 모델은 일반성을 학습함으로 한번도 보지 못했던 task에 대해서도 빠르게 학습이 가능하다
아이들이 메타인지를 통해서 처음 보는 동물에 대해서도 스스로 알아가는 것 처럼, 보지 못한 input(query)에 대해서도 적은 수의 데이터만을 가지고 학습을 진행할 수 있게 된다

Why Meta learning ?

적은 수의 데이터셋으로도 일반화 학습이 가능하다
학습에 있어서 충분한 연산 능력이 없을때, 메타학습은 모델을 훈련하는데 하드웨어적인 리소스가 많이 필요하지 않다

What is Few-shot learning?

Meta learning의 한 종류이다
few shot learning에서는 각 그림들을 특정 class로 구분 짓는 것이 아닌, '서로 다른 클래스의 이미지구나'를 학습하는 것. 다시 말해 아는 것과 모르는 것 자체를 학습한다
query sample never seen before
query sample from unknown classes

supervised learning

test sample never seen before
test sample from known classes

*Meta learning의 학습 기법 *

Model based approach: 모델 기반 학습
Metric based approach

저차원의 공간에 새로 들어온 데이터셋을 mapping시키고, 저차원 공간에서의 데이터 간의 거리를 통해 (distance) 데이터들을 분류하는 방법

Optimization based approach

여러 task의 generalized 버전 모델의 파라미터를 구하고, 이를 새로운 task의 파라미터로 initialization한다. 이렇게 함으로써 최적의 task parameter를 빠르게 찾아갈 수 있다.

Transfer learning

source task에서 지식을 추출해서 그 지식을 target task에 전달해주는 것

학습 데이터가 있는 다른 task에서 모델을 학습하여 일반화된 지식을 추출해서 학습데이터가 없는 다른 task에 전달해준다.
개-고양이 이미지 분류 task에서 개와 고양이 사진이 많이 없을 때, 개미와 벌 사진은 많은 양이 있다면, 개미와 벌을 분류하는 작업을 학습하면서 선, 윤곽, 일반적인 틍징들을 배우게 한 후(pretraining) 개와 고양이 이미지를 분류하도록 하는 것(fine-tuning)

[논문 리뷰] What Makes Training Multi-modal Classification Networks Hard?

Fri, 03 Mar 2023 06:41:57 GMT

이 글은 2020년 CVPR에 게재된 "What Makes Training Multi-Modal Classification Networks Hard?" 논문에 대한 리뷰입니다.

Background

Why Multimodal??

Uni-Modal Model (Vision-Only)
Bi-Modal Model (Audio-Vision)

Q. Multi-modlity를 사용하는 이유?

더 잘 학습하고 단일 모달리티보다 풍부한 정보로 쉽고, 정확하게 모델의 성능을 높이기 위해

In Theory: More modalities should boost model easy In Practice: Best Uni-modal model performs better

-> 실제 학습 과정을 보면, Audio feature와 Visual feature(RGB)가 각각 다른 epoch에서 overfitting이 일어나고, overfitting도 다른 rate으로 일어난다.

Deep-learning training

딥러닝 학습과정에서 가장 중요한 것은 뭐?

*1. Optimization * *2. Generalization *

-> Overfitting과 Underfitting이 일어나지 않는 정도에서 Optimization과 Generalization의 균형을 잘 잡는 것이 중요

Introduction

Theory: 이론상으로는 모달리티가 추가됨에 따라 정보가 더 늘어남으로 모델이 더 잘 학습해야한다.

Problem in reality: 실제 학습을 해보면 단일 모달을 썻을 때의 모델의 성능이 멀티 모달을 썼을 때의 성능보다 더 좋을 때가 있다.

*Solution: * 실제로 Multimodal을 썻을 때 성능이 더 안좋은 경우를 보여줌으로 문제를 정의하고, 그 문제를 해결할 수 있는 Metric을 제안하여 SoTA를 달성한다.

Contribution:

Empirically demonstrate the significance of overfitting in joint training of multi-modal networks, and we identify two causes for the problem.
Propose a metric to understand the problem quantitatively: the overfitting-to-generalization ratio (OGR)
Propose a new training scheme which minimizes OGR via an optimal blend of multiple supervision signals.
Proposed Gradient-Blending (G-Blend) method gives significant gains in ablations and achieves state-of-the-art (SoTA) accuracy on benchmarks including Kinetics, EPIC-Kitchen, and AudioSet by combining audio and visual signals.

문제 정의 및 검증

1. Unimodal이 더 좋은 경우

-> Best Unimodal이 더 높은 성능을 보임을 알 수 있음 *Why? * Multimodal의 Late fusion과정에서 parameter의 수가 unimodal보다 훨씬 더 많아졌기 때문에 overfitting이 발생하는 것이 원인이다.

*여기서 잠깐!! Optical flow란? 참고: https://velog.io/@yoorachoi/%EC%BB%B4%ED%93%A8%ED%84%B0-%EB%B9%84%EC%A0%84-Optical-Flow-Lukas-Kanade-Method-%EC%A4%91%EC%8B%AC%EC%9D%98-%EA%B0%9C%EB%85%90-OpenCV-%EA%B5%AC%ED%98%84

2. Overfitting을 줄이기 위한 시도들

1. 대안책: Pretraining, Early-stopping, Dropout 2. 구조적 관점에서 대안책: Mid-concatenation, SE-gate, NL-gate *구조적 관점? SE-gate: Squeeze-and-Excitation Networks https://arxiv.org/abs/1709.01507

NL-gate: Non Local Netorks https://arxiv.org/abs/1711.07971

-> 여기서 중요한 point! Dropout을 썼을 때가 가장 좋다? parameter수가 적어야 잘된다! Multimodal인데!

*결론: multimodal은 문제가 있는데 그것은 파라미터 수가 늘어남에 따라서 overfitting이 일어나서이고 이것을 정량화해서 해결하는 방안을 제시하겠다. *

*Unimodal loss function *

Multimodal loss fuction

Overfitting-to-Generalization Ratio (OGR)

T: Training V: Validation

Overfitting vs Generalizing

*Overfitting: Learning in a training set that do not generalize to the target distribution.

L* = Ground True Loss, True Label -> True loss를 구하는 건 어렵기 때문에 val loss로 하는 거임 G = 각 체크포인트에서 val loss 변화 O = 각 체크포인트에서 train loss와 val loss의 차이

-> 이 OGR은 학습과정의 품질, 즉 Overfitting과 Generalizing의 품질을 측정함.

-> 논문에서는 이 OGR을 최소화하면서 학습하는 Metric과 scheme을 제안함.

-> 그러나 underfitting의 경우 O의 값 자체가 작아서 여러 gradient를 혼합하는 아래 수식으로 극소값을 최소화하는 방법으로 진행됨.

How to use in Practice?

*Unimodal + Multimodal loss의 총합을 사용 *

k+1= k개의 단일 modality + 하나의 multimodal rep.

Modality m1, m2에서 각각 나온 feature와 그 둘을 concat해서 fusion한 multimodal feature를 각각의 loss를 구한 후, 이 loss들을 blending한다고 해서 Gradient Blending임.

*Blending? 믹서기 블랜딩 생각해보면 됨! 하나의 vector로 혼합한다는 것.

Code review

Official code가 없다! 근데 code보다 method를 제안하는 것이기 때문에 코드보다 pseudocode를 살펴보는 것이 논리를 파악하기 더 좋을 거 같다.

Measuring OGR

1. G-B Weight Estimation

2. Weight측정 방식에 따른 두가지 버전

2.1 Offline G-B -> 전체 N epoch에 대해서 측정

2.2 Online G-B -> 전체 N epoch에서 작은 n에폭마다 측정

Result

먼저, 하나의 modal을 썼을 때보다 multiple modal을 썼을 때 overfitting이 더 빠르게 일어난다를 보여주는 실험. (Backbone: ResNet50) -> 앞서 말한 논문의 전체 문제 정의와 동일.

각 Epoch마다 modality를 반영하는 weight이 어떻게 달라지는지를 보여줌.(멋진 실험...)

Online Offline두개를 한 이유?

*다른 모달리티를 사용

*다른 task에서 사용

*State-of-the-arts

Conclusion

Multimodal에서 당연히 생기는 문제를 정확히 짚었다
Multimodal에서 생기는 overfitting을 측정하는 OGR과 그것을 minimize함으로써 해결하는 방법을 제안함
unimodal, multimodal loss를 하나로 Blending을 해도 된다는 것을 보여줌
Multi-Task, Multi-modal, Multi-level Fusion 전부에서 SoTA를 달성

느낀점

장점: 문제 정의가 확실함. 구체적으로 이전까지는 multimodal representation으로 total loss를 구해서 사용하고, 이 논문 이전에는 multimodal이 unimodal보다 성능이 안좋은 경우와 이유에 대해서 규명하지 않았는데 이 논문이 그 문제를 규명하고 해결함. 그리고 gradient를 blending해도 됨을 수학적으로 증명하고 실험적으로 검증하는 과정이 매우 논리적임..
한계점: (1) 모달리티를 처리하는 방법을 한줄도 설명하지않아서 개인적으로 궁금함(ex. modality shape, 각 modality의 loss function 등) (2) 실험에서 G-Blend모델의 성능이 더 떨어지는 경우에 있어서는 Audio-Visual relevant가 적어서 그렇다고 하는데 (즉 모달리티끼리 상관관계가 없어서 모델이 잘못됐다기 보다는 겹치는 정보량의 문제) 뭔가 조금 이상함..
소감: 저번 CVPR논문 발표와 마찬가지로 문제정의랑 문제 해결방안이 너무 논리적이다. 코드를 어떻게해서 성능을 올리거나 그런 것이 아니라 멀티모달에서 중요한 문제를 정의하고 그것을 해결하는 방안을 제시함. 훌륭하다.

[논문 리뷰] Towards Casual VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing

Fri, 03 Feb 2023 14:10:24 GMT

들어가기 전에

이 글은 2020년 CVPR에 게재된 "Towards Casual VQA: Revealing and Reducing Spurious Correlations by Invariant and Covariant Semantic Editing" 논문 리뷰입니다.

Background: What is VQA?

*VQA: Visual Question Answering * -> VQA task는 이미지(Visual, 영상으로도 확장 가능)와 그 이미지에 대한 질문(Question)이 주어졌을 때, 해당 질문에 맞는 올바른 답변(Answer)을 만들어내는 task이다.

출처: 논문 VQA: Visual Question Answering

*Related work *

Text-based Q&A: 이 문제는 NLP와 텍스트 처리 분야에서 잘 연구됨. 텍스트 질문에 대한 대답은 VQA 기술과 밀접하게 연관.VQA는 text와 vision 모두에 의존한다.
Describing Visual Content: 이미지 태깅, 이미지 캡셔닝, 비디오 캡셔닝 등 VQA와 관련. 그러나 대부분의 캡션 연구는 vision에 특화된 것이 아닌 지나치게 일반적인 (많은 이미지에 대해 동일한 캡션을 써도 말이 되는) 경우가 많음.
Other Vision+Language Tasks: 이미지 캡셔닝보다 평가가 쉬운 coreference resolution, generating referring expressions 등의 task가 있다.

출처: CVPR

Introduction

오늘 날의 VQA 모델들은 질문의 변형에 취약하다는 문제점이 있음. -> 동일한 질문이어도 변형된 질문에 대한 정답을 잘 하지 못함

**본 연구에서는

VQA모델의 Robustness를 평가하는 방법을 제시
VQA Task에 의미론적, 시각적인 변형을 가해 변형된 내용이나, 가짜 질문에 대해 모델이 더 Robust하게 반응하도록 하는 방법을 제시 **

출처: CVPR

이전 연구들은 VQA모델의 Robustness를 Linquistic variation을 두어서 연구한 연구들이 대부분.

출처: CVPR

*1. 학습시에 Answer에 변형을 가하는 연구 * Don't Just assume; look and answer: Overcoming priors for Visaul Question Answering, 2018 CVPR

*2. 질문에 변형을 가하는 연구 * Cycle-consistency for robust visual question answering, 2019 CVPR Sunny and Dark outside? Improving answer consistency in vqa through entailed question generation, 2019 Arvix

**Main Contributions *

데이터와 모델의 Bias로 생기는 VQA모델들의 성능을 정량화하는 방법 제안.

-> 자체적으로 만든 synthetic data를 사용하여 이 문제들을 검증하고 새로운 평가 metric을 제안하여 VQA모델의 성능을 평가함

Contribute Methodology and a synthetic dataset

-> human study에 기반해서 직접 systematic variations를 가해 synthetic dataset을 생성함. -> human annotations로 dataset을 validate함

실제 3개의 VQA state of art models를 가지고 실험을해보고 성능 향상이 있음을 검증함
Adversarial training으로 제안한 synthetic dataset을 가지고 Data augmenation하여 위의 문제들을 해결할 수 있음을 보여줌.

무엇보다도, "First systematic study to visual robustness at scale"

Synthetic Dataset for Variances and Invariances in VQA

Synthtic Dataet -> Built upon existing VQAv2, MS-COCO datsets How? removing object by GAN

1. Invariant VQA (IV-VQA) *Editing Where Answer does not change * -> Question에 대한 답변이 변경되지 않도록 변형을 가함

1. Covariant VQA (CV-VQA) *Editing Where Answer change * -> Question에 대한 답변이 변경되도록 변형을 가함

Area-Overlap threshold

If, GAN으로 지워진 Object가 QA와 관련된 Object라면 어떻게하지?
전체 이미지에서 큰 Object는 작은 Object에 비해 GAN으로 지우기가 어렵다.

-> 이미지 왜곡으로까지 이어진다 3. Object를 지우고 생성한 새로운 이미지가 이전 이미지에 비해서 quality degrade가 이루어지면 안된다.

Validation of Three Human

새로 생성된 이미지의 퀄리티 검증
지워진 Object가 QA에서 등장하는지, 혹은 연관되어있는지
모델이 synthetic datasets을 보고 VQA task를 수행하는데 그 결과가 correct for the given image and question (yes/no/ambiguous)

Consistency Analysis

The goal of creating edited datasets : 모델이 이미지의 의미 변화에 대해 얼마나 일관성이 있는지 측정하는 것.

IV-VQA: QA에 관련이 크게 없는 Object를 제거했기 때문에 모델이 정답을 변경하지 않을거라 가정

CV-VQA: QA에 관련이 있는 Object를 제거했기 때문에 모델이 정답을 변경할 것이라 가정. 이 연구에서는 개수를 물어보고, 하나의 object를 없앴기 때문에 모델의 결과가 1개 작아졌을거라고 예상

다음으로, 모델의 정확도와 일관성을 평가. 구체적으로, 모델이 변형에 대해서 얼마나 자주 예측 답을 변경하는지를 살펴보고, 이 다양한 변형에 대한 질적 및 양적인 일관성 메트릭을 제안.

Inconsistent behavior on edited data into three categories

neg -> pos
pos -> neg
neg -> neg

가령 1번의 겨웅, answer in real IQA was wrong, but answer in edited was correct

Robustification by Data Augmentation

출처: CVPR 저자 발표내용 일부

Conclusion and Future Works

propose a semantic editing based approach to study and quantify the robustness of VQA models to visual variations.
-> Visual 변형을 주어서 VQA models의 robustness를 평가한 첫 연구 *
Our analysis shows that the models are brittle to visual variations and reveals spurious correlation being exploited by the models to predict the correct answer.
-> 기존의 VQA model들은 시각적 변형에 취약하고, 모델만의 잘못된 상관관계에 의존하여 예측을 하고 있다는 것을 밝혀냄 *
Next, we propose a data augmentation based technique to improve models’ performance.
-> 저자들이 만든 synthetic datasets을 가지고 data augmentation을 진행하였더니 모델의 성능과 일관성이 모두 향상하였다. *
Our trained models show significantly less flipping behaviour under invariant and covariant semantic edits, which we believe is an important step towards causal VQA models.
-> Synthetic dataset으로 학습한 모델은 Invariant VAQ와 Covariant VQA에서 모두 답변을 변경하는 정도가 낮았고, 논문의 저자들은 이것이 "Casual VQA"를 향한 단계라고 강조하면서 논문의 제목이 이렇게 됨*
By making our invariant and covariant VQA sets as well as evaluation and synthesis available to the community, we hope to support research in the direction towards causal VQA models.
-> 이 데이터셋으로 Causal VQA를 향한 다양한 연구들이 나오길 기대한다 *

Code Review

전체적인 코드 https://github.com/AgarwalVedika/CausalVQA

flip_accuracy_cal_cv_vqa.ipynb
flip_accuracy_cal_iv_vqa.ipynb

https://github.com/AgarwalVedika/CausalVQA/blob/master/flip_accuracy_cal_cv_vqa.ipynb

Object Remover

https://github.com/AgarwalVedika/CausalVQA/blob/master/object_remover.py

개인 견해

장점

명확한 동기: 기존의 방식들의 문제점을 제기한 것이 명쾌
논문의 흐름이 굉장히 자연스러움 문제 -> 해결 -> 검증
논문의 작성이나 구조도 굉장히 논리정연함
실험과정과 결과에 대한 자신감이 있어서 그런지 비교 실험을 굉장히 잘했다는 느낌이 들었다

단점/아쉬운점

일단 멀티모달 연구라고 생각해서 그런지 어떻게 피쳐를 뽑고 사용했는지 등의 모델관련한 내용이 있을거라 생각했는데 그냥 SoTA모델을 사용한 결과들만 나열해서 아쉬웠다. (멀티모달 연구라기 보다 데이터검증 연구 느낌)

느낀점

분명 contribution에서 정량화하는 새로운 metric을 제안한다고 했는데, 수식이나 공식을 생각했는데 synthetic dataset으로 학습하고 모델이 일관성을 유지하는지 안하는지를 평가하는 그 일련의 과정을 의미하는 거였다.

추가 논문

위 논문에서 Sota모델로 사용한 SAAA Show, Ask, Attend, and Answer: A Strong Baseline For Visual Question Answering https://arxiv.org/abs/1704.03162

[논문 리뷰] Self-supervised Learning from a Multi-view Perspective

Sun, 15 Jan 2023 10:16:48 GMT

들어가기 전에

이 글은 2021년 ICLR에 게재된 "SELF-SUPERVISED LEARNING FROM A MULTI-VIEW PERSPECTIVE" by Yao-Hung Hubert Tsai, Yue Wu, Ruslan Salakhutdinov, and Louis-Philippe Morency 논문 리뷰입니다.

Backgrounds

what is Self-supervised learning? Supervised learning vs Unsupervised learning: tagged label의 유무 -> Supervised learning은 데이터에 label이 없기 때문에 Data의 특징(representation)에 따라 범주를 묶는 Clustering을 수행

Self-supervised learning(SSL): Label이 없는 untagged data를 기반으로 자기 스스로 학습 데이터에 대한 분류를 수행

1. Introduction

1.1 Problem Statement and Motivation

*Self-Superviesd Learning(SSL): 데이터에 Label이 없을 때, input 데이터와 self supervised signals 사이의 objectives를 학습하기 위해 사용

일반적인 SSL Pipeline

1) Pretext task: unlabeled된 input data의 특징(representation)을 학습한다 = Pre-trained: 대량의 untagged data를 이용해 해당 데이터셋에 대한 전반적인 특징을 학습하는 단계 ex) Bert model

2) Downstream task = Fine-Tuning

논문의 목표 This paper aims at analyzing why self-supervised learning performs well theoretically and practically. 왜 자기 지도학습이 여러가지 task에서 잘 수행되는지 이론적으로, 실용적으로 분석하는 것을 목표로함.

1.2 Contributions

1) Multi-view assumption: input data and self-supervised signals are two different views of data 2) Each of these two views is sufficient for downstream task 3) Minimal but sufficient learned representation extracts task-relevant information with loss and discard task-irrelevant information with gap 4) Combination of input and self-supervised signals is also discussed

2. Methodology

2.1 Minimal and sufficient representations for self-supervised learning

Notations X = Input S = Self-supervised Signals Z = representation T = Task-relevent information I(X;Y) = Mutual information of X and Y H(X) = Entropy X|Y = conditional X based on Y

작동 원리 정리 *1) X는 Input data, S는 Self-supervised signals. 이 둘의 공통 부분은 최대가 되게 하고 다른 부분은 최소가 되게 한다. 2) Optimal Z는 이 둘의 공통 부분이 최대가 되는 부분 = Minimal and Sufficient Self-Supervision. 3) 이후, Down stream task T에 대해서 T와 공통인 부분들은 Task-relevent infomation, 겹치지 않는 부분은 gap이라 부르고 discard한다. *

2.2 connections with different learning objectives

** Different Learning Objectives * 1) Predicting Learning Uses cases of 𝑍𝑋 to reconstruct cases of 𝑆

2) Contrasive Learning To maximize the similarity of X and S, in order to maximize 𝐼(𝑍x;𝑆), which minimizes the 𝜖𝑖𝑛𝑓𝑜 and maximizes the task-relevant information in 𝑍x.

3) Predictive Learning Uses cases of 𝑆 to reconstruct cases of 𝑍𝑋

** Composing SSL Objectives* combine the contrastive learning objective 𝐿𝐶𝐿, the forward predictive learning objective 𝐿𝐹𝑃, and the inverse predictive learning objective 𝐿𝐼𝑃, which leads to the composing SSL objective 𝐿𝑆𝑆𝐿.

By adjusting the hyper-parameter,

3. Experiments

3.1 Experiments 1

𝐿𝐶𝐿 and 𝐿𝐹𝑃, which aim to extract task-relevant information 𝐿𝐼𝑃, which aims to discard task-irrelevant information.

(a) Omniglot dataset The training dataset includes images from 964 characters, and the test dataset includes images from 659 characters. For each character, it is drawn by 20 people with different styles

(b) CIFAR10

3.2 Experiments 2

4. Conclusion

1) Multi-view assumption 2) Interaction of Input and self-supervised signals includes enough task-relevent information 3) Can extract and discard information by task-relevent T

5. My Review

개인 정리 ** **Contrastive Learning: aims at extracting task-relevant information Forward Predictive Learning: aims at extracting task-relevant information Inverse Predictive Learning: aims at discarding task-irrelevant information Composing SSL Objectives: aims at extracting task-relevant information and discarding task-irrelevant information (i.e. finding the minimum and sufficient learned representation).

느낌점

1) Interaction of Input and self-supervised signals의 효과와 관계를 좋은 증명과 실험으로 검증하였다는게 이 논문의 의의인 것 같다. 2) 그러나 본 논문에서 제시하는 mathematic formulae of Bayes error rate과 다양한 학습 목표는 복잡하고 이해하기 어려우므로 그 의미에 대한 더 많은 논의와 설명이 제시되어야 한다고 생각한다. 3) 또한 𝐿𝐶𝐿, 𝐿𝐹𝑃, 𝐿𝐼𝑃 조합에 대한 테스트를 더 많이 해봐야 더 나은 구성 SSL objectives를 찾을 수 있을 것 같다. 4) 마지막으로 Input과 Self-supervised signals가 크게 겹치지 않는 데이터셋에 대해서도 잘 작동하는지 실험해보야하 한다.

Code review

https://github.com/yaohungt/Self_Supervised_Learning_Multiview