latent-cho.log

[PR | 22-08] Emerging Properties in Self-Supervised Vision Transformers (DINO)

Mon, 29 Aug 2022 10:47:05 GMT

0. Abstract

Self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets.
These features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT.
Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs.

1. Introduction

NLP 에서 BERT, GPT 가 사용하는 self-supervised pretraining objectives 는 문장마다 label 을 예측하는 supervised objective 보다 ** richer learning signal ** 을 제공한다.
이미지에서도 image-level supervision 은 미리 정의된 카테고리 중 선택하게 함으로서 rich visual information 을 제한한다.

Contribution

Self-supervised ViT features 는 물체의 바운더리 같은 scene layout 을 explicitly 포함한다. 이는 마지막 블럭의 self-attention 모듈에서 바로 뽑아낼 수 있다.
Self-supervised ViT features 는 추가적인 fine-tuning, linear classifier, data augmentation 없이 (=zero-shot) k-nn classifier 로 ImageNet에서 78.3% top-1 accuracy 를 달성했다.

Contribution 1에서 segmentation mask 가 나타나는 것은 self-supervised 방식으로 인한 특성으로 보이고, contribution 2 는 momentum encoder 과 multi-crop augmentation, smaller patches 를 같이 사용해서 나타난 특성으로 보인다.

Self-supervised learning

(1) Instance classification

how: consider each image a different class and trains the model by discriminating them up to data augmentations.
caveat: explicitly learning a classifier to discriminate between all images does not scale well with the number of images.

(2) Noise contrastive estimator (NCE)

how: compare instances instead of classifying them
caveat: it requires comparing features from a large number of images simultaneously. In practice, this requires large batches or memory banks

(3) W/O discriminating between images

ex-BYOL: features are trained by matching them to representations obtained with a momentum encoder

3. Approach

Notation

student network $g_{\theta_s}$ parameterized by $\theta_s$
teacher network $g_{\theta_t}$ parameterized by $\theta_t$
input image: x
Each model outputs prob distributions over $K$ dimensions denoted by $P_s$, $P_t$ ($P$ is the output of softmax function)
$\tau$: temperature parameter that controls the sharpness of the output distribution

-> softmax 는 여기서 normalization 의 효과를 얻기 위해 사용한 것으로 보인다.

(1) Loss: cross-entropy

Given a fixed teacher network $g_{\theta_t}$, we learn to match these distributions by minimizing the cross-entropy loss w.r.t. the parameters of the student network $g_{\theta_s}$

특이하게도 cross-entropy 를 사용하였다.

* loss: CE vs MSE vs INCE

(2) Augementation: multi-crop

cross entropy 에 들어가는 x 의 augmentation 방식을 짚고 넘어가야한다.

큰 resolution (224x224) 인 gloabl view, 작은 resolution (96x96) 인 local view 로 multi-crop
student 한테는 local view, teacher 한테는 gloabl veiw 를 통과시켜서
“local-to-global” correspondences 를 배우도록 시킨다.

(3) Teacher network: EMA

$θ_t←λθ_t+ (1−λ)θ_s$

Using an exponential moving average (EMA) on the student weights, i.e., a momentum encoder, is particularly well suited for our framework.
We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality.

(4) Avoiding collapse: centering & sharpening

It can also work with only a centering and sharpening ofthe momentum teacher outputs to avoid model collapse.
Centering prevents one dimension to dominate but encourages collapse to the uniform distribution,
while the sharpening has the opposite effect

Implementation

4. Results

[PR | 22-08] data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

Mon, 29 Aug 2022 10:00:28 GMT

0. Abstract

Self-supervised learning 의 일반적인 아이디어는 다양한 모달리티에 동일하게 적용될 수 있지만, 실제로 알고리즘과 objective 는 모달리티마다 달랐다. 그래서 본 논문에서는 Speech, NLP, CV 에서 같은 학습 방법을 쓸 수 있는 하나의 프레임워크, data2vec 을 제안한다.

The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup.
Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec predicts contextualized latent representations that contain information from the entire input.

data2vec은 BYOL, DINO 처럼 student-teacher 모델이지만 masked prediction objective 를 가지는 self-supervised model 이다. 이 때 target representation 은 contextualized 되어있다.

-> data2vec 에 대한 나의 짧은 요약!

Momentum encoder 를 이용한 모델 (BYOL, DINO) 과의 차이점 (1) use a masked prediction task (2) regress multiple neural network layer representations instead of just the top layer which we find to be more effective. (3) Moreover, data2vec works for multiple modalities.
Vision Transformers with masked prediction objectives predict visual tokens or the input pixels. Instead: (1) data2vec predicts the latent representations of the input data. (2) The latent target representations are contextualized, incorporating relevant features from the entire image instead of targets which contain information isolated to the current patch, such as visual tokens or pixels

3. Method

Contrastive learning 방식이 아니라, masked auto-encoding 방식으로 학습하는데 타겟이 teacher 의 output 이다. 이는 transformer 의 아웃풋으로 contextualized representation 이다.

학습 방법을 요약하면, student model 의 masked representation 와 teacher model 의 output 인 unmasked representation 와 smooth l1 loss 를 계산해서 student model 학습시키고, teacher model 은 student model 의 파라미터에 대해 ema 방식으로 학습된다.

Model architecture

모든 모달리티에 대해서 Transformer 구조를 사용하지만 input data 를 인코딩하는 방식은 기존의 modality-specific 한 방법을 써야한다:

Computer vision: ViT-strategy of encoding an image
Speech: Multi-layer 1D CNN
Text: sub-word units & learned lut

Masking

Computer vision: block-wise masking strategy (BEiT)
Speech: mask spans of latent
Text: mask tokens

Training targets

모델은 masked sample 인코딩을 보고 원래의 unmasked rerpesentation 을 예측한다
예측하는 representation 은 time-step 뿐만 아니라 self-attention 으로 인해 샘플의 모든 정보를 담을 수 있는 contextualized representation 이다. (This is an important difference to BERT, wav2vec 2.0 or BEiT, MAE which predict targets lacking contextual information.)
Teacher parameterization: EMA
Targets
- Apply a normalization to each block to obtain
- Average top K blocks $y_t = \frac{1}{K}\sum_{l=L-k}^{L} \hat{a}_t^l$ for time-step $t$

Normalizing the targets helps (1) prevent the model from collapsing into a constant representation for all time-steps (2) prevents layers with high norm to dominate the target features.
Computer vision: We use the same modified image both in teacher mode and student mode.

Objective: Smooth L1 loss

5. Results

[PR | 22-07] Self-supervised Learning: Generative or Contrastive

Mon, 15 Aug 2022 11:33:24 GMT

Self-supervised learning (SSL) 을 공부하기 앞서, 큰 그림을 보려고 survey 논문을 가볍게 스키밍했다. SSL 에 크게 관심이 있지는 않았는데 친구들이랑 같이 공부하는게 좋아서 시작했다. 시작하고보니 연구 주제도 떠오르고, 잘한 것 같다.

1. Introduction

SSL 을 한 문장으로 정의하면, "인풋의 일부를 예측" 하는 것이다.

In the invited speech on AAAI 2020, The Turing award winner Yann Le Cun described self-supervised learning as ”the machine predicts any parts of its input for any observed part”.

2. Motivation of SSL

앞으로 크게 두 축을 볼 것이다.

** Contrastive ** : SSL 하면 MoCo, SimCLR 가 바로 떠오를텐데 얘네는 SSL 중에서도 contrastive learning 에 속한다. ** Generative ** : NLP 에서 BERT(AE), GPT(AR) 스타일을 생각하면 된다. 요새 비전 분야에서도 contrastive learning 에서 이쪽으로 넘어오고 있는 것으로 보인다. 차차 리뷰할 것이다.

Contrastive vs Generative (vs Generative-Contrastive) 각각에 해당하는 예시 ({Moco, SimCLR} vs {BERT, GPT}) 만 봐도 contrastive 와 generative 가 어떻게 다른지 감이 올 것이다. 그래도 차이점을 formulation 하면 더 명확하고 직관적일 것이다.

	z	discriminator	objective
Generative	explicit	X	reconstructino loss
Contrastive	explicit	O	contrastive similarity metric
Generative-Contrastive	implicit	O	distributional divergence

A properly designed training objective related to down-stream tasks could turn our randomly initialized models into excellent pre-trained feature extractors.

ex) contrative learning - classification tasks

The art of self-supervised learning primarily lies in defining proper objectives for unlabeled data.

SSL 파트 논문 리뷰는 다음과 같이 진행될 것 같다.

MoCO
SimCLR
BYOL
SimSiam
DINO
iBOT
BEiT
MAE
data2vec

-1. Reference

본 논문: https://arxiv.org/abs/2006.08218

latent-cho.log

[PR | 22-08] Emerging Properties in Self-Supervised Vision Transformers (DINO)

0. Abstract

1. Introduction

Contribution

2. Related work

Self-supervised learning

(1) Instance classification

(2) Noise contrastive estimator (NCE)

(3) W/O discriminating between images

3. Approach

Notation

(1) Loss: cross-entropy

* loss: CE vs MSE vs INCE

(2) Augementation: multi-crop

(3) Teacher network: EMA

(4) Avoiding collapse: centering & sharpening

Implementation

4. Results

[PR | 22-08] data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language

0. Abstract

2. Related work

3. Method

Model architecture

Masking

Training targets

Objective: Smooth L1 loss

5. Results

[PR | 22-07] Self-supervised Learning: Generative or Contrastive

1. Introduction

2. Motivation of SSL

-1. Reference