dien-eaststar.log

[논문 리뷰] Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

Wed, 31 Aug 2022 07:40:54 GMT

📝 참여 스터디: 거꾸로 읽는 self-supervised-learning 시즌2: Contrastive learning on NLP

🔗 논문 링크: Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

📚 발표 자료: by Dien

💻 발표 영상: by Dien

Abstract

Language models pretrained on text from a wide variety of sources form the foundation of today’s NLP. In light of the success of these broad-coverage models, we investigate whether it is still helpful to tailor a pretrained model to the domain of a target task. We present a study across four domains (biomedical and computer science publications, news, and reviews) and eight classification tasks, showing that a second phase of pretraining indomain (domain-adaptive pretraining) leads to performance gains, under both high- and low-resource settings. Moreover, adapting to the task’s unlabeled data (task-adaptive pretraining) improves performance even after domain-adaptive pretraining. Finally, we show that adapting to a task corpus augmented using simple data selection strategies is an effective alternative, especially when resources for domain-adaptive pretraining might be unavailable. Overall, we consistently find that multiphase adaptive pretraining offers large gains in task performance.

Introduction

1. Pretrained Language Model (LM)

RoBERTa는 영어 백과사전, 뉴스, 문학 작품 등 약 160GB 이상의 대규모 corpus로 사전학습 됩니다.
Pretrained LM은 다양한 domain에서 가져온 다양한 task에서 강력한 성능을 달성합니다.
하지만, 특정 domain 또는 task 데이터 분포에 일반화할 수 있는가에 대해서는 확립되지 않았습니다.
여기서 domain이란 특정한 주제 또는 source로 이루어진 corpus, task란 특정한 domain에서 풀고자 하는 문제라고 정의할 수 있습니다.

2. Adaptive Pretraining

Adaptive pretraining이란, pretrained model을 초기 weight로 지정하고 추가로 pretraining하는 것을 말합니다.
본 논문에서는 4개의 domain(biomedical, computer science, news, and reviews)에 대해 각각 Domain-adaptive pretrainig(DAPT)를 수행합니다.
또한, Task에 보다 더 직접적으로 연관된 task-specific 데이터를 활용해 Task-adaptive pretrainig(TAPT)를 수행합니다.
추가로, 상대적으로 적은 데이터를 활용하는 TAPT 방법의 성능 향상을 위해 augmentation 방법을 활용합니다.

이 논문의 contribution은 다음과 같습니다.

NLP classification 관련 4개의 domain, 8개의 task에서 Domain- and Task- adaptive pretraining에 대한 철저한 분석 진행

Domain 및 Task 전반에 걸쳐 adapted language model의 transferability 에 대한 연구 수행

Task adaptive pretraining 수행시 사람이 직접 augmentation 하는 방식과, kNN 을 활용해 자동적으로 augmentation 하는 방법의 효과 강조

Background

1. RoBERTa

대규모 corpus를 pretraining하는 대표적인 모델은 Masked language model(MLM) 방법을 활용하는 BERT 계열 모델이 있습니다.
RoBERTa는 BERT 모델 아키텍쳐를 유지하며 BERT의 10배 규모 corpus로 pretraining되었으며, 학습단계의 여러 hyperparameter를 조정하여 최적화시켰습니다.
- 매 학습마다 새로운 masking 적용 (BERT에서는 학습 전 1번만 masking 적용)
- BERT에서 사용된 Next sentence prediction(NSP) loss 제거
- 2K 이상의 큰 batch size로 학습
- 최대 길이의 sequence length 사용
본 연구에서는 pretraining된 RoBERTa 모델을 활용해서 domain 및 task의 adaptive pretraining을 추가로 진행합니다.

Domain-Adaptive Pretraining

선행 연구에서 자주 다뤄지는 4개의 domain(biomedical papers, computer science papers, new text, AMAZONE review)을 선정했습니다.
성정된 domain에는 각각 2개씩 classification task가 존재합니다.

1. Analyzing Domain Similarity

PT(RoBERTa의 corpus) 및 각 domain의 corpus에서 상위 10K 단어 기반 vocabulary overlap (%) 비교 분석을 통해 domain별 유사도를 평가합니다.

2. Experiments

Pretraining 성능 비교 (unsupervised MLM loss)

(Baseline) RoBERTa pretraining 모델로 각 domain에 대한 MLM loss를 계산합니다.
(DAPT) RoBERTa pretraining 모델을 초기 weight롤 설정하고, 각 domain corpus를 활용해 추가로 pretraining을 진행하여 MLM loss를 계산합니다.
NEWS domain을 제외한 모든 domain에서 MLM loss가 감소한것을 알 수 있습니다.
특히, PT와 유사도가 낮은 BIOMED 및 CS domain에서 성능 향상이 높게 나타나는데 이는 domain 유사도가 낮을수록 DAPT의 잠재력이 높아짐을 시사합니다.

*Downstream task 성능 비교 (supervised Finetuning classification F-score) *

(Baseline) RoBERTa pretraining 모델로 각 domain의 downstream task에 대한 classification finetuning을 진행합니다.
(DAPT) 각 domain으로 추가 pretraining을 진행한 뒤에, classification finetuning을 진행합니다.
(¬DAPT) 해당 task의 domain과 무관한(dissimilar) domain으로 추가 pretraining을 진행한 뒤에, classification finetuning을 진행합니다.
DAPT가 Baseline보다 대부분의 경우에 좋은 성능을 나타냈습니다.
MLM loss 변화에서 시사했던 것처럼 BM과 CS domain에서 효과가 가장 두드러졌습니다.

3. Domain Relevance for DAPT

Target domain과 dissimilar한 domain으로 DAPT를 수행한 결과(¬DAPT), 오히려 baseline보다 성능이 떨어졌습니다.
이는 DAPT 수행시, 단순히 더 많은 데이터에 노출시키는 것이 아닌, domain-relevant가 중요하다는 것을 시사합니다.
NEWS로 DAPT를 수행한 모델은 CS domain에서 Baseline보다 오히려 성능이 좋아졌는데, 이는 NEWS domain이 PT와 유사하기 때문이라고 추측됩니다. (논문에는 없는 내용으로, 거꾸로 읽는 SSL-NLP의 3주차 발표 영상 by Dien에서 간단하게 논의되었습니다.)

4. Domain Overlap

Domain간 유사도가 높다면, 다른 domain이라 할지라도 긍정적인 transfer 효과를 보일 수 있습니다.
위의 Table 3의 마지막 행에 유사도가 높은(40%) 두 domain에 대한 결과를 추가로 표시했습니다.
NEWS로 DAPT를 수행한 모델은 REVIEW domain에서 비슷한 성능을 보였습니다.

Task-Adaptive Pretraining

1. Experiments

(TAPT) 각 domain별 task corpus를 활용해 RoBERTa pretraining 모델에 추가 pretraining을 진행하고, 해당 task에 대한 classification finetuning을 진행합니다.
(DAPT+TAPT) DAPT 모델에 task corpus를 활용해 추가 pretraining을 진행하고, 해당 task에 대한 classification finetuning을 진행합니다.
DAPT보다 활용할 수 있는 데이터의 수가 적지만, task-specific하기 때문에 효율적일 것이라 기대합니다.

TAPT, Combined DAPT and TAPT

Baseline 대비 TAPT가 모든 task에서 성능 개선을 보였습니다.
DAPT 대비 적은 resource임에도 몇가지 task에서는 오히려 성능 개선을 보였습니다.
CS에서는 DAPT 대비 꽤 감소하였는데 이는 low resource가 이유라고 추측됩니다.
DAPT+TAPT는 비용측면에서 비싸지만, 가장 좋은 성능을 보였습니다.

Cross-Task Transfer

(Transfer-TAPT) 같은 domain내에서 다른 task를 활용해 TAPT를 진행합니다.

같은 domain이라 할지라도, task별로 데이터 분포가 다를 수 있음을 시사합니다.

Augmenting Training Data for Task-Adaptive Pretraining

TAPT의 적은 resource 한계를 극복하기 위해 두가지 agumentation 방법을 제안합니다.

1. Human Curated-TAPT

사람이 직접 해당 domain내의 unlabeled corpus에서 task 관련 corpus를 curation(선별)하여 data augmentation을 진행합니다.

Curated-TAPT를 추가한 것이 더 나은 성능을 보였습니다.
특히 파란색 상자 안의 결과를 살펴보면, 0.3% 수준의 데이터를 추가한 것 만으로 95% 수준(87.8 -> 83.8)의 성능을 보여주었습니다.

2. Automated Data Selection for TAPT

Curated-TAPT의 경우 사람의 노력이 들어가는 resource가 발생합니다.
따라서, domain corpus 내에서 unsupervised 방식(VAMPIRE를 활용한 text embedding)으로 task distribution과 관련된 데이터를 샘플링하는 방법(kNN-TAPT)을 제안합니다.
- 합리적인 시간내에 모든 문장들을 embedding할 수 있을 정도로 가벼운 모델 필요
- VAMPIRE(VAriational Methods for Pretraining In Resource-limited Environments, Gururangan et al. "Variational pretraining for semi-supervised text classification." (2019))라는 unigram 단위 Bag-of-words language model 사용
- Text의 word frequencies를 input과 target으로 설정하여 VAE 학습
- 학습된 VAE의 encoder로 text embedding 진행
- Task와 domain corpus가 같은 공간에 embedding 될 수 있도록 함께 학습
학습된 text embedding 모델로 domain과 task corpus를 embedding하고, task corpus를 기준으로 가장 가까운 k개의 domain corpus를 sampling함

kNN-TAPT가 모든 경우에서 TAPT 성능을 개선시킨것을 보여줍니다.
k가 증가할수록 kNN-TAPT의 성능은 꾸준히 증가하여 DAPT에 근접하게 됩니다.
향후 kNN-TAPT에 대한 보다 정교한 데이터 선택 방법이 연구될 수 있습니다. (kNN-LMs, RETRO 등)

3. Computational Requirements

BM domain의 RCT-500 task에서 실험을 진행했습니다.

DAPT를 활용하는 것이 좋은 성능을 보였지만, TAPT 대비 storage는 약 40배의 resource가 필요합니다.
Curated-TAPT가 가장 좋은 성능을 보였지만, 사람이 직접 데이터를 선별해야하는 비용이 발생합니다.
kNN-TAPT가 적은 resource대비 합리적인 성능을 보여주었습니다.

Conclusion

여러 실험 결과, Language model은 도메인 특성에 따른 complexity(복잡도)를 인코딩하는 것에 어려움이 존재합니다.
특정 domain 또는 task에 대해 모델을 추가 pretraining하면 성능을 향상시킬 수 있습니다.
Language model을 고도화하기 위해, domain 및 task에 적합한 corpus를 추가 사용하는 것이 중요합니다.

🙆🏻‍♂️ 논문을 읽고 나서..

일반적으로 언어 모델을 가져와서 풀고자하는 task에 finetuning만 진행했었는데, 추가 pretraining 하는 것이 더 좋은 성능을 보일 수 있다는것을 알게 되었다.
발표영상에서 언급된 kNN-LMs 및 RETRO의 kNN을 활용한 document retriver 방법이 궁금하며, 추후 살펴볼 생각이다. 🤷🏻‍♂️
발표영상이 유튜브로 박제된다는 것이 무척 긴장되었는데, 역시 발표영상에 긴장된 모습이 고스란히 담긴 것 같아 아쉽다.. 🤦‍♂️😂
그래도 좋은 경험이였으며 많은 도움을 받았다. 앞으로 자주 이런 활동에 기여하며 발전하고 싶다는 생각을 했다. 💪

[논문 리뷰] VQ-VAE: Neural Discrete Representation Learning

Mon, 15 Aug 2022 15:33:52 GMT

📝 참고 사이트: 거꾸로 읽는 self-supervised-learning 파트 1

🔗 논문 링크: Neural Discrete Representation Learning

Abstract

Learning useful representations without supervision remains a key challenge in machine learning. In this paper, we propose a simple yet powerful generative model that learns such discrete representations. Our model, the Vector Quantised Variational AutoEncoder (VQ-VAE), differs from VAEs in two key ways: the encoder network outputs discrete, rather than continuous, codes; and the prior is learnt rather than static. In order to learn a discrete latent representation, we incorporate ideas from vector quantisation (VQ). Using the VQ method allows the model to circumvent issues of “posterior collapse” -— where the latents are ignored when they are paired with a powerful autoregressive decoder -— typically observed in the VAE framework. Pairing these representations with an autoregressive prior, the model can generate high quality images, videos, and speech as well as doing high quality speaker conversion and unsupervised learning of phonemes, providing further evidence of the utility of the learnt representations.

Introduction

1. Representation Learning for Generative Model with Discrete Features

최근 generative model은 여러 challenge한 task(few-shot learning, domain adaptation, reinforcement learning 등)에서 representation 학습에 의존하고 있습니다. 하지만, unsupervised 방식의 representation 학습은 여전히 dominant한 접근 방식이 아닙니다.
Pixel domain에서 unsupervised model을 학습하기 위해, maximum likelihood와 reconstruction error는 흔히 사용되는 방법입니다. 하지만, 그 유용성은 feature가 사용되는 특정한 application에 의존합니다.
본 논문의 목표는 maximum likelihood를 최적화하면서 데이터의 중요한 feature를 latent space에서 학습하는 것입니다.
Discrete representation은 많은 modality (언어 및 음성은 discrete, 이미지는 종종 언어 추론에 의해 discrete)에서 더욱 자연스럽습니다. 많은 이전의 연구들에서는 continuous features에 집중했지만, 본 논문은 discrete representation을 다룹니다.

2. Vector Quantised Variational AutoEncoder (VQ-VAE)

본 논문은 VAE에 discrete latent representation을 성공적으로 결합한 새로운 generative model를 소개합니다.
VQ 방법을 적용하여 variance가 커지는 문제(discrete variable의 분산이 커서 학습이 느려짐)와 "posterior collapse"(Approximate posterior가 prior를 그대로 모방하여 latent variable를 무시한 상태에서 학습되는 문제)를 겪지 않습니다.
또한 VQ-VAE는 latent space를 효율적으로 사용하기 때문에, 종종 local한 정보(noise, 감지할 수 없는 세부사항)에 집중적으로 학습하지 않고 다양한 도메인에서 성공적으로 모델링할 수 있습니다.
마지막으로 Modality의 적합한 discrete latent space를 학습하여, discrete random variable를 통해 흥미로운 sample과 유용한 application을 생성합니다.

VQ-VAE

VAE는 $q(z|x)$을 학습하는 encoder, $z$, 그리고 $p(x|z)$를 학습하는 decoder로 이루어져 있으며, encoder의 output과 decoder의 input에서 reparametrization trick이 사용됩니다.
- $z$: Discrete latent random variable (latent space)
- $q(z|x)$: input $x$의 posterior distribution($x$가 주어졌을 때 $z$의 distribution)
- $p(z)$: Prior distribution
- $p(x|z)$: input $x$의 true distribution($z$가 주어졌을 때 $x$의 distribution)
VAE에 vector quantisation(VQ) 방식을 추가하여, $p(z)$는 discrete latent variables로 학습되고 $p(z)$에서 추출된 embedding vector는 decoder를 통과합니다.

1. Discrete Latent Variables

학습은 encoder, decoder, embedding space에서 이루어집니다.
Latent embedding space $e\in R^{K \times D}$ ($K$는 discrete size)를 설정합니다.
먼저, encoder에서 input x는 CNN를 거쳐 $z_e(x)$를 출력합니다.
식(1)을 통해 $z_e(x)$와 $e$를 사용하여 최근접이웃 look-up 방식으로 dictionary $q(z|x)$를 생성하고, 식(2) 방식으로 $q(z|x)$에 $e$를 mapping하여 $z_q(x)$를 재구성합니다.
마지막으로 decoder에서 $z_q(x)$는 CNN을 거쳐 $p(x|z_q)$를 출력하게 됩니다.
저자들은 이 모델을 $log(p(x))$를 likelihood로 설정하고, ELBO 형태로 최적화하는 VAE라고 주장합니다. $z$를 uniform distribution으로 정의하여 $D_{KL}$ term이 상수가 됩니다. (VAE에서는 $z$를 gaussian distribution으로 정의)

2. Learning
식(3)은 VA-VAE의 전체 손실 함수를 나타내며, 세가지 term으로 구성되어 있습니다. (sg: stop gradient) ① Reconstruction loss
- 첫번째 term은 encoder와 decoder를 모두 최적화하는 reconstruction loss 입니다.
- 식(1, 2)에서 사용되는 argmin 연산은 비선형적이고 미분이 불가능하기 때문에, 'straight-through estimator'와 유사한 방법(decoder input $z_q(x)$에서 encoder output $z_e(x)$으로 gradient 복사)을 사용합니다.
- 즉, Forward에서는 $z_q(x)$가 decoder로 전달되고, backward 에서는 gradient가 encoder로 그대로 전달됩니다.
- Encoder output $z_e(x)$과 decoder input $z_q(x)$은 동일한 D 차원 공간을 공유하기 때문에, gradient에는 reconstruction loss를 최소화하기 위한 정보가 포함됩니다.
② Codebook loss
- $z_q(x)$에서 $z_e(x)$로 gradient가 그대로 매핑되기 때문에, $e_i$는 reconstruction loss에서 gradient 정보를 받지 못합니다.
- 따라서, embedding vector를 학습하기 위해 가장 간단한 dictionary learning 알고리즘 vector quantization(VQ)를 사용합니다.
- $e_i$와 $z_e(x)$ 사이의 $l_2$ error를 통해, $e_i$를 $z_e(x)$로 이동하도록 학습됩니다.
③ Commitment loss
- Embedding vector는 무한하기 때문에, codebook loss에서 $e_i$는 encoder vector만큼 빠르게 학습되기 어려우며 임의로 커질 수 있습니다.
- Encoder vector가 embedding vector에 commit하여 encoder vector가 embedding vector와 유사해질 수 있도록하는 commitment loss를 추가합니다.
- $\beta$ hyperparameter의 변화에 따라 큰 성능 차이를 보이지 않으므로, 제안한 모델이 $\beta$ 값에 대해 robust하다고 주장합니다. (모든 실험에서 $\beta$ = 0.25)

모델 $log$ $p(x)$의 log-likelihood는 $log\sum_kp(x|z_k)p(z_k)$로 계산됩니다. 여기서 decoder $p(x|z)$는 MAP-inference로 부터 $z=z_q(x)$로 학습되었기 때문에, decoder가 완전히 수렴되면 $z\neq z_q(x)$에 대한 어떤 $p(x|z)$ 확률밀도도 구할 필요가 없습니다. 따라서, $log$ $p(x)\approx$ $log$ $p(x|z_q(x))p(z_q(x))$가 되며, Jensen's inequality로 부터 $log$ $p(x)\geq$ $log$ $p(x|z_q(x))p(z_q(x))$가 될 수 있습니다.

3. Prior

Discrete latents $p(z)$에 대한 prior distribution은 categorical distribution이며, feature map의 다른 $z$에 따라 autoregressive하게 만들수 있습니다.
본 논문에서는 image에서 PixelCNN, raw audio에서 WaveNet을 사용합니다.

Experiments

1. Comparison with continuous variables

성능 평가를 위해 VAE(coninuous variable) 및 VIMCO(independent gaussain, categorical priors)와 비교합니다. (VAE: $4.51$ $bits/dim$, VQ-VAE: $4.67$ $bits/dim$, VIMCO: $5.14$ $bits/dim$)
VQ-VAE은 discrete latent space를 사용함에도 continuous latent space를 사용했을 때와 비슷한 성능을 보이고 있습니다.

2. Images

Figure 2에서 오른쪽 image는 왼쪽 image를 latent space $z$로 변환한 뒤, VQ-VAE의 decoder로 다시 복원한 image이다. Resolution 및 detail은 다소 감소했으나, 차원을 매우 감소시켰음에도 중요한 정보를 잃지 않고 전체적인 부분을 잡아낸 것을 알 수 있다.
PixelCNN으로 VQ-VAE로부터 생성된 여러 이미지는 figure 3과 같다.
DeepMind Lab 환경에서 얻은 데이터로 학습한 결과는 figrue 4와 같다.
Figure 5는 latent variable을 3개만 사용한 reconstruction 결과입니다. Textures, room layout 및 nearby walls 등 원본 장면이 많이 남아 있지만, 모델은 pixel 값 자체를 저장하지 않고 PixelCNN에 의해 생성된 것을 확인할 수 있습니다. 즉, 일반적으로 VAE 모델에서 발생하는 posterior collapse 문제를 겪지 않으며, latent space가 의미있게 사용됩니다.

3. Audio

첫번째 실험으로, long-term 관련 정보만 보존하는 latent space를 추출하는 실험을 진행했습니다. Figure 6과 같이 reconstruction은 동일한 text contents를 보이지만 waveform이 상당히 다르고 음성의 운율이 달라집니다. 이것은 VQ-VAE가 언어 관련 supervised 학습이 없음에도 low-level features보다 음성의 content만을 encoding하는 high-level abstract space를 학습한다는 것을 의미합니다.
두번째 실험으로, 학습된 latent representation으로 prior를 학습시켜서 데이터의 long-term dependencies를 모델링했습니다. 기존의 가장 성능이 좋은 WaveNet의 sample에서 babbling처럼 들리지만, VQ-VAE의 sample에는 명확한 단어와 part-sentence가 포함되어 있습니다.
세번째 실험으로, 한 speaker로부터 latent를 추출한 후, decoder를 통해 다른 speaker로 reconstruction하는 speaker conversion을 시도했습니다. 이 실험은 encoding된 representation이 speaker별 정보를 제외했음을 보여줍니다.
네번째 실험으로, 각 discrete latent variable를 ground-truth phomeme-sequence와 일대일 비교를 진행했습니다. 모든 latent variable를 41개의 phoneme와 매핑시켜 classification을 진행한 결과, 분류 정확도 49.3%를 달성했습니다. (random accuracy 7.2%) Unsupervised 방식으로 얻은 latent variable가 phoneme 정보를 학습했다고 볼 수 있습니다.

4. Video

DeepMind Lab 환경을 사용하여 주어진 작업 순서에 따라 생성 모델을 학습했습니다.
Figure 7는 초반 6개 frame과 VQ-VAE에서 샘플링된 10개의 frame을 보여줍니다.
VQ-VQE는 pixel sapce에 의존하지 않고 순전히 latent space에서 long sequence를 생성할 수 있음을 보여줍니다.

Conlusion

본 논문은 VAE와 VQ를 결합하여 discrete latent representation을 얻는 VQ-VAE 모델을 제안합니다.
VQ-VAE는 압축된 discrete latent space를 통해 이미지, 비디오, 오디오에서 long term dependencies를 모델링 할 수 있음을 보여줍니다.
모든 실험은 unsupervasied 방식으로 학습하여 데이터의 중요한 특징을 잘 capture한다는 것을 보여줍니다.
또한, VQ-VAE는 CIFAR 10 데이터에서 continuous latent space와 유사한 성능을 달성합니다.
저자는 VQ-VAE가 long term sequence를 unsupervised 방식으로 성공적으로 모델링하는 최소의 discrete latent representation 모델이라고 주장합니다.

🙆🏻‍♂️ 논문을 읽고 나서..

Discrete latent space 학습을 위해 stop gradient와 VQ 방법을 사용해서 여러 문제를 해결하는 방법이 신기했다.
End-to-end 방식이 아닌 것이 아쉬웠으며, autoregressive 모델만을 사용해야 하는것인지 다른 모델을 사용했을 때의 성능은 어떤지 궁금하다. 🤷🏻‍♂️
개인적으로 VAE의 개념과 수식을 완벽히 이해하기 위한 추가 공부 및 포스팅이 필요할 것 같다. 🙋🏻‍♂️

[논문 리뷰] DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features

Tue, 14 Jun 2022 05:18:43 GMT

📝 참고 사이트: 거꾸로 읽는 self-supervised-learning 파트 1

🔗 논문 링크: Deep Clustering for Unsupervised Learning of Visual Features

Abstract

Clustering is a class of unsupervised learning methods that has been extensively applied and studied in computer vision. Little work has been done to adapt it to the end-to-end training of visual features on large scale datasets. In this work, we present DeepCluster, a clustering method that jointly learns the parameters of a neural network and the cluster assignments of the resulting features. DeepCluster iteratively groups the features with a standard clustering algorithm, k- means, and uses the subsequent assignments as supervision to update the weights of the network. We apply DeepCluster to the unsupervised training of convolutional neural networks on large datasets like ImageNet and YFCC100M. The resulting model outperforms the current state of the art by a signicant margin on all the standard benchmarks.

Introduction

1. Pre-trained Convnet

Pre-trained Convnet은 general-purpose features를 학습하여 데이터수가 제한된 환경에서 일반화 성능을 향상시킨다.
ImageNet은 large fully-supervised dataset으로, pretrained convnet의 발전을 이끌었다.
그럼에도 ImageNet은 오직 object classification을 위한 백만 개의 이미지가 포함되어 있으며, 이는 오늘날의 real wolrd에 비해 상대적으로 작다.
수십 억개의 더 크고 다양한 데이터셋을 구축해야 하지만, 이를 labeling 하는 것은 많은 비용이 들며 visual representation의 bias가 발생한다.
이를 해결하기 위해 unsupervised learning 방법이 필요하다.

2. Unsupervised Learning

Clustering, dimensionality reduction, density estimation 등 unsupervised learning은 특정 도메인(위성, 의료 등) vision에서 자주 사용된다.
End-to-end 방식의 unsupervised learning 연구는 거의 없으며, 큰 규모의 연구는 없다.
Clustering은 주로 고정된 feature 위에 linear model을 추가하는 방식으로 설계되는데, clustering과 feature를 동시에 학습하게 되면, feature가 0이 되고 cluster가 single entity로 축소되는 trivial solution을 야기할 수 있다. (즉, 1개의 cluster로 학습됨)

3. Proposed Model

큰 규모의 end-to-end를 통해 범용적인 visul feature를 얻는것이 가능한 새로운 clustering 방법을 제안한다.
Cluster assignment를 예측하여, 이미지 clustering과 convnet의 가중치 업데이트를 교대로 진행된다.
Clustering 알고리즘으로는 K-means를 사용한다. (supplementary에 PIC 방법 비교 되어 있음)
Pre-training이 필수적인 self-supervised 방법과 달리 제안하는 clustering방법은 domain knowledge가 거의 필요하지 않고, 특정한 signal이 요구되지 않는다.

Contribution

K-means를 사용한 end-to-end unsupervised learning 방식의 clustering
Unsupervised learning을 사용한 많은 standard transfer task에서 state-of-the-art를 달성
Unsupervised feature learning의 현재 평가 프로토콜 discussion

Unsupervised Learning of Features

Unsupervised clustering의 기존의 연구(Coates and Ng [32])는 multi-stage로 이루어진다.
Convnet feature와 image cluster를 공동으로 학습하는 기존 연구([21,22,33,34])는 현대 convnet architecture에서 연구되는 큰 규모에서 검증되지 않았다.
Self-Supervised Learning
Unsupervised learning의 한 방법으로, "pseudo-labels"로 label를 대체하여 representation learning을 수행한다.(ex. masking, patch level, pixel level)
Spatial과 temporal 일관성, 이미지 색상, 교차 채널 예측, 소리 또는 instance counting 등 다양한 self-supervised learning이 존재하는데, 이는 도메인에 종속적이다.

Generative Models

Unsupervised learning의 한 방법으로, autoencoder, GAN, reconstruction loss 등 을 사용한다.

Method

1. Preliminaries

Convnet mapping function: $f_\theta$

Classifier: $g_w$
$f_\theta$ 다음 $g_w$를 거쳐 학습이 진행되며, $\theta$와 $w$는 동시에 학습된다.

2. Unsupervised Learning by Clustering

학습 없이 Gaussian distribution에서 $\theta$가 sample되면, $f_\theta$는 좋은 feature를 생성하지 못하지만, random 결과보다 높은 수준을 보여준다. => 학습되지 않은 AlexNet의 ImageNet에서정확도는 12%지만, class 1000개의 확률은 0.1%이다.
이러한 약한 신호(초기 feature)를 이용하여 convnet의 output(feature label)을 clustering하고, 이를 pseudo-labels ($y_n$)로 사용하여 최적화한다.
K-means clustering을 사용했으며, 아래 수식이 최적화되는 $y_n$을 pseudo-label로 사용한다.
DeepClustr는 feature 학습과 clustering이 교대로 이루어지는데, 이는 trivial solution(1개의 cluster로 학습)이 될 수 있다.

3. Avoiding Trivial Solutions

Trivial solution은 unsupervised learning에만 국한된 것이 아닌, classifier와 label을 공동으로 학습할 때 발생한다.

Empty clusters

Discriminative model은 class간의 결정 경계를 학습하는데, 최적의 결정 경계는 모든 데이터를 하나의 cluster에 할당하는 것이다.
- K-means 최적화 과정에서 빈 cluster를 다른 cluster로 재할당하여 해결한다. => 빈 cluster의 centroid를 비어있지 않은 cluster의 centroid로 할당하고, data를 두개의 cluster에 재할당한다.

Trivial parametrization

데이터 불균형에서도 발생하는 문제로, major class에 중점적으로 parameter가 학습된다.
- Loss function에 cluster 크기에 대한 가중치를 주어 해결한다. (할당된 cluster 크기의 역수)

Implementation detils

AlexNet와 VGG-16을 사용하여 결과를 비교하고, unsupervised 방법은 종종 색상 정보에 영향을 받지 않기 때문에 Sobel 필터를 사용하여 이미지의 색상을 제거한다.
대규모 데이터셋인 ImageNet을 사용한다.
PCA 알고리즘으로 256 차원으로 축소하고 정규화한 뒤에 K-means clustering을 수행한다.

Experiments

1. Preliminary Study

Normalized Mutual Information (NMI)를 사용한다. => Cluster가 실제 label에 얼마나 종속적인지 평가한다.

Relation between clusters and labels

Figure 2(a)는 학습중 할당된 cluster와 ImageNet label 간의 NMI 값을 보여준다.
학습이 진행됨에 따라 cluster와 label간의 dependence가 증가함을 알 수 있다.

Number of reassignmnets between epochs

Figure 2(b)는 각 epoch 마다, 이미지를 새로운 cluster에 재할당하게 되는데, 안정성을 확인하기 위한 결과이다.
훈련이 진행됨에 따라 재할당이 점점 줄어들고 cluster가 안정화되는 것을 알 수 있다.

Choosing the number of clusters

Figure 2(c)는 K-means의 하이퍼파라미터인 cluster의 수 k에 따라 모델의 성능을 보여준다.
K가 10000일 때 최상의 성능을 얻었으며, ImageNet의 class개수인 1000이상의 분할이 유용할 수 있다는 것을 알 수 있다.

2. Visualizations

First layer filters

Figure 3은 raw RGB image와 Sobel filtering image에 대해 DeepCluster로 훈련된 AlexNet의 첫 번째 layer의 필터를 보여준다.
RGB image의 대부분의 필터는 일반적으로 객체 분류에 적은 영향을 미치는 색상 정보만 캡처하지만, Sobel filtering image는 edge detector와 같은 역할을 합니다.

Probing deeper layers

Figure 4는 activation을 최대화하는 input image를 학습하여 target filter의 qulity를 평가한다.
예상대로, 네트워크의 더 깊은 레이어는 더 큰 textural 구조를 캡처하는 것처럼 보인다.
그러나, 마지막 convolutional layer의 일부 필터는 이전 레이어에서 이미 캡처된 textural를 단순히 반복한 거처럼 보인다.
Figure 5는 의미상 일관성이 있는 것으로 보이는 일부 필터의 activated image를 보여준다.
첫번째 행은 class와 상관관계가 높은 정보가 포함되어 있고, 두번째 행은 그림이나 추상적인 모양의 스타일에 동작하는 필터로 보인다.

3. Linear classification on activations

Table 1은 각 convolutional layer를 고정시키고 선형 분류기를 학습하여 얻은 결과이다.
DeepCluster가 기존의 unsupervised learning 방법보다 conv3~5에서 약 5% 더 나은 성능을 보인다.
Sobel filtering이 색상 정보를 제거했기 때문에 conv1에서 낮은 성능을 보인다.
Conv3이 conv5보다 더 나은 성능을 보였는데 이는 Figure 4의 결과와 일치한다.
Places에는 column에는 ImageNet으로 학습된 모델을 Places 데이터로 test한 결과를 보여준다.
DeepCluster는 conv4에서 가장 높은 성능을 보였으며, ImageNet supervised learning과 마찬가지로 conv5(label과 가장 가까운 layer)의 성능이 낮은것을 알 수 잇는데, 이는 도메인 다를 때 label이 덜 중요하다는 것을 의미한다.

4. Pascal VOC 2007

Table 2에는 Pascal VOC에서 3가지 task(image classificaion, object detection, semantic segmentation)에 대한 결과를 보여준다.
DeepCluster는 기존의 unsupervised 방식을 능가한다.
특히, FC6-8만을 학습하는 것은 실제 응용을 위해 중요한데, 기존의 방식보다 최대 9%(분류시) 더 나은 성능을 보인다.

Discussion

1. ImageNet versus YFCC100M

ImageNet은 balanced 데이터셋으로, DeepCluster는 balanced 데이터셋을 선호함으로 다른 unsupervised 연구와 공정한 비교가 되지 않을 수 있다.
Table 3은 YFCC100M 데이터에서 무작위 sampling하여 imbalance 데이터셋을 만들고, 비교 분석한 결과이다.
DeepCluster는 불균형 데이터셋에서 성능이 저하되었음에도 불구하고, 3가지 task에서 모두 더 나은 성능을 보인다. => Class distribution의 변화에 강건하다.

2. AlexNet versus VGG

Table 4는 VGG는 AlexNet보다 더 깊은 convnet으로, 다른 architecture를 사용하더라도 성능이 동일한지 확인하기 위한 결과이다.
Architecture와 관계없이 더 깊어질 수록 성능이 크게 향상 된다.

3. Evaluation on Instance Retrieval

Table 4는 제안된 모델이 class level이 아닌 instance level에서 기존의 연구보다 높은 성능을 보여준다.

Conclusion

Convnet의 unsupervised learning 기반 clustering 방법 DeepCluster를 제안한다.
K-means를 통해 convnet의 feature를 clustering 하고 이를 pseudo-label로 사용하여 가중치 학습을 반복한다.
ImageNet과 같은 대규모 데이터셋에서 검증되었다.
제안된 모델의 접근 방식은 input에 제약조건을 거의 두지 않으며, 도메인 특정 지식이 필요하지 않다.

🙆🏻‍♂️ 논문을 읽고 나서..

학습되지 않은 모델이라도 class 확률보다 나은 성능을 보인것을 이용하여, k-means clustering 결과를 pseudo-label로 사용한것이 인상깊었다.
DEC 와 DAC는 비교적 적은 데이터셋으로 간단한 convnet으로 검증되었지만, DeepCluster는 대규모 데이터셋인 ImageNet으로 더 깊은 arcitecture로 학습하여 다양한 task에서 검증되었다.
Trivial solution을 해결하기 위한 방법 두가지의 효과가 궁금하다. 🤷🏻‍♂️
DAC 에서의 의문점과 유사하게, cluster label를 one-hot vector 문제가 아닌 regression 문제(확률 값으로)로 변형시켜서 보다 유연한 모델을 만들 수 있을까? 🙋🏻‍♂️

[논문 리뷰] DAC: Deep Adaptive Image Clustering

Mon, 13 Jun 2022 03:59:46 GMT

📝 참고 사이트: 거꾸로 읽는 self-supervised-learning 파트 1

🔗 논문 링크: Deep Adaptive Image Clustering

Abstract

Image clustering is a crucial but challenging task in machine learning and computer vision. Existing methods often ignore the combination between feature learning and clustering. To tackle this problem, we propose Deep Adaptive Clustering (DAC) that recasts the clustering problem into a binary pairwise-classification framework to judge whether pairs of images belong to the same clusters. In DAC, the similarities are calculated as the cosine distance between label features of images which are generated by a deep convolutional network (ConvNet). By introducing a constraint into DAC, the learned label features tend to be one-hot vectors that can be utilized for clustering images. The main challenge is that the ground-truth similarities are unknown in image clustering. We handle this issue by presenting an alternating iterative Adaptive Learning algorithm where each iteration alternately selects labeled samples and trains the ConvNet. Conclusively, images are automatically clustered based on the label features. Experimental results show that DAC achieves state-of-the-art performance on five popular datasets, e.g., yielding 97.75% clustering accuracy on MNIST, 52.18% on CIFAR-10 and 46.99% on STL-10.

Introduction

1. Image Clustering

content-based annotation과 image retrieval 분야에서 image clustering이 적용되고 있다.
K-means와 agglometrative clustering 알고리즘이 사용되어 왔지만, 이는 미리 distance metric을 정의해야하는 어려움이 존재한다.
최근에 Autoencoder를 사용한 deep unsupervised feature learning 방법이 연구되고 있지만, upsupervised 방법의 pre-training 이 후 전통적인 clustering 방법을 적용하는 multi-stage 방법에는 몇가지 한계가 존재한다.
- Multi-stage는 학습이 번거롭다.
- Representation은 unsupervised feature learning 이후 고정된다.
- Clustering 과정에서 representation을 더 이상 개선할 수 없다.

2. Proposed Method

Single-stage ConvNet 기반 Deep Adaptive Clustering(DAC) 방법을 제안한다.
이미지 쌍이 같은 cluster에 속하는지 판단하기 위해, binary pairwise-classification 문제로 접근한다.
제약조건을 추가하면서, 학습된 feature는 one-hot vector 형태가 된다.
Ground-truth 유사도를 알지 못하기 때문에, 모델 최적화를 위해 alternating iterative 방법인 Adaptive Learning 알고리즘을 적용한다.
- 각 interation 동안, 쌍을 이룬 image간의 추정된 유사도가 고정된 ConvNet 기반으로 먼저 선택된다.
- DAC는 선택된 labeled sample을 사용하여 ConvNet을 supervised 방식으로 학습한다.
- 모든 sample이 학습에 사용되고, binary pairwise 분류 성능이 더 이상 개선되지 않을 때 수렴된다.
- 마지막으로, 이미지는 label feature의 가장 큰 response로 clustering 된다.

1. Data Clustering

크게 3가지 방법으로 나눌 수 있다.
- Distance-based
- Density-based
- Connectivity-based
2. Image Representation
low-level feature를 encoding 하는 전통적인 방법이 있다. => 이는 종종 scene과 개체의 모양 변화로 어려움을 겪는다.
Unsupervised feature learning 은 input을 reconstruction하면서 feature representation을 학습하는데, Autoencoder방법과 deep generative model방법이 있다. => 생성된 representation 기반으로 clustering 결과를 즉시 얻을 수 없다.

3. Combination
Feature learning과 clustering을 single-stage로 결합하는 여러 방법이 제안되었다.
- Deep embedded clustering (DEC)는 cluster의 centroid를 학습한다. => Pre-training이 필요하다는 불편한 점이 있다.
- Joint unsupervised learning (JULE)는 KNN에서 초기화한 clustering을 기반으로 agglometrative clustering과 feature learning을 공동으로 수행한다. => 서로 다른 이미지 사이의 거리를 정의하는 것은 어렵기 때문에, 복잡한 이미지 데이터에 대해 성능이 저하된다.
4. Sample Selection
머신러닝에서 더욱 효율적인 모델 학습을 위해 학습 샘플을 선택하는 방법이 연구되고 있다.
- Boosting algorithm은 다양한 모델에서 학습한 데이터셋에서 임의로 부분 샘플을 선택하고, 이들을 통합하여 가장 성능이 좋은 분류기를 만든다.
- Curriculum learning은 비교적 쉬운 샘플을 우선 사용하고, 점진적으로 복잡한 샘플을 학습에 사용한다.
=> 위의 두가지 방법은 labeled data에 동작되는 제약이 있다.

Deep Adaptive Clustering Model

1. Binary Pairwise-Classification for Clustering

Clustering을 이미지 쌍이 같은 cluster에 속하는지에 대한 여부를 학습하는 binary 문제로 재정의 한다.
- $D={(x_i,x_j,r_{ij})}^n_{i=1,j=1}$
- input: $x_i, x_j \in X$ (unlabed images)
- output: $r_{ij} \in Y$ (unkown binary variable)
Objective function:
$L$은 $r_{ij}$와 는 추정된 유사도 $g$의 loss, $w$는 학습 파라미터
위의 식에서 두가지 이슈가 있다.
- 오직 추정된 유사도 $g$에서 $x_i$와 $x_j$의 cluster를 얻을 수 없다.
- $Y$는 이미지 clustering 과정에서 알 수 없다.

2. Labeld features under Clustering Constraint ($g$)

이미지 쌍의 유사도 측정($g$)를 측정하기 위해 label feature $L={l_i \in R^k}^n_{i=1}$를 도입한다.
$l_i$는 k차원 label feature이다(k는 cluster 개수).
이미지 clustering에 보단 유용한 정보를 학습하기 위해 label feature $l_i$에 제약 조건을 적용한다.
ll $l_i$ ll$2 = 1$ 는 $L_2$-norm이며, $l{ih}$가 0 이상의 값을 갖도록 제약조건을 주게 되면, cosine similarity 연산을 내적 연산으로 치환이 가능하다.
따라서, $g$는 mapping function $f_w$ 과 함께 다음과 같이 나타낼 수 있다.
label feature의 제약 조건으로, $l_i$는 k개의 one-hot vector로 나타낼 수 있으며, 이는 이미지 clustering 으로 할당 할 수 있게 된다.
3. Labeld Training Samples Selection ($r$)
$Y$를 알지 못하기 때문에, labeled training sample을 선택하는 전략이 필요하다. (sample: $(x_i,x_j,r_{ij})$)
ConvNets은 다음 두가지 특징이 있다.
- ConvNets이 학습되지 않았다면, initialized ConvNets의 filter는 edge detectors 역할을 하기 때문에, low-level feature를 포착할 수 있다.
- ConvNets가 이미 학습되었다면, high-level feature를 생성할 수 있다.
위의 두가지 특징을 기반으로 $r_{ij}$를 다음과 같이 정의한다.
$\lambda$는 선택을 제어하는 adaptive parameter 이고, $u(\lambda)$와 $l(\lambda)$는 각각 유사도와 비유사도의 labeled sample 선택을 위한 threshold이다.
$None$인 sample은 학습에서 제외된다.
clustering이 쉬운(likelihood가 높은) sample이 먼저 선택 되어 rough cluster pattern을 발견하도록 하는, curriculum learning 방법을 적용한다.
Clustering 과정이 진행되면서, 학습된 ALL-ConvNet은 더욱 효과적인 label feature를 추출하고 더욱 정제된 cluster pattern을 발견하기 위해 점차 더 많은 sample이 추가된다.
이를 위해 $\lambda$는 점차 증가하게 되고, $u(\lambda)$는 감소하고 $l(\lambda)$는 증가하게 되며, 모든 sample이 학습에 사용되었을 때 $u(\lambda)$와 $l(\lambda)$가 같아진다.
$v_{ij}$가 1이면 sample $(x_i,x_j,r_{ij})$ 은 선택되고, 0이면 선택되지 않는다.
$u_\lambda-v_\lambda$는 penalty term이며, penaltyterm이 감소하면서, 모든 sample이 학습에 사용될 때까지 더 많은 sample이 선택된다.

Deep Adaptive Clustering Algorithm

1. Adaptive Learning

Objective function $E$를 최적화 하기 위해 Adaptive learning (alternating iterative optimization)을 적용한다.
Feature label $l_{ij}$의 restraint을 구현하기 위해, mapping function $f$ layer에 restraint을 설정한다.
$L^{in}$과 $L^{out}$은 restraint layer의 input과 ouput이다.
$L^{out}$는 (9a) 제약조건에 의해 모든 요소가 [0, 1]에 매핑되고, (9b) 에 의해 단위 벡터로 제한된다.
학습 파라미터 $\lambda$와 $w$는 교대로 최적화된다.
- $v$와 $r$이 정해지면 ($\lambda$),
- supervised 문제가 된다.
- $v$와 $r$는 $O(n^2)$의 메모리 문제가 발생하기 때문에, batch 단위 학습이 요구된다.
  - $w$가 고정되면,
- 경사하강법에 따라 $\lambda$가 update 된다.
- 학습에 sample이 점차 추가되면서 $\lambda$는 증가하게 된다.

2. Label Inference for Image Clustering

Label feature는 제약조건에 의해 one-hot vector가 되며, 결과적으로 이미지는 가장 큰 response에 clustering이 이루어 진다.

DAC 알고리즘을 요약하자면,

Adaptive learning 알고리즘을 통해 최적화 된다.
Fixed ConvNet에서 sample을 선택하는 것과, 선택된 sample 기반으로 학습하는 것을 반복한다.
모든 sample이 학습에 사용되었을 때, 알고리즘은 수렴된다.
이미지는 label feature에서 가장 큰 response로 clustering 된다.

Experiments

1. Image CLustering

Table 2에서 DAC는 NMI, ARI, ACC 세가지 평가 지표에서 가장 좋은 성능을 보였다.
Representation 기반 clustering(AE, AEVB)는 전통적인 방법(K-means, SC) 보다 더 나은 성능을 보였다. => 이미지 clustering에서 representation learning이 중요한 역할을 한다.
Single-stage 방법(End-to-end)이 더 우수한 representation learning을 가능하게 한다.
제안된 모델은 복잡하고 큰 규모의 이미지 데이터셋에서 잘 동작한다.
Adaptive learning의 효과를 확인하기 위해, DAC(not Adaptive learning) 의 성능을 비교한다. => DAC는 임의로 sample이 선택되기 때문에 noise가 있는 sample이 활용된다. => DAC*와 달리 DAC는 정교한 cluster pattern으로 시작하여 결과적으로 clustering 성능을 향상시킬 수 있다.

Table 3은 clustering tractics를 평가하기 위한 결과이다.
DAC기반으로 학습된 label feature에 전통적인 방법을 사용한것이 기존 방법보다 더 나은 성능을 보인다.
또한, DAC는 label feature에서 가장 큰 response를 찾기만 하면 되기 때문에, 더 간결하다.

이미지가 동일한 cluster에 속하는 경우 동일한 label feature에서 뚜렷하게 활성화된다.
제안된 방법은 시각적 특징의 단순한 조합이 아닌, 상위 수준의 feature를 학습한다. => airliner와 airship 등 복잡한 이미지도 구분이 가능하다.

Figure 4는 Clustering 제약조건의 효과를 확인하기 위한 결과이다.
학습된 label feature의 요소를 4개의 간격으로 분리하여 계산한다.
초기 단계에서 label label feature의 주요 요소는 [0,0.1) 및 [0.1, 0.5)에 있으며, 최종 단계에서 [0,0.1) 및 [0.9,1]에 있다.
이는 학습된 label feature가 sparse하며 0또는 1에 밀집 된 것을 알 수 있고, 이는 clustering을 위해 one-hot 벡터를 학습하려고 시도하는 목표에 부합한다.

제안된 방법이 다른모델에 비해 imabalance에 강하다.

Clustering의 수가 증가할수록 일반적으로 성능이 저하되는데, 이는 더 많은 불확실성이 유발되기 때문이다.
다른 방법에 비해 DAC는 다양한 cluster를 처리할 수 있는 적절한 기능을 가지고 있음을 보여준다.
Figure 7은 Sample 수의 영향을 관찰하기 위한 결과이다.
복잡한 데이터일수록 데이터의 개수에 영향을 미치며, 제안된 모델이 더욱 빠르게 증가하는 것을 알 수 있다.

Conclusion

Single-stage 방법으로 ConvNet기반 방법을 제안한다.
제약조건과 함께 binary-pairwise 유사도 분류 방법을 제안한다.
제안된 방법은 복잡한 대규모 이미지를 처리할 수 있음을 보여준다.

🙆🏻‍♂️ 논문을 읽고 나서..

Unsupervised clustering 방식을 end-to-end로 풀어낸것이 인상깊었다.
DEC 는 multi-stage 방식인 것에 비해 DAC는 single-stage 방식으로 더 나은 성능을 보였다.
유사도를 binary문제가 아닌 regression 문제로 변형시켜서 보다 유연한 모델을 만들 수 있을까? 🤷🏻‍♂️
MNIST 데이터셋을 제외한 실험에서의 성능이 충분하지 않은데 이를 해결하기 위해서는 어떻게 해야 할까? 🙋🏻‍♂️

[논문 리뷰] DEC: Unsupervised Deep Embedding for Clustering Analysis

Thu, 02 Jun 2022 05:45:38 GMT

📝 참고 사이트: 거꾸로 읽는 self-supervised-learning 파트 1

🔗 논문 링크: Unsupervised Deep Embedding for Clustering Analysis

Abstract

Clustering is central to many data-driven application domains and has been studied extensively in terms of distance functions and grouping algorithms. Relatively little work has focused on learning representations for clustering. In this paper, we propose Deep Embedded Clustering (DEC), a method that simultaneously learns feature representations and cluster assignments using deep neural networks. DEC learns a mapping from the data space to a lower-dimensional feature space in which it iteratively optimizes a clustering objective. Our experimental evaluations on image and text corpora show significant improvement over state-of-the-art methods.

Introduction

1. Clustering

데이터 분석과 시각화에서 핵심인 Clustering은 Cluster의 정의, 올바른 distance metric, 효율적인 그룹화, cluster 검증 등 다양한 관점에서 연구되어 왔다.
상대적으로 clustering을 위한 feature space의 unsupervised learning에 초점을 맞춘 연구는 적다.
Distance(또는 dissimilarity)는 clustering에서 핵심 개념이다.
Distance는 feature space에 데이터를 얼마나 잘 representation 하는지에 의존하며, 적절한 feature space를 선택하는 것이 중요하다.

2. Feature Representation and Cluster Assignments

Raw data space나 shallow linear embedded space에서 clustering을 수행하는 기존의 방법과 달리, SGD(Stochastic gradient descent) backpropagation을 사용하여 clustering mapping을 학습하는 DEC(Deep embedded clustering)을 제안한다.
Unsupervised learning 방법으로 feature representation과 cluster assignment를 동시에 해결하기 위해, 반복적으로 현재 soft cluster assignment에서 산출된 auxiliary target distribution으로 재정의하는 방법을 제안한다.
이 방법은 clustering 뿐만 아니라 feature representation 성능 또한 향상시킨다.

3. Contributions

DEC의 성능은 hyperparameter 선택에 덜 민감하기 때문에 real data에 적용시 robustness 하다.
Deep embedding과 clustering의 joint optimization을 수행한다.
Soft assignment를 통해 cluster를 반복적으로 재정의한다.
Clustering 정확도와 speed에 있어서 state-of-the-art 달성한다.

1. K-means and Spectral clustering

K-means와 GMM(Gaussian mixture models)은 모델 속도가 빠르고, 넓은 분야에서 적용할 수 있다. 하지만, 사용되는 distance metric은 raw data space에 제한되고, 고차원에서 비효율적이다.
고차원의 input space를 다루기 위해, cluster 분산을 최대화할 수 있는 저차원으로 차원축소하는 방법이 제안되었지만, linear embedding으로 복잡한 데이터에서 제한된다.
Spectral clustering은 flexible distance metrics를 사용하고 고유값 분해를 deep autoencoder로 대체하여 성능을 향상시켰지만, full graph Laplacian matrix를 계산하기 때문에 메모리 소비가 증가하고 속도가 느린 단점이 있다.
2. Parameteric t-SNE
Parameteric t-SNE는 deep neural network를 사용하여 embedding을 수행하고, 시간복잡도는 O(nlogn) 이다.
Parameteric t-SNE에 영감을 받아 centroid-based 확률 distribution를 정의하고, auxiliary target distribution으로의 KLD(Kullback-Leibler divergence)를 최소화하여 cluster assignmnet 및 feature representation을 동시에 개선한다.

Deep Embedded Clustering

- Data space $X$: ${x_i \in X}^n_{i=1}$ - Latent feature space $Z$ - Non linear mapping $f_\theta$: $X \to Z$ - Cluster $k$의 centroid: $u_{j=1,...,k} \quad {u_j \in Z}^k_{j=1}$

제안된 알고리즘 DEC는 feature space $Z$의 $k$ cluster centers와 parameter $\theta$ 를 동시에 학습한다.
DEC는 (1) deep autoencoder를 사용하여 parameter를 initialization 하고, (2) auxiliary target distribution으로의 KLD 최소화를 반복하여 parameter를 최적화한다.

1. Parameter initialization

각 layer마다 autoencoder를 적용하여 학습하고, 이를 연결한 SAE(Stacked autoencoder) 를 사용한다.
각 layer는 denoising autoencoder를 사용하며, greedy layer-wise training을 수행한다.
SAE의 reconstruction loss를 최소화 하도록 학습을 수행한 뒤, DEC의 data space와 feature space 사이를 mapping 하는 initial network로 SAE의 encoder를 사용한다.(Non linear mapping $f_\theta$)
K-means 알고리즘을 사용하여 feature space $Z$에서 initial cluster centers를 정한다.

2. Clustering with KL divergence

앞선 과정을 통해 non linear mapping $f_\theta$과 initial cluster centers $u_j$ 가 주어지면, (1) embedded points와 cluster centroids간의 soft assignment 계산 (2) $f_\theta$ 업데이트 및 cluster centroids 재조정을 반복하여 unsupervised learning clustering을 수행한다.
2.1 Soft Assignment
T-SNE에서 사용되는 student's t-distribution로 embedded point $z_i$와 $u_i$간의 similarity를 계산한다.
$q_{ij}$는 sample $i$가 cluster $j$에 할당될 확률로 해석될 수 있다 (즉, soft assignment).

2.2 KL Divergence Minimization

Soft assignment가 high confidence 될 수 있도록 auxiliary target distribution $p_i$를 사용한다.
(1) 강력한 예측 (cluster 성능), (2) high confidence assignment, (3) 각 centroid의 loss 정규화 세가지 목표와 함께 실험적으로 $p_i$를 계산한다.
$f_j$는 cluster의 데이터 개수 이며, 정규화를 위해 사용된다.
최종적으로 사용되는 loss는 $p_i$와 $q_i$의 distribution 차이가 최소가 되도록 하는 KLD를 사용한다.

2.3 Optimization

Cluster center $u_j$와 파라미터 $\theta$를 학습하기 위해 SGD(Stochastic gradient descent)를 사용한다.
$L$의 gradient는 $z_i$의 feature space embedding과 cluster centers $u_j$를 중점적으로 다음과 같이 계산된다.
학습 과정에서 전체 cluster assignment 변화가 0.1% 이하 라면 학습을 종료시킨다.

Experiments

1. Dataset

이미지: MNIST (class 10), STL-10 (class 10)
텍스트: REUTERS (category 4)

2. Evaluation Metric

Ground-truth categories의 숫자와 동일하게 culster의 수를 설정하고 unsupervised clustering accuracy를 계산한다.
$l_i$는 ground-truth label, $c_i$는 cluster assignment, $m$은 가능한 모든 mapping case

Results

Table 2.는 각 dataset에 대해 ACC 분류 성능을 나타낸 것으로, DEC 모델이 제일 좋은 성능을 보였다.
DEC w/o 는 end-to-end 학습의 효율성을 확인하기 위해, DEC의 $f_\theta$를 고정시키고 학습시킨 결과이다.

Figure 2.는 hyperparameter에 따라 모델 성능의 민감도를 나타낸다.
DEC 모델은 hyperparameter와 관계없이 일정한 성능을 보였다.

Figure 3.은 각 cluster를 행별로 나타낸 결과이며, 왼쪽에서 오른쪽으로 갈수록 cluster center와 거리가 먼 것을 나타낸다.
4와 9를 어려워한 것을 알 수 있다.

Discussion

Table 3.은 autoencoder를 수행했을 때와 비교한 결과이다.
Autoencoder만을 사용한것 보다, KLD로 파인튜닝한 deep embedding의 성능이 더 좋은 것을 알 수 있다.

Figure 4.는 Gradient와 cluster soft assignment $q_{ij}$의 관계를 파악하기 위한 결과이다.
Cluster center와 가까울 수록 ($q_{ij}$가 클수록) gradient가 더 크게 기여한다.
$q_{ij}$가 작을수록 더 모호해지며, 5를 8로 잘못 분류된 것을 알 수 있다.

Table 4.는 데이터 불균형에 따른 성능 비교 결과이며, 다른 모델에 비해 DEC는 데이터 불균형에 robust 한 것을 알 수 있다.

최적의 cluster 수를 결정하는 metric 두가지: (1) NMI (Normalized mutual information) (2) Generalizability 를 제안한다.
$I$는 mutual information metric, $H$는 entropy, G는 training과 validation loss 비율을 나타낸다.
최적의 cluster 개수가 9인 것은 label 9와 4가 유사하기 때문이다.

Conclusion

DEC는 clustering과 feature space 최적화를 공동으로 수행할 수 있다.
Self-training target distribution을 사용하여 반복적으로 KLD를 최적화한다.
제안된 모델은 semisupervised self-training의 unsupervised extension이라 볼 수 있다. (Self-supervised learning)
Groundtruth 없이 clustering에 특화된 representation을 학습하는 방법이다.
Unsupervised 방식으로 cross-validation이 불가능하기 때문에 hyperparameter에 관계없이 향상된 성능과 robust를 제공한다.
데이터수에 대해 linear complexity가 있어 대규모 데이터셋으로 확장이 가능하다.

🙆🏻‍♂️ 논문을 읽고 나서..

Real data에서 적용가능한 unsupervised 방식의 clustering 방법이 인상적이다.
Validation을 통해 hyperparameter를 조정할 수 있다면 더 나은 모델이 될 수 있지 않을까.
$p_{ij}$가 target distribution이 될 수 있는 이유는 무엇일까. 🙋🏻‍♂️
Cluster를 위한 모델이 아닌 데이터의 특징을 잘 설명할 수 있도록 representation에 중점을 두어 supervised learning의 성능을 높일 수 있다면? 🤷🏻‍♂️

dien-eaststar.log

[논문 리뷰] Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

📝 참여 스터디: 거꾸로 읽는 self-supervised-learning 시즌2: Contrastive learning on NLP

🔗 논문 링크: Don’t Stop Pretraining: Adapt Language Models to Domains and Tasks

📚 발표 자료: by Dien

💻 발표 영상: by Dien

Abstract

Introduction

1. Pretrained Language Model (LM)

2. Adaptive Pretraining

Background

1. RoBERTa

Domain-Adaptive Pretraining

1. Analyzing Domain Similarity

2. Experiments

3. Domain Relevance for DAPT

4. Domain Overlap

Task-Adaptive Pretraining

1. Experiments

Augmenting Training Data for Task-Adaptive Pretraining

1. Human Curated-TAPT

2. Automated Data Selection for TAPT

3. Computational Requirements

Conclusion

[논문 리뷰] VQ-VAE: Neural Discrete Representation Learning

📝 참고 사이트: 거꾸로 읽는 self-supervised-learning 파트 1

🔗 논문 링크: Neural Discrete Representation Learning

Abstract

Introduction

1. Representation Learning for Generative Model with Discrete Features

2. Vector Quantised Variational AutoEncoder (VQ-VAE)

VQ-VAE

1. Discrete Latent Variables

2. Learning

3. Prior

Experiments

1. Comparison with continuous variables

2. Images

3. Audio

4. Video

Conlusion

[논문 리뷰] DeepCluster: Deep Clustering for Unsupervised Learning of Visual Features

📝 참고 사이트: 거꾸로 읽는 self-supervised-learning 파트 1

🔗 논문 링크: Deep Clustering for Unsupervised Learning of Visual Features

Abstract

Introduction

1. Pre-trained Convnet

2. Unsupervised Learning

3. Proposed Model

Related Work

Unsupervised Learning of Features

Self-Supervised Learning

Generative Models

Method

1. Preliminaries

2. Unsupervised Learning by Clustering

3. Avoiding Trivial Solutions

Experiments

1. Preliminary Study

2. Visualizations

3. Linear classification on activations

4. Pascal VOC 2007

Discussion

1. ImageNet versus YFCC100M

2. AlexNet versus VGG

3. Evaluation on Instance Retrieval

Conclusion

[논문 리뷰] DAC: Deep Adaptive Image Clustering

📝 참고 사이트: 거꾸로 읽는 self-supervised-learning 파트 1

🔗 논문 링크: Deep Adaptive Image Clustering

Abstract

Introduction

1. Image Clustering

2. Proposed Method

Related Work

1. Data Clustering

2. Image Representation

3. Combination

4. Sample Selection

Deep Adaptive Clustering Model