<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>latent-cho.log</title>
        <link>https://velog.io/</link>
        <description>Welcome to my latent space!</description>
        <lastBuildDate>Mon, 29 Aug 2022 10:47:05 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <copyright>Copyright (C) 2019. latent-cho.log. All rights reserved.</copyright>
        <atom:link href="https://v2.velog.io/rss/latent-cho" rel="self" type="application/rss+xml"/>
        <item>
            <title><![CDATA[[PR | 22-08] Emerging Properties in Self-Supervised Vision Transformers (DINO)]]></title>
            <link>https://velog.io/@latent-cho/PR-22-08-DINO</link>
            <guid>https://velog.io/@latent-cho/PR-22-08-DINO</guid>
            <pubDate>Mon, 29 Aug 2022 10:47:05 GMT</pubDate>
            <description><![CDATA[<h1 id="0-abstract">0. Abstract</h1>
<ol>
<li>Self-supervised ViT features contain explicit information about the semantic segmentation of an image, which does not emerge as clearly with supervised ViTs, nor with convnets. </li>
<li>These features are also excellent k-NN classifiers, reaching 78.3% top-1 on ImageNet with a small ViT.</li>
<li>Our study also underlines the importance of momentum encoder, multi-crop training, and the use of small patches with ViTs. 
<img src="https://velog.velcdn.com/images/latent-cho/post/9c3df80d-5f20-4d8b-8348-13101bea2541/image.png" alt=""></li>
</ol>
<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li>NLP 에서 BERT, GPT 가 사용하는 self-supervised pretraining objectives 는 문장마다 label 을 예측하는 supervised objective 보다 ** richer learning signal ** 을 제공한다. </li>
<li>이미지에서도 image-level supervision 은 미리 정의된 카테고리 중 선택하게 함으로서 rich visual information 을 제한한다. </li>
</ul>
<hr>
<blockquote>
<h3 id="contribution">Contribution</h3>
</blockquote>
<ol>
<li>Self-supervised  ViT  features 는 물체의 바운더리 같은 scene layout 을 explicitly 포함한다. 이는 마지막 블럭의 self-attention 모듈에서 바로 뽑아낼 수 있다.</li>
<li>Self-supervised ViT features 는 추가적인 fine-tuning, linear classifier, data augmentation 없이 (=zero-shot) k-nn classifier 로 ImageNet에서 78.3% top-1 accuracy 를 달성했다. </li>
</ol>
<ul>
<li>Contribution 1에서 segmentation mask 가 나타나는 것은 self-supervised 방식으로 인한 특성으로 보이고, contribution 2 는 momentum encoder 과 multi-crop augmentation, smaller patches 를 같이 사용해서 나타난 특성으로 보인다. </li>
</ul>
<h1 id="2-related-work">2. Related work</h1>
<h3 id="self-supervised-learning">Self-supervised learning</h3>
<h4 id="1-instance-classification">(1) Instance classification</h4>
<ul>
<li><strong>how</strong>: consider each image a different class and trains the model by discriminating them up to data augmentations.</li>
<li><strong>caveat</strong>:  explicitly  learning  a  classifier to discriminate  between all images does not scale well with the number of images.  </li>
</ul>
<h4 id="2-noise-contrastive-estimator-nce">(2) Noise contrastive estimator (NCE)</h4>
<ul>
<li><strong>how</strong>: compare instances instead of classifying them</li>
<li><strong>caveat</strong>: it requires comparing features from a large  number of images simultaneously.  In practice, this requires large batches or memory banks </li>
</ul>
<h4 id="3-wo-discriminating-between-images">(3) W/O discriminating between images</h4>
<ul>
<li><strong>ex-BYOL</strong>: features are trained by matching them to representations obtained with a momentum encoder</li>
</ul>
<h1 id="3-approach">3. Approach</h1>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/e54c0df1-a20f-42ee-95ce-2d58f76341f8/image.png" alt=""></p>
<h3 id="notation">Notation</h3>
<ul>
<li>student network $g_{\theta_s}$ parameterized by $\theta_s$</li>
<li>teacher network $g_{\theta_t}$ parameterized by $\theta_t$</li>
<li>input image: x</li>
<li>Each model outputs prob distributions over $K$ dimensions denoted by $P_s$, $P_t$ ($P$ is the output of softmax function)
<img src="https://velog.velcdn.com/images/latent-cho/post/305b00d7-f188-47a5-9a24-ed407f6c450a/image.png" alt=""></li>
<li>$\tau$: temperature parameter that controls the <strong>sharpness</strong> of the output distribution</li>
</ul>
<p>-&gt; softmax 는 여기서 normalization 의 효과를 얻기 위해 사용한 것으로 보인다.</p>
<h3 id="1-loss-cross-entropy">(1) Loss: cross-entropy</h3>
<p>Given a fixed teacher network $g_{\theta_t}$, we learn to match these distributions by <strong>minimizing the cross-entropy loss w.r.t.  the parameters of the student network $g_{\theta_s}$</strong>
<img src="https://velog.velcdn.com/images/latent-cho/post/972ec556-602d-41f0-be0a-e1d4afac2f29/image.png" alt=""></p>
<p>특이하게도 cross-entropy 를 사용하였다.</p>
<h4 id="-loss-ce-vs-mse-vs-ince">* loss: CE vs MSE vs INCE</h4>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/2fb242c5-ab6d-4459-9cf8-e6270d38c7ed/image.png" alt=""></p>
<h3 id="2-augementation-multi-crop">(2) Augementation: multi-crop</h3>
<p>cross entropy 에 들어가는 x 의 augmentation 방식을 짚고 넘어가야한다.</p>
<ul>
<li>큰 resolution (224x224) 인 gloabl view, 작은 resolution (96x96) 인 local view 로 multi-crop</li>
<li>student 한테는 local view, teacher 한테는 gloabl veiw 를 통과시켜서</li>
<li><strong>“local-to-global” correspondences</strong> 를 배우도록 시킨다.</li>
</ul>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/bec4c58a-26e5-40c4-a595-5217a5ecf1e1/image.png" alt=""></p>
<h3 id="3-teacher-network-ema">(3) Teacher network: EMA</h3>
<p>$θ_t←λθ_t+ (1−λ)θ_s$</p>
<ul>
<li>Using an exponential moving average (EMA) on the student weights, i.e., a momentum encoder, is particularly well suited for our framework. </li>
<li>We observe that this teacher has better performance than the student throughout the training, and hence, guides the training of the student by providing target features of higher quality.</li>
</ul>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/5cefe396-45c7-4c24-ab68-9336eb75a54d/image.png" alt=""></p>
<h3 id="4-avoiding-collapse-centering--sharpening">(4) Avoiding collapse: centering &amp; sharpening</h3>
<ul>
<li>It can also work with only a centering and sharpening ofthe momentum teacher outputs to avoid model collapse.</li>
<li>Centering  prevents one dimension to dominate but encourages collapse to the uniform distribution,</li>
<li>while the sharpening has the opposite effect
<img src="https://velog.velcdn.com/images/latent-cho/post/8b36b081-8839-471a-a831-f1ae9121d358/image.png" alt=""></li>
</ul>
<h3 id="implementation">Implementation</h3>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/0e26d891-98ac-45d2-bcc3-ac379edef01e/image.png" alt=""></p>
<h1 id="4-results">4. Results</h1>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/560e03c9-aa90-4d40-bf39-d43aff1514ce/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/c1e4a32c-5505-4e09-986a-d2d01b7189ae/image.png" alt=""></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[PR | 22-08] data2vec: A General Framework for Self-supervised Learning in Speech, Vision and Language]]></title>
            <link>https://velog.io/@latent-cho/PR-22-08-data2vec</link>
            <guid>https://velog.io/@latent-cho/PR-22-08-data2vec</guid>
            <pubDate>Mon, 29 Aug 2022 10:00:28 GMT</pubDate>
            <description><![CDATA[<h1 id="0-abstract">0. Abstract</h1>
<p>Self-supervised learning 의 일반적인 아이디어는 다양한 모달리티에 동일하게 적용될 수 있지만, 실제로 알고리즘과 objective 는 모달리티마다 달랐다. 그래서 본 논문에서는  Speech, NLP, CV 에서 같은 학습 방법을 쓸 수 있는 하나의 프레임워크, data2vec 을 제안한다. </p>
<ul>
<li>The core idea is to predict latent representations of the full input data based on a masked view of the input in a self-distillation setup.</li>
<li>Instead of predicting modality-specific targets such as words, visual tokens or units of human speech which are local in nature, data2vec <strong>predicts contextualized latent representations that contain information from the entire input</strong>.</li>
</ul>
<h1 id="2-related-work">2. Related work</h1>
<blockquote>
<p>data2vec은 BYOL, DINO 처럼 student-teacher 모델이지만 masked prediction objective 를 가지는 self-supervised model 이다. 이 때 target representation 은 contextualized 되어있다. </p>
</blockquote>
<p>-&gt; data2vec 에 대한 나의 짧은 요약!</p>
<ol>
<li><p>Momentum encoder 를 이용한 모델 (BYOL, DINO) 과의 차이점
(1) use a <strong>masked prediction task</strong>
(2) regress <strong>multiple neural network layer representations</strong> instead of just the top layer which we find to be more effective.
(3) Moreover, data2vec <strong>works for multiple modalities</strong>.</p>
</li>
<li><p>Vision  Transformers with <strong>masked prediction objectives</strong> predict visual tokens or the input pixels. Instead:
(1) data2vec predicts the latent representations of the input data.
(2) The latent target representations are <strong>contextualized</strong>, incorporating relevant features from the entire image <strong>instead of targets which contain information isolated to the current patch, such as visual tokens or pixels</strong></p>
</li>
</ol>
<h1 id="3-method">3. Method</h1>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/0f424ad7-5450-445c-a68a-8f19cf55c53c/image.png" alt=""></p>
<p>Contrastive learning 방식이 아니라, masked auto-encoding 방식으로 학습하는데 타겟이 teacher 의 output 이다. 이는 transformer 의 아웃풋으로 contextualized representation 이다. </p>
<p>학습 방법을 요약하면, student model 의 masked representation 와 teacher model 의 output 인 unmasked representation 와 smooth l1 loss 를 계산해서 student model 학습시키고, teacher model 은 student model 의 파라미터에 대해 ema 방식으로 학습된다.</p>
<h3 id="model-architecture">Model architecture</h3>
<p>모든 모달리티에 대해서 Transformer 구조를 사용하지만 input data 를 인코딩하는 방식은 기존의 modality-specific 한 방법을 써야한다:</p>
<ul>
<li>Computer vision: ViT-strategy of encoding an image</li>
<li>Speech: Multi-layer 1D CNN</li>
<li>Text: sub-word units &amp; learned lut</li>
</ul>
<h3 id="masking">Masking</h3>
<ul>
<li>Computer vision: block-wise masking strategy (BEiT)</li>
<li>Speech: mask spans of latent</li>
<li>Text: mask tokens</li>
</ul>
<h3 id="training-targets">Training targets</h3>
<ul>
<li><p>모델은 masked sample 인코딩을 보고 원래의 unmasked rerpesentation 을 예측한다</p>
</li>
<li><p>예측하는 representation 은 time-step 뿐만 아니라 self-attention 으로 인해 샘플의 모든 정보를 담을 수 있는 contextualized representation 이다. (This is an important difference to BERT, wav2vec 2.0 or BEiT, MAE which predict targets lacking contextual information.)</p>
</li>
<li><p>Teacher parameterization: EMA</p>
</li>
<li><p>Targets</p>
<ul>
<li>Apply a normalization to each block to obtain </li>
<li>Average top K blocks $y_t = \frac{1}{K}\sum_{l=L-k}^{L} \hat{a}_t^l$  for time-step $t$<br><img src="https://velog.velcdn.com/images/latent-cho/post/803bf838-7c2a-41ed-b1d4-0cccf3b24a40/image.png" alt=""></li>
</ul>
</li>
</ul>
<ul>
<li>Normalizing the targets helps
(1) prevent the model from collapsing into a constant representation for all time-steps 
(2) prevents layers with high norm to dominate the target features.</li>
<li>Computer vision: We use the same modified image both in teacher mode and student mode.</li>
</ul>
<h3 id="objective-smooth-l1-loss">Objective: Smooth L1 loss</h3>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/47339202-3f81-4578-b7fb-8b47e38f545f/image.png" alt=""></p>
<h1 id="5-results">5. Results</h1>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/7bf9bb31-a96f-460c-b786-2e75885eac31/image.png" alt="">
<img src="https://velog.velcdn.com/images/latent-cho/post/a31f631e-f449-4fae-9db4-e2dac3470266/image.png" alt="">
<img src="https://velog.velcdn.com/images/latent-cho/post/445d782b-e72d-4d2b-94ee-ee0271726926/image.png" alt=""></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[PR | 22-07] Self-supervised Learning: Generative or Contrastive]]></title>
            <link>https://velog.io/@latent-cho/22-07-Self-supervised-Learning-Generative-or-Contrastive</link>
            <guid>https://velog.io/@latent-cho/22-07-Self-supervised-Learning-Generative-or-Contrastive</guid>
            <pubDate>Mon, 15 Aug 2022 11:33:24 GMT</pubDate>
            <description><![CDATA[<p>Self-supervised learning (SSL) 을 공부하기 앞서, 큰 그림을 보려고 survey 논문을 가볍게 스키밍했다. SSL 에 크게 관심이 있지는 않았는데 친구들이랑 같이 공부하는게 좋아서 시작했다. 시작하고보니 연구 주제도 떠오르고, 잘한 것 같다.</p>
<h1 id="1-introduction">1. Introduction</h1>
<p>SSL 을 한 문장으로 정의하면, &quot;인풋의 일부를 예측&quot; 하는 것이다. </p>
<blockquote>
<p>In the invited speech on AAAI 2020, The Turing award winner Yann Le Cun described self-supervised learning as ”the machine predicts any parts of its input for any observed part”.</p>
</blockquote>
<h1 id="2-motivation-of-ssl">2. Motivation of SSL</h1>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/71324edf-8940-4334-8f8e-139d20c891c4/image.png" alt=""></p>
<p>앞으로 크게 두 축을 볼 것이다. </p>
<p>** Contrastive ** : SSL 하면 MoCo, SimCLR 가 바로 떠오를텐데 얘네는 SSL 중에서도 contrastive learning 에 속한다.
** Generative ** : NLP 에서 BERT(AE), GPT(AR) 스타일을 생각하면 된다. 요새 비전 분야에서도 contrastive learning 에서 이쪽으로 넘어오고 있는 것으로 보인다. 차차 리뷰할 것이다. </p>
<p><img src="https://velog.velcdn.com/images/latent-cho/post/abbcfd32-e612-4402-8dd0-7f65809db553/image.png" alt=""></p>
<p><strong>Contrastive vs Generative (vs Generative-Contrastive)</strong>
각각에 해당하는 예시 ({Moco, SimCLR} vs {BERT, GPT}) 만 봐도 contrastive 와 generative 가 어떻게 다른지 감이 올 것이다. 그래도 차이점을 formulation 하면 더 명확하고 직관적일 것이다.</p>
<table>
<thead>
<tr>
<th align="center"></th>
<th align="center">z</th>
<th align="center">discriminator</th>
<th align="center">objective</th>
</tr>
</thead>
<tbody><tr>
<td align="center"><strong>Generative</strong></td>
<td align="center">explicit</td>
<td align="center">X</td>
<td align="center">reconstructino loss</td>
</tr>
<tr>
<td align="center"><strong>Contrastive</strong></td>
<td align="center">explicit</td>
<td align="center">O</td>
<td align="center">contrastive similarity metric</td>
</tr>
<tr>
<td align="center"><strong>Generative-Contrastive</strong></td>
<td align="center">implicit</td>
<td align="center">O</td>
<td align="center">distributional divergence</td>
</tr>
</tbody></table>
<blockquote>
<ul>
<li>A properly designed training objective related to down-stream tasks could turn our randomly initialized models into excellent pre-trained feature extractors.<ul>
<li>ex) contrative learning - classification tasks</li>
</ul>
</li>
<li>The art of self-supervised learning primarily lies in defining proper objectives for unlabeled data. </li>
</ul>
</blockquote>
<hr>
<p>SSL 파트 논문 리뷰는 다음과 같이 진행될 것 같다. </p>
<ul>
<li>MoCO</li>
<li>SimCLR</li>
<li>BYOL</li>
<li>SimSiam</li>
<li>DINO</li>
<li>iBOT</li>
<li>BEiT</li>
<li>MAE</li>
<li>data2vec</li>
</ul>
<hr>
<h1 id="-1-reference">-1. Reference</h1>
<ul>
<li>본 논문: <a href="https://arxiv.org/abs/2006.08218">https://arxiv.org/abs/2006.08218</a></li>
</ul>
]]></description>
        </item>
    </channel>
</rss>