lake_park_0706.log

Review] Self-supervised Vision Transformers for Land-cover Segmentation and Classification

Thu, 02 Feb 2023 06:50:56 GMT

1. Motivation

In this paper, it proposed the method to combine vision transformer architecture and self-supervied learning

2. Method

Overall structure

2.1. Self-supervised learning

2.2. SwinUNet

This work proposed two separate SwinUNet streams for contrastive learning process as Fig 2.

3. Result

Reference

https://openaccess.thecvf.com/content/CVPR2022W/EarthVision/papers/Scheibenreif_Self-Supervised_Vision_Transformers_for_Land-Cover_Segmentation_and_Classification_CVPRW_2022_paper.pdf

Review] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Wed, 01 Feb 2023 06:49:15 GMT

1. Motivation

The transformer structure shows that adaptation of attention module can outperform traditional CNN-based model.
However the transformer has a problem that it is computational complexity is quadratic to image size. Therefore, in this paper, it proposed swin-transformer which is linear to image size.

2. Method

2.1. Shifted Window based Self-Attention

Computational complexity of transformer is caused from global attention
Self-attention in non-overlapped windows
- For efficiency, the paper proposed the method to calculate local attention in non-overlapped windows with MxM patches. It can decrease the computational overhead as shown in equation (2)
Shifted window partitioning in successive blocks
- Just seperating the image to MxM patch might decrease connections across windows. To deal with the issue, shifted window partitioning approach is proposed.
Efficient batch computation for shifted configuration
- The cyclic shift makes computional cost same as regular window partition(3x3 -> 2x2)

3. Result

Image classification in ImageNet
Object detection in COCO
Semantic Segmentation on ADE20K

Reference https://arxiv.org/abs/2103.14030v2

Review] Stand-Alone Self-Attention in Vision Models

Tue, 31 Jan 2023 05:34:21 GMT

1. Motivation

Architecture of CNN has out-standing performance in computer vision applications. However, capturing long term arange interactions for convolutions is challenging because of poor scaling properties.
The problem of long range interaction has been tackled with using of attention and most of works using attention had used global attention layers as an add-on to existing CNN.
Therefore, this paper proposed the local self-attention layer that can be used for both small and large inputs.

2. Method

2.1. Base model of CNN

2.2. Self-Attention

Self-attention is defined as attention applied to a single context which query, key and value are extracted from the same image.
Single-head attnetion is computed as equation(2). With the equation, the model extracts local attention.(k is spatial extent) Global attention can be used after spatial downsampling has been applied because of computational cost issue.
In practice, multiple attention heads are used to learn multiple dinstinct representations of the input. Feature of input are divided into N groups in depthwise and the dimension of each weight is d_out/N * d_in/N. Finally the model output representation in d_out after concatenating.

2.3. Fully Attentional Vision Models

The main purpose of this work is creating a fully attentional vision model. Therefore, it takes an existing convolutional architecture and replaces every convolutional layer with attention layer.
- In this paper, it references resnet
For the initial layers of CNN refered as stem, attnetion layer is applied without positional information.

3. Result

In imagenet classification
Coco dataset obeject detection
Affect of relative position information
Spatial extent

Reveiw] Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation - incomplete

Fri, 27 Jan 2023 08:38:31 GMT

1. Motivation

This paper proposed the model that expolits all available foreground and background features in the support images for dense pixel-wise classification.

2. Method

Overall structure
2.1. Feature extraction and mask preparation
First feature vector of query image and support image is extracted from backbone model. The feature extractor outputs multi-scale, multi-layer feature map like Fig 2. Meanwhile, support masks of multi-layer are generated via bilinear interpolation.

2.2. Multi-scale multi-layer cross attention weighted mask aggregation

Attention is the main concept of vision transformer and the paper use equation (1) to adopt to compute attention across the query and support features.
The dimension of feauture map is h x w x c and mask is h x w x 1
The proposed model flatten the feature vector and interpolated masks to treat each pixel as a token. Then, generate Q and K matrices. After that, dot-product attention can be computed for each head. Lastly the paper average the outputs of the multiple heads for each token and reshaped them.

3. Result

Efficiency

Review] PANet: Few-Shot Image Semantic Segmentation with Prototype Alignment

Thu, 19 Jan 2023 07:21:25 GMT

1. Motivation

Previous few-shot segmentation method do not differentiate the feature extraction of the target object in the support set and segmentation process of query image, which may be problematic since the segmentation model representation is mixed with the feature extracted from support set.
This paper propose the model that is separated as prototype extraction and non-parametric learning. Moreover, the paper attempted to use mask for few shot learning process.

2. Method

Overall structure

2.1. Prototype learning

In proposed model, it leverages the mask annotations over the support images to learn prototypes for foreground and background separately.
Before leveraging, there are two strategies to exploit the segmentations masks. Early fusion masks the support image before feature extration. Otherwize, late fusion masks the feature map to treate foreground and background differently. The paper used the late fusion.
Equation (1) means the prototype of foreground and (2) means the prototype of back ground.

2.2. Non-parametric metric learning

To segment the query image, first the model calculates the distance between query feature vector at each spatial location with each computed prototype at 2.1. Second, softmax is applied over the distance to produce proability map. Cosine similarity is applied for calculating distance.

2.3. Prototype alignment regularization (PAR)

This module make it easy to extract general feature from the support set image to guide the FSS.
If the model can predict a good segmentation mask, then prototypes learned from query set should be able to segment support set images. Thus, PAR takes the query and predicted mask as the new support set and treats the old support set as the new query image as Figure 2 (b).

3. Result

Comparision with other model
Analysis on PAR
Result of using weak annotation

Review] Few-Shot Segmentation Propagation with Guided Networks

Wed, 18 Jan 2023 12:15:07 GMT

1. Motivation

Semi- and weakly supervised segmentation methods cannot segment for a new input class. Therefore, this paper attempted to perform segmentation of a new input given a few support image in the same class.
To perform such problem, this paper addressed three key parts of FSS problems (1) how to summarize the sparse, structured support into a task representation , (2) how to condition pixelwise inference on the given task representation, and (3) how to synthesize segmentation tasks for accuracy and generality

2. Method

The work deal with the problem with the model with two branches, (1) a guide branch for extracting the task representation from the support and (2) an inference branch for segmenting queries given the guidance.

2.1. Guidance: From Support to Task Representation

There is a problem in early fusion: In compatibility of the support and query representations. The paper address is problem like Figure 3 (b).
In late fusion, first it extract feature from the encoder, map the annotations into masks in the same channel with the feature map. Finally fuse feature map by multiplication.

2.2. Guiding Inference

3. Result

Review] Prototypical Networks for Few-shot Learning

Tue, 17 Jan 2023 06:17:54 GMT

1. Motivation

Traditional approaches of neural networks have problem that need abundant amount of data. To overcome such problem, few-shot learning is proposed which target to perform one-shot classification like human.
The problem of such approach is that it is easy to happend overfitting. To address such problem, this paper proposed prototypical networks. To model is based on the idea that there exists an embedding in which points cluster around a single prototype representation for each class. Specifically, class means are used as prototypes for each class.

2. Method

2.1 Model

In equation (1), it explains how the proposed model calculates the mean vector of embedded support points.
Therefore, the probability of x is a specific class k is like equation (2)

3. Result

Review] Sg-one: Similarity guidance network for one-shot semantic segmentation

Wed, 11 Jan 2023 08:45:54 GMT

1. Motivation

Traditional segmentation like Unet, FCN, extras require many loads for labeling tasks. To reduce the budget, one-shot segmentation is applied
The one-shot segmentation is required to recognize new objects according to only one annotated example.
Old-method in one-shot learning segmentation has a pair of networks. One for being trained for extracting the feature of support images and the other for query images. However the weakness of the model are 1) The parameters of using the two parallel networks are redundant, which is prone to overfitting and leading to the waste of computational resources; 2) combining the features of support and query images by mere multiplication is inadequate for guiding the query network to learn high-quality segmentation masks.
To overcome such weakness, SG-One model is proposed. The main idea of proposed model is to incorporate the similarities between support images and query images.
For extracting similarities, the model used mask averaging pooling method. As a result, consine similiarity between the features related to backgournd will be low otherwise, object will be high.

2. Method

2.1 Problem Definition

There are three kinds of datasets 1) training set 2) support set 3) testing set. Also, there is no intersected classes between training set and support set. Therefore, the purpose of the proposed model is to train the network on the training set, which can predict segmentation mask on testing set with a few reference of the support set.

2.2. Proposed Model

2.2.1. Masked Average Pooling

This module extracts the object-related represenstitive vector of image from support set.
The proposed model tried to use contextual regions. To use contextual information, first the model extracts the features from image and resizes the feature map as mask of the image with bilinear interpolation. Then, the feacture vector is computed by averaging the pixels within the object regions as eqution (1). By this method, the model can extract the feature of object regions while discarding the unuseful information such as background.

2.2.2. Similarity Guidance

This module combines the feature vectors of query images. In the masked average pooling layer, the model extracts the feature vectors from related region. For the query img, the model extract feature vectors and employ the cosine similiarity each other as equation 2.
As a result the model can obtain the similarity map and it can denote the region of object in the query img. In particular, the paper multiplies the similiarty map to the feacture map of the query images from segmentation brach.

2.3. Similarity Guidance Method

Stem means FCN layer.
Similarity Guidance Branch is fed the extracted features of both query and support images. This branch will produce similarity map. In this branch, Conv blocks extracts the features from input images. The feature vectors are used to calculated the closens between the feacture from support image and the feacture at each pixel fo the query images.
Segmentation branch is for pixel-wise classification. The proposed model used conv blocks to obtain features for segmentation. At the last two layer, input are concatenated with similarity map. The paper fuse the generated feature maps with the similarity maps by multiplication at each pixel. Finally, the output will be generated after decoding the feature maps.

3. Results

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs (Review)

Sat, 07 Jan 2023 06:28:36 GMT

1. Motivation

There are three challenges in semantic segmantation with DCNN. (1) reduced feature resolution, (2) existence of objects at multiple scales, and (3) reduced localization accuracy due to DCNN invariance.
The first challenge is caused by max-pooling and down sampling. To address this problem, the paper removed the downsampling operator and upsample the filters. In the other words, atrous convolution is used for recovering feature map.
To deal with second challenge, the model used astrous spatial pyramid pooling
The last challenge is related with the fact that traditioanal classifiers required transformation invariance. This paper used CRF to improve the model's ability to capture fine features.

2. Method

2.1. Atrous Convolution for Dense feature Extraction and Field-of-View Enlargement

The paper mentioned that using astrous DCNN is more effective for feature resolution

2.2. Multiscale Image Representations using ASPP

As following Fig 4, the paper have implemented a model which use multiple parallel atrous convolutional layers.

2.3. Structured Prediction with Fully-Connected Conditional Random Fields for Accurate Boundary Recovery

To recover local fine boundary, the model used short-range CRF

Review: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation(SegNet) - incomplete

Mon, 02 Jan 2023 11:07:45 GMT

1. Introduction

The motiviation of this paper is to solve the problem of max pooling and sub-sampling reduce feature map resolution.
The encoder of SegNet is indentical to CNN of VGG16. In decoder, the proposed model used max-pooling indices from encoder.

3. Architecture

The encoder of the model has 13 convolutional layer and each encoder layer has a corresponding decoder layer. Therefore there are 13 decoder layer. The output of final layer will be get through multi-class soft-max classifier and it will calculate probability of each class for a independent pixel
In encoder, it is easy to loss boundary information of image during max pooling. Therefore, this paper proposed the method to store max-pooling indices for each encoder feature map.
Decoder upsamples its input feature map using max pooling indices.(Figure 3) The final upsamples feature representation is fed to soft-max classifier. As a result, the output of the soft-max classifier will be (number of classess) dimentional output which represent probability for each class.

Reference https://arxiv.org/pdf/1511.00561.pdf

Review] Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation(MSANet)

Mon, 02 Jan 2023 08:00:44 GMT

1. Motivation

The problem of traditonal supervised CNN is that it needs the number of well-annotated data, the balance of class distribution and sample representation. Moreover, it is not suitable for unseen class data.
Few-shot learning can be a good solution for such problems. The goal of FSS is to segment the target region of selected category in the query image.
There are many attempts to get more guidance from prototype vectors adopting different mechanism( PANet, PFENet, SG-One Net ... )
However, such approach( PANet, PFENet, SG-One Net ... ) can lose fine information of an image due to masked average pooling operation. Therefore, this paper tried to deal with this problem with Multi-Similarity module and Attention module.

2. Method

The proposed model has two guiding modules. First, multi similiary module finds a visual correspondence between the support image and query image. On the other hand attention module will make FSS network focus on target object of the query image.
The following image shows the overall structure of proposed model.

2.1. Multi-Similarity Module

First, a pair of query image and support image are input to the backborn network( VGG, ResNet ). The last three blocks of these backborn model will extract feature maps from support image and query image repectively as following equation 1, 2.
To activate related object region, feature maps are masked with corresponding masks with bi-linear interpolation that interpolates the masks to appropriate channel.
To generate a visual correspondence, the paper compute pixel-wise cosine distance between feature map of support image and query image as equation 5. Figure 4 shows the overall process of computing this visual correspondance(similarity map)

2.2. Attention Module

Due to the limited number of support set image, the paper proposed a lightweight attention module. *This module will extract the class related information from the few support sample images and guide the model to focus on the target region. *
The extracted feature from block 2 and 3 are input to the attention module. Before feeding feature map, those who features are concatenated and reduced dimension. Actual process is as folloing figure.

3. Result

Reference

MSANet: Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation

Review: CONDITIONAL NETWORKS FOR FEW-SHOT SEMANTIC SEGMENTATION

Fri, 30 Dec 2022 09:48:01 GMT

1. Introduction

This paper proposed co-FCN network which make conditioning branch contain few-shot annotations. Samples for few-shot learning is selected from segmentation datset.
2. Conditional Architecture and Few-Shot Optimization
Few-shot segmentation is required to learn to segment new input with a few annotations.
In this paper, proposed model based on FCN and it align few shot trainng and testing paradigms.
Figure 1 shows overall structure of co-FCN. The segmentation branch is a FCN that segments the query. The conditioning branch takes support sets as input and makes features which are fused with query image.

4. Discussion

The proposed model is much faster then tranditional model, because it doesn't need any optimization at test time. Additionally, it is good to cope with sparse annotations.

Reference

CONDITIONAL NETWORKS FOR FEW-SHOT SEMANTIC SEGMENTATION

Review: One-shot learning for semantic segmentation

Thu, 29 Dec 2022 08:50:00 GMT

1. Introduction

In this paper, it proposed semantic segmentation with one-shot learning which is pixel-level prediction with a single image and it's mask.
A simple implementation of one-shot learning for segmentation may cause overfitting and hard to optimize. This paper will deal with such problems
This paper proposed two-branched approach to one-shot segmentation. Figure 1 shows overall structure of the branches. First branch takes labeled image as input and produces a vector of parameters as output. Second branch takes these parameters as well as a new image as input and produces a segmentation mask of the image for the new class as output.

3. Problem Setup

The main purpose of this paper is to learn model f() that, when given a support set and query image, predicts a binary mask for the semantic class.
During training, the model will consume image-mask pairs and at testing, the query images are only annotated for new semantic classes. As a result, there is no same image class for train set and test set.

4. Proposed Method

4.1: Producing Parameters from Labeled Image

VGGnet is used for base model to genreate conditioning branch.
label contains only the target object
weight hashing is good for avoiding overfitting
4.2: Dense Feature Extraction
FCN-32s is used

4.3: Training Procedure

selects image-label pair and maximize the log likelihood of the ground-truth mask.

4.4: Extension to k-shot

In k-shot learning, it contains more support sets.

8. Conclusion

In this paper, the proposed model learn to learn an ensemble classier and use it to classify pixels. As the result, the model is faster and good performance.

References https://arxiv.org/abs/1709.03410

Few Shot Learning

Wed, 28 Dec 2022 10:29:10 GMT

1. Basic concept

Human can recognize that the query is pangliln based on difference between four images, but it is chellenging for computer because there are a few images

traditional purpose of deep learning is recognizing certain class from the image, however in few shot learning, it's purpose is learning to learn -> learning similarity and differences between objects
few shot learning is kind of meta learning

What is meta learning

meta learning is learn to learn. For example, there are query which is unknown and there are 6 support set. Then, you might know the query is otter by comparing query with support set.

Supervised learning vs few shot learning

supervised learning

test samples are from known classes
few shot learning
Query samples are from unknown classes

Relation between accuracy and ways

Relation between accuracy and shots

2. Siamese Network

there are two ways for training this network
2.1: Learning Pairwise Similiarity Scores
There are two kind of samples ( Positive samples and negative samples )

2.1.1: CNN for Feature Extraction

First, the model should extract features from sample with CNN(f). f will produce two feature vectors(h1 & h2). Then the difference between two vector(z) will go through dense layers and sigmoid. The output should be 1 if input sample is positive sample. Otherwise, the output should be 0(when negative sample). The fully connected layer will scale the difference between two object 0 to 1.

As you can see, the structure looks like siamese twins. The loss will go through dense layser and CNN for backpropagation (when input is negative sample)

2.1.2: One-show Prediction

2.2: Triplet Loss

Another method for training siamese network First, choose anchor object. After then choose one positive object and negative object from datset. By calculating feacture vector from these three objects, finally calculate postive distance and negative distance. The positive distance should be small and negative distance big enough.

Finally, the loss function should be like upper image.

After training siamese network, use the network for one-shot prediction

3. Pretraining and Fine Tuning

3.1: Cosine Similarity

The cosine similarity means how two vectors are similiar to each other.

3.2: Extracting Feature Vector

3.3: Making Few-Shot prediction

There are three normalized feature vectors, and calculate softmax to choose most similar object.

3.4: Fine tuning

In fine tuning, it updates weight and bias of feature extractor(CNN) based on support set.

3.5: Entropy Regularization

In few shot learning, there are a few dataset and can cause overfitting. To prevent the case we can sue entropy regularization and it makes good sense.

references

https://www.youtube.com/watch?v=hE7eGew4eeg&t=32s

Review: Scene Parsing via Integrated Classification Model and Variance-Based Regularization ( incomplete )

Tue, 27 Dec 2022 09:01:00 GMT

Introduction

Scene parsing Scene parsing is to segment and parse an image into different image region associated with semantic catgories.

Most of scene parsing models use DNN to deal with pixel-wised classification problem. However, this method has problem in distingushing the categories with similar appereance.
Therefore, this paper solved that problem from two aspects. 1: proprose an integreated classificiation model for scence parsing to distinguish confusing catgories/ 2: propose variance-based regularization to differentiate the scores of all categories as large as possible
methos is in three steps
1. Encoding features from DNN model
  1. general classification
  2. refining the classifier for refining the scores. -> for differentiating similar categories variance based regularization is used to train the intregrated classification model

=> Problem of previous segmentation model

References

http://sceneparsing.csail.mit.edu/

Review: Two-Stream Convolutional Networks for Action Recognition in Videos

Mon, 19 Dec 2022 10:38:25 GMT

1. Introduction

In this paper, it tried to use CNN for recognizing human action which containing sptial and temporal information
Architecture is based on two streams(spatial info -> video frames/ temporal info -> optical flow)

2. Two-stream architecture for video recognition

As shown in Fig1, there are two stream( spatial stream and temporal stream )
Each stream is implemented using CNN and the softmax scored of last fusion( averaging and SVM )

Spatial stream CNN : there are useful clues in static frames

3. Optical Flow CNN

Optical CNN is CNN model about temporal recognition

3.1: CNN input configurations

Optical flow stacking

-> displace vector at point (u,v) in frame t.

representing motion by stacking the flow channels of L consecutive frames => 2L input channel

Trajectory stacking

(1) : stores displacement
(2) : stores vector samples along the trajactory

Bi-directional optical flow

computting additional set of displacement fields in the opposite direction
construct input by stacking forward L/2 frames and -L/2 frames

Mean flow subtraction

There can be case like camera movement which can cause dominant displacement
for simplify, just substracting mean vector

4. Multi-task learning

it is hard to concatnate two different dataset for video learning
there are two softmax layer ( one for HMDB-51 and one for UCF-101 )

GRAF: Generative Radiance Fields for 3D-Aware Image Synthesis Review

Fri, 16 Sep 2022 07:46:56 GMT

1. Abstract & Introduction

기존의 GAN은 3D이미지를 구축하는데 한계가 있었다. 이 논문은 그 한계를 radiance field를 활용해 해결했다고 한다.
3D image synthesis는 사진의 pose를 통해 다른 위치에서도 물체를 보는 것과 같이 view를 생성한다. 하지만 현실에서는 물체의 위치와 각도 정보를 구하는 것이 어렵고 다른 논문들은 이를 2D supervision으로만 해결하려고 했다.
따라서 이 논문은 3가지의 목적이 있는데,
- 1. radiance field를 생성하는 모델을 통해 pose 정보가 없는 이미지에 대해서도 3D aware한 정보를 만들 수 있다.
- 1. patch-based discriminator.
- 1. 체계적으로 현실의 데이터에서 실험을 진행한다.

3. Method

3.1 Neural Radiance Fields

radiance field는 3차원의 이미지를 2차원에서 보는 각도에 따라 RGB의 값을 연속적으로 이어주는 mapping이다. NeRF에서는 3차원의 카메라 위치 정보와, 2차원의 카메라 방향 정보를 feature representation에 사용하였다. 이때 복잡한 현상의 데이터에서도 학습이 잘 이루어지도록 하기 위해 positional encoding을 사용하였는데, 즉 카메라의 위치&방향 정보를 사용하여 이미지의 색과 밀도(투명도)를 MLP로 학습하였다. 이 논문에서는 이 positional encoding이 효과가 있었다는 것을 입증하였고, 각도에 따른 색의 변화는 카메라의 위치 변화에 따른 색의 변화보다 훨씬 매끄럽기 때문에 방향에 따른 정보를 인코딩할때 더 적은 component가 쓰였다고 한다.
euqation (1),(2)는 대략적인 NeRF의 구조를 나타내는데, equation (2)에서 볼 수 있듯이 카메라의 위치와 방향정보를 encoding하고 이를 출력 이미지의 색과 밀도를 학습하는데 쓰이는 것을 볼 수 있다.

3.2 Generative Radiance Fields

앞의 NeRF의 경우에는 학습을 하려면 이미지의 position과 direction이 필요하다는 단점이 있다. 따라서 이 논문에서는 위치와 방향 정보 없이 image synthesis를 하려고 시도하였다. 이를 위해 GAN의 nenerative&adversarial 구조를 사용하였다.
Figure 2는 GRAF의 대략적인 구조를 설명하는 그림이다. generator인 $$G_θ$$는 camera matrix $$K$$, camera pose ξ, 2D sample pattern $$v$$와 shape/apperance code $$z_a, z_s$$을 입력으로 받아 이미지 패치 $$P'$$를 출력으로 한다. 그후 discriminator는 진짜 이미지 패치 $$P$$를 이 출력과 비교한다. 여기서 이미지가 아닌 이미지의 패치로 학습하는 이유는 학습량이 너무 많아지기 때문이라고 한다.

3.2.1 Ray Sampling:

Momentum Contrast for Unsupervised Visual Representation Learning(MoCo) Review

Wed, 07 Sep 2022 03:01:31 GMT

1. Abstract & Introduction

이 논문은 dynamic dictionary(queue)와 contrastive loss를 이용한 이미지에서의 unsupervised learning을 실행하였다. 여기서 unsupervised learning은 dynamic dictionary를 참조할 수 있도록 인코더를 학습하는데 같은 key의 데이터는 가깝게, 반대로 다른 key는 멀도록 학습을 시킨다. 학습은 contrastive loss가 작도록 진행된다.
MoCo는 크고 일관된 dictionary를 만드는 것이 목표인데 이 dictionary는 queue형태이며, 현재의 mini batch의 representation이 enqueue되고, 가장 오래된 representation이 dequeue되는 방식이다. 또한 momentum 방식을 이용해 queue의 일관성을 높였다.
MoCo는 pretext task로 discrimination task를 선택했다.
unsupervise learning의 목적은 down-stream 작업에서 높은 성능을 낼 수 있는 pre-train representation들을 학습하는 것이다. 이 논문에서는 7가지 down-stream task를 진행했고 좋은 성능을 보였다고 한다.

What is contrastive loss?

서로 유사한 이미지 쌍(positive pair)인 $$x_p, x_q$$가 있다면 이 둘은 가까워지도록 학습이 되어야 한다. 따라서 거리가 멀수록 loss가 크다면 이 loss를 줄이도록 학습을 진행할 것이다. 즉 ($$x_p, x_q$$)loss = $$||x_p- x_q||^2$$
반대로 서로 유사점이 거의 없는 negative pair $$x_n, x_q$$에서는 서로 멀어지도록 학습이 되어야 한다. 이를 위해 margin이라는 개념을 사용하였는데 이는 negative pair가 가져야 하는 최소한의 거리를 뜻한다. 따라서 ($$x_n, x_q$$)loss =$$max(0, m^2-||x_p- x_q||^2)$$, $$m = margin$$
이 둘을 합치면 contrastive loss는 다음과 같다. $$Loss(x_i, x_j, y)$$ = $$y||x_i- x_j||^2+ (1-y)max(0, m^2-||x_i- x_j||^2)$$ 결과적으로 positive pair는 가깝게, negative pair는 최소한 margin의 거리만큼 멀어지도록 학습할 수 있다.

3. Method

3.1) Contrastive Learning as Dictionary Look-up
Equation (1)과 같은 Contrastive loss function이 사용되었다고 한다.
타우는 temperature hyper-parameter이고 loss funciton의 매개변수로 이미지 뿐만이 아닌 patch나 문맥을 이루고 있는 patch set이 들어갈 수 있다.

3.2) Momentum Contrast

이 논문에서는 key가 계속 주입되는 dynamic dictionary를 사용했는데 이 키들은 랜덤하게 뽑힌 키들이다.
여기서 두 가지의 가정이 사용되었는데 첫째는 많은 negative sample을 가지고 있는 큰 dictionary를 통해 좋은 feature를 학습할 수 있다는 가정이고, 둘째는 key가 계속 주입되는 상황에서도 dictionary key는 가능한 최대한 일관되어야 한다는 가정이다.

Dictionary as a queue

MoCo의 핵심은 dictionary를 queue형태로 만들었다는 것에 있다. queue를 쓰게 되면 인코딩된 키들을 저장해서 학습에 사용할 수 있고, queue의 크기를 조절함에 따라 dictionary의 크기를 유연하게 조절할 수 있다.
queue의 또다른 특징은 오래된 mini-batch 데이터를 순차적으로 제거하여 업데이트를 해준다는 것이다. 이는 새로 들어온 key와 차이가 많은 오래된 key를 내보내는 것이므로, consistency를 유지하는데 중요한 역할을 한다.

Momentum update

큰 dictionary를 사용하게 된다면 안의 key를 전부다 back-propagtaion을 해줘야한다. 문제는 key이 개수가 상당히 많다는 점이다. 이를 해결하기 위해 query의 인코더를 그대로 복사해서 key 인코더로 사용하였지만 이는 성능이 매우 좋지 않았다. 논문에서는 이 이유를 query의 값이 매우 빠르게 변하고 그에따라 key인코더의 값도 빠르기 때문이라 생각했고 momentum update를 적용해서 해결하고자 했다. 그 결과 key 인코더의 weight를 천천히 업데이트 하여 consistency를 만족할 수 있었고, momentum의 값이 0.9일 때보다 0.999일 때 더 좋은 성능을 보였다고 한다.

Relations to previous mechanisms

이 부분에서는 MoCo와 선행 연구되었던 2모델을 비교하였다.
처음 비교한 모델은 end-to-end 모델이다. 이 모델은 미니 배치 내의 샘플을 dicitonary로 사용한다. 이는 큰 dictionary를 사용하고 싶은 경우나, local position과 같은 pretext task를 적용하기 어렵다는 단점이 있다.
그 다음은 momery bank 모델과 비교하였다. 이 모델도 back propagation하는 과정이 없어 dictionary에 많은 데이터를 저장할 수 있다는 장점이 있다. 그러나 dictionary의 representation은 쓰여질 때만 update되는 방식을 가지고 있어 dictionary에 오래전에 업데이트 된 representation도 포함되어 있다는 단점이 있다.

3.3) Pretext Task

이 논문에서는 instance discrimination task라는 pretext task를 진행하였고, 같은 이미지에서의 augmentation된 이미지 쌍은 positive pair로, 다른 이미지에서의 augmentation된 이미지 쌍은 negative pair로 분류하여 학습하였다고 한다. 미니 배치에서 postive pair이면 인코더에, negative pair이면 momentum 인코터와 queue에 데이터를 넣어주었다고 한다.
Techinical details
이 논문에서는 인코더로 ResNet을 사용하였고, 이 인코더에서는 L2 norm을 사용하여 벡터를 출력하는데 이 벡터가 query나 key이다. Equation (1)의 temperature값은 0.07로 정해주었다.

Shuffling BN

이 논문에서 쓰인 인코더($$f_q, f_k$$)는 batch normalization을 포함하고 있다. 그런데 이 논문의 경우 batch nomalization은 모델의 학습을 방해한다고 한다.
논문에서는 이를 해결하기 위해, suffling BN으로 해결했다.
보류
(batch normalization에 대한 추가적인 공부가 필요하다 ㅠㅠ)

참조

Unsupervised Feature Learning via Non-Parametric Instance Discrimination(NPID) Review

Fri, 02 Sep 2022 07:35:04 GMT

1. Abstract & Introduction

이 논문의 주제는 ImageNet의 object recognition에서 나왔다고 한다. Figure 1에서 볼 수 있듯이 레오파드에 대한 top-5 classification error를 보면, 레오파드와 비슷하게 생긴 제규어나, 치타와 같은 동물들의 softmax값은 높게 나온다. 반면에 레오파드와 전혀다른 서랍이나 보트는 softmax의 값이 매우 작다. 이를 인공지능 모델이 이미지의 유사성을 학습할 수 있다고 본 것이다. 따라서 논문에서는 이를 이용해서 같은 종류의 이미지는 서로 유사성을 학습하고, 다른 종류의 이미지는 잘 구별하도록 학습하는 시도를 했다.
하지만 이 실험을 진행하는데 있어서 ImageNet 데이터셋의 class 개수가 너무 많다는 문제가 있었다. 클래스의 개수가 바로 traing set의 개수이기 때문이다. 이를 해결하기 위해 논문에서는 noise-contrastive estimation과 proximal regularization을 사용했다고 한다.
이 논문에서는 파라미터 없이 학습과 테스트를 진행하는 방법을 사용했다. 즉, 각 이미지의 특징들이 memory bank에 저장되고 테스트는 knn을 기본으로 하여 진행한다.
결과적으로 성능이 좋고 나름 가벼운 모델을 만드는데 성공했다고 한다.

3. Approach

이 논문의 목표는 인간의 개입 없이, $$v = f_θ(x)$$인 함수를 만들어내는 것이다. ($$x$$는 이미지 input, $$v$$는 이미지의 feature를 나타낸다.) 따라서,$$d_θ(x, y) = ||f_θ(x)-f_θ(y)||$$는 x, y의 거리가 가깝면 이는 x, y가 서로 유사점이 많다는 것을 나타낸다.
또한, 하나의 이미지를 하나의 케이스로 정의하여, 각각의 이미지를 구별(혹은 비교)하는 방식으로 모델을 학습하였다.

3.1) Non-Parametric Softmax Classifier

Parametric Classifier.

n개의 이미지를 학습한다면,n개의 x에 대하여 $$v_i = f_θ(x_i)$$이 성립한다. 이를 통해 x가 i클래스로 구별될 확률은 equation (1)로 나타낼 수 있다.
그러나 이 방법은 weight를 이용해서 확률를 구하는 방법이기 때문에 처음에 목표했던 다른 이미지들과의 비교를 할 수 없는 방법이다.

Non-parametric Classifier.

위의 문제점을 해결하기 위해 non-parametric classifier를 사용하였는데, Equation (2)처럼 weight($$w$$)대신에 $$v$$를 식에 넣었다.
이를 위해서 CNN이 축출한 이미지의 feature를 L2 normalization을 적용하였고, 결과적으로 $$v_i$$의 크기는 전부 1이 된다. (Figure 2 참조) 또한 $$τ$$는 temperature parameter로, 벡터의 분포가 한 곳으로 집중되는 정도를 조절한다고 한다.
따라서 이를 이용한 학습의 목표는 joint probaility를 최대로 하거나, Euqation (3)을 최소화 하는 방향으로 진행해야한다.

Learning with A Memory Bank

$$P(i|v)$$를 계산하기 위해서는 모든 이미지의 특징 벡터인 $$v$$가 필요하다. 하지만 이를 계산하기 위해서 모든 이미지의 representation을 매번 계산하는 번거로움을 개선하기 위해 Memory bank라는 개념을 도입하였다.
$$V = {v_j}$$를 memory bank로, $$f_i = f_θ(x_i)$$로 정의하자. 학습동안 $$f_i$$와 $$f_θ(x_i)$$안의 파라미터 $$θ$$는 SGD를 통해 업데이트 된다. 그렇다면 이 때 업데이트 된 $$f_i$$는 $$V$$안에서 대응되는 벡터 $$v_i$$로 치환된다.
이런 방식으로 학습을 진행하면, 이미지의 특징 벡터로 학습을 하기 때문에 weight와 gradient를 저장할 필요가 없고, 큰 데이터에서도 모델이 잘 학습할 수 있다.

3.2) Noise-Contrastive Estimation

Equation (2)처럼 softmax를 계산하게 된다면, 엄청나게 많은 데이터 셋에 대해서 학습을 하기 어렵다는 단점이 있다. 이를 해결하기 위해 Noise-Contrastive Estimation을 사용했다고 한다. (이 아래의 자세한 설명은 정확히 이해하지는 못했다 ㅠㅠ)
일단 noise distribution이 $$1/n$$이라는 가정과, noise sample이 data sample보다 m개 만큼 더 있다는 가정을 한다. 그렇다면 $$v$$ 특징을 가진 sample $$i$$가 distribution에 있을 확률은 Equation (6)와 같고, 이 논문의 모델 학습 목표는 negative log-posterior distribution of data and noise sample을 최소화 하는 것이므로 이는 Equation (7)을 최소화 시키는 것과 같다. 이때 $$v$$는 $$x_i$$에 대응되는 feature를 나타내고, $$v'$$는 다른 이미지의 특징을 나타낸다. 이 $$v$$, $$v'$$은 memory bank V에 포함되어 있는 벡터이다.
Equation 4의 Z을 구하는 것도 계산량이 많기 때문에 이를 줄여주는 작업을 했다. 바로 Monte Carlo approximation을 사용하였는데 {$$j_k$$}는 데이터의 랜덤 표본을 의미한다고 한다.

3.3) Proximal Regularization

이 논문의 학습 방식은 하나의 클래스에 여러가지 데이터셋이 있는 것이 아닌 각각의 데이터 셋을 하나의 클래스로 보는 방식이다. 이 때문에 학습할때 학습과정에서 많이 불안정하고 이를 개선하기 위해 proximal regularization을 도입했다.
proximal regularization을 도입한 loss function은 Euqation (9)과 같은데, 이 함수는 기존의 loss에 현재 iteration t의 memory bank와 이전 iteration t-1의 memory bank의 차이를 더해주겠다는 의미이다. 이 proximal regularization의 값은 시간이 갈 수록 점점 줄어들 것이다.
Equation(10)은 이 regularization을 도입한 식이고 Figure 3을 보면 이 방법을 통해 학습 그래프가 진동하는 정도를 많이 개선한 것을 관찰할 수 있다.

3.4) Weighted k-Nearest Neighbor Classifier

이 부분은 test 이미지를 넣었을 때 이미지의 클래스를 정할 것인지에 대해 설명한 부분이다. 일단 test용 이미지가 들어오면 이 이미지의 feature를 계산한다. 이 feature를 통해 cosine similarity를 구할 수 있는데 이때 memory bank에서 유사도가 높은 k개의 벡터를 뽑는다. 그 후 k개의 벡터에서 weighted voting을 통해 test 이미지의 클래스를 결정하는 방식이다.

참조

https://arxiv.org/abs/1805.01978

SimCLR review

Mon, 29 Aug 2022 12:58:29 GMT

1. Abstract & Introduction

What is Constrative Learning?

Constrative learning은 라벨링이 되지 않은 데이서셋에서 스스로 input과 label을 만들어 pretext task를 수행하는 self supervised learning 중 하나이다.
Constrative learning은 같은 이미지에서 나온 이미지 조각의 represenation은 서로 가까워지도록, 다른 이미지에서 나온 이미지 조각의 representation은 서로 멀어지도록 학습을 진행한다. 그림을 보면 같은 이미지에서 나온 이미지 조각은 positive pairs로, 다른 이미지에서 나온 이미지 조각은 negative pairs로 분류하여 학습하는 것을 관찰할 수 있다.
이 논문에서는 이런 contrastive learning의 성능을 높이기 위한 4가지 부분을 설명한다.
- 여러 개의 data augmentation
- learnable nonlinear transformation
- contrastive cross entropy loss
- larger batch size & longer training
이렇게 하여 다음과 같이 좋은 성능을 내었다고 한다.

2. Method

2.1) The Contrastive Learning Framework

첫째로, stochastic data augmentation module은 input $$x$$를 2개의 이미지($$x_i$$, $$x_j$$)로 augmentation한다. 물론 이 두 이미지 조각은 둘다 $$x$$에서 나왔으므로 positive pair이다. 논문에서는 random cropping, random color distortion, random Gaussian blur이 3가지 augmentation을 실험했다.
$$x_i$$와 $$x_j$$ 위에 있는 $$f(.)$$은 input의 representation을 뽑아내는 base encoder이고 논문에서는 이 인코더로 resnet을 사용했고, averaget pooling까지 진행해주었다고 한다.
$$g(.)$$은 constrative loss가 적용되는 단계인데, 이 단계에서는 1개의 hidden layer가 있는 MLP와 relu함수를 사용했다고 한다.
Loss Function
- 이 논문에서는 따로 negative pair를 정해주지 않았다. 대신, N개의 데이터에서 2개의 이미지 조각으로 transformation을 하고, 2N개의 데이터 중에서 2(N-1)개의 조각을 negative pair처럼 사용하였다.
- 결과적으로 positive pair (i, j)에 대한 loss function은 다음과 같다. sim 함수는 cosing similarity로 두 이미지가 얼마나 유사한지 계산한다. (필자는 이런 loss function을 사용해서 같은 이미지에서 나온 이미지 조각은 가깝게, 아닌 조각은 멀게 학습하는 것으로 이해했다)

2.2) Training with Large Batch Size

이 논문에서는 representation을 memory bank에 저장하는 것이 아닌, 배치 크기를 256애서 8192까지 다양하게 하여 학습을 진행하였다고 한다. 이 학습을 안정화 시키기 위하여, LARS optimizer를 사용했다.

3. Data Augmentation for Contrastive Representation Learning

3.1) Composition of data augmentation operation is crucial for learning good representations

이 논문에서는 data augmentation의 성능을 비교하기 위해서 다음의 방법들을 사용하여 비교를 했다.
(이 부분은 제대로 이해해 했는지는 모르지만) 논문에서는 ImageNet의 데이터를 사용하였고, 문제는 이 데이터셋의 이미지크기가 일정하지 않은 것이었다 이를 해결하기 위해, 일단 이미지를 crop & reseizing을 해주고 난 후, 이를 모델에 넣어주었다. 이때 모델에서는 augmentation을 실행하게 되는데 branch의 한 쪽 부분만 추가적인 augmentation을 해주는 것이다. 이는 모델의 성능을 떨어뜨리지만 목표는 augmention의 종류별 성능을 비교하는 것이니 그럼에도 불구하고 이렇게 실험을 진행하였다고 한다.
2가지의 방법을 결합하여 augmentation을 진행하였고, 그 결과를 비교하여 표로 나타낸 것이 다음 그림이다. 그림에서 볼 수 있듯이 color와 crop의 방법을 적용한 것이 가장 성능이 좋았다.
저 위의 두 방법이 성능이 좋은 이유는 이미지를 crop만 하게 되면 Figure 6.(a)처럼 crop된 사진들의 색 분포가 비슷할 수 있다. 논문에서는 이를 color distortion을 적용해서 이미지 색의 다양성을 늘리고 학습 성능을 올리는 것으로 해석하고 있다. 실제로 Figure 6.(b)를 보면 색의 분포가 다양해진 것을 관찰할 수 있다.

3.2) Contrastive learning needs stronger data augmentation than superivsed learning

color distortion의 강도에 따른 모델의 성능을 조사해 보았다. 놀랍게도, distortion의 강도가 높아질수록 SimCLR의 성능은 높아졌다. 반면에 supervised learing의 성능은 오히려 떨어지는 것을 확인할 수 있다. 이를 통해 supervised learning에 도움이 되지 않는 augmentation도 self-supervised learning에는 좋은 영향을 끼칠 수 있다는 것을 알 수 있다.

4. Architecture for Encoder and Head

4.1) Unsupervised contrastive learning benefits from bigger models
- 모델의 용량, 파라미터가 많을 수록 성능이 높아진다.
4.2) A nonlinear projection head improves the representation quality of the layer before it
- 이 부분은 Figure 2에서 projection에 해당하는 g(h)에 대해서 설명하였다. g함수는 projection함수인데 이 projection이 linear할 때와 non-linear할때, 없을 때를 비교하여 Figure 8에 정리하였다. Figure 8에서 보면 알 수 있듯이 non-linear projection을 적용했을 때 가장 성능이 좋은 것을 볼 수 있다.
또한, projection전의 hidden layer에서의 성능이 그 이후 projection을 적용한 layer보다 성능이 좋았는데, 이는 hidden layer가 더 좋은 representation을 포함하는 것을 뜻한다. 논문에서는 g 함수는 transform에 대해서 변함이 없게 학습이 되어야하고, 이로 인해 downstream tas에 유용한 정보가 이 과정에서 삭제된 것으로 추측했다.
이 추측이 맞는지 확인해보기 위해 실험을 진행했고, Table 3에서 보면 이 추측이 맞다고 판단할 수 있다.

5. Loss Function and Batch Size

5.1) Normalized cross entropy loss with adjustable

이 논문에서는 NT-Xent loss를 사용하였고, 왜 이 loss를 사용하였는지 설명하는 부분이다. NT-Logistic, MarginTriplet loss와 비교하였고, NT-Xent loss가 가장 성능이 좋은 것으로 나타났다. 이는 NT-Xent loss의 $$l_2$$ normalization과 temperature를 잘 조절했기 때문이다.(이 부분에 대해서 부연 설명이 있었지만 잘 이해하지 못했다...ㅠㅠ)
Table 4에서는 NT-Xent loss를 다른 loss function과 비교하였고, Table 5에서는 $$l_2$$ normalization과 temperature에 따른 모델의 성능을 정리하였다.

5.2) Contrastive learning benefits from larger batch sizes and longer training

이 부분은 배치의 크기에 따른 모델의 성능을 정리한 부분이다. 제목에서 유추할 수 있듯이 배치크기에 따라 모델의 성능이 향상되는 것을 관찰할 수 있었으며 이를 Figure 9에 정리하였다.

참조

lake_park_0706.log

Review] Self-supervised Vision Transformers for Land-cover Segmentation and Classification

1. Motivation

2. Method

2.1. Self-supervised learning

2.2. SwinUNet

3. Result

Review] Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

1. Motivation

2. Method

2.1. Shifted Window based Self-Attention

3. Result

Review] Stand-Alone Self-Attention in Vision Models

1. Motivation

2. Method

2.1. Base model of CNN

2.2. Self-Attention

2.3. Fully Attentional Vision Models

3. Result

Reveiw] Dense Cross-Query-and-Support Attention Weighted Mask Aggregation for Few-Shot Segmentation - incomplete

1. Motivation

2. Method

2.1. Feature extraction and mask preparation

2.2. Multi-scale multi-layer cross attention weighted mask aggregation

3. Result

Review] PANet: Few-Shot Image Semantic Segmentation with Prototype Alignment

1. Motivation

2. Method

2.1. Prototype learning

2.2. Non-parametric metric learning

2.3. Prototype alignment regularization (PAR)

3. Result

Review] Few-Shot Segmentation Propagation with Guided Networks

1. Motivation

2. Method

2.1. Guidance: From Support to Task Representation

2.2. Guiding Inference

3. Result

Review] Prototypical Networks for Few-shot Learning

1. Motivation

2. Method

2.1 Model

3. Result

Review] Sg-one: Similarity guidance network for one-shot semantic segmentation

1. Motivation

2. Method

2.1 Problem Definition

2.2. Proposed Model

2.2.1. Masked Average Pooling

2.2.2. Similarity Guidance

2.3. Similarity Guidance Method

3. Results

DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs (Review)

1. Motivation

2. Method

2.1. Atrous Convolution for Dense feature Extraction and Field-of-View Enlargement

2.2. Multiscale Image Representations using ASPP

2.3. Structured Prediction with Fully-Connected Conditional Random Fields for Accurate Boundary Recovery

Review: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation(SegNet) - incomplete

1. Introduction

3. Architecture

Review] Multi-Similarity and Attention Guidance for Boosting Few-Shot Segmentation(MSANet)

1. Motivation

2. Method

2.1. Multi-Similarity Module

2.2. Attention Module

3. Result

Review: CONDITIONAL NETWORKS FOR FEW-SHOT SEMANTIC SEGMENTATION

1. Introduction

2. Conditional Architecture and Few-Shot Optimization

4. Discussion

Review: One-shot learning for semantic segmentation

1. Introduction

3. Problem Setup

4. Proposed Method

4.1: Producing Parameters from Labeled Image

4.2: Dense Feature Extraction

4.3: Training Procedure

4.4: Extension to k-shot

8. Conclusion