bona-park.log

[Docker] Let me like you, Docker : part 1

Mon, 28 Nov 2022 00:58:06 GMT

Problem

A python file("hw1.py") runs well on my laptop. But what if it does not on my friend's? ➡ ModuleNotFoundError: No Module named 'xgboost' ➡ The module isn't installed in his laptob

Plus, what if there's a one who hasn't installed a Python? What if there's a one with Python 2.10 ..?

It is a headache to configure your environment for different versions of software a on single machine.

Solve

"Let's make our program run on every environtment!"

✅ Virtualization

Virtualization, as the name implies, is the creation of a virtual version of something, such as an OS, a server or a storage device.

It creates a simulated computing environment that is abstracted from the physical computing hardware.

Then the software simulates hardware functionallity.

There are 2 types of virtualization. Let's see what are the differences.

1. Virtual Machines

✌ Hardware-level virtualization

A virtual machine is a system which acts exactly like a computer.

Each virtual machine requires its own underlying operating system, and then the hardware is virtualized.

Virtual mahicne takes up a lot of resouces because each VM runs a virtual copy of all the HW that the OS needs to run.

Hypervisor: A layer that runs on the physical host and interacts with both the host machine and the VM. It abstracts the host computer's resources to VM. Thanks to the hypervisor, the hardware resources are virtualized and each VM is isolated from each other.

2. Container

✌ OS-level virtualizaition

Containers are a layer of abstraction above both pyhiscal machines and VMs. It sits on top of them.

While a VM abstracts a complete machine, a container only abstracts an application and its dependences.

While a VM has its own guest OS above the host OS, which makes VM heavy, a container share the host OS.

✅ Docker

What is Docker?

Docker is a OS-level virtualization software. It is designed to make it easier for developers to develop, ship, and run applications by using containers.

Container

As we mentioned above, containers are isolated from one another and bundle their own software, libraries and configuration files. They can communicate with each other through well-defined channels.

❔ Then how they are isolated?

Thanks to 2 isolation techonologies on linux kernel.

1) Namespace: a feature for partitioning kernel resources 2) Control group(cgroup): a feature for limiting and isolating resource usage(CPU, memorym, network ..) of a collection of processes.

Docker Image

Docker Image is a read-only templates containing instructions for creating a container. Then a docker container is a running image instance. You can create many containers from same image, each with its own unique data and state. (You can think of the left as an Image and the right as a Container! yummy Boongeobbang ;-))

image) https://en.wikipedia.org/wiki/Docker_(software) https://bikramat.medium.com/namespace-vs-cgroup-60c832c6b8c8

[Ubunut] Brew command not found

Sun, 27 Nov 2022 00:41:24 GMT

eval $(/home/linuxbrew/.linuxbrew/bin/brew shellenv)

must be add into you file .bashrc at the end

vi ~/bashrc
source ~/.bashrc

[Paper Review] Attention is all you need

Thu, 17 Nov 2022 11:31:43 GMT

Before..

RNN
LSTM : slow to train

Can we parallelize sequential data?

Transformers

Input sequence can be transmitted *parallel *

No concept of time step

Pass all the words simultaneously and determine the word embedding simultaneously

(RNN passes input word one after another)

Input Embedding

In embedding space, close-meaning words locates close to each other There're already pretrained embedding spaces.

But, same word in a different sentence has a different meaning!

Positional Encoder

: vector that gives context information based on position of word in a sentence

Can use sin/cos function to generate PE, but any reasonable function is ok

Structure of Encoder

1. Attention

: What part of the input should we focus?

How much the word 'The' is relevant to other words(big, red, dog) in the same sentence?

Attention Vetors(of English) contain contextual relationships betweeen the words in the sentence.

2. Feed Forward

Simple feed forward network is applied to every one of the attention vectors!

Problem of Attention vector

Focuses to much on itself.. We want to know the interactions and relationships between words!

➡ Use mulitple attention vector for the same word and average them: Multi Head Attention Block (Q. vectors from different sentence..?)

Attention vectors are feeded to Feed Forward Network one vector at a time Each of the attention vector of different word is independent each other ➡ can Parallize Feed Forwared Network! ➡ All words can be passed to the encoder block at the same time and output is a set of encoded vectors for every word

Structure of Decoder

In English -> French task, we feed output French to the decoder

1. Self attention block

Generates attention vectors(of French) showing how much each word is related to another

Attention vectors from both Encoder(English) and Decoder(French) are passed to another Encoder-Decoder Attention block. ➡ output of this block: Attention vector of all words(English+French) ➡ Each attention vector shows the realtionship of other words including both languages ➡ English to French word mapping happens!

Multi-headed Attention

If we use all the words in the French sentence, there'd be no learning, just spitting out the next word ➡ mask input: observe only previous and itself

Single-headed attention vs Multi-headed attention

1) Single-headed attention

V,K,Q: abstract vector that extracts different components of input words We have V,K,Q vectors for every single word ➡ create attention vector for every word using V, K, Q

2) Multi-headed attention

Have multiple weight matrices(Wv, Wk, Wq) ➡ multiple attention vectors for every word ➡ another weighted matrices(Wz) ➡ now feed forward nn can be fed only one attention vector per word

2. Feed forward unit

Pass each attention vector to feed forward unit

3. Linear layer

: another Feed Forward Layer

Used to expand the dimension to the number of words in French

4. Softmax layer

Transforms it into Probability Distribution

Output: The word with the highest probability to come next

Codes

Reference to https://github.com/hyunwoongko/transformer

Scale Dot Production Attention

class ScaleDotProductAttention(nn.Module):
    """
    compute scale dot product attention

    Query : given sentence that we focused on (decoder)
    Key : every sentence to check relationship with Qeury(encoder)
    Value : every sentence same with Key (encoder)
    """

    def __init__(self):
        super(ScaleDotProductAttention, self).__init__()
        self.softmax = nn.Softmax(dim=-1)

    def forward(self, q, k, v, mask=None, e=1e-12):
        # input is 4 dimension tensor
        # [batch_size, head, length, d_tensor]
        batch_size, head, length, d_tensor = k.size()

        # 1. dot product Query with Key^T to compute similarity
        k_t = k.transpose(2, 3)  # transpose
        score = (q @ k_t) / math.sqrt(d_tensor)  # scaled dot product

        # 2. apply masking (opt)
        if mask is not None:
            score = score.masked_fill(mask == 0, -e)

        # 3. pass them softmax to make [0, 1] range
        score = self.softmax(score)

        # 4. multiply with Value
        v = score @ v

        return v, score

Multi-head Attention

  class MultiHeadAttention(nn.Module):

      def __init__(self, d_model, n_head):
          super(MultiHeadAttention, self).__init__()
          self.n_head = n_head
          self.attention = ScaleDotProductAttention()
          self.w_q = nn.Linear(d_model, d_model)
          self.w_k = nn.Linear(d_model, d_model)
          self.w_v = nn.Linear(d_model, d_model)
          self.w_concat = nn.Linear(d_model, d_model)

      def forward(self, q, k, v, mask=None):
          # 1. dot product with weight matrices
          q, k, v = self.w_q(q), self.w_k(k), self.w_v(v)

          # 2. split tensor by number of heads
          q, k, v = self.split(q), self.split(k), self.split(v)

          # 3. do scale dot product to compute similarity
          out, attention = self.attention(q, k, v, mask=mask)

          # 4. concat and pass to linear layer
          out = self.concat(out)
          out = self.w_concat(out)

          # 5. visualize attention map
          # TODO : we should implement visualization

          return out

      def split(self, tensor):
          """
          split tensor by number of head

          :param tensor: [batch_size, length, d_model]
          :return: [batch_size, head, length, d_tensor]
          """
          batch_size, length, d_model = tensor.size()

          d_tensor = d_model // self.n_head
          tensor = tensor.view(batch_size, length, self.n_head, d_tensor).transpose(1, 2)
          # it is similar with group convolution (split by number of heads)

          return tensor

      def concat(self, tensor):
          """
          inverse function of self.split(tensor : torch.Tensor)

          :param tensor: [batch_size, head, length, d_tensor]
          :return: [batch_size, length, d_model]
          """
          batch_size, head, length, d_tensor = tensor.size()
          d_model = head * d_tensor

          tensor = tensor.transpose(1, 2).contiguous().view(batch_size, length, d_model)
          return tensor

REF https://github.com/hyunwoongko/transformer https://www.youtube.com/watch?v=TQQlZhbC5ps

[Algorithms] 다익스트라(Dijkstra)

Sat, 05 Nov 2022 02:20:28 GMT

다익스트라(Dijkstra)

문제 상황
1. Va -> Vb까지의 최단경로
2. Va -> 모든 노드 의 최단경로
3. 모든 노드 -> 다른 모든 지점 의 최단 경로
그리디 알고리즘으로 분류됨
매 상황에서 가장 비용이 적은 노드를 선택해서 임의의 과정을 반복
동작 단계
1. 출발노드 설정
2. 최단거리 테이블 초기화
3. 현재 위치한 노드의 인접 노드들 중 방문하지 않은 노드들 중에서 가장 거리가 짧은 노드 선택. 그 노드를 방문 처리.
4. cost를 계산하고 최단거리 테이블 업데이트
5. (3)(4) 과정 반복

다익스트라 구현 방법

힙(Heap) 자료구조 이용
cf) 힙(Heap)
- 우선순위 큐(Priority Queue)를 구현하기 위해 사용하는 자료구조
- Max Heap, Min Heap의 종류가 있음
  - 최대 힙을 최소 힙으로 쓰려면 저장되는 값에 -를 붙여 음수로 만들면 된다.
- 삽입시간, 삭제시간: O(logN)
```
import heapq
heapq.heappush(myHeap, value)
heapq.heappop(myHeap)
```

다익스트라 예제: 백준 1238번 파티 (파이썬)

import heapq
import sys

def dijkstra(start):
    distance = [INF] * (V+1)
    q = [] # 우선순위 큐
           # 현재 노드의 이웃들이 거리가 짧은 순으로 우선순위가 높게 담겨있음

    distance[start] = 0 # 시작점은 최단거리 0
    heapq.heappush(q, (0, start))

    while q: 
        # 큐에서 가장 최단거리가 짧은 노드 꺼내기
        dist, now = heapq.heappop(q)
        # 현재 노드가 이미 처리된 적이 있는 노드라면(더 짧은 최단거리가 이미 구해졌다면) 무시
        if distance[now] < dist:
            continue
        # 현재 노드와 연결된 다른 인접한 노드들을 확인
        for next in graph[now]:
            cost = dist + next[1]
            # 현재 노드를 거쳐서, 다음 노드로 이동하는 거리가 더 짧은 경우
            if cost < distance[next[0]]:
                distance[next[0]] = cost # 거리테이블 업데이트
                heapq.heappush(q, (cost, next[0]))
    return distance

input = sys.stdin.readline
INF = int(1e9) # 무한값(10억)

V, E, X= map(int, input().split())
graph = [[] for _ in range(V+1)]


# 모든 엣지값 입력받기
for _ in range(E):
    u, v, w = map(int, input().split())
    # u에서 v로 가는 가중치 w인 간선이 존재
    graph[u].append((v, w))


max_dist = 0
for i in range(1 ,V+1):
    # 다익스트라 i -> x
    d_0 = dijkstra(i)[X]
    # 다익스트라 x -> i
    d_1 = dijkstra(X)[i]
    max_dist = max(max_dist, d_0 + d_1)

print(max_dist)

(ref) https://freedeveloper.tistory.com/277

[Paper Review] CoCa: Contrastive Captioners are Image-Text Foundation Models

Thu, 03 Nov 2022 10:15:55 GMT

"CoCa: Contrastive Captioners are Image-Text Foundation Models"

History of Vision and Language training

Vision pretraining

pretrain ConvNets or Transformers on large-scale data such as ImageNet, Instagram to solve visual recognition problem
these models only learn modes for the vision modality-> not applicable to joint reasoning task over both image and text inputs

Vision-Language Pretraining(VLP)

Early work: relying on pretrained object detection modules such as Faster R-CNN to extract visual representations
Later work: unifying vision and language transformers, and training multimodal transformer from scratch

Image-Text Foundation models

recent works subsume both vision and vision-language pretrianing
adaptable for a wide range of vision and image-text benchmarks

Previous models for vision and vison-language problems: 3 training paradigms

Single-encoder models

provides generic visual represantations that can be adapted for various downstream tasks including image and video understanding
rely heavily on image annotains as labeled vectors
cannot deal a free-form human natural language

Dual-encoder models

pretrains two parallel encoders wit a contrastive loss on web-scale noisy image-text pairs
encode textual embeddings to the same latent space, enabling new crossmodal alignment capabilities such as zero-shot image classification and image-text retrieval
misses joint componenets to learn fuesed image and text representations
not applicable for joint vision-language understanding tasks such as visual question answering
learns an aligned text encoder that enables ** crossmodal alignment applications ** such as image-text retrieval and zero-shot image classification

Encoder-decoder models

During training, it takes images on the encoder side and applies Language Modeling loss on the decoder outputs
decoder outputs can be used as joint representations for mulitodal understanding tasks
the image encoder: provides latent encoded features using Vision Transformers
Text decoder: learns to maximize the likelihood of the paird text under the forward autoregressive factorization

CoCa

focus on training an image-text foundation model from scratch in a single pretraining stage to unify image and text
performs one forward and backward propagation for a batch of image-textpairs while ALBEF requires two (one on corrupted inputs and another without corruption)
trained from scratch on the** two objectives** only while ALBEF is initialized from pretrained visual and textual encoders with additional training signals including momentum modules
The decoder architecture with generative loss is preferred for natural language generation and thus directly enables image captioning and zero-shot learning

Architecture

Image Encoder Enocdes imgaes to latent representations by a neural network encoder
Decoupled Decoder Simultaneously produces both unimdoal and multimodal text representations for both contrastive and generative objectives

1) Unimodal Text Decoder - for Contrastive objective for learning global representations - append learnable token[CLS] at the end of the input sentence 2) Multimodal Text Decoder - for Captioning objective for fine-grained region-level features

Benefits of
- Can compute two training losses efficiently
- Induces minimal overhead

Basic

Captioning approcah: optimies the conditional likelihood of text Contrastive approach: uses an unconditional text representation

[Spark] Practice: RDD

Tue, 18 Oct 2022 07:23:45 GMT

sc = SparkContext('local', 'assignment')
baserdd = sc.textFile('file:///home/ec2-user/RDD/grade.txt') # 경로 설정 유의
baserdd.collect()


```text
['Mathew, science, grade-3, 45',
 'Mathew, history, grade-2, 55',
 'Mark, maths, grade-2, 23',
 'Mark, science, grade-1, 76',
 'John, history, grade-1, 14',
 'John, maths, grade-2, 74',
 'Lisa, science, grade-1, 24',
 'Lisa, history, grade-3, 86',
 'Andrew, maths, grade-1, 34',
 'Andrew, science, grade-3, 26',
 'Andrew, history, grade-1, 74',
 'Mathew, science, grade-2, 55',
 'Mathew, history, grade-2, 87',
 'Mark, maths, grade-1, 92',
 'Mark, science, grade-2, 12',
 'John, history, grade-1, 67',
 'John, maths, grade-1, 35',
 'Lisa, science, grade-2, 24',
 'Lisa, history, grade-2, 98',
 'Andrew, maths, grade-1, 26',
 'Andrew, science, grade-3, 44',
 'Andrew, history, grade-2, 77']

읽어온 line을 ", "를 기준으로 나눈 후, list로 저장

(단, 점수는 int형으로)

b0 = baserdd.map(lambda line: line.split(", "))
#erdd = baserdd.map(lambda line[3]: int(line[3]) )
b0.collect()

[['Mathew', 'science', 'grade-3', '45'], ['Mathew', 'history', 'grade-2', '55'], ['Mark', 'maths', 'grade-2', '23'], ['Mark', 'science', 'grade-1', '76'], ['John', 'history', 'grade-1', '14'],



>```python
b1 = b0.map(lambda line: line[:-1] + [int(line[-1])])
b1.collect()

[['Mathew', 'science', 'grade-3', 45],
 ['Mathew', 'history', 'grade-2', 55],
 ['Mark', 'maths', 'grade-2', 23],
 ['Mark', 'science', 'grade-1', 76],
 ['John', 'history', 'grade-1', 14],

과목별 등장 횟수를 (과목명, 등장 횟수)로

b2 = b1.map(lambda x: (x[1],1))
b2.collect()

[('science', 1), ('history', 1), ('maths', 1), ('science', 1),


>```python
b2 = b2.reduceByKey(lambda x, y: x+y)
b2.collect()

[('science', 8), ('history', 8), ('maths', 6)]

학생별 평균 점수 구하기

이름이 같아도 학년이 다르면 다른 사람임

b5 = b1.map(lambda x: ((x[0], x[2]), (x[3], 1)))
b5.collect()

[(('Mathew', 'grade-3'), (45, 1)), (('Mathew', 'grade-2'), (55, 1)), (('Mark', 'grade-2'), (23, 1)), (('Mark', 'grade-1'), (76, 1)), (('John', 'grade-1'), (14, 1)), (('John', 'grade-2'), (74, 1)), (('Lisa', 'grade-1'), (24, 1)), (('Lisa', 'grade-3'), (86, 1)), (('Andrew', 'grade-1'), (34, 1)), (('Andrew', 'grade-3'), (26, 1)), (('Andrew', 'grade-1'), (74, 1)), (('Mathew', 'grade-2'), (55, 1)), (('Mathew', 'grade-2'), (87, 1)), (('Mark', 'grade-1'), (92, 1)), (('Mark', 'grade-2'), (12, 1)), (('John', 'grade-1'), (67, 1)), (('John', 'grade-1'), (35, 1)), (('Lisa', 'grade-2'), (24, 1)), (('Lisa', 'grade-2'), (98, 1)), (('Andrew', 'grade-1'), (26, 1)), (('Andrew', 'grade-3'), (44, 1)), (('Andrew', 'grade-2'), (77, 1))]


>```python
totalByName = b5.reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
totalByName.collect()

[(('Mathew', 'grade-3'), (45, 1)),
 (('Mathew', 'grade-2'), (197, 3)),
 (('Mark', 'grade-2'), (35, 2)),
 (('Mark', 'grade-1'), (168, 2)),
 (('John', 'grade-1'), (116, 3)),
 (('John', 'grade-2'), (74, 1)),
 (('Lisa', 'grade-1'), (24, 1)),
 (('Lisa', 'grade-3'), (86, 1)),
 (('Andrew', 'grade-1'), (134, 3)),
 (('Andrew', 'grade-3'), (70, 2)),
 (('Lisa', 'grade-2'), (122, 2)),
 (('Andrew', 'grade-2'), (77, 1))]

averageByName = totalByName.mapValues(lambda x: x[0] / x[1])
averageByName.collect()

[(('Mathew', 'grade-3'), 45.0), (('Mathew', 'grade-2'), 65.66666666666667), (('Mark', 'grade-2'), 17.5), (('Mark', 'grade-1'), 84.0), (('John', 'grade-1'), 38.666666666666664), (('John', 'grade-2'), 74.0), (('Lisa', 'grade-1'), 24.0), (('Lisa', 'grade-3'), 86.0), (('Andrew', 'grade-1'), 44.666666666666664), (('Andrew', 'grade-3'), 35.0), (('Lisa', 'grade-2'), 61.0), (('Andrew', 'grade-2'), 77.0)]

[파이썬 머신러닝 완벽가이드] 8장: Text Analysis

Tue, 18 Oct 2022 07:18:08 GMT

NLP vs 텍스트 분석
텍스트 분석 주요 영역 텍스트 분류(어떤 카테고리에 속하나) 감성 분석(텍스트에서 나타나는 주관적인 기분 등의 요소를 분석) 텍스트 요약 텍스트 군집화와 유사도 측정
텍스트 문석 머신러닝 수행 프로세스 데이터 사전가공-> Feature Vectorization -> ML 학습/예측/평가

https://www.researchgate.net/figure/Text-mining-process-Source-Chakraborty-Pagolu-and-Garla-2013_fig1_262413948

파이썬 기반의 NLP, 텍스트 분석 패키지 NLTK, Gensim(토픽모델링), SpaCy
텍스트 전처리: 텍스트 정규화 클렌징: html, xml 태그나 특정 기호 제거 토큰화: 문장/ 단어 토큰화 필터링/ 스톱워드 제거/ 철자 수정: 관사 제거 Stemming/ Lemmatization: 단어 원형 추출
N-Gram 문장을 개별 단어로 토큰화 하면 문맥적인 의미가 무시됨 -> n개의 단어를 하나의 토큰화 단위로 분리하는 것 n개 단어 크기의 윈도우를 만들어 토큰화 수행

텍스트의 피처 벡터화 유형 BOW(Bag of Words)

  Document Term Matrix
  단어들의 등장 횟수를 매트릭스로

Word Embedding

  개별 단어를 문맥을 기준으로 N차원 공간에 벡터로 표현

BOW 문맥이나 순서를 무시하고 일괄적으로 단어에 대해 빈도 값을 부여해 피처 값을 추출하는 모델 (문서 내 모든 단어를 bag아넹 넣고 흔들어서 섞는다는 의미)
BOW 구조 column은 단어로, row는 문장을 기준으로 각 column의 단어들이 몇번 등장했는지름 적음

장점: 쉽고 빠른 구축
```
      예상보다 문서의 특징을 잘 나타내서 전통적으로 여러분야에 활용도가 높음
```
단점: 문맥 의미 반영을 못함
```
      희소 행렬 문제(값의 없는 null이 너무 많다)
```
BOW 피처 벡터화
1. 단순 카운트 기반의 벡터화 각 문서에서 해당 단어가 나타나는 횟수를 부여
2. TF-IDF 벡터화 단어에 가중치를 주어서, 모든 문서에서 많이 나타나는 단어에는 패널티를 줌
  
  특정 문서에만 나타나는 단어를 해당 문서를 잘 특징짓는 중요 단어일 가능성이 높다고 여김.
  
  TF(Term Frequency): 그 문서에서 몇번 나왔니 DF(Document Frequency): 그 단어가 전체 몇개의 문서에서 나왔는지 IDF: 전체문체수/DF

CountVectorizer를 이용한 피처 벡터화 사전데이터가공->토큰화->텍스트정규화->피처 벡터화
희소행렬의 저장 변환 방식 COO형식: 0이 아닌 데이터만 별도의 배열에 저장 CSR형식: COO형식이 위치 배열값을 중복적으로 가지는 문제를 해결한 방식
20 Newsgroup 분류하기 * 18846개의 뉴스 문서를 20개의 카테고리로 분류하기

텍스트정규화 -> 피처벡터화 -> 머신러닝 -> pipline 적용 -> GridSearchCV최적화

감성분석 지도학습 기반의 분석

감성 어휘 사전을 이용한 분석

  SentiWordNet: Synset별로 3가지 감성지수(긍정감성지수, 부정감성지수, 객관성지수)
  CADER: 소셜미디어의 텍스트에 대한 감성분석
  Pattern

IMDB의 영화 review에 대한 긍부정 예측

SentiWordNet의 감성지수
가로: 긍정부정, 세로: 객관성 정도

문서를 문장단위로 분해 -> 단어를 토큰화, 품사 태깅 -> synset객체, senti_synset객체 생성 -> senti_synset에서 긍정 부정 감성지수 구하고 역치값 기준으로 판별
VADER SentiIntensityAnalyzer 클래스 사용

토픽 모델링 문서들에 잠재돼 있는 공통된 토픽들을 추출해 내는 기법
1. 행렬 분해 기반 토픽 모델링(LSA. NMF)
2. 확률 기반의 토픽 모델링
LDA(Latent Dirichelt Allocation) 문서내의 단어들을 이용해 베이즈 추론을 통해 잠재된 문서내 토픽분포와 토픽별 단어분포를 추론하는 방식
베이즈 추론 켤레 사전 분포 이항 분포 -> 베타 분포 다항 분포 -> 디리클레 분포
LDA 구성요소
LDA 수행 프로세스 count 기반 행렬 토픽의 개수 설정 임의 주제를 최초할당 후, 문서별 토픽분포, 토픽별 단어분포가 결정됨 특정 단어를 추출.제외 하고 두 분포를 재계산 모든 단어들의 토픽 할당분포를 재계산

문섭 군집화

[파이썬 머신러닝 완벽가이드] 5장: Regression

Tue, 18 Oct 2022 07:14:52 GMT

회귀 데이터의 값이 평균과 같은 일정한 값으로 돌아가려는 경향 (아무리 키가 큰 집안의 아이도 무한정 키가 커지지는 x)

회귀: 여러개의 독립변수와 종속변수 간의 상관관계를 모델링하는 기법을 통칭

머신러닝에서 회귀의 핵심:
```
  최적의 회귀 계수를 찾아내는 것! (W1, W2, ..)
```
선형회귀 vs 비선형회귀: 회귀계수의 제곱여부! (독립변수 x**2는 상관없음!)
분류 vs 회귀 분류: 결과 값이 category값(이산적인 값으로)

회귀: 숫자값(연속값)

https://miro.medium.com/max/1400/1*kcBsTJsIiDEC9XcizmnmYg.png

[지도학습]

정답이 있는 데이터를 활용해 데이터를 학습시키는 것

대표적으로 분류(결과가 카테고리로 나뉠 때), 회귀(결과값이 실수)

[비지도학습]

정답 라벨이 없는 데이터를 비슷한 특징끼리 군집화하여 새로운 데이터에 대한 결과를 예측하는 법

대표적으로 클러스터링, K Means

[강화학습]

자신이 한 행동에 대한 보상(reward)를 받으며 높은 점수를 얻는 방법을 찾아가는 것

특정 학습 횟수 이후에, 높은 점수(reward)를 획득할 수 있는 전략이 형성됨

(행동을 위한 행동 목록은 사전에 정의 되어 있음)

선형 회귀의 종류 일반선형회귀 릿지, 라쏘, 엘라스틱넷 로지스틱회귀(사실은 분류에 사용되는 선형 모델)

https://www.google.com/url?sa=i&url=https%3A%2F%2Fdevopedia.org%2Ftypes-of-regression&psig=AOvVaw2_T3lABcpo_xoUkj-OpLXx&ust=1651009003135000&source=images&cd=vfe&ved=0CAwQjRxqFwoTCLDV1uWWsPcCFQAAAAAdAAAAABAP

단순 선형 회귀(SR): 피처가 1개 y = w0 + w1*x
RSS 기반의 회귀 오류 측정 RSS: 회귀의 비용함수(오류값의 제곱의 합)

단순선형회귀에서, RSS식에는 w0과 w1만 남는다.

회귀 알고리즘에서는 RSS(비용함수)를 최소로 만들어야 한다
비용 최소화하기: 경사하강법(Gradient Descent) 점진적으로 반복적인 계산을 통해 w를 업데이트 시키기
R(w)의 편미분
사이킷런 LinearRegression 클래스
선형회귀의 다중 공선성 문제 피처간의 상관관계가 높으면, 분산이 커져서, 오류에 매우 민감해진다 상관관계가 높은 피처들 중에 중요한 피처만 남겨야 한다!
회귀 평가 지표 MSE: mean squared error MSLE: MSE에 로그를 적용한 것
Scoring 함수에 회귀 평가 적용 시 유의사항
다항회귀 독립변수(x)의 식이 단항식이 아닌 2차, 3차 방적식과 같은 다항식으로 표현되는 것
PolynomialFeatures 다항식을 하나의 단항식으로 바꾼 다음에 LinearRegression학습을 시킨다
```
  ex) x1**3 = z1 으로 변환
```

규제선형회귀 다항회귀에서 회귀계수가 기하급수적으로 커지는 것을 제어해야한다

최적모델을 위한 cost함수 구성요소
```
  = 학습데이터 잔차 오류 최소화 + 회귀계수 크기 제어
```
규제: aplpha값으로 페널티를 부여해, 회귀 계수 값의 크기를 감소시킴

릿지회귀: w의 제곱에 대해 패널티 부여 라쏘회귀: W의 절댓값에 대해 패널티 부여
릿지 회귀 회귀계수의 크기를 주기적으로 감소시킴
라쏘회귀
불필요한 회귀 계수를 급격하게 감소시켜 0으로 만들고 제거
엘라스틱넷 회귀: L2+L1 alpha: = L1의 alpha + L2의 aplha

선형회귀모델을 위한 데이터 변환

타깃값변환: 타깃값을 반드시 정규 분포로 만들어야 함(주로 로그 변환 적용)

피처값 변환:
```
  StandardScaler
  MinMaxScaler
```
데이터 인코딩은 원-핫 인코딩 적용
피처 데이터 변환에 따른 예측 성능 비교 원본 vs 표준정규분호 vs 표준정규분포+2차다항식 vs 최소최대정규화 vs 로그변환

로지스틱 회귀 시그모이드 함수 최적선을 찾고, 이 시그모이드 함수의 반환 값을 확률로 간주해 확률에 따른 분류를 결정한다

주로 이진 분류에 사용됨. 예측확률은 시그모이드 함수의 출력값 예측확률이 0.5이상이면 예측값을 1로.
회귀트리 CART: Classification and regression trees
1. RSS를 최소화하는 규칙 기준에 따라 분할
2. 최종 분할된 영역에 있는 데이터들의 평균값들로 학습/예측
회귀트리의 오버비팅 트리크기, 노드 개수 제한 등의 방법 통해 오버피팅 개선.

ref: "파이썬 머신러닝 완벽가이드", 권철민, 위키북스

[Bioinformatics] Presentation: ShRec3D

Tue, 18 Oct 2022 04:30:09 GMT

I gave a 40-minute presentation in the Introduction to Bioinformatics class at the University of Miami. It was reviewing the paper "3D Genome Reconstruction from chromosol contacts" by Annick Lense. It was the first in-person presentation in my university life. 40-minute was a long presentation but I ended up performing well.

[Bioinformatics] Paper Review: Hi-C

Tue, 18 Oct 2022 03:46:14 GMT

Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome

Bona Park

Chromosome conformation capture is a way that enables researchers to observe interactions between loci. These loci are in close contact in the 3-dimnesional structure of a chromosome, but they can be apart in the linear sequence. It is important to understand how chromosomes fold that it is relevant to complicated relationships between chromatic structure, gene activity, and the cell’s functional state. There are many different ways to look into the 3-D structure of chromatin. Chromosome conformation capture, as know as 3C, was the first chromatin structure assay. It uses spatially constrained ligation. 3C has been adapted into several ways such as 4C(inverse PCR) and 5C(multiplexed ligation-meditated amplification). But those methods have to overcome that they require choosing a set of target loci and they cannot be applied to unbiased genomwide analysis. Therefore, researchers in this paper suggest a new method called Hi-C. It also has been developed from above methods but can identify chromatin interactions genomwidely. Hi-C starts with the cross-linking and DNA digestion steps. While the 5’ overhang is filled, biotinylated residue incorporates it. The biotin tag is at the center of the ligation junction in the DNA strand. It results in fragments in close spatial proximity in the nucleus compost of the ligation products.

The researchers divide the genome into loci 1Mb of which comprise one genome and make a genomwide contact matrix(M) .The matrix entry m(ij) means the number of ligation products between locus I and j. This expresses the combined meaning of the interactions in the cells.

They validated the Hi-C model with several ways. First, they repeated the experiment with the same restriction enzyme and with a different one and checked whether it reproduces same results every time. It was validated by extracting similar contact matrices from the different experiments. Furthermore, they checked whether there is corresponds to the known features of genome organization such as patterns in subnuclear positioning or chromosome territories.

From a genomic distance in base pairs along the nucleotide sequence, they computed the average contact probability. It can be expressed as In(s), where n stands for the sequence number of chromosome and s for the distance. As 3D distance between loci increases, In(s) decreases on every chromosome. These is also supported by 3C and fluorescence in situ hybridization(FISH). In terms of interchromosomal contact probabilities, some small and gene-rich chromosomes preferentially showed strong interact with each other. On the other hand, small but gene-poor chromosomes don’t interact that much with other small chromosomes. This is also supported by FISH. They were also curious about whether there are specific regions on individual chromosomes that preferentially associate with each other. They defined a normalized contact matrix M*. This entry of the matrix is by dividing each entry in the contact matrix by the genomwide average probability for loci. Interestingly, it makes a plaid pattern that has many blocks of enriched and depleted interactions. Then they defined a correlation matrix C from the assumption that spatially neighboring loci would have a correlated interaction profile. The entry c(ij) of correlation matrix C is the Pearson correlation between the ith row and jth column of M*. It even improves the plaid pattern composed of label A and B. Furthermore, the plaid patterns of intra-chromosome were similar within that of inter-chromosome. This leads to the fact that the whole genome can be divided into two compartments in 3-dimensional where the interactions within each compartment is bigger than across compartments. The Hi-C data shows that regions corresponding to the same compartment are likely to be spatially closer. They tested this by 3D-Fish method with investigating four loci on one chromosome locating in the two compartments. The result validated the compartmentalization in the space. They also examined the density of regions. They found out that the pairs of loci in the compartment B had a higher interaction frequency than the pairs in compartment A which means that B is denser and more packed.

The researchers compared the known genetic and epigenetic features between two spatial compartments. Compartment A is relevant to stronger presence of genes, higher expression through genomwide mRNA expression, and more accessible chromatin. It concludes that the compartment A is more with actively transcribed, open, and accessible chromatin. They do this experiment again with K562 cells, and gained similar result from in GM06990 cells. Even though both K562 and GM06990 showed similar compartment patterns, the loci alternated the compartment. They concluded that even a highly rearranged genome, spatial compartmentalization is relevant with the open or closed status.

Last but not least, they correlated chromatin structure to compartments. They calculated contact probability within chromosome scaling as s^(-1). So far, the chromosol regions have been modeled as an “equilibrium globule”. In it chromatin is pictured as being in a compact and densely knotted configuration. On the other hand “fractal globule” is formed by an unentangled polymer like a “beads-on-a-string” configuration. Because of lacking knots and availability to enfold and refold during the cell cycle, the latter is more attractive structure for chromatin segments than the former. Two globule models make very different predictions about the 3D distance between pairs of loci and the scaling of contact probability with genomic distance s. The researchers created ensemble models by implementing Monte Carlo simulations. The ensemble was tested and showed consistent to theoretically derived results for contact probability and 3D distance.

[파이썬 머신러닝 완벽가이드] 4장ㅣ 분류

Mon, 17 Oct 2022 14:59:55 GMT

[4장: 분류]

균일도 기반 규칙 조건

정보 균일도 측정 방법

1) 정보 이득 ..앤트로피 개념 정보이득 지수 = 1-엔트로피 지수

2) 지니계수: 불평등 지수 지니계수 낮을수록 균일한 데이터

결정트리의 규칙노드 생성 프로세스

If true/ else

결정트리 장점

쉽고 직관적

결정트리 단점

과적합(overfitting)
sol) 트리크기를 사전에 제한

결정트리 주요 hyperparameter

- max_depth, max_features..

Graphviz이용한 결정트리 모델의 시각화(실제 나무 모양 그림으로)

각 노드에는피처의 규칙 조건 gini samples: 현 규칙에 해당하는 데이터 건수 value: 클래스 값 기반의 데이터 건수 ex) [41,4,10] 이면 해당 조건을 만족하는 a품종은 41개 b품종은 4개
```
class: value리스트 내에 가장 많은 건수를 가진 결정값
```

결정트리의 feture 선택 중요도

중요한 feature들만 선택해서 학습,예측하는게 나을 수도 - feature_importance: 중요한 feature 찾아내기

결정트리 실습: 사용자 행동 인식 데이터 세트

* 스마트 워치끼고 어떤 행동을 하는지 찾아낸다

앙상블 학습

다양한 분류기의 예측 결과를 도합 이질적인 모델들을 섞는 게 전체 성능에 도움이 될 수 ㅇ

1. 보팅 하드 보팅: 다수의 classifier 간 다수결 소프트보팅: class별 확률 결정 by predict_proba()

* 위스콘신 유방암 데이터 예측 (cf) kNeighborsClassifier에서 n_neighbors는 내 주변의 몇개의 이웃들을 참조해서 이 데이터를 예측할 건지

https://velog.velcdn.com/images%2Fjiselectric%2Fpost%2F49803ffd-d915-403f-8c78-9fe5ee26ad1d%2F%E1%84%89%E1%85%B3%E1%84%8F%E1%85%B3%E1%84%85%E1%85%B5%E1%86%AB%E1%84%89%E1%85%A3%E1%86%BA%202021-01-14%20%E1%84%8B%E1%85%A9%E1%84%8C%E1%85%A5%E1%86%AB%201.02.01.png

2. 배깅 | 랜덤 포레스트가 대표적 여러개의 결정트리 분류기가 전체 데이터에서 배깅방식으로 각자 데이터를 샘플링해 개별적으로 학습 수행 -> 이후 모든 분류기를 보팅으로 예측 결정 비교적 빠른 수행속도

bootstrapping: 전체 데이터 세트를 중첩되게 분리하는 것 한 서브세트에 [1233346889]와 같이 한 데이터가 여러개씩 들어있음

  * 사용자 행동인지 데이터 예측
        (cf) 개별 피쳐들의 중요도 시각화    
            Random forest의 feature_importances_: 중요도 순서로 ndarray로 반환
            -> series로 변환

** 3. 부스팅** 여러개의 학습기를 "순차적"으로 학습,예측하면서 "잘못 에측한 데이터"에 가중치 부여해 오류를 개선해나가며 학습시키기
-> 수행시간 오래걸림

GBM (Gradient Boost Machine) 가중치 업데이트에 경사하강법을 이용 GradientBoostingClassifier 클래스

learning_rate: GBM 학습 진행시에 적용되는 학습률 n_estimators: weak learner의 개수 subsample: weak learner가 학습에 사용하는 데이터 샘플링 비율
XGBoost(extra gradient boost) 빠른 수행시간 다양한 성능향상기능: 규제, tree pruning(가지치기)
다양한 편의기능: 조기 중단, 자체 내장된 교차검증, 결손값 자체 처리
- 조기중단기능 더이상 비용함수가 감소하지 않으면, 수행을 종료 early_stopping_rounds(더이상 비용평가지표가 감소하지않는 최대 반복횟수), eval_metric, eval_set
LightGBM 더 빠르고 더 작은 메모리 사용 카테고리형 feature의 자동변환, 최적분활

leaf 중심트리 분할 방식(leaf wise(<->균형트리분할방식))

https://miro.medium.com/max/2590/1*5gY5IdU6PO4JCqQoEDtdMA.png

* 분류 실습1: 캐글경연대회의 산탄데르 은행 고객 만족 예측

* 분류 실습2: 신용카드 사기 예측 실습

IQR과 박스플롯

언더샘플링, 오버샘플링

    언더샘플링: 많은 레이블을 가진 데이터 세트를
                적은 레이블을 가진 데이터 세트 수준으로 감소 샘플링

SMOTE

    원본데이터-> k최근접 이웃으로 데이터 신규증식 -> 신규증식하여 오버샘플링 완료

정리

    데이터 로그 변환: 약간식 성능 좋아짐
    이상치 데이터 제거: 정밀도 up, 재현율 up
    SMOTE 오버 샘플링: 정밀도 down, 재현율 up

Stacking Model

기반 모델들이 예측한 값들을 Stacking 형태로 만들어서
메타 모달들이 이를 학습하고 예측하는 모델

교차 검증 세트 기반의 스태킹

step1: 각 기반 모델 별로 학습하고, 학습/테스트 데이터를 예측한 결과값을 기반으로 메타 모델을 위한 학습용/테스트용 데이터를 생성한다 step2: 스텝1에서

(신규)

Feature Selection

모델을 구성하는 주요 피처들을 선택
불필요한 다수의 피처들로 인해 모델 성능을 떨어뜨릴 가능성을 제거
설명가능한 모델이 될 수 있도록 피처들을 선별

피처값의 분포, Null의 개수, 피처간 높은 상관도(겹치는거 제거), 결정값과의 독립성 등을 고려
모델의 피처중요도 기반

사이킷런 Feature Selection 지원

RFE

모델의 최초학습 후 feature 중요도 선정 feature중요도가 낮은 속성들을 차례대로 제거, 반복적으로 학습.평가 수행하여 최적의 feature추출 수행시간이 너무 오래걸린다..

SelectFromModel

모델 최초 학습 후 선정된 피처 중요도에 따라 평균.중앙값의 특정 비율 이상인 피처들을 선택

ref: "파이썬 머신러닝 완벽가이드", 권철민, 위키북스

[GDSC-ML] Practice: Improve accuracy of ImageNet Classification project

Thu, 13 Oct 2022 09:32:16 GMT

- Models

1. Initial “mymodel.yaml”

```powershell
epochs: 10
resume: None
learning_rate: **0.001**
weight_decay: 10e-6
inference_device: "cuda"

train:
  batch_size: 4
  num_workers: 4
  valid_size: 0.3
  train_path: "./2_data/train"

test:
  batch_size: 4
  num_workers: 4
  test_path: "./2_data/val"

model:
  base_model: "convnext_base"
```

2. “convnext_base.yaml”

I follwed the hyperparamters same as convnext-base-224_finetuned_on_ImageIn_annotations Learning rate decreased.

      epochs: 10
      resume: None
      learning_rate: **2e-05**
      weight_decay: 10e-6
      inference_device: "cuda"

      train:
        batch_size: 16
        num_workers: 4
        valid_size: 0.3
        train_path: "./2_data/train"

      test:
        batch_size: 16
        num_workers: 4
        test_path: "./2_data/val"

      model:
        base_model: "convnext_base"

valid_loss: 2.0629

3. “convnext_base_1.yaml”

epochs: 10
resume: None
learning_rate: **4e-3**
weight_decay: 10e-6
inference_device: "cuda"

train:
  batch_size: 16
  num_workers: 4
  valid_size: 0.3
  train_path: "./2_data/train"

test:
  batch_size: 16
  num_workers: 4
  test_path: "./2_data/val"

model:
  base_model: "convnext_base"

4. “RestNet50.yaml”

learning_rate: 2e-05, batch size: 16

5. “ResNet50_batch256.yaml”

learning_rate: 0.001, batch size: 256

[ML] Hyperparameter Tuning: Learning rate and Batch size

Thu, 13 Oct 2022 09:25:15 GMT

Batch size and Learning rate

https://openreview.net/pdf?id=B1Yy1BxCZ

1. Batch Size

small: converges quickly at the cost of noise in the training process
large: converges slowly with accurate estimates of the error gradient

2. Learning Rate

The most popular form of learning rate annealing is a step decay where the learning rate is reduced by some percentage after a set number of training epochs.

https://www.jeremyjordan.me/nn-learning-rate/

https://www.baeldung.com/cs/learning-rate-batch-size

https://inhovation97.tistory.com/32

Bag of Tricks for Image Classification with Convolutional Neural Networks

https://arxiv.org/abs/1812.01187

Increase Batch size
Linear scaling learning rate
Model Architecture Tweaks

“A model tweak is a minor adjustment to the network architecture, such as changing the stride of a particular convolution layer. Such a tweak often barely changes the computational complexity but might have a non-negligible effect on the model accuracy.”

Training Refinements
- Cosine Learning rate Decay
- Label smoothing
- Mixup training

https://norman3.github.io/papers/docs/bag_of_tricks_for_image_classification.html

https://medium.com/analytics-vidhya/bag-of-tricks-for-image-classification-with-convolutional-neural-networks-99f00a9b9565

https://phil-baek.tistory.com/entry/CNN-꿀팁-Bag-of-Tricks-for-Image-Classification-with-Convolutional-Neural-Networks-논문-리뷰

[ML] Various ways for Hyperparameter Tuning in Machine Learning

Thu, 13 Oct 2022 09:21:33 GMT

Hyperparameter Tuning

The process of finding the right combination of hyperparameters to maximize the model performance

Hyperparameter tuning methods

Random Search
Grid Search
- Each iteration tries a combination of hyperparameters in a specific order.
  
  It fits the model on each combination, records the model performance, and returns the best model with the best hyperparameters.
Bayesian Optimization
Tree-structured Parzen estimators(TPE)

Hyperparameter tuning algorithms

Hyperband
Population-based Training(PBT)
- a hybrid of Random Search and manual tuning
- Many neural networks run in parallel but they are not fully independent of each other.
- It uses the information from the rest of the networks to refine the hyperparameters and determine which hyperparameter to use based on the rests

Useful libraries for hyperparameter optimization

Optuma
- Efficiently search large spaces and prune unpromising trials for faster results
Ray Tune

Tune is a Python library for experiment execution and hyperparameter tuning at any scale. You can tune your favorite machine learning framework (PyTorch , XGBoost, Scikit-Learn, TensorFlow and Keras, and more) by running state of the art algorithms such as Population Based Training (PBT) and HyperBand/ASHA.
Two Benefits
- They maximize model performance: e.g., DeepMind uses PBT to achieve superhuman performance on StarCraft; Waymo uses PBT to enable self-driving cars.
- They minimize training costs: HyperBand and ASHA converge to high-quality configurations in half the time taken by previous approaches; population-based data augmentation algorithms cut costs by orders of magnitude.
cf) ASHA: one of the popular early stopping algorithm
To be brief, Ray Tune scales your training from a single machine to a large distributed cluster without changing your code.
1. Simplifies scaling
  - It allows to use all of the cores and GPUs on the machine
  - So it makes it perform parallel asynchronous hyperparameter tuning.
  1. Flexible
    - Supports any ML framework(PyTorch, TensorFlow ..)
    - It provides a flexible interface for optimization algorithms,
    - We can visualize with MLFlow and TensorBoard.

ref) https://neptune.ai/blog/hyperparameter-tuning-in-python-complete-guide

[Spark] Spark DataFrame / SQL

Sat, 08 Oct 2022 06:53:10 GMT

목표
1. 정형 데이터를 쉽게 다룰 수 있는 Spark Dataframe, Dataset에 대해 이해한다
2. Spark DataFrame, DataSet 에 대해 SQL 연산을 수행해본다
Spark SQL 특징
1. Integrated
  - spark와 sql 쿼리를 원활하게 결합
  - Spark SQL은 Spark 내에 RDD 형태로 저장돼있는 정형 데이터를 통합된 API로 Qeury할 수 있게 한다
2. Unified Data Access
  - 다양한 소스로부터 데이터를 load, query할 수 있다
3. Standard Connectivity
DataFrame
- RDD+Schema
- 관계형 데이터베이스의 테이블
- 대량의 데이터를 처리하기 용이하게 디자인됨
RDD vs DataFrame
- RDD: 데이터에 대한 설명이 안들어가 있음
- DataFrame: 데이터에 대한 설명이 있음
Programming Interface
- 중간과정 Spark SQL에서 통합적으로 사용하는 DF API를 사용하고, 우리가 볼 수 있는 상태로 보여줌
- Catalyst: 연산최적화 해줌(?)
DataSet
- DataFrame과 거의 비슷한데, DF API가 DataSet API에 통합됐다. DataSet이 더 큰 개념
- DF는 Row로 이루어진 DataSet임. DF = DataSet[Row]
- DataSet은 Row 이외에도 다양한 형식을 가질 수 있음
차이: 스키마의 추정이 언제 일어나는 가? 의 차이
- DF: 런타임에 Schema 추정(데이터를 받아온 시점)
- DataSet: 컴파일 타임에 Schema를 정함. 에러를 쉽게 찾고, 최적화가 잘됨. 컴파일 기반의 언어에서만 사용가능(파이썬에서 no)
RDD vs Spark SQL

[파이썬 머신러닝 완벽가이드] 8장: Text Analysis

Fri, 07 Oct 2022 13:55:48 GMT

NLP vs 텍스트 분석
텍스트 분석 주요 영역 텍스트 분류(어떤 카테고리에 속하나) 감성 분석(텍스트에서 나타나는 주관적인 기분 등의 요소를 분석) 텍스트 요약 텍스트 군집화와 유사도 측정
텍스트 문석 머신러닝 수행 프로세스 데이터 사전가공-> Feature Vectorization -> ML 학습/예측/평가

파이썬 기반의 NLP, 텍스트 분석 패키지 NLTK, Gensim(토픽모델링), SpaCy
텍스트 전처리: 텍스트 정규화 클렌징: html, xml 태그나 특정 기호 제거 토큰화: 문장/ 단어 토큰화 필터링/ 스톱워드 제거/ 철자 수정: 관사 제거 Stemming/ Lemmatization: 단어 원형 추출
N-Gram 문장을 개별 단어로 토큰화 하면 문맥적인 의미가 무시됨 -> n개의 단어를 하나의 토큰화 단위로 분리하는 것 n개 단어 크기의 윈도우를 만들어 토큰화 수행

텍스트의 피처 벡터화 유형 BOW(Bag of Words)

  Document Term Matrix
  단어들의 등장 횟수를 매트릭스로

Word Embedding

  개별 단어를 문맥을 기준으로 N차원 공간에 벡터로 표현

BOW 문맥이나 순서를 무시하고 일괄적으로 단어에 대해 빈도 값을 부여해 피처 값을 추출하는 모델 (문서 내 모든 단어를 bag아넹 넣고 흔들어서 섞는다는 의미)
BOW 구조 column은 단어로, row는 문장을 기준으로 각 column의 단어들이 몇번 등장했는지름 적음

장점: 쉽고 빠른 구축
```
      예상보다 문서의 특징을 잘 나타내서 전통적으로 여러분야에 활용도가 높음
```
단점: 문맥 의미 반영을 못함
```
      희소 행렬 문제(값의 없는 null이 너무 많다)
```
BOW 피처 벡터화
1. 단순 카운트 기반의 벡터화 각 문서에서 해당 단어가 나타나는 횟수를 부여
2. TF-IDF 벡터화 단어에 가중치를 주어서, 모든 문서에서 많이 나타나는 단어에는 패널티를 줌
  
  특정 문서에만 나타나는 단어를 해당 문서를 잘 특징짓는 중요 단어일 가능성이 높다고 여김.
  
  TF(Term Frequency): 그 문서에서 몇번 나왔니 DF(Document Frequency): 그 단어가 전체 몇개의 문서에서 나왔는지 IDF: 전체문체수/DF

CountVectorizer를 이용한 피처 벡터화 사전데이터가공->토큰화->텍스트정규화->피처 벡터화
희소행렬의 저장 변환 방식 COO형식: 0이 아닌 데이터만 별도의 배열에 저장 CSR형식: COO형식이 위치 배열값을 중복적으로 가지는 문제를 해결한 방식
20 Newsgroup 분류하기 * 18846개의 뉴스 문서를 20개의 카테고리로 분류하기

텍스트정규화 -> 피처벡터화 -> 머신러닝 -> pipline 적용 -> GridSearchCV최적화

감성분석 지도학습 기반의 분석

감성 어휘 사전을 이용한 분석

  SentiWordNet: Synset별로 3가지 감성지수(긍정감성지수, 부정감성지수, 객관성지수)
  CADER: 소셜미디어의 텍스트에 대한 감성분석
  Pattern

IMDB의 영화 review에 대한 긍부정 예측

SentiWordNet의 감성지수
가로: 긍정부정, 세로: 객관성 정도

문서를 문장단위로 분해 -> 단어를 토큰화, 품사 태깅 -> synset객체, senti_synset객체 생성 -> senti_synset에서 긍정 부정 감성지수 구하고 역치값 기준으로 판별
VADER SentiIntensityAnalyzer 클래스 사용

토픽 모델링 문서들에 잠재돼 있는 공통된 토픽들을 추출해 내는 기법
1. 행렬 분해 기반 토픽 모델링(LSA. NMF)
2. 확률 기반의 토픽 모델링
LDA(Latent Dirichelt Allocation) 문서내의 단어들을 이용해 베이즈 추론을 통해 잠재된 문서내 토픽분포와 토픽별 단어분포를 추론하는 방식
베이즈 추론 켤레 사전 분포 이항 분포 -> 베타 분포 다항 분포 -> 디리클레 분포
LDA 구성요소
LDA 수행 프로세스 count 기반 행렬 토픽의 개수 설정 임의 주제를 최초할당 후, 문서별 토픽분포, 토픽별 단어분포가 결정됨 특정 단어를 추출.제외 하고 두 분포를 재계산 모든 단어들의 토픽 할당분포를 재계산

문섭 군집화

[GDSC-ML] Apply PyTorch template to Mnist classification

Thu, 06 Oct 2022 05:31:01 GMT

The second GDSC-ML session was to convert Jupyter Notebook of MNIST CNN model into Python scripts.

Like most people, I was used to do ML projects through Jupyter notebook. It had a big advangtage that I can validate and check the code easily by just typing (Shift + Enter).

But there are some fallbacks of Jupyter Notbooks in data science projects

Unorganized
- hard to keep track of what I write
Not ideal for reproducibility
- if there's a slight data change, hard to identify a source error
Not ideal for production
- hard to run Jupyter Notebook while using other tools

This time I used @victoresque 's Pytorch-Template. It provides a clear folder structure suitable for many Deep Learning Projects.

previous mnist_examle.ipynb outline

set device
hyperparmeter setting
load dataset
see dataset(we don't need it)
use torch dataloader to divide dataset into mini-batch
- define 2 Dataloaders: train_loader test_loader
define Basic CNN class
create model instance with preconifigured class
define optimizer -> mutate in 'config.json'
choose Loss function
train the model
update gradient in each training steps
evaluate the model

Because both projects are dealing with MNIST, there weren't that many things to change from the template.

We only need to change some parameters in config.json file and change the model structure in model.py

New Python Scripts using a template

config.json

{
  "name": "Mnist_LeNet",
  "n_gpu": 1,

  "arch": {
    "type": "MnistModel",
    "args": {}
  },
  "data_loader": {
    "type": "MnistDataLoader",
    "args": {
      "data_dir": "data/",
      "batch_size": 50,
      "shuffle": true,
      "validation_split": 0.1,
      "num_workers": 2
    }
  },
  "optimizer": {
    "type": "Adam",
    "args": {
      "lr": 0.0001,
      "weight_decay": 0,
      "amsgrad": true
    }
  },
  "loss": "nll_loss",
  "metrics": ["accuracy", "top_k_acc"],
  "lr_scheduler": {
    "type": "StepLR",
    "args": {
      "step_size": 50,
      "gamma": 0.1
    }
  },
  "trainer": {
    "epochs": 100,

    "save_dir": "saved/",
    "save_period": 1,
    "verbosity": 2,

    "monitor": "min val_loss",
    "early_stop": 10,

    "tensorboard": true
  }
}

model.py

class MnistModel(BaseModel):
        def __init__(self): ..
        def forward(self, x): ..

BaseModel.py

class BaseModel(nn.Module):
    """
    Base class for all models
    """
    @abstractmethod
    def forward(self, *inputs):
        """
        Forward pass logic
        :return: Model output
        """
        raise NotImplementedError

train.py

# setup data_loader instances with following conditions in config.json
data_loader = config.init_obj("data_loader", module_data)
valid_data_loader = data_loader.split_validation()

# build model architecture, then print to console
model = config.init_obj("arch", module_arch)
logger.info(model)

# prepare for (multi-device) GPU training
device, device_ids = prepare_device(config["n_gpu"])
model = model.to(device)
if len(device_ids) > 1:
model = torch.nn.DataParallel(model, device_ids=device_ids)

# get function handles of loss and metrics
criterion = getattr(module_loss, config["loss"])
metrics = [getattr(module_metric, met) for met in config["metrics"]]

# build optimizer, learning rate scheduler. delete every lines containing lr_scheduler for disabling scheduler
trainable_params = filter(lambda p: p.requires_grad, model.parameters())
optimizer = config.init_obj("optimizer", torch.optim, trainable_params)
lr_scheduler = config.init_obj("lr_scheduler", torch.optim.lr_scheduler, optimizer) 

trainer = Trainer(
        model,
        criterion,
        metrics,
        optimizer,
        config=config,
        device=device,
        data_loader=data_loader,
        valid_data_loader=valid_data_loader,
        lr_scheduler=lr_scheduler,
    )
trainer.train()

load_state_dict(state_dict, strict=True)

Copies parameters and buffers from state_dict into this module and its descendants. If strict is True, then the keys of state_dict must exactly match the keys returned by this module’s state_dict() function. more: saving_loading_models

path: ./saved/models/Mnist_LeNet/1006_133305/model_best.pth

test.py

    data_loader = getattr(module_data, config["data_loader"]["type"])(
        config["data_loader"]["args"]["data_dir"],
        batch_size=50,  #
        shuffle=True,  #
        validation_split=0.0,
        training=False,
        num_workers=2,
    )

    # build model architecture
    model = config.init_obj("arch", module_arch)
    logger.info(model)

    # get function handles of loss and metrics
    loss_fn = getattr(module_loss, config["loss"])
    metric_fns = [getattr(module_metric, met) for met in config["metrics"]]

    logger.info("Loading checkpoint: {} ...".format(config.resume))
    checkpoint = torch.load(config.resume)
    state_dict = checkpoint["state_dict"]
    if config["n_gpu"] > 1:
        model = torch.nn.DataParallel(model)
    model.load_state_dict(state_dict)

    # prepare model for testing
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
    model = model.to(device)
    model.eval()

    total_loss = 0.0
    total_metrics = torch.zeros(len(metric_fns))

    with torch.no_grad():
        for i, (data, target) in enumerate(tqdm(data_loader)):
            data, target = data.to(device), target.to(device)
            output = model(data)

            #
            # save sample images, or do something with output here
            #

            # computing loss, metrics on test set
            loss = loss_fn(output, target)
            batch_size = data.shape[0]
            total_loss += loss.item() * batch_size
            for i, metric in enumerate(metric_fns):
                total_metrics[i] += metric(output, target) * batch_size

    n_samples = len(data_loader.sampler)
    log = {"loss": total_loss / n_samples}
    log.update(
        {met.__name__: total_metrics[i].item() / n_samples for i, met in enumerate(metric_fns)}
    )
    logger.info(log)

Running Command: python test.py --resume ./saved/models/Mnist_LeNet/1006_133305/model_best.pth

[Processing] how to enable auto-complete in Processing

Sat, 01 Oct 2022 14:47:58 GMT

❔ how to enable auto-complete in Processing?

✅solution

check code completion with CTRL+space
find "preferences.txt" and change pdex.completion.trigger=true

Autocomplete works

▶Processing version: processing-4.0.1

Spark RDD

Sat, 01 Oct 2022 05:20:47 GMT

RDD: resilient distributed data
- 스파크에서 in-memory 기반으로 분산환경에서 대용량 데이터를 처리하기 위해 만든 일종의 자료구조
- 목적: 뷴산 컬랙션의 성질과 장애 내성을 추상화 -> 직관적이고 효율적인 대규모 데이터 셋 처리
RDD 연산 : Lazy execution
- 연산을 즉시 처리하지 않고, 계산의 결과값이 필요할때까지 계산을 늦추는 기법
Spark는 연산들을 바로하지 않고, 계산과정을 Lineage에 저장한 뒤, 마지막에 결과를 보여주거나 반환할 때 계산을 실행한다.
1. 변환 연산자들의 연산과정을 RDD lineage에 저장
  1. 행동 연산자가 호출되면 lineage를 토대로 최적화된 실행과정을 만듦
  2. 이를 토대로 최적화된 연산 실행
  - 행동 연산자: count, collect와 같이 결과를 반환하는 연산
    - 최종 결과 값을 반환하거나 외부 저장소에 값을 기록하는 연산
    - 호출 시 최적화된 경로로 계산을 진행한 뒤, 최종 결과를 return => " RDD in, Other out "
  - 변환 연산자: map, filter와 같이 결과를 바로 안 보여줘도 되는 연산
    - 불변성을 가지는 RDD의 형태를 변형하는 연산
    - 기존의 RDD를 입력받아서, 새로운 RDD를 return => " RDD in, RDD out "
    - Action연산이 호출될 때까지는 실제로 수행되지 않음(Lazy execution)

RDD 연산

Map

rdd.map(lambda x : x + 1)
# (1, 2, 3, 3) 을 가지고 있는 Rdd에 대해 연산 시 return: (2, 3, 4, 4)

[Hadoop & Spark] Hadoop의 map/reduce와 spark의 RDD연산의 차이

Fri, 30 Sep 2022 15:25:51 GMT

❔ Hadoop의 map/reduce와 spark의 RDD연산의 차이는 무엇일까?

Hadoop은 mapreduce 방식으로 데이터를 분산 처리한다. 여러 곳에 분산 저장된 데이터를 처리 하기 위해 mapreduce 방식으로 데이터를 처리한다.
spark 역시 mapreduce 방식의 데이터처리 구조를 지원한다. spark도 여러 곳에 저장된 데이터를 처리 하기 위해 mapreduce 방식으로 데이터를 처리 할 수 있다.
- 하지만 둘의 차이는, 데이터를 메모리에 놓고 하느냐, 디스크에 놓고 하느냐.
Hadoop은 기본적으로 디스크로부터 map/reduce할 데이터를 불러오고, 처리 결과를 디스크로 쓴다. 따라서, 데이터의 읽기/쓰기 속도는 느린 반면, 디스크 용량 만큼의 데이터를 한번에 처리 할 수 있다.
반면, spark는 메모리로부터 map/reduce할 데이터를 불러오고, 처리 결과를 메모리로 쓴다. 따라서, 데이터의 읽기/쓰기 속도는 빠른 반면, 메모리 용량만큼의 데이터만 한번에 처리 할 수 있다. 결론은, spark나 hadoop이나 모두 mapreduce 방식을 지원하지만, hadoop은 디스크 기반의 mapreduce 인것이고, spark는 메모리 기반의 mapreduce 인것이다.
Map Reduce를 사용하면 좋을 때 1) 거대한 데이터 세트의 선형처리: Hadoop Mapreduce를 사용하면 방대한 양의 데이터를 병렬로 처리 가능. 결과 데이터 세트가 사용 가능한 RAM보다 큰 경우 Hadoop MapReduce가 Spark를 능가할 수 있음.
2) 즉각적으로 결과가 필요하지 않는 경우 경제적인 솔루션이다.
Spark를 사용하면 좋을 때: 1) 빠른 데이터 처리
2) 반복 처리
3) Spark의 RDD(Resilient Distributed Datasets)는 메모리에서 여러 맵 작업을 가능하게 하는 반면 Hadoop MapReduce는 중간 결과를 디스크에 기록해야 함.
4) 기계 학습. Spark에는 내장된 기계 학습 라이브러리인 MLlib가 있으며, MLlib에는 메모리에서도 실행되는 즉시 사용 가능한 알고리즘이 있음.

_ref) _

https://wooono.tistory.com/50 https://sunrise-min.tistory.com/entry/MapReduce-vs-Spark-%EB%A7%B5%EB%A6%AC%EB%93%80%EC%8A%A4%EC%99%80-%EC%8A%A4%ED%8C%8C%ED%81%AC%EC%9D%98-%EC%B0%A8%EC%9D%B4%EC%A0%90 https://3months.tistory.com/5__11

bona-park.log

[Docker] Let me like you, Docker : part 1

Problem

Solve

✅ Virtualization

1. Virtual Machines

2. Container

✅ Docker

What is Docker?

Container

❔ Then how they are isolated?

Docker Image

[Ubunut] Brew command not found

[Paper Review] Attention is all you need

Before..

Transformers

Input Embedding

Positional Encoder

Structure of Encoder

1. Attention

2. Feed Forward

Problem of Attention vector

Structure of Decoder

1. Self attention block

Multi-headed Attention

Single-headed attention vs Multi-headed attention

1) Single-headed attention

2) Multi-headed attention

2. Feed forward unit

3. Linear layer

4. Softmax layer

Codes

[Algorithms] 다익스트라(Dijkstra)

다익스트라(Dijkstra)

다익스트라 구현 방법

cf) 힙(Heap)

다익스트라 예제: 백준 1238번 파티 (파이썬)

[Paper Review] CoCa: Contrastive Captioners are Image-Text Foundation Models

"CoCa: Contrastive Captioners are Image-Text Foundation Models"

History of Vision and Language training

Previous models for vision and vison-language problems: 3 training paradigms

CoCa

Architecture

Basic

[Spark] Practice: RDD

읽어온 line을 ", "를 기준으로 나눈 후, list로 저장

과목별 등장 횟수를 (과목명, 등장 횟수)로

학생별 평균 점수 구하기

[파이썬 머신러닝 완벽가이드] 8장: Text Analysis

[파이썬 머신러닝 완벽가이드] 5장: Regression

[Bioinformatics] Presentation: ShRec3D

[Bioinformatics] Paper Review: Hi-C

Comprehensive Mapping of Long-Range Interactions Reveals Folding Principles of the Human Genome

[파이썬 머신러닝 완벽가이드] 4장ㅣ 분류

[4장: 분류]

균일도 기반 규칙 조건

정보 균일도 측정 방법

결정트리의 규칙노드 생성 프로세스

결정트리 장점

결정트리 단점

결정트리 주요 hyperparameter

Graphviz이용한 결정트리 모델의 시각화(실제 나무 모양 그림으로)

결정트리의 feture 선택 중요도

결정트리 실습: 사용자 행동 인식 데이터 세트

앙상블 학습

* 분류 실습1: 캐글경연대회의 산탄데르 은행 고객 만족 예측

* 분류 실습2: 신용카드 사기 예측 실습

IQR과 박스플롯

언더샘플링, 오버샘플링

SMOTE

정리

Stacking Model

교차 검증 세트 기반의 스태킹

(신규)

Feature Selection

사이킷런 Feature Selection 지원

RFE

SelectFromModel

[GDSC-ML] Practice: Improve accuracy of ImageNet Classification project

- Models