yugyeong_929.log

[UoT] Introduction to Deep Learning (2)

Mon, 24 Feb 2025 18:38:39 GMT

Week3

ANNs Part2

Neural Network Architecture Architecture: 신경망 내 뉴런과 연결 구조를 설명하는 개념 Multi-Layer Perceptron (MLP):

Feed-Forward 및 Fully-Connected
Linear Layers + Nonlinear Activations

Ex) MLP, CNN, RNN..

Neural Network Training

Training: 데이터에서 모델 가중치 및 파라미터를 효과적으로 학습하는 과정
Loss function: 모델의 예측이 실제 값과 얼마나 차이가 있는지 측정

Optimizer: 모델의 가중치를 조정하여 최적의 출력을 도출

method: (1) 무작위로 가중치를 선택

    (2) 한 번에 하나의 가중치를 변경하여 오류를 줄이는 방향으로 조정
    (3) Gradient Descent 사용

Gradient Descent

1-layer의 경우: 경사 방향을 따라 가중치를 조정하여 최소 손실점으로 이동.
2-layer의 경우: non-convex 표면에서 최적화를 수행해야 함.

Critical Points (임계점) 유형

극소값(Local Minima)
극대값(Local Maxima)
안장점(Saddle Point)
평탄한 영역(Plateau)

Auto Differentiation

뉴럴 네트워크에서 기울기를 직접 계산하는 것은 복잡하고 오류 발생 가능성이 높음.
자동 미분을 지원하는 프레임워크: Pytorch, Tensorflow, Keras, Theano 등
연산 그래프(Computation Graph)를 활용한 미분 계산

Multi-Class Classification

숫자 0~9 중 하나를 분류하는 문제

Hyperparameter Tuning

Hyperparameter:
- Batch Size
- Learning Rate
- Size of Network (레이어 개수, 뉴런 개수)
- Activation Function
Batching
- 한 번의 학습에서 n개의 샘플을 사용하여 평균 손실을 계산
- 작은 배치 -> 최적호 과정이 불안정
- 너무 큰 배치 -> 계산 비용이 커짐
Learning Rate
- 너무 작으면 학습이 오래 걸림.
- 너무 크면 발산(unstable)
- 학습이 진행될수록 학습률을 줄이는 방식 사용
최적화 기법
- SGD
- Momentum
- RMSProp
- Adam (SGD + Momentum + RMSProp)

Regularization

L2 정규화(Weight Decay)
- 가중치의 크기를 줄여 모델이 복잡해지는 것을 방지
L1 정규화(Lasso)
- 가중치를 0으로 만들어 feature selection이 가능
Dropout
- 무작위로 뉴런을 비활성화하여 과적합 방지

예상 문제

What is ANN?

An ANN is a machine learning model inspired by the way neurons in the human brain process information. It consists of interconnected neurons with weights and can learn patterns from data to perform tasks such as predictions, classification, and regression.

Describe the key features of multilayer perceptron (MLP).

A MLP is a feed-forward neural network consisting of at least three layers (input layer, hidden layer output layer). Neurons in each layer have weights, and non-linearity is introduced through activation functions. MLPs are fully connected an d typically include linear layers followed by non-linear activation functions such as ReLU or sigmoid.

Describe the role of loss functions in the learning process of neural networks.

A loss function measures the difference between a model's predictions and the actual values. It is used in optimization algorithms like Gradient Descent to help the model find the optimal weights. Common loss functions include MSE and Cross-Entropy Loss.

What is Gradient Descent?

Gradient Descent is an optimization algorithm that updates weights in a neural network to minimize the loss function. It computes the gradient of the loss function and adjusts weights accordingly. Common variants include Stochastic Gradient Descent (SGD), Momentum, RMSProp, and Adam.

Explain the difference between binary classification and multi-class classification in PyTorch.

In binary classification, there is a single output neuron with a Sigmoid activation function, and the loss function used is BCEWithLogitsLoss(). In multi-class classification, the number of output neurons equeals the number of classes, and Softmax activation is applied. The loss function used is CrossEntropyLoss()

What is batch size, and explain the problem when it is too large or too small.

Batch size refers to the number of samples used in a single optimiztion step. If too small, training becomes unstable, and the loss function fluctuates frequently. If too large, computational cost increases, and optimization may slow down.

Explain overfitting and how to prevent it.

Overfitting occurs when a model is too optimized to the training data and lacks generalization to new data. It can be prevented using Dropout, L1/L2 Regularization, Data Augmenation, and Early Stopping.

[UoT] Introduction to Deep Learning (1)

Mon, 24 Feb 2025 02:49:21 GMT

Deep Learning Mid term test 요약 및 예상 문제

Week1

What is AI? AI is the intelligence of machines and the branch of computer science that aims to create it. AI is the science and engineering of making intelligent machines, especially intelligent computer programs. AI is often used to describe machines that mimic “cognitive” functions that humans associate with the human mind, such as “learning” and “problem solving”.
Unsupervised Learning: This type of learning focuses on finding patterns, regularities or structure in unlabeled data.

Week2

ANN의 개요 Motivation problem : 자동차 vs 트럭 분류기 문제를 통해 이미지 분류 개념 설명. : 입력(이미지) -> 출력(고정된 범주의 하나로 할당)

컴퓨터가 보는 방식
- 픽셀값 행렬 (0~255 사이의 값)
- RGB(컬러) 이미지: 3개의 채널로 구성됨
  
  시각 인식의 어려움
- Veiwpoint variation: 카메라 움직임에 따라 픽셀 값이 모두 변함.
- Interclass variation: 세부 카테고리로 세분화 가능.
- Background Clutter(배경 복잡성), illumination(조명), occlusion(가림) 등 다양한 도전적 과제
머신러닝 기반 이미지 분류 단계 Step1: 데이터셋 준비 이미지의 픽셀 값 가져오기: RGB 이미지 -> 각 픽셀의 [R, G, B] 값으로 구성 또는 grayscale 변환 후 1차원 벡터로 변환. Step2: 데이터셋 분할 Train/Validation/Test

머신러닝 기초 모델 Baseline 모델 선택
- Scikit-learn의 대표적인 머신러닝 모델 활용 가능: -> KNN, Logistic Regression, SVM, Random Forests
- Logistic Regression 적용 -> 낮은 성능을 보임 -> 왜 성능이 낮을까?
```
  : 이미지 데이터의 고차원 특성을 단순 선형 모델로 설명하기 어려움.
  : Feature Engineering 필요
```
Feature Engineering
- 이미지 데이터를 단순히 픽셀로 표현한 것이 아니라, 자동차의 길이, 너비, 높이, 크기 등 의미 있는 feature 추출 가능.
- 새로운 feature를 사용하여 logistic regression 적용 -> 성능 향상 문제점: feature를 수동으로 선택해야 함. 적절한 feature를 고르는 것이 어려움. 고차원 데이터에서는 feature가 폭발적으로 증가할 가능성이 있음.
  
  해결책: ANN 활용 -> 머신러닝 모델이 자동으로 의미있는 특징을 학습하도록 유도.
ANN 개념 : 인공 두뇌를 모방한 모델로, 뉴런의 구조를 기반으로 학습.
- Artificial Neuron 모델 : weight와 bias 사용 : 입력 x에 대해 선형 결합 후 활성화 함수 적용 : Activation Function을 사용하여 비선형성을 추가
Multi-Layer Neural Networks 기본적인 ANN 구조: Input Layer -> Hidden Layer -> Output Layer

왜 한 개의 층으로는 부족한가? : 단층 신경망 (1-layer ANN) = logistic regression 비선형 데이터를 제대로 학습할 수 없음. 단순 선형 변환만 가능.
Activation Functions 1) Step Function: 미분 불가능하여 사용하기 어려움 2) Sigmoid : 출력범위: (0, 1) 문제점: 큰 값에서는 기울기가 사라지는 Vanishing Gradient Problem 발생

3) Other Activation Functions : ReLU, Tanh 등도 자주 사용됨.
Training Neural Networks Loss Function
- 이진 분류: Binary Cross-Entropy 사용
- 다중 분류: Softmax + Cross-Entropy 사용
  
  Backpropagation
- Chain Rule을 이용하여 loss의 기울기를 계산하고 가중치를 업데이트
- 최적화 기법: Gradient Descent 적용

모델 성능 개선 전략 1) 하이퍼파라미터 튜닝

 - 은닉층 개수, 뉴런 수, 활성화 함수 선택
 - learning rate, batch size 조정

2) Regularization

 - 과적합 방지를 위해 drop out, L1/L2 정규화 적용

3) Data Augmentation

 - 이미지 변형을 통해 데이터 다양성을 확보하여 일반화 성능 향상

예상문제

What are the main challenges in image classification tasks?

The main challenges in image classification include:

Veiwpoint Variation: The appearance of objects changes with camera angles.
Interclass Variation: Objects within the same class may have different shapes, sizes, or textures.
Background Clutter: Unrelated objects in the background can interfere with classification.
Illumination and Occlusion: Lighting conditions and objects blocking parts of the image affect recognition.
Scalabillity: Large datasets require high computational power for training models.

Why does logistic regression perform poorly on raw pixel image data?

Logistic regression performs poorly on raw pixel data because:

Lack of Feature Extraction: It treats all pixels as independent variables without considering spatial relationships.
High Dimensionality: Images contain thousands of features, making linear models ineffective.
Non-linearity: Many real-world classification problems require non-linear decision boundaries.

What is feature engineering, and what are its limitations?

Feature engineering is the process of manually selecting or transforming raw data into meaningful representations for machine learning models.

Limitations:

Requires Domain Knowledge: Finding effective features requires expertise.
Time-Consuming: Manually extracting features is labor-intensive.
Curse of Dimensionality: High-dimensional feature representations can degrade model performance.

What is the problem with using a single-layer neural network?

A single-layer neural network cannot model complex decision boundaries fails to classify non-linearly separable data.

Why do deep neural networks require non-linear activation functions?

Without non-linear activation functions, deep neural networks behave as a single linear transformation, limiting their ability to learn complex patterns. Non-linearity allows the network to model intricate relationships in data.

What does "feed-forward" mean in the context of an artificial neural network?

Feed-forward in artificial neural networks refers to the process where the input data passes through the network layer by layer without loops or feedback connections. The information moves in one direction—from the input layer to the output layer.

What does “fully-connect” mean in the context of an artificial neural network?

A fully-connected layer in a neural network means that every neuron in a layer is connected to every neuron in the next layer. This allows the model to learn complex patterns but increases the number of parameters significantly.

Why do we need both a training set and a test set?

The training set is used to teach the model, while the test set evaluates its performance on unseen data. Without a separate test set, we cannot measure how well the model generalizes to new examples.

What is the difference between sigmoid, ReLU, and softmax activation functions?

Sigmoid: Maps values to the range (0,1), often used for binary classification but suffers from vanishing gradients.
ReLU (Rectified Linear Unit): Replaces negative values with zero, reducing the vanishing gradient problem but can suffer from dying neurons.
Softmax: Converts logits into probabilities, ensuring that the sum of all class probabilities equals 1, mainly used for multi-class classification.

What is the purpose of the softmax activation? How is it similar to the sigmoid activation? How is it different?

The softmax activation function transforms a vector of raw scores (logits) into probabilities that sum to 1. Like the sigmoid function, it maps values to a range between 0 and 1. However, while sigmoid applies to binary classification, softmax is used for multi-class classification by normalizing all outputs across multiple classes.

What is the vanishing gradient problem, and how does it affect training?

The vanishing gradient problem occurs when gradients become too small during backpropagation, causing earlier layers in deep networks to learn slowly or not at all. This leads to inefficient training and poor convergence.

How is an artificial neural network similar to a biological neural network? How are they different?

Similarities:

Both process information using interconnected units (neurons).
Both rely on weighted connections to determine the strength of signal transmission.
Both can adapt based on learning experience.

Differences:

Biological neurons transmit signals chemically and electrically, while artificial neurons use mathematical computations.
Artificial neural networks are structured in layers, whereas biological neural networks are highly interconnected and dynamic.

[cs231n+michigan DL for CV] lecture 2: Image Classification (1)

Tue, 02 Jul 2024 10:55:10 GMT

Image classification: A core computer vision task

What is Image Classification?

이미지 분류는 컴퓨터 비전의 중요한 핵심이다. 예를 들어, 이미지를 입력으로 사용하고 출력으로 레이블 중 하나를 이미지에 할당하는 작업이다.

Problem: Semantic Gap

인간에게는 간단한 이미지 분류 작업이 컴퓨터에게는 어려운데, 그 이유는 semantic gap 때문이다. 인간은 이미지를 이미지 자체로 인식하고 직관적으로 이해가 가능하지만, 컴퓨터는 모든 것을 숫자로 인식한다. 컴퓨터에서 이미지는 0~255 사이의 pixel과 red, green, blue의 3개의 채널로 이루어진 거대한 숫자 집합에 불과하다. 이렇게 인간과 컴퓨터가 이미지를 바라보는 인식의 차이를 semantic gap (의미론적 차이)라고 한다.

Challenges: Viewpoint Variation

컴퓨터가 이미지를 숫자로 인식한다는 점에서 발생하는 다양한 문제들이 있다. 1. Viewpoint Variation 카메라를 조금만 움직여도 사진을 구성하는 픽셀 값들이 전부 변한다.

2. Intraclass Variation 유전적 다양성으로 인해 컴퓨터가 이미지 분류 작업이 어려워진다. 예를 들어서, 같은 고양이이지만 색상, 무늬 등의 생김새가 다를 경우 이미지 분류를 하는데 어려움이 생길 수 있다.

3. Fine-Grained Categories 2번과 비슷한 맥락일 수 있는데, 예를 들어서 이는 고양이 안에서도 품종별로 세분화 하는 작업을 일컫는다.

4. Background Clutter 배경으로 인해 식별이 어려운 경우를 말한다. 인식해야할 객체가 배경에 섞일 수도 있기 때문이다.

5. Illumination Changes 장면에서 일어나는 다양한 변화에도 이미지를 잘 분류하려면, 장면의 조명 조건을 변경하면서도 조명 변화에 강인한 분류기가 필요하다. 예를 들어서, 어둠 속에서 사진을 찍을 수도 있고 조명을 켠 채로 찍을 수도 있는 경우를 뜻하는 것이다.

6. Deformation 고양이처럼 변형이 가능한 물체(살아움직이는..) 종류들은 인식하기 더 어려울 수 있다.

7. Occlusion 아래 사진들처럼 고양이의 일부밖에 보이지 않는 상황도 있을 수 있는데, 이 경우에도 인식하기 어려울 수 있다.

Image Classification: Building Block for other tasks!

많은 문제들이 있음에도 불구하고 Image Classification을 연구하는 이유는 다양한 분야에서 유용하게 쓰이기 때문이다. 일례로, 의학 분야에서는 양성 종양과 악성 종양을 구분하는데 이미지 분류 작업을 사용하기도 한다. 그 밖에도 이미지 분류를 사용하여 아래와 같은 작업들이 가능하다.

1. Object Detection 이미지를 분류하는 것에서 어떠한 객체에 대해 Bounding Box를 그려주는 Localization 작업까지 수행할 수 있다.

2. Image Captioning 이미지에 캡션을 달아주는 작업으로, 이미지를 보고 어떤 이미지인지 언어로 설명해주는 작업을 말한다. 접근 방식으로는 크게 'Top-Down Approach'와 'Bottom-Up Approach'로 구분된다.

Top-Down Approach: 이미지를 통째로 시스템에 통과시켜 얻은 요점을 언어로 변환
현재까지 가장 많이 쓰이고 있는 접근방식이지만, 이미지의 디테일한 부분에 집중하는 것이 상대적으로 어렵다.
Recurrent Neural Network (RNN)을 활용한 학습이 가능하며, 이 방식의 성능이 가장 좋다고 평가 받는다.
Bottom-Up Approach: 이미지를 부분적으로 접근하여 여러 부분들로부터 단어를 도출하고 이를 결합하여 문장을 생성
이미지의 여러 부분으로부터 하나씩 단어들을 뽑아낸 뒤에 결합하기 때문에 조금 더 디테일한 부분을 반영할 수 있다.

Attempts to Image Classification

이미지 분류를 위해 처음으로 시도된 방법 중 하나는 바로 가장자리 (edge)를 이용하는 것이었다. 가장자리를 추출하여 해당 가장자리의 특성 혹은 패턴을 찾으려고 하는 것이다.

자세히 이야기하면, 가장자리를 따라 outline을 만들어내고, 세 개의 선이 맞닿는 부분을 'corner'라고 정의한다. 고양이, 개 등 객체별 corner 집합의 규칙을 이끌어내 이미지를 구별한다는 것이다. 이 방식에도 문제점이 있는데, 1) Too weak. 고양이로 예를 들면 무늬, 자세 등에 너무 쉽게 영향을 받고, 2) Low scalability. 고양이, 개 등 객체별 집합을 모두 정의해야 하므로 확장성이 낮다.

이 문제점들을 해결하기 위해 등장한 것이 'Data-Driven Approach (데이터 중심 접근법)'이다.

Machine Learning: Data-Driven Approach

Collect a dataset of images and labels
Use Machine Learning to train a classifier
Evaluate the classifier on new images

Nearest Neighbor

분류를 위한 머신러닝 알고리즘으로, 이를 train function과 predict function 두개의 함수를 구현해야 한다.

train function: 모든 training data와 label을 기억한다.
predict function: 입력된 데이터를 training data와 비교함으로써 어떤 label을 가질지 예측한다.

Distance Metric to compare images

입력된 데이터를 training data와 비교하여 어떤 라벨을 가질지 예측하는 함수가 필요한데, 이에 사용할 수 있는 알고리즘 중 하나인 'L1(Manhattan) distance'에 대해서 알아보자.

L1 distance 수식을 살펴보면, test image에서 training image 픽셀 값을 빼서 차이를 구하는 것을 볼 수 있다. 각 픽셀끼리의 차를 구한 뒤 결과 값을 합산하는 방식이다.

import numpy as np

class NearestNeibor :
    def __init__(self) :
        pass

    ### Memorize training data
    def train(self , X , y) :
        # X is N x D where each row is an examples. y is label which is 1-dim of size N
        # the nearest neighbor classifier simply remembers all the training data
        self.Xtr = X
        self.ytr = y

    def predict(self , X) :
        # X is N x D where each row is an example we wish to predict label for
        num_test = X.shape(0)
        # let's make sure that the output type matches the input type
        Ypred = np.zeros(num_test, dtype = self.ytr.dtype)

    ### For each test image(loop over all test rows)
    ### Find nearest training image & Return label of nearest image
    for i in xrange(num_test)
            # find the nearest training images to the i'th test image
            # using the L1 distance (sum of absolute value difference)
            distances = np.sum(np.abs(sel.Xtr - X[i,:]) , axis = 1)

            # get the index with smallest distance
            min_index = np.argmin(distances)

            # predict the label of nearest example
            Ypred[i] = self.ytr[min_index]

        return Ypred

코드를 살펴보면 다음을 알 수 있다.

predict function 정의부의 반복문에서 self.Xtr(입력받은 train data를 사용하여 학습한 내용)과 X[i, :](test data) 의 거리를 합을 구하는 것을 알 수 있다.
이를 통해, L1 distance가 가장 작은 것이 train data와 test data 사이의 차이가 가장 작다는 의미로, 이 차이의 최솟값을 갖는 라벨을 Ypred(결과)로 출력한다.

Problems of Nearest Neighbor

시간 복잡도 Nearest Neighbor에서 train function은 단순히 train data를 저장하므로 O(1)이다. 하지만, predict function에서 반복문을 통해 모든 test data와 train data의 차이를 구하므로 N개의 데이터가 주어졌을 때의 시간 복잡도는 O(N)이 된다. 이는 데이터가 많아질수록 시간 복잡도가 계속해서 증가한다고 볼 수 있다.
Decision Boundaries Nearest Neighbor는 'Decision Boundaries'를 만들 때 문제가 발생한다. 위 이미지에서 각각의 점이 train data이고 점의 색깔은 해당 데이터의 라벨이다. 배경 색은 test data가 주어졌을 때 할당될 라벨이라고 생각하면 된다.

여기서의 문제는

가장 가까운 위치의 데이터만을 이용하여 boundary를 만들기 때문에 이상치에 취약하다.
마찬가지로 가까운 위치의 데이터만을 이용하기 때문에 boundary의 경계면이 부드럽지 못하다. 이럴 경우에는 과적합이 일어날 수 있다.

여기서 두번째 문제를 해결하여 경계선을 보다 스무스하게 만들기 위해 생각해낸 것이 'k-nearest neighbor'이다. 단 하나의 이웃만을 이용하는 기존 방식에서 나아가, k개의 이웃을 사용하여 가장 많은 득표수를 가진 클래스로 예측하는 방식이다.

K-Nearest Neighbor

아래의 사진을 통해 k의 수에 따라 경계면이 보다 부드럽게 변화한 것을 확인할 수 있다. 하지만, 경계 사이의 분류되지 않은 공백이 생기는 현상이 발생하는데 이 또한 knn이 해결해야 할 문제 중 하나이다. 공백은 추론하거나 임의로 결정하여 채울 수 있다.

L2 (Euclidean) distance

L1 distance 외에도 L2 (Euclidean) distance가 있다. 어떤 거리 척도를 사용하느냐에 따라서 boundary의 모양에 차이가 생긴다.

L1 distance: 각각의 벡터 요소들이 개별적인 의미를 가지고 있을 때 사용하는 거리 척도이다.
L2 distance: 벡터 요소들의 의미를 모르거나, 의미가 별로 없을 때 사용하는 거리 척도이다.

실제로 두 거리 척도의 수식을 보면 L1, L2 distance 모두 '거리' 개념이므로 양수가 나와야 한다는 점에서 같지만, 양수를 만들기 위해 사용한 방식이 다른 것을 알 수 있다. L1은 차이에 절댓값을 취하는 방식이라 test data에서 train data를 뺀 값 자체를 보존할 수 있다.

K-Nearest Neighbor의 특징: Universal Approximation

kNN은 충분히 많은 데이터가 주어지고 적절한 k값이 선택된다면, 거의 모든 함수 형태를 근사할 수 있다.(이론적으로는 어떤 함수든 근사할 수 있다고 하지만, 이를 위해서는 몇 가지 조건이 필요하다. 예를 들어, 함수가 특정 도메인에서 연속적이어야 하거나, 훈련 데이터 포인트 간의 간격이 일정해야 한다는 가정 등이 필요하다.) 이는 kNN이 데이터 분포의 세부적인 형태까지 포착할 수 있는 이유 중 하나이다. 또한, 선형 모델이 아니기 때문에 비선형적인 데이터를 잘 처리할 수 있다.

K-Nearest Neighbor의 문제점

- 차원의 저주: 차원이 늘어남에 따라 공간을 균일하게 메우기 위해 필요한 training point의 개수가 기하급수적으로 증가한다. image classification에서 보편적으로 사용하는 데이터셋의 이미지를 분류한다고 생각했을 때, 해당 데이터셋의 이미지 크기는 $3232$이므로 해당 공간을 균일하게 메워 균등한 학습이 이루어지도록 하려면 $2^{3232}$ 만큼의 training point가 필요하게 된다. image classification에서 kNN 알고리즘을 적용하는 것은 사실상 불가능하다는 뜻이다.

** distance metric은 이미지에 적용하기에 부적절하다.** 위 이미지들을 보면 알 수 있는데, 인간의 눈으로 보기에는 다 다른 이미지이지만 L2 distance를 구했을 때의 값은 비슷할 수 있다. 따라서, kNN으로는 분류가 잘 되지 않을 수도 있다.

[cs226] Lecture 1: Introduction

Wed, 26 Jun 2024 14:31:37 GMT

Introduction

What is Generative Models?

생성 모델은 주어진 학습 데이터를 학습하여 학습 데이터의 분포를 따르는 새로운 데이터를 생성하는 모델이다. 학습 데이터와 유사한 데이터를 생성해야 하기 때문에 학습 데이터와 유사한 샘플을 뽑아야 한다. 따라서, 생성 모델에는 학습 데이터의 분포를 어느 정도 안 상태에서 생성(Explicit)하거나 잘 모르는 상태에서 생성(Implicit)하는 다양한 모델들이 존재한다.

강의에서는 Richard Feynman의 인용문을 생성모델에 적용하여 설명했다. "What I cannot create, I do not understand" 이는 수학적 정리에서 자신이 직접 증명할 수 없다면 그 개념을 충분히 이해하지 못한다는 것을 의미한다. 이를 생성 모델에 적용한다면, "만약 이미지나 텍스트의 의미를 이해할 수 있다면 이를 생성할 수 있다"고 할 수 있다.

Generative Modeling: Computer Graphics

How to generate natural images with a computer?

컴퓨터 그래픽스에서 고차원 신호(이미지) 생성 : 주어진 High Level Description(물체의 종류, 색상, 위치 등)을 기반으로 이미지를 생성하는 과정이다. 이는 물체가 어떤 모양인지, 색상은 어떤지, 위치는 어디인지와 같은 정보를 바탕으로 이미지를 렌더링하는 것을 포함한다.

생성 모델을 통해 이를 역으로 해석하는 'Inverse Graphics (역 그래픽스)' : 생성 모델의 철학 중 하나로, 이미지를 생성하는 과정을 역으로 수행하여 이미지를 해석하는 방법이다. 이는 이미지에서 시작하여 그 이미지를 생성한 요소들, 즉 High Level Description을 추론하는 것을 의미한다.

이미지 생성 및 해석에서 컴퓨터 그래픽스와 생성 모델의 차이점

컴퓨터 그래픽스는 주로 고수준의 설명을 바탕으로 이미지를 생성하는 반면, 생성 모델은 주어진 데이터를 기반으로 새로운 데이터를 생성하거나 해석함.
생성 모델을 통해 역 그래픽스 접근법을 사용하면, 이미지를 보고 그 이미지를 생성한 요소들을 이해할 수 있다.

Statistical Generative Models

생성 모델은 이미지나 텍스트의 확률 분포로 구성된다. 이때, 통계 모델을 사용하여 데이터와 사전 지식을 결합을 통해 좋은 생성 모델을 구축한다. 사전 지식으로는 prametric form, loss function, optimization algorithm 등이 될 수 있다.

임의의 이미지 x가 입력으로 주어졌을 때, 이 이미지가 데이터 세트에서 나타날 확률 $p(x)$를 계산할 수 있다.

임의의 이미지 $x$ 입력
모델이 이 이미지가 학습된 분포에 따라 나타날 확률을 계산
이를 통해, 모델은 입력 이미지 $x$를 스칼라 확률 값 $p(x)$로 매핑한다. 예를 들어, 특정 이미지가 모델의 학습된 분포에 잘 맞는다면 높은 확률 값을 가질 것이다.

생성 모델을 데이터 시뮬레이터로 사용하여 새로운 데이터를 생성하는 과정은 데이터를 입력으로 받는 것이 아니라 출력을 생성하는 방식이라고 생각하면 된다.

생성 모델을 통해 시뮬레이터를 구축하는 과정

데이터 수집: 생성 모델을 학습시키기 위해 많은 데이터를 수집한다.
모델 학습: 이 데이터 세트를 사용하여 생성 모델을 학습 시킨다. 생성 모델은 데이터 세트의 분포를 학습하고, 각 데이터 포인트가 어떻게 생겼는지에 대한 확률적 이해를 가지게 된다. 이 과정에서 모델은 각 이미지에 대한 다양한 특징들을 학습하게 된다.

학습된 확률 분포를 바탕으로 해당 분포에서 샘플링하여 새로운 이미지를 생성할 수 있다고 생각하면 된다.

제어 신호를 사용한 데이터 생성

제어 신호를 사용하여 생성 모델을 특정한 방식으로 제어할 수 있다. 여기서 제어 신호는 생성 과정에서 원하는 출력을 얻기 위해 입력되는 추가적인 정보를 의미한다. 이는 스케치, 캡션 등이 포함될 수 있다. (이런 것들을 Conditional Generative Model이라고 생각하면 됨.)

Image Generation

생성 모델은 많은 발전을 이루었고, 다양한 분야에서 성공을 거두고 있다. 초기에는 단순한 흑백 이미지 생성이었지만, 점차 고해상도와 더 많은 디테일을 갖춘 현실적인 이미지까지 발전했다.

강의에서는 간략하게 GAN, Diffusion Model, Dalle3 등이 이미지를 생성한 예시들을 보여주며 넘어간다.

GAN과 Diffusion Model에 대해서 간략하게 설명하자면,

GAN: generator와 discriminator라는 두 개의 신경망으로 구성되는데, 이 두 신경망이 적대적인 관계로 가짜 데이터 생성하는 생성자와 진짜와 가짜 데이터를 구분하는 판별자가 서로 경쟁을 하며 학습을 한다.
Diffusion Model: 원본 데이터에 여러 단계에 걸쳐 노이즈를 추가(전향 과정(Forward Process))하고, 모델이 각 단계에서 노이즈가 추가된 데이터를 입력으로 받아, 노이즈를 제거하여 원래 데이터로 복원역향 과정(Reverse Process)하는 방법을 학습한다. 학습된 모델을 사용하여 노이즈가 많은 상태에서 시작하여 점진적으로 노이즈를 제거함으로써 새로운 데이터를 생성한다.

Audio and Speech Generation

음성 생성 모델도 또한 발전했다. WaveNet 모델은 텍스트를 음성으로 변환하는데 매우 효과적이다. 최신 모델은 감정 표현과 억양을 포함하여 더욱 자연스러운 음성을 생성할 수 있다. 강의에서는 Diffusion Text2Speech와 Audio Super Resolution 모델의 음성 생성 예시를 보여준다. (여기서 Audio Super Resolution 모델은 Conditional generative model이다.)

Language Generation

언어 모델도 많은 진전을 이루었다. GPT와 같은 대규모 언어 모델은 다양한 텍스트 생성 작업을 수행할 수 있다. GPT는 텍스트를 이해하고, 새로운 텍스트를 생성하는 능력이 매우 뛰어나다.

Machine Translation (기계 번역): 기계 번역은 생성 모델을 사용하여 다양한 언어로 텍스트를 변환하는 작업이다.
Code generation (코드 생성 모델): 자연어 설명을 기반으로 코드를 생성할 수 있다.

Video Generation

비디오 생성 모델도 발전하고 있다. 위와 같이 텍스트 설명을 기반으로 짧은 비디오를 생성할 수도 있고 다양한 비디오 클립을 결합하여 하나의 콘텐츠를 생성할 수도 있다. 비디오 생성 모델은 텍스트나 이미지 시드로 제어가 가능하다.

Utilization of Generative Model

생성 모델은 Decision Making and Robotics (결정 제어 및 로봇 공학 분야에서도 사용된다. 예를 들어, 자율 주행 자동차 등에 사용될 수 있다. 뿐만 아니라, 화학 및 생명 공학 분야에서도 생성 모델이 중요한 역할을 한다. 예를 들어, 특정 속성을 가진 분자나 단백질을 설계하는 데 사용될 수 있다.

이외에도, 3D 객체 생성, 음악 생성 등 다양한 분야에서도 활용된다.

Ethical Issues in Generative Models

생성 모델은 다양한 분야에서 활용되며 많은 발전을 이루고 있지만, 이에 따른 윤리적인 문제도 인식을 해야 한다. 예를 들어, 딥페이크와 같은 기술들은 다양한 범죄에 악용될 수 있기 때문에 이를 방지하기 위한 논의도 필요하다.

Transformer in Pytorch

Sun, 02 Jun 2024 06:14:46 GMT

Hyperparameter of Transformer

$d_{model}$(hidden_size): 트랜스포머 모델에서 각 토큰의 임베딩 벡터 차원을 나타낸다. 즉, 입력과 출력의 벡터 크기이다. 트랜스포머의 encoder와 decoder에서 정해진 입력과 출력의 크기로 embedding vector의 차원과 동일하며, encoder와 decoder 내에서 값이 전달될 때의 차원 또한 동일하다.
num_layers(num_encoder_layers): encoder 또는 decoder에 있는 layer의 수이다. (구현한 코드에서는 encoder의 layer 수를 의미함.)
num_heads: 트랜스포머에서는 attention을 사용할 때, 하나로 진행하는 것보다 여러 개의 attention을 병렬로 진행하고 독립적으로 수행한 결과값을 하나로 합친다. 이때 multi-head attention에서 병렬로 attention을 수행하는 head의 개수이다.
$d_{ff}$: 트랜스포머 내부에는 Feed-Forward Neural Network (FFN)가 존재한다. 이때의 은닉층 크기를 의미하는 하이퍼 파라미터이다. 즉, $d_{ff}$ 차원에서 $d_{model}$ 차원으로 임베딩이 진행된다. (이 내용은 아래의 내용을 읽으면서 이해하는 것이 빠르다.) $d_{ff}$는 $d_{model}$(hidden size)의 4배로 설정되는데, 이는 Transformer paper 'AttentionIs All You Need'_에서 제안된 기본 구조 때문이다. 각 트랜스포머 레이어의 FFN을 두 개의 선형 변환과 ReLU 활성화 함수로 구성하며, 이때 중간층의 크기($d{ff}$)는 입력 및 출력 크기인 $d_{model}$(hidden size)의 4배로 설정된다. 이 설계 방식은 모델이 더 많은 특징을 학습할 수 있게 하여 성능을 향상시키기 위함이다.

Structure of Transformer

트랜스포머 모델은 주로 두 가지 주요 구성 요소로 나뉜다:

Multi-Head Attention 메커니즘
Feed-Forward Neural Network (FFN)

각 트랜스포머 레이어는 위의 두 가지 구성 요소로 구성되어 있다. 이 구성 요소가 작동하는 방식은 아래와 같다.

1. 임베딩 및 Multi-Head Attention

입력 시퀀스는 처음에 임베딩 층을 거쳐 hidden_size (또는 d_model) 크기의 벡터로 변환된다.
각 입력 벡터는 포지셔널 인코딩을 더하여 시퀀스 내에서의 위치 정보를 포함한다.
이렇게 임베딩된 벡터들은 Multi-Head Attention 메커니즘에 입력으로 들어간다.

2. Feed-Forward Neural Network (FFN)

FFN은 트랜스포머 레이어의 중요한 부분이다. 각 트랜스포머 레이어는 다음과 같은 구조를 가진다:

${FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2$

여기서 W_1과 W_2는 선형 변환을 위한 weight이고, b_1과 b_2는 bias이다.

FFN의 동작 방식

입력 벡터 ( x ):
- 각 토큰에 대한 임베딩 벡터로, 차원은 hidden_size (또는 d_model)이다.
첫 번째 선형 변환 및 활성화 함수:
- 입력 벡터 ( x )는 첫 번째 선형 변환을 거쳐 d_ff 차원의 은닉층 벡터로 변환된다.
- 이 변환은 다음과 같다: $h = \max(0, xW_1 + b_1)$ 여기서 W_1의 크기는 (hidden_size, d_ff)이고, h의 크기는 d_ff이다. 활성화 함수로 ReLU가 사용된다.
두 번째 선형 변환:
- 은닉층 벡터 ( h )는 두 번째 선형 변환을 거쳐 다시 hidden_size 차원의 출력 벡터로 변환된다: $y = hW_2 + b_2$ 여기서 W_2의 크기는 (d_ff, hidden_size)이고, y의 크기는 hidden_size이다.

따라서, 피드포워드 네트워크는 hidden_size 차원의 입력을 받아 d_ff 차원으로 확장한 후, 다시 hidden_size 차원으로 축소하는 역할을 한다. 이는 다음과 같은 과정을 통해 이뤄진다:

hidden_size 차원의 입력 벡터 → d_ff 차원의 은닉층 → hidden_size 차원의 출력 벡터

예시

입력 임베딩 벡터 (hidden_size = 256): $[x_1, x_2, ..., x_{256}]$
FFN의 첫 번째 선형 변환 (hidden_size → d_ff = 1024): $[h_1, h_2, ..., h_{1024}]$ 여기서 ( h )는 활성화 함수(ReLU)를 통과한 후의 벡터입니다.
FFN의 두 번째 선형 변환 (d_ff → hidden_size): $[y_1, y_2, ..., y_{256}]$

따라서 d_ff는 피드포워드 신경망의 중간층 크기를 의미하며, 이 중간층은 hidden_size의 4배로 설정되는 것이 일반적이다.

Code

위의 설명을 코드에 반영하면, EncoderLayer 클래스는 다음과 같이 구성된다:

class EncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.hidden_size = config.hidden_size

        self.self_attn = MultiHeadAttention(config)
        self.self_attn_layer_norm = nn.LayerNorm(self.hidden_size)

        self.activation_fn = nn.ReLU()

        # 피드포워드 신경망의 첫 번째 선형 변환: hidden_size → d_ff
        self.fc1 = nn.Linear(self.hidden_size, config.d_ff)
        # 피드포워드 신경망의 두 번째 선형 변환: d_ff → hidden_size
        self.fc2 = nn.Linear(config.d_ff, self.hidden_size)
        self.final_layer_norm = nn.LayerNorm(self.hidden_size)

        self.dropout = nn.Dropout(0.1)

    def forward(self, hidden_states, enc_self_mask):
        residual = hidden_states
        hidden_states = self.self_attn(
            query_states=hidden_states,
            key_value_states=hidden_states,
            attention_mask=enc_self_mask
        )
        hidden_states = self.dropout(hidden_states)
        hidden_states = residual + hidden_states
        hidden_states = self.self_attn_layer_norm(hidden_states)

        residual = hidden_states
        hidden_states = self.activation_fn(self.fc1(hidden_states))
        hidden_states = self.dropout(hidden_states)
        hidden_states = self.fc2(hidden_states)
        hidden_states = self.dropout(hidden_states)
        hidden_states = residual + hidden_states
        hidden_states = self.final_layer_norm(hidden_states)

        return hidden_states

이 구조는 트랜스포머 레이어 내의 FFN가 hidden_size 차원의 입력을 받아 d_ff 차원으로 확장한 후, 다시 hidden_size 차원의 출력으로 변환하는 과정을 반영한다. 이를 통해, 모델은 더 많은 특징을 학습할 수 있으며, 성능을 높이는 데 기여한다.

Positional Encoding

기존의 다른 방법론들은 각 단어의 임베딩 벡터를 바로 input으로 받지만, Transformer에서는 encoder와 decoder에서 임베딩 벡터의 값을 조정하여 input으로 받는다. 이를 Positional Encoding이라고 한다.

Transformer는 단어를 순차적으로 입력받지 않기 때문에, 단어의 위치 정보를 다른 방식으로 알려주어야 한다. Transformer는 해당 단어의 위치 정보를 임베딩 벡터에 더하여 모델의 최종적인 입력으로 사용한다.

위 그림처럼, 단어들의 임베딩 벡터가 Transformer의 input으로 들어가기 전에 positional encoding 값이 더해진다.

Positional Encoding에는 다양한 종류들이 있지만, 기본적으로 sin, cos 함수를 이용하여 위치 정보를 전달한다.

${PE}{{pos}, 2i}$=$ sin\left(\frac{\text{pos}}{10000^{\frac{2i}{d{\text{model}}}}}\right) $

${PE}{{pos}, 2i+1}$ = $cos\left(\frac{\text{pos}}{10000^{\frac{2i}{d{\text{model}}}}}\right) $

$pos$: 입력 문장에서 임베딩 벡터의 위치(e.g., I는 'I am a student'의 첫 번째 위치)
$i$: 임베딩 벡터 내 차원의 index
${d}_{model}$: Transformer의 모든 layer의 output 차원을 나타내는 하이퍼 파라미터

Positional Encoding 방법을 이용할 경우에는 순서 정보가 보존된다. 예를 들어, 같은 임베딩 벡터(단어)라도 positional encoding 값을 더하게 되면 최종적으로 Transformer의 input 값은 달라진다.

Pytorch code of Positional Encoding

import torch
import matplotlib.pyplot as plt
import numpy as np
from torch import nn

class PositionalEncoding(nn.Module):

    def __init__(self, d_model, max_len, device):
        """
        sin, cos encoding 구현

        parameter
        - d_model : model의 차원
        - max_len : 최대 seaquence 길이
        - device : cuda or cpu
        """

        super(PositionalEncoding, self).__init__() # nn.Module 초기화

        # input matrix(자연어 처리에선 임베딩 벡터)와 같은 size의 tensor 생성
        # 즉, (max_len, d_model) size
        self.encoding = torch.zeros(max_len, d_model, device=device)
        self.encoding.requires_grad = False # encoding의 gradient는 필요 없음. 

        # 위치 indexing용 벡터
        # pos는 max_len의 index를 의미함.
        pos = torch.arange(0, max_len, device =device)
        # 1D : (max_len, ) size -> 2D : (max_len, 1) size -> word의 위치를 반영하기 위해

        pos = pos.float().unsqueeze(dim=1) # int64 -> float32 (없어도 됨.)

        # i는 d_model의 index를 의미한다. _2i : (d_model, ) size
        # 즉, embedding size가 512일 때, i = [0,512]
        _2i = torch.arange(0, d_model, step=2, device=device).float()

        # (max_len, 1) / (d_model/2 ) -> (max_len, d_model/2)
        self.encoding[:, ::2] = torch.sin(pos / (10000 ** (_2i / d_model)))
        self.encoding[:, 1::2] = torch.cos(pos / (10000 ** (_2i / d_model)))


    def forward(self, x):
        # self.encoding
        # [max_len = 512, d_model = 512]

        # batch_size = 128, seq_len = 30
        batch_size, seq_len = x.size() 

        # [seq_len = 30, d_model = 512]
        # [128, 30, 512]의 size를 가지는 token embedding에 더해질 것임.
        # 
        return self.encoding[:seq_len, :]

Attention

Transformer에는 3가지의 Attention이 사용된다.

Self-Attention of Encoder
Masked Self-Attention of Decoder
Co-Attention Between Encoder and Decoder.

Self-Attention은 Query, Key, Value가 동일한 경우(이 때, Query, Key, Value가 동일하다는 것은 서로 값이 같다는 것이 아닌, 출처 자체가 Encoder에서만 나오거나, Decoder에서만 나오거나 한다는 것)를 말하며, Co-Attention은 Query가 Decoder의 Vector, Key, Value가 Encoder의 Vector가 된다.

Encoder

Encoder는 하이퍼 파라미터 num_layers에 따라서 해당하는 개수만큼 Encoder layers를 쌓는다. 하나의 lay를 기준으로 봤을 때에는 Self-Attention layer와 Feed-Forward Network layer로 나뉜다.

Self-Attention of Encoder

Self-Attention이란? Attention 함수는 주어진 Query에 대해 모든 Key와의 유사도를 각각 구한다. 이 유사도는 weight로 사용하여 Key와 매핑되어 있는 Value와 가중합을 하게 된다.

기존 Attention: $Q$ = Query: $t$ 시점의 decoder cell에서의 hidden state $K$ = Keys: 모든 시점의 encoder cell의 hidden states $V$ = Values: 모든 시점의 encoder cell의 hidden states
Self-Attention: $Q$ = Querys: 모든 시점의 decoder cell에서의 hidden states $K$ = Keys: 모든 시점의 encoder cell의 hidden states $V$ = Values: 모든 시점의 encoder cell의 hidden states

이처럼 기존 Attention에서 Query $Q$는 decoder cell의 hidden state이고, $K$는 encoder cell의 hidden state이기 때문에 $Q$, $K$는 서로 다른 값을 가진다.

하지만, Self-Attention에서는 $Q$, $K$, $V$가 모두 동일하다.

$Q$: input sentence의 모든 단어 벡터들 (input sequence)
$K$: input sentence의 모든 단어 벡터들
$V$: input sentence의 모든 단어 벡터들

그렇다면, 왜 input sequence에 대해서 스스로 attention 과정을 거칠까? 위 이미지의 it은 animal일까, street일까?

Self-Attention은 위처럼 하나의 input sentence 내에서 특정 단어들끼리의 유사도를 구함으로써 위와 같이 it이 어떤 단어인지 파악할 수 있다.

Self-Attention은 encoder의 초기 input인 $d_{model}$ 차원의 sequence를 바로 사용하는 것이 아니라, 더 작은 차원의 $Q$, $K$, $V$ 벡터를 얻는다. 이는 $num$_$heads$에 의해 결정된다.

예를들어, 초기 input sequence의 차원 $d_{model}$ = $512$과 Attention의 $num$$heads$ = $8$을 사용한다면, 각 $Q$, $K$, $V$ 벡터의 차원은 $d_model$ $/$ $num$$heads$ = $512$ / $8$ = $64$가 된다.

기존의 $d_{model}$ $512$차원에서 $64$차원으로 줄이기 위해 weight matrix ($512$ X $64$)를 곱해주면 된다. 위와 같은 과정을 거쳐 각 단어는 낮은 차원의 $Q$, $K$, $V$로 변환된다.

[reference] https://velog.io/@sjinu/Transformer-in-Pytorch#5-%EC%9D%B8%EC%BD%94%EB%8D%94encoder

[선형대수학] Normal equation

Sat, 27 Jan 2024 05:36:23 GMT

What is the normal equaton?

정규 방정식(normal equation)은 최소제곱 문제(Least Squares Problem)에서 사용되는 방법으로, 선형 회귀에서 모델 파라미터를 추정하는데 사용되는 주요한 방법입니다.

-> 주어진 데이터 포인트들과 모델의 예측 값 사이의 오차를 최소화하여 모델의 파라미터를 추정하는 것이 목표입니다. -> 정규 방정식은 최소제곱법(Ordinary Least Squares)을 수학적으로 해결하는 방법 중 하나라고 말할 수 있습니다.

간단하게 정리하자면, 정규 방정식은 다항 방정식을 행렬로 나타내고, 역행렬을 통해 최적의 값을 찾는 방법이라고 말할 수 있습니다.

위와 같이 정규 방정식에서 비용 함수는 주로 평균 제곱 오차(Mean Squared Error, MSE)를 사용합니다. 위 수식의 비용 함수에서 제곱을 사용하는 이유는 오차의 크기와 오차의 제곱을 통한 평가 및 최적화를 효과적으로 수행하기 위함입니다. 1/2m을 곱해준 이유는 미분을 편라하게 하기 위한 목적으로 사용됩니다.

최솟값을 찾을 때, 주로 미분을 사용하는데 이처럼 정규 방정식에서도 a, b에 대한 최솟값을 찾기 위해 각 변수에 대해 편미분을 취해줍니다.

최적의 계수 a, b를 구하기 위해 아래와 같이 역행렬을 이용하여 값을 찾기 위한 식을 정리해볼 수 있습니다. 이를 행렬로 나타내면 아래와 같습니다. 식을 간단히 나타내면 아래와 같고, cost값을 최소로 하는 a, b를 구할 수 있습니다.

선형 회귀(Linear regression)

정규 방정식이 선형회귀에서 모델 파라미터를 추정하는데 쓰이는 방법이기 때문에 선형 회귀에 대해서도 알아보겠습니다.

먼저, x:[1, 2, 3]이라는 어떠한 입력이 있을 때, 그 예측값이 각각 y:[3, 5, 7]인 것을 아는 상황이라고 가정해보겠습니다.

이때, x=4라면 y가 어떤 값을 가질까요?

우리는 y를 x에 대한 함수로 나타낼 수 있습니다. 쉽게 f(x)=2x+1이라고 유추가 가능합니다. 만약, 데이터의 개수가 매우 많아지고 복잡해진다면 머리로 풀기에는 매우 어려운 문제가 됩니다. 이렇게 어려운 문제는 컴퓨터에게 대신 계산해달라고 할 수 있습니다.

컴퓨터는 문제를 어떻게 풀 수 있을까요? 이러한 문제를 선형 회귀(Linear regression)이라고 합니다. 선형 회귀 문제를 풀 수 있게 되면, 새로운 입력이 들어와도 y값을 유추해낼 수 있습니다.

선형회귀는 실제로 다양한 사례에서 매우 많이 관측되고 사용됩니다. 그렇기 때문에 굉장히 실용적인 모델이라고 할 수 있으며, 머신러닝과 통계에서 가장 기본이 되는 알고리즘이라고 할 수 있습니다.

이제, 컴퓨터가 선형 회귀 문제를 풀 수 있도록 해보겠습니다.

가장 먼저, 가설을 세워야 합니다. 가설함수는 H(W, b)=Wx+b라고 할 수 있고, 위 상황에서는 인간은 W는 2, b는 1이라고 유추할 수 있습니다.

하지만, 컴퓨터는 인간처럼 유추를 할 수 없고 컴퓨터를 인간처럼 학습을 시켜줘야 하는 것 입니다.

다시 말해, 우리의 목표는 W값과 b값이 있을 때, W=2, b=1으로 만드는 것입니다. 컴퓨터는 이를 모르는 상황이기 때문에 가설을 초기화하여, 임의로 W=1, b=0이라고 가정해보겠습니다. 반복적인 연산을 통해서 현재 1인 W와 0인 b를 목표와 가깝게 학습을 시켜야 합니다. 이를 위해서는 현재 가설이 목표와 얼마나 잘못되었는지를 판단하기 위한 척도가 필요합니다.

일반적으로, 이를 cost라고 합니다.->cost(W, b) 주로 최소제곱법을 이용하여 구합니다.

다시 돌아가서, 가지고 있는 x:[1, 2, 3], y:[3, 5, 7]을 그래프로 나타내보면 이와 같습니다.

이때 우리가 임의로 세운 가설인 W=1, b=0을 그래프에 그려보면 이와 같습니다. 우리의 목표는 원래의 데이터와 가깝게 H(W, b)를 옮겨주는 것입니다. 원래의 데이터와 가깝게 옮겨주기 위해서 '얼마나 잘못되었는가'를 판단하기 위해 비용함수를 정의해준 것 입니다.

최소제곱법은 원래의 값과 차이나는 정도를 제곱하여 그 값을 더해준 뒤에 데이터의 개수로 나눠주는 것입니다. 위 이미지에서의 비용은 현재 29/3로 원래의 값과 많이 차이난다는 것을 알 수 있습니다. 따라서, 현재의 값이 원래의 값에 근사하게 되면 이 비용은 0과 가까워지게 됩니다.

우리는 이 비용 함수를 정의할 수 있는데, 위 수식처럼 나타낼 수 있습니다. 여기서 Wxi+1은 우리가 설정한 가설 즉, 예측값이라고 할 수 있고 yi는 실제값이라고 할 수 있습니다.

예측값과 실제값의 차이는 항상 양수로 나와야하기 때문에 제곱을 사용하였습니다. 양수로 만들기 위해서는 제곱 외에도 절댓값으로도 나타낼 수 있는데 왜 하필 제곱을 사용했을까요?

이에 대해서는 두가지 이유가 있습니다.

-> 제곱을 이용하면 비용이 더 커지기 때문에 가설이 잘못되었을 때 그에대한 패널티를 더 크게 부여하기 위해서 입니다. -> 컴퓨터가 절댓값을 계산하기 위해서는 조건문을 이용해야하기 때문에 연산속도가 더 느려질 수도 있기 때문입니다.

그렇다면, 우리는 이 비용을 어떻게 줄여나갈 수 있을까요? 비용을 줄여나가기 위해서 값을 최소화하는 W, b값을 찾아야 합니다.

W는 식에서 기울기를 나타내는데, 만약 W를 잘못 설정하게 된다면 오차는 기하급수적으로 늘어나게 됩니다. 이를 표현하기 위해서, x축을 W로 y축을 cost로 설정하겠습니다.우리가 궁극적으로 찾고자하는 W의 값 즉, global optimum이 있다고 가정합니다. W의 값이 global optimum과 멀어지게 되면 비용은 증가하기 때문에 위처럼 이차함수 그래프로 나타낼 수 있습니다.

기울기가 0이 될 때, global optimum이기 때문에 기울기가 음수라면, 오른쪽으로 (+) 옮겨가고, 기울기가 양수라면, 왼쪽으로 (-) 옮겨가면 됩니다. 옮기는 과정을 반복해서 global optimum을 찾아 나가면 됩니다. 이러한 방법을 '경사하강법(Gradient Descent)라고 합니다.

쉽게 말해, 경사하강법은 cost를 줄이기 위해 반복적으로 기울기를 계산하여 변수의 값을 변경해나가는 과정이라고 할 수 있습니다.

b값은 위의 그래프에서 y절편을 말하고, W와 마찬가지로 이차함수로 나타낼 수 있습니다.

b도 같은 방식으로 위 이차함수의 기울기를 옮겨가며 최적의 값을 찾을 수 있습니다. W와 b의 적절한 값을 찾아서 비용을 줄여나가는 과정을 선형회귀 문제를 풀어나가는 과정이라고 할 수 있습니다.

정규방정식? Gradient Descent?

정규방정식은 비용 함수를 최소화하기 위해 수학적인 방정식을 사용하는데, 이 방법은 작은 데이터셋에서 효과적이며, 계산적으로 비용이 크지 않은 경우에 사용됩니다. 하지만, 정규방정식은 역행렬이 존재해야 하는 제약이 있습니다.

경사하강법은 비용 함수를 최소화하기 위해서 비용 함수의 미분을 계산하고, 이를 이용하여 모델 파라미터를 업데이트합니다.

reference) https://www.youtube.com/watch?v=ve6gtpZV83E

[Chapter.1] Introduction and Optimization Problems

Tue, 21 Nov 2023 08:41:10 GMT

key words

#Computer Model #Optimization Model #Greedy algorithms

knapsack Problem(냅색 문제)

constraint -> enough to put in a knapsack : 도둑이 가장 값 비싼 물건을 훔쳐야 하는 optimization problem

Two variants

0/1 knapsack problem vs Continuous or fractional knapsack problem

0/1 knapsack problem : 현재의 결정이 다음 결정에 영향을 끼침 Continuous or fractional knapsack problem : 연속 문제는 아주 쉬운 문제. Greedy algorithm으로 풀 수 있음.

0/1 knapsack problem

Each item is represented by a pair, __.
The knapsack can accomodate items with a total weight of no more than w.
A vector, _L _, of length n, represented the set of available items. (Each element of the vector is an item.)

A vector, V , of length n, is used to indicate whether or not items are taken.

  ex) V[i]=1 이면, i번 째 물건은 가져간 것. V[i]=0이면, i번 째 물건은 가져가지 않은 것.

이를 수학적으로 표현하면,

Find a V that maximizes $\sum_{i=0}^{n-1}$ V[i]I[i].value subject to the constraint that $\sum_{i=0}^{n-1}$ V[i]I[i].weight <= w

Solution) Brute Force Algorithm(완전 탐색)

Enumerate all possible combinations of items.(멱집합 구하기)
Remove all of the combinations whose total units exceeds the allowed weight.
From the remaining combinations choose anyone whose value is the largest.

-> often Not Practical!!! 이는 모든 경우의 수를 고려하는 알고리즘. 최적의 해를 구할 수 있지만, 벡터의 길이가 길어질 경우, 매우 오래걸리고 복잡함. 이 방법만큼 최적의 해를 찾는 좋은 algorithm은 없지만, 꽤 좋은 algorithm은 존재.

Another Solution is the "Flexible Greedy"

실습 코드

class Food(object):
    def __init__(self, n, v, w):
        self.name = n
        self.value = v
        self.calories = w
    def getValue(self):
        return self.value
    def getCost(self):
        return self.calories
    def density(self):
        return self.getValue()/self.getCost()
    def __str__(self):
        return self.name + ': <' + str(self.value)\
                 + ', ' + str(self.calories) + '>'

def buildMenu(names, values, calories):
    """names, values, calories lists of same length.
       name a list of strings
       values and calories lists of numbers
       returns list of Foods"""
    menu = []
    for i in range(len(values)):
        menu.append(Food(names[i], values[i],
                          calories[i]))
    return menu

def greedy(items, maxCost, keyFunction):
    """Assumes items a list, maxCost >= 0,
         keyFunction maps elements of items to numbers"""
    itemsCopy = sorted(items, key = keyFunction,
                       reverse = True)
    result = []
    totalValue, totalCost = 0.0, 0.0
    for i in range(len(itemsCopy)):
        if (totalCost+itemsCopy[i].getCost()) <= maxCost:
            result.append(itemsCopy[i])
            totalCost += itemsCopy[i].getCost()
            totalValue += itemsCopy[i].getValue()
    return (result, totalValue)

def testGreedy(items, constraint, keyFunction):
    taken, val = greedy(items, constraint, keyFunction)
    print('Total value of items taken =', val)
    for item in taken:
        print('   ', item)

def testGreedys(foods, maxUnits):
    print('Use greedy by value to allocate', maxUnits,
          'calories')
    testGreedy(foods, maxUnits, Food.getValue)
    print('\nUse greedy by cost to allocate', maxUnits,
          'calories')
    testGreedy(foods, maxUnits,
               lambda x: 1/Food.getCost(x))
    print('\nUse greedy by density to allocate', maxUnits,
          'calories')
    testGreedy(foods, maxUnits, Food.density)


names = ['wine', 'beer', 'pizza', 'burger', 'fries',
         'cola', 'apple', 'donut', 'cake']
values = [89,90,95,100,90,79,50,10]
calories = [123,154,258,354,365,150,95,195]
foods = buildMenu(names, values, calories)
testGreedys(foods, 1000)

코드 내부에서 sorted를 사용 -> python에서는 팀 정렬을 사용하는데, 이 정렬의 가장 나쁜 시간 복잡도는 n log n (여기서의 n은 물건의 개수를 뜻함.)

최종 시간복잡도 또한 n log n이기 때문에, 꽤 효율적인 알고리즘.

+) python 문법 설명 파이썬 람다(lambda) 표현식

익명(=이름이 없는)의 함수를 만들 때 사용
(lambda 식별자 배열 : 원하는 식)
이 파라미터들로 표현된 식을 계산하고 결과를 반환하는 함수를 만듬
def를 쓰는 대신에 인라인 함수로 정의

[AI web service project] MBTIgram: 모델링-LinearSVC

Thu, 24 Aug 2023 14:22:04 GMT

이번 포스팅은 최종 모델로 선정된 LinearSVC 모델링 과정을 설명해보려고 합니다. 미해결 과제로 남았던 '클래스 불균형'을 SMOTE를 통한 리샘플링으로 해결하였고, TF-IDF Vectorizer과 GridSearchCV, LinearSVC를 이용하여 모델링을 진행했습니다.

SMOTE

SMOTE는 불균형 데이터(imbalanced data) 처리를 위한 샘플링 기법으로, 오버 샘플링에 속합니다.

불균형 데이터를 처리해야만 했던 이유는 무엇일까요?

불균형 데이터란 정상 범주의 관측치 수와 이상 범주의 관측치 수가 현저히 차이나는 데이터입니다. 만약, 불균형 데이터를 처리하지 않고 그대로 모델에 적용하여 예측을 진행한다면 어떻게 될까요? 예를 들어, 0과 1 두가지 클래스가 존재할 때 데이터 샘플 수의 비율이 99:1이라고 가정해봅시다. 이런 불균형 데이터에 대해 분류 모델을 훈련시킨 후 예측을 하면 모든 데이터를 ‘0’ 이라고 분류한다고 했을 때의 정확도(accuracy)는 99%가 됩니다. 정확도만 보고 모델 성능 평가를 해보면 잘 만들어진 모델같지만 '1'이라는 클래스는 제대로 분류를 해내지 못했으므로 성능이 안좋은 모델이라고 할 수 있습니다.

이러한 이유로, 저는 불균형 데이터를 처리해서 balanced data로 만들어줘야 했습니다.

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

# SMOTE를 사용하기 위해서 예측 변수, 설명 변수 모두 인코딩
encoder_X = OneHotEncoder()
encoded_X = encoder_X.fit_transform(data['posts'].to_numpy().reshape(-1, 1))
encoder_y = LabelEncoder()
encoded_y = encoder_y.fit_transform(data['type'].to_numpy().reshape(-1, 1))

데이터셋을 SMOTE에 적용시키기 위해서는 숫자로 매핑을 하는 과정이 필요합니다.

Encoding

처음에는 One-hot encoding으로 예측 변수, 설명 변수를 모두 인코딩하였는데, 가진 데이터셋의 예측 변수가 16개의 클래스로 이루어져 있기 때문에 벡터의 차원이 커지는 것을 방지하고자 예측변수 인코딩만 Label encoding으로 변경하여 진행했습니다.

One-hot encoding이란?

원핫 인코딩은 주어진 범주(category)를 나타내는 변수를 0과 1로 이루어진 이진 벡터로 변환하는 방식입니다. 각 범주에 대해 하나의 새로운 이진 변수를 만들어 해당 범주에 해당하는 위치에 1을, 다른 위치에는 0을 할당합니다. 이를 통해 각 범주 간의 상호 관계가 없다는 것을 나타낼 수 있습니다.

예를 들어, ['사과', '바나나', '오렌지']와 같은 세 가지 과일 범주가 있을 경우:

'사과'는 [1, 0, 0]
'바나나'는 [0, 1, 0]
'오렌지'는 [0, 0, 1] 이런 식으로 변환됩니다. 원핫 인코딩은 주로 범주형 변수가 상대적으로 작을 때 사용되며, 범주의 개수가 많아질수록 벡터의 차원이 커지는 단점이 있습니다.

제가 가진 데이터셋으로 생각하면, 16개의 범주이기 때문에 [1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] 이런 식으로 차원이 아주 커지게 되는 거죠.

Label encoding이란?

라벨 인코딩은 각 범주에 대해 순차적인 정수 값을 할당하는 방식입니다. 이 방식은 범주 간의 순서나 관계를 전제로 하기 때문에 순서가 중요한 변수에 주로 사용됩니다.

예를 들어, ['저학년', '중학년', '고학년']과 같은 학년 범주가 있을 경우:

'저학년'은 1
'중학년'은 2
'고학년'은 3 이런 식으로 변환됩니다.

unique_array=np.unique(encoded_y)
print(unique_array)

출력값: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] 이렇게 0~15까지의 정수형으로 16개의 클래스가 제대로 맵핑된 것을 확인할 수 있었습니다.

SMOTE 적용

import sklearn
from imblearn.over_sampling import SMOTE

# SMOTE 적용
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(encoded_X, encoded_y)

뒤에서 사용할 TF-IDF Vectorizer에 적용시키기 위해서 SMOTE를 통해 불균형한 데이터를 처리한 뒤, 인코딩한 데이터들을 디코딩해줘야 합니다.

# 인코딩한 변수들을 다시 문자열로 디코딩하는 함수 정의
def text_inverse_transform_X(encoded_data, encoder):
    decoded_data = encoder.inverse_transform(encoded_data)
    return decoded_data

def text_inverse_transform_y(encoded_data, encoder):
    decoded_data = encoder.inverse_transform(encoded_data)
    return decoded_data

# 예측 변수, 설명 변수 디코딩
X_data = text_inverse_transform_X(X_resampled, encoder_X)
y_data = text_inverse_transform_y(y_resampled, encoder_y)

y_data

다음과 같이 예측변수가 제대로 디코딩 되었는지 확인했습니다.

X_data

설명변수 또한 디코딩이 제대로 수행되었습니다.

from collections import Counter
Counter(y_data)

사진과 같이 y_data의 각 클래스별 샘플 수를 확인해보면, SMOTE가 잘 적용된 것을 볼 수 있습니다.

TF-IDF Vectorizer

TF-IDF(Term Frequency-Inverse Document Frequency)는 자연어 처리에서 널리 사용되는 텍스트 피처 추출 방법 중 하나로, 문서 내에서 단어의 중요성을 평가하는 데 도움이 되는 기술입니다.

- TF(Term Frequency) : 특정 단어가 등장하는 횟수
- IDF(Inverse Document Frequency) : 특정 단어가 몇 개의 Document에서 등장하는지의 역수
- TF-IDF = TF*IDF

TF-IDF를 통해 벡터화를 하는 경우, 1. 단어의 중요성 강조: TF-IDF는 단어의 빈도와 문서 내에서 얼마나 널리 분포되어 있는지에 기반하여 단어의 중요성을 평가합니다. 단어가 특정 문서에서 높은 빈도로 등장하면서 동시에 다른 문서에서는 드물게 등장할수록 해당 단어는 그 문서의 특징을 잘 나타내는 단어로 간주됩니다. 2. 차원 감소: 텍스트 데이터는 일반적으로 고차원의 특성을 가지며, 이로 인해 차원의 저주(curse of dimensionality)와 관련된 문제가 발생할 수 있습니다. TF-IDF는 단어의 빈도를 기반으로 하지만, IDF(Inverse Document Frequency)의 역할로 인해 빈도가 높지만 모든 문서에 공통적으로 등장하는 단어들의 가중치는 낮아집니다. 따라서 흔한 단어들이 너무 큰 영향을 끼치지 않게 되어 차원을 상당히 감소시킬 수 있습니다.

# 변수가 바이트 형식이라면 utf-8로 디코딩
X_decoded = [x.decode('utf-8') if isinstance(x, bytes) else x for x in X_data]
y_decoded = [y.decode('utf-8') if isinstance(y, bytes) else y for y in y_data]
# data split
X, X_test, y, y_test = train_test_split(X_decoded, y_decoded, test_size=0.2, random_state=1)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1)

from sklearn.feature_extraction.text import TfidfVectorizer
# 벡터화
tfidf = TfidfVectorizer(lowercase=False)


# 문자열로 변환->X_train를 그대로 tfidfVectorizer로 벡터화하면 에러 발생
X_train_str = [str(x) for x in X_train]

# 훈련 데이터 벡터화
X_train_tfidf = tfidf.fit_transform(X_train_str)

clf = LinearSVC()
clf.fit(X_train_tfidf, y_train)
# grid search를 이용해 최적의 하이퍼 파라미터 찾기
cv = GridSearchCV(clf, {'C': [0.1, 0.2, 0.3, 0.4, 0.5, 1.0]}, scoring = "accuracy")

text_clf = Pipeline([('tfidf',TfidfVectorizer()),('clf',cv)])
text_clf.fit(X_train_str, y_train)

C = cv.best_estimator_.C

print("최적의 파라미터 C : ", C)

최적의 파라미터 C : 1.0

모델 학습

#grid search를 통해 찾은 최적의 하이퍼 파라미터 적용
text_clf = Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC(C=1.0))])
text_clf.fit(X_train_str, y_train)

X_valid_str = [str(x) for x in X_valid]  # 타입에러로 인해 문자열로 변환
# valid 데이터의 mbti 예측
pred = text_clf.predict(X_valid_str)
# valid data에서의 정확도
accuracy_score(pred, y_valid)

from sklearn.metrics import classification_report
print(classification_report(y_valid, pred))

X_str = [str(x) for x in X]
X_test_str = [str(x) for x in X_test]
# 모든 설명변수 데이터 X 자연어처리
X_tfidf = tfidf.fit_transform(X_test_str)

clf = LinearSVC()
clf.fit(X_tfidf, y_test)

svc_clf = Pipeline([('tfidf',TfidfVectorizer()),('clf',LinearSVC(C=1.0))])
svc_clf.fit(X_str, y) # train/valid data 합쳐서 학습, test data로 예측

pred_svc = svc_clf.predict(X_test_str)

accuracy_score(test_pred, y_test)

from sklearn.metrics import classification_report
print(classification_report(y_test, test_pred))

classification report를 보니, 여러 평가 지표를 모두 확인 했을 때 성능이 좋다는 것을 볼 수 있습니다.

모델 저장

학습이 된 모델을 피클 파일로 저장합니다.

import pickle

with open('saved_model_result', 'wb') as f:
    pickle.dump(svc_clf, f)

마치며..

MBTIgram 프로젝트를 통해 '텍스트 데이터를 이용한 MBTI 예측 알고리즘'을 개발했습니다. 개발의 모든 단계를 누구의 도움없이 혼자만의 힘으로 해낸게 처음이라 더욱 뿌듯하고 기억에 오래 남을 것 같습니다. 많은 시행착오를 직면한 덕에 더 많은 것을 배울 수 있었습니다. 이번 프로젝트를 통해서 남은 아쉬움과 부족한 점들을 보완하여 더욱 성장할 수 있었으면 좋겠습니다😆

[AI web service project] MBTIgram: 모델링-RNN

Wed, 23 Aug 2023 11:26:04 GMT

지난 포스팅에서는 XGBoost 모델링 과정을 설명했습니다:) 이번에는 시계열 데이터나 텍스트와 같은 도메인에서 강력한 성능을 발휘하는 RNN(Recurrent Neural Network) 모델링 과정에 대해서 다뤄보겠습니다.

모델링: RNN (Recurrent Neural Network)

import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, Dropout
from tensorflow.keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.metrics import classification_report

data = pd.read_csv('/content/drive/MyDrive/spp_project/data_result.csv', index_col='type') #type열을 인덱스로 설정.

X = data['posts'] #설명변수
y = data.index #예측변수

# Train/test split
X_train, X_test, y_train, y_test = train_test_split(X_padded, y, test_size=0.2, random_state=42)

# numpy배열로 변환
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)

모델링은 총 2가지로 학습을 진행했습니다. 2안은 lstm layer를 한 층 더 추가하고 drop out 비율을 늘렸습니다.

모델링 1안

# 모델 정의
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=256, input_length=max_len))
model.add(LSTM(64, dropout=0.2))
model.add(Dense(len(label_to_int), activation='softmax'))  # Use len(label_to_int) as the number of units

# 모델 컴파일
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

model.summary()

# Early stopping callback 설정
early_stopping = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

# Learning rate reduction callback 설정
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=1, verbose=1)

# 모델 학습
batch_size = 64
epochs = 20
model.fit(X_train, y_train_int, batch_size=batch_size, epochs=epochs,
          validation_split=0.2, callbacks=[early_stopping, reduce_lr])

early stopping: 검증 데이터(validation data)의 성능을 모니터링하여 학습을 조기중단 시키는 역할을 합니다. 성능이 개선되지 않을 때 학습을 중단하여 과적합을 방지하거나 학습 시간을 단축할 수 있습니다.

monitor: 모니터링할 지표 ex) 'val_loss', 'val_accuracy'
patience: 성능이 개선되지 않아도 몇 번까지 기다릴지 지정
verbose: 출력 여부 ex) 0: 출력o, 1: 출력x
resotre_best_weights: 최상의 모델 가중치로 복원할 지 여부

ReduceLROnPlateau: 학습률을 동적으로 조절하는 역할을 합니다. 검증 데이터(validation data)의 성능이 개선되지 않을 때, 학습률을 줄여서 모델이 더 나은 지점으로 수렴하도록 돕습니다.

monitor: 모니터링할 지표
factor: 학습률을 줄일 비율(새 학습률 = 현재 학습률*factor)
verbose: 출력 여부
min_lr: 학습률의 하한 값

아래의 결과를 보면, early stopping의 patience = 2로 설정했지 때문에 epoch 2개에서 연속으로 val_loss가 증가하면 성능이 개선되지 않는 것으로 학습이 조기 종료 되는 것을 확인할 수 있습니다.

# 예측
pred_probs = model.predict(X_test)
pred_classes = np.argmax(pred_probs, axis=1)

# classification report 계산 및 출력
report = classification_report(y_test_int, pred_classes, target_names=label_to_int.keys())
print(report)

classification report를 보면, support(각 클래스에 속한 샘플의 수)가 적은 클래스는 대체적으로 precision, recall, f1-score가 모두 낮은 것을 볼 수 있습니다. 성능을 높이기 위해서는 파라미터 튜닝(모델의 layer 층 개수 조절 등), 클래스 불균형 해결, drop out 비율 조절 등의 방법이 있습니다.

처음으로 고려했던 방법은 가중치 조정을 통한 '클래스 불균형 해결'이었는데, 오히려 정확도가 심하게 낮아지고 검증 데이터에 대한 성능이 거의 개선되지 않고 학습이 조기 종료되어 다른 방법을 시도했습니다.

코드에는 사용하지 않았지만, 가중치 조정에 대해 궁금하신 분들은 참고 바랍니다.가중치 조정 방법은 다음과 같습니다.

 클래스 불균형 문제를 해결하기 위해서는 class_weight 매개변수를 사용하면 됩니다. 
 class_weight는 손실 함수 계산 시 각 클래스에 적용할 가중치를 지정하는 매개변수로, 
 불균형한 클래스에 높은 가중치를 부여하여 모델이 불균형한 데이터를 더 잘 학습할 수 있도록 도와줍니다.

# 필요한 모듈 임포트
from sklearn.utils.class_weight import compute_class_weight

# 클래스 가중치 계산
class_weights = compute_class_weight(class_weight = "balanced", classes = np.unique(y_train_int), y = y_train_int)
class_weight_dict = {i: w for i, w in enumerate(class_weights)}
.
.
(생략)
.
.
# 매개변수에 class_weight=class_weight_dict 추가
model.fit(X_train, y_train_int, batch_size=batch_size, epochs=epochs,
          validation_split=0.2, callbacks=[early_stopping, reduce_lr], class_weight=class_weight_dict)

모델링 2안

# Define the model
model = Sequential()
model.add(Embedding(input_dim=max_words, output_dim=256, input_length=max_len))
model.add(LSTM(128, dropout=0.3, return_sequences=True))
model.add(LSTM(128, dropout=0.3))
model.add(Dense(len(label_to_int), activation='softmax'))  # Use len(label_to_int) as the number of units

# Compile the model
model.compile(loss='sparse_categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print model summary
model.summary()

# Early stopping callback
early_stopping = EarlyStopping(monitor='val_loss', patience=2, verbose=1)

# Learning rate reduction callback
reduce_lr = ReduceLROnPlateau(monitor='val_loss', factor=0.5, patience=1, verbose=1)

# Train the model
batch_size = 64
epochs = 30
model.fit(X_train, y_train_int, batch_size=batch_size, epochs=epochs,
          validation_split=0.2, callbacks=[early_stopping, reduce_lr])

# Predict classes
pred_probs = model.predict(X_test)
pred_classes = np.argmax(pred_probs, axis=1)

# Calculate classification report
report = classification_report(y_test_int, pred_classes, target_names=label_to_int.keys())
print(report)

2안도 마찬가지로 support(각 클래스에 속한 샘플의 수)가 적은 클래스는 대체적으로 precision, recall, f1-score가 모두 낮은 것을 볼 수 있습니다. 클래스 불균형 해결이 가장 중요할 것 같다는 생각이 들었고, 가중치 조정으로 계속 시도를 해보았지만 해결이 안되었기 때문에 처음에 사용하던 LinearSVC 모델에 데이터 리샘플링을 추가해보기로 결정했습니다.

마치며..

딥러닝 모델링 경험이 많지 않아서 더욱 많은 시간을 투자했지만 좋은 결과는 얻지 못했던 것 같습니다. 이번에 딥러닝 모델링을 직접 해보면서 공부해야할 부분이 많다는 것을 느꼈습니다. 다양한 에러와 좋지 않은 성능을 마주하는 등 많은 시행착오를 겪으면서, 배울 수 있었고 부족함을 많이 느꼈습니다. 초반에 짰던 알고리즘부터 최종 결과물까지 비교를 하면 많이 성장한 것 같다는 생각도 들었습니다. 다음 포스팅에서는 처음 짰던 LinearSVC 코드와 최종 모델로 완성된 LinearSVC 코드에 대한 설명을 모두 작성하겠습니다.

[AI web service project] MBTIgram: 모델링-XGBoost

Tue, 22 Aug 2023 07:39:40 GMT

지난번 포스팅에서 전처리 및 EDA를 수행한 내용을 바탕으로 모델링을 진행했습니다. 3가지 모델을 후보로 실험 및 검증을 진행했습니다. 전처리 및 EDA 과정은 이전 글을 참고 바랍니다.

모델 선정

XGBoost
RNN
LinearSVC

세 가지로 후보를 둔 이유는 다음과 같습니다.

XGBoost (eXtreme Gradient Boosting)

XGBoost는 부스팅 알고리즘으로 앙상블 기법을 사용하며, 다양한 데이터 유형과 복잡한 패턴에 대해 강력한 성능을 보일 수 있다.
클래스 불균형 문제를 다룰 수 있는 가중치 조정과 샘플링 기법을 제공하여 불균형 데이터셋에도 잘 대응할 수 있다.
Feature Importance를 제공하여 모델의 예측에 어떤 특성이 중요한지를 해석하기 쉽게 도와줍니다.

+) XGBoost는 하이퍼 파라미터 튜닝이 중요하기 때문에, 추가적으로 GridSearchCV를 이용하여 최적의 파라미터를 찾고 학습을 진행하였습니다.

RNN (Recurrent Neural Network)

RNN은 장기 의존성을 학습하는데 우수하며, 시계열 데이터나 텍스트와 같은 도메인에서 강력한 성능을 발휘할 수 있다.

+) 클래스 불균형 문제를 해결하기 위해서 불균형한 클래스의 샘플 비율에 따라서 자동으로 가중치 조정을 하는 'compute_class_weight'를 이용하였습니다.

LinearSVC (Support Vector Machine with Linear Kernel)

-LinearSVC는 선형 결정 경계를 통해 클래스를 분리하기 때문에, 비교적 간단하면서도 높은 성능을 발휘할 수 있다.

고차원 데이터에서도 잘 동작하며, 특히 텍스트 분류와 같은 자연어 처리 작업에서 효과적일 수 있다.
클래스 불균형 문제에도 상대적으로 덜 민감할 수 있다.

+) LinearSVC도 XGBoost와 마찬가지로 최적의 C값을 찾기 위해서 GridSearchCV를 사용했습니다.

모델링: XGBoost (eXtreme Gradient Boosting)

전처리

XGBoost를 사용하기 위해서 TF-IDF를 통한 벡터화와 label을 숫자로 매핑하기 위해 Label Encoder로 전처리를 한번 더 수행했습니다.

posts = data['posts'] # 설명변수
MBTItype = data['type'] # 예측변수

# numpy배열로 변환
posts_list = posts.to_numpy()
type_list = MBTItype.to_numpy()

type_list

array(['INTJ', 'INTJ', 'INTJ', ..., 'INTP', 'INFP', 'INFP'], dtype=object)

posts_list

array(['know tool use interaction people excuse antisocial truly enlighten mastermind know would count pet peeze something time matter people either whether group people mall never see best friend sit outside conversation jsut listen want interject sit formulate say wait inject argument thought find fascinate sit watch people talk people fascinate sit class watch different people find intrigue dad stand look like line safeway watch people home talk people like think military job people ...(생략)

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

# posts_list를 모델에 사용할 토큰 개수 행렬로 벡터화
cntizer = CountVectorizer(analyzer="word",
                             max_features=1000,
                             max_df=0.7,
                             min_df=0.1)
# analyzer="word": 텍스트를 단어 단위로 분석
# max_features=1000: 최대 1000개의 단어 피처를 선택
# max_df=0.7: 문서 빈도가 70% 이상인 단어는 무시
# min_df=0.1: 문서 빈도가 최소 10% 이상

print("Using CountVectorizer :")

# posts_list를 토큰 개수 행렬로 변환
X_cnt = cntizer.fit_transform(posts_list)

feature_names = list(enumerate(cntizer.get_feature_names_out()))
print("10 feature names can be seen below")
print(feature_names[0:10])



tfizer = TfidfTransformer()

print("\nUsing Tf-idf :")

print("Now the dataset size is as below")
# 카운트 행렬 X_cnt를 tf-idf 표현으로 변환
X_tfidf =  tfizer.fit_transform(X_cnt).toarray()
print(X_tfidf.shape)

Using CountVectorizer : 10 feature names can be seen below [(0, 'ability'), (1, 'able'), (2, 'absolutely'), (3, 'accept'), (4, 'accurate'), (5, 'across'), (6, 'act'), (7, 'action'), (8, 'actual'), (9, 'actually')]

Using Tf-idf : Now the dataset size is as below (114742, 672)

#counting top 10 words
reverse_dic = {}
for key in cntizer.vocabulary_:
    reverse_dic[cntizer.vocabulary_[key]] = key
top_10 = np.asarray(np.argsort(np.sum(X_cnt, axis=0))[0,-10:][0, ::-1]).flatten()
[reverse_dic[v] for v in top_10]

XGBoost를 사용하기 위해 LabelEncoder를 사용했습니다. XGBoost는 기본적으로 숫자형 데이터를 다루는데 특화되어 있기 때문에 범주형 변수를 사용하려면 이를 숫자로 변환해야 합니다.

제가 사용한 데이터셋은 16개의 범주형 변수가 사용되었기 때문에 이를 0~15의 숫자로 매핑하여 모델이 이해할 수 있는 형태로 변환해주었습니다.

#XGBoost를 사용하기 위해 LabelEncoder 사용
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
le.fit(type_list)
type_list = le.transform(type_list)

from sklearn.model_selection import train_test_split
X_data = X_tfidf
y_data = type_list

X, X_test, y, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=1)
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=0.2, random_state=1)

모델 학습

from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import numpy as np


xgb_model = XGBClassifier(n_estimators=100)
params = {'learning_rate': [0.1, 0.06,  0.02], 'max_depth':[3, 5, 7], 'colsample_bytree':[0.3, 0.5,0.75]} # 'n_estimators': [100, 300, 500],'min_child_weight':[1,3], // 'scale_pos_weight': [1, 5, 10] 를 작성해도 사용되지 않아 경고가 뜸.

# GridSearchCV 객체 생성
gridcv = GridSearchCV(xgb_model, param_grid=params, cv=3, scoring='f1_weighted')

# 파라미터 튜닝 시작
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric='mlogloss', eval_set=[(X_valid, y_valid)])

# 튜닝된 파라미터 출력
print(gridcv.best_params_)
print(gridcv.best_score_)

클래스 불균형 문제 해소를 위해서 하이퍼 파라미터 튜닝을 할때, scoring='f1_weighted'와 같이 설정하여 가중치를 적용한 평균 F1-score를 통해 최적의 하이퍼 파라미터를 구했습니다.

다음과 같이 하이퍼 파라미터 튜닝 결과가 나왔으며, 이를 바탕으로 학습을 진행하였습니다.

하이퍼 파라미터 튜닝 결과:

from xgboost import XGBClassifier
from sklearn.model_selection import GridSearchCV

# 1차적으로 튜닝된 파라미터를 가지고 객체 생성
xgb_model = XGBClassifier(n_estimators=500, learning_rate=0.1, max_depth=7, min_child_weight=3, colsample_bytree=0.75, reg_alpha=0.03)

# 학습
xgb_model.fit(X, y, early_stopping_rounds=200, eval_metric='mlogloss', eval_set=[(X_test, y_test)])

from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix, f1_score

xgb_pred = xgb_model.predict(X_test)

accuracy = accuracy_score(y_test, xgb_pred)
precision = precision_score(y_test, xgb_pred, average='weighted')
recall = recall_score(y_test, xgb_pred, average='weighted')
f1 = f1_score(y_test, xgb_pred, average = 'weighted')

print(accuracy)
print(precision)
print(recall)
print(f1)
print('정확도 : {:.4f}\n정밀도 : {:.4f}\n재현율 : {:.4f}\nf1-score : {:.4f}'.format(accuracy, precision, recall, f1))

📌 초기 모델링과의 차이점

프로젝트 시작의 목적은 University of Southern California에서 주최한 2023년도 한미 해커톤을 위한 개발이었기 때문에 시간 부족으로 인해 전처리 및 EDA 작업이 부족했습니다. 클래스 불균형 문제의 심각성을 알고 있었지만, 이를 고려하여 모델링을 진행하기 어려웠습니다.

프로젝트 초기에 LinearSVC를 이용하여 개발했던 알고리즘은 accuracy만을 고려하여 모델 성능 평가를 진행했는데 84%라는 정확도만 보고 성능이 좋다고 판단했습니다. 이는 클래스 불균형을 전혀 고려하지 않은 성능 평가입니다.

클래스 불균형 문제가 있을 때, accuracy만을 사용하여 모델 평가를 하면 안되는 이유:

데이터 불균형으로 인한 왜곡: 클래스 간 샘플 수가 많이 차이나면, 모델이 샘플 수가 많은 클래스를 더 잘 분류하도록 학습되기 쉽다.
이로 인해, 샘플 수가 적은 클래스를 정확하게 예측하지 못할 수 있으며, 데이터 불균형으로 인해 정확도가 높아지는 현상이 발생할 수 있다.

F1-score를 사용해야 하는 이유:

데이터 불균형으로 인한 편향을 상쇄하여, 클래스 간 예측 성능을 균형있게 평가한다.
F1-score는 정밀도와 재현율을 모두 고려하기 때문에, 클래스 간 불균형에 민감하게 반응한다.
모델이 작은 클래스의 예측을 얼마나 잘 수행하는지를 보다 정확하게 평가할 수 있다.

따라서, 성능 평가에 F1-score 또한 포함하기로 결정하였습니다.

포스팅을 마치며..

결과적으로, 첫 모델은 accuracy는 높지만 f1-score가 아주 낮았기 때문에 현재의 모델도 성능이 좋다고 판단할 수는 없으나 개선이 되었다는 것을 알 수 있었습니다. 모델의 성능이 좋지 않은 이유는 XGBoost와 같은 gradient boosting 알고리즘은 주로 수치형 데이터에 대한 모델링에 뛰어난 성능을 보이지만, 텍스트와 같은 비정형 데이터에는 다소 한계가 있기 때문이라고 생각합니다. 불균형 클래스에 덜 민감하지만, 텍스트 데이터를 사용한 모델링이기 때문에 적합하지 않은 모델이라는 생각이 들었습니다.

그동안 모델링을 할때, 불균형 클래스 문제가 있는 데이터셋을 다뤄본 적이 많이 없었기 때문에 accuracy만을 고려하여 모델 성능 평가를 했습니다. 불균형이 심각한 데이터를 사용하여 모델링을 진행해보면서, 다양한 지표를 활용하여 모델 성능 평가를 해야 한다는 점을 배울 수 있었던 것 같습니다.

부족한 글이지만, 읽어주셔서 감사합니다.☺️

다음 포스팅은 'RNN 모델링과 성능평가'입니다.

[AI web service project] MBTIgram: 데이터셋 전처리 및 EDA

Fri, 18 Aug 2023 19:10:12 GMT

1. 개발환경과 데이터셋

💻 개발환경: Google Colab ✅ 사용 데이터셋 (MBTI) Myers-Briggs Personality Type Dataset [Link] https://www.kaggle.com/datasets/datasnaek/mbti-type mbti_1.csv

MBTI Personality Types 500 Dataset [Link] https://www.kaggle.com/datasets/zeyadkhalid/mbti-personality-types-500-dataset MBTI 500.csv

kaggle에 있는 2개의 MBTI 데이터셋을 사용했습니다. type, posts 2개의 열을 가진 데이터로, type은 mbti 종류이며 posts는 해당 mbti가 작성한 텍스트 데이터입니다.

❤️저는 colab을 아주 애용하기 때문에 이번 프로젝트도 colab을 사용하여 개발을 진행하였습니다.❤️

2. 데이터 전처리

from google.colab import drive
drive.mount('/content/drive')

import pandas as pd
import numpy as np
import warnings
warnings.filterwarnings('ignore')

import seaborn as sns
import matplotlib.pyplot as plt

#데이터셋 로드
data = pd.read_csv('/content/drive/MyDrive/spp_project/mbti_concat.csv')

위 코드에서 로드한 데이터셋은 사전에 2개의 데이터셋을 합친 csv 파일입니다.

data

데이터셋을 합치면서 생성된 'Unnamed: 0' 컬럼이 보입니다. concat()을 진행하는 과정에서 index가 하나 더 생긴 것 같습니다. 불필요하기 때문에 해당 열 전체를 삭제해줍니다.

# 불필요한 열 제거
data = data.drop(['Unnamed: 0'], axis=1)
data

해당 열이 삭제된 것을 확인되었으니 본격적인 전처리를 시작하겠습니다. 영어에는 '축약형'이라는 것이 존재하기 때문에 이러한 텍스트를 정제하는 과정을 거쳐야 합니다. 아래의 링크를 참고하여 코드를 작성하였습니다. https://stackoverflow.com/questions/19790188/expanding-english-language-contractions-in-python

# 전처리 함수에서 사용할 contractions 생성
contractions = {"'cause": 'because',
 "I'd": 'I would',
 "I'd've": 'I would have',
 "I'll": 'I will',
 "I'll've": 'I will have',
 "I'm": 'I am',
 "I've": 'I have',
 "ain't": 'is not',
 "aren't": 'are not',
 "can't": 'cannot',
 "could've": 'could have',
 "couldn't": 'could not',
 "didn't": 'did not',
 "doesn't": 'does not',
 "don't": 'do not',
 "hadn't": 'had not',
 "hasn't": 'has not',
 "haven't": 'have not',
 "he'd": 'he would',
 "he'll": 'he will',
 "he's": 'he is',
 "here's": 'here is',
 "how'd": 'how did',
 "how'd'y": 'how do you',
 "how'll": 'how will',
 "how's": 'how is',
 "i'd": 'i would',
 "i'd've": 'i would have',
 "i'll": 'i will',
 "i'll've": 'i will have',
 "i'm": 'i am',
 "i've": 'i have',
 "isn't": 'is not',
 "it'd": 'it would',
 "it'd've": 'it would have',
 "it'll": 'it will',
 "it'll've": 'it will have',
 "it's": 'it is',
 "let's": 'let us',
 "ma'am": 'madam',
 "mayn't": 'may not',
 "might've": 'might have',
 "mightn't": 'might not',
 "mightn't've": 'might not have',
 "must've": 'must have',
 "mustn't": 'must not',
 "mustn't've": 'must not have',
 "needn't": 'need not',
 "needn't've": 'need not have',
 "o'clock": 'of the clock',
 "oughtn't": 'ought not',
 "oughtn't've": 'ought not have',
 "sha'n't": 'shall not',
 "shan't": 'shall not',
 "shan't've": 'shall not have',
 "she'd": 'she would',
 "she'd've": 'she would have',
 "she'll": 'she will',
 "she'll've": 'she will have',
 "she's": 'she is',
 "should've": 'should have',
 "shouldn't": 'should not',
 "shouldn't've": 'should not have',
 "so's": 'so as',
 "so've": 'so have',
 "that'd": 'that would',
 "that'd've": 'that would have',
 "that's": 'that is',
 "there'd": 'there would',
 "there'd've": 'there would have',
 "there's": 'there is',
 "they'd": 'they would',
 "they'd've": 'they would have',
 "they'll": 'they will',
 "they'll've": 'they will have',
 "they're": 'they are',
 "they've": 'they have',
 "this's": 'this is',
 "to've": 'to have',
 "wasn't": 'was not',
 "we'd": 'we would',
 "we'd've": 'we would have',
 "we'll": 'we will',
 "we'll've": 'we will have',
 "we're": 'we are',
 "we've": 'we have',
 "weren't": 'were not',
 "what'll": 'what will',
 "what'll've": 'what will have',
 "what're": 'what are',
 "what's": 'what is',
 "what've": 'what have',
 "when's": 'when is',
 "when've": 'when have',
 "where'd": 'where did',
 "where's": 'where is',
 "where've": 'where have',
 "who'll": 'who will',
 "who'll've": 'who will have',
 "who's": 'who is',
 "who've": 'who have',
 "why's": 'why is',
 "why've": 'why have',
 "will've": 'will have',
 "won't": 'will not',
 "won't've": 'will not have',
 "would've": 'would have',
 "wouldn't": 'would not',
 "wouldn't've": 'would not have',
 "y'all": 'you all',
 "y'all'd": 'you all would',
 "y'all'd've": 'you all would have',
 "y'all're": 'you all are',
 "y'all've": 'you all have',
 "you'd": 'you would',
 "you'd've": 'you would have',
 "you'll": 'you will',
 "you'll've": 'you will have',
 "you're": 'you are',
 "you've": 'you have'}

데이터셋에서 유의미한 단어 토큰만을 선별하기 위해서 큰 의미가 없는 단어 토큰을 제거하는 과정이 필요합니다. 예를 들어, 'I', 'my', 'me', 조사, 접미사 등과 같은 단어들은 문장에 빈번하게 등장하지만 의미 분석을 하는데는 많은 기여를 하지 않는 경우가 많습니다. 이러한 단어들을 불용어(stopword)라고 하며, nltk에는 불용어들을 패키지 내에서 미리 정의하고 있습니다.

nltk의 불용어를 사용하기위한 모듈을 import해야 하는데, 만약 데이터가 없다는 에러가 발생하면 nltk.download라는 커맨드를 통해서 다운로드가 가능합니다.

import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords

# NLTK의 불용어
stop_words = set(stopwords.words('english'))
print('불용어 개수 :', len(stop_words))
print(stop_words)

stopwords.words("english")는 nltk가 정의한 영어 불용어 리스트를 반환해줍니다. 위의 코드로 불용어의 개수와 불용어를 출력해서 확인할 수 있습니다. 불용어 개수는 179개라는 것을 확인하였습니다.

import re
from bs4 import BeautifulSoup

# 전처리 함수
def preprocess_sentence(sentence, remove_stopwords = True):
    sentence = re.sub(r'https?:\/\/.*?[\s+]', '', sentence) # Links 제거
    sentence = sentence.lower() # 텍스트 소문자화
    sentence = BeautifulSoup(sentence, "lxml").text # 
,  등의 html 태그 제거
    sentence = re.sub(r'\([^)]*\)', '', sentence) # 괄호로 닫힌 문자열  제거 Ex) my friend(yugyeong) -> my friend
    sentence = re.sub('"','', sentence) # 쌍따옴표 " 제거
    sentence = ' '.join([contractions[t] if t in contractions else t for t in sentence.split(" ")]) # 약어 정규화
    sentence = re.sub(r"'s\b","",sentence) # 소유격 제거. Ex) yugyeong's -> yugyeong
    sentence = re.sub("[^a-zA-Z]", " ", sentence) # 영어 외 문자(숫자, 특수문자 등) 공백으로 변환
    sentence = re.sub('[m]{2,}', 'mm', sentence) # m이 3개 이상이면 2개로 변경. Ex) ummmmmmm  -> umm

    pers_types = ['infp' ,'infj', 'intp', 'intj', 'istp', 'isfp', 'isfj','istp', 'entp', 'enfp', 'entj', 'enfj', 'estp', 'esfp' ,'esfj' ,'estj']
    for types in pers_types:
      sentence = sentence.replace(types, '')

    # 불용어 제거 (Text)
    if remove_stopwords:
        tokens = ' '.join(word for word in sentence.split() if not word in stop_words if len(word) > 1)
    # 불용어 미제거 (Summary)
    else:
        tokens = ' '.join(word for word in sentence.split() if len(word) > 1)
    return tokens

전처리 함수를 위와 같이 정의해줍니다. 코드에서 볼 수 있는 pers_types는 mbti 종류로 데이터셋 내부에 mbti 종류가 포함되어 있다면 예측 정확도에 영향을 끼칠 수도 있기 때문에 제거를 했습니다.

# posts 열 전처리
clean_posts = []
for s in data['posts']:
    clean_posts.append(preprocess_sentence(s))
clean_posts[:5]

결과를 보면, 처음 데이터셋을 불러올때 0번째 행에 포함되어 있던 'intj'라는 단어와 같이 mbti 종류와 링크, 특수문자 제거 및 소문자화 등의 과정이 제대로 진행된 것을 확인할 수 있습니다.

data['posts'] = clean_posts

# 전처리 진행과정에서 결측치 생성 여부 확인
print(data.isnull().sum())

posts 0
type 0
dtype: int64

결측치가 없는 것을 확인했습니다. 이제, collections 모듈을 사용하여 전체 posts 열에서 중복이 많은 단어들을 확인하고 word cloud를 이용하여 시각화를 진행하겠습니다.

import collections
from collections import Counter

# collections 모듈의 Counter를 사용하여 posts 열에서 중복이 많은 단어 40개 출력
words = list(data["posts"].apply(lambda x: x.split()))
words = [x for y in words for x in y]
Counter(words).most_common(40)

사진에 모두 담지는 못했지만, 중복이 많은 단어 40개를 출력한 것을 확인했습니다. Counter 생성자에 중복된 데이터가 저장된 배열을 인자로 넘기면 각 원소가 몇 번씩 나오는지 저장된 객체를 얻게 됩니다.

collection 모듈의 Counter를 사용하는 방법은 https://www.daleseo.com/python-collections-counter/ 를 참고하시면 됩니다.

import wordcloud
from wordcloud import WordCloud, STOPWORDS

wc = wordcloud.WordCloud(width=1200, height=500,
                         collocations=False, background_color="white",
                         colormap="tab20b").generate(" ".join(words))

plt.figure(figsize=(25,10))
# word cloud 생성
plt.imshow(wc, interpolation='bilinear')
_ = plt.axis("off")

fig, ax = plt.subplots(len(data['type'].unique()), sharex=True, figsize=(15,len(data['type'].unique())))
k = 0
for i in data['type'].unique():
    data_4 = data[data['type'] == i]
    wordcloud = WordCloud(max_words=1628,relative_scaling=1,normalize_plurals=False).generate(data_4['posts'].to_string())
    plt.subplot(4,4,k+1)
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.title(i)
    ax[k].axis("off")
    k+=1

word cloud를 통해서 각 mbti별로 자주 사용한 단어를 확인할 수 있었습니다.

3. EDA

전처리가 끝났으니, 전처리된 데이터셋으로 EDA를 수행합니다.

data.head()

data.info()

data.isnull().sum().to_frame().rename(columns={0: "Count of Missing Values"})

import seaborn as sns
import matplotlib.pyplot as plt

# 스타일과 색상 설정
sns.set(style="whitegrid", palette="pastel")

# count plot 생성
plt.figure(figsize=(14, 6))
ax = sns.countplot(data=data, x='type', order=sorted(data['type'].unique()),
                   palette="pastel")
ax.set_title('Distribution of MBTI Types')
ax.set_xlabel('MBTI Type')
ax.set_ylabel('Count')
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha="right")
for p in ax.patches:
    ax.annotate(f'{p.get_height():.0f}', (p.get_x()+p.get_width()/2, p.get_height()),
                ha='center', va='bottom', fontsize=12)
plt.tight_layout()

plt.show()

각 mbti별 데이터 분포를 살펴본 결과 클래스 불균형이 심각한 것을 확인할 수 있었습니다. 모델링을 진행할 때, 클래스 불균형 문제에 잘 대응할 수 있는 모델을 선정하는 것이 성능 향상에 가장 중요할 것 같다는 생각이 들었습니다.

data['word_count'] = data['posts'].apply(lambda x: len(x.split()))

plt.figure(figsize=(14, 6))
sns.histplot(data=data, x='word_count', bins=30, kde=True)
plt.title('Distribution of Word Count in Tweets')
plt.xlabel('Word Count')
plt.ylabel('Frequency')
plt.show()

단어별 개수 분포에 대한 결과입니다.

import plotly.express as px

# 색상 설정
color_palette = px.colors.qualitative.Pastel

# 박스 플롯 생성
fig = px.box(data, x="type", y="word_count", color="type",
             title="Word Count Distribution by MBTI Personality Type",
             category_orders={"label": sorted(data["type"].unique())},
             color_discrete_sequence=color_palette)

# 라벨 이름 설정
fig.update_xaxes(title="MBTI Personality Type", showgrid=False,
                 tickfont=dict(size=12, color="black"))
fig.update_yaxes(title="Word Count", showgrid=False,
                 tickfont=dict(size=12, color="black"))

# 타이틀 설정
fig.update_layout(title_font=dict(size=24, color="darkblue"))

fig.show()

mtbi 종류별 단어 개수 분포에 대한 결과입니다.

시각화를 하면서 생성된 word_count 컬럼을 삭제하고 'data_result.csv' 로 저장합니다. 저장한 csv 파일은 모델링에 사용할 최종 데이터셋입니다.

# word_count열 제거 후, csv파일로 저장
data = data.drop(['word_count'], axis=1)
data.to_csv('data_result.csv')

프로젝트 처음 작성했던 EDA 코드를 공개합니다..

from google.colab import drive
drive.mount('/content/drive')

  import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

train = pd.read_csv('/content/drive/MyDrive/spp_project/MBTI.csv')

  train.head()

  # 라벨별 개수 확인
  print(f"{len(train['type'].unique())}개")

  # 라벨별 비율 확인
  train['type'].value_counts()

  # 결측치 확인
  train.isnull().sum()

  # 데이터 중복 여부 확인
  train['posts'].nunique() == len(train['posts'])

  # MBTI 글자별 빈도수 확인

  # E, I 빈도수 확인
first = []
for i in range(len(train)):
    first.append(train['type'][i][0])
first = pd.DataFrame(first)
first[0].value_counts()

  # N, S 빈도수 확인
second = []
for i in range(len(train)):
    second.append(train['type'][i][1])
second = pd.DataFrame(second)
second[0].value_counts()

  # T, F 빈도수 확인
third = []
for i in range(len(train)):
    third.append(train['type'][i][2])
third = pd.DataFrame(third)
third[0].value_counts()

  # P, J 빈도수 확인
fourth = []
for i in range(len(train)):
    fourth.append(train['type'][i][3])
fourth = pd.DataFrame(fourth)
fourth[0].value_counts()

[ML] 데이터 결측치 처리 방법

Tue, 15 Aug 2023 06:39:53 GMT

정형 데이터를 다루다 보면 null 값을 가진 컬럼들이 존재하는 경우가 많다.

(공공하수처리 데이터 중 과거수질자료_낙동강.csv의 예시)

분석에 사용할 변수(유해남조류 세포수 / Microcystis / Anbaena / Oscillatoria / Aphanizomenon / 수온 / 지오스민 / 2MIB / PH / Microcystin-LR) 중에서 지오스민, 2MIB, Microcystin-LR 컬럼이 많은 결측치를 가지기 때문에 처리를 할 필요가 있다.
대부분의 경우, 결측치를 제거하고 분석을 진행했는데 낙동강 데이터셋에서 데이터의 개수가 충분하지 않기 때문에 다른 방법을 찾아보게 됐다.

결측치를 처리하는 방법에는 7가지가 있다.

Data Imputation

1. 평균값으로 대체(Mean Imputation)

-> 결측치가 존재하는 변수에서 결측치를 제외한 나머지 값들의 평균으로 결측치를 대체하는 방법 장점: 해당 값으로 대체 시 변수의 평균값이 변하지 않는다.

2. 새로운 값으로 대체(Substitution)

-> 해당 데이터 대신 샘플링 되지 않은 다른 데이터에서 값을 가져오는 방법

3. Hot deck imputation

-> 다른 변수에서 비슷한 값을 가지는 데이터 중 하나를 랜덤 샘플링하여 값을 복사해오는 방법 이는 결측값이 존재하는 변수가 가질 수 있는 값의 범위가 한정되어 있을 때 이점을 가짐. 또한, 랜덤하게 가져온 값이기 때문에 어느 정도 변동성을 더해준다는 점에서 표준 오차의 정확도에 어느 정도 기여를 함.

4. Cold deck imputation

-> Hot deck imputation처럼 다른 변수에서 비슷한 값을 가지는 데이터 중에서 하나를 골라서 결측값을 대체하는 방법. 다른 점은 랜덤 샘플링을 하는 것이 아니라 어떠한 규칙하에 하나를 선정하는 것 이다.

5. Regression imputation

-> 결측치가 존재하지 않는 변수를 feature로 삼고, 결측치를 채우고자 하는 변수를 target으로 삼아 regression task를 진행하는 방법 데이터 내의 다른 변수를 기반으로 결측치를 예측하는 것이기 때문에 변수 간 관계를 그대로 보존할 수 있다는 장점을 갖지만, 예측치 간의 variability는 보존할 수 없다는 단점을 가짐.

6. Stochastic regression imputation

-> regression 방법에 random residual value를 더해서 결측치의 최종 예측값으로 대체하는 방법 Regression imputation의 장점을 모두 가지며, random component를 가진다는 장점 또한 있음.

7. Interpolation and extrapolation(보간법, 보외법)

-> 같은 대상으로부터 얻은 다른 관측치로부터 결측치를 추정하는 방법

[Reference] https://daebaq27.tistory.com/43

[Data Science] 녹조 발생지역 예측 분석 Project

Fri, 11 Aug 2023 12:32:45 GMT

녹조(algal bloom)란?

강이나 호수에 남조류가 과도하게 성장하여 물의 색깔이 짙은 녹색으로 변하는 현상을 말한다.

이와같이, 남조류 과잉 발생이 녹조의 주된 원인이기 때문에 머신러닝과 딥러닝을 통해 유해 남조류 발생 예측을 한다면 녹조 발생지역 예측 분석이 가능할 것 이다.

✅ KNN, SVM, ANN 사용 예정
✅ Data Set: 물 환경 정보 시스템 과거수질자료 -> 낙동강, 한강, 금강, 영산강.csv

📌Data Set 컬럼: 분류 / 지점명 / 채수위치 / 조사년도 / 수온 / pH / DO / 투명도 / 탁도 / Chi-a / 유해남조류 세포수 / Microcystis / Anbaena / Oscillatoria / Aphanizomenon / 지오스민 / 2MIB / Microcystin-LR
📌사용 예정 컬럼: 유해남조류 세포수 / Microcystis / Anbaena / Oscillatoria / Aphanizomenon / 수온 / 지오스민 / 2MIB / PH / Microcystin-LR

ANN(Artifical Neural Network)이란?

인공 신경망이라고 불리는 ANN은 사람의 신경망 원리와 구조를 모방하여 만든 기계학습 알고리즘이다.

ANN의 구조

이는 은닉 계층을 포함하는 인공신경망 기술이며, 동작 단계는 다음과 같이 4 단계로 이루어져 있다.

1단계: 입력 계층에서 입력된 데이터에 대해 가충치 행렬을 곱하여 은닉 계층으로 보냄
2단계: 은닉 계층 내부에서 활성화 함수를 통해 데이터 가공
3단계: 은닉 계층에서 나온 데이터를 새로운 가중치 행렬을 곱해 출력 계층으로 보냄
4단계: 출력을 위한 활성화 함수를 반영하여 결과를 출력

활성화 함수

1. 계단 함수(Step Function) : 0보다 작은 수는 0으로, 0보다 큰 수는 1로 출력
2. 시그모이드 함수(Sigmoid Function): 미세한 변화에 대한 값도 반영하기 위해 사용
3. ReLU 함수(Rectified Linear Unit Function): 입력이 0을 넘으면 입력 그대로 출력, 0 이하일 땐 0을 출력
4. 소프트맥스 함수(Softmax Function): 입력받은 값을 0~1 사이의 값으로 정규화하며 총합이 항상 1이 되는 특성을 가진 함수, N개 이상의 class 확률 분포를 계산할 때 사용

[Reference]
Kim Sang-Hoon(2021), Prediction of cyanobacteria harmful algal blooms in reservoir using machine learning and deep learning
https://ebbnflow.tistory.com/119