1-june.log

YOLOv5 써보기

Sun, 21 Aug 2022 08:18:19 GMT

수학을 배울 때는 기초 원리부터 개념들을 차곡차곡 쌓아가며 배우지만, 연구나 계측을 위해 수학을 사용할 때는 기초 원리부터 생각하기보다는 그런 개념들을 포장해놓은 계산기나 프로그램들을 사용하듯이, 인공지능을 배울 때도 기본 원리부터 배우지만 실제 적용할 때는 이미 만들어진 모델들을 사용하는 경우가 대부분인 것 같습니다. 아무래도 컴퓨팅 파워와 데이터를 많이 가지고 있는 팀들이 pretraining 시켜놓은 모델들을 사용하는 것이 퍼포먼스가 더 좋을 수 밖에 없기도 하구요. 그래서 인공지능을 다루는 많은 엔지니어들이 모델의 골격을 짜는 일보다는 이미 있는 모델과 야생에 존재하는 데이터 사이에 다리를 놓아주는 역할을 하게 되는 것 같습니다. 심지어 object detection과 같이 유용한 모델들은 사용하기 쉽게 보급형으로 포장되어 나오기 때문에 잘 활용한다면 시간절약도 많이 할 수 있습니다. 그런 맥락에서 YOLOv5 object detection모델을 쓰기 쉽게 해놓은 Ultralytics의 YOLOv5모델을 가져와 추가 데이터를 통해 사용한 예시를 소개하고자 합니다.

YOLOv5는 COCO dataset으로 트레이닝 되어있기 때문에 이미 car, person, bicycle 등 다양한 물체를 detection할 수 있는데, 이 포스팅에선 새로운 물체를 detection할 수 있도록 YOLOv5모델을 추가 트레이닝 (transfer learning) 해보겠습니다: YOLOv5는 'car' 즉 차 자체는 detection할 수 있지만, 바퀴/문/사이드미러 등 차의 개별 부위를 detection하도록 트레이닝 되어있진 않은데, 이런 차의 개별 부위들을 따로 detection할 수 있도록 하는 것이 목표입니다.

결과물:


^{Modified from image by Tama66 under Pixabay License. Mercedes Daimler Automobile - Free photo on Pixabay}

^{Modification of videos by fernanfazio and Vimeo-Free-Videos respectively under Pixabay license.
Road Car Tire - Free video on Pixabay
Mercedes Glk Car Test - Free video on Pixabay}

Ultralytics YOLOv5 모델 가져오기

정말 쉽습니다. Ultralytics의 Github에서 clone해오면 YOLOv5모델과 함께 YOLOv5모델을 추가 트레이닝할 수 있는 코드, detection에 이용할 수 있는 코드까지 가져올 수 있습니다.

!git clone https://github.com/ultralytics/yolov5
%cd yolov5/
!pip install -r requirements.txt
!pip install -U roboflow
!wget https://github.com/ultralytics/yolov5/releases/download/v6.0/yolov5s.pt

각 명령의 상세 설명은 Ultralytics의 Quick Start - YOLOv5 Documentation (ultralytics.com) 에 더 자세히 나와있습니다.

추가 트레이닝을 위한 데이터

라이트, 바퀴, 문 등 차의 부위를 라벨링 해놓은 데이터는 다음 링크에서 찾을 수 있었습니다:

단, YOLOv5를 트레이닝하기 위해선 각 label과 bounding box를 그 박스의 중심점 (x, y), 그리고 그 박스의 넓이(width)와 높이(height)로 표현해주어야 하기 때문에 (Train Custom Data 📌 - YOLOv5 Documentation (ultralytics.com) 참고), YOLOv5 형식으로 데이터를 바꿔주어야 합니다. 저는 Light를 0, Wheel을 1, Glass를 2, Door를 3, SideGlass를 4로 지정하고 다음과 같은 방법으로 json형식으로 되어있는 bhadreshpsavani님의 데이터를 YOLOv5형식의 txt파일로 바꿔주었습니다:

import re
import json
import os
import csv

training_images_folder = os.path.join(os.getcwd(), "CarPartsDetectionChallenge", "Data", "Source_Images", "Training_Images")
jpegs = []
jsons = []
for f in os.listdir(training_images_folder):
    if f[-4:] == ".jpg":
        jpegs.append(f)
    if f[-5:] == ".json":
        jsons.append(f)

def save_as_yolo_format(destination_folder, json_data):
    img_width = json_data['asset']['size']['width']
    img_height = json_data['asset']['size']['height']

    yolov5_format_list = []
    for i in json_data['regions']:
        if i['tags'][0] == 'Light':
            region_num = 0
        if i['tags'][0] == 'Wheel':
            region_num = 1
        if i['tags'][0] == 'Glass':
            region_num = 2
        if i['tags'][0] == 'Door':
            region_num = 3
        if i['tags'][0] == 'SideGlass':
            region_num = 4
        xcentre = (i['boundingBox']['left'] + i['boundingBox']['width']/2)/img_width
        ycentre = (i['boundingBox']['top'] + i['boundingBox']['height']/2)/img_height
        bbox_width = i['boundingBox']['width']/img_width
        bbox_height = i['boundingBox']['height']/img_height
        yolov5_format = [region_num, xcentre, ycentre, bbox_width, bbox_height]
        yolov5_format_list.append(yolov5_format)

    file = open(os.path.join(destination_folder, re.sub(r'[^.]+$', 'txt', json_data['asset']['name'])), 'w', newline='')
    with file:
        write = csv.writer(file, delimiter= ' ')
        write.writerows(yolov5_format_list)

training_images_folder = os.path.join(os.getcwd(), "CarPartsDetectionChallenge", "Data", "Source_Images", "Training_Images")
destination_folder = os.path.join(os.getcwd(), "CarParts_yolov5labels")

for j in jsons:
    with open(os.path.join(training_images_folder, j)) as json_file:
        json_data = json.load(json_file)
    save_as_yolo_format(destination_folder, json_data)

Bounding box들이 제대로 그려지는지는 다음과 같이 확인해볼 수 있습니다:

import json
import csv
import os
from PIL import Image, ImageDraw
import cv2
import matplotlib.pyplot as plt
%matplotlib inline

training_images_folder = os.path.join(os.getcwd(), "CarPartsDetectionChallenge", "Data", "Source_Images", "Training_Images")

img_path = os.path.join(os.getcwd(), "CarPartsDetectionChallenge", "Data", "Source_Images", "Training_Images", "example.jpg")
img = cv2.imread(img_path)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

draw_img = Image.fromarray(img)
draw = ImageDraw.Draw(draw_img)

img_width = img.shape[1]
img_height = img.shape[0]
print(img_width, img_height)

with open(os.path.join(os.getcwd(), 'CarParts_yolov5labels', 'example.txt')) as f:
    csv_reader = csv.reader(f, delimiter=' ')
    for row in csv_reader:
        xcentre = float(row[1]) * img_width
        ycentre = float(row[2]) * img_height
        bbox_width = float(row[3]) * img_width
        bbox_height = float(row[4]) * img_height
        x0 = xcentre - bbox_width/2
        x1 = xcentre + bbox_width/2
        y0 = ycentre - bbox_height/2
        y1 = ycentre + bbox_height/2

        draw.rectangle([x0, y0, x1, y1], outline='white', width=2)

plt.figure(figsize=(20, 10))
plt.imshow(draw_img)

Output:

^{Modified from Image by PIRO4D under Pixabay License. Car Vehicle Wheels - Free image on Pixabay.}

이렇게 이미지와 각 이미지에 상응하는 label을 담은 txt파일을 training set과 validation set으로 나누어 서로 다른 폴더에 넣고, training set과 validation set이 있는 폴더의 위치를 보여주는 yaml파일을 다음과 같이 만들면 트레이닝을 한 준비가 모두 끝납니다:

YOLOv5 트레이닝과 사용

이제 그냥 Ultralytics YOLOv5 GitHub에서 clone해 온 train.py를 이용하면 됩니다. Train.py 파일을 열어보면 안내가 잘 되어있는데, 플래그들을 이용해 batch size나 epoch 수 등 옵션을 세팅할 수 있습니다. --data 플래그 뒤에는 트레이닝에 사용될 데이터 폴더의 위치 정보를 담고 있는 yaml파일의 위치를 명시해주면 되고, --weights 플래그 뒤에는 트레이닝을 할 모델을 선택해주면 됩니다 (clone한 repo의 yolov5폴더 내에 yolov5s 등 모델이 들어있는데 그 중 하나 선택하시면 됩니다. 단순한 detection 문제는 small 모델인 yolov5s로도 성능이 잘 나오는 것 같습니다):

!python train.py --img 640 --batch 8 --epochs 150 --data /content/drive/MyDrive --weights /content/yolov5/yolov5s.pt --nosave --noval --cache

트레이닝 된 모델은 자동으로 저장이 yolov5 폴더 내에 저장이 되며, loading하여 사용할 수 있습니다. 이렇게 저장 된 모델을 사용할 땐 detect.py를 이용하면 됩니다. --source 플래그로 적용할 이미지나 영상이 있는 위치를 명시해주고 --weights 플래그로 사용할 모델의 위치를 명시해줍니다:

!python detect.py --source /content/drive/MyDrive/ --weights /content/yolov5/runs/train/exp11/weights/best.pt

Detect.py는 이미지 파일들 (jpg, png, tiff 등)과 비디오 파일들 (mp4 등) 모두에 적용가능하니, 파일포멧에 상관없이 위 명령 실행하시면 됩니다. 결과물은 yolov5폴더 내 runs 폴더에 자동으로 저장되어 확인이 가능합니다 :D

마무리

몇 백장의 이미지만 가지고 추가 트레이닝을 했는데도 결과가 굉장히 잘 나오는 것 같습니다. 아마 YOLO모델이 'car'라는 물체에 대해 어느 정도 이미 파악하고 있어서 그런게 아닐까 하는 생각이 드네요. 오늘은 이미 존재하는 라벨들을 가져다 썼지만, 사실 사진 수백장 라벨링하는건 labelimg같은 툴들을 이용한다면 금방할 수 있어서 custom object detection 모델이 필요하다면 하루이틀만에 만들어서 사용할 수도 있을 것 같습니다. 여러 사람들의 노력덕분에 인공지능의 사용성이 이렇게 좋아지고 모두 오픈소스로 공개되어 누구나 사용할 수 있다는게 너무 감사하기도 하고 더 발전의 사이클을 빠르게 하지 않나 싶네요.

^{Modification of video by Vimeo-Free-Videos under Pixabay license. Mercedes Glk Car Test - Free video on Pixabay}

Residual Block 간단 예시

Sat, 11 Jun 2022 09:45:38 GMT

Residual Block이란

Neural network들에 대해서 배우기 시작하며, 정보가 뉴럴네트워크의 각 층을 순차적으로 지난다는 점이 동물의 뇌와는 다르다고 생각했습니다.

(https://commons.wikimedia.org/wiki/File:Example_of_a_deep_neural_network.png)" by BrunelloN. Licensed under CC BY-SA 4.0 International)

동물의 뇌에서 뉴런들의 연결은 좀 더 중구난방(?)으로 연결되어있어서 정보가 뉴런 층들을 순차적으로 지나지 않고 여기저기로 퍼지는 구조이죠

("Neurons" by Leterrier, NeuroCyto Lab, INP, Marseille, France. Obtained from NIH Image Gallery | Flickr. Licensed under CC BY-NC 2.0)

그래서 서로 멀리 떨어져 있는 뉴런 layer들을 이어주는 skip connection이 있는 "residual block"을 이용한 뉴럴 네트워크가 조금 더 생물학적 뇌의 구조와 비슷하단 생각이 들었습니다. Residual block의 개념은 단순합니다: 한 layer의 결과값을 바로 다음 layer에만 넣어주는 것이 아니라 좀 더 뒤에 있는 layer에도 넣어주는 것입니다. 이 연결을 skip connection이라 하고, 이 skip connection이 있는 레이어들을 residual block이라고 부릅니다.

(Image by Author)

점선으로 둘러싸인 residual block의 입장에서 보면, 들어오는 인풋 x가 있다고 할 때, 이 x를 그 블럭 내 레이어들을 통과시켜서 얻은 결과값 f(x)에다가 인풋 x를 그대로 더해준 것이 그 블럭의 최종 아웃풋이 됩니다.

Pytorch로 Residual Block 만들어보기

이를 구현하는 코드도 간단합니다 (pytorch):

from torch import nn

class ResBlock(nn.Module):
    def __init__(self, block):
        super().__init__()
        self.block = block
    def forward(self, x):
        return self.block(x) + x #f(x) + x

6줄이면 되네요.

그럼 이 residual block을 CNN에 넣어보도록 하겠습니다.

우선 평범한 CNN을 만듭니다 (저는 6개의 3x3 convolution레이어를 넣었습니다):

class Conv6(nn.Module):
  def __init__(self, n_class=10):
    super().__init__()
    self.name = 'conv6'
    self.model = nn.Sequential(
        nn.Conv2d(3, 32, 3, 1, 1),
        nn.BatchNorm2d(32),
        nn.ReLU(),

        nn.Conv2d(32, 32, 3, 1, 1),
        nn.BatchNorm2d(32),
        nn.ReLU(),

        nn.Conv2d(32, 32, 3, 1, 1),
        nn.BatchNorm2d(32),
        nn.ReLU(),

        nn.Conv2d(32, 32, 3, 1, 1),
        nn.BatchNorm2d(32),
        nn.ReLU(),

        nn.Conv2d(32, 32, 3, 1, 1),
        nn.BatchNorm2d(32),
        nn.ReLU(),

        nn.Conv2d(32, 32, 3, 1, 1),
        nn.BatchNorm2d(32),
        nn.ReLU(),

        nn.Flatten(),
        nn.Linear(32*32*32, 256),
        nn.ReLU(),
        nn.Linear(256, 10)

    )
  def forward(self, x):
    return self.model(x)

이 중 몇몇 레이어를 residual block으로 묶어보겠습니다:

class Conv6Res(nn.Module):
  def __init__(self, n_class=10):
    super().__init__()
    self.name = 'conv6res'
    self.model = nn.Sequential(
        nn.Conv2d(3, 32, 3, 1, 1),
        nn.BatchNorm2d(32),
        nn.ReLU(),

        ResBlock(
            nn.Sequential(
                nn.Conv2d(32, 32, 3, 1, 1),
                nn.BatchNorm2d(32),
                nn.ReLU(),

                nn.Conv2d(32, 32, 3, 1, 1),
                nn.BatchNorm2d(32),
                nn.ReLU()
            )
        ),

        ResBlock(
            nn.Sequential(
                nn.Conv2d(32, 32, 3, 1, 1),
                nn.BatchNorm2d(32),
                nn.ReLU(),

                nn.Conv2d(32, 32, 3, 1, 1),
                nn.BatchNorm2d(32),
                nn.ReLU(),

                nn.Conv2d(32, 32, 3, 1, 1),
                nn.BatchNorm2d(32),
                nn.ReLU(),
            )
        ),
        nn.Flatten(),
        nn.Linear(32*32*32, 256),
        nn.ReLU(),
        nn.Linear(256, 10)
    )
  def forward(self, x):
    return self.model(x)

이렇게 하면 residual block이 있는 네트워크 완성입니다.

Residual block이 없는 plain CNN인 Conv6와 residual block이 들어있는 CNN인 Conv6Res을 CIFAR10 데이터셋에 대해 트레이닝하여 비교해보겠습니다. 우선 Conv6()부터

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import torch
from torch import nn
from torch.utils.data import Dataset, DataLoader
from torchvision import datasets, transforms

# Downloading the CIFAR10 dataset
transform = transforms.Compose(
                               [transforms.ToTensor(),
                               transfomrs.Normalize((0.5,0.5,0.5),(0.5,0.5,0.5))]
)
cifar_tr = datasets.CIFAR10(root=os.getcwd(), train=True, download=True, transform=transform)
cifar_test = datasets.CIFAR10(root=root, train=False, download=True, transform=transform)

# Split training data into train set and validation set
def split_train_valid(dataset, valid_ratio=0.1):
  data_size = len(dataset)
  indices = list(range(data_size))
  np.random.seed(1)
  np.random.shuffle(indices)

  split_point = int(np.floor(valid_ratio*data_size))
  val_index, train_index = indices[:split_point-1], indices[split_point:]

  train = torch.utils.data.Subset(dataset, train_index)
  valid = torch.utils.data.Subset(dataset, val_index)

  return train, valid

cifar_train, cifar_valid = split_train_valid(dataset=cifar_tr)

# Make DataLoaders for train/validation/test sets
cifar_loaders = [DataLoader(dataset=d, batch_size=128, shuffle=True, drop_last=True) for d in [cifar_train, cifar_valid, cifar_test]]

# Define model to train
model = Conv6()

# Define loss function and optimizer
loss = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Use GPU if available (CPU if not)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Train the model
print("===== Train Start =====")
num_epochs = 40
history = {"train_loss": [], "train_acc": [], "valid_loss":[], "valid_acc":[]} # record of loss and accuracy in each epoch for plotting later
for epoch in range(num_epochs):
    train_loss, train_acc = 0, 0
    model.train()
    for (x, y) in cifar_loaders[0]: #cifar_loaders[0] is train set DataLoader
        x = x.to(device)
        y = y.to(device)

        y_hat = model(x)
        loss = loss(y_hat, y)

        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        train_loss += loss.to("cpu").item()
        train_acc += (y_hat.argmax(1)==y).type(torch.float).to('cpu').mean().item()

    train_loss /= len(cifar_loaders[0]) #len(DataLoader) is batch size (ie, 128)
    train_acc /= len(cifar_loaders[0])
    history["train_loss"].append(train_loss)
    history["train_acc"].append(train_acc)

    # Evaluate model on validation set
    valid_loss, valid_acc = 0, 0
    model.eval()
    with torch.no_grad():
        for (x, y) in cifar_loaders[1]: # Validation set DataLoader
            x = x.to(device)
            y = y.to(device)

            y_hat = model(x)
            loss = loss(y_hat, y)

            valid_loss += loss.to('cpu').item()
            valid_acc += (y_hat.argmax(1)==y).type(torch.float).to('cpu').mean().item()

    valid_loss /= len(cifar_loaders[1]) #len(DataLoader) is batch size
    valid_acc /= len(cifar_loaders[1])
    history["valid_loss"].append(valid_loss)
    history["valid_acc"].append(valid_acc)

    if epoch % 10 == 0:
    print(f"Epoch: {epoch}, train loss: {train_loss:>6f}, train acc: {train_acc:>3f}, valid loss: {valid_loss:>6f}, valid acc: {valid_acc:>3f}")

# Test the model on the test set
print("===== Test Start =====")
test_loss, test_acc = 0, 0
model.eval()
with torch.no_grad():
    for (x, y) in cifar_loaders[2]:
    x = x.to(device)
    y = y.to(device)

    y_hat = model(x)
    loss = loss(y_hat, y)

    test_loss += loss.to('cpu').item()
    test_acc += (y_hat.argmax(1)==y).type(torch.float).to('cpu').mean().item()

test_loss /= len(cifar_loaders[2])
test_acc /= len(cifar_loaders[2])
print(f"Test loss: {epoch_loss:>6f}, Test acc: {epoch_acc:>6f}")

같은 방법으로 residual block이 들어있는 모델(즉, model = Conv6Res())도 트레이닝해보면 다음과 같은 결과가 나옵니다:

Residual Block의 효과

위의 Conv6()와 Conv6Res() 두 모델의 accuracy는 별 차이 없지만, residual block은 뉴럴 네트워크의 레이어 갯수가 점점 많아지면서 오히려 performance가 저하 되는 'degradation'현상을 막아주는 것이 주 역할이므로 레이어가 더 많은 deeper 뉴럴 네트워크들에서 residual block의 효과를 보겠습니다.

위의 Conv6와 Conv6Res와 같은 구조로 레이어의 갯수가 6개, 9개, 12개, 15개, 18개, 21개인 모델들을 만들어 같은 방법으로 트레이닝 시켜보았습니다:

(Image by Author)

Residual block을 사용하지 않은 plain CNN의 경우, 레이어 갯수가 많아질수록 (그래프 상에서 선의 색이 짙어질수록) train accuracy가 오히려 대체로 떨어지는 것을 볼 수 있습니다. 이건 overfitting에 의한 현상이 아니고 (training data에 대한 accuracy니까요) 뉴럴네트워크의 레이어 수가 많아질수록 accuracy가 떨어지는 degradation 현상 때문인 것 같습니다. 반면 residual block이 있는 CNN들의 경우 이 현상이 덜해보입니다.

Validation set에 대한 accuracy도 같은 양상을 보입니다 (plain CNN들은 레이어 갯수가 많아질수록 퍼포먼스가 떨어지는데 비해, residual CNN들은 그렇지 않음):

(Image by Author)

마지막으로 레이어 갯수 별로 test set accuracy를 보겠습니다:

(Image by Author)

Plain CNN의 경우 레이어가 일정 갯수 이상으로 증가하면 test set에 대한 accuracy가 점점 떨어지는데 비해 residual block을 이용한 CNN들은 accuracy가 증가하는 양상을 보입니다. 레이어가 10개 정도만 넘어가도 degradation 현상이 일어난다는게 좀 신기하네요 (이 예시에선 pooling 레이어를 하나도 쓰지 않았는데, pooling 레이어가 있는 구조들은 degradation현상이 일어나려면 레이어 갯수가 더 많아야 할 것 같습니다).

따라서 레이어가 많은 뉴럴 네트워크를 만들 땐 residual block의 사용은 거의 필수적입니다. 이 개념을 적극적으로 사용하여 퍼포먼스를 향상시킨 응용예시도 다양합니다 (ResNeXt, DenseNet 등). Residual block의 구조도 block 내의 batch normalization layer, activation layer, weight layer들의 순서와 조합을 어떻게 하느냐에 따라 다양하게 만들 수 있어서 이를 이용해서도 퍼포먼스 향상이 가능합니다 (Residual network의 응용 예시들과 논문들을 깔끔하게 정리해 놓은 medium포스트: An Overview of ResNet and its Variants | by Vincent Feng | Towards Data Science).

결론

레이어를 잔뜩 쌓은 인공신경망들이 여러 분야에서 좋은 성능을 보이고 있습니다. 이런 인공신경망들의 training을 가능하게 하는 개념 중 하나가 residual block이어서, 가장 간단한 residual block을 직접 만들어보고 레이어 갯수에 따라 residual block들이 가지는 효과를 보았습니다. 인공신경망의 구조가 생물학적 뇌의 구조에 점점 비슷해져가는 것 같아 신기하네요 :D

말라리아 잡는 CNN

Thu, 02 Jun 2022 12:45:20 GMT

Convoluted neural networks (CNN)에 대해 배웠습니다. 왜 많은 사람들이 computer vision영역에서 일하고 싶어하는지 알 것 같습니다, 너무 신기하고 재밌어요..!

컴퓨터에게 시각정보처리 능력을 주기에 CNN의 활용방법은 무궁무진하지만, CNN을 처음 배우는 사람들은 모두 인간이 손으로 쓴 숫자 (handwritten digits)를 알아볼 수 있는 인공신경망을 구현하는 연습을 합니다 (MNIST 데이터셋을 사용해서요).

그와 비슷하면서도 초보자가 연습하는데 써볼 수 있을만한 데이터셋이 없을까 하고 인터넷을 찾아보다가, 적절한 데이터셋을 찾았습니다. 미국 National Institute of Health의 말라리아 데이터셋입니다 (두 번째 링크 Kaggle에 올라온 버전이 활용이 더 쉽습니다): Malaria Datasheet (nih.gov) Malaria Cell Images Dataset | Kaggle

말라리아 개론

말라리아에 대한 간단한 소개

Plasmodium이라는 기생충이 인간의 적혈구 속에 들어가서 기생하는 병입니다 (주로 모기에 의해 전파됩니다)
전 세계적으로 매년 5억명 정도 걸린다고 합니다 (어마어마하죠..? 지구의 역사상 태어난 모든 인간의 절반이 말라리아에 의해 죽었을 수 있다는 견해도 있습니다 [1] 현대의학의 발전으로 선진국에서는 먹는 약으로 비교적 쉽게 치료가 가능하지만, 개발도상국들에선 아직도 사망 원인 10위 안에 든다고 합니다 [2])
우리나라에도 말라리아 있습니다 (경기, 강원 북부에 특히)
말라리아는 적혈구에 사는 기생충이 있는 병이기 때문에, 진단 방법은 간단합니다: 혈액을 현미경으로 봐서 적혈구에 기생충이 있는지 없는지를 보면 됩니다

(Source: CDC, Pfalciparum_benchaidV2.pub (cdc.gov))

위 사진은 현미경으로 본 적혈구들이고, 몇몇 적혈구 안에 있는 반지처럼 생긴 것들이 바로 Plasmodium 기생충입니다 (생장단계에 따라 다른 모습으로 나타날 수 있지만, 이 반지 형태로 관찰되는 경우가 가장 흔한 것으로 알고 있습니다). 이런 기생충이 있는 적혈구들이 관찰되면 말라리아로 진단이 가능한 것이죠.

NIH의 말라리아 데이터셋은 정상적혈구들과 말라리아 기생충이 있는 적혈구들의 이미지 데이터셋입니다:

(Visualized from NIH's Malaria Data (Thin Smears - Falciparum and uninfected patietns) Malaria Datasheet (nih.gov))

어떤 적혈구들이 말라리아에 감염 된 것들이고, 어떤 적혈구들이 정상인지 알아보실 수 있겠나요?

사실 그렇게 어렵지 않게 구분 가능한 것 같습니다. 이미지 위에 '1'이라고 되어있는 것들은 정상 적혈구들이고 '0'이라고 되어 있는 것들은 말라리아 적혈구들입니다.

말라리아 잡는 CNN 만들어보기

각 혈구를 이미지로 만들어 놓아서 데이터가 이미 너무 깔끔한 상태라 바로 말라리아가 있는 적혈구와 정상 적혈구를 구분하는 CNN을 구현해볼 수 있습니다. 저는 Google Colab을 이용했습니다.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import os
import cv2
from PIL import Image

from torch.utils.data import Dataset, DataLoader
import torch
import torch.nn as nn
from torchvision import transforms, datasets, models

이미지 데이터를 활용하는 tip: 이미지 데이터가 들어있는 폴더는 용량이 크기 때문에 보통 zip파일로 접하게 됩니다. Colab에서는 다음 커멘드를 실행하여 zip파일을 쉽게 활용할 수 있습니다:

!unzip drive/MyDrive/...zip파일위치.../...zip파일이름.zip

이렇게 생성 된 이미지 폴더가 있으면 torchvision의 datasets.ImageFolder를 이용해 Dataset을 만들어줄 수 있습니다. 이 때 이미지 데이터를 모두 120 x 120 pixel로 리사이징하고 (이미지 사이즈가 모두 동일하지 않으면 뒤에 DataLoader로 각 이미지에 대해 iteration하며 모델을 트레이닝할 때 에러가 발생하더라구요) pytorch의 tensor로 변환해서 읽어오도록 하겠습니다.

# Define data transformations to be performed when reading in the images
data_transforms = transforms.Compose([transforms.Resize((120, 120)),
                                      transforms.ToTensor()])

# Location of images
img_dir = '/content/cell_images/cell_images'
malariadata = datasets.ImageFolder(img_dir, transform=data_transforms)

datasets.ImageFolder는 각 이미지가 들어있는 폴더를 그 이미지의 label(즉 class)로 인식합니다. datasets.ImageFolder를 통해 읽어 온 데이터의 label들을 보면

print(malariadata.class_to_idx)

Parasitized (말라리아에 감염 된 적혈구 이미지들)이 하나의 label, Uninfected (정상 적혈구 이미지들)이 하나의 label을 구성하고 있는 것을 확인할 수 있습니다.

다음으론 이미지들을 training set과 test set으로 나눠보겠습니다 (Train test split하는 방법은 여러 가지기 때문에 꼭 이 방법으로 할 필욘 없습니다)

from torch.utils.data.sampler import SubsetRandomSampler

test_size = 0.2
data_length = len(malariadata)
indices = list(range(data_length))
np.random.shuffle(indices)

test_split = int(np.floor(test_size*data_length))
test_index, train_index = indices[:test_split-1], indices[test_split:]

train_sampler = SubsetRandomSampler(train_index)
test_sampler = SubsetRandomSampler(test_index)

train_loader = DataLoader(malariadata, sampler=train_sampler, batch_size=32)
test_loader = DataLoader(malariadata, sampler=test_smapler, batch_size=32)

(Batch size를 어떻게 정하는게 가장 좋은지는 잘 모르겠습니다. 곧 소개할 논문을 따라서 32로 정했습니다.)

train_loader를 통해 이미지 몇 장을 확인해보도록 하겠습니다:

img_tensors, labels = next(iter(train_loader))

def showimg(img_tensor):
    #use matplotlib to display an image that is in tensor form
    npimg = img_tensor.numpy()
    plt.imshow(np.transpose(npimg, (1,2,0)))

fig = plt.figure(figsize=(20,15))
for i in range(20):
    ax = fig.add_subplot(4, 5, i+1, title=labels[i].item())
    showimg(img_tensors[i])
plt.show()

(Visualized from NIH's Malaria Data (Thin Smears - Falciparum and uninfected patietns) Malaria Datasheet (nih.gov))

그럼 이제 활용할 CNN을 정의해보도록 하겠습니다. Umer et al의 "A Novel Stacked CNN for Malarial Parasite Detection in Thin Blood Smear Images" (IEEE Xplore Full-Text PDF:)[3] 논문에 나온 CNN 구조를 따라 만들어보려 했습니다. 그런데 논문에서 코드는 따로 공개하고 있지 않고, kernel의 padding이나 stride 등 몇 가지 디테일은 명시하고 있지 않아서 그런 부분은 제 마음대로 넣고 만들어보았습니다. 논문의 Figure4가 CNN 구조를 한 눈에 보여줍니다 (여담: 한 1주일 전만해도 이런 그림보면 멋있긴한데 뭔뜻이지 싶었는데 이제 이해가 가서 너무 기쁩니다).

(Source: IEEE Access Vol 8 2020. Umer et al [3], Figure 4. CC-BY-4.0)

Pytorch로 다음과 같이 재현해봤습니다 (뒤쪽에 fully connected layer 단계에서 저자들은 activation function으로 sigmoid function을 사용하는데, 저는 그냥 익숙한 ReLU를 썼습니다):

# Recreation of CNN described by
# https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9093853

class MalariaNet(nn.Module):
  def __init__(self):
    super().__init__()

    self.layer1 = nn.Sequential(
        nn.Conv2d(3, 16, kernel_size=3, stride=1, padding=1),
        nn.Dropout2d(0.2),
        nn.ReLU(),
        nn.Conv2d(16, 32, kernel_size=4, stride=1, padding=2),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2)
    )

    self.layer2 = nn.Sequential(
        nn.Conv2d(32, 64, kernel_size=3, stride=1, padding=1),
        nn.Dropout2d(0.2),
        nn.ReLU(),
        nn.Conv2d(64, 128, kernel_size=3, stride=1, padding=1),
        nn.ReLU(),
        nn.MaxPool2d(kernel_size=2, stride=2)
    )

    self.layer3 = nn.Sequential(
        nn.Conv2d(128, 256, kernel_size=2, stride=1, padding=1),
        nn.ReLU(),
        nn.AvgPool2d(kernel_size=3, stride=3)
    )

    self.fc = nn.Sequential(
        nn.Linear(256*10*10, 512),
        nn.ReLU(),
        nn.Linear(512, 256),
        nn.ReLU(),
        nn.Linear(256, 128),
        nn.ReLU(),
        nn.Linear(128, 2)
    )

    self.fla = nn.Flatten()
    self.drop = nn.Dropout2d(0.2)

  def forward(self, x):
    out = self.layer1(x)
    out = self.drop(out)
    out = self.layer2(out)
    out = self.drop(out)
    out = self.layer3(out)
    out = self.drop(out)
    out = self.fla(out)
    out = self.fc(out)
    return out

이제 MalariaNet 클래스로 모델을 만들어서 트레이닝 시켜보겠습니다:

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = MalariaNet()
model.to(device)
error = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

# Train the model
num_epochs = 30

for epoch in range(num_epochs):
    train_loss, train_acc = 0, 0
    model.train() # state that model training is beginning
    for (images, labels) in train_loader:
        images, labels = images.to(device), labels.to(device)
        predictions = model(images)

        optimizer.zero_grad()
        loss = error(predictions, labels)
        loss.backward()
        optimizer.step()

        train_loss += loss.to('cpu').item()
        train_acc += (predictions.argmax(1)==labels).type(torch.float).to('cpu').mean().item()

    train_loss /= len(train_loader) #len(train_loader) is batch size
    train_acc /= len(train_loader)
    print(f"Epoch: {epoch}, loss: {train_loss:>6f}, acc: {train_acc:>6f}")

(여러 번 트레이닝을 해보니 어떨 때엔 위에서처럼 몇 epoch만에 accuracy가 확확 올라가는데 비해, 어떤 때엔 한참동안 accuracy가 오르지 않아서 트레이닝을 중단시킨적도 있는데, 뭐 때문에 이런 차이가 일어나는지는 잘 모르겠네요..)

테스트셋으로 모델의 성능을 평가해보면

# Test the model
test_loss, test_acc = 0, 0
model.eval() # state that testing is beginning (so gradients are not updated)
with torch.no_grad():
  for (images, labels) in test_loader:
    images, labels = images.to(device), labels.to(device)
    predictions = model(images)
    loss = error(predictions, labels)

    test_loss += loss.to('cpu').item()
    test_acc += (predictions.argmax(1) == labels).type(torch.float).to('cpu').mean().item()

test_loss /= len(test_loader) #len(test_loader) is batch size
test_acc /= len(test_loader)

print(f"Test loss: {test_loss:>6f},Test acc: {test_acc:>6f}")

95% 정확도로 적혈구에서 말라리아를 구분해낼 수 있네요! (Umer et al의 논문에 나온 모델은 데이터의 preprocessing을 거쳐서 99.98% accuracy를 보입니다..!)

직접 모델이 예측하는 것을 보고싶어,서 데이터 중에서 아무 사진이나 가져와서 모델을 적용시켜보았습니다:

img_path = 'cell_images/cell_images/Uninfected/C98P59ThinF_IMG_20150917_154235_cell_128.png'
img_array = cv2.imread(img_path)
img_original = Image.fromarray(img_array)
img_resized = img_original.resize((120,120))
img_resized

사진을 보니 정상 적혈구인 것 같네요. 모델의 예측을 한 번 보겠습니다

transform = transforms.Compose([transforms.PILToTensor()])
img_tensor = transform(img_resized).float().unsqueeze(0)

print(model(img_tensor).argmax(1))

위에서 malariadata.class_to_idx를 출력한 결과가 {'Parasitized':0, 'Uninfected':1}이었으니 이 사진에 대해 모델은 정상적혈구라고 (옳게) 판단했습니다.

재미로 사진을 하나 더 보면

img_path = 'cell_images/cell_images/Parasitized/C101P62ThinF_IMG_20150918_151335_cell_65.png'
img_array = cv2.imread(img_path)
img_original = Image.fromarray(img_array)
img_resized = img_original.resize((120,120))
img_resized

이 적혈구는 말라리아에 감염된듯 하네요.

모델의 예측을 보겠습니다:

transform = transforms.Compose([transforms.PILToTensor()])
img_tensor = transform(img_resized).float().unsqueeze(0)

print(loaded_model(img_tensor).argmax(1))

Parasitized로 (옳게) 판단했네요 (기특...).

이상 말라리아 데이터셋과 그것을 이용한 CNN모델의 소개였습니다. 컴퓨터비전에 입문하는 단계에서 MNIST 데이터셋만 가지고 공부하기 지루한 사람들에게 훌륭한 리소스인 것 같습니다!

References

1 Portrait of a serial killer | Nature 2 The top 10 causes of death (who.int) 3 IEEE Xplore Full-Text PDF:

Pytorch 건드려보기: Pytorch로 하는 linear regression

Wed, 25 May 2022 11:04:00 GMT

Coefficients of a Linear Regression model changing over 3000 epochs

Pytorch 쓰는 법을 처음 배웠습니다. 복잡한 neural network도 구현할 수 있게 해주는 파워풀한 라이브러리이지만, 우선은 pytorch의 기본적인 요소들과 친해지기 위해 심플함의 왕인 linear regression을 pytorch로 흉내내보는 것이 이 글의 목표입니다 (mnist조차도 벅차서 더 쉬운 것을 해보고자 했습니다).

우선 Linear regression 복습. Linear regression은 수식으로 나타내면 이런 형태죠: $$y = \alpha + \beta x$$

주어진 $x$를 가지고 $y$를 가장 잘 예측할 수 있는 $\alpha$와 $\beta$를 찾는 것이 linear regression의 목표입니다. 주어진 $x$가 여러 개일 땐 multiple regression이라고 부르기도 하며 수식으로 나타내면 이런 형태죠: $$y = \alpha + \beta_1 x_1 + \cdots \beta_kx_k$$

한 번은 익숙한 sklearn으로, 그 다음엔 익숙치 않은 pytorch로 각각 linear regression모델을 만들어보도록 하겠습니다. Sklearn은 LinearRegressor를 이용할테니 sum of squared error를 최소화 하는 방법으로 바로 모델의 coefficient들을 찾게 될거고, pytorch는 독립변수의 갯수만큼의 node를 가지고 있는 Linear 레이어를 하나만 이용해서 각 노드의 weight(즉 model의 coefficient)를 Adam을 이용해 트레이닝시켜보려 합니다.

우선 사용할 데이터셋을 소개하겠습니다. 미국의 county지역별로 인구통계학적 지표들(연령, 중위 소득, 평균 가구원수, 인종 등)과 암에 의한 사망률(cancer mortality)을 정리해놓은 데이터셋입니다:

OLS Regression Challenge - dataset by nrippner | data.world

우리의 목표는 다른 변수들을 이용해서 TARGET_deathRate (100,000명 당 암 사망률)을 예측하는 linear regression 모델을 만드는 것입니다.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

df = pd.read_csv("cancer_reg.csv", encoding="ISO-8859-1")

데이터 cleaning과 전처리는 대충만 했습니다 (missing value있는 column들과 multicollinearity보이는 column들, 분석에 필요없는 column들을 지우고, 중위 연령이 100 이상인 row들은 분명 데이터 오류기 때문에 지웠습니다):

df = pd.read_csv('cancer_reg.csv', encoding='ISO-8859-1')

#필요 없는 column들 지우기
df_main = df.drop(['avgDeathsPerYear', 'PctSomeCol18_24', 'PctEmployed16_Over', 'PctPrivateCoverageAlone', 'binnedInc', 'Geography', 'MedianAgeMale','MedianAgeFemale', 'PctPublicCoverage', 'PctPrivateCoverage', 'PctMarriedHouseholds', 'popEst2015', 'povertyPercent', 'PctWhite', 'PctBachDeg25_Over'])

#지역 중위 연령이 100이상인 row들 지우기
df_main = df_main.loc[df_main["MedianAge"]<100]

우선 sklearn라이브러리로 linear regression 모델을 만들어보겠습니다. Sklearn의 LinearRegression은 sum of squares error를 최소화 하여 모델을 생성합니다.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

X = df_main.drop(['TARGET_deathRate'], axis=1)
y = df_main['TARGET_deathRate']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=1)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

y_train = np.asarray(y_train)
y_test = np.asarray(y_test)
y_train = scaler.fit_transform(y_train.reshape(-1, 1))
y_test = scaler.transform(y_test.reshape(-1, 1))

regr = LinearRegression()
regr.fit(X_train, y_train)
print("Training error: ", mean_squared_error(y_train, regr.predict(X_train)))
print("Test error: ", mean_squared_error(y_test, regr.predict(X_test)))
print(regr.coef_)
print(regr.intercept_)

이렇게 각 변수에 대한 계수들과, y절편을 구할 수 있었습니다.

각 변수의 계수를 barplot으로 보면 incidenceRate (암의 발생률)이 암사망률에 가장 큰 영향을 주는게 보이네요 (당연한 결과지만 말이 되는 모델이라는 안심을 줍니다). 그 다음으로는 medianIncome과 PctPublicCoverageAlone인 것 같은데 중위소득(medianIncome)은 암사망률과 음의 상관관계를 가지고, 공공보험만 가지고 있는 사람들의 비율 (즉, 사보험이 없고 medicare나 medicaid만으로 의료보험을 받는 사람들의 비율)은 암사망률과 양의 상관관계를 가집니다. 돈이 암사망률과 꽤 연관성이 있다는게 씁쓸하지만서도 (미국 데이터라 더 그렇겠죠), 일반적인 암에 대해선 돈으로 생명을 연장하는 치료를 살 수 있는 시대에 산다는 것을 보여주는 것 같습니다 (돈이 아무리 많아도 그런 치료를 살 수 없는 병들도 아직 많거든요 - 예를 들어 스티브잡스의 사인인 췌장암).

그러면 이제 pytorch로 같은 모델을 만들 수 있는지 보도록 하겠습니다.

import torch

class DeathRatePredictor(torch.nn.Module):

  def __init__(self, input_dimension):
    super().__init__()
    self.linear = torch.nn.Linear(input_dimension, 1)

  def forward(self, input_dimension):
    return self.linear(input_dimension)

아마도 torch.nn.Module을 이용한 가장 간단한 모델이 아닐까 싶네요. init에서 우리가 만들고자 하는 뉴럴네트워크의 layer들을 명시해줍니다. 이 예시에선 Linear layer하나만 있습니다. Linear layer는 들어오는 input에 대해 linear transformation, 즉 $y = Ax + b$만 하는 layer입니다. super().__init__()은 해당 class(즉 DeathRatePredictor)의 상위 class (즉 nn.Module)의 method들을 가지고 온다는 뜻입니다.

그리고 forward는 init에 명시되어있는 layer들을 바탕으로 실제로 input에 대한 계산(computation)을 진행해서 output을 출력하는 function입니다.

그 다음은 DeathRatePredictor를 이용하여 모델을 트레이닝:

X = df_main.drop(['TARGET_deathRate'], axis=1)
y = df_main['TARGET_deathRate']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state=1)


scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

y_train = np.asarray(y_train)
y_test = np.asarray(y_test)
y_train = scaler.fit_transform(y_train.reshape(-1, 1))
y_test = scaler.transform(y_test.reshape(-1, 1))

X_train = torch.from_numpy(X_train.astype(np.float32))
X_test = torch.from_numpy(X_test.astype(np.float32))
y_train = torch.from_numpy(y_train.astype(np.float32))
y_test = torch.from_numpy(y_test.astype(np.float32))

# TRAIN MODEL
input_dimension = X_train.shape[1] #X_train.shape is [2262, 18]

model = DeathRatePredictor(input_dimension)

criterion = torch.nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())

n_epochs = 3000

train_losses = np.zeros(n_epochs)
test_losses = np.zeros(n_epochs)
epoch_coefs = []

for epoch in range(n_epochs):
  optimizer.zero_grad()
  
  outputs = model(X_train)
  loss = criterion(outputs, y_train)
  
  loss.backward()
  optimizer.step()

  outputs_test = model(X_test)
  loss_test = criterion(outputs_test, y_test)

  # Save loss values and model coefficients from each epoch for plotting later
  train_losses[epoch] = loss.item()
  test_losses[epoch] = loss_test.item()
  epoch_coefs.append(list(next(model.named_parameters()[1][0].detach().numpy()))

3000 epoch의 training 후 model의 coefficient들을 다음과 같이 확인해볼 수 있습니다:

next(model.named_parameters())[1][0].detach().numpy()

Sklearn의 LinearRegression과 거의 똑같은 결과가 나온 것을 확인할 수 있습니다!

그리고 epoch_coefs에 저장했던 각 epoch의 model의 coefficient를 bar chart로 plotting해보면, 점점 sklearn의 coefficient들로 만든 bar chart와 같아지는 모습을 볼 수 있습니다:

참고용: 위 애니메이션을 위한 코드입니다 (Google Colab에서 실행 시):

from matplotlib import animation
from matplotlib.animation import FuncAnimation
from matplotlib import rc
from IPython.display import HTML
rc('animation', html='jshtml')
%matplotlib inline

epoch_coefs_arr = np.asarray(epoch_coefs)
df_coefs = pd.DataFrame(epoch_coefs_arr)
df_coefs.columns = X.columns

fig = plt.figure(figsize=(15, 15))
ax = fig.add_subplot(1,1,1)
ax.set_ylim(-0.3, 0.4)

def animate(i):
  ax.clear()
  ax.set_ylim(-0.3, 0.4)
  ax.tick_params(axis='x', rotation=90)
  ax.set_title("Pytorch nn.Linear: Coefficients for each variable changing over 3000 epochs", fontsize=20)
  return ax.bar(df_coefs.columns, [df_coefs[feature][i] for feature in df_coefs.columns])

ani = FuncAnimation(fig, animate, frames = [30*frame for frame in range(100)], interval=30, blit=True)
HTML(ani.to_html5_video())

Linear regression을 하는 것조차도 처음 쓰는 도구로 하려다보니 생각보다 어렵고 많이 헤맸습니다. 그래도 이제 Pytorch의 기본개념은 터득한 것 같으니, 다음 주쯤엔 deep learning도 시도해 볼 수 있을 것 같네요!

야생의 K-Means Clustering

Tue, 17 May 2022 11:48:36 GMT

최근에 unsupervised learning에 대해 배웠습니다. Unsupervised learning의 핵심(목적)은 주어진 데이터 내에서 패턴을 찾아내는 것이고, 상황에 따라 이를 할 수 있는 여러 가지 방법들과 그 원리를 배웠습니다 (재밌고 신기하고 이런 방법을 처음 생각해낸 사람들은 뭘 먹고 그렇게 똑똑해졌을까 싶고 부럽더군요).

전 야생에서 unsupervised learning이 사용 된 예시를 찾아보고 싶었습니다. 구글에 "unsupervised machine learning clinical"을 쳐보니 이런 바로 연구를 찾을 수 있었습니다:

Unsupervised machine learning for identifying important visual features through bag-of-words using histopathology data from chronic kidney disease | Scientific Reports (nature.com)

Histopathology(조직병리)란 신체 장기의 일부를 떼내서 현미경으로 보는 것을 의미하고, chronic kidney disease (CKD, 만성신장병)는 어떤 원인에서든 신장 기능이 떨어진 상태를 말합니다 (보통 당뇨가 원인입니다).

K-means clustering을 사용해서 신장 조직병리를 분석하고, 이를 바탕으로 만성신장병의 중증도를 예측한 논문인데요, 사람이 한 labelling없이도 조직병리 소견을 분석하여 임상적으로 의미있는 결론(중증도 예측)까지 이어질 수 있다는 것을 보여줬다는 의의가 있습니다. 처음 읽을 땐 한 4% 정도 이해했고 지금도 '그래 이건 그냥 이런게 있다고만 알자'하고 넘어간 부분이 엄청 많지만 일단 K-means clustering을 조직병리 분석에 어떻게 사용한건지 설명하고자 하는 것이 이 글의 목적입니다.

Bag of Visual Words란

우선 bag of visual words라는 방법에 대한 설명이 필요합니다. Natural language processing 분야에 "bag of words"라는 분석방법이 있는데요, 어떤 문서가 있으면 문서 내 단어들의 순서나 맥락따윈 신경쓰지 않고 그냥 단어 갯수만 세서 히스토그램을 만들고 히스토그램들끼리 비교하여 문서들이 서로 얼마나 비슷한지, 혹은 얼마나 다른지를 보는 분석입니다. Bag of visual words는 이 방법을 image 데이터에 적용한 것이고, 단어(word) 대신 image의 feature(혹은 여러 개의 feature로 이루어진 feature vector)를 사용하여 각 image의 feature histogram을 만들어 비교합니다. 이 때 feature란 그 이미지를 이루는 pattern들입니다. 서로 다른 질감들을 비교하기 위해 처음 사용했다고 하는데, 아래 사진을 보시면 어떤 느낌인지 이해가 갑니다:

(Source: Kris Kitani, Carnegie Mellon University. 8.2 Bag of Visual Words (cmu.edu))

그럼 이미지 분석을 위한 feature선정은 어떻게 하는 걸까요? 여러 가지 방법(eg SIFT)이 있지만, 우리의 만성신장병 논문에선 신장 조직병리 사진들을 segmentation하는데 이용한 deep neural network의 한 layer를 feature set으로 사용했다고 했습니다 (= ResNet feature extraction).

이렇게 고른 feature들 중에서, 가장 대표적인 놈들을 몇 개 선정해서 걔네를 "code word"(= visual word)라 부르고, code word들의 전체 집합을 "codebook" 혹은 "visual dictionary"라 부릅니다:

(Source: Kris Kitani, Carnegie Mellon University. 8.2 Bag of Visual Words (cmu.edu))

수많은 feature들 중에서 code word들을 정할 때 바로 K-means clustering이 사용됩니다. Feature(혹은 feature vector)들 중 비슷한 애들끼리 군집화해서 평균(centroid)에 해당하는 녀석을 code word로 사용하는 것이죠.

그러면 새로운 image가 주어졌을 때, 그 image를 code word들의 histogram으로 나타낼 수 있습니다:

(Source: Li Fei-Fei, Rob Fergus, Antonio Torralba. part_1.ppt (live.com), Recognizing and Learning Object Categories)

신장병 논문에서의 K-means clustering 사용

우리의 신장병 논문에선 신장 조직병리 이미지들을 모두 256x256 pixels크기의 patch로 나누었고, 이 patch들을 9개의 클러스터로 나누는 K-means clustering을 하여 9개의 code word (=visual word)를 선정했습니다. 이 때 클러스터의 갯수는 silhouette이라는 알고리즘을 이용해 9로 정했다고 합니다 (강의에서 K-means clustering을 사용할 때 유의해야할 점 - 즉, 약점 - 중 하나가 클러스터의 갯수를 선정해야하는 것이라 배웠는데, 이 논문에서도 미래의 연구에선 k를 달리하여 분석을 해봐야 할 것임을 얘기하고 있습니다).

논문의 Figure 6(A)가 K-means clustering을 통해 도출 된 9개의 visual word(=code word)들을 보여주고 있습니다. 6(B)는 현미경으로 본 신장조직 샘플이고, 6(C)는 6(B)의 각 patch(256x256 pixels)가 어떤 visual word의 클러스터 속하는지 보여주고 있습니다.

Figure 6 (A) A visual dictionary that consists of 9 representative visual words, (B) a represnetative cortex sample, and (C) its cluster map with colored patches. Each colored patch corresponds to its assigned visual word [1]

Figure 7은 Figure6(C)를 zoom in해서 보여주네요.

Figure 7. An example of cortex trichrome stained images with color-coded patches and zoomed images [1]

클러스터링의 결과물인 Figure 6(A)의 visual word들을 조금 더 자세히 살펴보도록 할텐데, 그 전에 신장 조직학에 대해 잠깐 설명하겠습니다. 신장은 크게 3 가지 조직(조직이란 세포들이 모여 형성 된, 한 가지 기능을 하는 세포들의 집합을 말합니다. 조직이 모이면 장기(eg, 신장)가 되고, 장기들이 모이면 한 개체(eg, 사람)이 됩니다)으로 이루어져 있습니다: 사구체 (glomeruli), 세관 (tubules), 실질 (interstitium). 사구체는 혈액으로부터 노폐물(소변)을 걸러내는 역할을 하고, 세관은 걸러진 소변을 방광으로 운반하며, 실질은 이 둘을 뺀 나머지라 생각하면 됩니다. 그러나 세관과 실질은 보통 붙어있기 때문에 tubulointerstitium이라는 이름으로 하나의 조직으로 취급하는 경우가 많습니다. 논문의 Figure4가 친절하게 현미경으로 본 사구체, 세관, 실질의 모습을 보여주고 있네요 (여기선 혈관(arterioles)도 따로 분류해서 보여주고 있군요 중요하진 않습니다)

Figure 4. An example of a trichrome-stained image (left) and an automatically segmented image from our trained deep learning model (right) [1]

이제 Figure 6(A)의 visual word들을 보면, 잘 모르는 제가 봐도 2번은 정상 사구체 (약간의 염증이 있어보이긴 하지만 대충 정상), 5번은 병든 사구체인 것은 알아볼 수 있었습니다. 나머지 visual word들은 어떤 소견인지 저는 알아보지 못 했지만 다행히도 논문의 Table5에 친절히 설명되어있습니다:

Figure 6(A) A visual dictionary that consists of 9 representative visual words [1]

정상 세관/실질
정상 사구체 (염증 약간)
정상 세관/실질
정상 세관/실질 + 실질 확장 약간
사구체경화증 (병들고 기능 잘 못 하는 사구체란 뜻입니다)
정상 세관/실질
정상 세관/실질
세관 위축 (세관이 병든 소견입니다)
실질 확장 (이 또한 신장질환에서 관찰되는 소견인데, 5번 사구체경화증이나 8번 세관 위축소견만큼 중요한 소견은 아닙니다)

사람이 라벨링한 데이터를 보지 않고도 clustering만으로 신장조직을 이루는 사구체와 세관/실질을 구분했을 뿐만 아니라, 병든 사구체나 병든 세관의 모습도 구분해내었네요. 물론 이 상태에서는 알고리즘이 어떤 visual word가 신장병의 소견이고 어떤 것이 정상인진 알지 못하지만, 그 다음 단계에서 저자들이 가르쳐줍니다: 각 환자의 신장 조직병리 이미지에 대해 이 visual word들로 이루어진 히스토그램을 생성한 뒤 신장병이 얼마나 심한지를 예측하는 supervised learning을 진행했습니다 (환자들을 경도 만성신장병과 중등도/중증 만성신장병으로 분류하여 라벨링하고 random forest classifier를 트레이닝하는 방법 사용). 신장병 중증도 예측은 정확도 AUC 0.91 정도로 꽤 정확하게 예측할 수 있었습니다 (물론 경도 vs. 중등도/중증 두 그룹으로만 분류하면 됐기 때문에 task자체가 그렇게 어려운 것은 아닌 것 같지만요). 더 재밌는 것은 신장병의 중증도를 평가하는데 있어 random forest가 사용한 feature들의 feature importance를 계산하였을 때, visual word 2번 (정상 사구체 소견), 8번 (세관 위축 소견), 5번 (사구체경화증 소견)이 1, 2, 3순위로 가장 중요한 feature로 나타났습니다. 이것은 인간 병리학자가 현미경 소견으로부터 만성신장병의 중증도를 파악할 때 가장 중요하게 보는 소견들과 매우 비슷합니다 (사구체 중 몇 퍼센트가 정상모습인지, 사구체경화증이 일어난 사구체는 몇 얼마나 많은지, 세관 위축은 얼마나 진행됐는지 등을 보고 만성신장병의 중증도를 파악합니다).

여기서부턴 제 생각

이 연구에서 이용한 데이터셋은 만성신장병을 가진 사람들의 신장 조직병리소견들로 이루어져 있기 때문에, '정상 사구체' visual word도 염증을 일부 포함하고 있었지만, 만약 건강한 사람들의 신장 조직병리도 포함하여 learning을 진행한다면 완전 정상인 사구체들이 하나의 cluster를 이루고 별개의 visual word로 작용하여 신장병 중증도에 대한 예측 정확도를 더 높일 수 있을 것으로 생각됩니다. 또한, 사구체경화증(visual word #2)을 일으키는 원인들과 세관 위축(visual word #8)을 일으키는 원인들은 조금 다릅니다 (겹치기도 하지만요). 이 연구에서는 '만성신장병'(원인에 상관없이 신장기능이 저하 된 상태) 환자들의 조직병리를 대상으로 했지만, 각 환자가 어떤 원인에 의해 만성신장병을 가지게 되었는지에 대한 데이터도 라벨로 포함할 수 있다면, 중증도 예측뿐만 아니라 신장기능 저하의 원인 예측도 가능하지 않을까 하는 생각이 들었습니다.

마무리

K-means clustering방법으로 supervised learning을 위해 필요한 라벨링을 대체해버린 좋은 예시인 것 같습니다. 그리고 사람이 제시한 라벨링을 통해 배우는 것보다 더 유기적(?)이면서도 체계적으로 각 이미지 내에 어떤 패턴들이 존재하는지 알고리즘을 통해 발견해낸거라서 더 first principles에 가까운 방법이라 느껴집니다. 또한 새로운 발견을 이끌 수 있는 잠재력도 있을 것 같습니다, 왜냐하면 unsupervised learning이 발견한 visual word 중 인간 병리학자들은 모르던 패턴이 만성신장병 중증도 예측에 중요한 것으로 나타났다면, 새로운 조직병리 패턴을 하나 발견하게 되는 것이니까요. 이런 면들 때문에 unsupervised learning은 supervised learning보다 왠지 좀 더 멋있는 느낌입니다.

끝!

ML 방법론에 대해서도, 병리학에 대해서도 전문가가 아니라서 틀린 부분이 있을(많을) 수 있습니다. 지적해주세요! 설명이 충분치 않거나 궁금한 점도 말씀해주시면 추가 설명해보도록 하겠습니다.

[1] Lee, J., Warner, E., Shaikhouni, S. et al. Unsupervised machine learning for identifying important visual features through bag-of-words using histopathology data from chronic kidney disease. Sci Rep 12, 4832 (2022). https://doi.org/10.1038/s41598-022-08974-8