dev_halo.log

Ngrok

Mon, 09 Sep 2024 01:24:24 GMT

curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
    | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
    && echo "deb https://ngrok-agent.s3.amazonaws.com buster main" \
    | sudo tee /etc/apt/sources.list.d/ngrok.list \
    && sudo apt update \
    && sudo apt install ngrok

ngrok config add-authtoken $YOUR_AUTHTOKEN

ngrok start --config "/root/.config/ngrok/ngrok.yml" --all

authtoken: ${PERSONAL_AUTH_TOKEN}
tunnels:
  jupyter:
    proto: http
    addr: 8888
  test1:
    proto: http
    addr: 7474
  test2:
    proto: http
    addr: 7687

deepspeed Learning Rate Scheduler list

Fri, 05 Jul 2024 00:18:25 GMT

LR Scheduler

jupyter kernel

Thu, 23 Nov 2023 00:54:41 GMT

ipykernel 설치

pip install ipykernel

가상환경 생성

python -m venv 가상환경이름

가상환경 시작

source 가상환경이름/bin/activate

가상환경 종료

deactivate

커널 추가

python -m ipykernel install --user --name 가상환경이름 --display-name "화면에보여질가상환경이름"

CLOVA X 리뷰

Tue, 29 Aug 2023 01:58:39 GMT

CLOVA X 리뷰

3일을 기다려 CLOVA X의 대기가 끝났습니다.

리뷰 해봅시다.

1. main

middle

side

전형적인 요즘 chat 사이트 모습입니다.

2. chat

무난합니다.

할루시네이션

P의 거짓이란 게임이 있는데 할루시네이션이 벌써 보입니다.

다시 올바른 질문을 하니 링크도 제대로 주고 그럴듯 합니다.

In-Context Learning

유저와의 상호작용이 불가능합니다.

chatGPT의 경우 In-Context Learning이 가능하지만 CLOVA X는 불가능한 모습입니다.

Data Augmentation

데이터 증강을 요청했지만 chatGPT는 잘 생성해주는 것에 비해 CLOVA X는 전혀 못하고 있습니다.

CLOVA X Skill

CLOVA X 에선 네이버의 플랫폼을 chat과 연동하여 검색하는 기술을 SKILL 이라고 부르고 있습니다.

결론

생성형 모델 자체로만 보면 chatGPT가 우월합니다. 하지만 정보의 출처를 제공하고, 본인들의 서비스에 녹여낸 상태를 보면 역시 네이버라고 생각됩니다. 아직 베타이기에 완전히 공개된게 아닐 것이라 생각됩니다.

CLOVA 화이팅

git bash conda activate

Wed, 09 Aug 2023 04:31:13 GMT

miniconda

source ~/miniconda3/etc/profile.d/conda.sh

anaconda

source ~/anaconda3/etc/profile.d/conda.sh

문서 생성요약기를 만들어보자(with BART Huggingface)

Tue, 07 Feb 2023 04:03:19 GMT

KoBART

Huggingface🤗 Trainer를 활용하여 BART 한글 문서 생성 요약 Finetune 해보기

최대한 default 값 없이 재현하고 싶어서 작성한 BART Huggingface Fine tuning 코드

Data

결과물

def infer(text):
    if type(text) == int:
        out = model.generate(input_ids = torch.tensor(train_dataset[text]['input_ids'])[None,:].to(device)).logits.argmax(2)
        result = tokenizer.decode(out[0])
    else:
        tmp = [tokenizer.bos_token_id] + tokenizer.encode(text) + [tokenizer.eos_token_id]
        out = model.generate(input_ids = torch.tensor(tmp)[None,:].to(device))
        result = tokenizer.decode(out[0])
    return print(result)
text = """1일 오후 9시까지 최소 20만3220명이 코로나19에 신규 확진됐다. 
또다시 동시간대 최다 기록으로, 사상 처음 20만명대에 진입했다. 
방역 당국과 서울시 등 각 지방자치단체에 따르면 이날 0시부터 오후 9시까지 전국 신규 확진자는 총 20만3220명으로 집계됐다. 
국내 신규 확진자 수가 20만명대를 넘어선 것은 이번이 처음이다. 
동시간대 최다 기록은 지난 23일 오후 9시 기준 16만1389명이었는데, 이를 무려 4만1831명이나 웃돌았다. 
전날 같은 시간 기록한 13만3481명보다도 6만9739명 많다. 
확진자 폭증은 3시간 전인 오후 6시 집계에서도 예견됐다. 
오후 6시까지 최소 17만8603명이 신규 확진돼 동시간대 최다 기록(24일 13만8419명)을 갈아치운 데 이어 이미 직전 0시 기준 역대 최다 기록도 넘어섰다. 
역대 최다 기록은 지난 23일 0시 기준 17만1451명이었다. 
17개 지자체별로 보면 서울 4만6938명, 경기 6만7322명, 인천 1만985명 등 수도권이 12만5245명으로 전체의 61.6%를 차지했다. 
서울과 경기는 모두 동시간대 기준 최다로, 처음으로 각각 4만명과 6만명을 넘어섰다. 
비수도권에서는 7만7975명(38.3%)이 발생했다. 
제주를 제외한 나머지 지역에서 모두 동시간대 최다를 새로 썼다. 
부산 1만890명, 경남 9909명, 대구 6900명, 경북 6977명, 충남 5900명, 대전 5292명, 전북 5150명, 울산 5141명, 광주 5130명, 전남 4996명, 강원 4932명, 충북 3845명, 제주 1513명, 세종 1400명이다. 
집계를 마감하는 자정까지 시간이 남아있는 만큼 2일 0시 기준으로 발표될 신규 확진자 수는 이보다 더 늘어날 수 있다. 이에 따라 최종 집계되는 확진자 수는 21만명 안팎을 기록할 수 있을 전망이다. 
한편 전날 하루 선별진료소에서 이뤄진 검사는 70만8763건으로 검사 양성률은 40.5%다. 
양성률이 40%를 넘은 것은 이번이 처음이다. 
확산세가 계속 거세질 수 있다는 얘기다. 
이날 0시 기준 신규 확진자는 13만8993명이었다. 이틀 연속 13만명대를 이어갔다."""
infer(text)
 코로나19 사상 처음 20만 이상 인구가 신규 확진되었다.

Score

2 rows val_dataset으로 측정했기 때문에 완전한 score가 아닙니다 !

Score	Rouge-1	Rouge-2	Rouge-L
Recall	1.0	1.0	1.0
Precision	1.0	1.0	1.0
F1_score	0.99	0.99	0.99

깨달은점

초기 lr 1e-3 no scheduler 환경에서 학습이 전혀 진행되지 않아 모델 자체의 문제라고 생각했지만, lr과 scheduler warmup을 작성해주니 학습이 잘된점에서 깨달음을 얻음

초기

lr, scheduler 적용 후

당연하지만 인풋의 길이가 모델의 길이와 다르면 이상한 결과가 나옴

Reference

Github

학습에 사용한 데이터

gogamza/kobart-base-v1

Dacon 성균관대 문장 유형 분류 AI 경진대회 1등

Wed, 18 Jan 2023 12:58:10 GMT

문장 유형 분류 AI 경진대회

대회정보

주제

문장 유형 분류 AI 모델 개발

설명

언어가 사용되는 모든 영역에서 폭넓게 활용될 수 있는 문장 유형 분류 AI 모델을 개발해 주세요.

문장을 입력으로 받아 문장의 '유형', '시제', '극성', '확실성'을 AI 분류 모델 생성

주최/주관

주최: 성균관대학교
주관: 데이콘

참가 자격

일반인, 학생 등 누구나

데이터

성능 개선을 위한 시도

전처리

Text Augmentation

Loss

Custom Model

사용한 사전학습모델과 출처

Code

Github

For HuggingFace Custom CosineAnnealingWarmUpRestarts

Fri, 16 Dec 2022 04:54:31 GMT

서론

Huggingface Custom Trainer 작성중에 lr을 정하기 위해 scheduler를 확인 중이었는데 좋은 블로그가 나와서 사용해보았습니다. 그런데 추천하신 스케쥴러가 custom으로 작성하신 코드였는데 Huggingfcace에 적용해보니 에러가 나와서 수정하였습니다.

본론

class CosineAnnealingWarmUpRestarts(_LRScheduler):
    def __init__(self, optimizer, T_0, T_mult=1, eta_max=0.1, T_up=0, gamma=1., last_epoch=-1):
        if T_0 <= 0 or not isinstance(T_0, int):
            raise ValueError("Expected positive integer T_0, but got {}".format(T_0))
        if T_mult < 1 or not isinstance(T_mult, int):
            raise ValueError("Expected integer T_mult >= 1, but got {}".format(T_mult))
        if T_up < 0 or not isinstance(T_up, int):
            raise ValueError("Expected positive integer T_up, but got {}".format(T_up))
        self.T_0 = T_0
        self.T_mult = T_mult
        self.base_eta_max = eta_max
        self.eta_max = eta_max
        self.T_up = T_up
        self.T_i = T_0
        self.gamma = gamma
        self.cycle = 0
        self.T_cur = last_epoch
        super(CosineAnnealingWarmUpRestarts, self).__init__(optimizer, last_epoch)

    def get_lr(self):
        if self.T_cur == -1:
            return self.base_lrs
        elif self.T_cur < self.T_up:
            return [(self.eta_max - base_lr)*self.T_cur / self.T_up + base_lr for base_lr in self.base_lrs]
        else:
            return [base_lr + (self.eta_max - base_lr) * (1 + math.cos(math.pi * (self.T_cur-self.T_up) / (self.T_i - self.T_up))) / 2
                    for base_lr in self.base_lrs]

    def step(self, epoch=None):
        if epoch is None:
            epoch = self.last_epoch + 1
            self.T_cur = self.T_cur + 1
            if self.T_cur >= self.T_i:
                self.cycle += 1
                self.T_cur = self.T_cur - self.T_i
                self.T_i = (self.T_i - self.T_up) * self.T_mult + self.T_up
        else:
            if epoch >= self.T_0:
                if self.T_mult == 1:
                    self.T_cur = epoch % self.T_0
                    self.cycle = epoch // self.T_0
                else:
                    n = int(math.log((epoch / self.T_0 * (self.T_mult - 1) + 1), self.T_mult))
                    self.cycle = n
                    self.T_cur = epoch - self.T_0 * (self.T_mult ** n - 1) / (self.T_mult - 1)
                    self.T_i = self.T_0 * self.T_mult ** (n)
            else:
                self.T_i = self.T_0
                self.T_cur = epoch

        self.eta_max = self.base_eta_max * (self.gamma**self.cycle)
        self.last_epoch = math.floor(epoch)
        for param_group, lr in zip(self.optimizer.param_groups, self.get_lr()):
            param_group['lr'] = lr

결론

잘 돌아갑니다.

출처

Pytorch Learning Rate Scheduler

청와대 국민청원(청무위키) 데이터 덤프를 떠보자

Wed, 07 Dec 2022 05:19:15 GMT

bluehouse_petitions

문재인 전 대통령 청와대 국민청원 데이터 청와대 기록실을 통한 덤프

python lxml 기반으로 작성하였으며, 병렬 크롤링을 위해 ray를 활용하여 데이터를 수집하였습니다.

본인의 컴퓨터 스펙에 맞게 파라미터를 수정해주세요.

데이터 분석을 위해 전처리는 최소로 작성하였습니다.

worker = 16
size = 3000

Requirements

pip install ray requests

Data Shape

number	제목	답변상태	참여인원	카테고리	청원시작	청원마감	청원내용	답변원고

[451513 rows x 9 columns]

RAY Process

import pandas as pd
import requests
from lxml.html import fromstring
from tqdm import tqdm
import gc
import ray
ray.init()


@ray.remote
def main(data, h, ram, iteration):
    if ram == 0 & h > iteration: # 처음위치
        s = h
        e = s-iteration
    elif 0 <= h-iteration*ram <= iteration: # 마지막 위치
        s = h-iteration*ram
        e = -1
    elif h-iteration*ram < 0: # 이전에 마지막 위치를 도달 했으면 빈 df 리턴
        return data.append(pd.Series([None for _ in range(len(columns))]), ignore_index=True)
    else:
        s = h-iteration*ram
        e = s-iteration
    for code in tqdm(range(s, e, -1)):
        try:
            url = "http://19president.pa.go.kr/petitions/{}".format(code)
            res = requests.get(url)
            parser = fromstring(res.text)

            title = parser.xpath("/html/body/div[3]/div[2]/section[2]/div[2]/div[1]/div[2]/div[1]/div/h3/text()")
            status = parser.xpath(
                "/html/body/div[3]/div[2]/section[2]/div[2]/div[1]/div[2]/div[1]/div/div[1]/h4/text()")
            personnel = parser.xpath(
                "/html/body/div[3]/div[2]/section[2]/div[2]/div[1]/div[2]/div[1]/div/h2/span/text()")
            category = parser.xpath(
                "/html/body/div[3]/div[2]/section[2]/div[2]/div[1]/div[2]/div[1]/div/div[2]/ul/li[1]/text()")
            start = parser.xpath(
                "/html/body/div[3]/div[2]/section[2]/div[2]/div[1]/div[2]/div[1]/div/div[2]/ul/li[2]/text()")
            end = parser.xpath(
                "/html/body/div[3]/div[2]/section[2]/div[2]/div[1]/div[2]/div[1]/div/div[2]/ul/li[3]/text()")
            q = parser.xpath("/html/body/div[3]/div[2]/section[2]/div[2]/div[1]/div[2]/div[1]/div/div[4]/div[2]/text()")
            a = parser.xpath("/html/body/div[3]/div[2]/section[2]/div[2]/div[1]/div[2]/div[1]/div/div[5]/div/text()")
            data = data.append(pd.Series([code, *title, ' '.join(status), *personnel, *category, *start, *end,
                                          ' '.join(q),
                                          ' '.join(a)]), ignore_index=True)
        except:
            if len(data) > 2:
                return data
            else:
                return data.append(pd.Series([None for _ in range(len(columns))]), ignore_index=True)
    return data


worker = 16
size = 3000
total = round(605368/(size*worker))
columns = ['number', '제목', '답변상태', '참여인원', '카테고리', '청원시작', '청원마감', '청원내용', '답변원고']
for i in range(total):
    df = pd.DataFrame()
    df = ray.put(df)
    starting = 605368 - (size*worker*i)
    ans = [main.remote(data=df, h=starting, ram=n, iteration=size) for n in range(worker)]
    ans = ray.get(ans)
    ans = pd.concat(ans, ignore_index=True)

    ans.columns = columns
    ans = ans.dropna(subset=['청원시작']).reset_index(drop=True)

    ans.to_parquet(f'Bluehouse{i}.parquet', engine='pyarrow', compression='gzip', index=False)
    print(f"{i}th FIN")
    del df
    del ans
    gc.collect()
ray.shutdown()

Preprocess

import pandas as pd

df = pd.read_parquet('Bluehouse.parquet')
df.청원내용 = df.청원내용.str.replace('\t', '', regex=True)
df.청원내용 = df.청원내용.str.replace('\r', '', regex=True)
df.청원내용 = df.청원내용.str.replace('\n', ' ', regex=True)
df.청원내용 = df.청원내용.str.replace('\s+', ' ', regex=True)
df.청원내용 = df.청원내용.str.strip()

df.답변원고 = df.답변원고.str.replace('\t', '', regex=True)
df.답변원고 = df.답변원고.str.replace('\r', ' ', regex=True)
df.답변원고 = df.답변원고.str.replace('\n', '', regex=True)
df.답변원고 = df.답변원고.str.replace('\s+', ' ', regex=True)
df.답변원고 = df.답변원고.str.strip()

df.number = df.number.astype(int)

df.to_parquet('Bluehouse_dump.parquet')

reference

대통령기록관

RAY

github

Sagemaker 빌트인 모델 서빙 패턴

Mon, 17 Oct 2022 17:14:56 GMT

Amazone SageMaker의 빌트인 모델 서빙 패턴 4가지

딥러닝 모델을 만들고 효율적인 AI/ML 프로세스를 위해 Sagemaker로 Pipeline을 작성했다면 만들어진 모델을 이용해 실제 서비스에 적용해야 합니다. 이때, Sagemaker에서 크게 4가지의 빌트인 모델 서빙 패턴을 제공해주는데 프로젝트에 맞는 서빙 패턴을 고려해서 적용해야 합니다.

현재 작업을 완료한 실제 패턴은 특정시간을 기준으로 모인 데이터를 작성한 모델을 거쳐 고객 분석팀의 대시보드에 추론값을 parquet 방식으로 확률값과 라벨값을 뽑아줍니다.

처음 서빙패턴은 리얼타임 추론을 적용하였는데, 위 특성을 고려할 때 특정시간까지 모인 데이터셋의 크기가 일정하지 않고 매우 큰 데이터셋이 들어왔을때 최대 페이로드를 넘는 문제점이 존재했습니다. 따라서 배치 변환을 통한 서빙 패턴을 적용하기로 했습니다.

1. 배치 변환

전체 데이터셋에 대해 추론 대용량 데이터의 주기적 추론에 적합 임시 리소스(프로비저닝된 인스턴스는 작업 완료 후 곧바로 종료) -> 사용한 만큼 과금

2. 비동기 추론 인터페이스

최대 1GB의 대용량 페이로드에 적합 최대 15분의 타임아웃 오토스케일링 CV/NLP에 적합

3. 리얼타임 추론

모델 추론에 필요한 모든 아티팩트를 웹 서버에 저장 최대 6MB 페이로드에 대한 즉각적인 응답 60초 타임아웃 오토스케일링

4. 서버리스 추론

4.1. 람다 서버리스 추론

도커 이미지 빌드 및 람다 함수 구현 필요

4.2. Sagemaker 서버리스 추론

모델 생성 -> 엔드포인트 구성 생성 -> 엔드포인트생성 -> Sagemaker 서버리스 엔드포인트 사용 가능한 메모리 크기 1/2/3/4/5/6GB

요약

-	리얼타임	배치	비동기	서버리스
GPU지원	O	O	O	X
오토스케일링	O	-	O	O
Scale to Zero	X	-	O	O
멀티컨테이너	O	X	X	X
멀티모델	O	X	X	X
페이로드 크기	6MB	-	1GB	4MB
타임아웃	60s	-	15M	60s
블루그린 가드레일	O	X	X	1step
PrivateLnk 지원	O	O	O	X
AB테스트	O	X	X	X

출처

AWS 슬라이드쉐어

Yolov5 모델을 BentoML을 이용해 Serving 해보자

Wed, 07 Sep 2022 07:14:23 GMT

BentoML_Serving

사내에서 병원의 마약 이미지를 이용하여 마약을 분류하는 모델을 개발하였습니다. 이 모델은 Yolov5를 이용해 개발하였고 pytorch 기반으로 작성하였습니다. 실제 배포방식을 고민하며 flask보다 API Serving이 좋다고 소문나있는 BentoML에 얹는 방법을 공유하겠습니다.

1. Init WrapperModel then save model as bentoml model

import torch
from models.common import DetectMultiBackend
from utils.torch_utils import select_device, time_sync
from pathlib import Path
from utils.datasets import IMG_FORMATS, VID_FORMATS, LoadImages
from utils.general import check_file, check_img_size, non_max_suppression, scale_coords
import bentoml

# Model
device = select_device('')
original_model = DetectMultiBackend('best.pt', device=device, dnn=False)


class WrapperModel(torch.nn.Module):
    def __init__(self, model):
        super().__init__()
        self.model = model

    def forward(self, imgs):
        source = str(imgs)
        is_file = Path(source).suffix[1:] in (IMG_FORMATS + VID_FORMATS)
        is_url = source.lower().startswith(('rtsp://', 'rtmp://', 'http://', 'https://'))
        if is_url and is_file:
            source = check_file(source)  # download

        # Load model
        device = select_device('')
        stride, names, pt, jit, onnx, engine = self.model.stride, self.model.names, self.model.pt, self.model.jit, self.model.onnx, self.model.engine
        imgsz = check_img_size((448, 448), s=stride)  # check image size

        half = False
        if pt or jit:
            self.model.model.half() if half else model.model.float()

        # Dataloader
        dataset = LoadImages(source, img_size=imgsz, stride=stride, auto=pt)
        bs = 1  # batch_size
        vid_path, vid_writer = [None] * bs, [None] * bs

        # Run inference
        dt, seen = [0.0, 0.0, 0.0], 0
        for path, im, im0s, vid_cap, s in dataset:
            t1 = time_sync()
            im = torch.from_numpy(im).to(device)
            im = im.half() if half else im.float()  # uint8 to fp16/32
            im /= 255  # 0 - 255 to 0.0 - 1.0
            if len(im.shape) == 3:
                im = im[None]  # expand for batch dim
            t2 = time_sync()
            dt[0] += t2 - t1

            # Inference
            pred = self.model(im)
            t3 = time_sync()
            dt[1] += t3 - t2

            # NMS
            pred = non_max_suppression(prediction=pred, conf_thres=torch.tensor(0.25).cuda(),
                                       iou_thres=torch.tensor(0.45).cuda(), classes=None, agnostic=False, max_det=1000)
            dt[2] += time_sync() - t3

            # Process predictions
            for i, det in enumerate(pred):  # per image
                seen += 1
                p, im0, frame = path, im0s.copy(), getattr(dataset, 'frame', 0)

                s += '%gx%g ' % im.shape[2:]  # print string
                if len(det):
                    # Rescale boxes from img_size to im0 size
                    det[:, :4] = scale_coords(im.shape[2:], det[:, :4], im0.shape).round()

                    # Print results
                    for c in det[:, -1].unique():
                        n = (det[:, -1] == c).sum()  # detections per class
                        s += f"{n} {names[int(c)]}{'s' * (n > 1)}, "  # add to string
        return s


model = WrapperModel(original_model)

2. Write service.py

import bentoml
from bentoml.io import Text

yolo_runner = bentoml.pytorch.get("pytorch_yolov5").to_runner()

svc = bentoml.Service(
    name="pytorch_yolo_demo",
    runners=[yolo_runner],
)


@svc.api(input=Text(), output=Text())
async def predict(img: str) -> str:
    assert isinstance(img, str)
    return await yolo_runner.async_run(img)

3. Run service

bentoml serve service.py:svc

4. Send API

curl -X POST -H "Content-Type: text/plain" --data 'SAMPLE IMG URI' http://localhost:3000/predict

return 'image 1/1 /home/halo/PycharmProjects/bentoml/202106180043369591_0.jpg: 352x448 1 F_Duro_50mcg,'

5. Docker build

Write bentomlfile.yaml

service: "service:svc"  # where the bentoml.Service instance is defined
include:
- "*.py"
- "*.pt"
docker:
    base_image: "my_custom_image:latest"

Build

bentoml build

Deploying the Bento

bentoml containerize pytorch_yolo_demo:uk3q6lq7rsmkw3lr

# Successfully built docker image "pytorch_yolo_demo:uk3q6lq7rsmkw3lr"

docker run --gpus all -p 3000:3000 pytorch_yolo_demo:uk3q6lq7rsmkw3lr

Options

# Production
bentoml serve --production --host 0.0.0.0 service.py:svc

Sagemaker MLOps Pipeline with HuggingFace

Wed, 07 Sep 2022 07:02:15 GMT

huggingface_sagemaker_pipeline

K-항공사의 고객 설문 데이터를 이용하여 감정분석 모델을 HuggingFace를 이용해 개발했습니다. MLOps가 다들 좋다고 얘기하지만 정작 이를 실제 서비스에 배포하려면 정보가 매우 적습니다. 이 모델을 이용하여 MLOps Pipeline을 구축하였고 이 과정에서 겪은 시행착오를 공유하겠습니다. 현재 구축한 상태는 크게 전처리-학습-모델 업로드이며 실제 모델을 Serving 하는 방법은 추후에 공유하겠습니다.

Overview

Pipeline

Prepare

Preprocess dockerfile
Inference dockerfile(cpu or cuda) sagemaker-pytorch-inference-toolkit
Python Script (train.py, evaluation.py, inference.py)

Sagemaker Notebook

Sagemaker_pipeline

Sagemaker Environment variables

Environment variables

Some tips (To understand the path)

in ProcessingStep

inputs (List type)

source(S3) -> destination(container instance)

outputs (List type)

source(container instance) -> destination(S3)

in TrainingStep

inputs (Dict type)

Only suport train, test variables

source(Your sagemaker S3 URI, made at the processing step) -> destination(container instance)

Also, When the save the model, Use it trainer.save_model(os.environ["SM_MODEL_DIR"])

then you can check the model saved at sagemaker s3

Custom script processing step

from sagemaker.processing import ScriptProcessor

processing_output_destination = 'YOUR S3 PATH'

custom_processor = ScriptProcessor(
    image_uri='YOUR ECR URI',
    instance_type='ml.c5.2xlarge',
    instance_count=1,
    base_job_name=base_job_prefix + "/preprocessing",
    sagemaker_session=sagemaker_session,
    role=role,
    command = ["python3"])

step_process = ProcessingStep(
    name="ProcessDataForTraining",
    cache_config=cache_config,
    processor=custom_processor,
    job_arguments=["--bucket",'hg.sage',
                   "--file_name", 'train/train.csv'],
    inputs=[
        ProcessingInput(
            input_name="train.csv",
            source=f"{processing_output_destination}/train",
            destination="/opt/ml/processing/input/train"
        ),
        ProcessingInput(
            input_name="test.csv",
            source=f"{processing_output_destination}/test",
            destination="/opt/ml/processing/input/test"
        ),
    ],

    outputs=[
        ProcessingOutput(
            output_name="train",
            destination=f"{processing_output_destination}/train",
            source="/opt/ml/processing/train",
        ),
        ProcessingOutput(
            output_name="test",
            destination=f"{processing_output_destination}/test",
            source="/opt/ml/processing/test",
        ),
    ],
    code=f'{YOUR SOURCE PATH}/Preprocess.py',
)

Custom script inference step

To use the custom inference, You must follow the rules.

Sagemaker inference follow the model.tar.gz

model.tar.gz
├── /code
│   └── inference.py
├── tokenizer.json
├── tokenizer_config.json
├── config.json
├── vocab.txt
└── pytorch_model.bin

custom_inference_script

S3 Directory

S3
├── /code
│   └── inference.py
├── /train
│   ├── train.csv
│   ├── model.bin(optional)
│   ├── vocab.txt(optional)
│   └── config(optional)
└── /test
    └── test.csv

아무도 안알려주는 on-premise Kubeflow 구축해보자(with cuda)

Tue, 26 Jul 2022 11:54:19 GMT

회사에서 모델을 개발하니 Production을 해야하고 이를 실제 배포까지 진행한다면 많은 단계를 거쳐야하고 모니터링 및 복구가 가능해야 합니다.

이에따라 MLOps가 필요해졌고 고객요구에 따라 Sagemaker를 활용하여 MLOps를 구축하게 되었습니다.

사실 MLOps에 대한 개념도 없이 Sagemaker로 구축을하니 너무 불편하고 번거로운 작업을 초반에 진행해야하는 문제가 있었고 이게 좋은게 맞나 ? 라는 생각이 들어 on-premise mlops를 구축하여 모르고 욕하는 것보다 알고 욕하고 싶었고 그 결과는 Sagemaker가 맞구나 라는 결론이었습니다.

on-premise로 Kubeflow를 구축하며 많은 자료가 있었지만 너무나 구버전이고 GPU 자원 활용에 대한 정보가 부족하여 누군가에게 도움되길 바라며 공유합니다.

업데이트는 첫번째 출처 깃허브에서 진행됩니다 원본 깃허브

Set env

prerequirement

nvidia docker
docker default runtime

docker

(optional)gpu docker

# /etc/docker/daemon.json
{
  "default-runtime": "nvidia",
  "runtimes": {
      "nvidia": {
          "path": "nvidia-container-runtime",
          "runtimeArgs": []
   }
  }
}

minikube

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

minikube start --driver=docker --disk-size=100g --kubernetes-version=1.21.0 --memory=4g --cpus=4
(optional)minikube config set profile test

(optional)minikube cuda

kubeflow가 kubernetes 위에서 작동하므로 minikube에 cuda 환경이 배포되어야 사용 가능

sudo apt install conntrack
sudo apt install socat
minikube start --driver=none --kubernetes-version=1.21.0
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/master/nvidia-device-plugin.yml
kubectl get pod -A | grep nvidia
kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

vim gpu.yaml
# in gpu.yaml
# caution cuda version
# 이미지가 본인의 환경에 맞는 cuda 환경을 설정해 주어야 합니다.
apiVersion: v1
kind: Pod
metadata:
  name: gpu
spec:
  containers:
  - name: gpu-container
    image: nvidia/cuda:11.4.2-runtime-ubuntu18.04
    command:
      - "/bin/sh"
      - "-c"
    args:
      - nvidia-smi && tail -f /dev/null
    resources:
      requests:
        nvidia.com/gpu: 1
      limits:
        nvidia.com/gpu: 1
#
kubectl create -f gpu.yaml
kubectl logs gpu

kubectl

curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
kubectl version --client

kustomize

wget https://github.com/kubernetes-sigs/kustomize/releases/download/v3.2.0/kustomize_3.2.0_linux_amd64
chmod +x kustomize_3.2.0_linux_amd64
sudo mv kustomize_3.2.0_linux_amd64 /usr/local/bin/kustomize

manifests/common/user-namespace/base/params.env
manifests/common/dex/base/config-map.yaml
kubectl -n auth rollout restart deployment dex

kubeflow/manifests

git clone https://github.com/kubeflow/manifests.git
cd manifests
git checkout v1.4-branch(kubeflow 버전 변경)

# 아래 명령어는 작동 순서에 예민하므로 성공할 때까지 기다린다
while ! kustomize build example | kubectl apply -f -; do echo "Retrying to apply resources"; sleep 10; done
# 이유를 모르겠으나 이렇게 설치하면 파드가 잘 생성이 안됩니다.
# 아래 참조처럼 한개씩 설치합시다 
# (참조)[https://mlops-for-all.github.io/docs/setup-components/install-components-kf/#cert-manager]

(optional)Kubeflow local host

kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

# Login
user@example.com
12341234

(optional)add user

# manifests/common/dex/base/comnfig-map.yaml
# hash는 접속시 비밀번호인데 Bcrypt를 통해 암호화 시킨 값을 넣어야 한다.
apiVersion: v1
kind: ConfigMap
metadata:
  name: dex
data:
  config.yaml: |
    issuer: http://dex.auth.svc.cluster.local:5556/dex
    storage:
      type: kubernetes
      config:
        inCluster: true
    web:
      http: 0.0.0.0:5556
    logger:
      level: "debug"
      format: text
    oauth2:
      skipApprovalScreen: true
    enablePasswordDB: true
    staticPasswords:
    - email: user@example.com
      hash: $2y$12$4K/VkmDd1q1Orb3xAt82zu8gk7Ad6ReFR4LCP9UeYE90NLiN9Df72
      # https://github.com/dexidp/dex/pull/1601/commits
      # FIXME: Use hashFromEnv instead
      username: user
      userID: "15841185641784"
    - email: dev7halo@gmail.com
      hash: $2a$12$pnfKk2PSRTyM8Wm3jrEkKuM339fgBqWcFPrHbcsEHGhzDmH/pm/Uy
      username: krkim
      userID: krkim
    staticClients:
    # https://github.com/dexidp/dex/pull/1664
    - idEnv: OIDC_CLIENT_ID
      redirectURIs: ["/login/oidc"]
      name: 'Dex Login Application'
      secretEnv: OIDC_CLIENT_SECRET

(optional)namespace 설정

# manifests/common/user-namespace/base/params.env

user=dev7halo@gmail.com
profile-name=krkim

kfp 사용하기

import kfp
import requests

USERNAME = "dev7halo@gmail.com"
PASSWORD = "    "
NAMESPACE = "krkim"
HOST = "http://10.23.13.113:31265" # istio-ingressgateway's external-ip created by the load balancer.

session = requests.Session()
response = session.get(HOST)

headers = {
    "Content-Type": "application/x-www-form-urlencoded",
}

data = {"login": USERNAME, "password": PASSWORD}
session.post(response.url, headers=headers, data=data)
session_cookie = session.cookies.get_dict()["authservice_session"]

client = kfp.Client(
    host=f"{HOST}/pipeline",
    namespace=f"{NAMESPACE}",
    cookies=f"authservice_session={session_cookie}",
)
print(client.list_pipelines())

명령어

# 쿠버네티스 파드 확인
kubectl get pods -A
watch kubectl get pods -A 
kubectl get po -A -w

# namespace별 pod 확인
kubectl get po -n {namespace}

# pod 재생성
kubectl get pod  -n  -o yaml | kubectl replace --force -f-

# pod 삭제
kubectl delete pod  -n 

# 강제종료
kubectl delete pod  -n  --grace-period 0 --force

# namespace 삭제
kubectl delete namespace {삭제할 namespace}

# minikube service list
minikube service list -n istio-system

Info

Tool	Version
Ubuntu	22.04
minikube	v1.26.0
kubernetes	v1.21.0
kustomize	v3.2.0
cuda	11.4

겪었던 문제

회사 이메일에 . 이 들어가는데 yaml 파일에서 .이 들어가면 문제가 생겨서 삽질
on-premise로 구축하려니 쿠버네티스 생태계에 이해를 못해서 gpu 자원 활용에 문제
on-premise로 구축하고 localhost에서 jupyter notebook에서 kubeflow pipe라인을 작성하니 토큰 문제(client에 localhost:port를 넣어서 해결)
Could not find CSRF cookie XSRF-TOKEN in the request(참조3)

Reference

원본 깃허브

https://github.com/kubeflow/manifests

https://velog.io/@moey920/Minikube-Nvidia-GPU-%EC%82%AC%EC%9A%A9%ED%95%98%EA%B8%B0

https://otzslayer.github.io/kubeflow/2022/06/11/could-not-find-csrf-cookie-xsrf-token-in-the-request.html

모두의 MLOps

UnicodeDecodeError (Sagemaker, HuggingFace)

Wed, 25 May 2022 01:29:12 GMT

This error occur, When the load the checkpoint in Sagemaker that maked in huggingface(transformer) of other os(windows, ubuntu).

Maybe Sagemaker file system is configured as S3, So is not normal file system.

Thus, Can't use checkpoint file.

It is easy to solve this problem.

When the careate the model file, You just use torch.save

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/modeling_utils.py in load_state_dict(checkpoint_file)
    348     try:
--> 349         return torch.load(checkpoint_file, map_location="cpu")
    350     except Exception as e:

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/serialization.py in load(f, map_location, pickle_module, **pickle_load_args)
    586             orig_position = opened_file.tell()
--> 587             with _open_zipfile_reader(opened_file) as opened_zipfile:
    588                 if _is_torchscript_zip(opened_zipfile):

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/torch/serialization.py in __init__(self, name_or_buffer)
    241     def __init__(self, name_or_buffer) -> None:
--> 242         super(_open_zipfile_reader, self).__init__(torch._C.PyTorchFileReader(name_or_buffer))
    243 

RuntimeError: [enforce fail at inline_container.cc:145] . PytorchStreamReader failed reading zip archive: failed finding central directory

During handling of the above exception, another exception occurred:

UnicodeDecodeError                        Traceback (most recent call last)
~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/modeling_utils.py in load_state_dict(checkpoint_file)
    352             with open(checkpoint_file) as f:
--> 353                 if f.read().startswith("version"):
    354                     raise OSError(

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/codecs.py in decode(self, input, final)
    320         data = self.buffer + input
--> 321         (result, consumed) = self._buffer_decode(data, self.errors, final)
    322         # keep undecoded input until the next call

UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 64: invalid start byte

During handling of the above exception, another exception occurred:

OSError                                   Traceback (most recent call last)
 in 
      1 # 사용할 모델과 토크나이저 로드
----> 2 model = AutoModelForSequenceClassification.from_pretrained('trained', num_labels=3)
      3 tokenizer = AutoTokenizer.from_pretrained('koelectra-base-v3-discriminator')

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/models/auto/auto_factory.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
    444         elif type(config) in cls._model_mapping.keys():
    445             model_class = _get_model_class(config, cls._model_mapping)
--> 446             return model_class.from_pretrained(pretrained_model_name_or_path, *model_args, config=config, **kwargs)
    447         raise ValueError(
    448             f"Unrecognized configuration class {config.__class__} for this kind of AutoModel: {cls.__name__}.\n"

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   1795             if not is_sharded:
   1796                 # Time to load the checkpoint
-> 1797                 state_dict = load_state_dict(resolved_archive_file)
   1798             # set dtype to instantiate the model under:
   1799             # 1. If torch_dtype is not None, we use that dtype

~/anaconda3/envs/pytorch_latest_p36/lib/python3.6/site-packages/transformers/modeling_utils.py in load_state_dict(checkpoint_file)
    364         except (UnicodeDecodeError, ValueError):
    365             raise OSError(
--> 366                 f"Unable to load weights from pytorch checkpoint file for '{checkpoint_file}' "
    367                 f"at '{checkpoint_file}'. "
    368                 "If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True."

OSError: Unable to load weights from pytorch checkpoint file for 'trained/pytorch_model.bin' at 'trained/pytorch_model.bin'. If you tried to load a PyTorch model from a TF 2.0 checkpoint, please set from_tf=True.

Mount S3 at Local Docker Container

Mon, 23 May 2022 12:51:15 GMT

# Dockerfile
FROM ubuntu:18.04 

ARG DEBIAN_FRONTEND=noninteractive
RUN apt-get update
RUN apt-get install -y wget curl fuse awscli

# Goofys for s3

RUN curl -O https://storage.googleapis.com/golang/go1.18.2.linux-amd64.tar.gz && \
    tar -xvf go1.18.2.linux-amd64.tar.gz && \
    mv go /usr/local && \
    ln -s /usr/local/go/bin/go /usr/bin/go && \
    wget http://bit.ly/goofys-latest -O /usr/local/bin/goofys && \
    chmod 755 /usr/local/bin/goofys

# 실행
# docker run -it --privileged IMAGE_NAME 
# docker start CONTINER_NAME 
# docker exec -it CONTINER_NAME bash
# aws configure
# goofys -f S3_BUCKET YOURDIR

# aws configure
AWS Access Key ID : 
AWS Secret Access Key :
Default region name : 
Default output format :

AWS의 EC2 Docker에 S3를 붙이는 방식과 Local Container에 마운트하는 방식이 좀 달라서 하루종일 삽질..

그중 가장 빠르다는 goofys를 이용해 붙이는 방식을 이용하기로 했다..

Docker Hub

One Shot 한글 NLP Docker 이미지를 구워보자

Thu, 19 May 2022 06:15:43 GMT

1. 구성요소

pytorch/pytorch, konlpy, mecab, transformers, sklearn, jupyter lab, matplotlib, kiwi

2. Docker pull

docker pull dev7halo/ko-nlp

3. Docker run

sudo docker run -it -p 8888:8888 --name nlp --gpus all -v /home/halo/NLP:/root/.jupyter/NLP 6960dfd63b94 jupyter lab

jupyter port 8888 name nlp num_gpu all mount /home/halo/NLP to /root/.jupyter/NLP run jupyter lab

관리깃 Dockerhub

Sagemaker에서 Mecab을 설치해보자

Tue, 10 May 2022 07:48:52 GMT

pip install konlpy

Sagemaker Terminal에서 진행

wget https://bitbucket.org/eunjeon/mecab-ko/downloads/mecab-0.996-ko-0.9.2.tar.gz
tar xvfz mecab-0.996-ko-0.9.2.tar.gz
cd mecab-0.996-ko-0.9.2
./configure
make
make check
sudo make install
sudo ldconfig
mecab --version
---
mecab of 0.996/ko-0.9.2

wget https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.1.1-20180720.tar.gz
tar xvfz mecab-ko-dic-2.1.1-20180720.tar.gz
cd mecab-ko-dic-2.1.1-20180720
autoreconf
./configure
make
sudo make install

pip install mecab-python

참조1 참조2

Nvidia driver, cuda, cudnn

Fri, 06 May 2022 07:42:12 GMT

nvidia 드라이버 고급 검색

# 설치확인
nvidia-smi

# 드라이버제거
sudo apt-get --purge -y remove 'cuda*'
sudo apt-get --purge -y remove 'nvidia*'
sudo apt-get --purge -y remove "*nvidia*"
sudo apt-get autoremove --purge cuda
sudo rm -rf /usr/local/cuda*

sudo /usr/local/cuda-11.2/bin
sudo /usr/bin/nvidia-uninstall

drm error stackoverflow

cuda toolkit achive

# 설치확인
nvcc -V

# cuda 경로 설정
# 1
vi /etc/profile

# 열린 파일안 하단에 작성
export PATH=$PATH:/usr/local/cuda-11.3/bin 
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-11.3/lib64 
export CUDADIR=/usr/local/cuda-11.3

# 2
gedit ~/.bashrc

export PATH=/usr/local/cuda-11.1/bin:$PATH
export LD_LIBRARY_PATH=/usr/local/cuda-11.1/lib64:$LD_LIBRARY_PATH

source ~/.bashrc

cudnn achive

# 설치
tar xvzf cudnn-11.3-linux-x64-v8.2.1.32.tgz
sudo cp cuda/include/cudnn* /usr/local/cuda-11.3/include 
sudo cp cuda/lib64/libcudnn* /usr/local/cuda-11.3/lib64 
sudo chmod a+r /usr/local/cuda-11.3/include/cudnn.h /usr/local/cuda-11.3/lib64/libcudnn*

# 하단을 실행하는 이유
# cudnn을 카피시 심볼릭 링크가 사라져서 다시 잡아주기 위함
sudo ln -sf /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.2.1.32 /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_adv_train.so.8 

sudo ln -sf /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.2.1.32 /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8 

sudo ln -sf /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.2.1.32 /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8 

sudo ln -sf /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.2.1.32 /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8 

sudo ln -sf /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.2.1.32 /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_ops_train.so.8 

sudo ln -sf /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.2.1.32 /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8 

sudo ln -sf /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn.so.8.2.1.32 /usr/local/cuda-11.3/targets/x86_64-linux/lib/libcudnn.so.8

# 설치확인 1
ldconfig -N -v $(sed 's/:/ /' <<< $LD_LIBRARY_PATH) 2>/dev/null | grep libcudnn

# 설치확인 2
sudo ldconfig

# 설치확인 3
cat /usr/local/cuda/include/cudnn_version.h | grep CUDNN_MAJOR -A 2

Correlation loss / metric 구현해보자(with Pytorch)

Wed, 06 Apr 2022 03:48:55 GMT

Loss

def correlation_loss(y_pred, y_true):
    x = y_pred.clone()
    y = y_true.clone()
    vx = x - torch.mean(x)
    vy = y - torch.mean(y)
    cov = torch.sum(vx * vy)
    corr = cov / (torch.sqrt(torch.sum(vx ** 2)) * torch.sqrt(torch.sum(vy ** 2)) + 1e-12)
    corr = torch.maximum(torch.minimum(corr,torch.tensor(1)), torch.tensor(-1))
    return torch.sub(torch.tensor(1), corr ** 2)

Metric

def correlation_metric(y_pred, y_true):
        x = torch.Tensor(y_pred)
        y = torch.Tensor(y_true)
        vx = x - torch.mean(x)
        vy = y - torch.mean(y)
        cov = torch.sum(vx * vy)
        corr = cov / (torch.sqrt(torch.sum(vx ** 2)) * torch.sqrt(torch.sum(vy ** 2)) + 1e-12)
        return corr

세금안낸놈들 명단을 분석해보자

Sun, 27 Mar 2022 04:29:37 GMT

명단공개자의 일부 또는 전부를 사용하여 특정인의 명예를 훼손할 경우에는 형사상 명예훼손죄로 처벌을 받거나 민사상 손해배상 책임을 질 수 있습니다.

고액 상습체납자 명단 공개

국세청은 고액 상습체납자 명단을 홈페이지에 게시하여 공개하고 있습니다. 다만, 개인이 이 정보를 통해 체납자의 명예를 훼손시키는 경우 문제가 발생하니 조심하세요 !

밤에 잠이안와서 심심해서 해본 데이터 수집과 분석

1장. 크롤링

개인체납자

위 링크를 접속하면 개인 고액 체납자의 목록을 확인할 수 있는데,

무려 20개씩 1658페이지 까지 구성되어 있어 도구의 힘을 빌려야 수집할 수 있는 정보였습니다.

사실 BS4를 활용해서 크롤링을 진행하려 했는데, 페이지 이동이 자유롭지 못하여 Selenium을 통해 클릭 이벤트를 활용하여 수집하는 방식을 활용했습니다.

df = pd.DataFrame(columns = ['No','공개년도','성명','연령',
                             '상호','직업(업종)','체납자 주소',
                             '총 체납액','세목','납기','체납건수','체납요지'])

테이블의 컬럼은 총 12개로 구성되어 있어 위와 같이 데이터프레임을 할당했습니다.

for click in range(1, 1659):
    for i in range(1, 20+1):
        inputs = []
        for j in range(1, 12+1):
            inputs.append(driver.find_element_by_xpath(f'//*[@id="wrap"]/div/div[2]/table/tbody/tr[{i}]/td[{j}]').text)
        df = df.append(pd.DataFrame([inputs], columns = df.columns.to_list()))
    if click == 10:
        driver.find_element_by_xpath(f'//*[@id="wrap"]/div/div[2]/form/div/a[4]').click()
    elif click%10 == 0:
        driver.find_element_by_xpath(f'//*[@id="wrap"]/div/div[2]/form/div/a[5]').click()
    else:
        driver.find_element_by_xpath(f'//*[@id="wrap"]/div/div[2]/form/div/div/a[{click%10}]').click()

크롤링을 진행하며 불편했던 사항은 10페이지에서 다음페이지를 이동하기 위해 버튼의 클릭이벤트를 할당해야하는데 1-10번대의 클릭 이벤트 xpath와 20페이지 이상의 클릭버튼 xpath가 상이하여 문제가 발생했는데 다행히 처음 10페이지만 달랐기에 무난히 클릭이벤트를 사용했습니다.

2장. 분석

분석 내용은 사실 별거 없었습니다..

df.dtypes
---
No          int64
공개년도        int64
성명         object
연령        float64
상호         object
직업(업종)     object
체납자 주소     object
총 체납액      object
세목         object
납기         object
체납건수      float64
체납요지       object
dtype: object

데이터 타입에서 총 체납액 부분이 연속형 변수인데 '1,000'과 같이 구분자가 포함되어 Object로 인식되어 삭제하여 int타입으로 변환해주었습니다.

df['총 체납액'].replace('[^0-9]','', regex=True, inplace=True)
df = df.astype({'총 체납액' : 'int'})
df.info()
---

RangeIndex: 33150 entries, 0 to 33149
Data columns (total 12 columns):
 #   Column  Non-Null Count  Dtype  
---  ------  --------------  -----  
 0   No      33150 non-null  int64  
 1   공개년도    33150 non-null  int64  
 2   성명      33150 non-null  object 
 3   연령      33099 non-null  float64
 4   상호      22791 non-null  object 
 5   직업(업종)  26200 non-null  object 
 6   체납자 주소  33150 non-null  object 
 7   총 체납액   33150 non-null  int64  
 8   세목      33150 non-null  object 
 9   납기      33150 non-null  object 
 10  체납건수    33145 non-null  float64
 11  체납요지    33150 non-null  object 
dtypes: float64(2), int64(3), object(7)
memory usage: 3.0+ MB

공개년도는 2004년부터, 총 체납액은 최소 205 최대 163299인데 단위가 백만원입니다.. 엄청난 체납자 놈들이라 할 수 있습니다 ..

위 자료에서 확인해보면 연령에서의 최소값이 12인데 궁금해서 확인해보았습니다

df[df.연령 < 20]

상속, 양도소득세 등이 있네요 이제 결측값을 확인합시다.

df.isna().sum()
---
No            0
공개년도          0
성명            0
연령           51
상호        10359
직업(업종)     6950
체납자 주소        0
총 체납액         0
세목            0
납기            0
체납건수          5
체납요지          0
dtype: int64

상호와 직업란이 많이 비어있는데 큰 문제는 없을 것 같습니다.(개인체납자니까요)

주로 어떤 직업/직종이 많은지 확인해보겠습니다.

sorted_dict = sorted(dict.items(), key = lambda item: item[1], reverse = True)
print(sorted_dict[:20])
---
[(nan, 6950), ('건설업', 1737), ('제조업', 1611), 
('서비스', 1582), ('도소매', 1536), ('제조', 1220), 
('부동산', 1117), ('부동산업', 947), ('무직', 796), 
('건설', 723), ('도매', 720), ('음식', 637), 
('소매', 569), ('소매업', 450), ('서비스업', 369), 
('도소매업', 323), ('도매업', 306), ('도매 및 소매업', 304), 
('음식점업', 198), ('임대', 160)]

상위 20개를 뽑아봤는데 nan이 6950으로 1위 주로 제조, 부동산, 도소매 등이 상위에 있습니다. 이중 nan값과 무직이 궁금하니 한번 20개정도씩 확인해 보겠습니다.

df[df['직업(업종)'].isna()]
df[df['직업(업종)'] == '무직'][:20]

네.. 대부분 소득세, 상속세입니다 개킹받네요

자 이제 어느 지역에 가장 많은 체납자 놈들이 거주중일지 확인해봅니다.

dict = {}
for i in df['지역']:
    dict[i] = dict.get(i, 0) + 1
sorted_dict = sorted(dict.items(), key = lambda item: item[1], reverse = True)
---
[('경기', 6817),
 ('서울', 4653),
 ('경기도', 4309),
 ('서울특별시', 2359),
 ('경남', 1516),
 ('인천', 1417),
 ('충남', 1314),
 ('부산', 1033),
 ('경북', 948),
 ('인천광역시', 935),
 ('충북', 784),
 ('부산광역시', 736),
 ('전북', 577),
 ('전남', 574),
 ('강원', 560),
 ('대구광역시', 467),
 ('대구', 466),
 ('대전', 400),
 ('대전광역시', 320),
 ('경상남도', 306),
 ('강원도', 294),
 ('광주', 285),
 ('광주광역시', 267),
 ('울산', 259),
 ('울산광역시', 250),
 ('제주특별자치도', 215),
 ('경상북도', 200),
 ('충청남도', 198),
 ('제주', 188),
 ('전라북도', 121),
 ('전라남도', 115),
 ('충청북도', 110),
 ('세종', 80),
 ('세종특별자치시', 69),
 ('서울시', 5),
 ('제주도', 2),
 ('대구역시', 1)]

데이터에서 경기, 경기도, 서울, 서울특별시등 일관성이 없어 보이지만 사람눈에 확인 가능한 수준으로 경기가 압도적으로 1위 서울이 2위입니다.

이제 졸리니까 자러가겠습니다.

dev_halo.log

Ngrok

deepspeed Learning Rate Scheduler list

jupyter kernel

CLOVA X 리뷰

CLOVA X 리뷰

1. main

middle

side

2. chat

할루시네이션

최신 정보

In-Context Learning

Data Augmentation

CLOVA X Skill

결론

git bash conda activate

miniconda

anaconda

문서 생성요약기를 만들어보자(with BART Huggingface)

KoBART

Data

결과물

Score

깨달은점

Reference

Dacon 성균관대 문장 유형 분류 AI 경진대회 1등

문장 유형 분류 AI 경진대회

대회정보

주제

설명

주최/주관

참가 자격

데이터

성능 개선을 위한 시도

전처리

Text Augmentation

Loss

Custom Model

사용한 사전학습모델과 출처

Code

For HuggingFace Custom CosineAnnealingWarmUpRestarts

서론

본론

결론

출처

청와대 국민청원(청무위키) 데이터 덤프를 떠보자

bluehouse_petitions

Requirements

Data Shape

RAY Process

Preprocess

reference

Sagemaker 빌트인 모델 서빙 패턴

Amazone SageMaker의 빌트인 모델 서빙 패턴 4가지

1. 배치 변환

2. 비동기 추론 인터페이스

3. 리얼타임 추론

4. 서버리스 추론

4.1. 람다 서버리스 추론

4.2. Sagemaker 서버리스 추론

요약

출처

Yolov5 모델을 BentoML을 이용해 Serving 해보자

BentoML_Serving

1. Init WrapperModel then save model as bentoml model

2. Write service.py

3. Run service

4. Send API

5. Docker build

Write bentomlfile.yaml

Build

Deploying the Bento

Options

Sagemaker MLOps Pipeline with HuggingFace

huggingface_sagemaker_pipeline

Overview

Pipeline

Prepare

Sagemaker Notebook