standing-o.log

2022 구글 클라우드 스터디잼 수료 후기 | AI&ML 중급/심화, 쿠버네티스 입문/중급/심화 과정

Fri, 09 Dec 2022 07:56:10 GMT

본 포스팅은 2022 구글 클라우드 스터디잼의 AI&ML 중급/심화, 쿠버네티스 입문/중급/심화 과정에 참가하여 배운 내용과 느낀 점을 정리한 글입니다.
👉 Click

2022 구글 클라우드 스터디잼 이란?

2022 Google Cloud Study Jam 은 주최에서 제공하는 Qwiklabs(실습)쿠폰 및 Coursera 강의를 활용하여 Google Cloud Platform (GCP) 에 입문할 수 있도록 하는 온라인 학습 프로그램 입니다.
2022년에는 AI&ML 입문/중급/심화, 쿠버네티스 입문/중급/심화 총 여섯 개의 과정으로 진행됐습니다.
정해진 수료 조건을 달성하면 Cloud Study Jam 완주 기념품을 증정합니다.

신청하게 된 이유

실제 머신러닝 관련 업무는 서버에서 작업하게 된다는 사실을 알게 된 후 GCP, AWS와 같은 클라우드 서비스에 관심을 갖게 됐습니다. 클라우드를 공부하고 있었지만 실제로 실습을 해보거나 관련 프로젝트에 참여해본 경험이 없기에, 좋은 경험이 될 것 같아 신청하게 됐습니다!
특히 Qwiklabs 실습을 할 수 있는 Credit을 제공한다는 점이 좋았습니다.

수료 과정

1. AI & ML 중급 과정

Qwiklabs 2개의 퀘스트를 완료하는것이 수료 조건이었는데요, 저는 Google Cloud Essentials, Advanced ML: ML Infrastructure 를 수강했습니다.
- 구글 클라우드의 기본적인 툴과 서비스를 경험 할 수 있었고, 클라우드에 대한 지식이 없어도 쉽게 공부할 수 있을정도로 설명이 쉽고 자세했습니다. 대부분의 실습은 Cloud Shell command로 진행됐습니다.
Qwiklabs Advanced 추천 코스

➔ Google Developer Essentials, Advanced ML: ML Infrastructure, Build and Deploy Machine Learning Solutions on Vertex AI, Machine Learning APIs, Integrate with Machine Learning APIs, Data Science on Google Cloud: Machine Learning

2. AI & ML 심화 과정

중급과정과 같은 수료 조건이었기에 Intermediate ML: TensorFlow on Google Cloud, Baseline: Infrastructure 를 수강했습니다.
- 구글 클라우드 환경에서 텐서플로우를 활용하여 머신러닝 모델을 build, train, deploy 하는 방법을 공부했습니다.
- 클라우드 저장소와 함수 등 다양한 어플리케이션 서비스들을 배웠습니다.
Qwiklabs Advanced 추천 코스

➔ Google Developer Essentials, Baseline: Data, ML, AI, Advanced ML: ML Infrastructure, Data Science on Google Cloud, ML Pipelines on Google Cloud, Google Cloud Solutions II: Data and Machine Learning

3. 쿠버네티스 입문반

Kubernetes in the Google Cloud 퀘스트를 완료하는 것이 수료 조건입니다.
- Kubernetes는 컨테이너 배포관리 (container orchestration) 시스템 중 하나입니다. 구글 클라우드 환경에서 Kubernetes를 deploy하는 방법을 배웠습니다.
- 처음 보는 개념이라 생소했지만 설명이 친절했고, 수료 조건도 간단해서 부담없이 수강할 수 있었습니다 :)

4. 쿠버네티스 중급반

Getting Started with Google Kubernetes Engine 코세라 강의를 완주하는것이 수료 조건입니다.
- Google Cloud Computing, Container, Google Kubernetes Engine (GKE), Kubernetes 아키텍처 등을 공부했습니다.
- 평소 코세라에 익숙하기도 했고, 강좌에 Qwiklab실습이 포함되어있어 편하게 수강했습니다.

5. 쿠버네티스 심화반

세 개의 코세라 강좌 Architecting with Google Kubernetes Engine: Foundations, Architecting with Google Kubernetes Engine: Workloads, Architecting with Google Kubernetes Engine: Production를 수강하고, Kubernetes Solutions Qwiklabs 실습을 완료하는것이 수료 조건입니다.
- 초반 Foundation 강의에는 중급반 강의와 겹치는 내용이 있었고, 후반에는 Role-based access control, GKE Networking, Persistent storage, Secrets and Configmaps 등 심화적인 내용들이 포함됩니다.
- 다른 과정에 비해 어렵고 확실히 수강할 강좌도 많아 힘들었지만, 내용이 잘 이어지고 알찼기에 가장 남는게 많은 프로그램이었던것 같습니다!

다섯 개 과정을 통해 얻게된 Qwiklab 뱃지들과 코세라 수료증들.. 모아두고 보니 뿌듯하네요 😂😂

기념품

총 다섯 개의 과정에 참여하고 다섯 개의 기념품을 받게 되었습니다. 😁😁 감사합니다!
Google Developer가 적힌 티셔츠 2장, 가방, 파우치, 모자를 받았습니다 :)

후기

무엇보다도 많은 실습을 통해 GCP 환경에 익숙해졌다는 것이 가장 큰 성과라고 생각합니다. 처음엔 실습에서 오류가 뜨면 쩔쩔매고 다시 시작하는 과정을 반복했었는데, 이제는 오류도 잘 고칠 수 있게 되었습니다. 😎😎
스터디 잼은 그룹 단위로 신청이 가능합니다! 저는 개인적으로 진행했지만 팀을 구성하여 함께 공부한다면 더 도움이 될 듯 합니다.
Google Study Jams in Korea 페이스북 페이지에서 항상 소식이 올라오니 참고하시면 좋을것 같습니다.
좋았던 기억밖에 없어서 기회가 된다면 내년에도 참여하려고 합니다!

2022 데이터 크리에이터 캠프 최우수상 수상 후기

Thu, 08 Dec 2022 13:00:10 GMT

본 포스팅은 2022 데이터 크리에이터 캠프에 참가하여 배운 내용과 느낀 점을 정리한 글입니다.
👉 Click

DAMS (Data Analysis Math Statistics) 팀 최우수상 🎉🎉 감사합니다 👍👍

데이터 크리에이터 캠프란?

데이터 크리에이터 캠프는 실제 비즈니스 환경의 문제를 데이터 분석 교육 및 멘토링을 통해 해결해 보는 데이터 분석대회입니다.
온라인 사전 학습을 제공하고 4주간의 해커톤 예선을 진행하는데 이때, 매주 멘토님에게 튜터링을 받을 수 있는 기회를 제공합니다.
과학기술정보통신부와 한국지능정보사회진흥원이 주최하며 약 2달 동안 진행됩니다.

신청하게 된 이유

평소 저는 캐글/데이콘과 같은 머신러닝의 예측 성능을 높이는 대회에 참여해왔습니다. 리더보드에서 알 수 있는 등수와 별개로 제가 사용하는 모델과 방법론들을 이 문제에 사용하는 것이 타당한지, 옳은 근거인지가 매번 궁금했었는데요, 그렇기에 데이터 크리에이터 캠프에서 매주 멘토링을 제공한다는 점이 매력적이었습니다.
무엇보다도 잘 맞는 친구들과 한 달 넘게 같이 팀 프로젝트를 할 기회가 앞으로는 거의 없을 것 같아서 신청했습니다! 😀😀

진행 방식

1. 온라인 연수원을 통한 사전 학습

전반적인 인공지능 역사부터 머신러닝/딥러닝 개념까지 학습할 수 있는 온라인 강의를 제공합니다.

2. 예선 문제 해결하기

예선 문제는 아래와 같이 크게 세 가지로 나눠 집니다. 일러스트와 실사가 섞인 이미지 데이터를 제공받았습니다.
- EDA 과정을 통해 데이터 분포 문제 해결하기.
- 학습 데이터에서 실사 영상 제거하기 (비지도 학습).
- 일러스트 영상 분류하기 (지도 학습).
저희 팀은 일주일에 두 번 정도 회의를 진행했는데요, 회의 때 역할 분담을 하고 다음 회의에서 분석 결과를 공유하는 식으로 진행했습니다.
Notion 페이지를 활용하여 회의 내용 및 진행 상황을 팀원들과 공유하고, 멘토링을 위한 발표 자료를 기록했습니다.
- 아래와 같이 진행 상황과 모델 학습계획을 정리했습니다.

주최 측에서 Colab Pro 환경을 제공 해주셔서 더 편하게 모델링 할 수 있었습니다.
마지막으로 소스 코드와 PPT, 최종 모델 가중치를 제출했습니다.
- 제출 일주일 전 최종 모델을 채점할 때 사용하는 테스트 코드를 주최 측에서 제공해 주시는데요, 저희 팀에서 테스트 코드 오류를 먼저 발견하고 오류 레포트를 작성하여 주최 측에 보냈습니다. 😂😂
- 다행히도 오류가 맞아서 수정된 코드를 다시 받았습니다. 사실 저희가 틀린 것 같아서 조마조마 했었는데, 지나고 보니까 더 다양한 실험과 시도를 해보게 된 것 같아 저희 팀을 더 성장하게 해준 일이었던 것 같습니다 :)

3. 멘토링 활용하기

매주 토요일마다 담당 멘토님에게 미션을 수행하면서 궁금했던 점을 질문했습니다.
Notion 페이지에 궁금한 점도 따로 적어두고 질문했습니다 :)

발표 잘 들어주시고 좋은 피드백 해주신 조희승 멘토님! 감사드립니다 😀😀
멘토링 데이 외에도 매일 QnA 게시판에 질문 하나 씩 할 수 있어서 유용하게 활용했습니다.

4. 본선

예선 일주일 후 본선 진출 팀임을 공지 받았습니다. 🎉🎉

마지막으로 발표 자료와 대본을 점검하고, 예상 질문 및 모델 개념을 공부했습니다.
- 특정 방법론과 모델을 활용한 근거와 구체적인 학습 방법을 위주로 정리하여 발표 디펜스를 대비했습니다.
  
  ➔ (EX) L2정규화와 드롭아웃을 통해 과적합이 어느정도 보완되었나요?, 왜 기존 VGG16보다 새로 제안한 모델이 더 빠르게 수렴되나요?
- 학습 데이터에서 실사 영상을 제거할 때 Self supervised learning 을 활용했는데요, 이에 대한 survey 논문을 정리하여 팀원들에게 공유하고 함께 공부했습니다.

본선은 한국지능정보사회진흥원 스마트 스퀘어에서 오후 1시에 시작했습니다. 발표 순서는 당일 추첨으로 결정됐고, 저희는 대학부 7번째로 발표했습니다 :)

발표 시간은 5분으로 짧았지만 다행히 발표자 친구가 모든 내용을 잘 발표해 주었습니다. 😂😂 발표가 끝나면 팀원들 전체가 무대 위로 올라가서 심사위원 분들께 질문을 받게 되는데, 운좋게도 예상 질문을 벗어난 질문이 없었기에 논리적으로 잘 대답했던 것 같습니다.
5시 정도에 시상식이 진행되었는데 정말 너무 떨리더라구요.. DAMS 팀은 최우수상을 수상하게 되었습니다! 감사합니다 🥳🥳

배우고 느낀 점

먼저 멘토링을 통해 많이 배웠습니다. 어떤 문제를 풀기 위해 내가 생각하는 방법론을 제안해야할 때, EDA와 모델링 실험 결과를 근거로 들어 설득하는 법을 배웠습니다.
최종 발표 내용에는 포함하지 않았지만, 사실 캠프를 진행하면서 해본 시도가 굉장히 많았습니다. 이러한 시도들을 통해 더 많이 배우게 된 것 같습니다.

(WARD, DBSCAN, K-means, CAE, Segmentation, AnoGAN, Efficientnet, ConvNext, Resnet, Desnet ...)
어쩌다 보니 제가 Self supervised learning 부분을 도맡아 하게 됐는데요, 덕분에 많이 배웠고 흥미로워서 추후에 여유가 되면 다른 대회에도 활용해볼까 고민 중입니다.
이렇게 해보면 좋지 않을까? 라고 막연하게 생각했던 것들을 직접 하나하나 시도해 보면서 지식으로 적립되었네요 :)

후기

열심히 준비해주신 운영진분들 덕분에 몇 달 동안 부담없이 편하게 참가했습니다. 무엇보다도 상을 받게 되어 더 뿌듯하고 즐거웠습니다! (다른 task의 문제로 캠프가 또 열리면 좋겠습니다!)
팀원 분들에게도 감사합니다! 두 달 동안 즐거웠고 너무 좋은 팀워크를 느낄 수 있었습니다 😀😀

Google Kubernetes 워크로드

Mon, 05 Sep 2022 12:26:19 GMT

본 포스팅은 Kubernetes의 배포와 포드 네트워킹, 볼륨에 대한 내용을 포함하고 있습니다.
Keyword : Kubernetes, GKE, deployment, pod networking, volume
👉 Click

Kubernetes Workload

`kubectl` command

Kubectl : 관리자가 kubernetes cluster를 제어하는 데 사용하는 유틸리티
Kubectl transforms your command-line entires into API calls
- kubectl은 명령줄 입력 내용은 API 호출로 전환한 후 선택한 kubernetes cluster 내 kubeAPI 서버로 전송
Use kubectl to see a list of Pods in a cluster
- kubectl get pods : kubectl은 이 명령어를 API 호출로 전환하고 cluster 제어 영역 서버에서 HTTPS를 통해 kubeAPI 서버로 보냄
- kubeAPI 서버는 etcd 쿼리를 통해 요청을 처리 ➔ kubeAPI 서버는 HTTP를 통해 kubectl에 결과를 반환 ➔ kubectl은 API 응답을 해석하여 명령 프롬프트에서 관리자에게 결과를 표시
kubectl must be configured first
- Relies on a config file : $HOME/.kube/config
- Config file contains:
  - Target cluster name, credentials for the cluster
- Current config: kubectl config view
Connect to a google kubernetes engine cluster
- kubectl config view : kubectl 명령어 자체의 구성에 대해 알려줌
- gcloud 명령줄 도구와 kubectl을 설치한 다른환경에서 get credentials gcloud 명령어를 사용 : 사용자 인증 정보 가져오기
The kubectl command syntax has several parts
- Type : 명령어가 적용되는 kubernetes 객체를 정의, command와 함께 사용되어 어떤 작업을 어떤 type의 객체에 수행하길 원하는지 kubectl에 알림
- Name : type에 정의된 객체를 지정

Deployments

Deployments declare th state of Pods
- Pod 사양을 업데이트 할 때마다 변경된 deployment 버전과 일치하는 새 ReplicaSet이 생성
  ➔ 배포가 제어된 방식으로 업데이트된 pod를 롤아웃 하는 방법 ➔ 기존 pod는 이전 ReplicaSet에서 제거되고 새 ReplicaSet의 새로운 pod로 대체
- 배포는 stateless app용으로 설계됨
  ➔ stateless app : 데이터 또는 app 상태의 cluster나 영구 스토리지에 저장X
Deployment is a two-part process
- 원하는 상태는 pod의 특성이 포함된 배포 YAML 파일에 설명되어 있으며 pod를 운영 가능하게 실행하고 수명 주기 이벤트를 처리하는 방법이 함께 제공
  ➔ 이 파일을 Kubernetes 제어 영역에 제출하면 배포 컨트롤러가 생성되며 이 컨트롤러는 원하는 상태를 실현하고 원하는 상태를 유지하는 역할을 함
- 배포 : 상태를 선언하는 pod의 상위 수준 컨트롤러
Deployment has three different lifecycle states
- Progressing state, complete state, failed state

Pod networking

Pod : a group of containers with shared storage and networking
- Kubernetes의 'pod별 IP' 모델을 기반
  ➔ 각 pod에 단일 IP 주소가 할당되고 pod 내의 container는 해당 IP 주소를 포함하여 동일한 네트워크 namespace를 공유
Your workload doesn't run in an single pod
- Workload : 서로 통신해야 하는 다양한 app으로 구성됨
- 각 pod에는 고유한 IP 주소가 있고, 노드에서 pod는 노드의 루트 네트워크 namespace를 통해 서로 연결 ➔ 해당 VM에서 pod가 서로를 찾고 연결
- 루트 네트워크 namespace는 노드의 기본 NIC에 연결 ➔ 노드의 VM NIC를 사용하여 루트 네트워크 namespace는 해당 노드에서 트래픽을 전달
  ➔ Pod의 IP 주소를 노드가 연결된 네트워크에서 라우팅할 수 있어야 한다는 뜻
Node get pod IP addresses from address ranges assigned to your virtual private cloud
- GKE에서 노드는 Virtual Private Cloud 즉, VPC에 할당된 주소 범위에서 pod IP 주소를 가져옴
- VPC : 는 GCP 내에서 배포하는 resource에 대한 연결을 제공하는 논리적으로 격리된 네트워크
  ➔ 이러한 resource에는 Kubernetes cluster Compute Engine 인스턴스 App Engine 가변형 인스턴스가 있음
Addressing the pods
- GKE cluster node : GKE가 맞춤설정하고 관리하는 컴퓨팅 인스턴스 ➔ 해당 머신이 있는 VPC 서브넷의 IP 주소가 할당
- VPC 기반 GKE 클러스터는 pod에 대해 별도의 별칭 IP 범위도 생성

Volumes

Kubernetes offers storage abstraction options

Volumes
- Volumes are the method by which you attach storage to a pod
- Some volumes are ephemeral
- Some volumes are persistent
Persistent storage options
- Are block storage, or networked file systems
- Provide durable storage outside a pod
- Are independent of the pod's lifecycle
- May exist before pod creation and be claimed

References

[1] Getting Started with Google Kubernetes Engine, Coursera

Google Kubernetes의 구조

Mon, 05 Sep 2022 12:25:56 GMT

본 포스팅은 GKE의 구조 및 object management에 대한 내용을 포함하고 있습니다.
Keyword : Kubernetes, GKE, object management
👉 Click

Kubernetes architecture

Kubernetes objects : persistent entities representing the state of the cluster
- Object spec : 만들려는 각 객체에 대해 객체 사양을 kubernetes에 제공
- Object status : current state described by kubernetes
Containers in a Pod share resources
- Pod : 표준 kubernetes 모듈의 기본 구성요소, kubernetes 시스템에서 실행 중인 모든 컨테이너
  ➔ Container가 위치한 환경을 구현하며 해당 환경은 하나 이상의 container 수용가능
  ➔ Kubernetes는 각 pod에 고유한 IP 주소를 할당; pod 내의 모든 container는 네트워크 namespaces를 공유
Desired state compared to current state
- Kubernetes가 원하는 상태를 지정했다고 가정했을 때, 해당 상태를 나타내는 하나 이상의 객체를 만들고 유지하도록 kubernetes에 지시하여 작업을 실행
  ➔ kubernetes는 원하는 상태를 현재 상태와 비교 ➔ kubernetes의 제어영역이 cluster 상태를 지속적으로 모니터링하여 상태를 수정

Cooperating processes make a kubernetes cluster work

제어영역 : 한 컴퓨터 ➔ 전체 클러스터를 조정
노드 : 다른 컴퓨터 ➔ pod를 실행
kube-APIserver : 사용자가 직접 상호작용하는 단일 구성요소 ➔ cluster 상태를 보거나 변경하는 명령어를 수락하는 것
➔ kubectl : kube-APIserver에 연결, 통신
➔ etcd : cluster의 데이터베이스 ➔ cluster 상태를 안정적으로 저장
➔ kube-scheduler : pod를 노드에 예약
➔ kube-controller-manager : kube-APIserver를 통해 cluster 상태를 지속적으로 모니터링
➔ kube-cloud-manager : 기본 cloud 제공업체와 상호작용하는 컨트롤러 관리
각 노드는 제어 영역 구성요소의 작은 그룹도 실행

Google kubernetes engine

GKE manages all the control plane components
- GKE는 사용자를 대신하여 모든 제어 영역 구성요소를 관리
- 모든 kubernetes 환경에서 노드는 kubernetes 자체가 아닌 cluster 관리자가 외부에서 만듬
Use node pools to manage different kinds of nodes
- Node pool은 GKE 기능
  ➔ node pool 수준에서 자동 노드 생성, 자동 노드 복구 cluster 자동 확장을 사용 설정
Zonal versus regional clusters
- 더 많은 노드를 추가하고 app의 여러 복제본을 배포하면 app의 가용성이 일정 수준까지 향상
- 전체 컴퓨팅 영역이 다운된다면?
  ➔ GKE regional cluster 사용 ➔ app의 가용성이 단일 region 내의 여러 영역에서 유지되도록함
- GKE에서 regional cluster을 구성하려는 목적 : cluster에서 실행 중인 app이 영역 손실을 견뎌낼 수 있도록 하기위함
A regional or zonal GKE cluster can also be set up as a private cluster
- Google cloud 제품이 cluster 제어 영역에 access 가능
- 승인된 네트워크는 기본적으로 제어 영역에 access 하도록 신뢰를 주는 IP 주소 범위
- 노드는 제한된 outbound access 권한을 비공개 google access를 통해 보유 가능 ➔ 다른 google cloud 서비스와 통신가능
GKE cluster vs GKE
- GKE cluster에서는 compute engine 가상머신으로 노드가 프로비저닝
- GKE에서는 마스터가 google cloud 고객에게 노출되지 않는, GKE 서비스의 추상화 부분으로 프로비저닝

Object management

모든 kubernetes 객체는 고유한 이름과 고유 식별자로 구분됨
Objects are defined in a YAML file
- Kubernetes가 만들고 유지할 객체를 manifest 파일을 사용하여 정의
- apiVersion : 객체를 만드는데 사용되는 kubernetes API 버전을 나타냄
- kind : 원하는 객체를 정의 (pod)
- metadata : 객체를 식별가능하도록 이름, 고유 ID, namespace를 사용
- spec : pod의 manifest 파일에서 pod의 container image를 정의하는 필드
- YAML 파일은 버전 관리 저장소에 저장
  ➔ GCP 고객은 cloud source repositories를 사용 ➔ 다른 GCP resource와 동일한 방식으로 해당 파일의 권한을 제어가능
Object names
- Kubernetes 객체를 만들 때 이름을 고유한 문자열로 지정
- Cluster의 수명 주기 중에 만든 모든 객체에는 kubernetes에서 생성된 고유 ID(UID)가 할당
- Label은 key-value 쌍이며, 이를 사용하여 생성 중이나 생성 후에 객체를 태그
nginx 웹 서버 3개를 만드는 한 가지 방법
- Pod 객체 3개를 선언 ➔ 각 pod에 고유한 YAML 섹션이 있음
- kubernetes의 기본 예약 알고리즘은 사용 가능한 노드에 워크로드를 균등하게 분산하는 것을 선호
Allocating resource quotas
- Multiple projects run on a single cluster ➔ How can I allocate resources quotas?
- Kubernetes를 사용하면 단일 물리적 cluster를 namespace라고 하는 여러 가상 cluster로 추상화 가능
- Namespace를 사용하면 cluster 전체에 resource 할당량을 적용 가능 ➔ resource 사용량 한도를 정의
배포 객체 : 지정된 시간에 정의된 포드 집합이 실행되도록함
Namespaces
- 기본 namespace : 다른 namespace가 정의되지 않은 객체를 포함
- Kube-system namespace : kubernetes 시스템 자체에서 만든 객체를 포함.
- Kube-public namespace : 모든 사용자가 읽을 수 있도록 공개된 객체를 포함
  ➔ Namespace를 만들 때 namespace에 resource를 적용하려면 명령줄 namespace 플래그를 사용하거나 resource에 대한 YAML파일에 namespace를 지정가능
  ➔ Namespace를 사용하면 cluster 전체에 resource 할당량을 구현가능, 서로 중복되는 객체 이름 사용가능
서비스의 용도
- Pod의 부하 분산 네트워크 엔드포인트 제공, 포드 노출 방법 선택

Kubernetes architecture

App을 설계하고 있으며 지연 시간을 최소화하기 위해 container가 가능한 한 서로 가까이 있기를 원함 ➔ 동일 pod에 container 배치
(EX) 첫 번째 영역의 기본 풀에 4개의 머신이 있는 새 kubernetes engine regional cluster를 배포하고 영역 수를 기본값으로 둠.
➔ 계정에 대해 배포되는 compute engine machine 개수 : 12
Kubernetes cluster에서 실행 중인 프로덕션 app이 test 및 staging 배포의 영향을 받지 않는지 확인해야함 ➔ production app을 위한 resource의 우선순위를 지정하려면? : test, staging, production을 위한 namespace를 구성하고, test 및 staging namespace에 특정된 kubernetes resource 할당량을 구성
Stateful app을 위한 스토리지를 구성할 때 pod가 실패하거나 삭제되더라도 app의 데이터가 삭제되지 않도록 container 내부에 파일 시스템 스토리지를 제공하려면? : 네트워크 기반 스토리지를 사용해 볼륨을 생성하여 pod에 원격으로 내구성 높은 스토리지를 제공하고 이를 pod에 지정
DaemonSet : cluster 내의 모든 nod에 배포해야하는 새로운 로깅 및 감사 유틸리티를 처리하기 위함
App의 여러 복사본을 배포하여 이 복사본 전체에 트래픽 부하를 분산하려고 합니다. 이 app의 pod를 cluster의 production namespace에 배포하는 방법: 실행할 복제본 수를 지정하는 배포 manifest 제작

Summary

Kubernetes controllers keep the cluster state matching the desired state.
Kubernetes consists of al family of control plane components, running on the control plane and the nodes.
GKE abstracts away the control plane.
Declare the state you want using manifest files.

References

[1] Getting Started with Google Kubernetes Engine, Coursera

Google Kubernetes Engine(GKE) 란?

Mon, 05 Sep 2022 12:25:34 GMT

본 포스팅은 Google Kubernetes와 cloud function 및 cloud run에 대한 내용을 포함하고 있습니다.
Keyword : Kubernetes, GKE, cloud function, cloud run
👉 Click

Google Kubenetes Engine

Kubernetes

Container 인프라를 더 효과적으로 관리하기 위함
Kubernetes : container 인프라를 on-premise 또는 클라우드에서 조정, 관리 가능
- Container 중심의 관리 환경
- Automation : Container화된 app의 배포, 확장, 부하 분산, 로깅, 모니터링, 기타 관리 등을 자동화 ➔ platform as a service
- Infrastructure as a service : 다양한 사용자 환경설정과 구성 유연성을 지원
- Declarative configuration : 인프라를 선언적으로 관리 ➔ 원하는 시스템 상태가 항상 문서화됨
  ➔ Kubernetes를 사용할 때 원하는 상태를 설명하면 Kubernetes는 배포된 시스템을 원하는 상태에 맞게 만들고 장애가 발생해도 상태를 유지
- Imperative configuration : 명령형 구성으로 명령어를 실행하여 시스템 상태를 변경

Kubernetes features

Supports both stateful and stateless applications : 데이터를 지속적으로 저장해야하는 app
Autoscaling : resource 사용률에 따라 container화된 app을 자동으로 수평확장/축소 가능
Resource limits : resource를 제어하여 클러스터 내의 전반적인 워크로드 성능 개선
Extensibility : 플러그인, 부가기능 확장
Portability : on-premise 또는 GCP, 여러 클라우드 서비스 제공업체 간 워크로드 이동성 (공급업체 제약X)

Google Kubernetes engine (GKE)

Kubernetes is powerful, but managing the infrastructure is a full-time job
➔ Google cloud 내에서 유용한 기능 : GKE
GKE를 사용하면 GCP에서 container화된 app을 위해 kubernetes 환경을 배포, 관리, 확장 가능
GKE lets you deploy workloads easily

GKE features

Fully managed : 기본 resource를 provisioning 할 필요X
Container-optimized OS : Google이 관리하는 이 운영체제는 빠른 확장에 최적화
Auto upgrade
- 자동 업그레이드로 최신버전의 kubernetes 유지
- Cluster : 인스턴스화 한 kubernetes 시스템
Auto repair
- 서비스가 비정상 노드를 자동으로 복구
- Node : GKE 클러스터 내에서 container를 호스팅 하는 가상머신
Cluster scaling
Seamless integration
- GKE는 google의 cloud build/container registry와 원활하게 통합
Identity and access management
Integrated logging and monitoring
- Stackdriver : Google cloud 시스템 서비스, 컨테이너, app, 인프라를 모니터링하고 관리
  ➔ GKE는 Stackdriver monitoring과 통합되어 app 성능을 파악
Integrated networking
- GKE는 virtual private cloud (VPC) 와 통합되며 GCP의 네트워킹 기능을 사용
Cloud console
- GKE 클러스터와 resource에 대한 정보를 제공하며 클러스터의 resource를 확인, 검사, 삭제 가능

Computing options

Fully customizable virtual machines
Persistent disks and optional local SSDs
Global load balancing and autoscaling
Per-second billing

Compute engine use cases

Complete control over the OS and virtual hardware
Well suited for lift-and-shift migrations to the cloud
Most flexible compute solution, often used when a managed solution is too restrictive

App engine

Provides a fully managed, code-first platform.
Streamlines application deployment and scalability.
Provides support for popular programming languages and application runtimes.
Supports integrated monitoring, logging, and diagnostics.
Simplifies version control, canary testing, and rollbacks.
Use cases : websites, mobile app and gaming backends, RESTful APIs

Google kubernetes engine

Fully managed kubernetes platform
Supports cluster scaling, persistent disks, automated upgrades, and auto node repairs
Built-in integration with Google cloud services
Portability across multiple environments
- Hybrid computing, multi-cloud computing
Use cases : Containerized applications, cloud-native distributed systems, hybrid applications

Cloud run

웹 요청 또는 cloud put/sub 이벤트를 통해 stateless container를 실행할 수 있는 관리형 컴퓨팅 플랫폼
Serverless : 모든 인프라 관리를 추상화 ➔ app 개발에만 집중가능
Abstract away infrastructure management ➔ 서버 걱정없이 요청 또는 이벤트 기반 stateless 워크로드를 실행가능
Automatically scales up and down
Open API and runtime environment ➔ 일관된 개발자 환경이 포함된 stateless container를 완전 관리형 환경 또는 자체 GKE cluster에 배포하도록 선택가능
Use cases :
- HTTP 요청을 통해 전달되는 요청이나 이벤트를 수신대기하는 stateless container를 배포가능
- Build applications in any language using any frameworks and tools.

Cloud functions

Event-driven, serverless compute service ➔ 자바스크립트, python 또는 go로 작성한 코드를 업로드하기만 하면 GCP가 코드를 실행하는데 적합한 컴퓨팅 용량을 자동으로 배포
Automatic scaling with highly available and fault-tolerant design
Charges apply only when your code runs
Triggered based on events in google cloud services, HTTP endpoints, and Firebase
Use cases :
- Supporting microservice architecture
- Serverless application backends : mobile and IoT backends, integrate with third-party services and APIs
- Intelligent applications : virtual assistant and chat bots, video and image analysis

Summary

Create a container using cloud build
Store a container in container registry
Compare and contrast kubernetes and google kubernetes engine features.
App 배포를 위한 기술을 선택해야 하는 상황에서 resource 효율이 높고 이동성이 우수한 독립형 경량 패키지 ➔ container
Compute engine : 완전히 맞춤 설정 가능한 가상머신, 초당 청구 ➔ 비용을 세밀하게 제어가능
(EX) 컨테이너화된 app을 배포하고 있으며 컨테이너가 구성 및 배포되는 방식을 최대한 제어하려고함. 전체 container 클러스터 환경을 직접 관리하는 운영 관리 오버헤드도 피하고픈 상황 ➔ Google kubernetes engine

References

[1] Getting Started with Google Kubernetes Engine, Coursera

Container와 Container image

Mon, 05 Sep 2022 12:25:09 GMT

본 포스팅은 container와 container image에 대한 내용들을 소개하고 있습니다.
Keyword : Container, container image
👉 Click

Container and Container image

Containers

Dedicated server (Application code, dependencies, kernel, hardware) ➔ Virtual machine (Application code, dependencies, kernel, hardware + hypervisor)
- 어플리케이션을 실제 컴퓨터에 배포
  ➔ resource 낭비가 크고 대규모 배포와 유지보수에 많은 시간이 소요 ➔ 가상화 필요
- Hypervisors create and manage virtual machines
Running multiple apps on an single VM
➔ 종속 항목을 공유하는 app이 서로 격리되지 않는 문제 발생

The VM-centric way to solve this problem

App마다 전용 가상 머신을 실행
- 각 app에서 고유 종속 항목을 유지 관리 ➔ 커널이 격리되어 있으므로 app끼리 성능에 영향X
- But 대규모 시스템의 경우 전용 VM은 중복적이며 낭비, VM 시작속도가 느림

User space abstraction and containers

종속 항목 문제 해결 : app과 종속 항목 수준에서 추상화를 구현
- 전체 머신이 아니라 사용자 공간만 가상화
- 사용자 공간 : 커널 위에 있는 모든 코드; app과 종속항목 포함
- 운영체제 전체를 실행하지 않아 가벼움 ➔ 빠르게 만들고 종료 가능
- 기본 시스템 위에서 예약/패키징 하므로 효율적
  ➔ container를 만든다
Container (Application code, dependencies) : 단일 app 코드를 생성하는 격리된 사용자 공간

Why developers like containers

App 중심으로 확장성 높은 고성능 app을 제공, 기본 하드웨어와 소프트웨어를 전제로 작업가능
App을 쉽게 빌드 가능 ➔ 느슨하게 결합되고 세분화된 구성요소 사용 (모듈식 설계)
App의 종속 항목을 서로에게서 격리할 방법이 필요, 가상머신에서 app을 패키징하는것은 낭비
개발자의 노트북에서는 작동하지만 production에서는 실패하는 app의 문제를 해결하기가 어려움

Container image

Image : app과 종속 항목
➔ Container : 실행중인 image instance
➔ 소프트웨어를 container image로 빌드하면 개발자는 손쉽게 app을 패키징/제공 가능
➔ 소프트 웨어가 필요 (Docker)

Containers use a varied set of Linux technologies

Processes : Linux process마다 서로 분리된 고유 가상 메모리 주소 공간이 존재, 빠르게 생성/삭제 가능
Linux namespaces : process ID 번호, directory tree, IP 주소를 제어 (≠ kubernetes namespaces)
cgroups : app이 사용가능한 CPU시간, 메모리, I/O 대역폭, 기타 resource의 최대 사용량을 제어
Union file systems : app과 종속항목을 간결한 최소 레이어 모음으로 캡슐화

Containers are structured in layers

Container manifest 파일로 image build
Docker 형식의 container image ➔ Dockerfile (Container image 내부 레이어가 지정됨)
- FROM ubuntu:18.04 : FROM 문으로 기본 레이어를 공개 저장소에서 가져와 생성
- COPY ./app : COPY 명령어로 빌드 도구의 현재 디렉토리에서 복사된 파일이 포함된 레이어를 추가
- RUN make /app : RUN 명령어는 make 명령어를 사용하여 app을 빌드하고, 빌드 결과를 세번째 레이어에 배치
- CMD python /app/app.py : 마지막 레이어는 실행 시 container 내에 실행할 명령어를 지정
요즘은 배포 및 실행하는 container에 app을 빌드 권장 X ➔ app 패키징에 다단계 빌드 process를 이용
Image에서 새 container를 만들면, container 런타임에서는 쓰기 가능한 레이어를 기본 레이어 위에 추가 (container layer)

Containers promote smaller shared images

여러 container가 동일한 기본 image에 access권한을 공유하면서 자체 데이터 상태를 보유
Container를 실행하면 container 런타임에서 필요한 레이어를 가져옴 ➔ 업뎃시, 차이 나는 항목만 복사

How can you get containers?

Download containerized software from a container registry such as gcr.io.
- 공개 오픈소스 image가 다수 포함, google cloud 고객도 이를 사용하여 비공개 image를 cloud IAM과 잘 통합되는 방식으로 저장
Docker : bulid your own container using the open-source docker command
- 반드시 신뢰가능한 컴퓨터에 빌드해야함
Build your own container using Cloud bulid
- 빌드에 필요한 소스코드를 다양한 스토리지 위치에서 검색가능 (Cloud storage, git repo, cloud source repo)
- 빌드 단계를 구성하여 종속 항목 가져오기 ➔ 소스코드 컴파일 ➔ 통합 테스트 (Docker container에서 실행)
- 빌드한 image를 다양한 실행환경에 제공 (GKE, app engine, cloud functions)

References

[1] Getting Started with Google Kubernetes Engine, Coursera

Cloud Computing과 Google Cloud

Mon, 05 Sep 2022 12:23:54 GMT

본 포스팅은 compute engine, resource, 그리고 GCP billing에 대한 내용을 포함하고 있습니다.
Keyword : Google cloud, compute engine, resource, billing
👉 Click

Cloud Computing and Google Cloud

Computing resources가 On-demand self-service로 제공됨
➔ 사람 개입없이 필요한 처리 능력, 스토리지 네트워크 확보 가능
Broad network access
Resource pooling ➔ 고객은 resource의 물리적 위치를 파악할 필요 없음
Rapid elasticity
Measured service ➔ 사용한 만큼만 지불

Google cloud offers a range of services:

Compute engine : 클라우드에서 주문형 가상머신을 실행하게 해 주는 Google cloud infrastructure-as-a-service 솔루션
Google Kubernetes engine (GKE) : google이 관리하는 클라우드 환경에서 컨테이너화된 어플을 실행하며 사용자에게 관리 권한을 부여
➔ 컨테이너화 : 이동성을 극대화 하고 resource를 효율적으로 사용할 수 있도록 코드를 패키지화 하는 방식
App engine : GCP의 관리형 Platform-as-a-service 프레임워크, 인프라 걱정없이 클라우드에서 코드를 실행
Cloud functions : 서비스로서의 기능, 이벤트 발생 빈도에 상관없이 이벤트에 응답하여 코드를 실행

Build your own database solution

Compute engine, Google Kubernetes engine (GKE)

Use a managed service

Storage : Cloud bigtable, cloud storage, cloud SQL, cloud spanner, datastore

Google cloud offers a range of services

Big data : Bigquery, pub/sub, dataflow, dataproc, AI platform notebooks
Machine learning : Vision API, AI platform, speech-to-text API, cloud translation API, cloud natural language API

Resource management

Google cloud는 multi-region, region, zone을 통해 resource를 제공
- Multi-region : America, Europe, Asia-Pacific ➔ divided into regions
- Region : 같은 대륙 내에서 독립된 지리적 위치, region 내에서 네트워크 연결이 빠름
  ➔ divided into zones
- Zones : 집중된 지리적 위치 내에 있는 GCP resource의 배포 위치
인터넷 사용자가 구글 resource로 트래픽을 전송하면, google은 지연시간이 가장 낮은 edge network 위치에서 응답함

Zonal resources operate exclusively in a single zone

GCP service와 resource를 이용하면 resource의 지리적 위치를 지정 가능
Zonal resource는 단일 영역 내에서 실행됨 ➔ 해당 영역이 사용 불가능해지면 resource 역시 사용 불가능
- Compute Engine Virtual Machine instance, 영구디스크, GKE의 노드

Regional resources span multiple zones

하나의 region 내에 여러 영역에 걸쳐 실행됨 ➔ 중복배포가능

Global resources

Multi-region을 통해 관리 가능
- HTTPS load balancers, VPC, Virtual Private Cloud networks

Resources sit in projects, Resources have hierarchy

사용자가 사용하는 GCP resource는 위치와 상관없이 프로젝트에 속해야함
- 프로젝트 : 항목을 논리적으로 구성, 고유 ID와 번호로 식별, 폴더로 그룹핑 가능
GCP resource 계층 구조는 조직 내 여러 부서와 팀의 resource를 관리할 수 있도록 도와줌
Cloud IAM (Cloud identity and access management) : 사용자가 사용하는 모든 GCP resource에 세부적인 엑세스 제어를 가능케함
사용자가 선택한 수준에서 적용한 정책은 낮은 수준으로 상속됨 (조직 ➔ 폴더 ➔ 프로젝트 ➔ 리소스).
결제는 프로젝트 수준에서 누적됨

Billing

GCP 결제는 GCP 프로젝트 수준에서 설정함
결제 계정은 하나 이상의 프로젝트에서 연결 가능
결제 계정은 매월 또는 기준액 도달 시 자동으로 청구 및 invoice 되도록 설정가능
GCP service를 재판매하는 GCP 고객들은 하위 계정을 사용하기도함
How to keep your billing under control :
- Budgets and alerts, billing export, reports
  ➔ Budgets and alerts keep your billing under control; 결제 알림 기반으로 자동화 제어
  ➔ Billing export allows you to send your billing data to a BigQuery dataset; 결제 세부정보 저장
  ➔ Reports is a visual tool to monitor expenditure; 지출 모니터링
  ➔ Quotas are helpful limits; 할당량은 오류나 악성 공격으로 인해 resource를 과소비하는 것을 방지하기 위해 설계됨
- 할당량 ^Quotas
  - 비율 할당량 : 특정 시점을 기준으로 재설정, GKE 클러스터 자체에서 받는 호출을 제한하는 방식, 정기적으로 재설정
  - 배정 할당량 : 프로젝트 내에서 보유할 수 있는 resource 수를 제어, 시간이 지나도 재설정X
측정형 서비스 : 사용한 resource에 대해서만 비용 지불

Interacting with Google Cloud

Google tools or interfaces : GCP resource를 관리하고 구성

Ways to interact with Google cloud

Google cloud console web user interface : GCP resource를 관리
- Web-based GUI to manage all google cloud resources
- Executes common tasks using simple mouse clicks
- Provides visibility into google cloud projects and resources
- Interacting with the Cloud console : console.cloud.google.com ➔ GCP console log-in
Cloud SDK and cloud shell command-line interface
- Cloud SDK : gcloud, kubectl, gsutil, bq ➔ 주기적 업데이트를 위한 자동화 스크립트 작성
- Cloud shell : access directly, constant availability of gcloud, ephemeral compute engine virtual machine instance
Cloud console mobile app for iOS and android
- SSH to conneect to compute engine instances, up-to-date billing info and alerts, customizable graphs
REST-based API for custom applications

Google Cloud's choices for organizing compute workloads

Service name : description
Kubernetes engine : a managed environment for deploying containerized applications
➔ 컨테이너에 대한 기본적인 지원과 더불어 관리형 컴퓨팅 플랫폼을 제공
Compute engine : a managed environment for deploying virtual machines
App engine : a managed serverless platform for deploying applications
Cloud functions : a managed serverless platform for deploying event-driven functions

Summary

Cloud computing means on-demand, pay-as-you-go resources.
Google cloud offers 4 compute services.
Google cloud is organized into regions and zones.
The resource hierarchy helps you manage your google cloud use.
Use the cloud console and cloud shell for access.

References

[1] Getting Started with Google Kubernetes Engine, Coursera

인과 | Causality 와 인과추론 | Causal inference

Tue, 23 Aug 2022 13:23:32 GMT

본 포스팅은 인과, 인과추론의 개념과 관련 이론 (Back-door, Do-calculus) 들을 소개하고 있습니다.
Keyword : Causality, SCM, Back-door, Do-calculus
👉 Click

인과 | Causality 와 인과추론 | Causal inference

Causality

Influence by shich one event, process, state, or object a contributes to the production of another event, process, state, or object where the cause is partly responsible for the effect, and the effect is partly dependent on the cause.
Causality in various academic disciplines
- Physics, chemistry,biology, climate science
- Psychology, social science, economics
- Epidemiology, public health
Relation to AI, ML, DS
- AI : a rational agent performing actions to achieve a goal (reinforcement learning)
- ML : currently focused on learning correlations
- DS : capture, process, analyze, communicate with data

Structural causal model (SCM)

SCM $$M = $$ provides a formal framework.
SCM induces observational, interventional, and counterfactual distributions.
SCM induces a causal graph $$g$$, which implies conditional independencies testable via d-separation (blockage).
The underlying model $$M$$ is unknown but the causal graph $$g$$ can be given from common sense or domain knowledge.
Intervention do(X=x) as a submodel M_x, which induces a manipulated causal graph $$g_\bar{x}$$.
Causal effect of $$X=x$$ on $$Y=y$$ is defined as $$P(y\mid{do(x)})$$.

Remark

Identifiability : causal effect may be computable from existing observational data for some causal graphs.
In a Markovian case an singleton X, a causal effect can be easily derivable by canceling output $$P(x\mid{pa_x})$$

Back-door Criterion

DefinitionㅣBack-door
- Find a set $$Z$$ s.t. it can sufficiently explain 'confounding' between $$X$$ and $$Y$$. Then,
$$ P(y|do(x))=\sum_Z{P(y|x,z)P(z)}" $$
DefinitionㅣBack-door criterion
- A set $$Z$$ satisfies the back-door criterion with respect to a pair of variables $$X, Y$$ in causal diagram $$g$$ if;
  - (i) no node in $$Z$$ is a descendant of $$X$$; and
  - (ii) $Z$ blocks every path between X ∈ $$X$$ and Y ∈ $$Y$$ that contains an arrow into X.
A back-door adjustment formula is simple and widely used but limited.

Back-door sets as substitutes of the direct parents of X

Rain satisfies the back-door criterion relative to Sprinkler ans Wet:
- (i) Rain is not descendant of Sprinkler, and
- (ii) Rain blocks the only back-door path from Sprinkler to Wet.
Adjusting for the direct parents of Sprinkler, we have:

$$ P(\text{wt}|do(\text{sp}))=\sum_\text{sn}P(\text{wt}|\text{sp,sn})P(\text{sn})=...=\sum_\text{rn}P(\text{wt}|\text{sp,rn})P(\text{rn}) $$

Rules of Do-calculus

Backdoor criterion results in a very specific form of indentification formula.
Do-calculus (Pearl, 1995) provides general machinery to manipulate observational and interventional distributions.
TheoremㅣRules of Do-calculus (simplified)
- Rule 1 : Adding/removing observations
$$ P(y|do(x),z)=P(y|do(x)),,,\text{if},,(Z\perp{Y|X}),,\text{in},,g_{\bar{X}} $$
- Rule 2 : Action/observation exchange
$$ P(y|do(x),do(z))=P(y|do(x),z),,,\text{if},,(Z\perp{Y|X}),,\text{in},,g_{\bar{X}\underline{Z}} $$
- Rule 3 : Adding/removing actions
$$ P(y|do(x),do(z))=P(y|do(x)),,,\text{if},,(Z\perp{Y|X}),,\text{in},,g_{\bar{X}\bar{Z}} $$

Do-calculus is sound and complete but it has no algorithmic insight
A graphical condition and an efficient algorithmic procedure have developed for identifiability.
Do-calculus is a set of rules to manipulate observational or interventional probabilites. (Do-calculus is complete)

Modern identification tasks

Experimental conditions ➔ Generalized identification
- Combining datasets of different experimental conditions
- The identifiability of any expression of the form $$P(y\mid{do(x), z})$$ can be determined given any causal graph $$g$$ and an arbitrary combination of observational and experimental studies.
- If the query is identifiable, then its estimand can be derived in polynomial time.
Environmental conditions ➔ Transportability
- Combining datasets from different sources
- Non-parametric transportability can be determined provided that the problem instance is encoded in selection diagrams.
- When transportability is feasible, the transport formula can be derived in polynomial time.
- The causal calculus and the corresponding transportation algorithm are complete.
Sampling conditons ➔ Recovering from selection bias
- Nonparametric recoverability of selection bias from causal and statistical settings can be determined provided that an augmented causal graph is available.
- When recoverability is feasible, the estimated can be derived in polynomial time.
- The result is complete for pure recoverability, and sufficient for recoverability with external information.
Responding conditons ➔ Recovering from missingness

References

본 포스팅은 LG Aimers 프로그램에 참가하여 학습한 내용을 기반으로 작성된것입니다. (전체내용 X)

➔ LG Aimers 바로가기

[1] LG Aimers AI Essential Course Module 5.인과추론, 서울대학교 이상학 교수

자율주행과 레이더센서

Tue, 23 Aug 2022 13:22:41 GMT

자율주행과 레이더센서

본 포스팅은 자율주행과 radar에 대한 기술과 동향을 소개하고 있습니다.
Keyword : 자율주행, Radar
👉 Click

자율주행 시장동향

미래 모빌리티 메가 트렌드

Autonomous driving ➔ 운전자 개입없이 스스로 안전하게 주행이 가능한 자율주행 고도화
Connectivity ➔ 고도화된 연결형 자율주행을 통한 탑승자의 안전 및 교통관리 효과성 극대화
Electrification ➔ 높은 에너지 효율성 기반 1회 충전으로 최대 주행거리 확보

자율주행 단계 고도화

자율주행 단계
- 수동운전 ➔ 주행보조 ➔ 부분적 자율주행 ➔ 조건적 자율주행 ➔ 고도 자율주행 ➔ 완전 자율주행

자율주행 자동자 시장 동향

2025년 시장 점유율
- 부분 자율주행 : 12.4% / 완전 자율주행 : 0.5%
2035년 시장 점유율
- 부분 자율주행 : 15% / 완전 자율주행 : 9.8%

자율 주행 센서

Camera
- 장거리 및 인식률 개선을 위한 고화소화, 픽셀 사이즈 소형화, 저조도 개선
- 고온 동작의 품질 확보를 위한 Lens/Housing 구조 최적화
- 생산에서 Active alignment와 calibration 공정 기술 차별화
Radar
- 고해상도 4D Imaging radar 구현을 위한 안테나 및 신호처리 S/W 기술 발전
- Perception SW 고도화로 사물의 형상구분 및 상황예측까지 성능 발전
- 생산에서는 평탄도 관리 및 Calibration 및 EOL 공정 기술 고도화
LiDAR
- ADAS용 LiDAR는 차량신뢰성, 디자인, Cost 우선 순위로 진화
- Lv4/5를 위해 Redundancy를 고려한 Sensor fusion 핵심 부품으로 성장

자율주행 SoC 동향

Tesla
- 카메라 2D 이미지만으로 실시간 3D 이미지 합성하는 기술
- Edge case 중심의 서버를 통한 딥러닝과 시뮬레이션으로 정확도 향상
엔디비아
- ADAS 시스템에서 자율주행용 Hyperion 시스템 발전
- 2D 카메라 중심에서 초음파, LiDAR, Radar 병행하는 3D 방식으로 전환
모빌아이
- 자율주행 EyeQ 시리즈 + 인포테인먼트 인텔 Atom C3000 솔루션
- SD맵과 HD맵의 하이브리드 방식인 Autonomous vehicle 방식

Remark

미래 모빌리티 메가 트렌드는 A.C.E
자율주행 단계는 현재 Lv3, 2025년 Lv4, 2030년 이후 Lv5 완전 자율 주행화 예상
자율주행 완성차는 2035년까지 CAGR 3%, 자율주행 센서는 CAGR 7% 성장 예상
자율주행 센서는 카메라, 레이다, 라이다, 5G C-V2X 통신, 오디오 등이 필요
기존 개별 센서의 역량의 한계를 극복하기 위해 센서 Pod 기술로 발전
자율주행 솔루션 업체별 Lv4/Lv5의 상용차 중심의 자율주행을 개발중

Radar

Radio detection and ranging
Radio wave를 이용한 사물 감지 기술
차량 radar는 차량/보행자/도로 인프라를 인식하여 차량과의 거리, 상대속도, 각도, 높이 등의 정보를 수집

거리 측정

Measure the time of flight (ToF) in order to calculate the distance : $$d = \frac{c_0t}{2}$$
With c₀ being the speed of light and t the ToF

속도 측정

Pulsed radar : two succesive measurements
FMCW radar : exploit the Doppler shift

Radar 필요 기술

Antenna
- High gain, 광각, 고해상도, peak gain, 방사패턴 최적화, array 안테나 설계
mmWave 회로
- 저손실, EMC 최소화 설계, Main IC 기반 플랫폼 설계, Transition 최소화 및 RF 매칭
SW
- System SW, radar 신호처리, perception 알고리즘
기구
- Radome 전파 투과율 최적화, 고신뢰성 및 방수, 방진, 방열 설계, simulation

Radar 기술 동향

2D ADAS Basic (X, Y, Doppler) ➔ 2D ADAS improved (X, Y, Doppler) ➔ 3D (X, Y, Z, Doppler) ➔ 4D HR (High resolution; X, Y, Z, Doppler, depth)

➔ 4D UHR (Ultra high resolution; X, Y, Z, Doppler, depth) ➔ Imaging (X, Y, Z, Doppler, depth, AI/Deeplearning)

Radar 시장 동향

차량 제어를 위해 AEB 기능 채용 확대
Front radar의 고해상도로 채용률 성장
Corner radar의 low cost ➔ 차량당 4개 이상 적용되어 360도 서라운드 센싱
안전과 편의 기능으로 강화를 위한 In-cabin용 신규 application 개화

Remark

Radar 필요기술은 안테나, mmWave 회로, SW, 기구, PCB, 공정설계
Radar 종류는 SRR, MRR, LRR이고 향후 4D Imaging radar로 고도화
차량용 radar는 Infineon, TI, NXP가 주로 사용

References

본 포스팅은 LG Aimers 프로그램에 참가하여 학습한 내용을 기반으로 작성되었습니다. (전체내용 X)

➔ LG Aimers 바로가기

[1] LG Aimers AI Essential Course Module 6.자율주행과 레이더센서의 이해, LG이노텍 김경석 연구위원

XAI | Explainable AI 의 개념과 분류

Tue, 23 Aug 2022 13:21:57 GMT

XAI | Explainable AI 의 개념과 분류

본 포스팅은 설명가능한 인공지능 (XAI) 의 개념과 분류 방법에 대한 내용을 포함하고 있습니다.
Keyword : XAI, CAM, LIME, RISE
👉 Click

Supervised (deep) learning

Supervised (deep) learning has made a huge progress but deep learning models are extremely complex
- End-to-end learning becomes a black-box
- Problem happens when models applied to make critical decisions

What is explainability & Interpretability?

Interpretability is the degree to which a human can understand the cause of a decision
Interpretability is the degree to which a human can consistently predict the model's resutls.
An explanation is the answer to why-question.

Taxonomy of XAI methods

Local vs. Global
- Local : describes an individual prediction
- Global : describes entire model behavior
White-box vs. Black-box
- White-box : explainer can assess the inside of model
- Black-box : explainer can assess only the output
Intrinsic vs. Post-hoc
- Intrinsic : restricts the model complexity before training
- Post-hoc : Applies after the ML model is trained
Model specific vs. Model agnostic
- Model-specific : some methods restricted to specific model classes
- Model-agnostic : some methods can be used for any model

Examples

Linear model, simple decision tree

➔ Global, white-box, intrinsic, model-specific

Grad-CAM

➔ Local, white-box, post-hoc, model-agnostic

Simple Gradient method

Simple use the gradient as the explanation (importance)
- Interpretation of f at x₀ (for the i-th input/feature/pixel)
$$ R_i=(\nabla{f(x)}|_{x_0})_i $$
- Shows how sensitive a function value is to each input
Examples : the gradient maps are visualized for the highest-scoring class
Strength : easy to compute (via back-propagation)
Weakness : becomes noisy (due to shattering gradient problems)

SmoothGrad

A simple method to address the noisy gradients
- Add some noise to the input and average
$$ \nabla_{\text{smooth}}f(x)=\mathbb{E}_{\epsilon\sim\mathcal{N}(0,\sigma^{2}I)}[\nabla{f(x+\epsilon)}] $$
- Averaging gradients of slightly perturbed input would smoothen the interpretation
- Typical heuristics
  - Expectation is approximated with Monte-Carlo (around 50 runs)
  - σ is set to be 10~20% of x_max-x_min
Strength
- Clearer interpretation via simple averaging
- Applicable to most sensitive maps
Weakness
- Computationally expensive

Class activation map (CAM)

Method
- Upsample the CAM to match the size with the input image
- Global average pooling (GAP) should be implemented before the softmax layer
Alternative view of CAM
- The logit of the class c for CAM is represented by:

(GAP-FC model)

$$ Y^c=\sum_k{w_k^c}\frac{1}{Z}\sum_{ij}{A^k_{ij}} $$

Result
- CAM can localize objects in image
- Segment the regions that have the value above 20% of the max value of the CAM and take the bounding box of it
Strength
- It clearly shows what objects the model is looking at
Weakness
- Model-specific: it can be applied only to models with limited architecture
- It can only be obtained at the last convolutional layer and this makes the interpretation resolution coarse

Grad-CAM

Method
- To calculate the channel-wise weighted sum, Grad-CAM substitute weights by average pooled gradient
Strength
- Model agnostic: can be applied to various output models
Weakness
- Average gradient sometimes is not accurate
Result
- Debugging the training with Grad-CAM

LIME

Local interpretable model-agnostic explanations (LIME)
- Can explain the predictions of any classfier by approximating it locally with an interpretable model
- Model-agnostic, black-box
- General overview of the interpretations
Perturb the super-pixels and obtain the local interpretation model near the given example
Explaining an image classification prediction made by Google's inception neural network
Strength
- Black-box interpretation
Weakness
- Computationally expensive
- Hard to apply to certain kind of models
- When the underlying model is still locally non-linear

RISE

Randomized input sampling for explanation (RISE)
- Sub-sampling the input image via random masks
- Record its response to each of the masked images
Comparison to LIME
- The saliency of LIME is relied on super-pixels, which may not capture correct regions
Strength
- Much clear saliency-map
Weakness
- High computational complexity
- Noisy due to sampling
RISE, sometimes, provides noisy importance maps
- It is due to sampling approximation (Monte Carlo) expecially in presence of objects with varying sizes

Understanding black-box predictions via influence functions

Different approach for XAI
- Identify most influential training data point for the given prediction
Influence function
- Measure the effect of removing a training sample on the test loss value
Influence function-based explanation can show the difference between the models

Metrics

Human-based visual assessment

AMT (Amazon mechanical turk) test
- Want to know: Can human predict a model prediction via interpretation?
Weakness
- Obtaining human assessment is very expensive

Human annotation

Some metrics employ human annotation (localization and semantic segmentation) as a ground truth, and compare them with interpretation
- Pointing game
- Weakly supervised semantic segmentation
Pointing game
- For given human annotated bounding box $${B^i}{i=1,...,N}$$ and interpretations $$h^i{I=1,...,N}$$, a mean accuracy of pointing game is defined by:
$$ Acc=\frac{1}{N}\sum^N_{i=1}1_{[p^{(i)}\in{B^{(i)}}]} $$
- Where $$p_i$$ is a pixel s.t. $$p_i=argmax_p(h_p^i)$$
- $$1_{[p^i∈B^i]}$$ is an indicator function that the value is; if the pixel of highest interpretation score is loacted in the bounding box
Weakly supervised semantic segmentation
- Setting : Pixel-level label is not given during training
- This metric measures the mean IoU between interpretation and semantic segmentation label
Weakness
- Hard to make the human annotations
- Such localization and segmentation labels are not a ground truth of interpretation

Pixel perturbation

Motivation
- If we remove an important area in image, the logit value for class would be decreased
AOPC (Area over the MoRF perturbation curve)
- AOPC measures the decreases of logits from the replacement of the input patch in MoRF (most relevant first) order

$$ AOPC=\frac{1}{L+1}\mathbb{E}{x\sim{p(x)}}[\sum^L{k=0}f(x^{(0)}{MoRF}-x^{(k)}{MoRF})] $$

where f is the logit for true label

Insertion and deletion
- Both measure the AUC of each curve
  - In deletion curve, x axis is the percentage of the removed pixels in the MoRF order, and y axis is the class probability of the model
  - In insertion curve, x axis is the percentage of the recovered pixels in the MoRF order, starting grom gray image
Weakness
- Violates one of the key assumptions in ML that the training and evaluation data come from the same distribution
- The perturbed input data is different from the model of interest which is deployed and explained at test time
- Perturbation can generate another feature for model, i.e., the model tends to predict perturbed input as Balloon

ROAR

ROAR removes some portion of pixels in train data in the order of high interpretation values of the original model, and retrains a new model
Weakness
- Retraining everytime is computationally expensive

Sanity checks

Model randomization

Interpretation = Edge detector?
- Some interpretation methods produce saliency maps that strikingly similar as the one created by edge detector
Model randomization test
- This experiment randomly re-initialize the parameters in a cascading fashion or single independent layer fashion
- Some interpretation does not sensitive to this randomization, i.e., Guided-backprop, LRP, and pattern attribution

Adversarial attack

Geometry is to blame
- Proposed the adversarial attack on interpretation: $$\mathbb{L}=||h(x_{adv})-h^t||^2+\gamma||g(x_{adv})-g(x)||^2$$
- Proposed a smoothing method to undo attack
  - Using a softplus activation with high beta can undo the interpretation attack
- Provided theoretical bound of such attack
Results
- In the right figure, the visualization of manipulated image is attacked with target interpretation h_t.
- For both gradient and LRP, the manipulated interpretation of for network with ReLU activation is similar as target interpretation, but the one with softplus is not manipulated.

Adversarial model manipulation

Adversarial model manipulation
- Two models could produce totally different interpretations, while have similar accuracy.
Attack on the input
- Negligible model accuracy drop
- Fooling generalizes across validation set
- Fooling transfers to different interpretations
- AOPC analysis confirms true foolings

References

본 포스팅은 LG Aimers 프로그램에 참가하여 학습한 내용을 기반으로 작성되었습니다. (전체내용 X)

➔ LG Aimers 바로가기

[1] LG Aimers AI Essential Course Module 4.설명가능한 AI(Explainable AI), 서울대학교 문태섭 교수

비지도학습 | Unsupervised learning 과 딥러닝

Tue, 23 Aug 2022 13:21:06 GMT

비지도학습 | Unsupervised learning 과 딥러닝

본 포스팅은 인공지능의 비지도 학습 개념과 종류를 설명하고 있습니다.
Keyword : Unsupervised learning
👉 Click

In traditional machine learning

K-means clustering
Hierarchical clustering
Density estimation
PCA

특징

Low dimensional data
Simple concepts

In Deep learing

Feature engineering vs. Representation learning

Feature engineering
- By human
- Domain knowledge & Creactivity
- Brainstorming
Representation learning
- By machine
- Deep learning knowledge & coding skill
- Trial and error

Modern unsupervised learning

High dimensional data
Difficult concepts ➔ Not well understood, but surprisingly good performance
Deep learning
Unsupervised representation learning

Representation in deep learning

Deep learning representation is under constrained
- Simple SGD can find one of the useful networks
- Representation characteristics can be adjusted if needed
- Learned representation becomes difficult to understand
Disentangled representation
- Alinged
- Independent
- Subspaces
- Possible because severaly underconstrained

Angle information

0 ~ 2π
- Algorithm thinks : 0 and 2π are different / 0 and 1.9π are far
(x₁, x₂) = (cos(θ), sin(θ))
- 0 and 2π are the same
- 0 and 1.9π are close

Spatial information

Goal : Represent as mathematical object

Human representation problems

Human can understand
Human can design with a goal

➔ Good representation in deep learning? : Useful and irrelevant

A well defined task

Typically, only on attribute of interest is considered as y
- Imagenet - class
- y is well defined because it is simply defined as human selected label
Good representation - a vague concept (Supervised)
- Even when y is well defined, what do we want for h_i and h₂?
- Simply say "representation learning successful" if good performance?
- But then there is almost nothing we can sy about h_i and h₂
- Other than saying "useful information has been well curated"
- Is there anything we can say or pursue?
- For a general purpose, what is a good representation?

Information bottleneck

For a well defined supervised task, what should h_i and h₂ satisfy?
Good representation - a vague concept (Unsupervised)
- For a general purpose, whawt is a good representation?
- General purpose often defined as a list of downstream tasks?
- So, we go back to good performance for the tasks of interest?

Representation

What we want: a formal definition and evaluation metrics for representation
Reality : No definition, task dependent evaluation methods

Unsupervised representation learning

Unsupervised performance ≈ supervised performance
- For linear evaluation
- Thanks to instance discrimination, contrastive loss, and aggressive augmentation
As in supervised learning
- Performance metric can be unclear
- Design of surrogate loss is an art (some principled; some hueristics based)
- Training techinique development continuing (but augmentation methods are dominating)
NLP
- Masked language modeling
- What next?
Unsupervised representation learning
- Still a long way to go...

References

본 포스팅은 LG Aimers 프로그램에 참가하여 학습한 내용을 기반으로 작성되었습니다. (전체내용 X)

➔ LG Aimers 바로가기

[1] LG Aimers AI Essential Course Module 3.비지도학습, 서울대학교 이원종 교수

지도학습 | Supervised learning 이란? (SVM, ANN, Ensemble)

Tue, 23 Aug 2022 13:20:15 GMT

지도학습 | Supervised learning 이란? (SVM, ANN, Ensemble)

본 포스팅은 인공지능의 지도학습 개념과 그 종류 (선형모델, SVM, ANN) 에 대한 내용을 포함하고 있습니다.
Keyword : Supervised learning, overfitting, underfitting, linear model, SVM, ANN, ensemble
👉 Click

지도학습 ^{Supervised learning}

Given a set of labeled examples $$(x^1, y^1),...,(x^N, y^N)$$, learn a mapping function g : X ➔ Y, s.t. given an unseen sample x', associated output y' is predicted.
Supervised learning relies on the sizes of dataset; what if we have no sufficient data?
- Data augmentation, learning from insufficient labels (weak supervision)
What if the data properties are different between datasets?
- Domain adaptation, transfer learning

Problem formulation

$$X = R^d$$ is an input space
- $$X = R^d$$ : a d-dimensional Euclidean space
- Input vector $$x ∈ X : x = (x_1,...,x_d)$$
Y is an output space (binary decision)
We want to approximate a target function f
- f : X ➔ Y (unknown ideal function)
- Data $$(x^1, y^1),...,(x^N, y^N)$$; dataset where $$y^N = f(X^N)$$
- Correct label is ready for a training set
- Hypothesis $$g : X ➔ Y$$ (ML model to approximate $$f$$) : $$g ∈ H$$
Learning model : feature selection, model selection, optimization

Model generalization

Learning is an ill-posed problem; data is limited to find a unique solution
Generalization (goal) : a model needs to perform well on unseen data
- Generalization error E_gen; the goal is to minimize this error, but it is impractical to compute in the real world
- Use training/validation/test set errors for the proxy

Errors

Pointwise error is measured on an each input sample : $$e(h(x), y)$$
From a pointwise error to overall errors: $$E[h(x^i) - y^i)^2]$$
- If an input sample is chosen from training, validation, and testing datasets, the errors are called a training error (E_train), a validation error (E_val), and a testing error (E_test).
Training error E_train measured on a training set, which may or may not represent E_gen; used for fitting a model
Testing error E_test (not used in training), which can be used for a proxy of E_gen.
Goal : E_test ≈ E_gen ≈ 0

Overfitting and Underfitting

Underfitting problem because of using too simpler model than actual data distribution (high bias)
Overfitting problem because of more complex model than actual data distribution (high variance)
- Avoid overfitting
  - Problem : In today's ML problems, a complex model tends to be used to handle high-dimensional data (and relatively insufficient number of data); prone to an overfitting problem
  - Curse of dimension : Will you increase the dimension of the data to improve the performance as well as maintain the density of the examples per bin? If so, you need to increase the data exponentially.
  - Remedy : Data augmentation, regularization, ensemble

Bias and Variance

Bias : error because the model can not represent the concept
Variance : error because a model overreacts to small changes (noise) in the training data

➔ Total loss = Bias + Variance (+ noise)

➔ Bias-variance trade-off : the two objective have trade-off between approximation and generalization w.r.t model complexity

Cross-validation (CV)

CV allows a better model to avoid overfitting (but more complexity)

Linear Regression

Hypothesis set H, model parameter θ

$$ h_\theta(x)=\theta_0+\theta_1x_1+...+\theta_dx_d=\theta^Tx $$

Good for a first try : simplicity, generalization
L₂ cost function (Goal : minimizing MSE)

$$ J(\theta_0,\theta_1)=\frac{1}{2m}\sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})^2 $$

➔ $$minimize_{\theta_0, \theta_1} J(\theta_0, \theta_1)$$

Optimization

Data matrix X ∈ R^Nx(d+1), target vector y ∈ R^N, weight vector θ ∈ R^d+1
In-sample error : $$\mid\mid{y - X\theta}\mid\mid_2$$
Normal equation
- (Least square) E is continuous, differentiable, and convex
- Need to compute an inverse matrix and slow if the number of samples is very large

$$ \theta^{*}=\text{argmin}_{\theta}E(\theta) $$

$$ =\text{argmin}_{\theta}[\frac{1}{N}(\theta^TX^TX\theta-2\theta^TX^Ty+y^Ty)] $$

$$ \therefore\theta^{*}=(X^TX)^{-1}X^Ty=X^{+}y $$

➔ Problem : huge computational complexity, non-invertible matrix ➔ needs iterative algorithm (gradient descent)

To avoid overfitting
- If we have too many features, the hypothesis may fit the training set very well. However, it may fail to generalize to new samples.
- More features ➔ more parameters ➔ need more data; (in practive) less data ➔ overfitting ➔ Reduce the number of features, regularization

Gradient descent algorithm

$$ \theta_{new}=\theta_{old}-\alpha\frac{\partial}{\partial\theta}J(\theta) $$

J is the objective function that we want to optimize. α : the step size to control the rate to move down the error surface (hyper parameter).
- If α is too small, gradient descent can be slow.
- If α is too large, gradient descent can overshoot the minimum.
Gradient descent works well even when n large
All examples (batch) are examined at each iteration : Use stochastic gradient descent or mini batch.
Advances : AdaGrad, RMSProp, Adam
Limitation : local optimum ➔ cannot guarantee global minimum but attempt to find a good local minimum
- To avoid local minimum
  - Momentum : designed to speed up learning in high curvature and small/noise gradients ➔ exponentially weighted moving average of past gradients (low pass filtering)
  - SGD + momentum : use a velocity as a weighted moving average of previous gradients
  - Nesterov momentum : difference from standard momentum where gradient g is evaluated (lookahead gradient step)
  - AdaGrad : adapts an indibidual learning rate of each direction
  - RMSProp : attempts to fix the drawbacks of AdaGrad, in which the learning rate becomes infinitesimally small and the algorithm is no longer able learning when the accumulated gradient is large.
  - Adam : RMSProp + momentum
Learning rate scheduling

➔ Need to gradually decrease learning rate over time

Linear Classification

Uses a hyperplane as a decision boundary to classify samples based on a linear combination of its explanatory variables
The linear formula g ∈ H can be written as:

$$ h(x)=sign((\sum^{d}_{i=1}w_ix_i)+w_0) $$

➔ $$x_0 = 1, w_0$$ : a bias term, $$sign(x) = 1{\quad}if{\quad}x>0;0{\quad}if{\quad}x<0;$$

Sigmoid function
- Used to map a score value into a probability value.
- Squash the output of the linear function:

$$ \sigma(-w^Tx)=\frac{1}{1+e^{-w^Tx}} $$

Advantage : simplicity and interpretability

Loss

Zero-one loss

$$ \text{Loss}_{0-1}(x,y,w)=1[(w\cdot\phi(x))0\ge{y}] $$

Hinge loss

$$ \text{Loss}_\text{hinge}(x,y,w)=\text{max}{1-(w\cdot\phi(x))y,0} $$

Cross-entropy loss
- Considers two probability mass functions (pmf) {p, 1-p} and {q, 1-q} with a binary outcomes:
- Cross-entropy loss measures the performance of a classification model whose output is a probability value between 0 and 1.

$$ CE(S,Y)=-\sum_{\forall{i}}Y_i\text{log}(S_i)) $$

Support vector machine (SVM)

Choose the linear separator (hyperplane) with the largest margin on either side
- Maximum margin hyperplane with support vectors
- Robust to outliers

Support Vector

an instance with the minimum margin, which will be the most sensible data points to affect the performance

Margin

twice the distance from the hyperplane to the nearest instance on either side
w : orthogonal to the hyperplane

Optimization

Optimal weight w and bias b
Classifiers points correctly as well as achieves the largest possible margin
Hard margin SVM assumes linear separability
Soft margin SVM extends to non-separable cases

➔ Nonlinear transformation and kernel trick

Constraints : linearly separable, hard-margin linear SVM

$$ h(x)=w^Tx+b\geq1\text{ for }y=1 $$

$$ h(x)=w^Tx+b\leqq-1\text{ for }y=-1 $$

$$ y(w^Tx+b)\geq1\text{ for all samples} $$

Objective function : linearly separable, hard-margin linear SVM
- Distance from a support vector to the hyper plane:

$$ \frac{w^Tx+b}{\mid\mid{w}\mid\mid}=\frac{\pm1}{\mid\mid{w}\mid\mid}\longrightarrow\frac{2}{\mid\mid{w}\mid\mid} $$

Kernel trick (not linearly separable)

Polynomials:

$$ K(x,y)=(x\cdot{y}+1)^p $$

Gaussian radial basis function (RBF):

$$ K(x,y)=e^{-\mid\mid{x-y}\mid\mid^2/2\sigma^2} $$

Hyperbolic tangent (multilayer perceptron kernel):

$$ K(x,y)=\text{tanh}(kx\cdot{y}-\delta) $$

Artificial neural network (ANN)

Needs elaborated training schemes to improve performance
Activation functions
- Sigmoid neurons give a real-valued output that is a smooth and bounded function of their total input
- Non-linearity due to the activation functions
Deep neural network can represent more complex (non-linear) boundaries with increasing neurons
Multilayer perceptron (MLP)
- can solve XOR problem
ANN for non-linear problem
- There exists cases when the accuracy is low even if the number of layers is high

➔ The result of one ANN is the result of sigmoid function

➔ The numerous multiplication of this result converges to near zero ➔ Gradient vanishing problem

Back propagation

Back propagation barely changes lower-layer parameters (vanishing gradient)
Breakthrough
- Pre-training + fine tuning
- CNN for reducing redundant parameters
- Rectified linear unit (constant gradient propagation)
- Dropout

Performance evaluation

Accuracy = (TP+TN)/ALL
Precision = TP/(TP+FP)
Recall = TP/(TP+FN)
F1 = PxR/(P+R)
TPR = R = TP/(TP+FN)
TNR = TN/(TN+FP)
False positive error : predict = positive, actual = negative
False negative error : predict = negative, actual = positive
ROC Curve : performance comparisons between different classifiers in different true positive rates (TPR) and true negative rates (TNR).

Error measure

The error measure should be specified by the user ➔ Not always given but needs to be carefully considered

Ensemble learning

Predict class label for unseen data by aggregating a set of predictions : different classifiers (experts) learned from the training data
Make a decision with a voting
Bagging and boosting : improving decision tree
- By bagging : random forest (inherently boosting)
- By boosting : gradient boosting machine as generalized Adaboost
Advantages
- Improve predictive performance, Other types of classifiers can be directly included
- Easy to implement, No too much parameter tuning
Disadvantages
- Not a compact representation

Bagging

Bootstrapping + aggregating (for more robust performance; lower variance)
Train several models in parallel
Bagging works because it reduces variance by voting/averaging (robust to overfitting)
- Learning algorithm is unstable; if small changes to the training set cause large changes in the learned classifier.
- Usually, the more classifiers the better

Boosting

Cascading of week classifiers ➔ training multiple models in sequence, adaboost
- Adaboost : trained on weighted form of the training set, weight depends on the performance of the previous classifier, combined to give the final classifier
Simple and easy, flexible, Versatile, non-parametric
No prior knowledge needed about week learner

References

본 포스팅은 LG Aimers 프로그램에 참가하여 학습한 내용을 기반으로 작성되었습니다. (전체내용 X)

➔ LG Aimers 바로가기

[1] LG Aimers AI Essential Course Module 2.지도학습(분류/회귀), 이화여자대학교 강제원 교수

신뢰성 | Reliability 이해하기

Tue, 23 Aug 2022 13:19:05 GMT

신뢰성 | Reliability 이해하기

본 포스팅은 신뢰성의 개념과 척도, 그리고 여러 분포들에 대한 내용을 포함하고 있습니다.
Keyword : 신뢰성, 신뢰성 척도, 보전도
👉 Click

신뢰성 ^Reliability

주어진 작동 환경에서 주어진 시간 동안 시스템이 고유의 기능을 수행할 확률
중요성 : 제품 라이프 사이클 관점의 total cost 관리 필요

➔ 품질 비용은 잠재 risk

➔ 개발 단계에서 시장품질은 예측 가능하고 control 되어야함

품질 vs 신뢰성

항목	품질	신뢰성
시간	정적(현시점에서 제품의 특성)	동적(미래의 성능과 고장)
추진	전사적 추진, 주로 생산단계	전문 기술자 팀에 의해 추전, 설계 및 개발단계
자료	완전자료	불완전자료(관측 중단자료)
개선 tool	SPC, RQC, 식스시그마	FMEA, FTA, 고장해석
시험 시간	단기	장기
척도	불량율, 평균, 분산	고장률, 수명, 신뢰도
분포	이항분포, 정규분포	지수분포, 와이블분포, 대수정규분포

신뢰성 분석

필요성 : 패러다임의 변화
- 고장의 원인 : 취약한 설계, 과부하, 강도와 부하의 산포, 마모, 시간 매커니즘, 잠재된 오작동, 오류

신뢰성 공학의 학문적 발전

1950년대 후반 ~ 1960년대
- 확률분포로 고장현상 모형화, 지수분포에 관한 통계적 분석법 개발, NASA 창설, FMEA/FTA 개발 및 활용
- 국제전기기술위원회 내에 장치와 부품의 신뢰성 기술위원회 발족, 와이블 분포에 관한 통계적 분석법 개발, 신뢰성 샘플링 검사방식 개발
1970년대
- 원자력 발전소를 비롯한 복합시스템에 초점을 맞춘 신뢰도와 안전을 고려한 위험분석
- RCM을 비롯한 설비보전, 소프트웨어 신뢰도 분야 등으로 확대
- FTA가 항공우주분야나 핵발전소 분야에서 활발히 연구되고 활용
- 충격 모형의 연구와 네트워크 신뢰도 연구 시작
1980년대
- 가속 수명 시험의 설계, 분석 방법론 연구 및 개발, 신뢰성 데이터 분석에서의 베이지안 방법론 응용, 네트워크 신뢰성 분야 연구
- 공통원인 고장 모형 및 분석이 대형 복잡시스템에서 활발하게 연구
1990년대
- 열화시험방법과 분석법 개발, 초가속수명시험과 강건설계, 실험계획법 도입, 부하분석 및 열화과정 해석을 통한 정확한 의사결정
2000년대
- 고장물리 분야가 더욱 다양하게 연구, 대형 시스템 분석, 시뮬레이션 방법론의 응용, 안전성 및 위험분석과의 통합화
- 재료, 부품, 장비 고장 예측진단 기술의 필요에 관한 PHM가 매우 크게 대두
21세기
- 기계학습과 강화학습 등을 통한 신뢰성 분석과 예측 연구가 활발
- 신뢰성 공학의 대상이 거대한 네트워크 시스템으로 확대됨 ➔ 긴급회복성 ^{resilience measure}과 같은 일시적 돌발현상에서의 대응능력에 관한 지표들이 정의
- RAM ➔ RAMS 로 확대적용
- 기능안전성 ^{Functional safety} : 위험사건을 예방하기 위해 매우 높은 안전 방호시스템이 요구되는 산업부분에서 일반화됨

신뢰성 척도

신뢰도 ^Reliability : 부품, 제품, 시스템 등이 주어진 사용 조건에서 일정 기간동안 요구되는 기능을 고장 없이 수행할 확률

➔ 시구간 $$[0, t]$$ 동안 고장나지 않을 확률

고장률 ^{Failure rate}
- 순간 고장율 ^{Instantaneous failure rate, hazard rate} : 어떤 시점까지 동작하고 있는 시스템이 계속되는 단위시간 동안 고장을 일으킬 비율
- 평균 고장율 ^{Average failure rate} : 총 동작시간 동안의 고장개수

➔ 시점 t에서 작동하는 부품이 시구간 $$[t, t+\Delta]$$에서 고장날 확률

평균 고장시간 ^{Mean-time-to-failure} : 수리불가시스템 ^{Non-repairable system}에서 고장이 발생하기 까지의 평균시간
평균 고장간격 ^{Mean-time-between-failure} : 수리가능시스템 ^{Repairable system}에서 고장간격 간의 평균 동작시간
평균 잔여수명 ^{Mean residual life} : 자동차, 선박, 항공기 등의 중고제품을 구입할 경우, 향후 얼마나 더 사용할 수 있는지 평가하는 척도
보전도 ^{Maintainability} : 고장난 시스템이 주어진 조건 하에서 규정된 시간 내에 수리(보전)을 완료할 확률
가용도 ^Availability : 수리 가능한 시스템이 어떤 특정 시점에 기능을 유지하고 있을 확률

신뢰성 데이터

수명데이터 : 의도된 기능을 제대로 수행하고 있거나 고장인지의 여부로 판정 (binary data)
성능데이터 : 시간 경과에 따른 제품의 성능을 측정 (continuous data)

연속형 수명분포

지수 분포 ^{Exponential distribution}
- $$f(t) = {\lambda}e^{-\lambda{t}}$$, $$F(t) = 1-e^{\lambda{t}}$$ (연속형)
- 시간 t에 관계없이 원래의 평균수명과 동일 ➔ 지수분포를 따르는 제품은 작동하는 동안에는 늘 새것과 같음 ➔ 망각성 ^{memoryless property}
- Relation to the poisson process : 지수분포는 일반적으로 사건이 1건 발생하는데 걸리는 시간에 대한 분포로 사용됨
- 지수분포를 따르는 제품에 대한 고려사항
  - 사용된 제품은 확률적으로 새 것과 같기 때문에 작동하고 있는 부품을 예방보전의 목적으로 미리 교체할 아무런 이유가 없다.
  - 신뢰도 함수, 고장까지의 평균시간 등의 추정은 관측시점에서 부품들의 총 작동시간과 고장의 수에 대한 데이터를 수집하는것으로 충분
  - Drenick thm : 복잡한 기기나 시스템의 수명분포는 비교적 넓은 조건하에서 근사적으로 지수분포를 따옴
감마 분포 ^{Erlang distribution}
- 서로 독립인 확률변수 $$X_1,...,X_k$$가 모수 $$\lambda>0$$인 지수분포를 따를때 $$Y = \sum{X_i}$$는 감마분포를 따른다.
- $$Y ~ Gamma(k, \lambda)$$ where K is positive integer, $$\lambda>0$$, $$f(y) = \frac{\lambda^k{y^{k-1}}e^{-\lambda{y}}}{(k-1)!}$$ if $$y>0$$; $$0{\quad}otherwise$$
- 지수분포의 일반화 형태 (poisson 분포를 따를때 연속적으로 k개의 사건이 발생할때까지 걸린 시간)
와이블 분포 ^{Weibull distribution}
- $$X ~ weibull(\alpha, \beta)$$, $$\alpha>0$$ and $$\beta>0$$, $$f(x) = \beta\alpha^{\beta}x^{\beta-1}e^{-(\alpha{x})^\beta}$$ if $$x>0$$, $$0{\quad}otherwise$$
- If $$\beta=1$$, 지수분포와 동일
- Reliability information : 시간에 따른 제품의 동작확률을 와이블 분포로 모형화 함으로써 제품의 수명에 대한 다양한 정보를 획득
- 최약 연결 ^{weakest link} 의 법칙 : 독립적이고 동일한 분포를 따르는 여러 개의 비음의 확률변수들이 있을 때, 이 중 최소인 확률변수의 분포는 와이블 분포를 따름
- 부품 ^{components or parts}의 수명분포에 주로 사용됨
정규 분포 ^{Normal distribution}
- 중심이 μ이고 좌우대칭인 종모양의 형태, 분산 σ²는 분포의 넓고 좁은 정도(산포)를 결정하는 모수
- 표준정규분포 : 평균이 0이고 분산이 1인 정규분포
- 정규분포를 따르는 n개의 표본으로부터 얻어진 평균의 분포는 평균 μ, 분산 σ²/n인 정규분포를 따름 i.e., n개의 표본이 추출되는 데이터 분포가 정규분포가 아닌 다른 임의의 분포로 가정하여도 그 표본평균의 분포는 정규분포를 따름
- 표본이 추출된 분포의 평균 μ, 분산 σ²이 존재할 때, 표본 수 n이 충분히 크면 표본의 평균 또는 표본의 합은 근사적으로 정규분포를 따름
대수정규 분포 ^{Lognormal distribution}
- 다양한 형태의 분포를 표현할 수 있기 때문에 고장 데이터 등을 모형화 하는 경험적 모형으로 폭넓게 사용됨
- 곱셈형 충격의 누적효과로 인해 고장이 발생하는 현상에 대해서 대수 정규 분포가 유도됨을 보일 수 있음

이산형 분포

이항 분포 ^{Binomial distribution}
- n개의 독립적인 베르누이 시행을 $$X_1,...,X_n$$라 하고, 각각 X가 1일 확률을 P라 하면, $$Y = X_1 + ... + X_n$$은 모수 n과 p를 갖는 이항분포를 가짐
- $$P(Y = y) = \binom{n}{y}{p^y}(1-p)^{n-y}$$, $$x = 0,1,2,...,n$$
포아송 분포 ^{Poisson distribution}
- 단위시간동안 발생빈도가 λ인 포아송 프로세스를 따르는 사건의 발생 횟수를 X, 주어진 t시간동안 발생하는 사건의 수는 평균 μ = λt인 포아송 분포를 따름
- $$P(X = x) = \frac{\mu^{x}e^{-\mu}}{x!}$$, where $$x=1,2,...$$

보전도 ^{Maintainability}

보전의 목적 : 안전하고 경제적으로 운전될 수 있는 조건으로 장비유지
사후보전 ^{Breakdown maintenance} : 점검 및 정기교환을 전혀 하지 않고 장비고장 후 수리
시간 기준보전 ^{Time based maintenance} : 장비의 열화에 가장 비례하는 파라미터 (생산성, 작동회수 등)로서 수리주기 (이론값, 경험값)을 정하고 주기까지 사용 시 무조건 수리함
상태 기반보전 ^{Condition based maintenance} : 장비 열화 상태를 각 측정 데이터와 그 해석에 따라서 오프라인 혹은 온라인 상태로 파악하며, 열화를 나타내는 값이 미리 정한 열화 기준에 달하면 수리
- 목적, 유닛/부품 단위 ➔ 성능열화상태 ➔ 파라미터 ➔ 파라미터 측정법 ➔ 정기적 장비 이상 측정 ➔ 파라미터 기능 열화간 상관관계 ➔ threshold 설정 ➔ 현물 분해조사 ➔ 상관관계 입증 ➔ 경향관리 시스템 구축

References

본 포스팅은 LG Aimers 프로그램에 참가하여 학습한 내용을 기반으로 작성되었습니다. (전체내용 X)

➔ LG Aimers 바로가기

[1] LG Aimers AI Essential Course Module 1.품질과 신뢰성, 한양대학교 배석주 교수

품질 | Quality 의 모든 것

Tue, 23 Aug 2022 13:17:50 GMT

품질 | Quality 의 모든 것

본 포스팅은 품질의 개념과 유형, 그리고 품질비용에 대한 내용을 포함하고 있습니다.
Keyword : 품질, 품질비용, SPC, 품질경영
👉 Click

품질 (Quality)

규격에 부합하는 것 (in 전통적 품질관리)
제품특징 (판매 수익의 증대에 기여), 무결함 (원가절감에 기여) 로 구성됨
개념의 변화

➔ 요구조건의 만족도 or 용도에 대한 적합성 ^{fitness for use}, 고객 기대에의 적응도

품질의 유형

요구품질 ^{Requirement of quality}

➔ 제품/서비스를 사용하는 사람 입장에서 요구, 추상적 개념의 품질

설계품질 ^{Quality of design}

➔ 기업의 품질방침 및 제조역량을 고려하여 추상적 요구 품질을 구체적으로 명문화

➔ 제약조건 (기술,비용) 과 경쟁제품의 품질 및 가격을 종합적으로 고려하여 결정됨

제조품질 or 적합품질 ^{Quality of manufacturing or conformance}

➔ 제조시스템의 다양한 원천에서 발생하는 변동성과 불확실성에 의해 결정됨

사용품질 or 시장품질 ^{Quality of use or market}

➔ 고객이 제품/서비스를 사용한 후 기본적 욕구의 충족, 애프터서비스, 보전, 신뢰성 등에 대한 만족/불만을 인식함으로써 결정됨

품질의 차원

성능 ^Performance, 특징 ^Features, 신뢰성 ^Reliability, 적합성 ^Conformance, 내구성 ^Durability, 서비스성 ^{Serviceability}, 심미성 ^esthetics, 인지품질 ^{Perceived quailty}

종합적 품질 ^{Total Quality}

고객을 만족시킬 수 있는 품질을 달성하기 위한 제조시스템의 가치사슬 ^{value chain}을 고려
QCD (Quality, cost, delivery)도 총체적 품질에 포함되어야함

저품질비용 ^{COPQ: Cost of poor quality}

기업 내에서 불필요하게 발생하는 이익손실비용을 측정하는 재무적 척도, 기업이익에 기여하지 않는 모든 것
6시그마에서 개선 프로젝트의 대상
품질비용 + 숨겨진비용 ^{Q-Cost + Hidden Cost}

품질로 인한 상승효과

서비스 품질에 대한 소비자의 인식이 시장에서 거래될 수 있는 판매가격 결정
고품질 제공 ➔ 브랜드 인지도 상승 ➔ 높은 가격

품질비용

영향

품질 불만족으로 초래되는 고객 이탈율의 증가로 인한 기회손실 비용 발생
고객 이탈률 5% 줄이면 기업의 수익이 업종에 따라 25~82%까지 증가

종류

생산자 품질비용 (예방비용, 평가비용, 실패비용)
- 예방비용 : 처음부터 불량이 발생하지 않도록 하는데 소요
- 평가비용 : 소정의 품질 수준을 유지하는데 소요
- 실패비용 : 소정의 품질 수준을 유지하는데 실패하여 발생
사용자 품질비용 (소비자 부담비용, 소비자 불만비용, 명성상실 비용)
사회적 품질비용

새로운 개념

기존의 품질개념 : 검사에 의존하여 출하 품질을 보증함으로서 고품질을 확보하기 위해서는 검사, 재작업, 폐기 비용들의 loss 발생
새로운 품질개념: 불량을 만들지않는 프로세스를 구축함으로서 검사, 재작업, 폐기 비용등의 loss 발생X

산포

부적절한 제품 설계 ∩ 불안정한 원재료 ∩ 불충분한 공정능력

변동

잠재적 변동 이해를 위한 지침
- 각각의 변동요인은 서로 동일하지 않음
- 제품 및 공정에서의 변동은 측정 가능해야함
- 개별적 출력의 결과는 예측 불가
- 원인에 대한 형태를 특정 출력 특성으로 계통화해야함
- 변동은 우연 ^random과 이상 ^assignable 원인으로 구분됨
변동 원인에 따른 구분
- 이상원인 ^{chance cause} : 비정상적인 요인에 의함, 각각의 개별적 요인에 의함, 큰 변동
➔ 공정상에 이상변동만 존재하면, 분포상태는 시간에 대해 불안정되고 예측불가
- 우연원인 ^{assignable cause} : 정상적인 운전상태에서는 존재하는 공정의 고유한 변동, 많은 개별적 요인에 의함, 작은 변동폭
➔ 공정상에 우연변동만 존재하면, 분포상태는 시간에 대해 안정되고 예측가능

SPC (Statistical, process, control)

공정에서 요구되는 품질이나 생산성 목표를 달성하기 위하여 통계적 방법으로 공정을 효율적으로 운영해나가는 관리방법
장점
- 결함방지에 효과적, 불필요한 공정조정 방지
- 공정 능력에 대한 정보를 제공, 입증된 생상성향상 기술
- 계량치(변수)데이터와 계수치(속성)데이터 모두에 사용가능
단점
- 데이터의 정확한 수집 및 올바른 관리도가 필요
- 관리도에 대한 올바른 분석과 패턴에 대한 적절한 조치가 필요
- 모든사람이 교육받을 필요성 존재
SPC에서 사용되는 통계적 기법 : 평균, 분산 및 확률분포, 관리도 및 공정능력 지수, QC 7가지 기본도구

➔ QC 7가지 기본도구 : 파레토차트 ^{Stratification}, 특성요인도^{Cause-and-effect-diagram}, 체크시트^{Check sheet}, 히스토그램^Histogram, 산점도^{Scatter diagram}, 그래프^Graph, 관리도

품질 4.0과 스마트 품질경영

전사적 품질관리 ^{Total quality management}

➔ 우수한 제품/서비스 등을 고객에게 제공하기 위해 품질에 중점을 두고 기업 전 부분의 참여를 통해 회사의 장기적성공에 목표를 두는 조직 전체의 노력

품질 4.0
- 빅데이터 : 크기 ^Volume,다양성 ^Variety,속도 ^Velocity, 정확성 ^Veracity
- Analytics : 설명적 ^Descriptive, 진단적 ^Diagnostics, 예측적 ^Predictive, 처방적 or 규범적 ^Prescriptive
- 연결성 : IoT를 기반으로 실시간 작업자, 제품, 설비 및 프로세스들의 연결성 보장가능

➔ ICT 융합을 통해 종전의 사후검사 및 보증에서 벗어나 사전에 수집, 분석된 빅데이터를 활용하여 선제적 불량예지 및 보전 중심으로 진화된 품질경영시스템

스마트 품질경영 혁신방안
- 실시간 커뮤니티 피드백을 제공,원격진단 및 유지보수, 고도화된 공급망 품질관리

빅데이터를 활용한 스마트 품질 경영

공정 모니터링 시스템의 품질 예측 및 불량요인 분석 알고리즘 개발
- 공정변수와 품질 계측치의 상관관계를 파악할 수 있는 지표 도출 (원 공정변수를 군집분석 ➔ 변수선택법 ➔ 기여도 분석)
- 공정변수를 통하여 품질 계측치를 예측할 수 있는 가상 계측 시스템 구축 (회귀분석 ➔ 변수선택법 ➔ 주성분회귀 및 부분최소제곱회귀)
- 공정의 이상감지 및 진단 모니터링 기법

스마트 공장

환경을 고려하고 안전성을 확보하면서 빠르고 역동적인 시장변화에 대하여 능동적으로 대응할 수 있는 지능형 디지털 시스템

품질관리 개선영역

예방적 ^Preventive 품질관리

➔ 전 과정 상에서 제품의 품질을 보장하기 위해 설계됨

반응적 ^Reactive 품질관리

➔ 제품 판매 이후의 품질관리

품질관리 문화 : 기업 내 여러 부서간의 협업과 대화를 통해 합의를 이루어내야함

품질 VS 신뢰성

➔ 바로가기

References

본 포스팅은 LG Aimers 프로그램에 참가하여 학습한 내용을 기반으로 작성되었습니다. (전체내용 X)

➔ LG Aimers 바로가기

[1] LG Aimers AI Essential Course Module 1.품질과 신뢰성, 한양대학교 배석주 교수

[Dacon] 음성 분류 경진대회

Tue, 23 Aug 2022 13:07:59 GMT

음성 분류

본 포스팅은 음성 데이터에 대한 data augmentation과 feature extraction 등의 내용을 포함하고 있습니다.
코드실행은 Google Colab의 CPU, Standard RAM 환경에서 진행했습니다.
Keyword : 음성분류, classification, data-augmentation, feature-extraction ➔ 데이콘에서 읽기
👉Click

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

0. Import Packages

import numpy as np
import pandas as pd
import random as rn
import os

from scipy.io import wavfile
import librosa

import matplotlib.pyplot as plt
import seaborn as sns
import IPython.display as ipd
import librosa.display

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

# reproducibility

def all_seed(seed_num):
    np.random.seed(seed_num)
    rn.seed(seed_num)
    os.environ['PYTHONHASHSEED']=str(seed_num)
    # tf.random.set_seed(seed_num)

seed_num = 42
all_seed(seed_num)

1. Load and explore dataset

train = pd.read_csv('/content/drive/MyDrive/Speech_classification/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Speech_classification/test.csv')

📝 한 음성의 waveplot을 확인해보겠습니다.

a_filename = '/content/drive/MyDrive/Speech_classification/dataset/train/001.wav'
samples, sample_rate = librosa.load(a_filename)

plt.figure(figsize=(10, 7))

# plt.plot(np.linspace(0, sample_rate/len(samples), len(samples)), samples)
librosa.display.waveplot(samples, sr=40000)

plt.xlabel('time', fontsize = 14)
plt.ylabel('amplitude', fontsize = 14)
plt.title('001.wav | Length : ' + str(len(samples)))

plt.show()

print(sample_rate)
print(samples)

22050
[0.00013066 0.00016804 0.00014106 ... 0.00017342 0.00017514 0.        ]

📝 한 음성의 spectrogram을 생성하겠습니다.
↪ Short term Fourier transform (STFT)의 magnitude를 db 스케일로 변환하여 spectrogram을 생성합니다.

samples, sample_rate = librosa.load(a_filename)
X = librosa.stft(samples)  # data -> short term FT
Xdb = librosa.amplitude_to_db(abs(X))

plt.figure(figsize=(12, 3))
plt.title('001.wav spectrogram | Length : ' + str(len(samples)))
librosa.display.specshow(Xdb, sr = sample_rate, x_axis='time', y_axis='hz')   
plt.colorbar()
plt.show()

📝 train.csv에는 train 폴더의 음성 파일 이름과 라벨 컬럼이 포함되어 있습니다. label 컬럼은 0~9 정수로 구성됩니다.

train.head()

  file_name  label
0   001.wav      9
1   002.wav      0
2   004.wav      1
3   005.wav      8
4   006.wav      0

print(train['label'].unique())

[9 0 1 8 7 4 5 2 6 3]

📝 데이터가 클래스 균형을 이루고 있습니다.

plt.figure(figsize=(12, 8))
sns.countplot(train['label'])

plt.title("The number of recordings for each label")
plt.ylabel("Count", fontsize = 14)
plt.xlabel("Label", fontsize = 14)
plt.show()

file_name = train['file_name']
train_path = '/content/drive/MyDrive/Speech_classification/dataset/train/'

📝 데이터들의 길이가 모두 다릅니다.

all_shape = []
for f in file_name:
  data, sample_rate = librosa.load(train_path + f, sr = 20000)
  all_shape.append(data.shape)

print(all_shape[:5])
print("Max :", np.max(all_shape, axis = 0))
print("Min :", np.min(all_shape, axis = 0))

[(12740,), (13126,), (12910,), (9753,), (17572,)]
Max : [19466]
Min : [7139]

2. Data augmentation

📝 원래의 음성데이터에 새로운 perturbation 들을 추가하여 새로운 음성데이터를 생성합니다. (모델의 일반화 능력 향상을 위함)
↪ Noise 추가, time stretching, pitch 변환

# noise 추가
def noise(sample):
    noise_amp = 0.01*np.random.uniform()*np.amax(sample)
    sample = sample + noise_amp*np.random.normal(size = sample.shape[0])
    return sample

# time stretching
def stretch(sample, rate = 0.8):
    stretch_sample = librosa.effects.time_stretch(sample, rate)
    return stretch_sample

# pitch 변환
def pitch(sample, sampling_rate, pitch_factor = 0.8):
    pitch_sample = librosa.effects.pitch_shift(sample, sampling_rate, pitch_factor)
    return pitch_sample

3. Feature Extraction

📝 모델링에 사용하면 도움이 될만한 몇가지 feature extraction 방법을 소개하겠습니다.

1. Zero Crossing Rate (ZCR)

↪ 특정 프레임이 지속 기간 동안의 신호의 부호(sign) 변화율 i.e. 신호의 부호가 바뀌는 비율

2. Chroma_shift

↪ Waveform 또는 power spectrogram으로 생성한 chromagram.

3. Mel spectrum

↪ 오디오 신호(time domain)에 Fast Fourier Transform (FFT) -> Spectrum (frequency domain)

↪ Spectrum + 필터링 (Mel filter bank) -> Mel spectrum

4. MFCC (Mel-Frequency Cepstral Coefficient)

↪ Mel spectrum에서 Cepstral 분석을 통해 고유한 특성을 추출함

5. RMS (Root Mean Square)

↪ 오디오 평균 음량 측정

def extract_features(sample):
    # ZCR
    result = np.array([])
    zcr = np.mean(librosa.feature.zero_crossing_rate(y = sample).T, axis=0)
    result=np.hstack((result, zcr)) 

    # Chroma_stft
    stft = np.abs(librosa.stft(sample))
    chroma_stft = np.mean(librosa.feature.chroma_stft(S = stft, sr = sample_rate).T, axis=0)
    result = np.hstack((result, chroma_stft))

    # MelSpectogram
    mel = np.mean(librosa.feature.melspectrogram(y = sample, sr = sample_rate).T, axis=0)
    result = np.hstack((result, mel)) 

    # MFCC
    mfcc = np.mean(librosa.feature.mfcc(y = sample, sr = sample_rate).T, axis=0)
    result = np.hstack((result, mfcc)) 

    # Root Mean Square Value
    rms = np.mean(librosa.feature.rms(y = sample).T, axis=0)
    result = np.hstack((result, rms)) 

    return result

📝 Noise 추가, time stretching, pitching 방법들을 통해 음성 데이터 하나 당 (1, 162) 크기의 feature를 (3, 162) 로 증강합니다.

def get_features(path):

    sample, sample_rate = librosa.load(path)

    # without augmentation
    res1 = extract_features(sample)
    result = np.array(res1)

    # sample with noise
    noise_sample = noise(sample)
    res2 = extract_features(noise_sample)
    result = np.vstack((result, res2)) 

    # sample with stretching and pitching
    str_sample = stretch(sample)
    sample_stretch_pitch = pitch(str_sample, sample_rate)
    res3 = extract_features(sample_stretch_pitch)
    result = np.vstack((result, res3)) 

    return result

labels = train['label']
x, y = [], []
for f, label in zip(file_name, labels):
    feature = get_features(train_path + f)
    for fe in feature:
        x.append(fe)
        y.append(label)

X = np.array(x)
Y = np.array(y)

print("Shape of X:", np.shape(X))
print("Shape of Y:", np.shape(Y))

Shape of X: (1200, 162)
Shape of Y: (1200,)

Reference

[1] Speech Emotion Recognition by SHIVAM BURNWAL, https://www.kaggle.com/code/shivamburnwal/speech-emotion-recognition

읽어주셔서 감사합니다 :)
도움이 됐길 바랍니다 👍👍

[Dacon] 중고차 가격 예측 경진대회

Tue, 23 Aug 2022 13:05:16 GMT

중고차 가격 예측

본 포스팅은 feature engineering과 ensemble (catboost, random forest, gradient boosting) 등의 내용을 포함하고 있습니다.
코드실행은 Google Colab의 CPU, Standard RAM 환경에서 진행했습니다.
Keyword : 중고차가격예측, regression, catboost, randomforest, gradientboosting, ensemble, pycaret ➔ 데이콘에서 읽기
👉 Click

0. Import Packages

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

!pip install h5py
!pip install typing-extensions
!pip install wheel
!pip install folium==0.2.1
!pip install markupsafe==2.0.1
!pip install -U pandas-profiling
!pip install catboost
!pip install pycaret==2.3.10 markupsafe==2.0.1 pyyaml==5.4.1 -qq

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn
import pandas_profiling
import seaborn as sns
import random as rn
import os
import scipy.stats as stats
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder

from collections import Counter
from pycaret.regression import *

%matplotlib inline
warnings.filterwarnings(action='ignore')

print("numpy version: {}". format(np.__version__))
print("pandas version: {}". format(pd.__version__))
print("matplotlib version: {}". format(matplotlib.__version__))
print("scikit-learn version: {}". format(sklearn.__version__))

numpy version: 1.21.6
pandas version: 1.3.5
matplotlib version: 3.2.2
scikit-learn version: 0.23.2

# reproducibility
seed_num = 42 
np.random.seed(seed_num)
rn.seed(seed_num)
os.environ['PYTHONHASHSEED']=str(seed_num)

1. Load and Check Dataset

train = pd.read_csv('/content/drive/MyDrive/Forecasting_price/dataset/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Forecasting_price/dataset/test.csv')

print(train.shape)
train.head()

(1015, 11)

   id                          title  odometer location    isimported  \
0   0                   Toyota RAV 4     18277   Lagos   Foreign Used   
1   1            Toyota Land Cruiser        10    Lagos          New    
2   2  Land Rover Range Rover Evoque     83091    Lagos  Foreign Used   
3   3                   Lexus ES 350     91524    Lagos  Foreign Used   
4   4                   Toyota Venza     94177    Lagos  Foreign Used   

           engine transmission    fuel  paint  year    target  
0  4-cylinder(I4)    automatic  petrol    Red  2016  13665000  
1  4-cylinder(I4)    automatic  petrol  Black  2019  33015000  
2  6-cylinder(V6)    automatic  petrol    Red  2012   9915000  
3  4-cylinder(I4)    automatic  petrol   Gray  2007   3815000  
4  6-cylinder(V6)    automatic  petrol    Red  2010   7385000

pr = train.profile_report()
pr.to_file('/content/drive/MyDrive/Forecasting_price/pr_report.html')
pr

Summary of Pandas profiling : Alert

High correlation

odometer-year-target-paint-fuel-transmission-engine

High cardinality

title, paint

↪ 중복도가 낮은 데이터

High skewness

Skewness of year : -21.68

`odometer` has 21 zeros

↪ 주행거리가 0인 중고차가 21대 (2.1%)

2. EDA

id : 샘플 아이디, title : 제조사 모델명, odometer : 주행 거리

location : 판매처(나이지리아 도시), isimported : 현지 사용 여부

engine : 엔진 종류, transmission : 트랜스미션 종류

fuel : 연료 종류, paint : 페인트 색상, year : 제조년도, target : 자동차 가격

Data type

Numeric (4) : id, odometer, year, target
Categorical (7) : title, location, isimported, engine, transmission, fuel, paint

train.isnull().sum()

id              0
title           0
odometer        0
location        0
isimported      0
engine          0
transmission    0
fuel            0
paint           0
year            0
target          0
dtype: int64

test.isnull().sum()

id              0
title           0
odometer        0
location        0
isimported      0
engine          0
transmission    0
fuel            0
paint           0
year            0
dtype: int64

📝 결측치가 없습니다.

df_train = train.copy()
df_test = test.copy()

2-(1). Outliers

fig, ax = plt.subplots(1, 2, figsize=(18,5))
g = sns.histplot(df_train['odometer'], color='b', label='Skewness : {:.2f}'.format(df_train['odometer'].skew()), ax=ax[0])
g.legend(loc='best', prop={'size': 16})
g.set_xlabel("Odometer", fontsize = 16)
g.set_ylabel("Count", fontsize = 16)

g = sns.histplot(df_train['year'], color='b', label='Skewness : {:.2f}'.format(df_train['year'].skew()), ax=ax[1])
g.legend(loc='best', prop={'size': 16})
g.set_xlabel("Year", fontsize = 16)
g.set_ylabel("Count", fontsize = 16)
plt.show()

numeric_fts = ['odometer', 'year']
outlier_ind = []
for i in numeric_fts:
  Q1 = np.percentile(df_train[i],25)
  Q3 = np.percentile(df_train[i],75)
  IQR = Q3-Q1
  outlier_list = df_train[(df_train[i] < Q1 - IQR * 1.5) | (df_train[i] > Q3 + IQR * 1.5)].index
  outlier_ind.extend(outlier_list)

# Drop outliers
train_df = df_train.drop(outlier_ind, axis = 0).reset_index(drop = True)
train_df

       id                          title  odometer location    isimported  \
0       0                   Toyota RAV 4     18277   Lagos   Foreign Used   
1       1            Toyota Land Cruiser        10    Lagos          New    
2       2  Land Rover Range Rover Evoque     83091    Lagos  Foreign Used   
3       3                   Lexus ES 350     91524    Lagos  Foreign Used   
4       4                   Toyota Venza     94177    Lagos  Foreign Used   
..    ...                            ...       ...      ...           ...   
970  1010                 Toyota Corolla     46768    Lagos  Foreign Used   
971  1011                   Toyota Camry     31600    Abuja  Foreign Used   
972  1012                   Toyota Camry     96802    Abuja  Foreign Used   
973  1013                   Lexus GX 460    146275    Lagos  Foreign Used   
974  1014                         DAF CF         0    Lagos  Locally used   

             engine transmission    fuel   paint  year    target  
0    4-cylinder(I4)    automatic  petrol     Red  2016  13665000  
1    4-cylinder(I4)    automatic  petrol   Black  2019  33015000  
2    6-cylinder(V6)    automatic  petrol     Red  2012   9915000  
3    4-cylinder(I4)    automatic  petrol    Gray  2007   3815000  
4    6-cylinder(V6)    automatic  petrol     Red  2010   7385000  
..              ...          ...     ...     ...   ...       ...  
970  4-cylinder(I4)    automatic  petrol   Black  2014   5415000  
971  4-cylinder(I4)    automatic  petrol  Silver  2011   3615000  
972  4-cylinder(I4)    automatic  petrol   Black  2011   3415000  
973  6-cylinder(V6)    automatic  petrol    Gold  2013  14315000  
974  6-cylinder(V6)       manual  diesel   white  1998  10015000  

[975 rows x 11 columns]

fig, ax = plt.subplots(1, 2, figsize=(18,5))
g = sns.histplot(train_df['odometer'], color='b', label='Skewness : {:.2f}'.format(train_df['odometer'].skew()), ax=ax[0])
g.legend(loc='best', prop={'size': 16})
g.set_xlabel("Odometer", fontsize = 16)
g.set_ylabel("Count", fontsize = 16)

g = sns.histplot(train_df['year'], color='b', label='Skewness : {:.2f}'.format(train_df['year'].skew()), ax=ax[1])
g.legend(loc='best', prop={'size': 16})
g.set_xlabel("Year", fontsize = 16)
g.set_ylabel("Count", fontsize = 16)
plt.show()

📝 outlier 들을 제거하여 첨도가 감소했습니다.

print("# outliers to drop :", len(outlier_ind))

# outliers to drop : 44

2-(2). Correlation

📝 앞서 수행한 pandas profiling report의 alert를 참고하여 상관계수를 계산했습니다.

📝 Categorical 데이터를 라벨인코더를 통해 수치형으로 변환한 후 상관관계를 확인합니다.

cat_fts = ['title', 'location', 'isimported', 'engine', 'transmission', 'fuel', 'paint']

la_train = train_df.copy()

for i in range(len(cat_fts)):
  encoder = LabelEncoder()
  la_train[cat_fts[i]] = encoder.fit_transform(la_train[cat_fts[i]])

plt.figure(figsize = (10,8))
sns.heatmap(la_train[['odometer', 'year', 'paint', 'fuel', 'transmission', 'engine', 'target']].corr(), annot=True)
plt.show()

3. Feature Engineering

3-(1). `company` 컬럼 생성

📝 title 변수 값들의 앞부분에는 공통적으로 자동차 회사의 이름이 오는것을 확인할 수 있습니다.

📝 split 함수를 사용하여 첫번째 띄어쓰기를 기준으로 회사명 데이터를 추출하고 새 컬럼을 생성해주겠습니다.
📝 company 컬럼의 계급을 훈련 데이터의 target값 기준으로 나눠주겠습니다.

print(train_df['title'].unique()[:20])

['Toyota RAV 4' 'Toyota Land Cruiser' 'Land Rover Range Rover Evoque'
 'Lexus ES 350' 'Toyota Venza' 'Toyota Corolla'
 'Land Rover Range Rover Sport' 'Pontiac Vibe' 'Toyota Tacoma'
 'Lexus RX 350' 'Ford Escape' 'Honda Civic' 'Volvo XC90' 'BMW 750'
 'Infiniti JX' 'Honda Accord' 'Mercedes-Benz ML 350' 'Toyota Camry'
 'Hyundai Azera' 'Lexus GX 460']

train_df['company'] = train_df['title'].apply(lambda x : x.split(" ")[0])
df_test['company'] = df_test['title'].apply(lambda x : x.split(" ")[0])

print(train_df['company'].unique())
print("#fts :", len(train_df['company'].unique()), '\n')
print(df_test['company'].unique())
print("#fts :", len(df_test['company'].unique()), '\n')

['Toyota' 'Land' 'Lexus' 'Pontiac' 'Ford' 'Honda' 'Volvo' 'BMW' 'Infiniti'
 'Mercedes-Benz' 'Hyundai' 'Jaguar' 'Mitsubishi' 'Nissan' 'Chevrolet'
 'Mazda' 'Lincoln' 'Kia' 'Acura' 'DAF' 'Man' 'Isuzu' 'IVM' 'Porsche'
 'MINI' 'GMC' 'Iveco' 'Scania' 'Volkswagen' 'GAC' 'IVECO' 'Mack' 'Peugeot'
 'Rolls-Royce' 'MAN-VOLKSWAGEN' 'Jeep' 'ALPINA' 'Bentley' 'JMC']
#fts : 39 

['Mercedes-Benz' 'Honda' 'Toyota' 'Iveco' 'Lexus' 'Nissan' 'Volkswagen'
 'Jeep' 'Ford' 'BMW' 'Mack' 'Land' 'Hyundai' 'Peugeot' 'Volvo' 'Infiniti'
 'Acura' 'Man' 'Fiat' 'MINI' 'DAF' 'Mazda' 'Porsche' 'Mitsubishi'
 'Chevrolet' 'Kia' 'Pontiac' 'Rolls-Royce']
#fts : 28

plt.figure(figsize = (20,8))
g = sns.barplot(x = 'company', y = 'target', data = train_df)

for p in g.patches:
    left, bottom, width, height = p.get_bbox().bounds
    g.annotate("%.1f"%(height/1e6), (left+width/2, height*1.01), ha='center')

g.set_xlabel("company", fontsize = 16)
g.set_ylabel("target", fontsize = 16)

plt.xticks(rotation=90)
plt.show()

company_h = np.zeros((len(g.patches)))
i = 0
for p in g.patches:
    left, bottom, width, height = p.get_bbox().bounds
    company_h[i] = (height/1e6)
    i +=1

company_h

array([  6.37849032,  29.39868421,  14.08227273,   2.715     ,
         6.31845588,   4.39417308,   4.15571429,  15.279     ,
        16.16      ,  13.37352941,   3.89282609,   2.665     ,
         3.42      ,   1.98666667,   7.233     ,   2.07875   ,
         4.415     ,   2.81785714,   4.082     ,   8.515     ,
        10.265     ,   4.015     ,   2.89      ,  14.265     ,
         5.54      ,   5.515     ,  10.015     ,   7.93      ,
         2.09409091,   1.49      ,   6.015     ,   8.015     ,
         2.125     , 150.015008  ,   6.34      ,   2.515     ,
         9.065     ,  28.015     ,   9.365     ])

companys = train_df['company'].unique()

def company_fix(train_df, df, companys):
  only_test_com = list(set(df['company'])-set(train_df['company']))


  if len(only_test_com) != 0:
    for k in range(len(only_test_com)):
      print(only_test_com)
      df.loc[(df['company'] == only_test_com[k]), 'company'] = 1


  for c in range(7):
    if c==6:
      company_ind = companys[np.where(company_h>=c*5)]
    elif c==0:
      company_ind = companys[np.where(company_h<(c+1)*5)]
    else:  
      company_ind = companys[np.where((company_h>=c*5)&(company_h<(c+1)*5))]

    for i in range(len(company_ind)):
      df.loc[(df['company'] == company_ind[i]), 'company'] = c+1

copy_train = train_df.copy()

company_fix(copy_train, train_df, companys)

company_fix(copy_train, df_test, companys)

['Fiat']

train_df['company'].unique()

array([2, 6, 3, 1, 4, 7], dtype=object)

df_test['company'].unique()

array([3, 1, 2, 4, 6, 7], dtype=object)

3-(2). `paint`

📝 뒤죽박죽인 paint 변수를 고쳐주겠습니다.

print(sorted(train.paint.unique()))

[' Black', ' Black/Red', 'Ash', 'Ash and black', 'BLACK', 'Beige', 'Black', 'Black ', 'Black and silver', 'Black sand pearl', 'Black.', 'Blue', 'Blue ', 'Brown', 'Cream', 'Cream ', 'DARK GREY', 'Dark Ash', 'Dark Blue', 'Dark Green', 'Dark Grey', 'Dark ash', 'Dark blue ', 'Dark gray', 'Dark silver ', 'Deep Blue', 'Deep blue', 'GOLD', 'Gery', 'Gold', 'Gold ', 'Gray', 'Gray ', 'Green', 'Green ', 'Grey', 'Grey ', 'Ink blue', 'Light Gold', 'Light blue', 'Light silver ', 'Magnetic Gray', 'Magnetic Gray Metallic', 'Maroon', 'Midnight Black Metal', 'Milk', 'Navy blue', 'Off white', 'Off white l', 'Pale brown', 'Purple', 'Red', 'Redl', 'SILVER', 'Silver', 'Silver ', 'Silver/grey', 'Sky blue', 'Skye blue', 'Sliver', 'Super White', 'WHITE', 'WINE', 'Whine ', 'White', 'White ', 'White orchild pearl', 'Wine', 'Yellow', 'blue', 'green', 'orange', 'red', 'white', 'white-blue', 'yellow']

def color_handling(x):
  x['paint'] = x['paint'].str.strip()   # eliminate empty space
  x['paint'] = x['paint'].str.lower()    # convert to lower case
  x['paint'] = x['paint'].str.replace(".", "")

color_handling(train_df)
color_handling(df_test)

train_df['paint'].unique()

array(['red', 'black', 'gray', 'white', 'blue', 'redl', 'silver',
       'black/red', 'deep blue', 'dark grey', 'brown', 'grey', 'green',
       'purple', 'gold', 'dark blue', 'milk', 'midnight black metal',
       'beige', 'dark ash', 'cream', 'dark gray', 'white orchild pearl',
       'dark green', 'yellow', 'sliver', 'wine', 'white-blue',
       'magnetic gray', 'dark silver', 'silver/grey', 'ink blue',
       'light blue', 'sky blue', 'gery', 'pale brown', 'whine',
       'black and silver', 'light silver', 'black sand pearl',
       'off white', 'ash', 'maroon', 'navy blue', 'super white',
       'ash and black', 'magnetic gray metallic', 'skye blue',
       'off white l'], dtype=object)

skye blue -> sky blue
dark ash, dark grey, dark silver, ash and black, black and silver -> dark gray
gery, grey,ash, magnetic gray metallic, magnetic gray, gray metallic, silver/grey, sliver, silver -> gray
off white l, off white, super white, white orchild pearl -> white
redl, maroon -> red
whine -> wine
ink blue, deep blue, navy blue -> dark blue
sky blue, white-blue -> light blue
black sand pearl, midnight black metal -> black
pale brown -> brown
milk -> cream

def color_fix(x):
  x['paint'] = x['paint'].str.replace("skye blue", "sky blue")

  x['paint'] = x['paint'].str.replace("dark ash", "dark gray")
  x['paint'] = x['paint'].str.replace("dark grey", "dark gray")
  x['paint'] = x['paint'].str.replace("dark silver", "dark gray")
  x['paint'] = x['paint'].str.replace("ash and black", "dark gray")
  x['paint'] = x['paint'].str.replace("black and silver", "dark gray")

  x['paint'] = x['paint'].str.replace("gery", "gray")
  x['paint'] = x['paint'].str.replace("grey", "gray")
  x['paint'] = x['paint'].str.replace("ash", "gray")
  x['paint'] = x['paint'].str.replace("silver/grey", "gray")
  x['paint'] = x['paint'].str.replace("silver/gray", "gray")
  x['paint'] = x['paint'].str.replace("sliver", "gray")
  x['paint'] = x['paint'].str.replace("silver", "gray")

  x['paint'] = x['paint'].str.replace("magnetic gray", "gray")
  x['paint'] = x['paint'].str.replace("gray metallic", "gray")
  x['paint'] = x['paint'].str.replace("magnetic gray metallic", "gray")

  x['paint'] = x['paint'].str.replace("black sand pearl", "black")
  x['paint'] = x['paint'].str.replace("midnight black metal", "black")


  x['paint'] = x['paint'].str.replace("off white l", "white")
  x['paint'] = x['paint'].str.replace("off white", "white")
  x['paint'] = x['paint'].str.replace("super white", "white")
  x['paint'] = x['paint'].str.replace("white orchild pearl", "white")

  x['paint'] = x['paint'].str.replace("redl", "red")
  x['paint'] = x['paint'].str.replace("maroon", "red")
  x['paint'] = x['paint'].str.replace("whine", "wine")

  x['paint'] = x['paint'].str.replace("ink blue", "dark blue")
  x['paint'] = x['paint'].str.replace("deep blue", "dark blue")
  x['paint'] = x['paint'].str.replace("navy blue", "dark blue")

  x['paint'] = x['paint'].str.replace("sky blue", "light blue")
  x['paint'] = x['paint'].str.replace("white-blue", "light blue")
  x['paint'] = x['paint'].str.replace("pale brown", "brown")

  x['paint'] = x['paint'].str.replace("milk", "cream")

color_fix(train_df)
color_fix(df_test)

print(sorted(train_df['paint'].unique()))
print(len(train_df['paint'].unique()))

['beige', 'black', 'black/red', 'blue', 'brown', 'cream', 'dark blue', 'dark gray', 'dark green', 'gold', 'gray', 'green', 'light blue', 'light gray', 'purple', 'red', 'white', 'wine', 'yellow']
19

print(sorted(df_test['paint'].unique()))
print(len(df_test['paint'].unique()))

['beige', 'blac', 'black', 'blue', 'brown', 'classic gray met(1f7)', 'cream', 'dark blue', 'dark gray', 'dark green', 'gold', 'golf', 'gray', 'gray and black', 'green', 'indigo ink pearl', 'light gray', 'mint green', 'red', 'white', 'white and green', 'wine', 'yellow']
23

3-(3). `location`

📝 location 변수도 고쳐주겠습니다.

train_df['location'].unique()

array(['Lagos ', 'Lagos', 'Abuja', 'Lagos State', 'Ogun', 'FCT', 'Accra',
       'other', 'Abuja ', 'Abia State', 'Adamawa ', 'Abia', 'Ogun State'],
      dtype=object)

def location_fix(x):
  x['location'] = x['location'].str.replace("Lagos ", "Lagos")
  x['location'] = x['location'].str.replace("Lagos State", "Lagos")
  x['location'] = x['location'].str.replace("Ogun State", "Ogun")
  x['location'] = x['location'].str.replace("Abuja ", "Abuja")
  x['location'] = x['location'].str.replace("Abia State", "Abia")
  x['location'] = x['location'].str.replace("LagosState", "Lagos")

location_fix(train_df)
location_fix(df_test)

print(sorted(train_df['location'].unique()))
print(len(train_df['location'].unique()))

['Abia', 'Abuja', 'Accra', 'Adamawa ', 'FCT', 'Lagos', 'Ogun', 'other']
8

print(sorted(df_test['location'].unique()))
print(len(df_test['location'].unique()))

['Abia', 'Abuja', 'Arepo ogun state ', 'Lagos', 'Mushin', 'Ogun', 'other']
7

3-(4). `engine`

📝 engine 변수를 수치형으로 바꿔주겠습니다.

plt.figure(figsize = (10,8))
sns.barplot(x = 'engine', y = 'target', data = train_df)
plt.show()

engines = train_df['engine'].unique()
engines

array(['4-cylinder(I4)', '6-cylinder(V6)', '8-cylinder(V8)',
       '6-cylinder(I6)', '4-cylinder(H4)', '5-cylinder(I5)',
       '3-cylinder(I3)', '2-cylinder(I2)'], dtype=object)

train_df['engine']

0      4-cylinder(I4)
1      4-cylinder(I4)
2      6-cylinder(V6)
3      4-cylinder(I4)
4      6-cylinder(V6)
            ...      
970    4-cylinder(I4)
971    4-cylinder(I4)
972    4-cylinder(I4)
973    6-cylinder(V6)
974    6-cylinder(V6)
Name: engine, Length: 975, dtype: object

def engine_fix(df):
  df.loc[((df['engine'] != "8-cylinder(V8)") & (df['engine'] != "4-cylinder(H4)") & (df['engine'] != "6-cylinder(I6)") & 
          (df['engine'] != "6-cylinder(V6)") & (df['engine'] != "4-cylinder(I4)") & (df['engine'] != "5-cylinder(I5)") & (df['engine'] != "3-cylinder(I3)") & (df['engine'] != "2-cylinder(I2)")), 'engine'] = 2

  df.loc[(df['engine'] == "2-cylinder(I2)"), 'engine'] = 1
  df.loc[(df['engine'] == "3-cylinder(I3)"), 'engine'] = 1
  df.loc[(df['engine'] == "5-cylinder(I5)"), 'engine'] = 1
  df.loc[(df['engine'] == "4-cylinder(I4)"), 'engine'] = 2
  df.loc[(df['engine'] == "6-cylinder(V6)"), 'engine'] = 2
  df.loc[(df['engine'] == "6-cylinder(I6)"), 'engine'] = 2
  df.loc[(df['engine'] == "4-cylinder(H4)"), 'engine'] = 3
  df.loc[(df['engine'] == "8-cylinder(V8)"), 'engine'] = 4

engine_fix(train_df)
engine_fix(df_test)

print(sorted(train_df['engine'].unique()))
print(len(train_df['engine'].unique()))

[1, 2, 3, 4]
4

print(sorted(df_test['engine'].unique()))
print(len(df_test['engine'].unique()))

[1, 2, 4]
3

3-(5). dropping

📝 train과 test 데이터의 title, location, paint 변수의 값 종류 및 길이가 일치하지 않습니다.

cat_fts2 = ['title', 'location', 'isimported', 'transmission', 'fuel', 'paint']

for i in range(len(cat_fts2)):
  print(cat_fts2[i], ":")
  print(train_df[cat_fts2[i]].unique())
  print("#fts :", len(train_df[cat_fts2[i]].unique()), '\n')

title :
['Toyota RAV 4' 'Toyota Land Cruiser' 'Land Rover Range Rover Evoque'
 'Lexus ES 350' 'Toyota Venza' 'Toyota Corolla'
 'Land Rover Range Rover Sport' 'Pontiac Vibe' 'Toyota Tacoma'
 'Lexus RX 350' 'Ford Escape' 'Honda Civic' 'Volvo XC90' 'BMW 750'
 'Infiniti JX' 'Honda Accord' 'Mercedes-Benz ML 350' 'Toyota Camry'
 'Hyundai Azera' 'Lexus GX 460' 'BMW 325' 'Toyota Sienna' 'Honda Fit'
 'Honda CR-V' 'Hyundai Tucson' 'Jaguar XJ8' 'BMW X6' 'Mercedes-Benz C 300'
 'Mitsubishi Galant' 'Mercedes-Benz GL 450' 'Lexus RX 300'
 'Toyota Highlander' 'Mitsubishi CANTER PICK UP' 'Nissan Titan'
 'Lexus IS 250' 'Mercedes-Benz 200' 'Toyota Sequoia' 'Ford Explorer'
 'Hyundai ix35' 'Lexus CT 200h' 'Lexus LX 570' 'Toyota Avensis'
 'Toyota 4-Runner' 'Mercedes-Benz GLE 350' 'Mercedes-Benz E 300'
 'Toyota Avalon' 'Chevrolet Camaro' 'Land Rover Range Rover' 'Mazda CX-9'
 'Lexus RX 330' 'Lincoln Mark' 'Kia Optima' 'Lexus GS 300' 'Jaguar X-Type'
 'Nissan Altima' 'Acura MDX' 'DAF 95XF TRACTOR HEAD' 'Man TGA 18.360'
 'Nissan Pathfinder' 'Mercedes-Benz E 350' 'Honda Crosstour' 'Honda Pilot'
 'Lexus LS 460' 'Nissan Cabstar' 'Kia Sorento' 'Mercedes-Benz CLA 250'
 'Mitsubishi Pajero' 'Mercedes-Benz C 350' 'Lexus GS 350'
 'Mercedes-Benz E 320' 'Toyota Yaris' 'Toyota Matrix' 'Isuzu NQR'
 'IVM LT35' 'Hyundai Elantra' 'Porsche Cayenne' 'Toyota Prado'
 'Hyundai Sonata' 'MINI Cooper' 'Toyota Hiace' 'Mercedes-Benz 350'
 'Honda Odyssey' 'Mercedes-Benz E 550' 'GMC Terrain'
 'Mercedes-Benz GLK 350' 'Mercedes-Benz C 250' 'Mercedes-Benz ML 430'
 'Mercedes-Benz GLC 300' 'Kia Cerato' 'Chevrolet Evanda' 'Iveco TRUCK'
 'Acura ZDX' 'Mercedes-Benz 450' 'Mercedes-Benz GLA 250'
 'Mercedes-Benz CLS 500' 'Scania P94 FLATBED' 'Nissan Versa' 'Ford F 150'
 'Mercedes-Benz GLE 43 AMG' 'Volkswagen Golf' 'Mercedes-Benz 320'
 'Honda Ridgeline' 'Mercedes-Benz S 450' 'Mercedes-Benz 300' 'Kia Rio'
 'BMW 740' 'Ford Edge' 'Toyota Dyna' 'Volvo FL6' 'Toyota Coaster'
 'GAC Gonow Other' 'IVECO EUROTECH 7.50E-16' 'Mack CH613'
 'Scania TRACTOR HEAD' 'Nissan Xterra' 'Mercedes-Benz ML 320' 'Ford Focus'
 'Mercedes-Benz 220' 'Man Truck 18.44' 'BMW 730' 'Peugeot 607' 'BMW 528'
 'Volvo XC60' 'Mercedes-Benz E 200' 'Volkswagen Passat'
 'Volkswagen Sharan' 'Lexus GX 470' 'Ford Transit' 'Nissan Quest'
 'Nissan Maxima' 'Hyundai Santa Fe' 'Lexus ES 300' 'Mazda Tribute'
 'Ford Fusion' 'Acura RDX' 'Peugeot 206' 'Mercedes-Benz G 63 AMG'
 'Toyota Hilux' 'Kia Stinger' 'Volkswagen Tiguan' 'Acura TL'
 'Porsche Panamera' 'Rolls-Royce Ghost' 'BMW 745' 'BMW 335'
 'Volkswagen Jetta' 'Toyota Solara' 'Mercedes-Benz C 450 AMG'
 'Nissan Murano' 'Chevrolet Traverse' 'Volkswagen T4 Caravelle'
 'MAN-VOLKSWAGEN FLATBED' 'Nissan Frontier' 'Mercedes-Benz C 180'
 'Infiniti M35' 'Nissan Sentra' 'Jeep Cherokee' 'Toyota DYNA 200'
 'Nissan Rogue' 'Land Rover Range Rover Velar' 'ALPINA B3' 'Mazda 323'
 'Volkswagen T6 other' 'Bentley Arnage' 'Mazda 6' 'Infiniti FX'
 'Ford Expedition' 'Kia Picanto' 'Toyota Tundra' 'JMC Vigus'
 'Infiniti QX80' 'Volvo FH12' 'Volkswagen Touareg' 'Porsche Macan'
 'Peugeot 308' 'Nissan INFINITI M90.150/2' 'MINI Cooper Countryman'
 'Lexus ES 330' 'Honda Insight' 'Toyota Vitz' 'Isuzu CABSTER'
 'Mercedes-Benz C 63 AMG' 'Mercedes-Benz SL 400' 'Volkswagen 17.22'
 'DAF CF']
#fts : 185 

location :
['Lagos' 'Abuja' 'Ogun' 'FCT' 'Accra' 'other' 'Abia' 'Adamawa ']
#fts : 8 

isimported :
['Foreign Used' 'New ' 'Locally used']
#fts : 3 

transmission :
['automatic' 'manual']
#fts : 2 

fuel :
['petrol' 'diesel']
#fts : 2 

paint :
['red' 'black' 'gray' 'white' 'blue' 'black/red' 'dark blue' 'dark gray'
 'brown' 'green' 'purple' 'gold' 'cream' 'beige' 'dark green' 'yellow'
 'wine' 'light blue' 'light gray']
#fts : 19

for i in range(len(cat_fts2)):
  print(cat_fts2[i], ":")
  print(df_test[cat_fts2[i]].unique())
  print("#fts :", len(df_test[cat_fts2[i]].unique()), '\n')

title :
['Mercedes-Benz C 300' 'Honda Accord' 'Mercedes-Benz S 550'
 'Toyota Sienna' 'Toyota Hiace' 'Toyota Corolla' 'Iveco EUROCARGO 120e18'
 'Mercedes-Benz GLE 350' 'Toyota Highlander' 'Toyota Hilux' 'Toyota Camry'
 'Mercedes-Benz C 180' 'Lexus ES 350' 'Honda Fit' 'Toyota Matrix'
 'Toyota Venza' 'Lexus IS 250' 'Nissan Primera' 'Volkswagen Sharan'
 'Jeep Wrangler' 'Volkswagen Golf' 'Mercedes-Benz 814' 'Nissan Sentra'
 'Volkswagen Passat' 'Mercedes-Benz GLK 350' 'Lexus RX 350' 'Ford Mondeo'
 'BMW X3' 'Mack CXN613 CAB BEHIND ENGINE' 'Toyota RAV 4'
 'Land Rover Discovery' 'Toyota Avalon' 'Lexus GX 460' 'Hyundai Santa Fe'
 'Peugeot 206' 'Volvo FL7' 'Mercedes-Benz C 320' 'Hyundai Sonata'
 'Infiniti FX' 'Honda Civic' 'Mercedes-Benz CLS 500'
 'Mercedes-Benz GLK 300' 'Acura RDX' 'Mercedes-Benz G 550' 'BMW 535'
 'Acura TL' 'Nissan Xterra' 'Land Rover Range Rover' 'Nissan A'
 'Toyota 4-Runner' 'Honda Pilot' 'Man LE 8. 180 PLATFORM TRUCK'
 'Toyota Yaris' 'Hyundai Elantra' 'Volvo S80' 'Mercedes-Benz GLA 180'
 'Acura TSX' 'Lexus LX 570' 'Mercedes-Benz Maybach' 'Mercedes-Benz 300'
 'Acura MDX' 'Nissan INFINITI M90.150/2' 'Land Rover Range Rover Sport'
 'Nissan Altima' 'Peugeot 307' 'Fiat Ducato' 'Mercedes-Benz C 350'
 'Lexus RX 330' 'Ford Edge' 'Honda CR-V' 'Volvo FL12' 'Ford Explorer'
 'Man 26-403' 'MINI Cooper Coupé' 'Iveco TRUCK' 'Nissan Cabstar'
 'MINI Cooper' 'Lexus RX 400' 'Ford TRANSIT PICKUP' 'Toyota Prius'
 'Toyota Tundra' 'Honda Element' 'Toyota Tacoma' 'Lexus ES 300'
 'DAF XF TRACTOR HEAD' 'Honda Odyssey' 'Nissan Pathfinder' 'Mazda 323'
 'Mercedes-Benz E 300' 'Lexus GS 350' 'Mercedes-Benz ML 350'
 'Mercedes-Benz E 350' 'Porsche Cayenne' 'BMW 525' 'Toyota Land Cruiser'
 'Mack R-686ST' 'Toyota C-HR' 'Mitsubishi Eclipse' 'Chevrolet Camaro'
 'Mercedes-Benz CABIN PLUS CHASSIS ONLY' 'Mercedes-Benz GLE 450'
 'Toyota Avensis' 'Ford Mustang' 'Volvo FL6' 'Kia Optima'
 'Mitsubishi Pajero' 'Honda Crosstour' 'Lexus RX 300' 'Honda Ridgeline'
 'Mercedes-Benz 220' 'Mitsubishi Montero' 'Pontiac Vibe' 'Ford F 150'
 'Rolls-Royce Ghost' 'Ford Fusion' 'Lexus GS 300' 'Ford Transit'
 'Hyundai Azera' 'Mitsubishi L200' 'Mercedes-Benz DUMP TRUCK'
 'Mercedes-Benz WATER TANKER' 'Kia Rio' 'Man BOCKMANN' 'Lexus GX 470']
#fts : 124 

location :
['Abuja' 'Lagos' 'Ogun' 'Mushin' 'other' 'Arepo ogun state ' 'Abia']
#fts : 7 

isimported :
['New ' 'Foreign Used' 'Locally used']
#fts : 3 

transmission :
['automatic' 'manual']
#fts : 2 

fuel :
['petrol' 'diesel']
#fts : 2 

paint :
['white' 'black' 'dark gray' 'red' 'gray' 'blue' 'gold' 'green' 'cream'
 'brown' 'yellow' 'dark green' 'white and green' 'light gray' 'wine'
 'blac' 'dark blue' 'golf' 'indigo ink pearl' 'gray and black'
 'classic gray met(1f7)' 'beige' 'mint green']
#fts : 23

📝 One-hot encoding을 진행해줍니다

train_data = train_df.copy()
test_data = df_test.copy()

for i in range(len(cat_fts2)):
  onehot_encoder = OneHotEncoder(handle_unknown="ignore", sparse = False)

  transformed = onehot_encoder.fit_transform(train_data[cat_fts2[i]].to_numpy().reshape(-1, 1))
  onehot_df = pd.DataFrame(transformed, columns=onehot_encoder.get_feature_names())
  train_data = pd.concat([train_data, onehot_df], axis=1).drop(cat_fts2[i], axis=1)

  test_transformed = onehot_encoder.transform(test_data[cat_fts2[i]].to_numpy().reshape(-1, 1))
  test_onehot_df = pd.DataFrame(test_transformed, columns=onehot_encoder.get_feature_names())
  test_data = pd.concat([test_data, test_onehot_df], axis=1).drop(cat_fts2[i], axis=1)

print(train_data.columns)
print(test_data.columns)

Index(['id', 'odometer', 'engine', 'year', 'target', 'company', 'x0_ALPINA B3',
       'x0_Acura MDX', 'x0_Acura RDX', 'x0_Acura TL',
       ...
       'x0_gold', 'x0_gray', 'x0_green', 'x0_light blue', 'x0_light gray',
       'x0_purple', 'x0_red', 'x0_white', 'x0_wine', 'x0_yellow'],
      dtype='object', length=225)
Index(['id', 'odometer', 'engine', 'year', 'company', 'x0_ALPINA B3',
       'x0_Acura MDX', 'x0_Acura RDX', 'x0_Acura TL', 'x0_Acura ZDX',
       ...
       'x0_gold', 'x0_gray', 'x0_green', 'x0_light blue', 'x0_light gray',
       'x0_purple', 'x0_red', 'x0_white', 'x0_wine', 'x0_yellow'],
      dtype='object', length=224)

📝 train 데이터의 target 컬럼을 제외하고는 train과 test의 열길이가 같도록 one-hot encoding이 잘 진행된것을 확인할 수 있습니다.

train_x = train_data.drop('id', axis = 1)
test_x = test_data.drop('id', axis = 1)

print(train_x.shape)
print(test_x.shape)

(975, 224)
(436, 223)

4. Modeling

📝 pycaret을 활용했습니다.

py_reg = setup(train_x, target = 'target', session_id = seed_num, silent = True)

                               Description             Value
0                               session_id                42
1                                   Target            target
2                            Original Data        (975, 224)
3                           Missing Values             False
4                         Numeric Features                35
5                     Categorical Features               188
6                         Ordinal Features             False
7                High Cardinality Features             False
8                  High Cardinality Method              None
9                    Transformed Train Set        (682, 226)
10                    Transformed Test Set        (293, 226)
11                      Shuffle Train-Test              True
12                     Stratify Train-Test             False
13                          Fold Generator             KFold
14                             Fold Number                10
15                                CPU Jobs                -1
16                                 Use GPU             False
17                          Log Experiment             False
18                         Experiment Name  reg-default-name
19                                     USI              ee21
20                         Imputation Type            simple
21          Iterative Imputation Iteration              None
22                         Numeric Imputer              mean
23      Iterative Imputation Numeric Model              None
24                     Categorical Imputer          constant
25  Iterative Imputation Categorical Model              None
26           Unknown Categoricals Handling    least_frequent
27                               Normalize             False
28                        Normalize Method              None
29                          Transformation             False
30                   Transformation Method              None
31                                     PCA             False
32                              PCA Method              None
33                          PCA Components              None
34                     Ignore Low Variance             False
35                     Combine Rare Levels             False
36                    Rare Level Threshold              None
37                         Numeric Binning             False
38                         Remove Outliers             False
39                      Outliers Threshold              None
40                Remove Multicollinearity             False
41             Multicollinearity Threshold              None
42             Remove Perfect Collinearity              True
43                              Clustering             False
44                    Clustering Iteration              None
45                     Polynomial Features             False
46                       Polynomial Degree              None
47                    Trignometry Features             False
48                    Polynomial Threshold              None
49                          Group Features             False
50                       Feature Selection             False
51                Feature Selection Method           classic
52            Features Selection Threshold              None
53                     Feature Interaction             False
54                           Feature Ratio             False
55                   Interaction Threshold              None
56                        Transform Target             False
57                 Transform Target Method           box-cox

compare_models()

                                    Model           MAE           MSE  \
catboost               CatBoost Regressor  2.052122e+06  3.032507e+13   
gbr           Gradient Boosting Regressor  2.215648e+06  3.169851e+13   
rf                Random Forest Regressor  2.132068e+06  3.173878e+13   
et                  Extra Trees Regressor  2.235193e+06  3.563028e+13   
ridge                    Ridge Regression  3.439487e+06  4.245590e+13   
dt                Decision Tree Regressor  2.503733e+06  3.621137e+13   
omp           Orthogonal Matching Pursuit  3.249912e+06  4.415962e+13   
lr                      Linear Regression  3.577824e+06  4.495084e+13   
llar         Lasso Least Angle Regression  3.479438e+06  4.552524e+13   
lasso                    Lasso Regression  3.562897e+06  4.500952e+13   
lightgbm  Light Gradient Boosting Machine  3.335596e+06  4.506558e+13   
en                            Elastic Net  4.823784e+06  7.191481e+13   
ada                    AdaBoost Regressor  5.726544e+06  6.274745e+13   
knn                 K Neighbors Regressor  5.217788e+06  8.947216e+13   
br                         Bayesian Ridge  5.853927e+06  9.663452e+13   
huber                     Huber Regressor  5.072447e+06  1.106676e+14   
dummy                     Dummy Regressor  6.606546e+06  1.206503e+14   
par          Passive Aggressive Regressor  6.787941e+06  1.124849e+14   
lar                Least Angle Regression  7.229185e+28  7.044768e+59   

                  RMSE            R2    RMSLE          MAPE  TT (Sec)  
catboost  4.874472e+06  7.763000e-01   0.3940  2.705000e-01     5.786  
gbr       4.992229e+06  7.507000e-01   0.3991  3.435000e-01     0.201  
rf        4.964929e+06  7.477000e-01   0.3567  2.637000e-01     0.797  
et        5.401867e+06  7.073000e-01   0.3660  2.627000e-01     0.872  
ridge     5.963903e+06  6.569000e-01   0.8327  9.147000e-01     0.036  
dt        5.587945e+06  6.550000e-01   0.4376  2.988000e-01     0.027  
omp       6.123065e+06  6.337000e-01   0.8006  7.729000e-01     0.020  
lr        6.188975e+06  6.291000e-01   0.7927  9.623000e-01     0.349  
llar      6.215977e+06  6.243000e-01   0.8050  9.201000e-01     0.084  
lasso     6.206766e+06  6.219000e-01   0.7581  9.474000e-01     0.072  
lightgbm  6.324969e+06  5.892000e-01   0.5592  4.798000e-01     0.091  
en        7.970752e+06  3.732000e-01   0.8749  1.184900e+00     0.103  
ada       7.697862e+06  3.293000e-01   0.9726  1.669600e+00     0.149  
knn       8.988560e+06  1.959000e-01   0.8459  1.057700e+00     0.071  
br        9.334507e+06  1.286000e-01   1.0131  1.458200e+00     0.048  
huber     9.854927e+06  9.680000e-02   0.8858  8.251000e-01     0.079  
dummy     1.044993e+07 -6.330000e-02   1.1069  1.864000e+00     0.013  
par       1.019090e+07 -1.097000e-01   1.1548  1.859500e+00     0.026  
lar       2.657465e+29 -6.320727e+45  28.3450  1.132970e+22     0.113

catboost = create_model('catboost', verbose = False)
rf = create_model('rf', verbose = False)
gbr = create_model('gbr', verbose = False)

📝 상위 3개의 모델을 혼합한 모델을 생성합니다.

blended_model = blend_models(estimator_list = [catboost, rf, gbr])

               MAE           MSE          RMSE      R2   RMSLE    MAPE
Fold                                                                  
0     3.368465e+06  1.239934e+14  1.113523e+07  0.6025  0.3779  0.3052
1     1.523530e+06  7.672571e+12  2.769941e+06  0.8638  0.3379  0.2818
2     1.430990e+06  6.330266e+12  2.516002e+06  0.8527  0.3042  0.2395
3     1.205003e+06  6.456912e+12  2.541045e+06  0.7147  0.3131  0.2569
4     2.395485e+06  2.857651e+13  5.345700e+06  0.6260  0.3721  0.3061
5     3.142842e+06  6.432011e+13  8.019982e+06  0.5683  0.3675  0.2571
6     1.753312e+06  1.835539e+13  4.284319e+06  0.8353  0.3038  0.2304
7     1.810014e+06  2.096415e+13  4.578662e+06  0.8783  0.3255  0.2637
8     1.982680e+06  1.493599e+13  3.864710e+06  0.8744  0.3505  0.2618
9     1.554736e+06  7.822396e+12  2.796855e+06  0.9317  0.3449  0.3052
Mean  2.016706e+06  2.994277e+13  4.785245e+06  0.7748  0.3398  0.2708
Std   6.935942e+05  3.544528e+13  2.654091e+06  0.1269  0.0263  0.0262

📝 전체 데이터로 마지막 학습을 진행하고 test 예측을 생성합니다.

final_model = finalize_model(blended_model)
prediction = predict_model(final_model, data = test_x)

pred = prediction['Label']

읽어주셔서 감사합니다 :)
도움이 됐길 바랍니다👍👍

[Dacon] 소비자 데이터 기반 소비 예측 경진대회

Tue, 23 Aug 2022 12:59:51 GMT

소비자 데이터 기반 소비 예측

본 포스팅은 간단한 데이터 전처리 및 EDA와 Ensemble(Elasticnet, LightGBM, XGBoost) 등의 내용을 포함하고 있습니다.
코드실행은 Google Colab의 CPU, Standard RAM 환경에서 진행했습니다.
Keyword : 소비예측, regression, lightgbm, xgboost, elasticnet, ensemble ➔ 데이콘에서 읽기
👉Click

0. Import Packages

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

!pip install folium==0.2.1
!pip install markupsafe==2.0.1
!pip install -U pandas-profiling

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn
import pandas_profiling
import seaborn as sns
import random as rn
import os
import scipy.stats as stats
import datetime
import calendar

from sklearn.preprocessing import PowerTransformer
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split

from sklearn.model_selection import GridSearchCV, cross_val_score, RepeatedKFold
from sklearn import metrics

from sklearn.linear_model import ElasticNet
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor

from collections import Counter
import warnings

%matplotlib inline
warnings.filterwarnings(action='ignore')

print("numpy version: {}". format(np.__version__))
print("pandas version: {}". format(pd.__version__))
print("matplotlib version: {}". format(matplotlib.__version__))
print("scikit-learn version: {}". format(sklearn.__version__))
print("xgboost version: {}". format(xgb.__version__))
print("lightgbm version: {}". format(lgb.__version__))

numpy version: 1.21.6
pandas version: 1.3.5
matplotlib version: 3.2.2
scikit-learn version: 1.0.2
xgboost version: 0.90
lightgbm version: 2.2.3

# reproducibility
seed_num = 42   ####
np.random.seed(seed_num)
rn.seed(seed_num)
os.environ['PYTHONHASHSEED']=str(seed_num)

1. Load and Check Dataset

train = pd.read_csv('/content/drive/MyDrive/Consumer_spending_forecast/dataset/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Consumer_spending_forecast/dataset/test.csv')

print(train.shape)
train.head()

(1108, 22)

   id  Year_Birth   Education Marital_Status   Income  Kidhome  Teenhome  \
0   0        1974      Master       Together  46014.0        1         1   
1   1        1962  Graduation         Single  76624.0        0         1   
2   2        1951  Graduation        Married  75903.0        0         1   
3   3        1974       Basic        Married  18393.0        1         0   
4   4        1946         PhD       Together  64014.0        2         1   

  Dt_Customer  Recency  NumDealsPurchases  ...  NumStorePurchases  \
0  21-01-2013       21                 10  ...                  8   
1  24-05-2014       68                  1  ...                  7   
2  08-04-2013       50                  2  ...                  9   
3  29-03-2014        2                  2  ...                  3   
4  10-06-2014       56                  7  ...                  5   

   NumWebVisitsMonth  AcceptedCmp3  AcceptedCmp4  AcceptedCmp5  AcceptedCmp1  \
0                  7             0             0             0             0   
1                  1             1             0             0             0   
2                  3             0             0             0             0   
3                  8             0             0             0             0   
4                  7             0             0             0             1   

   AcceptedCmp2  Complain  Response  target  
0             0         0         0     541  
1             0         0         0     899  
2             0         0         0     901  
3             0         0         0      50  
4             0         0         0     444  

[5 rows x 22 columns]

pr = train.profile_report()
pr.to_file('/content/drive/MyDrive/Consumer_spending_forecast/pr_report.html')
pr

Summary of Pandas profiling : Alert

High Correlation

Income-Kidhome-NumWebPurchases-NumStorePurchases-NumStorePurchases-NumWebVisitsMonth-AcceptedCmp1-AcceptedCmp5-target
NumDealsPurchases-Teenhome

High Cardinality

Dt_customer

↪ 고객이 회사에 등록한 날짜를 의미하기 때문에 중복도가 낮은 데이터입니다.

📝 Cardinality가 높다 <-> 중복되는 값이 적다

2. EDA | Exploratory Data Analysis

id : 샘플 아이디, Year_Birth : 고객 생년월일, Education : 고객 학력
Marital_status : 고객 결혼 상태, Income : 고객 연간 가구 소득
Kidhome : 고객 가구의 자녀 수, Teenhome : 고객 가구의 청소년 수, Dt_Customer : 고객이 회사에 등록한 날짜
Recency : 고객의 마지막 구매 이후 일수, NumDealsPurchases : 할인된 구매 횟수, NumWebPurchases : 회사 웹사이트를 통한 구매 건수
NumCatalogPurchases : 카탈로그를 사용한 구매 수, NumStorePuchases : 매장에서 직접 구매한 횟수
NumWebVisitsMonth : 지난 달 회사 웹사이트 방문 횟수
AcceptedCmp(1-5) : 고객이 (1-5) 번째 캠페인에서 제안을 수락한 경우 1, 그렇지 않은 경우 0
Complain : 고객이 지난 2년 동안 불만을 제기한 경우 1, 그렇지 않은 경우 0
Response : 고객이 마지막 캠페인에서 제안을 수락한 경우 1, 그렇지 않은 경우 0
target : 고객의 제품 총 소비량

Data Type

Numeric (10) : id, Year_Birth, Income, Recency, NumDealsPurchases, NumWebPurchases, NumCatalogPurchases, NumStorePurchases, NumWebVisitsMonth, target
Categorical (12) : Education, Marital_Status, Kidhome, Teenhome, Dt_Customer, AcceptedCmp(1~5), Complain, Response

📝 결측치가 없습니다.

train.isnull().sum()

id                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Response               0
target                 0
dtype: int64

test.isnull().sum()

id                     0
Year_Birth             0
Education              0
Marital_Status         0
Income                 0
Kidhome                0
Teenhome               0
Dt_Customer            0
Recency                0
NumDealsPurchases      0
NumWebPurchases        0
NumCatalogPurchases    0
NumStorePurchases      0
NumWebVisitsMonth      0
AcceptedCmp3           0
AcceptedCmp4           0
AcceptedCmp5           0
AcceptedCmp1           0
AcceptedCmp2           0
Complain               0
Response               0
dtype: int64

df_train = train.copy()
df_test = test.copy()

2-(1). Outliers

📝 id와 target을 제외한 numerical 데이터의 outlier 들을 IQR method를 활용하여 찾아줍니다.

numeric_fts = ['Year_Birth', 'Income', 'Recency', 'NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth']

train_outlier_ind = []
for i in numeric_fts:
  Q1 = np.percentile(df_train[i],25)
  Q3 = np.percentile(df_train[i],75)
  IQR = Q3-Q1
  train_outlier_list = df_train[(df_train[i] < Q1 - IQR * 1.5) | (df_train[i] > Q3 + IQR * 1.5)].index
  train_outlier_ind.extend(train_outlier_list)

train_outlier_ind = Counter(train_outlier_ind)
train_multi_outliers = list(k for k,j in train_outlier_ind.items() if j > 2)  

print("The number of train outliers :", len(train_multi_outliers))

The number of train outliers : 0

📝 Train 데이터에는 IQR method로 탐지되는 이상치가 없는것을 확인할 수 있습니다.

2-(2). Transformation

📝 왜곡된 분포는 모델 학습에 안좋은 영향을 줄 수 있습니다. 높은 skewness를 가지고 있는 NumDealsPurchases 변수에 대하여 몇가지 transformation을 진행하려합니다.

print(df_train[numeric_fts].skew())

Year_Birth            -0.439100
Income                 0.291634
Recency               -0.061310
NumDealsPurchases      2.264245
NumWebPurchases        1.289607
NumCatalogPurchases    1.099499
NumStorePurchases      0.653689
NumWebVisitsMonth      0.299000
dtype: float64

fig = plt.figure(figsize = (16,6))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

sns.distplot(df_train['NumDealsPurchases'], ax = ax1, label='Skewness : {:.2f}'.format(df_train['NumDealsPurchases'].skew()))
ax1.legend(loc='best', fontsize = 15)

stats.probplot(df_train['NumDealsPurchases'], plot = ax2)
plt.title("Q-Q Plot", fontsize = 15)
plt.show()

Log transformation

log_trans = df_train['NumDealsPurchases'].map(lambda i: np.log(i) if i > 0 else 0)

fig = plt.figure(figsize = (16,6))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

sns.distplot(log_trans, ax = ax1, color='crimson', label='Skewness : {:.2f}'.format(log_trans.skew()))
ax1.legend(loc='best', fontsize = 15)
ax1.set_title('Log transformation', fontsize = 15)

stats.probplot(log_trans, plot = ax2)
ax2.set_title("Q-Q Plot", fontsize = 15)
plt.show()

Yeo-Johnson transformation

jy = PowerTransformer(method = 'yeo-johnson')
jy.fit(df_train['NumDealsPurchases'].values.reshape(-1, 1))
x_yj = jy.transform(df_train['NumDealsPurchases'].values.reshape(-1, 1))

fig = plt.figure(figsize = (16,6))
ax1 = fig.add_subplot(1,2,1)
ax2 = fig.add_subplot(1,2,2)

sns.distplot(x_yj, ax = ax1, color='crimson', label='Skewness : {:.5f}'.format(np.float(stats.skew(x_yj))))
ax1.legend(loc='best', fontsize = 15)
ax1.set_title('Yeo-Johnson transformation', fontsize = 15)

stats.probplot(x_yj.reshape(x_yj.shape[0]), plot = ax2)
ax2.legend(['Lambda : {:.2f}'.format(np.float(jy.lambdas_))], loc='best', fontsize = 15)
ax2.set_title("Q-Q Plot", fontsize = 15)
plt.show()

📝 두 변환을 진행한 결과 모두 조금 더 정규분포 직선과 비슷해진것을 Q-Q Plot을 통해 확인할 수 있습니다.

📝 데이터 전처리에는 Yeo-Johnson transformation을 사용했습니다.

df_train['NumDealsPurchases'] = x_yj
df_train['NumDealsPurchases'].head()

0    2.258975
1   -0.801066
2    0.146388
3    0.146388
4    1.846930
Name: NumDealsPurchases, dtype: float64

test_jy = PowerTransformer(method = 'yeo-johnson')
test_jy.fit(df_test['NumDealsPurchases'].values.reshape(-1, 1))
test_x_yj = test_jy.transform(df_test['NumDealsPurchases'].values.reshape(-1, 1))
df_test['NumDealsPurchases'] = test_x_yj

2-(3). Correlation

📝 앞서 수행한 pandas profiling report의 alert를 참고하여 상관계수를 계산했습니다.

corr_fts1 = ['Income', 'Kidhome', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth', 'AcceptedCmp1', 'AcceptedCmp5', 'target']
corr_fts2 = ['NumDealsPurchases', 'Teenhome']

plt.figure(figsize = (10,8))
sns.heatmap(df_train[corr_fts1].corr(), annot=True)

plt.show()

plt.figure(figsize = (8,6))
sns.heatmap(df_train[corr_fts2].corr(), annot=True)

plt.show()

📝 독립변수 간의 높은 상관관계는 다중공선성을 유발하기 때문에 좋지 않습니다.

📝 위 문제는 변수 선택, 차원 축소, 규제 등의 방법으로 해결할 수 있고, 저는 모델에 규제를 적용하거나 다중공선성의 영향을 적게 받는다고 생각되는 Decision Tree 베이스의 모델을 사용 할 예정입니다.

train_dataset = df_train.copy()
test_dataset = df_test.copy()

3. Feature Engineering

3-(1) `Dt_Customer` 변수 : 날짜 데이터 다루기

📝 Dt_Customer 변수는 회사 등록일을 뜻합니다. 회사에 등록한 시점에 대한 정보를 유지하면서 모델링에 사용할 수 있는 새 수치형 변수를 만들려고합니다.

↪ 가장 과거 시점의 회사 등록일로부터 며칠이 지났는지를 뜻하는 Pass_Customer변수를 새롭게 생성합니다.

train_dataset["Dt_Customer"]

0       21-01-2013
1       24-05-2014
2       08-04-2013
3       29-03-2014
4       10-06-2014
           ...    
1103    31-03-2013
1104    21-10-2013
1105    16-12-2013
1106    30-05-2013
1107    29-10-2012
Name: Dt_Customer, Length: 1108, dtype: object

train_dataset["Dt_Customer"] = pd.to_datetime(train_dataset["Dt_Customer"], format='%d-%m-%Y')
test_dataset["Dt_Customer"] = pd.to_datetime(test_dataset["Dt_Customer"], format='%d-%m-%Y')

train_dataset["Dt_Customer"]

0      2013-01-21
1      2014-05-24
2      2013-04-08
3      2014-03-29
4      2014-06-10
          ...    
1103   2013-03-31
1104   2013-10-21
1105   2013-12-16
1106   2013-05-30
1107   2012-10-29
Name: Dt_Customer, Length: 1108, dtype: datetime64[ns]

print(f'Minimum date: {train_dataset["Dt_Customer"].min()}')
print(f'Maximum date: {train_dataset["Dt_Customer"].max()}')

Minimum date: 2012-07-31 00:00:00
Maximum date: 2014-06-29 00:00:00

train_diff_date = train_dataset["Dt_Customer"] - train_dataset["Dt_Customer"].min()
test_diff_date = test_dataset["Dt_Customer"] - test_dataset["Dt_Customer"].min()

train_dataset["Pass_Customer"] = [i.days for i in train_diff_date]
test_dataset["Pass_Customer"] = [i.days for i in test_diff_date]

train_dataset["Pass_Customer"].head()

0    174
1    662
2    251
3    606
4    679
Name: Pass_Customer, dtype: int64

3-(2) `Year_Birth` to `Age`

📝 Year_Birth 변수를 활용하여 고객의 나이를 뜻하는 Age 변수를 새롭게 생성했습니다.
📝 한국나이로 계산했습니다.

print("Minimum birth :", train_dataset["Year_Birth"].min(), "\nMaximum birth :", train_dataset["Year_Birth"].max(), "\n")
train_dataset["Year_Birth"].head()

Minimum birth : 1893 
Maximum birth : 1996

0    1974
1    1962
2    1951
3    1974
4    1946
Name: Year_Birth, dtype: int64

train_dataset["Age"] = 2022 - train_dataset["Year_Birth"] + 1
test_dataset["Age"] = 2022 - test_dataset["Year_Birth"] + 1

train_dataset["Age"].head()

0    49
1    61
2    72
3    49
4    77
Name: Age, dtype: int64

3-(3) `AcceptedCmp(1~5)` 와 `Response` 변수로 새 Feature 생성

📝 위 여섯개의 변수는 고객이 1~5 번째와 마지막 캠페인에서 제안을 수락한 경우 1, 아닌경우 0 값을 가집니다.
📝 이 변수들을 활용하여 캠페인에서 제안을 수락한 횟수를 나타내는 AcceptCount 변수를 새롭게 생성하겠습니다.

train_dataset["AcceptCount"] = train_dataset["AcceptedCmp1"] + train_dataset["AcceptedCmp2"] + train_dataset["AcceptedCmp3"] + train_dataset["AcceptedCmp4"] + train_dataset["AcceptedCmp5"] + train_dataset["Response"]
test_dataset["AcceptCount"] = test_dataset["AcceptedCmp1"] + test_dataset["AcceptedCmp2"] + test_dataset["AcceptedCmp3"] + test_dataset["AcceptedCmp4"] + test_dataset["AcceptedCmp5"] + test_dataset["Response"]

train_dataset["AcceptCount"].head()

0    0
1    1
2    0
3    0
4    1
Name: AcceptCount, dtype: int64

print("Minimum count :", train_dataset["AcceptCount"].min(), "\nMaximum count :", train_dataset["AcceptCount"].max(), "\n")

Minimum count : 0 
Maximum count : 5

📝 train 데이터에서 캠페인 제안을 여섯번 모두 수락한 경우는 없는것을 확인할 수 있습니다.
📝 원래의 변수와 target과의 상관관계를 확인하겠습니다.

train_dataset[['Year_Birth', 'AcceptedCmp1','AcceptedCmp2','AcceptedCmp3','AcceptedCmp4','AcceptedCmp5', 'Response','target']].corr()

              Year_Birth  AcceptedCmp1  AcceptedCmp2  AcceptedCmp3  \
Year_Birth      1.000000     -0.050053     -0.034204      0.066802   
AcceptedCmp1   -0.050053      1.000000      0.198530      0.052213   
AcceptedCmp2   -0.034204      0.198530      1.000000      0.052513   
AcceptedCmp3    0.066802      0.052213      0.052513      1.000000   
AcceptedCmp4   -0.111485      0.184717      0.328941     -0.083690   
AcceptedCmp5   -0.010873      0.379563      0.192139      0.060890   
Response       -0.012304      0.268577      0.201945      0.194275   
target         -0.136035      0.361102      0.129995      0.040736   

              AcceptedCmp4  AcceptedCmp5  Response    target  
Year_Birth       -0.111485     -0.010873 -0.012304 -0.136035  
AcceptedCmp1      0.184717      0.379563  0.268577  0.361102  
AcceptedCmp2      0.328941      0.192139  0.201945  0.129995  
AcceptedCmp3     -0.083690      0.060890  0.194275  0.040736  
AcceptedCmp4      1.000000      0.313120  0.189849  0.256784  
AcceptedCmp5      0.313120      1.000000  0.336610  0.458208  
Response          0.189849      0.336610  1.000000  0.242760  
target            0.256784      0.458208  0.242760  1.000000

# Dt_customer
year = pd.to_datetime(train_dataset["Dt_Customer"]).dt.year
month = pd.to_datetime(train_dataset["Dt_Customer"]).dt.month
day = pd.to_datetime(train_dataset["Dt_Customer"]).dt.day

print(np.corrcoef(year, train_dataset['target']), '\n')
print(np.corrcoef(month, train_dataset['target']), '\n')
print(np.corrcoef(day, train_dataset['target']))

[[ 1.         -0.15940385]
 [-0.15940385  1.        ]] 

[[1.         0.03764911]
 [0.03764911 1.        ]] 

[[1.         0.01891694]
 [0.01891694 1.        ]]

📝 새로 생성한 변수와 target과의 상관관계를 확인하겠습니다.

train_dataset[['Pass_Customer', 'Age', 'AcceptCount', 'target']].corr()

               Pass_Customer       Age  AcceptCount    target
Pass_Customer       1.000000  0.012309    -0.080152 -0.174969
Age                 0.012309  1.000000     0.043180  0.136035
AcceptCount        -0.080152  0.043180     1.000000  0.444114
target             -0.174969  0.136035     0.444114  1.000000

📝 정리하자면, Pass_Customer-target의 상관계수 절댓값이 Dt_Customer(year, month, day)-target 보다 조금 더 크다는 것을 확인할 수 있습니다.

📝 또한, 당연하게도 Year_Birth를 Age 변수로 바꾼것은 상관관계에 아무런 영향도 주지 못했습니다.

📝 AcceptCount는 target과 어느정도 상관관계가 있습니다.

train_data = train_dataset.copy()
test_data = test_dataset.copy()

4. One-Hot Encoding

📝 Education, Marital_Status 변수의 one-hot encoding을 진행했습니다.

drop_col = ['id', 'Dt_Customer', 'Year_Birth', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']

train_data = train_data.drop(drop_col, axis = 1)
test_data = test_data.drop(drop_col, axis = 1)

print(train_data['Education'].unique())
print(train_data['Marital_Status'].unique())

['Master' 'Graduation' 'Basic' 'PhD' '2n Cycle']
['Together' 'Single' 'Married' 'Widow' 'Divorced' 'Alone' 'YOLO' 'Absurd']

# One-hot encoding
train_data = pd.get_dummies(train_data)
test_data = pd.get_dummies(test_data)

print(train_data.columns)
print(test_data.columns)

Index(['Income', 'Kidhome', 'Teenhome', 'Recency', 'NumDealsPurchases',
       'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
       'NumWebVisitsMonth', 'Complain', 'target', 'Pass_Customer', 'Age',
       'AcceptCount', 'Education_2n Cycle', 'Education_Basic',
       'Education_Graduation', 'Education_Master', 'Education_PhD',
       'Marital_Status_Absurd', 'Marital_Status_Alone',
       'Marital_Status_Divorced', 'Marital_Status_Married',
       'Marital_Status_Single', 'Marital_Status_Together',
       'Marital_Status_Widow', 'Marital_Status_YOLO'],
      dtype='object')
Index(['Income', 'Kidhome', 'Teenhome', 'Recency', 'NumDealsPurchases',
       'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases',
       'NumWebVisitsMonth', 'Complain', 'Pass_Customer', 'Age', 'AcceptCount',
       'Education_2n Cycle', 'Education_Basic', 'Education_Graduation',
       'Education_Master', 'Education_PhD', 'Marital_Status_Absurd',
       'Marital_Status_Alone', 'Marital_Status_Divorced',
       'Marital_Status_Married', 'Marital_Status_Single',
       'Marital_Status_Together', 'Marital_Status_Widow',
       'Marital_Status_YOLO'],
      dtype='object')

print("Length of train column :", len(train_data.columns))
print("Length of test column :", len(test_data.columns))

Length of train column : 27
Length of test column : 26

📝 train 데이터의 target 컬럼을 제외하고는 train과 test의 열길이가 같도록 one-hot encoding이 잘 진행된것을 확인할 수 있습니다.

train_data.head()

    Income  Kidhome  Teenhome  Recency  NumDealsPurchases  NumWebPurchases  \
0  46014.0        1         1       21           2.258975                7   
1  76624.0        0         1       68          -0.801066                5   
2  75903.0        0         1       50           0.146388                6   
3  18393.0        1         0        2           0.146388                3   
4  64014.0        2         1       56           1.846930                8   

   NumCatalogPurchases  NumStorePurchases  NumWebVisitsMonth  Complain  ...  \
0                    1                  8                  7         0  ...   
1                   10                  7                  1         0  ...   
2                    6                  9                  3         0  ...   
3                    0                  3                  8         0  ...   
4                    2                  5                  7         0  ...   

   Education_Master  Education_PhD  Marital_Status_Absurd  \
0                 1              0                      0   
1                 0              0                      0   
2                 0              0                      0   
3                 0              0                      0   
4                 0              1                      0   

   Marital_Status_Alone  Marital_Status_Divorced  Marital_Status_Married  \
0                     0                        0                       0   
1                     0                        0                       0   
2                     0                        0                       1   
3                     0                        0                       1   
4                     0                        0                       0   

   Marital_Status_Single  Marital_Status_Together  Marital_Status_Widow  \
0                      0                        1                     0   
1                      1                        0                     0   
2                      0                        0                     0   
3                      0                        0                     0   
4                      0                        1                     0   

   Marital_Status_YOLO  
0                    0  
1                    0  
2                    0  
3                    0  
4                    0  

[5 rows x 27 columns]

5. Modeling

📝 train x 데이터와 target 데이터를 나눠줍니다.

train_x = train_data.drop('target', axis = 1)
train_y = pd.DataFrame(train_data['target'])

def nmae(true, pred):
    mae = np.mean(np.abs(true-pred))
    score = mae / np.mean(np.abs(true))
    return score

📝 Lasso, Ridge regression은 Linear regression에 규제를 적용하는 방법입니다. 저는 이 두 모델의 규제를 모두 적용할 수 있는 Elastic-Net을 사용했습니다.

📝 또한 LightGBM, XGBoost 모델을 사용했고, 최종적으로 세 모델을 활용하여 Ensemble을 진행했습니다.

Elastic-Net

ela_param_grid = {'alpha': np.arange(1e-4,1e-3,1e-4),
              'l1_ratio': np.arange(0.1,1.0,0.1),
              'max_iter':[100000]}

elasticnet = ElasticNet(random_state = seed_num)

ela_rkfold = RepeatedKFold(n_splits = 5, n_repeats = 5, random_state = seed_num)
ela_gsearch = GridSearchCV(elasticnet, ela_param_grid, cv = ela_rkfold, scoring='neg_mean_absolute_error',
                               verbose=1, return_train_score=True)

ela_gsearch.fit(train_x, train_y)

Fitting 25 folds for each of 81 candidates, totalling 2025 fits

GridSearchCV(cv=RepeatedKFold(n_repeats=5, n_splits=5, random_state=42),
             estimator=ElasticNet(random_state=42),
             param_grid={'alpha': array([0.0001, 0.0002, 0.0003, 0.0004, 0.0005, 0.0006, 0.0007, 0.0008,
       0.0009]),
                         'l1_ratio': array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9]),
                         'max_iter': [100000]},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)

elasticnet = ela_gsearch.best_estimator_        
ela_grid_results = pd.DataFrame(ela_gsearch.cv_results_)  
ela_pred = elasticnet.predict(train_x)

print("train nmae of elasticnet :", nmae(train_y.values, ela_pred))

train nmae of elasticnet : 1.0457572302381468

XGBoost

xgb = XGBRegressor(objective='reg:squarederror', random_state = seed_num)

xgb_param_grid = {'n_estimators':np.arange(100,500,100),
              'max_depth':[1,2,3],
             }

xgb_rkfold = RepeatedKFold(n_splits = 5, n_repeats = 1, random_state = seed_num)
xgb_gsearch = GridSearchCV(xgb, xgb_param_grid, cv = xgb_rkfold, scoring='neg_mean_absolute_error',
                               verbose=1, return_train_score=True)

xgb_gsearch.fit(train_x, train_y)

Fitting 5 folds for each of 12 candidates, totalling 60 fits

GridSearchCV(cv=RepeatedKFold(n_repeats=1, n_splits=5, random_state=42),
             estimator=XGBRegressor(objective='reg:squarederror',
                                    random_state=42),
             param_grid={'max_depth': [1, 2, 3],
                         'n_estimators': array([100, 200, 300, 400])},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)

xgb = xgb_gsearch.best_estimator_        
xgb_grid_results = pd.DataFrame(xgb_gsearch.cv_results_)  
xgb_pred = xgb.predict(train_x)

print("train nmae of xgb :", nmae(train_y.values, xgb_pred))

train nmae of xgb : 1.0603134393578721

LightGBM

lgbm = LGBMRegressor(objective='regression', random_state = seed_num)

lgbm_param_grid = {'n_estimators': [8,16,24], 'num_leaves': [6,8,12,16], 'reg_alpha' : [1,1.2], 'reg_lambda' : [1,1.2,1.4]}


lgbm_rkfold = RepeatedKFold(n_splits = 5, n_repeats = 1, random_state = seed_num)
lgbm_gsearch = GridSearchCV(lgbm, lgbm_param_grid, cv = lgbm_rkfold, scoring='neg_mean_absolute_error',
                               verbose=1, return_train_score=True)

lgbm_gsearch.fit(train_x, train_y)

Fitting 5 folds for each of 72 candidates, totalling 360 fits

GridSearchCV(cv=RepeatedKFold(n_repeats=1, n_splits=5, random_state=42),
             estimator=LGBMRegressor(objective='regression', random_state=42),
             param_grid={'n_estimators': [8, 16, 24],
                         'num_leaves': [6, 8, 12, 16], 'reg_alpha': [1, 1.2],
                         'reg_lambda': [1, 1.2, 1.4]},
             return_train_score=True, scoring='neg_mean_absolute_error',
             verbose=1)

lgbm = lgbm_gsearch.best_estimator_        
lgbm_grid_results = pd.DataFrame(lgbm_gsearch.cv_results_)  
lgbm_pred = lgbm.predict(train_x)

print("train nmae of lgbm :", nmae(train_y.values, lgbm_pred))

train nmae of lgbm : 1.0075199760582296

Blending Models - Ensemble

def blended_models(X):
    return ((elasticnet.predict(X)) + (xgb.predict(X)) + (lgbm.predict(X)))/3

blended_score = nmae(train_y.values, blended_models(train_x))
print('train nmae of blended model:', blended_score)

train nmae of blended model: 1.0298941681188711

감사합니다 :)
도움이 됐길 바랍니다👍👍

[Dacon] 인구 데이터 기반 소득 예측

Tue, 23 Aug 2022 12:49:34 GMT

인구 데이터 기반 소득 예측

본 포스팅은 데이콘의 인구 데이터 기반 소득 예측 경진대회에 참여하여 작성한 코드이며, 간단한 데이터 전처리 및 EDA, LightGBM과 XGBoost로 Ensemble 모델링 등의 내용을 포함하고 있습니다.
코드실행은 Google Colab의 GPU, Standard RAM 환경에서 진행했습니다.
Keyword : 소득예측, classification, lightgbm, xgboost, ensemble, voting ➔ 데이콘에서 읽기
👉Click

0. Import Packages

from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive

!pip install -U pandas-profiling

import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import sklearn
import pandas_profiling
import seaborn as sns
import random as rn
import os
import scipy.stats as stats

from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.ensemble import VotingClassifier
from sklearn import metrics

import xgboost as xgb
import lightgbm as lgb

from collections import Counter
import warnings

%matplotlib inline
warnings.filterwarnings(action='ignore')

print("numpy version: {}". format(np.__version__))
print("pandas version: {}". format(pd.__version__))
print("matplotlib version: {}". format(matplotlib.__version__))
print("scikit-learn version: {}". format(sklearn.__version__))
print("xgboost version: {}". format(xgb.__version__))
print("lightgbm version: {}". format(lgb.__version__))

numpy version: 1.21.6
pandas version: 1.3.5
matplotlib version: 3.2.2
scikit-learn version: 1.0.2
xgboost version: 0.90
lightgbm version: 2.2.3

# reproducibility
seed_num = 42   ####
np.random.seed(seed_num)
rn.seed(seed_num)
os.environ['PYTHONHASHSEED']=str(seed_num)

1. Load and Check Dataset

Variable Description

train = pd.read_csv('/content/drive/MyDrive/Forecasting_income/dataset/train.csv')
test = pd.read_csv('/content/drive/MyDrive/Forecasting_income/dataset/test.csv')

train.columns = train.columns.str.replace('.','_')
test.columns = test.columns.str.replace('.','_')

train.head()

   id  age workclass  fnlwgt     education  education_num      marital_status  \
0   0   32   Private  309513    Assoc-acdm             12  Married-civ-spouse   
1   1   33   Private  205469  Some-college             10  Married-civ-spouse   
2   2   46   Private  149949  Some-college             10  Married-civ-spouse   
3   3   23   Private  193090     Bachelors             13       Never-married   
4   4   55   Private   60193       HS-grad              9            Divorced   

        occupation   relationship   race     sex  capital_gain  capital_loss  \
0     Craft-repair        Husband  White    Male             0             0   
1  Exec-managerial        Husband  White    Male             0             0   
2     Craft-repair        Husband  White    Male             0             0   
3     Adm-clerical      Own-child  White  Female             0             0   
4     Adm-clerical  Not-in-family  White  Female             0             0   

   hours_per_week native_country  target  
0              40  United-States       0  
1              40  United-States       1  
2              40  United-States       0  
3              30  United-States       0  
4              40  United-States       0

pr=train.profile_report()
pr.to_file('/content/drive/MyDrive/Forecasting_income/pr_report.html')

📝 Pandas profiling을 활용하면 아래와 같이 데이터 프레임을 쉽고 효율적으로 탐색할 수 있습니다.

Pandas profiling report의 Alert 활용하기

High Correlation

relationship - sex
age - marital.status
workclass - occupation
education - education.num
relationship - marital.status
race - native.country
sex - occupation
target - relationship

Data Type

Numeric (7) : id, age, fnlwgt, education.num, capital.gain, capital.loss, hours.per.week
Categorical (9) : workclass, education, marital.status, occupation, relationship, race, sex, native.country, target

Note

📝 workclass와 occupation이 같은 비율 (10.5%)의 missing value를 가지므로 확인해 볼 필요가 있습니다.

📝 또한, native.country는 583(3.3%) missing value를 가지므로 행을 삭제해주겠습니다.

📝 capital.gain, capital.loss가 high skewness를 가집니다. outlier를 확인하고 필요시 transformation을 진행하겠습니다.

train.info()


RangeIndex: 17480 entries, 0 to 17479
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   id              17480 non-null  int64 
 1   age             17480 non-null  int64 
 2   workclass       15644 non-null  object
 3   fnlwgt          17480 non-null  int64 
 4   education       17480 non-null  object
 5   education_num   17480 non-null  int64 
 6   marital_status  17480 non-null  object
 7   occupation      15637 non-null  object
 8   relationship    17480 non-null  object
 9   race            17480 non-null  object
 10  sex             17480 non-null  object
 11  capital_gain    17480 non-null  int64 
 12  capital_loss    17480 non-null  int64 
 13  hours_per_week  17480 non-null  int64 
 14  native_country  16897 non-null  object
 15  target          17480 non-null  int64 
dtypes: int64(8), object(8)
memory usage: 2.1+ MB

2. Data Preprocessing

(1) Missing Value

train.columns[train.isnull().any()]

Index(['workclass', 'occupation', 'native_country'], dtype='object')

train[train["workclass"].isnull()]

          id  age workclass  fnlwgt     education  education_num  \
15081  15081   90       NaN   77053       HS-grad              9   
15082  15082   66       NaN  186061  Some-college             10   
15084  15084   51       NaN  172175     Doctorate             16   
15086  15086   61       NaN  135285       HS-grad              9   
15087  15087   71       NaN  100820       HS-grad              9   
...      ...  ...       ...     ...           ...            ...   
17475  17475   35       NaN  320084     Bachelors             13   
17476  17476   30       NaN   33811     Bachelors             13   
17477  17477   71       NaN  287372     Doctorate             16   
17478  17478   41       NaN  202822       HS-grad              9   
17479  17479   72       NaN  129912       HS-grad              9   

           marital_status occupation   relationship                race  \
15081             Widowed        NaN  Not-in-family               White   
15082             Widowed        NaN      Unmarried               Black   
15084       Never-married        NaN  Not-in-family               White   
15086  Married-civ-spouse        NaN        Husband               White   
15087  Married-civ-spouse        NaN        Husband               White   
...                   ...        ...            ...                 ...   
17475  Married-civ-spouse        NaN           Wife               White   
17476       Never-married        NaN  Not-in-family  Asian-Pac-Islander   
17477  Married-civ-spouse        NaN        Husband               White   
17478           Separated        NaN  Not-in-family               Black   
17479  Married-civ-spouse        NaN        Husband               White   

          sex  capital_gain  capital_loss  hours_per_week native_country  \
15081  Female             0          4356              40  United-States   
15082  Female             0          4356              40  United-States   
15084    Male             0          2824              40  United-States   
15086    Male             0          2603              32  United-States   
15087    Male             0          2489              15  United-States   
...       ...           ...           ...             ...            ...   
17475  Female             0             0              55  United-States   
17476  Female             0             0              99  United-States   
17477    Male             0             0              10  United-States   
17478  Female             0             0              32  United-States   
17479    Male             0             0              25  United-States   

       target  
15081       0  
15082       0  
15084       1  
15086       0  
15087       0  
...       ...  
17475       1  
17476       0  
17477       1  
17478       0  
17479       0  

[1836 rows x 16 columns]

train['workclass'].unique()

array(['Private', 'State-gov', 'Local-gov', 'Self-emp-not-inc',
       'Self-emp-inc', 'Federal-gov', 'Without-pay', nan, 'Never-worked'],
      dtype=object)

📝 workclass, occupation 컬럼의 결측치를 포함한 행은 삭제합니다.

📝 두 컬럼이 동시에 결측치를 갖는 경우가 대부분이었기에 workclass의 결측치만 'Never-worked'와 같은 이미 존재하는 특성으로 채우는것은 의미가 없습니다.

📝 workclass와 occupation에 새 feature을 생성하여 넣는 방법도 시도했지만, one-hot encoding을 해서 생기는 test 데이터와의 컬럼 차이 때문에 다른 방법을 고려해볼 필요가 있다고 생각합니다. 😔😔

print(sum(train['workclass'].isna()))
print(sum(train['occupation'].isna()))

fill_na = train['workclass'].isna()

1836
1843

df_train = train.dropna()  

print(sum(df_train['workclass'].isna()))
print(sum(df_train['occupation'].isna()))
print(sum(df_train['native_country'].isna()))

0
0
0

df_train

          id  age     workclass  fnlwgt     education  education_num  \
0          0   32       Private  309513    Assoc-acdm             12   
1          1   33       Private  205469  Some-college             10   
2          2   46       Private  149949  Some-college             10   
3          3   23       Private  193090     Bachelors             13   
4          4   55       Private   60193       HS-grad              9   
...      ...  ...           ...     ...           ...            ...   
15076  15076   35       Private  337286       Masters             14   
15077  15077   36       Private  182074  Some-college             10   
15078  15078   50  Self-emp-inc  175070   Prof-school             15   
15079  15079   39       Private  202937  Some-college             10   
15080  15080   33       Private   96245    Assoc-acdm             12   

           marital_status       occupation   relationship                race  \
0      Married-civ-spouse     Craft-repair        Husband               White   
1      Married-civ-spouse  Exec-managerial        Husband               White   
2      Married-civ-spouse     Craft-repair        Husband               White   
3           Never-married     Adm-clerical      Own-child               White   
4                Divorced     Adm-clerical  Not-in-family               White   
...                   ...              ...            ...                 ...   
15076       Never-married  Exec-managerial  Not-in-family  Asian-Pac-Islander   
15077            Divorced     Adm-clerical  Not-in-family               White   
15078  Married-civ-spouse   Prof-specialty        Husband               White   
15079            Divorced     Tech-support  Not-in-family               White   
15080  Married-civ-spouse   Prof-specialty        Husband               White   

          sex  capital_gain  capital_loss  hours_per_week native_country  \
0        Male             0             0              40  United-States   
1        Male             0             0              40  United-States   
2        Male             0             0              40  United-States   
3      Female             0             0              30  United-States   
4      Female             0             0              40  United-States   
...       ...           ...           ...             ...            ...   
15076    Male             0             0              40  United-States   
15077    Male             0             0              45  United-States   
15078    Male             0             0              45  United-States   
15079  Female             0             0              40         Poland   
15080    Male             0             0              50  United-States   

       target  
0           0  
1           1  
2           0  
3           0  
4           0  
...       ...  
15076       0  
15077       0  
15078       1  
15079       0  
15080       0  

[15081 rows x 16 columns]

(2) Outlier

fig, ax = plt.subplots(1, 2, figsize=(12,3))
g = sns.distplot(df_train['capital_gain'], color='b', label='Skewness : {:.2f}'.format(df_train['capital_gain'].skew()), ax=ax[0])
g = g.legend(loc='best')

g = sns.distplot(df_train['capital_loss'], color='b', label='Skewness : {:.2f}'.format(df_train['capital_loss'].skew()), ax=ax[1])
g = g.legend(loc='best')
plt.show()

numeric_fts = ['age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss', 'hours_per_week']

outlier_ind = []
for i in numeric_fts:
  Q1 = np.percentile(df_train[i],25)
  Q3 = np.percentile(df_train[i],75)
  IQR = Q3-Q1
  outlier_list = df_train[(df_train[i] < Q1 - IQR * 1.5) | (df_train[i] > Q3 + IQR * 1.5)].index
  outlier_ind.extend(outlier_list)

outlier_ind = Counter(outlier_ind)
multi_outliers = list(k for k,j in outlier_ind.items() if j > 2)

# Drop outliers
train_df = df_train.drop(multi_outliers, axis = 0).reset_index(drop = True)
train_df

          id  age     workclass  fnlwgt     education  education_num  \
0          0   32       Private  309513    Assoc-acdm             12   
1          1   33       Private  205469  Some-college             10   
2          2   46       Private  149949  Some-college             10   
3          3   23       Private  193090     Bachelors             13   
4          4   55       Private   60193       HS-grad              9   
...      ...  ...           ...     ...           ...            ...   
15043  15076   35       Private  337286       Masters             14   
15044  15077   36       Private  182074  Some-college             10   
15045  15078   50  Self-emp-inc  175070   Prof-school             15   
15046  15079   39       Private  202937  Some-college             10   
15047  15080   33       Private   96245    Assoc-acdm             12   

           marital_status       occupation   relationship                race  \
0      Married-civ-spouse     Craft-repair        Husband               White   
1      Married-civ-spouse  Exec-managerial        Husband               White   
2      Married-civ-spouse     Craft-repair        Husband               White   
3           Never-married     Adm-clerical      Own-child               White   
4                Divorced     Adm-clerical  Not-in-family               White   
...                   ...              ...            ...                 ...   
15043       Never-married  Exec-managerial  Not-in-family  Asian-Pac-Islander   
15044            Divorced     Adm-clerical  Not-in-family               White   
15045  Married-civ-spouse   Prof-specialty        Husband               White   
15046            Divorced     Tech-support  Not-in-family               White   
15047  Married-civ-spouse   Prof-specialty        Husband               White   

          sex  capital_gain  capital_loss  hours_per_week native_country  \
0        Male             0             0              40  United-States   
1        Male             0             0              40  United-States   
2        Male             0             0              40  United-States   
3      Female             0             0              30  United-States   
4      Female             0             0              40  United-States   
...       ...           ...           ...             ...            ...   
15043    Male             0             0              40  United-States   
15044    Male             0             0              45  United-States   
15045    Male             0             0              45  United-States   
15046  Female             0             0              40         Poland   
15047    Male             0             0              50  United-States   

       target  
0           0  
1           1  
2           0  
3           0  
4           0  
...       ...  
15043       0  
15044       0  
15045       1  
15046       0  
15047       0  

[15048 rows x 16 columns]

print(train_df['capital_gain'].skew(), train_df['capital_loss'].skew())

12.004940559585881 4.607122286739042

📝 Outlier들을 제거했음에도 두 변수는 여전히 high skewness를 가지고 있으므로 log transformation을 진행해보고자 합니다.

# log transformation
train_df['capital_gain'] = train_df['capital_gain'].map(lambda i: np.log(i) if i > 0 else 0)
test['capital_gain'] = test['capital_gain'].map(lambda i: np.log(i) if i > 0 else 0)

train_df['capital_loss'] = train_df['capital_loss'].map(lambda i: np.log(i) if i > 0 else 0)
test['capital_loss'] = test['capital_loss'].map(lambda i: np.log(i) if i > 0 else 0)

print(train_df['capital_gain'].skew(), train_df['capital_loss'].skew())

3.0945787119106676 4.390015583095806

3. Feature Engineering

(1) Correlation

📝 Categorical 데이터를 라벨인코더를 통해 수치형으로 변환한 후 상관관계를 확인합니다.

📝 Categorical : workclass, education, marital.status, occupation, relationship, race, sex, native.country

la_train = train_df.copy()

cat_fts = ['workclass', 'education', 'marital_status', 'occupation', 'relationship', 'race', 'sex', 'native_country']
for i in range(len(cat_fts)):
  encoder = LabelEncoder()
  la_train[cat_fts[i]] = encoder.fit_transform(la_train[cat_fts[i]])

la_train.head()

   id  age  workclass  fnlwgt  education  education_num  marital_status  \
0   0   32          2  309513          7             12               2   
1   1   33          2  205469         15             10               2   
2   2   46          2  149949         15             10               2   
3   3   23          2  193090          9             13               4   
4   4   55          2   60193         11              9               0   

   occupation  relationship  race  sex  capital_gain  capital_loss  \
0           2             0     4    1           0.0           0.0   
1           3             0     4    1           0.0           0.0   
2           2             0     4    1           0.0           0.0   
3           0             3     4    0           0.0           0.0   
4           0             1     4    0           0.0           0.0   

   hours_per_week  native_country  target  
0              40              38       0  
1              40              38       1  
2              40              38       0  
3              30              38       0  
4              40              38       0

📝 앞서 수행한 pandas profiling report의 alert를 참고하여 상관계수를 계산했습니다.

📝 꽤 유의미한 상관관계를 가지고 있다고 생각되는것은 relationship-sex, occupation-workclass, education-education.num 입니다.

# Pearson
la_train[['age','marital_status', 'relationship', 'sex', 'occupation', 'workclass']].corr()

                     age  marital_status  relationship       sex  occupation  \
age             1.000000       -0.271955     -0.240331  0.087515   -0.007994   
marital_status -0.271955        1.000000      0.180281 -0.124481    0.023856   
relationship   -0.240331        0.180281      1.000000 -0.590077   -0.052109   
sex             0.087515       -0.124481     -0.590077  1.000000    0.061443   
occupation     -0.007994        0.023856     -0.052109  0.061443    1.000000   
workclass       0.081100       -0.044000     -0.070512  0.078764    0.010194   

                workclass  
age              0.081100  
marital_status  -0.044000  
relationship    -0.070512  
sex              0.078764  
occupation       0.010194  
workclass        1.000000

la_train[['education', 'education_num', 'race', 'native_country']].corr()

                education  education_num      race  native_country
education        1.000000       0.348614  0.011236        0.079063
education_num    0.348614       1.000000  0.034686        0.097485
race             0.011236       0.034686  1.000000        0.126654
native_country   0.079063       0.097485  0.126654        1.000000

📝 Categorical 인 두 변수의 경우는 Cramer's V 공식을 활용하여 상관관계를 확인했습니다.

stat = stats.chi2_contingency(la_train[['race', 'native_country']].values, correction=False)[0]
obs = np.sum(la_train[['race', 'native_country']].values) 
mini = min(la_train[['race', 'native_country']].values.shape)-1 

# Cramer's V 
V = np.sqrt((stat/(obs*mini)))
print(V)

0.11306993147326666

(2) String to numerical

📝 Categorical 데이터를 모델에 넣기 위해서는 수치화 시킬 필요가 있습니다. LabelEncoder는 불필요한 상관관계를 만들 가능성이 있기에 OnehotEncoder를 사용했습니다.

📝 Categorical : workclass, education, marital.status, occupation, relationship, race, sex, native.country

train_dataset = train_df.copy()
test_dataset = test.copy()

📝 get_dummies를 사용하여 one-hot encoding을 진행했습니다.

train_dataset = pd.get_dummies(train_dataset)
test_dataset = pd.get_dummies(test_dataset)

print(train_dataset.columns)
print(test_dataset.columns)

Index(['id', 'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
       'hours_per_week', 'target', 'workclass_Federal-gov',
       'workclass_Local-gov',
       ...
       'native_country_Portugal', 'native_country_Puerto-Rico',
       'native_country_Scotland', 'native_country_South',
       'native_country_Taiwan', 'native_country_Thailand',
       'native_country_Trinadad&Tobago', 'native_country_United-States',
       'native_country_Vietnam', 'native_country_Yugoslavia'],
      dtype='object', length=106)
Index(['id', 'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
       'hours_per_week', 'workclass_Federal-gov', 'workclass_Local-gov',
       'workclass_Private',
       ...
       'native_country_Portugal', 'native_country_Puerto-Rico',
       'native_country_Scotland', 'native_country_South',
       'native_country_Taiwan', 'native_country_Thailand',
       'native_country_Trinadad&Tobago', 'native_country_United-States',
       'native_country_Vietnam', 'native_country_Yugoslavia'],
      dtype='object', length=104)

📝 train과 test의 열길이를 맞춰주는 작업을 합니다.

test_col = []
add_test = []

for i in test_dataset.columns:
    test_col.append(i)
for j in train_dataset.columns:
    if j not in test_col:
        add_test.append(j)
add_test.remove('target')

📝 test 데이터의 native.country 컬럼에는 'Holand-Netherlands' 특성이 없는걸까요?

print(add_test)

['native_country_Holand-Netherlands']

for d in add_test:
    test_dataset[d] = 0

print(train_dataset.columns)
print(test_dataset.columns)

Index(['id', 'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
       'hours_per_week', 'target', 'workclass_Federal-gov',
       'workclass_Local-gov',
       ...
       'native_country_Portugal', 'native_country_Puerto-Rico',
       'native_country_Scotland', 'native_country_South',
       'native_country_Taiwan', 'native_country_Thailand',
       'native_country_Trinadad&Tobago', 'native_country_United-States',
       'native_country_Vietnam', 'native_country_Yugoslavia'],
      dtype='object', length=106)
Index(['id', 'age', 'fnlwgt', 'education_num', 'capital_gain', 'capital_loss',
       'hours_per_week', 'workclass_Federal-gov', 'workclass_Local-gov',
       'workclass_Private',
       ...
       'native_country_Puerto-Rico', 'native_country_Scotland',
       'native_country_South', 'native_country_Taiwan',
       'native_country_Thailand', 'native_country_Trinadad&Tobago',
       'native_country_United-States', 'native_country_Vietnam',
       'native_country_Yugoslavia', 'native_country_Holand-Netherlands'],
      dtype='object', length=105)

📝 Train의 target column을 제외하고 보면 열길이가 잘 맞춰진것을 확인할 수 있습니다.

4. Modeling

📝 먼저, train과 validation 데이터를 train_test_split 함수를 사용하여 나눠줍니다.

test_size =0.15

train_data, val_data = train_test_split(train_dataset, test_size = test_size, random_state = seed_num)

drop_col = ['target', 'id']

train_x = train_data.drop(drop_col, axis = 1)
train_y = pd.DataFrame(train_data['target'])

val_x = val_data.drop(drop_col, axis = 1)
val_y = pd.DataFrame(val_data['target'])

print(train_x.shape, train_y.shape)
print(val_x.shape, val_y.shape)

(12790, 104) (12790, 1)
(2258, 104) (2258, 1)

📝 LGBM과 XGboost를 Soft Voting하여 간단한 ensemble 모델을 제작했습니다.

📝 Soft Voting은 LGBM, XGboost 모델의 예측 확률을 평균하여 최종 class를 결정합니다.

LGBClassifier = lgb.LGBMClassifier(random_state = seed_num)

lgbm = LGBClassifier.fit(train_x.values,
                       train_y.values.ravel(),
                       eval_set = [(train_x.values, train_y), (val_x.values, val_y)], 
                       eval_metric ='auc', early_stopping_rounds = 1000,
                       verbose = True)

XGBClassifier = xgb.XGBClassifier(max_depth = 6, learning_rate = 0.01, n_estimators = 10000, random_state = seed_num)

xgb = XGBClassifier.fit(train_x.values,
                       train_y.values.ravel(),
                       eval_set = [(train_x.values, train_y), (val_x.values, val_y)], 
                       eval_metric = 'auc', early_stopping_rounds = 1000,
                       verbose = True)

voting = VotingClassifier(estimators=[('xgb', xgb),('lgbm', lgbm)], voting='soft')
vot = voting.fit(train_x.values, train_y.values)

5. Evaluation & Submission

l_val_y_pred = lgbm.predict(val_x.values)
x_val_y_pred = xgb.predict(val_x.values)
v_val_y_pred = vot.predict(val_x.values)

print(metrics.accuracy_score(l_val_y_pred, val_y))
print(metrics.accuracy_score(x_val_y_pred, val_y))
print(metrics.accuracy_score(v_val_y_pred, val_y))

0.8702391496899912
0.8680248007085917
0.8596102745792737

print(metrics.classification_report(v_val_y_pred, val_y))

              precision    recall  f1-score   support

           0       0.93      0.89      0.91      1800
           1       0.63      0.72      0.68       458

    accuracy                           0.86      2258
   macro avg       0.78      0.81      0.79      2258
weighted avg       0.87      0.86      0.86      2258

val_xgb = pd.Series(l_val_y_pred, name="XGB")
val_lgbm = pd.Series(x_val_y_pred, name="LGBM")

ensemble_results = pd.concat([val_xgb,val_lgbm],axis=1)
sns.heatmap(ensemble_results.corr(), annot=True)
plt.show()

📝 Soft Voting을 진행했음에도 성능이 향상되지 않았습니다.

📝 두 모델의 예측은 높은 상관관계를 가지고 있었기에 앙상블 이전보다 성능이 향상되지 않았다고 해석할 수 있습니다.

(조금 더 공부가 필요할것 같습니다 😂😂)

감사합니다 :) 도움이 됐길 바랍니다👍👍

standing-o.log

2022 구글 클라우드 스터디잼 수료 후기 | AI&ML 중급/심화, 쿠버네티스 입문/중급/심화 과정

2022 구글 클라우드 스터디잼 이란?

신청하게 된 이유

수료 과정

1. AI & ML 중급 과정

2. AI & ML 심화 과정

3. 쿠버네티스 입문반

4. 쿠버네티스 중급반

5. 쿠버네티스 심화반

기념품

후기

2022 데이터 크리에이터 캠프 최우수상 수상 후기

데이터 크리에이터 캠프란?

신청하게 된 이유

진행 방식

1. 온라인 연수원을 통한 사전 학습

2. 예선 문제 해결하기

3. 멘토링 활용하기

4. 본선

배우고 느낀 점

후기

Google Kubernetes 워크로드

Kubernetes Workload

kubectl command

Deployments

Pod networking

Volumes

Kubernetes offers storage abstraction options

References

Google Kubernetes의 구조

Kubernetes architecture

Cooperating processes make a kubernetes cluster work

Google kubernetes engine

Object management

Kubernetes architecture

Summary

References

Google Kubernetes Engine(GKE) 란?

Google Kubenetes Engine

Kubernetes

Kubernetes features

Google Kubernetes engine (GKE)

GKE features

Computing options

Compute engine use cases

App engine

Google kubernetes engine

Cloud run

Cloud functions

Summary

References

Container와 Container image

Container and Container image

Containers

The VM-centric way to solve this problem

User space abstraction and containers

Why developers like containers

Container image

Containers use a varied set of Linux technologies

Containers are structured in layers

Containers promote smaller shared images

How can you get containers?

References

Cloud Computing과 Google Cloud

Cloud Computing and Google Cloud

Google cloud offers a range of services:

Build your own database solution

Use a managed service

Google cloud offers a range of services

Resource management

Zonal resources operate exclusively in a single zone

Regional resources span multiple zones

Global resources

Resources sit in projects, Resources have hierarchy

Billing

Interacting with Google Cloud

Ways to interact with Google cloud

Google Cloud's choices for organizing compute workloads

Summary

`kubectl` command

지도학습 ^{Supervised learning}