kbc-1315.log

3. Oxford Cybersecurity Program : 후기

Thu, 19 Feb 2026 06:51:10 GMT

세 줄 요약

영국에서, 그것도 Oxford에서 살아볼 수 있는 좋은 기회
학술적으로는 그닥 얻은건 없다
한국은 생각보다 살기 좋은 나라다

프로그램 전반

아무래도 첫번째다 보니 로 모든 사항들에 대한 공지가 시작되고 알림이 시작된다. 첫번째 프로그램이라 준비 사항들이 부족한건 알겠지만, 이건 부족해도 너무 부족했다. 수업이 진행될 장소, 일정, 시간들이 학생들은 물론이거니와 수업을 진행할 교수진도 몰라서 교수진이 학생들한테 와서 물어보는건 충격이었다.
학생들은 아무 권한이 없어 이러한 문제나 건의사항이 발생을 하여도 켈로그 측과 사업단 측에 전달을 하여도 뭐 어떻게 진행이 되는지 알 수가 없고, 서로 핑퐁하다가 증발해버린게 많았다.
영국식 일처리(업무 삼켜버리기, 닥칠때까지 안하기) + 한국식 일처리(서명 증빙, 문서를 통한 증빙)의 조합은 중간에 있는 학생들만 힘들게했다.

수업 관련

모듈 수업이 3개로 파견 당시에 관련 역량이 뛰어난 학생들만 보내라는 카더라가 있었기에 굉장히 배울게 많을 것으로 기대하고 갔다.
본인은 경제학 전공에 AI 대학원 전공중이라 전공 지식이 부족하고 보안 수업을 수강한적이 없었지만, 높은 수준의 수업은 커녕 전에 사용했던 슬라이드 재활용에 배울수 있는 점은 없었다. 뭇 전공자들은 교양 정도의 수업 수준이라고 했다.
수업 후반 왜 있는지 모를 프로젝트를 한다. 대부분 PPT 발표를 팀프로젝트 형식으로 진행하는데, 시간이 촉박해서 실험 같은 부분은 불가해서 대부분 조사 정도만 해서 발표하는 식으로 끝난다. 1팀에 6명이라 무임승차자들이 발생을 하고 서로 사이만 안좋아지는 좋은 기회가 된다.
결국에 남은건 참가자들끼리 서로 무슨 연구를 하고 어떤 공부를 하고 나눈게 가장 크게 남았다.

프로젝트 관련

주제는 AI + 보안이었다. 근데 이제 문제는 프로그램 참가자 대부분이 AI에 대한 지식이 많이 있지 않고, 무엇보다도 AI 학습을 진행할 컴퓨팅 자원은 제공이 안된다는 점이다. 그리고 주어지는 기간도 턱없이 부족하다.
따라서 대부분의 프로젝트는 LLM을 활용해 그냥 AI를 연결해서 이용해봤다 정도로 그치고, 기존 연구실 서버 등을 활용해 프로젝트는 진행이 되었다.
프로젝트 담당 박사님은 굉장히 열성적이었지만, 마찬가지로 1팀 6명이라는 인원 분배 덕분에 많은 무임승차자가 발생하여 쉽지 않았다.

생활 관련

여러분이 받을 수 있는 체재비는 월세를 내면 끝이다. 생활비는 알아서 개인이 조달해야한다. 기숙사가 있긴 하지만, 8월은 한국과 달리 영국 학기 기준으로 학생들이 그리 많이 나가지도 들어가지도 않는 시즌이다. 따라서 저렴하게 지낼 수 있고 가까운 기숙사들은 구하는게 힘들다. 따라서 조금 거리가 있는 숙소를 구할 수 밖에 없다.
버스도 비싸고 해서 자전거를 타는 사람도 많았는데, 굉장히 위험하다. 한국과 달리 자동차 취급을 받고 또 좌측 통행이기 때문에 사고가 굉장히 잦고 길을 지나다니다 보면 꽃 장식된 자전거를 쉽게 볼 수 있다. 거기서 사망사고가 있었다는 뜻이라고 한다.
가능하면 숙소를 개별적으로 구하기보단 한국인들끼리 뭉쳐서 Flat이라던가 구하는 것도 좋은 방법이다. 관건은 개별 욕실인지, 주방을 쉐어하는지 그리고 위치를 살펴보는게 좋다. 외국인들과 함께 주방, 욕실 쉐어를 한다는게 정말 쉽지 않다. 단체로 모여서 Flat 전체를 계약하거나 하는 식으로 진행하는게 좋다. 독채 렌트는 영국 부동산법으로 인해 영국 거주하는 또는 일하는 보증인이 있어야 하기 때문에 불가하다(컬리지 측에서는 보증 안해준다).

건의사항

프로그램 초기 참가자간 세미나 개최
- 서로에 대해 잘 알아야 프로젝트 조직을 하든, 모듈 팀 구성을 하든 모든게 수월하고 또 서로 진행하는 연구와 학술적인 교류가 정말 유익하고 중요했다
- 따라서, 기왕 프로그램을 진행할거라면 초기에 바로 수업을 진행하는것보단 참가자들이 서로 무슨 연구를 하고 혹은 연구실에서 뭘 진행하고 뭘 공부를 하는지 그런 사항에 대한 공유가 필요하다.
- 본 프로그램에서는 자원을 받아 자체적으로 세미나를 진행하였으며 PPT를 최대한 재활용하는 식으로 혹은 간단한 자료(노션 등)을 이용하는 방향으로 진행했다.
옥스퍼드 랩 인턴 컨택
- 참가자들이 진정으로 원하는건 학술적인, 연구적인 경험이다.
- 영국 교수진들에게 메일을 했을때 답장 받기란 정말 어려웠다. 연구실에 컨택을 정말 많이 해봤지만 답장 조차 오지 않았다.
- 따라서, 프로그램 차원에서 협조를 구해 연구실 컨택을 도와준다면 더 연구 경험을 잘 살릴 수 있는 기회가 될 듯 하다.
보여주기 모듈 수업에서 각 랩 특화 연구 수업으로
- 하나도 배울거 없는 모듈 수업보다는 모듈 3 마지막에 진행했던 실제로 교수님께서 했던 프로젝트를 소개해줬는데 모든 수업보다 그게 더 재밌었다.
- 옥스퍼드에서만 들을 수 있는, 각자 연구실에서 진행하는 연구에 대한 소개 혹은 가능하다면 수업이 훨씬 더 좋을듯 하다.
참가자 리더 임명 및 권한 부여
- 프로그램 참가자 학생들은 아무런 권한이 없어 문제 사항이나 건의 사항이 발생하면 컬리지나 사업단 측에 의견을 전달하고 그 답변을 받을 때까지 하염없이 기다려야 한다.
- 프로그램 내내 자체적으로 뭐가 언제 시작하는지 수업 당일인데도 장소가 여기가 맞는지 참가자 내에서 서로 서로 물어가며 진행을 했었다.
- 따라서, 파견 전 조 편성 및 리더를 선발하고 어드밴티지(월 15만원)를 부여하고 그 인원에게 학교측과 혹은 사업단 측과 이야기하도록 권한과 의무를 부여하여 연락 창구와 업무 담당을 하도록 하면 더 좋을 것이다.

참여 소감

이 프로그램 아니면 언제 영국에서 그것도 옥스포드에서 공부를 해보겠다? 라는 생각으로 다녀왔다. 연구를 하고 싶었지만 환경이 여의치 않아 그냥 커리어 단절 대학원생만 된 기분이지만, 어떻게든 연구는 다시 시작해야하고 다음 기수를 위해 글을 작성해본다. 학술적인, 연구적인 수확은 없었지만, 6개월 유럽 살이라는 낭만만은 충족한 프로그램이었다. 그리고 한국은 생각보다 살기 좋은 나라다.

Text-to-SQL 활용 LLM 기반 가상 회사 챗봇 템플릿

Mon, 12 Jan 2026 13:17:11 GMT

Talking Potato: SQL + LLM 기반 회사 챗봇 서비스

개요

이 프로젝트는 SQL 데이터베이스와 대형 언어 모델(LLM)을 결합 : Text-to-SQL 가상 회사 정보를 생성하고 자연어 질의응답이 가능한 챗봇 서비스를 구현한 프로젝트임. Streamlit 기반 웹 UI로 구성됨

왜 만들었나

구조화된 SQL 데이터와 자연어 질의를 연결하기 위함
가상 회사 데이터를 생성하고 이를 기반으로 질의응답 실험 목적
교육 및 프로토타입용 챗봇 구조 예제로 활용 가능함

사용 기술

Python 3.9+
Streamlit
SQLite
OpenAI API

설치 및 실행

# 1. 저장소 클론
git clone https://github.com/KBC-1315/Talking_Potato

# 2. 라이브러리 설치
pip install -r requirements.txt

# 3. OpenAI API 키 설정
환경 변수 또는 Streamlit UI에서 직접 입력

# 4. 실행
streamlit run app.py

주요 기능

1) 설정 탭

회사 정보 입력
LLM 기반 가상 회사 데이터 생성
SQLite DB 저장
기본 관리자 계정 생성

2) DB 상태 탭

테이블별 데이터 개수 확인
저장된 샘플 데이터 확인

3) 챗봇 탭

DB 기반 질의응답
대화 문맥 유지
시스템 프롬프트 수정 가능
Agent 기반 응답 흐름 구성

프로젝트 구조

app.py        : Streamlit 앱 진입점
core/         : 주요 로직 모듈
data/         : SQLite DB 저장 경로
requirements.txt : 의존성 목록

활용 방안

사내 문서 질의응답 챗봇
FAQ 자동 응답 시스템
SQL 기반 검색 인터페이스 실험

정리

Talking Potato는 SQL과 LLM을 결합한 챗봇 구조를 이해하기 좋은 예제 프로젝트임. Streamlit과 OpenAI API 연동 흐름을 한 번에 파악할 수 있음.

Reverse Shell, VPC 활용 인바운드 차단 서버 우회

Sat, 10 Jan 2026 17:30:12 GMT

GPT 이미지생성 활용

연구 목적으로 작성한 글, 사이버 공격이 될 수 있으니 무분별한 사용의 책임은 본인에게 있음을 명심.

목표

인바운드 포트가 차단된 서버에 외부에서 안정적으로 SSH/VSCode 접속함
Reverse SSH Tunnel을 사용함
외부 VPS를 단일 진입 허브로 사용함

구조

[Local PC]
    |
    | SSH / VSCode
    v
[VPS (공인 IP, 인바운드 허용)]
    ^
    | Reverse SSH (outbound)
    |
[Internal Server]

1. VPS 기본 설정

VPS 생성함 - Oracle, AWS 등 상용 서비스 : AWS Lightsail 사용하였음

Linux (Ubuntu LTS)
공인 IP 할당됨

방화벽에서 아래 포트 허용함

TCP 22
TCP 2222

2. VPS SSH 설정

/etc/ssh/sshd_config 수정함

AllowTcpForwarding yes
GatewayPorts yes

SSH 재시작함

sudo systemctl restart ssh

3. 내부 서버에서 Reverse SSH Tunnel 생성

VPS 접속용 키 준비함

예: vps_key.pem
권한 600 설정함

chmod 600 vps_key.pem

Reverse SSH 실행함

ssh -i vps_key.pem -N -R 2222:localhost:22 vps_user@

의미

VPS의 2222 포트를 내부 서버의 22번 SSH로 연결함
세션 유지 중 명령 종료되지 않음

4. 로컬 PC SSH 설정

로컬 PC의 SSH config 작성함

파일 위치

Linux/macOS: ~/.ssh/config

Windows: C:\Users.ssh\config

Host vps
  HostName 
  User vps_user
  IdentityFile ~/.ssh/vps_key.pem
  IdentitiesOnly yes

Host internal-server HostName localhost User internal_user Port 2222 ProxyJump vps

--------------------------------------------------
# 5. 접속 테스트
--------------------------------------------------

ssh internal-server

--------------------------------------------------
# 6. autossh로 터널 자동 유지
--------------------------------------------------

sudo apt update sudo apt install -y autossh

autossh -M 0 -N
-o ServerAliveInterval=30
-o ServerAliveCountMax=3
-i vps_key.pem
-R 2222:localhost:22
vps_user@

--------------------------------------------------
# 7. systemd로 완전 자동화(재부팅시)
--------------------------------------------------

`/etc/systemd/system/reverse-ssh.service`

[Unit] Description=Reverse SSH Tunnel After=network-online.target Wants=network-online.target

[Service] ExecStart=/usr/bin/autossh -M 0 -N
-o ServerAliveInterval=30
-o ServerAliveCountMax=3
-i /path/to/vps_key.pem
-R 2222:localhost:22
vps_user@ Restart=always RestartSec=10

[Install] WantedBy=multi-user.target

sudo systemctl daemon-reload sudo systemctl enable reverse-ssh sudo systemctl start reverse-ssh

systemctl status reverse-ssh ```

2-1. Oxford Cybersecurity Program : Silverstone F1

Wed, 03 Sep 2025 13:03:36 GMT

Silverstone Circuit

위치: 영국 노샘프턴셔(Northamptonshire)와 버킹엄셔(Buckinghamshire) 경계에 위치
개장: 1948년, 제2차 세계대전 당시 비행장이었던 RAF 실버스톤을 개조하여 사용
길이: 약 5.89 km (현행 Grand Prix Circuit 기준)
특징:
- 빠른 고속 코너와 긴 직선 구간이 어우러진 레이아웃
- 대표적인 코너: Copse, Maggots–Becketts–Chapel, Stowe
- 드라이버와 팬 모두에게 "스피드의 성지"로 불림
주요 이벤트:
- 1950년 최초의 F1 월드 챔피언십 그랑프리 개최
- 현재도 매년 F1 영국 그랑프리 개최지로 사용
- MotoGP, FIA WEC 등 다양한 국제 모터스포츠 이벤트 개최
수용 인원: 약 150,000명 이상

결국은 F1 Single Seater 프로그램으로 하고 왔다. 한화 기준으로 50만원 정도....(이것도 첫타임 할인 풀로 받은거다)

나름 수동 운전도 많이 해봤고 자신 있게 들어갔지만 시동만 3번 꺼먹고왔다. 이게 군용차 민간차랑 다르게 엑셀과 브레이크 압력이 굉장히 많이 필요하다. 다리가 아플정도로... 그리고 굉장히 페달이 예민해서 컨트롤하기 어려웠다

숙박은 최고의 시설 Hilton Hotel에서 1박에 20만원 정도 생각보다 저렴하게 이용하고 왔다. F1 영상보면 스타팅 그리드 왼쪽 건물이 힐튼 호텔이었다... 신기...

옥스퍼드로 돌아와야하는데 돌아오는 날 일요일이어서 버스가 운행을 안했다... 그래서 가장 가까운 버스가 다니는 지역으로 우버타고 이동하고 거기서부터 다시 옥스퍼드로 버스타고 복귀했다.

언제 또 F1 서킷을 와보겠나... 후회없는 경험이었다.

가기 전 반드시 박물관 예약은 하시고, 안에 있는 시뮬레이터는 관심이 있으면 하시길, 힐튼 호텔 옆 기념품 샵에 판매하는 가짓수가 더 많지만 특가 상품 제외하고는 온라인 샵이 더 저렴한 편이다.

1-5. Oxford Cybersecurity Program : 숙소 구하기

Tue, 02 Sep 2025 21:50:56 GMT

Oxford 숙소 옵션 총정리

Oxford 체류 준비 중인 사람들을 위해, 다양한 숙소 타입별로 링크와 함께 간략히 정리한다. 공식 페이지와 신뢰할 만한 플랫폼을 기반으로 설명한다.

1. Kellogg College 제공 학내 숙소

풀타임, 파트타임, 단기 체류용 숙소를 모두 제공한다.
- Full-time Student Accommodation: 셀프 케이터링(주방, 냉장고, 정수기 등)을 갖추고 있으며 대부분 공유 욕실, 일부 인-스위트 룸도 제공한다. 신청 방법은 [Kellogg College 숙소 페이지]에서 확인한다. (kellogg.ox.ac.uk)
Short-stay Accommodation: 단기 방문객이나 alumni에게도 일정 요율로 개방한다. (kellogg.ox.ac.uk)
Donald Michie House (B&B 스타일): Kellogg 캠퍼스 중심부에 위치한 단기 숙소로, 인-스위트 객실과 공용 라운지, 주방, TV, 자전거 보관, Wi-Fi 등을 제공한다. 아침 식사는 포함되지 않는다. (universityrooms.com)

2. Oxford University 전체/대학 제공 숙소

대부분의 대학(College)들은 대학원생을 포함해 숙소 옵션을 제공하며, [University Accommodation Office]를 통해 부부나 가족용 숙소도 지원한다. (ox.ac.uk)
다만, 2018-19년 기준으로 전체 대학원 신입생 중 약 72%만이 대학 혹은 대학 측 숙소를 확보했고, 53% 정도만이 전체 대학원생 수준에서 제공받았다고 한다. 즉, 보장이 완전하지는 않는다. (ox.ac.uk)

3. Privately-rented housing (개인 임대)

Oxford 시내에는 많은 주택/플랫 렌트 옵션이 있으며, Rightmove, Zoopla, SpareRoom 등이 대표적이다. 주로 1년 단위 계약이며 term-time보다 비쌀 수 있다. 종종 term 외 기간에도 머물려는 경우 유리하다. (ox.ac.uk)
주의사항: 렌트 시장은 매우 변동성이 크며 외국에서 접근 시 사기 사례가 있으므로 주의해야 한다.
외국에 있는 경우는 현지 에이전트가 대신 집을 봐주는 서비스도 있긴 하지만, 그렇게 믿을 만하지는 않다
단체로 Roomshare를 알아봐서 Flat이나 House, Semi-detached 등의 렌트 방식을 알아봤는데 영국 Regulation에 의해서 외국인의 경우 영국 내 직장인의 보증과 함께 재정 수입원 증명 등의 절차가 필요해서 거의 불가능해 보인다.(https://www.rightmove.co.uk/)

4. 에어비앤비

에어비앤비 플랫폼을 통해 일반 주택을 하우징 포함해서 임대하는 식으로 하는데 이것도 사기도 많고... 역시나 좋은 위치, 가격에는 매물이 없다.
같은 프로그램에 다른 사례를 보면 괜찮은 조건에 에어비앤비 계약을 한 경우도 있으니 에어비앤비를 이용하고자 한다면, 매물을 잘 보고 있어야 할 것 같다.
비용 : 최소 월 200이상

5. 민간 학생 기숙사

현대식 건물 기반의 공유 플랫, 스튜디오, en-suite 등이 있는 옵션
- Amber Student(amberstudent.com)
- University Living(universityliving.com) https://nooc.org.uk/
입주 경쟁이 치열하므로 사전에 확보하는 것이 중요하다.
비용은 상대적으로 저렴하지만 옵션에 따라 150-250만원까지 올라간다
개인 욕실이 그렇게 흔하지 않다

컬리지 기숙사, Oxford 대학 기숙사, 민간 기숙사가 모두 자리가 없고, 에어비앤비도 마땅한 매물이 없어 Roomshare를 위해 모인 타대생 6명은 결국 부동산을 뒤져봤다(결국 계약 못함)

렌트 계약 실패

개인 욕실이 있을것
5-6명 수용이 가능할것
타입이 House, Semi-detached, Flat 등 우선
Kellogg College와 거리가 가까울것
위 선정기준을 가지고 엄청나게 찾아보고 연락을 돌려봤지만, 결국에는 외국인이라 보증인이 있어야하고 무엇보다 단기 렌트는 받지 않는다!
1년부터는 장기, 그 미만은 단기라고 부르는듯 하다. 아래 두 사이트 참고 https://homelet.co.uk/tenants/tips-for-tenants/tips-for-being-referenced https://homelet.co.uk/tenants/tips-for-tenants/what-documents-do-you-need-to-rent-a-house-or-flat
결국에는... 부동산 에이전시에도 다이렉트로 문의하고 했지만 못구했다

국내 대행 업체 스튜던트 홈즈

https://www.studenthomes.co.kr/

하도 사기가 많아서, 초반에는 진짜 못믿었다
그래서 해본 검증은 아래와 같다
- 강원도청 창조경제혁신센터 등록 사업이라길래 검색해보니 실제로 혁신 창업가라고 이쪽 대표가 뜸
- 한국소비자원, 기타 신고하는 사이트에 검색(깨끗함)
- 리뷰를 죄다 읽어봄
- 구직 사이트에 채용 공고도 올라와 있음
결국 여길 이용해보기로 하고 못참고 물어봤다
- 수수료를 안받고도 사업을 유지할 수 있는 이유는 해외 숙소들에서 비용을 받기 때문에 학생들에게서 별도로 받지 않는다고 함
- 업체를 이용해도, 하지 않아도 숙소 비용이 동일했음

The Park, Oxford

그 이후 플랫폼 카카오톡 채널을 이용해 문의 신청을 했고 꽤나 빠르게 답장이 왔다
이부분은 약간 스크립트 느낌이 나기도 하는데 Oxford에 숙소는 한군데 있고, 지금 자리가 많지 않다!
비용은 대략 월 200만원 수준이고, 거리고 꽤나 멀지만(도보 1시간반), 개인 욕실에 시설이 괜찮아보여서 6명이 함께 계약했다.
버스 타고도 50분 걸리는 곳에 숙소가 있다
나중에 입주하고 안거지만 미리 스포하자면, 학생들이 여기 대행 업체를 통해 계약하게 되면, 이런 구조가 된다

학생 - 스튜던트홈즈 - Uninst - Almero(숙소 관리)
이게 상당히 많은 회사가 껴있다보니 뭔가 좀 버벅이는 느낌이 있긴 하지만 그래도 스튜던트홈즈 측에 의사를 전달하면 진행은 된다.
여러 정보를 전달하면 메일로 요렇게 확인이 온다.
5,760파운드 = 약 1,100만원... 이것도 150일이고 나머지는 알아서 해야한다.(글쓴이의 경우 같은 숙소에 얼리체크인 결제를 통해 입주함)
시설 등은 홈페이지에서 제공하는 3D 뷰 보기를 이용하면 엄청 사실적으로 볼 수 있다(그거랑 똑같더라구요) https://www.theparkoxford.com/
결과적으로 얼리체크인 155만원 + 정기 입주 비용 1,100만원 포함해서 약 6개월에 1,255만원에 컨택 완료했다...(생활비는 어쩌지)
대부분의 단기 렌트의 경우 숙소 비용을 선불로 요구한다. The Park의 경우에도 거의 선납이었는데, 3번에 나눠 선납을 해야한다.
글쓴이의 경우 400, 400, 200으로 선납 일정이 7월말, 9월초, 10월 중순으로 잡혔다.
이 부분이 가장 부담이 컸다...
부동산 거래시 이렇게 선납을 요구하면 사기가 많다지만, 외국인의 경우 그리 선택지가 많지 않은듯 보인다...

1-4. Oxford Cybersecurity Program : 개인 처방약 준비

Tue, 02 Sep 2025 20:35:43 GMT

세줄 요약

먹는 약이 장기간 처방이 가능한지 반드시 확인
출국 직전 약 처방 받기
영문처방전과 약 설명서 챙기기

1. 먹는 약 장기 처방 가능 여부 확인

나는 아무 약이나 "6개월 나갈거라 6개월분 처방해주세요!"하면 주는줄 알았다. 근데 웬걸, 내가 먹는 약은 최대 3개월 처방으로 제한이 되어있다는 것이다... 나는 6개월 나가야하는데... 따라서, 반드시 사전에 다니는 병원에 내원하여 상담을 받아보시길 바란다.

2. 출국 직전 약 처방 받기

두 가지 방법이 있었다.

종합병원(2차병원) 급으로 가서 장기 처방받기
1회에 한해 3개월 | 3개월 2번 처방받기

모두 이 방법이 적용 가능한지는 알 수 없으니 반드시 의사선생님께 문의...

출국까지 기간이 많지 않아 검사며 결과며 진료며 기다릴 수 없는 나는 2번 옵션으로 2번 내원하여 3개월분씩 나눠서 처방받았다

3. 영문처방전 및 약 설명서 챙기기

웬 아시안 남자가 알약을 무더기로 들고가면 수상할까봐 영문처방전이랑 약 영문 설명서를 챙겼다.

영문 설명서는 웬만한 약에 같이 있어서 별다른 신경 쓸 필요가 없었다
영문 처방전은 병원 원무과에 요청을 드리니 가물가물하신 기억을 되짚어 일정 비용을 받고 작성해주시고 직인까지 찍어주셨다.

사실 지금 생각하면 입국 심사때 이거 다 안보긴 했다...

아무튼 처방약도 6개월치 해결

1-3. Oxford Cybersecurity Program : 비자 발급

Tue, 02 Sep 2025 20:12:25 GMT

세줄 요약

UK ETA 앱을 설치한다
여권에 휴대폰을 가져다 댄다(실물)
결제한다

아니 난 학생 비자로 가는줄 알았더니, 영국 학생 비자는 풀타임 아니면 어려우시단다... 그래서 여행 비자를 발급 받기로 했다. 영국의 경우 최근부터 전자 여행 허가제로 변경되어 손쉽게 6개월 미만에 대해 비자를 신청할 수 있다. 웹도 있고 앱도 있는데 앱이 더 쉬워서 앱으로 했다.

1. UK ETA 앱을 설치한다

2. 여권 정보를 등록한다

실물 여권을 안가지고 여권 사진만 가지고 있어 사진으로 등록하면 되겠지~라는 안일한 생각으로 어림도 없이 인식이 안되고(사실 그냥 이미지로 인식이 잘 안되는듯), Plan B로 여권에 있는 전자칩을 가지고 여권 정보를 등록해야 했다. 이게 휴대폰마다 인식 칩이 있는 위치가 달라서 인식 칩이 있을거라고 생각되는 곳에 대고 가만히 기다리면 등록이 완료된다.(쉽다!)

3. 또 결제를 한다

가격은 16파운드로 환율따라 다르겠지만 대략적으로 3만원 정도 했다. 눈물을 머금고 결제...

4. ETA 발급 확인 메일을 받는다

한 2-3일 까먹고 있으면 등록한 메일로 발급되었다는 확인 메일을 받는다. 글쓴이는 이번이 비행기 타고 처음 해외 나가는거라 혹시 몰라서 메일도 프린트해서 갔는데, 이건 안해도 되는듯하다...

1-2. Oxford Cybersecurity Program : 항공편

Tue, 02 Sep 2025 19:56:52 GMT

세줄 요약 아시아나 앱을 깐다 일정과 좌석을 고르고 결제한다 눈물을 흘리면서 결제한다

글쓴이는 프로그램에서 항공편 일괄 예약을 받아서 일단 일정과 이용할 항공편은 Fix가 되어있었다. 근데 웬걸 할게 더 있었네, 아래 절차는 여행사에서 일괄로 예매를 해줬어요!의 경우에 실시하면 될듯하다.

1. 아시아나 앱 설치

아시아나 홈페이지나 어플을 설치한다

2. 예약 조회

예약 조회 탭으로 들어간다. 아마 처음에는 로그인을 해도 항공권이 안뜰건데... 이건 아시아나 회원 번호를 등록을 해줘야한다.

등록을 위해서 항공편 예약 번호를 꼭꼭 미리 알아놔야한다.(e.g. DDD777)

예약 번호랑 필요한 정보를 뚝딱 입력하면 비회원이어도 조회를 할 수 있다.

3. 내 계정에 항공권 연동

간단하다. 조회된 항공권에 아시아나 계정 연동하는 버튼이 있다. 그거 누르고 마이페이지에서 조회되는 아시아나 회원 번호를 입력하면 연동이 완료된다. 이후에는 내 예약 조회하면 이 항공권이 같이 조회된다.

4. 좌석 선택

요 아름다운 비행기 친구에서 좌석 지정을 미리 해놓아야 한다.

편한 좌석은 돈을 더 내고 예약해야 한다.(2일전인데도 남아있으면 그때부터 무료로 가능)

빨리 내리는 좌석(유료)
비상구 좌석(편함 - 유료)
좀 더 넓은 좌석(유료)

이런 좌석들이 있다. 싸면 6만원에서 비싸면 16만원 정도까지 추가 결제가 필요하다. 불쌍한 글쓴이는 무료 좌석으로 골라서 예약했다. 가는 항공편에 일행이 있어 가운데 3칸중에 가운데 좌석이 당첨되었다.(멜라토닌 먹고 기절해서 별로 안불편했음)

이외에 셀프 체크인은 그닥 필수는 아니어보여서 패스

1-1. Oxford Cybersecurity Program : 펀딩 구하기

Tue, 02 Sep 2025 19:40:08 GMT

2025년 초 메일을 통해 위 프로그램 참여를 지도교수님께 권유받고 신청서, 교수님 추천서, 학업 계획서, 자기소개서 및 기타 서류를 뚝딱 준비해서 신청했다.

그래서 사실, 운이 좋게도 학교 탐색, 펀딩 탐색 등의 단계는 운이 좋게도 넘어갈 수 있었다.

공고문 링크

1. 과기정통부 옥스퍼드 사이버보안 프로그램

정식 명칭은 이게 아니긴 한데 편한대로 그냥 줄여서 말하겠다. 큰 분류로는 디지털혁신인재 단기집중역량강화 사업인데, 이미 캐나다 토론토 대학, 미국 카네기 멜론 등 시행 중인 학교가 여럿있었다.

근데, 영국 옥스퍼드는 이번 프로그램이 1회차였다. 개요는 아래와 같다

프로그램명 : 글로벌 사이버보안 인재 양성 교육프로그램
교육 기간 : 2025년 9월 ~ 2026년 2월(약 6개월) [2025.08.25 ~ 2026.02.10]
선발 인원 : 30명 내외
지원 자격 : 대학원생, 청년 프리랜서, 자립준비청년
지원 항목 : 교육비, 체재비, 항공료, 비자발급 비용, 여행자 보험료(5,500만원 수준)
참여 교수진 : Kellogg College, University of Oxford 소속 Computer Science 교수진

아무튼, 영국에 나가서 생활해보는게 인생 버킷리스트 중 하나였고, 박사과정 어차피 의무 인턴도 졸업 요건에 있으니 지원해보기로 했다.

2. 선발 과정

전체적인 흐름은 위와 같다. 글쓴이는 경제학 전공자로(AI 대학원 재학) 정보보안 및 수학에 대한 큰 베이스가 없고, 한번도 수업을 들어본 적이 없는 수준인것을 먼저 밝힌다.

2-1. 서류전형

크게 신경써야할 부분은 아래와 같았다.

프로그램 신청서 스캔본(국문)
개인정보 활용 동의서 스캔본
지도교수 추천서
학부 졸업증명서 및 성적증명서(국문)
자기소개서 자유양식(영문)
공인 영어시험 및 기재 내용 증빙
연구계획서(글쓴이의 경우 국문)
대학원 재학증명서(국문)
대학원 성적증명서(국문)

일부는 서류전형 때 제출하는것이 아니고 준비해야할 서류를 전체적으로 나열해보았다. 각 전형별로 필요한 서류는 공고문 참고하길 바란다.

다른 행정 서류 및 추천서 등등은 각 시스템별로 상이하기에 언급하지 않겠다.

자기소개서 부분은 아무래도 비전공이기 때문에, 정보보안에 대해 진심이고 이 분야에 대해 연구하게 된 개연성과 준비한 과정 그리고 하고자 하는 연구로 자연스럽게 이어지도록 하는 것을 목표로 작성하였다.

생각도 못했지만 다행히 합격 메일이 왔다

2-2. 실기전형 및 면접평가

아니 이미지에는 따로 있던데 왜 같이 있나요? 라고 물으신다면, 역시 같은 날 있어서 그랬다고 답을 드리겠습니다.

덜컥 합격 메일을 받고 나니, 정말 열심히 시험을 준비해야할듯한 부담감이 강했다.

시험 분야는 정보보안 또는 수학(선형대수) 중 한 분야를 골라 주어진 시간 내에 문제를 풀면 되는거였다. 글쓴이는 당연히 문돌이라 정보보안 분야를 골라 지정된 문제 후보(공지사항에 알려주시더라구요)를 풀어보는 식으로 진행했다. 문제 은행 느낌으로 진행했고 뭔가 뭔가, 이렇게 쉬워도 되나? 라는 생각이 들어 개념서 위주로도 암기했다.

정보보안 분야 문제는 정보보안산업기사 수준으로 선택지가 기사보다는 고르기가 훨씬 쉽다. 글쓴이는 정보처리기사, 빅데이터분석기사를 취득했고, 정보보안기사 취득 준비중에 있어 쉬울 수도 있긴 하다.

준비가 끝나고 불쌍한 지방러는 고려대 시험 겸 면접장으로 이동했다. 아쉽게도 교통편은 지원이 안되지만 점심은 지원이 되어 도시락을 맛있게 먹었다.

시험은 생각보다 쉬웠다...

너무 문제가 쉬워서 당황했다. 실수로 틀리거나 하면 이제 못보겠구나 싶었다.

그래서 이제 면접에서 갈리겠구나 싶었다.'

하지만, 애초에 시험을 보러 온 인원이 30명보다 조금 더 많은 수준이라서 그렇게 경쟁이 엄청나진 않아보였다.

오전에 실기 시험을 보고 오후에 바로 면접 평가가 이어지는 구성으로 진행되었다.

운이 좋게도 이른 순번을 배정받아 얼마 기다리지 않고 면접을 보고 일찍 집에 갈 수 있었다.

면접장에는 고려대학교 교수님 2분이 계시고, 전반적으로 영어 실력을 보기 위해 답변은 무조건 영어로 진행했다.

자기 소개
지원 동기(경제학 전공했으면서 왜 정보보안 하려는지?)
향후 연구

크게는 위 3가지 질문에 대해 영어로 답변을 했고, 글쓴이는 최대한 스크립트 없이 키워드 기반으로 자유롭게 답변하는 스타일이라 주절주절 답변하고 나왔다.

조마조마 몇 주를 기다리다가 다행히도 합격 메일을 받았다

2-3. 오리엔테이션

합격하고 한 달 준비하다가, 고려대에서 OT가 있어 다녀왔다(왕복 1회 10만원 ㅠㅠ).

합격자 구성은 절반 정도는 고려대 자대생, 나머지 절반은 타대생으로 구성되어 있었다.

안내 사항은 주로 아래와 같다.

프로그램 진행 개요(수업, 프로젝트 등)
생활 안내(숙소, 교통 등)
파견 전 준비 사항 안내

교육 프로그램과 같은 경우 현지 Kellogg 컬리지 교수님들께서 강의하시는 수업들과 프로젝트를 함께 하는 구성이었다. 다만, 숙소 컨택이 매우 힘드니 미리미리 알아보라고 안내가 있었다. 그때부터 무서웠다. 마지막으로 파견 전 준비 사항으로 Coursera 강의 수강 후 수료증 증빙하는 프로세스가 필요했다.(Cousera는 강의 듣는건 무료여도 수료증 받으려면 결제를 해야하더라구요...)

숙소 컨택이 사전에 알아봤을때 대부분 가능한 곳이 없어 룸쉐어 관심 있는 학생들을 모아 진행하기로 했다. 이건 숙소 편에서 더 자세히 다루도록 하겠다.

2-4. 체재비(생활비)

가장 큰 부분이 이건데, 사실 프로그램에서 항공편이나 보험, 교육비는 해결해줘서 다행이었다. 다만, 물가가 비싸기로 유명한 영국인지라 밥이라도 안 굶고 다니려면 어떡해야하나 싶었다.

글쓴이는 두 가지 수입원이 있다.

본 교육 프로그램에서 지원받는 체재비
원 소속(GIST)에서 연구 프로젝트 참여를 통해 받는 인건비

체재비의 규모가 모든 숙소 비용을 포함한 생활비를 감당하기에는 역부족이다. 따라서, GIST에서 진행하던 연구 프로젝트의 인건비를 통해 지도교수님이 배려해주셔서 해결할 수 있었다.

다만, 연구 과제별 담당 기관 및 과제별 수행 가능 요건이 달라 모든 과제 담당자에게 연락하여 수혜 가능 여부를 확인해야 했다. 예를 들어, 어느 과제는 국내 체류를 해야만하거나 그런 제약 조건이 붙어 있으면 수혜가 불가능하다. 글쓴이의 경우 몇몇 과제에 대해 명시적으로 지급이 불가하거나 관련 규정이 없을 경우 그냥 안받고, 문제가 없는 과제에 대해서만 받기로 되었다.

고려대 사업단 측에서는 일단 다른 연구과제 측에서 충돌만 나지 않는다면 체재비 지급에는 큰 문제가 없다고 한다

아무튼 이렇게 해서, 교육비, 항공편, 보험, 숙소 비용, 생활비 문제는 해결되었다. 아직도 다른 준비할게 엄청 많다...

[웹 보안] Google은 최고의 해킹 툴

Fri, 25 Jul 2025 08:10:09 GMT

1. 개요

특정 사이트에서 내부 게시판 자료에 접근할 수 있는지 점검하던 중, 로그인 후에만 접근 가능한 게시판 파일을 직접 다운로드할 수 있는 취약점을 발견함
특히 다음과 같은 URL 구조를 통해, 로그인 여부와 관계없이 파일이 다운로드됨을 확인함:

https://example.com/api/file/fileDown?file_name=파일명.pdf

이는 인증 우회 및 직접 접근(Direct Object Reference, IDOR) 취약점으로 분류될 수 있음

2. 검색 기법

구글 및 Bing과 같은 검색엔진에서 다음과 같은 고급 연산자를 이용하여 탐색

파일 다운로드 API 엔드포인트 검색 예시

site:example.com inurl:fileDown

fileDown, download, attach 등이 URL에 포함된 페이지를 찾기 위해 inurl: 연산자 사용

다운로드 가능 파일 검색 예시

site:example.com (filetype:pdf OR filetype:xlsx OR filetype:hwp) inurl:fileDown

접근 제어 없이 다운로드 가능한 파일 유형을 포괄적으로 탐지 가능

3. 점검 방법

게시판이 로그인 후에만 접근 가능하도록 UI상 제한이 걸려 있어도,
파일 다운로드 링크가 공개되어 있다면 직접 호출할 수 있는지 확인 필요

예시 시나리오

로그인 없이 URL 직접 접근 시도
접근이 된다면 서버 측 인증 미검증 가능성 존재
파일명을 바꿔가며 다른 내부 문서 접근 가능성 테스트

4. 대응 방안

다운로드 API 또는 파일 서버 접근 시 인증 및 권한 검사 필수
최소한 다음과 같은 조치 필요
- 파일 다운로드 전 로그인 여부 및 접근 권한 체크
- 다운로드 링크에 토큰 기반 검증 추가 (예: expiring URL)
- 접근 로그 및 이상 행위 탐지 시스템 연동

5. 결론

보안은 클라이언트 측 UI 제한만으로는 충분하지 않음
서버단에서의 인증 및 권한 제어가 필수적이며,
다운로드 경로 점검은 보안 점검 시 필수 항목으로 고려되어야 함

[Flutter]포트폴리오 커뮤니티 웹 개발

Fri, 13 Jun 2025 05:07:34 GMT

Flutter 기반 포트폴리오 커뮤니티 웹 개발기

https://porpolio-web-community-f3d4a.web.app/

1. 프로젝트 개요('24.06.11~13)

Flutter와 Firebase를 활용해 사용자들이 자신의 포트폴리오를 작성하고 공유할 수 있는 커뮤니티형 포트폴리오 웹앱을 개발했습니다.

사용자는 다음과 같은 기능을 수행할 수 있습니다:

로그인 및 회원가입 (Firebase Auth)
포트폴리오 블록 추가/편집/삭제 (카테고리/블록/세부 항목 구성)
포트폴리오 조회수 기반 정렬
사용자별 포트폴리오 PDF 다운로드 또는 공유
검색 및 무한 스크롤 탐색 기능
Firestore 기반 실시간 데이터 연동

2. 주요 기술 스택

구분	사용 기술
프론트엔드	Flutter (Web/Mobile Responsive)
상태관리	Provider
백엔드	Firebase Firestore, Firebase Auth
배포	Firebase Hosting
문서 출력	`pdf` 패키지를 이용한 포트폴리오 PDF 생성 및 다운로드

3. 핵심 기능 및 구현 방식

✅ 사용자 인증

Firebase Authentication을 통해 Email 로그인 지원
로그인한 사용자는 본인의 포트폴리오만 편집 가능

✅ 포트폴리오 구조

PortfolioCategory → PortfolioBlock → BlockDetail 계층으로 구성
각 항목은 드래그 앤 드롭 또는 순서 변경 가능
텍스트, 이미지(얘는 아직) 타입 지원

✅ Firestore 연동

users/{userId}/portfolioCategories 하위에 카테고리 및 블록 저장
user_profiles에는 사용자 프로필(name, email, url 등)과 함께 viewCount 저장
portfolios/{userId}에는 PortfolioSummary 저장하여 검색 및 탐색에 활용

✅ PDF 생성

사용자 요청 시 pdf 패키지를 이용해 포트폴리오 전체를 PDF로 구성
웹에서는 자동 다운로드, 모바일에서는 공유 메뉴 호출

4. 난관 및 해결 과정

문제 상황	해결 방법
Firebase 데이터 구조 설계	Firestore Subcollection을 활용해 구조화
모바일/웹의 PDF 처리 차이	`dart:html` vs `path_provider + share_plus` 플랫폼 분기 처리
로그인 여부에 따른 편집 권한	Firebase Auth와 userId를 비교하여 조건 분기
사용자 정의 정렬, 검색	Provider에서 상태 관리 및 필터링 처리

5. 향후 계획(언제할지는 모르겠음;)

댓글 및 피드백 기능 추가
인기 포트폴리오 자동 랭킹
다국어 지원 (i18n)
SEO 최적화

[강화학습] 1-3. Markov Decision Process

Thu, 01 May 2025 12:24:03 GMT

Markov Decision Process

1. 모든 State와 Action은 Random Variable이다

$$ p(a_1|s_0,a_0,s_1) $$

$s_1$일때 $a_1$을 구하려면 굳이 $s_0$과 $a_0$는 알 필요가 없다.
왜? 이미 $s_1$에 반영이 되어있으니까. 따라서 아래와 같이 다시 표현할 수 있다

$$ p(a_1|s_1) $$

그럼 아래와 같은 경우 어떨까 $$ p(s_2|s_0,a_0,s_1,a_1) $$
$s_2$를 알기 위해서는 $s_1$과 $a_1$이 모두 필요하다
$s_1$에는 $s_0$, $a_0$는 이미 반영이 되어있다
따라서 아래와 같이 다시 표현할 수 있다

$$ p(s_2|s_1,a_1) $$

Sufficient Statistics -> https://velog.io/@kbc-1315/DetnEst-5.-General-MVUE#sufficient-statistics

2. Policy

$$ p(a_t|s_t):\text{Policy} $$

상태 $s$에서 어떤 행동 $a$를 하는 분포

$$ p(s_{t+1}|s_t,a_t) : \text{Transtition | 전이} $$

얘는 Trainsition or 전이

강화학습에서는 Return을 Maximize한다 -> 정확히는 Expected Return $$ \text{Retrun }G_t=R_t+\gamma R_{t+1}+\gamma^2 R_{t+2}+\cdots $$

Action $a_t$를 했을때 넘어간 State에서 받는 Reward : $R_t$
따라서 Expected Return$E[G_t]$을 Maximize하는 Policy를 찾는다

원본 출처[혁펜하임 유튜브] : https://www.youtube.com/watch?v=DbbcaspZATg&list=PL_iJu012NOxehE8fdF9me4TLfbdv3ZW8g&index=3

[강화학습] 1-2. Q-learning

Thu, 01 May 2025 12:01:56 GMT

Q-Learning

내 AI Agent가 맛집을 찾아가게 하고 싶은 상황이다
뭐 랜덤으로 움직이다가 맛집으로 들어갈 수도 있겠지
어쨋든 찾아가게 하고 싶다

Greedy Action

각 방향에 대해 점수를 부여한다
각 셀에서는 위, 아래, 왼쪽, 오른쪽으로 갈 수 있다(모서리 제외)
맛집에 들어가면 R(Reward) 보상을 받게 된다
계속 랜덤하게 움직이다가 어쩌다 보니 맛집으로 들어간다
그럼 그 방향으로 갔더니 보상이 지급되었으니까 그 방향의 Q-value를 1로 업데이트한다
그 다음 에피소드에서 빨간 테두리에 있다고 가정해보자 위, 아래, 왼쪽 그리드는 값이 모두 없지만 오른쪽은 다르다, 오른쪽에서 가장 큰 값은 1이다.
빨간 테두리 안의 오른쪽에도 오른쪽 셀의 가장 큰 값인 1로 업데이트 한다



- 나머지도 똑같이 업데이트 한다
- 완성된 지금부터는 Q-value가 가장 큰 쪽으로 `Greedy Action`을 수행할 것이다.
- 모든 수행 결과는 동일할 것이다
- 하지만 이 결과가 최적인가? 바로 오른쪽으로 가면 더 최단 거리일텐데

Epsilon-Greedy

$$\epsilon\text{-greedy},;\epsilon\sim(0,1)$$

$\epsilon$ 값에 따라 Random Action을 하자
Exploration : 새로운 전략을 탐험
- 새로운 Path를 찾을 수 있다
- 새로운 맛집을 찾을 수 있다
Exploitation : 최적 행동을 수행
Exploration이 없다면 평생 새로운 맛집은 못가고 가던 곳만 갈것이다.
하지만 계속 랜덤 행동을 한다면 최적 전략에서는 멀어질 수 있다
따라서 Decaying하는 $\epsilon$ 값을 활용한다
적절히 탐험을 하다가 나중에는 덜 탐험을 하는 식으로 Agent가 행동할 것이다
그렇게 우린 새로운 최단 거리 경로를 찾았다.
하지만 우리 Agent는 어디가 더 좋은 경로인지 알지 못한다.
위나 아래나 Q-value가 똑같으니까...

Discount Factor

Discount factor : $\gamma$ 는 0보다 크고 1보다 작은 값이다 위와 같이 Discount factor를 적용하면
이런식으로 첫번째 시작점에서 $\gamma^2 >\gamma^4$ 이기 때문에 Agent는 오른쪽을 선택할 것이다
따라서 이 Discount rate는
- 효율적인 Path 탐색을 할 것이며
- 현재와 미래의 Reward에 대해 얼마나 비중을 둘 것인지 정할 수 있다

Q-value Update

$$ Q(s_t,a_t) \leftarrow (1-\alpha)Q(s_t,a_t)+\alpha(R_t+\gamma\max_{a_{t+1}}Q(s_{t+1},a_{t+1})) $$

화살표 왼쪽의 Q-값을, 화살표 오른쪽의 값으로 업데이트를 해라
$\alpha$는 0~1의 값 : 새로운 걸 얼마나 받아들일건지?
$(1-\alpha)Q(s_t,a_t)$ : 이쪽은 원래 가지고 있던 Q-value
$\alpha(R_t+\gamma\max_{a_{t+1}}Q(s_{t+1},a_{t+1}))$ : 새로 업데이트 하려는 값

원본 출처[혁펜하임 유튜브] : https://www.youtube.com/watch?v=3Ch14GDY5Y8&list=PL_iJu012NOxehE8fdF9me4TLfbdv3ZW8g&index=2

[Review][스포 있음] 미키17 리뷰

Mon, 03 Mar 2025 11:46:39 GMT

생각하는 만큼 느낄 수 있는 영화

영화계의 역사도 깊어지며 그 다양성과 깊이 또한 함께 넓고, 깊어져갔다. 그에 따라, 킬링 타임용 영화처럼 감독이 그닥 의도를 숨기지 않고 그대로 내비치어 표현하는 영화도 있는 반면, 숨겨진 의도로 말미암아 관객으로서 그 의도를 음미해보도록 하는 영화 또한 많다. 개인적인 의견이지만 이를 구분짓기는 매우 주관적인 요소로 결정된다고 생각한다. 감독이 "이 영화엔 숨겨져 있는 저의가 많으니 찾아보시길 바랍니다" 라고 말해주지는 않으니까 말이다. 필자는 위 영화에 대한 혹평을 많이 듣고 간 상태에서 오히려 더 재미있는 관람을 약간이나마 비판적인, 비관적인 입장에서 보았기 때문에 할 수 있었지 않나라고 느꼈다. 즉, 많은 노력을 들여 영화의 어느 부분이 까내릴만한가? 이런 부분에 집중해서 많은 노력을 들였지만, 그 숨겨진 재미와 감독의 의도를 찾아내 정말 만족할 만한 영화였음을 밝힌다.

필멸의 존재에게 죽음이란?

영화에서 계속 주인공에게 던져지는 질문이 있다.

"죽는 기분은 어때?"

주인공은 휴먼프린터라는 고도의 기술로 인해 죽지만 다시 생성되어 다른 몸으로 옮겨져 그 삶을 계속 영위해간다. 아니 타의에 의해 이어져 간다고 보는 것이 더 정확하다. 그렇기 때문에 수많은 죽음을 경험한 주인공은 쉽게 그 소감을 이야기할 수 있을 것 같지만, 그렇지 않다. 영화 초반, 그의 인격을 데이터화하고 신체를 다시 프린트할 수 있는 그는 시스템을 믿는다는 것의 증빙으로 자살을 강요받는다. 하지만, 그는 끝끝내 자살하지 못한다. 아마 죽음 뒤에 다시 삶이 돌아온다는 것을 믿지 못한게 아닐까 싶다. 이는 영화 후반부에 가서도 변하지 않는다. 계속되는 죽음 뒤에 그의 이름 뒤에는 17, 18의 번호들이 붙지만, 여전히 그는 물음에 대해 명확하게 답하지 못하고 그저 무섭다고 할 뿐이다. 필멸의 존재인 인간에게 죽음은 필수불가결하며 피할 수 없는 존재이며 두려움의 표상이다

죽음 앞에 어떻게 살아가야 하는가?

영화의 배경인 미래의 지구에서 수많은 사람들이 아니, 인생에서의 패배자들이 인생을 세탁하기 위해 식민지 개척 함선에 자원해 지구를 떠나게 된다. 지구에는 돈을 빌려주고 이를 갚지 못한 사람들을 그저 본인의 유희를 위해 죽이고 그 장면을 즐기는 자산가를 비롯한 정상적이지 않은 삶의 모습을 보여주고 있다. 주인공 또한, 햄버거보다 마카롱을 더 선호하는 시대가 올 것이라는 친구의 꾀임에 넘어가 사채를 써 결국은 본인도 죽을 운명에 마주친 찰나였다. 그런 잔인한 운명 앞에서 주인공은 도망치듯 떠나갔지만, 익스팬더블이라는 복제 가능한 사람으로 함선에 탑승하게 된다. 그런 함선 안에도 다양한 사람들이 살아가고 있다. 합리적이지 않은 , 정치에서 패배해 도망치듯 쫒겨나는 정치인, 엘리트 군인, 채무로부터 도망온 빚쟁이들... 그들 중 투표에서 패배하여 도망쳐온 정치인은 그의 아내, 주변인으로부터의 아첨과 조언으로 움직이며 진중하지 못하고 합리적이지 못한 모습을 보인다. 그는 타인의 뜻으로 움직이며 다른 사람들을 그저 도구 또는 부품 등으로 여길 뿐이며 이를 신의 뜻으로 포장하는 모습을 보인다. 그리고 엘리트 군인 겸 경찰 겸 소방관과 같은 인물 또한 등장한다. 그는 자신의 자리를 지키며 사회를 위해 헌신적으로 일하는 모습을 보인다. 또, 아무 능력이 없으면서도 아첨 하나만으로 독재자 옆을 지키는 사람도, 그저 묵묵히 살아가는 사람도 볼 수 있다. 각자 삶의 이유를 찾아가며 주어진 삶에 최선을 다하고 있다. 살기 위해서

인간은 그 존재 자체로 고귀한가?

이 영화에는 세 가지 존재들이 등장한다. 다른 존재들을 그저 자신의 수족, 하수인, 부품 등으로 생각하며 자신들의 안위만을 생각하는 존재들, 사회를 위해 자신의 자리를 묵묵히 지켜 살아가는 존재들, 자신들의 터전을 위해 살아가며 이를 지키려는 존재들. 영화가 진행되며 식민화를 위해 도착한 행성에 크리퍼라는 외형이 끔찍한 생명체를 발견한다. 이들은 징그러운 외형으로 인해, 오해를 받고 인간의 공격을 받으며, 한낱 해충에 불과한 존재로 인식된다. 이들을 박멸해야할 존재인가? 한 인물의 대답으로 이에 대한 감독의 답을 찾을 수 있었다. 그들은 그들의 보금자리를 지키려 할 뿐이다. 인간은 무슨 권리로 그들의 보금자리를 빼앗으려 하는가? 진정으로 악한 것은 인간이 아닌가? 하는 부분을 생각해 볼 수 있다. 하지만, 인간 사이에서 조차 다른 존재들이 함께 살아가고 있다. 외형은 같으나, 타인을 존중하지 않는 존재들과, 타인을 자신과 같이 존중하는 존재들. 주인공은 죽어도 다시 새로운 몸에 옮겨갈 수 있기에 어느 존재들은 온갖 모진 실험과 어려운 임무들을 부여하여 주인공을 소모시킨다. 하지만, 고통스러워 하는 그의 모습에 함께 고통스러워하며 이를 비난하는 존재들 또한 함께 살아간다.

인간은 태어남과 동시에 고귀함을 부여받는가?

아니면

다른 존재를 본인과 같이 존중함으로서 그 고귀함을 찾아가는가? 또 그 고귀함은 절대적인가?

이런 부분을 생각할 수 있는 기회였다.

공동체의 이익 아래 소수의 존엄성은 무시될 수 있는가?

위의 모든 물음에 반해, 주인공이 부품처럼 도구처럼 죽어가면서 공동체는 우주 방사선에 대해 더욱 알게 되었으며, 외계 행성의 바이러스를 정복할 수 있었다. 만약 그렇지 않았다면? 아마도, 끝끝내 행성을 알지 못하고, 바이러스를 알지 못하고 절멸했을 수 있다. 주인공의 희생으로 말미암아 공동체는 효율적으로 생존할 수 있었다. 영화 후반, 휴먼 프린터는 폭파되고 이제 주인공은 더이상 프린팅될 수 없다. 영화에서는 해피 엔딩처럼 외계 생명체와 공존 체제를 갖추고 테라포밍에 성공하는 듯 보였다. 하지만, 새로운 바이러스가 등장한다면, 익스펜더블이 없는 인류는 많은 희생으로서 그 백신을 개발하거나 아예 멸종할 수 있으며, 미지의 위협에 대해 전처럼 효율적으로 대처할 수 없을 것이다. 주인공은 어렸을적 교통사고로 어머니가 돌아가신 사고를 겪었다. 또 그 원인이 본인 때문이라고 생각해 모든 본인의 존엄성을 무시한 죽음을 비롯한 불이익에 대해서 받아들인다. 주인공을 존중하지 않고 부품처럼, 도구처럼 여겨온 과학자들이나 미치광이 지도자는 주인공이 죽기 위해 존재하는 사람으로 취급하며 그의 존엄성을 짓이겼지만, 이 행위로 인해 공동체의 더 많은 사람들이 효율적으로 살 수 있었다. 또 주인공의 실수였지만 본인이 자원함으로써 익스펜더블이 되었고 수많은 죽음을 겪었다.

우린 그들을 비난할 수 있나? 그들은 할 일을 한 것 뿐이다

결과가 좋으면 그 과정은 아무래도 좋다?

영화 내내 이상하게도 많이 나오는 소재가 있다. 미치광이 정치인의 아내는 소스에 집착하며 나올때마다 본인의 소스가 어떤지 물어본다. 또, 사람들이 먹는 음식에 대해서도 유별나게 잦은 출연이 있었다. 세상에는 정말 다양한 소스들이 있다. 또 다양한 음식들이 있다. 우린 식사를 할 때 만들어진 소스에 만들어진 음식을 별 생각 없이 먹지만, 음식의 식재료를 위해 수 많은 동물들이 죽고, 갖가지 조리법에 의해 조리되고, 소스의 개발을 위해 많은 생명체들이 소모된다. 하지만, 소비자들은 이에 대해 잘 알지 못한다. 미치광이 정치인의 아내는 소스에 너무도 집착한 나머지 외계 생명체의 꼬리를 갈아 소스로 만들어버린다. 이때 소스라고 부르는 무언가는 피인지, 소스인지 분간할 수 없는 무언가였다. 그녀의 눈에 외계 생명체는 그저 소스의 재료일 뿐이었다. 하지만, 많은 사람들은 그 과정을 보았기에 역겨워 하는 모습이다. 하지만, 다른 소스들 또한 그 제조과정, 개발과정은 비슷한 종류들이 있었을 것이다.

인간의 행복이라는 결과를 위해 그 과정이 어떻듯 상관이 없는 것인가?

영화계를 향한 감독의 비판인가?

개인적인 의견이지만, 소스에 대한 비유는 현 영화계를 향한 감독의 비판일 수도 있다고 생각한다. 영화 후반 주인공의 악몽에서 정치인의 아내는 주인공에게 바닥에 떨어져 있는 빨간 소스를 어떤지 먹어보라고 하지만, 미심쩍었던 주인공은 그녀에게 무엇으로 만들었는지 물어본다. 그 물음에 대해 그녀는 순서가 잘못되었다고 답한다. 먹어보는것이 먼저, 어떻게 만들었는지는 뭐 아무래도 좋다 이거다. 현재 감독이 몸담고 있는 영화계 또한 비슷하지 않을까 싶다. 관객들은 영화가 어떻게 만들어졌는지 대다수 관심이 없다. 그저 생각없이 보고 재미있다 또는 없다 판단할 뿐이다. 하지만, 그 영화를 만들기 위한 그 과정은 그렇지 않을 수 있다. 영화 관계자의 많은 희생과 노력이 있었을 것이고, 때로 어떤 영화들은 부정과 부패 혹은 범죄와 연류되었을 수도 있다. 또한, 감독이 영화를 통해 말하고자 하는 메세지에 대해서 그닥 궁금해하지 않는다. 그저 누군가 완성된 음식을 떠먹여주기만 기다릴 뿐 이 영화는 어쩌면 현 영화계를 향한 감독의 비판일 수도 있겠다.

[운동] 20250107

Tue, 07 Jan 2025 09:09:06 GMT

[ML&DL] 11. Deep Learning

Sun, 15 Dec 2024 09:56:55 GMT

Deep Learning

Single Layer Neural Network

$$ f(X)=\beta_0+\sum^K_{k=1}\beta_kh_k(X)\[0.2cm] =\beta_0+\sum^K_{k=1}\beta_k g(w_{k0}+\sum^p_{j=1}w_{kj}X_j) $$

$A_k=h_k(X)=g(w_{k0}+\sum^p_{j=1} w_{kj}X_j)$ are called the activations in the hidden layer
$g(z)$ is called the activation function. Popular are the sigmoid and rectified linear, shown in figure
Activation functions in hidden layers are typically nonlinear, otherwise the model collapses to a linear model
So the activations are like derived features - nonlinear transformations of linear combinations of the features
The model is fit by minimizing $\sum^n_{i=1}(y_i-f(x_i))^2$ (e.g. for regression)

Example : MNIST Digits

Handwritten digits $28\times 28$ grayscale images $60K$ train, $10K$ test images
Features are the $784$ pixel grayscale values $\in (0,255)$
Labels are the digit class $0-9$

Goal : build a classifier to predict the image class
We build a two-layer network with $256$ units at first layer, $128$ units at second layer, and $10$ units at output layer
Along with intercepts (called biases) there are $235,146$ parameters (referred to as weights)

Let $Z_m=\beta_{m0}+\sum^{K_2}{l=1}\beta{ml}A_l^{(2)},;m=0,1,\dots,9$ be $10$ linear combinations of activations at second layer
Output activation function encodes the softmax function $$ f_m(X)=\Pr(Y=m|X)=\frac{e^{Zm}}{\sum^9_{l=0}e^{Zl}} $$
We fit the model by minimizing the negative multinomial log-likelihood (or cross-entropy) $$
\sum^n_{i=1}\sum^9_{m=0}y_{im}\log(f_m(x_i)) $$
$y_{im}$ is $1$ if true class for observation $i$ is $m$, else $0$ - i.e. one-hot encoded

Early success for neural networks in the 1990s
With so many parameters, regularization is essential
Some details of regularization and fitting will come later
Very overworked problem - best reported rates are < $0.5%$
Human error rate is reported to be aroung $0.2%$

Convolutional Neural Network - CNN

Major success story for classifying images
Shown are samples from CIFAR100 database
$32\times 32$ color natural images, with $100$ classes
$50K$ training images, $10K$ test images
Each image is a three-dimensional array of feature map : $32\times32\times 3$ array of $8$-bit numbers
The last dimension represents the three color channels for red, green and blue

The CNN builds up an image in a hierarchical fashion
Edges and shapes are recognized and pieced together to form more complex shapes, eventually assembling the target image
This hierarchical construction is achieved using convolution and pooling layers

Convolution Filter

The filter is itself an image, and represents a samll shape, edge etc.
We slide it around the input image, scoring for matches
The scoring is done via dot-products, illustrated above
If the subimage of the input image is similar to the filter, the score is high, otherwise low
The filters are learned during training

The idea of convolution with a filter is to find common patterns that occur in differnet parts of the image
The two filters shown here highlight vertical and horizontal stripes
The result of the convolution is a new feature map
Since images have three colors channels, the filter does as well : one filter per channel, and dot-products are summed
The weights in the filters are learned by the network

Pooling

Each non-overlapping $2\times 2$ block is replaced by its maximum
This sharpens the feature identification
Allows for locational invariance
Reduces the dimension by a factor of $4$ - i.e. factor of $2$ in each dimension

Architecture of a CNN

Many convolve + pool layers
Filters are typically small, e.g. each channel $3\times 3$
Each filter creates a new channel in convolution layer
As pooling reduces size, the number of filters/channels is typically increased
Number of layers can be very large. E.g. resnet50 trained on imagenet 1000-class image data base has $50$ layers

Document Classification

Featurization : Bag-of-Words

Documents have different lengths, and consist of sequences of words
How do we create features $X$ to characterize a document?

From a dictionary, identify the $10K$ most frequently occurring words
Create a binary vector of length $p=10K$ for each document, and score a $1$ in every position that the corresponding word occurred
With $n$ documents, we now have a $n\times p$ sparse feature matrix $X$
We compare a lasso logistic regression model to a two-hidden-layer neural network on the next slide
Bag-of-words are unigrams
We can instead use bigrams ( occurrences of adjacent word pairs ), and in general m-grams

Lasso vs Neural Network

Simpler lasso logistic regression model works as well as neural network in this case

Recurrent Neural Networks

Often data arise as sequences :
- Documents are sequences of words, and their relative positions have meaning
- Time-series such as weather data or financial indices
- Recorded speech or music
- Handwriting, such as doctor's notes
RNNs build models that take into account this sequential nature of the data, and build a memory of the past
- The feature for each observation is a sequence of vectors $X={X_1,\dots,X_L}$
- The target $Y$ is often of the usual kind - e.g. a single variable such as Sentiment, or a one-hot vector for multiclass
- However, $Y$ can also be a sequence, such as the same document in a differnet language

The hidden layer is a sequence of vectors $A_l$, receiving as input $X_l$ as well as $A_{l-1}$
$A_l$ produces an output $O_l$
The same weights $W$, $U$ and $B$ are used at each step in the sequence - hence the term recurrent
The $A_l$ sequence represents an evolving model for the response that is updated as each element $X_l$ is processed

Suppose $X_l=(X_{l1},X_{l2},\dots,X_{lp})$ has $p$ components, and $A_l=(A_{l1},A_{l2},\dots,A_{lK})$ has $K$ components
Then the computation at the $k$th components of hidden unit $A_l$ is $$ A_{lk}=g\left(w_{k0}+\sum^p_{j=1}w_{kj}X_{lj}+\sum^K_{s=1}u_{ks}A_{l-1,s}\right)\[0.3cm] O_l=\beta_0+\sum^K_{k=1}\beta_kA_{lk} $$
Often we are concerned only with the prediction $O_L$ at the last unit
For squared error loss, and $n$ sequence / response pairs, we would minimize $$ \sum_{i=1}^n (y_i - o_{iL})^2 = \sum_{i=1}^n \left( y_i - \left( \beta_0 + \sum_{k=1}^K \beta_k g \left( w_{k0} + \sum_{j=1}^p w_{kj} x_{iL_j} + \sum_{s=1}^K u_{ks} a_{i, L-1, s} \right) \right) \right)^2 $$

RNN and IMDB Reviews

The document feature is a sequence of words ${W_l}^L_1$
We typically truncate/pad the documents to the same number $L$ of words (we use $L=500$)
Each word $W_l$ is represented as a one-hot encoded binary vector $X_l$ of length $10K$, with all zeros and a single one in the position for that word in the dictionary
This results in an extremely sparse feature representation, and would not work well
Instead we use a lower-dimensional pretrained work embedding matrix $E(m\times 10K)$
This reduces the binary feature vector of length $10K$ to a real feature vector of dimension $m<10K$ (e.g. $m$ in the low hundreds)

Word Embedding

Embeddings are pretrained on very large corpora of documents, using methods similar to principal components
word2vec and GloVe are popular

After a lot of work, the results are a disappointing $76%$ accuracy
We then fit a more exotic RNN than the one displayed - a LSTM with long and short term memory
Here $A_l$ receives input from $A_{l-1}$ (short term memory) as well as from a version that reaches further back in time (long term memory)
Now we get $87%$ accuracy, slightly less than the $88%$ achieved by glmnet
These data have been used as a benchmark for new RNN architectures
The best reported result found at the time of writing (2020) was aroung $95%$
We point to a leaderboard

Time Series Forecasting

New-York Stock Exchange Data

Shown in previous slide are three daily time series for the period $6,051$ trading days
- Log trading volume : This is the fraction of all outstanding shares that are traded on that day, relative to a $100$-day moving average of past turnover, on the log scale
- Dow Jones return : This is the difference between the log of the Dow Jones Industrial Index on consecutive trading days
- Log volatility : This is based on the absolute values of daily price movements
Goal : predict Log trading volume tomorrow, given it observed values up to today, as well as those of Dow Jones return and Log volatility

Autocorrelation

The autocorrelation at lag $l$ is the correlation of all paris $(v_t,v_{t-l})$ that are $l$ trading days apart
These sizable correlations give us confidence that past values will be helpful in predicting the future
This is a curious prediction problem : the response $v_t$ is also a feature $v_{t-l}$

RNN Forecaster

We only have one series of data
How do we set up for an RNN
We extract many short mini-series of input sequences $X={X_1,\dots,X_L}$ with a predefined length $L$ known as the lag $$ X_1 = \begin{pmatrix} v_{t-L} \ r_{t-L} \ z_{t-L} \end{pmatrix}, , X_2 = \begin{pmatrix} v_{t-L+1} \ r_{t-L+1} \ z_{t-L+1} \end{pmatrix}, , \cdots, , X_L = \begin{pmatrix} v_{t-1} \ r_{t-1} \ z_{t-1} \end{pmatrix}, , \text{and } Y = v_t $$
Since $T=6,051$, with $L=5$ we can create $6,046$ such $(X,Y)$ pairs
We use the first $4,281$ as training data, and the following $1,770$ as test data
We fit an RNN with $12$ hidden units per lag step (i.e. per $A_l$)

Figure shows predictions and truth for test period
$R^2=0.42$ for RNN
$R^2=0.18$ for straw man - use yesterday's value of Log trading volume to predict that of today

Autoregression Forecaster

The RNN forecaster is similar in structure to a traditional autoregression procedure $$ \mathbf{y} = \begin{bmatrix} v_{L+1} \ v_{L+2} \ v_{L+3} \ \vdots \ v_{T} \end{bmatrix} \quad \mathbf{M} = \begin{bmatrix} 1 & v_L & v_{L-1} & \cdots & v_1 \ 1 & v_{L+1} & v_L & \cdots & v_2 \ 1 & v_{L+2} & v_{L+1} & \cdots & v_3 \ \vdots & \vdots & \vdots & \ddots & \vdots \ 1 & v_{T-1} & v_{T-2} & \cdots & v_{T-L} \end{bmatrix} $$
Fit an OSL regression of $y$ on $M$, giving $$ \hat{v}t = \hat{\beta}_0 + \hat{\beta}_1 v{t-1} + \hat{\beta}2 v{t-2} + \cdots + \hat{\beta}L v{t-L}. $$
Known as an order-L autoregression model or $AR(L)$
For the NYSE data we can include lagged version of DJ_return and log_volatility in matrix $M$, resulting in $3L+1$ columns

$R^2=0.41$ for $AR(5)$ model ($16$ parameters)
$R^2=0.42$ for RNN model ($205$ parameters)
$R^2=0.42$ for $AR(5)$ model fit by neural network
$R^2=0.46$ for all model if we include day_of_week of day being predicted

Non Convex Functions and Gradient Descent

Start with a guess $\theta^0$ for all the parameters in $\theta$, and set $t=0$
Iterate until the objective $R(\theta)$ fails to decrease : 2.1. Find a vector $\delta$ that reflects a small change in $\theta$, such that $\theta^{t+1}=\theta^t+\delta$ reduces the objective 2.2. Set $t \leftarrow t+1$

In this sample example we reached the global minimum
If we had started a little to the left of $\theta^0$ we would have gone in the other direction, and ended up in a local minimum
Although $\theta$ is multi-dimensional, we have depicted the process as one-dimensional
It is much harder to identify whether one is in a local minimum in high dimensions
How to find a direction $\delta$ that point downhill?
We compute the gradient vector $$ \nabla R(\theta^t) = \left. \frac{\partial R(\theta)}{\partial \theta} \right|_{\theta = \theta^t} $$
i.e. the vector of partial derivatives at the current guess $\theta^t$
The gradient points uphil, so our update is $\delta =-\rho\nabla R(\theta^t)$ or $$ \theta^{t+1} \leftarrow \theta^t - \rho \nabla R(\theta^t) $$ where $\rho$ is the learning rate (typically small, e.g. $\rho=0.001$)

Gradients and Backpropagation

$$ R(\theta)=\sum^n_{i=1}R_i(\theta) $$ is a sum, so gradient is sum of gradients $$ R_i(\theta) = \frac{1}{2} \left( y_i - f_{\theta}(x_i) \right)^2 = \frac{1}{2} \left( y_i - \beta_0 - \sum_{k=1}^K \beta_k g \left( w_{k0} + \sum_{j=1}^p w_{kj} x_{ij} \right) \right)^2 $$

For ease of notation, let $z_{ik}=w_{k0}+\sum^p_{j=1}w_{kj}x_{ij}$
Backpropagation uses the chain rule for differentitaion $$ \frac{\partial R_i(\theta)}{\partial \beta_k} = \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)} \cdot \frac{\partial f_\theta(x_i)}{\partial \beta_k} = - (y_i - f_\theta(x_i)) \cdot g(z_{ik}).\[0.3cm] \frac{\partial R_i(\theta)}{\partial w_{kj}} = \frac{\partial R_i(\theta)}{\partial f_\theta(x_i)} \cdot \frac{\partial f_\theta(x_i)}{\partial g(z_{ik})} \cdot \frac{\partial g(z_{ik})}{\partial z_{ik}} \cdot \frac{\partial z_{ik}}{\partial w_{kj}} = - (y_i - f_\theta(x_i)) \cdot \beta_k \cdot g'(z_{ik}) \cdot x_{ij}. $$

Tricks of the Trade

Slow learning Gradient descent is slow, and a small learning rate $\rho$ slows it even further. With early stopping, this is a form of regularization
Stochastic gradient descent Rather than compute the gradient using all the data, use a small minibatch drawn at random at each step
An epoch is a count of iterations and amounts to the number of minibatch updates such that $n$ samples in total have been processed; i.e. $60K/128 \approx 469$ for MNIST
Regularization Ridge and lasso regularization can be used to shrink the weights at each layer. Two other popular forms of regulariztion and dropout and augmentation

Droupout Learning

At each SGD update, randomly remove units with probability $\phi$, and scale up the weights of those retained by $1/(1-\phi)$ to compensate
In simple scenarios like linear regression, a version of this process can be shown to be equivalent to ridge regularization
As in ridge, the other units stand in for those temporaily removed, and their weights are drawn closer together
Similar to randomly omitting variables when growing trees in random forests

Ridge and Data Augmentation

Make many copies of each $(x_i,y_i)$ and add a small amount of Gaussian noise to the $x_i$ - a little cloud around each observation - but leave the copies of $y_i$ alone
This makes the fit robust to small perturbations in $x_i$, and is equivalent to ridge regularization in an OLS setting

Double Descent

With neural networks, it seems better to have too many hidden units than too few
Likewise more hidden layers better than few
Running stochastic gradients descent till zero training error often gives good out-of-sample error
Increasing the number of units or layers and again training till zero error sometimes gives even better out-of-sample error
When $d\leq 20$, model is OLS, and we see usual bias-variance trade-off
When $d> 20$, we revert to minimum-norm
As $d$ increases above $20$, $\sum^d_{j=1} \hat \beta_j^2$ decreases since it is easier to achieve zero error, and hence less wiggly solutions

To achieve a zero-residual solution with $d=20$ is a real stretch
Easier for larger $d$

In a wide linear model ($p>n$) fit by least squares, SGD with a small step size leads to a minimum norm zero-residual solution
Stochastic gradient flow - i.e. the entire path of SGD solutions - is somewhat similar to ridge path
By analogy, deep and wide neural networks fit by SGD down to zero training error often give good solutions that generalize well
In particular cases with high signal-to-noise ratio - e.g. image recognition - are less prone to overfitting; the zero-error solution is mostly signal

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

[ML&DL] 10. Unsupervised Learning

Sat, 14 Dec 2024 14:18:56 GMT

Unsupervised Learning

Most of this course focuses on supervised learning methods such as regression and classification
In that setting we observe obth a set of features $X_1,X_2,\dots,X_p$ for each object, as well as a response or outcome variable $Y$
The goal is then to predict $Y$ using $X_1,X_2,\dots,X_p$
Here we instead focus on unsupervised learning, we where observe only the features $X_1,X_2,\dots,X_p$
We are not interested in prediction, because we do not have an associated response variable $Y$

The Goals of Unsupervised Learning

The goal is to discover interesting things about the measurements : is there an informative way to visualize the data?
Can we discover subgroups among the variables or among the observations?
We discuss two methods
- principal components analysis A tool used for data visulization or data pre-processing before supervised techniques are applied
- clustering A broad class of methods for discovering unknown subgroups in data

Principal Components Analysis

PCA produces a low-dimensional representation of a dataset
If finds a sequence of linear combinations of the variables that have maximal variance, and are mutually uncorrelated
Apart from producing derived variables for use in supervised learning problems, PCA also serves as a tool for data visulization
The first principal components of a set of features $X_1,X_2,\dots,X_p$ is the normalized linear combination of the features $$ Z_1=\phi_{11}X_1+\phi_{21}X_2+\dots+\phi_{p1}X_p $$ that has the largest variance. By normalized, we mean that $\sum^p_{j=1}\phi^2_{j1}=1$
We refer to the elements $\phi_{11},\cdots,\phi_{p1}$ as the loadings of the first principal component; together, the loadings make up the principal component loading vector $$ \phi_1=\left(\phi_{11};\phi_{21};\dots;\phi_{p1}\right)^T $$
We constrain the loadings so that their sum of squares is equal to one, since otherwise setting these elements to be arbitrarily large in absolute value could result in an arbitrarily large variance

Computation of Pricinpal Components

Suppose we have a $n\times p$ data set $X$
Since we are only interested in variance, we assume that each of the variables in $X$ has been centered to have mean zero (that is, the column means of $X$ are zero)
We then look for the linear combination of the sample feature values of the form $$ z_{i1}=\phi_{11}x_{i1}+\phi_{21}x_{i2}+\cdots+\phi_{p1}x_{ip} $$ for $i=1,\dots,n$ that has largest sample variance, subject to the constraint that $\sum^p_{j=1}\phi^2_{j1}=1$
Since each of the $x_{ij}$ has mean zero, then so does $z_{i1}$(for any value of $\phi_{j1})$
Hence the sample variance of the $z_{i1}$ can be written as $\frac{1}{n}\sum^n_{i=1}z^2_{i1}$
Plugging in (1) the first principal component loading vector solves the optimization problem $$ \text{maximize}{\phi{11}, \dots, \phi_{p1}} \frac{1}{n} \sum_{i=1}^n \left( \sum_{j=1}^p \phi_{j1} x_{ij} \right)^2 \text{subject to } \sum_{j=1}^p \phi_{j1}^2 = 1. $$
This problem can be solved via a singular-value decomposition of the matrix $X$, a standard technique in linear algebra
We refer to $Z_1$ as the first principal component, with realized values $z_{11},\dots,z_{n1}$

Geometry of PCA

The loading vector $\phi_1$ with elements $\phi_{11},\phi_{21},\dots,\phi_{p1}$ defines a direction in feature space along which the data vary the most
If we project the $n$ data points $x_1,\dots,x_n$ onto this direction, the projected values are the principal component scores $z_{11},\dots,z_{n1}$ themselves

Further principal components

The second principal component is the linear combination of $X_1,\dots,X_p$ that has maximal variance among all linear combinations that are uncorrelated with $Z_1$
The second principal component scores $z_{12},z_{22},\dots,z_{n2}$ take the form $$ z_{i2}=\phi_{12}x_{i1}+\dots+\phi_{p2}x_{ip} $$ where $\phi_2$ is the second principal component loading vector, with elements $\phi_{12},\dots,\phi_{p2}$
It turns out that constraining $Z_2$ to be uncorrelated with $Z_1$ is equivalent to constraining the direction $\phi_2$ to be orthogonal (perpendicular) to the direction $\phi_1$ And so on
The principal component directions $\phi_1,\phi_2,\phi_3$ are the ordered sequence of right singular vectors of the matrix $X$, and the variance of the components are $\frac{1}{n}$ times the squares of the singular values
There are at most $\min(n-1,p)$ principal components

PCA should be performed after standardization

PCA find the hyperplane closest to the observations

The first principal component loading vector has a very special property : it defines the line in $p$-dimensional space that is closest to the $n$ observations (using averae squared Euclidean distance as a measure of closeness)
The notion of principal components as the dimensions that are closest to the $n$ observations extends beyond just the first principal component
For instance, the first two principal components of a data set span the plan that is closest to the $n$ observations, in terms of average squared Euclidean distance

Scaling of the variables matters

If the variables are in different units, scaling each to have standard deviation equal to one is recommended
If they are in the same units, you might or might not scale the variables

Proportion Variance Explained

To understand the strength of each component, we are interested in knowing the proportion of variance explained (PVE) by each one
The total variance present in a data set (assuming that the variables have been centered to have mean zero) is defined as $$ \sum^p_{j=1}\text{Var}(X_j)=\sum^p_{j=1}\frac{1}{n}\sum^n_{i=1}x_{ij}^2\[0.2cm] n:\text{row},\quad j:\text{variables} $$ and the variance explained by the $m$th principal component is $$ \text{Var}(Z_m)=\frac{1}{n}\sum^n_{i=1}z^2_{im} $$
It can be shown that $\sum^p_{j=1}\text{Var}(X_j)=\sum^M_{m=1}\text{Var}(Z_m),;\text{with }M=\min(n-1,p)$
Therefore, the PVE of the $m$th principal component is given by the positive quantity between $0$ and $1$ $$ \frac{\sum^n_{i=1}z^2_{im}}{\sum^p_{j=1}\sum^n_{i=1}x^2_{ij}} $$
The PVEs sum to one
We sometimes display the cumlative PVEs

How many principal components should we use?

If we use principal components as a summary of our data, how many components are sufficient?
- No simple answer to this question, as cross-validation is not available for this purpose
  - Why not?
  - When could we use cross-validation to select the number of components?
- the scree plot on the previous slide can be used as a guide : we look for an elbow

Matrix Completion via Principal Components

We pose instead a modified version of the approximation criterion where $O$ is the set of all observed pairs of indices $(i,j)$ a subset of the possible $n\times p$ pairs
Once we solve this problem :
- we can estimate a missing observation $x_{ij}$ using $\hat x_{ij}=\sum^M_{m=1}\hat a_{im}\hat b_{jm}$, where $\hat a_{im}$ and $\hat b_{jm}$ are the $(i,m)$ and $(j,m)$ elements of the solution matrices $\hat A$ and $\hat B$
- we can (approximately) recover the $M$ principal component scores and loadings, as if data were complete

Iterative Algorithm for Matirx Completion

Initialize : create a complete data matrix $\tilde X$ by filling in the missing value susing mean imputation
Repeat : step (a)-(c) until the objective in (c) fails to decreases
- (a) by computing the principal components of $\tilde X$
- (b) For each missing entry $(i,j)\not \in O,$ set $\tilde x_{ij} \leftarrow \sum^M_{m=1}\hat a_{im}\hat b_{im}$
- (c) Compute the objective
Return the estimated missing entries $\tilde x_{ij},;(i,j)\not \in O$

Clustering

K-means clustering

Note that there is no ordering of the clusters, so that cluster coloring is arbitrary
Let $C_1,\dots,C_K$ denotes sets containing the indices of the observations in each cluster
These sets satisfy two properties
1. $C_1\cup C_2 \cup \dots\cup C_K={1,\dots,n}$. In other words, each observation belongs to at least one of the $K$ clusters
2. $C_k \cap C_{k'}=\not 0$ for all $k\neq k'$ In other words, the clusters are non-overlapping : no observation belongs to more than one cluster
- For instnace, if the $i$th observation is in the $k$th cluster, then $i\in C_k$

The idea begind $K$-means clustering is that a good clustering is one for which the within-cluster variation is as small as possible
The within-cluster variation for cluster $C_k$ is a measure $\text{WCV}(C_k)$ of the amount by which the observations within a cluster differ from each other
Hence we wnat to solve the problem $$ \text{minimize}{C_1,\dots,C_K}\left{\sum^K{k=1}\text{WCV}(C_k)\right} $$
In words, this formula says that we want to partition the observation into $K$ clusters such that the total within-cluster variation, summed over all $K$ clusters, is as small as possible

How to define within-cluster variation

Typically we use Euclidean distance $$ \text{WCV}(C_k)=\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum^p_{j=1}(x_{ij}-x_{i'j})^2 $$ where $|C_k|$ denotes the number of observations in the $k$th cluster
Combining (2) and (3) gives the optimization problem that define $K$-means clustering $$ \text{minimize}{C_1,\dots,C_K}\left{\sum^K{k=1}\frac{1}{|C_k|}\sum_{i,i'\in C_k}\sum^p_{j=1}(x_{ij}-x_{i'j})^2\right} $$

K-Means Clustering Algorithm

Randomly assign a number, from $1$ to $K$, to each of the observations. These serve as initial cluster assignments for the observations
Iterate until the cluster assignments stop changing 2.1. For each of the $K$ clusters, compute the cluster centroid. The $k$th cluster centroid is the vector of the $p$ feature means for the observations in the $k$th cluster 2.2. Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance)

Properties of the Algorithm

This algorithm is guaranteed to decrease the value of the objective (4) at each step $$ \frac{1}{|C_k|} \sum_{i,i' \in C_k} \sum_{j=1}^p (x_{ij} - x_{i'j})^2 = 2 \sum_{i \in C_k} \sum_{j=1}^p (x_{ij} - \bar{x}{kj})^2, \[0.2cm] \text{where } \bar{x}{kj} = \frac{1}{|C_k|} \sum_{i \in C_k} x_{ij} \text{ is the mean for feature } j \text{ in cluster } C_k. $$
however it is not guaranteed to give the global minimum

Hierarchical Clustering

K-means requires us to pre-specify the number of clusters $K$
We describe bottom-up or agglomerative clustering
The approach in words :
- Start with each point in its own cluster
- Identify the closest two clusters and merge them
- Repeat
- Ends when all points are in a single cluster

Linkage

Choice of Dissimilarity Measure

So far have used Euclidean distance
An alternative is correlation-based distance which considers two observations to be similar if their features are highly correlated
This is an unusual use of correlation, which is noramlly computed betwen variables; here it is computed between the observation profiles for each pair of observations

Practical issues

Scaling of the variable matters
Should the observations of features first be standardized in some way?
For instance, maybe the variables should be centered to have mean zero and scaled to have standard deviation one
In the case of hierarchical clustering,
- What dissimilarity measure should be used?
- What type of linkage should be used?
How many clusters to choose? ( in both K-means or hierarchical clustering)

Difficult problem
No agreed-upon method
Which features should we use to drive the clustering?

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

[ML&DL] 9. Survival Analysis

Sat, 14 Dec 2024 13:14:52 GMT

Survival Analysis

Survival analysis concerns a special kind of outcome variable : the time until an event occurs
For example, suppose that we have conducted a five-year medical study, in which patients have been treated for cance
We would like to fit a model to predict patient survival time, using features such as baseline health measurements or type of treatment
Sounds like a regression problem. But there is an important complication : some of the patients have survived until the end of the study. Such a patient's survival time is said to be censored
We do not wnat to discard this subset of surviving patients, since the fact that they survived at least five years amounts to valuable information

Non-medical Examples

The applications of survival analysis extend far beyond medicine. For example, consider a company that wishes to model churn, the event when customers cancel subscription to a service
The company might collect data on customers over some time period, in order to predict each customer's time to cancellation
However, presumably not all customers will have cancelled their subscription by the end of this time period; for such customers, the time to cancellation is censored
Survival analysis is a very well-studied topic within statistic. However, it has received relatively little attention in the machine learning community

Survival and Censoring Times

For each individual, we suppose that there is a true failure or event time $T$, as well as a true censoring time $C$
The survival time represents the time at which the event of interest occurs (such as death)
By contrast, the censoring is the time at which censoring occurs: for example, the time at which the patient drops out of the study or the study ends

We observe either the survival time $T$ or else the censoring time $C$. Specifically, we observe the random variable $$ Y=\min(T,C) $$
If the event occurs before censoring (i.e. $TC$) then we observe the censoring time. We also observe a status indicator $$ \delta=\begin{cases}1\quad\text{if }T\leq C\0\quad\text{if }T>C\end{cases} $$
Finally, in our dataset we observe $n$ paris $(Y,\delta)$, which we denote as $(y_1,\delta_1),\dots,(y_n,\delta_n)$

Here is an illustration of censored survival data
For patients $1$ and $3$, the event was observed
Patient $2$ was alive when the study ended
Patient $4$ dropped out of the study

A Closer Look at Censoring

Suppose that a number of patients drop out of a cancer study early because they are very sick
An analysis that does not take into consideration the reason why the patients dropped out will likely overestimate the true average survival time
Similarly, suppose that males who are very sick are more likely to drop out of the study than females who are very sick
Then a comparison of male and female survival times may wrongly suggest that males survive longer than females
In general, we need to assume that, conditional on the features, the event time $T$ is independent of the censoring time $C$
The two examples above violate the assumption of independent censoring

The Survival Curve

The survival function ( or curve) is defined as $$ S(t)=\Pr(T>t) $$
This decreasing function quantifies the probability of surviving past time $t$
For example, suppose that a company is interested in modeling customer churn
Let $T$ represent the time that a customer cancels a subscription to the company's service
Then $S(t)$ represents the probability that a customer cancels later than time $t$
The larger the value of $S(t)$, the less likely that the customer will cancel before time $t$

Estimating the Survival Curve

Consider the BrainCancer dataset, which contains the survival times for patients with primary brain tumors undergoing treatment with stereotactic radiation methods
Only $53$ of the $88$ patients were still alive at the end of the study
Suppose we'd like to estimate $S(20)=\Pr(T>20)$, the probability that a patient survives for at least $20$ months
It is tempting to simply compute the proportion of patients who are known to have survived past $20$ months, that is, the proportion of patients for whom $Y>20$
This turns out to be $48/88$ or approximately $55%$
However, this does not seem quite right : $17$ of the $40$ patients who did not survive to $20$ months were actually censored, and this analysis implicitly assumes they died before $20$ months
Hence it is probably an underestimate

Kaplen-Meier Survival Curve

Each point in the solid step-like curve shows the estimated probability of surviving past the time indicated on the horizontal axis
The estimated probability of survival pas $20$ months is $71%$, which is quite a bit higher than the naive estimate of $55%$ presented earlier

The Log-Rank Test

We wish to compare the survival of males to that of females
Shown are the Kaplan-Meier survival curves for the two groups
Females seem to fare a little better up to about $50$ months, but then the two curves both level off to about $50%$
How can we carry out a formal test of equality of the two survival curves?
At first glance, a two-sample $t$-test seems like an obvious choice : but the presence of censoring again creates a complication
To overcome this challenge, we will conduct a log-rank test

Recall that $d_1non-censored patients, $r_k$ is the number of patients at risk at time $d_k$, and $q_k$ is the number of patients who died at time $d_k$
We further define $r_{1k}$ and $r_{2k}$ to be the number of patients in groups $1$ and $2$, respectively, who are at risk at time $d_k$
Similarly, we define $q_{1k}$ and $q_{2k}$ to be the number of patients in groups $1$ and $2$, respectively, who died at time $d_k$
Note that $r_{1k}+r_{2k}=r_k$ and $q_{1k}+q_{2k}=q_k$

Details of the Test Statistic

At each death time $d_k$, we construct a $2\times 2$ table of counts of the form shown above
Note that if the death times are unique (i.e. no two individuals die at the same time), then one of $q_{1k}$ and $q_{2k}$ equals one, and the other equals zero
To test $\mathcal{H}_0:E(X)=0$ for some random variable $X$, one approach is to construct a test statistic of the form $$ W=\frac{X-E(X)}{\sqrt{\text{Var}(X)}} $$ where $E(X)$ and $\text{Var}(X)$ are the expectation and variance, respectively, of $X$ under $\mathcal{H}_0$
In order to construct the log-rank test statistic, we compute a quantity that takes exactly the form above, with $X=\sum^K_{k=1}q_{1k}$, where $q_{1k}$ is given in the top left of the table above

The resulting formula for the log-rank test statistic is $$ W=\frac{\sum^K_{k=1}(q_{1k}-E(q_{1k}))}{\sqrt{\sum^K_{k=1}\text{Var}(q_{1k})}}=\frac{\sum^K_{k=1}\left(q_{1k}-\frac{q_k}{r_k}r_{1k}\right)}{\sqrt{\sum^K_{k=1}\frac{q_k(r_{1k}/r_k)(1-r_{1k}/r_k)(r_k-q_k)}{r_k-1}}} $$
When the sample size is large, the log-rank test statistic $W$ has approximately a standard normal distribution
This can be used to compute a $p$-value for the null hypothesis that there is no difference between the survival curves in the two groups

Comparing the survival times of females and males on the BrainCancer data gives a log-rank test statistic of $$ W=1.2 $$ which corresponds to a two-sided $p$-value of $0.2$ Then, can not reject null hypothesis

Regression Models with a Survival Response

We now consider the task of fitting a regression model to survival data
We wish to predict the true survival time $T$
Since the observed quantity $Y=\min(T,C)$ is positive and may have a long right tail, we might be tempted to fit a linear regression of $\log(Y)$ on $X$
But censoring again creates a problem
To overcome this difficulty, we instead make use of a sequential construction, similar to the idea used for the Kaplain-Meier survival curve

The Hazard Function

The hazard function or hazard rate - also known as the force of mortality - is formally defined as $$ h(t)=\lim_{\Delta t\rightarrow 0}\frac{\Pr(tt)}{\Delta t} $$ where $T$ is the (true) survival time
It is the death rate in the instant after time $t$, given survival up to that time
The hazard function is the basis for the Proportional Hazards Model

The Proportional Hazards Model

The proportional hazards assumprion states that $$ h(t|x_i)=h_0(t)\exp\left(\sum^p_{j=1}x_{ij}\beta_j\right) $$ where $h_0(t)\geq 0$ is an unspecified function, known as the baseline hazard
It is the hazrard function for an individual with features $x_{i1} = \cdots=x_{ip} = 0$
The name proportional hazards arises from the fact that the hazard function for an individual with feature vector $x_i$ is some unknown function $h_0(t)$ times the factor $$ \exp\left(\sum^p_{j=1} x_{ij}\beta_j\right) $$
The quantity $$ \exp\left(\sum^p_{j=1}x_{ij}\beta_j\right) $$ is called the relative risk for the feature vector $x_i=(x_{i1},\cdots,x_{ip})$, realtive to that the feature vector $x_i=(0,\cdots,0)$
What does it mean that the baseline hazard function $h_0(t)$ is unspecified
Basically, we make no assumption about its functional form
We allow the instantaneous probability of death at time $t$, given that one has survived at least until time $t$, to take any form
This means that the hazard function is very flexible and can model a wide range of relationships between the covariates and survival time
Our only assumption is that a one-unit increase in $x_{ij}$ corresponds to an increase in $h(t|x_i)$ by a factor of $\exp(\beta_j)$

Here is an example with $p=1$ and a binary covariate $x_i \in{0,1}$
Top row : the log hazard and the survival function under the model are shown (green for $x_i=0$ and black for $x_i=1$). Because of the proportional hazards assumption, the log hazard functions differ by a constant, and the survival functions do not cross
Bottom row : the proportional hazards assumptions does not hold

Partial Likelihood

Because the form of the baseline hazard is unknown, we cannot simply plug $h(t|x_i)$ into the likelihood and then estimate $\beta=(\beta_1,\dots,\beta_p)^T$ by maximum likelihood
The magic of Cox's proportional hazard model lies in the fact that it is in fact possible to estimate $\beta$ without having to specify the form of $h_0(t)$
To accomplish this, we make use of the same sequential in time logic that we used to derive the Kaplan-Meier survival curve and the log-rank test
Then the total hazard at failure time $y_i$ for the at-risk observations is $$ \sum_{i':y_{i'}\geq y_i}h_0(y_i)\exp\left(\sum^p_{j=1}x_{i'j}\beta_j\right) $$

Therefore, the probability that the $i$th observation is the one to fail at time $y_i$ (as opposed to one of the other observations in the risk set) is $$ 0\leq \frac{h_0(y_i)\exp\left(\sum^p_{j=1}x_{ij}\beta_j\right)}{\sum_{i':y_{i'}\geq y_i}h_0(y_i)\exp\left(\sum^p_{j=1}x_{i'j}\beta_j\right)}=\frac{\exp\left(\sum^p_{j=1}x_{ij}\beta_j\right)}{\sum_{i':y_{i'}\geq y_i}\exp\left(\sum^p_{j=1}x_{i 'j}\beta)_i\right)}\leq 1 $$
Notice that the unspecified baseline hazard function $h_0(y_i)$ cancels out of the numerator and denominator

The partial likelihood is simply the product of these probabilities over all of the uncensored observations $$ \text{PL}(\beta)=\prod_{i:\delta_i=1}\frac{\exp\left(\sum^p_{j=1}x_{ij}\beta_j\right)}{\sum_{i':y_{i'}\geq y_i}\exp\left(\sum^p_{j=1}x_{i'j}\beta_j\right)} $$
Critically, the partial likelihood is valid regardless of the true value of $h_0(t)$, making the model very flexible and robust
Relative Risk Functions at each Failure Time
$$ RR_1(\boldsymbol{\beta}) = \frac{\exp\left(\sum_{j=1}^p x_{1j} \beta_j \right)} {\sum_{i':y_{i'} \geq y_1} \exp\left(\sum_{j=1}^p x_{i'j} \beta_j \right)}\[0.3cm]

RR_3(\boldsymbol{\beta}) = \frac{\exp\left(\sum_{j=1}^p x_{3j} \beta_j \right)} {\sum_{i':y_{i'} \geq y_3} \exp\left(\sum_{j=1}^p x_{i'j} \beta_j \right)}\[0.3cm]

RR_5(\boldsymbol{\beta}) = \frac{\exp\left(\sum_{j=1}^p x_{5j} \beta_j \right)} {\sum_{i':y_{i'} \geq y_5} \exp\left(\sum_{j=1}^p x_{i'j} \beta_j \right)} $$

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

[ML&DL] 8. Support Vector Machines

Sat, 14 Dec 2024 11:36:06 GMT

Support Vector Machines

Here we approach the two-class classification problem in a direct way :

We try and find a plane that seperates the classes in feature space
If we cannot, we get creative in two ways :
- We soften what we mean by seperates, and
- We enrich and enlarge the feature space so that separation is possible

What is a Hyperplane?

A hyperplane in $p$ dimensions is a flat affine subspace of dimension $p-1$
In general the equation for a hyperplane has the form $$ \beta_0+\beta_1X_1+\cdots+\beta_pX_p=0 $$
In $p=2$ dimensions a hyperplane is a line
If $\beta_0=0,$ the hyperplane goes throught the origin, otherwise not
The vector $\beta=(\beta_1,;\beta_2,\cdots,\beta_p)$ is called the normal vector - it points in a direction orthogonal to the surface of a hyperplane

Separating Hyperplanes

If $f(X)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p$, then $f(X)>0$ for points on one side of the hyperplane, and $f(X)<0$ for points on the other
If we code the colored points as $Y_i=\pm1$ for blue, say, and $Y_i=-1$ for mauve, then if $Y_i\cdot f(X_i)>0$ for all $i$, $f(X)=0$ defines a separating hyperplane $$ d=\frac{|\beta_0+\beta_1X_1+\cdots+\beta_pX_p|}{\sqrt{\beta_1^2+\cdots+\beta_p^2}} $$

Maximal Margin Classifier

Among all seperating hyperplanes, find the one that makes the biggest gap or margin between the two classes
Constrained optimization problem $$ \text{maximize M}\[0.2cm] \text{subject to }\sum^p_{j=1}\beta^2_j=1\[0.3cm] y_i(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip}\geq M,\quad\text{for all }i=1,\dots,N $$

Non-separable Data

The data are not separable by a linear boundary
This is often the ase, unless $N

Noisy Data

Sometimes the data are separable, but noisy
This can lead to a poor solution for the maximal-margin classifier
The support vector classifier maximizes a soft margin

Support Vector Classifier

$$ \text{maximize M subject to }\sum^p_{j=1}\beta^2_j=1\[0.3cm] y_i(\beta_0+\beta_1x_{i1}+\cdots+\beta_px_{ip})\geq M(1-\epsilon_i),\[0.3cm] \epsilon_i\geq0,;\sum^n_{i=1}\epsilon_i\leq C $$

$\epsilon_i$ : slack variable
$C$ : Budget

Linear boundary can fail

Sometimes a linear boundary simply won't work, no matter what value of $C$
The example on the left is such a case. What to do?

Featue Expansion

Enlarge the space of features by including transformations; e.g. $X^2_1,X^3_1,X_1X_2,X_1X^2_2,\dots$ Hence go from a $p$-dimensional space to a $M>p$ dimensional space
Fit a support-vector classifier in the enlarged space
This results in non-linear decision boundaries in the original space
Example : Suppose we use $(X_1,X_2,X^2_1,X_2^2,X_1X_2)$ instead of just $(X_1,X_2)$. Then the decision boundary would be of the form $$ \beta_0+\beta_1X_1+\beta_2X_2+\beta_3X^2_1+\beta_4X^2_2+\beta_5X_1X_2=0 $$
This leads to nonlinear decision boundaries in the original space(quadratic conic sections)

Cubic Polynomials

Here we use a basis expansion of cubin polynomials
From $2$ variables to $9$
The support-vector classifier in the enlarged space solves the problem in the lower-dimensional space

$$ \beta_0+\beta_1X_1+\beta_2X_2+\beta_3X^2_1+\beta_4X^2_2+\beta_5X_1X_2+\beta_6X^3_1+\beta_7X^3_2+\beta_8X_1X^2_2+\beta_9X^2_1X_2=0 $$

Nonlinearities and Kernels

Polynomials (especially high-dimensional ones) get wild rather fast
There is a more elegant and controlled way to introduce nonlinearities in support-vector classifiers - through the use of kernels
Before we discuss these, we must understand the role of inner products in support-vector classifiers

Inner products and Support Vectors

$$ =\sum^p_{j=1}x_{ij}x_{i'j} :\text{inner product between vectors} $$

The linear support vector classifier can be represented as $$ f(x)=\beta_0+\sum^n_{i=1}\alpha_i;:\text{n parameters} $$
To estimate the parameters $\alpha_1,\dots,\alpha_n$ and $\beta_0$, all we need are the $\left(\begin{matrix}n\2\end{matrix}\right)$ inner products $$ between all pairs of training observations
It turns out that most of the $\hat \alpha_i$ can be zero : $$ f(x)=\beta_0+\sum_{i\in S}\hat \alpha_i $$
$S$ is the support set of indies $i$ such that $\hat \alpha_i >0$

Kernels and Support Vector Machines

If we can compute inner-products between observations, we can fit a SV classifier. Can be quite abstract!
Some special kernel functions can do this for us $$ K(x_i,x_{i'})=\left(1+\sum^p_{j=1}x_{ij}x_{i'j}\right)^d $$ computes this inner-products needed for $d$ dimensional polynomials - $\left(\begin{matrix}p+d\d\end{matrix}\right)$ basis functions

Try it for $p=2$ and $d=2$
The solution has the form $$ f(x)=\beta_0+\sum_{i \in S}\hat \alpha_i K(x,x_i) $$

Radial Kernel

$$ K(x_i,x_{i'})=\exp\left(-\gamma\sum^p_{j=1}\left(x_{ij}-x_{i'j}\right)^2\right) $$ $$ f(x)=\beta_0+\sum_{i\in S}\hat \alpha_i K(x,x_i) $$

Implict feature space; very high dimensional
Controls variance by squashing down most dimensions severely

Example : Heart Data

ROC Curve is obtained by changing the threshold $0$ to threshold $t$ in $\hat f(X)>t$, and recording false positive and true positive rates as $t$ varies
here we see ROC curves on training data

SVMs : more than 2 classes?

The SVM as defined works for $K=2$ classes
What do we do if we have $K>2$ classes?
- OVA : One versus All. Fit $K$ different 2-class SVM classifiers $\hat f_k(x),;k=1,\dots,K;$ each class versus the rest Classify $x^$ to the class for which $\hat f_k(x^)$
- OVO : One versus One Fit all $\left(\begin{matrix}K\2\end{matrix}\right)$ pairwise classifiers $\hat f_{kl}(x)$ Classify $x^*$ to the class that wins the most pairwise competitions
Which to choose? If $K$ is not too large, use OVO

Support Vector vs Logistic Regression?

With $f(X)=\beta_0+\beta_1X_1+\cdots+\beta_pX_p$ can rephrase support-vector classifier optimization as $$ \min_{\beta_0,\dots,\beta_p}\left{\sum^n_{i=1}\max\left[0,1-y_if(x_i)\right]+\lambda\sum^p_{j=1}\beta^2_j\right} $$
This has the form loss plus penalty
The loss is known as the hinge loss very similar to loss in logistic regression (negative log-likelihood)

Which to use : SVM or Logistic Regression

When classes are (nearly) separable, SVM does better than LR. So does LDA
When not, LR (with ridge penalty) and SVM very similar
If you wish to estimate probabilities, LR is the choice
For nonlinear boundaries, kernel SVMs are popular
Can use kernels with LR and LDA as well, but computations are more expensive

All Contents written based on GIST - Machine Learning & Deep Learning Lesson(Instructor : Prof. sun-dong. Kim)

[ML&DL] 7. Multiple Hypothesis Testing

Sat, 14 Dec 2024 10:45:40 GMT

Multiple Hypothesis Testing

This session focuses on multiple hypothesis testing
A single null hypothesis might look like

$\mathcal{H}_0$ : the expected blood pressures of mice in the control and treatment groups are the same
We will now consider testing $m$ null hypotheses, $H_{01},\dots,H_{0m}$ where e.g.

$\mathcal{H}_{0j}$ : the expected values of the $j^{th}$ biomarker among mice in the control and treatment groups are equal
In this setting, we need to be careful to avoid incorrectly rejecting too many null hypotheses, i.e. having too many false positives

A Quick Review of Hypothesis Testing

Hypothesis tests allow us to answer simple yes-or-no questions, such as
- Is the true coefficient $\beta_j$ in a linear regression equal to zero?
- Does the expected blood pressure among mice in the treatment group equal the expected blood pressure among mice in the control group?
Hypothesis testing proceeds as follows :
1. Define the null and alternative hypotheses
2. Construct the test statistic
3. Compute the $p$-value
4. Decide whether to reject the null hypothesis

1. Define the Null and Alternative Hypotheses

We divide the world into null and alternative hypotheses
The null hypothesis $\mathcal{H}_0$, is the default state of belief about the world. For instance :
1. The true coefficient $\beta_j$ equals zero
2. There is no difference in the expected blood pressure
The alternative hypothesis $\mathcal{H}_a$, represents something different and unexpected. For instnace :
1. The true coefficient $\beta_j$ is non-zero
2. There is a difference in the expected blood pressure

2. Construct the Test Statistic

The test statistis summarizes the extent to which our data are consistent with $\mathcal{H}_0$
Let $\hat \mu_t/\hat \mu_c$ respectively denote the average blood pressure for the $n_t/n_c$ mice in the treatment and control groups
To test $\mathcal{H}_0$ : $\mu_t=\mu_c$, we use a two-sample $t$-statistic $$ T=\frac{\hat \mu_t-\hat \mu_c}{s\sqrt{\frac{1}{n_t}+\frac{1}{n_c}}}\[0.3cm] S:\text{total standard deviation}\[0.3cm] =\frac{(N_t-1)S_t^2+(N_c-1)S_c^2}{n_t+n_c-2} $$

3. Compute the p-value

The $p$-value is the probability of observing a test statistic at least as extreme as the observed statistic, under the assumption that $\mathcal{H}_0$ is true
A small $p$-value provides evidence against $\mathcal{H}_0$
Suppose we compute $T=2.33$ for our test of $\mathcal{H}_0:\mu_t=\mu_c$
Under $\mathcal{H}_0,;T\sim\mathcal{N}(0,1)$ for a two-sample $t$-statistic
The p-value is $0.02$ because, if $\mathcal{H}_0$ is true, we would only see $|T|$ this large $2%$ of the time

4. Decide Whether to Reject Null Hypothesis, Part 1

A small $p$-value indicates that such a large value of the test statistic is unlikely to occur under $\mathcal{H}_0$
So, a small $p$-value provides evidence against $\mathcal{H}_0$
If the $p$-value is sufficiently small, then we will want to reject $\mathcal{H}_0$
But how small is small enough? To answer this, we need to understand the Type 1 Error

4. Decide Whether to Reject Null Hypothesis, Part 2

The Type 1 Error rate is the probability of making a Type 1 Error
We want to ensure a small Type 1 Error rate
If we reject $\mathcal{H}_0$ when the p-value is less then $\alpha$, then the Type 1 Error rate will be at most $\alpha$
So, we reject $\mathcal{H}_0$ when the p-value falls below some $\alpha$ : often we choose $\alpha$ to equal $0.05$ or $0.01$

Multiple Testing

Now suppose that we wish to test $m$ null hypotheses, $H_{01},\dots,H_{0m}$
Can we simply reject all null hypotheses for which the corresponding $p$-value falls below $0.01?$
If we reject all null hypotheses for which the $p$-value falls below $0.01$, then how many Type 1 Error will be make?

A Thought Experiment

Suppose that we flip a fair coin ten times, and we wish to test

$\mathcal{H}_0:\text{the coin is fair}$
- We'll probably get approximately the same number of heads and tails
- The $p$-value probably won't be small. We do not reject $\mathcal{H}_0$
But what if we flip $1,024$ fair coins ten times each?
- We'd except one coin (on average) to come up all tails
- The $p$-values for the null hypothesis that this particular coin is fair is less than $0.002$!
- So we would conclude it is not fair, i.e. we reject null hypothesis, even though it's a fair coin
If we test a lot of hypotheses, we are almost certain to get one very small $p$-value by chance

The Challenge of Multiple Testing

Suppose we test $H_{01},\dots,H_{0m}$, all of which are true, and reject any null hypothesis with a $p$-value below $0.01$
Then we except to falsely reject approximately $0.01\times m$ null hypotheses
If $m=10,000$, then we expect to falsely reject $100$ null hypotheses by chance!

That's a lot of Type 1 Errors, i.e. false positives

The Family-Wise Error Rate

The family-wise error rate (FWER) is the probability of making at least one Type 1 error when conducting $m$ hypothesis tests
$\text{FWER=}\Pr(V\geq1)$

Challenges in Controlloing the FWER

$$ \text{FWER}=1-\Pr(\text{do not falsely reject any null hypotheses}) $$

If the tests are independent and all $H_{0j}$ are true then $$ \text{FWER} = 1-\prod^m_{j=1}(1-\alpha)=1-(1-\alpha)^m $$

The Bonferroni Correction

$$ \text{FWER} = \Pr(\text{falsely reject at least one null hypothesis)}\leq\sum^m_{j=1}\Pr(A_j) $$

Where $A_j$ is the event that we falsely reject the $j$th null hypothesis
If we only reject hypotheses when the $p$-value is less than $\alpha/m$, then $$ \text{FWER}\leq\sum^m_{j=1}\Pr(A_j)\leq\sum^m_{j=1}\frac{\alpha}{m}=m\times\frac{\alpha}{m}=\alpha $$
This is the Bonferroni Correction : to control FWER at level $\alpha$, reject any null hypothesis with $p$-value below $\alpha/m$

Holm's Method for Controlling the FWER

Compute $p$-values, $p_1,\dots,p_m$ for the $m$ null hypotheses $H_{01},\dots,H_{0m}$
Order the $m$ $p$-values so that $p_{(1)}\leq p_{(2)}\leq\cdots\leq p_{(m)}$
Define $$ L=\min\left{j:p_{(j)}>\frac{\alpha}{m+1-j}\right} $$
Reject all null hypoteses $H_{0j}$ for which $p_j
Holm's method controls the FWER at level $\alpha$

Holm's Method on the Fund Manager Data

The ordered $p$-values are $p_{(1)}=0.006,;p_{(2)}=0.012,;p_{(3)}=0.601,;p_{(4)}=0.756,;p_{(5)}=0.918$
The Holm procedure rejects the first two null hypotheses, because $$ p_{(1)}=0.006<0.05/(5+1-1)=0.0100\[0.2cm] p_{(2)}=0.012<0.05/(5+1-2)=0.0125\[0.2cm] p_{(3)}=0.601>0.05/(5+1-3)=0.0167 $$
Holm rejects $\mathcal{H}_0$ for the first and third managers, but Bonferroni only rejects $\mathcal{H}_0$ for the first manager

Comparison with m=10 p-values

Aim to control FWER at $0.05$
$p$-values below the balck horizontal line are rejected by Bonferroni
$p$-values below the blue line are rejected by Holm
Holm and Bonferroni make the same conclusion on the black points, but only Holm rejects for the red point

A More Extreme Example

Now five hypotheses are rejected by Holm but not by Bonferroni ...
even though both control FWER at $0.05$

Holm or Bonferroni?

Bonferroni is simple : reject any null hypothesis with a $p$-value below $\alpha/m$
Holm is slightly more complicated, but it will lead to more rejections while controlling FWER

So, Holm is a better choice

The False Discovery Rate

Back to this table :
The FWER rate focuses on controlling $\Pr(V>1)$, i.e., the probability of falsely rejecting any null hypothesis
This is a tough ask when $m$ is large. It will cause us to be super conservative(i.e. to very rarely reject)
Instead, we can control the false discovery rate $$ \text{FDR}=E(V/R) $$

Intuition Behind the False Discovery Rate

$$ \text{FDR}=E(V/R)=E\left(\frac{\text{number of false rejections}}{\text{total number of rejections}}\right) $$

A scientist conducts a hypothesis test on each of $m=20,000$ drug candidates
She wants to identify a smaller set of promising candidates to investigate further
She wants reassurance that this smaller set is really promising, i.e. not too many falsely rejected $\mathcal{H}_0$'s
FWER controls $\Pr(\text{at least one false rejection})$
FDR controls the fraction of candidates in the smaller set that are really false rejections.

Benjamini-Hochberg Procedure to Control FDR

Specify $q$, the level at which to control the FDR
Compute $p$-values $p_1,\dots,p_m$ for the null hypotheses $H_{01},\dots,H_{0m}$
Order the $p$-values so that $p_{(1)}\leq\dots\leq p_{(m)}$
Define $L=\max\left{j:p_{(j)}
Reject all null hypotheses $H_{0j}$ for which $p_j\leq p_{(L)}$ Then, FDR $\leq$ $q$

A Comparison of FDR vs FWER

Here, $p$-values for $m=2,000$ null hypotheses are displayed
To control FWER at level $\alpha=0.1$ with Bonferroni : reject hypotheses below green line
To control FDR at level $q=0.1$ with Benjamini-Hochberg : reject hypothese shown in blue

Consider $m=5$ p-values from the Fund data : $p_1=0.006,;p_2=0.918,;p_3=0.012,;p_4=0.601,;p_5=0.756$
To control FDR at level $q=0.05$ using Benjamini-Hochberg :
- Notice that $p_{(1)} <0.05/5,;p_{(2)}<2\times0.05/5,;p_{(5)}>5\times0.05/5$
- So, we reject $H_{01}$ and $H_{03}$
To control FWER at level $\alpha=0.05$ using Bonferroni :
- We reject any null hypothesis for which the $p$-value is less than $0.05/5$
- So, we reject only $H_{01}$

Re-Sampling Approaches

So far, we have assumed that we want to test some null hypothesis $\mathcal{H}_0$ with some test statistic $T$, and that we know the distribution of $T$ under $\mathcal{H}_0$
This allows us to compute the $p$-value
What if this theoretical null distribution is unknown?

A Re-Sampling Approach for a Two-Sample t-Test

Suppose we want to test $H_0:E(X)=E(Y)$ versus $H_\alpha :E(X)\neq E(Y)$, using $n_X$ independent observations from $X$ and $n_Y$ independent observations from $Y$
The two-sample $t$-statistic takes the form $$ T=\frac{\hat \mu_X-\hat \mu_Y}{s\sqrt{1/n_X+1/n_Y}} $$
If $n_X$ and $n_Y$ are large, then $T$ approximately follows a $\mathcal{N}(0,1)$ distribution under $\mathcal{H}_0$
If $n_X$ and $n_Y$ are small, then we don't know the theorectical null distribution of $T$
Let's take a permutation or re-sampling approach...

Compute the two-sample $t$-statistic $T$ on the original data $x_1,\dots,x_{n_X}$ and $y_1,\dots,y_{n_Y}$
For $b=1,\dots,B$(where $B$ is a large number, like $1,000$) : 2.1. Randomly shuffle the $n_X+n_Y$ observations 2.2. Call the first $n_X$ shuffled observations $x^_1,\dots,x^{n_X}$ and call the remaining observations $y^_1,\dots,y^{n_Y}$ 2.3. Compute a two-sample $t$-statistic on the shuffled data, and call it $T^{*b}$
The $p$-value is given by $$ \frac{\sum^B_{b=1} 1_{(|T^{*b}|)\geq|T|}}{B} $$

Theoretical $p$-value is $0.041$. Re-sampling $p$-value is $0.042$
Theoretical $p$-value is $0.571$. Re-sampling $p$-value is $0.673$