ooooo_h.log

도커 데몬

Wed, 05 Apr 2023 15:27:07 GMT

도커의 구조

# which docker
/usr/local/bin/docker

# ps aux | grep docker 
ohyeon            9032   0.0  0.2 35284600  70884   ??  S    0:11.95 /Applications/Docker.app/Contents/MacOS/com.docker.backend -watchdog -native-api
ohyeon             10204   0.0  0.0 409223472   4384   ??  S  0:00.55 /Library/PrivilegedHelperTools/com.docker.vmnetd

"docker.backend" : Docker 엔진의 주요 컴퍼넌트 중 하나인 "dockerd"를 실행하는 역할

클라이언트로서의 도커 : 도커 데몬은 API를 입력 받아 도커 엔진의 기능을 수행하는데 API를 사용할 수 있도록 CLI(Command Line Interface)를 제공 ex) docker version, images, start .. etc

서버로서의 도커
- 컨테이너를 생성하고 실행하며 이미지, 컨테이너, 네트워크를 관리하는 주체 (dockerd 프로세스로서 동작)
- 도커 데몬 : docker 엔진의 핵심 컴포넌트, 컨테이너화된 애플리케이션을 관리하고 실행하는 핵심 서비스

도커 클라이언트는 입력 명령어를 로컬에 존재하는 도커 데몬에게 API로 전달

도커 클라이언트는 /var/run/docker.sock에 위치하는 유닉스 소켓을 통해 도커 데몬의 API를 호출
- tcp로 원격에 있는 도커 데몬을 제어 할 수도 있음

도커 데몬 실행 및 설정

우분투에서는 자동으로 서비스로 등록되어 호스트가 재시작되어도 자동으로 실행

// 도커 데몬 시작 정지 
# service docker start
# service docker stop

도커 서비스는 dockerd_로 도커 데몬을 실행할 수 있으며 _dockerd 에 여러 옵션을 적용하여 사용

# service docker stop 
# dockerd 

// 도커 데몬에 여러 옵션 적용 

# dockerd --insecure-registry : 레지스트리 컨테이너 구축
          --log_driver : 컨테이너 로깅
          --storage-opt : 스토리지 백엔드 변경

-> 단, 도커 데몬을 실행하면 하나의 터미널을 차지하는 포그라운드(foreground) 상태로 실행되기 때문에 설정파일을 통해 서비스로 실행하는 것이 일반적임

도커 데몬 제어 : -H

-H 옵션은 도커 데몬의 API를 사용할 수 있는 방법을 추가, 아무런 옵션을 설정하지 않고 도커 데몬을 실행하면 도커 클라이언트인 /usr/local/bin/docker를 위한 유닉스 소캣 /var/run/docker.sock을 사용

# dockerd
# dockerd -H unix:///var/run/docker.sock

-H 옵션에 IP주소와 포트번호를 입력하면 원격 API인 Docker Remote API로 도커를 제어할 수 있음

RESTful API 형식을 띠고 있어서 HTTP 요청으로 다른 서버의 도커를 제어할 수 있음

# dockerd -H tcp://0.0.0:2375 // 단 Remote API만을 위한 바인딩 주소를 입력했다면 
                              // 유닉스 소캣은 비활성화 되어 도커 클라이언트 사용이 불가 
# docker -H unix://var/run/docker.sock -H tcp://0.0.0:2375

도커 클라이언트가 도커 데몬에게 명령어를 수행하도록 요청할 때도 내부적으로 같은 API를 사용하므로 Remote API 또한 도커 클라이언트에서 사용 가능한 모든 명령어를 사용할 수 있음 -H 로 Remote API를 사용하려면 cURL 같은 HTTP 요청 도구를 사용함

docker-daemon:/# dockerd -H tcp://192.168.00.100:2375

client:/# curl 192.168.99.100:2375/version --silent | python -m json.tool

192.168.99.199:2375 를 도커 데몬에 바인딩하고, 192.168.99.100:2375/version을 URL로 http 요청을 보내 도커 버전을 확인할 수 있음

도커 데몬 보안 적용 : --tlsverify

도커를 설치하면 기본적으로 보안 연결이 설정되어 있지 않아 운영 환경에서 도커를 사용해야 한다면 보안을 적용하는 것이 바람직 하지 않음 (IP, Port 만으로 도커 제어 가능)

도커 데몬에 TLS 보안 적용을 위해 5개의 파일을 사용 (ca.pem, server-cert.pem, server-key.pem, cert.pem, key.pem ) 이 중 클라이언트 측에서 도커 데몬에 접근 하려면 ca.pem, cert.pem, key.pem 파일이 필요

서버측 파일 생성
```
// 인증서에 사용될 키 생성 
# mkdir keys && cd keys 
# openssl genrsa -aes256 -out ca-key.pem

// 공용 키 (public key)를 생성 
# openssl req -new x509 -days 10000 -key ca-key.pem -sha256 -out ca.pem

// 서버 측에서 사용될 키를 생성 
# openssl genrsa -out server-key.pem 4096

// 서버 측에서 사용될 인증서를 위한 인증 요청서 파일을 생성 
# openssl req -subj "/CN={$DOCKER HOST IP or Domain}" -sha256 -new -key server-key.pem -out server.csr

// 접속에 사용될 IP 주소를 extfile.cnt 파일로 저장 
# echo subjectAltName = IP:{$DOCKER Host IP or Domain},IP:127.0.0.1" > extfile.cnf

// 서버 측 인증서 파일 생성 
# openssl x509 -req -days 365 -sha256 -in server.csr -CA ca.epm -CAkey ca-key.pem -CAreateserial -out server-cert.pem -extfile extfile.cnf
```
1. 클라이언트 측에서 사용할 파일 생성
// 클라이언트 측의 키 파일과 인증 요청 파일을 생성, extfile.cnt 파일에 extendedKeyUsage 항목 추가

openssl genrsa -out key.pem 4096

openssl -req -subj '/CN=client' -new -key key.pem -out client.csr

echo extendedKeyUsage = clientAuth > extfile.cnt

// 클라이언트 측의 인증서 생성

openssl x509 -req -days 30000 -sha256 -in client.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out cert.pem -extfile extfile.cnt

// 쓰기 권한을 삭제 후 읽기 전용 파일로 생성

chmod -v 0400 ca-key.pem server-key.pem ca.pem server-cert.pem cert.pem

// 도커 데몬의 설정 파일이 존재하는 디렉터리로 파일을 옮김 (~/.docker)

cp {ca, server-cert, server-key, cert, key}.pem ~/.docker
```
이 후 보안 적용을 위해 -tlsverify 옵션을 추가해 --tlscacert, --tlscert, --tlskey 파일 위치를 입력하여  확인 
```
dockerd --tlsverify \

--tlscacert=/root/.docker/ca.pem
--tlscert=/root/.docker/server-cert.pem
--tlskey=/root/.docker/server-key.pem
-H=0.0.0.0:2376
-H unix:///var/run/docker.sock

// 클라이언트에서 접근 시에도 옵션을 주어 접근해야 함 

# docker -H 192.168.99.100:2376 \ 
--tlscacert=/root/.docker/ca.pem \
--tlscert=/root/.docker/cert.pem \
--tlskey=/root/.docker/key.pem \
--tlsverify version 

// 위의 과정이 번거로우니 환경변수로 설정하여 접근

# export DOCKER_CERT_PATH="/root/.docker"
# export DOCKER_TLS_VERIFY=1 
```

도커 스토리지 드라이버 변경 : --storage-driver

도커는 특정 스토리지 백엔드 기술을 사용해 도커 컨테이너와 이미지를 저장하고 관리함 일부 운영체제는 도커를 설치할 때 기본적으로 사용하도록 설정된 스토리지 드라이버가 있음 (데비안 : overlay2, CentOS : devicemapper)

# docker info | grep "Storage Driver"
Storage Driver : overlay2

도커를 사용하는 환경에 따라 스토리지 드라이버는 자동으로 지정되지만 도커 데몬 실행 옵션에 스토리지 드라이버를 변경할 수 있음 (OverlayFS, AUFS, Btrfs, Devicemapper, VFS , ZFS ...)

이 중 하나만 선택해 도커 데몬에 적용할 수 있으며 적용된 스토리지 드라이버에 따라 컨테이너와 이미지가 별도로 생성됨

스토리지 드라이버의 원리

개념적으로 이미지는 읽기 전용 파일로 사용되며 컨테이너는 이 이미지 위에 얇은 컨테이너 레이어를 생성함으로써 컨테이너의 고유한 공간을 생성함

실제로는 컨테이너 내부에서 읽기와 새로운 파일 쓰기, 기존의 파일 쓰기 작업이 일어날 때는 드라이버에 따라 Copy-on-Write(CoW), Redirect-on-Write(RoW) 개념을 사용 - 도커 스토리지 드라이버는 CoW, RoW를 지원해야 함

스냅샷의 기본 개념 : "원본 파일은 읽기 전용으로 사용하되 이 파일이 변경 되면 새로운 공간으로 할당" -> 스토리지를 스냅샷으로 만들면 파일 위치 저장, 변경 내역 관리함으로써 스냅샷을 사용

A, B, C 파일이 스냅샷으로 생성되었을 때 읽기 작업을 수행하는 경우 파일시스템의 원본 파일에 접근해 내용을 읽으면 되지만 쓰기 작업을 수행할 경우 원본 파일을 유지하면서도 변경된 사항을 저장할 수 있어야 함 -> 이를 해결하는 방법에 따라 CoW, RoW로 나눠지게 됨

CoW : 스냅샷의 파일에 쓰기 작업을 수행할 때 스냅샷 공간에 원본 파일을 복사한 뒤 쓰기 요청을 반영 (복사를 위해 파일 읽는 작업 + 스냅샷 공간에 쓰고 변경된 사항을 쓰는 작업, 총 2번의 쓰기 작업으로 오버헤드가 발생)

RoW : CoW와 다르게 한 번의 쓰기 작업만 발생, 스냅샷 공간에 복사하는 것이 아닌 스냅샷에 기록된 원본 파일은 스냅샷 파일로 묶은 뒤 변경 사항을 새로운 장소에 할당 받아 덮어쓰는 형식

도커 컨테이너와 이미지에 적용을 하면 이미지 레이어는 각 스냅샷에 해당하고, 컨테이너는 이 스냅샷을 사용하는 변경점 ** 컨테이너를 이미지로 만들면 변경된 사항이 스냅샷으로 생성되고 하나의 이미지 레이어로서 존재하게 됨 **

-> 이를 수행하는 방법에 따라 여러 스토리지 드라이브 (AUFS, Devicemapper, OverlayFS, Btrfs, ZFS ..) 존재

도커 데몬 모니터링

많은 수의 도커 서버를 효율적으로 관리하기 위해, 도커로 컨테이너 애플리케이션을 개발하다 문제가 생겼을 때 원한을 찾기 위해, PaaS로 제공하기 위해 실시간으로 도커 데몬의 상태를 체크하기 위해 다양한 방법으로 모니터링을 할 수 있음

도커 데몬 디버그 모드

도커 데몬의 상태를 알아내는 방법 이며 Remote API의 입출력뿐만 아니라 로컬 도커 클라이언트에서 오가는 명령어를 출력함

# dockerd -D // -D 옵션을 통해 디버그 모드 실행

위의 디버그 모드는 너무 많은 로그를 출력하기 때문에 도커에서 제공하는 명령어를 통해 모니터링을 할 수 있음

events : 도커 데몬에 어떤 일이 일어나고 있는지 실시간 스트림 로그로 보여줌

# docker events 
# docker system events

attach, commit, copy, create 등의 컨테이너 관련 명령어, delete, import, load, pull, push 등의 이미지 관련 명령어, 볼륨, 네트워크, 플러그인 등에 관한 명령어를 확인할 수 있음

// --filter 옵션을 통해 원하는 정보만 확인할 수 있음 (container, image, volumne network, plugin, daemon)
# docker events --filter "type=image"

stats : 실행 중인 모든 컨테이너의 자원 사용량을 스트림으로 출력

실행중인 모든 컨테이너의 CPU, 메모리 제한 및 사용량, 네트워크 입출력(I/O), 블록 입출력 정보를 출력

# docker stats 
CONTAINER ID   NAME      CPU %     MEM USAGE / LIMIT   MEM %     NET I/O     BLOCK I/O   PIDS
c1c8ca08d5d0   ubuntu    0.00%     864KiB / 7.674GiB   0.01%     806B / 0B   0B / 0B     1

// --no-stream으로 한번만 출력 가능 
# docker stats --no-stream

*system df * : 도커에서 사용하고 있는 이미지, 컨테이너, 로컬 볼륨의 개수, 크기, 확보 가능한 공간을 출력

# docker system df

CAdvisor : 구글이 만든 컨테이너 모니터링 도구로, 컨테이너별 실시간 자원 사용량 및 도커 모니터링 정보를 시각화 해서 보여줌

# docker run \
  --volume=/:/rootfs:ro \ 
  --volume=/var/run:/var/run:ro \
  --volume=/sys:/sys:ro \
  --volume=/var/lib/docker/:/var/lib/docker:ro \ 
  --volume=/dev/disk/:/dev/disk:ro \
  --publish=8080:8080 \
  --detach=true \ 
  --name=cadvisor \ 
  google/cadvisor:latest

파이썬 Remote API 라이브러리를 이용한 도커 사용

-H 옵션을 원격의 도커 데몬을 제어하기 위해 사용하는 것도 좋은 방법이지만 컨테이너 애플리케이션이 수행해야 할 작업이 많거나 애플리케이션 초기화 등에 복잡한 과정이 포함되어 있으면 도커를 제어하는 라이브러리를 사용해 쉽게 해결 가능 함

Go, C#, C++, Python, Dart, Java 등 사용 가능

도커 클라이언트 객체를 생성하여 도커 엔진을 제어할 수 있음, 이때 base_url에 http://192.168.0.100:2375 와 같이 도커 데몬에 접근할 수 있는 IP 주소와 포트번호를 입력해야 함

# python3 

>>> import docker 
>>> client = docker.DockerClient(base_url="unix://var/run/docker.sock')
>>> client.info()


>>> container = client.run('nginx', detach=True, ports={'80/tcp' : 80 }) # docker run -d -p 80:80 nginx 와 동일

음성

Tue, 07 Dec 2021 18:26:58 GMT

Abstract

Neural Speech Synthesis 모델들은 Text-to-Speech 와 compression applications을 위한 high quality 읍성 합성이 가능한 것을 입증했다. 이러한 새로운 모델들은 real-time operation을 위해 powerful GPUs가 필요한데, WaveRNN을 변형하여 Linear Prediction과 recurrent neural networks를 결합한 LPCNet을 제안함으로써 WaveRNN보다 훨씬 더 높은 퀄리티의 Speech synthesis가 가능하며, 3GFLOPS보다 낮은 complexity로 embedded systems나 mobile phones와 같이 lower-power devices에 deploy할 수 있음

FLOPS : 초당 부동소수점 연산, 컴퓨터가 1초동안 수행할 수 있는 부동소수점 연산의 횟수를 기준으로 삼는다. ex) GPT-3의 사전 학습에 필요한 계산은 3640 petaflop/s-day ( 24시간동안 초당 $10^{15}$(1000조)개의 부동소수점 연산을 수행하는데 소요되는 전력

Introduction

Neural Speech synthesis 모델은 최근에 high-quality speech synthesize와 code high quality speech at very low bitrate
WaveNet 기반의 알고리즘들은 high-end GPU를 통해서만 real-time operation이 가능하기 때문에 end-user devices( ex. mobile phones)에서 사용불가

$\rightarrow$ 논문 저자들은 speech synthesis를 end-user devices에서 사용할 수 있게 하고자 함 (No powerful GPUs, Limited battery capacity)

FFTNET, WaveRNN 같은 모델들은 speech synthesis의 complexity를 줄이기 위해 노력함
Low bitrate vocoders와 같은 Low complexity parametric synthesis models이 존재함
- quality가 제한적이며, Linear Prediction을 사용하여 음성의 spectral envelope(vocal tract response)를 모델링 하는데 효율적이지만 excitation signal을 모델링 하는데 문제가 있음
  
  Spectral envelope : 스펙트럼 포락선, 발음마다 vocal tract의 구조가 달라 주파수의 증폭과 감쇠가 달라짐, 발음의 종류를 결정하는 주요한 특징 excitation signal : 여기 신호, 유성음(여러 배음들로 스펙트럼 구성), 무성음(백색소음과 같은 스펙트럼)

$\rightarrow$ 저자들은 spectral envelope modeling을 neural synthesis network를 사용하지 않은 model을 제안하여 complexity를 줄이면서 SOTA모델과 동일한 quality를 보장하는 방법을 제안하였다.

WaveRNN

WaveRNN은 이전 Audio Sample인 $s_{t-1}$과 conditioning parameters $f$를 통해 discrete probability distribution인 Output Sample $P(s_t)$를 생성함
RNN(GRU) + Two fully-connected layers이 연결된 dual softmax layers로 구성되어 있으며, WaveNet과 비교하여 학습된 모델 사이즈가 작으며 speech synthesis가 빠름

16-bit model(8 coarse bits(=high bits) and 8 fine bits(=low bits)) $\rightarrow$ 8 corarse bits를 먼저 예측하고, 이를 기반으로 8 fine bits를 예측하여, 16-bits sample을 예측하는 것이 아닌 8-bits sample을 두개 예측하여 complexity가 낮음

$\rightarrow$ LPCNet의 경우 coarse/fine split을 생략

$W^{(·)}, U^{(·)}$ 는 GRU weights, complexity를 줄이기 위해 GRU weights matrix를 4x4 or 16x1 non-zero block을 사용하여 spase하게 구성했음

LPCNet

16kHz을 연산하는 sample rate network와 10-ms frames(160개의 sample, =100FPS)을 처리하는 frame rate network로 구성되어 있다
논문에서는 Audio input을 20 features로 제안하여 연구를 진행하엿다
- 18 Bark-scale cepstral coefficients, 2 pitch parameters(period, correlation)

Conditioning Parameters

Frame rate network에서 Audio의 20features가 two convolution layers(filter size =3, conv 3x1)을 통해 5 frame의 receptive field가 되어 residual connection과 two fully-connected layers를 통과하여 128-dimensional conditioning vector $f$ 가 되어 sample rate networks의 conditioning vector $f$ 가 된다

Pre-emphasis and Quantization

WaveNet과 같은 synthesis model은 output sample values를 256으로 줄이기 위해 8-bit $\mu$-law quantization을 사용한다
Speech signals의 대부분은 low frequencies에 집중되는 경향이 있기 때문에, $\mu$-law white quantization noise는 high frequencies에서 들을 수 있다 $\rightarrow$ 이러한 문제를 해결하기 위해 16bits로 quantization (ex. Audio CD)을 하는 경우가 많다
위의 문제를 해결하기 위해 논문 저자들은 first-order pre-emphasis filter를 training data에 적용하였다
- pre-emphasis filter : $E(z) = 1-\alpha z^{-1}$, high frequencies에 대한 noise의 영향을 줄이기 위해 frequencies를 강화시켜 high frequencies의 noise 영향을 줄이는 방법 $\rightarrow$ synthesis의 output에 대해서는 inverse(de-emphasis) filter를 적용하여 noise를 줄일 수 있었으며, 8-bit $\mu$-law output이 high-quality synthesis에 성공할 수 있도록 만든다

Linear Prediction

Neural speech synthesis approaches는 glottal pulses, noise excitation, vocal tract response를 포함한 speech production process를 modeling 해야 한다
vocal tract response의 경우 all-pole linear filter를 통해 표현될 수 있다 (all-pole = autoregressive, AR, 자기회귀)
- vocal tract를 필터로 본다면 필터의 coefficients를 구할 수 있으며, 이를 음성의 특징을 짓는 시스템으로 생각할 수 있다
- $s_t$를 time t의 signal로 생각하면, 이것의 linear prediction을 이전 샘플들을 기반으로 보았을 때 다음과 같이 생각할 수 있다
- $a_k$는 현재 frame을 위한 $M^{th}$ order linear prediction coefficients (LPC)로 볼 수 있다
- $a_k$를 구하는 방법 :
  - 위의 식을 정리해보면 음성신호가 prediction error filter A(z)를 통과하면 excitation을 추정하는 error signal이 나오게 되며, 이러한 error signal이 vocal tract response model H(z)를 통과하면 음성신호가 된다
  - 이러한 모델의 장점은 vocal tract를 all-pole model 식으로 하여 LP coefficient를 찾으면 되는 장점이 있음 -> $s ,\tilde{s}$ 사이의 제곱으로 error function으로 정하여 최소화 하는 $a_k$를 찾으면 된다
  - 위의 식을 미분하면 다음과 같이 나온다
  - all-pole model이기 때문에 correlation으로 정리하여 matrix로 표현하면 다음과 같이 구할 수 있으며 쉽게 $a_k$를 구할 수 있다
Prediction coefficients $a_k$를 계산하기 위해 18-band Bark-frequency ceptrum을 linear-frequency power spectral density(PSD)로 변환하여 계산해야 한다
- PSD는 Inverse FFT를 사용하여 auto-correlation으로 표현할 수 있으며 Levinson-Durbin algorithms을 통해 쉽게 Predictor의 coefficients $a_k$를 구할 수 있다.
Cepstrum으로 부터 구해진 Predictor는 speech coding text와 같은 additional information 또는 synthesize가 필요하지 않는다.
LPC Analysis는 cepstrum이 low resolution하기 때문에 input signal 만큼 정확하지는 않지만, network가 이러한 부분까지 학습할 수 있기 때문에 출력에 미치는 영향이 적다.

- WaveNet을 이용하여 vocal tract를 모델링 하려고 한 GlotNet의 open-loop filtering approaches보다 장점이 있다고 하였다.

Linear Prediction을 사용하여 neural network에 영향을 주어, network가 sample values를 예측하는 것이 아닌 excitation (prediction residual)을 예측하도록 도울 수 있다.

- 이는 network가 더 쉽게 예측을 할 수 있으며, 일반적으로 excitation이 pre-emphasized signal보다 작은 amplitude를 가지기 때문에 $\mu$-law quantization noise를 감소시킬 수 있다.

즉, network 이전에 sampling된 excitation $e_{t-1}$뿐만 아니라 과거 신호 $s_{t-1}$, current prediction $p_t$를 input으로 받아 더 좋은 성능을 낼 수 있다.

Output Layer

Output Probabilities를 계산을 쉽게 하기 위해 two fully-connected layers을 경합한 dual_fc layers를 정의하였다. (element-wise weighted sum)
dual fully-connected layer는 필수적인 요소는 아니지만 동일한 복잡도의 fully-connected layer와 비교 했을 때, output의 quality를 향상시킨다는 것을 발견했습니다.

Sparse Matrices

모델의 complexity를 낮추기 위해 저자들은 Largest GRU를 위해 일반적인 GRU(element-by-element sparseness) 대신에 WaveRNN에서 사용한 Block-sparse matrices를 사용하였다.
Training은 dense matrices로 시작하지만, 원하는 sparseness가 달성될때까지 magnitude가 가장 낮은 block을 0으로 만들어 vectorlization을 쉽게 하도록 하였다. (16x1 blocks가 우수한 성능을 제공하였다.)

Embedding and Algebraic Simplifications

일반적으로 sample values를 network의 input으로 하기전에 고정된 범위로 scaling을 하지만 논문 저자들은 $\mu$-law values의 discrete한 특성을 통해 embedding matrix E를 학습하여 embedding을 하였다.
- Embedding은 각각의 $\mu$-law level을 vector에 매핑하여, $\mu$-law value에 적용될 non-linear function set을 학습한다. 그결과 embedding matrix를 시각화 하였을때, embedding matrix E $\mu$-law scale을 linear하게 변환하는것을 알게 되었다.
sample values를 embedding한 값들을 GRU의 input으로 넣지 않고 GRU의 non-recurrent weights $U^{(·)}$의 submatrices와 pre-computing 하여 complexity를 감소하였다 .

- $U^{(u,s)}$를 sample $s_{t-1}$의 입력 sample의 embedding에 적용되는 columns로 구성된 $U^{(u)}$의 submatrix라고 하면, sample $s_{t-1}$을 update gate의 non recurrent 항에 직접 매핑하는 새로운 embedding matrix $V^{(u,s)}$ = $U^{(u,s)}E$로 표현할 수 있다.

- 모든 게이트 $(u,r,h)$ 및 모든 입력 $(s,p,e)$에 대해 pre-computed된 $V^{(·,·)}$로 표현할 수 있다. $g^{(·,·)} = U^{(·)}f$

- frame conditioning vector $f$ 또한 input의 entire frame에 대해 일정하기 때문에 위와 같은 방법으로 단순화 할 수 있다.

$\rightarrow$ 이러한 단순화를 통해 GRU의 모든 non-recurrent inputs의 cost를 무시하여 complexity를 낮출 수 있다.
![](https://images.velog.io/images/ooooo_h/post/e73a2584-1508-46db-9bd1-62f987cf0058/image.png)

Sampling from Probability Distribution

Output Probability distribution $P(e_t)$에서 직접 sampling을 하게 되면 excessive noise가 발생할 수 있다. 이것은 FFTNET 제안한 방법으로 상수 $c=2$를 곱하여 해결할수 있지만, 논문 저자들은 이러한 binary voicing decision을 만드는 대신에 $c = 1 +max(0,1.5g_p-0.5)$ 를 제안하여 문제를 해결하였다. ($g_p$ : pitch correlation)
위의 방법을 통해 Probability distribution를 다음과 같이 정의하였다.

$\mathcal{R}(·)$ operator는 분포들을 다시 정규화
- 논문 저자들은 $T = 0.002$ 일때 impulse noise을 줄이는것과 speech가 자연스러움의 trade-off중 좋은 성능을 보인다고 하였다.

Training Noise Injection

Speech를 synthesizing 할때, Network는 생성된 샘플이 훈련 중 사용된 샘플과 다르므로 불완전하다.
이러한 mismatch는 synthesis에서 excessive distortion을 야기할 수 있어 mismatch에 robust하게 만들기 위해 noise를 input에 추가하여 학습을 진행했다.
Linear Prediction을 사용하면 noise-injection이 중요한데, signal에 noise를 추가하여 학습을 하면 noise가 synthesis filter 와 같은 형태를 갖는 pre-analysis-by-synthesis vocoder era와 유사한 artifacts를 생성하는것을 확인하였다.
noise를 추가함으로써 network가 signal domain의 error를 최소화하는 것을 효과적으로 학습하는것을 확인했다. $\rightarrow$ network의 output은 prediction residual 이지만 input중 하나는 residual을 계산하는데 사용된 것과 동일하게 예측하기 때문에, CELP에서 제시한 analysis-by-synthesis 효과와 유사하며, synthesized speech의 artifacts를 크게 감소 시킨다.
noise를 설정할 때, signal amplitude에 비례하도록 하기 위해 $\mu$-law domain에 직접 추가하였으며, $[-3,3]$ 범위의 uniform distribution으로 설정하였다.

EVALUATION

Complexity

LPCNet의 complexity는 two GRUs와 dual fully-connected layer로 이루어져있다.
$N_A,N_B$ : GRUs의 size, $d$ : density of the sparse GRU, $Q$ : number of $\mu$-law level, $F_s$ ; sampling rate
$N_A = 384, N_B=16,Q=256,F_S=16000$ 로 설정했을때 약 2.8GFLOPS의 complexity를 가짐( biases, conditioning network, activation-function, $...$는 약 0.5GFLOPS의 complexity)
그 결과 Apple A8(Iphone 6)의 single core로 real-time synthesis를 가능하게 함 (Iphone 6S정도 되야 잘된다고 하네요..)
WaveNet보다 작은 complexity를 가진다고 한 FFTNET의 경우 약 16GFLOPS, WaveRNN의 경우 10GFLOPS, SampleRNN의 경우 50GFLOPS complexity를 가진다고 한다

Experimental Setup

speaker-depedent or speaker-independent 중 더 어려웅 speaker-independent에서 평가를 하였다.
Training data는 NTT Multi-Lingual Speech Database for Telephonometry (21 languages)의 4시간 speech로 이루어져 있다.
120 - epochs, 64 - batch size, 각각의 sequence는 15 10-ms frames로 구성되어 있다.
AMSGrad optimization, step_size = $\alpha = \frac{\alpha_0}{1+\delta·b}$ , $\alpha_0$=0.001, $\delta$= 5x$10^{-5}$, $b$ = batch number
$GRU_A$의 size 192, 384, 640 non-zero density $d =0.1$ (GRUs의 size 61,122,203이랑 동일), $GRU_B$의 size 16으로 하여 LPCNet과 WaveRNN+와 비교하였다.

Quality Evaluation

subjective listening test인 MUSHRA를 통해 성능 비교를 하였다. (8 utterances , 2 male and 2 females speakers에 대해 100명에 의해 평가되었다.)
LPCNet이 WaveRNN+보다 좋은 Quality를 내는것을 확인했다. 이를 통해 pre-emphasis가 $\mu$-law quantization의 noise를 무시할 수 있고, 256-value distribution(8-bit)으로 계산할 수 있기 때문에 16-bit의 WaveRNN보다 complexity가 감소한다

Conclusion

LPCNet은 neural synthesis techniques와 traditional technique인 linear prediction을 결합하여 speaker-independent speech synthesis 성능이 좋은것을 확인했다.
WaveRNN과 LPC를 단순히 결합하는거 외에도 signal values embedding, pre-emphasis와 같은 다양한 방법이 synthesis에 영향을 준다는 것을 알게 되었다.

개인화, Multi-speaker

Tue, 09 Nov 2021 15:52:16 GMT

0. Speech Synthesis

음성 합성의 경우 주어진 Text를 Speech로 변환해 주는 System, Task

HMM (Hidden Markov Model)

HMM 방법은 통계적 SPSS(Statistical parametric speech synthesis) 음성 학습 방법으로 Two step Mapping으로 이루어져 있다.

Text input을 Decision tree를 이용하여 cluster mapping을 한다.
- Decision tree를 사용하면 input space가 sub cluster로 나뉘어 지기 때문에 성능저하, 분기를 기준으로 나누기 때문에 복잡한 context 정보를 표현하는데 문제가 있음
Cluster sequence를 이용하여 Feature mapping을 진행함
- HMM의 가정 : First-order Markov Assumption, Output Independence Assumption으로 인해 성능 저하

$\rightarrow$ 위의 문제를 해결하기 위해 Deep learning 사용(Cluster to feature mapping - DNN or Acoustic Model - DNN)

$\rightarrow$ 시간에 따라 변화하는 음성을 모델링 하기에는 DNN이 부족하여 RNN 사용 (Long-term dependency 문제가 있기 때문에 gate를 적용한 LSTM 사용, previous and future contexts도 고려해야 하기 때문에 Bidirectional LSTM)

Tacotron

SPSS 방법 같은 경우 다양한 모델로 이루어져 있음 이를 End-to-End Speech synthesis 제안

Input(Characters)이 들어오면 Encoder + Decoder 구조를 통해 Mel-spectrogram이 Output으로 나옴

CBHG : Convolution Bank + highway network + GRU
Encoder : Pre-net을 이용하여 모델의 convergence와 generalization을 위해 Bottleneck layer
Decoder : Align된 Attention을 사용하여 Mel-spectrogram을 생성 (여러 frame에 대한 spectrogram을 생성 후 마지막 frame만 다음 spectrogram을 예측하는데 input으로 줌 )
Post Processing : Mel-spectrogram을 Linear-scale spectrogram을 생성하기 위한 과정, Griffin-lim vocoder를 통해 음성 생성

Tacotron2

Stop Token : Tacotron은 max length를 지정해 두고 auto-regressive하게 학습이 됨(뒷 부분을 자르고 학습함)
WaveNet Vocoder : Griffin-Lim vocoder보다 좋은 성능

Transformer TTS

Tacotron의 경우 RNN을 많이 사용하여 오래걸림

RNN을 Transformer의 Multi-head Attention (Self-attention) 으로 바꿔 Encoder , Decoder를 Parallel하게 학습할 수 있다.
학습은 빨라졌지만, Inference 할 때 여전히 auto-regressive 함, 학습시 Teacher forcing 하기 때문에 exploration bias가 생긴다.

1. 개인화 (Multi-Speaker, Style Modeling)

사람들은 Emotion, Speech rate, Prosody 등 다양한 스타일을 통해 대화를 한다. $\rightarrow$ TTS의 경우 어떻게 스타일을 모델링하고, disentangle(분리) 하는지에 대한 다양한 연구가 진행되고 있다.

Personal TTS를 위해 Multi-speaker TTS 같은 모델이 필요하지만, Multi Data (DB)에 존재하는 경우 Personal TTS 성능이 좋지만 DB와 동떨어진 경우 성능이 좋지 않다.

1.1 D-vector based Multi-speaker

Tacotron2의 Encoder Output에 Speaker Embedding을 concat하여 Multi-speaker 모델을 학습하였다.
Speaker Embedding을 위해 D-vector를 사용하였다.

D-vector : 화자 검증(Speaker Verification)에서 많이 사용되는 방법으로, End-to-End 모델을 통해 Speaker의 utterance를 d-vector로 modeling하여 test utterance의 d-vector와 consine similarity를 구하여 accept/reject하는 task에서 사용되었다. $\rightarrow$ 위 논문에서는 Speaker의 utterance에서 general한 특성과 다른 unique한 특성을 식별하는 representation으로 사용하였다.
Speaker Encoding 부분과 Tacotron2 Encoder 부분을 따로 학습하는 방법으로 진행된다.

1.2 Global Style Token

위의 D-vector 자체가 Speech synthesis를 위한 방법이 아니기 때문에 성능 향상에 한계가 있다. 이를 해결하기 위해 reference encoder를 사용하여 reference speech로 부터 style을 추출하여 Tacotron에 적용시키는 방법이 발전 되었다.

Speech의 style을 GST 모델이 자체적으로 label에 대해 soft interpretable는 방법이다.
Target으로 하고 싶은 Style의 reference speech를 Input으로 주어 reference encoder를 통해 얻어진 style embedding에 대해 self-attention을 적용하여 각 token에 대한 style을 학습하게 됨

reference encoder : 2D-convolution layer * 6 + GRU Layer

Training

variable length audio가 reference encoder를 통해 fixed-length vector로 embedding (reference embedding)
reference embedding이 Attention Module인 style token layer를 입력 된다.
- Attention : reference embedding과 각 token 사이의 유사성을 학습 (GSTs)
각각의 token에 대해 weighted sum으로 표현된 GSTs (Style embedding) 을 text encoder의 condition으로 줌
Tacotron과 jointly training을 함, 따라서 GSTs에는 style or prosody labels가 필요하지 않음

Inference

특정한 Token에 condition을 줄 수 있다. (하지만 원하는 style을 적용하기 위해서는 Inference에서 reference speech를 적용하여 각각의 token이 어떠한 style에 영향을 주는지 확인해야 함)
Different audio signal을 통해 condition을 줄 수 있다.

한계 : GST Model의 style token에 대해 학습하는 것은 Training set에 대해서 학습을 하는 것이다. 즉, DB 내에서만 학습하기 때문에 유의미하거나 독특한 token에 대해 학습을 하지 못한다 (단순하게 구성된 DB의 경우 Emotion, Prosody에 대해 다양한 style을 학습하지 못한다.)

1.3 VAE-Tacotron2

Motivate : Movitation style을 latent variable을 통해 Modeling

위의 두 방법은 Style을 fixed-dimension으로 하여 modeling을 하였지만 VAE-Tacotron2는 latent variables을 통해 확률적(stochastic)으로 modeling을 함

reference speech를 fixed-length short latent variables로 representation한 VAE + End-to-End Model Tacotron2
reference speech를 위의 GST에서 제안한 reference encoder로 embedding한 후 two separate FC layers에 linear activation을 통해 latent variable z의 $\mu,\sigma$ generate하도록 구성 , z는 reparameterization trick 으로 구성

VAE의 latent variable z의 probability $p_{\theta}(z)\sim N(\mu,\sigma)$ 로 설정한다. 즉, 컨트롤 가능한 prior을 설정함으로써 likelihood $p_{\theta}(x|z)$를 알수 있으면 posterior $p_{\theta}(x,z)$ 또한 알 수 있다. (joint probability) 여기서 likelihood $p_{\theta}(x|z)$는 NN을 학습함으로써 구할 수 있다.
학습을 하기 위해서 VAE의 ELBO (Evidence Lower BOund)와 유사하게 log-likelihood를 사용한 reconstruction loss (= Tacotron2 loss), KLD regularization term을 사용한다.

reference audio 를 통해 style을 control,transfer 하거나 latent variable을 interpolating 하여 control을 해야 한다.

1.4 Gaussian mixture VAE (GMVAE)

VAE의 경우 데이터를 control하는 latent variable z의 probability $p_{\theta}(z)\sim N(\mu,\sigma)$, 단일 정규분포로 가정을 하고 학습을 진행한다. 음성에는 다양한 특징(gender, speaker, accent, speech rate, etc)들이 존재한다. 따라서 여러 Gaussian distribution을 혼합하여 데이터를 control 하는 latent variable의 probability를 구하자는 것이 Gaussian mixture model의 핵심

VAE-Tacotron2를 발전시켜 Hierarchical latent variable을 통해 Modeling함, VAE와 유사한 방식으로 ELBO를 통해 학습을 한다.
learned prior를 통해 systematic sampling mechanism을 확인한다. (style control)
latent variables을 통해 disentangled attribute representations을 학습한다.
training data에서 해석 가능한 cluster set을 발견 ex. one cluster for clean speech and another for noisy speech
2 separate latent spaces
- Labeled(observed) attributes : Categorical observed labels ($\mathbf{y}_o$)는 continuous attribute space를 categorization하는 것으로 생각할 수 있다. Observed class는 continuous space에서 mixture component($p(\mathbf{z}_o|\mathbf{y}_0)$)를 형성하며, diagonal-covariance Gaussian 분포로 나타낼 수 있다. ex) speaker label, gender label
- Unlabeled(latent) attributes : K-way categorical discrete variables $\mathbf{y}l$ 로 부터 D-dimensional continuous variable $\mathbf{z}_l$을 sampling 함 $p(\mathbf{z}{l}|\mathbf{y}_{l})$ ex)prosody elements(accent, speech rate, etc), noise

2. 보코더 (Vocoder)

2.1 Intro

음성신호는 아날로그 신호, 시간축으로 연속적으로 진행되는 신호 이다. 컴퓨터로 처리하기 위해서는 연속된 신호를 Discrete 하게 만들어 입력으로 받아야한다.

신호의 종류로는 시간축에 대해서 연속되는지에 대해 Continuous time signal, Discrete time signal로 나뉘며, Amplitude에 대해 Continuous 한지 Discrete한지로 나뉘어 진다.

음성신호를 디지털로 바꾸기 위해서는 Sampling과 Qunatization을 사용하는데, Sampling의 경우 time 축으로, Quantization은 amplitude 축으로 진행한다. (Nyquist sampling theorem)

Frequency 영역에서 연속된 시간 신호 해석하는 방법으로 주기 함수에는 Fourier series, 비주기 함수에는 Fourier transform을 사용하여 특정 Frequency의 크기를 표현할 수 있다. ( $-\infin \sim \infin$ 에 대해 )

Sampling한 신호의 경우 time과 amplitude가 discrete하기 때문에 위의 Fourier Transform을 특정 주기(N)의 영역으로 바꿔야 하기 때문에 DFT를 사용한다. $O(N^2)$의 복잡도를 가지고 있기 때문에 FFT를 사용함 $O(Nlog_2{N})$

시간에 따라 변하는 주파수를 분석하기 위해, 주파수를 의미 있는 음성 신호로 사용하기 위해 나누어 구간에 대해 Fourier transform을 processing 한다. (window function : Hamming, Hanning window) 이렇게 DFT, FFT, STFT를 진행하면 Spectogram이 나오게 된다.

사람의 귀는 주파수에 따라 민감도가 다르기 때문에, Spectogram을 사람의 귀의 특성에 맞게 고려하자 하여 Mel-scale을 통해 재조정한 Mel-spectogram을 음성 신호의 feature로 사용한다.

2.2 Vocoder

Audio를 주파수 영역에서 분석하기 위해 STFT를 통해 의미 있는 음성 신호로 사용할 수 있다. Magnitude 값을 이용하여 Mel-scale로 변환한 Mel-spectrogram을 사용하여 음성 feature로 사용하는데, 주파수의 magnitude와 phase 정보를 알고 있으면 SFTF를 Inverse하여 원본 Audio를 복원할 수 있다. 하지만 Mel-spectrogram을 이용한 TTS 모델의 경우 phase, magnitude 정보가 아닌 magnitude만 이용하여 phase 정보를 예측하고, 이를 통해 원본 Audio도 예측해야 한다. 이러한 역할을 하는 것이 Vocoder이다.

고전적인 음성합성의 경우 발음기관의 concept에 맞춰 모델링을 진행하였지만 최근 Deep learning을 사용한 End-to-End Model로 바뀌면서 Text에 해당하는 speech feature를 Waveform으로 변환하여 speech를 생성하는 방법으로 바뀌었다. 이때 Waveform generator가 mel-spectrogram을 input으로 받아 Waveform을 생성한다. (텍스트에 해당하는 음성신호는 phase 정보가 포함되어 있기 때문에 굉장히 큼 ex 5초 음성 신호는 120k 개의 sample로 이루어짐, Mel-spectrogram의 경우 phase 정보가 없기 때문에 실제 음성 신호보다는 작은 dimension을 가지기 때문에 효율적임)

하지만 phase 정보가 없는 Mel-spectrogram으로 waveform을 생성하기 때문에 완벽히 복원하기 어렵고, 다량의 학습데이터로 일반화를 하여 phase에 대한 예측을 한다.

이때 주로 사용하는 알고리즘이 Griffin-Lim, Neural Vocoder이다

Griffin-Lim

사진출처 : https://cvml.tistory.com/14

Griffin-Lim 알고리즘은 Mel-spectrogram으로 계산된 STFT의 magnitude 값을 통해 원본 음성을 예측하는 rule-based 알고리즘이다. STFT의 redundancy에 기반하여 phase를 reconstruction하는 방법이다. STFT phase 정보를 구하기 위해 임의의 값으로 두고, 예측된 음성의 STFT magnitude값과 Mel-spectrogram으로 계산된 STFT magnitude 값의 MSE가 최소가 되도록 반복 하여 원본 음성을 찾아는 방법이다.

Neural Vocoder

Mel-spectrogram을 입력으로 받아 waveform을 생성하는 딥러닝 모델, 다량의 데이터를 학습하여 mel-spectrogram과 음성 신호 사이의 phase를 예측하여 음성을 생성하는 방법이다. 학습 시, 1~2초 정도로 나누어진 음성 신호를 학습하여 메모리 사용의 부담을 줄일 수 있다 .

Autoregressive model : WaveNet vocoder
Flow-based model : WaveGlow, FloWaveNet, Parallel WaveNet, ClariNET
GAN-based model : ParallelWaveGAN, VocGAN

Scaling up GNN's

Fri, 29 Oct 2021 06:52:13 GMT

[작성자 : 권오현]

Intro

Real-World, 많은 Large Graphs가 존재한다. Recommendation Systems(ex. Amazon, Youtube, Pinterest, etc.) , Heterogeneous Graphs(MS Academic graph), Social Networks(ex. Facebook, Twitter, Instagram), Machine Learning Task (ex. Paper categorization, Author Collaboration, Paper citation recommendation), Knowledge Graph(ex. Wikidata, Freebase)

Motivate : How we do scale up GNN's to graphs of size (100M to 100B)?

기존 DL training을 생각해 보면 Loss를 통해 model parameters update
data가 크기 때문에 mini-batches를 통해 학습 (Node level prediction task - subsample a set of node N)
Graph Neural Networks에서 mini-batch는 independently sampling, 대부분 neighbors의 information을 aggregating하지 못함
Full-batch의 경우 모든 nodes의 embedding을 동시에 생성한다는 문제점이 있다. (graph는 각각의 layer에서 all node embedding을 계산해야 한다. 이 때 이전 layer의 all node embedding을 사용하여 다음 layer의 embedding을 계산하게 됨) -> recursive structure이기 때문에 memory가 상당히 요구됨
Full-batch의 경우 Time wise, GPU Memory limited로 인해 학습이 불가능 하다.

Scaling up GNNs을 학습하는 3가지 방법을 제안한다.
small subgraphs의 message-passing을 통한 방법, Simplified GNN

Neighbor Sampling

Main Idea : GraphSAGE의 방법을 통해 graph를 mini-batch로 학습할 수 있게 함

GNN은 neighborhood aggregation을 통해 node embedding을 생성한다. (Message Passing을 통해 Computation graph 생성)
Node 0의 embedding을 구하기 위해서 Graph Structure, Two-Hop neighborhood node feature가 필요함
K-layer GNN의 경우 K-hop neighborhood와 features를 통해 embedding을 generate(= K-hop neighborhood의 embedding을 알면 embedding을 generate)
Main Insight : Single Node의 embedding을 계산하기 위해 K-hop neighborhood가 필요하다. $\rightarrow$ Node Embedding이 아닌 K-hop neighborhoods 기반의 Computation graph를 통해 Mini-batch로 학습을 할 수 있다.

M개의 sample에 대해서 K-hop neighborhood를 통해 Gradient를 계산하여 학습
전체 K-hop neigborhood의 computation graph를 구해서 one node embedding을 구해야 하기 때문에 매우 expensive하다.
- Computation Graph는 Exponential하게 커진다.
- Natural graphs에서 hub-nodes(high-degree nodes, celebrity nodes)가 존재하기 때문에 expensive 하다

이를 해결하기 위해 neighbor sampling을 제안한다

모든 Node, layer에서 H random neighbors를 뽑는다 (computation graph는 exponential하게 증가하지만, Upper bounded by H)

Remarks on Neighbor sampling

Trade off in sampling Number H : H가 작으면 Neighbor aggregation에 효과적이지만 entire subpart of network를 고려하지 않아 variance가 커져 unstable learning하게 된다
Computation Time : Computational graph는 Exponential하게 증가하는데 H가 크지 않으면 K의 관점에서 깊게 학습하는게 아니다.
How to sample the nodes : Uniformly하게 H neighbors를 sampling 한다. $\rightarrow$ Real-world networks에서는 highly skewed degree distribution을 가지고 있기 때문에 Important neighbors가 sampling 되지 않고 Noisy하게 될 수 있다. 이를 해결하기 위해 Random walk를 적용하여 nodes를 sampling 한다.

Issues with Neighbor Sampling

Neighbor sampling, computational graph는 computational efficency 하지만 computational graphs는 여전히 크다. (GNN는 message-passing layers가 많이 있기 때문에)
Computation is redundant $\rightarrow$ 이를 해결하기 위해 Same components를 except(HAGs), Cluster-GCN

Cluster-GCN

Cluster-GCN은 Full-batch에서 Motivate되었다.
Full-batch training은 L-1의 모든 nodes embedding을 통해 L layer의 embedding을 계산한다. (각각의 layer에서 2K*# 의 message가 compute되어야 함, GNN's computation은 linear하지만, Graph가 매우 크기 때문에 한번에 계산 못한다.)
Full-batch의 경우 Layer-wise node embedding update를 하기 때문에 previous layer의 embedding을 re-use하게 해줌 (computational redundancy를 줄일 수 있다.)

Cluster-GCN Key Idea : 전체 graph를 small sub-parts로 나누어 subgraphs를 통해 full-batch implementation 수행

vs Neighbor sampling : Compute the computation graph for each individual node and entire graph -> Subgraph Sampling

GNN training에 좋은 Subgraph는 어떤 것인가 ? Original Graph의 edges를 가능한 많이 retain한 subgraph
Real world graph에는 community structure, clustering이 존재하기 때문에 community structure를 잘 보존하는 것이다.

Pre-processing : Original graph partitioning
- Louvain, METIS, BigCLAM을 통해 Community detection, edge drop
Mini-batch training : Sample one node group, perform full batch message passing over that entire subgraph, compute subgraph's node embedding and loss

GraphSAGE vs Cluster-GCN : neighborhood sampling( can be distributed across the entire network) vs Neighborhood sampling(just being limited to the subgraph we sample)

Problem

Induced subgraphs는 links를 drop하기 때문에 다른 group에서 message를 받지 못하는 문제점이 있다.
Graph community detection은 similar nodes를 group으로 묶기 때문에, sample node group은 전체 group에서 일부분만 고려한다. $\rightarrow$ Fluctuates a lot from a node group to another(= gradient high variance, slow convergence of SGD)
이를 해결하기 위해 Advanced Cluster-GCN이 제안됨

Advanced Cluster-GCN

Graph Partitioning을 더 작게 하여 mini-batch training에서 multiple node group을 aggregate한다.
Induced subgraph를 multiple node group으로 구성하여 결과적으로 more diverse하게 학습을 한다.

Training

Vanilla Cluster-GCN과 동일하게 학습

Comparision

Neighbor-sampling은 각 node 마다 H개 sampling 하기 때문에 computational graph의 cost는 $H^k$, M nodes에 대해서는 $MH^K$
Cluster-GCN은 M nodes에 대해 induced graph를 통해 message passing을 하기 때문에 $KMD_{avg}$의 cost를 가진다

Cluster-GCN이 Linear하게 증가하기 때문에 더욱 효율적이다. 하지만 layer K가 크지 않으면 neighbor-sampling이 더 많이 사용된다.
Cluster-GCN은 community간의 edges를 drop하기 때문에 biased gradient estimates를 구할 가능성이 있다

Simplifying GCN

GCN은 Node features를 input으로 하여 K+1 layer의 embedding을 K layer의 neighborhood의 embedding layer와 Trainable weight, activation function을 통해 구한다.
위의 식을 Matrix Form으로 정의할 수 있다 (Adjacency Matrix와 embedding Matrix의 product)
$\left |N(v) \right |$ = Diagonal Matrix(Neighborhood degree)
Simplify GCN은 이러한 식에서 non-linearity를 제거한 방법이다
$\widetilde{A}^k$ : connects the target node to its neighbors, neighbors of neighbors(= one-hop father out in the network as we increase)
$X$ : node features
$\widetilde{A}^k$, $X$는 learnable parameters가 아니기 때문에 pre-computed하여 더욱 efficiently (sequene of sparse matrix vector products)
pre-computed matrix Linear transformation으로 embedding을 구할 수 있음
node v의 embedding은 pre-computed된 feature에만 의존하게 된다.
$\widetilde{X}$ 의 row를 sampling하여 mini-batch training 가능
Simpified GCN은 Two steps 로 구성되어 있다.

Pre-processing Step : $\widetilde{A}^K$ 는 neighborhood의 degree, $X$는 node features
Mini-batch training step : $\widetilde{X}$의 row가 random sample

장점 : Embedding of every node computation of it's independent from the other nodes

Comparision with Other Methods

Neighbor sampling과 비교하여 Simplified GCN은 computatioal graph를 구할 필요가 없고 학습이 쉽다.
Cluster-GCN은 Multiple Group에서 sampling 하여 연산을 하였지만, Simplified GCN은 Entire of nodes에서 random sampling 하면 된다.
하지만 Original GNN과 비교하여 non-lineartiy를 제거 했기 때문에 expressive power는 제한된다.
그럼에도 불구하고 Real-world cases에서 잘 작동하고 Original GNN 보다 조금 안좋을 뿐이다.
Graph에는 Homophily가 존재하기 때문이다.
Pre-processed feature는 neighboring node features를 iterative하게 averaging 하기 때문에 Node Classification에서 Simplified GCN이 작동이 잘된다.
Pre-processed feature는 neighboring node features를 iterative하게 averaging 하기 때문에 잘 작동한다.

13 : Community Detection in Networks

Fri, 15 Oct 2021 08:23:12 GMT

1. Community detection in network

Community Detection : Graph에서 nodes를 clustering 하는 것이다.
네트워크를 messy objects가 아닌 clustering, organization의 종류로 생각할 수 있음 (Following structure)
Main Idea : 서로 간에 밀접하게 연결된 노드들의 집합, Cluster를 찾는 것이 목적

1.1 Granovetter's Answer

Community에 대하여 social science and classical social networks analysis 관점에서 생각해 보자

Information flow
- Short link : the ones that kind of are a very kind of local (strong friendships)
- Long link : our acquaintances and kind of colleagues and people we meet less often

Mark Granovetter는 '사람들이 새로운 직장을 어떻게 찾는지'에 대한 유형을 분석함 $\rightarrow$ 사람들은 새로운 직장을 구할 때 친한 친구 보다 지인(Acquaintances)으로 부터 소개 받는 경우가 많다.

Granovetter's은 'Friendship'을 두가지 관점으로 바라볼 수 있다고 함(즉, link에는 두가지 관점이 있다.)

Structural perspective : 링크가 네트워크의 어떤 부분을 연결하는가 ?
Interpersonal perspective : 링크가 Strong or Weak ?

1.2 Granovetter's Explanation

Structural role - Traidic Closure = High clustering coefficient

a-c 보다 a-b가 더 연결될 가능성이 높은 edge이다. (두 명의 common nodes = High clustering coefficient)

Teenage girls with low clustering coefficient are more likely to contemplate suicide

"Edge (Link)를 두가지 관저에서 바라볼 수 있다." - Structural role

First point : Structure
- 구조적으로 강하게 연결된(tightly-connected) edges는 Socially strong
- 네트워크의 다른 부분들에 걸쳐있는 edges(Long-range Edges)는 Socially weak

Second point : Information
- 구조적으로 연결된 edges는 Information access관점에서 매우 중복(redundant)
- 다른 network와 연결된 edges(Long-range edges)는 더 많은 정보를 줄것이다. (other community에 접근할 수 있기 때문에)

1.3 Edge overlap

Grannovetter's theory는 1960년대에 제안되어 증명이 되지 않았지만 Onnela et al. 2007에서 Cell-phone networks를 통해 증명 되었다.

데이터 : EU 소속 국가 인구의 20% 휴대폰 네트워크 데이터
Edge Weight는 통화 횟수
Edge overlap : $O_{ij}$ - $i,j$의 node의 neigbors 중 얼마나 많은 지인(nodes)를 공유(overlap)하고 있는가에 대한 정보
결과적으로 통화 횟수(Inter-personal strength of friendship)가 많을수록 $O_{ij}$ 값이 커졌다. $\rightarrow$ Strong Edges를 가진 Communities가 존재한다.

2. Network Communities

Granovetter's theory에 의하면 Network는 tightly connected set of nodes !
Network communities : Lots of internal connections and few external ones

(communities = clusters = groups = modules)

Networks에서 어떻게 densely connected groups of nodes(= tightly connected nodes)를 찾을 수 있을까 ?

$\rightarrow$ Modularity, Modularity를 Maximize하는 Communities를 찾으면 됨

Communities : set of tightly connected nodes
Modularity Q : Network가 communities로 얼마나 잘 나누었는지(partitioning)에 대한 measure
Partitioned된 Disjoint Communities S에 대하여 Modularity 계산

Q : "How many edges are within the members of the Group S - How many edges would I expect between this group S in random null model" -> Null Model을 구하기 위해 subgraph mining을 배움 ex)Erdos-Renyi, configuration model

Configuration Model :
- Community가 잘 Partitioned 되어있는지를 평가하기 위한 rewired network(비교 network)
- N개의 nodes와 M개의 edges가 주어진 Real Graph G를 통해 생성된 random network G'
- Same degree distribution을 가지지만 uniformly random connections, multi graph
- Expected number of edges
  
  $k_{i}$ : link to itself or whatever else $k_j/2m$ : 노드 $j$와 연결될 확률(i->j, j->i 이기 때문에)
- Expected number of edges in G' : All paires이기 때문에 2로 나누어줌

Modularity 계산을 다시 나타내면 다음과 같이 됨
Modularity q를 [-1,1]로 맞춰주기 위해 Normalizing constant
일반적으로 0.3 ~ 0.7 이면 Significant Community Structure

3. Louvain Algorithm

Community detection을 위한 Greedy Algorithm ($O(n log{n}$) run time, n = number of nodes)
Very scalable and popular algorithm

"It's de facto thing you would use if you want to partition the network into a set of clusters"
Scale to large networks, weighted graph and hierarchical communites 등 다양한 graph에서 사용 가능 ( one level에서의 clustering 뿐만 아니라 cluster의 kind of clustering도 제공한다)
Fast implementation available, quickly and well in practice communities with high modularity
Two phases 연산을 한다.
- Phase 1 :
  - Graph의 nodes를 distinct community로 분류한다.(node마다 각각의 cluster, community를 가지게 된다.)
  - Node i를 Neighbor node j의 community와 합쳤을 경우의 modularity $\Delta Q$를 계산한다.
  - Neighbor node j의 communities들 중 가장 큰 $\Delta Q$를 발생시키는 community로 Node i를 이동
  - 위의 결과의 변화가 없을때 까지 반복
  $\Delta Q$를 구하는 방법은 다음과 같다.
  - $\Delta Q(i \rightarrow C$)를 구하기 위해 우선 C의 modularity $Q(C)$를 구해야 한다.
  - $\Sigma_{in}$ : Inside which is sum of the links, number of links or some of the link weights between the members of C
  - $\Sigma_{tot}$ : Ttotal number of links that all these nodes have
  - Modularity를 $\Sigma_{in}$, $\Sigma_{tot}$ 로 정리해 보면 다음과 같이 정리 된다.
  - $Q(C)$가 total links의 대부분이 within-community links일 때 가장 크다.

 $\Delta Q(i \rightarrow C$)를 구하는 것도 간단하게 구할 수 있다.
 ![](https://images.velog.io/images/ooooo_h/post/e3d7b9cc-28b5-4d7c-8c22-79e125230dd6/image.png)

  - $k_{i,in}$ : the number of edges node i have to other members of c
  - $k_i$ : Total degree of node i

 ![](https://images.velog.io/images/ooooo_h/post/9339a068-22a1-4479-b51c-c93af70eb375/image.png)
  - $Q_{before}$ : $Q(C)$ + isolated community i, isolated community는 number of edges inside가 0
  - $Q_{after}$ : number of edges inside $\Sigma_{in}$ increase by $k_{i,in}$

 ![](https://images.velog.io/images/ooooo_h/post/f4dfe134-6c67-47d2-ae71-96c9ea42ed7e/image.png)

  - $\Delta Q(D \rightarrow i \rightarrow C$) 도 다음과 같이 구할 수 있다.

  - Nodes의 community가 움직이지 않을 때 까지 $C \rightarrow i \rightarrow C'$ 가 greedy way로 반복된다.






- Phase 2(Restructuring) : 
   - Phase 1에서 찾은 communities를 single Super-nodes로 Aggregate
   - Super-nodes는 communites를 구성하는 nodes 간에 적어도 하나의 edge가 존재하면 connected
   - 두 Super-node 간의 edge 가중치는 커뮤니티간 모든 edge 가중치들의 합 
   - Super graph가 구해지면 phase 1을 다시 수행 (한 개의 Community를 찾을 때까지 반복)
   ![](https://images.velog.io/images/ooooo_h/post/6b8bd79f-5095-4b11-ab68-c5aee8eb2359/image.png)
   - 위의 graph는 2개의 community, 4개의 community, 즉 hierarchical structure을 가질 수 있다.

4. Detecting Overlapping Communities : BigCLAM

앞에서 Network structure에 대해 assumption을 정의하였다.
Cluster : Each one has a lot of connection and then a few connection across
하지만 실제 세계에서 사람들은 multiple social communites를 가지고 있다. (overlap social structure)

이러한 Overlapping community structure는 어떻게 Detect할 수 있을까 ? BigCLAM

4.1 BigCLAM

Two Step
- Step 1 :
  - Node community affiliation에 기반하여 Generative model을 정의한다.
    - Affiliation Graph Model (AGM))

- Step 2 : 
    - Graph G가 AGM을 통해 만들어 졌다는 가정
    - Optimization problem을 정의하여 best AGM을 찾는다.

다음과 같은 형태로 Community Affiliation을 정의한다. 이를 통해 Edge of network를 생성하는 Model을 설정한다.
Generative Model : Community affiliation으로부터 어떻게 네트워크가 generate 되는가 ?

Model Parameters :

$V$ : nodes
$C$ : Communities
$M$ : Memberships
$P_c$ : single parameters, Community C의 노드들이 서로 연결되어 있을 Likelihood

$p(u,v)$ : u,v가 서로 연결되어 있을 확률
공통된 그룹의 수가 많을수록 연결될 가능성이 커진다.

이러한 방식의 AGM은 다양한 Community structures를 만들 수 있다. (Non-overlap, overlap, hierarchical)

4.2 AGM

위의 내용은 Affiliation으로 Network를 만드는 방법에 대해 생각하였지만 Network,Graph 만으로 Model을 구할수는 없을까 ?
Real-world graph는 AGM에 의해 생성되었다고 생각할 수 있다. 주어진 Graph를 가장 잘 생성하는 Model F도 찾을 수 잇을 것이다.
Maximum Likelihood Estimation(MLE)를 통해 구할 수 있다.
Model F를 통해 생성된 synthetic graph를 real graph와 비슷하도록 생성하고 싶다 !
Model Parameter를 통해 Graph Probability를 측정할 수 있으면, 이를 argmax하는 F를 찾으면 된다.
Graph G와 model F가 주어 졌을 때, $P(G|F)$의 Likelihood를 구할 수 있다.
adjacency matrix를 기반으로 Graph G에 대한 Probability를 assign한다.
Edge가 연결될 확률을 잘 맞추고 Edge가 연결 안될 확률을 잘 맞춰야 함

4.3 "Relaxing" AGM

위의 모델은 0,1을 사용하여 노드들이 given community의 member인지 아닌지 따졌지만, 실제로는 overlaping communities기 때문에 모든 node community membership의 strength를 구해야 한다.

앞에서 각각의 communities가 parameters를 가지고 있고 여기에 대한 membership의 strength를 assign 하였다.
$F_{uA}$ : community A에 대한 node u의 strength

Community C에 대해서 node u와 v가 서로 연결되어 있을 probability $P_c(u,v)$
Community C에 대한 node u의 strength $F_{uC}$, node v의 strength $F_{uV}$의 곱이 0 이면 $P_c(u,v)$, 크면 $P_c(u,v)$ 는 1에 가까워짐
node u와v가 연결되어 있을 확률은 shared memberships의 strength에 비례
이러한 node u,v 는 여러 공통 communities를 가질 수 있는데 위의 식을 그대로 사용 할 수 있다.
다음과 같이 전개될 수 있는데, 결국 $F_{uC}$의 벡터들과 $F_{vC}$의 벡터들의 dot product가 된다.
위의 결과를 통해 community membership의 vector representation을 기반으로 pair of nodes의 probability를 정의할 수 있다.
결론적으로 $P(G|F)$의 likelihood 또한 다음과 같이 정의될 수 있다.
이를 Log-likelihood로 변환을 한다. ( log-likelihood를 optimizing하면 보다 더 numerical stability, Monotonic Transformation이기 때문에 동일한 position에서 maximize하는 것을 의미한다.)
Objective function을 optimize하면 최적의 Model F를 찾을 수 있다.

Log-likeihood를 increase하기 위해 gradient ascent를 통해 optimize 한다.
아래 식을 살펴보면 node u의 neighbors가 아닌 node v는 매우 많다.(Graph는 크고 연결된 neighbors가 별로 없기 때문에 전체 network에 대해 summation하는 것과 동일하다.)
Gradient ascent는 느리지만 이를 다시 살펴 보면 다음과 같이 되고, 모든 nodes에 대해서 actually neighbors of node u만 빼주면 된다. (sum up over all the nodes를 저장한 후 나머지만 update하면 된다.)

Reference

CS224W 투빅스12-13 GNN 스터디 데이터괴짜 BLOG

[Paper Review] (2016, Chunqiu Zeng) Online Context-Aware Recommendation with Time Varying Multi-Armed Bandit

Wed, 02 Jun 2021 01:54:08 GMT

작성자 : 권오현

Abstract

Contextual Information을 활용하여 Online Personalized recommendation Servies (e.g., 온라인 광고, 뉴스 추천)를 제공하는 방법에 대하여 Contextual Multi-armed banit에 대한 관심이 증가하고 있다.
기존의 Contextual Multi-armed bandit은 특정한 context에 대해 arm의 reward를 예측하기 위해 보상 함수의 존재를 가정하는 경우가 많다.
하지만 Real-world 에서는 시간이 지남에 User의 선호도가 동적으로 변하는 경우가 많기 때문에 위와 같은 가정은 실제로 적용되는 경우가 거의 없다.
논문 저자는 시간에 따라 변화하는 보상 함수에 대하여 Time Varying Contextual Multi armed problem을 연구하였고 특히, Particle learning을 기반으로 한 dynamical context drift model을 제안하였다.
제안된 모델은 random walk particles set으로 모델링 되며, fitted particle이 reward function을 학습하도록 한다.
Particle learning을 통해 모델이 context의 변화를 포착하는데 효과적이며 context의 latent parameters를 학습할 수 있다.
온라인 광고와 뉴스 추천에 대하여 제안한 방법론의 효과와 모델이 Time varying reward에 대해서 dynamically tracking 하여 결과적으로 CTR을 향상시킬 수 있다는 것을 보여준다.

[참고]

Multi-armed Bandit problem

Bandits

MAB는 User가 어느 슬롯머신의 손잡이를 당겨야 가장 높은 수익을 올릴 수 있을지를 결정하는 알고리즘이다.
MAB에서 가장 중요한 개념은 Exploration(탐색) & Exploitaion(활용)이다.
- 만약 우리가 꽤 좋은 결과를 돌려주는 기계를 찾아냈다면 좋은 결과를 유지하기 위해 그 기계를 계속 사용할 것인지(exploitation) 아니면 더 좋은 결과를 얻을 거라는 희망으로 다른 기계를 찾아볼 것인가(exploration)?
  - Exploration은 다양한 정보를 수집하기 위해 불확실성이 높은 다양한 광고 콘텐츠를 보여주어 사용자들의 성향을 탐색한다. Exploitation은 사용자들이 많이 소비할 것 같은 광고, 인기 콘텐츠만 제공한다.
1. Introduction
Online Personalized recommender systems은 User와 Item의 현재 정보에 따라 고객에게 적절한 feed(e.g., advertisements, news article)를 주어 고객의 만족을 극대화하려고 한다.
이러한 목표 달성을 위해 추천시스템에서 고객의 preferences에 대하여 instantly tracking 과 large item repository에서 흥미로운 아이템을 추천해주는 것이 중요하다.
Consumer preference와 target items에 대해 적절히 매칭 시키는 것은 어렵다.
1. Cold Start Problem
2. 유행의 변화와 고객의 선호도가 dynamically changing
context-based Exploration-Exploitation tradoff dilemma

정보를 얻어 가장 좋은 승률을 가진 슬롯머신을 선택해야 하기 때문에 두 과정이 반드시 필요하며, 한 가지 과정을 완전히 배제해서는 원하는 결과를 얻을 수 없다.
Contextual multi-armed bandit에 대한 여러 알고리즘이 연구됨 ($\epsilon-greedy$, LinUCB, Thompson Sampling) : 이러한 알고리즘들은 동일한 contextual information에서의 reward는 변하지 않는다고 가정한다. 즉, 시간에 따라 변화는 contextual information의 latent factor를 고려하지 않기 때문이다.

동일한 contextual information에 대하여 5개 뉴스의 CTR을 조사하였다. Multi-armed bandit에서는 처음에 많이 선택된 article_1, article_2가 시간이 지나도 CTR이 높을 것이지만 시간이 지날 수록 article_1, article_2에 대한 CTR은 낮아지고 article_3, article_4, article_5에 대한 CTR이 높아지는 것을 확인할 수 있다.

즉, 동일한 contextual information이라 할지라도 시간마다 CTR에 대한 영향이 다르다

contextual multi-armed bandit problems의 Time varying reward를 포착 하기 위해 논문 저자들은 Particle Learning에 기반한 Dynamical context drift model을 제안하였고, 효과적인 Onlne 추론 모델을 개발하였다.
Dynamical Reward에 대하여 random walk particles를 통해 모델링 하였고, Particle learning을 통해 context change와 latent parameters를 학습할 수 있다.

논문에서 Reward의 dynamic한 상황을 해결하기 위한 Contextual Multi-armed bandit을 이용한 context drift model을 제안한다. Latent parameter를 학습하고 Latent States를 추론하기 위해 Sequential Online Method를 개발 하였다.

2.1 Contextual Multi-armed Bandit

Multi-armed bandit은 다양한 분야에서 사용된다. (e.g., online advertising, web content optimizing, robotic routing)
Multi-armed bandit의 가장 중요한 목적은 Exploration-Exploitation trade off 의 균형을 맞추는 것이다. 이러한 문제를 해결하기 위해 다양한 algorithms 제안됨 ($\epsilon$-greedy, UCB, EXP3, Thompson sampling)
Contextual Multi-armed Bandit은 Contextual information을 통해 arm을 선택한다.
Multi-armed bandit algorithms는 Context information을 통합하도록 확장되었다.
- $\epsilon-greedy$ : $1-\epsilon$의 확률을 기반으로 Arm을 선택하고 $\epsilon$ 의 확률을 기반으로 random 하게 Arm을 선택한다. 즉, 현재 상황에 대하여 Best arm 선택(Exploitation), 충분한 탐색(exploration)을 위해 $\epsilon$을 선택한다
- LinUCB, LogUCB : UCB algorithms, 탐색했던 arm으로 부터 얼마나 많은 reward 받았는지 뿐만 아니라 얼마나 확실한지도 함께 따져서, 덜 확실한 쪽에 더 많은 탐색을 하는 방법 , LinUCB는 arm의 reward와 context간에 coefficent vector를 이용하여 Linear Expected Reward를 계산하였다. LogUCB는 Logistic-regression으로 설명
- Thompson Sampling : earliest heuristics, 각각의 arm에 대한 reward distribution의 parameters를 확률 변수로 보고 이 parameter distribution으로 부터 무작위로 추출하여 해당 값에 대한 reward의 기댓값이 가장 높은 arm을 선택하고, arm에 대한 parameter distribution을 업데이트함 $\rightarrow$ Bandit을 당길 경우를 $\theta_k$ 로 보면, $\theta_k$의 추정값에 대한 prior distribution을 $Beta$분포로 가정하면 conjugacy한 특성으로 인해 posterior distribution을 구할 수 있다.
- 이외의 최신 연구들은 bootstrap sample와 같이 principled sampling approach하의 parameter-free algorithms을 다루고 있다.
하지만 이러한 알고리즘은 동일한 Contextual information하에서 Reward가 변하지 않는다고 가정하고 arm의 expected reward를 예측한다.

2.2 Sequential Online Inference

Real-world 에서도 효과적인 모델을 위해 저자들은 Sequential Online inference를 사용하여 context의 latent state를 추론하고, model paramters를 학습한다.
- Sequential Monte Carlo Method (SMC) : filtering problem을 해결하기 위해 제안된 Monte Carlo Method의 set. SMC는 posterior distribution을 계산하는 simulation method를 제공한다. 이러한 방법을 통해 general state space models에서 non-linear and non-Gaussian 일 수도 있는 posterior distribution을 구할 수 있다. (현실 세계의 정보를 반영하는 posterior 구할 수 있다.)
  
  Filtering problem : 해당 시스템에 대한 불완전하고 잡음이 많은 관측 세트에서 일부 시스템의 실제 값에 대한 "최상의 추정치"를 설정하는 것이다. Monte Carlo Method : Monte-Carlo method는 무작위 추출된 난수(Random Number)를 이용하여 함수의 값을 계산하는 통계학적 방법으로서, 수치적분이나 최적화 등에 널리 쓰인다.
- Particle Learning : Particle Learning은 general state space models에 대한 sequential parameter learning 및 smoothing인 state filtering을 제안한다. Particle은 광범위한 state space models에 대한 parameter의 불확실성에 비추어 sequential filtering과 smoothing distribution을 통해 근사화 한다. Particle learning의 주요 idea는 resample-propagate framework에서 fixed parameters에 대한 조건부 통계와 particle의 근사를 통해 states에 대한 joint posterior distribution을 직접 샘플링하는 Particle algorithms을 만드는 것이다.
논문 저자는 Real-world에 대한 latent state inference와 parameter learning을 위해 particle learning 아이디어를 적용하였다.

3. Problem Formulation

저자는 Contextual Multi-armed Bandit Problem을 정의하고 Time varying contextual multi-armed badit problem을 정의하였다.

3.1 Basic Concepts and Terminologies

$\mathcal{A}$ = {$a^{(1)}$,$a^{(2)}$, $a^{(3)}$,$\cdots$ ,$a^{(k)}$} , $k$ = The number of arms $x_{t} \in \mathcal{X}$, $x_t$는 t시점의 contextual information을 의미하고 $\mathcal{X}$는 d차원의 feature space
Contextual Multi-armed bandit은 유한한 시간 범위 T에 대한 결정
Policy function $\pi$는 $t \in [1, T]$의 t에 대한 arm을 결정($\pi(x_t)$)한다. arm을 pulling 한 후 policy function에 대한 reward를 받는다.
t시점의 arm $a^{(k)}$의 reward는 $r_{k,t}$로 표현된다. $r_{k,t}$는 arm $a^{(k)}$에 제시된 context $x_{t}$에 의해 결정되는 미지의 분포로 부터 도출된다. 또한 $a^{(k)}$가 pulling 않으면 reward를 구할 수 없다.
Total Reward $R_{\pi} = \sum_{t=1}^{T}r_{\pi(x_t)}$ , policy function $\pi$ 가 total reward인 $R_{\pi}$를 최대화 하는것이 목적이다.
t시점의 arm을 선택하기 전에 policy function $\pi$는 일반적으로 과거 관측치인 $S_{\pi,t-1} = \left {(x_i,\pi(x_i),r_{\pi(x_t)})|1\leq i < t \right }$에 따라 모든 arm에 대한 보상을 예측하는 모델을 학습한다. Reward Prediction은 policy function $\pi$의 결정이 total reward를 증가시키는 것을 도와준다.
$y_{k,t}$는 arm $a^{(k)}$의 predicted reward, $y_{k,t}$$=f_{a^{(k)}}(x_t)$ 로 표현된다. 즉, contextual information인 $x_{t}$가 reward function $f_{a^{(k)}}$의 input으로 들어간다.

$f_{a^{(k)}}$의 방법중 하나는 feature vector의 linear combination이다. $ex)$ $f_{a^{(k)}}(x_t)$ = $X_t^{T}W_k + \varepsilon_k$ , $X_t^{T}$는 T 시점의 contextual information이고 $W_K$는 d-dimension의 coefficient vector, $\varepsilon_k \sim \mathcal{N}(0, \sigma^2_k)$, $y_{k,t} \sim \mathcal{N}(X_t^{T}W_k,\sigma^2_k)$

이러한 상황에서 Multi-armed bandit은 다음과 같이 표현된다. $X_t$는 t시점의 contextual information, predicted reward인 $y_{k,t}$ 는 random variable $X_t, w_k, \sigma^2_k$에 의해 결정된다.
random variable인 $w_k, \sigma^2_k$ 의 conjugate prior distribution을 $\mathcal{NIG}$(i.e., Normal Inverse Gamma) distribution으로 가정한다. $\mathcal{NIG}(\mu_w,\sum_w,\alpha,\beta)$, 여기서 parameter들은 predefined함
policy function $\pi$는 reward prediction model에 따라 arm $a^{(k)}$중 하나를 선택한다. t 시점의 arm $a^{(k)}$를 pulling하면, $r_{k,t}$가 구해진다. 이 후 새로운 sequence $(X_t, \pi(X_t), r_{\pi(X_t)})$ 가 구해져 $S_{\pi, t-1}(X_t, \pi(x_t), r_{k,t})$에 업데이터 되어 $w_k, \sigma^2_k$의 posterior distribution $\mathcal{NIG}$ distribution이 된다. 즉 , t시점의 $w_k, \sigma^2_k$의 posterior distribution이 t+1 시점의 prior distribtuion이 된다.

Posterior Distribution의 parameter update 과정 :
- Thompson sampling : each arm $a^{(k)}$에 대하여 $w_k, \sigma^2_k$ 를 $\mathcal{NIG}$로 부터 sampling한 후 , $a^{} = argmax_{1 \leq k \leq K}\left {X_t^TW_k \right }$인, $a^{}$를 선택하고 $r_{a^{*},t}$를 통해 posterior distribution을 update한다.
- LinUCB : $LinUCB(\lambda)$을 통해 score가 가장큰 arm을 선택한다.
이러한 모델의 방법론을 dynamic context drift model에 적용시켰다.

3.2 Dynamic Context Drift Modeling

arm $a^{(k)}$의 predicted reward는 contextual feature of $X_{t}$와 $W_k$의 linear combination으로 구해진다. 즉 $W_k$의 가중치들은 reward prediction을 하는데 영향을 준다.
기존의 방법론은 $W_k$ 가 distribution에 의해 정해진 가중치라고 생각하지만, Real-World에서는 distribution에 대해 알 수없기 때문에 context feature of $X_t$의 predicted reward는 dynamical하게 된다.
이에 저자는 dynamic 상황을 고려하여 시간에 따라 변화하는 reward에 대하여 $W_k$ 정보를 포착하기 위해 $W_{k,t}$를 t시점의 arm $a^{(k)}$의 가중치 벡터로 정의한다.
$\delta_{W_{k,t}}$, 시간에 따라 변화하는 drift weight는 Contextual information의 다양한 특성, context feature의 다양한 scale 때문에 모델링하기 어렵다 (앞에서 설명한 동일한 context상에서의 CTR변화)
이에 저자는 추론을 간단화하기 위해 각각의 $\delta_{W_{k,t}}$를 독립적으로 변화한다고 가정한다.
불특정한 변환으로 인해 $\delta_{k,t}$를 scale variable $\theta_k$와 Gaussian random walk $\eta_{k,t}$의 element wise product로 표현한다.

이러한 모델은 실생활에서 변하는 reward를 고려하여 모델링하였다.
새로운 모델에서 $\text c_{\text w_{k}}$ 의 값들은 reward를 prediction하는 feature weight이고 $\theta_k$는 scale of context drifting을 의미한다. $\theta_k$의 값이 클수록 시간이 지남에 따라 feature에서 발생하는 context drifting을 의미한다.

4. Methodology and solution

Posterior distribution inference는 4개의 random한 변수들에 의해 정해진다. ($\sigma^2_k, \text c_{\text w_k}, \theta_k, \eta_{k,t}$)
이러한 4개의 변수들은 parameter random variable와 latent state random variable로 분류된다.
$\sigma^2_k, \text c_{\text w_k}, \theta_k$는 고정된 parameter random variables고, 시간에 의존적이지 않다.
$\eta_{k,t}$는 latent state random variable이고 시간에 의존적이다.
t시점의 context $\text x_t$에 의해 arm $a^{(k)}$의 reward는 $r_{k,t}$로 구해지고 이러한 $X_t, r_{k,t}$는 random variables를 통해 확인될 수 있다.
모델의 목표는 latent parameters variables와 latent state random variables를 sequential observed data를 통해 추론하는 것이다. 하지만 latent state radnom variable은 random walk 방법에 의존하기 때문에, 저자는 latent parameters, latent state random의 distribution을 학습하기 위해 Sequential Monte Carlo sampling과 Particle learning을 이용하여 random walk 방법을 사용하였다.

Standard gaussian random walk를 통한 state $\eta_{k,t-1}$은 시간에 따라 변화하기 때문에 이전 정보들을 통해 구해진 정보를 parameter로 하는 분포에서 뽑아 사용된다. $\eta_{k,t-1} \sim \mathcal N(\mu_{\eta_k},\sum_{\eta_k})$

4.1 Re-sample Particles with Weights

t-1의 각각의 arm $a^{(k)}$는 고정된 particle size를 가지고 있다. Particle set을 $\mathcal P_{k,t-1}$, Particle set의 개수를 $p$로 생각하면 $\mathcal P^{(i)}_{k,t-1}$은 t-1시점의 arm $a^{(k)}$의 $i^{th}$ particle이 된다.
각각의 particle $\mathcal P^{(i)}_{k,t-1}$는 가중치가 있으며 이를 $\rho^{(i)}$로 표현하고 이는 t시점의 새로운 관측 데이터에 대한 적합도를 나타낸다. 즉, $\rho^{(i)}$는 이전 정보들을 통해 $\text x_t$와 t시점의 arm $a^{(k)}$의 reward의 likelihood로 정의된다.
$y_{k,t}$ predicted value의 distribution은 $\sigma^2_k, \text c_{\text w_k}, \theta_k, \eta_{k,t}$ 를 통해 추정되기 때문에 $y_{k,t}= r_{k,t}$하에서 다음과 같이 $\rho^{(i)}$를 구할 수 있다.
parameter가 업데이트 되기 전에 resampling먼저 수행된다. weights of particles에 의해$\mathcal P_{k}$에서 generate된 $\mathcal P_{k}'$ 를 통해 sequential parameter가 업데이트 된다.
이러한 방식으로 weights of particles의 기반으로 보다 정확한 sampling이 가능해진다.

4.2 Latent State Inference

t-1시점에서의 $\eta_{k,t-1}$의 충분 통계량은 $\mu_{\eta_k}$와 covariance $\sum_{\eta_k}$ 로 나타난다.
t시점에서 새로운 $\text x_t$와 $r_{k,t}$이 발생되면 state $\eta_{k,t}$의 충분통계량은 다시 계산되어야 한다. 이때 kalman filtering method를 이용하여 재귀적으로 $\eta_{k,t}$의 충분 통계량을 update한다.

Kalman Filter : 기존에 인지하고 있던 과거 측정데이터와 새로운 측정데이터를 사용하여 데이터에 포함된 노이즈를 제서키셔 새로운 결과를 추정하는데 사용하는 알고리즘으로 선형적 움직임을 갖는 대상을 재귀적으로 동작시킨다.
이를 통해 $\eta_{k,t}$는 다음과 같은 분포에서 추출된다.

4.3 Parameter Inference

t-1시점의 parameter random variable( $\sigma^2_k, \text c_{\text w_k}, \theta_k$)의 충분 통계량은($\alpha, \beta, \mu_c, \sum_c, \mu_{\theta}, \sum_{\theta}$)이다.
$\text z_t, \mu, \nu$는 2d-dimensional vector이고 $\sum$은 2d x 2d - dimesional matrix 그러므로 $\text c_{\text w_k}$, $\theta_k$를 추론하는 것은 $\nu_k$를 추론하는 것과 같다.

4.4 Intergration with Policies

3.1에서 LinUCB와 Tomson sampling은 $\text w_k, \sigma^2_k$가 posterior distribution에 의해 samplig된다.
t 시점의 $\text x_t$에 대해서 arm $a^{(k)}$ 가 pulling되지 않았기 때문에 Context drifting model은 reward $r_{k,t}$를 모른다. 즉, t시점의 k번째 arm에 대한 reward가 없으면 t시점의 particle re-sampling, latent state inference, parameter inference가 samppling되지 않는다.
또한 모든 arm은 독립적인 p개의 particle을 가지고 있어서 각각의 particle에서 $\text w_{k,t-1}$의 posterior distribution을 이용할 수 없다. 왜냐면 현재까지 아는 정보는 $\text w_{k,t-2}$의 posterior distribution을 t-1시점의 prior distribution이 되기 때문이다.
이에 저자들은 Tompson smapling과 LinUCB가 weight sampling을 한 방법을 적용하여 다음과 같이 정의를 하였다.
$\text w^{(i)}{k,t-1} , \mu^{(i)}{\text w_k}, \sigma^{2^{(i)}}k, \sum^{(i)}{\text w_k}$를 $i^{(th)}$ particle의 random variables로 생각해보면 $\text w_{k,t-1}$을 $\bar{\text w}_{k,t-1}$로 정의할 수 있다. (bandit algorithm에서의 추론 방법)

4.5 Algorithm

5.EMPIRICAL STUDY
저자는 모델을 성능을 확인하기 위해 2개의 real-world dataset을 사용하여 실험하였다. (the online search advertising data from Track2 of KDD CUP 2012, news recommendation data of Yahoo ! Today News)

5.1 Baseline Algorithms
Random, $\epsilon-greedy$ , GenUCB($\lambda$), TS($q_0$), TSNR($q_0$), Bootstrap, Our Methods : TVUCB($\lambda$), TVTP($q_0$)

5.2 KDD Cup 2012 Online Advertising
USER (age, gender)와 search engine(query keywords, some ads information about click count on ads)의 interaction 데이터
저자들은 binary feature vector of dimension 1,070,866 사용, 1 million user visit events
Bandit problem evalution을 위해서는 simulation, replayer 방법이 있는데, large number of Recommending items (> 50) 에서는 simulation 방법이 적절 하였다.

5.3 Yahoo ! Today News

news data collection < 50 , replayer 방법 사용

5.4 Time Cost

6. Conclusions

저자는 dynamic behavior of reward을 위한 context drift model을 제안하였다.
Particle learning을 통해 parameters와 latent drift of the context을 효율적으로 학습하는 model을 제안하였다.
기존에 존재하는 Bandit algoritms를 개선하여 보다 contextual dynamics한 상황 tracking을 가능하게 하였으며 CTR에 대하여 personalized recommendation performance를 향상 시켰다.

[Paper Review] (2017, Huifeng Guo) DeepFM : A Factorization-Machine based Neural Network for CTR Prediction

Tue, 18 May 2021 11:32:41 GMT

작성자 : 권오현

Abstract

- 추천 시스템에서 CTR을 최대화 하기 위해서는 유저 행동에 숨겨진 복잡한 상호관계(feature interaction)을 학습하는 것이 중요하다.

기존의 방법론은 낮은 차원의 상호관계(low-order interactions)나 높은 차원의 상호관계(High-order interactions)에 치우쳐 학습을 하거나, 전문가의 feature engineering이 필요하다.
논문 저자는 낮은 차원(low-order), 고차원(high-order)의 feature interactions을 End-to-End 모델로 학습할 수 있음을 보여준다.
DeepFM은 Factorization machines이 가지는 장점과 Deep Learning이 가지는 장점을 결합한 새로운 Neural Network 구조이다.

[참고]

** CTR (Click-Through Rate) 이란 **

클릭률(CTR)은 광고를 본 사용자가 해당 광고를 클릭하는 빈도의 비율입니다. 클릭률(CTR)을 사용하면 키워드와 광고, 무료 제품 목록의 실적을 파악할 수 있습니다.(클릭수 ÷ 노출수 = CTR) -Google Ads
Factorization Machines :

Linear Regression : 유저와 아이템에 대한 평가를 예측하는 문제라고 생각하면 추천시스템은 선형 회귀식에서 시작 되었다.

*Polynomial Model *: 추천 시스템의 데이터는 categorical type 변수가 많고 one-hot-encoding 값이 많아 데이터의 차원이 크고 Sparse하다. 이를 위해 각 변수 간의 interaction을 고려한 모델이 제안되었다.

이러한 모델링은 Linear Regression의 한계를 극복 가능 하지만 Parameter 수가 증가 하기 때문에 연산이 복잡, Factorization Machines는 feature interaction vector를 저차원으로 factorize하여 이를 해결함

*Factorization Machines *: Polynomial model에서 interaction weight인 $w_{ij}$ 가 두 벡터 $(\vec{v_{i}} \cdot \vec{v_{j}})$의 내적으로 factorize 되었다. 이 두 벡터는 interaction의 latent vector로 표현함으로써 latent space내에서 interaction을 보다 더 잘 잡아낼 수 있다.

Introduction

Click-Through Rate Prediction for recommendation

The task of predicting the likelihood that something to website(such as an advisement) will be checked
CTR Prediction은 User가 추천 아이템을 클릭할 확률을 추정하는 것으로 유저에게 추천될 아이템의 순위를 매길 수 있다. 또한 온라인 광고 같은 시나리오에서도 수입을 증가시킬 수 있다. $\Rightarrow$ CTR을 정확하게 추정하는 것이 핵심이다.
CTR Prediction은 User click 행동에 존재하는 implicit feature interactions를 학습하는 것이 중요하다.
일반적으로 User click 행동의 interactions는 매우 복잡하고 low-order, high-order interaction이 중요한 역할을 한다.

Example
1. 식사시간에 배달앱 다운로드 하는 경우 많음 -> "app category"와 "time-stamp"의 interaction (order-2)
2. 10대 남자들은 Shooting games & RPG games 좋아함 -> "app category"와 "gender", "age"의 interaction (order-3)
3. Association rule에 의해 발견된 '기저기와 맥주'의 interaction (order-2)
예시 1,2 와 같이 쉽게 어떠한 feature interactions는 쉽게 이해할 수 있지만, 대부분의 feature interactions는 데이터 속에 숨겨 있고 prior 정보 만으로 알기 어려워 machine learning에 의해 auto-matically하게 포착되어야 한다.(예시 3)

Previous Approaches
Generalized Linear Model FTRL [McMahan et al. 2013] :

선형 모델은 feature interaction을 학습하기에 부족하며 pairwise feature interactions을 직접 feature vector로 넣어준다.

이러한 방법은 High-order feature interactions 학습하기 어렵고 cold start 문제가 존재함.
Factorization Machine (FM) [Rendle, 2010] :

pairwise feature interactions을 feature 간의 inner product를 통해 모델링 하였다.

High-order feature interaction을 모델링 할 수 있지만 High Complexity로 인해 order-2 feature interaction만 고려한다.
Factorization-machine supported Neural Network_ (FNN) [Zhang et al., 2016]_ :

Deep Neural Network에 적용하기 전에 FM으로 pre-trains 하여 feature embeddings의 가중치를 초기화 하여 학습을 한다.

DNN에 적용하기 전에 FM을 pre-train 하기 때문에 FM에 의해 성능이 제한된다.
Product-based Neural Network (PNN) [Qu et al., 2016] :

Feature interactions에 대해 Embedding layer와 fully-connected 사이에 product layer를 도입하였다.

FNN과 PNN은 다른 DNN 처럼 CTR Prediction에 필수적인 low-order feature interactions를 거의 포착하지 못한다.
Wide & Deep learning for Recommender Systems (Wide & Deep) [Cheng et al., 2016] :

low order & high-order interactions를 모델링하기 위해 linear("wide") model과 DNN ("deep") 을 결합한 Wide & Deep 모델을 제시하였다.

Wide & Deep은 두 개의 다른 input(wide part, deep part)을 필요로 하는데 wide part input은 feature engineering이 필요함

wide part input은 designed pairwise feature interaction이기 때문에 입력 vector가 굉장히 커져 Complexity를 증가시킨다.

Problem

low-order or high-order feature interaction, feature engineering을 모델링 하는데 치우쳐 저 있다.

feature engineering 없이 low order, high-order feature interactions를 End-to-End 방식의 모델을 제안함

Model Structure

Factorization Machines(FM)과 Deep Neural Network(DNN)의 구조를 결합한 새로운 NN모델인 DeepFM을 제시한다.

Wide & Deep 모델과 다르게 같은 input, embedding layer를 공유하기 때문에 효율적으로 학습할 수 있다.

Benchmark data와 commercial data에 평가한 결과 다른 CTR Prediction 모델 보다 일관적으로 향상된 결과를 보였다.

Our Approach

Traning Dataset

$x=[x_{field_{1}},x_{field_{2}},\cdots,x_{field_{j}},\cdots, x_{field_{m}}]$

$y=\left{ \begin{matrix} 1 & user; clicked; the; item \ 0 & {otherwise} \end{matrix} \right .$

n개의 $(\chi, y)$ 데이터가 구성되어 있다. $\chi$는 User, Item pair m-fields 구성되어 있다.

$\chi$ Categorical fields (e.g., gender, location): one-hot encoding Continuous fields (e.g., age) : value, Discretization & one-hot encoding

CTR Prediction 모델의 목적은 context가 주어졌을 때 특정 앱을 클릭할 확률을 추정하는 모델을 만드는 것이다.

DeepFM
- DeepFM은 같은 input을 통해 FM Component(low-order feature interaction)와 Deep Component(high-order feature interaction)을 학습하는 것을 목표로 한다.

FM Component와 Deep Componet가 같은 Embedding 벡터를 공유한다. 이러한 구조는 두 가지 장점을 가진다. 1) $Field; vector$의 크기가 다를 수 있지만 $Embedding ,Layer$로 인해 같은 크기(k=5)를 가진다.

Gender의 경우 크기가 2인 $V_{gender}$ 이지만 국적이나 나이의 경우 Gender보다 더 큰 벡터를 가지게 된다. 하지만 결과적으로 $k-dimension$으로 Embedding된다.

2) (FNN) [Zhang et al., 2016] 에서는 $V$ 가 FM에 의해 pre-trained 한 feature embeddings의 가중치를 초기화하여 학습하지만 DeepFM은 pre-trained할 필요가 없고 End-to-End로 학습을 할 수 있게 된다.
결과적으로 $\hat{y}$는 FM, DNN의 예측값의 합을 $sigmoid$를 적용한다.

FM Component
- FM Component는 (FNN) [Zhang et al., 2016] 에서 제안된 feature interaction을 학습하기 위한 Factorization machines 이다.

FM은 각각 feature latent vector의 inner product (order-2)로 feature interaction을 모델링한다.

FM의 Output은 Addition Unit과 Inner Product Unit의 합으로 다음과 같이 표현된다.

Deep Component

Deep Component는 high-order feature interaction을 학습하기 위한 Feed-Forward Neural Network이다.

CTR prediction 모델의 Input data는 굉장히 Sparse하고 super high-demensional하다. $\Rightarrow Embedding layer$를 통해 Input vector를 low-order dense한 vector로 압축되어 NN에 입력된다.

Relationshop with the other Neural Networks

FNN [Zhang et al., 2016] : feature embeddings의 가중치를 FM에 의해 pre-trained 값을 initialization하기 때문에 두 가지 제한점을 가진다.

1) Embedding 파라미터가 FM의 영향을 받을 수 있다.

2) Pre_trained에서 도입된 Overhead로 효율성이 저하된다.

High-order feature interaction은 학습할 수 있지만 low order feature interaction은 학습 불가능 하다 DeepFM의 경우 사전 훈련이 불 필요하며, low & high order feature interaction 학습 가능하다.
PNN [Qu et al., 2016] : High order feature interaction을 포착하기 위해 $Embedding, Layer$와 $First; Hidden ; Layer$ 사이에 $Product$를 추가한 모델이다. $IPNN$(내적), $OPNN$(외적), $PNN$(내적 ,외적)

계산을 효율적으로 하기 위해 $IP$(내적)을 할 경우 뉴런을 제거하고 $OP$(외적)을 할 때 차원을 압축하여 근사적으로 계산한다.

1) $IP$(내적)을 할 경우 reliable하지만 $Product,Layer$이후에 $Hidden,Layer$의 모든 neuron에 연결도기 때문에 High Computational Complexity가 요구된다.

2) $OP$(외적)을 할 경우 unstable하고 feature의 정보를 잃어 버린다.

3) 저차원 feature interaction이 무시된다.

DeepFM의 FM Component는 마지막 $Output, Layer(1; neuron)$에만 연결된다.
Wide & Deep [Cheng et al., 2016] : DeepFM과 다르게 Wide Part의 Input을 직접 feature engineering 해야한다.

이 모델의 Linear Regression(LR)을 FM으로 바꾸면 DeepFM과 비슷하지만 DeepFM은 Feature Embedding을 share한다.

Embedding share 하는 방식은 low & high-order feature interaction을 표현하는 feature representation에 영향을 미쳐 더욱 정교하게 학습을 한다.(back-propagate manner)

DeepFM은 사전 훈련과 Feature engineering 없이 low & high-order feature interaction을 포착할 수 있다.

Experiments

Experiment Setup

DataSets :

1) Criteo DataSet : 4,500만 명의 유저 클릭을 포함한, 13개의 continuous feature과 26개의 categorical feature가 있다.

2) Company Dataset(~~아마 Huawei~~) :
- 실제 산업의 CTR prediction을 평가하기 위해 회사 데이터에서 실험을 해 보았다. 회사의 App store에서 연속된 7일의 유저 클릭 기록을 사용하여 다음 하루치를 evaluate 하였다.

10억개 가량의 기록, app과 관련된 feature(identification, category and etc)와 user feature(유저가 다운로드한 앱), context feature(operation time and etc)가 있다.

Evaluation Metrics :

1) Logloss 2) AUC Score

Model Comparison

9개의 모델 실험 비교
- LR, FM, FNN, PNN, Wide & Deep (Wide - LR & FM) , DeepFM

Performance Evaluation

Effciency Comparison

LR 대비 학습시간에 걸린 시간을 나타내는 그래프
FNN은 DeepFM에 비해 오래 걸린다. Pre-training에 시간을 많이 씀
IPNN, PNN의 경우 Inner Product의 학습 속도가 굉장히 느려지는 걸 확인할 수 있다.
Wide & Deep 모델보다는 살짝 느리지만 준수함

Effectiveness Comparison

Feature interaction이 표현되지 않은 LR보다 interaction을 표현하는 다른 모델들이 성능이 더 좋다.
Low-order & High-order feature interaction을 같이 학습하는 모델인 DeepFM과 Wide & Deep 모델이 다른 모델보다 성능이 좋다.
Feature embedding을 share하는 DeepFM이 성능이 더 좋다.

결과적으로 Wide & Deep 보다 AUC와 LogLoss 측면에서 각각 0.37% point, 0.42% point 성능이 좋아 졌다. 근소한 차이지만 Wide & Deep 논문에 의하면 Offline에서 0.275% 포인트 차이가 Online CTR에서는 3.9% point 차이로 이어졌다. 회사 AppStore에서 일일 전환율이 수백만 달러이기 때문에 작은 비율 상승은 년간 수 백만 달러의 추가 이익을 가져온다. 성능 향상이 효과적임을 확인할 수 있다. -Wide & Deep [Cheng et al., 2016]

Conclusions

이번 논문에서는 기존의 CTR예측 SOTA 모델의 단점을 극복하고 더 좋은 성능을 보인 모델인 FM기반 신경망 모델 DeepFM을 제시했다. DeepFM은 Deep Component와 FM Component를 jointly 학습한다. 이런 방식은 세 가지 장점이 있다.

1) pre-training이 필요하지 않다.

2) 높은 차원과 낮은 차원의 피쳐 상호작용을 같이 학습한다.

3) Feature Engineering이 필요 없도록 feature embedding을 공유하는 전략을 사용한다.

두 개의 real-word datasets 상에서 effectiveness측면이나 efficiency 측면을 다른 SOTA 모델과 비교했다. 실험 결과는 1) DeepFM이 AUC와 Logloss 지표에서 SOTA모델을 앞질렀고 2) 제일 efficient한 모델과 비교할 만큼 성능이 좋았다.

두 가지 후속 연구 방향이 있다. 하나는 높은 차원의 피쳐 상호작용을 강화하는 방법(예를 들면 pooling layer를 도입)을 탐색해보는 것이고 다른 하나는 GPU cluster상에서 large-scale problem을 DeepFM에서 학습하는 것이다.

ooooo_h.log

도커 데몬

도커의 구조

도커 데몬 실행 및 설정

도커 데몬 제어 : -H

도커 데몬 보안 적용 : --tlsverify

openssl genrsa -out key.pem 4096

openssl -req -subj '/CN=client' -new -key key.pem -out client.csr

echo extendedKeyUsage = clientAuth > extfile.cnt

openssl x509 -req -days 30000 -sha256 -in client.csr -CA ca.pem -CAkey ca-key.pem -CAcreateserial -out cert.pem -extfile extfile.cnt

chmod -v 0400 ca-key.pem server-key.pem ca.pem server-cert.pem cert.pem

cp {ca, server-cert, server-key, cert, key}.pem ~/.docker

dockerd --tlsverify \

도커 스토리지 드라이버 변경 : --storage-driver

스토리지 드라이버의 원리

도커 데몬 모니터링

도커 데몬 디버그 모드

파이썬 Remote API 라이브러리를 이용한 도커 사용

음성

Abstract

Introduction

WaveRNN

LPCNet

Conditioning Parameters

Pre-emphasis and Quantization

Linear Prediction

Output Layer

Sparse Matrices

Embedding and Algebraic Simplifications

Sampling from Probability Distribution

Training Noise Injection

EVALUATION

Complexity

Experimental Setup

Quality Evaluation

Conclusion

개인화, Multi-speaker

0. Speech Synthesis

HMM (Hidden Markov Model)

Tacotron

Tacotron2

Transformer TTS

1. 개인화 (Multi-Speaker, Style Modeling)

1.1 D-vector based Multi-speaker

1.2 Global Style Token

Training

Inference

1.3 VAE-Tacotron2

1.4 Gaussian mixture VAE (GMVAE)

2. 보코더 (Vocoder)

2.1 Intro

2.2 Vocoder

Griffin-Lim

Neural Vocoder

Scaling up GNN's

Intro

Neighbor Sampling

Remarks on Neighbor sampling

Issues with Neighbor Sampling

Cluster-GCN

Problem

Advanced Cluster-GCN

Training

Comparision

Simplifying GCN

Comparision with Other Methods

13 : Community Detection in Networks

1. Community detection in network

1.1 Granovetter's Answer

1.2 Granovetter's Explanation

Structural role - Traidic Closure = High clustering coefficient

1.3 Edge overlap

2. Network Communities

Networks에서 어떻게 densely connected groups of nodes(= tightly connected nodes)를 찾을 수 있을까 ?

$\rightarrow$ Modularity, Modularity를 Maximize하는 Communities를 찾으면 됨

3. Louvain Algorithm

4. Detecting Overlapping Communities : BigCLAM

이러한 Overlapping community structure는 어떻게 Detect할 수 있을까 ? BigCLAM

4.1 BigCLAM

4.2 AGM