sujinyun999.log

[#LoG_Reading] Structure-Aware Transformer for Graph Representation Learning

Fri, 21 Apr 2023 13:23:45 GMT

Structure-Aware Transformer for Graph Representation Learning

Background before reading this review.

*Notation

$G = (V, E, \mathbf X)$

node $u \in V$
node attribute $x_u \in \mathcal X \subset \mathbb R^d$
$\mathbf X \in \mathbb R^{n \times d}$

Transformers on Graph

GNN에서는 그래프 구조를 explicit하게 활용하는 반면, transformer는 노드의 attribute를 활용하여 노드들 사이 relation을 나타내는데 활용

Transformer 구성 요소
1. Self-attention module
  - input node feature $\mathbf X$가 linear projection을 통해 Query($\mathbf Q$), Key($\mathbf K$), Value($\mathbf V$)로 투영되고, 이를 활용하여 self-attention을 계산할 수 있음
  - multi-head attention
2. feed-forward NN
  - self-attention의 output이 skipconnection이나 FFN등을 거치면 하나의 transforemer layer를 통과하게됨
Absolute encoding

그래프의 위치적/구조적인 representation을 input node feature에 더하거나 concatenate하여 Transformer의 input으로 사용 : Vanilla transformer의 PE

Graph Transformer에서 자주 사용되는 Positional encoding method들
- Laplacian PE
- Random Walk PE
(-) 노드와 그 이웃들 사이의 structural similarity를 반영하지는 않음
Self-attention and kernel smoothing

$\operatorname{Attn}\left(x_v\right)=\sum_ {u \in V} \frac{\kappa_ {\exp }\left(x_v, x_u\right)}{\sum_ {w \in V} \kappa_ {\exp }\left(x_v, x_w\right)} f\left(x_u\right), \forall v \in V$
- linear value function $f(x) = \mathbf W_ {\mathbf V}x$
- $\kappa_ {\exp }$ (non-symmetric) exponential kernel parameterized by $\mathbf W_ {\mathbf Q}, \mathbf W_ {\mathbf K}$

$\kappa_ {\exp }\left(x, x^{\prime}\right):=\exp \left(\left\langle\mathbf{W}_ {\mathbf{Q}} x, \mathbf{W}_ {\mathbf{K}} x^{\prime}\right\rangle / \sqrt{d_ {\text {out }}}\right)$

-   $\langle \cdot, \cdot\rangle$ : dotproduct
-   학습가능한 exponential kernel
-   (-) only position-aware, not structure-aware encoding

1. Problem Definition

Limitations of GNN

limited expressiveness : 최대 1-WL test의 표현력을 가짐
Over-smoothing problem : GNN layer의 수가 충분히 커지면 모든 node representation이 상수로 수렴하게됨
Over-squashing problem : 그래프의 수많은 메세지들이 고정된 길이의 벡터 하나로 압축되어 발생하는 그래프 “bottleneck”으로 인해 멀리 위치한 노드의 메세지가 효율적으로 전파되지 않는 문제

⇒ Beyond neighborhood aggregation!

Transformer

하나의 self-attention layer를 통해 그래프내의 어떤 노드쌍이든지 그 사이의 상호작용을 확인할 수 있음
GNN과 달리 중간 계층에서 structural inductive bias가 발생하지 않아 GNN의 표현력 한계를 해결
graph structure info를 얼마나 학습하는지 input node feature에만 structural, positional info를 encoding해 넣음
노드에 대한 structural, positional info만 input node feature로 encoding하지만, 그래프 구조 자체에서 학습할 수 있는 정보의 양이 제한됨

💡 Goal : 그래프 데이터에 Transformer를 적절히 변형해 적용하여 그래프 구조를 잘 반영하고 높은 표현력을 가지는 Achitecture를 디자인하는 것

2. Motivation

Message passing graph neural networks.

최대 1-WL test로 제한된 표현력, over-smoothing, over-quashing

Limitations of existing approaches

노드들 사이 positional relationship만 encoding하고, strucutral relationship을 직접 encoding하지않음
- 노드들 사이 structural similarity를 확인하기가 어렵고, 노드들 사이의 structural interaction을 모델링하는데 실패하게됨

ex.

G1과 G2에서 최단거리를 활용한 positional encoding을 할경우 node u와 v가 다른 노드들에 대해 모두 같은 representation을 가지게되지만, 그래프의 실제 구조는 다름 → strucure aware에 실패

💡 Message-passing GNN과 Transformer architecture 각각의 장점을 살려 local, global info를 모두 고려하는 transformer architecture를 제안

Contribution of this paper

Q. Transformer구조에 structural info를 어떻게 인코딩할것인가?

A. Structure-aware self attention를 도입한 Structre-Aware Transformer(SAT)

reformulate the self-attention mechanism
- kernel smoother
- 원래 노드 feature에 적용하는 exponential 커널을 확장하여 각 노드가 중심인 subgraph representation을 추출하여 local structure에도 적용
subgraph representation들을 자동적으로 만들어내는 방법론 제안
- 이를 통해 kernel smoother가 구조적/특성적 유사성을 포착할 수 있게됨
GNN으로 그래프의 subgraph info를 포함하는 node representation을 만들어 기존 GNN에 추가적인 구조 개선 없이도 더 높은 성능을 냄
Transformer의 성능향상이 structure-aware한 측면에서 일어난 것을 증명하고 absolute encoding이 추가된 transfoemr보다 SAT가 얼마나 interpretable한지를 보여줌

3. Method

Structure-Aware Transformer

1. Structure-Aware Self-attention

position-aware한 structural encoding에 노드들 사이 structural similarity를 포함하기 위해 각 노드의 local structure에 관한 generalized kernel을 추가

각 노드가 중심이되는 subgraph set을 추가함으로써 structure-aware attention은 다음과 같이 정의될 수 있음

$\operatorname{SA-Attn}\left(v\right):=\sum_ {u \in V} \frac{\kappa_ {\text{graph} }\left(S_G(v), S_G(u)\right)}{\sum_ {w \in V} \kappa_ {\text{graph}}\left(S_G(v), S_G(u)\right)} f\left(x_u\right)$

$S_G(v)$ : node feature $\mathbf X$와 연관된 $v$를 중심으로하는 subgraph
$\kappa_ {\text{graph} }$ : subgraph쌍을 비교하는 kernel

⇒ attribute & structural similarity 모두 표현 가능한 expressive node representation 생성 → table 1

⇒ 동일한 subgraph 구조를 가지는 경우에만 permutation equivariant한 성질을 갖게됨

$\kappa_ {\text {graph }}\left(S_G(v), S_G(u)\right)=\kappa_ {\exp }(\varphi(v, G), \varphi(u, G))$

$\varphi(v, G)$ : feature $\mathbf X$를 가지는 node $v$가 중심에 있는 subgraph의 vector representation을 만들어내는 structure extractor
- GNN이나 differentiable Graph kernel등 subgraph의 representation을 만들 수 있는 어느 모델이든 될 수 있음
- Task/data 특성에 따라 Edge attribute을 활용할 필요가 있는 경우 그에 맞는GNN을 선택하면 됨. 따로 edge attribute을 활용하지는 않고 subgraph extractor에서 활용

k-subtree GNN extractor.

$\varphi(u, G) = \operatorname{GNN}_G^{(k)}(u)$

node u에서 시작하는 k-subtree structure의 representation생성
at most 1-WL test
작은 k 값이더라도 over-smoothing, over-squashing issue없이 좋은 성능을 내는것을 확인

k-subgraph GNN extractor.

$\varphi(u, G) = \sum_ {v \in \mathcal N_k(u)} \operatorname{GNN}_G^{(k)}(v)$

node u의 representation만을 사용하는데서 나아가 node u가 중심이 되는 k-hop subgraph전체의 representation을 생성하고 활용
node u 의 k-hop이웃 $\mathcal N_k(u)$에 대해 각 노드에 GNN을 적용한 node representation을 pooling(논문에서는 summation)
More powerful than 1-WL test
original node representation과의 concatenation을 통ㅎ structural similarity뿐만 아니라 attributed similarity도 반영

Other structure extractors.

directly learn a number of “hidden graphs” as the “anchor subgraphs” to represent subgraphs
domain-specific GNNs
non-parametric graph-kernel

2. Structure-Aware Transformer

self-attention→ skipconnection → normalization layer → FFN → normalization layer

Augmentation on skip connection.

$x'_v = x_c +1/ \sqrt {d_v} \operatorname{SA-Attn}\left(v\right)$

$d_v$ : node $v$의 degree
degree factor를 포함하여 연결이 많은 graph component들이 압도적인 영향을 미치지 않도록함

*graph-level task를 진행해야 할 경우 input graph에 다른 노드와의 connectivity없이 virtual [cls]node를 추가하거나, node-level representation을 sum/average 등으로 aggregation

3. Combination with Absolute Encoding

위의 structure aware self-attention에 추가로 absolute encoding을 추가하게 되면 postion-aware한 특성이 추가되어 기존의 정보를 보완하는 역할을 하게된다. 이러한 조합을 통해 성능향상을 확인할 수 있었다.

RandomWalk PE

Absolute PE만 사용할 경우 structural bias가 과도하게 발생하지 않아서 누개의 노드가 유사한 local structure를 갖고 있더라도 비슷한 node representation이 생성되는것을 보장하기 어렵다!

→ Structural, positional sign으로 주로 사용되는 distance나 Laplacian-based positional representation이 노드들 사이의 structural simialrity를 포함하지 않기때문

📌 Structural aware attenrion은 inductive bias가 더 강하더라도 노드의 strucutral similarity를 측정하는데 적합하여 유사한 subgraph구조를 가진 노드들이 비슷한 embedding을 갖게하고, expressivity가 향상되어 좋은 성능을 보임

4. Expressivity Analysis

SAT에서는 각노드를 중심으로하는 k-subgraph GNN extractor가 도입되어 적어도 subgraph representation만큼은 expressive하다는 것을 보장

4. Experiment

Experiment setup

Dataset

ZINC
CLUSTER
PATTERN
OGBG-PPA
OGBG-CODE2

Baseline

GNNs
- GCN
- GraphSAGE
- GAT
- GIN
- PNA
- Deeper GCN
- ExpC
Transformers
- Original Transformer with RWPE
- Graph Transformer
- SAN
- Graphormer
- GraphTrans

Results

Table1. SAT와 graph regression, classification task의 sota모델과 비교

ZINC dataset의 경우 작을수록 더 좋은 성능을 의미하는 MAE(Mean Absolute Error), CLUSTER와 PATTERN의 경우 높을수록 더 좋은 성능을 의미하는 Acurracy가 평가지표로 사용되었음.

Table2. SAT와 OGB데이터셋에서의 sota모델 비교

OGB dataset의 경우 높을수록 더 좋은 성능을 의미하는 Acurracy, F1 score가 평가지표로 사용되었음.

Table3. structure extractor로 사용한 GNN과의 성능비교. Sparse GNN을 모든 경우에서 outperform하는 것을 확인할 수 있음

Fig3. ZINC데이터셋에 SAT의 다양한 variant실험

평가지표 : MAE(더 작은 지표가 좋은 성능을 의미)

structure extractor에서의 k의 영향 비교
- k=0일때, Absolute encoding만을 활용하는 vanilla transformer랑 같다고 볼 수 있음
- k=3일때, optimal performance를 보임을 확인
- k=4를 넘어서면 성능이 악화되는것을 확인할 수 있었는데, 이는 GNN에서의 알려진 사실인 더 적은 수의 layer를 가지는 network가 더 좋은 성능을 보이는 것과 마찬가지라고 할 수 있음
Absolute encoding의 영향 비교
- RWPE vs. Laplacian PE
- Structure-aware attention의 도입으로 인한 성능향상보다는 그 정도가 낮았지만, RWPE를 도입할 경우 성능이 더 좋은것으로 보았을 때, 두가지 encoding이 상호보완적인 역할을 한다고 해석할 수 있음
Readout method의 영향 비교
- node-level representation을 aggregate할 때 사용하기 위한 readoutd으로 mean과 sum을 비교하였음
- 추가로 [CLS] 토큰을 통해 graph-level 정보를 pooling하는 방법도 같이 비교하여보았음
- GNN에서는 readout method의 영향이 매우 컸지만 SAT에서는 매우 약한 영향만을 확인함.

5. Conclusion

Strong Points.

structural info를 graphormer에서처럼 휴리스틱하게 shortest path distance(SPD)를 활용하지 않고, 그러한 local info를 잘 배우는 GNN으로 대체한 점이 novel하다고 할 수 있음

Transformer의 global receptive field 특성과 GNN의 local structure특성이 상호보완적

encoding에 있어서도

RWPE를 통한 positional encoding
k-subtree/subgraph GNN을 통한 structure-aware attention

두가지가 상호보완적인 역할을 함

→ 각자가 잘 배우는 특성을 고려하여 상호보완적인 두가지 방법론을 잘 섞어서 좋은 성능을 내었고, 그 이유가 납득하기 쉬움

Weak Points.

그래프데이터에 Transformer를 적용한 다른 논문의 architecture인 Graphormer에서 사용한 SPD만의 장점 : 직접적으로 연결되어있지 않은, 아주 멀리에 위치한 노드쌍이더라도 shortest path상의 weighted edge aggregation을 하는 만큼 그러한 특성 반영되면 좋은 그래프 구조/ 데이터셋에서는 SAT가 capture하지 못하는 부분이 있을 것

Author Information

Dexiong Chen
- Department of Biosystems Science and Engineering, ETH Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, Switzerland.
Leslie O'Bray
- Department of Biosystems Science and Engineering, ETH Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, Switzerland.
Karsten Borgwardt
- Department of Biosystems Science and Engineering, ETH Zurich, Switzerland.
- SIB Swiss Institute of Bioinformatics, Switzerland.

6. Reference & Additional materials

Github Implementation : https://github.com/BorgwardtLab/SAT
Reference : Structure-Aware Transformer for Graph Representation Learning

[#LoG_Reading] Graphormer : Do Transformers Really Perform Badly for Graph Representation?

Fri, 21 Apr 2023 13:20:47 GMT

"Message-passing GNNs are known to suffer from pathologies, such as oversmoothing, due to their repeated aggregation of local information [19], and over-squashing, due to the exponential blow-up in computation paths as the model depth increases [1]. As a result, there is a growing interest in deep learning techniques that encode graph structure as a soft inductive bias, rather than as a hard-coded aspect of message passing [14, 24]." - Rethinking Graph Transformers with Spectral Attention. NeurIPS 2021 -*

Abstract

Graph representation learning에서 Transformer가 좋은 성능을 보이는가?

*Graphormer *

standard transformer architecture
많은 graph representation learning task에서 좋은 성능을 보임
- especially on the recent OGB Large-Scale Challenge

*Key Insight *

graph를 모델링하기위해 structural information을 효율적으로 encoding하는 방법의 필요성
Graphormer의 표현력을 수학적으로 characterize
- graph의 structural information을 encoding하는 방법을 다른 GNN variants와 비교해보았을 때, 이들을 모두 Graphormer의 특수케이스로 커버할 수 있었음

💡 Ways to encode structural information of the graph ⇒ Graphormer

1. Introduction

Graph representation에 transformer를 적용하고자 하는 꾸준한 시도

하지만 지금까지 가장 효과적인 방법은 feature aggregation과 같은 classic GNN variants의 몇몇 key module에만 softmax attention을 적용하는 방식 (ex. GAT)

→ graph representation learning에 transformer 를 적용하는것이 적절한가?

Graphormer

directly built upon the standard transformer architecture
graph-level prediction task에서의 SOTA(OGB-LSC)

node $i$의 self-attention

노드쌍 사이의 relation으로 확인할 수 있는 structural information을 고려하지 않음
node $i$와 다른 노드들 사이의 semantic similarity만 계산

다음 information들을 활용해 structural encoding method 구성

Centrality Encoding

목적 : 그래프에서의 node importance

☹️ self-attention의 경우 node의 semantic feature의 유사도를 기반으로 계산되기 때문에, 다른 노드들보다 더 중요한 노드의 중요도같은 정보를 파악하기 어려움

💡 degree centrality

학습가능한 벡터가 노드의 degree에 따라 각 노드에 할당되고, 이것이 input layer의 node feature들에 더해지는 방식

연결 중심성(Degree Centrality) : 한 Node에 연결된 모든 Egde의 갯수(Weighted 그래프일 경우에는 모든 Weight의 합)로 중심성을 평가

Spatial Encoding

목적 : 노드들 사이 structural relation

☹️ graph-structured data는 이미지나 자연어같은 structured data와는 달리 임베딩 하기위한 canonical grid(표준 격자)가 없음

노드들은 non-Euclidean space에 엣지로 연결되어 존재함

💡 각 노드쌍에 대해 spatial raltion을 기반으로한 학습가능한 임베딩 주기

shortest path
- softmax attention의 bias term으로 encoded
- additional info : edge feature(ex. 분자구조에서 두개의 원자사이 연결의 종류)를 포함하기도 함

각 노드쌍에 대해 learnable embedding(shortest path)과 edge feature의 dot product 평균을 계산해 attention module에 사용

2. Preliminary

Graph Neural Network (GNN)

AGGREGATE : 이웃노드의 정보 aggregation
COMBINE : 노드 representation에 이웃노드로 부터 aggregation한 정보를 융합
READOUT : 노드 feature를 final representation에 aggregate, by permutation invariant function

Transformer

Query
Key
Value

Self-Attention

3. Graphormer

3.1 Structural Encodings in Graphormer

3.1.1 Centrality Encoding

Importance of node centrality

☹️ node embedding의 similarity를 기반으로 계산되는 기존 attention → node centrality의 특성을 반영하기 어려움

💡 Degree centrality

indegree&outdegree → two real-valued embedding (learnable)

Apply centrality encoding to each node and simply add it to the node feature

$h_i^{ (0)} = x_i + z^ −{ deg^−(vi)} + z^ +{ deg^+(vi)}$

undirected : 그냥 하나로 합침

💡 원래는 그냥 x(semantic embedding)로만 similarity를 구해서 attention을 구했었는데, learnable한 indegree, outdegree embedding을 더한걸 가지고 similarity를 바탕으로 attention을 계산하다 보니까 semantic correlation과 node importance모두 caputre할 수 있었다~!

3.1.2 Spatial Encoding

Transformer의 장점 중 global receptive field ← sequential하게가 아니라 한번에 문장내 모든 단어들에 대해 attention을 구해서 생기는 특성

⇒ byproduct problem : sequential data에 대해 순차적 process가 들어가지 않으니까 자연어 같은 애들은 positional encoding이 들어가야됨

그래프에서는 node가 sequence로 정렬되어있지 않음
multi-dimensional spatial space, edge로 연결

💡 자연어에서는 그냥 positional encoding해주면 되는데 그래프는 sequential이 아니니까 다른 방법으로 structural information을 넣어준다 = Spatial encoding

⇒ Spatial Encoding

Graph $G$에서 $v_i, v_j$사이 spatial relation을 측정하는 function $\phi(vi_i,v_j) : V \times V \rightarrow \R$
- $\phi(vi_i,v_j)$는 노드들 사이 connectivity로 정의 = distance of shortest path(SPD)
- 연결안된경우 -1
각 output value에 learnable scalar assign → self-attention 모듈의 bias term

$$ A_{ij} = \frac{(h_iW_Q)(h_jW_k)^T}{\sqrt{d}} + b_{\phi(v_i,v_j)} $$

$A_{ij}$ : Query-Key matrix A의 $(i,j)$th element
- $b_{\phi(v_i,v_j)}$ : $\phi(vi_i,v_j)$에 따라 assign된 learnable scalar, shared across all layers

💡 transformer에서 query를 바탕으로 다른 모든 key value들과 연산하고 그거를 softmax먹여서 score로 사용. score랑 value들 가중합해서 attention으로 사용. 여기 나와있는 A_ij는 score느낌인듯, 근데 degree centrality가 반영된 node embedding similarity만 구한게 아니라 거기에 bias term으로 distance of shortest path도 더해준 것!

장점

receptive field가 이웃 노드들로 제한되는 conventional GNN과 비교했을 때 transformer layer는 그래프 내의 모든 다른 노드들의 attend하며 global information을 제공
$b_{\phi(v_i,v_j)}$을 적용함으로써, transformer layer의 각 노드는 그래프의 구조적 정보(structural information)에 따라서 adaptively attend
- ex. $b_{\phi(v_i,v_j)}$가 $\phi(vi_i,v_j)$에 따라 감소함수로 학습되었을 경우 → 모델은 가까운 노드를 더 attend하고, 멀리있는 노드는 상대적으로 덜 attend하게됨

3.1.3 Edge Encoding in the Attention

edge또한 구조적 특성을 가지는 경우

ex. molecular graph, 원자쌍 사이 연결이 특정 type을 가지기도 함
Traditional methods for edge encoding
1. edge features가 연관된 nodes’ features에 더해짐
2. 각 노드에 대해 연관된 edges’ features는 aggregation단계에서 node features와 함께 사용됨

⇒ 연관된 노드와의 edge 정보만 propagate → whole graph representation에 edge information을 활용하기에는 효율적인 방법이 아닌듯

attention mechanism은 각 node pair $(v_i, v_j)$에 대해 correlation을 추정
- 이때 두 노드를 연결하는 edge가 고려되어야함

⇒ 노드간 shortest path를 찾고, path를 따라 edge feature와 learnable embedding(Weight embedding)을 dot product하여 평균내는 방식

spatial encoding처럼 bias term을 통해 추가됨

$$ A_{ij} = \frac{(h_iW_Q)(h_jW_k)^T}{\sqrt{d}} + b_{\phi(v_i,v_j)}+c_{ij}, ~~\text{where} ~c_{ij} = \frac{1}{N}\sum_{n=1}^{N}x_{e_n}(w_n^E)^T $$

$v_i$에서 $v_j$로 가는 shortest path $\text{SP}_{ij} = (e_1,e_2, ..., e_N)$
$x_{e_n}$ : $\text{SP}_{ij}$ 에서 $n$번째 edge $e_n$의 feature
$w_n^E \in \R^{d_E}$ : $n$번째 weight embedding(learnable embedding)
$d_E$ : edge feature의 dimensionality

💡 spatial encoding할때도 shortest path를 learable scalar로 mapping해줬으니까 좀더 global한 단위의 edge feature도 마찬가의 방법으로 해준듯, 경로상 각 edge의 weight는 learnable하게 둬서 알아서 학습되도록

💡 노드간 거리말고 graph의 structural한 정보를 encoding한다면? ex. 특정 구조의 subgraph -> 후속 논문으로 나온 Structure-Aware Transformer for Graph Representation Learning 의 베이스 아이디어!

3.2 Implementation Details of Graphormer

Graphormer Layer

Built upon classic Transformer encoder

Multi-head self-attention(MHA)와 Feed-foward block(FFN)이전에 Layer Normalization(LN) 추가
- On Layer Normalization in the Transformer Architecture
FFN sub-layer에 대해서는 input, output, inner-layer를 $d$ dimension으로 통일

$$ h^{ ′ (l)} = MHA(LN(h^{ (l−1)})) + h^{ (l−1)}\ h^{ (l)} = FFN(LN(h^{ ′ (l)} )) + h^{ ′ (l)} $$

Special Node

Graph embedding을 만들기 위한 graph pooling function

inspired by Neural Message Passing for Quantum Chemistry

[VNode]

그래프 내의 모든 노드와 각각 연결
Aggregate-Combine 단계에서 그냥 일반 노드처럼 업데이트되고. 이를 통해 그래프 전체의 representation $h_G$가 이 Vnode의 representation이됨
BERT에서 CLS 토큰을 sequence의 시작마다 두고 마지막에 문장 분류같은 sequence단위 downstream task에 쓰는거랑 마찬가지

☹️근데 그래프 내에 모든 노드랑 연결되어있으니까 shortest path가 항상 1! 근데 얘는 virtual connection이지, 실제연결이 아님

💡 reset all spatial encodings for $b_{ϕ([VNode],v_j)}$ and $b_{ϕ(v_i,[VNode])}$ to a distinct learnable scalar

inspired by Rethinking Positional Encoding in Language Pre-training

3.3 How Powerful is Graphormer?

Fact 1. By choosing proper weights and distance function ϕ(Spatial encoding을위한 SPD를 scalar mapping해주는 function), the Graphormer layer can represent AGGREGATE and COMBINE steps of popular GNN models such as GIN, GCN, GraphSAGE.

self attention을 계산할 때 spatial encoding이 들어가서 노드의 neighbor set(SPD=1) 구분 가능 → softmax적용해서 이웃노드셋에 대한 평균 계산 가능
node의 degree를 위에서 구한 이웃노드셋에 대한 평균에다 곱해주면 sum으로 변환 가능
멀티헤드 어텐션, FFN을 통해 노드 v, 그리고 이웃 노드셋의 representation은 따로 계산된 후에 나중에 합쳐짐 ← FFN은 vanilla transformer에서도 토큰단위로 따로 들어가는 거였음

😀 spatial encoding으로 WL test보다 더 expressive한 GNN(1WL test가 MPNN)

appendixA에서 WL test로 구분하지 못하는 그래프를 grpahormer로 구분하는거랑 다른 GNN의 special case인거 증명

Connection between Self-attention and Virtual Node.

[VNode] Readout function처럼 전체그래프의 information을 aggregate하는 역할

☹️ can potentially lead to inadvertent over-smoothing of information propagation

Graph Warp Module: an Auxiliary Module for Boosting the Power of Graph Neural Networks in Molecular Graph Analysis

Fact 2. By choosing proper weights, every node representation of the output of a Graphormer layer without additional encodings can represent MEAN READOUT functions.

self attention은 전체 노드를 attend ⇒ simulate graph-level READOUT operation to aggregate information from the whole graph
Theoretically 뿐만아니라 empirically No over-smoothing problem ⇒ vnode추가로 graph readout 구현

4. Experiments

4.1 OGB Large-Scale Challenge

graph-level prediction dataset

4.2 Graph Representation.

4.3 Ablation Studies

Node Relation Encoding

Positional encoding employed by previous Transfomer based-GNN

Weisfeiler-LehmanPE (WL-PE) : Graph-Bert: Only Attention is Needed for Learning Graph Representations
Laplacian PE : Laplacian Eigenmaps for dimensionality reduction and data representation

💡위의 Positional encoding방법론들을 적용한 것 보다 spatial encoding을 적용한게 더 잘됨

Centrality Encoding

💡없는거보다 있는게 더 잘됨 → transformer architecture로 그래프 데이터를 모델링하는데 centrality encoding을 적용하는게 필수!

Edge Encoding

traditional ways to encode edge

💡proposed edge encoding performs better

5.1 Graph Transformer

5.2 Structural Encodings in GNNs

Path and Distance in GNNs

Path-Augmented Graph Transformer Network : node features, edge features, one-hot feature of the distance and ring flag feature 들을 attention 계산 시 활용

Positional Encoding in Transformer on Graph

Graph-Bert: Only Attention is Needed for Learning Graph Representations :
- absolute WL-PE, intimacy based PE, hop based PE
A Generalization of Transformer Networks to Graphs : Absolute Laplacian PE

Edge Feature

Exploiting Edge Features in Graph Neural Networks : node similarity를 활용해 edge feature weight
How powerful are graph neural networks? : project edge features to an embedding vector and multiply it by attention coef, and send the result to an additional FFN sub-layer to produce edge representations

6. Conclusion

graph structural encodings
- Centrality Encoding
- Centrality Encoding
- Edge Encoding
works well on wide range of popular benchmark datasets
challange
- quadratic complexity of the self-attention module restricts Graphormer’s application on large graphs

💡 GAT와 Transformer : GAT에서는 node embedding의 semantic한 similarity를 기반으로 attention을 구했다면, Transfomer는 여기에 degree centrality, spatial encoding, edge feature를 추가한 attention을 계산. 이 세가지를 attention을 구하는데 집어넣는 과정에서 각각의 representation에 대해 learnable한 function들을 도입하여 학습 과정에서 그 weight를 학습하게되는 방식. Attention을 사용했을 때 장점은 노드의 이웃여부와 관계없이 모든 노드를 attend할 수 있다는 것 ⇒ global receptive field 🤷 그럼 기존 MPNN 레이어를 쌓아서 hop을 늘리면 되는거 아니냐 라고하면? ⇒ 그랬을때는 oversmoothing problem(increasing the number of GNN layers, the features tend to converge to the same value)이 있음 💡 Transfomer가 CNN, LSTM, GNN, GCN 같은 애들보다 잘하는건 적은 inductive bias때문이라고 할 수 있을거 같다. 기존 모델들은 들어가는 데이터의 특성(이미지의 경우 locality, 시계열의 경우 sequential)을 살린 모델 구조라서 inductive bias가 세게 걸려있다고 할 수 있다. 반면 Transformer는 그런거 없이 개별토큰이 모든 intput에 대해 attend하기도 하고, locality나 sequentiality같은 것들은 positional encoding으로 상대적으로 약하게 걸려있다고 할 수 있다. ⇒ inductive bias가 적게 걸려있어서 NN이 학습할 수 있는게 더 많았다! ⇒ 데이터가 많을때 더 잘 작동한다! 🤷 근데 그럼 데이터가 적은데 transformer를 사용하려면? ⇒ 추가적인 inductive bias를 줄 수 있는 방법에 대해 고민해봐야할 듯

Autoencoder의 모든 것 - 1/2 (Deep learning basic revisit, Manifold learning)

Thu, 19 Jan 2023 13:02:01 GMT

오토인코더의 모든 것 - 1/3

GAN리뷰를 마치고 같은 플레이 리스트의 오토인코더의 모든 것 강의를 듣기 시작했다! 여행 가기전에 다 듣고 가고싶었지만 맘이 붕뜨는 바람에 다 못들었었는데, 일단 Autoencoder의 introduction을 위한 Deep learning basic revisit과 Manifold learning 파트까지는 들었기 때문에 유럽행 경유지인 홍콩 공항에서 지금까지 들은 부분 먼저 올려보자,, 해서 올리는 글! 다음 비행기까지 2시간이나 남았는데 유튜브나 넷플릭스는 더이상 보기 싫어서 뒷부분 강의도 들어보려구 한다,,, 팟팅,,,

Autoencoder → purpose of nonlinear dimensionality reduction

🤷‍♂️ nonlinear dimensionality reduction?

Representation learning
Feature extraction
Manifold learning

Autoencoder’s 4 Main keywords

Unsupervised Learning → 학습 방법
Manifold Learning → encoder : 차원 축소의 역할
Generative Model Learning → decoder : 생성 모델의 역할
ML(Maximum Likelihood) Density Estimation → Loss function : negative ML

입력과 출력이 같은 결과를 내도록 하는 구조

1. Revisit Deep Neural Networks

Machine learning problem

Classical Machine Learning

collecting training data
define functions
1. output
2. Loss function
learning/training
1. loss가 최소화되는 optimal parameter 찾기
predicting/testing
1. optimal function output계산
*Note. 고정된 입력값에 대한 고정된 출력값 생성

⇒ Deep Neural Networks

collecting training data
define function
1. Deep Neural Net
  1. 네트워크 구조
  2. 레이어 수
2. Loss function
  1. MSE, CrossEntropy
- Backpropagation을 통해 DNN을 학습시키기 위한 조건들
  
  Assumption1. Total loss of DNN over training samples is the sum of loss for each training sample : training DB의 전체 샘플들의 각 loss의 합을 DNN의 total loss로 본다.
  
  Assumption2. Loss for each training example is a function of final output of DNN : loss function의 입력은 정답과 네트워크의 출력값 두가지로만 이루어져있다.
learning/training
1. Gradient descent
  
  $$ \theta^* = \text{argmin }{\theta \in \Theta} L(f{\theta}(x),y) $$
2. Iterative Method : step by step
  1. How to update $\theta → \theta+\Delta \theta$ ⇒ Only if $L(\theta+\Delta \theta) < L( \theta)$ : Loss가 줄어들 때
  2. when to stop search ⇒ If $L(\theta+\Delta \theta) == L( \theta)$ : Loss의 변동이 없을 때
  3. How to find $\Delta \theta$ so that $L(\theta+\Delta \theta) < L( \theta)$ ⇒ $\Delta \theta = -\eta \nabla L$ where $\eta > 0$ : learning rate
  - 어떻게 $\theta$를 업데이트 해야 Loss 값이 줄어드는지?
    
    $\small L(\theta+\Delta\theta) = L(\theta)+\nabla L \cdot\Delta\theta + \text{second derivative} + \text{third derivative} + \dots$ → Taylor Expansion
    
    $\small L(\theta+\Delta\theta) \approx L(\theta)+\nabla L \cdot\Delta\theta$ → Approximation : 더 많은 미분 차수를 사용할 수록 더 넓은 지역을 작은 오차로 표현 가능
    
    $\small L(\theta+\Delta\theta)-L(\theta) = \Delta L = \nabla L \cdot\Delta\theta$ → $\Delta L <0$ 이어야함
    
    If $\Delta \theta = -\eta \nabla L$, then $\Delta L = -\eta ||\nabla L||^2 < 0$, where $\eta > 0$ and called learning rate
    
    *$\nabla L$ = gradient of L, steepest increasing direction of $L$
3. parameter들에 대해 Loss function의 미분이 필요 → Backropagation
  1. Error at output layer
    
    $$ \delta^L = \nabla_aC\odot\sigma'(z^L) $$
    - $C$ : Cost(Loss)
    - $a$ : final output of DNN
    - $\sigma(\cdot)$ : activation function
    - 가장 마지막 출력단의 error signal을 구하는 과정 → loss를 network 출력값에 대해 미분 : $\nabla_aC$ & activation function의 미분에 feedfoward 값을 넣은 값 element-wise product
  2. Error relationship between two adjacent layers → 바로 앞레이어와의 error relationship을 통해 가장 앞 레이어의 loss까지 구함
    
    $$ \delta^l = \sigma'(z^l)\odot\left( \left(w^{l+1} \right)^T\delta^{l+1}\right) $$
  3. Gradient of C in terms of bias
    
    $$ \nabla_{b^l}C = \delta^l $$
  4. Gradient of C in terms of weight
    
    $$ \nabla_{W^l}C = \delta^l(a^{l-1})^T $$

    [Neural networks and deep learning](http://neuralnetworksanddeeplearning.com/chap2.html)

predicting/testing

Loss function viewpoints I : Backpropagation

Backpropagation이 얼마나 잘 동작하는가의 관점에서는 MSE < Cross Entropy Loss
Maximum Likelihood 관점 : output의 형태에 따라
- continuous value : MSE
- discrete value : Cross Entropy Loss

Type 1. Mean Square Error / Quadratic loss

$$ \begin{aligned} &C = (a-y)^2/2 = a^2/2 \ &\nabla_aC = (a-y)\ & \delta = \nabla_aC \odot \sigma'(z) = (a-y)\sigma'(z) \end{aligned} $$

Loss $C$ = MSE ⇒ (입력-정답)^2/2
$\nabla_aC$ : C를 a에 대해 미분
$\delta$ : error signal
$\frac{\partial C}{\partial w} = x\delta = \delta$ → $w = w-\eta\delta$
$\frac{\partial C}{\partial b} = \delta$ → $b=b-\eta\delta$
Gradient Vanshing problem
- 출력 레이어에서의 에러값에 activation function 의 미분값이 곱해지는데($\sigma'(z^L)$), 이때 그 값이 0에 가까워질 경우, error signal($\delta$)도 0에 가까워져 레이어간 업데이트가 전혀 이루어지지 않게 된다.
- 초기값의 영향을 크게 받는다!

Type 2. Cross Entropy

Cross Entropy의 출력 layer의 error signal($\delta$)는 $\sigma'(z^L)$ term자체가 사라져서 gradient vanishing problem에서 자유로운 편, 초기값의 영향이 적다.
- btw, 레이어를 여러개 사용하면 결국 activation function 의 미분값이 계속해서 곱해지므로 gradient vanishing problem에서 완전히 자유로운건 아님.

Loss function viewpoints II : Maximum likelihood

x축 : 특정 값 → 모델의 출력, 정답 등,,,
y축 : 해당 값이 분포에서 등장할 확률
분포 : 가정한 분포, 특정 값이 해당 모델에서 도출될 확률 분포
어떤 파라미터의 모델이 있을 때, 그때의 출력값으로 정답 y의 확률을 높이는 모델인지 보고, 그 확률을 최대화하는 과정
결국에는 가장 높은 확률을 갖는 평균과 정답 y가 일치하게 되는 것이 목표
그렇게 모델의 최적 확률 분포를 찾게 되면, 그 분포에서의 sampling을 통해 y와 비슷하지만 다른 어떤 값을 생성하는 과정으로 이어지게 됨

Conditional probability $p$가 어떤 분포를 따르는지 가정하에 다음을 진행 → ex. Gaussian Dist.

$p\left(y|f_{\theta}\left(x\right)\right)$ → Network출력값이 $f_{\theta}\left(x\right)$로 주어졌을 때, 정답 $y$가 나올 확률이 최대화 되어야함

Network출력값 $f_{\theta}\left(x\right)$ : Conditional probability $p$를 추정하기 위한 parameter 개념(ex. Gaussian의 경우 평균값,,,)

log-likelihood

loss : $-\log \left( p \left(y|f_{\theta}\left (x\right )\right)\right)$

주어진 데이터를 가장 잘 설명하는 모델 찾기 → Maximum log-likelihood되는 파라미터

$\theta^* = \argmin_\theta[-\log \left( p \left(y|f_{\theta}\left (x\right )\right)\right)]$

⇒ Conditional probability $p$의 분포의 parameter를 찾은 것 ⇒ 확률 분포를 찾은 것 → sampling이 가능하다!

*Extension. 생성모델

확률 분포에서 sampling을 통해 새로운 input을 만들어 낼 수 있고, 새로운 출력이 나오게 된다.

$y_{new} \sim p(y|f_{\theta^*}\left(x_{new}\right))$

$\text{i.i.d Condition on }p \left(y|f_{\theta}\left (x\right )\right)$

Assumption1. Independece : train db모두의 conditional prob가 아닌 각 sample($D_i$)의 conditional prob의 곱으로 추정

$$ p \left(y|f_{\theta}\left (x\right )\right) = \prod_i p_{D_i} \left(y|f_{\theta}\left (x_i\right )\right) $$

Assumption2. Identical Distribution : 모든 샘플의 distribution을 같다고 가정

$$ p \left(y|f_{\theta}\left (x\right )\right) = \prod_i p \left(y|f_{\theta}\left (x_i\right )\right) $$

⇒ 결론 : 두개의 가정을 모두 만족

$$ -\log \left( p \left(y|f_{\theta}\left (x\right )\right)\right) = -\sum_i \log \left( p \left(y_i|f_{\theta}\left (x_i\right )\right)\right) $$

분포를 다음 두가지로 가정해볼 수 있음 (일변수 - Univariate)

Gaussian Distribution의 경우 MSE
Bernoulli Distribution의 경우 Cross-Entropy

각각과 같이 loss가 정리된다.

다변수의 경우도 마찬가지.

Autoencoder
- $p(x|x)$
- network 입력 x일때 출력도 입력과 같아지는것이 목표
Variational Autoencoder
- $p(x)$
- training db의 확률분포 그 자체를 추정

2. Manifold Learning

Definition

Visualize의 편의성을 위해 고차원 데이터($R^m$)를 저차원 데이터($R^d$)로 차원 축소할 때,

Manifold : 고차원 데이터 포인트들을 error없이 잘 아우르는 subspace가 있을 것. 그것을 manifold라고 하고, maifold에서 projection하게되면 저차원 데이터로 매핑할 수 있을 것.

Four objectives

Data visualization
Curse of dimensionality
- 차원이 증가할수록, 공간내 샘플의 수가 매우 희박해진다
- Manifold Hypothesis(assumption)
  - 샘플을 아우르는, 잘 밀집된 저차원의 subspace(=Manifold)를 잘 찾으면, 해당 Manifold를 벗어나면 밀도가 매우 희박해진다.
Discovering most important features
- Manifold 좌표들이 조금씩 변화할 때 원래 데이터도 조금씩 변함을 확인
- New distance metric → Euclidean distance ≠ Manifold의 distance ⇒ dominant한 feature representation상 가까운 sample들을 찾을 수 있음, 의미적인 interpolation이 가능하다
- Entangled Manifold vs. Disentangled Manifold
  - dominant feature를 잘 capture했을 경우 disentangled manifold의 형태가 됨을 확인할 수 있음

Dimension reduction

Texonomy

Linear
1. PCA
2. LDA
Non-linear
1. AE
2. t-SNE

이중에서도 Non-linear dimension reduction method에 해당하는 Autoencoder에 대해 앞으로 더 자세히 알아볼 예정!

GAN 완전 정복 - Generative Adversarial Network(GAN)

Sun, 01 Jan 2023 13:46:27 GMT

https://www.youtube.com/watch?v=odpjk7_tGY0

Generative Model의 Goal

$p_{\text{data}}(x)$에 근사하는 $p_{\text{model}}(x)$를 찾기
$p_{\text{data}}(x)$ : 실제 학습 데이터의 분포
$p_{\text{model}}(x)$ : 모델이 생성한 데이터의 분포
두 분포의 차이를 최소화하기

Brief Introduction - GAN(Generative Adversarial Networks)

$D$(Discriminator Model)
$G$(Generator Model)

최종 목표는 $G$를 학습하는 것, 이를 위해 $D$를 먼저 학습시킬 필요가 있음

STEP 1) $D$ 학습시키기

진짜 이미지는 1, 가짜 이미지는 0 라벨로 분류하는 것이 학습 목적
Input : 고정 이미지 벡터
Output : Binary, 1dim, sigmoid(0.5)

STEP 2) $G$ 학습시키기

랜덤한 코드(latent code $z$)를 받아 이미지를 생성
생성한 이미지로 $D$를 속이는 것이 목표 → $D$의 output이 1이 되도록
학습할 수록 진짜같은 가짜이미지를 생성하게됨

Objective(Loss) Function of GAN

$$ \min_{G} \max_{D} V(D, G) = E_{x\sim p_{data}(x)}[\log D(x)]+E_{z\sim p_z(z)}[\log(1-D(G(z)))] $$

A. Discriminator 관점

목적함수를 최대화하는 것이 목적

Left Term : $E_{x\sim p_{data}(x)}[\log D(x)]$

$x\sim p_{data}(x)$ : 확률 밀도 함수, 실제 데이터에서 샘플링,
$\log D(x)$ 최대화 : 실제 데이터에서 받은 데이터를 입력으로 받으면, $D$는 1에 가까운 값을 출력해야함, $D(x)$는 0~1사이 값을 출력

Right Term : $E_{z\sim p_z(z)}[\log(1-D(G(z)))]$

$z\sim p_z(z)$ : z는 Generator로 들어가는 입력, 표준 정규 분포/uniform 분포에서 랜덤하게 추출된 100차원의 벡터
$G(z)$ : Random 하게 생성한 벡터를 입력으로 받아 Generate한 이미지, 출력은 가짜 이미지
$D(G(z))$ : 이를 다시 Discriminator에 넣어 Fake, Real Binary classification
$\log(1-D(G(z)))$ : $D(G(z))$값이 0일때 최대 = $z$로 부터 생성된 가짜 이미지를 가짜로 분류하였을때 최대값을 가짐 = 학습 목표

B. Generator 관점

목적함수를 최소화하는 것이 목적

Left Term : $E_{x\sim p_{data}(x)}[\log D(x)]$

실제이미지를 discriminate하는 것과 Generator는 독립

Right Term : $E_{z\sim p_z(z)}[\log(1-D(G(z)))]$

가짜이미지를 입력으로 받았을 때 Discriminator가 진짜 이미지로 분류하도록 하는 것이 목적
$D(G(z))$ 값이 1일때 최소 = z로 부터 생성된 가짜 이미지를 진짜로 분류하였을때 최소값을 가짐 = 학습 목표

Pytorch Implementation

DCGAN Tutorial - PyTorch Tutorials 1.13.1+cu117 documentation

import torch
import torch.nn. as nn

D = nn.Sequential(
    nn.Linear(784 ,128),
    nn.ReLU(),
    nn.Linear(128, 1),
    nn.Sigmoid())

G = nn.Sequential(
    nn.Linear(100, 128),
    nn.ReLU(),
    nn.Linear(128, 784),
    nn.Tanh()) # 생성된 값이 -1 ~ 1

criterion = nn.BCELoss() # Binary Cross Entropy Loss(h(x), y)

d_optimizer = torch.optim.Adam(D.parameters(), lr=0.01) #maximize
g_optimizer = torch.optim.Adam(G.parameters(), lr=0.01) #minimize
# 충돌하기에 2개의 optimizer를 설정

while True:
    # train D
    loss = criterion(D(x), 1) + criterion(D(G(z)), 0)
    loss.backward() # 모든 weight에 대해 gradient값을 계산
    d_optimizer.step()

    # train G
    loss = criterion(D(G(z)), 1)
    loss.backward()
    g_optimizer.step() # generator의 파라미터를 학습

Binary Cross Entropy Loss $(h(x),y)$

$$ -y\log h(x) -(1-y)\log (1-h(x)) $$

criterion = nn.BCELoss()

Loss function

$$ \min_{G} \max_{D} V(D, G) = E_{x\sim p_{data}(x)}[\log D(x)]+E_{z\sim p_z(z)}[\log(1-D(G(z)))] $$

loss = criterion(D(x),1) + criterion(D(G(x)),0)

criterion(D(x),1) : $-\log D(x)$
criterion(D(G(x)),0) : $-\log(1-D(G(x)))$

**Note : Gradient Descent로 학습되기 때문에 기존 loss function에 -를 붙여준 형태

**Train $G$에서 주의할 점

Generator를 학습할 때 Discriminator는 고정이어야함.
optimizer를 $D,G$ 파라미터에 대해 각각 설정해두고, genrator학습 시 g_optimizer.step()만 수행

Non-Saturating GAN Loss

$G$의 objective Function

$$ \min_{ G }{V(G)} = { E }{ z\sim { p }{ z }( z) }[ \log { (1-D(G(z))) } ] $$

$log(1-x)$ 그래프

$G$는 학습 초반에 매우 평편없는 이미지를 생성하게 되고, $D$는 이를 가짜 이미지라고 확신하게됨 → $D$가 0에 매우 가까운 값을 출력

⚠️ 이때의 gradient가 상대적으로 작다

💡 $log(1-x)$를 최소화 하는 대신 $log(x)$를 최대화 하자

→ 상대적으로 큰 graident

⇒ 초반에 Generator가 매우 안좋은 상황을 최대한 빠르게 벗어날 수 있게됨

Implementation

loss = criterion(D(G(z)), 1)

Why does GANs work?

GAN의 loss function을 최대화 하는 것이 실제 데이터와 가짜 데이터의 분포 차이를 줄이는 것이 맞는가? → O

$$ \begin{aligned} &\min_ { G }\max_ { D }{ V( D,G) }\\iff &\min_ {G,D} JSD(p_{\text{data}} \vert \vert p_g)\end{aligned} $$

어떤식으로 GAN이 학습되는지 돌아보고 다시 아래에서 증명을 이어서 해보자!

파란색 점선 : $D$, discriminative distribution (판별 모델의 분포)
검정색 점선 : $p_x$, 데이터에서 생성된 분포 (원본 데이터의 분포)
초록색 실선 : $p_g(G)$ , generative distribution (생성 모델의 분포)
z 실선 : uniformly sampling된 z의 domain
z → x 화살표 : $x=G(z)$ 매핑, non-uniform 분포 $p_g$로 변환되는 과정
x 실선 : 매핑/변환된 x

GAN은 Discriminative distribution과 동시에 실제 데이터에서 샘플링하여 생성된 분포 $p_x$와 Generator를 통해 생성된 분포에서 샘플링한 $p_g(G)$를 구분하도록 학습

$G$ contracts in regions of high density and expands in regions of low density of $p_g$.

(a) $x=G(z)$ 매핑을 통해 만들어진 가짜 분포 $p_g$

(b) $D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x)+p_g(x)}$을 통해 판별 모델 확률 분포 $D$ 업데이트

(d) 학습을 계속 반복하여 $p_g = p_{\text{data}}$ 가 되면 두 분포를 구분할 수 없어져 $D(x) = \frac{1}{2}$로 수렴

어떻게 $P_g$가 $P_{\text{data}}$로 수렴할 수 있게 될까?
- Proof. Global Optimality
1. G가 고정되어있는 상황에서 D의 optimal point $D^*(x) = \frac{p_{\text{data}}(x)}{p_{\text{data}}(x)+p_g(x)}$

2. G의 global optimum은 $p_g = p_{\text{data}}$에 있다.

    ![](https://velog.velcdn.com/images/sujin_yun_/post/2cd3d185-baed-42a4-a301-7c1b2ebf6591/image.png)

KL divergence

Note. BCE 

$$
BCE = \sum_{x\in{0, 1}}\left(-P(x)\log(Q(x))\right)
$$

- $P(x)$ = 희망하는 타겟에 대한 결괏값
- $Q(x)$ = 모델에서 출력한 출력값
- Q라는 모델의 결과에 대해 P라는 이상적인 값을 기대했을 때 그와 실제 결과의 차이에 대한 감각

KL divergence

확률분포 $P$를 모델링한다고 할때, 이산 확률 분포 $P$와 $Q$가 동일한 샘플 공간 $x$에서 정의된다고 하면 KL divergence는 다음과 같다.

$$
\begin{aligned}
D_{KL}(P\|Q) 
&= \sum_{x\in \chi}P(x)\log_b\left(\frac{P(x)}{Q(x)}\right)\\
&=-\sum_{x\in \chi}P(x)\log_b\left(\frac{Q(x)}{P(x)}\right)\\
&=-\sum_{x\in\chi}P(x)\log_b Q(x) + \sum_{x\in\chi}P(x)\log_b P(x)
\end{aligned}
$$

이를 기댓값($=\sum x\times\text{prob}(x)$)으로 치환하면

$$
\Rightarrow -E_P[\log_bQ(x)]+E_P[\log_bP(x)]
$$

*여기서 $E_P$는 $P(x)$라는 확률 분포에 대한 기댓값 연산임을 의미

이를 전개하면

$$
\Rightarrow H_P(Q) - H(P)
$$

*여기서 $H_P(Q)$는 $P$를 기준으로 봤을 때 $Q$에대한 cross entropy, $H(P)$는 $P$에 대한 정보 엔트로피

$H_P(Q)$ : **어떠한 확률분포 P가 있을 때, 샘플링 과정에서 확률분포 Q를 P 대신 사용할 경우 엔트로피**

$H_P(Q) - H(P)$ : 위에서 $H(P)$를 빼주게 되면, 기존에서의 엔트로피의 **변화**를 의미하게됨

- 항상 0이상
- aysymmetric : distance개념이 아니다.

JSD(Jensen-Shannon Divergence)

KL divergence를 distance metric으로 쓸 수 있는 방법은 없을까

$M$을 확률 분포 $P$와 $Q$의 평균이라고 할 때

$$ JSD(p||q) = \frac{1}{2}KL(p||M)+\frac{1}{2}KL(q||M)\ where, M=\frac{1}{2}(p+q) $$
- symmetric → distance개념

Algorithm

Variations of GAN

1. DCGAN(Deep Convolutional GAN)

Discriminator
- CNN
Generator
- deep convolutional NN
- deconvolution, transpose convolution → upsampling
No pooling layer
stride size>2 의 convolution,deconvolution
BN
Adam optimizer
- Momentum = 0.5, 0.999
- 64x64이미지를 사용할때 실험적으로 위 숫자들을 사용할 때 성능이 좋은 것을 확인
Generator의 입력인 Latent vector $z$간의 산술적 연산이 가능! (선형적 관계)
- ex. man with glasses - man without glassed + woman without glasses ⇒ woman with glasses

2. LSGAN(Least Squares GAN)

기존의 GAN Loss → $D$를 속이기만 하면 됨
파란색 선 = $D$의 decision boundary → 낮으면 진짜, 높으면 가짜
빨간색 점들 → 진짜 이미지
파란색 점들 → 가짜 이미지
- ⇒ 빨간점에 가까이 있는 파란 점들은 잘만든 가짜 이미지
핑크색 점들 → discriminator를 완벽히 속인 가짜 이미지 (Decision boundary완전 안쪽에 있어서)

💡 그렇다고 핑크색 점들이 잘 만들어진 이미지인가? ⇒ NO

🤷 why? ⇒ 실제 이미지에 가깝게 만들어진게 잘 만들어진 이미지, discriminator를 완벽히 속였어도, 실제와 비슷하다는 보장을 할 수가 없다.

⇒ LSGAN에서는 핑크색 점들을 decision boundary근처로 끌어 올린다.

Vanilla GAN → LSGAN

$D$의 마지막 레이어 sigmoid 제거
$G$는 동일
Cross entropy loss ⇒ Least squeares loss
LSGAN - loss of D
- (D(x)-1)**2 → 진짜 이미지 D(x)는 1에 가깝게
- (D(G(z))**2 → 가짜 이미지 D(G(z))는 0에 가깝게
LSGAN - loss of G
- (D(G(z))-1)**2 → 가짜 이미지 D(G(z))는 1에 가깝게
cross entropy loss와의 차이 :
- 1에 최대한 가까운 값이 나오도록 조정하게됨

*Note. 코드로 비교해보자~!

Vanilla GAN

import torch
import torch.nn. as nn

D = nn.Sequential(
    nn.Linear(784 ,128),
    nn.ReLU(),
    nn.Linear(128, 1),
    nn.Sigmoid())

G = nn.Sequential(
    nn.Linear(100, 128),
    nn.ReLU(),
    nn.Linear(128, 784),
    nn.Tanh()) 

#Loss of D
D_loss = -torch.mean(torch.log(D(x))) - torch.mean(torch.log(1-D(G(z))))))

#Loss of G
G_loss = -torch.mean(torch.log(D(G(z))))

LSGAN

import torch
import torch.nn. as nn

D = nn.Sequential(
    nn.Linear(784 ,128),
    nn.ReLU(),
    nn.Linear(128, 1)) #1. Remove sigmoid

G = nn.Sequential(
    nn.Linear(100, 128),
    nn.ReLU(),
    nn.Linear(128, 784),
    nn.Tanh())

#Loss of D
D_loss = -torch.mean((D(x)-1)**2) - torch.mean(D(G(z))**2))

#Loss of G
G_loss = -torch.mean((D(G(z))-1)**2)

3. SGAN(Semi-Supervised GAN)

MNIST data
$D$가 진짜/가짜를 구분하는 것이 아닌 class를 구분(0~9) + Fake class를 추가해 11개의 class ⇒ softmax ⇒ one-hot vector
$G$는 one-hot vector + latent vector $z$를 input으로 받아 fake image생성
- $D$는 이 이미지는 fake로 분류해야함
$D$는 라벨이 있어야 하는 supervised learning, $G$는 generator가 만든 이미지로 분류하는 unsupervised learning ⇒ Semi-Supervised GAN

4. ACGAN(Auxiliary Classifier GAN)

$D$ → Multi-task learning
1. 진짜이미지 vs 가짜이미지 (0 or 1) → sigmoid
2. 이미지의 진위 여부와 관계 없이 0~9중 어떤 숫자에 해당하는지 → softmax
- 노이즈가 포함된 이미지의 분류에 집중
$G$
- input = one-hot vector + latent vector $z$
- 여기서 생성한 가짜 이미지로 $D$는 다음 두가지 task 시행
  1. 진짜이미지 vs 가짜이미지 (0 or 1)
  2. 이미지의 진위 여부와 관계 없이 0~9중 어떤 숫자에 해당하는지
- Data augmentation의 효과 (Noise가 포함된 이미지)

⇒ Loss의 경우 두가지 task의 loss합한 것을 사용

Extensions of GAN

1. CycleGAN : Unpaired Image-to-Image Translation

이미지의 style, domain을 바꾸는 task

💡 Pair image가 없는 unsupervised 상태에서도 이러한 task의 학습이 가능하지 않을까?

⇒ How does it work?

ex. 얼룩말 이미지를 말로 변환하기

$D$
- 말 이미지만 받게 되고, 진짜라고 학습
$G$
- latent code $z$대신 Real image입력을 받게됨
- 차원을 줄였다가 다시 복구하는 encoder decoder 구조
- 얼룩말 이미지를 받아 $D$를 속이기 위해 말 모양으로 변환

**Note. 얼룩말 이미지를 말로 변환하되, 이미지의 형태는 유지해야함!

$G_{\text{BA}}$로 다시 원래 이미지로 복원하려면 모양이 최대한 적게 바뀌어야함 → reconstrunction error를 줄이는 방향

Implementation

https://github.com/yunjey/mnist-svhn-transfer

2. StackGAN : Text to Photo-realistic Image Synthesis

text를 주고 그에 해당하는 이미지 생성

⚠️ 128x128, 256x256 고해상도 이미지를 $z$벡터에서 바로 생성하기 어렵다는 문제

💡 64x64 저해상도 이미지를 먼저 생성한 후 이를 기반으로 또다른 Generator로 upsampling하기

Understanding Convolutions on Graphs

Wed, 21 Dec 2022 08:28:43 GMT

Understanding Convolutions on Graphs

The Challenges of Computation on Graphs

Lack of Consistent Structure

ex. molecule; 각각 다른 수/종류의 atom을 포함하고 다른 수/종류의 connection을 가짐

Node-Order Equivariance

원래는 노드에 순서가 없지만, 임의로 부여 ⇒ node-order equivariant

Scalability

very large graph, but sparse

Problem Setting and Notation

Node Classification: Classifying individual nodes.
Graph Classification: Classifying entire graphs.
Node Clustering: Grouping together similar nodes based on connectivity.
Link Prediction: Predicting missing links.
Influence Maximization: Identifying influential nodes.

→ 문제 해결에 선행하여 node representation learning이 필요

$$ G = (V,E) $$

Undirected Graph
$x_v$ : node features for node $v \in V$
$h_v^{(k)}$ : kth layer(iteration)을 거친 뒤 node v의 representation
$M_v$ : matrix M에서 node v의 property

Extending Convolutions to Graphs

ordinary convolutions : not node-order invariant

이미지의 경우 pixel의 위치에 영향을 받음

⇒ How to generalize convolutions to general graph?

Polynomial Filters on Graphs

CNN에서 주변 pixel에대해 localized filter를 적용하는 것처럼 주변 노드들에 대해 polynomial filter적용하기

The Graph Laplacian

A : Adjacency Matrix
D : Degree Matrix

$$ D_v = \sum_u{A_{vu}} $$

L = Graph Laplacian, $n\times n$ matrix

$$ L = D-A $$

Polynomials of the Laplacian

Graph Laplacian을 이해하기 위한 polynomial 만들기
- polynomial : CNN에서의 필터에 해당하는 느낌
- coefficient w : 필터의 가중치

$$ p_w(L) = w_0I_n + w_1L + w_2L^2 + \dots + w_dL^d = \sum_{i=0}^d w_iL^i $$

이 polynomial의 계수들을 다음의 벡터로 나타낼 수 있다. → $w = [w_0, \dots, w_d]$

모든 $w$, $p_w(L)$은 $n \times n$ matrix

Fixing a node order (indicated by the alphabets) and collecting all node features into a single vector x

node feature $x_v$가 실수라고 할때, 이를 stacking하여 single vector $x$를 만들 수 있고, 이때 $x \in \mathbb{R}^n$이다.
feature vector $x$에 대해 polynomial filter $p_w$로 convolution을 적용하면 다음과 같이 나타낼 수 있다.

$$ x' = p_w(L)\ x $$

💡 1. $p_w(L)= \sum_{i=0}^d w_iL^i$ 에서 계수 $w$가 convolution에 어떤 영향을 미치는가?

ex1. $w_0 = 1$, $w_i = 0$ for $i$ ≥1 일때

$$ x' = p_w(L)\ x = \sum_{i=0}^d w_iL^ix = w_0I_nx = x $$

ex2. $w_1 = 1$, $w_i = 0$ for $i ≠1$ 일때

$$ \begin{aligned} {x'}v= (Lx)_v &= L_vx \ &=\sum{u \in G}L_{vu}x_u \&= \sum_{u \in G}(D_{vu}-A_{vu})x_u \&=D_vx_v - \sum_{u \in N(v)}x_u \end{aligned} $$

Adjacency matrix * x → 이웃 노드들의 feature합계

💡 2. Polynomial의 차수 d는 convolution에 어떤 영향을 미치는가?

Lemma 5.2 from Wavelets on graphs via spectral graph theory

Wavelets on graphs via spectral graph theory

Let $G$ be a weighted graph, $\mathcal{L}$ the graph Laplacian (normalized or non-normalized) and s > 0 an integer. For any two vertices m and n, if $d_G(m,n) > s$ then $(\mathcal{L}^s)_{m,n} = 0$.

**Proof.**

![](https://velog.velcdn.com/images/sujin_yun_/post/337fcea7-7661-4256-8f1b-42e66554760a/image.png)


![](https://velog.velcdn.com/images/sujin_yun_/post/bb5e97a2-2671-438e-b584-e3454d8fadd8/image.png)

- s-1 길이의 sequence $k_1, k_2, \dots, k_{s-1}$ with $1≤k_r≤N$
- 노드 m,n사이의 Laplacian의 s제곱($(\mathcal{L}^s)_{m,n}$)은 m에서 n까지 도달하는 s-1길이의 path들의 노드들사이 Laplacian의  곱의 합(path들간의 합)이라고 할 수 있음
- 증명하려는 것과 반대로, $(\mathcal{L}^s)_{m,n}$가 non-zero이려면, sum이 진행되는 term(Laplacian matrix의 곱)들 중 적어도 하나가 non-zero이어야 한다.
- 즉, $\mathcal{L}_{m,k_1} \neq 0, \mathcal{L}_{k_1,k_2} \neq 0, \dots,\mathcal{L}_{k_{s-1},n} \neq 0.$  인 $k_1, k_2, \dots, k_{s-1}$(path)이 존재한다. ⇒ 노드 m,n사이에 길이가 s인 path가 존재한다.
- 반복되는 노드를 제거할경우, “노드 m,n사이에 길이가 s보다 작은 path가 존재한다”고 정리할 수 있다.
- 즉, $(\mathcal{L}^s)_{m,n} = 0$ 이려면, 노드 m,n사이에 길이가 s보다 짧은 path가 존재하지 않는다. ⇒ 두 노드 사이 거리는 s보다 크다.

다시 돌아와서 Polynomial의 차수 d가 convolution에 미치는 영향에 대해 보면,

$$ \text{dist}G(v,u) > i \Rightarrow L{vu}^i = 0 $$

$x'$를 얻기 위해 $x$를 차수가 $d$인 polymonial $p_x(L)$로 대체하면,

$$ \begin{aligned} x'v = (p_w(L)x)_v &= (p_w(L))_vx \ &=\sum{i=0}^{d}w_iL_{v}^ix \&= \sum_{i=0}^dw_i\sum_{u \in G}L_{vu}^ix_u \&=\sum_{i=0}^dw_i \ \sum_{u \in G, \ \text{dist}_G(v,u) \le i}x_u \end{aligned} $$

노드 v에서의 convolution은 d hop보다 멀지 않은 노드들에 대해서만 적용됨
- localized polynomial filters, d의 크기에 따라 localization의 정도 변동

ChebNet

위에서 사용했던 polynomial filter는 다음과 같이 정의됨

$$ p_w(L) = \sum_{i=0}^d w_iL^i $$

ChebNet에서는 polynomial filter를 다음의 형태로 수정하여 사용

$$ p_w(L) = \sum_{i=1}^d w_iT_i(\tilde{L}) $$

$T_i$ : i차의 1종 체비세프 다항식
- 체비세프 다항식
$\tilde{L}$ : L의 가장 큰 eigen value를 사용해 정의된 normalized Laplacian

$$ \tilde{L} = \frac{2L}{\lambda_{\text{max}}(L)}-I_n $$
- L → positive semi-definite : L의 모든 eigenvalue들은 0보다 작지 않다.
- $\lambda_{\text{max}} > 1$ 이라면, L의 제곱들은 급격하게 크기가 커지게 되는데, L을 효율적으로 scale-down한 $\tilde{L}$의 경우 eigenvalue들이 [-1,1]의 구간에 있음을 보장하여 제곱값의 크기가 끝없이 커지는 것을 막는다.
  - unnormalized Laplacian $L$에 대해 높은 차원의 계수의 크기 제한
  - normalized Laplacian $\tilde{L}$에 대해서는 더 큰 값의 계수 허용
- chebyshev polynomial → interpolation을 수치적으로 안정적이게 만드는 특성

Polynomial Filters are Node-Order Equivariant

Polynomial filter

node order에 independent ← $p_w$의 차원이 1이고, x가 이웃 노드의 feature를 aggregate한 value인 경우를 떠올리면 이해하기 쉬움(ex. 합계는 순서와 상관이 없음)
proof.

permutation matrix $P$ : 정사각 이진 행렬, 각 행에 값이 1인 요소가 하나 있고, 나머지는 0의 값을 가짐. 다른 matrix $A$ 를 곱해주었을 때 $A$의 각 row를 permutating해주는 것과 같은 효과라고 이해하면 됨.(=행 교환을 수행하는 행렬)

$$ PP^T =P^TP = I_n $$

function $f$ : node-order equivariant

$$ f(Px) = Pf(x) $$

permutation matrix $P$를 활용해 새로운 node-order를 만들어내면, 다음과 같은 변환이 수행된다.

$$ \begin{aligned} x &\rightarrow Px\L&\rightarrow PLP^T \ L^i &\rightarrow PL^iP^T \end{aligned} $$

$f(x) = p_w(L)x$ 와 같은 polynomial filter가 있을 때, 다음과 같이 정리할 수 있다.

$$ \begin{aligned} f(Px) &= \sum^{d}{i=0}w_i(PL^iP^T)(Px)\ &=P\sum^{d}{i=0}w_iL^ix\&=Pf(x) \end{aligned} $$

Embedding Computation

$K$ different polynomial filter layer
- $k^{th}$ layer의 learnable weights $w^{(k)}$

$w^{(k)}$의 weight를 가진 polynomial filter를 $L$에 적용하여 matrix $p^{(k)}$를 계산
$p^{(k)}$에 $h^{(k-1)}$를 곱해 $g^{(k)}$를 계산
$g^{(k)}$에 non-linearity $\sigma$를 적용하여 $h^{(k)}$를 계산
CNN에서의 weight-sharing(동일한 필터를 여러 grid에 적용)과 마찬가지로, 다른 노드들에 같은 weight를 가진 filter를 적용

Modern Graph Neural Networks

$$ p_w(L) = w_0I_n + w_1L + w_2L^2 + \dots + w_dL^d $$

$w_1 = 1$, 나머지 $w_i$들은 모두 0
$p_w(L) = L$
$p_w$의 차원 d ⇒ 얼마나 localized된 필터를 사용할 것인가

$$ \begin{aligned} x'v &= L_vx \ &=\sum{u \in G}L_{vu}x_u \&= \sum_{u \in G}(D_{vu}-A_{vu})x_u \&=D_vx_v - \sum_{u \in N(v)}x_u \end{aligned} $$

Aggregation : 바로 이웃하는 노드 feature $x_u$들을 aggregate
Combination : 자기자신의 feature $x_v$를 합침

💡 Polynomial filter를 사용해 가능한 것과 별개로 다른 종류의 “Aggregation”, “Combination” 를 사용하면 어떨까?

Aggregation이 node-order equivariant하다면, convolution 전체가 node-order equivariant하다.
바로 이웃하는 노드들 사이의 “message-passing”과정이라고 이해할 수 있다.
1-hop localized convolution을 K번 반복하면, K-hop만큼 멀리에 있는 노드들까지 receptive field를 늘릴 수 있다.

Embedding Computation

Message-passing을 backbone으로 하는 많은 GNN Architecture들

Graph Convolutional Networks (GCN)

- 각 k 단계별로 학습가능한 function $f^{(k)}$, matrics $W^{(k)}$, $B^{(k)}$는 모든 노드들에 걸쳐 공유됨

Graph Attention Networks (GAT)

- 각 k 단계별로 학습가능한 function $f^{(k)}$, matrics $W^{(k)}$, $B^{(k)}$, Attention $A^{(k)}$는 모든 노드들에 걸쳐 공유됨
- multi-head attention

Graph Sample and Aggregate (GraphSAGE)

- 각 k 단계별로 학습가능한 function $f^{(k)}$, matrics $W^{(k)}$, $\text{AGG}$는 모든 노드들에 걸쳐 공유됨
- $\text{AGG}$의 선택
    - Mean(GCN과 유사)

        ![](https://velog.velcdn.com/images/sujin_yun_/post/2211fce0-b814-451c-921a-05b41e44266b/image.png)


    - Dimension-wise Maximum

        ![](https://velog.velcdn.com/images/sujin_yun_/post/6459b63b-e2dc-4f7c-9f89-cde42a54e795/image.png)


    - LSTM (after ordering the sequence of neighbours)
- neighbourhood sampling : 노드의 이웃노드 수에 관계 없이 고정된 수의 노드들을 random sampling하여 GraphSAGE가 매우 큰 그래프에도 적용될 수 있게됨, Node embedding의 variance는 증가

Graph Isomorphism Network (GIN)

- 각 k 단계별로 학습가능한 function $f^{(k)}$, real-valued parameter $\epsilon^{(k)}$는 모든 노드들에 걸쳐 공유됨

Prediction

최종적으로 계산된 embedding을 사용해 각 노드에 대한 prediction을 만들어내는 방법은 다음과 같다.

$$ \hat{y_v} = \text{PREDICT}(h_v^{(K)}) $$

$\text{PREDICT}$ : 또다른 Neural Net, 각 GNN을 학습할 때 같이 학습

From Local to Global Convolutions

message passing을 반복적으로 수행하여 그래프 전체에서 한 노드로 정보가 모일 수 있지만, 직접적으로 global convolution을 수행하는 방법이 있음

Spectral Convolutions

💡 feature vector $x$에 대해, Laplacian $L$을 사용해 $G$에 관해 $x$가 얼마나 smooth한지 quantify할 수 있음

$\sum_{i=1}^{n} x^2_i = 1$을 만족하는, 즉, normalize된 $x$에 대해

Rayleigh quotient를 정의하면 다음과 같다.

$$ R_L(x) = \frac{x^TLx}{x^Tx} = \frac{\sum_{(i,j)\in E}(x_i-x_j)^2}{\sum_{i}x_i^2} = \sum_{(i,j)\in E}(x_i-x_j)^2 $$

Laplacian matrix의 value가 0이면 두 노드는 인접하지 않음.
$L = D-A$

인접한 노드 feature의 차이의 제곱으로 정리할 수 있음
= Smoothness
인접한 노드 feature가 유사하면 $R_L(x)$ 값이 작아짐

Laplacian matrix $L$

Real, symmetric matrix
eigenvalues : $\lambda_1 ≤ \dots ≤ \lambda_n$이
이에 대응하는 $u_1, \dots, u_n$는 orthonormal

$$ \begin{aligned} u_{k1}^Tu_{k2} = \Bigg{ \begin{array}{ll} 1 & \text{if }k_1 = k_2,\\ 0 & \text{if }k_1 \neq k_2. \end{array} \end{aligned} $$

\text{argmin}_{x, x\perp{u_1, \dots, u{i-1}}} R_L(x) = u_i. $$

$$ \text{min}_{x, x\perp{u_1, \dots, u{i-1}}} R_L(x) = \lambda_i. $$

L의 eigenvectors → successively less smooth
L의 eigenvalue들을 “spectrum”이라고 할때, L의 spectral decomposition은

$$ L = U\Lambda U^T $$
- $\Lambda$ : 정렬된 eigenvalue들의 대각행렬 = $\begin{bmatrix}\lambda_{1} & & \ & \ddots & \ & & \lambda_{n}\end{bmatrix}$
- $U$ : eigenvector들의 matrix, 정렬된 eigenvalue에 대응 = $\begin{bmatrix}u_1 \dots u_n\end{bmatrix}$

eigenvector들은 orthonormality condition을 만족 : $U^TU = I$

eigenvector들은 $\mathbb{R}^n$의 basis로 다음과 같이 linear combination으로 feature vector $x$를 나타낼 수 있음

$$ x = \sum_{i=1}^n \hat{x_i}u_i = U\hat{x} $$

$\hat{x}$ : $\begin{bmatrix}x_0 \dots x_n\end{bmatrix}$의 coefficient, spectral representation of vector $x$
orthonormality condition을 $x = U\hat{x}$에 적용하면, natural representation $x$와 spectral representation $\hat{x}$ 사이 변환식으로 정리할 수 있음

$$ x = U\hat{x}\ \Longleftrightarrow \ U^Tx = \hat{x}

Embedding Computation

$k$ layers → each layers’ learnable parameter $\hat{w}^{(k)}$ = filter weights
spectral representation을 계산하는데 필요한 eigenvector의 수 = $m$ = 각 layer에 필요한 weight의 수
spectral domain에서 convolution을 적용하면 direct convolution보다 더 적은 파라미터를 사용해도 됨
Laplacian eigenvector들의 smoothness덕분에, spectral representation을 사용하면 자연스럽게 주변 노드들이 유사한 representation을 갖도록하는 inductive vias를 강화할 수 있음
node feature가 1d라고 할 때, 각 layer의 output은 node representation $h^{(k)}$의 벡터로, 각 행은 node의 representation을 의미

$$ h^{(k)} = \begin{bmatrix}h^{(k)}{1}\ \vdots \ h^{(k)}{n}\end{bmatrix} \quad \scriptsize \text{for each } k = 0,1,2,\dots \text{up to }K $$

Graph $G$

adjacency matrix $A$
degree matrix $D$
Lapalcian $L = D-A$

$$ ⁍ $$

$\Lambda$ : 정렬된 eigenvalue들의 대각행렬 = $\begin{bmatrix}\lambda_{1} & & \ & \ddots & \ & & \lambda_{n}\end{bmatrix}$
$U$ : eigenvector들의 matrix, 정렬된 eigenvalue에 대응 = $\begin{bmatrix}u_1 \dots u_n\end{bmatrix}$

$$ \scriptsize \color{#FE6100} \text{Computed node embeddings} \quad \color{#4D9DB5} \text{Learnable parameters} $$

Start with original feature $h^{(0)} = x$
Iterate, for k=1,2,… upto K
1. Convert previous feature $\color{#FE6100}{h}^{(k - 1)}$ to its spectral representation $\hat{h}^{(k - 1)}$
  
  : spectral representation으로 변환하기
  
  $$ \hat{h}^{(k - 1)}= U^T_mh^{(k - 1)} $$
2. Convolve with filter weights ${\color{#4D9DB5} \hat{w}^{(k)}}$ in the spectral domain to get ${\hat{g}^{(k)}}$, $\odot$ = element-wise multiplication
  
  : weight가 $\hat{w}^{(k)}$인 filter로 spectral domain에서의 convolution적용
  
  $$ \hat{g}^{(k)} = \hat{w}^{(k)} \odot \hat{h}^{(k-1)} $$
3. Convert $\hat{g}^{(k)}$ back to its natural representation ${g}^{(k)}$
  
  : spectral → natural
  
  $$ g^{(k)} = U_m\hat{g}^{(k)} $$
4. Apply a non-linearity $\sigma$ to ${g}^{(k)}$ to get $\color{#FE6100} {h}^{(k)}$
  
  $$ h^{(k)} = \sigma (g^{(k)}) $$

Spectral Convolutions are Node-Order Equivariant

Laplacian polynomial filter의 node equivariant 증명에서 보인 것과 유사한 접근으로 증명 가능

proof.

node order의 변경 : permutation matrix $P$( = 행 교환을 수행하는 행렬)와의 내적

그에 따라 다음과 같은 식의 변경 발생

$$ \begin{aligned} x &\rightarrow Px\A&\rightarrow PAP^T\L&\rightarrow PLP^T \ U_m &\rightarrow PU_m \end{aligned} $$

embedding computation에는 다음과 같은 변경

$$ \begin{aligned} \hat{x} &\rightarrow(PU_m)^T(Px) = U^T_mx = \hat{x}\ \hat{w}&\rightarrow (PU_m)^T(Pw) = U^T_mw = \hat{w}\ \hat{g}&\rightarrow \hat{g} \ g &\rightarrow (PU_m)\hat{g} = P(U_m\hat{g}) = Pg \end{aligned} $$

$\sigma$ 가 elemnetwise 방향으로 적용되므로

$$ f(Px) = \sigma(Pg) = P \sigma(g) = Pf(x) $$

와 같이 정리할 수 있다.

+) spectral의 $\hat{x}, \hat{w}, \hat{g}$는 node permutation이 되더라도 변하지 않는다.

Spectral convolution의 단점

$L$로부터 eigenvector matrix $U_m$을 계산해야 하는데, 아주 큰 그래프에 대해서 이것이 불가능해질 수도 있다.
U_m 을 계산해내더라도, global convolution을 계산하는 것이 $U_m$, $U^T_m$ 을 반복적으로 곱해주어야해서 비효율적일 수 있다.
학습된 필터가 input그래프의 Laplacian L의 spectral decomposition에 관한것이기 때문에 그래프 specific해서 다른 구조를 가진 그래프에 transfer하여 적용하기 어렵다.

Global Propagation via Graph Embeddings

graph-level information을 node와 edge의 정보를 pooling함으로써 전체 그래프의 embedding을 계산하고, 그래프 embedding을 사용해 노드 embedding을 업데이트하여 전체 그래프의 embedding을 계산 → Relational inductive biases, deep learning, and graph networks

spectral convolution이 잡아낼 수 있는 그래프의 topology를 무시하는 경향이 있음

Learning GNN Parameters

embedding computations → completely differentiable ⇒ end-to-end training이 가능

Task에 따른 Loss function $\mathcal{L}$

Node Classification

$$ \mathcal{L}(y_v,\hat{y}v)=−∑_c y{vc}\log \hat{y}_{vc} $$

$\hat{y}_{vc}$ : node v가 class c로 예측될 확률
cross-entropy loss
semi-supervised setting에서는 다음과 같이 정의할 수 있음

$$ \mathcal{L}G = \frac{\sum{v \in \text{Lab}(G)}\mathcal{L}(y_v,\hat{y}_v)}{|\text{Lab}(G)|} $$
- $\text{Lab}(G)$ : labelled nodes에 대해서만 loss계산

Graph Classification

node representation을 aggregate하여 전체 그래프에 대한 representation생성

Pooling

$$ h_G = \text{PREDICT}G(\text{AGG}{v \in G}({h_v})) $$
- SortPool (An End-to-End Deep Learning Architecture for Graph Classification) : 그래프의 정점들을 sort하여 고정된 크기의 node-order invariant한 그래프 representation을 만들고, 일반적인 NN architecture에 적용
- DiffPool (Hierarchical Graph Representation Learning with Differentiable Pooling) : 정점을 클러스터링하고, 노드 대신 클러스터로 coarser graph를 만들고, coarser graph에 GNN을 적용한다. 하나의 클러스터만 남을때 까지 반복
- SAGPool (Self-Attention Graph Pooling) : GNN을 적용해 node score를 학습하고, 상위 score를 가진 노드만 남기고 나머지는 버림. 하나의 노드만 남을때까지 반복

Link Prediction

adjacent, non-adjacent한 노드 쌍을 sampling하여 이 벡터 쌍을 연결 유무를 예측하기 위해 사용

$$ \begin{aligned} \mathcal{L}(y_v,y_u,e_{vu})&=−e_{vu}\log (p_{vu})\ -\ (1-e_{vu})\log(1-p_{vu})\ p_{vu} &= \sigma(y_v^Ty_u) \end{aligned} $$

$e_{vu}$ : node v와 u사이에 edge 가 있으면 1, otherwise 0

Others

NLP에서 ELMo나BERT에서 적용된 테크닉을 GNN에 적용해볼 수 있음

local graph properties (eg. node degrees, clustering coefficient, masked node attributes)
global graph properties (eg. pairwise distances, masked global attributes)

인접 노드가 비슷한 embedding을 갖도록 하기 위한 self-supervised technique 중 node2vec이나 DeepWalk등 random-walk와 유사한 접근을 하기도 함

$$ \mathcal{L}G = ∑_v∑{u \in N_R(v)} \log \frac{\exp z_v^Tz_u}{∑{u'}\exp z{u'}^Tz_u} $$

$N_R(v)$ : node v에서 시작하여 random-walk를 통해 방문한 node들의 multi-set

아주 큰 그래프에 대해서는 Noise Contrastive Estimation을 적용하기도 함

A Gentle Introduction to Graph Neural Networks

Sun, 18 Dec 2022 05:27:37 GMT

A Gentle Introduction to Graph Neural Networks

Graphs and where to find them

Images as graphs

224x224x3 floats → 1 pixel = 1 node

each nodes’ feature vector : 3-dimensional vector representing RGB value

non-border pixel은 8개의 이웃을 가지게됨 → adjacency matrix
Text as graphs

단어/토큰/문자(=node)의 연결 = directed graph

ex. Graphs are all around us ⇒ (Graphs) → (are) → (all) → (around) → (us)

하지만, image와 text의 encoding으로 graph를 잘 사용하지 않음

→ Since they have Regular structure!

Graph-valued data in the wild

1. **Molecules as graphs**
2. **Social networks as graphs**
3. **Citation networks as graphs**
4. **Other examples**

What types of problems have graph structured data?

1. Graph-level task

Goal : predict the property of an entire graph

ex. Molecule의 구조가 주어졌을 때 해당 분자의 특성 알아내기

2. Node-level task

Goal : predict the identity or role of each node within a graph

ex. 네트워크 내의 노드 특성 분류

3. Edge-level task

Goal : predict the property of an entire graph

ex. 노드간의 연결 여부 예측, 연결 특성 분류

The challenges of using graphs in machine learning

4 Types of information on Graph : nodes, edges, global-context, connectivity

nodes, edges, global-context → 노드에 id를 부여하고 Matrix만들기

connectivity → Adjacency Matrix

👍) easily tensoriable

👎) 노드수가 많고 연결수가 적을경우 아주 sparse한 adjacency matrix*가 만들어져 *space-inefficient

👎) 같은 connectivity를 표현하는 adjacency matrix가 다양하고, 이렇게 각각 다른 matrix가 deepNN에서 동일한 결과를 생성한다는 보장이 없음(=not permutation invariant)

Sparse matrix를 표현하는 memory efficient한 방법 = adjacency list

Graph Neural Networks

GNN : optimizable transformation on all attributes of the graph (nodes, edges, global-context) that preserves graph symmetries (permutation invariances)

→ 그래프 대칭성을 보존하며 노드, 엣지 등 그래프의 모든 특성에 대한 optimizable transformation

GNN ← Graph Nets architecture schematics + message passing neural network

“graph-in, graph-out” : 노드/엣지/global-context의 정보를 입력으로 그래프의 연결성을 변화시키지 않으면서 임베딩을 점진적으로 변환

The Simplest GNN

GNN Layer : Graph의 각 component(V = Node, E = Edge, U = Global-context)에 대해 별도의 MLP를 사용

A single layer of a simple GNN. A graph is the input, and each component (V,E,U) gets updated by a MLP to produce a new graph. Each function subscript indicates a separate function for a different graph attribute at the n-th layer of a GNN model.

그래프를 input으로, 동일한 그래프 구조(연결)에 업데이트된 임베딩을 가진 그래프가 output으로 나오게 됨!

GNN Predictions by Pooling Information

각 노드의 업데이트된 임베딩에 대해 선형분류기를 적용하여 node prediction task를 수행할 수 있음

cf. 노드에 대한 정보가 없고, 엣지에 대한 정보는 있는 상황에서 노드에 대한 예측 task를 수행해야할 경우 → 엣지에서 정보를 수집하여 예측을 위해 노드에게 해당 정보를 제공해야함 ⇒ Pooling

2 Steps of Pooling

For each item to be pooled, gather each of their embeddings and concatenate them into a matrix. : 풀링할 요소들의 임베딩을 행렬로 concat하여 gathering
The gathered embeddings are then aggregated, usually via a sum operation. : 모인 임베딩들을 aggregate(ex. sum)

$\rho$ : Pooling operation ⇒ gathering information form edges to node : $\rho_{E_n \rightarrow V_n}$

Model for predicting binary node information using edge-level features

Model for predicting binary edge-level information using node-level features

Model for predicting binary global property using node-level features

CNN의 Global average pooling layer와 유사
분자특성 예측 task : atomic information + connectivity → toxicity of a molecule

Classification model $c$는 다른 differentiable model로 대체 가능

An end-to-end prediction task with a GNN model

An end-to-end prediction task with a GNN model.

Simplest GNN에서는 각 GNN layer내 그래프 연결성 정보를 활용하지 않음
각 노드/엣지는 독립적으로 처리됨
예측을 위해 정보를 pooling할 때만 connectivity를 활용

Passing messages between parts of the graph

GNN layer안에서 Pooling을 활용해 graph connectivity를 고려한 임베딩을 만들어 낼 수 있음 ⇒ Message Passing : 이웃하는 노드와 엣지들 사이 정보를 주고받으며 각각의 업데이트에 영향을 줌

Message Passing

For each node in the graph, gather all the neighboring node embeddings (or messages), which is the $g$ function described above. : $g$함수로 그래프의 각 노드에 대해 모든 인접노드의 임베딩을 모음
Aggregate all messages via an aggregate function (like sum). : 집계함수로 모인 모든 메세지들을 집계
All pooled messages are passed through an update function, usually a learned neural network. : 풀링된 모든 메세지는 학습된 NN인 update function으로 전달됨

노드 또는 엣지에 풀링을 각각 적용하는 것과 같이 노드/엣지간의 message passing이 발생할 수 있음

그래프의 connectivity 활용
GNN의 표현력 증가

Convolution과 비슷한 느낌
- Message passing과 convolution 모두 특정 요소의 값을 업데이트하기위해 이웃의 정보를 집계하여 처리하는 작업
- 차이점 : 이미지에서는 인접 요소의 수가 고정적이지만 그래프는 가변적
Message passing GNN layer들을 쌓으면, 한 노드는 최종적으로 전체 그래프에 대한 정보를 통합할 수 있음
- layer가 하나 쌓일때마다 정보를 모으는 노드수가 1,2,3,,,hop으로 증가

Schematic for a GCN architecture, which updates node representations of a graph by pooling neighboring nodes at a distance of one degree.

Learning edge representations

*앞선 예시

node에 대한 예측을 수행해야 하는데 edge에 대한 정보만 가지고 있는경우

→ edge정보를 node에 대한 정보로 routing하기 위해 pooling을 해주는 방법 활용

👎) model의 마지막 prediction 단계에 대해서만 적용 가능

⇒ sol💡) message passing을 사용하여 GNN layer내에서 node와 edge사이 정보공유 활성화

👎) edge information과 node information이 같은 차원을 가지고 있다고 보장할 수 없음

⇒ sol💡) 서로의 information space로 mapping하는 linear function을 학습 or concatencate

Architecture schematic for Message Passing layer. The first step “prepares” a message composed of information from an edge and it’s connected nodes and then “passes” the message to the node.

GNN의 설계 요소 중 하나는 node embedding과 edge embedding 중 어떤 것을 먼저 업데이트할지에 대한 결정이 있음

ex. four updated representations that get combined into new node and edge representations: node to node (linear), edge to edge (linear), node to edge (edge layer), edge to node (node layer)

Molecular Graph Convolutions: Moving Beyond Fingerprints

Some of the different ways we might combine edge and node representation in a GNN layer.

Adding global representations

지금까지 설명한 네트워크들의 단점

message passing을 여러번 적용하더라도 아주 멀리있는 노드들은 서로의 정보를 효율적으로 교환하기 어려움
i.e. k개의 GNN layer를 쌓으면, k-hop 내의 노드들까지만 정보의 전파가 이루어짐
node prediction이 멀리 떨어져있는 노드들의 정보에 영향을 받는 경우 문제가 됨

⇒ sol💡) Using global representation of graph(U), i.e. master node or context vector

Global context vector
- 네트워크 모든 노드, 엣지들과 연결
- information pass의 중간다리 역할
- 그래프 전체의 representaion을 만드는 역할
- 더 풍부하고 복잡한 그래프 representation

Schematic of a Graph Nets architecture leveraging global representations.

Schematic for conditioning the information of one node based on three other embeddings (adjacent nodes, adjacent edges, global). This step corresponds to the node operations in the Graph Nets Layer.

새로운 노드 임베딩을 만들어낼 때 neighboring nodes, connected edges, global information을 모두 활용할 수도 있지만, conditioning을 통해 일부만 활용할 수도 있음

concatenate
learning linear mapping function
feature-wise modulation layer(featurize-wise attention mechanism)

Feature-wise transformations

GNN playground

Graph-level prediction task
Leffingwell Odor Dataset

Some empirical GNN design lessons

a higher number of parameters does correlate with higher performance
- GNNs are a very parameter-efficient model type: for even a small number of parameters (3k) we can already find models with high performance
models with higher dimensionality tend to have better mean and lower bound performance
- higher dimensionality = higher number of parameter 이기때문에, 위와 동일
GNN with a higher number of layers will broadcast information at a higher distance
- 노드 하나가 넓은 영역의 그래프의 노드들로부터 정보를 얻어 개별정보가 희석될 위험이 있음 → layer를 쌓을 수록 성능의 boundary가 큼
the more graph attributes are communicating, the better the performance of the average model

Approach

neighborhood-based pooling operation
- 그래프 구조에 의존적인 node representation learning
- stochastic graph traversals
graph information을 얻는 새로운 mechanism들
- Representation Learning on Graphs with Jumping Knowledge Networks
  - jumping knowledge (JK) networks는 각 노드의 neighborhood set을 다르게 정의하여 structure-aware representation
- Graph Traversal with Tensor Functionals: A Meta-Algorithm for Scalable Learning
  - Graph Traversal via Tensor Functionals(GTTF) : 다양한 graph altorithm을 통해 큰 그래프에서의 효율적인 학습
- Graph Neural Tangent Kernel: Fusing Graph Neural Networks with Graph Kernels
  - infinitely wide multi-layer GNNs trained by gradient descent
- Neural Execution of Graph Algorithms

Into the Weeds

Other types of graphs (multigraphs, hypergraphs, hypernodes, hierarchical graphs)

Schematic of more complex graphs. On the left we have an example of a multigraph with three edge types, including a directed edge. On the right we have a three-level hierarchical graph, the intermediate level nodes are hypernodes.

multigraphs(multi-edge graphs)
- 노드쌍이 여러 type의 edge를 공유
- For example with a social network, we can specify edge types based on the type of relationships (acquaintance, friend, family).
- GNN can be adapted by having different types of message passing steps for each edge type.
hypernode graphs(nested graphs)
- node represents a graph
- For example, we can consider a network of molecules, where a node represents a molecule and an edge is shared between two molecules if we have a way (reaction) of transforming one to the other
- GNN that learns representations at the molecule level and another at the reaction network level, and alternate between them during training.
hypergraph
- edge가 2개이상의 노드를 연결
- build a hypergraph by identifying communities of nodes and assigning a hyper-edge that is connected to all nodes in a community.

Sampling Graphs and Batching in GNNs

👎그래프의 경우 노드와 엣지의 수가 고정적이지 않으므로 constant한 batchsize를 만들기 어려움

⇒ 💡 큰 그래프의 필수적인 속성을 보존하는 subgraph를 만들어 batching

graph sampling operation
- 그래프에서 노드와 엣지를 sub-selecting하는 과정을 포함하고 context에 매우 의존적
- Cluster-GCN, GraphSaint 같은 새로운 architecture, training strategy를 만들어내기도 함
  
  ⇒ Research Question : How to sample a graph?

Four different ways of sampling the same graph. Choice of sampling strategy depends highly on context since they will generate different distributions of graph statistics (# nodes, #edges, etc.). For highly connected graphs, edges can be also subsampled.

- [Little Ball of Fur: A Python Library for Graph Sampling](https://dl.acm.org/doi/abs/10.1145/3340531.3412758)
- preserving structure at a neighborhood level
    - 동일한 숫자의 노드를 random sampling하여 node set을 만들고, edge를 포함해 node set에서 k-hop 이웃 노드들을 추가 ⇒ 이를 개별 그래프처럼 batch 학습에 활용
    - The **loss** can be masked to **only consider the node-set** since all neighboring nodes would have incomplete neighborhoods.
- 한 노드를 랜덤하게 샘플링한 후 그 노드의 k-hop까지 그래프를 확장한 뒤, 확장된 셋내의 다른 노드를 선택함, 원하는 수의 셋을 만들때 까지 반복
- Random walk
- Metropolis algorithm

Inductive biases

Relational inductive biases, deep learning, and graph networks

How each graph component (edge, node, global) is related to each other so we seek models that have a relational inductive bias?

A model should preserve explicit relationships between entities (adjacency matrix) and preserve graph symmetries (permutation invariance).
node나 edge에의 operation 순서는 상관이 없어야하고, operation이 다양한 input에 작동해야함

Comparing aggregation operations

Aggregation function

node ordering에 invariant해야함
differentiable해야함
비슷한 input에 대해 비슷한 aggregated output을 만들어내야함

mean

when nodes have a highly-variable number of neighbors or you need a normalized view of the features of a local neighborhood

max

when you want to highlight single salient(prominent) features in local neighborhoods

sum

provides a balance between these two, by providing a snapshot of the local distribution of features, but because it is not normalized, can also highlight outliers

Designing new aggregation operations

Principal Neighborhood aggregation
- take into account several aggregation operations by concatenating them and adding a scaling function that depends on the degree of connectivity of the entity to aggregate

GCN as subgraph function approximators

k layer GNN : 노드로 부터 k-hop의 subgraph에 대한 representation을 학습하는 것

⇒ GCN is collecting all possible subgraphs of size k and learning vector representations from the vantage point of one node or edge

N-Gram Graph: Simple Unsupervised Representation for Graphs, with Applications to Molecules

Edges and the Graph Dual

Edge prediction과 Node prediction task

: an edge prediction task on a graph $G$ can be phrased as a node-level prediction on $G$’s dual.

G의 dual을 얻기 위해, node→edge, edge→node로의 convert가 필요

Dual Graph?
Dual-Primal Graph Convolutional Networks
- to solve an edge classification problem on $G$, we can think about doing graph convolutions on $G$’s dual (which is the same as learning edge representations on $G$)

Graph convolutions as matrix multiplications, and matrix multiplications as walks on a graph

Message passing

“gathering” all node features values of dimension j that share an edge with $node_i$
not updating the representation of the node feature, just pooling neighboring node feature

$$ \ = A_{i,1}X_{1,j}+A_{i,2}X_{2, j}+…+A_{i,n}X_{n, j}\=\sum_{A_{i,k}>0} X_{k,j} $$

$A$ : adjacency matrix, $n_{nodes} \times n_{nodes}$
$X$ : node feature matrix, $n_{nodes} \times node_{dim}$
$A_{i,k}$ : node i와 node k사이 edge의 존재 여부
Adjacency matrix $A$의 sparsity
- matrix multiply-free approach : $A_{i,j}$가 0인 경우 값을 더할 필요 없음 → 양수 값 retrieval로 해결
  - aggregation function으로 sum을 사용할 필요가 없어짐

위 과정을 여러번 반복하게되면 더 넓은 영역의 정보를 전파할 수 있음

⇒ matrix multiplication is a form of traversing over a graph

$A^K_{ij}$ : node i와 j사이 길이가 K인 경로의 수

$$ A^2_{ij} = = A_{i,1}A_{1, j}+A_{i,2}A_{2, j}+…+A_{i,n}A{n,j} $$

Graph Attention Networks

Node feature aggregation을 할때 이웃 노드의 중요도 weight를 만들 수 있을까?

Schematic of attention over one node with respect to it’s adjacent nodes. For each edge an interaction score is computed, normalized and used to weight node embeddings.

Transformers are Graph Neural Networks

transformers can be viewed as GNNs with an attention mechanism
transformer models several elements (i.g. character tokens) as nodes in a fully connected graph and the attention mechanism is assigning edge embeddings to each node-pair which are used to compute attention weights →all possible combinations to make a_input : $[\mathbf{W}h_i||\mathbf{W}h_j]$
The difference lies in the assumed pattern of connectivity between entities, a GNN is assuming a sparse pattern and the Transformer is modelling all connections.
- GNN : adjacency matrix에 대한 연산, aggregation function으로 attention을 사용하느냐, 하지 않느냐의 차이
- Transformer : 모든 노드들의 서로에 대한 가중치 계산 → modelling all connections

Graph explanations and attributions

Schematic of some explanability techniques on graphs. Attributions assign ranked values to graph attributes. Rankings can be used as a basis to extract connected subgraphs that might be relevant to a task.

GNNExplainer

Graph concept에 따라 달라지는 그래프 explanation에 대한 해결책
extracting the most relevant subgraph that is important for a task
Attribution techniques : task에 relevant한 그래프의 part에 대해 importance value를 ranking
Because realistic and challenging graph problems can be generated synthetically, GNNs can serve as a rigorous and repeatable testbed for evaluating attribution techniques : https://papers.nips.cc/paper/2020/hash/417fbbf2e9d5a28a855a11894b2e795a-Abstract.html

Generative modelling

Graph generation

Method
- sampling from a learned distribution
- completing a graph given a starting point
Key Point
- modelling the topology of a graph, which can vary dramatically in size and has $N_{nodes}^2$ terms
Solution
- Variational Graph Auto-Encoders
  - modelling the adjacency matrix directly like an image with an autoencoder framework
  - binary classification task : prediction of the presence or absence of an edge
  - learns to model positive patterns of connectivity and some patterns of non-connectivity in the adjacency matrix
- build a graph sequentially, by starting with a graph and applying discrete actions such as addition or subtraction of nodes and edges iteratively
  - Auto-regressive mode : GraphRNN: Generating Realistic Graphs with Deep Auto-regressive Models
  - Optimization of Molecules via Deep Reinforcement Learning
  - graph to sequence with grammar elements
    - Self-Referencing Embedded Strings (SELFIES): A 100% robust molecular string representation
    - GraphGen: A Scalable Approach to Domain-agnostic Labeled Graph Generation

[Pytorch Geometric Tutorial] 3. Graph attention networks (GAT) implementation

Thu, 08 Dec 2022 06:50:18 GMT

Pytorch Geometric tutorial: Graph attention networks (GAT) implementation

💡 target node에 대한 neighbor node의 중요도가 모두 같지 않다

→ 특별히 더 중요한 노드가 있다고 할 때, 그 weight를 automatic하게 학습하는 방법?

⇒ GAT

Graph Attention Layer

Input : set of node features $\mathbf{h} = {\bar{h_1},\bar{h_2}, \dots ,\bar{h_n}} \quad \bar{h_i} \in \mathbf{R}^F$
Output : a new set of node features $\mathbf{h'} = {\bar{h'_1},\bar{h'_2}, \dots ,\bar{h'_n}} \quad \bar{h'_i} \in \mathbf{R}^{F'}$

apply a parameterized linear transformation to every node

$$ \mathbf{W} \cdot \bar{h_i} \quad \mathbf{W} \in \mathbf{R}^{F' \times F} $$

$(F' \times F) \cdot F$ matrix연산 ⇒ $F'$

Self attention

$$ a:\mathbf{R^{F'}} \times \mathbf{R^{F'}} \rightarrow \mathbf{R} \e_{i,j} = a(\mathbf{W}\cdot \bar{h_i},\mathbf{W}\cdot \bar{h_j}) $$

$e_{i,j}$ : Specify the importance of node j’s features to node i

Normalization

$$ \alpha_{i,j} = softmax_j(e_{i,j}) = \frac{exp(e_{i,j})}{\sum_{k\in N(i)}exp(e_{i,k})} $$

attention mechanism $a$ : a single-layer feed forward neural network

주변노드 j의 임베딩과 자기 자신노드 i의 임베딩을 각각 parameter update한 후 concatenate
LeakyReLU

$$ \alpha_{i,j} = \frac{exp(LeakyReLU(a^{-T}[\mathbf{W}h_i||\mathbf{W}h_j]))}{\sum_{k\in N(i)}exp(LeakyReLU(a^{-T}[\mathbf{W}h_i||\mathbf{W}h_j]))} $$

$||$ : concatenate → $F' + F'$
$[\mathbf{W}h_i||\mathbf{W}h_j]$ → $(2F' \times 1)$
$a^{-T}$ : transpose(a) → $(1\times 2F')$
LeakyReLU(Real number)

학습한 attention 사용하기 : Node i의 이웃의 중요도를 결정하여 Input 데이터를 재정의

$$ h'i = \sigma(\sum{j\in N(i)} \alpha_{i,j} \mathbf{W}h_j) $$

Multi-head attention($K$번 반복)
1. Concatenation : in layer
  
  $$ h'i = ||{k=1}^K\sigma(\sum_{j\in N(i)} \alpha_{i,j}^k \mathbf{W}^kh_j) $$
2. Average : on the final prediction layer of the network
  
  $$ h'i = \sigma(\frac{1}{K}\sum{k=1}^K \sum_{j\in N(i)} \alpha_{i,j}^k \mathbf{W}^kh_j) $$

👍Advantages

Computationally efficient
- Self-attention layers can be parallelized across edges
- Output features can be parallelized across nodes
Allows to assign different importances to nodes of a same neighborhood
It is applied in a shared manner to all edges in the graph
- Not required to have the entire graph
Works in both
- Transductive learning (Cora, Citeseer, Pubmed) : Big whole graph에 접근하여 node classification을 하거나 link prediction
- Inductive learning (PPI) : Multiple graphs, 다른 그래프셋에 대한 예측

Message Passing Implementation

Creating Message Passing Networks - pytorch_geometric documentation

torch_geometric.nn.conv.message_passing - pytorch_geometric documentation

$$ \mathbf{x}i^{(k)} = \gamma^{(k)}(\mathbf{x}_i^{(k-1)},f{j\in N(i)}\phi^{(k)}(\mathbf{x}i^{(k-1)},\mathbf{x}_j^{(k-1)},\mathbf{e}{j,i})) $$

자기자신, 주변 노드들, 엣지 정보를 concat하여 message를 만들고, 이를 aggregate하여 자기자긴노드와 한번 더 합친 뒤, 한번 더 MLP에 먹여주면 업데이트된 노드 임베딩
$\mathbf{x}_i^{(k)}$ : Features representations of node i at the k-th layer (업데이트 하고싶은 것)
$\phi^{(k)}$ : Differentiable function, Eg. MLP
$\mathbf{x}_i^{(k-1)}$ : Feature representation of node i at the (k-1)-th layer
$\mathbf{x}_j^{(k-1)}$ : Feature representation of node j at the (k-1)-th layer
$\mathbf{e}_{j,i}$ : [optionally] features of edge (i,j)
$f_{j\in N(i)}$ : Differentiable, ordering invariant function. Aggregate function. For every j in the neighbourhood of i. Eg. sum, average, etc...
$\gamma^{(k)}$ : Differentiable function, Eg. MLP

PyTorch Geometric MessagePassing base class

PyTorch Geometric 탐구 일기 - Message Passing Scheme (1)

GNN의 MessagePassing Shceme에 대해, propagation을 구조적으로 연결해주는 편리한 클래스
message(), update(), aggregation를 설정
$\phi^{(k)}$ : message()
$\gamma^{(k)}$ : update()
$f_{j\in N(i)}$ : aggregation → max, mean, add,,,
flow : flow direction of message passing : 주변 노드로부터 정보를 전달 받을지, 전달할지 결정 (either "source_to_target" or "target_to_source")
node_dim : 노드의 차원을 의미
- defualt 값은 int 0
- 어떤 axis로 propagate할지 결정하는 것

ex. Message Passing interface 예시

class MyOwnConv(MessagePassing):
    def __init__(self):
        super(MyOwnConv, self).__init__(aggr='add') # add, mean or max aggregation

    def forward(self, x, edge_index, e):
        return self.propagate(edge_index, x=x, e=e) # pass everything needed for propagation

    def message(self, x_j, x_i, e): # Node features get automatically mapped to source(_j) and target(_i) nodes
        return x_j * e

torch.nn.Module이 superclass
torch.nn.Module ⇒ torch_geometric.nn.MessagePassing ⇒ OurCustomLayer
대부분의 torch_geometric.nn.conv layer 구현체들이 Message Passing Scheme을 따름

MessagePassing - propagate()

def propagate():
       if mp_type == 'adj_t' and self.fuse and not self.__explain__:
              out = self.message_and_aggregate(edge_index, **msg_aggr_kwargs)

      # Otherwise, run both functions in separation.
       elif mp_type == 'edge_index' or not self.fuse or self.__explain__:
           msg_kwargs = self.__distribute__(self.inspector.params['message'],
                                             coll_dict)
           out = self.message(**msg_kwargs)
           out = self.aggregate(out, **aggr_kwargs)
      out = self.update(out, **update_kwargs)

propagate(edge_index, size=None, **kwargs)
node embedding을 업데이트하고 message를 구성하기 위한 edge index등 다양한 추가 정보를 가져옴
size → bipartite graph처럼 (N,M) 사이즈도 propagate 가능
- 이 경우, size = (N,M) 으로 넣어줌
- size = None 일경우 정사각행렬
message()와 update() 함수를 차례로 호출
message와 aggregate 함수는 분리되거나 합쳐져 사용
최종적으로 update 함수를 통해 출력값 생성

MessagePassing - message()

def message(self, x_j: torch.Tensor) -> torch.Tensor:
    # need to construct
    return x_j

$\phi$, 노드 i에 대한 message구성
message(**kwargs)
- 각 edge마다 발생하는 “message”라는 것을 어떻게 construct할지 구체화하는 함수
- propagate의 호출을 따르므로, propagate에 전달할 어떤 인자든 넘길 수 있음
- 주의할 점, 메세지 간의 노드를 구체화할 때는, “_i”와 “_j”를 구분해서 표현해야 mapping이 정의 가능
  - i : central node
  - j : neighboring nodes
  - flow=’source_to_target’일 경우, $e_{ij}\in E$ 로 구분
  - flow=’target_to_source’일 경우, $e_{ji}\in E$ 로 구분
따라서, 함수의 argument naming이 중요

MessagePassing - update()

def update(self, inputs: torch.Tensor):
    # need to construct
    return inputs

$\gamma$, 각 노드 i에 대해서, node embedding을 업데이트하는 함수
update(aggr_out, **kwargs)
- message의 aggregation 결과값을 inputs 인자로
- 처음 propagate()에 전달한 초기 인자들도 이용 가능

Implementing the GCN Layer

Semi-Supervised Classification with Graph Convolutional Networks

$$ \mathbf{x}i^{(k)} = \sum{j \in \mathcal{N}(i) \cup { i }} \frac{1}{\sqrt{\deg(i)} \cdot \sqrt{\deg(j)}} \cdot \left( \mathbf{W}^{\top} \cdot \mathbf{x}_j^{(k-1)} \right) + \mathbf{b}, $$

이웃 노드들의 feature들이 weight matrix $\mathbf{W}$로 먼저 transform되고, node degree들로 normalize됨
bias vector $\mathbf{b}$를 적용해 output을 aggregate
Comparison with general message passing
- $\mathbf{x}i^{(k)} = \gamma^{(k)}(\mathbf{x}_i^{(k-1)},f{j\in N(i)}\phi^{(k)}(\mathbf{x}i^{(k-1)},\mathbf{x}_j^{(k-1)},\mathbf{e}{j,i}))$
- $\sum_{j \in \mathcal{N}(i) \cup { i }} \frac{1}{\sqrt{\deg(i)} \cdot \sqrt{\deg(j)}}$ = $f_{j\in N(i)}$
- $\mathbf{W}$ = $\phi^{(k)}$

Steps

Add self-loops to the adjacency matrix.(본인 노드 feature도 넣어줌, 대각 성분을 1로)
Linearly transform node feature matrix.
Compute normalization coefficients.
Normalize node features in $\phi$
Sum up neighboring node features ("add" aggregation).
Apply a final bias vector.

Step 1~3 : sum 기호 내부, 타겟 노드에 전달해줄, 흐르게 할 (propagating할) message를 construct하는 과정
Step 4~6 : 이웃인 node-pair에 대해 aggregation하고 해당 타겟 노드를 update하는 과정

Source Code

import torch
from torch.nn import Linear, Parameter
from torch_geometric.nn import MessagePassing
from torch_geometric.utils import add_self_loops, degree

class GCNConv(MessagePassing):
    def __init__(self, in_channels, out_channels):
        super().__init__(aggr='add')  # "Add" aggregation (Step 5).
        self.lin = Linear(in_channels, out_channels, bias=False)
        self.bias = Parameter(torch.Tensor(out_channels))

        self.reset_parameters()

    def reset_parameters(self):
        self.lin.reset_parameters()
        self.bias.data.zero_()

    def forward(self, x, edge_index):
        # x has shape [N, in_channels]
        # edge_index has shape [2, E]

        # Step 1: Add self-loops to the adjacency matrix.
        edge_index, _ = add_self_loops(edge_index, num_nodes=x.size(0))

        # Step 2: Linearly transform node feature matrix.
        x = self.lin(x)

        # Step 3: Compute normalization.
        row, col = edge_index #출발, 도착 노드 분리
                #도착노드에 대해 node 등장횟수 count
        deg = degree(col, x.size(0), dtype=x.dtype)
        deg_inv_sqrt = deg.pow(-0.5)
        deg_inv_sqrt[deg_inv_sqrt == float('inf')] = 0
        norm = deg_inv_sqrt[row] * deg_inv_sqrt[col]

        # Step 4-5: Start propagating messages.
        out = self.propagate(edge_index, x=x, norm=norm)

        # Step 6: Apply a final bias vector.
        out += self.bias

        return out

    def message(self, x_j, norm):
        # x_j has shape [E, out_channels]

        # Step 4: Normalize node features.
        return norm.view(-1, 1) * x_j

GCNConv

“add” propagation을 사용한 MessagePassing을 상속받음
foward()
- Step 1: Add self-loops to the adjacency matrix → [torch_geometric.utils.add_self_loops()] : https://pytorch-geometric.readthedocs.io/en/latest/modules/utils.html#torch_geometric.utils.add_self_loops
- Step 2: Linearly transform node feature matrix → [torch.nn.Linear] : https://pytorch.org/docs/master/generated/torch.nn.Linear.html#torch.nn.Linear
- Step 3: Compute normalization → $1/(\sqrt{\deg(i)} \cdot \sqrt{\deg(j)})$ → [num_edges, ] 크기의 tensor norm output
  - torch_geometric.utils.degree

- Step 4-5: Start propagating messages. → call `propagate()` → 내부적으로 `message()`, `aggregate()`, `update()`
    - `propagate()` : node embedding을 업데이트하고 message를 구성하기 위한 edge index등 **다양한 추가 정보**를 가져옴
        - node embeddings `x`, the normalization coefficients `norm` 를 추가 전달
- Step 6: Apply a final bias vector.

message()
- normalize the neighboring node features x_j by norm
  - x_j : lifted tensor, 각 엣지의 source node feature를 포함
  - x_i : 각 엣지의 target node feature를 포함
  - i : central node
  - j : neighboring nodes

Implementations https://github.com/sujinyun999/PytorchGeometricTutorial/blob/main/Tutorial3/Tutorial3.ipynb

[Pytorch Geometric Tutorial] 1. Introduction to Pytorch geometric

Thu, 08 Dec 2022 06:40:00 GMT

Pytorch Geometric tutorial: Introduction to Pytorch geometric

Problems of dealing graph in deep learning

Different sizes
- 노드 크기에 따라 adjacency matrix의 크기가 달라짐
NOT invariant to node ordering
- 그래프가 위상적으로 동일하더라도 adjacency matrix는 다를 수 있음

Notation

$$ \mathbf{G} = (V,E) $$

Computation Graph

: The neighbour of a node define its computation graph

Neighbor node들의 정보를 aggreagte하여 박스를 거쳐 target node의 representation을 만들어 냄
- 박스 : Neural Network
- Ordering invariant Aggregation : sum, average같이 노드 순서에 상관없는 노드 정보 aggregation
redundancy : 각 노드들에 대해 2hop의 computational graph를 만들면 중복되는 부분이 발생하게됨

Math

0번째 layer에서 node v의 representation은 node feature이다

$$ H_v^0 = X_v $$

$H_v^0$ : layer0에서 노드 v의 representation
$X_v$ : node $v$의 feature

노드 representation update

$$ h_v^{k+1} = \sigma(W_k \sum_{u \in N(v)} \frac{h_u^k}{|N(v)|}+B_kh_v^k) $$

$h_v^{k+1}$ : layer k+1에서 노드 v의 representation
$h_v^{k}$ : layer k에서 노드 v의 representation
$\sum_{v \in N(u)} \frac{h_u^k}{|N(v)|}$ : 노드 v의 neighbor $N(v)$에 대해 임베딩을 평균한 것
$W_k, B_k$ : Weights(박스), shared for all computation graph
$\sigma$ : Activation function(relu)

k번째 layer에서 node v의 representation

$$ Z_v = h_v^k $$

GraphSAGE

Inductive Representation Learning on Large Graphs

위의 Math에서 간단한 수정으로 구현할 수 있음

average ⇒ general aggregation function
주변 노드 정보를 합친 정보와 이전 layer의 자기 자신에 대한 정보를 더하는 대신, concatenate(,)

$$ h_v^{k+1} = \sigma([W_k \cdot AGG({h_u^{k-1},{\forall u \in N(v)}}) ,B_kh_v^k]) $$

$AGG$
- Pool : Element-wise min/max
- LSTM : note not order invariant

Implementations https://github.com/sujinyun999/PytorchGeometricTutorial/blob/main/Tutorial1/Tutorial1.ipynb

[Dialogue System] Persona chat 리뷰

Thu, 28 Jul 2022 08:32:45 GMT

Abstract

chit-chat model의 문제점

lack of specificity : 일관된 personality를 보여주지 않음
often not very captivating

⇒ chit-chat model에 profile information을 conditioning함으로써 해결

주어진 프로필 정보 조건에 맞는 (=condition on their given profile information)
이야기하고 있는 대상에 대한 정보를 포함하는 (=information about the person they are talking to)

위 두가지와 관련된 데이터를 수집하여 모델을 학습시킴으로써 다음 utterance 예측을 통해 측정되는 dialogue 생성 능력 향상을 이루고자 함.

그중에서도 2번의 경우 처음에는 알 수 없는 정보이기때문에, 모델은 상대가 개인적인 주제를 이야기 하도록 학습되고, 그 결과인 dialogue는 interlocutor의 프로필 정보를 예측하는데 사용될 수 있다.

1. Introduction

현재 : Neural model들이 chit-chat에 있어 적절히 유의미한 대답을 생성하기 위한 데이터 셋이 구축된지 얼마 되지 않았고, 여전히 그러한 모델들의 약점이 뚜렷함

chit-chat model이 갖고있는 문제점

lack of consistent personality : 각각 다른 speaker가 이야기한 dialogue를 바탕으로 학습되기 때문
lack of explicit long-term memory : 최근 dialogue history 만으로 utterance를 생성
tendency to produce non-specific answers like “I don’t know”

→ 위 세개의 문제들로 인해 대화하는 사람의 전반적인 만족도가 크게 떨어지게 됨

→ 저자는 이러한 문제가 general chit-chat 모델을 위한 public dataset의 부족때문이라고 주장

최근의 대화모델의 낮은 퀄리티와 이러한 모델을 평가하는데 있어 발생하는 어려움 때문에, chit-chat model << task-oriented communication

btw, 사람간 대화의 대부분은 socialization, personal interests 그리고 chit-chat에 집중되어있다.

⇒ configurable, persistent persona를 도입함으로써 더 engaging한 chit-chat dialogue agent를 만들 수 있을 것.

이때, 이러한 persona는 textual descrpition과 profile에서 encoded된 것들

profile은 memory-augmented NN에 저장되어 persona free model보다 더 personal, specific, consistent, engaging한 답변을 생성할 수 있음 → chit-chat model의 문제 완화
마찬가지로, 대화 상대에 대한 정보도 같은 방식으로 활용할 수 있음

⇒ Model은 personal topic에 대한 ask와 answer question모두를 활용해 학습되고, 이는 대화 상대의 persona를 modeling할 수 있도록 함

그러한 모델을 학습시키기 위해, persona-chat dataset을 제시

a new dialogue dataset consisting of 164,356 utterances between crowdworkers
each asked to act the part of a given provided persona
get to know each other during the conversation

Next utterance prediction task during dialogue

→ compare generative & ranking model

Seq2Seq
Memory Networks

⇒ generative & ranking model 모두에 있어 persona info가 주어진 모델이 task를 더 잘 수행함

Traditional dialogue systems
- dialogue state tracking component와 response genrator로 구성
- user intent가 명확히 정의된 goal-oriented dialogue
- chit-chat setting은 고려되지 않음
- functional goal 달성에 집중
- 이에 따라 task와 dataset도 좁은 domain에 집중됨
- IR base models : 최근 대화 기록으로 response를 matching score기반 retrieve and rank
End-to-end neural approach
1. generative recurrent system (ex. seq2seq)
  - Sequence to sequence learning with neural networks
  - A Neural Conversational Model
  - A hierarchical latent variable encoder-decoder model for generating dialogues
    - +) produce syntactically coherent novel responses
    - -) memory-free → longterm coherence, persistent personality 부족
2. memory-augmented network
  - Evaluating Prerequisite Qualities for Learning End-to-End Dialog Systems
3. chit-chat setting
  - A persona-based neural conversation model
    - Twitter corpus에서 capture한 background history와 speaking styler과 같은 persona를 distributed embedding으로 encapsulate하여 향상된 결과 생성
    - 대화 상대를 getting to know하는 과정이 없음
  ⇒ explicit profile information에 집중, getting to know과정 추가

3. The PERSONA-CHAT Dataset

Goal

Facilitate more engaging and more personal chit-chat dialogue

Data collection stages

Persona
- 1155 Possible persona
- 각각은 최소 5개의 profile sentence를 포함
- 100 never seen before peronas for validation, the other 100 for test
Revised Persona
- trivial word overlap으로 인해 model이 advantage를 챙기는 것을 방지하기 위해, 1155개의 persona에 대해 rephrases, generalizations or specializations처리된 관련 문장을 크라우드소싱하여 얻음
Persona chat
- 164,356 utterances over 10,981 dialogs, 15,705 utterances (968 dialogs) of which are set aside for validation, and 15,119 utterances (1000dialogs) for test

3.1 Personas

persona description 5문장을 활용해 crowdsourced worker들이 character를 만들어 냄.

Example.

“I am a vegetarian. I like swimming. My father used to work for Ford. My favorite band is Maroon5. I got a new job last month, which is about advertising design.”

→ 이러한 프로필은 사람간 대화에 있어 나올 수 있는 전형적인 관심사를 자연스럽게 묘사하는데 초점을 둠

각 문장들이 최대 15단어 정도인 짧은 문장으로 구성

3.2 Revised Personas

Issue of textual persona :

프로필 정보를 unwittingly 반복하게 되어 엄청난 단어 중복이 발생할 수 있음

→ 이로 인해 모델이 단어 중복 만으로 답을 맞추는 상황이 발생

잘 알려진 QA데이터셋인 SQuAD의 경우, 단순한 단어 overlap만으로 맞출 수 있는 케이스가 다수 있음

해결 :

기존의 프로필 문장을 재작성 및 재구성 하도록
“a related characteristic that the same person may have” : 같은 것을 의미하는 것 뿐 아니라 같은 persona가 두가지 특징을 모두 포함 할 수 있음
generalizations or specializations
ex. “I like basketball” → “I am a big fan of Michael Jordan”
Not just trivially rephrase the sentence by copying the original word
- ex. “My father worked for Ford.”
  - “My dad was employed by Ford.” (X)
  - “My dad worked in the car industry” (O)

3.3 Persona Chat

수집한 persona를 crowdworkers에게 랜덤하게 부여한 후, getting to know 과정의 대화를 하도록 함.
turn base
max 15 words per message
페르소나 프로필 내용을 조금만 변형하여 발화하지 않도록 지시
Minimum dialogue length : 6~8 turns

Evaluation

Standard dialogue task

given the dialogue history, predict the next utterance
with/ without profile information

Goal :

enabling intersting directions for future research (persona를 도입함으로써 the engaging한 대화가 만들어지는지

Possible Scenarios

conditioning on

No persona
Your own persona
Their persona
Both

original, revised ver.모두

Metrics

log likelihood of the correct sequence, measured via perplexity
F1 score
next utterance classification loss

: N개(N=19)의 random distractor(오답)와 정답들 사이에서 정답을 고를경우 1점을 부여하는 방식

4. Models

Next utterance prediction을 수행하기 위한 두가지 class의 모델

ranking model : training set의 가능한 답변 후보들을 생성하고, 각 reply에 순위를 매김
generative model : dialogue history와 persona에 따라 word-by-word로 novel sentence를 생성해냄
- 특정 후보 생성 확률을 계산하고 해당 점수로 후보 순위를 매긴다는 점에서 ranking model과 유사하게 평가할 수 있음

4.1 Baseline ranking models

IR baseline - 다양한 variant가 있지만, 가장 단순한 것으로 적용
- training set에서 가장 유사한 메세지를 찾고, 해당 exchange에서 그 응답을 output.
- 여기서 유사도는 tf-idf weighted cosine similarity between the bags of words로 측정됨
Starspace( = supervised embedding model)
- IR
- margin ranking loss와 k-negative sampling을 사용해 해당 작업에 대한 임베딩을 직접 최적화하여 dialog와 next utterance의 유사성 학습
- (dialogue+persona)와 next utterance사이 유사도를 단어 임베딩의 합 벡터의 cosine similarity( = $sim(q,c')$)를 이용해 측정 후, 제일 유사한 utterance를 선택 → $q$ : query, $c'$ : candidate
- $W$ : dictionary of $D$ word embeddings, $D$x$d$ matrix, $W_i$ indexes ith word(row)
- $q$ , $c'$를 임베딩하는 d-dimenional embedding
- profile을 포함하기 위해 query vector bag of word에 단순 concatenate

4.2 Ranking Profile Memory Network

두가지 이전 모델들 모두 profile info를 dialogue history와 합쳐 사용

→ model이 다음 utterance를 결정하는데 있어 두가지를 구분하지 못함

Dialogue history를 inqut query로 입력하는 memory network를 사용하고, 각 profile sentence에 대한 attention을 학습
유사도 : input q와 profile sentence $p_i$

candidates와 $q^+$의 유사도 기반 랭킹

4.3 Key-Value Profile Memory Network

key-value (KV) memory network

improvement to the memory network by performing attention over keys and outputting the values
Dialogue history를 keys, next dialogue utterance(=reply of speaking partner)를 value로 하는 메모리 네트워크 → 모델이 past dialogue에 대한 memory를 갖고, 직접적으로 prediction에 사용할 수 있게됨
Ranking profile Memory network에서 구해진 $q^+$를 이용해 각 key에 대한 attention값을 구하고, value의 가중합을 만들어 새로운 query embedding q^++를 만들어냄
q^++는 candidate $c'$들을 유사도 기반 ranking했던 거처럼 마찬가지로 ranking하는데 사용
매우 큰 key-value쌍은 학습을 매우 느리게 만들 수 있어, 실험에서는 profile memory network를 학습시키고 같은 모델의 가중치를 활용해 test시 적용하였음

4.4 Seq2Seq

LSTM encoder, decoder를 갖는 단순 deq2seq
- 각 timestep t에 대해 decoder는 word j의 발생 확률을 softmax로 구함
- negative log likelihood
GloVe word embedding
persona는 input sequence에 concat해 입력

4.5 Generative Profile Memory Network

각 profile이 memory network의 individual memory representation으로
seq2seq 모델에서 decoding할때 각 step에서 memory(=profile sentences)에 attend하여 persona context vector를 생성하고, 이 벡터를 추가로 입력

5. Experiments

persona정보를 이용했을떄 더 좋은 성능을 보임
original 보다 revised ver.이 더 도전적인 데이터셋
hits@1에 의한 성능비교시 Ranking model이 generation model보다 좋은 성능을 보임
사람이 직접 scoring했을 때에도 ranking model이 generation model보다 좋은 성능을 보임

Attention is all you need

Tue, 31 May 2022 01:52:27 GMT

~~진정한 물아일체~~

다음 자료들을 참고하여 본 글을 작성하였습니다!

Attention is all you need

DSBA 연구실 Transformer 강의

Jay Alammar - The Illustrated Transformer

Transformer

RNN처럼 순차적 데이터를 처리하지 않음, 한꺼번에 처리
Model that uses attention to boost the speed with which these models can be trained and easy to parallelize
Encoding component와 decoding component, 그리고 연결

Encoding component : Encoder의 stack, 논문에서는 6개를 쌓음

Decoding component : Decoder의 stack, 마찬가지로 6개

→ seq2seq와 달리 반복적 수행

512개의 token활용, 문장이 더 짧으면 padding

Encoder

각각 입력된 문장내 단어간의 관계를 보여주는 self-attention layer와 모든 단어들에 동일하게 적용되는 fully connected feed-foward layer
순차적 X, 한번에 모든 sequence를 사용하는 unmasked
각각의 encoder는 구조관점에서 모두 동일한 구조 → 가중치를 share하는 것은 아님
각 encoder는 self attention layer와 fully connected feed-foward layer로 구성

self attention layer
- self-attention layer는 다른 단어를 참고해 각 단어의 vector들끼리 서로간의 관계가 얼마나 중요한지 점수화한다 (a layer helps the encoder look at other words in the input sentence as it encodes a specific word)
fully connected feed-foward layer
- self attention layer의 결과물이 input으로 들어가게됨
- 각 position에 대해 feed forward network이 한번에 적용되는 것이 아닌 각각 한번씩 적용되어 output이 나오게됨

Decoder

encoder layer의 두개의 sublayer에 더해서 세번째 sublayer인 Encoder-decoder attention layer가 두 layer 사이에 들어가 있음
Encoder-decoder attention layer : 최종 output을 생성할때 encoder에서 넘어온 정보를 어떻게 활용할 것인지 연산하는 layer
순차적인 처리가 필요, Masked self attention
~~토큰으로 인해 마지막 단어가 masked~~

N = 6

Encoder

1. Input Embedding : 처음에 들어갈 정보에 대한 vector

The embedding only happens in the bottom-most encoder : 제일 첫번째 encoder의 입력으로써 사용됨

Word2vec, fasttext, gloVe와 같은 embedding algorithm활용

common to all the encoders is that they receive a list of vectors each of the size 512 (512차원의 word embedding : 단어 1개)

가장 아랫단의 encoder에만 word embedding이 input으로 들어가게되고, 윗단의 encoder들은 바로 아래 encoder의 output을 input으로 받음

The size of this list is hyperparameter we can set – basically it would be the length of the longest sentence in our training dataset. (리스트 자체도 파라미터, 한 시퀀스의 길이를 최대 몇개로 가져갈 것인가 → 가장 긴 문장의 단어수나 상위 95%에 해당하는 단어 수) (512차원 여러개가 모여 문장 하나)

2. Positional Encoding : 어떤 단어가 몇번째 위치에 있는지에 대한 정보

input embedding에 positional encoding을 더함(그대로 옆에 이어 붙이는 concat과는 다름!)

input embedding 512차원 + positional encoding 512차원 합 ⇒ 512차원

각각의 input embedding에 더해지는 vector

transformer는 한번에 모든 sequence를 입력받아 단어의 위치정보를 고려하지 못한다 → 최소한 순서를 반영할 수 있는 장치를 마련한 것이 positional encoding

word embedding에 똑같은 크기로 더해주어야 한다. → 단어의 위치에 따라 positional encoding 의 size가 달라져서는 안된다.

위치관계를 표현해야하므로 단어사이의 거리가 멀 경우 positional encoding vector사이의 거리도 멀어져야한다

sequence길이, n=100이고 각 단어의 차원이 512일때 positional encoding을 위의 식대로 생성한 후 L2-norm distribution을 계산한 결과

- 평균에 비해 표준편차가 매우 작은것을 확인할 수 있음 - 멀면 멀수록 positional encoding사이 거리가 커야함(The further the two positions, the larger the distacne) - 하지만 실제로는 그렇지 않은 경우도 있긴함 - 어떤 단어가 앞에나오고 뒤에나오고에 대한 내용은 보전하지 못해도, 얼마나 멀리 떨어져 있는지에 대한 정보는 전달될 수 있다.
3. Multi-head attention

the word in each position flows through its own path in the encoder

There are dependencies between these paths in the self-attention layer. The feed-forward layer does not have those dependencies → thus the various paths can be executed in parallel while flowing through the feed-forward layer.

Self attention layer : Dependency O, feed-forward layer : Dependency X

Dependency가 있다 : 서로 영향을 미친다

차원수는 그대로 유지된다.

feed foward layer는 구조가 같지만 weight를 공유하지는 않음

Self attention

Self attention의 역할 :

The animal didn’t cross the street because it was too tired

it : the animal

it이라는 단어를 processing할때, 문장 내 다른 단어들을 살펴보면서 it과 연관된 단어가 무엇인지에 대한 정답을 얻는 과정이라고 할 수 있음

=현재 단어와 연관된 단어는 무엇인가?

Self-Attention in Detail

✅ Step 1. create three vectors from each of the encoder’s input vectors (각각의 input vector에 대해 Query vector, Key vector, Value vector를 생성)

Query vector : 현재 보고있는 단어의 representation, 다른 단어들을 scoring하기 위한 기준이 되는 값 (We only care about the query of the tocken we’re currently processing)

Key vector : label, query가 주어졌을 때 이 쿼리에 대해 relevant한 단어를 찾는다고 할때 key 값을 활용해 찾음

ex. it이라는 query가 주어졌을 때, key는 각 파일들에 해당하는 identity

Value vector : actual word representation

⇒ query와 key를 통해 가장 적절한 value를 찾아 연산한다!

Input Embedding에서 만들어지는 Query, key, value

각각에 해당하는 matrix가 존재해서 계산해주면됨

ex. X1 x Wq → (1x4) x (4x3) ⇒ (1x3) ⇒ q1

Wq,Wk,Wv는 학습을 통해 찾아야하는 미지수

Notice that these new vectors are smaller in dimension than the embedding vector. Their dimensionality is 64 (Multi head attention관점에서 Multi head attention을 통과한 벡터를 concat해서 encoder, decoder를 통과하게 하기때문에 작게 잡는것이 좋음)

64*8 = 512 →여기서 8은 Multi-head attention의 숫자

✅ Step 2. calculate a score, i.e, how much focus to place on other parts of the input sentence as we encode a word at a certain position (query와 가장 관련성 높은 단어는 무엇인지 찾기위한 score계산)

The score is calculated by taking the dot product of the query vector with the key vector of the respective word we’re scoring.

현재 보고있는 토큰의 query(q1)과 자신의 key를 포함한 모든 key와 연산을 수행해 score를 계산

✅ Step 3. divide the scores by the square root of the dimension of the key vectors (query, key, value 차원의 root를 씌운 숫자를 score에서 나눠준다.)

This leads to having more stable gradients

✅ Step 4. pass the result through a softmax operation

이렇게 만들어진 score는 현재 단어에 해당 단어가 얼마나 큰 영향을 미치는지를 나타냄

✅ Step 5. multiply each value vector by the softmax score(실제 value와 softmax를 통과한 값을 곱해준다)

✅ Step 6. sum up the weighted value vectors which produces the output of the self-attention layer at this position (for the first word).

⇒ 현재 단어의 Self-Attention = weighted sum of value vectors

Matrix Calculation of Self-Attention

<정리>

각 단어에 대한 Wq, Wk, Wv와의 연산을 통해 각각의 query, key, value를 만들어낸다.

첫번째 단어의 query와 모든 단어의 key value와 연산 → softmax → 가중치

각각의 score와 value의 연산, 후 전체합계를 통해 첫번째 단어의 self attention계산 완료

모든 단어에 대해 아래 과정을 반복해 모든 단어에 대한 self attention 계산 완료

Multi-Head Attention

Multi-Head Attention (Expand the model’s ability to focus on different positions)

→ 8개의 서로 다른 representation subspace를 가짐으로써 single-head attention보다 문맥을 더 잘 이해할 수 있게 된다.

→ layer를 여러 번 조금 다른 초기 조건으로 학습시킴으로써 '그것'에 관련된 단어에 대해 더 많은 후보군을 제공한다.

아까의 예시에서 it이 어떤 단어와 연관있는지를 1개만 결정하는 것이 아닌, 여러개를 허용한다 = Attention head를 여러개 둔다!

Calculating attetion seperatly in eight different attention head

개별적인 attention을 만든 후 concatenate

concatenate

concatenate한 벡터의 컬럼과 같은 행의수를 갖고, 원래 임베딩과 같은 차원의 열을 갖고 모델 학습과정에서 함께 학습되는 Wo를 mulitply

연산을 통해 처음 갖고 있었던 input embedding의 dim과 동일한 output을 만들 수 있음

첫번째 인코더일 경우에만 input embedding X로 연산이 이루어지고, 이후의 encoder에는 직전 인코더의 output R로 해당 연산이 이루어짐

크기는 계속 유지됨!

같은 단어에 대해 다르게 scoring되는것을 확인할 수 있음

Residual block & Normalization

Self attention layer를 통과한 다음에는? ⇒ Residual block & Normalization

f(x)+x → d → f’(x)+1

self attention의 output인 z에 x를 그대로 더해준 후 layer normalization을 진행

⇒ FFNN의 입력이 되는 z가 완성됨!

Residual& Noramlization을 Encoder하나당 2번, decoder하나당 3번 default로 계속 적용됨

Position-wise Feed Forward Networks

Fully connected feed forward network

Applied to each position seperatly and identically(각 postition에 대해 개별적 적용)

Relu function*W+b

각각의 layer마다 서로다른 parameter

linear transformations are the same across different positions

같은 encoder블록 내 FFN은 같은 구조

각 encoder블록간에는 다름

위 그림에서 z1, z2에 대한 W1,W2는 각각 같다!

kernel size= 1인 Convolution으로 생각하기

Decoder

Masked Multi-head attention

Decoder에서의 self attention layer는 반드시 자기 자신보다 앞쪽 포지션에 해당하는 토큰들에 대해서만 attention score 참조 가능

이를 수학적으로 구현하기 위해 뒤에 나오는 단어의 score를 -inf로 주면, softmax를 통해 score = 0이됨

이를 그림으로 표현하면 다음과 같음

하지만 Sequentially 시행될 필요는 없음, 한번에

Masked Multihead attention은 output embedding에 대응한 부분이었다면, decoder의 두번째 sublayer인 Encoder-decoder attention layer은 Masked Multihead attention과 encoder의 output에 대응

encoder start by processing the input sequence

The output of the top encoder is then transformed into a set of attention vectors K and V (These are to be used by each decoder in its “encoder-decoder attention” layer which helps the decoder focus on appropriate places in the input sequence)

Encoder의 최종 output인 K,V는 stacked decoder의 Encoder-decoder attention layer에서 decoder가 어떤 input에 집중해야하는지 결정하는데 사용됨

Decoder의 input은 masking이되어야해서 sequential하게 데이터가 들어가게됨, 특정 토큰을 마주칠때까지 프로세스 반복

Encoder-decoder attention layer는 Encoder의 multiheaded self-attention과 동일하게 작용함

The Final Linear and Softmax Layer

Linear layer : FFN

Softmax layer : 최종 output 확률

sujinyun999.log

[#LoG_Reading] Structure-Aware Transformer for Graph Representation Learning

Structure-Aware Transformer for Graph Representation Learning

1. Problem Definition

Limitations of GNN

Transformer

2. Motivation

Message passing graph neural networks.

Limitations of existing approaches

Contribution of this paper

3. Method

Structure-Aware Transformer

1. Structure-Aware Self-attention

2. Structure-Aware Transformer

3. Combination with Absolute Encoding

4. Expressivity Analysis

4. Experiment

Experiment setup

Results

5. Conclusion

Author Information

6. Reference & Additional materials

[#LoG_Reading] Graphormer : Do Transformers Really Perform Badly for Graph Representation?

Abstract

1. Introduction

2. Preliminary

3. Graphormer

3.1 Structural Encodings in Graphormer

3.1.1 Centrality Encoding

3.1.2 Spatial Encoding

3.1.3 Edge Encoding in the Attention

3.2 Implementation Details of Graphormer

Graphormer Layer

Special Node

3.3 How Powerful is Graphormer?

4. Experiments

4.1 OGB Large-Scale Challenge

4.2 Graph Representation.

4.3 Ablation Studies

5. Related Work

5.1 Graph Transformer

5.2 Structural Encodings in GNNs

6. Conclusion

Autoencoder의 모든 것 - 1/2 (Deep learning basic revisit, Manifold learning)

1. Revisit Deep Neural Networks

Machine learning problem

Loss function viewpoints I : Backpropagation

Loss function viewpoints II : Maximum likelihood

2. Manifold Learning

Definition

Four objectives

Dimension reduction

GAN 완전 정복 - Generative Adversarial Network(GAN)

Generative Model의 Goal

Brief Introduction - GAN(Generative Adversarial Networks)

Objective(Loss) Function of GAN

Pytorch Implementation

Non-Saturating GAN Loss

Why does GANs work?

Algorithm

Variations of GAN

1. DCGAN(Deep Convolutional GAN)

2. LSGAN(Least Squares GAN)

3. SGAN(Semi-Supervised GAN)

4. ACGAN(Auxiliary Classifier GAN)

Extensions of GAN

1. CycleGAN : Unpaired Image-to-Image Translation

2. StackGAN : Text to Photo-realistic Image Synthesis

Understanding Convolutions on Graphs

The Challenges of Computation on Graphs

Problem Setting and Notation

Extending Convolutions to Graphs

Polynomial Filters on Graphs

The Graph Laplacian

Polynomials of the Laplacian

ChebNet

Polynomial Filters are Node-Order Equivariant

Embedding Computation

Modern Graph Neural Networks

Embedding Computation