seongyeon_park.log

[논문리뷰] STP-UDGAT: Spatial-Temporal-Preference User Dimensional Graph Attention Network for Next POI Recommendation (2020)

Mon, 10 Oct 2022 04:23:35 GMT

세 줄 요약

User의 PoI historical data로부터 spatial, temporal, preference 세 측면의 새로운 PoI exploration을 수행함
Exploit user historical data(local view) + Explore new PoIs(global view)
Masked self-attention을 통한 random walk으로 higher-order PoI exploration을 가능하게 함(option)

Problem

기존에는(RNN) PoI-PoI relationship을 user의 visit sequential data를 기반으로 학습함

Solution

PoI-PoI relationship을 학습하는 데에 다른 요소(STP)를 반영하겠다.

Approach

DGAT

~~알고리즘 어쩌구

GAT의 변형 버전인 것 같음, DGAT layer를 모든 embedding에 사용함

PP-DGAT(Personalized Preference-DGAT)

user의 historical PoI visits를 graph로 construction
다만 모든 PoI들이 완전 연결됨
이 파트는 exploitation of users' historical PoIs에 해당함(local view)

STP-DGAT

Spatial Graph

PoI로부터 가까운 top k개의 PoI들에 adjacency 부여
edge weight은 euclidean distance의 역수

Temporal Graph

연속적으로 방문된 PoI들에 adjacency 부여
모든 user들의 consecutive visits pair들의 time interval을 통합해서 평균함
averaged time interval의 역수가 edge weight

Preference Graph

연속적으로 방문된 PoI들에 adjacency 부여
edge weight는 PoI들 사이의 frequency (방문 횟수)

Exploring New STP Neighbours

위에서 construction한 STP graph에서 relevant new PoIs를 찾음
user의 PP graph에 있는 node를 seed set으로 함
Random Walk Masked Self-Attention을 이용하여 STP graph에서도 higher-order exploration이 가능하게 함

STP-UDGAT(User DGAT)

비슷한 다른 user들을 참고하기 위함
Jaccard similarity coefficient가 0.2이상이면 두 유저가 adjacency하다고 정의

Results

Baseline에 GNN 계열이 없어서 아쉬움

cs224w | Lecture 4. Link Analysis: PageRank

Sun, 09 Oct 2022 03:26:46 GMT

Link Analysis Algorithms

PageRank: The "Flow" Model

A page is important if it is pointed to by other important pages

Connection to random walk

What is the long-term distribution of the surfers?

PageRank is the stationary distribution of random-walk
결국 eigenvalue가 1인 행렬 M의 eigenvector를 구하는 문제
Summary

PageRank: How to solve?

Power Iteration

Problems of Spider Traps / Dead ends
Use Teleports
PageRank에서는 구글 행렬을 사용함: $\beta$를 설정하여 다른 노드로 랜덤하게 teleport함

Random Walk with Restarts / Personalized PageRank

Personalized PageRank는 다른 노드로 teleport하는 확률이 uniform하지 않음

Matrix Factorization and Node Embeddings

Node가 유사하다는 건 서로 연결되어 있다는 것. 즉, $$ \bold{Z}^{T}\bold{Z} = A $$
하지만 실제로 embedding dimension d보다 number of node n이 훨씬 큼. 따라서, factorization으로 A size가 나오는건 현실적으로 불가능.
하지만 둘의 차이가 작아지도록 $\bold{Z}$를 근사할 수는 있음.

DeepWalk, Node2Vec
Limitations
1. Cannot obtain embeddings for nodes not in the training set
2. Cannot capture structural similarity
3. Cannot utilize node, edge and graph features

cs224w | Lecture 3. Node Embeddings

Sat, 08 Oct 2022 08:07:20 GMT

Embedding Nodes

Encoder and Decoder

Goal: embedding space에서 가깝에 위치하게 embedding하기 위함 $$ similarity(u,v) \approx \bold{z}{v}^{T} \bold{z}{u} $$

"Shallow" Encoding

Encoder is just an ebmbedding-lookup
Each node is assigned a unique embedding vector

similarity 를 maximize하는 parameter $\bold{Z}$를 optimize하는게 목표

Random-Walk Embeddings

$ \bold{Z}{u}^{T}\bold{Z}{v} \approx $ $u$, $v$가 random walk에서 co-occur 할 확률

Predicted probability와 실제 random-walk probability와 비슷하게 학습
Random walk을 했을 때 나타난 neighbor set node들의 확률이 높게 나타나야 함
Generally work most efficiently
Expressivity: incorporate high-order, multi-hop info
Efficiency: only need to consider pairs that co-occur

모든 노드를 탐색하는건 expensive
Solution: Negative sampling
- Sample k Negative sampling
- Sample k with prob. proportional to is degree
- In practice 5~20
Optimize embeddings using Stochastic Gradient Descent

Node2Vec

그래서 random walk을 어떻게 할건지?
node2vec은 graph에서 node를 vector로 sampling하는 strategy
Breadth-First Search
Depth-First Search

Parameters
- Return parameter p
- In-out parameter q: ratio between BFS and DFS

Embedding Entire Graphs

Approaches

Node(Sub-graph) embedding을 합함
Virtual node의 embedding
Anonymous Walks: 가능한 anonymous walks의 확률 벡터
Learn Walk Embeddings: Predict the next anonymous walk

cs224w | Lecture 2. Traditional Methods for ML on Graphs

Sat, 08 Oct 2022 07:14:47 GMT

Graph-Level Features

Kernel Methods

Graph Kernels: Measure similarity between graphs
Goal: Design graph feature vector $\phi (G)$
Key Idea: Bag of Words

Graphlet Kernels

Key Idea: Count the number of different graphlets in a graph
Don't have to be connected
Don't have to be rooted
Given two graphs, G and G', graphlet kernel is computed as $$ K(G,G')=\mathit {\bold{f}}{G}^{T} \mathit{\bold{f}}{G'} $$

Limitations: Counting graphlets is expensive!

Weisfeiler-Lehman Kernel

Color Refinement: iterate color aggregation and obtain color count vectors
Closely related to Graph Neural Networks
Counting colors takes linear-time w.r.t. #(nodes)

[논문리뷰] LightGCN: Simplifying and Powering Graph Convolution Network for Recommendation (2020)

Tue, 04 Oct 2022 08:25:50 GMT

세 줄 요약

collaborative filtering의 성능을 저하시키는 기존 GCN 구조 (feature transformation, nonlinear activation) 의 단순화를 목적으로 함
neighborhood aggregation과 같은 주 요소만 포함한 LightGCN을 제안
SOTA GCN-based recommender 보다 향상된 결과를 보임

Problem definition

GCN은 node feature가 풍부한 그래프의 node classification 태스크에 주로 이용되는 방법론
하지만, CF를 위한 user-item interaction graph에서 각 node는 ID embedding으로 이루어짐 (no concrete semantic)
따라서, GCN의 feature transformation 과 nonlinear activation은 Neural Graph Collaborative Filtering (NGCF)에 거의 기여하는 바가 없음

Solution

LightGCN은 GCN의 가장 중요한 요소인 neighborhood aggregation 만을 포함하도록 함

Empirical Explorations on NGCF

NGCF-f(removes feature transformation)
NGCF-n(removes non-linear activation function)

Finding

(1) Feature transformation imposes negative effect on NGCF (2) Nonlinear activation affects slightly (3) Removing them simultaneously improves largely (9.57% relative improvement on recall) -> Deterioration of NGCF stems from the training difficulty, rather than overfitting

LightGCN

GCN에서 item or user embedding의 기본은 아래와 같음 $$ \bold{e}{u}^{(k+1)}=AGG(\bold{e}{u}^{(k)},\left{\bold{e}{i}^{(k)}:i\in \mathcal{N}{u}\right}). $$ 여기서 self-connction, feature transformation, nonlinear activation을 제거함 따라서, LightGCN에서는 neighborhood aggregation은 아래와 같음 $$ \bold{e}{u}^{(k+1)}=\sum{i \in \mathcal{N}{u}} \frac {1} {\sqrt{\left\vert \mathcal{N}{u} \right\vert} \sqrt{\left\vert \mathcal{N}{i} \right\vert}} \bold{e}{i}^{(k)} \ \bold{e}{i}^{(k+1)}=\sum{i \in \mathcal{N}{i}} \frac {1} {\sqrt{\left\vert \mathcal{N}{u} \right\vert} \sqrt{\left\vert \mathcal{N}{i} \right\vert}} \bold{e}{u}^{(k)} $$ Aggregation이 단순 합이므로, trainable model parameter는 0-th layer, $\bold{e}{u}^{(0)}$과 $\bold{e}{i}^{(0)}$에만 존재한다. Final representation을 얻기 위해서 $K$개의 layer를 합하는 layer combination을 수행하는데 이는 아래 수식과 같다. $$ \bold{e}{u}=\sum^{K}{k=0}\alpha_{k}\bold{e}{u}^{(k)}; \bold{e}{i}=\sum^{K}{k=0}\alpha{k}\bold{e}{i}^{(k)} $$ 이때 hyperparameter로 지정하는 $\alpha{k}$는 k-th layer embedding의 importance이다. 본 논문에서는 layer마다 동일하게 $1/(K+1)$로 설정하는 것이 전반적으로 좋은 성능을 냈다고 한다. layer combination을 수행한 이유로는 아래의 세 가지 이유를 든다. (1) 레이어의 수가 증가할수록 over-smoothing되는 문제를 해결하기 위해 (2) 각 레이어의 embedding이 다른 semantic을 포착하기 때문 (3) weighted sum을 이용한 layer combination이 self-connection의 효과를 내기 때문 (uniform한 alpha를 사용한다면서..)

마지막으로 model prediction을 위해 user와 item의 inner product를 수행해 recommendation을 위한 ranking score를 구한다. $$ \hat{y}{ui}=\bold{e}{u}^{T}\bold{e}_{i} $$ 이러한 과정을 matrix form으로 나타내면 아래와 같다.

user-item interaction matrix $\bold{R} \in \mathbb{R}^{M\times N}$ 에 대해

$$ \bold{A}=\begin{pmatrix} \bold{0} & \bold{R} \ \bold{R}^{T} & \bold{0} \end{pmatrix}, \ $$ embedding size $T$ 에 대해 $$ \bold{E}^{(0)} \in \mathbb{R}^{(M+N)\times T}, \ $$ symmetrically normalized matrix $\tilde{A}=\bold{D}^{-\frac{1}{2}}\bold{A}\bold{D}^{-\frac{1}{2}}$에 대해 $$ \bold{E}^{(k+1)}=\bold{D}^{-\frac{1}{2}}\bold{A}\bold{D}^{-\frac{1}{2}}\bold{E}^{(k)}, $$ 따라서, $$ \begin{aligned} \bold{E} & = \alpha_{0}\bold{E}^{(0)}+\alpha_{1}\bold{E}^{(1)}+\alpha_{2}\bold{E}^{(2)}+...++\alpha_{K}\bold{E}^{(K)} \
& =\alpha_{0}\bold{E}^{(0)}+\alpha_{1}\tilde{A}\bold{E}^{(1)}+\alpha_{2}\tilde{A}^{2}\bold{E}^{(2)}+...+\alpha_{K}\tilde{A}^{K}\bold{E}^{(K)} \end{aligned} $$