data.log

PinSage : Graph Convolutional Neural Networks for Web-Scale Recommender Systems (2018)

Sun, 17 Dec 2023 12:22:29 GMT

dddd

KGCN : Knowledge Graph Convolutional Networks for Recommender Systems (2019)

Sun, 17 Dec 2023 12:12:30 GMT

KGCN 논문 KGCN Github

0. Abstract

1. Introduction

KGCN은 개념적으로 GCN에 영향을 받았으며, 일반적으로 GCN은 2가지 유형(Spectral Vs. Non-spectral)으로 구분할 수 있음
- KGCN은 지식그래프(KG)에 Non-spectral 방법론을 적용한 것
KGCN은 PinSage와 GAT 방법론과도 연관되어 있음
- PinSage와 GAT는 homogeneous graph에 적용한 방법론
- KGCN은 hetrogeneous graph에 적용하여 추천시스템을 위한 새로운 관점을 제시

3. KGCN

3.1. Problem Formulation

3.2. KGCN Layer

3.3. Learning Algorithm

4. Experiments

4.1. Datasets

4.2. Baselines

KGCN을 다른 방법론들과 비교
- KG-free methods
  - SVD is a classic CF-based model using inner product to model user-item interactions.
  - LibFM is a feature-based factorization model in CTR scenarios. We concatenate user ID and item ID as input for LibFM.
- KG-aware methods
  - LibFM + TransE extends LibFM by attaching an entity representation learned by TransE to each user-item pair.
  - PER treats the KG as heterogeneous information networks and extracts meta-path based features to represent the connectivity between users and items.
  - CKE combines CF with structural, textual, and visual knowledge in a unified framework for recommendation. We implement CKE as CF plus a structural knowledge module in this paper.
  - RippleNet is a memory-network-like approach that propagates users’ preferences on the KG for recommendation.

4.3. Experiments Setup

4.4. Results

4.4.1. Impact of neighbor sampling size

K = 2, 4, 8, 16, 32, 64에 대해서 실험한 결과, K=4 or K=8일 때의 결과가 가장 좋았음
K가 너무 작으면 이웃들의 정보를 잘 반영할 수 없고, 너무 크면 오히려 noise를 발생시키기 때문

4.4.2. Impact of depth or receptive field

H = 1, 2, 3, 4에 대해서 실험 진행. KGCN은 K(neighbor sampling size)보다 H(depth)에 더 영향을 많이 받음
H가 커질수록 noise가 많이 발생. H=1 or H=2 정도가 적당

4.4.3. Impact of dimension of embedding

embedding 차원 d가 커지면 커질수록 성능이 좋아지는 경향을 보이다가, 일정 수준이 지나면 over-fitting이 발생

5. Conclusions And Future Work

[참고] https://velog.io/@lse7530/GNN-Knowledge-Graph-Convolutional-Networks-for-RecommenderSystems https://themore-dont-know.tistory.com/5

Graph Representation Learning

Sun, 17 Sep 2023 00:10:25 GMT

3.1 Node Embeddings

Recap: Traditional ML for graphs
- Given an input graph, extract node, link and graph-level features, learn a model that maps features to labels
- Input graph → Structured features → Learning algorithm → Prediction
Graph tepresentation learning
- Graph representation learning alleviates the need to do feature engineering every single time → Automatically learn the features
- Learn how to map node in a d-dimensional space and represent it as a vector of d numbers
- Call it this vector of d numbers as feature representation or embedding
  - This vector captures the structure of the underlying network that we are interested in analyzing or making predictions over
Why embedding?
- Task: Map node into an embedding space
- Similarity of embeddings b/t nodes indicates their similarity in the network
  - For example: Both nodes are close to each other (connected by an edge)
- Encode network structure information
- Potentially used for many downstream predictions
  - Node classification, Link prediction, Graph classification, Anomalus node detection, Clustering and so on

Encoder and Decoder

Assume we have a graph $G$:
- $V$ is the vertex set
- $A$ is the adjacency matrix (assume binary)
- For simplicity: no node features or extra information is used

Embedding nodes
- Encode nodes so that similarity in the embedding space (e.g, dot-product) approximates similarity in the graph
- To learn encoder that encodes the original networks as a set of node embeddings
Learning node embeddings
1. Encoder maps from nodes to embeddings
2. Define a node similarity function (i.e, a measure of similarity in the original network)
3. Decode maps from embeddings to the similairty score
4. Optimize the parameters of the encoder
Two key components
- Encoder: maps each node to a low-dimensional vector
- Similarity function: specifies how the relationships in vector space map to the relationships in the original network
Shallow eEncoding
- Simplest encoding approach: Encoder is just an embedding-lookup
- Each node is assigned a unique embedding vector (i.e, we directly optimize the embedding of each node): DeepWalk, Node2Vec
Encoder + Decoder framework
- Shallow encoder: embedding lookup
- Parameters to optimize: $Z$ which contains node embeddings $z_u$ for al nodes $u \in V$
- Decoder: based on node similarity
- Objective: maximize ${z_v}^Tz_u$ for node pairs $(u, v)$ that are similar according to our node similairty function
How to define node similarity?
- Should two nodes have a similar embedding if they
  - are linked?
  - share neighbors?
  - have similar "structural roles"?
- Similarity definition that uses random walks, and how to optimize embeddings for such a similarity measure

Summary

3.2 Random walk approaches for node embddings

Notation
- Vector $z_u$: The embedding of node $u$ (what we aim to find)
- Probability $P(v|z_u)$: The (predicted) probability of visiting node $v$ on random walks starting from node $u$. (Our model prediction based on $z_u$
Non-linear functions used to produce predicted probabilities $P(v|z_u)$
- Softmax function: Turns vector of $K$ real values (model predictions) into $K$ probabilities that sum to 1
- Sigmoid function: S-shaped function that turns real values into the range of (0, 1)

Random walk
- Given a graph and a starting point(node), we select a neighbor of it at random, and move to this neighbor
- Then we select a neighbor of this point at random, and move to it
- The (random) sequence of points visited this way is random walk on the graph

Random walk embeddings: ${z_u}^Tz_v \approx$ probability that $u$ and $v$ co-occur on a random walk over the graph
1. Estimate probability of visiting node $v$ on a random walk starting from node $u$ using some random walk strategy $R$
2. Optimize embeddings to encode these random walk statistics
  - Similarity in embdding space(Here: dot product = $cos(\theta)$) encodes random walk "similarity"

Why random walks?
- Expressitivity: Flexible stochastic definition of node similarity that incorporates both local and higher-order neighborhood information
  - IDEA: if random walk starting from node $u$ visits $v$ with high probability, $u$ and $v$ are similar (high-order multi-hop information)
- Efficiency: Do not need to consider all node pairs when training; only need to consider pairs that co-occur on random walks

Un-supervised feature learning
- Intuition: Find embedding of nodes in $d$-dimensional space that preserves similarity
- IDEA: Learn node embedding such that near by nodes are close together in the network
- Given a node $u$, how do we define neraby nodes?
  - $N_R(u)$ ... neighborhood of $u$ obtained by some random walk strategy $R$

Feature learning as Optimization
- Given $G=(V, E)$
- Our goal is to learn a mapping $f: u-> \R^d:f(u)=z_u$
- Log-likelihood objective $\max_{f} \displaystyle\sum_{u\in V}logP(N_R(u)|z_u)$
  - $N_R(u)$ is the neighborhood of node $u$ by strategy $R$
Given node $u$, we want to learn feature representations that are predictive of the nodes in its random walk neighborhood $N_R(u)$

Random walk optimization
1. Run short-fixed length random walks starting from each node $u$ in the graph using some random walk strategy $R$
2. For each node $u$ collect $N_R(u)$, the multiset of nodes visited on random walks starting from $u$
  - $N_R(u)$ can have repeat elements since nodes can be visited multiple times on random walks
3. Optimize embeddings according to: Given node $u$, predicts its neighbors $N_R(u)$
  - $\max_{f} \displaystyle\sum_{u\in V}logP(N_R(u)|z_u)$ → maximum likelihood objective
4. Equivalently, $$ L = \displaystyle\sum_{u \in V}\sum_{v \in N_R(u)}-log(P(v|z_u)) $$
  - Intuition: Optimize embeddings $z_u$ to maximize the likelihood of random walk co-occurrences
  - Parameterize $P(v|z_u)$ using softmax: $$ P(v|z_u) = {exp({z_u}^Tz_v) \over \sum_{n \in V} exp({z_u}^Tz_n)} $$
Stochastic Gradient Descent

Random walks summary

Node2Vec

How sould we randomly walk?
- So far, we have described how to optimize embeddings given a random walk strategy $R$
- What strategies should we use to run these random walks?
  - Simplest idea: Just run fixed-length, unbiased random walks strating from each node (DeepWalk)
    - The issue is that such notion of similarity is too constrained

Overview of Node2Vec
- Goal: Embed nodes with similar network neighborhoods close in the feature space
- We frame this goal as a maximum likelihood optimization problem, independent to the downstream prediction task
- Key observation: Flexible notion of network neighborhood $N_R(u)$ of node $u$ leads to rich node embeddings
- Develop biased $2nd$ order random walk $R$ to generate network neighborhood $N_R(u)$ of node $u$

Node2Vec : Biased walks
- IDEA: use flexible, biased random walks that can trade-off b/t local and global vies of the network

Interpolating BFS and DFS
- Biased fixed-length random walk $R$ that given a node $u$ generates neighborhood $N_R(u)$
  - Two parameters
    - Return parameter $p$: Return back to the previous node
    - In-out parameter $q$: Moving outwards(DFS) vs. Inwards(BFS)
    - Intuitively, $q$ is the "ratio" of DFS vs. BFS
Biased random walks

Node2Vec algorihthm
1. Compute random walk probabilities
2. Simulate $r$ random walks of length $l$ starting from each node $u$
3. Optimize the Node2Vec objective using Stochastic Gradient Descent
- Linear-time complexity
- All 3 steps are individually parallelizable
Other random walk ideas
- Different kinds of biased random walks
  - Based on node attributes
  - Based on learned weights
- Alternative optimization schemes
  - Directly optimize based on 1-hop and 2-hop random walk probabilities
- Network pre-processing techniques
  - Run random walks on modified versions of the original network

Summary

3.3 Embedding entire graphs

Graph embedding $z_G$: Want to embed a subgraph or an entire graph $G$
Tasks
- Classifying toxic vs. non-toxic molecules
- Identifying anomalous graphs

Approach 1
- Run a standard node embedding technique on the (sub) graph $G$
- Then just sum (or average) the node embeddings in the (sub) graph $G$ $$ z_G = \sum_{v \in G} z_v $$
Approach 2
- Introduce a "virtual node" to represent the (sub) grpah and run a standard node embedding technique

Approach 3 : Anonymous walk embeddings
- States in anonymous walks correspond to the index of the first time we visited the node in a random walk
New idea: Learn walk embeddings
- Rather than simply represent each walk by the fraction of times it occurs, we learn embedding $z_i$ of anonymous walk $w_i$
- Learn a graph embedding $Z_G$ together with all the anonymous walk embeddings $z_i$. $Z={z_i:i=1, ..., \eta}$, where $\eta$ is the number of sampled anonymous walks
- How to embed walks? Embed walks s.t. the next walk can be predicted

Summary

How to use embeddings
- Clustering/community detection: cluster points $z_i$
- Node classification: Predict label of node $i$ based on $z_i$
- Link prediction: Predict edge $(i, j)$ based on $(z_i, z_j)$
  - Where we can: concatenate, avg, product, or take a difference b/t the embeddings
- Graph classification: graph embedding $z_G$ via aggregating node embeddings or anonymous random walks → Predict label based on graph embdding $z_G$

Summary

Traditional Methods for ML in Graphs

Sun, 10 Sep 2023 02:44:02 GMT

Traditional methods for machine learning in graphs

The goal of lecture: Creating additional features that will describe how this particular node is positioned in the rest of the network and what is its local network structure
- These additional features that describe the topology of the network of the graph will allow us to make more accurate predictions
- Using effective features over graphs is the key to achieving good test performance
ML in Graphs
- Given $$ G = (V, E) $$
- Learn a function $$ f: V -> \R $$
- The way we can think of this is that we are given a graph as a set of verticies, as a set of edges and we wanna learn a function

** 2.1 Node-level tasks and features **

Node classification: Where we are given a network, we are given a couple of nodes that are labeld with different colors, and the goal is to predict the colors of uncolored nodes
Goal: Characterize the structure and position of a node in the network
- Node degree
- Node centrality
- Clustering coefficient
- Graphlets

** 1. Node degree **

The degree of $k_v$ of $node , v$ is the number of edges (neighboring nodes) the node has
Treats all neighboring nodes equally

** 2. Node centrality **

Node degree counts the neighboring nodes without capturing their importance
Node centrality $c_v$ takes the node importance in a graph into account
Different ways to model importance
- Eigen-vector centrality
- Betweenness centrality
- Closeness centrality
- and many others
Eigen-vector centrality: A node $v$ is important if surrounded by important neighboring nodes $u \in N(v)$
Betweenness centrality: A node $v$ is important if it lies on many shortest paths b/t other nodes
Closeness centrality: A node $v$ is important if it has small shortest path lengths to all other nodes

** 3. Clustering coefficient**

Measures how connected $v's$ neighboring nodes

** 4. Graphlets **

Observation: Clustering coeffieicnt counts the #(triangles) in the ego-network
- We can generalize the by #(pre-specified subgraphs, i.e. graphlets)
- Graphlets: Rooted connected non-isomorphic subgraphs
- GDV(Graphlet Degree Vector): Graphlet-base features for nodes
  - Degree counts #(edges) that a node touches
  - Clustering coefficient counts #(triangles) that a node touches
  - GDV counts #(graphlets) that a node touches
GDV(Graphlet Degree Vector): A count vector of graphlets rooted at a given node
- Considering graphlets on 2 to 5 nodes we get:
  - Vector of 73 coordinates is a signature of a node that describes the topology of node's neighborhood
  - Captures its inter-connectivities out to a distance of 4 hops
- Graphlet degree vector provides a measure of a node's local network topology:
  - Comparing vectors of two nodes provides a more detailed measure of local topological similarity than node degrees or clustering coefficient

Summary

** 2.2 Link prediction task and features **

The link prediction task is to predict new links based on existing links
At test time, all node pairs(no existing links) are ranked, and top $$ k $$ node pairs are predicted
The key is to design features for a pair of nodes
Two formulations of the link prediction task
- Links missing at random
  - Remove a random set of links and then aim to predict them
  - More useful for static networks like protein-protein interaction networks
- Links over time
- Given $G[t_0, t'_0]$ a graph on edges up to time $t'_0$, output a ranked list $L$ of links (not in $G[t_0, t'_0]$) that are predicted to appear in $G[t_1, t'_1]$
- Evaluation
  - $n=|E_{NEW}|$ : # new edges that appear during the test period $[t_1, t'_1]$
  - Take top $n$ elements of $L$ and count correct edges
  - Useful or natural for networks that evolve over time like transaction networks, social networks
Methodology
- For each pair of nodes $(x, y)$ compute score $c(x, y)$
- For example, $c(x, y)$ could be the # of common neighbors of $x$ and $y$
- Sort pairs $(x, y)$ by the decreasing score $c(x, y)$
- Predict top $n$ pairs as new links
- See which of these links actually appear in $G[t_1, t'_1]$
Link-level features:
- How to featurize or create a descriptor of the relationship b/t two nodes in the network
- Distance-based feature
- Local neighborhood overlap
- Global neighborhood overlap

1. Distance-based features

Shortest path distance b/t two nodes
However, this does not capture the degree of neighborhood overlap(=strength of connection)

2. Local neighborhood overlap

Try to capture the strength of connection b/t two nodes would be to ask how many neighbors do you have in common
What is the number of common friends b/t a pair of nodes
Captures # neighboring nodes shared b/t two nodes $v_1$ and $v_2$
Limitation: Is always zero if the two nodes do not have any neighbors in common (However, the two nodes may still potentially be connected in the future)

3. Global neighborhood overlap

Resolve the limitation of local neighborhood overlap by considering the entire graph
Katz index: count the number of paths of all lengths b/t a given pair of nodes
- How to compute # paths b/t two nodes?
- Use adjacency matrix

Summary

** 2.3 Graph-level features and Graph kernels **

Features that characterize the structure of an entire graph
Kernel methods are widely-used for traditional ML for graph-level prediction
- IDEA: Design kernels instead of feature vectors
- A quick introduction to Kernels
  - Kernel $K(G, G') \in \R$ measures similarity b/t data
  - Kernel matrix $K=(K(G, G'))_{G, G'}$ must always be positive semi-definite (i.e., has positive eigenvalues)
  - There exists a feature representation $\phi(\cdot)$ such that $K=(G, G')=\phi(G)^T\phi(G')$
    - And the value of the kernel is simply a dot product of this vector representation of the two graphs
    - Once the kernel is defined, off-the-shelf ML model, such as kernel SVM can be used to make predictions
Graph Kernels: Measure similarity b/t two graphs
- Goal: Design graph feature vector $\phi(G)$
- Key IDEA: BoW(Bag-of-Words) for a graph
  - BoW simply uses the word counts as features for documents (no ordering considered)
[1] Graphlet kernel [2] WL(Weisfeiler-Lehman) kernel [3] Other kernels are also proposed in the literature
- Random-walk kernel
- Shortest-path graph kernel
- And many more
Both Graphlet kernel and WL kernel use Bag-of-* representation of graph, where * is more sophisticated than node degress

1. Graphlet features

Count the number of different graphlets in a graph
Definition of graphlets here is slightly different from node-level features
Two two differences are
- Nodes in graphlets here do not need to be connected (allows for isolated nodes)
- The graphlets here are not rooted
Limitations: Counting graphlet is expensive
Counting size-$k$ graphlets for a graph with size $n$ by enumeration takes $n^k$
This is unavoidable in the worst-case since subgraph isomorphism test (judding whether a graph is a subgraph of another graph) is NP-hard
If a graph's node degree is bounded by $d$, an $O(nd^{k-1})$ algorithm exists to count all the graphlets of size $k$

2. WL(Weisfeiler-Lehman) Kernel

Goal: Design an efficient graph feature descriptor of $\phi(G)$
IDEA: Use neighborhood structure to iteratively enrich node vocabulary
- Generalized version of Bag-of-node degrees since node degrees are one-hop neighborhood information
- Algorithm to achieve this: WL graph isomorphsim test(=Color refinement)
Color refienment
WL kernel is very popular/strong graph feature descriptor to gives strong performance and computationally efficient
- The time complexity for color refinement at each step is linear in #(edges), since it involves aggregating neighboring colors
- When computing a kernel value, only colors appeared in the two graphs need to be tracked
  - Thus, #(colors) is at most the total number of nodes
  - Counting colors takes linear-time w.r.t #(nodes)
  - In total, time complexity is linear in #(edges)

Sumamry

Lecture Summary

Introduction To DGL

Mon, 28 Aug 2023 05:24:14 GMT

ML with Graphs

Mon, 28 Aug 2023 02:48:53 GMT

1.1 Why Graphs?

Graphs are a general language for describing and analyzing entities with relations/interactions
Main Question: How do we take advantage of relational structure for better prediction?
- Comeplex domains have a rich relational structure, which can be represented as a relational graph
- By explicitly modeling relationships we achieve better performance
Modern deep learning toolbox is designed for simple sequences(text) & grids(image)
- How can we develop neural networks that are much more broadly applicable? → Graphs are the new frontier of deep learning

1.2 Applications of GraphML

PapersWithCode - Graphs

In many Graph Machine Learning, we can formulate different types of tasks:
- At the level of individual nodes
  - The Protein Folding problem: Computationally predict a protein's 3D structure based solely on its amino acid sequence
  - DeepMind's AlphaFold
- At the level of edges which is pairs of nodes
  - RecSys - Graph Neural Networks in Recommender System: A survey (2022)
- At the level of sub-graphs of nodes
  - Traffic Prediction - Graph Neural Network for Traffic Forecasting: A Survey (2022)
- At the level of entire graphs
  - Drug Discovery - A Deep Learning Approach to Antibiotic Discovery (2020)
Classic GraphML Tasks

1.3 Choice of Graph Representation

How do you define a graph? (= How to build a graph?)
- What are nodes?
- What are edges?
- Choice of the proper network representation of a given domain/problem determines our ability to use networks successfully
Directed Graphs Vs. Un-directed Graphs
Node degrees
Bi-partite graph
Representing graphs: Adjacency matrix
Representing graphs: Edge list
- This is a representation that is quite popular in deep learning frameworks because we can simply represent it as a two-dimensional matrix
- The problem of this representation is that it is very hard to do any kind of graph manipulation or any kind of analysis of the graph because even computing a degree of a given node is non-trivial in this case
Representing graphs: Adjacency list
Node and Edge Attributes (Possible options)
More Types of Graphs: Weighted Graphs Vs. Un-Weighted Graphs
More Types of Graphs: Self-edges (self-loops) / Multi Graph
Connectivity of Un-directed graphs
Connectivity of Directed graphs

data.log

PinSage : Graph Convolutional Neural Networks for Web-Scale Recommender Systems (2018)

KGCN : Knowledge Graph Convolutional Networks for Recommender Systems (2019)

0. Abstract

1. Introduction

2. Related Work

3. KGCN

3.1. Problem Formulation

3.2. KGCN Layer

3.3. Learning Algorithm

4. Experiments

4.1. Datasets

4.2. Baselines

4.3. Experiments Setup

4.4. Results

4.4.1. Impact of neighbor sampling size

4.4.2. Impact of depth or receptive field

4.4.3. Impact of dimension of embedding

5. Conclusions And Future Work

Graph Representation Learning

Traditional Methods for ML in Graphs

Introduction To DGL

ML with Graphs