jaeheon-lee.log

[Study] Koopman Operator

Tue, 06 Jan 2026 12:32:52 GMT

[Study] Kramers escape rate

Mon, 11 Aug 2025 16:51:07 GMT

https://home.icts.res.in/~abhi/notes/kram.pdf

[Nonlinear Dynamics and Chaos] Poincaré Maps

Thu, 07 Aug 2025 07:50:37 GMT

[Paper Review] Implicit encoding of prior probabilities in optimal neural populations

Fri, 27 Jun 2025 06:34:03 GMT

Implicit encoding of prior probabilities in optimal neural populations

with Poisson and Fisher information, this paper shows that warping adjusts neuron density and gain to prioritize high-probability stimuli.

roughly summarized core idea of this paper, especially about 'warping'

[Paper Review] Hippocampal Engram Formation and Memory Precision

Thu, 26 Jun 2025 10:23:18 GMT

Hippocampal Engram Formation and Memory Precision

It was not an open-access paper, so I removed the figure.

Summary

This study investigates the neurobiological mechanisms underlying the development of precise episodic-like memory in mice, focusing on the role of the hippocampal CA1 region during early postnatal development. The research addresses why early memories in juvenile mice (postnatal day 16 to 20, P16–P20) are imprecise and how memory precision emerges by the fourth postnatal week (around P24). Using a combination of behavioral tasks (contextual fear conditioning and spatial foraging), chemogenetic and optogenetic manipulations, and histological analyses, the authors demonstrate that memory precision is tied to the maturation of sparse engram formation in CA1. They identify parvalbumin-expressing (PV+) interneurons and their surrounding perineuronal nets (PNNs) as critical regulators of this process, mediated by the protein HAPLN1. The study reveals that immature neuronal allocation in juvenile mice results in dense engrams, leading to generalized, imprecise memories, while competitive allocation mechanisms, supported by mature PV+ interneurons and PNNs, enable sparse engrams and precise memories in older mice. These findings provide insights into the cellular and molecular basis of childhood amnesia and the ontogeny of episodic memory.

Key Findings

Emergence of Memory Precision by P24

Juvenile mice (P16–P20) exhibited imprecise contextual fear memories, freezing equally in trained (Context A) and similar novel contexts (Context B). From P24, mice displayed precise memories, with freezing in Context A exceeding 60% and dropping to ~20% in Context B, indicating a developmental shift in memory precision around the fourth postnatal week (Fig. 1C).

Engram Sparsity and Memory Precision

In P20 mice, ~40% of CA1 pyramidal neurons were c-Fos-positive without fear conditioning, indicating dense engrams, compared to P24 and P60 mice (Fig. 1H). Chemogenetic shrinking of engrams in P20 mice (using hM4Di) reduced c-Fos to ~30% and induced precise memories, while expanding engrams in P60 mice (using hM3Dq) increased c-Fos to ~50% and led to imprecise memories (Fig. 2).

Immature Neuronal Allocation in Juveniles

Optogenetic allocation of CA1 neurons to engrams (using HSV-NpACY) showed that silencing allocated neurons impaired fear recall in P24 and P60 mice but not in P20 mice, indicating that juvenile memories are broadly distributed due to immature allocation mechanisms (Fig. 3). Shrinking engrams in P20 mice localized memories to sparse populations, while expanding engrams in P60 mice disrupted localization.

Role of PV+ Interneurons

PV+ interneurons in CA1 matured structurally and functionally by P24, with increased neurite density and Syt2+ synaptic terminals compared to P20 (Fig. 4). Inhibiting PV+ interneurons in P60 mice (using hM4Di) increased c-Fos to ~40%, disrupted neuronal allocation, and induced juvenile-like imprecise memories, highlighting their role in competitive allocation and engram sparsity.

PNN Maturation and HAPLN1

PNNs surrounding PV+ interneurons reached adult-like density by P24 (Fig. 5). Disrupting PNNs in P60 mice with AAV-DHapln1 reduced WFA+ PNN density by ~50%, increased c-Fos, and led to dense engrams and imprecise memories (Fig. 6). Accelerating PNN formation in P20 mice with AAV-Hapln1 increased PNN density twofold, reduced c-Fos, and promoted sparse engrams and precise memories, demonstrating that PNN maturation, driven by HAPLN1, is necessary and sufficient for memory precision.

Significance

This study significantly advances our understanding of the neurobiological mechanisms governing the development of precise episodic-like memory and the phenomenon of childhood amnesia. Its key contributions include:

Elucidation of CA1 and Sparse Engram Roles:

The research establishes the hippocampal CA1 region as a critical hub for memory precision, demonstrating that the formation of sparse engrams is directly linked to the emergence of precise episodic memories. By showing that engram sparsity increases with age (from dense engrams in P20 mice to sparse ones by P24), the study provides a mechanistic explanation for why early memories in juveniles are imprecise, offering a neurobiological basis for childhood amnesia.

Role of PV+ Interneurons and PNNs:

The identification of PV+ interneurons and their surrounding PNNs as key regulators of engram sparsity and memory precision is a major conceptual advance. The study highlights how PV+ interneuron maturation, driven by lateral inhibition, and PNN stabilization facilitate competitive neuronal allocation, enabling selective recruitment of neurons into sparse engrams.

Adaptive Perspective on Childhood Amnesia:

The paper proposes that imprecise, gist-like memories in early development are not a deficit but an adaptive strategy. By prioritizing generalized, semantic-like knowledge over detailed episodic memories, the immature hippocampus may support survival by allowing young organisms to learn broad environmental patterns. This reframes childhood amnesia as a developmentally appropriate mechanism rather than a limitation.

Clinical Implications:

The ability to manipulate PNNs and engram sparsity (via HAPLN1 or chemogenetics) opens potential therapeutic avenues for memory-related disorders, such as PTSD or neurodevelopmental conditions like autism. For example, enhancing PNN formation could improve memory precision in juveniles, while destabilizing PNNs in adults might promote flexible learning, offering novel strategies for cognitive intervention.

[Paper Review] Learning efficient task-dependent representations with synaptic plasticity

Sun, 22 Jun 2025 07:37:08 GMT

Learning efficient task-dependent representations with synaptic plasticity

constructed stochastic RNN using novel form of reward-modulated Hebbian synaptic plasticity
compared each hidden unit activity in two different tasks
analyzed effects of noise

Task-dependent synaptic plasticity

Stochastic circuit model

stimuli orientation theta is drawn from von Mises distribution
each neuron has fixed tuning s_j(theta)
estimation task: replicate the input theta / classification task: classify theta >0 or not

The stochastic dynamics governing the activity of recurrent neurons

f: nonlinear fuunction. if we consider steady-state dynamics, at equilibrium,

and you could see the origin form of the stochastic dynamics. Also the Brownian noise B_i added independently to each neuron, and in terms of steady-state dynamics, this noise induces fluctuations about this fixed point (steady-states).

When the recurrent weight W is symmetric, the energy function.

the network dynamics implement SGD on this energy function, which corresponds to Langevin sampling from the stimulus-dependent steady-state probability distribution.

Instead of using backpropagation, they used Langevin sampling (or using Euler-Maruyama integration) for the update. (d_ri) = -∇E (direction energy diminishing) + noise --> network sampling with repect to this energy function and the gradient can be approximated with calculating the SDE

https://github.dev/colinbredenberg/Efficient-Plasticity-Camera-Ready

Task dependent objectives

Task-specific objective function

Dr: readout, alpha: task specific loss (cross entropy or MSE)

Local task-dependent learning

They derived synaptic plasticity rules by maximizing O using gradient ascent.

Suggested weight update is similar to a standard reward-Hebbian plasticity rule. (in that, alpha(Dr, s): reward, and rirj-terms: pre-, post-synaptic activity)

Learning the decoder

p(r|s;W) does not depend on D

Numerical results

stimulus encoding

figure1: Recurrent neural network architecture and task learning figure1b : derived local plasticity rules quickly converge to good solution

figure1c and d: distribution of preferred orientation for each neuron (histogram)

estimation: concentrated highly probable stimulus
classification: bimodal

figure 1e and f: average population activies : encoded prior probability

tested narrower input theta distribution on estimation task, prior encoded. shifted prior theta distribution in discrimination task --> break symmetricity

supplementary results, f: broken symmetricity on shifted exp condition.

decoded outputs

responses are systematically biased for less probable stimuli.
effect of variance was much weaker.

left: theta nearby pi (less probable region), right: theta nearby 0 (most concentrated region) d': sensitivity index, discriminability

high discriminability in high probability stimuli region.

discriminability increased as evidence cumulated

effects of internal noise

a: increased noise leads to slower learning, worse asymptotic performance b: higher noise increased engagement of recurrent connectivity after learning. (meaning more bias on previous observations?) - exploit encoded prior distribution more (?)

noise volume fraction (calculated by covariance matrix of r projected onto the two output diensions and covariance matrix of r projected on to the two principal components of the neural activity in fixed stimulus s)

c: after learning, VF much smaller for probable stimulus --> network has learned to effectively "hide" more of its noise for frequent input. (!!) resilient for noise in probable stimulus region

d: interms of energy function, noise variance acts as a temperature. energy landscape "flattens" with increasing noise.

Thoughts

This paper was really inspiring. Although its focus is slightly different from ours—since it proposes a network trained only with local learning rules and analyzes tuning curves and the effects of noise, I thought it would be interesting for us to consider using this model or introducing stochasticity to examine the influence of noise.

Also, the observation that bias increases for improbable stimuli was consistent with my own results. I really liked how they plotted the argmax frequency histogram for each neuron and the average population activity. It gave me the idea that, in our case, it could be valuable to plot neuronal activity across different within-trial steps (steps 0 to 11). My intuition is that there might not be neurons that respond specifically to individual steps, but rather, there may be a directional dynamic that pushes activity forward across the trial.

I also think it would be interesting to plot the recurrent-to-input ratio at each step. Since our task is an integration task, my intuition is that the recurrent contribution should naturally become more dominant in the later steps of each trial.

[Probabilistic Machine Learning] Gaussian Process

Sat, 26 Apr 2025 13:58:17 GMT

GPFA, CCA

[Paper Review] Disentangling Representations through Multi-task Learning

Tue, 22 Apr 2025 12:21:51 GMT

X from R^D (usually D=2), X = x* + noise the model has D(=2) input and N(=2, 3, 6, 12, 24?) output channel

each output channel’s target is classifying x* under c_i, b_i decision boundary (i from 0, 1, ,,, N) so it’s basically similar with parallel setting in Elia,Omri2024 paper. (the simplicity bias paper)

x* is sampled from uniform distribution [-0.5, 0.5], given D=2, then all 4 quadrants.

when they test OOD, they limited the sign of each dimension of x* for example trained on x only one quadrant and test on the rest three quadrants.

https://arxiv.org/pdf/1902.07275 and we can also link with this paper they trained N_task simultaneously, but if we add one by one, the pre-formed representation for each task would be disrupted

[Paper Review] Competition Dynamics Shape Algorithmic Phases of In-Context Learning

Tue, 22 Apr 2025 12:17:57 GMT

interesting paper and task https://openreview.net/pdf?id=XgH1wfHSX8

proposed new task, finite mixture of Markov chains. there are 10 states (0~9) and task T is defined as 10x10 transition matrix. also T_train = {T_1, T_2, ,,, ,T_N} and N can control complexity. sequence length l is fixed to 512 in training.

4 algorithmic phases Unigram/Bigram Retrieval/Inference (Uni-Ret, Uni-Inf, Bi-Inf, Bi-Ret) Unigram: answer based on simple statistics,histogram of states (fast but imprecise) Bigram: answer based on transition matrix (more precise) Retrieval: dependent on train dataset - good ID (indistribution) but not good OOD Inference: independent on train datset - great OOD

a) data diversity threshold if N is small (easy task) dominant Uni-Ret phase. if N is high enough (complex), initial training step Uni-Inf --> Bi-Inf (good OOD) --> Bi-Ret (ID overfitting). b) emergence of induction heads high N and intermediate training step, induction heads are formed. c) transient nature good OOD algorithmic phase (Bi-Inf) only activated transiently. (in the middle of training step). and deactivated by Bi-Ret.

[Paper Review] Signatures of Criticality in Efficient Coding Networks

Fri, 07 Mar 2025 17:42:41 GMT

Signatures of Criticality in Efficient Coding Networks

This paper studies two big ideas in neuroscience: criticality (the brain operating near a critical state) and efficient coding (neurons encoding inputs optimally). Using a network of leaky integrate-and-fire (LIF) neurons, the authors test whether optimizing for efficient coding naturally leads to signatures of criticality—like power-law distributions in neural avalanches.

Why Avalanches?

Avalanches: Neuronal avalanches are cascades of spikes spreading through a network, like a chain reaction.
this is a hallmark of criticality. If their sizes and durations follow a power-law distribution, it signals the network is in a critical state: balanced between order (over-synchronization) and chaos (random firing).

Noise Tunes Criticality and Coding

examines how noise levels affect avalanche size distributions and coding performance (MSE)
- Low Noise (blue line): Neurons over-synchronize, causing large avalanches (a "bump" in the tail): a supercritical state.
- High Noise (red line): Activity fragments into small avalanches (exponential decay): a subcritical state.
- Moderate Noise (green line): Avalanches follow a power-law distribution (linear in log-log): a critical state.
- The noise level where avalanches are most scale-free (lowest $\kappa$) matches where coding error (MSE) is minimized.
- Criticality and efficient coding align

Robust Across Network Sizes

results of fig1 hold across different network sizes? (50 to 400 neurons).
MSE (Fig. 2A) and $\kappa$ (Fig. 2B) show similar nonmonotonic patterns with noise, regardless of size.

Discussion

The study suggests that criticality and efficient coding aren’t separate theories but deeply connected.
Excessive synchronization reduces the diversity of firing patterns, and this could be explained as being trapped in a single attractor (?!)

[Paper Review] Recurrence resonance - noise-enhanced dynamics in recurrent neural networks

Fri, 07 Mar 2025 10:42:54 GMT

Recurrence resonance - noise-enhanced dynamics in recurrent neural networks

This paper introduces Recurrence Resonance (RR), where adding optimal white noise enhances the mutual information ($I$) between consecutive states, reflecting improved internal information flow. Using Symmetric Boltzmann Machines (SBMs) with varied weight matrices (random, Autapses-only, Hopfield, NRooks), the study shows that RR occurs in systems with multiple pre-existing attractors (fixed points, n-cycles) when trapped in one without noise. Optimal noise r_opt enables exploration of these attractors, increasing entropy ($H$) and $I$, while excessive noise disrupts predictability, reducing $I$.

What does it mean that adding noise

it means adding random signals (white noise, drawn from $N(0,1)$) to each neuron in the recurrent neural network (RNN). Specifically:

Noise is introduced via $r \eta_n(t)$ in the input equation $u_n(t)$, where $r$ controls its strength too explore how noise affects the network’s dynamics and information processing. While noise is typically seen as disruptive, the paper shows it can enhance information flow under specific conditions, a phenomenon (RR)
Continuous noise was applied in most experiments, with strength $r$ varied to observe changes in entropy $H$, mutual information $I$, and divergence $D$ (Figures 1, 2).
Short noise pulses were also tested to switch attractors

Noise enables the exploration of already-existing multiple attractors

Multiple attractors (e.g., fixed points, n-cycles) are predefined by the weight matrix $W$ and the network’s dynamics before noise is added (Section 3.1). Noise doesn’t create new attractors but allows the system to transition between them (Section 4).
Without noise ($r = 0$), the system is trapped in one attractor; with optimal noise ($r_{opt}$), it visits more pre-existing attractors. Excessive noise ($r \gg r_{opt}$) randomizes transitions without forming new attractors

method to confirm that the system visits multiple attractors

They confirmed it using joint probability distributions and information-theoretic measures:

Joint Probability $P(s(t), s(t+1))$: Calculated from state time series over $N_T$ steps, visualized as matrices (Figures 1D-F, 2 columns 2-4).
- r = 0: Few states visited (trapped in one attractor).
- r = r_opt: More states visited, clustered around attractors.
- r = 50: Nearly all states visited randomly.
Information Measures:
- Entropy $H$: Measures state diversity (Equation 4).
- Mutual Information $I$: Measures predictability between states, peaking at r_opt
- Divergence $D = H - I$: Indicates randomness.
State Transition Graphs: Showed preferred paths forming attractors as $w$ increases
Specific Tests: Confirmed in Autapses-only (32 fixed points), Hopfield (2 fixed points), and NRooks (4 8-cycles) via $P(s(t), s(t+1))$ patterns

What kind of prior learning process allowed multiple attractors to already exist?

No explicit learning process was used; multiple attractors exist due to the weight matrix $W$ and inherent dynamics, not training.

Mechanism:
- $W$ is predefined (randomly or structurally), and the RNN’s feedback loops naturally form attractors like fixed points or cycles (Section 1, 2.1).
- Example: Large $w$ creates stable attractors; small $w$ leads to randomness (Section 3.1.1).
No Training: Unlike supervised learning, attractors emerge from the system’s autonomous dynamics (e.g., NRooks’ permutation-like $W$ ensures cycles, Section 3.2.3).
Exception: Hopfield’s $W$ was designed to store two patterns, implying a minimal "learning" setup, but this was pre-set, not trained in the study (Section 3.2.2).
Conclusion: Attractors are a mathematical consequence of $W$ and dynamics, assumed to exist for studying noise effects (Section 4).

Weight matrix design (Random, Autapses-only, Hopfield, NRooks)

Random Gaussian Matrix

Design: $w_{nm} \sim N(0, 1)$, scaled by $w$ (Section 3.1, Figure 1A).
Attractors:
- Small $w$: No clear attractors (random walk).
- Large $w$: Fixed points or cycles (e.g., Figure 1I shows 2 fixed points at $w = 5$).
Effect: Noise shifts from one attractor ($r = 0$) to multiple ($r_{opt}$), then randomness ($r = 50$).

Autapses-only

Design: Diagonal $w = +10$, others 0 (Section 3.2.1, Figure 2A).
Attractors: 32 quasi-stable fixed points (5 neurons, $2^5$), each neuron persists independently.
Plot (Figure 2A):
- $r = 0$: $H \approx 1$, $I \approx 1$ (2 points).
- $r = 4$ (optimal): $H \approx 5$, $I \approx 4.5$ (all points visited).
- $r = 50$: $H \approx 5$, $I \approx 0.1$ (random transitions).

Hopfield

Design: Symmetric, stores two patterns (24, 7), no self-connections (Section 3.2.2, Figure 2B).
Attractors: 2 stable fixed points corresponding to stored patterns.
Plot (Figure 2B):
- $r = 0$: $H = 0$, $I = 0$ (trapped in 24).
- $r = 23$ (optimal): $H \approx 1.5$, $I \approx 1.3$ (both visited).
- $r = 50$: $H$ increases, $I$ drops slightly.

NRooks

Design: One non-zero ($w = 20$) per row/column (Section 3.2.3, Figure 2C).
Attractors: 4 stable 8-cycles (32 states organized into cycles).
Plot (Figure 2C):
- $r = 0$: $H = 3$, $I = 3$ (one cycle).
- $r = 7$ (optimal): $H \approx 5$, $I \approx 4.9$ (all cycles).
- $r = 50$: $H \approx 5$, $I$ decreases (randomness).

[Paper Review] Fragmentation of grid cell maps in a multicompartment environment

Sun, 23 Feb 2025 11:02:37 GMT

Fragmentation of grid cell maps in a multicompartment environment

In this paper, they recorded neural activity in grid cells and place cells when rats ran through a hairpin maze.

Figure 1

grid cells have a primarily been recorded in environments with no internal boundaries (like open fields)
grid maps repeat across alleys of the hairpin maze.
in hairpin maze, although two-dimensional periodicity of the grid was lost, the positions were highly correlated across arms, especially all even-numbered arms or all odd-numbered arms.

Figure 2

population analysis for a single cell emsemble.
they showed repeating submaps and high correlation btw neural activity in the same direction, but in the opposite direction the correlation values were not high.
grid cell representation segmentation (fragmentation)

Figure 3

population analysis for all trials and all rats
a,b: population vector correlation matrix (in one rat)
c,d: all rats
grid representations don't encode simple spatial coordinate but direction-specific and visual-input dependent maps

Figure 4

representation were reset near the turning points.
correlation values are high between adjacent bins in the population vector, however, around the turning point in the hairpoint (the start and end of each corridor).
in c, tested on shuffled data where correlation value was high indicating that this is not the statistical result related to noise

Figure 5

short cut experiments (most interesting results!)
they intentionally cut one specific arm - truncated arm.
on truncated arms, when shift is small (around starting point) or high (around end point), the correlation values with reference arms were high, but around middle shift value, the correlation values were low. <-- they're confusing where they are.

Figure 7

open field, hairpin, virtual hairpin, open field 2
in OF, VH, OF2 setting, the grid field were maintained. (spatial phase), but in HP, realignment occurs (?) and the corrlelation value were low
the fact that such discontinuities were absent in the virtual hairpin task suggests that the realignments were imposed by the physical structure of the task.
in the transparent wall condition, grid realignment between compartments still occurred, indicating that segmentation was not solely due to visual occlusion.

Discussion

fragmentation of grid cells. firing patterns of grid cells were repeatedly similar in each corridor. but the firing patterns were sharply disconnected at turning points where the direction changed --> suggesting that grid cells were fragmented into "submaps"
similar to place cell.
fragmentation was not seen in open spaces or virtual hairpin without physical walls even if the trajectory was the same in hairpin map.

[Probabilistic Machine Learning] Hidden Markov Model (forward-backward algorithm)

Sun, 23 Feb 2025 09:08:37 GMT

study material: https://www.youtube.com/watch?v=7zDARfKVm7s (thank you!)

HMM (forward-backward algorithm) implementation

purpose: need to calculate hidden state of each time step with observations across time steps {$P(z_k|x_{1:n})$} $\forall k=1,2,,,n$

def forward_backward(log_likelihoods, trans):
    T, K = log_likelihoods.shape
    gamma = torch.zeros(T, K)

    for t in range(T):
        alpha = torch.zeros(t + 1, K)
        alpha[0] = log_likelihoods[0] + torch.log(torch.ones(K) / K + 1e-10)
        for s in range(1, t + 1):
            alpha[s] = log_likelihoods[s] + torch.logsumexp(alpha[s-1] + torch.log(trans + 1e-10), dim=1)

        beta = torch.zeros(T - t, K)
        beta[-1] = torch.zeros(K)
        for s in range(T - t - 1, -1, -1):
            if s == T - t - 1:
                continue
            beta[s] = torch.logsumexp(torch.log(trans + 1e-10) + log_likelihoods[t + s + 1] + beta[s + 1], dim=1)

        gamma[t] = torch.softmax(alpha[t] + beta[0], dim=0)

    xi = torch.zeros(T - 1, K, K)
    for t in range(T - 1):
        alpha_t = torch.zeros(t + 1, K)
        alpha_t[0] = log_likelihoods[0] + torch.log(torch.ones(K) / K + 1e-10)
        for s in range(1, t + 1):
            alpha_t[s] = log_likelihoods[s] + torch.logsumexp(alpha_t[s-1] + torch.log(trans + 1e-10), dim=1)

        beta_t = torch.zeros(T - t, K)
        beta_t[-1] = torch.zeros(K)
        for s in range(T - t - 1, -1, -1):
            if s == T - t - 1:
                continue
            beta_t[s] = torch.logsumexp(torch.log(trans + 1e-10) + log_likelihoods[t + s + 1] + beta_t[s + 1], dim=1)

        xi[t] = (alpha_t[t].unsqueeze(1) + torch.log(trans + 1e-10) + 
                 log_likelihoods[t + 1].unsqueeze(0) + beta_t[0].unsqueeze(0))
        xi[t] = torch.softmax(xi[t], dim=1)

    return gamma, xi

[Paper Review] Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space

Thu, 06 Feb 2025 14:00:14 GMT

Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space

using generative model, this paper proposes concept space to evaluate a model's learning dynamics in this space. each abstract coordinate sytem indicates specific concepts including size, color, background color.

introducing concept space
concept signal dictates speed of learning
sudden transition in concept learning

Concept Space: A Framework for Analyzing Concept Learning Dynamics

z is the full high-dimensional latent representation in the concept space, while h is the observed input derived from z with some components masked or reduced.

z represents the complete underlying structure for data generation, and h is a partial view provided as input for the model to learn. The model uses h to infer or align with the full representation z.

ex) z contains actual size, shape, background "representations" ex) h contains low dimensional and (sometimes) masked representations like 01, 10, 11

In this paper, they often mention the strength of "signal". (left) color separation is stronger than size separation. (right) size separation is stronger than color separation.

Experimental and Evaluation Setup

experiments were conducted on 2D space using size and color variables and intentionally excluded one combination one case which could be trainied by combining concept.
00 (large red circles), 01 (large blue circles), and 10 (small red circles) (train) -> 11 (small and blud) OOD
followed disentangled representation learning. (diffusion, U-Net)

Concept Signal Determines Learning Speed

RGB contrast on color, size difference --> controlled concept signal strength
speed of learning definition: inverse of the number of gradient steps required to reach 80% accuracy for class 11 (OOD)
concept signal determines the speed at which individual concept are learned

Concept Signal Governs Generalization Dynamics

color concept signal is relative strength of color compared with strength of size signal
when in distribution generalization for class 00, the trajectory converges to 00 regardless of their color concept signal.
however, in OOD generalization, they first go toward in distribution concept like 01 or 10 depending on color concept signal and suddenly changes their trajectory towards 11.
concept memoraization
imbalance of concept signal strong --> strongly biased (look at blue trajectory on (b))
states potential problem of early stopping.

Towards a Landscape Theory of Learning Dynamics

they tried to explain the phenomology of learning dynamics in concept space using analytical curves (dynamics equation below)

where $\hat{z}$ is target point, $\tilde{z}$ is initial biased target. $\sigma$ is strength of signal. $\sigma_1$: color concept signal, $\sigma_2$: size concept signal
($\hat{z}_1$, $\hat{z}_2$) = (0, 1) indicates the representation of small size and red color. when $\sigma_2$ is larger than $\sigma_1$, in other words trajectory color is blue, they showed bias towards (0,1)
this equation uses 1+exp(-t-t^) which is very similar to memory decay.

based on this framework, they derived above energy function. a is difference between $\sigma_1$ and $\sigma_2$ and $\hat{t}_1$ and $\hat{t}_2$ is the time when they learned concepts $z_1$ and $z_2$.

this figure illustrates the simulated trajectories for classes 00 (a), and 11 (b) which is very similar to previous ID / OOD generalization figure.
the network's learning dynamics can be decomposed into two stages :first biased toward ID sample and suddenly goes to OOD sample. --> existence of phase change underlying the decomposition and model acquires the capability to alter concepts

Sudden Transitions in Concept Learning Spaces

at the point of departure (phase change), the model has already learned to manipulate concepts, causing a shift in its learning trajectory.
however, in naive prompting, model fails to elicit these learned capabilities, making the model seem incomplete.
to handle this insufficiency, they use new two prompting methods.

about linear latent intervention :

the model first identifies a vector representing a specific concept, such as "blue," by transforming the corresponding latent concept vector into the model's latent space.
Similarly, a vector for "large size" is created. The model then modifies the original concept vector by adjusting specific components to enhance or suppress certain attributes.
Increasing the weight of the blue concept strengthens its influence, while increasing the weight of the large-size concept reduces its effect.

alpha, beta: hyperparameter.

analyzes the second-from-left curve (green), using checkpoints and two prompting methods to generate class 11 (blue, small).

(a) naive input prompting: sometimes failed (b, c) much faster. 6000 gradient step.

this shows that the model can actually manipulate the concept and has capacity for OOD generalization.
Also, in every prompting method, there were sudden turns where training trajectories suddenly changes. In this period, they "Activates" the ability of manipulating on this concept space. (??)

Effect of Underspecification on Learning Dynamics in Concept Space

in sota generative model, when yellow strawberry is given, the model actually generates red strawberry.
similar to this, they randomly select training samples that have a specific combination of shape and color (like red triangle) and mask the token representing the color (red) and train the model on three concep classes (00, 01, 10).

the number of gradient steps required to reach accuracy 0.8 increases with percentage of masked prompts.

underspecification delays and hinders OOD generalization
toy model also exhibits similar plots

very interesting paper

[Paper Review] Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks

Thu, 30 Jan 2025 08:03:41 GMT

Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks

Intro

This is a summary of selected parts from the paper, focused on understanding the "curse of memory," rather than a review of the entire content.

this paper addresses whether continuous-time RNNs can approximate time-dependent functionals $$H_t(x)$$, which map input signals $$x(t)$$ to outputs $$y(t)$$. Unlike prior studies, this work emphasizes:

The "absence" of underlying dynamical systems for $$H_t$$.
The necessity of memory decay for approximation.

Problem Formulation

Initially described in a discrete setting with equations (1) and (2), the discussion transitions into the continuous setting.
The continuous RNN dynamics are expressed in equation (17), leading to the representation in equation (18).

Universal Approximation Theorem (Theorem 7)

Using the Riesz-Markov-Kakutani theorem, the existence of $$H_t$$ is shown through its unique association with a measure $$\mu_t$$.
The kernel representation $$H_t(x) = \int_0^\infty x^\top_{t-s} \rho(s) , ds$$ (equation (23)) is central, where $$\rho$$ dictates smoothness and decay properties of input-output relationships --> convolution.
equation (23) underscores kernel $$\rho(t)$$'s role in input-output convolution

The quality of approximation depends on how well $$\rho(t)$$ can be represented by exponential sums.

The eigenvalues of $$W$$ with $$\text{Re}(\lambda) < 0$$ ensure system stability.

Approximation Rates (Theorem 10) and Inverse Approximation Theorem (Theorem 11)

I couldn't understand the full proof process...
(yet,) this provides bounds for approximating functionals under smoothness ($$\alpha$$) and decay ($$\beta$$) conditions, leading to approximations via width-$$m$$ RNNs with bounded error rates (equation (27)).
In other words, although it may seem complex, the function is $$\alpha$$-smooth (continuous and differentiable), and its derivatives are controlled by a decaying rate of $$\beta$$ as in (25). When the decaying rate and the smoothness are related as shown in (26), width-$$m$$ RNN functionals $$H_t$$ can approximate the target under these bounds.

demonstrates that without memory decay, approximation is infeasible. (amazing)
again, i gave up understanding the proof haha..

Curse of Memory

When $$\rho(t)$$ decays slowly (e.g., $$\sim t^{-(1/\omega)}$$), RNNs face exponential model size growth to maintain accuracy

[Paper Review] Understanding and Controlling Memory in Recurrent Neural Networks

Sat, 25 Jan 2025 12:53:36 GMT

Understanding and Controlling Memory in Recurrent Neural Networks

2019, curriculum learning, memory, slow/flexible point
- difference from our setting is that they are using pre-defined task transition timing and generating each batch data for every time step (??) → I'm scared, wondering if this is what I should have done too.
Task definition (look at figureA below)
- whole time step: 140,000. In every time step, new data is generated and trained.
- Input:
  - MNIST (orCIFAR-10) appears at a specific time t_s, and noise images at remaining time steps.
  - stimulus (t_s) and response (t_a) are chosen randomly for each trial, ensuring that t_a occurs at least 4 time steps after t_s. (t_a > t_s + 4)
  - total length of the input time steps for each trial is capped at 20 Tmax = 20, meaning the network can process inputs and must formulate a response within a maximum of 20 time steps.
- Output:
  - network should output a "null" label at all times except during t_a therefore, only response at t_a is included in loss calculation! → this is implemented by introducing new variable z in code…. (I've been struggling with this for an hour)
  - t_a output label is dependent on stimulus on t_s
Model
- GRU and LSTM for RNN
- like on above B, when using CIFAR-10, CNN weights are used together

Training Protocol - Two different Curriculum Learning
- VoCu:
  - https://dl.acm.org/doi/abs/10.1145/1553374.1553380 Bengio Curriculum Learning 2009
  - gradually increase the number of categories 3→4→…→10 (or more)
- DeCu:
  - https://doi.org/10.1142/S0218488598000094 sepp hochreiter 1998 VarnishGrad RNN
  - gradually increase the number of T_max 6→8→…→20

Extrapolation ability for each diff training protocol
- retrieval accuracy: DeCu was better than VoCu.
- VoCu rapidly formed each clouds. but showed unstable and higher variability
- DeCu relatively slow, but showed good convergence toward each fixed points

- not well described but red dot (faster, low acc) = VoCu, blue dot (slower, high acc) = DeCu
- speed and accuracy has negative correlation.
- (right) code for finding slow points : just found local minima of “speed”

Formation of Slow Points - WHY protocols differ
- in VoCu
  - 8:2 means : in training step 8000, new class introduced. (MaxClasses[0:8000] = 3)
  - jumps are observed.
- in DeCu
  - 20 : 8 means : in t step 20000, delay time increased to 8. (T_MAX_VEC[20000:30000] = 8)
  - relatively gradual changes are observed.
- (C)
  - In VoCu, with Backtracking procedure, they found that “the new class is assigned to an existing slow point”. → related to shared attractors (?)
- (D) history affect performance?
  - in (c), class 8 -orange- originated from class 5 -green-.
  - class 5 performance was impaired more than other existing classes following the introduction of class 8. (look at thick green and orange colored line)

Improving Long Term Memory
- slow → good
- new regularization with hidden state speed

   - slow points can move → costly
   - therefore they used center of mass of each class instead of slow points for each class.

[Paper Review] A complexity-based theory of compositionality

Tue, 29 Oct 2024 23:20:02 GMT

A complexity-based theory of compositionality

Compositionality is fundamental to intelligence. espeically in human, the structure of thought, language, higher-level reasoning. however, there's no measurable and mathematical definition of this, so they tried to quantify this with algorithmic information theory, Kolmogorov complexity.

Definition 1 (compositinality)

existence of a symbolic complex expression -> where do these expressions saved? in human, they're saved in "language" or other kinds of expression but in neural network it could be only exist by some combination of representations.
similarly, the structure of expression (somehow language) is also unclear
in language, there's no limit on mapping the sentence and the semantics.
also the semantics varies in what context that the sentence is located.

compressing a representation, Kolmogorov complexity

through the lens of optimal compresseion and Kolmogorov complexity
kolmogorov complexity defines a notion of information quantity :length of the shortest program in some programming language that outputs that object ex) KC of 101010: length(repeat 10, 3 times) : the more it has structure or pattern the small value of kolmogorov complexity that object get

in the context of ML, given dataset X = (x1, ,,, xn) sufficiently large drawn iid from dist p(x)
the optimal method for compressing :let's say K(X) = K(X|p) + K(p)
then K(X|p) can be optimally encoded using only $-log_2p(x_i)$ bits Shannon information
second term K(p) refers to complexity of the data distribution

Compressing $Z$ as a function of parts

denote a representation by a matrix $Z \in {\R}^{N \times D}$ where each row $z_n$ is D-dimesional vector from $p(z|x)p(x)$.
describe a representation using short parts-based constituents sentences $W \in \mathcal{V}^{N \times M}$ where nu is finite set of discrete symbols to vocabulary where M is maximum sentence length.
let's say $p_w(w)$ is their most compressed forms
then we can say like below

then for the representation decoding from their sentences, we need new mapper $f:\mathcal{V}^{M} - \R^D$ : semantics
one thing important about K(f), set p with Normal distribution, and introduced additional error term induced by Gaussian assumption -> thats K(Z|W,f)

Summary and further intuition

total Kolmogorov complexity of the representation:

Representational compositionality: a formal definition of compositionality

wip

[Paper Review] The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation

Tue, 22 Oct 2024 19:55:06 GMT

The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation

How can a single neural population combine these disparate objects (dynamical motifs) into a joint representation? And under what conditions do these representations become shared or separate?
simplicity bias: the author wasn’t trying to quantify; they were just pointing out such tendency.
- persists regardless of the number of tasks or the nature of the dynamical objects ( observed in various objects including fixed points, limit cycles, line- plane- attractors)
- link this bias to “sequential emergence” of attractors
- how external factors can resist this bias and create more complex dynamic objects → this external factors means, ‘architectural constraints’ ~ gated, orthogonal, parallel
They use “separate” input and output

In this settings, they have
- gated setting: only one task in a time, no output in other task
  - in code, they ignore loss from other tasks
- orthogonal setting: only one task in a time, zero output in other task
- parallel setting: multiple tasks in a time, no constraint in other task
Compared to gated settings, orthogonal and parallel setting allow ‘interaction btw tasks’
task examples
- from the top, fixed point, limited cycles, line-, plane- attractor (left) and attractors (right)
They trained two tasks in gated mode (sky blue) and orthogonal mode (blue)
Especially in gated mode, they tend to share the attractors — simplicity bias (why? below)
- A : gated mode resulted in a complete overlap between the tasks (even diff tasks), the opposite with orthogonal settings
- B: linear classifier to separate the neural trajectories (2) — failed in gated, success in ortho
- C: ratio variances between and within tasks (F-factor) — shared in gated, separated in ortho
- D: when we look at the spectrum of connectivity matrix (W_recurrent) - lambda
  - the number of unstable eigenvalues was larger in the orthogonal settings
  - oh, in orthogonal settings, there are multiple fixed points! hey then how it emerges???
Next question is, what is the origin of this simplicity bias??
consider the two task needs two fixed points each (4 points)
- In Gated mode: 2 shared 2 shared
  - the recurrent dynamics cause the states to mostly depend on the “task agnostic attractors”, and less on the task-specific inputs
- In orthogonal mode: 4 all different
  - two attractors have to be orthogonal to each other, forcing the network to separate them
  - because the network architecture make the other zero when one is training
ok let’s track this hypothesis by following the attractor landscape of networks (low-rank) [13]
with this projection (approximation), they followed the evolution of the dynamics in parallel settings trained on two fixed-point tasks
in C and D,
- at epoch0 only origin is stable. (two eigenvalues are in unit circle)
- origin destabilizes, single unstable eigenvalue emerges
- more training, second eigenvalue leaves the unit circle → pair of stable fixed points emerges
in B, they repeated this analysis. (skyblue, blue, darkblue - gated, ortho, parallel)
- in orthogonal, parallel mode, we can see the clear sequential emergence of two outliers
- the gated setting is solved with a single outlier, as all tasks share a common attractor

In dicussion,

reusing existing representation leads to faster learning of task ~ i think they wanted to mention the "efficiency" as shown in gated setting
if there're modular representation, it could be some constraint on architecture. good point! since in typical multitask learning setting, the 'rule' signals are separately processing at the head of the RNN..
an attractor shared btw tasks is not identical to an attractor that was formed in response to a single task so if one task has oscillatory component, it might suggest that the same circuit is also capable of generating such oscillation in another context. (?!?)

[Paper Review] Flexible multitask computation in recurrent networks utilizes shared dynamical motifs

Mon, 16 Sep 2024 20:36:41 GMT

Flexible multitask computation in recurrent networks utilizes shared dynamical motifs

https://github.dev/lauradriscoll/flexible_multitask
set up 4 period (context, stimulus, memory, response)
- input: 1(fixation) + 2(modalities) x 2(Asin(theta)andAcos(theta)) + 15(rule)
- output: 1(fixation) + 1(modality) x 2(sin(phi) and cos(phi))
fixed point and stability of that fixed point via analyzing linearization matrix (jacobian)
- used approximation https://joss.theoj.org/papers/10.21105/joss.01003 tensorflow based
single task network
- shared fixed point in context dynamics, memory dynamics
- during stimulus period, center → ring movement
- during memory period, ring attractor shrink on memory PC subspace
  - by introducing interpolation alpha
two task networks (MemoryPro, MemoryAnti)
- shared ring attractor & two separate stable FP, one unstable FP
  - by introducing ‘rule input’ interpolation alpha
task variance analysis - finding dynamical motifs
- cluster 2 - reaction-timed response, cluster 9 - memory-guided responses
- figure 3a. - cluster C has block → that color is for response
- in fig 6, their ‘response’ dynamics interrupted with lesioned cluster c units
exploiting these dynamical motif (similar subspace) with “rule interpolation”
- (PCA on memory state)
- task period cluster 6, unit clusters t and u / 9,10 with a-d: shared point/ring attractor
shared stimulus period dynamical motifs
- during stimulus period, if initial condition is similar, also is evolving steps.
- during rule (context) period, network prepares future pathways (trajectories)
interpolation with rule inputs by alpha
- (PCA on stimulus state)
- “connection bridge” by alpha was orthogonal to decision boundary
- similar task share stable and unstable fixed point
lesioning clustered RNN units - they found modular lesion effects on…
- c: delayed response, f: anti-response task, modality 1/2, t/u: category memory, a/b continuous memory
reusing dynamical motifs
- train all task except MemoryAnti, all freeze except the input weights

[Paper Review] Task representations in neural networks trained to perform many cognitive tasks

Mon, 16 Sep 2024 20:34:22 GMT

Task representations in neural networks trained to perform many cognitive tasks

task representation = RNN weights when performing specific task (?)
task explanation and codes are in https://github.com/gyyang/multitask/blob/master/task.py
tasks at a glance - there are two modalities of stimulus
learn multiple tasks “sequentially” with continual-learning technique
- if there’s some ‘state’ in each task, would there be replay memory?
- is there any biased results on the “sequence” of the task?
  - if tested it will take all permutations of that all tasks…
RNN model: just one hidden layer & non-negative (with 256 units)
- and calculated task variance. also clustering across units.
  - didn’t read much about how they did clustering and clean out the noise
32 ring (direction-specific recurrent unit)
- how is is possible to force them response to specific directions??
(just cameout) one paper using fMRI
- complexity defined as number of choices (but the same task family unlike us)
- more complex task → people tends to do model-based thinking
- https://www.nature.com/articles/s41467-019-13632-1
in line with above paper, I think we should think about each task’s complexity
- what amount information (or dimension) should be sufficient for correct answer?
- ‘amount’ and ‘dimension’ → complexity (how does it combine?)
task vector
- why do they use only “steady-state” response across stimulus?
- they can’t capture the task variance (sensitive unit activity…)
- how about trying DSA like https://github.com/mitchellostrow/DSA
- I don’t know
- how to combine the ‘task variance’ to this?
task vector 2
- tho it’s cool that there could exist algebraic form of compositional representation
rule compositionality (combination of rule inputs!!!)
- most important part of this paper
- Dly Anti task = Anti + (Dly Go - Go)
- but failed in DMS != DMC + DNMS - DNMC
- can we use multiplication or convolution??https://neuroai.neuromatch.io/tutorials/W2D2_NeuroSymbolicMethods/student/W2D2_Tutorial2.html
a bit of continual learning
- Dly GO → Ctx Dly Dm1, Ctx Dly DM2 ----- forget Dly Go
- added “penalty for deviations of important synaptic weights” ~ regularizer
I didn’t look through much about
- how they clustered
- how they analyzed with different activation functions (tanh, ,,,)
- how they did continual learning (method, regularizer part)