<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>jaeheon-lee.log</title>
        <link>https://velog.io/</link>
        <description>https://jaeheon-lee486.github.io/</description>
        <lastBuildDate>Tue, 06 Jan 2026 12:32:52 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>jaeheon-lee.log</title>
            <url>https://velog.velcdn.com/images/jaeheon-lee/profile/e7d7d879-a47d-41eb-b658-daec32846539/image.jpeg</url>
            <link>https://velog.io/</link>
        </image>
        <copyright>Copyright (C) 2019. jaeheon-lee.log. All rights reserved.</copyright>
        <atom:link href="https://v2.velog.io/rss/jaeheon-lee" rel="self" type="application/rss+xml"/>
        <item>
            <title><![CDATA[[Study] Koopman Operator]]></title>
            <link>https://velog.io/@jaeheon-lee/Study-Koopman-Operator</link>
            <guid>https://velog.io/@jaeheon-lee/Study-Koopman-Operator</guid>
            <pubDate>Tue, 06 Jan 2026 12:32:52 GMT</pubDate>
            <description><![CDATA[<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/549574b3-f669-4fae-b371-fd7b2252972f/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/5bf4a86d-1737-4216-a254-0f0b1212bac3/image.png" alt=""></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Study] Kramers escape rate]]></title>
            <link>https://velog.io/@jaeheon-lee/Study-Kramers-escape-rate</link>
            <guid>https://velog.io/@jaeheon-lee/Study-Kramers-escape-rate</guid>
            <pubDate>Mon, 11 Aug 2025 16:51:07 GMT</pubDate>
            <description><![CDATA[<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/a635b069-2237-47bb-b994-afd979bfd774/image.PNG" alt="">
<a href="https://home.icts.res.in/~abhi/notes/kram.pdf">https://home.icts.res.in/~abhi/notes/kram.pdf</a> </p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Nonlinear Dynamics and Chaos] Poincaré Maps]]></title>
            <link>https://velog.io/@jaeheon-lee/Nonlinear-Dynamics-and-Chaos-Poincar-Maps</link>
            <guid>https://velog.io/@jaeheon-lee/Nonlinear-Dynamics-and-Chaos-Poincar-Maps</guid>
            <pubDate>Thu, 07 Aug 2025 07:50:37 GMT</pubDate>
            <description><![CDATA[<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/3d876577-8ac8-44c3-9342-e6d96ad4483c/image.png" alt=""></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Implicit encoding of prior probabilities in optimal neural populations]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Implicit-encoding-of-prior-probabilities-in-optimal-neural-populations</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Implicit-encoding-of-prior-probabilities-in-optimal-neural-populations</guid>
            <pubDate>Fri, 27 Jun 2025 06:34:03 GMT</pubDate>
            <description><![CDATA[<h1 id="implicit-encoding-of-prior-probabilities-in-optimal-neural-populations">Implicit encoding of prior probabilities in optimal neural populations</h1>
<p>with Poisson and Fisher information, this paper shows that warping adjusts neuron density and gain to prioritize high-probability stimuli.</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/27c3cc93-449d-4ab3-acab-3aafb4df2709/image.png" alt=""></p>
<p>roughly summarized core idea of this paper, especially about &#39;warping&#39;</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/08f3496b-f4b4-4adb-a61b-b3a45d2195be/image.png" alt=""></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Hippocampal Engram Formation and Memory Precision]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Hippocampal-Engram-Formation-and-Memory-Precision</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Hippocampal-Engram-Formation-and-Memory-Precision</guid>
            <pubDate>Thu, 26 Jun 2025 10:23:18 GMT</pubDate>
            <description><![CDATA[<h1 id="hippocampal-engram-formation-and-memory-precision">Hippocampal Engram Formation and Memory Precision</h1>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/8048ff12-1550-4e5d-bef4-1aee47455321/image.png" alt=""></p>
<p>It was not an open-access paper, so I removed the figure.</p>
<h2 id="summary">Summary</h2>
<p>This study investigates the neurobiological mechanisms underlying the development of precise episodic-like memory in mice, focusing on the role of the hippocampal CA1 region during early postnatal development. The research addresses why early memories in juvenile mice (postnatal day 16 to 20, P16–P20) are imprecise and how memory precision emerges by the fourth postnatal week (around P24). Using a combination of behavioral tasks (contextual fear conditioning and spatial foraging), chemogenetic and optogenetic manipulations, and histological analyses, the authors demonstrate that memory precision is tied to the maturation of sparse engram formation in CA1. They identify parvalbumin-expressing (PV+) interneurons and their surrounding perineuronal nets (PNNs) as critical regulators of this process, mediated by the protein HAPLN1. The study reveals that immature neuronal allocation in juvenile mice results in dense engrams, leading to generalized, imprecise memories, while competitive allocation mechanisms, supported by mature PV+ interneurons and PNNs, enable sparse engrams and precise memories in older mice. These findings provide insights into the cellular and molecular basis of childhood amnesia and the ontogeny of episodic memory.</p>
<h2 id="key-findings">Key Findings</h2>
<h3 id="emergence-of-memory-precision-by-p24">Emergence of Memory Precision by P24</h3>
<p>Juvenile mice (P16–P20) exhibited imprecise contextual fear memories, freezing equally in trained (Context A) and similar novel contexts (Context B). From P24, mice displayed precise memories, with freezing in Context A exceeding 60% and dropping to ~20% in Context B, indicating a developmental shift in memory precision around the fourth postnatal week (Fig. 1C).</p>
<h3 id="engram-sparsity-and-memory-precision">Engram Sparsity and Memory Precision</h3>
<p>In P20 mice, ~40% of CA1 pyramidal neurons were c-Fos-positive without fear conditioning, indicating dense engrams, compared to P24 and P60 mice (Fig. 1H). Chemogenetic shrinking of engrams in P20 mice (using hM4Di) reduced c-Fos to ~30% and induced precise memories, while expanding engrams in P60 mice (using hM3Dq) increased c-Fos to ~50% and led to imprecise memories (Fig. 2).</p>
<h3 id="immature-neuronal-allocation-in-juveniles">Immature Neuronal Allocation in Juveniles</h3>
<p>Optogenetic allocation of CA1 neurons to engrams (using HSV-NpACY) showed that silencing allocated neurons impaired fear recall in P24 and P60 mice but not in P20 mice, indicating that juvenile memories are broadly distributed due to immature allocation mechanisms (Fig. 3). Shrinking engrams in P20 mice localized memories to sparse populations, while expanding engrams in P60 mice disrupted localization.</p>
<h3 id="role-of-pv-interneurons">Role of PV+ Interneurons</h3>
<p>PV+ interneurons in CA1 matured structurally and functionally by P24, with increased neurite density and Syt2+ synaptic terminals compared to P20 (Fig. 4). Inhibiting PV+ interneurons in P60 mice (using hM4Di) increased c-Fos to ~40%, disrupted neuronal allocation, and induced juvenile-like imprecise memories, highlighting their role in competitive allocation and engram sparsity.</p>
<h3 id="pnn-maturation-and-hapln1">PNN Maturation and HAPLN1</h3>
<p>PNNs surrounding PV+ interneurons reached adult-like density by P24 (Fig. 5). Disrupting PNNs in P60 mice with AAV-DHapln1 reduced WFA+ PNN density by ~50%, increased c-Fos, and led to dense engrams and imprecise memories (Fig. 6). Accelerating PNN formation in P20 mice with AAV-Hapln1 increased PNN density twofold, reduced c-Fos, and promoted sparse engrams and precise memories, demonstrating that PNN maturation, driven by HAPLN1, is necessary and sufficient for memory precision.</p>
<h2 id="significance">Significance</h2>
<p>This study significantly advances our understanding of the neurobiological mechanisms governing the development of precise episodic-like memory and the phenomenon of childhood amnesia. Its key contributions include:</p>
<h3 id="elucidation-of-ca1-and-sparse-engram-roles">Elucidation of CA1 and Sparse Engram Roles:</h3>
<p>The research establishes the hippocampal CA1 region as a critical hub for memory precision, demonstrating that the formation of sparse engrams is directly linked to the emergence of precise episodic memories. By showing that engram sparsity increases with age (from dense engrams in P20 mice to sparse ones by P24), the study provides a mechanistic explanation for why early memories in juveniles are imprecise, offering a neurobiological basis for childhood amnesia.</p>
<h3 id="role-of-pv-interneurons-and-pnns">Role of PV+ Interneurons and PNNs:</h3>
<p>The identification of PV+ interneurons and their surrounding PNNs as key regulators of engram sparsity and memory precision is a major conceptual advance. The study highlights how PV+ interneuron maturation, driven by lateral inhibition, and PNN stabilization facilitate competitive neuronal allocation, enabling selective recruitment of neurons into sparse engrams.</p>
<h3 id="adaptive-perspective-on-childhood-amnesia">Adaptive Perspective on Childhood Amnesia:</h3>
<p>The paper proposes that imprecise, gist-like memories in early development are not a deficit but an adaptive strategy. By prioritizing generalized, semantic-like knowledge over detailed episodic memories, the immature hippocampus may support survival by allowing young organisms to learn broad environmental patterns. This reframes childhood amnesia as a developmentally appropriate mechanism rather than a limitation.</p>
<h3 id="clinical-implications">Clinical Implications:</h3>
<p>The ability to manipulate PNNs and engram sparsity (via HAPLN1 or chemogenetics) opens potential therapeutic avenues for memory-related disorders, such as PTSD or neurodevelopmental conditions like autism. For example, enhancing PNN formation could improve memory precision in juveniles, while destabilizing PNNs in adults might promote flexible learning, offering novel strategies for cognitive intervention.</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Learning efficient task-dependent representations with synaptic plasticity]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Learning-efficient-task-dependent-representations-with-synaptic-plasticity</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Learning-efficient-task-dependent-representations-with-synaptic-plasticity</guid>
            <pubDate>Sun, 22 Jun 2025 07:37:08 GMT</pubDate>
            <description><![CDATA[<h1 id="learning-efficient-task-dependent-representations-with-synaptic-plasticity">Learning efficient task-dependent representations with synaptic plasticity</h1>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/3f62077b-3f9c-434b-9a25-f11f088e1ff1/image.png" alt=""></p>
<ul>
<li>constructed stochastic RNN using novel form of reward-modulated Hebbian synaptic plasticity</li>
<li>compared each hidden unit activity in two different tasks</li>
<li>analyzed effects of noise</li>
</ul>
<h2 id="task-dependent-synaptic-plasticity">Task-dependent synaptic plasticity</h2>
<h3 id="stochastic-circuit-model">Stochastic circuit model</h3>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/0e336a3f-f4af-4955-8c11-02dfa912d735/image.png" alt=""></p>
<ul>
<li>stimuli orientation theta is drawn from von Mises distribution</li>
<li>each neuron has fixed tuning s_j(theta)</li>
<li>estimation task: replicate the input theta / classification task: classify theta &gt;0 or not</li>
</ul>
<p>The stochastic dynamics governing the activity of recurrent neurons 
<img src="https://velog.velcdn.com/images/jaeheon-lee/post/11a9bc39-0402-45a9-b896-29e625c468e3/image.png" alt=""></p>
<p>f: nonlinear fuunction.
if we consider steady-state dynamics, at equilibrium,
<img src="https://velog.velcdn.com/images/jaeheon-lee/post/1156d6c9-883d-4cd3-9d1c-2319c441362c/image.png" alt=""></p>
<p>and you could see the origin form of the stochastic dynamics.
Also the Brownian noise B_i added independently to each neuron, and in terms of steady-state dynamics, this noise induces fluctuations about this fixed point (steady-states).</p>
<p>When the recurrent weight W is symmetric, the energy function.
<img src="https://velog.velcdn.com/images/jaeheon-lee/post/ffcec790-482f-4bb3-bca0-433f2dca40aa/image.png" alt=""></p>
<p>the network dynamics implement SGD on this energy function, which corresponds to Langevin sampling from the stimulus-dependent steady-state probability distribution.</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/70c82f96-41a5-4176-a3e9-4b99cf888e04/image.png" alt=""></p>
<p>Instead of using backpropagation, they used Langevin sampling (or using Euler-Maruyama integration) for the update.
(d_ri) = -∇E (direction energy diminishing) + noise
--&gt; network sampling with repect to this energy function and the gradient can be approximated with calculating the SDE </p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/ec7d797d-a847-4205-be3d-67be9e048bf7/image.png" alt=""></p>
<p><a href="https://github.dev/colinbredenberg/Efficient-Plasticity-Camera-Ready">https://github.dev/colinbredenberg/Efficient-Plasticity-Camera-Ready</a>  </p>
<h3 id="task-dependent-objectives">Task dependent objectives</h3>
<p>Task-specific objective function</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/be0dfb8d-6962-48b7-9289-e926e495a754/image.png" alt=""></p>
<p>Dr: readout, alpha: task specific loss (cross entropy or MSE)</p>
<h3 id="local-task-dependent-learning">Local task-dependent learning</h3>
<p>They derived synaptic plasticity rules by maximizing O using gradient ascent.</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/60c85944-cc5e-4145-87d5-26546b6a8efd/image.png" alt=""></p>
<p>Suggested weight update is similar to a standard reward-Hebbian plasticity rule. (in that, alpha(Dr, s): reward, and rirj-terms: pre-, post-synaptic activity)</p>
<h3 id="learning-the-decoder">Learning the decoder</h3>
<p>p(r|s;W) does not depend on D</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/71fb6f26-d298-4adc-9c54-7a631eb0eca7/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/7328cd28-f84b-4c11-9573-b07d819645aa/image.png" alt=""></p>
<h2 id="numerical-results">Numerical results</h2>
<h3 id="stimulus-encoding">stimulus encoding</h3>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/efa7847a-64e3-4586-be1d-7d51b5b05aad/image.png" alt=""></p>
<p>figure1: Recurrent neural network architecture and task learning
figure1b : derived local plasticity rules quickly converge to good solution</p>
<p>figure1c and d: distribution of preferred orientation for each neuron (histogram)</p>
<ul>
<li>estimation: concentrated highly probable stimulus</li>
<li>classification: bimodal</li>
</ul>
<p>figure 1e and f: average population activies
: encoded prior probability</p>
<p>tested narrower input theta distribution on estimation task, prior encoded.
shifted prior theta distribution in discrimination task --&gt; break symmetricity</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/1c1c5af8-5bdf-48bb-91a9-4160149e5ef2/image.png" alt=""></p>
<p>supplementary results, f: broken symmetricity on shifted exp condition.</p>
<h3 id="decoded-outputs">decoded outputs</h3>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/770d5f98-15a0-4ef3-862f-c13d3baec44b/image.png" alt=""></p>
<ul>
<li>responses are systematically biased for less probable stimuli.</li>
<li>effect of variance was much weaker. </li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/7d6e555c-9e7c-40a2-b501-382aa5d759c9/image.png" alt=""></p>
<p>left: theta nearby pi (less probable region), right: theta nearby 0 (most concentrated region)
d&#39;: sensitivity index, discriminability</p>
<ul>
<li>high discriminability in high probability stimuli region.
<img src="https://velog.velcdn.com/images/jaeheon-lee/post/2ab9c15d-120a-4ba2-b7a3-3ed272323d7b/image.png" alt=""></li>
</ul>
<p>discriminability increased as evidence cumulated</p>
<h3 id="effects-of-internal-noise">effects of internal noise</h3>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/d0eef654-df2b-41d8-a0c2-1a1486625f70/image.png" alt=""></p>
<p>a: increased noise leads to slower learning, worse asymptotic performance
b: higher noise increased engagement of recurrent connectivity after learning. (meaning more bias on previous observations?) - exploit encoded prior distribution more (?)</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/a188bc58-cfd1-4ac7-a9c1-8251a8b92332/image.png" alt=""></p>
<ul>
<li>noise volume fraction (calculated by covariance matrix of r projected onto the two output diensions and covariance matrix of r projected on to the two principal components of the neural activity in fixed stimulus s)</li>
</ul>
<p>c: after learning, VF much smaller for probable stimulus --&gt; network has learned to effectively &quot;hide&quot; more of its noise for frequent input. (!!) resilient for noise in probable stimulus region</p>
<p>d: interms of energy function, noise variance acts as a temperature. energy landscape &quot;flattens&quot; with increasing noise. </p>
<h2 id="thoughts">Thoughts</h2>
<p>This paper was really inspiring. Although its focus is slightly different from ours—since it proposes a network trained only with local learning rules and analyzes tuning curves and the effects of noise, I thought it would be interesting for us to consider using this model or introducing stochasticity to examine the influence of noise.</p>
<p>Also, the observation that bias increases for improbable stimuli was consistent with my own results. I really liked how they plotted the argmax frequency histogram for each neuron and the average population activity. It gave me the idea that, in our case, it could be valuable to plot neuronal activity across different within-trial steps (steps 0 to 11). My intuition is that there might not be neurons that respond specifically to individual steps, but rather, there may be a directional dynamic that pushes activity forward across the trial.</p>
<p>I also think it would be interesting to plot the recurrent-to-input ratio at each step. Since our task is an integration task, my intuition is that the recurrent contribution should naturally become more dominant in the later steps of each trial. </p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Probabilistic Machine Learning] Gaussian Process]]></title>
            <link>https://velog.io/@jaeheon-lee/Probabilistic-Machine-Learning-Gaussian-Process</link>
            <guid>https://velog.io/@jaeheon-lee/Probabilistic-Machine-Learning-Gaussian-Process</guid>
            <pubDate>Sat, 26 Apr 2025 13:58:17 GMT</pubDate>
            <description><![CDATA[<p>GPFA, CCA</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/5069c001-8fc6-47e7-a971-1627a230863d/image.jpg" alt=""></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Disentangling Representations through Multi-task Learning]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Disentangling-Representations-through-Multi-task-Learning</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Disentangling-Representations-through-Multi-task-Learning</guid>
            <pubDate>Tue, 22 Apr 2025 12:21:51 GMT</pubDate>
            <description><![CDATA[<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1745323309587_image.png" alt=""></p>
<p>X from R^D (usually D=2), X = x* + noise
the model has D(=2) input and N(=2, 3, 6, 12, 24?) output channel</p>
<p>each output channel’s target is classifying x* under c_i, b_i decision boundary (i from 0, 1, ,,, N)
so it’s basically similar with parallel setting in Elia,Omri2024 paper. (the simplicity bias paper)</p>
<p>x* is sampled from uniform distribution [-0.5, 0.5], given D=2, then all 4 quadrants.</p>
<p>when they test OOD, they limited the sign of each dimension of x<em>*  
for example trained on x</em> only one quadrant and test on the rest three quadrants.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1745323686031_image.png" alt=""></p>
<p><a href="https://arxiv.org/pdf/1902.07275">https://arxiv.org/pdf/1902.07275</a> and we can also link with this paper
they trained N_task simultaneously, but if we add one by one, the pre-formed representation for each task would be disrupted</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Competition Dynamics Shape Algorithmic Phases of In-Context Learning]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Competition-Dynamics-Shape-Algorithmic-Phases-of-In-Context-Learning</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Competition-Dynamics-Shape-Algorithmic-Phases-of-In-Context-Learning</guid>
            <pubDate>Tue, 22 Apr 2025 12:17:57 GMT</pubDate>
            <description><![CDATA[<p>interesting paper and task <a href="https://openreview.net/pdf?id=XgH1wfHSX8">https://openreview.net/pdf?id=XgH1wfHSX8</a></p>
<p>proposed new task, finite mixture of Markov chains.
there are 10 states (0~9) and task T is defined as 10x10 transition matrix. also T_train = {T_1, T_2, ,,, ,T_N} and N can control complexity. sequence length l is fixed to 512 in training.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1745152098866_image.png" alt=""></p>
<p>4 algorithmic phases Unigram/Bigram Retrieval/Inference
(Uni-Ret, Uni-Inf, Bi-Inf, Bi-Ret)
Unigram: answer based on simple statistics,histogram of states (fast but imprecise)
Bigram: answer based on transition matrix (more precise)
Retrieval: dependent on train dataset - good ID (indistribution) but not good OOD
Inference: independent on train datset - great OOD</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1745152109831_image.png" alt=""></p>
<p>a) data diversity threshold
if N is small (easy task) dominant Uni-Ret phase.
if N is high enough (complex), initial training step Uni-Inf --&gt; Bi-Inf (good OOD) --&gt; Bi-Ret (ID overfitting).
b) emergence of induction heads
high N and intermediate training step, induction heads are formed.
c) transient nature
good OOD algorithmic phase (Bi-Inf) only activated transiently. (in the middle of training step). and deactivated by Bi-Ret.</p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1745152123958_image.png" alt=""></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Signatures of Criticality in Efficient Coding Networks]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Signatures-of-Criticality-in-Efficient-Coding-Networks</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Signatures-of-Criticality-in-Efficient-Coding-Networks</guid>
            <pubDate>Fri, 07 Mar 2025 17:42:41 GMT</pubDate>
            <description><![CDATA[<h1 id="signatures-of-criticality-in-efficient-coding-networks">Signatures of Criticality in Efficient Coding Networks</h1>
<p>This paper studies two big ideas in neuroscience: <strong>criticality</strong> (the brain operating near a critical state) and <strong>efficient coding</strong> (neurons encoding inputs optimally). Using a network of leaky integrate-and-fire (LIF) neurons, the authors test whether optimizing for efficient coding naturally leads to signatures of criticality—like power-law distributions in neural avalanches. </p>
<h2 id="why-avalanches">Why Avalanches?</h2>
<ul>
<li>Avalanches: Neuronal avalanches are cascades of spikes spreading through a network, like a chain reaction. </li>
<li>this is a hallmark of criticality. If their sizes and durations follow a power-law distribution, it signals the network is in a critical state: balanced between order (over-synchronization) and chaos (random firing).</li>
</ul>
<h2 id="noise-tunes-criticality-and-coding">Noise Tunes Criticality and Coding</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/4ac03baa-622f-4060-bfdb-65eaa82ac68a/image.png" alt=""></p>
<ul>
<li>examines how noise levels affect avalanche size distributions and coding performance (MSE)<ul>
<li>Low Noise (blue line): Neurons over-synchronize, causing large avalanches (a &quot;bump&quot; in the tail): a supercritical state.</li>
<li>High Noise (red line): Activity fragments into small avalanches (exponential decay): a subcritical state.</li>
<li>Moderate Noise (green line): Avalanches follow a power-law distribution (linear in log-log): a critical state.</li>
<li>The noise level where avalanches are most scale-free (lowest $\kappa$) matches where coding error (MSE) is minimized. </li>
<li>Criticality and efficient coding align</li>
</ul>
</li>
</ul>
<h2 id="robust-across-network-sizes">Robust Across Network Sizes</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/245f513c-341d-48bf-b084-fd2cf19fd451/image.png" alt=""></p>
<ul>
<li>results of fig1 hold across different network sizes? (50 to 400 neurons).</li>
<li>MSE (Fig. 2A) and $\kappa$ (Fig. 2B) show similar nonmonotonic patterns with noise, regardless of size.</li>
</ul>
<h2 id="discussion">Discussion</h2>
<ul>
<li>The study suggests that criticality and efficient coding aren’t separate theories but deeply connected. </li>
<li>Excessive synchronization reduces the diversity of firing patterns, and this could be explained as being trapped in a single attractor (?!)</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Recurrence resonance - noise-enhanced dynamics in recurrent neural networks]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Recurrence-resonance-noise-enhanced-dynamics-in-recurrent-neural-networks</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Recurrence-resonance-noise-enhanced-dynamics-in-recurrent-neural-networks</guid>
            <pubDate>Fri, 07 Mar 2025 10:42:54 GMT</pubDate>
            <description><![CDATA[<h1 id="recurrence-resonance---noise-enhanced-dynamics-in-recurrent-neural-networks">Recurrence resonance - noise-enhanced dynamics in recurrent neural networks</h1>
<p>This paper introduces Recurrence Resonance (RR), where adding optimal white noise enhances the mutual information ($I$) between consecutive states, reflecting improved internal information flow. Using Symmetric Boltzmann Machines (SBMs) with varied weight matrices (random, Autapses-only, Hopfield, NRooks), the study shows that RR occurs in systems with multiple pre-existing attractors (fixed points, n-cycles) when trapped in one without noise. Optimal noise r_opt enables exploration of these attractors, increasing entropy ($H$) and $I$, while excessive noise disrupts predictability, reducing $I$.</p>
<h2 id="what-does-it-mean-that-adding-noise">What does it mean that adding noise</h2>
<p>it means adding random signals (white noise, drawn from $N(0,1)$) to each neuron in the recurrent neural network (RNN). Specifically:</p>
<ul>
<li>Noise is introduced via $r \eta_n(t)$ in the input equation $u_n(t)$, where $r$ controls its strength too explore how noise affects the network’s dynamics and information processing. While noise is typically seen as disruptive, the paper shows it can enhance information flow under specific conditions, a phenomenon (RR)</li>
<li>Continuous noise was applied in most experiments, with strength $r$ varied to observe changes in entropy $H$, mutual information $I$, and divergence $D$ (Figures 1, 2). 
<img src="https://velog.velcdn.com/images/jaeheon-lee/post/11e191f8-3be5-46b5-90c6-9a5487ecb8b2/image.png" alt=""></li>
<li>Short noise pulses were also tested to switch attractors 
<img src="https://velog.velcdn.com/images/jaeheon-lee/post/a29f6aaf-87df-4989-a5ba-f39a1776fbd4/image.png" alt=""></li>
</ul>
<h2 id="noise-enables-the-exploration-of-already-existing-multiple-attractors">Noise enables the exploration of already-existing multiple attractors</h2>
<ul>
<li>Multiple attractors (e.g., fixed points, n-cycles) are predefined by the weight matrix $W$ and the network’s dynamics before noise is added (Section 3.1). Noise doesn’t create new attractors but allows the system to transition between them (Section 4).</li>
<li>Without noise ($r = 0$), the system is trapped in one attractor; with optimal noise ($r_{opt}$), it visits more pre-existing attractors. Excessive noise ($r \gg r_{opt}$) randomizes transitions without forming new attractors</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/99ff8d49-25b9-48f5-b71c-d4387a5364a7/image.png" alt=""></p>
<h2 id="method-to-confirm-that-the-system-visits-multiple-attractors">method to confirm that the system visits multiple attractors</h2>
<p>They confirmed it using <strong>joint probability distributions</strong> and <strong>information-theoretic measures</strong>:</p>
<ul>
<li><strong>Joint Probability $P(s(t), s(t+1))$</strong>: Calculated from state time series over $N_T$ steps, visualized as matrices (Figures 1D-F, 2 columns 2-4).<ul>
<li>r = 0: Few states visited (trapped in one attractor).</li>
<li>r = r_opt: More states visited, clustered around attractors.</li>
<li>r = 50: Nearly all states visited randomly.</li>
</ul>
</li>
<li><strong>Information Measures</strong>:<ul>
<li>Entropy $H$: Measures state diversity (Equation 4).</li>
<li>Mutual Information $I$: Measures predictability between states, peaking at r_opt</li>
<li>Divergence $D = H - I$: Indicates randomness.</li>
</ul>
</li>
<li><strong>State Transition Graphs</strong>: Showed preferred paths forming attractors as $w$ increases</li>
<li><strong>Specific Tests</strong>: Confirmed in Autapses-only (32 fixed points), Hopfield (2 fixed points), and NRooks (4 8-cycles) via $P(s(t), s(t+1))$ patterns</li>
</ul>
<h2 id="what-kind-of-prior-learning-process-allowed-multiple-attractors-to-already-exist">What kind of prior learning process allowed multiple attractors to already exist?</h2>
<p>No explicit learning process was used; multiple attractors exist due to the <strong>weight matrix $W$ and inherent dynamics</strong>, not training.</p>
<ul>
<li><strong>Mechanism</strong>:<ul>
<li>$W$ is predefined (randomly or structurally), and the RNN’s feedback loops naturally form attractors like fixed points or cycles (Section 1, 2.1).</li>
<li>Example: Large $w$ creates stable attractors; small $w$ leads to randomness (Section 3.1.1).</li>
</ul>
</li>
<li><strong>No Training</strong>: Unlike supervised learning, attractors emerge from the system’s autonomous dynamics (e.g., NRooks’ permutation-like $W$ ensures cycles, Section 3.2.3).</li>
<li><strong>Exception</strong>: Hopfield’s $W$ was designed to store two patterns, implying a minimal &quot;learning&quot; setup, but this was pre-set, not trained in the study (Section 3.2.2).</li>
<li><strong>Conclusion</strong>: Attractors are a mathematical consequence of $W$ and dynamics, assumed to exist for studying noise effects (Section 4).</li>
</ul>
<h2 id="weight-matrix-design-random-autapses-only-hopfield-nrooks">Weight matrix design (Random, Autapses-only, Hopfield, NRooks)</h2>
<h3 id="random-gaussian-matrix">Random Gaussian Matrix</h3>
<ul>
<li><strong>Design</strong>: $w_{nm} \sim N(0, 1)$, scaled by $w$ (Section 3.1, Figure 1A).</li>
<li><strong>Attractors</strong>: <ul>
<li>Small $w$: No clear attractors (random walk).</li>
<li>Large $w$: Fixed points or cycles (e.g., Figure 1I shows 2 fixed points at $w = 5$).</li>
</ul>
</li>
<li><strong>Effect</strong>: Noise shifts from one attractor ($r = 0$) to multiple ($r_{opt}$), then randomness ($r = 50$).</li>
</ul>
<h3 id="autapses-only">Autapses-only</h3>
<ul>
<li><strong>Design</strong>: Diagonal $w = +10$, others 0 (Section 3.2.1, Figure 2A).</li>
<li><strong>Attractors</strong>: 32 quasi-stable fixed points (5 neurons, $2^5$), each neuron persists independently.</li>
<li><strong>Plot (Figure 2A)</strong>: <ul>
<li>$r = 0$: $H \approx 1$, $I \approx 1$ (2 points).</li>
<li>$r = 4$ (optimal): $H \approx 5$, $I \approx 4.5$ (all points visited).</li>
<li>$r = 50$: $H \approx 5$, $I \approx 0.1$ (random transitions).</li>
</ul>
</li>
</ul>
<h3 id="hopfield">Hopfield</h3>
<ul>
<li><strong>Design</strong>: Symmetric, stores two patterns (24, 7), no self-connections (Section 3.2.2, Figure 2B).</li>
<li><strong>Attractors</strong>: 2 stable fixed points corresponding to stored patterns.</li>
<li><strong>Plot (Figure 2B)</strong>: <ul>
<li>$r = 0$: $H = 0$, $I = 0$ (trapped in 24).</li>
<li>$r = 23$ (optimal): $H \approx 1.5$, $I \approx 1.3$ (both visited).</li>
<li>$r = 50$: $H$ increases, $I$ drops slightly.</li>
</ul>
</li>
</ul>
<h3 id="nrooks">NRooks</h3>
<ul>
<li><strong>Design</strong>: One non-zero ($w = 20$) per row/column (Section 3.2.3, Figure 2C).</li>
<li><strong>Attractors</strong>: 4 stable 8-cycles (32 states organized into cycles).</li>
<li><strong>Plot (Figure 2C)</strong>: <ul>
<li>$r = 0$: $H = 3$, $I = 3$ (one cycle).</li>
<li>$r = 7$ (optimal): $H \approx 5$, $I \approx 4.9$ (all cycles).</li>
<li>$r = 50$: $H \approx 5$, $I$ decreases (randomness).</li>
</ul>
</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Fragmentation of grid cell maps in a multicompartment environment]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Fragmentation-of-grid-cell-maps-in-a-multicompartment-environment</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Fragmentation-of-grid-cell-maps-in-a-multicompartment-environment</guid>
            <pubDate>Sun, 23 Feb 2025 11:02:37 GMT</pubDate>
            <description><![CDATA[<h1 id="fragmentation-of-grid-cell-maps-in-a-multicompartment-environment">Fragmentation of grid cell maps in a multicompartment environment</h1>
<p>In this paper, they recorded neural activity in grid cells and place cells when rats ran through a hairpin maze. </p>
<h2 id="figure-1">Figure 1</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/3183f48b-e02a-48ae-8df0-b25d65d430c4/image.png" alt=""></p>
<ul>
<li><p>grid cells have a primarily been recorded in environments with no internal boundaries (like open fields)</p>
</li>
<li><p>grid maps repeat across alleys of the hairpin maze. </p>
</li>
<li><p>in hairpin maze, although two-dimensional periodicity of the grid was lost, the positions were highly correlated across arms, especially all even-numbered arms or all odd-numbered arms.</p>
</li>
</ul>
<h2 id="figure-2">Figure 2</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/5d85b213-3f85-4794-b487-74f499ee58ac/image.png" alt=""></p>
<ul>
<li>population analysis for a single cell emsemble.</li>
<li>they showed repeating submaps and high correlation btw neural activity in the same direction, but in the opposite direction the correlation values were not high. </li>
<li>grid cell representation segmentation (fragmentation)</li>
</ul>
<h2 id="figure-3">Figure 3</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/9cdf45ed-6454-4c48-a40d-28601eab14b5/image.png" alt=""></p>
<ul>
<li>population analysis for all trials and all rats</li>
<li>a,b: population vector correlation matrix (in one rat)</li>
<li>c,d: all rats</li>
<li>grid representations don&#39;t encode simple spatial coordinate but direction-specific and visual-input dependent maps</li>
</ul>
<h2 id="figure-4">Figure 4</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/d49656b8-0d04-47d1-976b-4e79c25e8742/image.png" alt=""></p>
<ul>
<li>representation were reset near the turning points. </li>
<li>correlation values are high between adjacent bins in the population vector, however, around the turning point in the hairpoint (the start and end of each corridor).</li>
<li>in c, tested on shuffled data where correlation value was high indicating that this is not the statistical result related to noise</li>
</ul>
<h2 id="figure-5">Figure 5</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/d86baa01-9fd8-488d-9e9c-078da212a499/image.png" alt=""></p>
<ul>
<li>short cut experiments (most interesting results!)</li>
<li>they intentionally cut one specific arm - truncated arm.</li>
<li>on truncated arms, when shift is small (around starting point) or high (around end point), the correlation values with reference arms were high, but around middle shift value, the correlation values were low. &lt;-- they&#39;re confusing where they are. </li>
</ul>
<h2 id="figure-7">Figure 7</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/9b3c26dc-5dfb-4262-947b-17c7084c1fce/image.png" alt=""></p>
<ul>
<li>open field, hairpin, virtual hairpin, open field 2</li>
<li>in OF, VH, OF2 setting, the grid field were maintained. (spatial phase), but in HP, realignment occurs (?) and the corrlelation value were low</li>
<li>the fact that such discontinuities were absent in the virtual hairpin task suggests that the realignments were imposed by the physical structure of the task.</li>
<li>in the transparent wall condition, grid realignment between compartments still occurred, indicating that segmentation was not solely due to visual occlusion. </li>
</ul>
<h2 id="discussion">Discussion</h2>
<ul>
<li>fragmentation of grid cells. firing patterns of grid cells were repeatedly similar in each corridor. but the firing patterns were sharply disconnected at turning points where the direction changed --&gt; suggesting that grid cells were fragmented into &quot;submaps&quot;</li>
<li>similar to place cell.</li>
<li>fragmentation was not seen in open spaces or virtual hairpin without physical walls even if the trajectory was the same in hairpin map.</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Probabilistic Machine Learning] Hidden Markov Model (forward-backward algorithm)]]></title>
            <link>https://velog.io/@jaeheon-lee/Probabilistic-Machine-Learning-Hidden-Markov-Model</link>
            <guid>https://velog.io/@jaeheon-lee/Probabilistic-Machine-Learning-Hidden-Markov-Model</guid>
            <pubDate>Sun, 23 Feb 2025 09:08:37 GMT</pubDate>
            <description><![CDATA[<p>study material: <a href="https://www.youtube.com/watch?v=7zDARfKVm7s">https://www.youtube.com/watch?v=7zDARfKVm7s</a> (thank you!)</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/7959cdad-1be1-4c1e-b546-12b20eb170a3/image.jpg" alt=""></p>
<p>HMM (forward-backward algorithm) implementation</p>
<ul>
<li>purpose: need to calculate hidden state of each time step with observations across time steps {$P(z_k|x_{1:n})$} $\forall k=1,2,,,n$</li>
</ul>
<pre><code class="language-python">def forward_backward(log_likelihoods, trans):
    T, K = log_likelihoods.shape
    gamma = torch.zeros(T, K)

    for t in range(T):
        alpha = torch.zeros(t + 1, K)
        alpha[0] = log_likelihoods[0] + torch.log(torch.ones(K) / K + 1e-10)
        for s in range(1, t + 1):
            alpha[s] = log_likelihoods[s] + torch.logsumexp(alpha[s-1] + torch.log(trans + 1e-10), dim=1)

        beta = torch.zeros(T - t, K)
        beta[-1] = torch.zeros(K)
        for s in range(T - t - 1, -1, -1):
            if s == T - t - 1:
                continue
            beta[s] = torch.logsumexp(torch.log(trans + 1e-10) + log_likelihoods[t + s + 1] + beta[s + 1], dim=1)

        gamma[t] = torch.softmax(alpha[t] + beta[0], dim=0)

    xi = torch.zeros(T - 1, K, K)
    for t in range(T - 1):
        alpha_t = torch.zeros(t + 1, K)
        alpha_t[0] = log_likelihoods[0] + torch.log(torch.ones(K) / K + 1e-10)
        for s in range(1, t + 1):
            alpha_t[s] = log_likelihoods[s] + torch.logsumexp(alpha_t[s-1] + torch.log(trans + 1e-10), dim=1)

        beta_t = torch.zeros(T - t, K)
        beta_t[-1] = torch.zeros(K)
        for s in range(T - t - 1, -1, -1):
            if s == T - t - 1:
                continue
            beta_t[s] = torch.logsumexp(torch.log(trans + 1e-10) + log_likelihoods[t + s + 1] + beta_t[s + 1], dim=1)

        xi[t] = (alpha_t[t].unsqueeze(1) + torch.log(trans + 1e-10) + 
                 log_likelihoods[t + 1].unsqueeze(0) + beta_t[0].unsqueeze(0))
        xi[t] = torch.softmax(xi[t], dim=1)

    return gamma, xi
</code></pre>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Emergence-of-Hidden-Capabilities-Exploring-Learning-Dynamics-in-Concept-Space</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Emergence-of-Hidden-Capabilities-Exploring-Learning-Dynamics-in-Concept-Space</guid>
            <pubDate>Thu, 06 Feb 2025 14:00:14 GMT</pubDate>
            <description><![CDATA[<h1 id="emergence-of-hidden-capabilities-exploring-learning-dynamics-in-concept-space">Emergence of Hidden Capabilities: Exploring Learning Dynamics in Concept Space</h1>
<p>using generative model, this paper proposes concept space to evaluate a model&#39;s learning dynamics in this space. each abstract coordinate sytem indicates specific concepts including size, color, background color. </p>
<ul>
<li>introducing concept space</li>
<li>concept signal dictates speed of learning</li>
<li>sudden transition in concept learning</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/37730432-7f5a-4fc4-bd27-ee5cb805deca/image.png" alt=""></p>
<h2 id="concept-space-a-framework-for-analyzing-concept-learning-dynamics">Concept Space: A Framework for Analyzing Concept Learning Dynamics</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/8b96243c-6457-41f1-b4f0-026a030f532a/image.png" alt=""></p>
<p>z is the full high-dimensional latent representation in the concept space, while h is the observed input derived from z with some components masked or reduced.</p>
<p>z represents the complete underlying structure for data generation, and h is a partial view provided as input for the model to learn. The model uses h to infer or align with the full representation z.</p>
<p>ex) z contains actual size, shape, background &quot;representations&quot; 
ex) h contains low dimensional and (sometimes) masked representations like 01, 10, 11</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/312c3a2e-679a-4c1b-93e1-956409eb9515/image.png" alt=""></p>
<p>In this paper, they often mention the strength of &quot;signal&quot;.
(left) color separation is stronger than size separation.
(right) size separation is stronger than color separation.</p>
<h2 id="experimental-and-evaluation-setup">Experimental and Evaluation Setup</h2>
<ul>
<li>experiments were conducted on 2D space using size and color variables and intentionally excluded one combination one case which could be trainied by combining concept. </li>
<li>00 (large red circles), 01 (large blue circles), and 10 (small red circles) (train) -&gt; 11 (small and blud) <strong>OOD</strong></li>
<li>followed disentangled representation learning. (diffusion, U-Net)</li>
</ul>
<h2 id="concept-signal-determines-learning-speed">Concept Signal Determines Learning Speed</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/3c905fbf-01ca-4eda-89ef-2382110b38ab/image.png" alt=""></p>
<ul>
<li>RGB contrast on color, size difference --&gt; controlled concept signal strength</li>
<li>speed of learning definition: inverse of the number of gradient steps required to reach 80% accuracy for class 11 (OOD)</li>
<li>concept signal determines the speed at which individual concept are learned</li>
</ul>
<h2 id="concept-signal-governs-generalization-dynamics">Concept Signal Governs Generalization Dynamics</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/3cadc431-b633-44b1-9126-4aae7d3343f1/image.png" alt=""></p>
<ul>
<li>color concept signal is relative strength of color compared with strength of size signal</li>
<li>when in distribution generalization for class 00, the trajectory converges to 00 regardless of their color concept signal.</li>
<li>however, in OOD generalization, they first go toward in distribution concept like 01 or 10 depending on color concept signal and suddenly changes their trajectory towards 11. </li>
<li><strong>concept memoraization</strong></li>
<li>imbalance of concept signal strong --&gt; strongly biased (look at blue trajectory on (b))</li>
<li>states potential problem of early stopping. </li>
</ul>
<h2 id="towards-a-landscape-theory-of-learning-dynamics">Towards a Landscape Theory of Learning Dynamics</h2>
<p>they tried to explain the phenomology of learning dynamics in concept space using analytical curves (dynamics equation below)</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/1f367f02-3477-4cef-8275-ea6021d92d5a/image.png" alt=""></p>
<ul>
<li>where $\hat{z}$ is target point, $\tilde{z}$ is initial biased target. $\sigma$ is strength of signal. $\sigma_1$: color concept signal, $\sigma_2$: size concept signal</li>
<li>($\hat{z}_1$, $\hat{z}_2$) = (0, 1) indicates the representation of small size and red color. when $\sigma_2$ is larger than $\sigma_1$, in other words trajectory color is blue, they showed bias towards (0,1)</li>
<li>this equation uses 1+exp(-t-t^) which is very similar to memory decay.</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/15e5d877-62e3-4fff-8860-c641ed99d7d9/image.png" alt=""></p>
<p>based on this framework, they derived above energy function. a is difference between $\sigma_1$ and $\sigma_2$ and $\hat{t}_1$ and $\hat{t}_2$ is the time when they learned concepts $z_1$ and $z_2$. </p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/3280b8ab-fa35-4e1b-a2e4-d7134aaa0352/image.png" alt=""></p>
<ul>
<li><p>this figure illustrates the simulated trajectories for classes 00 (a), and 11 (b) which is very similar to previous ID / OOD generalization figure.</p>
</li>
<li><p>the network&#39;s learning dynamics can be decomposed into two stages :<strong>first biased toward ID sample and suddenly goes to OOD sample.</strong> --&gt; existence of phase change underlying the decomposition and model acquires the capability to alter concepts</p>
</li>
</ul>
<h2 id="sudden-transitions-in-concept-learning-spaces">Sudden Transitions in Concept Learning Spaces</h2>
<ul>
<li>at the point of departure (phase change), the model has already learned to manipulate concepts, causing a shift in its learning trajectory.</li>
<li>however, in naive prompting, model fails to elicit these learned capabilities, making the model seem incomplete.</li>
<li>to handle this insufficiency, they use new two prompting methods.</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/08fc3012-48a3-4fb6-9756-f205fdeff3d1/image.png" alt=""></p>
<p>about linear latent intervention : </p>
<ol>
<li>the model first identifies a vector representing a specific concept, such as &quot;blue,&quot; by transforming the corresponding latent concept vector into the model&#39;s latent space. </li>
<li>Similarly, a vector for &quot;large size&quot; is created. The model then modifies the original concept vector by adjusting specific components to enhance or suppress certain attributes. </li>
<li>Increasing the weight of the blue concept strengthens its influence, while increasing the weight of the large-size concept reduces its effect.</li>
</ol>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/36fbf1a5-745e-466c-a279-88fee6f507e9/image.png" alt=""></p>
<p>alpha, beta: hyperparameter.</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/71fe5618-4c04-439c-aac1-3327eb0d73eb/image.png" alt=""></p>
<p>analyzes the second-from-left curve (green), using checkpoints and two prompting methods to generate class 11 (blue, small).</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/688f2428-7b87-4d35-b1d8-c35e7c55a234/image.png" alt=""></p>
<p>(a) naive input prompting: sometimes failed
(b, c) much faster. 6000 gradient step. </p>
<ul>
<li>this shows that the model can actually manipulate the concept and has capacity for OOD generalization.</li>
<li>Also, in every prompting method, there were sudden turns where training trajectories suddenly changes. In this period, they &quot;Activates&quot; the ability of manipulating on this concept space. (??)</li>
</ul>
<h2 id="effect-of-underspecification-on-learning-dynamics-in-concept-space">Effect of Underspecification on Learning Dynamics in Concept Space</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/f7a3935b-6437-45ac-8202-979b5dbe8269/image.png" alt=""></p>
<ul>
<li>in sota generative model, when yellow strawberry is given, the model actually generates red strawberry. </li>
<li>similar to this, they randomly select training samples that have a specific combination of shape and color (like red triangle) and mask the token representing the color (red) and train the model on three concep classes (00, 01, 10).</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/953a2168-0252-4acb-8acc-6f8f52165a4b/image.png" alt=""></p>
<ul>
<li>the number of gradient steps required to reach accuracy 0.8 increases with percentage of masked prompts.</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/e2991870-1020-431b-b8d9-7ee067e3592a/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/4c46acb9-2580-4b3f-a295-5e2e6d78b1c3/image.png" alt=""></p>
<ul>
<li>underspecification delays and hinders OOD generalization</li>
<li>toy model also exhibits similar plots</li>
</ul>
<p>very interesting paper</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks]]></title>
            <link>https://velog.io/@jaeheon-lee/Approximation-and-Optimization-Theory-for-Linear-Continuous-Time-Recurrent-Neural-Networks</link>
            <guid>https://velog.io/@jaeheon-lee/Approximation-and-Optimization-Theory-for-Linear-Continuous-Time-Recurrent-Neural-Networks</guid>
            <pubDate>Thu, 30 Jan 2025 08:03:41 GMT</pubDate>
            <description><![CDATA[<h1 id="approximation-and-optimization-theory-for-linear-continuous-time-recurrent-neural-networks">Approximation and Optimization Theory for Linear Continuous-Time Recurrent Neural Networks</h1>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/a947af26-1824-424c-b66c-32c83eef94a7/image.png" alt=""></p>
<h2 id="intro">Intro</h2>
<p><strong>This is a summary of selected parts from the paper, focused on understanding the &quot;curse of memory,&quot; rather than a review of the entire content.</strong></p>
<p>this paper addresses whether continuous-time RNNs can approximate time-dependent functionals $$H_t(x)$$, which map input signals $$x(t)$$ to outputs $$y(t)$$. Unlike prior studies, this work emphasizes:</p>
<ul>
<li>The &quot;absence&quot; of underlying dynamical systems for $$H_t$$.</li>
<li>The necessity of memory decay for approximation.</li>
</ul>
<h2 id="problem-formulation">Problem Formulation</h2>
<ul>
<li><p>Initially described in a discrete setting with equations (1) and (2), the discussion transitions into the continuous setting.</p>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/3638c1f3-d992-47a0-972f-f33562bd4d3a/image.png" alt=""></p>
</li>
<li><p>The continuous RNN dynamics are expressed in equation (17), leading to the representation in equation (18).</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/78f66742-798c-4c3b-a99d-d0dac5951b24/image.png" alt=""></p>
<h2 id="universal-approximation-theorem-theorem-7">Universal Approximation Theorem (Theorem 7)</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/0c7d71f9-2412-42df-9341-57b9491945f8/image.png" alt=""></p>
<ul>
<li><p>Using the Riesz-Markov-Kakutani theorem, the existence of $$H_t$$ is shown through its unique association with a measure $$\mu_t$$.</p>
</li>
<li><p>The kernel representation $$H_t(x) = \int_0^\infty x^\top_{t-s} \rho(s) , ds$$ (equation (23)) is central, where <strong>$$\rho$$ dictates smoothness and decay properties of input-output relationships --&gt; convolution.</strong></p>
</li>
<li><p>equation (23) underscores kernel $$\rho(t)$$&#39;s role in input-output convolution </p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/96e19deb-81ed-4643-83b8-26b2191e064a/image.png" alt=""></p>
<ul>
<li>The quality of approximation depends on how well $$\rho(t)$$ can be represented by exponential sums.</li>
</ul>
<ul>
<li>The eigenvalues of $$W$$ with $$\text{Re}(\lambda) &lt; 0$$ ensure system stability.</li>
</ul>
<h2 id="approximation-rates-theorem-10-and-inverse-approximation-theorem-theorem-11">Approximation Rates (Theorem 10) and Inverse Approximation Theorem (Theorem 11)</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/59e083bc-61f0-44e2-bc7d-efc58d05c489/image.png" alt=""></p>
<ul>
<li>I couldn&#39;t understand the full proof process... </li>
<li>(yet,) this provides bounds for approximating functionals under smoothness ($$\alpha$$) and decay ($$\beta$$) conditions, leading to approximations via width-$$m$$ RNNs with bounded error rates (equation (27)).</li>
<li>In other words, although it may seem complex, the function is $$\alpha$$-smooth (continuous and differentiable), and its derivatives are controlled by a decaying rate of $$\beta$$ as in (25). When the decaying rate and the smoothness are related as shown in (26), width-$$m$$ RNN functionals $$H_t$$ <strong>can approximate the target</strong> under these bounds.</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/5f212a2f-573f-4290-8d0c-07ad7addafae/image.png" alt=""></p>
<ul>
<li>demonstrates that without memory decay, approximation is infeasible. (amazing)</li>
<li>again, i gave up understanding the proof haha..</li>
</ul>
<h2 id="curse-of-memory">Curse of Memory</h2>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/00700610-2327-41cd-bf4c-d131ec4eb2f6/image.png" alt=""></p>
<ul>
<li>When $$\rho(t)$$ decays slowly (e.g., $$\sim t^{-(1/\omega)}$$), RNNs face exponential model size growth to maintain accuracy </li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Understanding and Controlling Memory in Recurrent Neural Networks]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Understanding-and-Controlling-Memory-in-Recurrent-Neural-Networks</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Understanding-and-Controlling-Memory-in-Recurrent-Neural-Networks</guid>
            <pubDate>Sat, 25 Jan 2025 12:53:36 GMT</pubDate>
            <description><![CDATA[<h1 id="understanding-and-controlling-memory-in-recurrent-neural-networks">Understanding and Controlling Memory in Recurrent Neural Networks</h1>
<ul>
<li><p>2019, curriculum learning, memory, slow/flexible point</p>
<ul>
<li>difference from our setting is that 
  they are using pre-defined task transition timing and generating each batch data for every time step (??) → I&#39;m scared, wondering if this is what I should have done too.</li>
</ul>
</li>
<li><p>Task definition (look at figureA below)</p>
<ul>
<li>whole time step: 140,000. In every time step, new data is generated and trained.</li>
<li>Input: <ul>
<li>MNIST (orCIFAR-10) appears at a specific time t_s, and noise images at remaining time steps.</li>
<li>stimulus (t_s) and response (t_a) are chosen randomly for each trial, ensuring that t_a occurs at least 4 time steps after t_s. (t_a &gt; t_s + 4)</li>
<li>total length of the input time steps for each trial is capped at 20 Tmax = 20, meaning the network can process inputs and must formulate a response within a maximum of 20 time steps. </li>
</ul>
</li>
<li>Output: <ul>
<li>network should output a &quot;null&quot; label at all times except during t_a
  therefore, only response at t_a is included in loss calculation!
  → this is implemented by introducing new variable z in code…. (I&#39;ve been struggling with this for an hour)</li>
<li>t_a output label is dependent on stimulus on t_s
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737612694517_image.png" alt=""></li>
</ul>
</li>
</ul>
</li>
<li><p>Model</p>
<ul>
<li>GRU and LSTM for RNN</li>
<li>like on above B, when using CIFAR-10, CNN weights are used together</li>
</ul>
</li>
</ul>
<ul>
<li><p>Training Protocol - Two different Curriculum Learning 
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737610899099_image.png" alt=""></p>
<ul>
<li>VoCu:<ul>
<li><a href="https://dl.acm.org/doi/abs/10.1145/1553374.1553380">https://dl.acm.org/doi/abs/10.1145/1553374.1553380</a> Bengio Curriculum Learning 2009</li>
<li>gradually increase the number of categories 3→4→…→10 (or more)</li>
</ul>
</li>
<li>DeCu:<ul>
<li><a href="https://doi.org/10.1142/S0218488598000094">https://doi.org/10.1142/S0218488598000094</a> sepp hochreiter 1998 VarnishGrad RNN</li>
<li>gradually increase the number of T_max 6→8→…→20</li>
</ul>
</li>
</ul>
</li>
</ul>
<ul>
<li><p>Extrapolation ability for each diff training protocol
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737805034830_image.png" alt="">
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737805080956_image.png" alt=""></p>
<ul>
<li>retrieval accuracy: DeCu was better than VoCu. </li>
<li>VoCu rapidly formed each clouds. but showed unstable and higher variability</li>
<li>DeCu relatively slow, but showed good convergence toward each fixed points</li>
</ul>
</li>
</ul>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737806541867_image.png" alt="">
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737806953829_image.png" alt=""></p>
<pre><code>- not well described but red dot (faster, low acc) = VoCu, blue dot (slower, high acc) = DeCu
- speed and accuracy has negative correlation.
- (right) code for finding slow points : just found local minima of “speed”</code></pre><ul>
<li><p>Formation of Slow Points - WHY protocols differ</p>
<ul>
<li><p>in VoCu</p>
<ul>
<li>8:2 means : in training step 8000, new class introduced. (MaxClasses[0:8000] = 3)</li>
<li>jumps are observed.</li>
</ul>
</li>
<li><p>in DeCu</p>
<ul>
<li>20 : 8 means : in t step 20000, delay time increased to 8. (T_MAX_VEC[20000:30000] = 8)</li>
<li>relatively gradual changes are observed.
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737808137092_image.png" alt=""></li>
</ul>
</li>
<li><p>(C) </p>
<ul>
<li>In VoCu, with Backtracking procedure, they found that “the new class is assigned to an existing slow point”. → related to shared attractors (?)</li>
</ul>
</li>
<li><p>(D) history affect performance?</p>
<ul>
<li>in (c), class 8 -orange- originated from class 5 -green-.</li>
<li>class 5 performance was impaired more than other existing classes following the introduction of class 8. (look at thick green and orange colored line)</li>
</ul>
</li>
</ul>
</li>
</ul>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737808449373_image.png" alt=""></p>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737808565909_image.png" alt=""></p>
<ul>
<li>Improving Long Term Memory<ul>
<li>slow → good</li>
<li>new regularization with hidden state speed </li>
</ul>
</li>
</ul>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737809358041_image.png" alt=""></p>
<pre><code>   - slow points can move → costly
   - therefore they used center of mass of each class instead of slow points for each class.</code></pre><p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1737809366943_image.png" alt=""></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] A complexity-based theory of compositionality]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-A-complexity-based-theory-of-compositionality</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-A-complexity-based-theory-of-compositionality</guid>
            <pubDate>Tue, 29 Oct 2024 23:20:02 GMT</pubDate>
            <description><![CDATA[<h1 id="a-complexity-based-theory-of-compositionality">A complexity-based theory of compositionality</h1>
<p>Compositionality is fundamental to intelligence. espeically in human, the structure of thought, language, higher-level reasoning. however, there&#39;s no measurable and mathematical definition of this, so they tried to quantify this with algorithmic information theory, Kolmogorov complexity.</p>
<p>Definition 1 (compositinality)</p>
<ul>
<li>existence of a symbolic complex expression -&gt; where do these expressions saved? in human, they&#39;re saved in &quot;language&quot; or other kinds of expression but in neural network it could be only exist by some combination of representations.</li>
<li>similarly, the structure of expression (somehow language) is also unclear</li>
<li>in language, there&#39;s no limit on mapping the sentence and the semantics.</li>
<li>also the semantics varies in what context that the sentence is located.</li>
</ul>
<p>compressing a representation, Kolmogorov complexity</p>
<ul>
<li>through the lens of optimal compresseion and Kolmogorov complexity</li>
<li>kolmogorov complexity defines a notion of information quantity
  :length of the shortest program in some programming language that outputs that object
  ex) KC of 101010: length(repeat 10, 3 times)
  : the more it has structure or pattern the small value of kolmogorov complexity that object get</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/be9faf49-d757-43d4-93d8-b9414f39b46d/image.png" alt=""></p>
<ul>
<li>in the context of ML, given dataset X = (x1, ,,, xn) sufficiently large drawn iid from dist p(x)</li>
<li>the optimal method for compressing :let&#39;s say K(X) = K(X|p) + K(p)</li>
<li>then K(X|p) can be optimally encoded using only $-log_2p(x_i)$ bits Shannon information</li>
<li>second term K(p) refers to complexity of the data distribution </li>
</ul>
<p>Compressing $Z$ as a function of parts</p>
<ul>
<li>denote a representation by a matrix $Z \in {\R}^{N \times D}$ where each row $z_n$ is D-dimesional vector from $p(z|x)p(x)$.</li>
<li>describe a representation using short parts-based constituents sentences $W \in \mathcal{V}^{N \times M}$ where nu is finite set of discrete symbols to vocabulary where M is maximum sentence length.</li>
<li>let&#39;s say $p_w(w)$ is their most compressed forms</li>
<li>then we can say like below</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/d755907e-180f-49ec-967e-e98ad38864ca/image.png" alt=""></p>
<ul>
<li>then for the representation decoding from their sentences, we need new mapper $f:\mathcal{V}^{M} - \R^D$ : semantics</li>
<li>one thing important about K(f), set p with Normal distribution, and introduced additional error term induced by Gaussian assumption -&gt; thats K(Z|W,f)</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/7a0611bb-1a60-4696-ad02-49cd1727b88a/image.png" alt=""></p>
<p>Summary and further intuition</p>
<ul>
<li>total Kolmogorov complexity of the representation:</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/f4a81ff2-a76d-4d59-8eb2-4324c8073e58/image.png" alt=""></p>
<ul>
<li>Representational compositionality: a formal definition of compositionality</li>
</ul>
<p><img src="https://velog.velcdn.com/images/jaeheon-lee/post/cdb65419-e2a1-4278-a259-e0e47a6c870a/image.png" alt=""></p>
<p>wip</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-The-Simplicity-Bias-in-Multi-Task-RNNs-Shared-Attractors-Reuse-of-Dynamics-and-Geometric-Representation</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-The-Simplicity-Bias-in-Multi-Task-RNNs-Shared-Attractors-Reuse-of-Dynamics-and-Geometric-Representation</guid>
            <pubDate>Tue, 22 Oct 2024 19:55:06 GMT</pubDate>
            <description><![CDATA[<h1 id="the-simplicity-bias-in-multi-task-rnns-shared-attractors-reuse-of-dynamics-and-geometric-representation">The Simplicity Bias in Multi-Task RNNs: Shared Attractors, Reuse of Dynamics, and Geometric Representation</h1>
<ul>
<li>How can a single neural population combine these disparate objects (dynamical motifs) into a joint representation? And under what conditions do these representations become shared or separate?</li>
<li>simplicity bias: the author wasn’t trying to quantify; they were just pointing out such tendency.<ul>
<li>persists regardless of the number of tasks or the nature of the dynamical objects
  ( observed in various objects including fixed points, limit cycles, line- plane- attractors)</li>
<li>link this bias to “sequential emergence” of attractors</li>
<li>how external factors can resist this bias and create more complex dynamic objects
  → this external factors means, ‘architectural constraints’ ~ gated, orthogonal, parallel</li>
</ul>
</li>
<li>They use “separate” input and output
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1729623204820_image.png" alt=""></li>
</ul>
<p><img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1729623233179_image.png" alt=""></p>
<ul>
<li><p>In this settings, they have</p>
<ul>
<li>gated setting: only one task in a time, no output in other task<ul>
<li>in code, they ignore loss from other tasks</li>
</ul>
</li>
<li>orthogonal setting: only one task in a time, zero output in other task</li>
<li>parallel setting: multiple tasks in a time, no constraint in other task</li>
</ul>
</li>
<li><p>Compared to gated settings, orthogonal and parallel setting allow ‘interaction btw tasks’ </p>
</li>
<li><p>task examples</p>
<ul>
<li>from the top, fixed point, limited cycles, line-, plane- attractor (left) and attractors (right)
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1729623582970_image.png" alt=""></li>
</ul>
</li>
<li><p>They trained two tasks in gated mode (sky blue) and orthogonal mode (blue) 
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1729623667591_image.png" alt=""></p>
</li>
<li><p>Especially in gated mode, they tend to share the attractors —  simplicity bias (why? below)</p>
<ul>
<li>A : gated mode resulted in a complete overlap between the tasks (even diff tasks), the opposite with orthogonal settings</li>
<li>B: linear classifier to separate the neural trajectories (2) — failed in gated, success in ortho</li>
<li>C: ratio variances between and within tasks (F-factor) — shared in gated, separated in ortho</li>
<li>D: when we look at the spectrum of connectivity matrix (W_recurrent) - lambda<ul>
<li>the number of unstable eigenvalues was larger in the orthogonal settings</li>
<li>oh, in orthogonal settings, there are multiple fixed points! hey then how it emerges???</li>
</ul>
</li>
</ul>
</li>
<li><p>Next question is, what is the origin of this simplicity bias??</p>
</li>
<li><p>consider the two task needs two fixed points each (4 points)</p>
<ul>
<li>In Gated mode: 2 shared 2 shared<ul>
<li>the recurrent dynamics cause the states to mostly depend on the “task agnostic attractors”, and less on the task-specific inputs</li>
</ul>
</li>
<li>In orthogonal mode: 4 all different<ul>
<li>two attractors have to be orthogonal to each other, forcing the network to separate them</li>
<li>because the network architecture make the other zero when one is training</li>
</ul>
</li>
</ul>
</li>
<li><p>ok let’s track this hypothesis by following the attractor landscape of networks (low-rank) [13]
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1729625167922_image.png" alt=""></p>
</li>
<li><p>with this projection (approximation), they followed the evolution of the dynamics in parallel settings trained on two fixed-point tasks
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1729625527560_image.png" alt=""></p>
</li>
<li><p>in C and D, </p>
<ul>
<li>at epoch0 only origin is stable. (two eigenvalues are in unit circle)</li>
<li>origin destabilizes, single unstable eigenvalue emerges</li>
<li>more training, second eigenvalue leaves the unit circle → pair of stable fixed points emerges</li>
</ul>
</li>
<li><p>in B, they repeated this analysis. (skyblue, blue, darkblue - gated, ortho, parallel)</p>
<ul>
<li>in orthogonal, parallel mode, we can see the clear sequential emergence of two outliers</li>
<li>the gated setting is solved with a single outlier, as all tasks share a common attractor</li>
</ul>
</li>
</ul>
<p>In dicussion,</p>
<ul>
<li>reusing existing representation leads to faster learning of task ~ i think they wanted to mention the &quot;efficiency&quot; as shown in gated setting</li>
<li>if there&#39;re modular representation, it could be some constraint on architecture. good point! since in typical multitask learning setting, the &#39;rule&#39; signals are separately processing  at the head of the RNN.. </li>
<li>an attractor shared btw tasks is not identical to an attractor that was formed in response to a single task so if one task has oscillatory component, it might suggest that the same circuit is also capable of generating such oscillation in another context. (?!?)</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Flexible multitask computation in recurrent networks utilizes shared dynamical motifs]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Flexible-multitask-computation-in-recurrent-networks-utilizes-shared-dynamical-motifs</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Flexible-multitask-computation-in-recurrent-networks-utilizes-shared-dynamical-motifs</guid>
            <pubDate>Mon, 16 Sep 2024 20:36:41 GMT</pubDate>
            <description><![CDATA[<h1 id="flexible-multitask-computation-in-recurrent-networks-utilizes-shared-dynamical-motifs">Flexible multitask computation in recurrent networks utilizes shared dynamical motifs</h1>
<ul>
<li><p><a href="https://github.dev/lauradriscoll/flexible_multitask">https://github.dev/lauradriscoll/flexible_multitask</a></p>
</li>
<li><p>set up 4 period (context, stimulus, memory, response)
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1726004022458_image.png" alt=""></p>
<ul>
<li>input: 1(fixation) + 2(modalities) x 2(Asin(theta)andAcos(theta)) + 15(rule)</li>
<li>output: 1(fixation) + 1(modality) x 2(sin(phi) and cos(phi))</li>
</ul>
</li>
<li><p>fixed point and stability of that fixed point via analyzing linearization matrix (jacobian)</p>
<ul>
<li>used approximation <a href="https://joss.theoj.org/papers/10.21105/joss.01003">https://joss.theoj.org/papers/10.21105/joss.01003</a> tensorflow based</li>
</ul>
</li>
<li><p>single task network</p>
<ul>
<li>shared fixed point in context dynamics, memory dynamics</li>
<li>during stimulus period, center → ring movement</li>
<li>during memory period, ring attractor shrink on memory PC subspace<ul>
<li>by introducing interpolation alpha</li>
</ul>
</li>
</ul>
</li>
<li><p>two task networks (MemoryPro, MemoryAnti)</p>
<ul>
<li>shared ring attractor &amp; two separate stable FP, one unstable FP<ul>
<li>by introducing ‘rule input’ interpolation alpha </li>
</ul>
</li>
</ul>
</li>
<li><p>task variance analysis - finding dynamical motifs</p>
<ul>
<li>cluster 2 - reaction-timed response, cluster 9 - memory-guided responses</li>
<li>figure 3a. - cluster C has block → that color is for response</li>
<li>in fig 6, their ‘response’ dynamics interrupted with lesioned cluster c units
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1726006067423_image.png" alt=""></li>
</ul>
</li>
<li><p>exploiting these dynamical motif (similar subspace) with “rule interpolation”</p>
<ul>
<li>(PCA on memory state)</li>
<li>task period cluster 6, unit clusters t and u / 9,10 with a-d: shared point/ring attractor
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1726006430651_image.png" alt=""></li>
</ul>
</li>
<li><p>shared stimulus period dynamical motifs</p>
<ul>
<li>during stimulus period, if initial condition is similar, also is evolving steps.</li>
<li>during rule (context) period, network prepares future pathways (trajectories)</li>
</ul>
</li>
<li><p>interpolation with rule inputs by alpha</p>
<ul>
<li>(PCA on stimulus state)</li>
<li>“connection bridge” by alpha was orthogonal to decision boundary</li>
<li>similar task share stable and unstable fixed point
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1726006896351_image.png" alt=""></li>
</ul>
</li>
<li><p>lesioning clustered RNN units - they found modular lesion effects on…</p>
<ul>
<li>c: delayed response, f: anti-response task, modality 1/2, t/u: category memory, a/b continuous memory</li>
</ul>
</li>
<li><p>reusing dynamical motifs</p>
<ul>
<li>train all task except MemoryAnti, all freeze except the input weights</li>
</ul>
</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[[Paper Review] Task representations in neural networks trained to perform many cognitive tasks]]></title>
            <link>https://velog.io/@jaeheon-lee/Paper-Review-Task-representations-in-neural-networks-trained-to-perform-many-cognitive-tasks</link>
            <guid>https://velog.io/@jaeheon-lee/Paper-Review-Task-representations-in-neural-networks-trained-to-perform-many-cognitive-tasks</guid>
            <pubDate>Mon, 16 Sep 2024 20:34:22 GMT</pubDate>
            <description><![CDATA[<h1 id="task-representations-in-neural-networks-trained-to-perform-many-cognitive-tasks">Task representations in neural networks trained to perform many cognitive tasks</h1>
<ul>
<li><p>task representation = RNN weights when performing specific task (?)</p>
</li>
<li><p>task explanation and codes are in <a href="https://github.com/gyyang/multitask/blob/master/task.py">https://github.com/gyyang/multitask/blob/master/task.py</a></p>
</li>
<li><p>tasks at a glance - there are two modalities of stimulus
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1725477155269_image.png" alt="">
<img src="https://paper-attachments.dropboxusercontent.com/s_913ABF8DA3F73C73175C8BB8FFD1ADB54F703D0990295D1C06265F8B80DC0619_1725477072553_image.png" alt=""></p>
</li>
<li><p>learn multiple tasks “sequentially” with continual-learning technique</p>
<ul>
<li>if there’s some ‘state’ in each task, would there be replay memory?</li>
<li>is there any biased results on the “sequence” of the task?<ul>
<li>if tested it will take all permutations of that all tasks…</li>
</ul>
</li>
</ul>
</li>
<li><p>RNN model: just one hidden layer &amp; non-negative (with 256 units)</p>
<ul>
<li>and calculated task variance. also clustering across units.<ul>
<li>didn’t read much about how they did clustering and clean out the noise</li>
</ul>
</li>
</ul>
</li>
<li><p>32 ring (direction-specific recurrent unit)</p>
<ul>
<li>how is is possible to force them response to specific directions??</li>
</ul>
</li>
<li><p>(just cameout) one paper using fMRI</p>
<ul>
<li>complexity defined as number of choices (but the same task family unlike us)</li>
<li>more complex task → people tends to do model-based thinking </li>
<li><a href="https://www.nature.com/articles/s41467-019-13632-1">https://www.nature.com/articles/s41467-019-13632-1</a></li>
</ul>
</li>
<li><p>in line with above paper, I think we should think about each task’s complexity</p>
<ul>
<li>what amount information (or dimension) should be sufficient for correct answer?</li>
<li>‘amount’ and ‘dimension’ → complexity (how does it combine?)</li>
</ul>
</li>
<li><p>task vector</p>
<ul>
<li>why do they use only “steady-state” response across stimulus?</li>
<li>they can’t capture the task variance (sensitive unit activity…)</li>
<li>how about trying DSA like <a href="https://github.com/mitchellostrow/DSA">https://github.com/mitchellostrow/DSA</a></li>
<li>I don’t know</li>
<li>how to combine the ‘task variance’ to this?</li>
</ul>
</li>
<li><p>task vector 2</p>
<ul>
<li>tho it’s cool that there could exist algebraic form of compositional representation</li>
</ul>
</li>
<li><p>rule compositionality (combination of rule inputs!!!)</p>
<ul>
<li>most important part of this paper</li>
<li>Dly Anti task = Anti + (Dly Go - Go) </li>
<li>but failed in DMS != DMC + DNMS - DNMC</li>
<li>can we use multiplication or convolution??<a href="https://neuroai.neuromatch.io/tutorials/W2D2_NeuroSymbolicMethods/student/W2D2_Tutorial2.html">https://neuroai.neuromatch.io/tutorials/W2D2_NeuroSymbolicMethods/student/W2D2_Tutorial2.html</a></li>
</ul>
</li>
<li><p>a bit of continual learning</p>
<ul>
<li>Dly GO → Ctx Dly Dm1, Ctx Dly DM2 ----- forget Dly Go</li>
<li>added “penalty for deviations of important synaptic weights” ~ regularizer</li>
</ul>
</li>
<li><p>I didn’t look through much about </p>
<ul>
<li>how they clustered</li>
<li>how they analyzed with different activation functions (tanh, ,,,)</li>
<li>how they did continual learning (method, regularizer part)</li>
</ul>
</li>
</ul>
]]></description>
        </item>
    </channel>
</rss>