<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom">
    <channel>
        <title>0404_not_found.log</title>
        <link>https://velog.io/</link>
        <description></description>
        <lastBuildDate>Fri, 03 Jan 2025 14:31:24 GMT</lastBuildDate>
        <docs>https://validator.w3.org/feed/docs/rss2.html</docs>
        <generator>https://github.com/jpmonette/feed</generator>
        <image>
            <title>0404_not_found.log</title>
            <url>https://velog.velcdn.com/images/0404_not_found/profile/94b92194-27d7-48c2-8d48-b46a2f3a2f22/social_profile.png</url>
            <link>https://velog.io/</link>
        </image>
        <copyright>Copyright (C) 2019. 0404_not_found.log. All rights reserved.</copyright>
        <atom:link href="https://v2.velog.io/rss/0404_not_found" rel="self" type="application/rss+xml"/>
        <item>
            <title><![CDATA[GPTScore: Evaluate as You Desire]]></title>
            <link>https://velog.io/@0404_not_found/GPTScore-Evaluate-as-You-Desire</link>
            <guid>https://velog.io/@0404_not_found/GPTScore-Evaluate-as-You-Desire</guid>
            <pubDate>Fri, 03 Jan 2025 14:31:24 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>GPT</p>
<ul>
<li><p>Analytical AI to Generative AI</p>
</li>
<li><p>large PLM + Prompt $\rightarrow$ superior performance</p>
</li>
<li><p>needs for evaluating the quality of these texts</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/5f620815-64e3-470f-a01c-1cc7b1bff0eb/image.png" alt=""></p>
<ul>
<li><p>evaluating single aspect</p>
<ul>
<li>hard for users to evaluate aspects as they need</li>
</ul>
</li>
<li><p>multi-aspect evaluation</p>
<ul>
<li><p>lacks the aspects&#39; definition and relationship</p>
</li>
<li><p>empirically bound with metric variants</p>
</li>
</ul>
</li>
<li><p>Needed supervised training and manual annotation</p>
</li>
<li><p>using LLM to achieve multi-aspect, customized and training-free evaluation</p>
<ul>
<li><p>using zero-shot instruction and ICL</p>
</li>
<li><p>higher quality text for a specific aspect will be more likely generated</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/1d92146c-1f90-4702-b874-093cb4e73df4/image.png" alt=""></p>
<ul>
<li><p>perform an evaluation as the user desires</p>
<ul>
<li><p>task specification</p>
</li>
<li><p>aspect definition</p>
</li>
<li><p>deponstrated samples</p>
<ul>
<li>well-labeled sample</li>
</ul>
</li>
<li><p>GPT to calculatehow likely the text could be generated based on the evaluation protocol</p>
<ul>
<li>GPT2, OPT, T5, GPT3</li>
</ul>
</li>
</ul>
</li>
<li><p>almost all NLG task</p>
<ul>
<li><p>performed well when instructed by the definition of task and aspect</p>
</li>
<li><p>different evaluation aspects exhibit certain collerations</p>
</li>
<li><p>in summarization, data-to-text and dialogue response, GPTScore outperformed fine-tuned models</p>
</li>
<li><p>gpt3-text-davinci-003 (human feedback) is inferior to text-davinci-001</p>
</li>
</ul>
</li>
</ul>
<h1 id="2-related-work">2. Related Work</h1>
<h4 id="similarity-based-metrics">Similarity-based Metrics</h4>
<ul>
<li><p>lexical overlap-based</p>
<ul>
<li><p>BLEU</p>
</li>
<li><p>ROUGE</p>
</li>
</ul>
</li>
<li><p>embedding-based</p>
<ul>
<li><p>BERTScore</p>
</li>
<li><p>MoverScore</p>
</li>
</ul>
</li>
</ul>
<h4 id="single-aspect-evaluator">Single-aspect Evaluator</h4>
<ul>
<li><p>Coherence of dialogue system</p>
<ul>
<li>DEAM</li>
<li>QuantiDCE</li>
</ul>
</li>
<li><p>Consistency</p>
</li>
</ul>
<h4 id="multi-aspect-evaluator">Multi-aspect Evaluator</h4>
<ul>
<li><p>one evaluator handles several evaluation aspects</p>
<ul>
<li><p>different input and output pair</p>
</li>
<li><p>different prompt by the aspect name</p>
</li>
<li><p>different formulas</p>
</li>
</ul>
</li>
</ul>
<h4 id="emergent-ability">Emergent Ability</h4>
<ul>
<li><p>ICL</p>
</li>
<li><p>CoT Reasoning</p>
</li>
<li><p>Zero-shot instruction</p>
</li>
</ul>
<h4 id="bartscore--vs-gptscore">BARTScore  vs. GPTScore</h4>
<ul>
<li><p>BARTScore needs a fine-tuning step</p>
</li>
<li><p>GPTScore &gt; BARTScore</p>
<ul>
<li><p>customizable</p>
</li>
<li><p>multi-faceted</p>
</li>
<li><p>train-free</p>
</li>
</ul>
</li>
</ul>
<h1 id="3-gptscore">3. GPTScore</h1>
<ul>
<li><p>GPT will assign a higher probability of high-quality text given instruction and context</p>
<ul>
<li><p>$d$ : task description</p>
</li>
<li><p>$a$ : aspect definition</p>
</li>
<li><p>$\bm{h} = {h_1, h_2, ... \ }$ : text to be evaluated</p>
</li>
<li><p>$\mathcal{S}$ : context information (source or reference)</p>
</li>
</ul>
</li>
<li><p>$\text{GPTScore}(\bm{h} | d, a, \mathcal{S}) = \sum_{t=1}^m w_t\log p(h_t | \bm{h}_{&lt;t}, T(d, a, \mathcal{S}), \theta )$</p>
<ul>
<li><p>$w_t$ : weight of the token at position $t$ (in this work, it is treated equally)</p>
</li>
<li><p>$T$ : prompt template that defines the evaluation protocol</p>
<ul>
<li><p>task-specific</p>
</li>
<li><p>handcrafted with prompt engineering</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Few-shot</p>
<ul>
<li>extending $T$</li>
</ul>
</li>
<li><p>Prompt Template</p>
<ul>
<li><p>officially given by OpenAI (GPT3-based model)</p>
</li>
<li><p>NaturalInstruction (instruction based pre-trained model)</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/49209623-b6ac-48ec-bdce-19febf33fb38/image.png" alt=""></p>
<ul>
<li><p>Selection of Scoring Dimension</p>
<ul>
<li><p>$p(\text{hypo} | \text{ref})$ vs. $p(\text{hypo} | \text{src})$</p>
</li>
<li><p>in GPTScore, it is chosn to aligh the protocol of human</p>
</li>
</ul>
</li>
</ul>
<h1 id="4-experimental-settings">4. Experimental Settings</h1>
<h2 id="41-meta-evaluation">4.1 Meta Evaluation</h2>
<ul>
<li><p>how well automated scores correlate with human judgement</p>
<ul>
<li><p>$g(y_{\text{auto}}, y_{\text{human}})$</p>
</li>
<li><p>$g$ : correlation function (Spearman, Pearson)</p>
</li>
</ul>
</li>
</ul>
<h2 id="42-tasks-datasets-and-aspects">4.2 Tasks, Datasets and Aspects</h2>
<ul>
<li><p>Tasks</p>
<ul>
<li><p>Dialogue Response Generation</p>
<ul>
<li><p>generate an engaging and informative response</p>
</li>
<li><p>FED datasets </p>
</li>
<li><p>turn-level, dialogue-level evaluation</p>
</li>
</ul>
</li>
<li><p>Text Summarization</p>
<ul>
<li><p>SummEval</p>
</li>
<li><p>REALSumm</p>
</li>
<li><p>NEWSROOM</p>
</li>
<li><p>QAGS_XSUM</p>
</li>
</ul>
</li>
<li><p>Data-to-Text</p>
<ul>
<li><p>generate a fluent and factual description for a given table</p>
</li>
<li><p>BAGEL</p>
</li>
<li><p>SFRES</p>
</li>
</ul>
</li>
<li><p>Machine Translation</p>
<ul>
<li>MQM (Multidimensional Quality Metrics) -&gt; MQM-2020 (Ch-&gt;Eng)</li>
</ul>
</li>
</ul>
</li>
<li><p>37 Datasets</p>
</li>
<li><p>22 Evaluation Aspects</p>
</li>
</ul>
<h2 id="43-scoring-models">4.3 Scoring Models</h2>
<ul>
<li><p>ROUGE</p>
<ul>
<li><p>ROUGE-1</p>
</li>
<li><p>ROUGE-2</p>
</li>
<li><p>ROUGE-L</p>
</li>
</ul>
</li>
<li><p>PRISM</p>
</li>
<li><p>BERTScore</p>
</li>
<li><p>MoverScore</p>
</li>
<li><p>DynaEval</p>
<ul>
<li>dialogue response generation tasks on the turn level and dialogue level</li>
</ul>
</li>
<li><p>BARTScore</p>
<ul>
<li><p>scoring model based on BART without finetuning</p>
</li>
<li><p>+CNN (finetuned on the CNNDM dataset)</p>
</li>
<li><p>+CNN +Para(+CNNDM +Paraphrase 2.0)</p>
</li>
</ul>
</li>
<li><p>GPTScore</p>
<ul>
<li><p>19 PLMs backbone</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/93ff4e13-0db7-4fa0-9048-7913d4722995/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h2 id="44-scoring-dimension">4.4 Scoring Dimension</h2>
<ul>
<li><p>INT, ENG, SPC, REL, COR, SEM, UND, FLU from FED-Turn</p>
<ul>
<li><p>$p(\text{hypo} | \text{src})$</p>
</li>
<li><p>human data in the dataset</p>
</li>
</ul>
</li>
<li><p>COH, CON, INF from SummEval and Newsroom</p>
<ul>
<li><p>$p(\text{hypo} | \text{src})$</p>
</li>
<li><p>labeled data exists</p>
</li>
</ul>
</li>
<li><p>INF, NAT and FLU from the data-to-text</p>
<ul>
<li><p>$p(\text{hypo} | \text{ref})$</p>
</li>
<li><p>source text is not standard function</p>
</li>
</ul>
</li>
<li><p>ACC, FLU, MQM from machine translation</p>
<ul>
<li><p>$p(\text{hypo} | \text{ref})$</p>
</li>
<li><p>source text is in different language</p>
</li>
</ul>
</li>
</ul>
<h2 id="45-evaluation-dataset-construction">4.5 Evaluation Dataset Construction</h2>
<ul>
<li><p>sampled 40 sample for each summarization dataset</p>
</li>
<li><p>sampled 100 samples for dialogue response generation and data-to-text</p>
</li>
</ul>
<h1 id="5-experiment-results">5. Experiment Results</h1>
<ul>
<li><p>three scenario</p>
<ul>
<li><p>Vanilla : non-instruction and non-demonstration</p>
</li>
<li><p>IST : instruction only</p>
</li>
<li><p>IDM : instruction + demonstration</p>
</li>
</ul>
</li>
<li><p>Significance Tests</p>
<ul>
<li><p>based on bootstrapping</p>
<ul>
<li><p>IST or IDM &gt; VAL</p>
</li>
<li><p>IDM &gt; IST</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="51-text-summarization">5.1 Text Summarization</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/b4a02b66-72f0-4503-a98b-558654061980/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/d42832cc-de5f-4908-98da-5c79141a5ddb/image.png" alt=""></p>
<ul>
<li><p>Evaluator with instruction significantly improves the performance</p>
</li>
<li><p>GPT3 / FT5 based models + instructions &gt; supervised method</p>
</li>
</ul>
<h2 id="52-data-to-text">5.2 Data to Text</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/1166de8d-3597-4157-9f6c-c1ffbeeba121/image.png" alt=""></p>
<ul>
<li><p>IDM &gt; IST &gt; VAL</p>
</li>
<li><p>IDM &gt; finetuned model</p>
</li>
<li><p>the choice of examples impacts the performance a lot</p>
</li>
<li><p>IDM + GPT3 small size family &gt; large sized model</p>
</li>
</ul>
<h2 id="53-dialogue-response-generation">5.3 Dialogue Response Generation</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/940b8150-8bd5-4af6-82ef-48f7b1c683f2/image.png" alt=""></p>
<ul>
<li><p>GPT3-d01 &gt;&gt; GPT3-d03</p>
</li>
<li><p>GPT3 based model demonstrate storonger generalization ability</p>
</li>
</ul>
<h2 id="54-machine-translation">5.4 Machine Translation</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/949d1c01-b198-47e1-89d6-bcbe5ccddfad/image.png" alt=""></p>
<ul>
<li><p>IST improved the performance</p>
</li>
<li><p>IDM &gt; IST</p>
</li>
<li><p>GPT3-c01 achieved comparable performance with d01 and d03</p>
</li>
</ul>
<h1 id="6-ablation-study">6. Ablation Study</h1>
<h2 id="61-effectiveness-of-demonstration">6.1 Effectiveness of Demonstration</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/b00a7ce3-0ca4-4f1b-8223-a33fd7b6c5c8/image.png" alt=""></p>
<ul>
<li><p>demonstration  improves the performance</p>
</li>
<li><p>there is an upper bound on the performance gains</p>
</li>
<li><p>if there is only a few samples, small models are prone to performance degradation</p>
</li>
</ul>
<h2 id="62-partial-order-of-evaluation-aspect">6.2 Partial Order of Evaluation Aspect</h2>
<ul>
<li><p>tested INT as the target aspect</p>
<ul>
<li><p>combined other aspects with the definition of INT</p>
</li>
<li><p>GPT3-c01 6.7B</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/a52d25c9-63da-4589-83fd-2db5fdb38dd7/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/c877b398-0000-4c95-8f2e-4215c903ea5f/image.png" alt=""></p>
<ul>
<li>By combining definitions with other highly correlated aspects, smaller model outperformed the bigger model</li>
</ul>
<h1 id="7-conclusion">7. Conclusion</h1>
<ul>
<li>customable multi-faceted, train-free evaluation framework using emergent ability of LLM</li>
</ul>
<h1 id="8-limitation">8. Limitation</h1>
<ul>
<li><p>Not included GPT 3.5 and GPT 4</p>
</li>
<li><p>the reason why d03 is worse than d01 is unclear as it is not open source</p>
</li>
<li><p>API cost issue</p>
</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[SELF-EXPERTISE: Knowledge-Based Instruction Dataset Augmentation for a Legal Expert Language Model]]></title>
            <link>https://velog.io/@0404_not_found/SELF-EXPERTISE-Knowledge-Based-Instruction-Dataset-Augmentation-for-a-Legal-Expert-Language-Model</link>
            <guid>https://velog.io/@0404_not_found/SELF-EXPERTISE-Knowledge-Based-Instruction-Dataset-Augmentation-for-a-Legal-Expert-Language-Model</guid>
            <pubDate>Fri, 03 Jan 2025 12:08:28 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>Instruction Tuning Dataset</p>
<ul>
<li><p>Instruction Tuning is important for LLMs</p>
</li>
<li><p>Auto generation method is unsuitable for some domains where the accuracy is important</p>
</li>
</ul>
</li>
<li><p>SELF-EXPERTISE</p>
<ul>
<li><p>automatic instruction data generation for knowledge-intensive tasks</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/12c8dc60-cc60-49a5-b219-a92d84cc3e40/image.png" alt=""></p>
</li>
<li><p>19k dataset from 980 seed dataset</p>
</li>
<li><p>LxPERT : LLaMA-2-7B + SELF-EXPERTISE</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/1c4b4be8-7bc0-47ba-9ef1-0b80cbb1f69f/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h1 id="2-related-work">2. Related Work</h1>
<h2 id="21-llm-based-instruction-dataset-augmentation">2.1 LLM-based Instruction Dataset Augmentation</h2>
<ul>
<li><p>Generate instruction dataset using LLMs</p>
</li>
<li><p>Self-Instruct $\rightarrow$ Prone to hallucination</p>
</li>
</ul>
<h2 id="22-knowledge-intensive-tasks">2.2 Knowledge-Intensive Tasks</h2>
<ul>
<li><p>requires knowledge-based solution</p>
<ul>
<li><p>legal domain is knowledge intensive</p>
</li>
<li><p>RAG</p>
</li>
<li><p>SELF-EXPERTISE generates instructions and outputs based on precise external knowledge</p>
</li>
</ul>
</li>
</ul>
<h1 id="3-methodology">3. Methodology</h1>
<h2 id="31-defining-instruction-data">3.1 Defining Instruction Data</h2>
<ul>
<li><p>typical instruction dataset</p>
<ul>
<li><p>(instruction, input, output) triplet + system instructions</p>
</li>
<li><p>to facillitate reasoning and narrative structure</p>
</li>
<li><p>input is optional</p>
</li>
</ul>
</li>
</ul>
<h2 id="32-self-expertise">3.2 SELF-EXPERTISE</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/50260467-b472-48f3-a23b-7167104a2503/image.png" alt=""></p>
<h4 id="knowledge-extraction-based-on-output">Knowledge Extraction based on Output</h4>
<ul>
<li><p>knowledge is extracted from the outputs of a small set of expert-written seed data</p>
</li>
<li><p>generates new user instructions and outputs with external knowledge</p>
</li>
<li><p>lawyer&#39;s argument (output) + case law (external data)</p>
</li>
</ul>
<h4 id="generation-of-user-instruction-and-input-based-on-knowledge">Generation of User Instruction and Input Based on Knowledge</h4>
<ul>
<li><p>analogous to how teachers create exam questions based on textbook</p>
</li>
<li><p>generates exam questions and context</p>
</li>
</ul>
<h4 id="system-instructions">System Instructions</h4>
<ul>
<li><p>handcrafted</p>
</li>
<li><p>wrote 8 system instructions</p>
</li>
<li><p>they differ to allow the creation of outputs in various manners, lengths, formats</p>
</li>
</ul>
<h4 id="output-generation-based-on-previous-results">Output Generation Based on Previous Results</h4>
<ul>
<li><p>Use all knowledge, system instructions, instructions and input to generate output from LLM</p>
</li>
<li><p>8 outputs for each user instruction and input pair</p>
</li>
</ul>
<h2 id="33-finetuning-the-sllm-using-augmented-instruction-dataset">3.3 Finetuning the sLLM using Augmented Instruction Dataset</h2>
<ul>
<li><p>Similar to knowledge distillation</p>
</li>
<li><p>distills domain knowledge</p>
</li>
<li><p>sLLM is trained to generate responses based on indirectly learned knowledge</p>
</li>
<li><p>learns 8 types of system instructions and corresponding output forms</p>
</li>
</ul>
<h1 id="4-legal-self-expertise-data">4. Legal SELF-EXPERTISE Data</h1>
<h2 id="41-seed-dataset">4.1 Seed Dataset</h2>
<ul>
<li><p>980 legal seed instruction by legal experts</p>
<ul>
<li><p>560 legal cases + 916 clauses</p>
</li>
<li><p>civil law, local law, legislative information and legal consultation</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/9a48d5ae-0574-42ac-b95a-9d4f1a1ab0c6/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h2 id="42-data-generation-details">4.2 Data Generation Details</h2>
<ul>
<li><p>GPT-3.5-turbo in Step 1</p>
</li>
<li><p>GPT-4-preview-1106 for Step 2 and 4</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/bee039f6-3799-4fcc-93ba-e88476be26bb/image.png" alt=""></p>
</li>
</ul>
<h2 id="43-diversity">4.3 Diversity</h2>
<ul>
<li><p>compared the lengths of the user instruction, input and output</p>
</li>
<li><p>more even than Self-Instruct</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/b1dd91e8-08db-4271-84fd-1f5d28ae86a0/image.png" alt=""></p>
</li>
<li><p>various system instructions worked well for the diversity</p>
</li>
<li><p>extracting objective knowledge from outputs will help model not be limited to particular situations</p>
<ul>
<li><p>compared the generated instructions for 200 seeds</p>
</li>
<li><p>BERT-score to calculate similarity</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/c9d7c3d2-0277-405a-aee9-909431ebd8a9/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h2 id="44-quality">4.4 Quality</h2>
<ul>
<li><p>human evaluation for 100 random sample</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/aefae21d-6fbc-4a0e-b3dd-dc378da43233/image.png" alt=""></p>
</li>
</ul>
<h1 id="5-experimental-setup">5. Experimental Setup</h1>
<h2 id="51-training-details">5.1 Training Details</h2>
<ul>
<li><p>LLaMA-2-ko 7B</p>
</li>
<li><p>SELF-EXPERTISE dataset</p>
</li>
<li><p>3 epochs, AdamW, lr 2e-5, batch 1 per device, max len 1024</p>
</li>
<li><p>A100 80G</p>
</li>
</ul>
<h2 id="52-baselines">5.2 Baselines</h2>
<ul>
<li><p>Foundation Models</p>
<ul>
<li>LLaMA-2 7B and LLaMA-2-ko 7B</li>
</ul>
</li>
<li><p>Instruction Tuned Models</p>
<ul>
<li>LLaMA-2-chat 7B and LLaMA-2-ko-chat 7B</li>
</ul>
</li>
<li><p>GPT</p>
<ul>
<li>GPT-3.5-turbo</li>
</ul>
</li>
<li><p>Instruction-tuned Models in Legal Domain</p>
<ul>
<li><p>SELF-EXPERTISE tuned LLaMA-2-ko 7B</p>
</li>
<li><p>seed dataset tuned LLaMA-2-ko 7B</p>
</li>
<li><p>legal domain dataset augmented by Self-Instruct </p>
</li>
</ul>
</li>
</ul>
<h2 id="53-evaluation-dataset">5.3 Evaluation Dataset</h2>
<ul>
<li><p>In-domain Dataset</p>
<ul>
<li><p>legal experts to create new dataset that is related to 4 domains like seed dataset</p>
</li>
<li><p>200 pairs</p>
</li>
</ul>
</li>
<li><p>Out of domain Dataset</p>
<ul>
<li><p>100 QA pairs from easylay.go.kr</p>
</li>
<li><p>selected questions that requires knowledge not in the seed data</p>
</li>
</ul>
</li>
</ul>
<h2 id="54-evaluation-settings">5.4 Evaluation Settings</h2>
<ul>
<li><p>GPT-4 Evaluation</p>
</li>
<li><p>Human Evaluation</p>
<ul>
<li>5-point Likert scale (accuracy, fluency)</li>
</ul>
</li>
</ul>
<h1 id="6-results">6. Results</h1>
<h2 id="61-evaluation-on-in-domain-data">6.1 Evaluation on In-domain Data</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/db9eb6cb-4441-4d30-ab58-3652110aa714/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/073336fb-dfec-4c5e-a979-cae91ff7bbe3/image.png" alt=""></p>
<h2 id="62-evaluation-on-out-of-domain-data">6.2 Evaluation on Out-of-domain Data</h2>
<ul>
<li>Seed dataset tuned model performance noticeably dropped</li>
</ul>
<h2 id="63-quality-of-answers-relative-to-the-amount-of-training-data">6.3 Quality of Answers Relative to the Amount of Training Data</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/b1de4282-63e7-4c28-b34b-037404f22b4c/image.png" alt=""></p>
<ul>
<li><p>Excessivie augmentation of data with same seed dataset leads to overfitting on specific knowledge</p>
</li>
<li><p>Expanding seed data and additional general domain dataset will work.</p>
</li>
</ul>
<h1 id="7-discussion">7. Discussion</h1>
<ul>
<li><p>ability to follow instructions</p>
</li>
<li><p>legal domain knowledge</p>
</li>
<li><p>still prone to make errors</p>
</li>
</ul>
<h1 id="8-conclusion">8. Conclusion</h1>
<ul>
<li><p>automatically generating instruction dataset in specialized domain</p>
</li>
<li><p>can be extended to generate instruction datasets in other domain</p>
</li>
</ul>
<h1 id="comment">Comment</h1>
<ul>
<li><p>지식을 추출해서 쓰자</p>
</li>
<li><p>Augmentation은 Seed에 맞게 적당히 할 것</p>
</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[A Multi-Task Benchmark for Korean Legal Langhage Understanding and Judgement Prediction]]></title>
            <link>https://velog.io/@0404_not_found/A-Multi-Task-Benchmark-for-Korean-Legal-Langhage-Understanding-and-Judgement-Prediction</link>
            <guid>https://velog.io/@0404_not_found/A-Multi-Task-Benchmark-for-Korean-Legal-Langhage-Understanding-and-Judgement-Prediction</guid>
            <pubDate>Thu, 02 Jan 2025 16:13:20 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>Previous Legal Export Systems</p>
<ul>
<li>Useful on certain areas</li>
</ul>
</li>
<li><p>Deep Learning Based Approach</p>
<ul>
<li><p>Legal Judgement Prediction</p>
</li>
<li><p>Legal Content Generation</p>
</li>
<li><p>Legal Text Classification</p>
</li>
<li><p>Legal Event Detection</p>
</li>
<li><p>Legal Information Extraction</p>
</li>
<li><p>Legal Contract Review and QA</p>
</li>
</ul>
</li>
<li><p>LBOX</p>
<ul>
<li><p>Large Scale Korean legal AI benchmark</p>
<ul>
<li><p>precedent corpus</p>
</li>
<li><p>classification tasks (Case Name, Statute)</p>
</li>
<li><p>judgement prediction task (LJP-Criminal, Civil)</p>
</li>
<li><p>summarization task</p>
</li>
</ul>
</li>
<li><p>pre-trained $\rightarrow$ LCUBE (decoder only, based on GPT-2)</p>
<ul>
<li>doesn&#39;t have advantage on summarization task</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="2-background">2. Background</h1>
<h2 id="21-korean-legal-system">2.1 Korean Legal System</h2>
<ul>
<li><p>Three-tiered (District, High and the Supreme Court)</p>
</li>
<li><p>rooted in civil law system (vs. common law system)</p>
</li>
</ul>
<h2 id="22-korean-precedent">2.2 Korean Precedent</h2>
<ul>
<li><p>Structure of Korean Precedent</p>
<ul>
<li><p>meta information</p>
</li>
<li><p>gist of claim from plaintiffs in a civil case</p>
</li>
<li><p>ruling</p>
</li>
<li><p>reasoning</p>
<ul>
<li><p>facts</p>
</li>
<li><p>claims</p>
</li>
<li><p>reasoning</p>
</li>
<li><p>decisions</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>The Redaction Process</p>
<ul>
<li>Anonymizing</li>
</ul>
</li>
<li><p>Precedent Disclosure Status</p>
<ul>
<li>Courts&#39; decision should be pubshed via online service</li>
</ul>
</li>
</ul>
<h1 id="3-lbox-open-datasets">3. LBOX Open Datasets</h1>
<h2 id="31-structuring-raw-data">3.1 Structuring Raw Data</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/4a5aec20-198c-4636-afa8-308ee49eb278/image.png" alt=""></p>
<ul>
<li><p>Document Images and PDF precedents are available</p>
</li>
<li><p>Preprocessing pipeline</p>
<ul>
<li><p>Layout Classifier (based on ResNet)</p>
</li>
<li><p>Layout Parser (based on Mask-R-CNN)</p>
</li>
<li><p>OCR</p>
</li>
<li><p>Custom Language Model to correct OCR errors</p>
</li>
<li><p>Human annotation for low-confidence instances</p>
</li>
</ul>
</li>
<li><p>JSON format</p>
<ul>
<li><p>meta information</p>
</li>
<li><p>ruling</p>
</li>
<li><p>gist of claim</p>
</li>
<li><p>appeal</p>
</li>
<li><p>reasoning</p>
</li>
</ul>
</li>
</ul>
<h2 id="32-datasets">3.2 Datasets</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/73801f15-cd95-4e1d-9264-08c637536241/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/6fca3f23-263f-44b1-8c9c-cc7b0775c99c/image.png" alt=""></p>
<ul>
<li><p>Precedent Corpus</p>
<ul>
<li><p>AI Hub 6k + LAW OPEN DATA 82k + Internal 65k</p>
</li>
<li><p>57% of LAW OPEN DATA consist of the trials of the Supreme Court (no factual issues)</p>
</li>
</ul>
</li>
<li><p>Case Name</p>
<ul>
<li>10k facts + case name</li>
</ul>
</li>
<li><p>Statute</p>
<ul>
<li>facts + statute</li>
</ul>
</li>
<li><p>LJP-Criminal</p>
<ul>
<li><p>facts + punishments(fine, imprisonment with labor, imprisonment without labor)</p>
</li>
<li><p>Level 0 (type of punishment)</p>
</li>
<li><p>Level 1 (degree of punishment in 3-scale, null/low/high) </p>
</li>
<li><p>Level 2 (5-scale for fine, 6-scale for imprisonment) </p>
</li>
<li><p>Level 3 (exact number) $\rightarrow$ Regression!</p>
</li>
</ul>
</li>
<li><p>LJP-Civil</p>
<ul>
<li><p>fact + gist of claim + degrees of claim acceptance</p>
</li>
<li><p>claim acceptance degree</p>
<ul>
<li><p>claimed money from the gist of claim</p>
</li>
<li><p>approved money from ruling section</p>
</li>
<li><p>approved money / claimed money</p>
</li>
</ul>
</li>
<li><p>Level 1 (rejection / partial approval / full approval)</p>
</li>
<li><p>Level 2 (13 categories)</p>
</li>
<li><p>mt5-small + prompt-tuning for parsing expression (money provider / receiver / amount / litigation cost)</p>
</li>
</ul>
</li>
<li><p>Summarization</p>
<ul>
<li><p>Supreme Court Decisions Report + Summary of Decision</p>
</li>
<li><p>Ruling and Reasoning section</p>
</li>
</ul>
</li>
</ul>
<h1 id="4-experiments">4. Experiments</h1>
<h2 id="41-model-training">4.1 Model Training</h2>
<ul>
<li><p>Nvidia A6000, RTX3090 or RTX6000</p>
</li>
<li><p>lr 3e-5 to 1e-4</p>
</li>
<li><p>batch 8 to 60, AdamW</p>
</li>
<li><p>finetuning experiments with errorbar were repeaded 3 times</p>
</li>
<li><p>google/mt5-small for fine-tuning</p>
</li>
<li><p>GPT-2 from scratch (LCUBE), Modu and Wiki corpora</p>
</li>
<li><p>byte-level BPE</p>
</li>
<li><p>50K for base and 100K for medium</p>
</li>
<li><p>compared KoGPT2 and LCUBE</p>
</li>
</ul>
<h2 id="42-task-setting">4.2 Task Setting</h2>
<ul>
<li>text generation following</li>
</ul>
<h2 id="43-metric">4.3 Metric</h2>
<ul>
<li><p>Case Name, Statiute, LJP-Civil : Examt Match</p>
</li>
<li><p>LJP-Criminal : F1 of individual fields</p>
</li>
</ul>
<h1 id="5-results">5. Results</h1>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/062406cf-2bd6-487b-b88d-c5758d959714/image.png" alt=""></p>
<ul>
<li><p>Domain specific corpus is critical in the classification and the summarization tasks</p>
<ul>
<li><p>pretrain with Precedent Corput only also performed well in domain adaptation</p>
</li>
<li><p>in summarization task, LCUBE doesn&#39;t have an advantage over other models</p>
<ul>
<li><p>this might be from the architecture difference between encoder-decoder model and decoder only model</p>
</li>
<li><p>LCUBE generated ~40% fewer tokens $\rightarrow$ ROUGE score is low</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Domain adaptation is not helpful on legal judgement prediction tasks</p>
<ul>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/50269054-ae65-464c-b8df-325a97aa3870/image.png" alt=""></p>
</li>
<li><p>In LJP-Civil, without the facts, the model performance is close to a dummy baseline</p>
</li>
</ul>
</li>
<li><p>Legal judgement prediction is challenging</p>
<ul>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/6336a370-d495-4a05-922c-85d819442e32/image.png" alt=""></p>
</li>
<li><p>There is no one superior model</p>
</li>
</ul>
</li>
</ul>
<h1 id="6-conclusion">6. Conclusion</h1>
<ul>
<li><p>the first large-scale Korean legal AI benchmark and legal language model LCUBE</p>
</li>
<li><p>only considered precedents from the first level courts</p>
<ul>
<li>for simplicity in legal reasoning</li>
</ul>
</li>
<li><p>didn&#39;t used plaintiffs and defendants claims</p>
</li>
<li><p>difficult to separate the claims from reasoning sections without error</p>
</li>
<li><p>didn&#39;t consider many important legal applications of AI</p>
</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks]]></title>
            <link>https://velog.io/@0404_not_found/Retrieval-Augmented-Generation-for-Knowledge-Intensive-NLP-Tasns</link>
            <guid>https://velog.io/@0404_not_found/Retrieval-Augmented-Generation-for-Knowledge-Intensive-NLP-Tasns</guid>
            <pubDate>Thu, 02 Jan 2025 13:47:47 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>PLMs learn a substantial amount of in-depth knowledge from data</p>
<ul>
<li><p>it can&#39;t expand or revise their memory</p>
</li>
<li><p>can&#39;t straightforwardly provide insight into their predictions</p>
</li>
<li><p>hallucination</p>
</li>
</ul>
</li>
<li><p>Hybrid Models (REALM, ORQA)</p>
<ul>
<li><p>parametric + non-parametric (retrieval-based)</p>
</li>
<li><p>seq2seq transformer + vector index + pre-trained neural retriever $\rightarrow$ RAG</p>
</li>
<li><p>per-sequence bases vs. per-token basis</p>
</li>
<li><p>This can be fine-tuned on any seq2seq task (generator and retriever are jointly learned)</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/8604b725-e408-4bc3-8659-1fcd7b7072fd/image.png" alt=""></p>
<ul>
<li><p>Enrich systems with non-parametric memory</p>
<ul>
<li><p>parametric and non-parametric components are pretrained and pre-loaded</p>
</li>
<li><p>using pre-trained access mechanisms, accessing knowledge without additional training is possible</p>
</li>
</ul>
</li>
<li><p>Works well with <strong>Knowledge-Intensive Tasks</strong> </p>
<ul>
<li>Humans could not reasonably be expected to perform without access to an external knowledge source</li>
</ul>
</li>
</ul>
<h1 id="2-methods">2. Methods</h1>
<ul>
<li><p>$x$ (input sequence) $\rightarrow$ $z$ (text documents) $\rightarrow$ $y$ (target sequence)</p>
<ul>
<li><p>$p_{\eta} (z | x)$ : retriever (returns top-K distributions)</p>
</li>
<li><p>$p_{\theta}(y_i | x, z, y_{1:i-1})$ : generator</p>
</li>
<li><p>$z$ as a latent variable</p>
</li>
</ul>
</li>
</ul>
<h2 id="21-models">2.1 Models</h2>
<h4 id="rag-sequence">RAG-Sequence</h4>
<ul>
<li><p>$p_{\text{RAG-Sequence}} (y|x) = \displaystyle \sum_{z \in \text{top-}k(p(\cdot|x))} p_{\eta}(z|x) p_{\theta}(y | x, z)$</p>
</li>
<li><p>uses the same retrieved document to generate the complete sequence</p>
</li>
</ul>
<h4 id="rag-token">RAG-Token</h4>
<ul>
<li><p>$p_{\text{RAG-Token}} (y|x) = \displaystyle \prod <em>i ^N \sum</em>{z \in \text{top-}k(p(\cdot|x))} p_{\eta}(z|x) p_{\theta}(y_i | x, z_i, y_{1:i-1})$</p>
</li>
<li><p>draw a different latent document for each target token</p>
</li>
<li><p>generator to choose content form several documents when producing an answer</p>
</li>
<li><p>computes a distribution for the next output token for each document</p>
</li>
<li><p>used for sequence classification $\rightarrow$ target class as a length-one sequence</p>
</li>
</ul>
<h2 id="22-retriever-dpr">2.2 Retriever: DPR</h2>
<ul>
<li><p>$p_{\eta} (z | x) \propto \exp({\mathbf{d}(z)^{\top}}\mathbf{q}(x))$</p>
</li>
<li><p>used BERT as $\mathbf{d}(z)$ and $\mathbf{q}(x)$</p>
</li>
<li><p>MIPS: Maximum Inner Product Search Problem</p>
</li>
<li><p>document index: non parametric memory</p>
</li>
</ul>
<h2 id="23-generator-bart">2.3 Generator: BART</h2>
<ul>
<li><p>BART-Large 400M (seq2seq transformer)</p>
</li>
<li><p>simply concatenate $z$ and $x$</p>
</li>
</ul>
<h2 id="24-training">2.4 Training</h2>
<ul>
<li><p>jointly train retriever and generator without any direct supervision on the document</p>
</li>
<li><p>NLL Loss, Adam, SGD</p>
</li>
<li><p>only trained query encoder and generator</p>
</li>
</ul>
<h2 id="25-decoding">2.5 Decoding</h2>
<ul>
<li><p>RAG-Token uses standard beam-decoder</p>
</li>
<li><p>RAG-Sequence performs beam-search for each doeument</p>
</li>
<li><p>Through Decoding vs. Fast Decoding</p>
</li>
</ul>
<h1 id="3-experiments">3. Experiments</h1>
<ul>
<li><p>Wikipedia as document index (100 token chunk, 21M documents)</p>
</li>
<li><p>FAISS, HNSW</p>
</li>
<li><p>k = 5 or 10</p>
</li>
<li><p>Open Domain QA, Abstractive QA, Jeopardy QA (non-standard QA format, fact to entity), Fact Verification (retrieve from Wikipedia and reason whether the given claim is true)</p>
</li>
<li><p>Natural Questions / TriviaQA / WebQuestions / CuratedTrec $\rightarrow$ Exact Match Scores</p>
</li>
<li><p>MSMARCO NLG task v2.1 (only question and answer)</p>
</li>
<li><p>SearchQA $\rightarrow$ SQuAD-tuned Q-BLEU-1</p>
</li>
<li><p>FEVER $\rightarrow$ label accuracy</p>
</li>
</ul>
<h1 id="4-results">4. Results</h1>
<h4 id="open-domain-qa">Open Domain QA</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/ac525e14-efad-480b-9465-a019018735a7/image.png" alt=""></p>
<ul>
<li><p>Extract &lt; Generate</p>
<ul>
<li>document with only clue not the exact answer</li>
</ul>
</li>
</ul>
<h4 id="abstractive-qa">Abstractive QA</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/133ef9ff-8ff9-4449-b259-b59cb3b97754/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/18ffc61d-b885-4fa8-9c65-267a4fbb5eb3/image.png" alt=""></p>
<ul>
<li><p>RAG is more diverse thatn BART, less hallucinative</p>
</li>
<li><p>SotA models access gold passages while RAG is not</p>
</li>
<li><p>many questions are unanswerable without gold passages</p>
</li>
<li><p>not all questions are answerable from Wikipedia alone</p>
</li>
</ul>
<h4 id="jeopardy-question-generation">Jeopardy Question Generation</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/222c7a5e-531a-44a4-854d-0ac89dd38671/image.png" alt=""></p>
<ul>
<li>RAG-Token can perform well as it uses multiple documents</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/aa8c0770-ca95-4270-b1fe-e1211ed5a97e/image.png" alt=""></p>
<ul>
<li>parametric and non-parametric memory work together</li>
</ul>
<h4 id="fact-verification">Fact Verification</h4>
<ul>
<li>document retrieved by RAG is the gold evidence in FEVER</li>
</ul>
<h4 id="additional-reuslts">Additional Reuslts</h4>
<ul>
<li><p>Diversity</p>
<ul>
<li><img src="https://velog.velcdn.com/images/0404_not_found/post/be2d434c-a5cd-4984-aa47-329fa25367f7/image.png" alt=""></li>
</ul>
</li>
</ul>
<ul>
<li><p>Retrieval Ablations</p>
<ul>
<li><img src="https://velog.velcdn.com/images/0404_not_found/post/053c9432-e2d3-4a4e-b15f-9ba832359a82/image.png" alt=""></li>
</ul>
</li>
<li><p>Index Hop-swapping</p>
<ul>
<li>Changed from Wikipedia 2018 to DrQA Wikipedia dump </li>
</ul>
</li>
<li><p>Retrieving more documents</p>
<ul>
<li><p>didn&#39;t observe significant differences and performances</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/e7500080-d6e0-466d-8a60-dacc49975131/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h1 id="5-discussion">5. Discussion</h1>
<ul>
<li>Hybrid generation models with access to parametric and non-parametric memory</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[Computational analysis of 140 years of US political speeches reveals more positive but increasingly polarized framing of immigration]]></title>
            <link>https://velog.io/@0404_not_found/Computational-analysis-of-140-years-of-US-political-speeches-reveals-more-positive-but-increasingly-polarized-framing-of-immigration</link>
            <guid>https://velog.io/@0404_not_found/Computational-analysis-of-140-years-of-US-political-speeches-reveals-more-positive-but-increasingly-polarized-framing-of-immigration</guid>
            <pubDate>Mon, 27 May 2024 06:52:46 GMT</pubDate>
            <description><![CDATA[<h1 id="abstract">Abstract</h1>
<ul>
<li><p>200K US congressional speeches + 5K presidential communications related to immigration from 1880 to the present</p>
</li>
<li><p>political speech about immigration is much more positive on average than the past</p>
<ul>
<li><p>shift largely between WW2 and the passage of Immigration and Nationality Act in 1965</p>
</li>
<li><p>since the late 1970s, political parties become polarized</p>
</li>
</ul>
</li>
<li><p>contextual embeddings of text</p>
<ul>
<li>modern Republicans $\rightarrow$ suggestive of metaphors long associated with immigration (animals, cargo) and frames like &quot;crimes&quot; and &quot;legality&quot;</li>
</ul>
</li>
<li><p>nationality mentioned changed the tone of speeches (Mexican, Chinese) $\rightarrow$ still a major factor in how immigrants are spoken of in Congress</p>
</li>
</ul>
<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>Recently, the attitude toward immigration became negative than ever before</p>
<ul>
<li><p>anti-Chinese fearmongering in 1880s</p>
</li>
<li><p>Southern and Eastern European immigrants in 1920s</p>
</li>
<li><p>antiimmigration rhetoric of Trump (2017 to 2020)</p>
<p>$\rightarrow$ &quot;Certain types of immigrants can never truly join American society?&quot;</p>
</li>
</ul>
</li>
<li><p>how have attitudes toward immigrants in US changed over the past century?</p>
<ul>
<li><p>public opinion poll began in the 1960s</p>
</li>
<li><p>turned to Congressional Record</p>
</li>
</ul>
</li>
<li><p>Corpus</p>
<ul>
<li><p>full corpus of more than 17M congressional speeches from 1880 to present</p>
</li>
<li><p>200K speeches relevant to immigration</p>
</li>
<li><p>presidential communications</p>
</li>
<li><p>quantitative analysis</p>
</li>
</ul>
</li>
<li><p>Related works</p>
<ul>
<li><p>qualitative approaches and historical archives</p>
</li>
<li><p>quantitative work on immigration used migration and census records</p>
</li>
<li><p>Rhetorical aspects of immigration debates $\rightarrow$ dehumanizing language (vermin and cargo) with qualitative analysis</p>
</li>
<li><p>NLP methods to cover in news media and Congress $\rightarrow$ not a long time span / not a comprehensive corpus with a consistent methodology</p>
</li>
</ul>
</li>
<li><p>Methods</p>
<ul>
<li><p>identify relevant speeches</p>
<ul>
<li>automated text classification based on extensive human annotations</li>
</ul>
</li>
<li><p>curated and applied a set of lexicons for analyzing relevant frames with semi-automated method</p>
</li>
<li><p>neural contextual embedding models to quantify implicit dehumanizing metaphors</p>
</li>
</ul>
</li>
<li><p>Brief results and discussion</p>
<ul>
<li><p>political speeches about immigration today are more positive than the past</p>
<ul>
<li>the shift between WW2 and 1965 Immigration and Nationality Act</li>
</ul>
</li>
<li><p>being net positive on average since early 1950s</p>
</li>
<li><p>Trump is the first president to express sentiment toward immigration more negative than the average member of his own party</p>
</li>
<li><p>two parties have become increasingly polarized over time</p>
<ul>
<li>liniar increase in polarization on immigration since late 1970s</li>
</ul>
</li>
<li><p>today, Democrats are unprecedentedly positive</p>
</li>
<li><p>generic political polarization observed in Gentzkow et al. by more than a decade</p>
</li>
<li><p>nationality of immigrants continues to matter greatly</p>
<ul>
<li><p>Mexican $\rightarrow$ more negative than European</p>
</li>
<li><p>Mexican framed today is similar to Chinese framed during Chinese exclusion in 19th century</p>
</li>
<li><p>negative frame &quot;crime&quot;, &quot;labor&quot;, &quot;legality&quot; + dehumanizing metaphors</p>
</li>
</ul>
</li>
<li><p>there remains a string and growing strain of antiimmigration speech among Replblicans</p>
<ul>
<li><p>expressed opinions toward immigrants still vary greatly by country of origin </p>
</li>
<li><p>rhetorical strategies continue to be deployed</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="2-results">2. Results</h1>
<h4 id="tone-of-immigration-speeches">Tone of Immigration Speeches</h4>
<ul>
<li><p>17M congressional speeches from 1880 to 2020</p>
</li>
<li><p>human annotations and trained ML classifiers to detece immigration related speech with accompanying tone (pro, con,  neutral)</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/03cdcbbf-dd59-4b57-b752-fa87200de0d1/image.png" alt=""></p>
<ul>
<li><p>applied same models to all presidential communications by American Presidency Project (Bottom)</p>
</li>
<li><p>Fig 1</p>
<ul>
<li><p>average sentiment is negative throughout the late 19th and early 20th centuries (Chinese Exclusion Act(1882) to strint immigration quotas (1920s))</p>
</li>
<li><p>the attitude became more positive around the start of WW2</p>
<ul>
<li><p>rising steadily from 1940 until the end of the Johnson administration (1969)</p>
</li>
<li><p>average tone has been pro since the beginning of the Eisenhower (1953)</p>
</li>
</ul>
</li>
<li><p>beginning about a dacade after 1965, an overall decline in sentiment amone Republicans and incline among Democrats is observed</p>
<ul>
<li><p>except for the early 1990s, this coincides with the end of the Cold War and NAFTA</p>
</li>
<li><p>Republican shows antiimmigration as 1920s</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Trends for presedential attitudes should be treated more cautiously as there is less text</p>
<ul>
<li><p>involves a slight domain shift (the model is trained on congressional speeches)</p>
</li>
<li><p>found a similar pattern</p>
<ul>
<li>early presidents were more antiimmigration</li>
</ul>
</li>
<li><p>in recent years, presidents are uniformly more proimmigration even the Republican (Ronald Reagan) and the Democrats (Jimmy Carter)</p>
<ul>
<li>Trump was a start exception (the most antiimmigration president over the past 140 years)</li>
</ul>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/2aabdcc2-d161-4e86-974f-73d55f1fd88d/image.png" alt=""></p>
<ul>
<li><p>Fig 2.</p>
<ul>
<li><p>the tone was varied dramatically depending on which groups of immigrants are being discussed</p>
</li>
<li><p>Mexican and Chinese (Italian is Identifying Groups)</p>
</li>
<li><p>Speech mentioning Chinese immigrants were overwhelmingly negative during Chinese exclusion (1882 to 1943)</p>
<ul>
<li>while the tone toward Italian was slightly more favorable</li>
</ul>
</li>
<li><p>Attitute toward all groups improved from 1940 to 1970</p>
<ul>
<li>mentioning China and Mexico remained relatively more negative overall</li>
</ul>
</li>
<li><p>since the late 1970s, the gap between Italian and Mexican is large as the gap in tone that exists between Replublicans and Democrats today.</p>
</li>
<li><p>this pattern is mirrored in broader regional trends</p>
<ul>
<li><p>Most European $\rightarrow$ referred to positively on average by the 1960s</p>
</li>
<li><p>Asian $\rightarrow$ by the 1980s</p>
</li>
<li><p>Caribbean $\rightarrow$ negative on average until the 2000s</p>
</li>
<li><p>few countries are mentioned as frequently as those three (Mexico, Italy, China)</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4 id="language-framing-and-dehumanization">Language, Framing, and Dehumanization</h4>
<ul>
<li>trained interpretable logistic regression models to approximate the predictions of the contextual embedding models and determine feature importance using Shapley values</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/d282950e-a5f3-4654-8e66-2c888afa041d/image.png" alt=""></p>
<ul>
<li><p>Table 1</p>
<ul>
<li><p>antiimmigration terms contains the words representing threats (dangerous, cheap), control (permit, violation), and the targets of early antiimmigration legislation (undesirable, Chinese)</p>
<ul>
<li>by midcentury and beyond, another threats appear (subversive, terrorism) along with the themes of legality (aliens, illegal) and crime (criminals, smuggling)</li>
</ul>
</li>
<li><p>proimmigration terms contain the words representing desirable characteristics (industrious), land (property, agriculture), and service (gave, served)</p>
<ul>
<li><p>by post-WW2 era, humanitarian concerns (discriminatory, migrants) and community (citizens, families, children) appeared</p>
</li>
<li><p>this continued into the present (victoms, community) along with a celebration of once-vilified communities (Irish, Italian, heritage)</p>
</li>
</ul>
</li>
<li><p>Despite the relatively negative tone toward Mexican in the modern period, Hispanic and Latino had strong positive associations</p>
<ul>
<li><p>these are likely to be used by Democrats than Republicans $\rightarrow$ they are proimmigrations</p>
</li>
<li><p>but &quot;Mexico&quot; and &quot;Mexican&quot; are mentioned with very similar frequency by Democrat and Republicans $\rightarrow$ the tone difference is not simply a matter of Mexico</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>To understand the rhetorical divergence between parties $\rightarrow$ they focused on the frame about immigration $\rightarrow$ built a series of lexicons</p>
<ul>
<li>working upon the previous work, they developed 14 of these lexicons with automated/manual curation</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/1f0a97f4-af50-4be1-802c-010f2bf39d4c/image.png" alt=""></p>
<ul>
<li><p>Fig 3</p>
<ul>
<li><p>almost no difference in the frames by two parties in the earlier time period</p>
</li>
<li><p>today, they use strongly divergent use of different frames</p>
<ul>
<li><p>Republicans : crime, legality, threats, deficiency, flood/tide $\rightarrow$ commonly heard antiimmigration comments</p>
</li>
<li><p>Democrats : family, victims, contributions, culture (positive)</p>
</li>
</ul>
</li>
<li><p>these patterns are robust to the exclusion of any individual term as well as to automated lexicon expansion</p>
</li>
</ul>
</li>
</ul>
<pre><code>- the most salient aspects

    - earlier time period : difeciency, culture, labor

    - today : crime, legality (partly due to frequent mentions of legal and illegal immigrants + other legal terms and crime terms (laws, visas, criminals, terrorism)) 

- economy is the most uncommon in speeches about immigration $\rightarrow$ the lease salient in both era</code></pre><ul>
<li><p>measured more implicit dehumanizing metaphors </p>
<ul>
<li><p>only flood and tide metaphor emerge dfrom the semiautomated frame construction process</p>
</li>
<li><p>measure the metaphors based on how probable such terms are as substitutes according to contextual embedding models</p>
<ul>
<li><p>animals, cargo, disease, flood/tide, machines, vermin are drawn by this method</p>
</li>
<li><p>&quot;dumping produces <del>~</del> &quot; $\rightarrow$ cargo</p>
</li>
<li><p>&quot;herding of <del>~</del>&quot; $\rightarrow$ animal</p>
</li>
</ul>
</li>
<li><p>Republican used more dehumanizing metaphors</p>
</li>
</ul>
</li>
</ul>
<h4 id="differences-by-country-of-origin">Differences by Country of Origin</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/e15a27a3-a275-495d-a514-4bdafae37299/image.png" alt=""></p>
<ul>
<li><p>Fig 4</p>
<ul>
<li><p>how Mexicans are framed today vs how Chinese were framed a centuryearlier</p>
</li>
<li><p>crime, labor, legality are deployed vastly</p>
</li>
<li><p>4 most positive frames are all used far more in sentences mentioning European than the non-European groups (culture, victims, contributions, family)</p>
</li>
</ul>
</li>
<li><p>imploicit dehumanizing language is slightly but significantly more common for mentions of the non-European group in both cases</p>
</li>
</ul>
<h1 id="3-discussion">3. Discussion</h1>
<ul>
<li><p>Congressional antagonism to immigration started much earlier than the quota period</p>
<ul>
<li><p>China is mentioned in more than 20% of the speeches in 1880 to 1900</p>
</li>
<li><p>negative attitudes toward immigration remained from 1880 to 1940</p>
</li>
</ul>
</li>
<li><p>negative tone toward Chinese is consistent with the many pieces of anti-Chinese legislation introduced into Congress</p>
<ul>
<li><p>1875 Page Act</p>
</li>
<li><p>1882 Chinese Exclusion</p>
</li>
<li><p>1888 Scott</p>
</li>
<li><p>mentioning Chinese remained until the Chinese Exclusion Act was repealed in 1943</p>
</li>
<li><p>significantly greater use of implicit dehumanizing language to mention Chinese and emphasis on the threatening aspects</p>
</li>
</ul>
</li>
<li><p>The combination of frames (crime, threats) underscores the dual nature of immigrants (threatening vs cheap labor)</p>
<ul>
<li><p>same pattern in the Mexican today</p>
</li>
<li><p>the frame of Europeans was more sympathetic although still negative until the middle of the 20th centurs</p>
</li>
</ul>
</li>
<li><p>gradual loosening of immigration laws from 1940s</p>
<ul>
<li><p>this trend mirrored by congressional tone toward immigration</p>
</li>
<li><p>eventually becoming net positive on average in 1950s</p>
</li>
<li><p>possibly by the humanitarian concerns</p>
<ul>
<li><p>signaling positive attitudes</p>
</li>
<li><p>increasing association with the &#39;victims&#39; frame and decreasing the prominence of &#39;deficiency&#39; and &#39;threats&#39;</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>nearly 30 years after the border reopened in 1965, the positive sentiment didn&#39;t fully erode even as the immigration from developing countries increased</p>
<ul>
<li><p>partisnan divice on immigration cemerged in the late 1970s although the Republican showed neutral or positive stance until the election of Bill Clinton and NAFTA</p>
</li>
<li><p>this predates the previous work about polarization</p>
</li>
</ul>
</li>
<li><p>the results are consistent with the patterns in the previous work about the polarizations</p>
<ul>
<li><p>polarization + overall tone toward immigration</p>
</li>
<li><p>beyond the sentiment, there are the &#39;framing&#39; used in immigration debates</p>
</li>
</ul>
</li>
<li><p>understanding the causes of this polarizattion is beyond the scope of this paper</p>
<ul>
<li><p>legislators&#39; tone is weakly correlated with public opinion on the issue at the state level</p>
</li>
<li><p>noevidence of systematic differences in tone among House menbers in election vs nonelection years</p>
</li>
</ul>
</li>
<li><p>stark differences in framing between European and non-European groups</p>
<ul>
<li><p>Chinese in the late 19th and early 20th century, Mexican today</p>
</li>
<li><p>more implicitly dehumanizing metaphors for non-European</p>
</li>
<li><p>also for the explicit frames (crime, labor, legality)</p>
</li>
<li><p>the gap between mentioning Mexico and European is equivalent to the modern gap between Democrats and Republicans</p>
</li>
</ul>
</li>
<li><p>modern immigration laws and the rhetoric of &#39;illegal&#39; immegrants were crafted specifically to target immigration from Mexico</p>
<ul>
<li>made associations with crime, legality and labor</li>
</ul>
</li>
<li><p>Mexico was also the target of early discrimination like China</p>
<ul>
<li><p>Mexico was exempt from the quota system</p>
</li>
<li><p>Although the tone of speeches mentioning Mexican increased with other nationalities, these gains were largely eroded in the early 1970s  $\rightarrow$ persistently nationality-based gap</p>
</li>
</ul>
</li>
<li><p>with the public opinion polls (by Gallup)</p>
<ul>
<li><p>also shows the increase in proimmigrant sentiment from 1965 to present</p>
</li>
<li><p>in 2019, 77% answered as a positive</p>
</li>
<li><p>in 2002, it was 52% (after 9.11)</p>
</li>
<li><p>for asking the decrease of immigration, 65% answered it should be decreased in 1990s</p>
</li>
<li><p>in 2020, this fell to 28%</p>
</li>
</ul>
</li>
<li><p>the analysis of congressional and presidential speeches is more complicated</p>
<ul>
<li><p>attitudes among Repuiblican are negative as the members of Congress were during the push for restrictive quotas</p>
</li>
<li><p>Chinese are still duscussed more negatively than European even the overall sentiment is positive today</p>
</li>
<li><p>recent years, COVID-19 made anti-Asian and hate crimes, anti-Chinese rhetoric</p>
</li>
<li><p>despite of the proimmigration among the general population, the tone differences in Congress based on nationality are strong as that between the parties</p>
</li>
</ul>
</li>
</ul>
<h1 id="4-materials-and-methods">4. Materials and Methods</h1>
<h4 id="data">Data</h4>
<ul>
<li><p>43rd to 111th Congress : digitized copy of the Congressional Record</p>
</li>
<li><p>112th to 116th Congress : congressional-record tool by @unitedstates project</p>
</li>
<li><p>data with speaker, party, state and date</p>
</li>
<li><p>Procedural speeches were identified and excluded</p>
</li>
<li><p>presidential communication : all presidential documents from The American Presidency Project</p>
</li>
<li><p>Immigration statistics : Historical Statistics of the United States Millennial Edition Online + census data by the Migration Policy Institute</p>
</li>
</ul>
<h4 id="classification">Classification</h4>
<ul>
<li><p>Princeton University research assistants to label a speech</p>
<ul>
<li><p>about immigration or not</p>
</li>
<li><p>proimmigration / antiimmigration</p>
</li>
<li><p>extensive set of keyword was used to select the candidate for annotation</p>
</li>
<li><p>7626 segments annotated (3643 were judged relevant)</p>
</li>
<li><p>the judgements were aggregated with Bayesian item response model (to get a probability distribution over labels for each segment)</p>
</li>
</ul>
</li>
<li><p>trained RoBERTa</p>
<ul>
<li><p>fine-tuned the pretrained roberta-base to congressional speeches with self-supervised </p>
</li>
<li><p>then fine-tuned it to be a classifier using annotated examples</p>
</li>
<li><p>~90% accuracy on relevance and 65% on tone</p>
</li>
<li><p>major error in tone is between neutral and non-neutral</p>
</li>
<li><p>models trained on earlier and later parts of the data showed similar aggregate results in the intervening years</p>
</li>
</ul>
</li>
<li><p>the predictions on segments are used to predict the speeches</p>
<ul>
<li>same predictor is used to presidential communication</li>
</ul>
</li>
</ul>
<h4 id="identifying-groups">Identifying Groups</h4>
<ul>
<li><p>the mose prominent immigrant nationalities $\rightarrow$ historical data on the countries of origin of the foreign-born US population</p>
</li>
<li><p>45 countries that accounted for at least 1% of the foreitn-born population in at least 1 decade</p>
</li>
<li><p>manually modified the country name and nationality</p>
</li>
</ul>
<h4 id="measuring-impact">Measuring Impact</h4>
<ul>
<li><p>used L1-regularized LR models to fit the predicted tone labels on all congressional segments classified as relevant</p>
<ul>
<li>approximates the influence of individual words</li>
</ul>
</li>
<li><p>words in the vocab : at lease 20 times used / excluding numbers, punctuation, stop words / counts were binarized</p>
</li>
<li><p>Shapley values computed (reflected in Table 1)</p>
</li>
</ul>
<h4 id="curating-frames">Curating Frames</h4>
<ul>
<li><p>curated lexicons for 14 immigration frames</p>
<ul>
<li><p>identified significantly frequently occurred terms mentioning of immigrants compared to terms mentioning generic people</p>
</li>
<li><p>considered initial exploration, annotators&#39; comments, and prior literature to identified 14 relevant categories</p>
</li>
<li><p>which term should be selected in which frame is aggregated by majority votes</p>
</li>
</ul>
</li>
</ul>
<h4 id="identifyng-mentions">Identifyng Mentions</h4>
<ul>
<li><p>collected direct mentions + group terms + more generic person references with nationality</p>
</li>
<li><p>used to measure dehumanizing metaphorical language for each group</p>
</li>
<li><p>included slang and derogatory terms to identify groups</p>
</li>
</ul>
<h4 id="measuring-dehumanization">Measuring Dehumanization</h4>
<ul>
<li><p>introduce a method that is based purely on context $\rightarrow$ used BERT</p>
<ul>
<li><p>trained on MLM task</p>
</li>
<li><p>fine-tuning to act as a classifier</p>
</li>
<li><p>to train implicit metaphorical language, began with the representative of that category</p>
<ul>
<li><p>used static vector to fine similar terms</p>
</li>
<li><p>tried to find that kind of word in the BERT&#39;s vocabulary</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>training procedure</p>
<ul>
<li><p>for each sentence that mentions and immigrant of immigrant group</p>
</li>
<li><p>mask the mention with [MASK] token</p>
</li>
<li><p>to compute the probability of the candidate for the [MASK]</p>
</li>
<li><p>add up all the probabilities to get an overall score for each category for that sentence</p>
</li>
<li><p>showed log ratio of the mean probability for one set of mentions to the mean probability for the other</p>
</li>
</ul>
</li>
<li><p>validating procedure</p>
<ul>
<li><p>collected human judgements on a sample of masked contexts</p>
</li>
<li><p>three of the auther independently rated whether a term would be a plausible replacement</p>
</li>
<li><p>resonably strong agreement (Krippendorff&#39;s alpha = 0.59)</p>
</li>
<li><p>correlated with the log probability by the model (r = 0.73)</p>
</li>
</ul>
</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[Elements of Worls Knowledge (EWoK)]]></title>
            <link>https://velog.io/@0404_not_found/Elements-of-Worls-Knowledge-EWoK</link>
            <guid>https://velog.io/@0404_not_found/Elements-of-Worls-Knowledge-EWoK</guid>
            <pubDate>Sun, 19 May 2024 13:52:31 GMT</pubDate>
            <description><![CDATA[<p>Elements of Worls Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in LMs</p>
<h1 id="1-introduction">1. Introduction</h1>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/abe0f276-9d52-4fb2-a0d9-bf0ecdf4ec00/image.png" alt=""></p>
<ul>
<li><p>LLM acquires a substantial amount knowledge from their training data</p>
<ul>
<li><p>knowledge about language (world meaning, syntax)</p>
</li>
<li><p>knowledge about world (social conventions, physical properties of objects)</p>
</li>
</ul>
</li>
<li><p>To check the robustness of the model&#39;s language</p>
<ul>
<li><p>Elements of World Knowledge (EWoK)</p>
<ul>
<li><p>several domains that constitute the foundation for basic human world knowledge</p>
</li>
<li><p>specific concepts within each domain</p>
</li>
<li><p>a set of item templates</p>
</li>
<li><p>a set of fillers to populate the templates (each templates to be used multiple times)</p>
</li>
<li><p>a pipeline to generate a specific set of items</p>
</li>
</ul>
</li>
</ul>
<ul>
<li><p>Why Elements?</p>
<ul>
<li><p>this targets specific cognitive targets (e.g. friend/enemy)</p>
</li>
<li><p>concept leveraged in context are the first-class object of the EWoK as opposed to individual sentences or facts</p>
</li>
<li><p>NLP benchmarks $\rightarrow$ aim to evaluate knowledge based on individual items</p>
</li>
<li><p>individual item makes it hard to assess why a model fails</p>
</li>
<li><p>explicitly link the items with the concepts that they test</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Why cognition-inspired?</p>
<ul>
<li><p>selected a range of domains that have been shown to recruit dedicated cognitive and/or neural machinery in humans </p>
<ul>
<li><p>intuitive physics</p>
</li>
<li><p>physical and spatial relations</p>
</li>
<li><p>intuitive number sense</p>
</li>
<li><p>social reasoning</p>
</li>
<li><p>reasoning about agents with both physical and social knowledge</p>
</li>
</ul>
</li>
<li><p>present in preverbal infants</p>
</li>
<li><p>but language contains a rich amount of information that reflects grounded world knowledge $\rightarrow$ LLMs might acquire the domain-specific knowledge from text alone</p>
</li>
</ul>
</li>
<li><p>Why plausibility?</p>
<ul>
<li><p>plausible vs implausible context-target pairs</p>
</li>
<li><p>plausibility $\rightarrow$ serves as a proxy for factual accuracy (determines whether a given scenario makes sense)</p>
</li>
<li><p>an accurate world model is necessary for distinguishing the plausibility no matter how they are worded</p>
</li>
</ul>
</li>
<li><p>Why minimal pairs?</p>
<ul>
<li><p>contexts and targets in EWoK have a minimal-pairs design</p>
</li>
<li><p>target change results in an opposite result (plausible $\rightarrow$ implausible)</p>
</li>
<li><p>help to identify specific manipulations that LLMs are sensitive and they are not</p>
</li>
</ul>
</li>
<li><p>Why context-target combinations?</p>
<ul>
<li><p>LLMs are very good at memorization $\rightarrow$ many distinguishing can be done with their presence in the training data</p>
</li>
<li><p>this framework tests LLM&#39;s ability to evaluate contextual plausibility such that the same exact target&#39;s plausibility depending on the context</p>
</li>
</ul>
</li>
</ul>
<h1 id="2-related-work">2. Related Work</h1>
<ul>
<li><p>commonsense benchmark</p>
<ul>
<li>reporting bias in training data</li>
</ul>
</li>
<li><p>Co-occurrence information easily available through perception is often underrepresented in language corpora</p>
<ul>
<li>earlier LLMs failed</li>
</ul>
</li>
<li><p>natural language inference and entailment</p>
<ul>
<li><p>recognizing textual entailment (RTE)</p>
</li>
<li><p>natural language inference (NLI)</p>
</li>
<li><p>EWoK asks the plausibility within given context $\rightarrow$ it might indicate an entailment</p>
</li>
<li><p>LLMs use heuristics to solve the task rather than the understanding</p>
<ul>
<li><p>in EWoK, the task is posed as a minimal pair (one must be preferred over the alternative) $\rightarrow$ making reliance on target plausibility alone is impossible</p>
</li>
<li><p>test which item design features drive model performance</p>
</li>
<li><p>test the relationship between the LLM performance and surface-level item properties (length, average work frequency, BoW model performance)</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>bAbi</p>
<ul>
<li><p>similar design about world knowledge and reasoning</p>
</li>
<li><p>EWoK is more simpler design and harder in practice</p>
</li>
</ul>
</li>
<li><p>minimal pair design</p>
<ul>
<li><p>SyntaxGym, BLiMP, COMPS</p>
</li>
<li><p>Winograd Schema Challenge</p>
</li>
<li><p>EWoK used minimal pairs of pairs design</p>
<ul>
<li>both context and target sentences have a minimal pair counterpart</li>
</ul>
</li>
</ul>
</li>
<li><p>assessing LM performance</p>
<ul>
<li><p>until 2023, each item&#39;s log probability</p>
<ul>
<li><p>effective at grammatical vs ungrammatical</p>
</li>
<li><p>plausible and implausible</p>
</li>
<li><p>relevant and irrelevant object properties</p>
</li>
</ul>
</li>
<li><p>log probability shows the surface-level properties</p>
</li>
<li><p>Recently, to prompt an LLM to rate them plausibility</p>
<ul>
<li><p>LLM performs worse in direct prompting than implicit log probability</p>
</li>
<li><p>in EWoK, both log probability and explicit prompting are used</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="3-the-framework">3. The Framework</h1>
<h4 id="item-format">Item Format</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/39c71b23-e747-49bf-b387-0dd134e6cae1/image.png" alt=""></p>
<ul>
<li><p>Each item consists of two minimal pair contexts</p>
<ul>
<li><p>$C_1$ : The piano is in front of Ali. Ali turns <strong>left</strong>.</p>
</li>
<li><p>$C_2$ : The piano is in front of Ali. Ali turns <strong>right</strong>.</p>
</li>
</ul>
</li>
<li><p>Also, there are two target sentences</p>
<ul>
<li><p>$T_1$ : The piano is <strong>right</strong> of Ali.</p>
</li>
<li><p>$T_2$ : The piano is <strong>left</strong> of Ali.</p>
</li>
</ul>
</li>
<li><p>the two target items are juxtaposed such that</p>
<ul>
<li>$P(T_1 \ | \ C_1) &gt; P(T_1 \ | \ C_2)$ and $P(T_2 \ | \ C_1) &lt; P(T_2 \ | \ C_2)$</li>
</ul>
</li>
<li><p>then the base target $P(T_1)$ and $P(T_2)$ can&#39;t serve as plausibility cues $\rightarrow$ the model should rely on the given context</p>
</li>
</ul>
<h4 id="domain-and-concenpts">Domain and Concenpts</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/262b900c-2692-409f-a09a-a444ffd43787/image.png" alt=""></p>
<h4 id="dataset-generation-procedure">Dataset generation procedure</h4>
<ul>
<li><p>each concept is associated with several items that test knowledge of the concept (mostly contrasting with another concept)</p>
</li>
<li><p>flexible but controlled manner</p>
</li>
<li><p>atomic units and combination rules $\rightarrow$ generation of templates with fillers</p>
</li>
</ul>
<h4 id="contexts-and-targets">Contexts and Targets</h4>
<ul>
<li><p>target : a simple sentence that incorporates a concept</p>
</li>
<li><p>contrasting target pair is generated by</p>
<ul>
<li><p>concept swap</p>
<ul>
<li><p>{agent 1} is to the left of {agent 2}</p>
</li>
<li><p>{agent 1} is to the right of {agent 2}</p>
</li>
</ul>
</li>
<li><p>variable swap</p>
<ul>
<li><p>{agent 1} is to the left of {agent 2}</p>
</li>
<li><p>{agent 2} is to the left of {agent 1}</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>context pair : one or more minimal pair of sentences that is pared with a target pair</p>
<ul>
<li><p>$C_1$ only matches with $T_1$ and $C_2$ only matches with $T_2$</p>
</li>
<li><p>typically an opposite concept pair (left/right) or single concept (left, with variable swap)</p>
</li>
</ul>
</li>
<li><p>contrasting concept pair is generated by </p>
<ul>
<li><p>filler swap</p>
<ul>
<li>use contrasting fillers</li>
</ul>
</li>
<li><p>variable swap</p>
<ul>
<li>changes the positions of two entities of the same kind</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4 id="templates-and-fillers">Templates and Fillers</h4>
<ul>
<li><p>Each collection of concepts, contexts, targets can be compiled to as set of templates</p>
</li>
<li><p>partial items with types variables describing the range of fillers</p>
<ul>
<li><p>{object2: can_bounce=True} bounced off {object1} from below</p>
</li>
<li><p>object1 can be the desk or the crate</p>
</li>
<li><p>object2 should be the object marked with can_bounce=True (the ball, the tire)</p>
</li>
<li><p>500 filler items across 13 classes with 28 type restrictions</p>
</li>
</ul>
</li>
<li><p>users can specify various custom parameters</p>
<ul>
<li><p>number of items to generate from each template</p>
<ul>
<li>full set of items $\rightarrow$ &quot;version&quot;</li>
</ul>
</li>
<li><p>whether fillers should be hold constant across all items in a version</p>
</li>
<li><p>apply transformations to filler restrictions at compile-time</p>
<ul>
<li><p>agent $\rightarrow$ agent:western=False</p>
</li>
<li><p>object $\rightarrow$ nonword</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>this allows controlled experimentation of the features</p>
</li>
</ul>
<h1 id="4-evaluation">4. Evaluation</h1>
<ul>
<li><p>with this framework, EWoK-CORE-1.0 is released by generating 5 unique fixed substitutions of filler items across 880 templates from 11 domains</p>
</li>
<li><p>evaluated with LogProb and two prompt-based methods LIKERT, CHOICE</p>
<ul>
<li>LogProb outperforms the direct prompting</li>
</ul>
</li>
<li><p>for the prompt-based evaluations</p>
<ul>
<li>collected data from LLMs and humans using paired identical prompts</li>
</ul>
</li>
</ul>
<h2 id="41-scoring-metrics">4.1. Scoring Metrics</h2>
<ul>
<li><p>LogPRobs</p>
<ul>
<li><p>token-level LLM probabilities with sum of conditional log probs of each token</p>
</li>
<li><p>$\log P_{\theta}(T \ | \ C) = \sum_{k=1}^n \log P_{\theta}(\mathbf{t}<em>k \ | \ C, \mathbf{t}</em>{&lt;k})$</p>
</li>
</ul>
</li>
<li><p>LIKERT</p>
<ul>
<li>participants are prompted to reate the plausibility of each $C_i$ and $T_j$ pair on 1-5 scale</li>
</ul>
</li>
<li><p>CHOICE</p>
<ul>
<li><p>participants are given $C_1$, $C_2$ and a single target $T$</p>
</li>
<li><p>participants should choose between $C_1$ and $C_2$ which better matches with $T$</p>
</li>
</ul>
</li>
<li><p>the metric for correctness fo given item is the recovery of the designed item structure</p>
<ul>
<li><p>$score(T_1 \ | \ C_1) &gt; score(T_1 \ | \ C_2)$ and $score(T_2 \ | \ C_1) &lt; score(T_2 \ | \ C_2)$</p>
</li>
<li><p>the score is different from method</p>
</li>
</ul>
</li>
<li><p>find both $C, T$ matches $\rightarrow$ 1.0 (full point)</p>
</li>
<li><p>find only one match $\rightarrow$ 0.5 (half point)</p>
<ul>
<li>in LIKERT, this is the case with the model gave same ratings</li>
</ul>
</li>
<li><p>trivial 50% baseline for all scenario</p>
</li>
</ul>
<h2 id="42-models">4.2. Models</h2>
<ul>
<li><p>20 transformer LMs</p>
</li>
<li><p>1.3B-70B and different pretraining diet</p>
</li>
<li><p>13 dense pretrained transformers</p>
</li>
<li><p>4 instruction-tuned</p>
</li>
<li><p>2 chat fine-tuned</p>
</li>
<li><p>1 MoE</p>
</li>
<li><p>the model doesn&#39;t require specific formatting</p>
</li>
</ul>
<h2 id="43-surface-level-item-properties">4.3. Surface-level item properties</h2>
<ul>
<li><p>baseline: BoW with word2vec</p>
</li>
<li><p>scored with Cosine-Similarity</p>
</li>
<li><p>tested LLM with number of words in each item and average word frequency in an item with Google Ngrams</p>
</li>
</ul>
<h2 id="44-human-data">4.4. Human Data</h2>
<ul>
<li><p>1262 participants (591 female, 579 male, 27 other)</p>
</li>
<li><p>median age 36</p>
</li>
<li><p>US-residents with first language Enalish</p>
</li>
<li><p>poor agreement with others were excluded</p>
</li>
</ul>
<h1 id="5-release-considerations">5. Release Considerations</h1>
<ul>
<li><p>reduce the chances of caccidental incorporation of EWoK into LLM&#39;s training data</p>
</li>
<li><p>promote accountability and reporting when such incorporation is done intentionally</p>
</li>
</ul>
<h1 id="6-experiments">6. Experiments</h1>
<h4 id="ewok-core-10-is-challenging-for-llms">EWoK-CORE-1.0 is challenging for LLMs</h4>
<ul>
<li><p>even larger models generally perform much below humans</p>
</li>
<li><p>best one falcon-40b-instruct git 0.80 while human got 0.95</p>
</li>
<li><p>instruction tuning doesn&#39;t affect to the performance under LogProbs</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/4ae6c584-63b6-477a-9c5d-cdcde7690417/image.png" alt=""></p>
<h4 id="performance-vaires-drastically-by-domain">Performance vaires drastically by domain</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/8c84e657-2f28-4203-a2f6-6e2dd55b6977/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/d1f52845-f180-4ea7-b0fe-7c3682246498/image.png" alt=""></p>
<ul>
<li><p>domain difficulty is consistent across LLMs</p>
</li>
<li><p>heterogeneous performance of the <strong>phi</strong> models</p>
<ul>
<li><p>phi-1 is the worst</p>
</li>
<li><p>phi-1.5 outperforms all models and even humans on physical dynamics</p>
</li>
<li><p>phi-2 on par with the largest models on some domains to worst than gpt2-xl on spatial relations</p>
</li>
<li><p>possibly due to their unique training procedure (synthetic data)</p>
</li>
</ul>
</li>
</ul>
<h4 id="llms-show-heterogeneous-performance-across-dataset-versions">LLMs show heterogeneous performance across dataset versions</h4>
<ul>
<li><p>in principle, these variables should not affect the results</p>
</li>
<li><p>phi-2 and phi-1.5 showed the largest performance range</p>
</li>
<li><p>humans showed somewhat heterogeneous performance too (driven only by a subset of the domains)</p>
</li>
</ul>
<h4 id="domain-content-item-design-features-and-surface-level-item-features-all-affect-llm-performance">Domain content, item design features, and surface-level item features all affect LLM performance</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/79e8bdd9-8366-41d9-b2bd-88492ec8bd7a/image.png" alt=""></p>
<ul>
<li><p>they affected the performance often in a different ways than they affect humans</p>
</li>
<li><p>BoW baseline is predictive of LLM but not human</p>
</li>
<li><p>the number of words in an item negatively affects LLM but not model performance</p>
</li>
<li><p>word-frequenct is negatively affected to both LLM and human performance</p>
<ul>
<li>this is because the hardest two domain (physical-relations and spatial relations) have the highest word frequency</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/d113dbb4-22de-46a6-85ac-41af24a7b684/image.png" alt=""></p>
<ul>
<li><p>jointly models all features using mixed effects regression</p>
<ul>
<li><p>word frequency has a significant positive effect</p>
</li>
<li><p>the number of words has a significant negative effect</p>
</li>
<li><p>the domain is remained a significant predictor of performance</p>
</li>
</ul>
</li>
</ul>
<h4 id="logprobs-yield-higher-accuracy-than-prompting">LogProbs yield higher accuracy than prompting</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/cfa0de7f-3c47-4e40-b164-7b923550cff9/image.png" alt=""></p>
<ul>
<li>the gap was large in smaller models</li>
</ul>
<h4 id="human-ratings-are-often-but-not-always-accurate">Human ratings are often but not always accurate</h4>
<ul>
<li><p>sometimes the discrepancies between human ratings and experimental labels resulted from specific fillers changing the plausibility</p>
<ul>
<li><p>The cooler is inside the car. Chao cannot see the cooler</p>
</li>
<li><p>this is implausible as the cooler is large and the car has windows</p>
</li>
<li><p>but the small object and the container without window is plausible</p>
</li>
</ul>
</li>
<li><p>Human made mistakes</p>
<ul>
<li><p>the bakery is north of Chao. Chao turns around. The baker is south of Chat.</p>
</li>
<li><p>this is implausible as cardinal directions don&#39;t depend on the agent&#39;s orientation</p>
</li>
</ul>
</li>
</ul>
<h1 id="7-discussion">7. Discussion</h1>
<ul>
<li><p>the goal was to develop a dataset</p>
<ul>
<li><p>uses a uniform item format to probe diverse domains of physical and social knowledge</p>
</li>
<li><p>contains items that probe specific concepts</p>
</li>
<li><p>requires integrating information across sentences</p>
</li>
<li><p>consists of generic templates that can be used to generate a wide variety of items</p>
</li>
</ul>
</li>
<li><p>presented evaluation results</p>
</li>
<li><p>EWoK-CORE-1.0 is moderately challenging for LLMs</p>
</li>
<li><p>LogProbs contain enough information for most LLMs</p>
</li>
<li><p>Future Work</p>
<ul>
<li><p>Targeted experiments</p>
<ul>
<li>the flexibility of the framework allows specific experiments using customized sets of fillers</li>
</ul>
</li>
<li><p>Interpretability research</p>
<ul>
<li>Knowledge Editing research to basic physical and social concepts</li>
</ul>
</li>
<li><p>From elements to world models</p>
<ul>
<li><p>model to function as a flexible and robust general purpose AI system, it needs tob e able to construct, maintain and update internal world models</p>
</li>
<li><p>LLMs usage of internal world models is ongoing investigation</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Limitations</p>
<ul>
<li><p>written in English</p>
</li>
<li><p>same prompting setup for all models</p>
<ul>
<li>with tailored prompt engineering, the performance can improve</li>
</ul>
</li>
<li><p>some items are semantically weird</p>
<ul>
<li>due to the synthetic nature of dataset</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="8-conclusion">8. Conclusion</h1>
<ul>
<li>EWoK provides a way to evaluate the fundamental elements of workd knowledge</li>
</ul>
<h1 id="9-comment">9. Comment</h1>
<p>실제 세계에 대한 모델의 &#39;이해&#39;를 테스트 하려고 만든 데이터셋. 범용성이 있고 공들여 만든 것으로 보이지만 더 적절한 평가 방법이 있으면 좋을 것 같음.</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[LayerSkip : Enabling Early Exit Inference and Self-Speculative Decoding]]></title>
            <link>https://velog.io/@0404_not_found/LayerSkip-Enabling-Early-Exit-Inference-and-Self-Speculative-Decoding</link>
            <guid>https://velog.io/@0404_not_found/LayerSkip-Enabling-Early-Exit-Inference-and-Self-Speculative-Decoding</guid>
            <pubDate>Wed, 15 May 2024 13:30:23 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>LLM Acceleration</p>
<ul>
<li><p>sparsity</p>
</li>
<li><p>quantization</p>
</li>
<li><p>head pruning</p>
</li>
</ul>
</li>
<li><p>Reducing the number of layers for each token by <strong>exiting early during inference</strong></p>
</li>
<li><p>Speculative decoding</p>
<ul>
<li><p>main model + draft model</p>
</li>
<li><p>larger memory footprint and complexity</p>
</li>
<li><p>faster inference</p>
<p>$\rightarrow$ <strong>Self-Speculative Decoding</strong></p>
</li>
</ul>
</li>
<li><p>contribution</p>
<ul>
<li><p>training recipe that combines layer dropout and early exit loss</p>
</li>
<li><p>the recipe more robust to exiting at earlier layers of the model, essentially creating different sized sub-models within the same model</p>
</li>
<li><p>self-speculative decoding solution that decodes with earlier layers and verifies and corrects with later layers</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/607ebae5-3d5e-48d4-895e-4fd14f90caf9/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h1 id="2-motivation">2. Motivation</h1>
<h2 id="21-exiting-earlier-in-llms">2.1. Exiting Earlier in LLMs</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/f3b7df6d-b84f-4106-99d6-0ce8a918a15e/image.png" alt=""></p>
<ul>
<li><p>Fig 2a -&gt; Llama1 7B + HumanEval coding dataset</p>
</li>
<li><p>projected each layer&#39;s output embeddings on the LM head + softmax $\rightarrow$ got the index of the output element (Unembedding)</p>
<ul>
<li><p>token predictions in earlier layers appear to be irrelevant</p>
</li>
<li><p>in later layers, token predictions converge to the final prediction</p>
</li>
<li><p>most of the time, the final token predition is predicted fewer layers before the end</p>
</li>
<li><p>intermediate layers are sometimes hesitant and change their mind</p>
</li>
<li><p>a token requires 23.45 layers out of the model&#39;s 32 layers</p>
<p>$\rightarrow$ need to make the model to use fewer layers</p>
<p>$\rightarrow$ make the model not to hesitate and change their mind</p>
</li>
</ul>
</li>
<li><p>skipping layers during training (dropout)</p>
<ul>
<li>higher rate for later layers and lower rates for earlier layers</li>
</ul>
</li>
<li><p>unembedding </p>
<ul>
<li><p>typically LLMs are trained to unembed at the last transformer layer</p>
</li>
<li><p>need to adds a loss function during training to make the LM heads <strong>understand</strong> embeddings of earlier layer</p>
</li>
<li><p>shared LM head to early exit</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/983929cd-b5cd-47bb-8f71-619a0a80ee0e/image.png" alt=""></p>
</li>
<li><p>make the LM head as ensemble of different depth models with same weight</p>
</li>
</ul>
</li>
</ul>
<h2 id="22-correcting-if-we-exit-too-early">2.2. Correcting if we exit too early</h2>
<ul>
<li><p>exiting early can reduce the accuracy</p>
<ul>
<li><p>needs a way to verify if an early prediction is accurate and correct it by using remaining layers</p>
</li>
<li><p>Self-speculative decoding</p>
</li>
</ul>
</li>
</ul>
<h1 id="3-related-work">3. Related Work</h1>
<h4 id="dropout">Dropout</h4>
<ul>
<li><p>unstructured dropout (original)</p>
</li>
<li><p>large models (Llama, GPT3, PaLM) don&#39;t use it at large corpus</p>
</li>
<li><p>enable the training to learn across an ensemble of many models</p>
</li>
<li><p>multiplicative noise</p>
</li>
</ul>
<h4 id="layer-dropout-stochastic-depth">Layer Dropout (stochastic depth)</h4>
<ul>
<li><p>stochastically skipping layers</p>
</li>
<li><p>LayerDrop in LMs $\rightarrow$ robustness</p>
</li>
<li><p>layer dropout for training decoder-only models or scaling LMs has not beed explored</p>
</li>
</ul>
<h4 id="early-exit">Early Exit</h4>
<ul>
<li><p>branch modules at different exit points in a deep learning network + additional loss</p>
</li>
<li><p>in LMs, early exit in encoder-only models was explored</p>
</li>
<li><p>dedicated LM head for each decoder layer</p>
</li>
<li><p>SkipDecode</p>
</li>
<li><p>additional FC layer</p>
</li>
</ul>
<h4 id="speculative-decoding">Speculative Decoding</h4>
<ul>
<li><p>auto-regressive decoding is slow while measuring the likelihood of a group of generated tokens in parallel is faster</p>
</li>
<li><p>draft model (fast, less accurate) to generate tokens and verify and correct with main (slow, more accurate) model</p>
</li>
</ul>
<h1 id="4-proposed-solution">4. Proposed solution</h1>
<h2 id="41-training-using-layer-dropout--early-exit-loss">4.1. Training using Layer Dropout &amp; Early Exit Loss</h2>
<ul>
<li><p>Notation</p>
<ul>
<li><p>model $X$</p>
</li>
<li><p>output $Y$</p>
</li>
<li><p>token embeddings $x_0$</p>
</li>
<li><p>number of layers $L$</p>
</li>
<li><p>$x_{l+1} = x_l + f_l (x_l)$</p>
</li>
<li><p>final LM head maps the embedding outputs to logits $e_L = g(x_L)$</p>
</li>
<li><p>BCE loss = $J_{\text{BCE}}(e_L, Y)$</p>
</li>
</ul>
</li>
</ul>
<h3 id="411-layer-dropout">4.1.1. Layer Dropout</h3>
<ul>
<li><p>layer dropout at layer $l$ and iteration $t$</p>
<ul>
<li><p>$x_{l+1, t} = x_{l, t} + M(p_{l, t})f_l(x_{l, t})$</p>
</li>
<li><p>where $M$ is bernoulli function that returns 0 with probability $p$</p>
</li>
<li><p>apply dropout on each sample separately within a batch</p>
</li>
<li><p>remove dropped sample and apply transformer operation $f_l$ on the remaining samples</p>
</li>
<li><p>same random seed for GPUs</p>
</li>
</ul>
</li>
<li><p>Dropout rate $p_{l, t} = S(t)D(l)p_{max}$</p>
<ul>
<li><p>$p_{max}$ : hyperparameter</p>
</li>
<li><p>$D(l)$ : per-layer scaling function</p>
</li>
<li><p>$D(l) = e^{{l \ln 2 \over L-1}} - 1$ was the best (growing exponentially)</p>
</li>
<li><p>$S(t)$ : per-time step scaling function</p>
</li>
<li><p>for pre-trained model and doing fine-tuning or continuous training, $S(t) = 1$ was the best</p>
</li>
<li><p>for pretraining from scratch, $S(t) = e^{{t \ln 2 \over T-1}} - 1$ was the best</p>
</li>
</ul>
</li>
</ul>
<h3 id="412-early-exit-loss">4.1.2. Early Exit Loss</h3>
<ul>
<li><p>LM head $g$ should be capable of unembedding outputs of different layers</p>
</li>
<li><p>During training, supervise the model directly to connect the early exit layers to the LM head</p>
<ul>
<li><p>$J(X, Y, t) = \displaystyle \sum_{l=0}^{l = L-1} \tilde{e}(t, l) J_{\text{BCE}}(g(x_{l+1}), Y)$</p>
</li>
<li><p>$\tilde{e}(l) = {C(t,l)e(l) \over \sum_{i=0}^{i=L-1} C(t,i)e(i)}$, normalized per-layer loss scale</p>
</li>
<li><p>$C(t, l)$ : Binary curriculum function that determines if we enable early exit of layer $l$ at iteration $t$</p>
</li>
<li><p>$$ e(l) = 
\begin{cases}
e_{scale} \sum_{i=0}^{i=l} i \quad &amp;\text{if } 0 \le l &lt; L-1 \
L-1 + e_{scale} \sum_{i=0}^{i=L-2}i &amp;\text{if } l = L-1
\end{cases}
$$</p>
</li>
<li><p>the scale increases across layers</p>
</li>
<li><p>the scale at one layer is proportional to the sum of the scales of all previous layers</p>
</li>
<li><p>penalize later layers with quadratically higher weight (predicting in later layers is easier)</p>
</li>
<li><p>$0 \ \le e_{scale} \ \le 1$ is a hyperparameter</p>
</li>
</ul>
</li>
</ul>
<h4 id="early-exit-loss-curriculum">Early Exit Loss Curriculum</h4>
<ul>
<li><p>adding early exit loss of all layers at all iteration slows down the training and reduces the accuracy</p>
</li>
<li><p>use $C(t, l)$</p>
<ul>
<li><p>rotational early exit curriculum $C_{\text{rot}, R}$</p>
<ul>
<li><p>enable early exit at every $R$ layers</p>
</li>
<li><p>only $\lceil L/R \rceil$ unembedding operations are applied</p>
</li>
</ul>
</li>
<li><p>gradual early exit curriculum $C_{\text{grad}}$</p>
<ul>
<li>gradually enable early exit loss from layers $L-1$ to 0, one layer at a time every $T/2L$ iterations</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4 id="hyperparameter-summary">Hyperparameter Summary</h4>
<ul>
<li><p>Layer Dropout</p>
<ul>
<li><p>$p_{max}$ : max dropout rate of last layer of the model</p>
</li>
<li><p>$S(t)$: layer dropout curriculum</p>
</li>
</ul>
</li>
<li><p>Early Exit Loss</p>
<ul>
<li><p>$e_{scale}$: scalar scale of loss of earlier layers</p>
</li>
<li><p>$C(t,l)$: early exit loss curriculum</p>
</li>
</ul>
</li>
</ul>
<h2 id="42-inference-using-early-exit">4.2. Inference using Early Exit</h2>
<ul>
<li><p>run the first $E$ transformer layers and skip to the model&#39;s LM head</p>
</li>
<li><p>the final output is $g(x_E)$</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/46049f50-9028-43a7-96c4-01e1584fdd07/image.png" alt=""></p>
<h2 id="43-inference-using-self-speculative-decoding">4.3. Inference using Self-Speculative Decoding</h2>
<ul>
<li><p>Self-speculative decoding</p>
<ul>
<li><p>use single model and latency of traditional speculative decoding</p>
</li>
<li><p>Self Drafting and Self-Verification</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/a0ad0bae-a87b-4a3b-9c0c-13bfb37a93a3/image.png" alt=""></p>
</li>
<li><p>Self Drafting: using the early exit to draft tokens</p>
</li>
<li><p>Self Verification: using the remaining layers to validate the prediction</p>
</li>
<li><p>Cache Reuse : unifies the KV cache and storing the exit query</p>
</li>
</ul>
</li>
</ul>
<h3 id="431-self-drafting">4.3.1. Self-Drafting</h3>
<ul>
<li><p>compute the first $d$ draft tokens through early exit</p>
<ul>
<li><p>leverage a subset of the LLM and conduct auto-regressive inference exiting at layer $E$</p>
</li>
<li><p>train the model once to get an ensemble of different candidate draft models at each layer depth</p>
</li>
</ul>
</li>
</ul>
<h3 id="432-self-verification">4.3.2. Self-Verification</h3>
<ul>
<li><p>leverages the full LLM to predict the next token for each draft token in a single forward pass</p>
</li>
<li><p>find the point where the draft tokens and verified tokens agree</p>
</li>
<li><p>All the draft tokens up till the disagreement point are added to the output along with the next verified token and continues from the draft</p>
</li>
<li><p>only computes $L-E$ layers</p>
</li>
</ul>
<h3 id="433-reusing-the-cache">4.3.3. Reusing the Cache</h3>
<ul>
<li><p>avoid recomputing prior KV pairs in each layer</p>
</li>
<li><p>Single KV Cache</p>
<ul>
<li>first $E$ layers are shared in two stages</li>
</ul>
</li>
<li><p>Exit Query Cache</p>
<ul>
<li><p>saves the query vector of exit layer $E-1$ for verification to directly continue from layer $E$</p>
</li>
<li><p>save only the query for the exit layer</p>
</li>
</ul>
</li>
</ul>
<h1 id="5-experiments">5. Experiments</h1>
<ul>
<li><p>Continual Pretraining</p>
<ul>
<li><p>continue training with 52B tokens</p>
</li>
<li><p>text + code</p>
</li>
<li><p>Llama2 7B (32 layers)</p>
<ul>
<li><p>$p_{max} = 0.1$</p>
</li>
<li><p>$e_{scale} = 0.2$</p>
</li>
<li><p>$C_{\text{rot}, R=8}$</p>
</li>
</ul>
</li>
<li><p>Llama2 13B (40 layers)</p>
<ul>
<li><p>$p_{max} = 0.1$</p>
</li>
<li><p>$e_{scale} = 0.1$</p>
</li>
<li><p>$C_{\text{rot}, R=39}$</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Pretraining from scratch</p>
<ul>
<li><p>26B tokens</p>
</li>
<li><p>text + code</p>
</li>
<li><p>Llama2 1.5B (24 layers)</p>
<ul>
<li><p>$p_{max} = 0.1$</p>
</li>
<li><p>$e_{scale} = 0.2$</p>
</li>
<li><p>$C_{\text{rot}, R=23}$</p>
</li>
</ul>
</li>
<li><p>Llama2 7B (32 layers)</p>
<ul>
<li><p>$p_{max} = 0.2$</p>
</li>
<li><p>$e_{scale} = 0.2$</p>
</li>
<li><p>$C_{\text{rot}, R=31}$</p>
</li>
</ul>
</li>
<li><p>higher LR when dropout $\ge$ 0.0</p>
</li>
</ul>
</li>
<li><p>Fine-tuning on Code</p>
<ul>
<li><p>5.2B tokens</p>
</li>
<li><p>Llama1 7B</p>
<ul>
<li><p>$p_{max} = 0.1$</p>
</li>
<li><p>$e_{scale} = 1.0$</p>
</li>
<li><p>$C_{\text{rot}, R=16}$</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Fine-tuning on Task-Specific Dataset</p>
<ul>
<li><p>TOPv2 dataset</p>
</li>
<li><p>Llama 1.5B (24 layers)</p>
<ul>
<li><p>$p_{max} = 0.2$</p>
</li>
<li><p>$e_{scale} = 1.0$</p>
</li>
<li><p>$C_{\text{grad}}$</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>tried LD, EE, LD+EE</p>
</li>
</ul>
<h1 id="6-results">6. Results</h1>
<h2 id="61-early-exit-inference-results">6.1. Early Exit Inference Results</h2>
<h4 id="continual-pretraining">Continual Pretraining</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/218bbbed-9e1e-436b-9675-43768b70d36c/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/89254e52-1517-4042-a1cf-e3b9972c6086/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/f7fa276d-d5e0-4046-9631-35388fd91a31/image.png" alt=""></p>
<ul>
<li><p>LayerSkip is better than the baseline</p>
</li>
<li><p>for the last layer accuracy, LayerSkip has minimal drop in accuracy</p>
</li>
<li><p>some classification tasks (multiple choice, TF) $\rightarrow$ maintain relatively decent accuracy on earlier layers</p>
</li>
<li><p>generation task $\rightarrow$ drop drastically</p>
</li>
<li><p>classification is evaluated on one token while generation is evaluated on many tokens</p>
</li>
<li><p>in MMLU, Llama2 13B baseline dropped from 55.2 to 49.2</p>
</li>
<li><p>NaturalQuestions $\rightarrow$ LayerSkip&#39;s accuracy is higher at middle layer</p>
</li>
</ul>
<h4 id="pretraining-from-scratch">Pretraining from Scratch</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/1e62ad4d-a614-44f5-81f3-28f0a6cca495/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/d5721ab2-94ce-42d7-b3dd-a8494dc91f11/image.png" alt=""></p>
<ul>
<li><p>on the last layer in some downstream tasks, a slight drop in accuracy is seen</p>
<ul>
<li>small tokens $\rightarrow$ some tasks were close to random guess</li>
</ul>
</li>
</ul>
<h4 id="finetuning-on-code-data">Finetuning on Code Data</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/36f7dd26-3300-4cfa-b0e7-3f70585f994f/image.png" alt=""></p>
<ul>
<li><p>Fig 10a</p>
</li>
<li><p>earlier layers are better than the baseline</p>
</li>
<li><p>LD+EE shows a big improvement</p>
</li>
<li><p>this is specific domain data, scaled $e_{scale}$ to 1.0</p>
</li>
</ul>
<h4 id="finetuning-on-task-specific-dataset">Finetuning on Task-Specific Dataset</h4>
<ul>
<li><p>Fig 10b</p>
</li>
<li><p>removing layers from the baseline, the model is not able to generate complete and accurate parses $\rightarrow$ 0 EM</p>
</li>
<li><p>LayerSkip shows 77% at layer 12</p>
</li>
<li><p>regression in the final layer reducing accuracy by 3%</p>
</li>
</ul>
<h2 id="62-self-speculative-decoding-results">6.2. Self-Speculative Decoding Results</h2>
<ul>
<li><p>used EM, ROUGE-2</p>
</li>
<li><p>compared with common models and tasks in Draft &amp; Verify</p>
</li>
<li><p>used greedy decoding and max 512 tokens</p>
</li>
</ul>
<h4 id="continual-pretraining-1">Continual Pretraining</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/08d30d3f-969b-4419-a93a-a9669582c3ac/image.png" alt=""></p>
<ul>
<li>higher speedups for the smaller model</li>
</ul>
<h4 id="pretraining-from-scratch-1">Pretraining from Scratch</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/c5837a34-1c32-4244-9931-5f94e1511b9b/image.png" alt=""></p>
<h4 id="finetuning-on-code-data-1">Finetuning on Code Data</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/1d083942-cc3e-47eb-9c14-2b494eb14e00/image.png" alt=""></p>
<h4 id="finetuning-on-task-specific-data">Finetuning on Task-Specific Data</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/cdd5f92c-e678-4035-8620-5b756b294e86/image.png" alt=""></p>
<h1 id="7-ablation-studies">7. Ablation Studies</h1>
<h4 id="scaling-with-pretraining-tokens">Scaling with Pretraining Tokens</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/d15ca60d-2d89-469e-bc6c-385bff7bacc0/image.png" alt=""></p>
<ul>
<li><p>50000 steps</p>
</li>
<li><p>batch size per device: 4</p>
</li>
<li><p>context window: 4096</p>
</li>
<li><p>number of GPUs: 32, 64, 128</p>
</li>
<li><p>middle layer PPL increases by default (w/o EE)</p>
</li>
<li><p>could open door about the dynamics of transformers</p>
</li>
</ul>
<h4 id="kv-cache-in-self-speculation">KV Cache in Self-Speculation</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/28970675-bb5b-4060-aaf3-b47aa17398f2/image.png" alt=""></p>
<ul>
<li>use of KV cache is able to consistently save 9-20ms per token</li>
</ul>
<h1 id="8-limitations">8. Limitations</h1>
<ul>
<li><p>self-speculative decoding doesn&#39;t require changing a model&#39;s weights</p>
</li>
<li><p>$p_{max}$, $e_{scale}$, $R$ need to be tuned</p>
</li>
<li><p>pretraining with layer dropout from scratch, increasing LR is needed and tuning LR is tricky</p>
</li>
</ul>
<h1 id="9-conclusion">9. Conclusion</h1>
<ul>
<li><p>layer dropout + early exit loss improves accuracy and speed</p>
</li>
<li><p>hope this to be combined with PEFT</p>
</li>
<li><p>in the future, increasing the accuracy of early-exit layers and exploring dynamic conditions to determine a different exit layer can be done</p>
</li>
</ul>
<h1 id="10-comment">10. Comment</h1>
<p>Pruning과 다르게 선택적인 레이어만 사용하여 학습과 추론을 하는 것, 그리고 남는 레이어를 이용해 Self-Speculative Decoding을 알차게 구현한 기법. Transformer에서는 결국 모든 레이어가 필요치 않은듯. CoT without prompting과 결이 비슷한 듯.</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Octopus v4: Graph of Language Models]]></title>
            <link>https://velog.io/@0404_not_found/Octopus-v4-Graph-of-Language-Models</link>
            <guid>https://velog.io/@0404_not_found/Octopus-v4-Graph-of-Language-Models</guid>
            <pubDate>Sun, 05 May 2024 02:17:48 GMT</pubDate>
            <description><![CDATA[<p><img src="https://velog.velcdn.com/images/0404_not_found/post/5d2ff526-a698-4ae7-a8ad-af1bf1364133/image.png" alt=""></p>
<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>LLMs became very powerful and used in lots of fields</p>
</li>
<li><p>Due to Llama 2 and 3, the open-source LLMs has seen significant growth</p>
<ul>
<li>user may select the optimal model based on the use case</li>
</ul>
</li>
<li><p>Graph data structure</p>
<ul>
<li><p>can be used to represent the relationships between models, the optimal use cases and their capabilities</p>
</li>
<li><p>create a powerful framework for seamless model integration, intelligent query routing and optimized perofrmance</p>
</li>
</ul>
</li>
<li><p>on-device AI models</p>
<ul>
<li><p>enhances security, reduces latency</p>
</li>
<li><p><strong>cloud-on-device collaboration</strong></p>
<ul>
<li><p>seamless integration with cloud-based models</p>
</li>
<li><p>light task for on-device models, complicated task for cloud models</p>
</li>
<li><p>IoT may plays a crutial roles by connecting a vast network of devices</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="2-related-works">2. Related Works</h1>
<h4 id="graph-data-format">Graph data format</h4>
<ul>
<li><p>BFS, DFS</p>
</li>
<li><p>PageRank</p>
</li>
<li><p>GNN</p>
</li>
<li><p>GAT (Graph Attention Networks), GCN (Graph Convolution Networks)</p>
</li>
</ul>
<h4 id="ai-agents-with-functional-tokens">AI agents with functional tokens</h4>
<ul>
<li><p>functional tokens can select suitable models or functions</p>
</li>
<li><p>make synergy with Octopus framework</p>
</li>
<li><p>selects the best neighbor, restructures the information and transmits optimized information</p>
</li>
</ul>
<h4 id="multi-agent-llms">Multi-Agent LLMs</h4>
<ul>
<li><p>harness collective intelligence from specialized agents</p>
</li>
<li><p>integration difficulties, data sharing issues and maintaining smooth coordination between agents</p>
</li>
<li><p>exploring possibilities like cross-domain expertise and real-time collaboration</p>
</li>
<li><p>parallel function calling $\rightarrow$ self-connections</p>
</li>
<li><p>sequential action processing $\rightarrow$ graph traversal</p>
</li>
</ul>
<h4 id="llm-scaling-law">LLM Scaling law</h4>
<ul>
<li>leverating distributed computing and node expansion to addresses the scalability issues $\rightarrow$ nearly <strong>unlimited node scalability</strong></li>
</ul>
<h1 id="3-methodology">3. Methodology</h1>
<h2 id="31-lm-for-classification-from-octopus-v2">3.1 LM for classification from Octopus v2</h2>
<ul>
<li><p>functional token in Octopus v2</p>
<ul>
<li>$f$ for the choice from the set $F$, $params$ for the reformulated information derived from the query $q$</li>
</ul>
</li>
</ul>
<p>$$ 
P(f, params  \ | \ q)
$$</p>
<ul>
<li><p>used in selecting the optimal choice, reformulating the query to transmit</p>
<ul>
<li>select the best neighboring nodes, pass the information to subsequent nodes</li>
</ul>
</li>
</ul>
<h2 id="32-lms-as-nodes-in-graph">3.2 LMs as nodes in graph</h2>
<ul>
<li><p>directed and heterogeneous graph $G = (N, E)$</p>
<ul>
<li><p>master nodes $N^m$ : coordinate queries by directing to worker nodes</p>
</li>
<li><p>worker nodes $N^w$ : transfer necessary information for task</p>
</li>
<li><p>master node passes the information and worker nodes handle</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/4905f222-b933-47b1-bdc0-927749315c69/image.png" alt=""></p>
<ul>
<li><p>user queries $q$ and responses $y$</p>
<ul>
<li>$P(y \ | \ q) = P(y \ | \ q; G)$</li>
</ul>
</li>
<li><p>single-step task involves only one worker node</p>
<ul>
<li><p>$P(y \ | \ q; G) = P(N^w, q_h \ | \ q; N^m)P(y \ | \ q_h ; N^w)$</p>
</li>
<li><p>second term is from Octopus v2</p>
</li>
<li><p>uses Octopus v2 to select the best neighboring worker $N^w$ and reformat the query to $q_h$</p>
</li>
<li><p>third term is to calculating the result by worker</p>
</li>
</ul>
</li>
<li><p>Multi-step task involves several sequential interactions</p>
<ul>
<li>simply expands the formula</li>
</ul>
</li>
</ul>
<p>$$
P(y \ | \ q; G) = \prod_{i=0}^{k-1} P(N_i^w, q_{h_i} \ | \ q; N_i^m)P(y \ | \ q_{h_i}; N_i^w)
$$</p>
<ul>
<li><p>to answer one query from the user, only activating two small models is needed</p>
</li>
<li><p>use functional token to get rid of RAG</p>
</li>
</ul>
<h2 id="33-task-planning-using-graphs-for-multistep-operations">3.3 Task planning using graphs for multistep operations</h2>
<ul>
<li><p>traditional approach</p>
<ul>
<li><p>all available functions are listed</p>
</li>
<li><p>LLM generated the plan with the user query and the list</p>
</li>
<li><p>small model cannot grash the extensive descriptions effectively</p>
</li>
<li><p>it doesn&#39;t consider the inherent relevance among function descriptions</p>
<p>$\rightarrow$ using Graph</p>
</li>
</ul>
</li>
<li><p>Graph-based approach</p>
<ul>
<li><p>only neighboring nodes are considered</p>
</li>
<li><p>reducing the complexity</p>
</li>
</ul>
</li>
<li><p>using Octopus v2</p>
<ul>
<li><p>enabling rapid query redirection and reformatting</p>
</li>
<li><p>apply the functional token to make it as a single AI agent which can take single function callings for each LMs</p>
</li>
<li><p>or the single noce can be an ordinary LM (Llama3, Phi3)</p>
</li>
<li><p>At thi another layer, user Octopus v3 to choose from the nodes</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/879fdd22-2f29-4c66-babd-e8ce158b8745/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h2 id="34-functional-token-and-dataset-collections">3.4 Functional token and dataset collections</h2>
<ul>
<li><p>conceptualize each model as a distinct function</p>
</li>
<li><p>for specific models, detail the required prompt template in the function&#39;s doc string</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/282db3fa-d0ab-47a3-b1e3-f2d612672106/image.png" alt="">
<img src="https://velog.velcdn.com/images/0404_not_found/post/1989489c-416c-4b12-9188-90f1257979cf/image.png" alt=""></p>
<ul>
<li><p>construct the dataset using similar strategy to Octopus v2</p>
<ul>
<li><p>synthetic data to train the functional tokens</p>
</li>
<li><p>increase the temperature to accommodate diverse queries</p>
</li>
</ul>
</li>
</ul>
<h2 id="35-system-design-of-lm-graph">3.5 System design of LM graph</h2>
<ul>
<li><p>Worker node deployment</p>
<ul>
<li><p>$N^w$ as an individual LM</p>
</li>
<li><p>serverless architecture</p>
</li>
<li><p>limit the worker size to 10B</p>
</li>
</ul>
</li>
<li><p>Master node deployment</p>
<ul>
<li><p>base model with fewer than 10B</p>
</li>
<li><p>compact LoRA can be integrated to extend functional token capabilities</p>
</li>
<li><p>single base model with multiple LoRA, one per each worker</p>
</li>
<li><p>LoraX library</p>
</li>
</ul>
</li>
<li><p>Communication</p>
<ul>
<li><p>worker and master is distributed acrosso various devices</p>
</li>
<li><p>internet connectivity is essential</p>
</li>
<li><p>master $\rightarrow$ on-device, worker $\rightarrow$ cloud</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/c2b204dc-6c5f-4922-ac27-e88a44ceffe5/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h1 id="4-experiments">4. Experiments</h1>
<h2 id="41-task-and-models">4.1 Task and models</h2>
<ul>
<li>MMLU with 17 distinct models</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/68b37b78-ef0d-4650-b181-5b77dd2047ca/image.png" alt=""></p>
<ul>
<li><p>Specialized models from HF based on benchmark, popularity and endorsements</p>
</li>
<li><p>Not all tasks have specialized model $\rightarrow$ used Llama 3 with system prompt is used instead of the specialized model (Humanities task)</p>
</li>
</ul>
<h2 id="42-mmlu-evaluation">4.2 MMLU evaluation</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/959aa892-efad-4282-8a44-3d4e3666d0f6/image.png" alt=""></p>
<ul>
<li><p>example query
<img src="https://velog.velcdn.com/images/0404_not_found/post/11f7b8eb-f15d-45eb-84db-23c4433893c8/image.png" alt=""></p>
</li>
<li><p><nexa_4> is functional token which maps to math gpt</p>
</li>
</ul>
<h1 id="5-discussion-and-future-works">5. Discussion and Future works</h1>
<h2 id="51-how-to-train-a-vertical-model">5.1 How to train a vertical model</h2>
<ul>
<li><p>fine-tune with domain-specific expertise</p>
<ul>
<li><p>gather substancial corpus</p>
</li>
<li><p>ensure the data is diverse, well-organized, embodies the knowledge</p>
</li>
<li><p>clean the data</p>
</li>
</ul>
</li>
<li><p>Use HF SFT Trainer</p>
</li>
</ul>
<h2 id="52-future-work">5.2 Future work</h2>
<ul>
<li><p>integrating a variety of vertical specific models </p>
</li>
<li><p>Multimodal case (Octopus 3.5)</p>
</li>
</ul>
<h1 id="6-comment">6. Comment</h1>
<p>주어진 Query를 보고 유사도에 기반해 다음 행동을 정하는 RAG가 아니라, 애초에 학습 과정에서 토큰에 값을 붙여주면 더 빠르게 행동을 선택할 수 있다는 아이디어. Agent를 활용할 때 도움이 될듯</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Can large language models explore in-context?]]></title>
            <link>https://velog.io/@0404_not_found/Can-large-language-models-explore-in-context</link>
            <guid>https://velog.io/@0404_not_found/Can-large-language-models-explore-in-context</guid>
            <pubDate>Sat, 30 Mar 2024 10:04:54 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>In-context Learning $\rightarrow$ important emergent capability of LLM</p>
<ul>
<li><p>without updating the model parameter, LLM can solve various problem</p>
</li>
<li><p>this ability is extracted from training corpus and emerge at scale</p>
</li>
</ul>
</li>
<li><p>After GPT3, ICL has been the subject of a growing body of research</p>
<ul>
<li>lots of research focused on In-Context Supervised Learning (ICSL)</li>
</ul>
</li>
<li><p>Many application demand the use of ML model for decision making</p>
<ul>
<li><p>In-Context Reinforcement Learning (ICRL) + sequential decision making is the next frontier</p>
</li>
<li><p>LLM are already used in cedision making (experiment design, game etc.)</p>
</li>
<li><p>ICRL is less developed than ICSL</p>
</li>
</ul>
</li>
<li><p>Decision making agentsmust posses</p>
<ul>
<li><p>generalization : required for supervised learning</p>
</li>
<li><p>exploration : making suboptimal decision to gather more information</p>
</li>
<li><p>planning : account long-term consequences of decisions</p>
</li>
<li><p><strong>exploration</strong> is focused in this paper</p>
</li>
</ul>
</li>
<li><p>recent papers about ICRL</p>
<ul>
<li><p>ICRL in transformer when they are explicitly trained</p>
</li>
<li><p>training is hard</p>
</li>
<li><p>in that case, <strong>Does it exhibit the capability to explore in-context?</strong></p>
</li>
</ul>
</li>
<li><p>Deploying LLM to solve multi-armed bandit problem</p>
<ul>
<li><p>classical RL problem shows the tradeoff between exploration and exploitation</p>
</li>
<li><p>this would be the building block to general RL question</p>
</li>
</ul>
</li>
<li><p>evaluate the in-context behavior</p>
<ul>
<li><p>tested GPT-3.5, GPT-4, LLaMA2</p>
</li>
<li><p>only single configuration (prompt + model) showed satisfactory exploratory behavior</p>
</li>
<li><p>all failure is due to suffix failure (fails to select the best arm even once after some initial rounds)</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/92020dbd-8a8a-4183-ad08-93489a8b59ea/image.png" alt=""></p>
</li>
<li><p>GPT-4 with basic prompt failed in over 60%</p>
</li>
<li><p>another failure is for LLM to behave uniformly</p>
</li>
</ul>
</li>
<li><p>successed configuration</p>
<ul>
<li><p>GPT-4 + enhanced prompt</p>
<ul>
<li><p>suggestive hint</p>
</li>
<li><p>summarizes the history of interaction into per-arm average</p>
</li>
<li><p>zero-shot CoT</p>
</li>
</ul>
</li>
<li><p>SOTA model has capability to robustly explore if prompt is designed carefully</p>
</li>
<li><p>but it may fail in complex environments</p>
<ul>
<li>summarizing history is non-trivial problem</li>
</ul>
</li>
</ul>
</li>
<li><p>In-Context bandit learning is hard</p>
<ul>
<li><p>stochasticity in the environment demands high-degree of replication for statistical significance</p>
</li>
<li><p>even single experiment involve hundreds or thousands LLM queries</p>
</li>
</ul>
</li>
<li><p>identify surrogate statistics as diagnostics for long-term exploration failure</p>
<ul>
<li>characterize long-tern exploration failure</li>
</ul>
</li>
</ul>
<h1 id="2-experimental-setup">2. Experimental Setup</h1>
<h4 id="multi-armed-bandits">Multi-armed bandits</h4>
<ul>
<li><p>used MAB variant, Stochastic Bernoulli bandits</p>
<ul>
<li><p>$K$ possible actions (arms) $[K] = {1, ... , K }$</p>
</li>
<li><p>each arm $a$ is associated with mean reward $\mu_a \in [0, 1]$ (unknown)</p>
</li>
<li><p>an agent interacts with the environment with $T$ time steps</p>
</li>
<li><p>each time step $t \in [T]$ the agent selects an arm $a_t \in [K]$ and receives a reward $r_t \in { 0, 1 }$ drawn from Bernoulli with mean $\mu_{a_t}$</p>
</li>
<li><p>the MAB instance is determined by the mean rewards and the time horizon</p>
</li>
</ul>
</li>
<li><p>reward for each arm is not chosen by the agent are not revealed $\rightarrow$ exploration is necessary to identify the best arm</p>
</li>
<li><p>focus on MAB with the best arm has mean reward $\mu^* = 0.5 + \Delta / 2$ and all other arms has mean reward $\mu = 0.5 - \Delta / 2$</p>
<ul>
<li>$\Delta = \mu^* - \mu$</li>
</ul>
</li>
<li><p>set $K = 5$ and $\Delta = 0.2$ as &#39;hard&#39; instance</p>
</li>
<li><p>set $K = 4$ and $\Delta = 0.5$ as &#39;easy&#39; instance</p>
</li>
</ul>
<h4 id="prompts">Prompts</h4>
<ul>
<li><p>prompt design</p>
<ul>
<li><p>scenario</p>
<ul>
<li><p>positioning LLM as an agent choosing buttons to press </p>
</li>
<li><p>or a recommendation engine displaying advertisements to users</p>
</li>
</ul>
</li>
<li><p>framing</p>
<ul>
<li><p>suggestive of the need to balance exploration and exploitation</p>
</li>
<li><p>neutral</p>
</li>
</ul>
</li>
<li><p>history</p>
<ul>
<li><p>raw list over rounds</p>
</li>
<li><p>summarized via number of rounds and average rewards of each arm</p>
</li>
</ul>
</li>
<li><p>requested final answer</p>
<ul>
<li><p>single arm</p>
</li>
<li><p>distribution over arms</p>
</li>
</ul>
</li>
<li><p>method</p>
<ul>
<li><p>the answer only</p>
</li>
<li><p>CoT</p>
</li>
</ul>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/eabef338-485b-4ca7-a091-dbfe949f8778/image.png" alt=""></p>
</li>
<li><p>basic prompt is buttons / neutral framing / raw history / return only arm / no CoT</p>
</li>
</ul>
</li>
<li><p>Each modification might help LLM with model&#39;s knowledge</p>
<ul>
<li><p>advertising scenario / suggestive framing (system message) : model&#39;s knowledge of bandit algorithms</p>
</li>
<li><p>history summarization (user message) : if LLM reliably summarize history itself</p>
</li>
<li><p>returning a distribution (system message) : help to identify a good distribution (fails to sample from it)</p>
</li>
<li><p>CoT (system message) : general performance</p>
</li>
</ul>
</li>
<li><p>in GPT-4, used reinforced CoT design to additionally reminds the model to use CoT at user prompt</p>
</li>
</ul>
<h4 id="llm-configurations">LLM configurations</h4>
<ul>
<li><p>models</p>
<ul>
<li><p>GPT-3.5-Turbo-0613</p>
</li>
<li><p>GPT-4-0613</p>
</li>
<li><p>LLaMA2-13B-CHAT with 4bits</p>
</li>
</ul>
</li>
<li><p>Temparature 0(deterministic) or 1</p>
<ul>
<li>don&#39;t consider temp 1 with &#39;return distribution&#39; as it may causes external randomness</li>
</ul>
</li>
<li><p>5-letter $L_1 L_2 L_3 L_4 L_5$ notation for prompt design </p>
<ul>
<li><p>$L_1$ : $\text{B}$ or $\text{A}$ for buttons or advertisements scenario</p>
</li>
<li><p>$L_2$ : $\text{N}$ or $\text{S}$ for neutral or suggestive framing</p>
</li>
<li><p>$L_3$ : $\text{R}$ or $\text{S}$ for raw or summarized history</p>
</li>
<li><p>$L_4$ : $\text{N}$ or $\tilde{\text{C}}$ or $\text{N}$ for CoT, reinforced CoT or No CoT</p>
</li>
<li><p>$L_5$ : $0$, $1$ or $\text{D}$ for temperature and returning a distribution (temp 0)</p>
</li>
<li><p>$\text{BNRN0}$ as a basic prompt</p>
</li>
<li><p>advertisement scenario will be used ads robustness check</p>
</li>
</ul>
</li>
<li><p>48 configs for GPT-3.5 and LLaMA2 and 72 configs for GPT-4</p>
</li>
</ul>
<h4 id="baselines">Baselines</h4>
<ul>
<li><p>two standard MAB algorithms</p>
<ul>
<li><p>UCB</p>
</li>
<li><p>Thompson Sampling (TS)</p>
</li>
</ul>
</li>
<li><p>Greedy (doesn&#39;t explore and finally fail)</p>
</li>
<li><p>no parameter tuning</p>
</li>
<li><p>1000 replicates for each baseline and each MAB instance</p>
</li>
</ul>
<h4 id="scale-of-the-experiments">Scale of the experiments</h4>
<ul>
<li><p>time horizon $T = 100$</p>
</li>
<li><p>$N \in { 10, 20 }$ replicates for each LLM configuration and bandit instance</p>
</li>
<li><p>single experiment on GPT-4 with basic configuration for $T = 500$ for robustness check</p>
</li>
<li><p>in detail</p>
<ul>
<li><p>GPT-3.5</p>
<ul>
<li><p>$N = 20$ replicates across all 48 prompt</p>
</li>
<li><p>about 200K queries</p>
</li>
</ul>
</li>
<li><p>GPT-4</p>
<ul>
<li>$N = 10$ replicates across 10 representative configurations</li>
</ul>
</li>
<li><p>GPT-4 (additional aubustness check)</p>
<ul>
<li><p>$T=200$</p>
</li>
<li><p>two for $N = 20$</p>
</li>
<li><p>two for $N = 40$</p>
</li>
</ul>
</li>
<li><p>LLaMA2</p>
<ul>
<li><p>free from query (local model)</p>
</li>
<li><p>hard MAB instance, 32 configs, $N = 10$</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>$N \times T$ LLM queries for each config and MAB instance</p>
<ul>
<li><p>$N$ : significance level, must be large to overcome randomnes,nms in reward</p>
</li>
<li><p>$T$ : effect size, must be large so that good algorithms have enough time to identify the optimal arm</p>
</li>
</ul>
</li>
<li><p>Both exploration failures are less frequent in easier MAB instances</p>
</li>
<li><p>To cover extremely large prompt space, use small $\Delta$ and large $N$, $T$ </p>
</li>
<li><p>$N \in {10, 20 }, T = 100$ and $\Delta = 0.2$ do not provide enough statistical power to distinquish between successful and unsuccessful methods</p>
</li>
<li><p>rely on surrogate statistics which can be detected in current moderate scale rather than scale up</p>
</li>
</ul>
<h1 id="3-experimental-results">3. Experimental Results</h1>
<h2 id="31-overview">3.1 Overview</h2>
<ul>
<li><p>All but one LLM config failed to converge to the best arm with significant probability</p>
</li>
<li><p>Suffix Failures</p>
<ul>
<li>LLM never selects the best arm after a small number of initial rounds</li>
</ul>
</li>
<li><p>Uniform-like failures</p>
<ul>
<li><p>LLM chooses all arms at uniform rate</p>
</li>
<li><p>failed to eliminate poorly performing arms</p>
</li>
</ul>
</li>
<li><p>the only one exception is GPT-4 with $\text{BSS}\tilde{\text{C}}0$</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/22f9108a-3e96-4097-8ad1-0217cdc3bf04/image.png" alt=""></p>
<ul>
<li><p>Fig 3 : summarize the main set of experiments (hard MAB instance)</p>
</li>
<li><p>two surrogate statistics</p>
<ul>
<li><p>SuffFailFreq : measures suffix failures $\rightarrow$ exploration fail</p>
</li>
<li><p>$K \cdot$ MinFrac : measures uniform-like failures $\rightarrow$ exploitation fail</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/7f6fd825-acfd-49f1-b243-433fb8c47c11/image.png" alt=""></p>
<ul>
<li><p>show another statistic GreedyFrac (how similar a method is to GREEDY)</p>
</li>
<li><p>only GPT-4 with $\text{BSS}\tilde{\text{C}}0$ follows the baseline TS and UCB</p>
</li>
</ul>
<h2 id="32-identifying-failures">3.2 Identifying failures</h2>
<ul>
<li>focus on GPT-4</li>
</ul>
<h4 id="suffix-failures">Suffix Failures</h4>
<ul>
<li><p>most of the LLM configs exhibit bimodal behavior</p>
<ul>
<li><p>large fraction of the replicates choose the best arm very rarely</p>
</li>
<li><p>few replicates converged extremely quickly</p>
</li>
</ul>
</li>
<li><p>Consistent with this, suffix failures occurred many times</p>
</li>
<li><p>suggests long-term failure to explore</p>
<ul>
<li><p>cannot be improved by running more time steps</p>
</li>
<li><p>very similar to greedy and different from UCB and TS</p>
</li>
</ul>
</li>
<li><p>For an experiment replicate $\text{R}$ and round $t$</p>
<ul>
<li><p>SuffFail($t, \text{R}$) = 1 if the best arm is never chosen in rounds $[t, T]$</p>
</li>
<li><p>SuffFailFreq($t$) = mean({SuffFail($t, \text{R}$) : replicates $\text{R}$)</p>
</li>
<li><p>SuffFailFreq($T/2$) : frequency of failure to choose best arm even after the last half rounds</p>
</li>
</ul>
</li>
<li><p>basic config (GPT-4-$\text{BNRN0}$) in Fig1 (top) for T = 500, Fig 5 for GPT-4 for T = 100</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/e26ce11c-4cae-40ac-a114-abb35e3d4f54/image.png" alt=""></p>
<ul>
<li><p>bimodal behavior is shown in left plot</p>
</li>
<li><p>LLMs have much higher SuffFailFreq than UCB and TS</p>
</li>
<li><p>as T = 100 is not enough, suffix failures are not fully reflected in Fig 5 (right)</p>
</li>
<li><p>in Fig 1, suffix failure makes the larger differences in reward for large $T$</p>
</li>
</ul>
<h4 id="uniform-like-failures">Uniform-like failures</h4>
<ul>
<li><p>in Fig 3 (left), 3 GPT-4 configurations avoid suffix failures</p>
</li>
<li><p>two of these shows the uniform-like failures (exploitation failure)</p>
</li>
<li><p>For an experiment replicate $\text{R}$ and round $t$,</p>
<ul>
<li><p>$f_a(t, R)$ be the fraction of rounds in which a given arm $a$ is chosen</p>
</li>
<li><p>MinFrac($t, R$) =$\min_a f_a(t, R)$</p>
</li>
<li><p>MinFrac($t$) = mean({MinFrac($t, R$) : replicates $R$ })</p>
</li>
<li><p>MinFrac($t$) $\le 1/K$ for all $t$, rescale it by multiplying $K$<br>i.e. $K \cdot$ MinFrac($t$)</p>
</li>
<li><p>Larger MinFrac($t$) means more uniform selection of arms at time $t$</p>
</li>
</ul>
</li>
<li><p>for LLMs, MinFrac($t$) doesn&#39;t decrease over time and stays larger than that of baselines</p>
</li>
<li><p>for two GPT-4 that avoid suffix failures and get uniform-like failures, (BNRND, BSSCD) both used distributional output</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/31c29a46-563f-45ca-965e-2c359a38978e/image.png" alt=""></p>
<ul>
<li><p>MinFrac doesn&#39;t decrease while baselines does</p>
</li>
<li><p>in longer $T$, it has much lower reward than baselines</p>
<ul>
<li>poor long-term performance</li>
</ul>
</li>
</ul>
<h4 id="generality-of-the-failures">Generality of the failures</h4>
<ul>
<li><p>all LLMs except GPT-4-$\text{BSS}\tilde{\text{C}}0$ exhibit either a suffix failure or a uniform failure for hard MAB</p>
</li>
<li><p>other experiments have similar result</p>
</li>
<li><p>summary</p>
<ul>
<li><p>GPT-4 performed much better than GPT-3.5</p>
</li>
<li><p>LLaMA 2 performed much worse</p>
</li>
<li><p>all LLMs are sensitive to small changes in the prompt design    </p>
<ul>
<li>different modification interact with each other</li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="33-investigating-successes">3.3 Investigating successes</h2>
<ul>
<li><p>GPT-4-$\text{BSS}\tilde{\text{C}}0$</p>
<ul>
<li><p>no suffix failures</p>
</li>
<li><p>$K \cdot$ MinFrac is slightly larger than TS</p>
</li>
<li><p>reward is compatible to TS</p>
</li>
</ul>
</li>
<li><p>ran this config on hard MAB with $T = 200$ and $N = 40$ + $\text{BSR}\tilde{\text{C}}0$ as an ablation
<img src="https://velog.velcdn.com/images/0404_not_found/post/7bf30684-b1f9-4a23-9535-6cd7762ad799/image.png" alt=""></p>
</li>
<li><p>$\text{BSS}\tilde{\text{C}}0$ worked well in longer T</p>
</li>
<li><p>$\text{BSR}\tilde{\text{C}}0$ showed non-trivial fraction of suffix failures (Fig 1(b))</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/199c8488-b19c-4898-990e-9c48db541dc0/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/3dd51721-1516-4bcc-bd64-d06b88464b3e/image.png" alt=""></p>
<ul>
<li><p>Fig 8    </p>
<ul>
<li><p>basic config tends to commit to a single arm for several rounds (like greedy)</p>
</li>
<li><p>$\text{BSR}\tilde{\text{C}}0$ also commits for long periods but to a lesser extent than the basic config</p>
</li>
<li><p>$\text{BSS}\tilde{\text{C}}0$ switches arms requently and qualitatively appears much more similar to TS</p>
</li>
</ul>
</li>
<li><p>Fig 9</p>
<ul>
<li><p>plotted the fraction of rounds in $[0, t]$ where the optimal arm was pulled </p>
</li>
<li><p>$\text{BSR}\tilde{\text{C}}0$ looks like UCB except that sone runs shows suffix failures (goes to 0)</p>
</li>
<li><p>$\text{BSS}\tilde{\text{C}}0$ is similar to TS with almost all replicates slowly converge to 1</p>
</li>
</ul>
</li>
</ul>
<h2 id="34-root-causes">3.4 Root causes</h2>
<ul>
<li>understand why LLMs behave like this</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/23814952-309d-4b91-9c27-87c10f0e4655/image.png" alt=""></p>
<ul>
<li><p>Fig 13</p>
<ul>
<li><p>with (a) and (c), GPT-4 shows qualitatively different behavior in easy and hard MAB</p>
</li>
<li><p>easy instance is much easier</p>
</li>
<li><p>in easy instance, GPT-4 showed very high GreedyFrac $\rightarrow$ behave like Greedy (as it performed quite well)</p>
</li>
<li><p>GPT-4 performs quite well in low-noise settings</p>
</li>
<li><p>in hard instance, GPT-4 did something non-trivial (neigher Greedy nor uniform)</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/a706130d-bef5-45b9-9909-e978810a9db3/image.png" alt=""></p>
<ul>
<li><p>Fig 10</p>
<ul>
<li><p>per-round decisions with GPT-3.5</p>
</li>
<li><p>each experiment considers a particular distribution of bandit histories</p>
</li>
<li><p>sampled 50 histories of length $t$ </p>
</li>
<li><p>tracked two statistics for each agent</p>
<ul>
<li>empirically best arm so far vs a lesast-chosen arm so far</li>
</ul>
</li>
<li><p>uniform sampled data + UCB and TS sampled data</p>
</li>
<li><p>per-round performance of both the LLMs and baselines is very sensitive to data source</p>
</li>
<li><p>$\text{BNSN0}$ is too greedy, $\text{BNRN0}$ is too uniform</p>
</li>
<li><p>$\text{BNRN0}$ and $\text{BNRC0}$ fall within the reasonable range by baselines while they failed in longitudinal experiments</p>
</li>
<li><p>hard to assess whether LLM agents are too greedy or too uniform based on per-round decisions</p>
</li>
</ul>
</li>
</ul>
<h1 id="4-related-work">4. Related work</h1>
<ul>
<li><p>LLM capability</p>
<ul>
<li><p>general intelligence</p>
</li>
<li><p>causal reasoning</p>
</li>
<li><p>mathmatical reasoning</p>
</li>
<li><p>planning</p>
</li>
<li><p>compositionality</p>
</li>
</ul>
</li>
<li><p>In-context capabilities</p>
<ul>
<li><p>theoritical and empirical investigation to ICSL</p>
</li>
<li><p>ICRL study focus on models trained from trajectory data from another agent</p>
</li>
<li><p>justify with Bayesian meta-reinforcement learning perspective</p>
</li>
<li><p>transformers work like TS and UCB</p>
</li>
</ul>
</li>
<li><p>applying LLM to real-world decision making</p>
<ul>
<li><p>gaming, programming, medicine</p>
</li>
<li><p>generative agent to simulate human behavior in open-world environment</p>
</li>
<li><p>LLM-enabled robots</p>
</li>
</ul>
</li>
<li><p>LLM performance in a task that characterizes intelligent agents with two-armed bandits</p>
<ul>
<li><p>very easy MAB ($K = 2$, $\Delta = 0.6$)</p>
</li>
<li><p>single prompt design</p>
</li>
<li><p>compared to human</p>
</li>
<li><p>GPT-4 performed well</p>
</li>
</ul>
</li>
</ul>
<h2 id="41-further-background-on-mab">4.1 Further background on MAB</h2>
<ul>
<li><p>UCB</p>
<ul>
<li><p>explores by asigning each arm $a$ an index (average reward + bonus)</p>
</li>
<li><p>choose the arm with largest index</p>
</li>
<li><p>bonus form $\sqrt{C/n_a}$ and used $C = 1$ for this paper</p>
</li>
</ul>
</li>
<li><p>TS</p>
<ul>
<li><p>proceeds as if the arms&#39; mean rewards were initially drawn from some Bayesian prior</p>
</li>
<li><p>computes a Bayesian posterior using the given history</p>
</li>
<li><p>chooses an arm with largest mean reward</p>
</li>
<li><p>chose prior that uniformly draws the mean reward at random from [0, 1] in this paper</p>
</li>
<li><p>update each armindependently as a Beta-Bernoulli conjugate prior</p>
</li>
</ul>
</li>
<li><p>regret</p>
<ul>
<li><p>difference in expected total reward of the best arm and the algorithm</p>
</li>
<li><p>baselines achieve regret $O(\sqrt{KT \log T})$ which is nearly minimax optimal for $K$ and $T$</p>
</li>
<li><p>achieved also $O(\sqrt{{K \over \Delta} \log T}$ for instance-optimal regret rate</p>
</li>
</ul>
</li>
<li><p>$\epsilon$-greedy and greedy</p>
</li>
</ul>
<h1 id="5-discussion-and-open-questions">5. Discussion and open questions</h1>
<ul>
<li>contemporary LLMs do not robustly engage in exploration required for basic statistical RL and decision making problems without further intervention</li>
</ul>
<h4 id="basic-interventions-and-the-need-for-methodological-advancements">Basic interventions and the need for methodological advancements</h4>
<ul>
<li><p>Experiment with other prompts</p>
</li>
<li><p>Experiment with few-shot prompting</p>
</li>
<li><p>Train the LLM to use auxilary tools</p>
</li>
</ul>
<h4 id="implications-for-complex-decision-making-problems">Implications for complex decision making problems</h4>
<ul>
<li><p>simple MAB provides a clean and controllable setup </p>
</li>
<li><p>in more complex RL and decision making, similar failures also occur</p>
</li>
<li><p>the solution for MAB may not generalize to more complex settings</p>
</li>
<li><p>even for linear contextual bandits, this approach may not be applicable without a substancial intervention</p>
</li>
</ul>
<h1 id="6-comment">6. Comment</h1>
<p>ICSL이 아닌 ICRL의 관점에서 LLM이 어느 정도의 지식을 가지고 있는지 확인하는 논문. 단순한 문제이긴 하지만 요즘 LLM Agent에 대해서도 연구가 이뤄지는 만큼 Baseline이 되긴 할듯. RL에서는 단순한 문제지만 GPT-4에서 프롬프팅을 섞어야 해결이 가능할 만큼 LLM 능력으로는 접근하기 쉽지 않은듯</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/a5741371-e53e-450e-95aa-1c63f32677f5/image.png" alt=""></p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Training Neural Networks from Scratch with Parallel Low-Rank Adapters]]></title>
            <link>https://velog.io/@0404_not_found/Training-Neural-Networks-from-Scratch-with-Parallel-Low-Rank-Adapters</link>
            <guid>https://velog.io/@0404_not_found/Training-Neural-Networks-from-Scratch-with-Parallel-Low-Rank-Adapters</guid>
            <pubDate>Sun, 24 Mar 2024 12:30:29 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/fc330474-d423-45c0-9b85-77e630d87909/image.png" alt=""></p>
<ul>
<li><p>SOTA models&#39; complexity $\rightarrow$ computation / memory / communication bandwidth</p>
<ul>
<li><p>LoRA</p>
</li>
<li><p>quantizing model parametros</p>
</li>
</ul>
</li>
<li><p>Prior work has been limited to fine-tuning $\rightarrow$ tools for pretrain from scratch is absent</p>
</li>
</ul>
<blockquote>
<p>Can neural networks be trained from scratch using Low-Rank Adapters?</p>
</blockquote>
<ul>
<li><p>common computing clusters often have slower cross-node training with gradient accumulation as slow communication speed and bandwidth</p>
<ul>
<li>Low-Rank adapters compress the communication between these processors while preserving essential structural attributes</li>
</ul>
</li>
<li><p>Vanila LoRA underperforms in training a model from scratch</p>
<ul>
<li>using parallel low-rank updates can bridge this gap</li>
</ul>
</li>
</ul>
<h4 id="difference-to-existing-works">Difference to existing works</h4>
<ul>
<li><p>data and model parallelism</p>
<ul>
<li><p>stores different copies of the LoRA parameters</p>
</li>
<li><p>trained on different shards </p>
<ul>
<li>different from traditional federated learning which replicates the same model across devices</li>
</ul>
</li>
<li><p>their method enables distributed training with infrequent synchronizations allowing for single-device inference</p>
</li>
</ul>
</li>
<li><p>Previous works</p>
<ul>
<li><p>ReLoRA : trains and merges LoRA into main weights</p>
</li>
<li><p>FedLoRA : train LoRA parameters for finetuning within a federated learning framework $\rightarrow$ training multiple LoRA and averaging them</p>
</li>
<li><p>AdaMix : averages all MLP in MoE into a single MLP $\rightarrow$ needs constant synchronization during the forward and backward pass</p>
</li>
</ul>
</li>
</ul>
<h1 id="2-preliminaries">2. Preliminaries</h1>
<ul>
<li>$x$ as a scalar, $\mathbf{x}$ as a vector, $X$ as a matrix, $\mathcal{X}$ as a distribution or a set</li>
<li>$f$ as a function, $F(\cdot)$ as a composition of functions, $\mathcal{L}(\cdot, \cdot)$ as a loss-function</li>
</ul>
<h2 id="21-parameter-efficient-adapters">2.1 Parameter Efficient adapters</h2>
<ul>
<li><p>Adapters : trainable functions that modify existing layers in an neural network</p>
</li>
<li><p>LoRA : subclass of linear adapters</p>
<ul>
<li><p>the linearity of LoRA allows for the trained parameters to be integrated back in to the existing weights</p>
</li>
<li><p>the linearity allows models to maintain the original inference cost</p>
</li>
</ul>
</li>
</ul>
<h4 id="lora">LoRA</h4>
<ul>
<li><p>Given input $\mathbf{x} \in \reals^n$ and a linear layer $f(\cdot) : \reals^n \rightarrow \reals^m$ parameterized by the weight $W \in \reals^{m \times n}$</p>
</li>
<li><p>LoRA re-parameterizes the function as</p>
<ul>
<li><p>$f_{\text{lora}}(x) = \mathbf{W}\mathbf{x} + s \mathbf{BAx}$</p>
</li>
<li><p>$\mathbf{B} \in \reals^{m\times r}$, $\mathbf{A} \in \reals^{r \times n}$, $s \in \reals$</p>
</li>
<li><p>rank $r &lt;&lt; \min(m, n)$</p>
</li>
</ul>
</li>
<li><p>Forward pass incurs an extra computational overhead</p>
</li>
<li><p>the significance of LoRA pertains to the optimizer memory footprint</p>
<ul>
<li><p>AdamW stores two states for each parameter $\rightarrow$ double the memory consumption</p>
</li>
<li><p>using LoRA, the memory cost is $\mathcal{O}(r(m+n))$ is less than the original model&#39;s $\mathcal{O}(mn)$</p>
</li>
<li><p>QLoRA saves $W$ in 4-bit precision to achieve more memory saving</p>
</li>
</ul>
</li>
</ul>
<h1 id="3-method">3. Method</h1>
<ul>
<li>standard training performance can be recovered using LoRA</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/acd323f6-ce65-440c-8bba-325edd8eef34/image.png" alt=""></p>
<ul>
<li><p>Low-Rank LoRA shows inferior performance to the models using standard optimization</p>
</li>
<li><p>LoRA is incapable of recovering weights that exceed the rank $r$</p>
</li>
<li><p>Although there is a solution within a low-ranmk proximity of the initialization, it still needs the high-rank updates</p>
</li>
</ul>
<h2 id="31-motivation--multi-head-merging-perspective">3.1 Motivation : Multi-head merging perspective</h2>
<ul>
<li><p>this will show why LoRA heads in parallel can achieve the performance of standard pre-training</p>
</li>
<li><p>elevating the rank $r$ to the $\min(m, n)$ is sufficient to replicate standard pre-training performance</p>
<ul>
<li>it compromises the memory efficienty of low-rank adapters</li>
</ul>
</li>
<li><p>leveraging multiple low-rank adapters in parallel</p>
<ul>
<li><p>given a matrix of the form $\mathbf{BA} \in \reals^{d_1 \times d_2}$ and $\mathbf{B} \in \reals^{d_1 \times d}$, $\mathbf{A} \in \reals^{d \times d_2}$</p>
</li>
<li><p>then it is possible to represent the product as two lower-rank matrices $\mathbf{B_1A_1} + \mathbf{B_2A_2}$</p>
<ul>
<li><p>let $\mathbf{b}_i$ and $\mathbf{a}_i$ be the column vectors</p>
</li>
<li><p>then we can construct $\mathbf{B_1} = [\mathbf{b}<em>1, ..., \mathbf{b}</em>{[d/2]}]$, $\mathbf{B_2} = [\mathbf{b}<em>{[d/2]}, ..., \mathbf{b}</em>{d}]$ and $\mathbf{A_1} = [\mathbf{a}<em>1^{\top}, ..., \mathbf{a}</em>{[d/2]}^{\top}]$, $\mathbf{A_2} = [\mathbf{a}<em>{[d/2]}^{\top}, ..., \mathbf{a}</em>{d}^{\top}]$</p>
</li>
<li><p>then this approximates the high-rank matrix into a linear combination of low-rank matrices</p>
</li>
<li><p>the same comclusion can be reached by beginning with a linear combination of rank-1 matrices</p>
</li>
<li><p>This forms the basis for a novel multi-head LoRA</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4 id="multi-head-lora-mhlora">Multi-head LoRA (MHLoRA)</h4>
<ul>
<li><p>given a matrix $\mathbf{W} \in \reals^{m \times n}$ and constant $N$</p>
</li>
<li><p>$f_{\text{mhlora}}(\mathbf{x}) = \mathbf{Wx} + {s \over N} \displaystyle\sum^N_{n=1} \mathbf{B}_n \mathbf{A}_n \mathbf{x}$</p>
</li>
<li><p>reparameterizes full rank weights into a linear combination fo low-rank weights</p>
</li>
<li><p>single parallel LoRA head can approximate the trajectory of a single step of the multi-head LoRA given that the parallel LoRA heads are periodically merged into the full weights</p>
<ul>
<li><p>using the same rank $r$ for all the LoRA parameters</p>
</li>
<li><p>$\argmin_{\mathbf{B}<em>n \mathbf{A}_n} \mathcal{L} \left( \mathbf{W} + {s \over N} \displaystyle\sum ^N _{n=1} \mathbf{B}_n \mathbf{A}_n\right) = \argmin</em>{\hat{\mathbf{B}}_n \hat{\mathbf{A}}_n} \mathcal{L} \left( \hat{\mathbf{W}} + {s \over N} \hat{\mathbf{B}}_n \hat{\mathbf{A}}_n\right)$</p>
</li>
<li><p>used hat for the single parallel LoRA head</p>
</li>
<li><p>when either $\sum_{n=1}^N = \mathbf{B}_n \mathbf{A}_n = \hat{\mathbf{B}}_n \hat{\mathbf{A}}_n$ or $\hat{\mathbf{W}} = \mathbf{W} + {s \over N} \sum _{j \not = n}^N \mathbf{B}_j \mathbf{A}_j$</p>
</li>
</ul>
</li>
<li><p>The first scenario is rank deficient $\rightarrow$ unable to recover the original model performance</p>
</li>
<li><p>The latter case necessitates that $\hat{\mathbf{W}}$ accumulates all the information of the LoRA parameters at every iteration $\rightarrow$ if we use a merge operator at every iteration, recovering the exact update is possible</p>
</li>
<li><p>one can recover the exact gradient updates of the MHLoRA</p>
</li>
<li><p>in distributed setting, only the LoRA params/gradients have to be communicated across devices $\rightarrow$ good when the interconnect speed is limited</p>
</li>
</ul>
<h2 id="32-lora-soup-delayed-lora-merging">3.2 LoRA soup: delayed LoRA merging</h2>
<ul>
<li><p>To reduce the communication cost of LTE</p>
<ul>
<li>local updates</li>
<li>model-averaging</li>
</ul>
</li>
<li><p>allow LoRA parameters to train independently for longer period befor e merge operator</p>
<ul>
<li><p>$\hat{\mathbf{W}} = \mathbf{W} + {s \over N} \sum _{j \not = n}^N \mathbf{B}&#39;_j \mathbf{A}&#39;_j$</p>
</li>
<li><p>$&#39;$ for stale estimate the parameters</p>
</li>
</ul>
</li>
<li><p>Merging every iteration $\rightarrow$ ensures the representation will not diverge</p>
</li>
<li><p>using stale estimetes relaxes this equivalence $\rightarrow$ it can still match the standard training performance</p>
<ul>
<li><p>As its estimate is inaccurate, the optimization trajectory diverge from the optimization path of MHLoRA</p>
</li>
<li><p>it doesn&#39;t imply that the model won&#39;t optimize</p>
</li>
<li><p>just different path from MHLoRA</p>
</li>
<li><p>used simple averaging (left more sophisticated merging as future work)</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/abf8e021-d03b-4bb2-a78d-3d620553aab0/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h2 id="33-lora-the-explorer-parallel-low-rank-updates">3.3 LoRA-the-Explorer: parallel low-rank updates</h2>
<ul>
<li><p>achieving an informative update $\Delta \mathbf{W}$ that does not require materialization of the full parameter size during training</p>
</li>
<li><p>parameterizing $\mathbf{W}$ such that it can be stored in low-precision and communicated efficiently (using quantized weights and keeping a high-precision copy)</p>
</li>
<li><p><strong>LoRA-the-Explorer (LTE)</strong> : optimization algorithm that approximates full-rank updates with parallel low-rank updates</p>
<ul>
<li><p>creates $N$-different LoRA for each linear layer at initialization</p>
</li>
<li><p>each worker is assigned the LoRA parameter and creates a local optimizer</p>
</li>
<li><p>independently sample data from the same distribution $\mathbf{x} = { \mathbf{x}_1, ..., \mathbf{x}_N}$</p>
</li>
<li><p>for each LoRA head $n$, optimize the parameters with own partition for $T$ iterations to get $\delta_{\text{lora}<em>n} = -\eta \sum</em>{t=1} ^T \nabla_{\text{lora}_n} \mathbf{x}_i[t]$</p>
</li>
<li><p>don&#39;t synchronize the optimizer state across workers</p>
</li>
<li><p>After the optimization, synchronize the LoRA parameters to compute the final update for the main weight $\Delta_{\text{lora}}(\mathbf{x}) = {1 \over N} \sum_{n=1}^N \delta_N$</p>
</li>
<li><p>then update the LoRA parameters with the updated weights $\mathbf{W}$</p>
<ul>
<li>re-initialize the LoRA parameter or use the same value with correction term</li>
</ul>
</li>
<li><p>since it doesn&#39;t train directly on the main parameter $\mathbf{W}$, using quantized parameter $q(\mathbf{W})$ is possible</p>
<ul>
<li>keep the high-precision weight only in the master node or offload it from device during training</li>
</ul>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/a60ec52c-0572-43d2-acd5-5c04a5996fb1/image.png" alt="">
<img src="https://velog.velcdn.com/images/0404_not_found/post/b4692660-d154-412f-a6f1-d580885a85b9/image.png" alt=""></p>
<h2 id="34-implementation-details">3.4 Implementation details</h2>
<h4 id="not-resetting-matrix-a-and-optimizer-states">Not resetting matrix A and optimizer states</h4>
<ul>
<li><p>investigated whether the matrices $\mathbf{A}_n$ would converge to the same sub-space during training</p>
<ul>
<li><p>If so, resetting $\mathbf{A}_n$ or using regularizer are needed</p>
</li>
<li><p>$\mathbf{A}$ is orthogonal to remain consistent throughout training</p>
</li>
<li><p>without reset, it performed better (re-learning $\mathbf{A}$ and re-accumulating the optimizer are wasting optimization steps)
<img src="https://velog.velcdn.com/images/0404_not_found/post/b4d95410-90e8-4a33-9b91-3e8fec2a1077/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h4 id="scaling-up-s-and-lowering-learning-rate-eta">scaling up $s$ and lowering learning rate $\eta$</h4>
<ul>
<li><p>scaling $s$ has the same effect as tuning the lr $\eta$ $\rightarrow$ common misconception</p>
</li>
<li><p>during experiment, there is no comparable performance when using $s$ to be 1~4</p>
<ul>
<li><p>using large $s$ and slightly lowering $\eta$ worked best</p>
</li>
<li><p>standard practice : set $s$ proportional to the rank $r$,  i.e. $s = {\alpha \over r}$</p>
</li>
<li><p>used $\alpha = 4096, s = 64$ and $\eta = 2 \cdot 10^{-4}$</p>
</li>
<li><p>lr doesn&#39;t scale linearly with $s$</p>
</li>
<li><p>$s$ only affects the forward computation</p>
<ul>
<li>it modifies the contribution of the LoRA parameters in the forward pass $\rightarrow$ implacation on the effective gradient</li>
</ul>
</li>
<li><p>$s$ scales quadratically with the alignment of $\bold{B}$ and $\bold{A}$</p>
</li>
</ul>
</li>
</ul>
<h4 id="significance-of-initialization-strategies">Significance of Initialization Strategies</h4>
<ul>
<li><p>used the initialization scheme that utilizes a semi-orthogonal matrix scaled by $\sqrt{d_{out}/d_{in}}$</p>
<ul>
<li><p>originally designed for standard feed-forward models</p>
</li>
<li><p>whereas LoRA operates under the assumption that $\bold{B}$ is zero-initialized with a residual connection</p>
</li>
<li><p>in Ablation study, Kaiming initialization and Xavier initialization performing similar</p>
</li>
</ul>
</li>
</ul>
<h1 id="4-experiments">4. Experiments</h1>
<ul>
<li>in transformer experiment, they misused the scaling factor $1/\sqrt{d_{out}}$ instead of the standard scaling $1/\sqrt{d_{out}/n_{attn}}$ (they will revise the hyper-parameter)</li>
</ul>
<h2 id="41-iterative-lora-merging">4.1 Iterative LoRA Merging</h2>
<ul>
<li><p>iteratively merging LoRA is a key component in recovering the full-rank representation</p>
</li>
<li><p>they assess the effectiveness of merging a single LoRA head in context of linera networks trained on synthetic LS regression datasets
<img src="https://velog.velcdn.com/images/0404_not_found/post/dce34310-2627-4c8c-92c4-017090af6650/image.png" alt=""></p>
</li>
<li><p>without merging, the model performance is not changing </p>
</li>
<li><p>iterative merging recovers the GT solution with the rate increasing with higher merge frequency</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/e1241a16-1fdc-485f-91a9-23adf9cda616/image.png" alt=""></p>
<ul>
<li><p>in Vit-S with patch size 32 on ImageNet100</p>
<ul>
<li><p>merging of a single LoRA head outperforms standalone LoRA</p>
</li>
<li><p>frequent merging delays convergence (LoRA parameter re-initialization and momentum state inconsistencies)</p>
</li>
<li><p>performance doesn&#39;t match $\rightarrow$ potential local minima when training with rank-deficient representation</p>
</li>
</ul>
</li>
<li><p>they found the merge iteration of $T = 10$ is still stable when using batch size of 4096</p>
<ul>
<li>with higher $T$, additional training may be required</li>
</ul>
</li>
<li><p>with increased merge iteration, smarter merging techniques may be necessary</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/417e0944-9d27-499d-81c7-1e7b032c4f41/image.png" alt=""></p>
<ul>
<li>to further test the generalizability, they conducted various vision tasks on MLP-Mixer </li>
</ul>
<h2 id="42-lora-parameter-alignment">4.2 LoRA parameter alignment</h2>
<ul>
<li><p>the efficacy of their optimization algorithm</p>
<ul>
<li><p>individual heads to explore distince subspaces within the parameter space
<img src="https://velog.velcdn.com/images/0404_not_found/post/7eb702c7-7ea3-489a-9583-7068248332fe/image.png" alt=""></p>
</li>
<li><p>average cosine similarity and Grassman distance between the heads $\bold{B}_n \bold{A}_n$</p>
</li>
<li><p>conducted with data samples drawn from same distribution</p>
</li>
<li><p>each set of LoRA parameters was exposed to a different set of samples</p>
</li>
<li><p>LoRA heads do not converge to the same representation</p>
</li>
<li><p>this orthogonality is maximized when using different parameters and different data (mini-batches)</p>
</li>
</ul>
</li>
</ul>
<h2 id="43-ablation-study-the-effect-of-lora-heads-rank-and-merge-iteration">4.3 Ablation study: the effect of LoRA heads, rank, and merge iteration</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/77d123a5-7767-41c2-b64e-d203ef5f1557/image.png" alt=""></p>
<ul>
<li><p>monotonic improvement in performance with an increased number of heads and ranks</p>
</li>
<li><p>extending the merge iteration negatively impacts performance</p>
</li>
<li><p>in LS regression, excessive merging hurts model accuracy</p>
</li>
<li><p>with large enough rank and head, the model converges to better accuracy even if the test loss was similar</p>
</li>
<li><p>averaging of the LoRA heads has a regularization effect similar to model ensembling</p>
</li>
<li><p>ViT-S as the primary architecture</p>
<ul>
<li><p>hidden size = 384</p>
</li>
<li><p>MLP dimension = 1536</p>
</li>
<li><p>number of heads * rank of the LoRA &gt; the largest dimension of the model $\rightarrow$ worked well</p>
</li>
<li><p>number of heads &gt; rank $\rightarrow$ longer iterations were required </p>
</li>
</ul>
</li>
</ul>
<h2 id="44-gradient-noise-with-parallel-updates">4.4 Gradient noise with parallel updates</h2>
<ul>
<li><p>in ablation, they fixed cumulative batch size of 4096 and epoch of 1200</p>
</li>
<li><p>each LoRA head received a reduced batch size of 4096/heads</p>
</li>
<li><p>scaling the rank exerts a greater impact than increasing the number of heads</p>
<ul>
<li><p>proportional scaling of gradient noise with smaller mini-batches</p>
</li>
<li><p>gradient noise contribute to slower convergence in addition to the use of stale parameter estimates</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/efd8ee11-0e5f-4fcc-bbca-4818e834751d/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<pre><code>- increasing the number of heads necessitates more sequential FLOPs but it offers efficient parallelization</code></pre><ul>
<li>using a larger batch size for gradient estimation may prove beneficial in distributed training</li>
</ul>
<h2 id="45-performance-scaling-on-imagenet-1k">4.5 Performance Scaling on ImageNet-1K</h2>
<ul>
<li><p>scaled up to ImageNet 1K</p>
<ul>
<li><p>doubled batch size to 8192</p>
</li>
<li><p>didn&#39;t changed the way mini-batches were sampled</p>
</li>
<li><p>scheduling the randomness for the mini-batches is not explored</p>
</li>
</ul>
</li>
<li><p>in Initial training, LTE outperformed standard training</p>
<ul>
<li><p>as training completed, standard training overtook LTE</p>
</li>
<li><p>LTE needs additional iterations to achieve comparable performance</p>
</li>
</ul>
</li>
<li><p>standard training appeared to benefit more from a lower lr compared to LTE</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/084a4647-25a0-46f4-ac0e-fa938603442e/image.png" alt=""></p>
<ul>
<li><p>this study is focused on training deep networks with parallel low-rank adapters (not efficiency!)</p>
</li>
<li><p>hypothetical computation analysis for future scaling efforts</p>
<ul>
<li><p>model size $M_{\text{ddp}} = M$ and $M_{\text{lte}}$ for LTE</p>
</li>
<li><p>the number of devices for each method $N_{\text{ddp}}$ and $N_{\text{lte}}$</p>
</li>
<li><p>with quantization, each LTE device require a memory footprint of $qM + M_{\text{lte}}$</p>
</li>
<li><p>as base model is 16-bit and if we use 4-bit quantizing, $q = 0.25$</p>
</li>
<li><p>with AdamW, DDP necessitates an additional $2M$ parameters (total $3M$)</p>
</li>
<li><p>for LTE, $qM + 3M_{\text{lte}}$ is needed</p>
</li>
<li><p>Assuming the training is parameter bound by the main weights $r &lt;&lt; \min(m, n)$, LTE can leverage GPUs roughly 1/3 size of DDP</p>
</li>
<li><p>LTE requires 40% more data and 20% slowdown per iteration with quantization (QLoRA)</p>
</li>
<li><p>on average, each LTE device observes 1/3 less data than a device in DDP</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/e16f9afa-59d3-4bed-a619-f2d2d31953f7/image.png" alt=""></p>
</li>
<li><p>Communication bottleneck</p>
<ul>
<li><p>In multi-node systems, the communication scales with the size of the model and is bottlenecked at interconnect speed</p>
</li>
<li><p>using standard all-reduce, the gradient shared between each device for a total communication of $N_{\text{ddp}}(N_{\text{ddp}} - 1)M$</p>
</li>
<li><p>for LTE, as it communicate every $T$ iteration so ${1 \over T}N_{\text{lte}}(N_{\text{lte}} - 1)M$</p>
</li>
<li><p>using parameter server method (1 and broadcast), gradients are sent to the main parameter server and averaged</p>
</li>
<li><p>DDP with a parameter server would use $2(N_{\text{ddp}}-1)M$</p>
</li>
<li><p>LTE with parameter server would use ${1 \over T}(N_{\text{lte}} - 1)(qM + M_{\text{lte}})$</p>
</li>
<li><p>LTE can leverage lower-bandwidth communication as the parameters shared between devices are strictly smaller by a factor of $M_{\text{ddp}}/M_{\text{lte}}$</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="5-related-works">5. Related works</h1>
<ul>
<li><p>Training with adapters</p>
<ul>
<li>LoRA</li>
<li>MoE PEFT and averaging</li>
<li>Additive adapters</li>
<li>Adapters for NLP, vision, video, incremental learning, domain adaptation, vision-language, text-to-vision, perceptual learning</li>
</ul>
<ul>
<li><p>Distributed Training and Federated Learning</p>
<ul>
<li><p>Federated Learning in low-compute devices, high-latency training, privacym, cross and in-silo learning</p>
</li>
<li><p>communication efficiency    </p>
<ul>
<li>local steps</li>
<li>decentralized training</li>
<li>gradient checkpointing</li>
<li>reversible gradient computation</li>
<li>gradient or weight compression</li>
</ul>
</li>
<li><p>Combining models in federated learning</p>
<ul>
<li>FedAvg</li>
<li>weight averaging</li>
<li>probabilistic frameworks for merging</li>
<li>updating with stale parameters</li>
</ul>
</li>
<li><p>Server momentum and adaptive methods</p>
</li>
<li><p>bi-level optimization schemes</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Linear mode connectivity and model averaging</p>
<ul>
<li>deep models can be connected through nonlinear means</li>
<li>linear paths with constant energy exist in trained models</li>
<li>for models with different initialization, parameter permutations can be solved to align them linearly</li>
<li>model averaging</li>
<li>model stitching</li>
<li>Anna Karenina principle</li>
<li>model averaging within ensembles</li>
<li>utilizing an average model as a target</li>
</ul>
</li>
</ul>
<h1 id="6-conclusion">6. Conclusion</h1>
<ul>
<li><p>Low-rank adapters for model pre-training</p>
</li>
<li><p>LTE : bi-level optimization method that capicalizes on the memory-efficient properties of LoRA</p>
</li>
<li><p>how to accelerate convergence during the final 10% of the training?</p>
</li>
<li><p>how to dynamically determine the nuimber of ranks or heads?</p>
</li>
<li><p>is heterogeneous parameterization of LoRA feasible where each LoRA has a different rank?</p>
</li>
<li><p>what strategies for merging can achieve higher performance?</p>
</li>
<li><p>This study is showing viability</p>
</li>
<li><p>tests on larger models are needed</p>
</li>
<li><p>this will pave the way for pre-training models in computationally constrained or low-bandwidth environments</p>
<ul>
<li><p>less capable and low-memory devices can train a large model</p>
</li>
<li><p><strong>wisdom of the crowd</strong></p>
</li>
</ul>
</li>
</ul>
<h1 id="7-comment">7. Comment</h1>
<p>메인 파라미터들을 건드리지 않고 어댑터를 이용해 전체 모델을 근사하여 복원한다는 아이디어. rank r의 LoRA를 다시 rank 1로 decompose하여 바로 합치는 것이 아니라 주기적으로 병합해주는 것. 왜 lora를 다시 분해한다는 생각은 안해봤을까.</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Is Cosine-Similarity of Embeddings Really About Similarity?]]></title>
            <link>https://velog.io/@0404_not_found/Is-Cosine-Similarity-of-Embeddings-Really-About-Similarity</link>
            <guid>https://velog.io/@0404_not_found/Is-Cosine-Similarity-of-Embeddings-Really-About-Similarity</guid>
            <pubDate>Fri, 15 Mar 2024 21:17:55 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>Discrete Entities are embedded to dense real-valued vectors</p>
<ul>
<li><p>word embedding for LLM</p>
</li>
<li><p>recommender system </p>
</li>
</ul>
</li>
<li><p>The embedding vector can be used as input to other models</p>
</li>
<li><p>Also, they can provide a data-driven notion of similarity between entities</p>
</li>
<li><p>** Cosine Similarity** has become a very popular measure of semantic similarity</p>
<ul>
<li><p>the norm of the embedding vectors is not as important as the directional alignment between the embedding vectors</p>
</li>
<li><p>unnormalized dot-product is not worked</p>
</li>
</ul>
</li>
<li><p>Cosine similarity of the learned embeddings can in fact yield arbitrary results</p>
<ul>
<li>learned embeddings have a DoF that can render arbitrary cosine-similarities even though their dot-products are well-defined and unique</li>
</ul>
</li>
<li><p>propose linear Matrix Factorization models as analytical solutions</p>
</li>
</ul>
<h1 id="2-matrix-factorization-models">2. Matrix Factorization Models</h1>
<ul>
<li><p>focus on linear models as they follow for closed-form solutions</p>
</li>
<li><p>matrix $X \in \mathbb{R}^{n \times p}$, conatining $n$ data points and $p$ features</p>
</li>
<li><p>the goal is to estimate a low-rank matrix $AB^{\top} \in \Reals ^{p \times p}$ where $A, B \in \Reals^{p \times k}$ with $k \le p$ such that $XAB^{\top}$ is a good approximation of $X \approx XAB^{\top}$</p>
</li>
<li><p>$X$ is a user-item matrix</p>
<ul>
<li>the row $\vec{b_i}$ of $B$ : item-embeddings  ($k$-dimensional)</li>
<li>the row $\vec{x_u} \cdot A$ of $XA$ : user-embeddings</li>
<li>the embedding of user $u$ is the sum of the item-embeddings $\vec{a_j}$ that the user has consumed</li>
</ul>
</li>
<li><p>this is defined in terms of the unnormalized dot-product between two embeddings</p>
<ul>
<li><p>$(XAB^{\top})_{u,i} = \lang \vec{x_u} \cdot A, \vec{b_i} \rang$</p>
</li>
<li><p>once it has been learned, it is common to consider </p>
<ul>
<li><p>two items cosine similarity</p>
</li>
<li><p>two users cosine similarity</p>
</li>
<li><p>user-item cosine similarity</p>
</li>
</ul>
</li>
<li><p>this can lead to arbitrary results and they may not even be unique</p>
</li>
</ul>
</li>
</ul>
<h2 id="21-training">2.1 Training</h2>
<ul>
<li><p>the key factor affecting the utility of cosine similarity is the <strong>regularization</strong> employed when learning the embeddings in $A, B$</p>
<ul>
<li><p>$\min_{A, B} || X - XAB^{\top} || _F ^2 + \lambda ||AB^{\top}||_F^2$</p>
</li>
<li><p>$\min_{A, B} || X - XAB^{\top} || _F ^2 + \lambda (||XA||_F^2 + ||B||_F^2)$</p>
</li>
</ul>
</li>
<li><p>First one applies $||AB^{\top}||_F^2$ to their product</p>
<ul>
<li><p>in Linear models, this L2-regularization is equivalent to learning with denoising (drop-out in the input layer)</p>
</li>
<li><p>the resulting prediction accuracy os test data was superior to the second objective</p>
</li>
<li><p>denoising/drop-out is better than weight decay (second one)</p>
</li>
</ul>
</li>
<li><p>Second one is equivalent to the usual matrix factorization objective</p>
<ul>
<li><p>$|| X - PQ^{\top} || _F ^2 + \lambda (||P||_F^2 + ||Q||_F^2)$</p>
</li>
<li><p>regularizing $P$ and $Q$ separately is similar to weight decay in deep learning</p>
</li>
</ul>
</li>
<li><p>if $\hat{A}, \hat{B}$ are solutions of objective, for an arbitrary rotation matrix $R \in \Reals^{k \times k}$ are the solutions as well</p>
</li>
<li><p>cosine similarity is invariant under rotation $R$</p>
</li>
<li><p>only the first objetive is invariant to rescalings of the columns of $A$ and $B$ (different latent dimensions of the embeddings)</p>
</li>
</ul>
<pre><code>- if $\hat{A}\hat{B}^{\top}$ is the solution of the first objective, for an arbitrary diagonal matrix $D \in \Reals^{k \times k}$, $\hat{A}DD^{-1}\hat{B}^{\top}$ is the solution also

- Then devine a new solution as a function of $D$ as 

$$
\begin{aligned} 
\hat{A}^{(D)} &amp;:= \hat{A}D \\
\hat{B}^{(D)} &amp;:= \hat{B}D^{-1}
\end{aligned}
$$

- this diagonal matrix $D$ affects the normalization of the learned user and item embeddings (rows)
$$
\begin{aligned} 
(X\hat{A}^{(D)})_{(\text{normalized})} &amp;= \Omega_A X\hat{A}^{(D)} = \Omega_A X\hat{A}D \\
\hat{B}^{(D)}_{\text{(normalized)}} &amp;= \Omega_B \hat{B}^{(D)} = \Omega_B \hat{B}D^{-1}
\end{aligned}
$$

where $\Omega$ is appropriate diagonal matrices to normalize each learned embedding to unit Euclidian norm

- a different choice for $D$ cannot be compensated by the $\Omega$

- they depend on $D$ so they can be shown as $\Omega_A(D), \Omega_B(D)$

- **cosing similarities of the embeddings depend on this arbitrary matrix $D$**</code></pre><ul>
<li><p>the cosine similarity becomes</p>
<ul>
<li><p>item - item
$$
\text{cosSim}(\hat{B}^{(D)}, \hat{B}^{(D)}) = \Omega_B(D) \cdot \hat{B} \cdot D^{-2} \cdot \hat{B}^{\top} \cdot \Omega_B(D)
$$</p>
</li>
<li><p>user-user
$$
\text{cosSim}(X\hat{A}^{(D)}, X\hat{A}^{(D)}) = \Omega_A(D) \cdot X\hat{A} \cdot D^{2} \cdot (X\hat{A})^{\top} \cdot \Omega_A(D)
$$</p>
</li>
<li><p>user-item
$$
\text{cosSim}(X\hat{A}^{(D)}, \hat{B}^{(D)}) = \Omega_A(D) \cdot X\hat{A} \cdot \hat{B}^{\top} \cdot \Omega_B(D)
$$</p>
</li>
</ul>
</li>
<li><p>These cosine similarities all depend on arbitrary matrix $D$</p>
</li>
<li><p>user-user and item-item is directly depend on $D$ while user-item is indirectly depend on $D$ due to its effect to normalizing matrices</p>
</li>
</ul>
<h2 id="22-details-on-first-objective">2.2 Details on First Objective</h2>
<ul>
<li><p>The closed-form solution of the first objective is $\hat{A}<em>{(1)}\hat{B}</em>{(1)} = V_k \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)_k \cdot V_k^{\top}, \quad X =: U\Sigma V^{\top}, \quad \Sigma = \text{dMat}(..., \sigma_i, ...)$ and $V_k$ is truncated matrices of rank $k$</p>
</li>
<li><p>Sine $D$ is arbitrary, w.l.o.g. we may define $\hat{A}<em>{(1)} = \hat{B}</em>{(1)} := V_k \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)_k^{{1 \over 2}}$</p>
</li>
<li><p>when we think of the special case of a full-rank MF model, this would be two cases</p>
<ul>
<li><p>choose $D = \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)^{{1 \over 2}}$</p>
<ul>
<li><p>$A_{(1)}^{(D)} = \hat{A}_{(1)} \cdot D = V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)$</p>
</li>
<li><p>$B_{(1)}^{(D)} = \hat{B}_{(1)} \cdot D^{-1} = V$</p>
</li>
<li><p>given the matrix of normalized singular vectors $V$, the normalization $\Omega_B = I$</p>
</li>
<li><p>Then $\text{cosSim}(\hat{B}<em>{(1)}^{(D)}, \hat{B}</em>{(1)}^{(D)}) = VV^{\top} = I$</p>
</li>
<li><p>Cosine similarity between any pair of item-embedding is zero</p>
</li>
<li><p>$\text{cosSim}(X\hat{A}<em>{(1)}^{(D)}, \hat{B}</em>{(1)}^{(D)}) = \Omega_A \cdot X \cdot V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...) \cdot V^{\top} = \Omega_A \cdot X \cdot \hat{A}<em>{(1)}\hat{B}</em>{(1)}^{\top}$</p>
</li>
<li><p>the only difference in user-item embeddings is the normalization $\rightarrow$ the same ranking ($\Omega_A$ is irrelevant)</p>
</li>
</ul>
</li>
<li><p>choose $D = \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)^{-{1 \over 2}}$</p>
<ul>
<li><p>similar to previous case</p>
</li>
<li><p>$A_{(1)}^{(D)} = \hat{A}_{(1)} \cdot D^{-1} = V$</p>
</li>
<li><p>$B_{(1)}^{(D)} = \hat{B}_{(1)} \cdot D = V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)$</p>
</li>
<li><p>$\text{cosSim}(X\hat{A}<em>{(1)}^{(D)}, X\hat{A}</em>{(1)}^{(D)}) = \Omega_A \cdot X \cdot X^{\top} \cdot \Omega_A$</p>
</li>
<li><p>for user-user similarities, it is based on the raw data-matrix</p>
</li>
<li><p>it doesn&#39;t uses the learned embeddings</p>
</li>
<li><p>$\text{cosSim}(X\hat{A}<em>{(1)}^{(D)}, \hat{B}</em>{(1)}^{(D)}) = \Omega_A \cdot X \cdot \hat{A}<em>{(1)} \cdot \hat{B}</em>{(1)}^{\top} \cdot \Omega_B$</p>
</li>
<li><p>$\Omega_B$ normalizes the rows of $B$ but this is again the same rankings</p>
</li>
<li><p>$\text{cosSim}(\hat{B}<em>{(1)}^{(D)}, \hat{B}</em>{(1)}^{(D)}) = \Omega_B \cdot V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)^{2} \cdot V_{\top} \cdot \Omega_B$</p>
</li>
<li><p>this is very different from the previous choice</p>
</li>
</ul>
</li>
<li><p>Hence, the different choice of $D$ result in different cosine-similarities even though the learned model $\hat{A}<em>{(1)}^{(D)}\hat{B}</em>{(1)}^{(D)\top} = \hat{A}<em>{(1)}\hat{B}</em>{(1)}^{\top}$ is invariant to $D$</p>
</li>
<li><p><strong>the results of cosine-similarity are arbitrary and not unique for this model</strong></p>
</li>
</ul>
</li>
</ul>
<h2 id="23-details-on-second-objective">2.3 Details on Second Objective</h2>
<ul>
<li><p>The solution of the second objective is</p>
<ul>
<li><p>$\hat{A}<em>{(2)} = V_k \cdot \text{dMat}(..., \sqrt{{1 \over \sigma_i} \cdot (1 - {\lambda \over \sigma_i})</em>+}, ...)_k$</p>
</li>
<li><p>$\hat{B}<em>{(2)} = V_k \cdot \text{dMat}(..., \sqrt{\sigma_i \cdot (1 - {\lambda \over \sigma_i})</em>+}, ...)_k$</p>
</li>
<li><p>$(y)_+ = \max(0, y)$ </p>
</li>
</ul>
</li>
</ul>
<ul>
<li><p>If we use usual notation of MF $P = XA$ and $Q = B$, </p>
<ul>
<li><p>we get $\hat{P} = X\hat{A}<em>{(2)} = U_k \cdot \text{dMat}(..., \sqrt{{1 \over \sigma_i} \cdot (1 - {\lambda \over \sigma_i})</em>+}, ...)_k$</p>
</li>
<li><p>this diagonal matrix is same for user and item embeddings due to its symmetry in the L2-norm regularization</p>
</li>
<li><p>this solution is unique $\rightarrow$ there is no way to choose arbitrary matrix $D$ </p>
</li>
</ul>
</li>
<li><p>In this case, the cosine-similarity yields unique results</p>
</li>
<li><p>is this matrix $\text{dMat}(..., \sqrt{{1 \over \sigma_i} \cdot (1 - {\lambda \over \sigma_i})_+}, ...)_k$ the best possible semantic similarities?</p>
<ul>
<li>comparing this case with 2.2 gives the arbitrary diagonal matrix $D$ in 2.2 may be chosen as $D = \text{dMat}(..., \sqrt{{1 \over \sigma_i}}, ...)_k$</li>
</ul>
</li>
</ul>
<h1 id="3-remedies-and-alternatives-to-cosine-similarity">3. Remedies and Alternatives to Cosine-Similarity</h1>
<ul>
<li><p>when a model is trained w.r.t. the dot-product, its effect on cosine-similarity can be opaque and sometimes not even unique</p>
<ul>
<li><p>train model on cosine-similarity $\rightarrow$ use layer normalization</p>
</li>
<li><p>project the embedding back into the original space $\rightarrow$ cosine-similarity works</p>
<ul>
<li>view $X\hat{A}\hat{B}^{\top}$ as the raw data&#39;s smoothed version and the rows of $X\hat{A}\hat{B}^{\top}$ as the users&#39; embeddings in the original space</li>
</ul>
</li>
</ul>
</li>
<li><p>in cosine-similarity, normalization is applied after the embeddings have been learned</p>
<ul>
<li>this can reduce the result similarities compare to applying <strong>some normalization or reduction of popularity bias</strong> before of during learning</li>
</ul>
</li>
<li><p>To resolve this, </p>
<ul>
<li><p>standardize data $X$ (zero mean, unit variance)</p>
</li>
<li><p>negative sampling, inverse propensity scaling to account for the different item popularities</p>
<ul>
<li>word2vec is trained by sampling negatives with a probability proportional to their frequency</li>
</ul>
</li>
</ul>
</li>
</ul>
<h1 id="4-experiments">4. Experiments</h1>
<ul>
<li><p>illustrate these findings for low-rank embeddings</p>
</li>
<li><p>Not aware of a good metric for semantic similarity $\rightarrow$ experiments on simulated data $\rightarrow$ ground-truths are known (clustered items data)</p>
</li>
<li><p>generated interactions between 20000 users and 1000 items assigned to 5 clusters with probability $p_c$</p>
</li>
<li><p>sampled the powerlaw-exponent for each cluster $c$, $\beta_c \sim \text{Uniform}(\beta_{min}^{(item)}, \beta_{min}^{(item)})$</p>
<ul>
<li>where $\beta_{min}^{(item)} = 0.25, \beta_{min}^{(item)} = 1.5$</li>
</ul>
</li>
<li><p>assigned a baseline popularity to each item $i$ according to the powerlaw $p_i = \text{PowerLaw}(\beta_c)$ </p>
</li>
<li><p>then generated the items that each user $u$ had interacted with</p>
<ul>
<li><p>firstly, randomly sampled user-cluster preferences $p_{uc}$ </p>
</li>
<li><p>compute the user-item probabilities $p_{ui} = {p_{uc_i}p_i \over \sum <em>i p</em>{uc_i}p_i}$</p>
</li>
<li><p>sampled the number of items for this user $k_u \sim \text{PowerLaw}(\beta^{(user)})$ (used $\beta^{(user)} = 0.5$ and sampled $k_u$ items with $p_{ui}$)</p>
</li>
</ul>
</li>
<li><p>Learned the matrices $A, B$ with two training objective ($\lambda = 10000$ and $\lambda = 100$)</p>
<ul>
<li>low-rank constraint $k=50$, $p=1000$ to complement the analytical result for the full-rank case above</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/d5fa3e4f-739a-4d38-9acd-87141ad6e13c/image.png" alt=""></p>
<ul>
<li><p>Left one is ground-truth item-item similarities</p>
</li>
<li><p>training with first objective and chose three re-scaling of the singular vectors in $V_k$ (middle three)</p>
</li>
<li><p>Right one is trained with second objective $\rightarrow$ unique solution</p>
</li>
<li><p>see how vastly different the resulting cosine-similarities can be even for reasonable choice of re-scaling (not used extreme case)</p>
</li>
</ul>
<h1 id="5-conclusions">5. Conclusions</h1>
<ul>
<li><p>cosine similarities are heavily dependent on the method and regularization technique</p>
</li>
<li><p>in some cases, it can be rendered even meaningless</p>
</li>
<li><p>cosine-similarity with the embeddings in deep models to be plagued by similar problems</p>
<ul>
<li>deep model&#39;s different layers may be subject to different regularization $\rightarrow$ may affect $D$</li>
</ul>
</li>
</ul>
<h1 id="6-comment">6. Comment</h1>
<ul>
<li>맹목적으로 사용하는 코사인 유사도에 대한 고찰. 하지만 너무 제한적인 상황에서 테스트를 해본 것 같지만 그냥 의심을 한번 해보자는 취지에서는 괜찮았던 것 같음.</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[Beyond Language Models: Byte Models are Digital World Simulators]]></title>
            <link>https://velog.io/@0404_not_found/Beyond-Language-Models-Byte-Models-are-Digital-World-Simulators</link>
            <guid>https://velog.io/@0404_not_found/Beyond-Language-Models-Byte-Models-are-Digital-World-Simulators</guid>
            <pubDate>Sat, 09 Mar 2024 16:51:42 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>Deep Learning has focused on interpretable digital media files - text, images, audio</p>
<ul>
<li><p>Text played central role in conveying human intelligence and has led to the emergence of LMs</p>
</li>
<li><p>LMs tokenize text and predict next token so that it can comprehend human language and intellegence</p>
</li>
<li><p>Recent advancements extend tokenization beyond text</p>
</li>
</ul>
</li>
<li><p>These deep learning models overlooks the omnipresent native binary data in the digital world</p>
<ul>
<li><p>Next-Byte Prediction will allow the models to truly understand and simulate all activities in the digital world</p>
</li>
<li><p>It has practical benefits in cybersecurity, computer diagnostics, data compression and even for reverse-engineering a software&#39;s source code from binary representation</p>
</li>
</ul>
</li>
<li><p><strong>bGPT</strong> : model for binary data processing and digital world modelling by next byte prediction</p>
<ul>
<li><p>directly interpreting and manipulating binary data</p>
</li>
<li><p>two-fold advantages</p>
<ul>
<li><p>Interpreting Digital System</p>
</li>
<li><p>Unified Modelling</p>
</li>
</ul>
</li>
</ul>
</li>
<li><p>Experiment in two areas</p>
<ul>
<li><p>well-studied tasks (generative modelling, classification)</p>
</li>
<li><p>relatively underexplored tasks intrinsic to binary-native operations (data conversion, CPU state modelling)</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/77ee79e4-cbcd-4b4e-ab60-63676d3cb8d8/image.png" alt=""></p>
<h1 id="2-background">2. Background</h1>
<h2 id="21-language-models">2.1 Language Models</h2>
<ul>
<li><p>Text Models</p>
<ul>
<li><p>LSTM-based to Transformer-based</p>
</li>
<li><p>Tokenization plays a fundamental role (breaking down into words or subwords)</p>
</li>
<li><p>GPT  models pretrained with self-supervised learning via next token prediction</p>
</li>
<li><p>next token prediction enables the GPT to capture the structure and semantics behind languages</p>
</li>
</ul>
</li>
<li><p>Audio Models</p>
<ul>
<li><p>AudioPaLM : merged text and speech</p>
<ul>
<li>enables speech-to-speech translation and speech recognition</li>
</ul>
</li>
<li><p>MusicGen : generate music by multiple parallel streams of acoustic tokens by EnCodec</p>
</li>
</ul>
</li>
<li><p>Image Models</p>
<ul>
<li><p>iGPT : transformer to predict next pixel</p>
</li>
<li><p>vision-language models : connect text and visual data</p>
</li>
</ul>
</li>
<li><p>Biochemical sequence Models</p>
<ul>
<li><p>Tranception : transformers to predict protein fitness</p>
</li>
<li><p>ProtGPT2 : generates protein sequences</p>
</li>
<li><p>HyenaDNA : extends context lengths in genomic modelling</p>
</li>
</ul>
</li>
</ul>
<h2 id="22-byte-models">2.2 Byte Models</h2>
<ul>
<li><p>Binary data lacks the inherent structure and semantics of human-interpretable data</p>
</li>
<li><p>MalConv, DeepVSA : malware detection and program analysis</p>
<ul>
<li><p>MalConv uses CNN to analyze byte sequences</p>
</li>
<li><p>DeepVSA : value seet analysis for post-mortem program analysis</p>
</li>
</ul>
</li>
<li><p>Byte-level Byte Pair Encoding (BBPE) : used for multilingual pretraining, machine translation</p>
</li>
<li><p>ByT5 : transformers for byte sequences</p>
<ul>
<li>token-free encoding that improves noise robustness and spelling sensitivity in multilingual</li>
</ul>
</li>
<li><p>ByteFormer : raw byte sequences from images and audio</p>
</li>
<li><p>MegaByte : modelling long byte sequences across various modalities</p>
</li>
<li><p>MambaByte : used Mamba to excel in byte-level language modelling and outperformed LMs based on subword tokenization</p>
</li>
<li><p>Current research often neglects <strong>native binary data</strong>, focusing on narrow tasks and overlooking broader potential in digital world simulation</p>
</li>
</ul>
<h1 id="3-methodology">3. Methodology</h1>
<h2 id="31-model-architecture">3.1 Model Architecture</h2>
<ul>
<li><p>the high granularity of bytes results in long sequences $\rightarrow$ computational cost</p>
</li>
<li><p>quadratic self-attention scaling $\rightarrow$ computational cost</p>
</li>
<li><p><strong>hierarchical Transformer architecture</strong></p>
<ul>
<li><p>sequence of bytes $B = { b_1, b_2, ..., b_T}$ of length $T$</p>
</li>
<li><p>sequence of patches $\mathcal{P} = [P_1, P_2, ..., P_N]$</p>
</li>
<li><p>each patch contains $S$ bytes</p>
</li>
<li><p>the number of patches $N = \lceil{T \over S} \rceil$ </p>
</li>
<li><p>$P_i = [b_{(i-1)S + 1}, ..., b_{iS}]$ for $1 \le i \le N$</p>
</li>
<li><p>if $T \mod S \not= 0$, the last patch is padded with $e$ to size $S$(eop, end-of-patch token)</p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/0088935f-af55-4926-977a-a4424d3baad3/image.png" alt=""></p>
</li>
</ul>
</li>
</ul>
<h4 id="linear-projection-layer">Linear Projection Layer</h4>
<ul>
<li><p>Each patch $P_i$ from $\mathcal{P}$ is viewed as a matrix of size $S \times 257$ </p>
<ul>
<li>each byte is one-hot encoded (256 values + eop token)</li>
</ul>
</li>
<li><p>Flatten those patches into one-dimensional vectors</p>
<ul>
<li>rows in the matrix are concatenated</li>
</ul>
</li>
<li><p>the projection layer mats each flattened vector into a dense vector $E_i$ of a hidden size $H$</p>
<ul>
<li>$E_i = \text{Flatten}(P_i) \cdot W_{\text{linear}}, \quad 1 \le i \le N$</li>
</ul>
</li>
<li><p>$W_{\text{linear}}$ has the shape of $(257\times S, H)$</p>
</li>
<li><p>Dense embedding enables more efficient processing of the byte sequence by reducing the dimension while preserving the essential information</p>
</li>
</ul>
<h4 id="patch-level-decoder">Patch-Level Decoder</h4>
<ul>
<li><p>Takes the sequence of embedded patches $\mathcal{E} = { E_1, E_2, ..., E_N }$ and processes it to autoregressively predict the features of the subsequent patch, effectively learning the structure of data</p>
<ul>
<li><p>$\hat{E}<em>i = \text{Decoder}</em>{\text{patch}}(\mathcal{E}<em>{&lt;i} \oplus \mathcal{X}</em>{&lt;i})$</p>
</li>
<li><p>$\mathcal{E}_{&lt;i}$ for the sequence of patch embedding before the $i$-th patch</p>
</li>
<li><p>$\mathcal{X}_{&lt;i}$ for corresponding positional embeddings</p>
</li>
<li><p>$\oplus$ for element-wise addition</p>
</li>
</ul>
</li>
</ul>
<h4 id="byte-level-decoder">Byte-Level Decoder</h4>
<ul>
<li><p>Takes the predicted feature $\hat{E}_i$ of each patch and autoregressively reconstructs the sequence of bytes within that patch</p>
</li>
<li><p>independent for each patch and operates by conditioning on the feature representation $\hat{E}_i$ of the current patch</p>
</li>
<li><p>$\hat{b}<em>{i, j} = \text{Decoder}</em>{\text{byte}}(\hat{E}<em>i, b</em>{i, &lt;j}), \quad 1 \le j \le S$</p>
</li>
</ul>
<h2 id="32-training-objectives">3.2 Training Objectives</h2>
<h4 id="generative-modelling">Generative Modelling</h4>
<ul>
<li><p>aims to predict the next byte $b_{i+1}$ based on preceding bytes ${ b_1, b_2, ..., b_i}$ without explicit guidance</p>
</li>
<li><p>the objective is minimizing the negative log-likelihood of the next byte prediction across the sequence</p>
</li>
<li><p>$\mathcal{L}<em>{\text{GEN}}(\theta) = - \displaystyle\sum</em>{i=1}^{T-1} \log p(b_{i+1}|b_1, b_2, ..., b_i; \theta)$</p>
</li>
<li><p>this loss encourages the model to understand the sequential dependencies in data at the byte level</p>
</li>
</ul>
<h4 id="classification">Classification</h4>
<ul>
<li><p>After pretrained by next byte prediction, it is further trained on labelled datasets for classification</p>
</li>
<li><p>predicts categories from byte sequences</p>
</li>
<li><p>involves extracting a global feature from the byte sequence which is then processed by a classification head</p>
</li>
<li><p>$\mathcal{L}<em>{\text{CLF}}(\theta) = -\displaystyle\sum</em>{k=1}^K y_k \log p(y_k | B; \theta)$</p>
</li>
<li><p>$y_k$ is the boolean label for the $k$-th category indicating whether the byte sequence is for that category</p>
</li>
<li><p>$K$ for total number of category</p>
</li>
<li><p>$p(y_k | B; \theta)$ is the predicted probability of category $k$ given the byte sequence $B$</p>
</li>
</ul>
<h1 id="4-applications">4. Applications</h1>
<h2 id="41-digital-media-processing">4.1 Digital Media Processing</h2>
<ul>
<li><p>The field of deep learning is steadily advancing its proficiency in both generation and classification of text, audio, and images</p>
</li>
<li><p>These media is typically stored and transmitted as byte sequences $\rightarrow$ bGPT can process them for generative modelling and classification</p>
</li>
<li><p>bGPT is trained in next token prediction, uses features from the patch-level decoder and employs average pooling to derive global features for classification</p>
</li>
<li><p>Data</p>
<ul>
<li>Audio : convert to WAV, including an 8000Hz sampling rate, mono channel, 8-bit depth, trimmed to 1 sec</li>
<li>Image : convert to BMP, 32 * 32, RGB, 24-bit depth</li>
</ul>
</li>
</ul>
<h2 id="42-algorithm-and-hardware-simulation">4.2 Algorithm and Hardware Simulation</h2>
<h4 id="data-conversion">Data Conversion</h4>
<ul>
<li><p>converting data from one format to another with symbolic music formats (ABC notation) and MIDI files</p>
</li>
<li><p>employs the generative modelling approach on concatenated byte sequences of paired ABC and MIDI files separated by a special patch</p>
</li>
<li><p>bGPT learns to convert text-based ABC notation into binary MIDI performance signals and its reverse</p>
</li>
<li><p>ability to simulate and reverse-engineer the conversion algorithm</p>
</li>
</ul>
<h4 id="cpu-state-modeling">CPU State Modeling</h4>
<ul>
<li><p>give concatenated sequences of low-level machine instructions followed by a series of CPU register states</p>
</li>
<li><p>to accurately predict how the state updates with each instruction until the program halts</p>
</li>
<li><p>interpreting operational data and replicate digital activities within hardware</p>
</li>
<li><p>CPU States dataset (2.1M instances)</p>
<ul>
<li><p>offering a simplified representation of CPU behavior </p>
</li>
<li><p>each instance contains a 1KB memory block with varying numbers of machine instructions followed by a sequence of 16-byte CPU register states</p>
</li>
<li><p>these states include various instructions (21 types with 43 variants - data movement, logical operations, arithmetic operations)</p>
</li>
<li><p>within each state</p>
<ul>
<li><p>1 byte is for Program Counter and Accumulator</p>
</li>
<li><p>4 bytes for Instruction Register</p>
</li>
<li><p>10 bytes for general-purpose registers</p>
</li>
</ul>
</li>
<li><p>instances are randomly generated 1 to 256 instructions and their captured results</p>
</li>
</ul>
</li>
</ul>
<h1 id="5-experiments">5. Experiments</h1>
<h2 id="51-settings">5.1 Settings</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/fd2b7912-db4a-4eb3-b001-48cc8b7d7d0b/image.png" alt=""></p>
<ul>
<li>used open-source datasets</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/ffa923af-485d-4a07-82be-984403055a25/image.png" alt=""></p>
<ul>
<li><p>110M parameter bGPT matches the standard Transformer based model scale</p>
</li>
<li><p>avoided hyper parameter tuning and data augmentation for all evaluatioins</p>
</li>
<li><p>Acc for classification</p>
</li>
<li><p>Bits-Per-Byte for other generative modelling</p>
</li>
</ul>
<h2 id="52-digital-media-processing">5.2 Digital Media Processing</h2>
<ul>
<li><p>used standard pre-training and fine-tuning approach</p>
</li>
<li><p>$\text{bGPT}_{\text{image}}$ : using ImageNet</p>
</li>
<li><p>$\text{bGPT}_{\text{wiki}}$ : Wikipedia</p>
</li>
<li><p>$\text{bGPT}_{\text{libri}}$ : LibriSpeech</p>
</li>
<li><p>$\text{bGPT}_{\text{signal}}$ : LibriSpeech + ImageNet</p>
</li>
<li><p>$\text{bGPT}_{\text{mix}}$ : LibriSpeech + ImageNet + Wikipedia</p>
</li>
<li><p>$\text{bGPT}_{\text{random}}$ : randomly initialized, baseline</p>
</li>
<li><p>first fine-tuned with next byte prediction on AGNews, CIFAR-10, Speech Commands v2</p>
</li>
<li><p>then fine-tuned for classification</p>
</li>
</ul>
<h3 id="521-baselines">5.2.1 Baselines</h3>
<ul>
<li><p>GPT2-small for text</p>
<ul>
<li>pretrained on English Wikipedia with same sattings as bGPT</li>
</ul>
</li>
<li><p>ViT-B/16 for image </p>
<ul>
<li><p>pretrained on ImageNet</p>
</li>
<li><p>results are taken from original studies</p>
</li>
</ul>
</li>
<li><p>AST for audio</p>
</li>
</ul>
<h3 id="522-results">5.2.2 Results</h3>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/6f1e288a-073f-4df4-b9dc-ab522b8ae3ad/image.png" alt=""></p>
<ul>
<li><p>When pretraining data and fine-tuning data are match, the model shows performance in downstream tasks</p>
</li>
<li><p>Despite not having modality-specific prior knowledge, bGPT still manage to achieve performances similar to baseline</p>
</li>
<li><p>but $\text{bGPT}_{\text{image}}$ much lower than ViT as sequential processing nature of byte models is not suitable for processing 2D data</p>
<ul>
<li>simply scaling while retaining this sequential processing holds</li>
</ul>
</li>
<li><p>$\text{bGPT}<em>{\text{signal}}$ and $\text{bGPT}</em>{\text{mix}}$ shows compatible accuracy to the unimodal models but there is a small loss</p>
<ul>
<li>Trade-off in byte models : mixed modality dilutes the depth of domain-specific understanding but it fosters versatility</li>
</ul>
</li>
<li><p>positive transferring (pretrain with Audio/Image and fine-tune with Image/Audio) shows improvements over random initialization</p>
<ul>
<li>audio and image have some shared byte pattern</li>
</ul>
</li>
<li><p>negative transferring (from text to other modalities) shows the structured pattern learning in pretraining is not applied</p>
<ul>
<li>text has distinct byte-level organizational patterns than audio and image</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/e71cf76f-ff45-4664-92e7-8be5ea184549/image.png" alt=""></p>
<ul>
<li><p>To investigate cross-modal knowledge transfer    </p>
<ul>
<li><p>convert the Speech Commands v2 into 32 * 32 BMP spectrograms</p>
</li>
<li><p>8KB audio to 3KB images</p>
</li>
<li><p>there is some information loss</p>
</li>
</ul>
</li>
<li><p>image model for its data format consistency with spectrograms</p>
</li>
<li><p>libri model for its information similarity</p>
</li>
<li><p>disparity in CIFAR-10 does not extend to this spectrogram task observing image and libri models&#39; BPB</p>
<ul>
<li>CIFAR-10 shares fewer patterns with spectrograms than spectrograms and raw audio</li>
</ul>
</li>
<li><p>libri model has the higher accuracy than image model with speech content spectrogram</p>
</li>
<li><p>byte models have an inherent capability to discern and translate abstract data features and patterns regardless of modality and format</p>
</li>
</ul>
<h2 id="53-algorithm-and-hardware-simulation">5.3 Algorithm and Hardware Simulation</h2>
<ul>
<li><p>To evaluate bGPT&#39;s ability in simulating algorithms and hardware</p>
</li>
<li><p>Lack of baseline models and widely used datasets $\rightarrow$ evaluating scalability of bGPT on binary data</p>
</li>
<li><p>data conversion and CPU state modelling</p>
</li>
<li><p>$10^3$ to $10^6$ ($\text{bGPT}^3$ to $\text{bGPT}^6)$</p>
</li>
<li><p>all models are randomly initialized</p>
</li>
<li><p>for data conversion, used IrishMAN dataset (ABC motation and MIDI files)</p>
</li>
</ul>
<h3 id="531-data-conversion">5.3.1 Data Conversion</h3>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/8d3986bb-d321-40af-bc85-6cb5e835bb7a/image.png" alt=""></p>
<ul>
<li>for ABC to MIDI, $\text{BPB}<em>{\text{abc}}$ assesses generative modelling as it generates content from scratch and $\text{BPB}</em>{\text{MIDI}}$ evaluates data conversion as full ABC byte sequence is given</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/ad00def8-93e6-4bdb-a5da-9eee696ac42f/image.png" alt=""></p>
<ul>
<li><p>increased data volume directly enhances model performance in simulating data conversion</p>
</li>
<li><p>from Table 5, the BPB is decreasing as the model size grows</p>
</li>
<li><p>high BPB value for ABC in both directions</p>
<ul>
<li><p>ABC to MIDI focuses on simulating an existing algorithm with necessary information while the reverse process requires inferring and reconstructing missing information in MIDI (score structure, musical ornament, expression)</p>
</li>
<li><p>as MIDI is binary and ABC is text, model may find it easier to learn patterns within MIDI files</p>
</li>
</ul>
</li>
</ul>
<h3 id="532-cpu-state-modelling">5.3.2 CPU State Modelling</h3>
<ul>
<li><p>to replicate CPU functionality </p>
</li>
<li><p>selecting the highest probability byte at each step</p>
</li>
<li><p>accuracy $\rightarrow$ byte-wise comparisons with actual states</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/270b1517-f33a-403e-a5e0-61fab5d2282c/image.png" alt=""></p>
<ul>
<li><p>data volume significantly influences modelling performance</p>
</li>
<li><p>efficiency beyond simple memorization (each test case consists of average of 128 instructions)</p>
</li>
<li><p>After epoch 11, $\text{bGPT}^5$ showed significant improvement of performance $\rightarrow$ deeper understanding of CPU states may stem from a qualitative enhancement in capability</p>
</li>
<li><p>Aligns with emergent abilities in LLMs</p>
</li>
<li><p>Is this learning genuine?</p>
<ul>
<li><p>performance boosts are due to non-linear metrics or overfitting</p>
</li>
<li><p>but BPB is linear and smooth</p>
</li>
<li><p>this improvement is stem from a real comprehension of CPU</p>
</li>
</ul>
</li>
<li><p>bGPT shows strong scalability on native binary data with emergent abilities in data conversion and CPU state modelling</p>
</li>
</ul>
<h1 id="6-conclusions">6. Conclusions</h1>
<ul>
<li><p><strong>bGPT</strong> : as a versatile simulator for the digital world</p>
</li>
<li><p>extending deep learning to binary data processing</p>
</li>
<li><p>effective in modeling digital media data + modality-agnostic knowledge transfer</p>
</li>
<li><p>strong scalability in modelling native binary data and signs of emergent abilities</p>
</li>
<li><p>without modality specific designs, it shows compatible performance</p>
</li>
<li><p>opportunities for improvement</p>
<ul>
<li><p>currently tested for short audio and low-resolution images</p>
</li>
<li><p>data conversion between ABC and MIDI</p>
</li>
<li><p>only simplified CPUs</p>
</li>
</ul>
</li>
<li><p>Future research</p>
<ul>
<li><p>reducing computational cost</p>
</li>
<li><p>scaling models and dataset to cover more broader data</p>
</li>
<li><p>improving model performance for underexplored tasks</p>
</li>
</ul>
</li>
</ul>
<h1 id="7-impact-statements">7. Impact Statements</h1>
<ul>
<li><p>it necessitates a careful examination if its ethical implications</p>
</li>
<li><p>its simulate or reverse-engineer algorithms</p>
<ul>
<li><p>can significantly boost technological innovation in cybersecurity, software, hardware</p>
</li>
<li><p>poses a risk to intellectual property as training bGPT on paired source code and executable software might enable the reverse-engineering of proprietary software</p>
</li>
</ul>
</li>
<li><p>it gives opportunities for advancing understanding of digial world but be careful for ethical, societal, legal implications</p>
</li>
</ul>
<h1 id="8-comment">8. Comment</h1>
<ul>
<li>결국 모든 컴퓨터 데이터는 0과 1이므로 바이트로 접근해서 멀티모달을 실현한다는 아이디어. 이외 CPU 상태를 통한 리버스 엔지니어링 태스크도 꽤 흥미로웠음. 역시나 사이즈가 문제지만, 한가지 의문인 점은 바이트로 표현하면 현재 모델들에 비해 컨텍스트 길이가 엄청 길어야 할텐데, 이 부분에 대한 대응은 크게 없어보임.</li>
</ul>
]]></description>
        </item>
        <item>
            <title><![CDATA[The Era of 1-bit LLMs: All LLMs are in 1.58 bits]]></title>
            <link>https://velog.io/@0404_not_found/The-Era-of-1-bit-LLMs-All-LLMs-are-in-1.58-bits</link>
            <guid>https://velog.io/@0404_not_found/The-Era-of-1-bit-LLMs-All-LLMs-are-in-1.58-bits</guid>
            <pubDate>Mon, 04 Mar 2024 13:28:02 GMT</pubDate>
            <description><![CDATA[<h1 id="abstract">Abstract</h1>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/0370c0cd-c12d-47d7-a4dc-99e287598225/image.png" alt=""></p>
<ul>
<li><p>BitNet paved the way for a new era of 1-bit LLMs</p>
</li>
<li><p>BitNet b.58 has every parameter as a ** tenary ** {-1, 0, 1}</p>
<ul>
<li>matches a full-precision Transformer with the same model size</li>
<li>significantly more cost-effective</li>
</ul>
</li>
<li><p>defines new scalinglaw and recipe for training</p>
</li>
</ul>
<h1 id="1-the-era-of-1-bit-llms">1. The era of 1-bit LLMs</h1>
<ul>
<li><p>The recent LLMs&#39; size is increasing</p>
<ul>
<li><p>remarkable performance on LLM tasks</p>
</li>
<li><p>high energy comsumption</p>
<ul>
<li>challenges for deployment</li>
<li>environmental and economic impact</li>
</ul>
</li>
</ul>
</li>
<li><p>Post-training quantization to create low-bit models for inferenct</p>
<ul>
<li><p>reduces weights and activations</p>
</li>
<li><p>16 bits to lower bits (4-bits)</p>
</li>
<li><p>sub-optimal</p>
</li>
</ul>
</li>
<li><p>BitNet presents a direction for reducing the cost of LLMs while their performance</p>
</li>
<li><p>the major computation cost comes from the <strong>floating-point addition and multiplication</strong> </p>
<ul>
<li>BitNet has only integer addition</li>
</ul>
</li>
<li><p>transferring model parameters from DRAM to the memory of an on-chip accelerator (SRAM) can be expensive during inference</p>
<ul>
<li><p>enlarging SRAM to improve throughput $\rightarrow$ significantly higher costs than DRAM</p>
</li>
<li><p>1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint</p>
</li>
</ul>
</li>
<li><p>BitNet b1.58</p>
<ul>
<li><p>added 0 to original BitNet</p>
</li>
<li><p>retains all the benefits of the original BitNet</p>
</li>
<li><p>included new computation paradigm (no multiplication for matmul)</p>
</li>
<li><p>same energy consumption as the original BitNet</p>
</li>
<li><p>stronger modeling capability $\rightarrow$ explicit support for reature filtering by inclusion of 0</p>
</li>
<li><p>it can match full precision baselines in terms of PPL and end-task starting from 3B</p>
</li>
</ul>
</li>
</ul>
<h1 id="2-bitnet-b158">2. BitNet b1.58</h1>
<h4 id="recap-bitlinear">Recap: BitLinear</h4>
<ol>
<li>Binarize weights to +1 or -1 with signum function</li>
</ol>
<ul>
<li>Centralize to be zero-mean to increase the capacity witnin a limited numerical range</li>
<li>Use scaling factor $\beta$ after binarization to reduce l2 error between real-valued and the binarized.</li>
</ul>
<p>$$
\tilde{W} = \rm{Sign} \it{(W - \alpha)} / \beta
$$
$$
\rm{Sign} \it(W_{ij}) = \begin{cases} +1, &amp;&amp;&amp; \text{if} \ \it W_{ij} &gt; \rm 0, \ 
-1, &amp;&amp;&amp; \text{if} \ W_{ij} \le 0,
\end{cases}
$$
$$
\alpha =  {1 \over nm} \sum <em>{ij} W</em>{ij}
$$
$$
\beta = {1 \over nm} ||W||_1
$$</p>
<ol start="2">
<li>Quantize activations to $b$-bit precision with absmax</li>
</ol>
<ul>
<li><p>$Q_b = 2^{b-1}$</p>
</li>
<li><p>$\epsilon$ is a small floating-point number that prevents overflow in clipping
$$
\tilde{x} = \text{Quant}(x) = \text{Clip} \left( x \times {Q_b \over \gamma}, -Q_b + \epsilon , Q_b - \epsilon \right)
$$
$$
\gamma = ||x||_{\infin}
$$</p>
</li>
<li><p>For activations before non-linear functions (ReLU) $\rightarrow$ scale into $[0, Q_b]$ by subtracting the minimum of the inputs
$$
\tilde{x} = \text{Quant}(x) = \text{Clip} \left( (x-\eta) \times {Q_b \over \gamma}, \epsilon, Q_b - \epsilon\right)
$$
$$
\eta = \min <em>{ij} x</em>{ij}
$$</p>
</li>
<li><p>quantize with 8-bit </p>
</li>
<li><p>Training $\rightarrow$ quantize per tensor / Inference $\rightarrow$ quantize per token</p>
</li>
</ul>
<ol start="3">
<li>Matrix Multiplication
$$
y = \tilde{W} \tilde{x}
$$</li>
</ol>
<p>The variance of the output $y$ under following assumption</p>
<ul>
<li>the elements in $W$ and $x$ are mutually independent and share same distribution</li>
<li>$W$ and $x$ are independent of each other</li>
</ul>
<p>$$
\begin{aligned} 
\text{Var}(y) &amp;= n\text{Var}(\tilde{w}\tilde{x}) \
&amp;= nE \left[ \tilde{w}^2 \right]E\left[ \tilde{x}^2 \right] \
&amp;= n \beta^2 E \left[\tilde{x}^2\right] \approx E\left[\tilde{x}^2\right]
\end{aligned}
$$
In full-precision, $\text{Var}(y) = 1$ with standard initialization method $\rightarrow$ training stability. To preserve this stability, use LayerNorm function.</p>
<ul>
<li>$$
\text{Var}(y) \approx E[\text{LN}(\tilde{x})^2] = 1 \quad \quad \quad (\text{SubLN})
$$</li>
</ul>
<p>Then, the final representation of BitLinear is:
$$
y = \widetilde{W}\widetilde{x} = \widetilde{W} \text{Quant}(\text{LN}(x)) \times {\beta\gamma \over Q_b} \
\text{LN} (x) = {x - E(x) \over \sqrt{\text{Var}(x) + \epsilon}}
$$
${\beta\gamma \over Q_b}$ means Dequantization to restore original precision</p>
<ol start="4">
<li>Model Parallelism with Group quantization and Normalization</li>
</ol>
<ul>
<li>Calculate all parameters $\alpha, \beta, \gamma, \eta$ with each group (device)</li>
<li>If the Number of group is $G$, then the parameter becomes
$$
\alpha_g = {G \over nm} \sum <em>{ij} W</em>{ij} ^{(g)}, \quad \quad \beta_g = {G \over nm} ||W^{(g)}||<em>1, \
\gamma_g = ||x^{(g)}||</em>{\infin}, \quad \quad \eta_g = \min <em>{ij} x</em>{ij} ^{(g)}
$$</li>
<li>LayerNorm should also be applied with similar way</li>
</ul>
<h4 id="bitnet-b158">BitNet B1.58</h4>
<ul>
<li><p>based on the BitLinear</p>
</li>
<li><p>trained from scratch, 1.58-bit weights and 8-bit activations</p>
</li>
<li><p>adopted <strong>absmean</strong> quantization</p>
<ul>
<li><p>scales the weight by its average absolute value</p>
</li>
<li><p>round each value to the nearest integer among {-1, 0, 1}
$$
\begin{aligned} 
\tilde{W} &amp;= \text{RoundClip}({W \over \gamma + \epsilon}, -1, 1) \
\text{RoundClip}(x, a, b) &amp;= \max(a, \min(b, \text{round}(x))) \
\gamma &amp;= {1 \over mn} \sum_{ij} |W_{ij}|
\end{aligned}
$$</p>
</li>
<li><p>don&#39;t scale the activations before the non-linear functions to the range $[0, Q_b]$</p>
</li>
<li><p>scale all activations to $[-Q_b, Q_b]$ per token to get rid of the zero-point quantization</p>
<ul>
<li>more convenient and simple for both implementation and system-level optimization</li>
</ul>
</li>
</ul>
</li>
</ul>
<h4 id="llama-alike-components">LLaMA-alike components</h4>
<ul>
<li><p>used LLaMA alike components</p>
<ul>
<li>RMSNorm</li>
<li>SwiGLU</li>
<li>rotary embedding</li>
<li>removed all biases</li>
</ul>
</li>
<li><p>it can be integrated into the popular open-source software</p>
</li>
</ul>
<h1 id="3-results">3. Results</h1>
<ul>
<li><p>BitNet b1.58 vs FP16 LLaMA</p>
</li>
<li><p>pretrained on RedPajama for 100B tokens</p>
</li>
<li><p>zero-shot performance</p>
<ul>
<li><p>ARC-Easy</p>
</li>
<li><p>ARC-Challenge</p>
</li>
<li><p>Hellaswag</p>
</li>
<li><p>Winogrande</p>
</li>
<li><p>PIQA</p>
</li>
<li><p>OpenbookQA</p>
</li>
<li><p>BoolQ</p>
</li>
</ul>
</li>
<li><p>validation PPL</p>
<ul>
<li><p>WikiText2</p>
</li>
<li><p>C4</p>
</li>
</ul>
</li>
<li><p>runtime GPU memory and latency</p>
<ul>
<li><p>FasterTransformer codebase</p>
</li>
<li><p>2-bit kernel from Ladder in BitNet</p>
</li>
<li><p>the time per output token</p>
</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/433c531f-d56d-405b-a4cf-e1a66e5cc3ae/image.png" alt=""></p>
<ul>
<li><p>BitNet starts to match FP LLaMA at 3B size</p>
</li>
<li><p>BitNet b1.58 3.9B outperforms FP LLaMA 3B</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/506ee24e-0f80-4162-815e-a9c6bc87e631/image.png" alt=""></p>
<ul>
<li><p>the performance gap between BitNet and LLaMA narrows as the model size increases</p>
</li>
<li><p>in terms of zero-shot performane, BitNet starts to match LLaMA at 3B size</p>
</li>
<li><p>BitNet b1.58 3.9B outperforms LLaMA $\rightarrow$ BitNet b1.58 is a Pareto improvement over the SOTA LLMs</p>
</li>
</ul>
<h4 id="memory-and-latency">Memory and Latency</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/8e91fac3-e056-40bc-a24f-2e10ecb88965/image.png" alt=""></p>
<ul>
<li><p>the speed-up increases as the model size scales</p>
<ul>
<li>the proportion of nn.Linear increases as the model size grows</li>
</ul>
</li>
<li><p>for the memory, the trend follows that of the latency</p>
<ul>
<li>as the embedding remains full precision and its proportion gets smaller</li>
</ul>
</li>
<li><p>Both were measured with a 2-bit kernel</p>
<ul>
<li>there is still room for optimization</li>
</ul>
</li>
</ul>
<h4 id="energy">Energy</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/ee87a15d-a147-44e7-9599-dabeb032e909/image.png" alt=""></p>
<ul>
<li><p>for LLaMA model, the majority of matmul is FP16 multiplication while for BitNet, it is INT8 addition</p>
</li>
<li><p>BitNet is more efficient when model is large</p>
<ul>
<li>as the percentage of nn.Linear grows with the model size</li>
</ul>
</li>
</ul>
<h4 id="throughput">Throughput</h4>
<ul>
<li><p>compared on two A100 80G cards</p>
</li>
<li><p>BitNet b1.58 and LLaMA 70B</p>
</li>
<li><p>maximum batch size for the GPU memory</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/26a061f7-2ec8-47f0-8e44-a23249725c09/image.png" alt=""></p>
<h4 id="bitnet-b158-is-enabling-a-new-scaling-law-wrt-model-performance-and-inference">BitNet b1.58 is enabling a new scaling law w.r.t. model performance and inference</h4>
<ul>
<li><p>in terms of latency, memory usage and energy consumption,</p>
<ul>
<li><p>BitNet 13B &gt; FP16 3B</p>
</li>
<li><p>BitNet 30B &gt; FP16 7B</p>
</li>
<li><p>BitNet 70B &gt; FP16 13B</p>
</li>
</ul>
</li>
</ul>
<h4 id="training-with-2t-tokens">Training with 2T tokens</h4>
<ul>
<li><p>to test scalability in terms of token</p>
</li>
<li><p>same recipe with StableLM 3B</p>
</li>
<li><p>evaluated on </p>
<ul>
<li>Winogrande</li>
<li>PIQA</li>
<li>SciQ</li>
<li>LAMBADA</li>
<li>ARC-easy</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/9f572fb5-5c2a-49f6-ba3b-a997ee9dcca9/image.png" alt=""></p>
<ul>
<li>It has strong generalization capabilities</li>
</ul>
<h1 id="4-discussion-and-future-work">4. Discussion and Future Work</h1>
<h4 id="1-bit-moe-llms">1-bit MoE LLMs</h4>
<ul>
<li><p>MoE has high memory comsumption and inter-chip communication overhead</p>
</li>
<li><p>BitNet b1.58 can handle them</p>
<ul>
<li><p>reduced memory footprint reduces the number of devices required to deploy MoE models</p>
</li>
<li><p>there would be no overhead if the entire models could be placed on a single chip</p>
</li>
</ul>
</li>
</ul>
<h4 id="native-support-of-long-sequence-in-llms">Native Support of Long Sequence in LLMs</h4>
<ul>
<li><p>the issue in handling long sequence is the memory consumption introduced by the KV caches</p>
</li>
<li><p>BitNet b1.58 reduces activations from 16-bits to 8-bits</p>
<ul>
<li><p>doubling the sequence length</p>
</li>
<li><p>if reducing to lower than 4 bits is possible, the length would be longer</p>
</li>
</ul>
</li>
</ul>
<h4 id="llms-on-edge-and-mobile">LLMs on Edge and Mobile</h4>
<ul>
<li><p>for Edge and Mobile device, BitNet b1.58 can resolve the issue of memory and computational power</p>
</li>
<li><p>BitNet is more friendly to CPU devices</p>
</li>
</ul>
<h4 id="new-hardware-for-1-bit-llms">New Hardware for 1-bit LLMs</h4>
<ul>
<li>Groq demonstrated promising results and great potential for specific LLMs (LPU)</li>
<li>expect new hardware for 1-bit LLM</li>
</ul>
<h1 id="5-comment">5. Comment</h1>
<p>두 번 날려먹고 다시 쓰는 코멘트. 1bit 모델의 가능성을 보여주었다면, 조금 더 다듬어진 듯한 논문. 온디바이스나 3진법 반도체가 떠오르게 하는 글이었음. 왜 처음에 0을 넣지 않았는지, 그리고 양자화 범위의 구분이 어떤 효과를 주는지 설명해주었다면 좋았을듯</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[CoT Reasoning without Prompting]]></title>
            <link>https://velog.io/@0404_not_found/CoT-Reasoning-without-Prompting</link>
            <guid>https://velog.io/@0404_not_found/CoT-Reasoning-without-Prompting</guid>
            <pubDate>Fri, 23 Feb 2024 12:20:30 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>LLMs&#39; reasoning capabilities are elicited by prompting techniques</p>
<ul>
<li><p>Few shot prompting with intermediate steps augmented demonstration exemplars</p>
</li>
<li><p>Zero shot prompting with specific instructions to show intermediate steps</p>
</li>
</ul>
</li>
<li><p>Can LLMs reason effectively without prompting?</p>
<ul>
<li><p>there exists a task-agnostic way to elicit CoT reasoning by altering the <strong>decoding procedure</strong>
<img src="https://velog.velcdn.com/images/0404_not_found/post/5dddb2a7-893a-4578-80f9-aa451adf4b2b/image.png" alt=""></p>
</li>
<li><p>LLM generates a wrong answer via the standard greedy decoding but there are alternative top-k token inspection unveiled inherent CoT paths</p>
<ul>
<li><p>Use standard QA format</p>
</li>
<li><p>LLMs struggle with reasoning when relying solely on greedily decoded paths</p>
</li>
<li><p>CoT reasoning patterns <strong>emerge naturally</strong> within the alternative paths among the top-k tokens</p>
</li>
<li><p>when CoT path is present, the model demonstrates increased confidence in the final answer</p>
</li>
<li><p><strong>CoT-decoding</strong> : a method to sift through the top-k paths by isolating the most reliable paths</p>
</li>
</ul>
</li>
<li><p>CoT decoding elicits reasoning capabilities without explicit prompting</p>
<ul>
<li><p>enhances the model&#39;s reasoning capabilities</p>
</li>
<li><p>paths are more prevalent in tasks frequently represented in the pre-training data and less tso in complex, synthetic tasks $\rightarrow$ Still prompting is needed</p>
</li>
</ul>
</li>
</ul>
</li>
</ul>
<ul>
<li><p>Summarized Contributions</p>
<ul>
<li>LLMs inherently possess reasoning capabilities</li>
<li>they generates CoT reasoning when examining alternative top tokens</li>
<li>mere change in decoding strategy effectively elicit model reasoning</li>
<li>LLM&#39;s confidence in its final answers increases when CoT is in its decoding path</li>
<li>CoT decoding to select more reliable decoding paths</li>
</ul>
</li>
</ul>
<h1 id="2-cot-decoding">2. CoT Decoding</h1>
<h2 id="21-the-presence-of-cot-paths-during-decoding">2.1 The presence of CoT Paths during Decoding</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/c574c59a-04bf-4673-a30c-5a29c3fb1945/image.png" alt=""></p>
<ul>
<li><p>$k$ represents the coice of the $k$-th token at the first decoding step</p>
</li>
<li><p>PaLM-2 Large model example</p>
</li>
<li><p>The greedy decoding often doesn&#39;t contain CoT </p>
<ul>
<li><p>model&#39;s skewed perception of problem difficulty</p>
</li>
<li><p>pretrained on simpler questions</p>
</li>
</ul>
</li>
<li><p>direct answer prompts generally result in low accuracy</p>
</li>
</ul>
<h2 id="22-cot-decoding-for-extracting-cot-paths">2.2 CoT-Decoding for Extracting CoT Paths</h2>
<ul>
<li><p>Extracting CoT paths from the top-$k$ decoded paths is an issue</p>
<ul>
<li><p>CoT Paths don&#39;t consistently outrank non-CoT in the model&#39;s probability assessment</p>
</li>
<li><p>they often don&#39;t represent the predominant answer among all paths $\rightarrow$ Self-consistency is not applicable</p>
</li>
</ul>
</li>
<li><p>the presence of CoT path typically leads to a more confident decoding of the final answer</p>
<ul>
<li><p>characterized by a probability disparity between the top and secondary tokens</p>
</li>
<li><p>$\Delta_{k, \text{answer} } = {1 \over n} \sum_{x_t \in \text{answer}} p(x_t^1 \ | \ x_{&lt;t}) - p(x_t^2 \ | \ x_{&lt;t})$</p>
</li>
<li><p>$x_t^1$, $x_t^2$ means the top two tokens at each decoding step $t$ in the $k$-th decoding path chosen for their maximum post-softmax probabilities from the vocab</p>
</li>
<li><p>Overall confidence in decoding the final answer is approximated by averaging the probability differences for all relevant $x_t$ tokens</p>
<ul>
<li>For the GSM8K question in Table 1, average the probability differences for &#39;6&#39; and &#39;0&#39;</li>
</ul>
</li>
<li><p>Called CoT-Decoding and aimed to extract CoT Paths</p>
</li>
<li><p>CoT path shows high $\Delta$ value</p>
</li>
</ul>
</li>
<li><p>Additional heuristic about the length of the answer</p>
<ul>
<li><p>longer decoding paths more likely contain CoT</p>
</li>
<li><p>general applicability is limited</p>
</li>
<li><p>Normalize the probability score by length $\rightarrow$ intruduces a length bias (when the decoding paths are of similar lengths, its effectiveness diminishes)</p>
</li>
</ul>
</li>
</ul>
<h4 id="identifying-the-answer-spans">Identifying the answer spans</h4>
<ul>
<li><p>for math tasks, one can extract the last numerical value</p>
<ul>
<li>less precise when there are distractive numbers/options and open-ended responses</li>
</ul>
</li>
<li><p>extending the model&#39;s output with the prompt &quot;So the answer is&quot;    </p>
<ul>
<li><p>only token ids are needed</p>
</li>
<li><p>suitable for encompassing mathematical and natural language reasoning</p>
</li>
<li><p>crucial to calculate $\Delta$ over the answer spans from the original decoding path, not those following &quot;So the answer is&quot;</p>
</li>
</ul>
</li>
<li><p>When answer is more open-ended, modify the $\Delta$ calculation</p>
<ul>
<li><p>If the options are defined, aggregating the probability mass over &quot;yes&quot; and compute the probability differences between the aggregated mass on &quot;yes&quot; and &quot;no&quot;</p>
</li>
<li><p>addressing this limitation is left for further research</p>
</li>
</ul>
</li>
</ul>
<h4 id="branching-at-other-decoding-steps">Branching at other decoding steps</h4>
<ul>
<li><p>Is branching viable at later decoding stages?</p>
</li>
<li><p>Early branching significantly enhances the diversity of potential paths</p>
</li>
<li><p>Optiman branching point may vary with the task</p>
<ul>
<li>for year parity task, mid-path branching can effectively yield correct CoT paths</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/97d695f4-a161-4c73-a4be-bd5941a2b86d/image.png" alt=""></p>
<h4 id="aggregation-of-the-decoding-paths">Aggregation of the decoding paths</h4>
<ul>
<li><p>Aggregate the answers over all those paths like self-consistency without CoT prompting</p>
<ul>
<li>to mitigate sensitivity to small differences in the model&#39;s logits particularly when relying solely on the path with the maximum $\Delta$</li>
</ul>
</li>
<li><p>Majority answer may not be correct</p>
</li>
<li><p>weighted aggregation method</p>
<ul>
<li><p>take the answer that maximizes $\tilde{\Delta}<em>a = \sum_k \Delta</em>{k, a}$</p>
</li>
<li><p>$\Delta_{k, a}$ means the $k$-th decoding path whose answer is $a$</p>
</li>
<li><p>this enhances the stability of the results</p>
</li>
</ul>
</li>
</ul>
<h4 id="sampling-under-the-standard-qa-format">Sampling under the standard QA format</h4>
<ul>
<li><p>Can sampling achieve a similar effect and unveil the CoT reasoning paths?</p>
</li>
<li><p>althouth sampling works well under few-shot CoT prompting, it doesn&#39;t exhibit the desired behaviour when the model is queried with the standard QA format</p>
</li>
<li><p>less than 30% of the sampled responses contain a correct CoT path</p>
</li>
<li><p>the model tends to provide a direct answer as the first token is sampled based on the model&#39;s probability distribution reflecting the model&#39;s tendency to output direct answer</p>
</li>
<li><p>the rest of the tokens lead to incorrect final answers</p>
</li>
</ul>
<h1 id="3-experiments">3. Experiments</h1>
<ul>
<li>Used standard QA format (Q: (question)\nA:)</li>
<li>$k=10$ as default</li>
<li>PaLM-2 with different scales</li>
<li>Mistral-7B</li>
<li>last numerical numbers or the available options for Mistral</li>
<li>extend the output with &quot;So the answer is&quot; for PaLM-2</li>
</ul>
<h2 id="31-mathematical-reasoning-tasks">3.1 Mathematical Reasoning Tasks</h2>
<ul>
<li>GSM8K (grade-school math problems)</li>
<li>MultiArith (multi-step arithmetic dataset)</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/e02d95a8-e849-4606-b075-71e6c1a00ab1/image.png" alt=""></p>
<ul>
<li><p>CoT Decoding significantly enhances models&#39; reasoning ability</p>
</li>
<li><p>CoT Decoding partially closes the gap between the pre-trained model and instruction-tuned model</p>
</li>
<li><p>Instruction Tuning with sufficient CoT data can also be partially achieved by CoT Decoding</p>
</li>
<li><p>As instruction-tuning contains the CoT annotations, the model is expected to generate inherently generate CoT paths</p>
</li>
<li><p>Even after instruction-tuning, the model occasionally attempts to directly address a question</p>
</li>
</ul>
<h4 id="scaling-results-and-choice-of-k">Scaling results and choice of $k$</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/1ead9dae-ac90-4e45-adc2-81737586d666/image.png" alt=""></p>
<ul>
<li><p>higher $k$ typically result in improved model performance</p>
<ul>
<li>correct CoT paths are often ranked lower</li>
</ul>
</li>
<li><p>for IT models, the effect of $k$ is not significant</p>
<ul>
<li>instruction-tuning brings forth the majority of CoT-paths to the first few paths</li>
</ul>
</li>
</ul>
<h2 id="32-natural-language-reasoning-tasks">3.2 Natural Language Reasoning Tasks</h2>
<ul>
<li><p>year parity task : Was (person) born in an even or odd year?</p>
</li>
<li><p>Even SoTA models like GPT-4 achieves at-chance accuracy (~50%) when prompted directly</p>
<ul>
<li><p>SoTA LLMs are perfect for retrieving the year or judging the parity given the correct year</p>
</li>
<li><p>the limitation lies in the model&#39;s ability in knowledge manipulation</p>
</li>
</ul>
</li>
<li><p>100 celeb names and their birth years</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/0e2ba919-5917-48a0-8297-4b187060b9d1/image.png" alt=""></p>
<ul>
<li><p>When the model is small, the model becomes incapable to determine the parity even given the correct year</p>
<ul>
<li>the performance doesn&#39;t vary significantly for model sizes below &quot;Small&quot; size</li>
</ul>
</li>
</ul>
<h2 id="33-symbolic-reasoning-tasks">3.3 Symbolic Reasoning Tasks</h2>
<ul>
<li><p>Coin Flip with 2, 3, 4 rounds of potential flip</p>
</li>
<li><p>two tasks from Big-Bench-Hard</p>
</li>
<li><p>Web of lies with 3, 4, 5 truth/lie statements</p>
</li>
<li><p>Multi-step arithmetic with variuous depth and length (generated)</p>
</li>
<li><p>existing dataset from (Suzgun et al., 2022)</p>
</li>
<li><p>Sports understanding and Object Counting from Big-Bench</p>
</li>
</ul>
<h4 id="the-presense-of-correct-cot-paths-depends-on-the-taks-prominence-in-the-pre-training-distribution">The presense of correct CoT paths depends on the taks prominence in the pre-training distribution</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/8e43cc2c-9f73-4cca-b292-cb8c57ae6f6e/image.png" alt=""></p>
<ul>
<li><p>CoT-Decoding gain is decreasing when the task complexity increases</p>
</li>
<li><p>When task is highly synthetic, the model cannot generate correct CoT paths</p>
<ul>
<li><p>tasks that lack significant representation in the pre-training distribution</p>
</li>
<li><p>tasks that require accurate state tracking (Coin-Flip and Web-of-Lies) $\rightarrow$ easily lose track of the states as the task became more complex</p>
</li>
<li><p>Multi-step Arithmetic and Object counting</p>
</li>
<li><p>CoT prompting based techniques can &#39;teach&#39; how to solve tasks like above</p>
</li>
</ul>
</li>
</ul>
<h4 id="compared-to-cot-prompting">Compared to CoT Prompting</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/f3e4ce8f-6774-4aba-9bfd-43da2719a299/image.png" alt=""></p>
<ul>
<li><p>the aggregated path approach significantly improves the accuracy compare to taking the maximum path only</p>
</li>
<li><p>the aggregated path results in a similar performance to few-shot CoT</p>
<ul>
<li>model possesses intrinsic abilities in solving this task effectively</li>
</ul>
</li>
<li><p>CoT prompting takes the intrinsic CoT path to the top-1 path</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/24ea1a1f-9697-43ab-b07f-678801c61607/image.png" alt=""></p>
<ul>
<li><p>CoT Decoding exhibits a more &#39;free-form&#39; generation in comparison to alternative CoT prompting</p>
<ul>
<li><p>encourage the diversity at the initial decoding step</p>
</li>
<li><p>absence of explicit constraints imposed by prompt</p>
</li>
</ul>
</li>
<li><p>CoT Decoding can reveal what LLMs&#39; intrinsic strategy in solving a problem without being influenced by the prompts</p>
<ul>
<li><p>Few shot CoT follows the standard method of solving this task (profession - evaluation)</p>
</li>
<li><p>influenced vby the few-shot prompt</p>
</li>
</ul>
</li>
</ul>
<h2 id="34-results-across-model-families">3.4 Results across Model Families</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/968743c6-67c5-45e3-a4d3-652e0f86633e/image.png" alt=""></p>
<ul>
<li>CoT path emerges too</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/32a9e6fe-355e-49e8-a5e7-77050e59b3d1/image.png" alt=""></p>
<ul>
<li>consistent improvements across model families</li>
</ul>
<h1 id="4-conclusion">4. Conclusion</h1>
<ul>
<li><p>inherent capabilities of LLMs in generating CoT paths</p>
</li>
<li><p>exploring alternative top-k tokens reveals the natural existence of reasoning paths</p>
</li>
<li><p>presence of a CoT path correlates with increased model confidence in decoding its final answer</p>
</li>
<li><p>additional computational costs</p>
<ul>
<li>future work may leverage the CoT paths to fine-tune the model</li>
</ul>
</li>
<li><p>focused on branching at the first token</p>
<ul>
<li><p>one can explore branching at any token and find best possible paths</p>
</li>
<li><p>how to reliably identify the best token during the search</p>
</li>
</ul>
</li>
</ul>
<h1 id="5-comment">5. Comment</h1>
<p>확률이 가장 높은 토큰이 아니라 나머지 Top-k 중에 자연적인 CoT가 적용되어 정답을 찾는 Path가 있을 것이라는 굉장히 발상적인 아이디어. 계산량을 제외하고 봤을때 가장 참신한 아이디어였던 것 같음. 항상 Greedy가 옳은가에 대해 돌아볼 수 있는 논문.</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Self-Discover: LLMs Self-Compose Reasoning Structure]]></title>
            <link>https://velog.io/@0404_not_found/Self-Discover-LLMs-Self-Compose-Reasoning-Structure</link>
            <guid>https://velog.io/@0404_not_found/Self-Discover-LLMs-Self-Compose-Reasoning-Structure</guid>
            <pubDate>Fri, 16 Feb 2024 12:48:54 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>To enhance LLMs&#39; capability to reason and solve complex problems via prompting</p>
<ul>
<li><p>Few-shot &amp; Zero-shot CoT $\rightarrow$ how humans solve problems step-by-step</p>
</li>
<li><p>decomposition based prompting $\rightarrow$ how humans breakdowns problems into subproblems</p>
</li>
<li><p>step-back prompting $\rightarrow$ how humans reflec on task nature to derive general principles</p>
</li>
</ul>
</li>
<li><p>Each method serves as an atomic reasoning module making <strong>an implicit prior assumption of the process</strong> on how to tackle a given task</p>
</li>
<li><p>Instead, each task has a unique intrinsic structure underlying the reasoning process involved in solving it efficiently</p>
<ul>
<li>Least-to-Most Prompting is more effective than CoT at <strong>symbolic manipulation and compositional generalization</strong> due to the decomposition structure of the tasks</li>
</ul>
</li>
<li><p><strong>Self-Discover</strong> $\rightarrow$ how humans devise areasoning probram for problem-solving</p>
<ul>
<li><p>Aims to discover the inderlying reasoning structure of each task
<img src="https://velog.velcdn.com/images/0404_not_found/post/58d9a151-561b-4242-b95f-3ec88e4ee358/image.png" alt=""></p>
</li>
<li><p>It composes a coherent reasoning structure intrinsic to the task (Stage 1)</p>
<ul>
<li>Operates at Task Level</li>
<li>uses three actions to guide LLM to generate a reasoning structure for the task</li>
</ul>
</li>
<li><p>Solves instances of the task using the discovered structure (Stage 2)</p>
<ul>
<li>LLM simply follows the self-discovered structure to get the final answer</li>
</ul>
</li>
</ul>
</li>
<li><p>Self-Discover helps to use multiple atomic reasoning modules like CoT</p>
</li>
<li><p>It only needs 3 more inference steps on the task-level (more performant than inference-heavy ensemble approaches like self-consistency)</p>
</li>
<li><p>It conveys LLMs&#39; insights about the task in a more interpretable way</p>
</li>
<li><p>Tested 25 challenging reasoning, it outperformed 21/25 tasks</p>
<p>  <img src="https://velog.velcdn.com/images/0404_not_found/post/c1c8d352-f396-4a2a-9799-02359993a80f/image.png" alt=""></p>
<ul>
<li>It achieves superior performance against a inference-heavy method like SoT + Self-consistency and Majority voting</li>
</ul>
</li>
<li><p>Compared Self-Discover with prompts optimized using a training set</p>
<ul>
<li>Performed par or better than OPRO</li>
</ul>
</li>
<li><p>Analyzed its effectiveness by breaking down BBH task into 4 categories</p>
<ul>
<li>Self-Discover worked best on tasks requiring world knowledge</li>
<li>it has a moderate performance boost on algorithmic tasks compared to CoT</li>
</ul>
</li>
<li><p>Error analysis on MATH</p>
<ul>
<li>majorities of failures comes from computation errors</li>
<li>showed the universality of the reasoning structures by PaLM2 to GPT-4 / GPT-4 to Llama-2-70B</li>
</ul>
</li>
</ul>
<h1 id="2-self-discovering-reasoning-structures-for-problem-solving">2. Self-Discovering Reasoning Structures for Problem-Solving</h1>
<ul>
<li>How humans use prior knowledge and skills to devise a reasoning program</li>
</ul>
<pre><code>- Search search internally what knowledge and skills might be helpful to solve it

- Attempt to apply relevant knowledge and skills to the task

- Finally conntect multiple skills and knowledge</code></pre><ul>
<li><p>Given a <strong>task</strong> and a <strong>set of reasoning module descriptions representing high-level problem-solving heuristics</strong> (&quot;Use critical thinking&quot;, &quot;Let&#39;s think step by step&quot;), Stage 1 aims to undercover the intrinsic reasoning structure via meta-resoning</p>
<ul>
<li><p>Three meta-prompt to guide LLM to select, adapt, implement an actionable reasoning structure without labels or training</p>
</li>
<li><p>Formatted the structure in <strong>key-value pairs</strong> like JSON due to interpretability and performance</p>
</li>
<li><p><img src="https://velog.velcdn.com/images/0404_not_found/post/27f0fab3-fdaf-4631-958b-0ea5f6cf29bb/image.png" alt=""></p>
</li>
<li><p>this operates on <strong>Task-Level</strong> so this  stage is only needed once for each task</p>
</li>
<li><p>Use discovered reasoning structure to solve every instance of task</p>
<ul>
<li>Follow the step-by-step reasoning plan in JSON to correctly solve the task. Fill in the values following the keys by reasoning specifically about the task given. Do not simply rephrase the keys</li>
</ul>
</li>
</ul>
</li>
</ul>
<h2 id="21-stage-1--self-discover-task-specific-structures">2.1 Stage 1 : Self-Discover Task-Specific Structures</h2>
<h4 id="select">SELECT</h4>
<ul>
<li><p>Not every reasoning module is helpful</p>
</li>
<li><p>Guide LLM to select module based on task example</p>
</li>
<li><p>given raw set of reasoning module $D$ and a few task examples without labels $t_i \in T$, Self-Discover selects a subset of reasoning modules $D_S$ by a model $\mathcal{M}$ and a meta-prompt $p_S$</p>
<ul>
<li>$D_S = \mathcal{M}(p_S \ || \ D \ || \ t_i)$</li>
</ul>
</li>
</ul>
<h4 id="adapt">ADAPT</h4>
<ul>
<li><p>Each reasoning module provide general description of how to solve problems</p>
</li>
<li><p>Self-Discover aims to tailor each module</p>
<ul>
<li>&quot;break the problem into subproblems&quot; $\rightarrow$ &quot;calculate each arithmetic operation in order&quot; for arithmetic problems</li>
</ul>
</li>
<li><p>given $D_S$ and meta-prompt $p_A$, the model generates the adapted reasoning module descriptions $D_A$</p>
<ul>
<li>$D_A = \mathcal{M}(p_A \ || \ D_S \ || \ t_i)$</li>
</ul>
</li>
</ul>
<h4 id="implement">IMPLEMENT</h4>
<ul>
<li><p>Given adapted reasoning module descriptions $D_A$, it uses the reasoning modules into an implemented reasoning structure $D_I$ with specified instruction on what to generate for each step</p>
</li>
<li><p>Provide a human-written reasoning structure $S_{\text{human}}$ on another task in addition to meta prompt to better convert the natural language descriptions into a reasoning structure</p>
<ul>
<li>$D_I = \mathcal{M}(p_A \ || \ S_{\text{human}} \ || \ D_A \ || \ t_i)$</li>
</ul>
</li>
</ul>
<h2 id="22-stage-2--tackle-tasks-using-discovered-structures">2.2 Stage 2 : Tackle Tasks Using Discovered Structures</h2>
<ul>
<li>After those stages, use $D_I$ which is uniquely adapted for the task to solve $T$</li>
<li>$A = \mathcal{M}(D_S \ || \ t), \quad \forall t \in T$</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/bc082359-7e12-4b4b-94ca-f800e14a0e07/image.png" alt=""></p>
<h1 id="3-experiment-setup">3. Experiment Setup</h1>
<h4 id="tasks">Tasks</h4>
<ul>
<li><p>Used diverse reasoning benchmarks challenging for LLMs</p>
<ul>
<li><p>Big Bench Hard (23 challengine tasks from Big-Bench)</p>
<ul>
<li>Algorithmic and Multi-Step Arithmetic Reasoning</li>
<li>NLU</li>
<li>Use of Wold Knowledge</li>
<li>Multilingual Knowledge and Reasoning</li>
</ul>
</li>
<li><p>Thinking for Doing (T4D)</p>
<ul>
<li>models must leverage mental state reasoning to determine actions to perform (GPT-4 + CoT reached only 50%)</li>
</ul>
</li>
<li><p>MATH test set (200 samples)</p>
</li>
</ul>
</li>
</ul>
<h4 id="models">Models</h4>
<ul>
<li>GPT-4 (gpt-4-turbo)</li>
<li>GPT-3.5 (chatGPT, gpt-3.5-turbo)</li>
<li>instruction tuned PaLM2-L</li>
<li>Llama2-70B</li>
</ul>
<h4 id="baselines">Baselines</h4>
<ul>
<li><p>Zero-shot prompting</p>
<ul>
<li>Direct Prompting</li>
<li>CoT</li>
<li>Plan-and-Solve (firstly generate a plan and solve problem)</li>
</ul>
</li>
<li><p>use the raw seed reasoning modules passed to Self-Discover</p>
<ul>
<li>CoT-Self-Consistency (sample multiple outputs with CoT and aggregate answer)</li>
<li>Majority voting of each RM</li>
<li>Best of each RM (uses highest accuracy from each RM</li>
</ul>
</li>
<li><p>To test universality of reasoning structure, comparing with Prompt-optimization that requires a training set (OPRO)</p>
<ul>
<li>showing when applyting structures or prompt that optimized from one model, the reasoning structure can retain more performance</li>
</ul>
</li>
</ul>
<h1 id="4-results">4. Results</h1>
<h2 id="41-does-self-discover-improve-llm-reasoning">4.1 Does Self-Discover Improve LLM Reasoning?</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/a95f4c0c-6887-4b84-8d34-9235308f51a9/image.png" alt=""></p>
<ul>
<li>For MATH, upon error analysis, the reasoning structures generated by PaLM 2-L from Self-Discover are correct 87.5% of the time (Human can follow the structures to solve the tasks perfectly)</li>
</ul>
<h2 id="42-which-types-of-problems-do-self-discover-help-the-most">4.2 Which Types of Problems Do Self-Discover Help the Most?</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/366baa2a-44c8-4b8b-b0a9-e8e8e80e53e5/image.png" alt=""></p>
<ul>
<li><p>Self Discover improved the performance of World Knowledge task the most (sports understanding, movie recommendation, ruin names)</p>
</li>
<li><p>Using CoT misses the key knowledge</p>
</li>
<li><p>Algorithmic category&#39;s gain is moderate which is consistent with MATH result from 4.1</p>
</li>
</ul>
<h2 id="43-how-efficient-is-self-discover">4.3 How Efficient is Self-Discover?</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/ae51013a-4a6c-4d2e-bbd9-9855a4899b7b/image.png" alt=""></p>
<ul>
<li>Self-Discover achieves the best performance while requiring 10-40x fewer inference call compared to Self-Consistency and Majority voting</li>
</ul>
<h2 id="44-qualitative-examples">4.4 Qualitative Examples</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/5bfa8594-b79f-4ab4-845c-dfd7964ee269/image.png" alt="">
<img src="https://velog.velcdn.com/images/0404_not_found/post/bbf108d8-28e2-4eaf-aad9-1279f8c630b9/image.png" alt=""></p>
<h1 id="5-deep-diving-into-self-discovered-reasoning-structures">5. Deep Diving Into Self-Discovered Reasoning Structures</h1>
<ul>
<li>** All actions of Self-Discover needed**</li>
</ul>
<h2 id="51-importance-of-self-discover-actions">5.1 Importance of Self-Discover Actions</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/1eae3e46-6866-42f2-a19f-5adde67c1c91/image.png" alt=""></p>
<ul>
<li>S for only SELECT / SA for SELECT and ADAPT</li>
<li>Adding each step, the model&#39;s zero-shot reasoning capability improved $\rightarrow$ All three step is beneficial</li>
</ul>
<h2 id="52-towards-universality-of-discovered-reasoning-structure">5.2 Towards Universality of Discovered Reasoning Structure</h2>
<h4 id="applying-palm-2-l-discovered-structures-to-gpt-4">Applying PaLM 2-L Discovered Structures to GPT-4</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/543008d7-0c28-4bef-9e58-6cbfbbb6602f/image.png" alt=""></p>
<h4 id="applying-gpt-4-discovered-structures-to-llama2-and-chatgpt">Applying GPT-4 Discovered Structures to Llama2 and ChatGPT</h4>
<ul>
<li>Llama2 + Self-Discover (52%) &gt; CoT (42%) on zero-shot disambiguation QA</li>
<li>GPT-3.5 (56%) &gt; CoT (51%) on geometry with 3-shot</li>
</ul>
<h1 id="6-related-work">6. Related Work</h1>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/13661cc1-0066-4509-8bfc-d01c1286e1a4/image.png" alt=""></p>
<h4 id="opro-framework-llms-as-optimizers-yang-et-al-2023">OPRO Framework (LLMs as optimizers, Yang et al., 2023)</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/afc45db7-cfe2-4439-858e-28d1c6f5dfab/image.png" alt="">
<img src="https://velog.velcdn.com/images/0404_not_found/post/8131d37a-8818-404e-a338-52d5411d72b8/image.png" alt=""></p>
<h1 id="7-conclusion">7. Conclusion</h1>
<ul>
<li>Self-Discover a reasoning structure for any task</li>
<li>Drastic improvements from challenging task</li>
<li>the composed reasoning structure is transferable</li>
</ul>
<h1 id="8-comment">8. Comment</h1>
<p>문제를 푸는 구조까지 LLM의 판단에 맡기는 파이프라인. 대형 모델에서 효과를 봤다는 점이 인상적인듯.</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Repeat After Me:
Transformers are Better than State Space Models at Copying]]></title>
            <link>https://velog.io/@0404_not_found/Repeat-After-MeTransformers-are-Better-than-State-Space-Models-at-Copying</link>
            <guid>https://velog.io/@0404_not_found/Repeat-After-MeTransformers-are-Better-than-State-Space-Models-at-Copying</guid>
            <pubDate>Wed, 07 Feb 2024 08:20:09 GMT</pubDate>
            <description><![CDATA[<p><img src="https://velog.velcdn.com/images/0404_not_found/post/4acb6292-9ba5-44da-a30c-be920647175c/image.png" alt="">
..?</p>
<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>Transformers require $\Omega(L)$ memory and compute to predict the next token of a sequence of length $L$ (using Flash Attention!)</p>
</li>
<li><p>Attempts to make similar architectures but with $O(1)$ memory to predict each token $\rightarrow$ S4 or Mamba / RNN / models that can trained in parallel like linear attention / parallel RNNs</p>
<ul>
<li>Say all models as <strong>GSSM (Generalized State Space Models)</strong> </li>
</ul>
</li>
<li><p>Resent work says GSSM&#39;s performance but it is not clear what these models sacrifice for efficiency</p>
<ul>
<li>One particular capability that is sarificed is the ability to <strong>retrieve and repeat parts of the input context</strong></li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/ba729e11-c72b-4032-ba20-2807f35c65b2/image.png" alt=""></p>
<ul>
<li><p>Theoritical analysis of copying task</p>
<ul>
<li>Transformer can copy strings of length that is exponential in the number of heads of the transformer</li>
<li>Transformer implements a &#39;storage&#39; mechanism and retrieval of sequences of n-grams</li>
<li>GSSMs cannot accurately copy strings with more bits than the size of the latent state</li>
</ul>
</li>
<li><p>In practice, large GSSM may have enough capacity to represent the entire input in the latent state</p>
<ul>
<li>Transformers are both much more efficient at learning to copy and to generalize better to longer inputs</li>
<li>Copy algorithms learned by Transformers are based on n-grams to perform where to copy from </li>
</ul>
</li>
</ul>
<h1 id="2-theory-representational-capaciy">2. Theory: Representational Capaciy</h1>
<h2 id="21-setting">2.1 Setting</h2>
<ul>
<li><p>dictionary $\mathbb{D}$ which contains $D$ alphabet tokens</p>
</li>
<li><p>seq2seq model $H : \mathbb{D}^* \rightarrow \mathbb{D}^*$</p>
<ul>
<li>input $x_1, x_2, ... x_i$ as the prompt</li>
<li>$H(x_1, x_2, ... x_i)$ as the generated &#39;answer&#39;</li>
</ul>
</li>
<li><p>sequence to token model $h : \mathbb{D}^* \rightarrow \mathbb{D}$</p>
<ul>
<li>it naturally defines $H$ by autoregressive inference</li>
<li>for every input sequence $x_1, ... ,x_i \in \mathbb{D}$, define $x_{i+j} = h(x_1, ... ,x_{i+j-1})$ recursively and let $H(x_1, ... ,x_i) = (x_{i+1}, x_{i+2}, ... )$</li>
</ul>
</li>
</ul>
<h4 id="gssm">GSSM</h4>
<ul>
<li><p>Finite set $\mathcal{S}$ is a state space</p>
</li>
<li><p>the number of bits required to encode the states of $\mathcal{S}$ as $\text{mem}(\mathcal{S}) = \log(|\mathcal{S}|)$</p>
</li>
<li><p>GSSM is a sequence model defined by an update rule $u : \mathcal{S} \times \mathbb{D} \rightarrow \mathcal{S}$ and some output function $r : \mathcal{S} \rightarrow \mathbb{D}$</p>
<ul>
<li>Let $s_o \in \mathcal{S}$ be some initial state</li>
<li>Given sequence $x_1, ..., x_L$, the state of model at iteration $i$ is denoted by $S_i(x_1, ..., x_i)$</li>
<li>the output token is denoted by $R_i(x_1, ..., x_i)$</li>
<li>The recursive process is
$$
\begin{aligned} 
&amp;1)\quad S_o(\empty) = s_0 \
&amp;2) \quad S_i(x_1, ... ,x_i) = u(S_{i-1}(x_1, ..., x_{i-1}), x_i) \
&amp;3) \quad R_i(x_1, ..., x_i) = r(S_i(x_1, ..., x_i))
\end{aligned}
$$</li>
</ul>
</li>
<li><p>Note that for any sequence model, there are two types of memory considerations</p>
<ul>
<li>Input-Independent Memory - parameters</li>
<li>Input-Dependent Memory - activations</li>
</ul>
</li>
<li><p>GSSM definition constraints the input-dependent memory $\text{mem}(\mathcal{S})$</p>
</li>
<li><p>It doesn&#39;t restrict in any way the amount of input-independent memory or the runtime of state updates</p>
</li>
<li><p>Leaving all other considerations unconstrained shows the lower bownd on the state space memory</p>
</li>
</ul>
<h4 id="transformers">Transformers</h4>
<ul>
<li><p>input length $L$</p>
</li>
<li><p>dimension $d$</p>
</li>
<li><p>input tokens $\boldsymbol{x}_1, ..., \boldsymbol{x}_L \in \mathbb{R}^d$</p>
</li>
<li><p>an attention head is parametrized as $W_q, W_k, W_v \in \mathbb{R}^{d \times d}$</p>
</li>
<li><p>$\boldsymbol{k}_i = W_k \boldsymbol{x}_i, \quad \boldsymbol{q}_i = W_q \boldsymbol{x}_i, \quad \boldsymbol{v}_i = W_v \boldsymbol{x}_i$</p>
</li>
<li><p>$K_i = [\boldsymbol{k}_1, ..., \boldsymbol{k}_i] \in \mathbb{R}^{d \times i}, \quad V_i = [\boldsymbol{v}_1, ..., \boldsymbol{v}_i] \in \mathbb{R}^{d \times i}$</p>
</li>
<li><p>the output of the head at token $i$ is $\boldsymbol{o}_i = V_i \ \cdot \ \text{softmax}(K_i \cdot \boldsymbol{q}_i) \in \mathbb{R}^d$</p>
</li>
<li><p>with $l$ attention heads, the full dimension should be $dl$</p>
</li>
<li><p>embedding $\Psi : \mathbb{D} \rightarrow \mathbb{R}^d$</p>
</li>
<li><p>MLP $f : \mathbb{R}^{dl} \rightarrow \mathbb{R}^{dl} \ \text{s.t.} \ f(\boldsymbol{x}) = U_1 \sigma (U_2 \boldsymbol{x})$</p>
</li>
<li><p>embedding and MLP is applied on the token level</p>
</li>
<li><p>Attention-block is a set of $l$ heads applied in parallel</p>
</li>
<li><p>transformer-block is an attention-block floowed by an MLP on the concatenated output of $l$ heads</p>
</li>
</ul>
<h4 id="the-copy-task">The Copy Task</h4>
<ul>
<li><p>Add two special token &lt;BOS&gt; and &lt;COPY&gt; to $\mathbb{D}$</p>
<ul>
<li>$|\mathbb{D}| = D + 2$</li>
</ul>
</li>
<li><p>A length-$L$ copy distribution $\mathcal{D}_L$ over $\mathbb{D}^{L+2}$ generates strings of the form &quot;&lt;BOS&gt;, $x_1, x_2, ..., x_L$, &lt;COPY&gt;&quot; where $\boldsymbol{x} \in (\mathbb{D} \text{\textbackslash} { \tiny \text{<BOS>},\text{<COPY>} \normalsize } )^L$</p>
</li>
<li><p>For some seq2seq model $H$, denote the error of $H$ on a copy distribution 
  $$
  \text{err}<em>{\mathcal{D}_L}(H) = \underset{\mathcal{D}_L} {\text{Pr}}[H</em>{1:L}(\tiny \text{<BOS>} \normalsize, \boldsymbol{x}, \tiny \text{<COPY>} \normalsize) \not= \boldsymbol{x}]
  $$</p>
</li>
</ul>
<h2 id="22-transformers-can-copy-inputs-of-exponential-length">2.2 Transformers can copy inputs of exponential length</h2>
<h4 id="construction--hash-based-copying">Construction : Hash-Based Copying</h4>
<ul>
<li><p>Hash sequences of $n$ tokens</p>
</li>
<li><p>At each iteration of the auto-regression attend to the previous occurrence of the most recent $n$-gram and output the succeeding token
<img src="https://velog.velcdn.com/images/0404_not_found/post/1c8112ca-1084-42e8-be72-e03e18c34c06/image.png" alt="">
<img src="https://velog.velcdn.com/images/0404_not_found/post/02d02a93-c673-42d6-b5f7-69bb66a74fe1/image.png" alt=""></p>
</li>
</ul>
<h4 id="positional-embedding-hard-alibi">Positional Embedding: Hard-ALiBi</h4>
<ul>
<li><p>To perform the hashing described in the algorithm, it is necessary to leverage local positional information to define a hash and apply it globally on the entire input $\rightarrow$ use Hard version of ALiBi</p>
</li>
<li><p>Alibi : biases the attention scores with a penalty that is proportional to their distance ($m$ is a head-specific slope fixed before training)</p>
</li>
<li><p>add a bias $b_i$ to the $i$-th attention head </p>
<ul>
<li>$\boldsymbol{o}_i = V_i \ \cdot \ \text{softmax}(K_i \cdot \boldsymbol{q}_i + b_i)$</li>
<li>$b_i = \begin{cases} b_{i, j} = - \infin \quad &amp;j \le i-m \ b_{i,j} = 0 \quad &amp;j &gt; i-m\end{cases}$</li>
<li>Allow different head with different $m$ and also allow $m = \infin$ (softmax attention with no PE)
<img src="https://velog.velcdn.com/images/0404_not_found/post/8a5c5810-7324-472a-ad5d-e11dec1db367/image.png" alt=""></li>
</ul>
</li>
</ul>
<h4 id="guarantees">Guarantees</h4>
<ul>
<li>The copy algorithm can perfectly copy the input sequence, as long as there are no repeated $n$-gram patterns in the input</li>
<li>Then the error of the algorithm is
$$
p_{\text{n-gram}}(\mathcal{D}<em>L) = \underset{\mathcal{D}_L}{\text{Pr}} [\exist</em>{i \not= j} \ \text{s.t.} \ x_1, ..., x_{i+n} = x_j, ..., x_{j+n}]
$$</li>
</ul>
<blockquote>
<h4 id="theorem-23">Theorem 2.3.</h4>
<p>  For all $n$, there exists a depth-2 transformer $\mathcal{T}$ of dimension $O(n \log (D))$ s.t. for all $2n \le L \le D^n$ and for any copy distribution $\mathcal{D}<em>L$, $\text{err}</em>{\mathcal{D}<em>L}(\mathcal{T}) &lt; p</em>{\text{n-gram}} (\mathcal{D}_L)$</p>
</blockquote>
<ul>
<li>The probability of repeated $n$-grams quickly decays when $n$ increases</li>
<li>For the uniform distribution over sequences, thie probability decays <strong>exponentially</strong> witn $n$</li>
</ul>
<blockquote>
<h4 id="lemma-24">Lemma 2.4.</h4>
<p>  Let $\mathcal{D}<em>L$ be the copy distribution generated by sampling $\boldsymbol{x}$ from the uniform distribution over the non-special (alphabet) tokens. Then $p</em>{\text{n-gram}}(\mathcal{D}_L) &lt; L^2D^{-n}$</p>
</blockquote>
<ul>
<li>By combining those, we get that Transformers can copy sequences of tokens drawn from the uniform distribution using a number of params that depends only logarithmically on the input sequence length</li>
</ul>
<blockquote>
<h4 id="corollary-25">Corollary 2.5.</h4>
<p>  Fix some $\epsilon \in (0, 1/2)$ and some $L \ge \Omega(\log (1/\epsilon))$, there exists a depth-2 Transformer $\mathcal{T}$ of dimension $O(\log(L/\epsilon)\log(D))$ s.t. for the uniform copy distribution $\mathcal{D}<em>L$, $\text{err}</em>{\mathcal{D}_L}(\mathcal{T}) &lt; \epsilon$</p>
</blockquote>
<ul>
<li>This doesn&#39;t limit the precision of the parameters of activations, but it holds for finite-precision transformers, using $O(\log(\log(L)))$ bits</li>
</ul>
<h2 id="23-state-space-models-cannot-copy-inputs-beyond-memory-size">2.3 State Space Models cannot copy inputs beyond memory size</h2>
<ul>
<li>GSSMs cannot copy uniform input sequences unless the capacity of their state space grows linearly with the sequence length (To be able to copy, the model needs to store it in state space)</li>
</ul>
<blockquote>
<h4 id="theorem-27">Theorem 2.7.</h4>
<p>Fix some GSSM $H$ over state space $\mathcal{S}$. Then for all $L$, for the uniform copy distribution $\mathcal{D}<em>L$, the model $H$ has error $\text{err}</em>{\mathcal{D}_L}(H) &gt; 1 - {|\mathcal{S}| \over {D^L}}$</p>
</blockquote>
<blockquote>
<h4 id="corollary-28">Corollary 2.8.</h4>
<p>Fix some $L$ then every GSSM $H$ with state space $\mathcal{S}$ s.t. $\text{mem}(\mathcal{S}) &lt; L \log (D) - 1$ has error $\err_{\mathcal{D}_L}(H) &gt; 1/2$ for uniform copy distribution $\mathcal{D}_L$</p>
</blockquote>
<ul>
<li>The Input-dependent memory of Transformers grows linearly with the sequence length (less memory-efficient than GSSM)</li>
<li>Transformers are almost optimal in terms of input-dependent memory (at least copying)</li>
<li>Thm 2.3. says that there exists a transformer which can copy inputs of length $L$ using $\tilde{O}(L)$ input-dependent memory and it is optimal by Corollary 2.8.</li>
</ul>
<h1 id="3-learning-to-copy">3. Learning to Copy</h1>
<ul>
<li><p>Above results may not be observed in practice</p>
<ul>
<li>It&#39;s not clear that transformers can indeed learn to copy from examples</li>
<li>In practice, GSSM may use a large latent state memory so that this bounds only hold for very long sequences of tokens (Also, it may not learn to do so)</li>
</ul>
</li>
</ul>
<h2 id="31-experimental-setup">3.1. Experimental Setup</h2>
<ul>
<li><p>Transformer and Mamba $\approx$ 160M</p>
</li>
<li><p>LSTM $\approx$ 40M</p>
</li>
<li><p>64 Batch</p>
</li>
<li><p>10 batches of 128 examples for test</p>
</li>
<li><p>token space size is 30 and normally $\mathcal{V} = {a, ..., z, \tiny \text{<BOS>}, \text{<EOS>}, \text{<COPY>} \normalsize }$</p>
</li>
<li><p>All strings are sampled uniformly</p>
<ul>
<li>sample the length of the sequence</li>
<li>independently sample each position of the string from $\mathcal{V}$</li>
<li>pack the context with i.i.d. sequences during training</li>
<li>fill the context with multiple independent samples of task</li>
</ul>
</li>
<li><p>Positonal Information</p>
<ul>
<li>RoPE</li>
<li>NoPE (No Positional Information)</li>
<li>Hard-ALiBi</li>
</ul>
</li>
</ul>
<h2 id="32-data-efficiency-on-the-copy-task">3.2. Data Efficiency on the Copy task</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/aec8afc1-ac4a-40e5-9f83-739dccb47bae/image.png" alt=""></p>
<ul>
<li>Model gets an input of $\le L \le 300$ tokens followed by separator token</li>
<li>record the string-level accuracy</li>
<li>sharp change is due to the log-scaled x-axis and string-level accuracy as a y-axis</li>
<li>String-level Accuracy
<img src="https://velog.velcdn.com/images/0404_not_found/post/0d829471-1b26-4fe6-8f19-2ccaddd98837/image.png" alt=""></li>
<li>Character-level Accuracy 
<img src="https://velog.velcdn.com/images/0404_not_found/post/46a0a4d2-efce-4d44-9256-047ffd16f022/image.png" alt=""></li>
</ul>
<h2 id="33-length-generalization-on-the-copy-task">3.3 Length Generalization on the Copy Task</h2>
<ul>
<li><p>Test to generalize out-of-distribution</p>
</li>
<li><p>Understand which function the model has learned</p>
<ul>
<li>model has truly learned the &quot;correct&quot; copy operation vs it just learned to copy sequences of the particular size it was trained on</li>
</ul>
</li>
<li><p>Trained all models on sequences of $\le 50$ tokens test them up to 100 tokens (string-level accuracy)</p>
</li>
<li><p>Transformers shows better generalization to longer input compared to GSSMs</p>
<ul>
<li>GSSMs&#39; performance drops to near zero</li>
<li>ALiBi and NoPE dramatically outperform the RoPE</li>
<li>Sinusoidal embedding of RoPE creates a more dramatic change thatn the decay of ALiBi or NoPE</li>
</ul>
</li>
<li><p>Using Hard-ALiBi in sequence length less than 50 shows almost perfect generalization up to 1000 tokens</p>
</li>
</ul>
<h2 id="34-transformers-learn-to-use-n-gram-hashing">3.4. Transformers learn to use n-gram hashing</h2>
<ul>
<li><p>To test whether the transformer uses the storage mechanism and retrieval of n-grams</p>
</li>
<li><p>Train Hard-ALiBi Transformer on the copy task with a dataset contains duplicate n-grams</p>
</li>
<li><p>Draw uniform sequences of tokens and randomly replace some n-gram with another n-gram that already appears in the sequence (each example always have two copies of n-gram)</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/764ce3db-572e-4e45-a3be-931cab0cdae7/image.png" alt=""></p>
<ul>
<li>It seems Transformer relies on something like 5-gram retrieval to do the copy task</li>
</ul>
<h2 id="35-gssms-cannot-arbitrarily-retrieve-from-context">3.5. GSSMs cannot arbitrarily retrieve from context</h2>
<ul>
<li><p>n-gram lookup task : the model should use given n-gram as a key to loop up k-token key that follows the query</p>
<ul>
<li>suffix key and prefix key</li>
<li>assess length generalization
<img src="https://velog.velcdn.com/images/0404_not_found/post/f72409cd-0894-4bd6-8136-ce876579da4b/image.png" alt=""></li>
</ul>
</li>
<li><p>Suffix key version</p>
<ul>
<li>given sequence $L$ of input tokens, separator, n-gram from the input sequence</li>
<li>need output sequence of $k$ tokens following the chosen n-gram</li>
<li>it requires the model to be able to &#39;store&#39; the context to find the correct key</li>
<li>train all models on sequences of at most 30 tokens</li>
<li>Transformers perform well</li>
<li>Transformers learn to n-gram retrieval and storage
<img src="https://velog.velcdn.com/images/0404_not_found/post/898bb3bc-6df2-4239-b2b7-57f768d59208/image.png" alt=""></li>
</ul>
</li>
<li><p>Prefix key version</p>
<ul>
<li>provide n-gram key at the beginning and then the full sequence</li>
<li>model doesn&#39;t have to store the entire input as it can find the key on the fly</li>
<li>good for the GSSMs since they can write the key in to the state and then ignore inputs that don&#39;t match</li>
<li>GSSMs achieved almost perfect (outperformed NoPE and ALiBi but Hard-ALiBi)</li>
<li>This may be an issue where positional embedding make it more diffecult to perform the hashing lookup over a long distance</li>
<li>GSSM is memory limited but effective when the tasks only require a summary of the inputs</li>
</ul>
</li>
</ul>
<h1 id="4-pre-trained-models">4. Pre-trained Models</h1>
<ul>
<li>pretrained Transformer, GSSM</li>
<li>copying long strings, retrieval and few-shot QA</li>
<li>Transformer outperforms GSSM even GSSM shows lower PPL</li>
</ul>
<h2 id="41-setup">4.1. Setup</h2>
<ul>
<li><p>Pythia transformer models 410M ~ 2.8B</p>
</li>
<li><p>Mamba with similar size</p>
</li>
<li><p>Pretrained on Pile, used same tokenizer</p>
</li>
<li><p>Copy based task  / Information Retrieval (selective copy)</p>
</li>
<li><p>String-Level Accuracy</p>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/ce412bf9-80a7-4f26-8e4f-d22355cd0fd2/image.png" alt=""></p>
<h2 id="42-copying-the-input-text">4.2. Copying the input text</h2>
<ul>
<li>Transformers &gt; GSSM</li>
<li>Random sample from C4 dataset</li>
<li>two copies of sampled string  + first word of the string $\rightarrow$ complete the third copy</li>
<li>Unlike random string, natural text can often be compressed so that the model use lower memory to copy</li>
<li>When the input is more difficult to compress, GSSM suffers due to its state size</li>
</ul>
<h2 id="43-retrieval-from-the-input-context">4.3. Retrieval from the input context</h2>
<ul>
<li><p>Phone-book Lookup</p>
<ul>
<li>provide a synthetic phone-book to the model ans ask it to return the phone number</li>
<li>randomly sampling $L$ names and phone number</li>
<li>two-shot examples and question for phone-number</li>
<li>Transformer (410M) &gt; GSSM (2.8B) when $L \ge 70$</li>
</ul>
</li>
<li><p>QA</p>
<ul>
<li>2.8B Mamba and Transformer on SQuAD</li>
<li>provided single demonstration of a QA pair with same text</li>
<li>Mamba degrades more quickly with the paragraph length</li>
</ul>
</li>
</ul>
<h1 id="5-discussion">5. Discussion</h1>
<ul>
<li><p>Transformer &gt; GSSM at copying from their input text</p>
</li>
<li><p>SSM have many advantages over transformers</p>
<ul>
<li>The memory and computational complexity doesn&#39;t increase with the input length $\rightarrow$ good for long context</li>
<li>Better at tracking state variables across long sequences to make long consistent text</li>
<li>Similar to Human brain</li>
</ul>
</li>
<li><p>Future work is needed to make hybrid architectures of SSM and attention-like mechanism to enhance retrieving ability</p>
<ul>
<li>Humans have very limited memory but can translate entire novels if we allow look back at the text</li>
</ul>
</li>
</ul>
<h1 id="6-comment">6. Comment</h1>
<p>제목이 자극적이었음. Retrieval 부분에서 Transformer의 성능을 증명했음. 다른 분야보다도 텍스트 관련해서는 이 점 때문에 SSM의 도입이 쉽지는 않을듯</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs]]></title>
            <link>https://velog.io/@0404_not_found/Adaptation-with-Self-Evaluation-to-Improve-Selective-Prediction-in-LLMs</link>
            <guid>https://velog.io/@0404_not_found/Adaptation-with-Self-Evaluation-to-Improve-Selective-Prediction-in-LLMs</guid>
            <pubDate>Thu, 01 Feb 2024 13:16:11 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/50162f85-4ed1-45d2-8677-433375573360/image.png" alt=""></p>
<ul>
<li><p>LLM is not guaranteed to be accurate for all queries</p>
</li>
<li><p>Understanding which queries they are reliable for is important</p>
</li>
<li><p><strong>Selective Prediction</strong> : the deployment scenario for AI where humans are involved to maintain overall accuracy by reviewing <strong>AI-generated, low-confidence outputs</strong></p>
<ul>
<li>Both human and AI performance are considered together to minimize human involvement cost</li>
<li>AI should use Selective Prediction to assess the accuracy of their prediction and refrain from making wrong predictions</li>
<li>Able to say &quot;I don&#39;t know&quot; when its prediction is not confident</li>
</ul>
</li>
<li><p>Selective Prediction is hard as LLM is trained to predict not the &quot;correct&quot; next token but only the &quot;next&quot; token</p>
</li>
<li><p>It doesn&#39;t generate a confidence score also $\rightarrow$ obtaining confidence score from output sequence is not straightforward</p>
</li>
<li><p>Distinguishing the correctness from likelihood scores is a challenging</p>
<ul>
<li>Using Prompt (Is the proposed answer True or False?) $\rightarrow$ not generalized to other LLMs</li>
<li>Semantic Entropy or Self-consistency $\rightarrow$ should generate multiple output sequence</li>
<li>Fine-tuning LLMs on target question can improve the likelihood of the ground-truth $\rightarrow$ it is not same as minimizing wrong answers and it still has probability to generate wrong answers</li>
</ul>
</li>
<li><p><strong>ASPIRE</strong> : learns self-evaluate from target-task data</p>
<ul>
<li>training LLMs on a subset of the training data from the QA tasks</li>
<li>define a selection score that combines the likelihood of the generated answer with the learned self-eval score to make selective predictions</li>
<li>less computationally expensive than generating multiple output sequences </li>
</ul>
</li>
</ul>
<h1 id="2-related-work">2. Related Work</h1>
<h4 id="selective-predictions-for-llms">Selective Predictions for LLMs</h4>
<ul>
<li><p>Selective Prediction for classification (NLI) vs Selective Prediction for NLG</p>
<ul>
<li>NLG task has infinite size of the possible answer set</li>
</ul>
</li>
<li><p>Uncertainty Measure for LLMs</p>
</li>
<li><p>Use selective prediction to solve QA task when question is ambiguous</p>
</li>
<li><p>Use auxiliary model to distinguish correct predictions of QA model</p>
</li>
</ul>
<h4 id="parameter-efficient-fine-tuning-peft">Parameter Efficient Fine-Tuning (PEFT)</h4>
<ul>
<li>LoRA</li>
<li>Prefix Tuning</li>
<li>Soft Prompt Tuning $\rightarrow$ used!</li>
<li>P-Tuning</li>
</ul>
<h1 id="3-problem-setup">3. Problem Setup</h1>
<h4 id="notations">Notations</h4>
<ul>
<li>pretrained LLM $f$ for arbitary generative modeling task like QA</li>
<li>vocabulary $\mathcal{V}$</li>
<li>the space of sequences of tokens $\mathcal{V}^*$</li>
<li>logits of $f$ on $v \in \mathcal{V}$ given $\mathbf{x} \in \mathcal{V}^*$ is $\bar{f}(v \ | \ \mathbf{x})$</li>
<li>the likelihood of the next token following $\mathbf{x}$ being $v$ is
$$
f(v \ | \ \mathbf{x}) := {\exp(\bar{f} (v \ | \ \mathbf{x})) \over \sum _{v&#39; \in \mathcal{V}} \exp (\bar{f} ( v&#39; \ | \ \mathbf{x}))}
$$
(softmax!)</li>
<li>likelihood of generating $\hat{\mathbf{y}} \in \mathcal{V}^*$ given $\mathbf{x}$ is
$$
f(\hat{\mathbf{y}} \ | \ \mathbf{x}) := \prod_{i=1}^{|\hat{\mathbf{y}}|}f(\hat{y_i} \ | \ \mathbf{x}, \hat{y}<em>{[i-1]}) 
$$
where $\hat{\mathbf{y}} = (\hat{y_1}, \hat{y_2}, ... \hat{y}</em>{|\hat{\mathbf{y}}|})$ and $\hat{y}<em>{[i-1]} = (\hat{y_1}, ... \hat{y}</em>{i-1}), \hat{y}_{[0]} = \empty$</li>
<li>This likelihood can be very small when $|\hat{\mathbf{y}}|$ is very large $\rightarrow$ normalize the likelihood
$$
f_{\text{norm}}(\hat{\mathbf{y}} \ | \ \mathbf{x}) := f(\hat{\mathbf{y}} \ | \ \mathbf{x})^{{1 \over |\hat{\mathbf{y}}|}}
$$</li>
<li>use $f$ to generate the output sequence by solving 
$$
\hat{\mathbf{y}} ^ * = \argmax_{\hat{\mathbf{y}}} \log f(\hat{\mathbf{y}} \ | \ \mathbf{x}) 
$$</li>
<li>Impossible to solve exactly as the output sequence is arbitrarily long $\rightarrow$ use decoding strategy (greedy decoding, beam search) to solve it</li>
</ul>
<h4 id="evaluate-correctness">Evaluate Correctness</h4>
<ul>
<li><p>set of reference outputs $S$</p>
</li>
<li><p>evaluation metric $M : \mathcal{V}^* \times \mathcal{V}^* \rightarrow \ [0,1]$</p>
<ul>
<li>evaluate the similarity of the generated output $\hat{\mathbf{y}}$ and the reference output $\mathbf{y}_r \in S$</li>
</ul>
</li>
<li><p>threshold $\gamma$</p>
<ul>
<li>if $\max_{\mathbf{y}_r \in S} M(\hat{\mathbf{y}}, \mathbf{y}_r) &gt; \gamma$, then the generated output is correct</li>
</ul>
</li>
<li><p>training dataset $\mathcal{D}^{tr} = { (\mathbf{x}^i, S^i) }<em>{i=1}^{n</em>{tr}}$ randomly sampled from a target task distribution</p>
</li>
<li><p>rejection operation $\bot$</p>
</li>
<li><p>selective predictor $f_s : \mathcal{V}^* \rightarrow \mathcal{V}^* \cup { \bot }$</p>
<ul>
<li>should achieve strong selective prediction performance on test dataset</li>
<li>composed of a predictor $\hat{f} : \mathcal{V}^* \rightarrow \mathcal{V}^<em>$ and a selection scoring function $g : \mathcal{V}^</em> \rightarrow \mathbb{R}$</li>
<li>$$
f_s(\mathbf{x}; \tau) = \begin{cases}
\hat{f}(\mathbf{x}) \quad &amp;\text{if }g(\mathbf{x}) \ge \tau \ \bot &amp;\text{if } g(\mathbf{x}) &lt; \tau
\end{cases}
$$</li>
<li>accuracy : the fraction of the accepted inputs where the predictions are correct</li>
<li>coverage : the fraction of the inputs that are accepted</li>
<li>Tune $\tau$ to achieve a certain coverage and manage accuracy-coverage trade-off</li>
</ul>
</li>
<li><p>use AUACC (area under the accuracy-coverage curve) to measure selective prediction performance</p>
</li>
<li><p>use AUROC (area under the receiver operator characteristic curve) to measure the quality of the selection score estimation</p>
<ul>
<li>equivalent to the probability that a randomly chosen correct output sequence has a higher selection score than a randomly chosen incorrect output sequence</li>
</ul>
</li>
</ul>
<h1 id="4-aspire-framework">4. ASPIRE Framework</h1>
<ul>
<li><p>LLM should have self-evaluation ability</p>
<ul>
<li>Previous work was only adaptable for specific LLMs</li>
<li>Colelcting some training data to employ self-evaluation</li>
</ul>
</li>
</ul>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/aa95889a-0598-42e3-93df-04727a83fa4b/image.png" alt=""></p>
<ul>
<li><p>Start with LoRA</p>
<ul>
<li>model parameters $\theta$ is frozen</li>
<li>adapter $\theta_p$ is added for fine-tuning and updated</li>
<li>it improves prediction accuracy and likelihood of correct output sequences $\rightarrow$ improves selective prediction performance!</li>
</ul>
</li>
<li><p>Fine-tune LLM to learn self-evaluation</p>
<ul>
<li><p>use $\theta_p$ to generate different answers for each example $(\mathbf{x}, \mathbf{y}) \in \mathcal{D}^{tr}$</p>
</li>
<li><p>supposing the decoding algorithm used to generate output sequences for $\mathbf{x}$ is $\mathcal{A}$
where $\mathcal{A}(f, \theta_p, \mathbf{x}) = [\hat{\mathbf{y}}^1, ..., \hat{\mathbf{y}}^k]$</p>
</li>
<li><p>choose output sequences such that $f(\hat{\mathbf{y}}^j \ | \ \mathbf{x}; \theta_p)$ is maximal</p>
</li>
<li><p>use metric $M$ to determine $\hat{\mathbf{y}}^j$ is correct 
i.e. if $M(\hat{\mathbf{y}}^j, \mathbf{y}) &gt; \hat{\gamma}$, it is correct</p>
</li>
<li><p>use threshold $\hat{\gamma}$ different from $\gamma$ for evaluation (choose sufficiently large $\hat{\gamma}$ so that the wrong outputs wouldn&#39;t be labeled as correct outputs)</p>
</li>
<li><p>after sampling high-likelihood outputs, tune $\theta_s$ only for learning self-evaluation ($\theta$ and $\theta_p$ are frozen)</p>
</li>
<li><p>the training objective is
$$
\min_{\theta_s} \mathbb{E}<em>{(\mathbf{x}, \mathbf{y}) \sim \mathcal{D}^{tr}} \ \mathcal{L}_c + \mathcal{L}_w \
\mathcal{L}_c = \mathbb{E}</em>{\hat{\mathbf{y}} \sim S_c(\mathbf{x}, \mathbf{y})} - \log f(\text{<code>correct&#39;&#39;} \ | \ \mathbf{x}, \hat{\mathbf{y}}; \theta_p, \theta_s) \\
\mathcal{L}_w = \mathbb{E}_{\hat{\mathbf{y}} \sim S_w(\mathbf{x}, \mathbf{y})} - \log f(\text{</code>wrong&#39;&#39;} \ | \ \mathbf{x}, \hat{\mathbf{y}}; \theta_p, \theta_s) \</p>
<p>$$
where $S_c(\mathbf{x}, \mathbf{y})$ is a set of &#39;correct&#39; outputs containing the reference $\mathbf{y}$ and $k_c$ correct outputs with highest likelihood from $\mathcal{A}(f, \theta_p, \mathbf{x})$, same for $S_w$ (If $\mathcal{A}(f, \theta_p, \mathcal{x})$ doesn&#39;t have wrong output, add a default wrong output(e.g. empty string) to $S_w$)</p>
</li>
<li><p>After training $\theta_s$, obtain the prediction solving 
$$
\hat{\mathbf{y}}^* = \argmax_{\hat{\mathbf{y}}} \log f(\hat{\mathbf{y}} \ | \ \mathbf{x};\theta_p)
$$</p>
</li>
<li><p>Also, the self-eval score is defined as
$$
P(\text{correct} \ | \ \mathbf{x}, \hat{\mathbf{y}}^<em>) = {\exp (\bar{f}(\text{correct} \ | \ \mathbf{x}, \hat{\mathbf{y}}^</em>; \theta_p, \theta_s)) \over \sum_{z \in {\text{correct}, \text{wrong} }} \exp (\bar{f}(z \ | \ \mathbf{x}, \hat{\mathbf{y}}^*; \theta_p, \theta_s))}
$$</p>
</li>
<li><p>Used Beam search decoding</p>
</li>
<li><p>Overall, the selection scoring function is 
$$
g(\mathbf{x}) = (1 - \alpha)\cdot \log f_{\text{norm}} (\hat{\mathbf{y}}^* \ | \ \mathbf{x}; \theta_p) + \alpha \cdot \log P(\text{correct} \ | \ \mathbf{x}, \hat{\mathbf{y}}^*)
$$
where $\alpha \in [0,1]$ is a hyperparameter</p>
</li>
</ul>
</li>
</ul>
<h1 id="5-implementation-via-soft-prompt-tuning">5. Implementation via Soft Prompt Tuning</h1>
<ul>
<li>They could develop prompts that effectively stimulate self-evaluation</li>
<li>it is possible to discover these prompts through soft prompt tuning with targeted training objectives
<img src="https://velog.velcdn.com/images/0404_not_found/post/a9f7bd45-9294-4a81-ac70-086c4e0a7648/image.png" alt=""></li>
</ul>
<h4 id="soft-prompt-tuning">Soft Prompt Tuning</h4>
<ul>
<li>given query $\mathbf{x} = (x_1, ..., x_{m_q})$</li>
<li>get embedding of $\mathbf{x}$  to form a matrix $X \in \mathbb{R}^{m_q \times d_e}$</li>
<li>soft-prompts $\tilde{\theta} \in \mathbb{R}^{l \times d_e}$</li>
<li>concatenate soft-prompts to query to form $[\tilde{\theta}; X] \in \mathbb{R}^{(m_q + l) \times d_e}$</li>
</ul>
<h4 id="adapt-to-aspire">Adapt to ASPIRE</h4>
<ul>
<li><p>update $\theta_p$ with 
$$
\min_{\theta_p} \mathbb{E}<em>{(\mathbf{x}, \mathbf{y}) \sim \mathcal{D}^{tr}} {1 \over |\mathbf{y}|} \sum _{j=1} ^{|\mathbf{y}|} - \log f(y_j \ | \ [\theta_p ; X ; Y</em>{[j-1
]}])
$$</p>
</li>
<li><p>update $\theta_s$ with 
$$
  \min_{\theta_s} \mathbb{E}<em>{(\mathbf{x}, \mathbf{y}) \sim \mathcal{D}^{tr}} \ \mathcal{L}_c + \mathcal{L}_w \
  \mathcal{L}_c = \mathbb{E}</em>{\hat{\mathbf{y}} \sim S_c(\mathbf{x}, \mathbf{y})} - \log f(\text{<code>correct&#39;&#39;} \ | \ [\theta_p; X; \hat{Y}; \theta_s]) \\
  \mathcal{L}_w = \mathbb{E}_{\hat{\mathbf{y}} \sim S_w(\mathbf{x}, \mathbf{y})} - \log f(\text{</code>wrong&#39;&#39;} \ | \ [\theta_p; X; \hat{Y}; \theta_s]) \</p>
<p>  $$</p>
</li>
<li><p>The Inference objective becomes
$$
  \hat{\mathbf{y}}^* = \argmax_{\hat{\mathbf{y}}} \log f(\hat{\mathbf{y}} \ | \ \mathbf{x};[\theta_p; X])
$$</p>
</li>
<li><p>The self-eval score becomes
$$
  P(\text{correct} \ | \ \mathbf{x}, \hat{\mathbf{y}}^<em>) = {\exp (\bar{f}(\text{correct} \ | \ [\theta_p; X; \hat{Y}^</em>; \theta_s]) \over \sum_{z \in {\text{correct}, \text{wrong} }} \exp (\bar{f}(z \ | \ [\theta_p; X; \hat{Y}^*; \theta_s])}
$$</p>
<h4 id="generation-pipeline">Generation Pipeline</h4>
</li>
<li><p>obtain generated output and the likelihood for the output</p>
</li>
<li><p>obtain self-eval score</p>
</li>
<li><p>cache the states of first stage to reduce computational cost for second stage</p>
</li>
</ul>
<h4 id="computational-complexity">Computational Complexity</h4>
<ul>
<li>At test time : $O(l_{max})$</li>
<li>Predictive entropy and semantic entropy methods : $O(m \cdot l_{max})$</li>
</ul>
<h1 id="6-experiments">6. Experiments</h1>
<ul>
<li>Use decoding algorithms that can sample different high-likelihood samples is important</li>
<li>more training samples lead to enhanced performance</li>
<li>2k samples are enough to outperform the baselines without soft-prompt tuning</li>
</ul>
<h2 id="61-setup">6.1 Setup</h2>
<ul>
<li><p>free-form QA task : CoQA(zero-shot), SQuAD(zero-shot), TriviaQA (5-shot)</p>
</li>
<li><p>used 50K examples subset</p>
</li>
<li><p>OPT(350M, 1.3B, 2.7B, 30B), GPT-2(M, L, XL)</p>
</li>
<li><p>pretrained LLM and $\theta_p$ trained model</p>
</li>
<li><p>beam-search</p>
</li>
<li><p>selection score $g(\mathbf{x})$ with PPL, Predictive Entropy, Semantic Entropy, Self-eval, P(True)</p>
</li>
<li><p>Rouge-L as the evaluation metric $M$ with relatively large $\gamma = 0.7$ (accepting wrong answer is more costly)</p>
</li>
<li><p>Both stage of training $\theta_p$ and $\theta_s$, 10 epochs with AdamW, batch 8, lr 0.01 and cosine lr scheduling</p>
</li>
<li><p>for ASPIRE, </p>
<ul>
<li>beam search for $\mathcal{A}$</li>
<li>$l = 50$</li>
<li>$\hat{\gamma} = 0.9$</li>
<li>$k=10$</li>
<li>$k_c = 2$</li>
<li>$k_w = 10$</li>
<li>$\alpha=0.25$</li>
</ul>
</li>
</ul>
<h1 id="62-results">6.2 Results</h1>
<h4 id="accuracy">Accuracy</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/e01813bd-fa5a-402b-b5d1-71284a5992d9/image.png" alt=""></p>
<h4 id="methods-to-get-selection-score">Methods to get selection score</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/10370cc5-4132-4915-9c7c-c9fbcbf0fb75/image.png" alt=""></p>
<ul>
<li>After prompt tuning, other methods&#39; AUACC is significantly improved as accuracy became better and PPL became more meaningful</li>
<li>ASPIRE with OPT-2.7B significantly outperforms with Self-eval and P(True) with OPT-30B</li>
<li>For Self-eval and P(True) method, the AUACC of OPT-30B is better than Adapted OPT-2.7B, it has much worse selective prediction performance
$\rightarrow$ self-evaluation approach is not effective for high capacity LLMs</li>
</ul>
<h2 id="63-empirical-analyses">6.3 Empirical Analyses</h2>
<h4 id="the-effect-of-alpha">The effect of $\alpha$</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/d777925c-116c-4cd1-9c64-dddea11b00eb/image.png" alt=""></p>
<ul>
<li>$\alpha=0.25$ is the best recipe for normalized likelihood and the learned self-eval score</li>
<li>In practice, this value can be chosen based on the performance on the validation data</li>
</ul>
<h4 id="the-choices-of-mathcala">The choices of $\mathcal{A}$</h4>
<ul>
<li>compared beam search and multinomial sampling</li>
<li>used $k$ highest scoring beams as the answer list (beam search)</li>
<li>tested temperature 0.1, 1.0, 2.0 for multinomial sampling
<img src="https://velog.velcdn.com/images/0404_not_found/post/cd18a04f-d993-49b7-bd8a-35a3af6adf74/image.png" alt=""></li>
</ul>
<h4 id="training-sample-efficienty">Training sample efficienty</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/1545da21-2c45-45df-9e0f-1e9dd79923e6/image.png" alt=""></p>
<ul>
<li>Fixed the number of steps to be 50K</li>
<li>ASPIRE can significantly improve selective prediction performance even with limited number of training samples</li>
</ul>
<h1 id="7-conclusion">7. Conclusion</h1>
<ul>
<li>Adaptation with self-evaluation to improve selective prediction in LLMs</li>
<li>Soft prompt tuning</li>
<li>Implement via other PEFT approaches and adapt to larger LLMs (Future work)</li>
<li>Didn&#39;t tested with larger and stringest LLMs (computational constraints)</li>
</ul>
<h1 id="8-comment">8. Comment</h1>
<p>단순히 프롬프트로 신뢰도를 찍어내는 것이 아니라, 나름의 계산과 Learning 기반으로 신뢰도를 얻어낼 수 있는게 좋았음. 다만 테스트한 모델이 좀 오래되어서, 최근의 sLLM으로도 가능한지 의문</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Spotting LLMs with Binoculars: Zero-Shot Detection of Machine-Generated Text]]></title>
            <link>https://velog.io/@0404_not_found/Spotting-LLMs-with-Binoculars-Zero-Shot-Detection-of-Machine-Generated-Text</link>
            <guid>https://velog.io/@0404_not_found/Spotting-LLMs-with-Binoculars-Zero-Shot-Detection-of-Machine-Generated-Text</guid>
            <pubDate>Fri, 26 Jan 2024 14:08:45 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>Intruducing a method for detecting LLM-generated text using zero-shot setting (No training sample from LLM source) </p>
</li>
<li><p>outperforms all models with ChatGPT detection</p>
</li>
<li><p>As it is <strong>zero-shot</strong> nature, it can spot multiple different LLMs with high accuracy</p>
</li>
<li><p>Prior research (Turnitin) fixated strongly on ChatGPT</p>
</li>
<li><p>More sophisticated actors use a wide range of LLMs beyond just ChatGPT</p>
</li>
<li><p><strong>Binoculars</strong> works by viewing text through two lenses</p>
<ul>
<li>compute the log perplexity of the text in question using an &quot;observer LLM&quot;</li>
<li>compute all the next-token predictions that a &quot;performer LLM&quot; would make and compute their perplexity according to the observer</li>
<li>If the string is written by a machine, the perplexities would be similar.</li>
</ul>
</li>
</ul>
<h1 id="2-the-llm-detection-landscape">2. The LLM Detection Landscape</h1>
<ul>
<li><p>Spam and Fake news analyzing $\rightarrow$ all benefit from signals that quantify whether text is human or machine-generated</p>
</li>
<li><p>Due to the rise of the Transformer models, primitive mechanisms became useless $\rightarrow$ to record or watermark all generated text</p>
</li>
<li><p>Post-hoc detection approaches without cooperation from model owners</p>
<ul>
<li>Fine-Tuned pretrained backbone for the binary classification task (adversarial training, absentation)</li>
<li>Linear classifier on top of frozen learned features allowing for the inclusion of commercial API outputs</li>
</ul>
</li>
<li><p>Using statistical signatures that are characteristic of machine-generated text</p>
<ul>
<li>requires none or little training data</li>
<li>easily adapted to newer model families</li>
<li>based on perplexity, perplexity curvature, log rank, intrinsic dimensionality of generated text, n-gram</li>
</ul>
</li>
<li><p>Detection has limitation</p>
<ul>
<li>Fully general-purpose models of language would be, by definition, impossible to detect</li>
<li>Given sufficient examples, the text by model close to the optimum is technically detectable</li>
<li>In practice, the relative success of detection provides evidence that current language models are imperfect representations of human writing (Detectable!)</li>
</ul>
</li>
<li><p>How do we appropriately and thoroughly evaluate detectors?</p>
<ul>
<li>accuracy on test sets, AUC of classifiers are not well-suited for the highstakes question of detection</li>
<li>Only detectors with low false positive truely reduce harm</li>
<li>detectors are often only evaluated on relatively easy datasets that are reflexive of their training data</li>
</ul>
</li>
</ul>
<h1 id="3-binoculars-how-it-works">3. Binoculars: How it works</h1>
<ul>
<li>perplexity and cross-perplexity (the next token predictions of one model are to another model)</li>
</ul>
<h2 id="31-background-and-notation">3.1 Background and Notation</h2>
<ul>
<li>string $s$</li>
<li>a list of token indices $\vec{x}$</li>
<li>tokenizer $T$</li>
<li>$i$-th token ID $x_i$</li>
<li>vocab $V = { 1,2 , ... ,n }$</li>
<li>language model $\mathcal{M}$</li>
<li>number of tokens in $s$,  $L$
$$
\mathcal{M}(T(s)) = \mathcal{M}( \vec{x} ) = Y \
Y_{ij} = P(v_i | x_{0:i-1}) \text{ for all} \ j \in V
$$</li>
<li>Define logPPL as the average negative log-likelihood of all tokens in the given sequence
$$
\log \text{PPL}<em>{\mathcal{M}(s)} = - {1 \over L} \sum</em>{i=1} ^{L}\log (Y_{ix_{i}})
$$</li>
<li>This logPPL intuitively measures how <strong>surprising</strong> a string is to a language model</li>
<li>As it is used as a loss function, the models are likely to score their own outputs as unsurprising</li>
<li>Define <strong>Cross-Perplexity</strong> as a average per-token cross-entropy between the outputs of two models
$$
\log \text{X-PPL}<em>{\mathcal{M}_1, \mathcal{M}_2}(s) = - { 1 \over L} \sum</em>{i=1}^{L} \mathcal{M}_1(s)_i \ \cdot \ \log(\mathcal{M}_2 (s)_i)
\ \text{where } \cdot \text{ means the dot product}
$$</li>
</ul>
<h2 id="32-what-makes-detection-hard-a-primer-on-the-capybara-problem">3.2 What makes detection Hard? A primer on the Capybara problem</h2>
<ul>
<li><p>LLM tends to generate text that is unsurprising to an LLM</p>
</li>
<li><p>As humans are different from machine, human PPL is higher according to an LLM observer</p>
</li>
<li><p>When it faces hand-crafted prompts, this intuition breaks</p>
<ul>
<li>prompt &quot;1, 2, 3, &quot; results in &quot;4, 5, 6&quot; which has very low PPL</li>
<li>But the prompt like &quot;Can you wirte a few sentences about a capybara that is an astrophysicist?&quot; will yield a response that seems more strange $\rightarrow$ High PPL (&quot;capybara&quot;, &quot;astrophysicist&quot;)</li>
<li>in the absence of the prompt, LLM detection seems difficult and naive perplexity-based detection fails
<img src="https://velog.velcdn.com/images/0404_not_found/post/5e07a486-6ca0-4a0b-9901-d25e9e979714/image.png" alt=""></li>
</ul>
</li>
</ul>
<h2 id="33-our-detection-score">3.3 Our Detection Score</h2>
<ul>
<li>Binoculars solves the capybara problem by providing a mechanism for estimating the <strong>baseline PPL</strong> induced by the prompt</li>
</ul>
<h4 id="motivation">Motivation</h4>
<ul>
<li><p>LM generates Low-PPL text relative to humans $\rightarrow$ PPL Threshold classifier</p>
</li>
<li><p>Capybara problem $\rightarrow$ prompt matters $\rightarrow$ Cross-PPL</p>
</li>
<li><p>Cross-PPL measures the tokens are surprising <strong>relative to the baseline PPL of an LLM acting on the same string</strong></p>
</li>
<li><p>Expect the next-token choices of humans to be even higher PPL than those of the machine $\rightarrow$ Normalize the observed PPL by the expected PPL of a machine acting on the same text
$$
B_{\mathcal{M}<em>1, \mathcal{M}_2} (s) = { \log \text{PPL}</em>{\mathcal{M}<em>1} (s) \over \log \text{X-PPL}</em>{\mathcal{M}_1, \mathcal{M}_2}(s)}
$$</p>
</li>
<li><p>The numerator is simple PPL (how surprising a string is to $\mathcal{M}_1$)</p>
</li>
<li><p>The denominator measures how surprising the token predictions of $\mathcal{M}_2$ are when observed by $\mathcal{M}_1$</p>
</li>
<li><p>Expect  human diverge from $\mathcal{M}_1$ more than $\mathcal{M}_2$ diverges from $\mathcal{M}_1$</p>
</li>
<li><p>The Binoculars score $B$ is a general mechanism that captures a statistical signature of machine text</p>
</li>
<li><p>It is also capable of detecting generic machine-text generated by a third model altogether</p>
</li>
<li><p>Connection to other approaches </p>
<ul>
<li>Contrastive Decoding : generate high-quality text by maximizing the difference between a weak and a stron gmodel</li>
<li>Speculative Decoding : Use weaker models to plan completions</li>
<li>Both are working when pairing a strong model with a very seak model</li>
<li>But Binoculars works well when pairing very close two models (use Falcon-7B as $\mathcal{M}_1$ and Falcon-7b-instruct as $\mathcal{M}_2$)</li>
</ul>
</li>
</ul>
<h1 id="4-accurate-zero-shot-detection">4. Accurate Zero-Shot Detection</h1>
<h2 id="41-datasets">4.1 Datasets</h2>
<ul>
<li><p>Ghostbuster : Writing Prompts, News, Student Essay datasets (Humans vs ChatGPT)</p>
</li>
<li><p>Drew human samples from CCNews, PubMed, CNN and generated machine text by LLaMA-2-7B and Falcon-7B</p>
<ul>
<li>Peel up first 50 tokens of human sample and used it as a prompt to generate up to 512 tokens</li>
<li>removed human prompt from the generation</li>
</ul>
</li>
<li><p>Orca dataset to check the reliability of the proposed method for instruction-tuned models</p>
</li>
</ul>
<h2 id="42-metrics">4.2 Metrics</h2>
<ul>
<li><p>Binary classification metrics </p>
<ul>
<li>ROC Curve</li>
<li>AUC</li>
</ul>
</li>
<li><p>In high-stakes detection settings, false positive is the most concerning harms (human text is labeled as machine&#39;s)</p>
<ul>
<li>TPR (True-Positive rates) at FPR (False-Positive rates)</li>
<li>standard FPR threshold of 0.01%</li>
<li>when the FPR is below 1%, AUC and TPR@FPR are often uncorrelated</li>
</ul>
</li>
</ul>
<h2 id="43-benchmark-performances">4.3 Benchmark Performances</h2>
<h4 id="ghostbuster-vs-chatgpt">Ghostbuster (vs ChatGPT)</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/6fbd47fb-d2a8-4446-8dd7-e3533370816c/image.png" alt=""></p>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/9243f21e-ae90-4f2a-bbc8-5c56d919a382/image.png" alt=""></p>
<ul>
<li>outperforms Ghostbuster in &quot;out-of-domain&quot; settings</li>
<li>Ghostbuster and Binoculars both have a property that they are getting stronger given more information</li>
<li>Binoculars are clearer in the few-token regime</li>
</ul>
<h4 id="open-source-lms-vs-llama-2-and-falcon">Open source LMs (vs LLaMA-2 and Falcon)</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/4f892a91-0b25-49cb-9527-f61fa169f9c4/image.png" alt=""></p>
<ul>
<li>Ghostbuster fails to detect other Open-source models generation</li>
</ul>
<h1 id="5-reliability-in-the-wild">5. Reliability in the Wild</h1>
<h2 id="51-varied-text-sources">5.1 Varied Text Sources</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/2dbb4db3-1343-4c16-8092-378852dc1fbc/image.png" alt="">
<img src="https://velog.velcdn.com/images/0404_not_found/post/726d1742-9ee2-4421-b4d5-8f440112c62d/image.png" alt=""></p>
<ul>
<li>used M4 detection dataset</li>
<li>Binoculars generalizes across domains and languages</li>
<li>LR GLTR : Logistic Regression over Giant Language Model Test Room</li>
<li>NELA : News Landscape Classifiers</li>
</ul>
<h2 id="52-other-languages">5.2 Other Languages</h2>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/237d4a4a-f815-4a4a-a4f5-69fc80d583c8/image.png" alt=""></p>
<ul>
<li><p>Evaluating on Binoculars on samples from languages that are not well represented in Common Crawn data</p>
<ul>
<li>FPR remains low but machine text is classified as human (poor recall)</li>
<li>Binoculars is a machine-text detector to detect whtehre text may have been generated from a <strong>similar</strong> language model</li>
<li>for Falcon, it has low capacity with low-resource languages. Then ChatGPT&#39;s text is unlikely to be machine-generated according to this score</li>
</ul>
</li>
<li><p>Stronger multilingual pair of models would lead to make Binoculars more effietive to detect ChatGPT generated text in that language</p>
</li>
</ul>
<h4 id="fpr-on-text-written-by-non-native-speakers">FPR on text written by non-native speakers</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/84f69fb6-cfab-48f1-8dbb-809e26736f00/image.png" alt=""></p>
<ul>
<li>LLM detectors are inadvertently biased against non-native English speakers classifying their writing as machine-generated</li>
<li>Analyzed EssayForum (ESL student&#39;s academic writing) to make original essay and grammar-corrected version</li>
<li>Binoculars is insensitive to this type of shift</li>
</ul>
<h2 id="53-memorization">5.3 Memorization</h2>
<ul>
<li>Highly memrized examples are classified as machine-generated in PPL based detection (famous quotes)
<img src="https://velog.velcdn.com/images/0404_not_found/post/0e05d1bb-4fa1-4af8-ad6f-d25747d5f607/image.png" alt=""></li>
<li>Memorized text is both written by human and machine</li>
<li>Both behavior is acceptable (plagiarism detection or removal of LLM-generated text from a training corpus)</li>
</ul>
<h2 id="54-modified-prompting-strategies">5.4 Modified Prompting Strategies</h2>
<ul>
<li><p>For OpenOrca set, Binoculars detects 92% of GPT-3 sampels and 89.57% of GPT-4 samples
<img src="https://velog.velcdn.com/images/0404_not_found/post/aa1a704a-a6ae-4fcf-bfd5-e0a8d455f3d2/image.png" alt=""></p>
</li>
<li><p>Simple detection schemes are fooled by this changes of prompt</p>
</li>
<li><p>This is not affecting the performance of Binoculars score
<img src="https://velog.velcdn.com/images/0404_not_found/post/5f4eeb73-bfbc-4940-9c3e-88448c9bbac2/image.png" alt=""></p>
</li>
</ul>
<h2 id="55-randomized-data">5.5 Randomized Data</h2>
<ul>
<li>Test arbitrary mistakes, hashcodes, or other kinds of random string
<img src="https://velog.velcdn.com/images/0404_not_found/post/a8c1f1ea-53ec-4523-8bf0-207c8f463adb/image.png" alt=""></li>
<li>Confidently scores them as human</li>
<li>LLMs usually don&#39;t generate such things</li>
</ul>
<h1 id="6-discussion-and-limitations">6. Discussion and Limitations</h1>
<ul>
<li>a method for detecting LLMs in Zero-Shot case</li>
<li>Transferable detector words in zero-shot setting</li>
<li>This transferability cames from the similarity between modern LLMs (Transformer!)</li>
<li>Due to VRAM, they didn&#39;t check larger models (30B+)</li>
<li>Didn&#39;t consider explicit efforts to bypass detection</li>
<li>Non-conversational text domains are not included</li>
</ul>
<h1 id="7-comment">7. Comment</h1>
<p>단순 PPL이 아닌 Cross-PPL을 이용해 상대적으로 모델의 생성을 체크하는 방법. 그런데 모델 두 개를 올리려면 리소스 사용량이 꽤 많이 필요할듯. </p>
]]></description>
        </item>
        <item>
            <title><![CDATA[Sparse Upcycling: Training MoE from Dense Checkpoints]]></title>
            <link>https://velog.io/@0404_not_found/Sparse-Upcycling-Training-MoE-from-Dense-Checkpoints</link>
            <guid>https://velog.io/@0404_not_found/Sparse-Upcycling-Training-MoE-from-Dense-Checkpoints</guid>
            <pubDate>Tue, 16 Jan 2024 12:43:35 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduciton">1. Introduciton</h1>
<ul>
<li><p>Increased Scale is one of the main drivers of better performancd in DL (NLP, Vision, Speech, RL, Multimodal etc.)</p>
</li>
<li><p>Most SOTA Neural Nets are trained from-scratch (random weights) $\rightarrow$ Cost for training is high</p>
</li>
<li><p><strong>Model Upcycling</strong>: upgrading an existing model with a relatively small additional compulational budget</p>
<ul>
<li>focus on dense models into larger sparsely activated MoEs (pretrained dense Transformer checkpoint)</li>
<li>less than 40% additional budget for all size for language and vision</li>
</ul>
</li>
<li><p>Valuable in two scenarios</p>
<ul>
<li>Have access to a pretrained Transformer and want to improve it within a computational budget</li>
<li>Plan to train a large model and don&#39;t know whether dense of MoE would be more effective $\rightarrow$ First train the dense model then upcycle it into a MoE</li>
</ul>
</li>
<li><p>Central challenge in model upcycling is the initial performance decrease entailed by changing a trained network structure $\rightarrow$ present a model surgery recipe</p>
</li>
</ul>
<h1 id="2-background">2. Background</h1>
<h2 id="21-sparsely-activated-mixture-of-experts-moe">2.1 Sparsely Activated Mixture of Experts (MoE)</h2>
<h4 id="dense-vs-sparse">Dense vs Sparse</h4>
<ul>
<li>Dense model : apply all params to every input</li>
<li>Sparse model : activating a subset of params for each input</li>
<li>MoE models are an accelerator friendly family of sparse models that allow training of models with up to trillions of params</li>
</ul>
<h4 id="moe-model">MoE Model</h4>
<ul>
<li>alternate standard Transformer blocks with MoE blocks</li>
<li>usually replace the MLPs in a Transformer block with a number of &#39;experts&#39; (also MLP) with different params and a router (small neural net, decides which expert should be applied)</li>
<li>There is multiple routing algorithms (Top-K, BASE and Sinkhorn-BASE layers, Hash layers, Expert Choice routing)</li>
</ul>
<h4 id="sparsely-gated-moe-shazeer-et-al-2017">Sparsely Gated MoE (Shazeer et al., 2017)</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/9e1c69c9-829d-423e-b19a-7aa311a6e0e6/image.png" alt=""></p>
<ul>
<li><p>Gating network $G(x) \in \mathbb{R}^n$ and $n$ expert networks $E_1, E_2, ... E_n$</p>
</li>
<li><p>the output $y$ of the MoE module is $y = \sum_{i=1} ^n G(x)_i E_i(x)$</p>
</li>
<li><p>$G(x)$ is a sparse vector. This has only non-zero element in the index of selected expert</p>
</li>
<li><p>The choice of gating function</p>
<ul>
<li><p>Softmax gating : $G_{\sigma}(x) = Softmax(x \cdot W_g)$ where $W_g$ is trainable weight matrix</p>
</li>
<li><p>Noisy Top-K gating
$$
\begin{aligned}
G(x) &amp;= Softmax(KeepTopK(H(x), k)) \
H(x)<em>i &amp;= (x \cdot W_g)_i + StandardNormal() + Softplus((x \cdot W</em>{noise})_i) \</p>
<p>KeepTopK(v, k)_i &amp;= \begin{cases} v_i \quad &amp;\text{if } v_i \text{ is in top } k \text{ elements of } v \ -\infin &amp;\text{ otherwise } 
\end{cases} \
Softplus(x) &amp;= \ln ({1 + e^x})</p>
<p>\end{aligned}
$$</p>
</li>
</ul>
</li>
</ul>
<h4 id="expert-choice-routing-zhou-et-al-2022">Expert Choice routing (Zhou et al., 2022)</h4>
<ul>
<li><p>$E$ for total # of experts</p>
</li>
<li><p>$n$ for total number of tokens</p>
</li>
<li><p>Router output $\bold{R} \in \mathbb{R}^{n \times E}$ : routing probabilities</p>
</li>
<li><p>the row $r_{i} \in \mathbb{R}^E$ corresponds to the $i$-th token and distribution over experts (non-negative and sum to 1)</p>
</li>
<li><p>Every expert $e$ independently chooses the $T$ tokens with highest probabilities (top-T per column) and process</p>
</li>
<li><p>parametrize $T$ as $T = C(n/E)$ where $C$ is a capacity factor to control # of tokens per expert (if $C=1$, some token will be processed by multiple experts while others by none)</p>
</li>
<li><p>This makes a model parameter count increase with minimal FLOPs overhead (router computation)</p>
</li>
<li><p>Letting $C &gt; 1$ usually leads to higher performance at a higher compute cost
<img src="https://velog.velcdn.com/images/0404_not_found/post/922b30be-8ec1-4c82-99c0-ce017f784343/image.png" alt=""></p>
</li>
<li><p>$$ S = Softmax(X \cdot W_g), \quad S \in \mathbb{R}^{n\times e} \
G,\ I = TopK(S^T, k), \quad P = Onehot(I) \in \mathbb{R}^{e \times k \times n}
$$</p>
</li>
<li><p>$G \in \mathbb{R}^{e \times k}$ is for weight of expert for the selected token, $I$ is an index matrix where $I[i,j]$ = $j$-th selected token of the $i$-th expert</p>
</li>
<li><p>Then, apply MoE and gating funtion in the dense FFN layer</p>
<ul>
<li>input : $X_{in} = P \ \cdot \ X \in \mathbb{R}^{e \times k \times d}$  where $P$ is permutation matrix</li>
<li>$X_{in}[i] \in \mathbb{R}^{k \times d}$ is input for $i$-th expert</li>
<li>output of each expert 
$X_e[i] = \text{GeLU}(X_{in}[i] \cdot W_1[i]) \cdot W_2[i]^T$</li>
<li>Final output $X_{\text{out}}[l,d] = \sum_{i,j}P[i, j, l]G[i, j]X_e[i, j, d]$
$l$ for batch dimension, $d$ for model dimension<h2 id="22-architectures">2.2 Architectures</h2>
</li>
</ul>
</li>
<li><p>Apply same sparse upcycling recipe to both language and vision tasks on T5 and ViT (encoder)</p>
</li>
<li><p>ViT : follow V-MoE, but used global average pooling and Expert Choice Routing</p>
</li>
<li><p>T5 : use Expert Choice Routing for encoder, Top-K routing for decoder with $K=2$</p>
</li>
</ul>
<h1 id="3-the-upcycling-algorithm">3. The upcycling Algorithm</h1>
<h4 id="initialize">Initialize</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/1eaae4d4-c82b-423e-b584-60961c5b8e30/image.png" alt=""></p>
<ul>
<li>Use dense model&#39;s parameters (checkpoint) to initialize new Transformer block (same number and shape)</li>
<li>A subset of the MLP layers are expanded into MoE layer</li>
<li>remaining layers are copied to new model</li>
<li>each MoE have a fixed number of experts</li>
<li>each expert is initialized as a copy of the original MLP</li>
<li>After initializing, continue training it for a number of additional steps (considering budget and resources)</li>
</ul>
<h2 id="design-decisions">Design Decisions</h2>
<p>The performance of upcycled models is heavily influenced by the configuration of the MoE layers</p>
<h4 id="router-type">Router Type</h4>
<ul>
<li>ViT : Expert Choice routing with $C=2$ (encoder)</li>
<li>T5 : Expert Choice routing with $C=2$ (encoder), Top-K routing with $K=2$ (decoder)</li>
</ul>
<h4 id="number-of-layers-to-upcycle">Number of layers to upcycle</h4>
<ul>
<li>Adding more MoE increases the model capacity</li>
<li>replace half of the MLP layers of original model with MoE layers</li>
</ul>
<h4 id="number-of-experts-to-add-in-upcycled-layers">Number of Experts to add in upcycled layers</h4>
<ul>
<li>Adding more experts doesn&#39;t significantly affect the FLOPS (the expert capacity is inversely proportional to the number of experts)</li>
<li>Too many experts make the upcycled model&#39;s larger initial quality drop (this will be overcome by sufficientl upcycling compute)</li>
<li>32 experts was good</li>
</ul>
<h4 id="expert-capacity">Expert capacity</h4>
<ul>
<li>Larger Expert Capacity generally yields larger quality byt increases the FLOPS</li>
<li>$C=2$ was good</li>
</ul>
<h4 id="resuming-optimizer-state-vision">Resuming Optimizer State (Vision)</h4>
<ul>
<li>reusing the optimizer state gives a performance boost for vision models (not language)</li>
</ul>
<h4 id="normalize-weights-after-routing-vision">Normalize weights after routing (Vision)</h4>
<ul>
<li><p>To reduce the performance drop when upcycling model surgery, normalized the router combine weights of each token to 1</p>
<ul>
<li>Each token was previously only processed by a single expert (original dense model)</li>
<li>for vision, it was helpful but it hurts the performance of language case. (the hypothesis that the decoder of T5 uses Top-K routing)</li>
</ul>
</li>
</ul>
<h1 id="4-experiments">4. Experiments</h1>
<h2 id="41-experimental-setup">4.1 Experimental Setup</h2>
<ul>
<li>Vision : V-MoE, ImageNet using 10-shot, 5 different training sets, average accuracy</li>
<li>Language : span corruption task on English C4 (pretrain), a proportional mix of all SuperGLUE (fine-tune), dense baseline starting checkpoint (Base), T5 1.1 checkpoints (L, XL)</li>
</ul>
<h2 id="42-results">4.2 Results</h2>
<h3 id="421-core-result">4.2.1 Core Result</h3>
<h4 id="pretraining">Pretraining</h4>
<ul>
<li>When applying small amount of Extra training, the performance is almost their original checkpoint
<img src="https://velog.velcdn.com/images/0404_not_found/post/d233a7eb-4c47-4bb9-a590-a7b824a5abb9/image.png" alt=""></li>
</ul>
<h4 id="full-fine-tune">Full Fine-Tune</h4>
<ul>
<li>Still, the upcycled model has faster growth of score</li>
<li>For language, the difference is larger
<img src="https://velog.velcdn.com/images/0404_not_found/post/e5a254d9-649d-4136-b55a-6e0344d30695/image.png" alt=""></li>
</ul>
<h4 id="sparse-upcycling-vs-sparse-models-from-scratch">Sparse upcycling vs Sparse models from scratch</h4>
<ul>
<li>training from scratch takes longer to catch up with the upcycled models</li>
<li>For language, it used 120% of original dense checkpoint&#39;s computation to catch up upcycled models</li>
<li>Larger learning rate + experts can develop and diversify from the beginning</li>
<li>Given Large computation budget (&gt; 100% of original dense), training MoE from scratch may be preferable
<img src="https://velog.velcdn.com/images/0404_not_found/post/12f399fb-9f35-4806-93ac-bcfd85ad1ac9/image.png" alt=""></li>
</ul>
<h4 id="sparse-upcycling-vs-warm-starting">Sparse upcycling vs Warm starting</h4>
<ul>
<li>Dense upcycling (depth tiling) replicates layers from dense Base checkpoint to construct new layer
<img src="https://velog.velcdn.com/images/0404_not_found/post/9ea5d33f-bd77-4652-b51c-5a07488140c5/image.png" alt=""></li>
</ul>
<h3 id="422-ablations">4.2.2 Ablations</h3>
<ul>
<li>Vision : B/16 sparse model with 32 experts, $C=1$, 6 MoE layers at the last few blocks, dense checkpoint with 14 epochs + 7 additional epoch</li>
<li>Language : Base with 32 experts, $C=2$, 6 MoE layers interspersed, 0.5M ~ 1M additional steps</li>
</ul>
<h4 id="amount-of-dense-pretraining">Amount of dense pretraining</h4>
<ul>
<li>Regardless of the amount, upcycled model showed higher performance
<img src="https://velog.velcdn.com/images/0404_not_found/post/aeda3c38-56e3-41b1-8ecd-5a79f725ba74/image.png" alt=""></li>
</ul>
<h4 id="router-type-1">Router type</h4>
<ul>
<li>For vision, Top-K routing and Batch Prioritized Routing matches performance of Expert Choice Routing but slightly slow (step basis)</li>
<li>Top-K underperforms Expert Choice routing per time basis
<img src="https://velog.velcdn.com/images/0404_not_found/post/5ed51719-4088-4bc4-bd6d-82b3e34daef4/image.png" alt=""></li>
</ul>
<h4 id="expert-capacity-factor">Expert Capacity Factor</h4>
<ul>
<li>The more tokens processed by expert, the greater the computation and performance</li>
<li>$C=2$ was best
<img src="https://velog.velcdn.com/images/0404_not_found/post/3d60620f-4891-4ed7-b1ef-7c4e3523e911/image.png" alt=""></li>
</ul>
<h4 id="number-of-moe-layers">Number of MoE layers</h4>
<ul>
<li>More MoE layers is not always better even on a per step basis
<img src="https://velog.velcdn.com/images/0404_not_found/post/05959dd4-d2f0-4789-940d-5c53bfd78166/image.png" alt="">
<img src="https://velog.velcdn.com/images/0404_not_found/post/65ae8773-bdf7-4aaa-aa8d-c21cac29afbd/image.png" alt=""></li>
</ul>
<h4 id="initialization-of-experts">Initialization of Experts</h4>
<ul>
<li>copying MLP layer &gt;&gt; train from scratch</li>
<li>adding small Gaussian noise to each copied MLP layer didn&#39;t work (small amount - no effect, large amount - hurts the performance
<img src="https://velog.velcdn.com/images/0404_not_found/post/01eb4339-699e-4353-a6a5-bedf57c9cd42/image.png" alt=""></li>
</ul>
<h4 id="number-of-experts">Number of Experts</h4>
<ul>
<li>Adding more experts increases the model parameter count and quality</li>
<li>Using very large number of expert shows large initial quality drop (Fig 10 left two)</li>
</ul>
<h1 id="5-conclusion">5. Conclusion</h1>
<ul>
<li>Provided Simple recipe to reuse pretrained dense checkpoints to initialize more powerful sparse models</li>
<li>Smooth transition from dense to MoE</li>
<li>Applicable for vision and language</li>
<li>Upcycling of model</li>
</ul>
<h1 id="6-comment">6. Comment</h1>
<p> 생각했던 것과는 다른 MoE였음. Expert를 선택하는 방법에 있어 Router을 이용할 수 있다는 아이디어와 Expert Choice라는 색다른 아이디어를 볼 수 있었음.</p>
]]></description>
        </item>
        <item>
            <title><![CDATA[SOLAR 10.7B: Scaling LLMs with Simple yet Effective Depth Up-Scaling]]></title>
            <link>https://velog.io/@0404_not_found/SOLAR-10.7B-Scaling-LLMs-with-Simple-yet-Effective-Depth-Up-Scaling</link>
            <guid>https://velog.io/@0404_not_found/SOLAR-10.7B-Scaling-LLMs-with-Simple-yet-Effective-Depth-Up-Scaling</guid>
            <pubDate>Tue, 09 Jan 2024 12:39:22 GMT</pubDate>
            <description><![CDATA[<h1 id="1-introduction">1. Introduction</h1>
<ul>
<li><p>Recent LLMs scaling with performance scaling law $\rightarrow$ MoE</p>
<ul>
<li>Often require non-trivial changes to the training and inference framework</li>
<li>hinders widespread applicability</li>
</ul>
</li>
<li><p>Scaling up and retain <strong>simplicity</strong> is important</p>
</li>
</ul>
<h4 id="depth-up-scaling-dus">Depth Up-Scaling (DUS)</h4>
<ul>
<li>Scaling the base model along the depth dimension and continually pretraining the scaled model</li>
<li>Not using MoE</li>
<li>No additional Module </li>
<li>No changes for framework</li>
<li>Applicable for all Transformer architecture</li>
<li>Solar &gt; Mistral 7B, LLaMA 7b</li>
<li>Solar-Instruct &gt; Mixtral-8x7b</li>
</ul>
<h1 id="2-depth-up-scaling">2. Depth Up-Scaling</h1>
<ul>
<li>Use pretrained weights of base models to scale up</li>
<li>continually pretrain the scaled model
<img src="https://velog.velcdn.com/images/0404_not_found/post/d06a457f-037f-4b92-b2bf-99589166ece7/image.png" alt=""></li>
</ul>
<h4 id="base-model">Base Model</h4>
<ul>
<li>Any $n$-layer transformer architecture is OK (used 32-layer Llama 2)</li>
<li>Initialized Llama-2 architecture + pretrained weights from Mistral-7B</li>
</ul>
<h4 id="depthwise-scaling">Depthwise Scaling</h4>
<ul>
<li>From the base model with $n$ layers, set the target layer count $s$ for the scaled model (largely dictated by the available hardware)</li>
<li>Copy the base model</li>
<li>remove final $m$ layers for original model and initial $m$ layers for duplicated model</li>
<li>concatenate to form $s = 2 \cdot(n-m)$ layers ($n=32$, $s=48$, $m=8$ for SOLAR)</li>
</ul>
<h4 id="continued-pretraining">Continued Pretraining</h4>
<ul>
<li><p>The performance of scaled model initially drops below that of the base model</p>
</li>
<li><p>rapid performance recovery is observed</p>
</li>
<li><p>particular way of depthwise scaling has isolated the heterogeneity in the scaled model</p>
<ul>
<li>if we just repead all layers so that the number of total layer to be $2n$, the <strong>layer distance</strong> or the difference in the layer indices is too large at the seam</li>
<li>SOLAR sacrificed the $2m$ middle layers thereby reducing the discrepancy at the seam</li>
<li>the success of DUS is obtained by both depthwise scaling and continued pretraining</li>
</ul>
</li>
</ul>
<h4 id="comparison-to-other-up-scaling-methods">Comparison to other up-scaling methods</h4>
<ul>
<li>DUS do not require a distinct training framework, additional modules (ex. gating networks, dynamic expert solution), specialized CUDA kernel</li>
<li>seamlessly integrate into existing training and inference frameworks with high efficiency</li>
</ul>
<h1 id="3-training-details">3. Training Details</h1>
<h4 id="instruction-tuning">Instruction Tuning</h4>
<ul>
<li><p>QA Format + synthesized math QA dataset</p>
<ul>
<li>seed math data from <em>Math dataset</em> only to avoid contamination</li>
<li>using a process similar to MetaMath, rephrase the question and answers of the seed data $\rightarrow$ <em>Synth. Math-Instruct</em></li>
</ul>
</li>
</ul>
<h4 id="alignment-tuning">Alignment Tuning</h4>
<ul>
<li>Instruction-tuned model is further fine-tuned to be more aligned with human or strong AI like GPT4 preference using DPO</li>
<li>Open-Source + Synth.Math-Instruct</li>
<li>Speculated that rephrased answer &gt; original answer</li>
<li>Made DPO tuple with {prompt(rephrased question), chosen(rephrased answer), rejected(original answer)} $\rightarrow$ <em>Synth.Math-Alignment</em></li>
</ul>
<h1 id="4-result">4. Result</h1>
<h4 id="training-dataset">Training Dataset</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/fe3e6c19-c11d-46ec-acff-d47a1cc3ca67/image.png" alt=""></p>
<ul>
<li>Didn&#39;t always used all dataset</li>
<li>Synth. Math-Instruct can be replaced with MetaMathQA</li>
</ul>
<h4 id="result">Result</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/133fe857-9396-4579-b757-ad2ca2540c63/image.png" alt=""></p>
<ul>
<li>Merged some of models that they trained while instruction and alignment tuning stages.</li>
<li>Implemented their own merging method</li>
</ul>
<h4 id="ablation-on-instruction-tuning">Ablation on Instruction Tuning</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/90b4d8b3-4315-4c98-8c8b-2c62ecb077b6/image.png" alt=""></p>
<ul>
<li>Alpaca-GPT4 and OpenOrca makes the model to behave differently (SFTv1 and SFTv2)</li>
<li>Synth. Math-Instruct was helpful (SFTv3, SFT v4)</li>
<li>Merging models that specialize in different tasks is a promising way to obtain a model that performs well generally</li>
</ul>
<h4 id="ablation-on-alignment-tuning">Ablation on Alignment Tuning</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/803a90ef-406b-4cf0-9aab-663ce0662178/image.png" alt=""></p>
<ul>
<li>Adding Synth. Math-Alignment was helpful</li>
<li>Merging is not beneficial as DPOv2 is strict improvement over DPOv1</li>
</ul>
<h4 id="ablation-on-sft-base-models">Ablation on SFT base models</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/71bf037a-5f90-41fb-98a4-87ecd601fb94/image.png" alt=""></p>
<ul>
<li>the performance gaps in certain tasks in the SFT base models don&#39;t always carry over to the alignment-tuned models</li>
</ul>
<h4 id="ablation-on-merge-methods">Ablation on Merge Methods</h4>
<p><img src="https://velog.velcdn.com/images/0404_not_found/post/97d8f80a-a251-4b9a-bc6f-225a3cab8dda/image.png" alt="">
<img src="https://velog.velcdn.com/images/0404_not_found/post/59facc7d-b11f-435a-93e3-2a54a737d28b/image.png" alt=""></p>
<ul>
<li>As merge candidates have sufficiently different strengths, merge method may not be as crucial.</li>
<li>Merge 1 is SOLAR 10.7b-Instruct</li>
</ul>
<h1 id="5-conclusion">5. Conclusion</h1>
<ul>
<li>Depth up-scaled model SOLAR 1.7b</li>
<li>DUS is effective in scaling up from smaller ones</li>
</ul>
]]></description>
        </item>
    </channel>
</rss>