0404_not_found.log

GPTScore: Evaluate as You Desire

Fri, 03 Jan 2025 14:31:24 GMT

1. Introduction

GPT
- Analytical AI to Generative AI
- large PLM + Prompt $\rightarrow$ superior performance
- needs for evaluating the quality of these texts

evaluating single aspect
- hard for users to evaluate aspects as they need
multi-aspect evaluation
- lacks the aspects' definition and relationship
- empirically bound with metric variants
Needed supervised training and manual annotation
using LLM to achieve multi-aspect, customized and training-free evaluation
- using zero-shot instruction and ICL
- higher quality text for a specific aspect will be more likely generated

perform an evaluation as the user desires
- task specification
- aspect definition
- deponstrated samples
  - well-labeled sample
- GPT to calculatehow likely the text could be generated based on the evaluation protocol
  - GPT2, OPT, T5, GPT3
almost all NLG task
- performed well when instructed by the definition of task and aspect
- different evaluation aspects exhibit certain collerations
- in summarization, data-to-text and dialogue response, GPTScore outperformed fine-tuned models
- gpt3-text-davinci-003 (human feedback) is inferior to text-davinci-001

Similarity-based Metrics

lexical overlap-based
- BLEU
- ROUGE
embedding-based
- BERTScore
- MoverScore

Single-aspect Evaluator

Coherence of dialogue system
- DEAM
- QuantiDCE
Consistency

Multi-aspect Evaluator

one evaluator handles several evaluation aspects
- different input and output pair
- different prompt by the aspect name
- different formulas

Emergent Ability

ICL
CoT Reasoning
Zero-shot instruction

BARTScore vs. GPTScore

BARTScore needs a fine-tuning step
GPTScore > BARTScore
- customizable
- multi-faceted
- train-free

3. GPTScore

GPT will assign a higher probability of high-quality text given instruction and context
- $d$ : task description
- $a$ : aspect definition
- $\bm{h} = {h_1, h_2, ... \ }$ : text to be evaluated
- $\mathcal{S}$ : context information (source or reference)
$\text{GPTScore}(\bm{h} | d, a, \mathcal{S}) = \sum_{t=1}^m w_t\log p(h_t | \bm{h}_{
- $w_t$ : weight of the token at position $t$ (in this work, it is treated equally)
- $T$ : prompt template that defines the evaluation protocol
  - task-specific
  - handcrafted with prompt engineering
Few-shot
- extending $T$
Prompt Template
- officially given by OpenAI (GPT3-based model)
- NaturalInstruction (instruction based pre-trained model)

Selection of Scoring Dimension
- $p(\text{hypo} | \text{ref})$ vs. $p(\text{hypo} | \text{src})$
- in GPTScore, it is chosn to aligh the protocol of human

4. Experimental Settings

4.1 Meta Evaluation

how well automated scores correlate with human judgement
- $g(y_{\text{auto}}, y_{\text{human}})$
- $g$ : correlation function (Spearman, Pearson)

4.2 Tasks, Datasets and Aspects

Tasks
- Dialogue Response Generation
  - generate an engaging and informative response
  - FED datasets
  - turn-level, dialogue-level evaluation
- Text Summarization
  - SummEval
  - REALSumm
  - NEWSROOM
  - QAGS_XSUM
- Data-to-Text
  - generate a fluent and factual description for a given table
  - BAGEL
  - SFRES
- Machine Translation
  - MQM (Multidimensional Quality Metrics) -> MQM-2020 (Ch->Eng)
37 Datasets
22 Evaluation Aspects

4.3 Scoring Models

ROUGE
- ROUGE-1
- ROUGE-2
- ROUGE-L
PRISM
BERTScore
MoverScore
DynaEval
- dialogue response generation tasks on the turn level and dialogue level
BARTScore
- scoring model based on BART without finetuning
- +CNN (finetuned on the CNNDM dataset)
- +CNN +Para(+CNNDM +Paraphrase 2.0)
GPTScore
- 19 PLMs backbone

4.4 Scoring Dimension

INT, ENG, SPC, REL, COR, SEM, UND, FLU from FED-Turn
- $p(\text{hypo} | \text{src})$
- human data in the dataset
COH, CON, INF from SummEval and Newsroom
- $p(\text{hypo} | \text{src})$
- labeled data exists
INF, NAT and FLU from the data-to-text
- $p(\text{hypo} | \text{ref})$
- source text is not standard function
ACC, FLU, MQM from machine translation
- $p(\text{hypo} | \text{ref})$
- source text is in different language

4.5 Evaluation Dataset Construction

sampled 40 sample for each summarization dataset
sampled 100 samples for dialogue response generation and data-to-text

5. Experiment Results

three scenario
- Vanilla : non-instruction and non-demonstration
- IST : instruction only
- IDM : instruction + demonstration
Significance Tests
- based on bootstrapping
  - IST or IDM > VAL
  - IDM > IST

5.1 Text Summarization

Evaluator with instruction significantly improves the performance
GPT3 / FT5 based models + instructions > supervised method

5.2 Data to Text

IDM > IST > VAL
IDM > finetuned model
the choice of examples impacts the performance a lot
IDM + GPT3 small size family > large sized model

5.3 Dialogue Response Generation

GPT3-d01 >> GPT3-d03
GPT3 based model demonstrate storonger generalization ability

5.4 Machine Translation

IST improved the performance
IDM > IST
GPT3-c01 achieved comparable performance with d01 and d03

6. Ablation Study

6.1 Effectiveness of Demonstration

demonstration improves the performance
there is an upper bound on the performance gains
if there is only a few samples, small models are prone to performance degradation

6.2 Partial Order of Evaluation Aspect

tested INT as the target aspect
- combined other aspects with the definition of INT
- GPT3-c01 6.7B

By combining definitions with other highly correlated aspects, smaller model outperformed the bigger model

7. Conclusion

customable multi-faceted, train-free evaluation framework using emergent ability of LLM

8. Limitation

Not included GPT 3.5 and GPT 4
the reason why d03 is worse than d01 is unclear as it is not open source
API cost issue

SELF-EXPERTISE: Knowledge-Based Instruction Dataset Augmentation for a Legal Expert Language Model

Fri, 03 Jan 2025 12:08:28 GMT

1. Introduction

Instruction Tuning Dataset
- Instruction Tuning is important for LLMs
- Auto generation method is unsuitable for some domains where the accuracy is important
SELF-EXPERTISE
- automatic instruction data generation for knowledge-intensive tasks
- 19k dataset from 980 seed dataset
- LxPERT : LLaMA-2-7B + SELF-EXPERTISE

2.1 LLM-based Instruction Dataset Augmentation

Generate instruction dataset using LLMs
Self-Instruct $\rightarrow$ Prone to hallucination

2.2 Knowledge-Intensive Tasks

requires knowledge-based solution
- legal domain is knowledge intensive
- RAG
- SELF-EXPERTISE generates instructions and outputs based on precise external knowledge

3. Methodology

3.1 Defining Instruction Data

typical instruction dataset
- (instruction, input, output) triplet + system instructions
- to facillitate reasoning and narrative structure
- input is optional

3.2 SELF-EXPERTISE

Knowledge Extraction based on Output

knowledge is extracted from the outputs of a small set of expert-written seed data
generates new user instructions and outputs with external knowledge
lawyer's argument (output) + case law (external data)

Generation of User Instruction and Input Based on Knowledge

analogous to how teachers create exam questions based on textbook
generates exam questions and context

System Instructions

handcrafted
wrote 8 system instructions
they differ to allow the creation of outputs in various manners, lengths, formats

Output Generation Based on Previous Results

Use all knowledge, system instructions, instructions and input to generate output from LLM
8 outputs for each user instruction and input pair

3.3 Finetuning the sLLM using Augmented Instruction Dataset

Similar to knowledge distillation
distills domain knowledge
sLLM is trained to generate responses based on indirectly learned knowledge
learns 8 types of system instructions and corresponding output forms

4. Legal SELF-EXPERTISE Data

4.1 Seed Dataset

980 legal seed instruction by legal experts
- 560 legal cases + 916 clauses
- civil law, local law, legislative information and legal consultation

4.2 Data Generation Details

GPT-3.5-turbo in Step 1
GPT-4-preview-1106 for Step 2 and 4

4.3 Diversity

compared the lengths of the user instruction, input and output
more even than Self-Instruct
various system instructions worked well for the diversity
extracting objective knowledge from outputs will help model not be limited to particular situations
- compared the generated instructions for 200 seeds
- BERT-score to calculate similarity

4.4 Quality

human evaluation for 100 random sample

5. Experimental Setup

5.1 Training Details

LLaMA-2-ko 7B
SELF-EXPERTISE dataset
3 epochs, AdamW, lr 2e-5, batch 1 per device, max len 1024
A100 80G

5.2 Baselines

Foundation Models
- LLaMA-2 7B and LLaMA-2-ko 7B
Instruction Tuned Models
- LLaMA-2-chat 7B and LLaMA-2-ko-chat 7B
GPT
- GPT-3.5-turbo
Instruction-tuned Models in Legal Domain
- SELF-EXPERTISE tuned LLaMA-2-ko 7B
- seed dataset tuned LLaMA-2-ko 7B
- legal domain dataset augmented by Self-Instruct

5.3 Evaluation Dataset

In-domain Dataset
- legal experts to create new dataset that is related to 4 domains like seed dataset
- 200 pairs
Out of domain Dataset
- 100 QA pairs from easylay.go.kr
- selected questions that requires knowledge not in the seed data

5.4 Evaluation Settings

GPT-4 Evaluation
Human Evaluation
- 5-point Likert scale (accuracy, fluency)

6. Results

6.1 Evaluation on In-domain Data

6.2 Evaluation on Out-of-domain Data

Seed dataset tuned model performance noticeably dropped

6.3 Quality of Answers Relative to the Amount of Training Data

Excessivie augmentation of data with same seed dataset leads to overfitting on specific knowledge
Expanding seed data and additional general domain dataset will work.

7. Discussion

ability to follow instructions
legal domain knowledge
still prone to make errors

8. Conclusion

automatically generating instruction dataset in specialized domain
can be extended to generate instruction datasets in other domain

Comment

지식을 추출해서 쓰자
Augmentation은 Seed에 맞게 적당히 할 것

A Multi-Task Benchmark for Korean Legal Langhage Understanding and Judgement Prediction

Thu, 02 Jan 2025 16:13:20 GMT

1. Introduction

Previous Legal Export Systems
- Useful on certain areas
Deep Learning Based Approach
- Legal Judgement Prediction
- Legal Content Generation
- Legal Text Classification
- Legal Event Detection
- Legal Information Extraction
- Legal Contract Review and QA
LBOX
- Large Scale Korean legal AI benchmark
  - precedent corpus
  - classification tasks (Case Name, Statute)
  - judgement prediction task (LJP-Criminal, Civil)
  - summarization task
- pre-trained $\rightarrow$ LCUBE (decoder only, based on GPT-2)
  - doesn't have advantage on summarization task

2. Background

2.1 Korean Legal System

Three-tiered (District, High and the Supreme Court)
rooted in civil law system (vs. common law system)

2.2 Korean Precedent

Structure of Korean Precedent
- meta information
- gist of claim from plaintiffs in a civil case
- ruling
- reasoning
  - facts
  - claims
  - reasoning
  - decisions
The Redaction Process
- Anonymizing
Precedent Disclosure Status
- Courts' decision should be pubshed via online service

3. LBOX Open Datasets

3.1 Structuring Raw Data

Document Images and PDF precedents are available
Preprocessing pipeline
- Layout Classifier (based on ResNet)
- Layout Parser (based on Mask-R-CNN)
- OCR
- Custom Language Model to correct OCR errors
- Human annotation for low-confidence instances
JSON format
- meta information
- ruling
- gist of claim
- appeal
- reasoning

3.2 Datasets

Precedent Corpus
- AI Hub 6k + LAW OPEN DATA 82k + Internal 65k
- 57% of LAW OPEN DATA consist of the trials of the Supreme Court (no factual issues)
Case Name
- 10k facts + case name
Statute
- facts + statute
LJP-Criminal
- facts + punishments(fine, imprisonment with labor, imprisonment without labor)
- Level 0 (type of punishment)
- Level 1 (degree of punishment in 3-scale, null/low/high)
- Level 2 (5-scale for fine, 6-scale for imprisonment)
- Level 3 (exact number) $\rightarrow$ Regression!
LJP-Civil
- fact + gist of claim + degrees of claim acceptance
- claim acceptance degree
  - claimed money from the gist of claim
  - approved money from ruling section
  - approved money / claimed money
- Level 1 (rejection / partial approval / full approval)
- Level 2 (13 categories)
- mt5-small + prompt-tuning for parsing expression (money provider / receiver / amount / litigation cost)
Summarization
- Supreme Court Decisions Report + Summary of Decision
- Ruling and Reasoning section

4. Experiments

4.1 Model Training

Nvidia A6000, RTX3090 or RTX6000
lr 3e-5 to 1e-4
batch 8 to 60, AdamW
finetuning experiments with errorbar were repeaded 3 times
google/mt5-small for fine-tuning
GPT-2 from scratch (LCUBE), Modu and Wiki corpora
byte-level BPE
50K for base and 100K for medium
compared KoGPT2 and LCUBE

4.2 Task Setting

text generation following

4.3 Metric

Case Name, Statiute, LJP-Civil : Examt Match
LJP-Criminal : F1 of individual fields

5. Results

Domain specific corpus is critical in the classification and the summarization tasks
- pretrain with Precedent Corput only also performed well in domain adaptation
- in summarization task, LCUBE doesn't have an advantage over other models
  - this might be from the architecture difference between encoder-decoder model and decoder only model
  - LCUBE generated ~40% fewer tokens $\rightarrow$ ROUGE score is low
Domain adaptation is not helpful on legal judgement prediction tasks
- In LJP-Civil, without the facts, the model performance is close to a dummy baseline
Legal judgement prediction is challenging
- There is no one superior model

6. Conclusion

the first large-scale Korean legal AI benchmark and legal language model LCUBE
only considered precedents from the first level courts
- for simplicity in legal reasoning
didn't used plaintiffs and defendants claims
difficult to separate the claims from reasoning sections without error
didn't consider many important legal applications of AI

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Thu, 02 Jan 2025 13:47:47 GMT

1. Introduction

PLMs learn a substantial amount of in-depth knowledge from data
- it can't expand or revise their memory
- can't straightforwardly provide insight into their predictions
- hallucination
Hybrid Models (REALM, ORQA)
- parametric + non-parametric (retrieval-based)
- seq2seq transformer + vector index + pre-trained neural retriever $\rightarrow$ RAG
- per-sequence bases vs. per-token basis
- This can be fine-tuned on any seq2seq task (generator and retriever are jointly learned)

Enrich systems with non-parametric memory
- parametric and non-parametric components are pretrained and pre-loaded
- using pre-trained access mechanisms, accessing knowledge without additional training is possible
Works well with Knowledge-Intensive Tasks
- Humans could not reasonably be expected to perform without access to an external knowledge source

2. Methods

$x$ (input sequence) $\rightarrow$ $z$ (text documents) $\rightarrow$ $y$ (target sequence)
- $p_{\eta} (z | x)$ : retriever (returns top-K distributions)
- $p_{\theta}(y_i | x, z, y_{1:i-1})$ : generator
- $z$ as a latent variable

2.1 Models

RAG-Sequence

$p_{\text{RAG-Sequence}} (y|x) = \displaystyle \sum_{z \in \text{top-}k(p(\cdot|x))} p_{\eta}(z|x) p_{\theta}(y | x, z)$
uses the same retrieved document to generate the complete sequence

RAG-Token

$p_{\text{RAG-Token}} (y|x) = \displaystyle \prod i ^N \sum{z \in \text{top-}k(p(\cdot|x))} p_{\eta}(z|x) p_{\theta}(y_i | x, z_i, y_{1:i-1})$
draw a different latent document for each target token
generator to choose content form several documents when producing an answer
computes a distribution for the next output token for each document
used for sequence classification $\rightarrow$ target class as a length-one sequence

2.2 Retriever: DPR

$p_{\eta} (z | x) \propto \exp({\mathbf{d}(z)^{\top}}\mathbf{q}(x))$
used BERT as $\mathbf{d}(z)$ and $\mathbf{q}(x)$
MIPS: Maximum Inner Product Search Problem
document index: non parametric memory

2.3 Generator: BART

BART-Large 400M (seq2seq transformer)
simply concatenate $z$ and $x$

2.4 Training

jointly train retriever and generator without any direct supervision on the document
NLL Loss, Adam, SGD
only trained query encoder and generator

2.5 Decoding

RAG-Token uses standard beam-decoder
RAG-Sequence performs beam-search for each doeument
Through Decoding vs. Fast Decoding

3. Experiments

Wikipedia as document index (100 token chunk, 21M documents)
FAISS, HNSW
k = 5 or 10
Open Domain QA, Abstractive QA, Jeopardy QA (non-standard QA format, fact to entity), Fact Verification (retrieve from Wikipedia and reason whether the given claim is true)
Natural Questions / TriviaQA / WebQuestions / CuratedTrec $\rightarrow$ Exact Match Scores
MSMARCO NLG task v2.1 (only question and answer)
SearchQA $\rightarrow$ SQuAD-tuned Q-BLEU-1
FEVER $\rightarrow$ label accuracy

4. Results

Open Domain QA

Extract < Generate
- document with only clue not the exact answer

Abstractive QA

RAG is more diverse thatn BART, less hallucinative
SotA models access gold passages while RAG is not
many questions are unanswerable without gold passages
not all questions are answerable from Wikipedia alone

Jeopardy Question Generation

RAG-Token can perform well as it uses multiple documents

parametric and non-parametric memory work together

Fact Verification

document retrieved by RAG is the gold evidence in FEVER

Additional Reuslts

Diversity

Retrieval Ablations
Index Hop-swapping
- Changed from Wikipedia 2018 to DrQA Wikipedia dump
Retrieving more documents
- didn't observe significant differences and performances

5. Discussion

Hybrid generation models with access to parametric and non-parametric memory

Computational analysis of 140 years of US political speeches reveals more positive but increasingly polarized framing of immigration

Mon, 27 May 2024 06:52:46 GMT

Abstract

200K US congressional speeches + 5K presidential communications related to immigration from 1880 to the present
political speech about immigration is much more positive on average than the past
- shift largely between WW2 and the passage of Immigration and Nationality Act in 1965
- since the late 1970s, political parties become polarized
contextual embeddings of text
- modern Republicans $\rightarrow$ suggestive of metaphors long associated with immigration (animals, cargo) and frames like "crimes" and "legality"
nationality mentioned changed the tone of speeches (Mexican, Chinese) $\rightarrow$ still a major factor in how immigrants are spoken of in Congress

1. Introduction

Recently, the attitude toward immigration became negative than ever before
- anti-Chinese fearmongering in 1880s
- Southern and Eastern European immigrants in 1920s
- antiimmigration rhetoric of Trump (2017 to 2020)
  
  $\rightarrow$ "Certain types of immigrants can never truly join American society?"
how have attitudes toward immigrants in US changed over the past century?
- public opinion poll began in the 1960s
- turned to Congressional Record
Corpus
- full corpus of more than 17M congressional speeches from 1880 to present
- 200K speeches relevant to immigration
- presidential communications
- quantitative analysis
Related works
- qualitative approaches and historical archives
- quantitative work on immigration used migration and census records
- Rhetorical aspects of immigration debates $\rightarrow$ dehumanizing language (vermin and cargo) with qualitative analysis
- NLP methods to cover in news media and Congress $\rightarrow$ not a long time span / not a comprehensive corpus with a consistent methodology
Methods
- identify relevant speeches
  - automated text classification based on extensive human annotations
- curated and applied a set of lexicons for analyzing relevant frames with semi-automated method
- neural contextual embedding models to quantify implicit dehumanizing metaphors
Brief results and discussion
- political speeches about immigration today are more positive than the past
  - the shift between WW2 and 1965 Immigration and Nationality Act
- being net positive on average since early 1950s
- Trump is the first president to express sentiment toward immigration more negative than the average member of his own party
- two parties have become increasingly polarized over time
  - liniar increase in polarization on immigration since late 1970s
- today, Democrats are unprecedentedly positive
- generic political polarization observed in Gentzkow et al. by more than a decade
- nationality of immigrants continues to matter greatly
  - Mexican $\rightarrow$ more negative than European
  - Mexican framed today is similar to Chinese framed during Chinese exclusion in 19th century
  - negative frame "crime", "labor", "legality" + dehumanizing metaphors
- there remains a string and growing strain of antiimmigration speech among Replblicans
  - expressed opinions toward immigrants still vary greatly by country of origin
  - rhetorical strategies continue to be deployed

2. Results

Tone of Immigration Speeches

17M congressional speeches from 1880 to 2020
human annotations and trained ML classifiers to detece immigration related speech with accompanying tone (pro, con, neutral)

applied same models to all presidential communications by American Presidency Project (Bottom)
Fig 1
- average sentiment is negative throughout the late 19th and early 20th centuries (Chinese Exclusion Act(1882) to strint immigration quotas (1920s))
- the attitude became more positive around the start of WW2
  - rising steadily from 1940 until the end of the Johnson administration (1969)
  - average tone has been pro since the beginning of the Eisenhower (1953)
- beginning about a dacade after 1965, an overall decline in sentiment amone Republicans and incline among Democrats is observed
  - except for the early 1990s, this coincides with the end of the Cold War and NAFTA
  - Republican shows antiimmigration as 1920s
Trends for presedential attitudes should be treated more cautiously as there is less text
- involves a slight domain shift (the model is trained on congressional speeches)
- found a similar pattern
  - early presidents were more antiimmigration
- in recent years, presidents are uniformly more proimmigration even the Republican (Ronald Reagan) and the Democrats (Jimmy Carter)
  - Trump was a start exception (the most antiimmigration president over the past 140 years)

Fig 2.
- the tone was varied dramatically depending on which groups of immigrants are being discussed
- Mexican and Chinese (Italian is Identifying Groups)
- Speech mentioning Chinese immigrants were overwhelmingly negative during Chinese exclusion (1882 to 1943)
  - while the tone toward Italian was slightly more favorable
- Attitute toward all groups improved from 1940 to 1970
  - mentioning China and Mexico remained relatively more negative overall
- since the late 1970s, the gap between Italian and Mexican is large as the gap in tone that exists between Replublicans and Democrats today.
- this pattern is mirrored in broader regional trends
  - Most European $\rightarrow$ referred to positively on average by the 1960s
  - Asian $\rightarrow$ by the 1980s
  - Caribbean $\rightarrow$ negative on average until the 2000s
  - few countries are mentioned as frequently as those three (Mexico, Italy, China)

Language, Framing, and Dehumanization

trained interpretable logistic regression models to approximate the predictions of the contextual embedding models and determine feature importance using Shapley values

Table 1
- antiimmigration terms contains the words representing threats (dangerous, cheap), control (permit, violation), and the targets of early antiimmigration legislation (undesirable, Chinese)
  - by midcentury and beyond, another threats appear (subversive, terrorism) along with the themes of legality (aliens, illegal) and crime (criminals, smuggling)
- proimmigration terms contain the words representing desirable characteristics (industrious), land (property, agriculture), and service (gave, served)
  - by post-WW2 era, humanitarian concerns (discriminatory, migrants) and community (citizens, families, children) appeared
  - this continued into the present (victoms, community) along with a celebration of once-vilified communities (Irish, Italian, heritage)
- Despite the relatively negative tone toward Mexican in the modern period, Hispanic and Latino had strong positive associations
  - these are likely to be used by Democrats than Republicans $\rightarrow$ they are proimmigrations
  - but "Mexico" and "Mexican" are mentioned with very similar frequency by Democrat and Republicans $\rightarrow$ the tone difference is not simply a matter of Mexico
To understand the rhetorical divergence between parties $\rightarrow$ they focused on the frame about immigration $\rightarrow$ built a series of lexicons
- working upon the previous work, they developed 14 of these lexicons with automated/manual curation

Fig 3
- almost no difference in the frames by two parties in the earlier time period
- today, they use strongly divergent use of different frames
  - Republicans : crime, legality, threats, deficiency, flood/tide $\rightarrow$ commonly heard antiimmigration comments
  - Democrats : family, victims, contributions, culture (positive)
- these patterns are robust to the exclusion of any individual term as well as to automated lexicon expansion

- the most salient aspects

    - earlier time period : difeciency, culture, labor

    - today : crime, legality (partly due to frequent mentions of legal and illegal immigrants + other legal terms and crime terms (laws, visas, criminals, terrorism)) 

- economy is the most uncommon in speeches about immigration $\rightarrow$ the lease salient in both era

measured more implicit dehumanizing metaphors
- only flood and tide metaphor emerge dfrom the semiautomated frame construction process
- measure the metaphors based on how probable such terms are as substitutes according to contextual embedding models
  - animals, cargo, disease, flood/tide, machines, vermin are drawn by this method
  - "dumping produces ~ " $\rightarrow$ cargo
  - "herding of ~" $\rightarrow$ animal
- Republican used more dehumanizing metaphors

Differences by Country of Origin

Fig 4
- how Mexicans are framed today vs how Chinese were framed a centuryearlier
- crime, labor, legality are deployed vastly
- 4 most positive frames are all used far more in sentences mentioning European than the non-European groups (culture, victims, contributions, family)
imploicit dehumanizing language is slightly but significantly more common for mentions of the non-European group in both cases

3. Discussion

Congressional antagonism to immigration started much earlier than the quota period
- China is mentioned in more than 20% of the speeches in 1880 to 1900
- negative attitudes toward immigration remained from 1880 to 1940
negative tone toward Chinese is consistent with the many pieces of anti-Chinese legislation introduced into Congress
- 1875 Page Act
- 1882 Chinese Exclusion
- 1888 Scott
- mentioning Chinese remained until the Chinese Exclusion Act was repealed in 1943
- significantly greater use of implicit dehumanizing language to mention Chinese and emphasis on the threatening aspects
The combination of frames (crime, threats) underscores the dual nature of immigrants (threatening vs cheap labor)
- same pattern in the Mexican today
- the frame of Europeans was more sympathetic although still negative until the middle of the 20th centurs
gradual loosening of immigration laws from 1940s
- this trend mirrored by congressional tone toward immigration
- eventually becoming net positive on average in 1950s
- possibly by the humanitarian concerns
  - signaling positive attitudes
  - increasing association with the 'victims' frame and decreasing the prominence of 'deficiency' and 'threats'
nearly 30 years after the border reopened in 1965, the positive sentiment didn't fully erode even as the immigration from developing countries increased
- partisnan divice on immigration cemerged in the late 1970s although the Republican showed neutral or positive stance until the election of Bill Clinton and NAFTA
- this predates the previous work about polarization
the results are consistent with the patterns in the previous work about the polarizations
- polarization + overall tone toward immigration
- beyond the sentiment, there are the 'framing' used in immigration debates
understanding the causes of this polarizattion is beyond the scope of this paper
- legislators' tone is weakly correlated with public opinion on the issue at the state level
- noevidence of systematic differences in tone among House menbers in election vs nonelection years
stark differences in framing between European and non-European groups
- Chinese in the late 19th and early 20th century, Mexican today
- more implicitly dehumanizing metaphors for non-European
- also for the explicit frames (crime, labor, legality)
- the gap between mentioning Mexico and European is equivalent to the modern gap between Democrats and Republicans
modern immigration laws and the rhetoric of 'illegal' immegrants were crafted specifically to target immigration from Mexico
- made associations with crime, legality and labor
Mexico was also the target of early discrimination like China
- Mexico was exempt from the quota system
- Although the tone of speeches mentioning Mexican increased with other nationalities, these gains were largely eroded in the early 1970s $\rightarrow$ persistently nationality-based gap
with the public opinion polls (by Gallup)
- also shows the increase in proimmigrant sentiment from 1965 to present
- in 2019, 77% answered as a positive
- in 2002, it was 52% (after 9.11)
- for asking the decrease of immigration, 65% answered it should be decreased in 1990s
- in 2020, this fell to 28%
the analysis of congressional and presidential speeches is more complicated
- attitudes among Repuiblican are negative as the members of Congress were during the push for restrictive quotas
- Chinese are still duscussed more negatively than European even the overall sentiment is positive today
- recent years, COVID-19 made anti-Asian and hate crimes, anti-Chinese rhetoric
- despite of the proimmigration among the general population, the tone differences in Congress based on nationality are strong as that between the parties

4. Materials and Methods

Data

43rd to 111th Congress : digitized copy of the Congressional Record
112th to 116th Congress : congressional-record tool by @unitedstates project
data with speaker, party, state and date
Procedural speeches were identified and excluded
presidential communication : all presidential documents from The American Presidency Project
Immigration statistics : Historical Statistics of the United States Millennial Edition Online + census data by the Migration Policy Institute

Classification

Princeton University research assistants to label a speech
- about immigration or not
- proimmigration / antiimmigration
- extensive set of keyword was used to select the candidate for annotation
- 7626 segments annotated (3643 were judged relevant)
- the judgements were aggregated with Bayesian item response model (to get a probability distribution over labels for each segment)
trained RoBERTa
- fine-tuned the pretrained roberta-base to congressional speeches with self-supervised
- then fine-tuned it to be a classifier using annotated examples
- ~90% accuracy on relevance and 65% on tone
- major error in tone is between neutral and non-neutral
- models trained on earlier and later parts of the data showed similar aggregate results in the intervening years
the predictions on segments are used to predict the speeches
- same predictor is used to presidential communication

Identifying Groups

the mose prominent immigrant nationalities $\rightarrow$ historical data on the countries of origin of the foreign-born US population
45 countries that accounted for at least 1% of the foreitn-born population in at least 1 decade
manually modified the country name and nationality

Measuring Impact

used L1-regularized LR models to fit the predicted tone labels on all congressional segments classified as relevant
- approximates the influence of individual words
words in the vocab : at lease 20 times used / excluding numbers, punctuation, stop words / counts were binarized
Shapley values computed (reflected in Table 1)

Curating Frames

curated lexicons for 14 immigration frames
- identified significantly frequently occurred terms mentioning of immigrants compared to terms mentioning generic people
- considered initial exploration, annotators' comments, and prior literature to identified 14 relevant categories
- which term should be selected in which frame is aggregated by majority votes

Identifyng Mentions

collected direct mentions + group terms + more generic person references with nationality
used to measure dehumanizing metaphorical language for each group
included slang and derogatory terms to identify groups

Measuring Dehumanization

introduce a method that is based purely on context $\rightarrow$ used BERT
- trained on MLM task
- fine-tuning to act as a classifier
- to train implicit metaphorical language, began with the representative of that category
  - used static vector to fine similar terms
  - tried to find that kind of word in the BERT's vocabulary
training procedure
- for each sentence that mentions and immigrant of immigrant group
- mask the mention with [MASK] token
- to compute the probability of the candidate for the [MASK]
- add up all the probabilities to get an overall score for each category for that sentence
- showed log ratio of the mean probability for one set of mentions to the mean probability for the other
validating procedure
- collected human judgements on a sample of masked contexts
- three of the auther independently rated whether a term would be a plausible replacement
- resonably strong agreement (Krippendorff's alpha = 0.59)
- correlated with the log probability by the model (r = 0.73)

Elements of Worls Knowledge (EWoK)

Sun, 19 May 2024 13:52:31 GMT

Elements of Worls Knowledge (EWoK): A cognition-inspired framework for evaluating basic world knowledge in LMs

1. Introduction

LLM acquires a substantial amount knowledge from their training data
- knowledge about language (world meaning, syntax)
- knowledge about world (social conventions, physical properties of objects)
To check the robustness of the model's language
- Elements of World Knowledge (EWoK)
  - several domains that constitute the foundation for basic human world knowledge
  - specific concepts within each domain
  - a set of item templates
  - a set of fillers to populate the templates (each templates to be used multiple times)
  - a pipeline to generate a specific set of items
- Why Elements?
  - this targets specific cognitive targets (e.g. friend/enemy)
  - concept leveraged in context are the first-class object of the EWoK as opposed to individual sentences or facts
  - NLP benchmarks $\rightarrow$ aim to evaluate knowledge based on individual items
  - individual item makes it hard to assess why a model fails
  - explicitly link the items with the concepts that they test
Why cognition-inspired?
- selected a range of domains that have been shown to recruit dedicated cognitive and/or neural machinery in humans
  - intuitive physics
  - physical and spatial relations
  - intuitive number sense
  - social reasoning
  - reasoning about agents with both physical and social knowledge
- present in preverbal infants
- but language contains a rich amount of information that reflects grounded world knowledge $\rightarrow$ LLMs might acquire the domain-specific knowledge from text alone
Why plausibility?
- plausible vs implausible context-target pairs
- plausibility $\rightarrow$ serves as a proxy for factual accuracy (determines whether a given scenario makes sense)
- an accurate world model is necessary for distinguishing the plausibility no matter how they are worded
Why minimal pairs?
- contexts and targets in EWoK have a minimal-pairs design
- target change results in an opposite result (plausible $\rightarrow$ implausible)
- help to identify specific manipulations that LLMs are sensitive and they are not
Why context-target combinations?
- LLMs are very good at memorization $\rightarrow$ many distinguishing can be done with their presence in the training data
- this framework tests LLM's ability to evaluate contextual plausibility such that the same exact target's plausibility depending on the context

commonsense benchmark
- reporting bias in training data
Co-occurrence information easily available through perception is often underrepresented in language corpora
- earlier LLMs failed
natural language inference and entailment
- recognizing textual entailment (RTE)
- natural language inference (NLI)
- EWoK asks the plausibility within given context $\rightarrow$ it might indicate an entailment
- LLMs use heuristics to solve the task rather than the understanding
  - in EWoK, the task is posed as a minimal pair (one must be preferred over the alternative) $\rightarrow$ making reliance on target plausibility alone is impossible
  - test which item design features drive model performance
  - test the relationship between the LLM performance and surface-level item properties (length, average work frequency, BoW model performance)
bAbi
- similar design about world knowledge and reasoning
- EWoK is more simpler design and harder in practice
minimal pair design
- SyntaxGym, BLiMP, COMPS
- Winograd Schema Challenge
- EWoK used minimal pairs of pairs design
  - both context and target sentences have a minimal pair counterpart
assessing LM performance
- until 2023, each item's log probability
  - effective at grammatical vs ungrammatical
  - plausible and implausible
  - relevant and irrelevant object properties
- log probability shows the surface-level properties
- Recently, to prompt an LLM to rate them plausibility
  - LLM performs worse in direct prompting than implicit log probability
  - in EWoK, both log probability and explicit prompting are used

3. The Framework

Item Format

Each item consists of two minimal pair contexts
- $C_1$ : The piano is in front of Ali. Ali turns left.
- $C_2$ : The piano is in front of Ali. Ali turns right.
Also, there are two target sentences
- $T_1$ : The piano is right of Ali.
- $T_2$ : The piano is left of Ali.
the two target items are juxtaposed such that
- $P(T_1 \ | \ C_1) > P(T_1 \ | \ C_2)$ and $P(T_2 \ | \ C_1) < P(T_2 \ | \ C_2)$
then the base target $P(T_1)$ and $P(T_2)$ can't serve as plausibility cues $\rightarrow$ the model should rely on the given context

Domain and Concenpts

Dataset generation procedure

each concept is associated with several items that test knowledge of the concept (mostly contrasting with another concept)
flexible but controlled manner
atomic units and combination rules $\rightarrow$ generation of templates with fillers

Contexts and Targets

target : a simple sentence that incorporates a concept
contrasting target pair is generated by
- concept swap
  - {agent 1} is to the left of {agent 2}
  - {agent 1} is to the right of {agent 2}
- variable swap
  - {agent 1} is to the left of {agent 2}
  - {agent 2} is to the left of {agent 1}
context pair : one or more minimal pair of sentences that is pared with a target pair
- $C_1$ only matches with $T_1$ and $C_2$ only matches with $T_2$
- typically an opposite concept pair (left/right) or single concept (left, with variable swap)
contrasting concept pair is generated by
- filler swap
  - use contrasting fillers
- variable swap
  - changes the positions of two entities of the same kind

Templates and Fillers

Each collection of concepts, contexts, targets can be compiled to as set of templates
partial items with types variables describing the range of fillers
- {object2: can_bounce=True} bounced off {object1} from below
- object1 can be the desk or the crate
- object2 should be the object marked with can_bounce=True (the ball, the tire)
- 500 filler items across 13 classes with 28 type restrictions
users can specify various custom parameters
- number of items to generate from each template
  - full set of items $\rightarrow$ "version"
- whether fillers should be hold constant across all items in a version
- apply transformations to filler restrictions at compile-time
  - agent $\rightarrow$ agent:western=False
  - object $\rightarrow$ nonword
this allows controlled experimentation of the features

4. Evaluation

with this framework, EWoK-CORE-1.0 is released by generating 5 unique fixed substitutions of filler items across 880 templates from 11 domains
evaluated with LogProb and two prompt-based methods LIKERT, CHOICE
- LogProb outperforms the direct prompting
for the prompt-based evaluations
- collected data from LLMs and humans using paired identical prompts

4.1. Scoring Metrics

LogPRobs
- token-level LLM probabilities with sum of conditional log probs of each token
- $\log P_{\theta}(T \ | \ C) = \sum_{k=1}^n \log P_{\theta}(\mathbf{t}k \ | \ C, \mathbf{t}{
LIKERT
- participants are prompted to reate the plausibility of each $C_i$ and $T_j$ pair on 1-5 scale
CHOICE
- participants are given $C_1$, $C_2$ and a single target $T$
- participants should choose between $C_1$ and $C_2$ which better matches with $T$
the metric for correctness fo given item is the recovery of the designed item structure
- $score(T_1 \ | \ C_1) > score(T_1 \ | \ C_2)$ and $score(T_2 \ | \ C_1) < score(T_2 \ | \ C_2)$
- the score is different from method
find both $C, T$ matches $\rightarrow$ 1.0 (full point)
find only one match $\rightarrow$ 0.5 (half point)
- in LIKERT, this is the case with the model gave same ratings
trivial 50% baseline for all scenario

4.2. Models

20 transformer LMs
1.3B-70B and different pretraining diet
13 dense pretrained transformers
4 instruction-tuned
2 chat fine-tuned
1 MoE
the model doesn't require specific formatting

4.3. Surface-level item properties

baseline: BoW with word2vec
scored with Cosine-Similarity
tested LLM with number of words in each item and average word frequency in an item with Google Ngrams

4.4. Human Data

1262 participants (591 female, 579 male, 27 other)
median age 36
US-residents with first language Enalish
poor agreement with others were excluded

5. Release Considerations

reduce the chances of caccidental incorporation of EWoK into LLM's training data
promote accountability and reporting when such incorporation is done intentionally

6. Experiments

EWoK-CORE-1.0 is challenging for LLMs

even larger models generally perform much below humans
best one falcon-40b-instruct git 0.80 while human got 0.95
instruction tuning doesn't affect to the performance under LogProbs

Performance vaires drastically by domain

domain difficulty is consistent across LLMs
heterogeneous performance of the phi models
- phi-1 is the worst
- phi-1.5 outperforms all models and even humans on physical dynamics
- phi-2 on par with the largest models on some domains to worst than gpt2-xl on spatial relations
- possibly due to their unique training procedure (synthetic data)

LLMs show heterogeneous performance across dataset versions

in principle, these variables should not affect the results
phi-2 and phi-1.5 showed the largest performance range
humans showed somewhat heterogeneous performance too (driven only by a subset of the domains)

Domain content, item design features, and surface-level item features all affect LLM performance

they affected the performance often in a different ways than they affect humans
BoW baseline is predictive of LLM but not human
the number of words in an item negatively affects LLM but not model performance
word-frequenct is negatively affected to both LLM and human performance
- this is because the hardest two domain (physical-relations and spatial relations) have the highest word frequency

jointly models all features using mixed effects regression
- word frequency has a significant positive effect
- the number of words has a significant negative effect
- the domain is remained a significant predictor of performance

LogProbs yield higher accuracy than prompting

the gap was large in smaller models

Human ratings are often but not always accurate

sometimes the discrepancies between human ratings and experimental labels resulted from specific fillers changing the plausibility
- The cooler is inside the car. Chao cannot see the cooler
- this is implausible as the cooler is large and the car has windows
- but the small object and the container without window is plausible
Human made mistakes
- the bakery is north of Chao. Chao turns around. The baker is south of Chat.
- this is implausible as cardinal directions don't depend on the agent's orientation

7. Discussion

the goal was to develop a dataset
- uses a uniform item format to probe diverse domains of physical and social knowledge
- contains items that probe specific concepts
- requires integrating information across sentences
- consists of generic templates that can be used to generate a wide variety of items
presented evaluation results
EWoK-CORE-1.0 is moderately challenging for LLMs
LogProbs contain enough information for most LLMs
Future Work
- Targeted experiments
  - the flexibility of the framework allows specific experiments using customized sets of fillers
- Interpretability research
  - Knowledge Editing research to basic physical and social concepts
- From elements to world models
  - model to function as a flexible and robust general purpose AI system, it needs tob e able to construct, maintain and update internal world models
  - LLMs usage of internal world models is ongoing investigation
Limitations
- written in English
- same prompting setup for all models
  - with tailored prompt engineering, the performance can improve
- some items are semantically weird
  - due to the synthetic nature of dataset

8. Conclusion

EWoK provides a way to evaluate the fundamental elements of workd knowledge

9. Comment

실제 세계에 대한 모델의 '이해'를 테스트 하려고 만든 데이터셋. 범용성이 있고 공들여 만든 것으로 보이지만 더 적절한 평가 방법이 있으면 좋을 것 같음.

LayerSkip : Enabling Early Exit Inference and Self-Speculative Decoding

Wed, 15 May 2024 13:30:23 GMT

1. Introduction

LLM Acceleration
- sparsity
- quantization
- head pruning
Reducing the number of layers for each token by exiting early during inference
Speculative decoding
- main model + draft model
- larger memory footprint and complexity
- faster inference
  
  $\rightarrow$ Self-Speculative Decoding
contribution
- training recipe that combines layer dropout and early exit loss
- the recipe more robust to exiting at earlier layers of the model, essentially creating different sized sub-models within the same model
- self-speculative decoding solution that decodes with earlier layers and verifies and corrects with later layers

2. Motivation

2.1. Exiting Earlier in LLMs

Fig 2a -> Llama1 7B + HumanEval coding dataset
projected each layer's output embeddings on the LM head + softmax $\rightarrow$ got the index of the output element (Unembedding)
- token predictions in earlier layers appear to be irrelevant
- in later layers, token predictions converge to the final prediction
- most of the time, the final token predition is predicted fewer layers before the end
- intermediate layers are sometimes hesitant and change their mind
- a token requires 23.45 layers out of the model's 32 layers
  
  $\rightarrow$ need to make the model to use fewer layers
  
  $\rightarrow$ make the model not to hesitate and change their mind
skipping layers during training (dropout)
- higher rate for later layers and lower rates for earlier layers
unembedding
- typically LLMs are trained to unembed at the last transformer layer
- need to adds a loss function during training to make the LM heads understand embeddings of earlier layer
- shared LM head to early exit
- make the LM head as ensemble of different depth models with same weight

2.2. Correcting if we exit too early

exiting early can reduce the accuracy
- needs a way to verify if an early prediction is accurate and correct it by using remaining layers
- Self-speculative decoding

Dropout

unstructured dropout (original)
large models (Llama, GPT3, PaLM) don't use it at large corpus
enable the training to learn across an ensemble of many models
multiplicative noise

Layer Dropout (stochastic depth)

stochastically skipping layers
LayerDrop in LMs $\rightarrow$ robustness
layer dropout for training decoder-only models or scaling LMs has not beed explored

Early Exit

branch modules at different exit points in a deep learning network + additional loss
in LMs, early exit in encoder-only models was explored
dedicated LM head for each decoder layer
SkipDecode
additional FC layer

Speculative Decoding

auto-regressive decoding is slow while measuring the likelihood of a group of generated tokens in parallel is faster
draft model (fast, less accurate) to generate tokens and verify and correct with main (slow, more accurate) model

4. Proposed solution

4.1. Training using Layer Dropout & Early Exit Loss

Notation
- model $X$
- output $Y$
- token embeddings $x_0$
- number of layers $L$
- $x_{l+1} = x_l + f_l (x_l)$
- final LM head maps the embedding outputs to logits $e_L = g(x_L)$
- BCE loss = $J_{\text{BCE}}(e_L, Y)$

4.1.1. Layer Dropout

layer dropout at layer $l$ and iteration $t$
- $x_{l+1, t} = x_{l, t} + M(p_{l, t})f_l(x_{l, t})$
- where $M$ is bernoulli function that returns 0 with probability $p$
- apply dropout on each sample separately within a batch
- remove dropped sample and apply transformer operation $f_l$ on the remaining samples
- same random seed for GPUs
Dropout rate $p_{l, t} = S(t)D(l)p_{max}$
- $p_{max}$ : hyperparameter
- $D(l)$ : per-layer scaling function
- $D(l) = e^{{l \ln 2 \over L-1}} - 1$ was the best (growing exponentially)
- $S(t)$ : per-time step scaling function
- for pre-trained model and doing fine-tuning or continuous training, $S(t) = 1$ was the best
- for pretraining from scratch, $S(t) = e^{{t \ln 2 \over T-1}} - 1$ was the best

4.1.2. Early Exit Loss

LM head $g$ should be capable of unembedding outputs of different layers
During training, supervise the model directly to connect the early exit layers to the LM head
- $J(X, Y, t) = \displaystyle \sum_{l=0}^{l = L-1} \tilde{e}(t, l) J_{\text{BCE}}(g(x_{l+1}), Y)$
- $\tilde{e}(l) = {C(t,l)e(l) \over \sum_{i=0}^{i=L-1} C(t,i)e(i)}$, normalized per-layer loss scale
- $C(t, l)$ : Binary curriculum function that determines if we enable early exit of layer $l$ at iteration $t$
- $$ e(l) = \begin{cases} e_{scale} \sum_{i=0}^{i=l} i \quad &\text{if } 0 \le l < L-1 \ L-1 + e_{scale} \sum_{i=0}^{i=L-2}i &\text{if } l = L-1 \end{cases} $$
- the scale increases across layers
- the scale at one layer is proportional to the sum of the scales of all previous layers
- penalize later layers with quadratically higher weight (predicting in later layers is easier)
- $0 \ \le e_{scale} \ \le 1$ is a hyperparameter

Early Exit Loss Curriculum

adding early exit loss of all layers at all iteration slows down the training and reduces the accuracy
use $C(t, l)$
- rotational early exit curriculum $C_{\text{rot}, R}$
  - enable early exit at every $R$ layers
  - only $\lceil L/R \rceil$ unembedding operations are applied
- gradual early exit curriculum $C_{\text{grad}}$
  - gradually enable early exit loss from layers $L-1$ to 0, one layer at a time every $T/2L$ iterations

Hyperparameter Summary

Layer Dropout
- $p_{max}$ : max dropout rate of last layer of the model
- $S(t)$: layer dropout curriculum
Early Exit Loss
- $e_{scale}$: scalar scale of loss of earlier layers
- $C(t,l)$: early exit loss curriculum

4.2. Inference using Early Exit

run the first $E$ transformer layers and skip to the model's LM head
the final output is $g(x_E)$

4.3. Inference using Self-Speculative Decoding

Self-speculative decoding
- use single model and latency of traditional speculative decoding
- Self Drafting and Self-Verification
- Self Drafting: using the early exit to draft tokens
- Self Verification: using the remaining layers to validate the prediction
- Cache Reuse : unifies the KV cache and storing the exit query

4.3.1. Self-Drafting

compute the first $d$ draft tokens through early exit
- leverage a subset of the LLM and conduct auto-regressive inference exiting at layer $E$
- train the model once to get an ensemble of different candidate draft models at each layer depth

4.3.2. Self-Verification

leverages the full LLM to predict the next token for each draft token in a single forward pass
find the point where the draft tokens and verified tokens agree
All the draft tokens up till the disagreement point are added to the output along with the next verified token and continues from the draft
only computes $L-E$ layers

4.3.3. Reusing the Cache

avoid recomputing prior KV pairs in each layer
Single KV Cache
- first $E$ layers are shared in two stages
Exit Query Cache
- saves the query vector of exit layer $E-1$ for verification to directly continue from layer $E$
- save only the query for the exit layer

5. Experiments

Continual Pretraining
- continue training with 52B tokens
- text + code
- Llama2 7B (32 layers)
  - $p_{max} = 0.1$
  - $e_{scale} = 0.2$
  - $C_{\text{rot}, R=8}$
- Llama2 13B (40 layers)
  - $p_{max} = 0.1$
  - $e_{scale} = 0.1$
  - $C_{\text{rot}, R=39}$
Pretraining from scratch
- 26B tokens
- text + code
- Llama2 1.5B (24 layers)
  - $p_{max} = 0.1$
  - $e_{scale} = 0.2$
  - $C_{\text{rot}, R=23}$
- Llama2 7B (32 layers)
  - $p_{max} = 0.2$
  - $e_{scale} = 0.2$
  - $C_{\text{rot}, R=31}$
- higher LR when dropout $\ge$ 0.0
Fine-tuning on Code
- 5.2B tokens
- Llama1 7B
  - $p_{max} = 0.1$
  - $e_{scale} = 1.0$
  - $C_{\text{rot}, R=16}$
Fine-tuning on Task-Specific Dataset
- TOPv2 dataset
- Llama 1.5B (24 layers)
  - $p_{max} = 0.2$
  - $e_{scale} = 1.0$
  - $C_{\text{grad}}$
tried LD, EE, LD+EE

6. Results

6.1. Early Exit Inference Results

Continual Pretraining

LayerSkip is better than the baseline
for the last layer accuracy, LayerSkip has minimal drop in accuracy
some classification tasks (multiple choice, TF) $\rightarrow$ maintain relatively decent accuracy on earlier layers
generation task $\rightarrow$ drop drastically
classification is evaluated on one token while generation is evaluated on many tokens
in MMLU, Llama2 13B baseline dropped from 55.2 to 49.2
NaturalQuestions $\rightarrow$ LayerSkip's accuracy is higher at middle layer

Pretraining from Scratch

on the last layer in some downstream tasks, a slight drop in accuracy is seen
- small tokens $\rightarrow$ some tasks were close to random guess

Finetuning on Code Data

Fig 10a
earlier layers are better than the baseline
LD+EE shows a big improvement
this is specific domain data, scaled $e_{scale}$ to 1.0

Finetuning on Task-Specific Dataset

Fig 10b
removing layers from the baseline, the model is not able to generate complete and accurate parses $\rightarrow$ 0 EM
LayerSkip shows 77% at layer 12
regression in the final layer reducing accuracy by 3%

6.2. Self-Speculative Decoding Results

used EM, ROUGE-2
compared with common models and tasks in Draft & Verify
used greedy decoding and max 512 tokens

Continual Pretraining

higher speedups for the smaller model

Pretraining from Scratch

Finetuning on Code Data

Finetuning on Task-Specific Data

7. Ablation Studies

Scaling with Pretraining Tokens

50000 steps
batch size per device: 4
context window: 4096
number of GPUs: 32, 64, 128
middle layer PPL increases by default (w/o EE)
could open door about the dynamics of transformers

KV Cache in Self-Speculation

use of KV cache is able to consistently save 9-20ms per token

8. Limitations

self-speculative decoding doesn't require changing a model's weights
$p_{max}$, $e_{scale}$, $R$ need to be tuned
pretraining with layer dropout from scratch, increasing LR is needed and tuning LR is tricky

9. Conclusion

layer dropout + early exit loss improves accuracy and speed
hope this to be combined with PEFT
in the future, increasing the accuracy of early-exit layers and exploring dynamic conditions to determine a different exit layer can be done

10. Comment

Pruning과 다르게 선택적인 레이어만 사용하여 학습과 추론을 하는 것, 그리고 남는 레이어를 이용해 Self-Speculative Decoding을 알차게 구현한 기법. Transformer에서는 결국 모든 레이어가 필요치 않은듯. CoT without prompting과 결이 비슷한 듯.

Octopus v4: Graph of Language Models

Sun, 05 May 2024 02:17:48 GMT

1. Introduction

LLMs became very powerful and used in lots of fields
Due to Llama 2 and 3, the open-source LLMs has seen significant growth
- user may select the optimal model based on the use case
Graph data structure
- can be used to represent the relationships between models, the optimal use cases and their capabilities
- create a powerful framework for seamless model integration, intelligent query routing and optimized perofrmance
on-device AI models
- enhances security, reduces latency
- cloud-on-device collaboration
  - seamless integration with cloud-based models
  - light task for on-device models, complicated task for cloud models
  - IoT may plays a crutial roles by connecting a vast network of devices

Graph data format

BFS, DFS
PageRank
GNN
GAT (Graph Attention Networks), GCN (Graph Convolution Networks)

AI agents with functional tokens

functional tokens can select suitable models or functions
make synergy with Octopus framework
selects the best neighbor, restructures the information and transmits optimized information

Multi-Agent LLMs

harness collective intelligence from specialized agents
integration difficulties, data sharing issues and maintaining smooth coordination between agents
exploring possibilities like cross-domain expertise and real-time collaboration
parallel function calling $\rightarrow$ self-connections
sequential action processing $\rightarrow$ graph traversal

LLM Scaling law

leverating distributed computing and node expansion to addresses the scalability issues $\rightarrow$ nearly unlimited node scalability

3. Methodology

3.1 LM for classification from Octopus v2

functional token in Octopus v2
- $f$ for the choice from the set $F$, $params$ for the reformulated information derived from the query $q$

$$ P(f, params \ | \ q) $$

used in selecting the optimal choice, reformulating the query to transmit
- select the best neighboring nodes, pass the information to subsequent nodes

3.2 LMs as nodes in graph

directed and heterogeneous graph $G = (N, E)$
- master nodes $N^m$ : coordinate queries by directing to worker nodes
- worker nodes $N^w$ : transfer necessary information for task
- master node passes the information and worker nodes handle

user queries $q$ and responses $y$
- $P(y \ | \ q) = P(y \ | \ q; G)$
single-step task involves only one worker node
- $P(y \ | \ q; G) = P(N^w, q_h \ | \ q; N^m)P(y \ | \ q_h ; N^w)$
- second term is from Octopus v2
- uses Octopus v2 to select the best neighboring worker $N^w$ and reformat the query to $q_h$
- third term is to calculating the result by worker
Multi-step task involves several sequential interactions
- simply expands the formula

$$ P(y \ | \ q; G) = \prod_{i=0}^{k-1} P(N_i^w, q_{h_i} \ | \ q; N_i^m)P(y \ | \ q_{h_i}; N_i^w) $$

to answer one query from the user, only activating two small models is needed
use functional token to get rid of RAG

3.3 Task planning using graphs for multistep operations

traditional approach
- all available functions are listed
- LLM generated the plan with the user query and the list
- small model cannot grash the extensive descriptions effectively
- it doesn't consider the inherent relevance among function descriptions
  
  $\rightarrow$ using Graph
Graph-based approach
- only neighboring nodes are considered
- reducing the complexity
using Octopus v2
- enabling rapid query redirection and reformatting
- apply the functional token to make it as a single AI agent which can take single function callings for each LMs
- or the single noce can be an ordinary LM (Llama3, Phi3)
- At thi another layer, user Octopus v3 to choose from the nodes

3.4 Functional token and dataset collections

conceptualize each model as a distinct function
for specific models, detail the required prompt template in the function's doc string

construct the dataset using similar strategy to Octopus v2
- synthetic data to train the functional tokens
- increase the temperature to accommodate diverse queries

3.5 System design of LM graph

Worker node deployment
- $N^w$ as an individual LM
- serverless architecture
- limit the worker size to 10B
Master node deployment
- base model with fewer than 10B
- compact LoRA can be integrated to extend functional token capabilities
- single base model with multiple LoRA, one per each worker
- LoraX library
Communication
- worker and master is distributed acrosso various devices
- internet connectivity is essential
- master $\rightarrow$ on-device, worker $\rightarrow$ cloud

4. Experiments

4.1 Task and models

MMLU with 17 distinct models

Specialized models from HF based on benchmark, popularity and endorsements
Not all tasks have specialized model $\rightarrow$ used Llama 3 with system prompt is used instead of the specialized model (Humanities task)

4.2 MMLU evaluation

example query
is functional token which maps to math gpt

5. Discussion and Future works

5.1 How to train a vertical model

fine-tune with domain-specific expertise
- gather substancial corpus
- ensure the data is diverse, well-organized, embodies the knowledge
- clean the data
Use HF SFT Trainer

5.2 Future work

integrating a variety of vertical specific models
Multimodal case (Octopus 3.5)

6. Comment

주어진 Query를 보고 유사도에 기반해 다음 행동을 정하는 RAG가 아니라, 애초에 학습 과정에서 토큰에 값을 붙여주면 더 빠르게 행동을 선택할 수 있다는 아이디어. Agent를 활용할 때 도움이 될듯

Can large language models explore in-context?

Sat, 30 Mar 2024 10:04:54 GMT

1. Introduction

In-context Learning $\rightarrow$ important emergent capability of LLM
- without updating the model parameter, LLM can solve various problem
- this ability is extracted from training corpus and emerge at scale
After GPT3, ICL has been the subject of a growing body of research
- lots of research focused on In-Context Supervised Learning (ICSL)
Many application demand the use of ML model for decision making
- In-Context Reinforcement Learning (ICRL) + sequential decision making is the next frontier
- LLM are already used in cedision making (experiment design, game etc.)
- ICRL is less developed than ICSL
Decision making agentsmust posses
- generalization : required for supervised learning
- exploration : making suboptimal decision to gather more information
- planning : account long-term consequences of decisions
- exploration is focused in this paper
recent papers about ICRL
- ICRL in transformer when they are explicitly trained
- training is hard
- in that case, Does it exhibit the capability to explore in-context?
Deploying LLM to solve multi-armed bandit problem
- classical RL problem shows the tradeoff between exploration and exploitation
- this would be the building block to general RL question
evaluate the in-context behavior
- tested GPT-3.5, GPT-4, LLaMA2
- only single configuration (prompt + model) showed satisfactory exploratory behavior
- all failure is due to suffix failure (fails to select the best arm even once after some initial rounds)
- GPT-4 with basic prompt failed in over 60%
- another failure is for LLM to behave uniformly
successed configuration
- GPT-4 + enhanced prompt
  - suggestive hint
  - summarizes the history of interaction into per-arm average
  - zero-shot CoT
- SOTA model has capability to robustly explore if prompt is designed carefully
- but it may fail in complex environments
  - summarizing history is non-trivial problem
In-Context bandit learning is hard
- stochasticity in the environment demands high-degree of replication for statistical significance
- even single experiment involve hundreds or thousands LLM queries
identify surrogate statistics as diagnostics for long-term exploration failure
- characterize long-tern exploration failure

2. Experimental Setup

Multi-armed bandits

used MAB variant, Stochastic Bernoulli bandits
- $K$ possible actions (arms) $[K] = {1, ... , K }$
- each arm $a$ is associated with mean reward $\mu_a \in [0, 1]$ (unknown)
- an agent interacts with the environment with $T$ time steps
- each time step $t \in [T]$ the agent selects an arm $a_t \in [K]$ and receives a reward $r_t \in { 0, 1 }$ drawn from Bernoulli with mean $\mu_{a_t}$
- the MAB instance is determined by the mean rewards and the time horizon
reward for each arm is not chosen by the agent are not revealed $\rightarrow$ exploration is necessary to identify the best arm
focus on MAB with the best arm has mean reward $\mu^* = 0.5 + \Delta / 2$ and all other arms has mean reward $\mu = 0.5 - \Delta / 2$
- $\Delta = \mu^* - \mu$
set $K = 5$ and $\Delta = 0.2$ as 'hard' instance
set $K = 4$ and $\Delta = 0.5$ as 'easy' instance

Prompts

prompt design
- scenario
  - positioning LLM as an agent choosing buttons to press
  - or a recommendation engine displaying advertisements to users
- framing
  - suggestive of the need to balance exploration and exploitation
  - neutral
- history
  - raw list over rounds
  - summarized via number of rounds and average rewards of each arm
- requested final answer
  - single arm
  - distribution over arms
- method
  - the answer only
  - CoT
- basic prompt is buttons / neutral framing / raw history / return only arm / no CoT
Each modification might help LLM with model's knowledge
- advertising scenario / suggestive framing (system message) : model's knowledge of bandit algorithms
- history summarization (user message) : if LLM reliably summarize history itself
- returning a distribution (system message) : help to identify a good distribution (fails to sample from it)
- CoT (system message) : general performance
in GPT-4, used reinforced CoT design to additionally reminds the model to use CoT at user prompt

LLM configurations

models
- GPT-3.5-Turbo-0613
- GPT-4-0613
- LLaMA2-13B-CHAT with 4bits
Temparature 0(deterministic) or 1
- don't consider temp 1 with 'return distribution' as it may causes external randomness
5-letter $L_1 L_2 L_3 L_4 L_5$ notation for prompt design
- $L_1$ : $\text{B}$ or $\text{A}$ for buttons or advertisements scenario
- $L_2$ : $\text{N}$ or $\text{S}$ for neutral or suggestive framing
- $L_3$ : $\text{R}$ or $\text{S}$ for raw or summarized history
- $L_4$ : $\text{N}$ or $\tilde{\text{C}}$ or $\text{N}$ for CoT, reinforced CoT or No CoT
- $L_5$ : $0$, $1$ or $\text{D}$ for temperature and returning a distribution (temp 0)
- $\text{BNRN0}$ as a basic prompt
- advertisement scenario will be used ads robustness check
48 configs for GPT-3.5 and LLaMA2 and 72 configs for GPT-4

Baselines

two standard MAB algorithms
- UCB
- Thompson Sampling (TS)
Greedy (doesn't explore and finally fail)
no parameter tuning
1000 replicates for each baseline and each MAB instance

Scale of the experiments

time horizon $T = 100$
$N \in { 10, 20 }$ replicates for each LLM configuration and bandit instance
single experiment on GPT-4 with basic configuration for $T = 500$ for robustness check
in detail
- GPT-3.5
  - $N = 20$ replicates across all 48 prompt
  - about 200K queries
- GPT-4
  - $N = 10$ replicates across 10 representative configurations
- GPT-4 (additional aubustness check)
  - $T=200$
  - two for $N = 20$
  - two for $N = 40$
- LLaMA2
  - free from query (local model)
  - hard MAB instance, 32 configs, $N = 10$
$N \times T$ LLM queries for each config and MAB instance
- $N$ : significance level, must be large to overcome randomnes,nms in reward
- $T$ : effect size, must be large so that good algorithms have enough time to identify the optimal arm
Both exploration failures are less frequent in easier MAB instances
To cover extremely large prompt space, use small $\Delta$ and large $N$, $T$
$N \in {10, 20 }, T = 100$ and $\Delta = 0.2$ do not provide enough statistical power to distinquish between successful and unsuccessful methods
rely on surrogate statistics which can be detected in current moderate scale rather than scale up

3. Experimental Results

3.1 Overview

All but one LLM config failed to converge to the best arm with significant probability
Suffix Failures
- LLM never selects the best arm after a small number of initial rounds
Uniform-like failures
- LLM chooses all arms at uniform rate
- failed to eliminate poorly performing arms
the only one exception is GPT-4 with $\text{BSS}\tilde{\text{C}}0$

Fig 3 : summarize the main set of experiments (hard MAB instance)
two surrogate statistics
- SuffFailFreq : measures suffix failures $\rightarrow$ exploration fail
- $K \cdot$ MinFrac : measures uniform-like failures $\rightarrow$ exploitation fail

show another statistic GreedyFrac (how similar a method is to GREEDY)
only GPT-4 with $\text{BSS}\tilde{\text{C}}0$ follows the baseline TS and UCB

3.2 Identifying failures

focus on GPT-4

Suffix Failures

most of the LLM configs exhibit bimodal behavior
- large fraction of the replicates choose the best arm very rarely
- few replicates converged extremely quickly
Consistent with this, suffix failures occurred many times
suggests long-term failure to explore
- cannot be improved by running more time steps
- very similar to greedy and different from UCB and TS
For an experiment replicate $\text{R}$ and round $t$
- SuffFail($t, \text{R}$) = 1 if the best arm is never chosen in rounds $[t, T]$
- SuffFailFreq($t$) = mean({SuffFail($t, \text{R}$) : replicates $\text{R}$)
- SuffFailFreq($T/2$) : frequency of failure to choose best arm even after the last half rounds
basic config (GPT-4-$\text{BNRN0}$) in Fig1 (top) for T = 500, Fig 5 for GPT-4 for T = 100

bimodal behavior is shown in left plot
LLMs have much higher SuffFailFreq than UCB and TS
as T = 100 is not enough, suffix failures are not fully reflected in Fig 5 (right)
in Fig 1, suffix failure makes the larger differences in reward for large $T$

Uniform-like failures

in Fig 3 (left), 3 GPT-4 configurations avoid suffix failures
two of these shows the uniform-like failures (exploitation failure)
For an experiment replicate $\text{R}$ and round $t$,
- $f_a(t, R)$ be the fraction of rounds in which a given arm $a$ is chosen
- MinFrac($t, R$) =$\min_a f_a(t, R)$
- MinFrac($t$) = mean({MinFrac($t, R$) : replicates $R$ })
- MinFrac($t$) $\le 1/K$ for all $t$, rescale it by multiplying $K$
  i.e. $K \cdot$ MinFrac($t$)
- Larger MinFrac($t$) means more uniform selection of arms at time $t$
for LLMs, MinFrac($t$) doesn't decrease over time and stays larger than that of baselines
for two GPT-4 that avoid suffix failures and get uniform-like failures, (BNRND, BSSCD) both used distributional output

MinFrac doesn't decrease while baselines does
in longer $T$, it has much lower reward than baselines
- poor long-term performance

Generality of the failures

all LLMs except GPT-4-$\text{BSS}\tilde{\text{C}}0$ exhibit either a suffix failure or a uniform failure for hard MAB
other experiments have similar result
summary
- GPT-4 performed much better than GPT-3.5
- LLaMA 2 performed much worse
- all LLMs are sensitive to small changes in the prompt design
  - different modification interact with each other

3.3 Investigating successes

GPT-4-$\text{BSS}\tilde{\text{C}}0$
- no suffix failures
- $K \cdot$ MinFrac is slightly larger than TS
- reward is compatible to TS
ran this config on hard MAB with $T = 200$ and $N = 40$ + $\text{BSR}\tilde{\text{C}}0$ as an ablation
$\text{BSS}\tilde{\text{C}}0$ worked well in longer T
$\text{BSR}\tilde{\text{C}}0$ showed non-trivial fraction of suffix failures (Fig 1(b))

Fig 8
- basic config tends to commit to a single arm for several rounds (like greedy)
- $\text{BSR}\tilde{\text{C}}0$ also commits for long periods but to a lesser extent than the basic config
- $\text{BSS}\tilde{\text{C}}0$ switches arms requently and qualitatively appears much more similar to TS
Fig 9
- plotted the fraction of rounds in $[0, t]$ where the optimal arm was pulled
- $\text{BSR}\tilde{\text{C}}0$ looks like UCB except that sone runs shows suffix failures (goes to 0)
- $\text{BSS}\tilde{\text{C}}0$ is similar to TS with almost all replicates slowly converge to 1

3.4 Root causes

understand why LLMs behave like this

Fig 13
- with (a) and (c), GPT-4 shows qualitatively different behavior in easy and hard MAB
- easy instance is much easier
- in easy instance, GPT-4 showed very high GreedyFrac $\rightarrow$ behave like Greedy (as it performed quite well)
- GPT-4 performs quite well in low-noise settings
- in hard instance, GPT-4 did something non-trivial (neigher Greedy nor uniform)

Fig 10
- per-round decisions with GPT-3.5
- each experiment considers a particular distribution of bandit histories
- sampled 50 histories of length $t$
- tracked two statistics for each agent
  - empirically best arm so far vs a lesast-chosen arm so far
- uniform sampled data + UCB and TS sampled data
- per-round performance of both the LLMs and baselines is very sensitive to data source
- $\text{BNSN0}$ is too greedy, $\text{BNRN0}$ is too uniform
- $\text{BNRN0}$ and $\text{BNRC0}$ fall within the reasonable range by baselines while they failed in longitudinal experiments
- hard to assess whether LLM agents are too greedy or too uniform based on per-round decisions

LLM capability
- general intelligence
- causal reasoning
- mathmatical reasoning
- planning
- compositionality
In-context capabilities
- theoritical and empirical investigation to ICSL
- ICRL study focus on models trained from trajectory data from another agent
- justify with Bayesian meta-reinforcement learning perspective
- transformers work like TS and UCB
applying LLM to real-world decision making
- gaming, programming, medicine
- generative agent to simulate human behavior in open-world environment
- LLM-enabled robots
LLM performance in a task that characterizes intelligent agents with two-armed bandits
- very easy MAB ($K = 2$, $\Delta = 0.6$)
- single prompt design
- compared to human
- GPT-4 performed well

4.1 Further background on MAB

UCB
- explores by asigning each arm $a$ an index (average reward + bonus)
- choose the arm with largest index
- bonus form $\sqrt{C/n_a}$ and used $C = 1$ for this paper
TS
- proceeds as if the arms' mean rewards were initially drawn from some Bayesian prior
- computes a Bayesian posterior using the given history
- chooses an arm with largest mean reward
- chose prior that uniformly draws the mean reward at random from [0, 1] in this paper
- update each armindependently as a Beta-Bernoulli conjugate prior
regret
- difference in expected total reward of the best arm and the algorithm
- baselines achieve regret $O(\sqrt{KT \log T})$ which is nearly minimax optimal for $K$ and $T$
- achieved also $O(\sqrt{{K \over \Delta} \log T}$ for instance-optimal regret rate
$\epsilon$-greedy and greedy

5. Discussion and open questions

contemporary LLMs do not robustly engage in exploration required for basic statistical RL and decision making problems without further intervention

Basic interventions and the need for methodological advancements

Experiment with other prompts
Experiment with few-shot prompting
Train the LLM to use auxilary tools

Implications for complex decision making problems

simple MAB provides a clean and controllable setup
in more complex RL and decision making, similar failures also occur
the solution for MAB may not generalize to more complex settings
even for linear contextual bandits, this approach may not be applicable without a substancial intervention

6. Comment

ICSL이 아닌 ICRL의 관점에서 LLM이 어느 정도의 지식을 가지고 있는지 확인하는 논문. 단순한 문제이긴 하지만 요즘 LLM Agent에 대해서도 연구가 이뤄지는 만큼 Baseline이 되긴 할듯. RL에서는 단순한 문제지만 GPT-4에서 프롬프팅을 섞어야 해결이 가능할 만큼 LLM 능력으로는 접근하기 쉽지 않은듯

Training Neural Networks from Scratch with Parallel Low-Rank Adapters

Sun, 24 Mar 2024 12:30:29 GMT

1. Introduction

SOTA models' complexity $\rightarrow$ computation / memory / communication bandwidth
- LoRA
- quantizing model parametros
Prior work has been limited to fine-tuning $\rightarrow$ tools for pretrain from scratch is absent

Can neural networks be trained from scratch using Low-Rank Adapters?

common computing clusters often have slower cross-node training with gradient accumulation as slow communication speed and bandwidth
- Low-Rank adapters compress the communication between these processors while preserving essential structural attributes
Vanila LoRA underperforms in training a model from scratch
- using parallel low-rank updates can bridge this gap

Difference to existing works

data and model parallelism
- stores different copies of the LoRA parameters
- trained on different shards
  - different from traditional federated learning which replicates the same model across devices
- their method enables distributed training with infrequent synchronizations allowing for single-device inference
Previous works
- ReLoRA : trains and merges LoRA into main weights
- FedLoRA : train LoRA parameters for finetuning within a federated learning framework $\rightarrow$ training multiple LoRA and averaging them
- AdaMix : averages all MLP in MoE into a single MLP $\rightarrow$ needs constant synchronization during the forward and backward pass

2. Preliminaries

$x$ as a scalar, $\mathbf{x}$ as a vector, $X$ as a matrix, $\mathcal{X}$ as a distribution or a set
$f$ as a function, $F(\cdot)$ as a composition of functions, $\mathcal{L}(\cdot, \cdot)$ as a loss-function

2.1 Parameter Efficient adapters

Adapters : trainable functions that modify existing layers in an neural network
LoRA : subclass of linear adapters
- the linearity of LoRA allows for the trained parameters to be integrated back in to the existing weights
- the linearity allows models to maintain the original inference cost

LoRA

Given input $\mathbf{x} \in \reals^n$ and a linear layer $f(\cdot) : \reals^n \rightarrow \reals^m$ parameterized by the weight $W \in \reals^{m \times n}$
LoRA re-parameterizes the function as
- $f_{\text{lora}}(x) = \mathbf{W}\mathbf{x} + s \mathbf{BAx}$
- $\mathbf{B} \in \reals^{m\times r}$, $\mathbf{A} \in \reals^{r \times n}$, $s \in \reals$
- rank $r << \min(m, n)$
Forward pass incurs an extra computational overhead
the significance of LoRA pertains to the optimizer memory footprint
- AdamW stores two states for each parameter $\rightarrow$ double the memory consumption
- using LoRA, the memory cost is $\mathcal{O}(r(m+n))$ is less than the original model's $\mathcal{O}(mn)$
- QLoRA saves $W$ in 4-bit precision to achieve more memory saving

3. Method

standard training performance can be recovered using LoRA

Low-Rank LoRA shows inferior performance to the models using standard optimization
LoRA is incapable of recovering weights that exceed the rank $r$
Although there is a solution within a low-ranmk proximity of the initialization, it still needs the high-rank updates

3.1 Motivation : Multi-head merging perspective

this will show why LoRA heads in parallel can achieve the performance of standard pre-training
elevating the rank $r$ to the $\min(m, n)$ is sufficient to replicate standard pre-training performance
- it compromises the memory efficienty of low-rank adapters
leveraging multiple low-rank adapters in parallel
- given a matrix of the form $\mathbf{BA} \in \reals^{d_1 \times d_2}$ and $\mathbf{B} \in \reals^{d_1 \times d}$, $\mathbf{A} \in \reals^{d \times d_2}$
- then it is possible to represent the product as two lower-rank matrices $\mathbf{B_1A_1} + \mathbf{B_2A_2}$
  - let $\mathbf{b}_i$ and $\mathbf{a}_i$ be the column vectors
  - then we can construct $\mathbf{B_1} = [\mathbf{b}1, ..., \mathbf{b}{[d/2]}]$, $\mathbf{B_2} = [\mathbf{b}{[d/2]}, ..., \mathbf{b}{d}]$ and $\mathbf{A_1} = [\mathbf{a}1^{\top}, ..., \mathbf{a}{[d/2]}^{\top}]$, $\mathbf{A_2} = [\mathbf{a}{[d/2]}^{\top}, ..., \mathbf{a}{d}^{\top}]$
  - then this approximates the high-rank matrix into a linear combination of low-rank matrices
  - the same comclusion can be reached by beginning with a linear combination of rank-1 matrices
  - This forms the basis for a novel multi-head LoRA

Multi-head LoRA (MHLoRA)

given a matrix $\mathbf{W} \in \reals^{m \times n}$ and constant $N$
$f_{\text{mhlora}}(\mathbf{x}) = \mathbf{Wx} + {s \over N} \displaystyle\sum^N_{n=1} \mathbf{B}_n \mathbf{A}_n \mathbf{x}$
reparameterizes full rank weights into a linear combination fo low-rank weights
single parallel LoRA head can approximate the trajectory of a single step of the multi-head LoRA given that the parallel LoRA heads are periodically merged into the full weights
- using the same rank $r$ for all the LoRA parameters
- $\argmin_{\mathbf{B}n \mathbf{A}_n} \mathcal{L} \left( \mathbf{W} + {s \over N} \displaystyle\sum ^N _{n=1} \mathbf{B}_n \mathbf{A}_n\right) = \argmin{\hat{\mathbf{B}}_n \hat{\mathbf{A}}_n} \mathcal{L} \left( \hat{\mathbf{W}} + {s \over N} \hat{\mathbf{B}}_n \hat{\mathbf{A}}_n\right)$
- used hat for the single parallel LoRA head
- when either $\sum_{n=1}^N = \mathbf{B}_n \mathbf{A}_n = \hat{\mathbf{B}}_n \hat{\mathbf{A}}_n$ or $\hat{\mathbf{W}} = \mathbf{W} + {s \over N} \sum _{j \not = n}^N \mathbf{B}_j \mathbf{A}_j$
The first scenario is rank deficient $\rightarrow$ unable to recover the original model performance
The latter case necessitates that $\hat{\mathbf{W}}$ accumulates all the information of the LoRA parameters at every iteration $\rightarrow$ if we use a merge operator at every iteration, recovering the exact update is possible
one can recover the exact gradient updates of the MHLoRA
in distributed setting, only the LoRA params/gradients have to be communicated across devices $\rightarrow$ good when the interconnect speed is limited

3.2 LoRA soup: delayed LoRA merging

To reduce the communication cost of LTE
- local updates
- model-averaging
allow LoRA parameters to train independently for longer period befor e merge operator
- $\hat{\mathbf{W}} = \mathbf{W} + {s \over N} \sum _{j \not = n}^N \mathbf{B}'_j \mathbf{A}'_j$
- $'$ for stale estimate the parameters
Merging every iteration $\rightarrow$ ensures the representation will not diverge
using stale estimetes relaxes this equivalence $\rightarrow$ it can still match the standard training performance
- As its estimate is inaccurate, the optimization trajectory diverge from the optimization path of MHLoRA
- it doesn't imply that the model won't optimize
- just different path from MHLoRA
- used simple averaging (left more sophisticated merging as future work)

3.3 LoRA-the-Explorer: parallel low-rank updates

achieving an informative update $\Delta \mathbf{W}$ that does not require materialization of the full parameter size during training
parameterizing $\mathbf{W}$ such that it can be stored in low-precision and communicated efficiently (using quantized weights and keeping a high-precision copy)
LoRA-the-Explorer (LTE) : optimization algorithm that approximates full-rank updates with parallel low-rank updates
- creates $N$-different LoRA for each linear layer at initialization
- each worker is assigned the LoRA parameter and creates a local optimizer
- independently sample data from the same distribution $\mathbf{x} = { \mathbf{x}_1, ..., \mathbf{x}_N}$
- for each LoRA head $n$, optimize the parameters with own partition for $T$ iterations to get $\delta_{\text{lora}n} = -\eta \sum{t=1} ^T \nabla_{\text{lora}_n} \mathbf{x}_i[t]$
- don't synchronize the optimizer state across workers
- After the optimization, synchronize the LoRA parameters to compute the final update for the main weight $\Delta_{\text{lora}}(\mathbf{x}) = {1 \over N} \sum_{n=1}^N \delta_N$
- then update the LoRA parameters with the updated weights $\mathbf{W}$
  - re-initialize the LoRA parameter or use the same value with correction term
- since it doesn't train directly on the main parameter $\mathbf{W}$, using quantized parameter $q(\mathbf{W})$ is possible
  - keep the high-precision weight only in the master node or offload it from device during training

3.4 Implementation details

Not resetting matrix A and optimizer states

investigated whether the matrices $\mathbf{A}_n$ would converge to the same sub-space during training
- If so, resetting $\mathbf{A}_n$ or using regularizer are needed
- $\mathbf{A}$ is orthogonal to remain consistent throughout training
- without reset, it performed better (re-learning $\mathbf{A}$ and re-accumulating the optimizer are wasting optimization steps)

scaling up $s$ and lowering learning rate $\eta$

scaling $s$ has the same effect as tuning the lr $\eta$ $\rightarrow$ common misconception
during experiment, there is no comparable performance when using $s$ to be 1~4
- using large $s$ and slightly lowering $\eta$ worked best
- standard practice : set $s$ proportional to the rank $r$, i.e. $s = {\alpha \over r}$
- used $\alpha = 4096, s = 64$ and $\eta = 2 \cdot 10^{-4}$
- lr doesn't scale linearly with $s$
- $s$ only affects the forward computation
  - it modifies the contribution of the LoRA parameters in the forward pass $\rightarrow$ implacation on the effective gradient
- $s$ scales quadratically with the alignment of $\bold{B}$ and $\bold{A}$

Significance of Initialization Strategies

used the initialization scheme that utilizes a semi-orthogonal matrix scaled by $\sqrt{d_{out}/d_{in}}$
- originally designed for standard feed-forward models
- whereas LoRA operates under the assumption that $\bold{B}$ is zero-initialized with a residual connection
- in Ablation study, Kaiming initialization and Xavier initialization performing similar

4. Experiments

in transformer experiment, they misused the scaling factor $1/\sqrt{d_{out}}$ instead of the standard scaling $1/\sqrt{d_{out}/n_{attn}}$ (they will revise the hyper-parameter)

4.1 Iterative LoRA Merging

iteratively merging LoRA is a key component in recovering the full-rank representation
they assess the effectiveness of merging a single LoRA head in context of linera networks trained on synthetic LS regression datasets
without merging, the model performance is not changing
iterative merging recovers the GT solution with the rate increasing with higher merge frequency

in Vit-S with patch size 32 on ImageNet100
- merging of a single LoRA head outperforms standalone LoRA
- frequent merging delays convergence (LoRA parameter re-initialization and momentum state inconsistencies)
- performance doesn't match $\rightarrow$ potential local minima when training with rank-deficient representation
they found the merge iteration of $T = 10$ is still stable when using batch size of 4096
- with higher $T$, additional training may be required
with increased merge iteration, smarter merging techniques may be necessary

to further test the generalizability, they conducted various vision tasks on MLP-Mixer

4.2 LoRA parameter alignment

the efficacy of their optimization algorithm
- individual heads to explore distince subspaces within the parameter space
- average cosine similarity and Grassman distance between the heads $\bold{B}_n \bold{A}_n$
- conducted with data samples drawn from same distribution
- each set of LoRA parameters was exposed to a different set of samples
- LoRA heads do not converge to the same representation
- this orthogonality is maximized when using different parameters and different data (mini-batches)

4.3 Ablation study: the effect of LoRA heads, rank, and merge iteration

monotonic improvement in performance with an increased number of heads and ranks
extending the merge iteration negatively impacts performance
in LS regression, excessive merging hurts model accuracy
with large enough rank and head, the model converges to better accuracy even if the test loss was similar
averaging of the LoRA heads has a regularization effect similar to model ensembling
ViT-S as the primary architecture
- hidden size = 384
- MLP dimension = 1536
- number of heads * rank of the LoRA > the largest dimension of the model $\rightarrow$ worked well
- number of heads > rank $\rightarrow$ longer iterations were required

4.4 Gradient noise with parallel updates

in ablation, they fixed cumulative batch size of 4096 and epoch of 1200
each LoRA head received a reduced batch size of 4096/heads
scaling the rank exerts a greater impact than increasing the number of heads
- proportional scaling of gradient noise with smaller mini-batches
- gradient noise contribute to slower convergence in addition to the use of stale parameter estimates

- increasing the number of heads necessitates more sequential FLOPs but it offers efficient parallelization

using a larger batch size for gradient estimation may prove beneficial in distributed training

4.5 Performance Scaling on ImageNet-1K

scaled up to ImageNet 1K
- doubled batch size to 8192
- didn't changed the way mini-batches were sampled
- scheduling the randomness for the mini-batches is not explored
in Initial training, LTE outperformed standard training
- as training completed, standard training overtook LTE
- LTE needs additional iterations to achieve comparable performance
standard training appeared to benefit more from a lower lr compared to LTE

this study is focused on training deep networks with parallel low-rank adapters (not efficiency!)
hypothetical computation analysis for future scaling efforts
- model size $M_{\text{ddp}} = M$ and $M_{\text{lte}}$ for LTE
- the number of devices for each method $N_{\text{ddp}}$ and $N_{\text{lte}}$
- with quantization, each LTE device require a memory footprint of $qM + M_{\text{lte}}$
- as base model is 16-bit and if we use 4-bit quantizing, $q = 0.25$
- with AdamW, DDP necessitates an additional $2M$ parameters (total $3M$)
- for LTE, $qM + 3M_{\text{lte}}$ is needed
- Assuming the training is parameter bound by the main weights $r << \min(m, n)$, LTE can leverage GPUs roughly 1/3 size of DDP
- LTE requires 40% more data and 20% slowdown per iteration with quantization (QLoRA)
- on average, each LTE device observes 1/3 less data than a device in DDP
- Communication bottleneck
  - In multi-node systems, the communication scales with the size of the model and is bottlenecked at interconnect speed
  - using standard all-reduce, the gradient shared between each device for a total communication of $N_{\text{ddp}}(N_{\text{ddp}} - 1)M$
  - for LTE, as it communicate every $T$ iteration so ${1 \over T}N_{\text{lte}}(N_{\text{lte}} - 1)M$
  - using parameter server method (1 and broadcast), gradients are sent to the main parameter server and averaged
  - DDP with a parameter server would use $2(N_{\text{ddp}}-1)M$
  - LTE with parameter server would use ${1 \over T}(N_{\text{lte}} - 1)(qM + M_{\text{lte}})$
  - LTE can leverage lower-bandwidth communication as the parameters shared between devices are strictly smaller by a factor of $M_{\text{ddp}}/M_{\text{lte}}$

Training with adapters
- LoRA
- MoE PEFT and averaging
- Additive adapters
- Adapters for NLP, vision, video, incremental learning, domain adaptation, vision-language, text-to-vision, perceptual learning
- Distributed Training and Federated Learning
  - Federated Learning in low-compute devices, high-latency training, privacym, cross and in-silo learning
  - communication efficiency
    - local steps
    - decentralized training
    - gradient checkpointing
    - reversible gradient computation
    - gradient or weight compression
  - Combining models in federated learning
    - FedAvg
    - weight averaging
    - probabilistic frameworks for merging
    - updating with stale parameters
  - Server momentum and adaptive methods
  - bi-level optimization schemes
Linear mode connectivity and model averaging
- deep models can be connected through nonlinear means
- linear paths with constant energy exist in trained models
- for models with different initialization, parameter permutations can be solved to align them linearly
- model averaging
- model stitching
- Anna Karenina principle
- model averaging within ensembles
- utilizing an average model as a target

6. Conclusion

Low-rank adapters for model pre-training
LTE : bi-level optimization method that capicalizes on the memory-efficient properties of LoRA
how to accelerate convergence during the final 10% of the training?
how to dynamically determine the nuimber of ranks or heads?
is heterogeneous parameterization of LoRA feasible where each LoRA has a different rank?
what strategies for merging can achieve higher performance?
This study is showing viability
tests on larger models are needed
this will pave the way for pre-training models in computationally constrained or low-bandwidth environments
- less capable and low-memory devices can train a large model
- wisdom of the crowd

7. Comment

메인 파라미터들을 건드리지 않고 어댑터를 이용해 전체 모델을 근사하여 복원한다는 아이디어. rank r의 LoRA를 다시 rank 1로 decompose하여 바로 합치는 것이 아니라 주기적으로 병합해주는 것. 왜 lora를 다시 분해한다는 생각은 안해봤을까.

Is Cosine-Similarity of Embeddings Really About Similarity?

Fri, 15 Mar 2024 21:17:55 GMT

1. Introduction

Discrete Entities are embedded to dense real-valued vectors
- word embedding for LLM
- recommender system
The embedding vector can be used as input to other models
Also, they can provide a data-driven notion of similarity between entities
** Cosine Similarity** has become a very popular measure of semantic similarity
- the norm of the embedding vectors is not as important as the directional alignment between the embedding vectors
- unnormalized dot-product is not worked
Cosine similarity of the learned embeddings can in fact yield arbitrary results
- learned embeddings have a DoF that can render arbitrary cosine-similarities even though their dot-products are well-defined and unique
propose linear Matrix Factorization models as analytical solutions

2. Matrix Factorization Models

focus on linear models as they follow for closed-form solutions
matrix $X \in \mathbb{R}^{n \times p}$, conatining $n$ data points and $p$ features
the goal is to estimate a low-rank matrix $AB^{\top} \in \Reals ^{p \times p}$ where $A, B \in \Reals^{p \times k}$ with $k \le p$ such that $XAB^{\top}$ is a good approximation of $X \approx XAB^{\top}$
$X$ is a user-item matrix
- the row $\vec{b_i}$ of $B$ : item-embeddings ($k$-dimensional)
- the row $\vec{x_u} \cdot A$ of $XA$ : user-embeddings
- the embedding of user $u$ is the sum of the item-embeddings $\vec{a_j}$ that the user has consumed
this is defined in terms of the unnormalized dot-product between two embeddings
- $(XAB^{\top})_{u,i} = \lang \vec{x_u} \cdot A, \vec{b_i} \rang$
- once it has been learned, it is common to consider
  - two items cosine similarity
  - two users cosine similarity
  - user-item cosine similarity
- this can lead to arbitrary results and they may not even be unique

2.1 Training

the key factor affecting the utility of cosine similarity is the regularization employed when learning the embeddings in $A, B$
- $\min_{A, B} || X - XAB^{\top} || _F ^2 + \lambda ||AB^{\top}||_F^2$
- $\min_{A, B} || X - XAB^{\top} || _F ^2 + \lambda (||XA||_F^2 + ||B||_F^2)$
First one applies $||AB^{\top}||_F^2$ to their product
- in Linear models, this L2-regularization is equivalent to learning with denoising (drop-out in the input layer)
- the resulting prediction accuracy os test data was superior to the second objective
- denoising/drop-out is better than weight decay (second one)
Second one is equivalent to the usual matrix factorization objective
- $|| X - PQ^{\top} || _F ^2 + \lambda (||P||_F^2 + ||Q||_F^2)$
- regularizing $P$ and $Q$ separately is similar to weight decay in deep learning
if $\hat{A}, \hat{B}$ are solutions of objective, for an arbitrary rotation matrix $R \in \Reals^{k \times k}$ are the solutions as well
cosine similarity is invariant under rotation $R$
only the first objetive is invariant to rescalings of the columns of $A$ and $B$ (different latent dimensions of the embeddings)

- if $\hat{A}\hat{B}^{\top}$ is the solution of the first objective, for an arbitrary diagonal matrix $D \in \Reals^{k \times k}$, $\hat{A}DD^{-1}\hat{B}^{\top}$ is the solution also

- Then devine a new solution as a function of $D$ as 

$$
\begin{aligned} 
\hat{A}^{(D)} &:= \hat{A}D \\
\hat{B}^{(D)} &:= \hat{B}D^{-1}
\end{aligned}
$$

- this diagonal matrix $D$ affects the normalization of the learned user and item embeddings (rows)
$$
\begin{aligned} 
(X\hat{A}^{(D)})_{(\text{normalized})} &= \Omega_A X\hat{A}^{(D)} = \Omega_A X\hat{A}D \\
\hat{B}^{(D)}_{\text{(normalized)}} &= \Omega_B \hat{B}^{(D)} = \Omega_B \hat{B}D^{-1}
\end{aligned}
$$

where $\Omega$ is appropriate diagonal matrices to normalize each learned embedding to unit Euclidian norm

- a different choice for $D$ cannot be compensated by the $\Omega$

- they depend on $D$ so they can be shown as $\Omega_A(D), \Omega_B(D)$

- **cosing similarities of the embeddings depend on this arbitrary matrix $D$**

the cosine similarity becomes
- item - item $$ \text{cosSim}(\hat{B}^{(D)}, \hat{B}^{(D)}) = \Omega_B(D) \cdot \hat{B} \cdot D^{-2} \cdot \hat{B}^{\top} \cdot \Omega_B(D) $$
- user-user $$ \text{cosSim}(X\hat{A}^{(D)}, X\hat{A}^{(D)}) = \Omega_A(D) \cdot X\hat{A} \cdot D^{2} \cdot (X\hat{A})^{\top} \cdot \Omega_A(D) $$
- user-item $$ \text{cosSim}(X\hat{A}^{(D)}, \hat{B}^{(D)}) = \Omega_A(D) \cdot X\hat{A} \cdot \hat{B}^{\top} \cdot \Omega_B(D) $$
These cosine similarities all depend on arbitrary matrix $D$
user-user and item-item is directly depend on $D$ while user-item is indirectly depend on $D$ due to its effect to normalizing matrices

2.2 Details on First Objective

The closed-form solution of the first objective is $\hat{A}{(1)}\hat{B}{(1)} = V_k \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)_k \cdot V_k^{\top}, \quad X =: U\Sigma V^{\top}, \quad \Sigma = \text{dMat}(..., \sigma_i, ...)$ and $V_k$ is truncated matrices of rank $k$
Sine $D$ is arbitrary, w.l.o.g. we may define $\hat{A}{(1)} = \hat{B}{(1)} := V_k \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)_k^{{1 \over 2}}$
when we think of the special case of a full-rank MF model, this would be two cases
- choose $D = \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)^{{1 \over 2}}$
  - $A_{(1)}^{(D)} = \hat{A}_{(1)} \cdot D = V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)$
  - $B_{(1)}^{(D)} = \hat{B}_{(1)} \cdot D^{-1} = V$
  - given the matrix of normalized singular vectors $V$, the normalization $\Omega_B = I$
  - Then $\text{cosSim}(\hat{B}{(1)}^{(D)}, \hat{B}{(1)}^{(D)}) = VV^{\top} = I$
  - Cosine similarity between any pair of item-embedding is zero
  - $\text{cosSim}(X\hat{A}{(1)}^{(D)}, \hat{B}{(1)}^{(D)}) = \Omega_A \cdot X \cdot V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...) \cdot V^{\top} = \Omega_A \cdot X \cdot \hat{A}{(1)}\hat{B}{(1)}^{\top}$
  - the only difference in user-item embeddings is the normalization $\rightarrow$ the same ranking ($\Omega_A$ is irrelevant)
- choose $D = \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)^{-{1 \over 2}}$
  - similar to previous case
  - $A_{(1)}^{(D)} = \hat{A}_{(1)} \cdot D^{-1} = V$
  - $B_{(1)}^{(D)} = \hat{B}_{(1)} \cdot D = V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)$
  - $\text{cosSim}(X\hat{A}{(1)}^{(D)}, X\hat{A}{(1)}^{(D)}) = \Omega_A \cdot X \cdot X^{\top} \cdot \Omega_A$
  - for user-user similarities, it is based on the raw data-matrix
  - it doesn't uses the learned embeddings
  - $\text{cosSim}(X\hat{A}{(1)}^{(D)}, \hat{B}{(1)}^{(D)}) = \Omega_A \cdot X \cdot \hat{A}{(1)} \cdot \hat{B}{(1)}^{\top} \cdot \Omega_B$
  - $\Omega_B$ normalizes the rows of $B$ but this is again the same rankings
  - $\text{cosSim}(\hat{B}{(1)}^{(D)}, \hat{B}{(1)}^{(D)}) = \Omega_B \cdot V \cdot \text{dMat}(..., {1 \over 1 + \lambda/\sigma_i^2}, ...)^{2} \cdot V_{\top} \cdot \Omega_B$
  - this is very different from the previous choice
- Hence, the different choice of $D$ result in different cosine-similarities even though the learned model $\hat{A}{(1)}^{(D)}\hat{B}{(1)}^{(D)\top} = \hat{A}{(1)}\hat{B}{(1)}^{\top}$ is invariant to $D$
- the results of cosine-similarity are arbitrary and not unique for this model

2.3 Details on Second Objective

The solution of the second objective is
- $\hat{A}{(2)} = V_k \cdot \text{dMat}(..., \sqrt{{1 \over \sigma_i} \cdot (1 - {\lambda \over \sigma_i})+}, ...)_k$
- $\hat{B}{(2)} = V_k \cdot \text{dMat}(..., \sqrt{\sigma_i \cdot (1 - {\lambda \over \sigma_i})+}, ...)_k$
- $(y)_+ = \max(0, y)$

If we use usual notation of MF $P = XA$ and $Q = B$,
- we get $\hat{P} = X\hat{A}{(2)} = U_k \cdot \text{dMat}(..., \sqrt{{1 \over \sigma_i} \cdot (1 - {\lambda \over \sigma_i})+}, ...)_k$
- this diagonal matrix is same for user and item embeddings due to its symmetry in the L2-norm regularization
- this solution is unique $\rightarrow$ there is no way to choose arbitrary matrix $D$
In this case, the cosine-similarity yields unique results
is this matrix $\text{dMat}(..., \sqrt{{1 \over \sigma_i} \cdot (1 - {\lambda \over \sigma_i})_+}, ...)_k$ the best possible semantic similarities?
- comparing this case with 2.2 gives the arbitrary diagonal matrix $D$ in 2.2 may be chosen as $D = \text{dMat}(..., \sqrt{{1 \over \sigma_i}}, ...)_k$

3. Remedies and Alternatives to Cosine-Similarity

when a model is trained w.r.t. the dot-product, its effect on cosine-similarity can be opaque and sometimes not even unique
- train model on cosine-similarity $\rightarrow$ use layer normalization
- project the embedding back into the original space $\rightarrow$ cosine-similarity works
  - view $X\hat{A}\hat{B}^{\top}$ as the raw data's smoothed version and the rows of $X\hat{A}\hat{B}^{\top}$ as the users' embeddings in the original space
in cosine-similarity, normalization is applied after the embeddings have been learned
- this can reduce the result similarities compare to applying some normalization or reduction of popularity bias before of during learning
To resolve this,
- standardize data $X$ (zero mean, unit variance)
- negative sampling, inverse propensity scaling to account for the different item popularities
  - word2vec is trained by sampling negatives with a probability proportional to their frequency

4. Experiments

illustrate these findings for low-rank embeddings
Not aware of a good metric for semantic similarity $\rightarrow$ experiments on simulated data $\rightarrow$ ground-truths are known (clustered items data)
generated interactions between 20000 users and 1000 items assigned to 5 clusters with probability $p_c$
sampled the powerlaw-exponent for each cluster $c$, $\beta_c \sim \text{Uniform}(\beta_{min}^{(item)}, \beta_{min}^{(item)})$
- where $\beta_{min}^{(item)} = 0.25, \beta_{min}^{(item)} = 1.5$
assigned a baseline popularity to each item $i$ according to the powerlaw $p_i = \text{PowerLaw}(\beta_c)$
then generated the items that each user $u$ had interacted with
- firstly, randomly sampled user-cluster preferences $p_{uc}$
- compute the user-item probabilities $p_{ui} = {p_{uc_i}p_i \over \sum i p{uc_i}p_i}$
- sampled the number of items for this user $k_u \sim \text{PowerLaw}(\beta^{(user)})$ (used $\beta^{(user)} = 0.5$ and sampled $k_u$ items with $p_{ui}$)
Learned the matrices $A, B$ with two training objective ($\lambda = 10000$ and $\lambda = 100$)
- low-rank constraint $k=50$, $p=1000$ to complement the analytical result for the full-rank case above

Left one is ground-truth item-item similarities
training with first objective and chose three re-scaling of the singular vectors in $V_k$ (middle three)
Right one is trained with second objective $\rightarrow$ unique solution
see how vastly different the resulting cosine-similarities can be even for reasonable choice of re-scaling (not used extreme case)

5. Conclusions

cosine similarities are heavily dependent on the method and regularization technique
in some cases, it can be rendered even meaningless
cosine-similarity with the embeddings in deep models to be plagued by similar problems
- deep model's different layers may be subject to different regularization $\rightarrow$ may affect $D$

6. Comment

맹목적으로 사용하는 코사인 유사도에 대한 고찰. 하지만 너무 제한적인 상황에서 테스트를 해본 것 같지만 그냥 의심을 한번 해보자는 취지에서는 괜찮았던 것 같음.

Beyond Language Models: Byte Models are Digital World Simulators

Sat, 09 Mar 2024 16:51:42 GMT

1. Introduction

Deep Learning has focused on interpretable digital media files - text, images, audio
- Text played central role in conveying human intelligence and has led to the emergence of LMs
- LMs tokenize text and predict next token so that it can comprehend human language and intellegence
- Recent advancements extend tokenization beyond text
These deep learning models overlooks the omnipresent native binary data in the digital world
- Next-Byte Prediction will allow the models to truly understand and simulate all activities in the digital world
- It has practical benefits in cybersecurity, computer diagnostics, data compression and even for reverse-engineering a software's source code from binary representation
bGPT : model for binary data processing and digital world modelling by next byte prediction
- directly interpreting and manipulating binary data
- two-fold advantages
  - Interpreting Digital System
  - Unified Modelling
Experiment in two areas
- well-studied tasks (generative modelling, classification)
- relatively underexplored tasks intrinsic to binary-native operations (data conversion, CPU state modelling)

2. Background

2.1 Language Models

Text Models
- LSTM-based to Transformer-based
- Tokenization plays a fundamental role (breaking down into words or subwords)
- GPT models pretrained with self-supervised learning via next token prediction
- next token prediction enables the GPT to capture the structure and semantics behind languages
Audio Models
- AudioPaLM : merged text and speech
  - enables speech-to-speech translation and speech recognition
- MusicGen : generate music by multiple parallel streams of acoustic tokens by EnCodec
Image Models
- iGPT : transformer to predict next pixel
- vision-language models : connect text and visual data
Biochemical sequence Models
- Tranception : transformers to predict protein fitness
- ProtGPT2 : generates protein sequences
- HyenaDNA : extends context lengths in genomic modelling

2.2 Byte Models

Binary data lacks the inherent structure and semantics of human-interpretable data
MalConv, DeepVSA : malware detection and program analysis
- MalConv uses CNN to analyze byte sequences
- DeepVSA : value seet analysis for post-mortem program analysis
Byte-level Byte Pair Encoding (BBPE) : used for multilingual pretraining, machine translation
ByT5 : transformers for byte sequences
- token-free encoding that improves noise robustness and spelling sensitivity in multilingual
ByteFormer : raw byte sequences from images and audio
MegaByte : modelling long byte sequences across various modalities
MambaByte : used Mamba to excel in byte-level language modelling and outperformed LMs based on subword tokenization
Current research often neglects native binary data, focusing on narrow tasks and overlooking broader potential in digital world simulation

3. Methodology

3.1 Model Architecture

the high granularity of bytes results in long sequences $\rightarrow$ computational cost
quadratic self-attention scaling $\rightarrow$ computational cost
hierarchical Transformer architecture
- sequence of bytes $B = { b_1, b_2, ..., b_T}$ of length $T$
- sequence of patches $\mathcal{P} = [P_1, P_2, ..., P_N]$
- each patch contains $S$ bytes
- the number of patches $N = \lceil{T \over S} \rceil$
- $P_i = [b_{(i-1)S + 1}, ..., b_{iS}]$ for $1 \le i \le N$
- if $T \mod S \not= 0$, the last patch is padded with $e$ to size $S$(eop, end-of-patch token)

Linear Projection Layer

Each patch $P_i$ from $\mathcal{P}$ is viewed as a matrix of size $S \times 257$
- each byte is one-hot encoded (256 values + eop token)
Flatten those patches into one-dimensional vectors
- rows in the matrix are concatenated
the projection layer mats each flattened vector into a dense vector $E_i$ of a hidden size $H$
- $E_i = \text{Flatten}(P_i) \cdot W_{\text{linear}}, \quad 1 \le i \le N$
$W_{\text{linear}}$ has the shape of $(257\times S, H)$
Dense embedding enables more efficient processing of the byte sequence by reducing the dimension while preserving the essential information

Patch-Level Decoder

Takes the sequence of embedded patches $\mathcal{E} = { E_1, E_2, ..., E_N }$ and processes it to autoregressively predict the features of the subsequent patch, effectively learning the structure of data
- $\hat{E}i = \text{Decoder}{\text{patch}}(\mathcal{E}{{

Byte-Level Decoder

Takes the predicted feature $\hat{E}_i$ of each patch and autoregressively reconstructs the sequence of bytes within that patch

independent for each patch and operates by conditioning on the feature representation $\hat{E}_i$ of the current patch

$\hat{b}{i, j} = \text{Decoder}{\text{byte}}(\hat{E}i, b{i,

3.2 Training Objectives

Generative Modelling

aims to predict the next byte $b_{i+1}$ based on preceding bytes ${ b_1, b_2, ..., b_i}$ without explicit guidance

the objective is minimizing the negative log-likelihood of the next byte prediction across the sequence

$\mathcal{L}{\text{GEN}}(\theta) = - \displaystyle\sum{i=1}^{T-1} \log p(b_{i+1}|b_1, b_2, ..., b_i; \theta)$

this loss encourages the model to understand the sequential dependencies in data at the byte level

Classification

After pretrained by next byte prediction, it is further trained on labelled datasets for classification

predicts categories from byte sequences

involves extracting a global feature from the byte sequence which is then processed by a classification head

$\mathcal{L}{\text{CLF}}(\theta) = -\displaystyle\sum{k=1}^K y_k \log p(y_k | B; \theta)$

$y_k$ is the boolean label for the $k$-th category indicating whether the byte sequence is for that category

$K$ for total number of category

$p(y_k | B; \theta)$ is the predicted probability of category $k$ given the byte sequence $B$

4. Applications

4.1 Digital Media Processing

The field of deep learning is steadily advancing its proficiency in both generation and classification of text, audio, and images

These media is typically stored and transmitted as byte sequences $\rightarrow$ bGPT can process them for generative modelling and classification

bGPT is trained in next token prediction, uses features from the patch-level decoder and employs average pooling to derive global features for classification

Data

Audio : convert to WAV, including an 8000Hz sampling rate, mono channel, 8-bit depth, trimmed to 1 sec

Image : convert to BMP, 32 * 32, RGB, 24-bit depth

4.2 Algorithm and Hardware Simulation

Data Conversion

converting data from one format to another with symbolic music formats (ABC notation) and MIDI files

employs the generative modelling approach on concatenated byte sequences of paired ABC and MIDI files separated by a special patch

bGPT learns to convert text-based ABC notation into binary MIDI performance signals and its reverse

ability to simulate and reverse-engineer the conversion algorithm

CPU State Modeling

give concatenated sequences of low-level machine instructions followed by a series of CPU register states

to accurately predict how the state updates with each instruction until the program halts

interpreting operational data and replicate digital activities within hardware

CPU States dataset (2.1M instances)

offering a simplified representation of CPU behavior

each instance contains a 1KB memory block with varying numbers of machine instructions followed by a sequence of 16-byte CPU register states

these states include various instructions (21 types with 43 variants - data movement, logical operations, arithmetic operations)

within each state

1 byte is for Program Counter and Accumulator

4 bytes for Instruction Register

10 bytes for general-purpose registers

instances are randomly generated 1 to 256 instructions and their captured results

5. Experiments

5.1 Settings

used open-source datasets

110M parameter bGPT matches the standard Transformer based model scale

avoided hyper parameter tuning and data augmentation for all evaluatioins

Acc for classification

Bits-Per-Byte for other generative modelling

5.2 Digital Media Processing

used standard pre-training and fine-tuning approach

$\text{bGPT}_{\text{image}}$ : using ImageNet

$\text{bGPT}_{\text{wiki}}$ : Wikipedia

$\text{bGPT}_{\text{libri}}$ : LibriSpeech

$\text{bGPT}_{\text{signal}}$ : LibriSpeech + ImageNet

$\text{bGPT}_{\text{mix}}$ : LibriSpeech + ImageNet + Wikipedia

$\text{bGPT}_{\text{random}}$ : randomly initialized, baseline

first fine-tuned with next byte prediction on AGNews, CIFAR-10, Speech Commands v2

then fine-tuned for classification

5.2.1 Baselines

GPT2-small for text

pretrained on English Wikipedia with same sattings as bGPT

ViT-B/16 for image

pretrained on ImageNet

results are taken from original studies

AST for audio

5.2.2 Results

When pretraining data and fine-tuning data are match, the model shows performance in downstream tasks

Despite not having modality-specific prior knowledge, bGPT still manage to achieve performances similar to baseline

but $\text{bGPT}_{\text{image}}$ much lower than ViT as sequential processing nature of byte models is not suitable for processing 2D data

simply scaling while retaining this sequential processing holds

$\text{bGPT}{\text{signal}}$ and $\text{bGPT}{\text{mix}}$ shows compatible accuracy to the unimodal models but there is a small loss

Trade-off in byte models : mixed modality dilutes the depth of domain-specific understanding but it fosters versatility

positive transferring (pretrain with Audio/Image and fine-tune with Image/Audio) shows improvements over random initialization

audio and image have some shared byte pattern

negative transferring (from text to other modalities) shows the structured pattern learning in pretraining is not applied

text has distinct byte-level organizational patterns than audio and image

To investigate cross-modal knowledge transfer

convert the Speech Commands v2 into 32 * 32 BMP spectrograms

8KB audio to 3KB images

there is some information loss

image model for its data format consistency with spectrograms

libri model for its information similarity

disparity in CIFAR-10 does not extend to this spectrogram task observing image and libri models' BPB

CIFAR-10 shares fewer patterns with spectrograms than spectrograms and raw audio

libri model has the higher accuracy than image model with speech content spectrogram

byte models have an inherent capability to discern and translate abstract data features and patterns regardless of modality and format

5.3 Algorithm and Hardware Simulation

To evaluate bGPT's ability in simulating algorithms and hardware

Lack of baseline models and widely used datasets $\rightarrow$ evaluating scalability of bGPT on binary data

data conversion and CPU state modelling

$10^3$ to $10^6$ ($\text{bGPT}^3$ to $\text{bGPT}^6)$

all models are randomly initialized

for data conversion, used IrishMAN dataset (ABC motation and MIDI files)

5.3.1 Data Conversion

for ABC to MIDI, $\text{BPB}{\text{abc}}$ assesses generative modelling as it generates content from scratch and $\text{BPB}{\text{MIDI}}$ evaluates data conversion as full ABC byte sequence is given

increased data volume directly enhances model performance in simulating data conversion

from Table 5, the BPB is decreasing as the model size grows

high BPB value for ABC in both directions

ABC to MIDI focuses on simulating an existing algorithm with necessary information while the reverse process requires inferring and reconstructing missing information in MIDI (score structure, musical ornament, expression)

as MIDI is binary and ABC is text, model may find it easier to learn patterns within MIDI files

5.3.2 CPU State Modelling

to replicate CPU functionality

selecting the highest probability byte at each step

accuracy $\rightarrow$ byte-wise comparisons with actual states

data volume significantly influences modelling performance

efficiency beyond simple memorization (each test case consists of average of 128 instructions)

After epoch 11, $\text{bGPT}^5$ showed significant improvement of performance $\rightarrow$ deeper understanding of CPU states may stem from a qualitative enhancement in capability

Aligns with emergent abilities in LLMs

Is this learning genuine?

performance boosts are due to non-linear metrics or overfitting

but BPB is linear and smooth

this improvement is stem from a real comprehension of CPU

bGPT shows strong scalability on native binary data with emergent abilities in data conversion and CPU state modelling

6. Conclusions

bGPT : as a versatile simulator for the digital world

extending deep learning to binary data processing

effective in modeling digital media data + modality-agnostic knowledge transfer

strong scalability in modelling native binary data and signs of emergent abilities

without modality specific designs, it shows compatible performance

opportunities for improvement

currently tested for short audio and low-resolution images

data conversion between ABC and MIDI

only simplified CPUs

Future research

reducing computational cost

scaling models and dataset to cover more broader data

improving model performance for underexplored tasks

7. Impact Statements

it necessitates a careful examination if its ethical implications

its simulate or reverse-engineer algorithms

can significantly boost technological innovation in cybersecurity, software, hardware

poses a risk to intellectual property as training bGPT on paired source code and executable software might enable the reverse-engineering of proprietary software

it gives opportunities for advancing understanding of digial world but be careful for ethical, societal, legal implications

8. Comment

결국 모든 컴퓨터 데이터는 0과 1이므로 바이트로 접근해서 멀티모달을 실현한다는 아이디어. 이외 CPU 상태를 통한 리버스 엔지니어링 태스크도 꽤 흥미로웠음. 역시나 사이즈가 문제지만, 한가지 의문인 점은 바이트로 표현하면 현재 모델들에 비해 컨텍스트 길이가 엄청 길어야 할텐데, 이 부분에 대한 대응은 크게 없어보임.

The Era of 1-bit LLMs: All LLMs are in 1.58 bits

Mon, 04 Mar 2024 13:28:02 GMT

Abstract

BitNet paved the way for a new era of 1-bit LLMs

BitNet b.58 has every parameter as a ** tenary ** {-1, 0, 1}

matches a full-precision Transformer with the same model size

significantly more cost-effective

defines new scalinglaw and recipe for training

1. The era of 1-bit LLMs

The recent LLMs' size is increasing

remarkable performance on LLM tasks

high energy comsumption

challenges for deployment

environmental and economic impact

Post-training quantization to create low-bit models for inferenct

reduces weights and activations

16 bits to lower bits (4-bits)

sub-optimal

BitNet presents a direction for reducing the cost of LLMs while their performance

the major computation cost comes from the floating-point addition and multiplication

BitNet has only integer addition

transferring model parameters from DRAM to the memory of an on-chip accelerator (SRAM) can be expensive during inference

enlarging SRAM to improve throughput $\rightarrow$ significantly higher costs than DRAM

1-bit LLMs have a much lower memory footprint from both a capacity and bandwidth standpoint

BitNet b1.58

added 0 to original BitNet

retains all the benefits of the original BitNet

included new computation paradigm (no multiplication for matmul)

same energy consumption as the original BitNet

stronger modeling capability $\rightarrow$ explicit support for reature filtering by inclusion of 0

it can match full precision baselines in terms of PPL and end-task starting from 3B

2. BitNet b1.58

Recap: BitLinear

Binarize weights to +1 or -1 with signum function

Centralize to be zero-mean to increase the capacity witnin a limited numerical range

Use scaling factor $\beta$ after binarization to reduce l2 error between real-valued and the binarized.

$$ \tilde{W} = \rm{Sign} \it{(W - \alpha)} / \beta $$ $$ \rm{Sign} \it(W_{ij}) = \begin{cases} +1, &&& \text{if} \ \it W_{ij} > \rm 0, \ -1, &&& \text{if} \ W_{ij} \le 0, \end{cases} $$ $$ \alpha = {1 \over nm} \sum {ij} W{ij} $$ $$ \beta = {1 \over nm} ||W||_1 $$

Quantize activations to $b$-bit precision with absmax

$Q_b = 2^{b-1}$

$\epsilon$ is a small floating-point number that prevents overflow in clipping $$ \tilde{x} = \text{Quant}(x) = \text{Clip} \left( x \times {Q_b \over \gamma}, -Q_b + \epsilon , Q_b - \epsilon \right) $$ $$ \gamma = ||x||_{\infin} $$

For activations before non-linear functions (ReLU) $\rightarrow$ scale into $[0, Q_b]$ by subtracting the minimum of the inputs $$ \tilde{x} = \text{Quant}(x) = \text{Clip} \left( (x-\eta) \times {Q_b \over \gamma}, \epsilon, Q_b - \epsilon\right) $$ $$ \eta = \min {ij} x{ij} $$

quantize with 8-bit

Training $\rightarrow$ quantize per tensor / Inference $\rightarrow$ quantize per token

Matrix Multiplication $$ y = \tilde{W} \tilde{x} $$

The variance of the output $y$ under following assumption

the elements in $W$ and $x$ are mutually independent and share same distribution

$W$ and $x$ are independent of each other

$$ \begin{aligned} \text{Var}(y) &= n\text{Var}(\tilde{w}\tilde{x}) \ &= nE \left[ \tilde{w}^2 \right]E\left[ \tilde{x}^2 \right] \ &= n \beta^2 E \left[\tilde{x}^2\right] \approx E\left[\tilde{x}^2\right] \end{aligned} $$ In full-precision, $\text{Var}(y) = 1$ with standard initialization method $\rightarrow$ training stability. To preserve this stability, use LayerNorm function.

$$ \text{Var}(y) \approx E[\text{LN}(\tilde{x})^2] = 1 \quad \quad \quad (\text{SubLN}) $$

Then, the final representation of BitLinear is: $$ y = \widetilde{W}\widetilde{x} = \widetilde{W} \text{Quant}(\text{LN}(x)) \times {\beta\gamma \over Q_b} \ \text{LN} (x) = {x - E(x) \over \sqrt{\text{Var}(x) + \epsilon}} $$ ${\beta\gamma \over Q_b}$ means Dequantization to restore original precision

Model Parallelism with Group quantization and Normalization

Calculate all parameters $\alpha, \beta, \gamma, \eta$ with each group (device)

If the Number of group is $G$, then the parameter becomes $$ \alpha_g = {G \over nm} \sum {ij} W{ij} ^{(g)}, \quad \quad \beta_g = {G \over nm} ||W^{(g)}||1, \ \gamma_g = ||x^{(g)}||{\infin}, \quad \quad \eta_g = \min {ij} x{ij} ^{(g)} $$

LayerNorm should also be applied with similar way

BitNet B1.58

based on the BitLinear

trained from scratch, 1.58-bit weights and 8-bit activations

adopted absmean quantization

scales the weight by its average absolute value

round each value to the nearest integer among {-1, 0, 1} $$ \begin{aligned} \tilde{W} &= \text{RoundClip}({W \over \gamma + \epsilon}, -1, 1) \ \text{RoundClip}(x, a, b) &= \max(a, \min(b, \text{round}(x))) \ \gamma &= {1 \over mn} \sum_{ij} |W_{ij}| \end{aligned} $$

don't scale the activations before the non-linear functions to the range $[0, Q_b]$

scale all activations to $[-Q_b, Q_b]$ per token to get rid of the zero-point quantization

more convenient and simple for both implementation and system-level optimization

LLaMA-alike components

used LLaMA alike components

RMSNorm

SwiGLU

rotary embedding

removed all biases

it can be integrated into the popular open-source software

3. Results

BitNet b1.58 vs FP16 LLaMA

pretrained on RedPajama for 100B tokens

zero-shot performance

ARC-Easy

ARC-Challenge

Hellaswag

Winogrande

PIQA

OpenbookQA

BoolQ

validation PPL

WikiText2

C4

runtime GPU memory and latency

FasterTransformer codebase

2-bit kernel from Ladder in BitNet

the time per output token

BitNet starts to match FP LLaMA at 3B size

BitNet b1.58 3.9B outperforms FP LLaMA 3B

the performance gap between BitNet and LLaMA narrows as the model size increases

in terms of zero-shot performane, BitNet starts to match LLaMA at 3B size

BitNet b1.58 3.9B outperforms LLaMA $\rightarrow$ BitNet b1.58 is a Pareto improvement over the SOTA LLMs

Memory and Latency

the speed-up increases as the model size scales

the proportion of nn.Linear increases as the model size grows

for the memory, the trend follows that of the latency

as the embedding remains full precision and its proportion gets smaller

Both were measured with a 2-bit kernel

there is still room for optimization

Energy

for LLaMA model, the majority of matmul is FP16 multiplication while for BitNet, it is INT8 addition

BitNet is more efficient when model is large

as the percentage of nn.Linear grows with the model size

Throughput

compared on two A100 80G cards

BitNet b1.58 and LLaMA 70B

maximum batch size for the GPU memory

BitNet b1.58 is enabling a new scaling law w.r.t. model performance and inference

in terms of latency, memory usage and energy consumption,

BitNet 13B > FP16 3B

BitNet 30B > FP16 7B

BitNet 70B > FP16 13B

Training with 2T tokens

to test scalability in terms of token

same recipe with StableLM 3B

evaluated on

Winogrande

PIQA

SciQ

LAMBADA

ARC-easy

It has strong generalization capabilities

4. Discussion and Future Work

1-bit MoE LLMs

MoE has high memory comsumption and inter-chip communication overhead

BitNet b1.58 can handle them

reduced memory footprint reduces the number of devices required to deploy MoE models

there would be no overhead if the entire models could be placed on a single chip

Native Support of Long Sequence in LLMs

the issue in handling long sequence is the memory consumption introduced by the KV caches

BitNet b1.58 reduces activations from 16-bits to 8-bits

doubling the sequence length

if reducing to lower than 4 bits is possible, the length would be longer

LLMs on Edge and Mobile

for Edge and Mobile device, BitNet b1.58 can resolve the issue of memory and computational power

BitNet is more friendly to CPU devices

New Hardware for 1-bit LLMs

Groq demonstrated promising results and great potential for specific LLMs (LPU)

expect new hardware for 1-bit LLM

5. Comment

두 번 날려먹고 다시 쓰는 코멘트. 1bit 모델의 가능성을 보여주었다면, 조금 더 다듬어진 듯한 논문. 온디바이스나 3진법 반도체가 떠오르게 하는 글이었음. 왜 처음에 0을 넣지 않았는지, 그리고 양자화 범위의 구분이 어떤 효과를 주는지 설명해주었다면 좋았을듯

CoT Reasoning without Prompting

Fri, 23 Feb 2024 12:20:30 GMT

1. Introduction

LLMs' reasoning capabilities are elicited by prompting techniques

Few shot prompting with intermediate steps augmented demonstration exemplars

Zero shot prompting with specific instructions to show intermediate steps

Can LLMs reason effectively without prompting?

there exists a task-agnostic way to elicit CoT reasoning by altering the decoding procedure

LLM generates a wrong answer via the standard greedy decoding but there are alternative top-k token inspection unveiled inherent CoT paths

Use standard QA format

LLMs struggle with reasoning when relying solely on greedily decoded paths

CoT reasoning patterns emerge naturally within the alternative paths among the top-k tokens

when CoT path is present, the model demonstrates increased confidence in the final answer

CoT-decoding : a method to sift through the top-k paths by isolating the most reliable paths

CoT decoding elicits reasoning capabilities without explicit prompting

enhances the model's reasoning capabilities

paths are more prevalent in tasks frequently represented in the pre-training data and less tso in complex, synthetic tasks $\rightarrow$ Still prompting is needed

Summarized Contributions

LLMs inherently possess reasoning capabilities

they generates CoT reasoning when examining alternative top tokens

mere change in decoding strategy effectively elicit model reasoning

LLM's confidence in its final answers increases when CoT is in its decoding path

CoT decoding to select more reliable decoding paths

2. CoT Decoding

2.1 The presence of CoT Paths during Decoding

$k$ represents the coice of the $k$-th token at the first decoding step

PaLM-2 Large model example

The greedy decoding often doesn't contain CoT

model's skewed perception of problem difficulty

pretrained on simpler questions

direct answer prompts generally result in low accuracy

2.2 CoT-Decoding for Extracting CoT Paths

Extracting CoT paths from the top-$k$ decoded paths is an issue

CoT Paths don't consistently outrank non-CoT in the model's probability assessment

they often don't represent the predominant answer among all paths $\rightarrow$ Self-consistency is not applicable

the presence of CoT path typically leads to a more confident decoding of the final answer

characterized by a probability disparity between the top and secondary tokens

$\Delta_{k, \text{answer} } = {1 \over n} \sum_{x_t \in \text{answer}} p(x_t^1 \ | \ x_{

$x_t^1$, $x_t^2$ means the top two tokens at each decoding step $t$ in the $k$-th decoding path chosen for their maximum post-softmax probabilities from the vocab

Overall confidence in decoding the final answer is approximated by averaging the probability differences for all relevant $x_t$ tokens

For the GSM8K question in Table 1, average the probability differences for '6' and '0'

Called CoT-Decoding and aimed to extract CoT Paths

CoT path shows high $\Delta$ value

Additional heuristic about the length of the answer

longer decoding paths more likely contain CoT

general applicability is limited

Normalize the probability score by length $\rightarrow$ intruduces a length bias (when the decoding paths are of similar lengths, its effectiveness diminishes)

Identifying the answer spans

for math tasks, one can extract the last numerical value

less precise when there are distractive numbers/options and open-ended responses

extending the model's output with the prompt "So the answer is"

only token ids are needed

suitable for encompassing mathematical and natural language reasoning

crucial to calculate $\Delta$ over the answer spans from the original decoding path, not those following "So the answer is"

When answer is more open-ended, modify the $\Delta$ calculation

If the options are defined, aggregating the probability mass over "yes" and compute the probability differences between the aggregated mass on "yes" and "no"

addressing this limitation is left for further research

Branching at other decoding steps

Is branching viable at later decoding stages?

Early branching significantly enhances the diversity of potential paths

Optiman branching point may vary with the task

for year parity task, mid-path branching can effectively yield correct CoT paths

Aggregation of the decoding paths

Aggregate the answers over all those paths like self-consistency without CoT prompting

to mitigate sensitivity to small differences in the model's logits particularly when relying solely on the path with the maximum $\Delta$

Majority answer may not be correct

weighted aggregation method

take the answer that maximizes $\tilde{\Delta}a = \sum_k \Delta{k, a}$

$\Delta_{k, a}$ means the $k$-th decoding path whose answer is $a$

this enhances the stability of the results

Sampling under the standard QA format

Can sampling achieve a similar effect and unveil the CoT reasoning paths?

althouth sampling works well under few-shot CoT prompting, it doesn't exhibit the desired behaviour when the model is queried with the standard QA format

less than 30% of the sampled responses contain a correct CoT path

the model tends to provide a direct answer as the first token is sampled based on the model's probability distribution reflecting the model's tendency to output direct answer

the rest of the tokens lead to incorrect final answers

3. Experiments

Used standard QA format (Q: (question)\nA:)

$k=10$ as default

PaLM-2 with different scales

Mistral-7B

last numerical numbers or the available options for Mistral

extend the output with "So the answer is" for PaLM-2

3.1 Mathematical Reasoning Tasks

GSM8K (grade-school math problems)

MultiArith (multi-step arithmetic dataset)

CoT Decoding significantly enhances models' reasoning ability

CoT Decoding partially closes the gap between the pre-trained model and instruction-tuned model

Instruction Tuning with sufficient CoT data can also be partially achieved by CoT Decoding

As instruction-tuning contains the CoT annotations, the model is expected to generate inherently generate CoT paths

Even after instruction-tuning, the model occasionally attempts to directly address a question

Scaling results and choice of $k$

higher $k$ typically result in improved model performance

correct CoT paths are often ranked lower

for IT models, the effect of $k$ is not significant

instruction-tuning brings forth the majority of CoT-paths to the first few paths

3.2 Natural Language Reasoning Tasks

year parity task : Was (person) born in an even or odd year?

Even SoTA models like GPT-4 achieves at-chance accuracy (~50%) when prompted directly

SoTA LLMs are perfect for retrieving the year or judging the parity given the correct year

the limitation lies in the model's ability in knowledge manipulation

100 celeb names and their birth years

When the model is small, the model becomes incapable to determine the parity even given the correct year

the performance doesn't vary significantly for model sizes below "Small" size

3.3 Symbolic Reasoning Tasks

Coin Flip with 2, 3, 4 rounds of potential flip

two tasks from Big-Bench-Hard

Web of lies with 3, 4, 5 truth/lie statements

Multi-step arithmetic with variuous depth and length (generated)

existing dataset from (Suzgun et al., 2022)

Sports understanding and Object Counting from Big-Bench

The presense of correct CoT paths depends on the taks prominence in the pre-training distribution

CoT-Decoding gain is decreasing when the task complexity increases

When task is highly synthetic, the model cannot generate correct CoT paths

tasks that lack significant representation in the pre-training distribution

tasks that require accurate state tracking (Coin-Flip and Web-of-Lies) $\rightarrow$ easily lose track of the states as the task became more complex

Multi-step Arithmetic and Object counting

CoT prompting based techniques can 'teach' how to solve tasks like above

Compared to CoT Prompting

the aggregated path approach significantly improves the accuracy compare to taking the maximum path only

the aggregated path results in a similar performance to few-shot CoT

model possesses intrinsic abilities in solving this task effectively

CoT prompting takes the intrinsic CoT path to the top-1 path

CoT Decoding exhibits a more 'free-form' generation in comparison to alternative CoT prompting

encourage the diversity at the initial decoding step

absence of explicit constraints imposed by prompt

CoT Decoding can reveal what LLMs' intrinsic strategy in solving a problem without being influenced by the prompts

Few shot CoT follows the standard method of solving this task (profession - evaluation)

influenced vby the few-shot prompt

3.4 Results across Model Families

CoT path emerges too

consistent improvements across model families

4. Conclusion

inherent capabilities of LLMs in generating CoT paths

exploring alternative top-k tokens reveals the natural existence of reasoning paths

presence of a CoT path correlates with increased model confidence in decoding its final answer

additional computational costs

future work may leverage the CoT paths to fine-tune the model

focused on branching at the first token

one can explore branching at any token and find best possible paths

how to reliably identify the best token during the search

5. Comment

확률이 가장 높은 토큰이 아니라 나머지 Top-k 중에 자연적인 CoT가 적용되어 정답을 찾는 Path가 있을 것이라는 굉장히 발상적인 아이디어. 계산량을 제외하고 봤을때 가장 참신한 아이디어였던 것 같음. 항상 Greedy가 옳은가에 대해 돌아볼 수 있는 논문.

Self-Discover: LLMs Self-Compose Reasoning Structure

Fri, 16 Feb 2024 12:48:54 GMT

1. Introduction

To enhance LLMs' capability to reason and solve complex problems via prompting

Few-shot & Zero-shot CoT $\rightarrow$ how humans solve problems step-by-step

decomposition based prompting $\rightarrow$ how humans breakdowns problems into subproblems

step-back prompting $\rightarrow$ how humans reflec on task nature to derive general principles

Each method serves as an atomic reasoning module making an implicit prior assumption of the process on how to tackle a given task

Instead, each task has a unique intrinsic structure underlying the reasoning process involved in solving it efficiently

Least-to-Most Prompting is more effective than CoT at symbolic manipulation and compositional generalization due to the decomposition structure of the tasks

Self-Discover $\rightarrow$ how humans devise areasoning probram for problem-solving

Aims to discover the inderlying reasoning structure of each task

It composes a coherent reasoning structure intrinsic to the task (Stage 1)

Operates at Task Level

uses three actions to guide LLM to generate a reasoning structure for the task

Solves instances of the task using the discovered structure (Stage 2)

LLM simply follows the self-discovered structure to get the final answer

Self-Discover helps to use multiple atomic reasoning modules like CoT

It only needs 3 more inference steps on the task-level (more performant than inference-heavy ensemble approaches like self-consistency)

It conveys LLMs' insights about the task in a more interpretable way

Tested 25 challenging reasoning, it outperformed 21/25 tasks

It achieves superior performance against a inference-heavy method like SoT + Self-consistency and Majority voting

Compared Self-Discover with prompts optimized using a training set

Performed par or better than OPRO

Analyzed its effectiveness by breaking down BBH task into 4 categories

Self-Discover worked best on tasks requiring world knowledge

it has a moderate performance boost on algorithmic tasks compared to CoT

Error analysis on MATH

majorities of failures comes from computation errors

showed the universality of the reasoning structures by PaLM2 to GPT-4 / GPT-4 to Llama-2-70B

2. Self-Discovering Reasoning Structures for Problem-Solving

How humans use prior knowledge and skills to devise a reasoning program

- Search search internally what knowledge and skills might be helpful to solve it - Attempt to apply relevant knowledge and skills to the task - Finally conntect multiple skills and knowledge

Given a task and a set of reasoning module descriptions representing high-level problem-solving heuristics ("Use critical thinking", "Let's think step by step"), Stage 1 aims to undercover the intrinsic reasoning structure via meta-resoning

Three meta-prompt to guide LLM to select, adapt, implement an actionable reasoning structure without labels or training

Formatted the structure in key-value pairs like JSON due to interpretability and performance

this operates on Task-Level so this stage is only needed once for each task

Use discovered reasoning structure to solve every instance of task

Follow the step-by-step reasoning plan in JSON to correctly solve the task. Fill in the values following the keys by reasoning specifically about the task given. Do not simply rephrase the keys

2.1 Stage 1 : Self-Discover Task-Specific Structures

SELECT

Not every reasoning module is helpful

Guide LLM to select module based on task example

given raw set of reasoning module $D$ and a few task examples without labels $t_i \in T$, Self-Discover selects a subset of reasoning modules $D_S$ by a model $\mathcal{M}$ and a meta-prompt $p_S$

$D_S = \mathcal{M}(p_S \ || \ D \ || \ t_i)$

ADAPT

Each reasoning module provide general description of how to solve problems

Self-Discover aims to tailor each module

"break the problem into subproblems" $\rightarrow$ "calculate each arithmetic operation in order" for arithmetic problems

given $D_S$ and meta-prompt $p_A$, the model generates the adapted reasoning module descriptions $D_A$

$D_A = \mathcal{M}(p_A \ || \ D_S \ || \ t_i)$

IMPLEMENT

Given adapted reasoning module descriptions $D_A$, it uses the reasoning modules into an implemented reasoning structure $D_I$ with specified instruction on what to generate for each step

Provide a human-written reasoning structure $S_{\text{human}}$ on another task in addition to meta prompt to better convert the natural language descriptions into a reasoning structure

$D_I = \mathcal{M}(p_A \ || \ S_{\text{human}} \ || \ D_A \ || \ t_i)$

2.2 Stage 2 : Tackle Tasks Using Discovered Structures

After those stages, use $D_I$ which is uniquely adapted for the task to solve $T$

$A = \mathcal{M}(D_S \ || \ t), \quad \forall t \in T$

3. Experiment Setup

Tasks

Used diverse reasoning benchmarks challenging for LLMs

Big Bench Hard (23 challengine tasks from Big-Bench)

Algorithmic and Multi-Step Arithmetic Reasoning

NLU

Use of Wold Knowledge

Multilingual Knowledge and Reasoning

Thinking for Doing (T4D)

models must leverage mental state reasoning to determine actions to perform (GPT-4 + CoT reached only 50%)

MATH test set (200 samples)

Models

GPT-4 (gpt-4-turbo)

GPT-3.5 (chatGPT, gpt-3.5-turbo)

instruction tuned PaLM2-L

Llama2-70B

Baselines

Zero-shot prompting

Direct Prompting

CoT

Plan-and-Solve (firstly generate a plan and solve problem)

use the raw seed reasoning modules passed to Self-Discover

CoT-Self-Consistency (sample multiple outputs with CoT and aggregate answer)

Majority voting of each RM

Best of each RM (uses highest accuracy from each RM

To test universality of reasoning structure, comparing with Prompt-optimization that requires a training set (OPRO)

showing when applyting structures or prompt that optimized from one model, the reasoning structure can retain more performance

4. Results

4.1 Does Self-Discover Improve LLM Reasoning?

For MATH, upon error analysis, the reasoning structures generated by PaLM 2-L from Self-Discover are correct 87.5% of the time (Human can follow the structures to solve the tasks perfectly)

4.2 Which Types of Problems Do Self-Discover Help the Most?

Self Discover improved the performance of World Knowledge task the most (sports understanding, movie recommendation, ruin names)

Using CoT misses the key knowledge

Algorithmic category's gain is moderate which is consistent with MATH result from 4.1

4.3 How Efficient is Self-Discover?

Self-Discover achieves the best performance while requiring 10-40x fewer inference call compared to Self-Consistency and Majority voting

4.4 Qualitative Examples

5. Deep Diving Into Self-Discovered Reasoning Structures

** All actions of Self-Discover needed**

5.1 Importance of Self-Discover Actions

S for only SELECT / SA for SELECT and ADAPT

Adding each step, the model's zero-shot reasoning capability improved $\rightarrow$ All three step is beneficial

5.2 Towards Universality of Discovered Reasoning Structure

Applying PaLM 2-L Discovered Structures to GPT-4

Applying GPT-4 Discovered Structures to Llama2 and ChatGPT

Llama2 + Self-Discover (52%) > CoT (42%) on zero-shot disambiguation QA

GPT-3.5 (56%) > CoT (51%) on geometry with 3-shot

6. Related Work

OPRO Framework (LLMs as optimizers, Yang et al., 2023)

7. Conclusion

Self-Discover a reasoning structure for any task

Drastic improvements from challenging task

the composed reasoning structure is transferable

8. Comment

문제를 푸는 구조까지 LLM의 판단에 맡기는 파이프라인. 대형 모델에서 효과를 봤다는 점이 인상적인듯.

Repeat After Me: Transformers are Better than State Space Models at Copying

Wed, 07 Feb 2024 08:20:09 GMT

..?

1. Introduction

Transformers require $\Omega(L)$ memory and compute to predict the next token of a sequence of length $L$ (using Flash Attention!)

Attempts to make similar architectures but with $O(1)$ memory to predict each token $\rightarrow$ S4 or Mamba / RNN / models that can trained in parallel like linear attention / parallel RNNs

Say all models as GSSM (Generalized State Space Models)

Resent work says GSSM's performance but it is not clear what these models sacrifice for efficiency

One particular capability that is sarificed is the ability to retrieve and repeat parts of the input context

Theoritical analysis of copying task

Transformer can copy strings of length that is exponential in the number of heads of the transformer

Transformer implements a 'storage' mechanism and retrieval of sequences of n-grams

GSSMs cannot accurately copy strings with more bits than the size of the latent state

In practice, large GSSM may have enough capacity to represent the entire input in the latent state

Transformers are both much more efficient at learning to copy and to generalize better to longer inputs

Copy algorithms learned by Transformers are based on n-grams to perform where to copy from

2. Theory: Representational Capaciy

2.1 Setting

dictionary $\mathbb{D}$ which contains $D$ alphabet tokens

seq2seq model $H : \mathbb{D}^* \rightarrow \mathbb{D}^*$

input $x_1, x_2, ... x_i$ as the prompt

$H(x_1, x_2, ... x_i)$ as the generated 'answer'

sequence to token model $h : \mathbb{D}^* \rightarrow \mathbb{D}$

it naturally defines $H$ by autoregressive inference

for every input sequence $x_1, ... ,x_i \in \mathbb{D}$, define $x_{i+j} = h(x_1, ... ,x_{i+j-1})$ recursively and let $H(x_1, ... ,x_i) = (x_{i+1}, x_{i+2}, ... )$

GSSM

Finite set $\mathcal{S}$ is a state space

the number of bits required to encode the states of $\mathcal{S}$ as $\text{mem}(\mathcal{S}) = \log(|\mathcal{S}|)$

GSSM is a sequence model defined by an update rule $u : \mathcal{S} \times \mathbb{D} \rightarrow \mathcal{S}$ and some output function $r : \mathcal{S} \rightarrow \mathbb{D}$

Let $s_o \in \mathcal{S}$ be some initial state

Given sequence $x_1, ..., x_L$, the state of model at iteration $i$ is denoted by $S_i(x_1, ..., x_i)$

the output token is denoted by $R_i(x_1, ..., x_i)$

The recursive process is $$ \begin{aligned} &1)\quad S_o(\empty) = s_0 \ &2) \quad S_i(x_1, ... ,x_i) = u(S_{i-1}(x_1, ..., x_{i-1}), x_i) \ &3) \quad R_i(x_1, ..., x_i) = r(S_i(x_1, ..., x_i)) \end{aligned} $$

Note that for any sequence model, there are two types of memory considerations

Input-Independent Memory - parameters

Input-Dependent Memory - activations

GSSM definition constraints the input-dependent memory $\text{mem}(\mathcal{S})$

It doesn't restrict in any way the amount of input-independent memory or the runtime of state updates

Leaving all other considerations unconstrained shows the lower bownd on the state space memory

Transformers

input length $L$

dimension $d$

input tokens $\boldsymbol{x}_1, ..., \boldsymbol{x}_L \in \mathbb{R}^d$

an attention head is parametrized as $W_q, W_k, W_v \in \mathbb{R}^{d \times d}$

$\boldsymbol{k}_i = W_k \boldsymbol{x}_i, \quad \boldsymbol{q}_i = W_q \boldsymbol{x}_i, \quad \boldsymbol{v}_i = W_v \boldsymbol{x}_i$

$K_i = [\boldsymbol{k}_1, ..., \boldsymbol{k}_i] \in \mathbb{R}^{d \times i}, \quad V_i = [\boldsymbol{v}_1, ..., \boldsymbol{v}_i] \in \mathbb{R}^{d \times i}$

the output of the head at token $i$ is $\boldsymbol{o}_i = V_i \ \cdot \ \text{softmax}(K_i \cdot \boldsymbol{q}_i) \in \mathbb{R}^d$

with $l$ attention heads, the full dimension should be $dl$

embedding $\Psi : \mathbb{D} \rightarrow \mathbb{R}^d$

MLP $f : \mathbb{R}^{dl} \rightarrow \mathbb{R}^{dl} \ \text{s.t.} \ f(\boldsymbol{x}) = U_1 \sigma (U_2 \boldsymbol{x})$

embedding and MLP is applied on the token level

Attention-block is a set of $l$ heads applied in parallel

transformer-block is an attention-block floowed by an MLP on the concatenated output of $l$ heads

The Copy Task

Add two special token and to $\mathbb{D}$

$|\mathbb{D}| = D + 2$

A length-$L$ copy distribution $\mathcal{D}_L$ over $\mathbb{D}^{L+2}$ generates strings of the form ", $x_1, x_2, ..., x_L$, " where $\boldsymbol{x} \in (\mathbb{D} \text{\textbackslash} { \tiny \text{},\text{} \normalsize } )^L$

For some seq2seq model $H$, denote the error of $H$ on a copy distribution $$ \text{err}{\mathcal{D}_L}(H) = \underset{\mathcal{D}_L} {\text{Pr}}[H{1:L}(\tiny \text{} \normalsize, \boldsymbol{x}, \tiny \text{} \normalsize) \not= \boldsymbol{x}] $$

2.2 Transformers can copy inputs of exponential length

Construction : Hash-Based Copying

Hash sequences of $n$ tokens

At each iteration of the auto-regression attend to the previous occurrence of the most recent $n$-gram and output the succeeding token

Positional Embedding: Hard-ALiBi

To perform the hashing described in the algorithm, it is necessary to leverage local positional information to define a hash and apply it globally on the entire input $\rightarrow$ use Hard version of ALiBi

Alibi : biases the attention scores with a penalty that is proportional to their distance ($m$ is a head-specific slope fixed before training)

add a bias $b_i$ to the $i$-th attention head

$\boldsymbol{o}_i = V_i \ \cdot \ \text{softmax}(K_i \cdot \boldsymbol{q}_i + b_i)$

$b_i = \begin{cases} b_{i, j} = - \infin \quad &j \le i-m \ b_{i,j} = 0 \quad &j > i-m\end{cases}$

Allow different head with different $m$ and also allow $m = \infin$ (softmax attention with no PE)

Guarantees

The copy algorithm can perfectly copy the input sequence, as long as there are no repeated $n$-gram patterns in the input

Then the error of the algorithm is $$ p_{\text{n-gram}}(\mathcal{D}L) = \underset{\mathcal{D}_L}{\text{Pr}} [\exist{i \not= j} \ \text{s.t.} \ x_1, ..., x_{i+n} = x_j, ..., x_{j+n}] $$

Theorem 2.3.

For all $n$, there exists a depth-2 transformer $\mathcal{T}$ of dimension $O(n \log (D))$ s.t. for all $2n \le L \le D^n$ and for any copy distribution $\mathcal{D}L$, $\text{err}{\mathcal{D}L}(\mathcal{T}) < p{\text{n-gram}} (\mathcal{D}_L)$

The probability of repeated $n$-grams quickly decays when $n$ increases

For the uniform distribution over sequences, thie probability decays exponentially witn $n$

Lemma 2.4.

Let $\mathcal{D}L$ be the copy distribution generated by sampling $\boldsymbol{x}$ from the uniform distribution over the non-special (alphabet) tokens. Then $p{\text{n-gram}}(\mathcal{D}_L) < L^2D^{-n}$

By combining those, we get that Transformers can copy sequences of tokens drawn from the uniform distribution using a number of params that depends only logarithmically on the input sequence length

Corollary 2.5.

Fix some $\epsilon \in (0, 1/2)$ and some $L \ge \Omega(\log (1/\epsilon))$, there exists a depth-2 Transformer $\mathcal{T}$ of dimension $O(\log(L/\epsilon)\log(D))$ s.t. for the uniform copy distribution $\mathcal{D}L$, $\text{err}{\mathcal{D}_L}(\mathcal{T}) < \epsilon$

This doesn't limit the precision of the parameters of activations, but it holds for finite-precision transformers, using $O(\log(\log(L)))$ bits

2.3 State Space Models cannot copy inputs beyond memory size

GSSMs cannot copy uniform input sequences unless the capacity of their state space grows linearly with the sequence length (To be able to copy, the model needs to store it in state space)

Theorem 2.7.

Fix some GSSM $H$ over state space $\mathcal{S}$. Then for all $L$, for the uniform copy distribution $\mathcal{D}L$, the model $H$ has error $\text{err}{\mathcal{D}_L}(H) > 1 - {|\mathcal{S}| \over {D^L}}$

Corollary 2.8.

Fix some $L$ then every GSSM $H$ with state space $\mathcal{S}$ s.t. $\text{mem}(\mathcal{S}) < L \log (D) - 1$ has error $\err_{\mathcal{D}_L}(H) > 1/2$ for uniform copy distribution $\mathcal{D}_L$

The Input-dependent memory of Transformers grows linearly with the sequence length (less memory-efficient than GSSM)

Transformers are almost optimal in terms of input-dependent memory (at least copying)

Thm 2.3. says that there exists a transformer which can copy inputs of length $L$ using $\tilde{O}(L)$ input-dependent memory and it is optimal by Corollary 2.8.

3. Learning to Copy

Above results may not be observed in practice

It's not clear that transformers can indeed learn to copy from examples

In practice, GSSM may use a large latent state memory so that this bounds only hold for very long sequences of tokens (Also, it may not learn to do so)

3.1. Experimental Setup

Transformer and Mamba $\approx$ 160M

LSTM $\approx$ 40M

64 Batch

10 batches of 128 examples for test

token space size is 30 and normally $\mathcal{V} = {a, ..., z, \tiny \text{}, \text{}, \text{} \normalsize }$

All strings are sampled uniformly

sample the length of the sequence

independently sample each position of the string from $\mathcal{V}$

pack the context with i.i.d. sequences during training

fill the context with multiple independent samples of task

Positonal Information

RoPE

NoPE (No Positional Information)

Hard-ALiBi

3.2. Data Efficiency on the Copy task

Model gets an input of $\le L \le 300$ tokens followed by separator token

record the string-level accuracy

sharp change is due to the log-scaled x-axis and string-level accuracy as a y-axis

String-level Accuracy

Character-level Accuracy

3.3 Length Generalization on the Copy Task

Test to generalize out-of-distribution

Understand which function the model has learned

model has truly learned the "correct" copy operation vs it just learned to copy sequences of the particular size it was trained on

Trained all models on sequences of $\le 50$ tokens test them up to 100 tokens (string-level accuracy)

Transformers shows better generalization to longer input compared to GSSMs

GSSMs' performance drops to near zero

ALiBi and NoPE dramatically outperform the RoPE

Sinusoidal embedding of RoPE creates a more dramatic change thatn the decay of ALiBi or NoPE

Using Hard-ALiBi in sequence length less than 50 shows almost perfect generalization up to 1000 tokens

3.4. Transformers learn to use n-gram hashing

To test whether the transformer uses the storage mechanism and retrieval of n-grams

Train Hard-ALiBi Transformer on the copy task with a dataset contains duplicate n-grams

Draw uniform sequences of tokens and randomly replace some n-gram with another n-gram that already appears in the sequence (each example always have two copies of n-gram)

It seems Transformer relies on something like 5-gram retrieval to do the copy task

3.5. GSSMs cannot arbitrarily retrieve from context

n-gram lookup task : the model should use given n-gram as a key to loop up k-token key that follows the query

suffix key and prefix key

assess length generalization

Suffix key version

given sequence $L$ of input tokens, separator, n-gram from the input sequence

need output sequence of $k$ tokens following the chosen n-gram

it requires the model to be able to 'store' the context to find the correct key

train all models on sequences of at most 30 tokens

Transformers perform well

Transformers learn to n-gram retrieval and storage

Prefix key version

provide n-gram key at the beginning and then the full sequence

model doesn't have to store the entire input as it can find the key on the fly

good for the GSSMs since they can write the key in to the state and then ignore inputs that don't match

GSSMs achieved almost perfect (outperformed NoPE and ALiBi but Hard-ALiBi)

This may be an issue where positional embedding make it more diffecult to perform the hashing lookup over a long distance

GSSM is memory limited but effective when the tasks only require a summary of the inputs

4. Pre-trained Models

pretrained Transformer, GSSM

copying long strings, retrieval and few-shot QA

Transformer outperforms GSSM even GSSM shows lower PPL

4.1. Setup

Pythia transformer models 410M ~ 2.8B

Mamba with similar size

Pretrained on Pile, used same tokenizer

Copy based task / Information Retrieval (selective copy)

String-Level Accuracy

4.2. Copying the input text

Transformers > GSSM

Random sample from C4 dataset

two copies of sampled string + first word of the string $\rightarrow$ complete the third copy

Unlike random string, natural text can often be compressed so that the model use lower memory to copy

When the input is more difficult to compress, GSSM suffers due to its state size

4.3. Retrieval from the input context

Phone-book Lookup

provide a synthetic phone-book to the model ans ask it to return the phone number

randomly sampling $L$ names and phone number

two-shot examples and question for phone-number

Transformer (410M) > GSSM (2.8B) when $L \ge 70$

QA

2.8B Mamba and Transformer on SQuAD

provided single demonstration of a QA pair with same text

Mamba degrades more quickly with the paragraph length

5. Discussion

Transformer > GSSM at copying from their input text

SSM have many advantages over transformers

The memory and computational complexity doesn't increase with the input length $\rightarrow$ good for long context

Better at tracking state variables across long sequences to make long consistent text

Similar to Human brain

Future work is needed to make hybrid architectures of SSM and attention-like mechanism to enhance retrieving ability

Humans have very limited memory but can translate entire novels if we allow look back at the text

6. Comment

제목이 자극적이었음. Retrieval 부분에서 Transformer의 성능을 증명했음. 다른 분야보다도 텍스트 관련해서는 이 점 때문에 SSM의 도입이 쉽지는 않을듯

Adaptation with Self-Evaluation to Improve Selective Prediction in LLMs

Thu, 01 Feb 2024 13:16:11 GMT

1. Introduction

LLM is not guaranteed to be accurate for all queries

Understanding which queries they are reliable for is important

Selective Prediction : the deployment scenario for AI where humans are involved to maintain overall accuracy by reviewing AI-generated, low-confidence outputs

Both human and AI performance are considered together to minimize human involvement cost

AI should use Selective Prediction to assess the accuracy of their prediction and refrain from making wrong predictions

Able to say "I don't know" when its prediction is not confident

Selective Prediction is hard as LLM is trained to predict not the "correct" next token but only the "next" token

It doesn't generate a confidence score also $\rightarrow$ obtaining confidence score from output sequence is not straightforward

Distinguishing the correctness from likelihood scores is a challenging

Using Prompt (Is the proposed answer True or False?) $\rightarrow$ not generalized to other LLMs

Semantic Entropy or Self-consistency $\rightarrow$ should generate multiple output sequence

Fine-tuning LLMs on target question can improve the likelihood of the ground-truth $\rightarrow$ it is not same as minimizing wrong answers and it still has probability to generate wrong answers

ASPIRE : learns self-evaluate from target-task data

training LLMs on a subset of the training data from the QA tasks

define a selection score that combines the likelihood of the generated answer with the learned self-eval score to make selective predictions

less computationally expensive than generating multiple output sequences

2. Related Work

Selective Predictions for LLMs

Selective Prediction for classification (NLI) vs Selective Prediction for NLG

NLG task has infinite size of the possible answer set

Uncertainty Measure for LLMs

Use selective prediction to solve QA task when question is ambiguous

Use auxiliary model to distinguish correct predictions of QA model

Parameter Efficient Fine-Tuning (PEFT)

LoRA

Prefix Tuning

Soft Prompt Tuning $\rightarrow$ used!

P-Tuning

3. Problem Setup

Notations

pretrained LLM $f$ for arbitary generative modeling task like QA

vocabulary $\mathcal{V}$

the space of sequences of tokens $\mathcal{V}^*$

logits of $f$ on $v \in \mathcal{V}$ given $\mathbf{x} \in \mathcal{V}^*$ is $\bar{f}(v \ | \ \mathbf{x})$

the likelihood of the next token following $\mathbf{x}$ being $v$ is $$ f(v \ | \ \mathbf{x}) := {\exp(\bar{f} (v \ | \ \mathbf{x})) \over \sum _{v' \in \mathcal{V}} \exp (\bar{f} ( v' \ | \ \mathbf{x}))} $$ (softmax!)

likelihood of generating $\hat{\mathbf{y}} \in \mathcal{V}^*$ given $\mathbf{x}$ is $$ f(\hat{\mathbf{y}} \ | \ \mathbf{x}) := \prod_{i=1}^{|\hat{\mathbf{y}}|}f(\hat{y_i} \ | \ \mathbf{x}, \hat{y}{[i-1]}) $$ where $\hat{\mathbf{y}} = (\hat{y_1}, \hat{y_2}, ... \hat{y}{|\hat{\mathbf{y}}|})$ and $\hat{y}{[i-1]} = (\hat{y_1}, ... \hat{y}{i-1}), \hat{y}_{[0]} = \empty$

This likelihood can be very small when $|\hat{\mathbf{y}}|$ is very large $\rightarrow$ normalize the likelihood $$ f_{\text{norm}}(\hat{\mathbf{y}} \ | \ \mathbf{x}) := f(\hat{\mathbf{y}} \ | \ \mathbf{x})^{{1 \over |\hat{\mathbf{y}}|}} $$

use $f$ to generate the output sequence by solving $$ \hat{\mathbf{y}} ^ * = \argmax_{\hat{\mathbf{y}}} \log f(\hat{\mathbf{y}} \ | \ \mathbf{x}) $$

Impossible to solve exactly as the output sequence is arbitrarily long $\rightarrow$ use decoding strategy (greedy decoding, beam search) to solve it

Evaluate Correctness

set of reference outputs $S$

evaluation metric $M : \mathcal{V}^* \times \mathcal{V}^* \rightarrow \ [0,1]$

evaluate the similarity of the generated output $\hat{\mathbf{y}}$ and the reference output $\mathbf{y}_r \in S$

threshold $\gamma$

if $\max_{\mathbf{y}_r \in S} M(\hat{\mathbf{y}}, \mathbf{y}_r) > \gamma$, then the generated output is correct

training dataset $\mathcal{D}^{tr} = { (\mathbf{x}^i, S^i) }{i=1}^{n{tr}}$ randomly sampled from a target task distribution

rejection operation $\bot$

selective predictor $f_s : \mathcal{V}^* \rightarrow \mathcal{V}^* \cup { \bot }$

should achieve strong selective prediction performance on test dataset

composed of a predictor $\hat{f} : \mathcal{V}^* \rightarrow \mathcal{V}^$ and a selection scoring function $g : \mathcal{V}^ \rightarrow \mathbb{R}$

$$ f_s(\mathbf{x}; \tau) = \begin{cases} \hat{f}(\mathbf{x}) \quad &\text{if }g(\mathbf{x}) \ge \tau \ \bot &\text{if } g(\mathbf{x}) < \tau \end{cases} $$

accuracy : the fraction of the accepted inputs where the predictions are correct

coverage : the fraction of the inputs that are accepted

Tune $\tau$ to achieve a certain coverage and manage accuracy-coverage trade-off

use AUACC (area under the accuracy-coverage curve) to measure selective prediction performance

use AUROC (area under the receiver operator characteristic curve) to measure the quality of the selection score estimation

equivalent to the probability that a randomly chosen correct output sequence has a higher selection score than a randomly chosen incorrect output sequence

4. ASPIRE Framework

LLM should have self-evaluation ability

Previous work was only adaptable for specific LLMs

Colelcting some training data to employ self-evaluation

Start with LoRA

model parameters $\theta$ is frozen

adapter $\theta_p$ is added for fine-tuning and updated

it improves prediction accuracy and likelihood of correct output sequences $\rightarrow$ improves selective prediction performance!

Fine-tune LLM to learn self-evaluation

use $\theta_p$ to generate different answers for each example $(\mathbf{x}, \mathbf{y}) \in \mathcal{D}^{tr}$

supposing the decoding algorithm used to generate output sequences for $\mathbf{x}$ is $\mathcal{A}$ where $\mathcal{A}(f, \theta_p, \mathbf{x}) = [\hat{\mathbf{y}}^1, ..., \hat{\mathbf{y}}^k]$

choose output sequences such that $f(\hat{\mathbf{y}}^j \ | \ \mathbf{x}; \theta_p)$ is maximal

use metric $M$ to determine $\hat{\mathbf{y}}^j$ is correct i.e. if $M(\hat{\mathbf{y}}^j, \mathbf{y}) > \hat{\gamma}$, it is correct

use threshold $\hat{\gamma}$ different from $\gamma$ for evaluation (choose sufficiently large $\hat{\gamma}$ so that the wrong outputs wouldn't be labeled as correct outputs)

after sampling high-likelihood outputs, tune $\theta_s$ only for learning self-evaluation ($\theta$ and $\theta_p$ are frozen)

the training objective is $$ \min_{\theta_s} \mathbb{E}{(\mathbf{x}, \mathbf{y}) \sim \mathcal{D}^{tr}} \ \mathcal{L}_c + \mathcal{L}_w \ \mathcal{L}_c = \mathbb{E}{\hat{\mathbf{y}} \sim S_c(\mathbf{x}, \mathbf{y})} - \log f(\text{correct''} \ | \ \mathbf{x}, \hat{\mathbf{y}}; \theta_p, \theta_s) \\ \mathcal{L}_w = \mathbb{E}_{\hat{\mathbf{y}} \sim S_w(\mathbf{x}, \mathbf{y})} - \log f(\text{wrong''} \ | \ \mathbf{x}, \hat{\mathbf{y}}; \theta_p, \theta_s) \

$$ where $S_c(\mathbf{x}, \mathbf{y})$ is a set of 'correct' outputs containing the reference $\mathbf{y}$ and $k_c$ correct outputs with highest likelihood from $\mathcal{A}(f, \theta_p, \mathbf{x})$, same for $S_w$ (If $\mathcal{A}(f, \theta_p, \mathcal{x})$ doesn't have wrong output, add a default wrong output(e.g. empty string) to $S_w$)

After training $\theta_s$, obtain the prediction solving $$ \hat{\mathbf{y}}^* = \argmax_{\hat{\mathbf{y}}} \log f(\hat{\mathbf{y}} \ | \ \mathbf{x};\theta_p) $$

Also, the self-eval score is defined as $$ P(\text{correct} \ | \ \mathbf{x}, \hat{\mathbf{y}}^) = {\exp (\bar{f}(\text{correct} \ | \ \mathbf{x}, \hat{\mathbf{y}}^; \theta_p, \theta_s)) \over \sum_{z \in {\text{correct}, \text{wrong} }} \exp (\bar{f}(z \ | \ \mathbf{x}, \hat{\mathbf{y}}^*; \theta_p, \theta_s))} $$

Used Beam search decoding

Overall, the selection scoring function is $$ g(\mathbf{x}) = (1 - \alpha)\cdot \log f_{\text{norm}} (\hat{\mathbf{y}}^* \ | \ \mathbf{x}; \theta_p) + \alpha \cdot \log P(\text{correct} \ | \ \mathbf{x}, \hat{\mathbf{y}}^*) $$ where $\alpha \in [0,1]$ is a hyperparameter

5. Implementation via Soft Prompt Tuning

They could develop prompts that effectively stimulate self-evaluation

it is possible to discover these prompts through soft prompt tuning with targeted training objectives

Soft Prompt Tuning

given query $\mathbf{x} = (x_1, ..., x_{m_q})$

get embedding of $\mathbf{x}$ to form a matrix $X \in \mathbb{R}^{m_q \times d_e}$

soft-prompts $\tilde{\theta} \in \mathbb{R}^{l \times d_e}$

concatenate soft-prompts to query to form $[\tilde{\theta}; X] \in \mathbb{R}^{(m_q + l) \times d_e}$

Adapt to ASPIRE

update $\theta_p$ with $$ \min_{\theta_p} \mathbb{E}{(\mathbf{x}, \mathbf{y}) \sim \mathcal{D}^{tr}} {1 \over |\mathbf{y}|} \sum _{j=1} ^{|\mathbf{y}|} - \log f(y_j \ | \ [\theta_p ; X ; Y{[j-1 ]}]) $$

update $\theta_s$ with $$ \min_{\theta_s} \mathbb{E}{(\mathbf{x}, \mathbf{y}) \sim \mathcal{D}^{tr}} \ \mathcal{L}_c + \mathcal{L}_w \ \mathcal{L}_c = \mathbb{E}{\hat{\mathbf{y}} \sim S_c(\mathbf{x}, \mathbf{y})} - \log f(\text{correct''} \ | \ [\theta_p; X; \hat{Y}; \theta_s]) \\ \mathcal{L}_w = \mathbb{E}_{\hat{\mathbf{y}} \sim S_w(\mathbf{x}, \mathbf{y})} - \log f(\text{wrong''} \ | \ [\theta_p; X; \hat{Y}; \theta_s]) \

$$

The Inference objective becomes $$ \hat{\mathbf{y}}^* = \argmax_{\hat{\mathbf{y}}} \log f(\hat{\mathbf{y}} \ | \ \mathbf{x};[\theta_p; X]) $$

The self-eval score becomes $$ P(\text{correct} \ | \ \mathbf{x}, \hat{\mathbf{y}}^) = {\exp (\bar{f}(\text{correct} \ | \ [\theta_p; X; \hat{Y}^; \theta_s]) \over \sum_{z \in {\text{correct}, \text{wrong} }} \exp (\bar{f}(z \ | \ [\theta_p; X; \hat{Y}^*; \theta_s])} $$

Generation Pipeline

obtain generated output and the likelihood for the output

obtain self-eval score

cache the states of first stage to reduce computational cost for second stage

Computational Complexity

At test time : $O(l_{max})$

Predictive entropy and semantic entropy methods : $O(m \cdot l_{max})$

6. Experiments

Use decoding algorithms that can sample different high-likelihood samples is important

more training samples lead to enhanced performance

2k samples are enough to outperform the baselines without soft-prompt tuning

6.1 Setup

free-form QA task : CoQA(zero-shot), SQuAD(zero-shot), TriviaQA (5-shot)

used 50K examples subset

OPT(350M, 1.3B, 2.7B, 30B), GPT-2(M, L, XL)

pretrained LLM and $\theta_p$ trained model

beam-search

selection score $g(\mathbf{x})$ with PPL, Predictive Entropy, Semantic Entropy, Self-eval, P(True)

Rouge-L as the evaluation metric $M$ with relatively large $\gamma = 0.7$ (accepting wrong answer is more costly)

Both stage of training $\theta_p$ and $\theta_s$, 10 epochs with AdamW, batch 8, lr 0.01 and cosine lr scheduling

for ASPIRE,

beam search for $\mathcal{A}$

$l = 50$

$\hat{\gamma} = 0.9$

$k=10$

$k_c = 2$

$k_w = 10$

$\alpha=0.25$

6.2 Results

Accuracy

Methods to get selection score

After prompt tuning, other methods' AUACC is significantly improved as accuracy became better and PPL became more meaningful

ASPIRE with OPT-2.7B significantly outperforms with Self-eval and P(True) with OPT-30B

For Self-eval and P(True) method, the AUACC of OPT-30B is better than Adapted OPT-2.7B, it has much worse selective prediction performance $\rightarrow$ self-evaluation approach is not effective for high capacity LLMs

6.3 Empirical Analyses

The effect of $\alpha$

$\alpha=0.25$ is the best recipe for normalized likelihood and the learned self-eval score

In practice, this value can be chosen based on the performance on the validation data

The choices of $\mathcal{A}$

compared beam search and multinomial sampling

used $k$ highest scoring beams as the answer list (beam search)

tested temperature 0.1, 1.0, 2.0 for multinomial sampling

Training sample efficienty

Fixed the number of steps to be 50K

ASPIRE can significantly improve selective prediction performance even with limited number of training samples

7. Conclusion

Adaptation with self-evaluation to improve selective prediction in LLMs

Soft prompt tuning

Implement via other PEFT approaches and adapt to larger LLMs (Future work)

Didn't tested with larger and stringest LLMs (computational constraints)

8. Comment

단순히 프롬프트로 신뢰도를 찍어내는 것이 아니라, 나름의 계산과 Learning 기반으로 신뢰도를 얻어낼 수 있는게 좋았음. 다만 테스트한 모델이 좀 오래되어서, 최근의 sLLM으로도 가능한지 의문

Spotting LLMs with Binoculars: Zero-Shot Detection of Machine-Generated Text

Fri, 26 Jan 2024 14:08:45 GMT

1. Introduction

Intruducing a method for detecting LLM-generated text using zero-shot setting (No training sample from LLM source)

outperforms all models with ChatGPT detection

As it is zero-shot nature, it can spot multiple different LLMs with high accuracy

Prior research (Turnitin) fixated strongly on ChatGPT

More sophisticated actors use a wide range of LLMs beyond just ChatGPT

Binoculars works by viewing text through two lenses

compute the log perplexity of the text in question using an "observer LLM"

compute all the next-token predictions that a "performer LLM" would make and compute their perplexity according to the observer

If the string is written by a machine, the perplexities would be similar.

2. The LLM Detection Landscape

Spam and Fake news analyzing $\rightarrow$ all benefit from signals that quantify whether text is human or machine-generated

Due to the rise of the Transformer models, primitive mechanisms became useless $\rightarrow$ to record or watermark all generated text

Post-hoc detection approaches without cooperation from model owners

Fine-Tuned pretrained backbone for the binary classification task (adversarial training, absentation)

Linear classifier on top of frozen learned features allowing for the inclusion of commercial API outputs

Using statistical signatures that are characteristic of machine-generated text

requires none or little training data

easily adapted to newer model families

based on perplexity, perplexity curvature, log rank, intrinsic dimensionality of generated text, n-gram

Detection has limitation

Fully general-purpose models of language would be, by definition, impossible to detect

Given sufficient examples, the text by model close to the optimum is technically detectable

In practice, the relative success of detection provides evidence that current language models are imperfect representations of human writing (Detectable!)

How do we appropriately and thoroughly evaluate detectors?

accuracy on test sets, AUC of classifiers are not well-suited for the highstakes question of detection

Only detectors with low false positive truely reduce harm

detectors are often only evaluated on relatively easy datasets that are reflexive of their training data

3. Binoculars: How it works

perplexity and cross-perplexity (the next token predictions of one model are to another model)

3.1 Background and Notation

string $s$

a list of token indices $\vec{x}$

tokenizer $T$

$i$-th token ID $x_i$

vocab $V = { 1,2 , ... ,n }$

language model $\mathcal{M}$

number of tokens in $s$, $L$ $$ \mathcal{M}(T(s)) = \mathcal{M}( \vec{x} ) = Y \ Y_{ij} = P(v_i | x_{0:i-1}) \text{ for all} \ j \in V $$

Define logPPL as the average negative log-likelihood of all tokens in the given sequence $$ \log \text{PPL}{\mathcal{M}(s)} = - {1 \over L} \sum{i=1} ^{L}\log (Y_{ix_{i}}) $$

This logPPL intuitively measures how surprising a string is to a language model

As it is used as a loss function, the models are likely to score their own outputs as unsurprising

Define Cross-Perplexity as a average per-token cross-entropy between the outputs of two models $$ \log \text{X-PPL}{\mathcal{M}_1, \mathcal{M}_2}(s) = - { 1 \over L} \sum{i=1}^{L} \mathcal{M}_1(s)_i \ \cdot \ \log(\mathcal{M}_2 (s)_i) \ \text{where } \cdot \text{ means the dot product} $$

3.2 What makes detection Hard? A primer on the Capybara problem

LLM tends to generate text that is unsurprising to an LLM

As humans are different from machine, human PPL is higher according to an LLM observer

When it faces hand-crafted prompts, this intuition breaks

prompt "1, 2, 3, " results in "4, 5, 6" which has very low PPL

But the prompt like "Can you wirte a few sentences about a capybara that is an astrophysicist?" will yield a response that seems more strange $\rightarrow$ High PPL ("capybara", "astrophysicist")

in the absence of the prompt, LLM detection seems difficult and naive perplexity-based detection fails

3.3 Our Detection Score

Binoculars solves the capybara problem by providing a mechanism for estimating the baseline PPL induced by the prompt

Motivation

LM generates Low-PPL text relative to humans $\rightarrow$ PPL Threshold classifier

Capybara problem $\rightarrow$ prompt matters $\rightarrow$ Cross-PPL

Cross-PPL measures the tokens are surprising relative to the baseline PPL of an LLM acting on the same string

Expect the next-token choices of humans to be even higher PPL than those of the machine $\rightarrow$ Normalize the observed PPL by the expected PPL of a machine acting on the same text $$ B_{\mathcal{M}1, \mathcal{M}_2} (s) = { \log \text{PPL}{\mathcal{M}1} (s) \over \log \text{X-PPL}{\mathcal{M}_1, \mathcal{M}_2}(s)} $$

The numerator is simple PPL (how surprising a string is to $\mathcal{M}_1$)

The denominator measures how surprising the token predictions of $\mathcal{M}_2$ are when observed by $\mathcal{M}_1$

Expect human diverge from $\mathcal{M}_1$ more than $\mathcal{M}_2$ diverges from $\mathcal{M}_1$

The Binoculars score $B$ is a general mechanism that captures a statistical signature of machine text

It is also capable of detecting generic machine-text generated by a third model altogether

Connection to other approaches

Contrastive Decoding : generate high-quality text by maximizing the difference between a weak and a stron gmodel

Speculative Decoding : Use weaker models to plan completions

Both are working when pairing a strong model with a very seak model

But Binoculars works well when pairing very close two models (use Falcon-7B as $\mathcal{M}_1$ and Falcon-7b-instruct as $\mathcal{M}_2$)

4. Accurate Zero-Shot Detection

4.1 Datasets

Ghostbuster : Writing Prompts, News, Student Essay datasets (Humans vs ChatGPT)

Drew human samples from CCNews, PubMed, CNN and generated machine text by LLaMA-2-7B and Falcon-7B

Peel up first 50 tokens of human sample and used it as a prompt to generate up to 512 tokens

removed human prompt from the generation

Orca dataset to check the reliability of the proposed method for instruction-tuned models

4.2 Metrics

Binary classification metrics

ROC Curve

AUC

In high-stakes detection settings, false positive is the most concerning harms (human text is labeled as machine's)

TPR (True-Positive rates) at FPR (False-Positive rates)

standard FPR threshold of 0.01%

when the FPR is below 1%, AUC and TPR@FPR are often uncorrelated

4.3 Benchmark Performances

Ghostbuster (vs ChatGPT)

outperforms Ghostbuster in "out-of-domain" settings

Ghostbuster and Binoculars both have a property that they are getting stronger given more information

Binoculars are clearer in the few-token regime

Open source LMs (vs LLaMA-2 and Falcon)

Ghostbuster fails to detect other Open-source models generation

5. Reliability in the Wild

5.1 Varied Text Sources

used M4 detection dataset

Binoculars generalizes across domains and languages

LR GLTR : Logistic Regression over Giant Language Model Test Room

NELA : News Landscape Classifiers

5.2 Other Languages

Evaluating on Binoculars on samples from languages that are not well represented in Common Crawn data

FPR remains low but machine text is classified as human (poor recall)

Binoculars is a machine-text detector to detect whtehre text may have been generated from a similar language model

for Falcon, it has low capacity with low-resource languages. Then ChatGPT's text is unlikely to be machine-generated according to this score

Stronger multilingual pair of models would lead to make Binoculars more effietive to detect ChatGPT generated text in that language

FPR on text written by non-native speakers

LLM detectors are inadvertently biased against non-native English speakers classifying their writing as machine-generated

Analyzed EssayForum (ESL student's academic writing) to make original essay and grammar-corrected version

Binoculars is insensitive to this type of shift

5.3 Memorization

Highly memrized examples are classified as machine-generated in PPL based detection (famous quotes)

Memorized text is both written by human and machine

Both behavior is acceptable (plagiarism detection or removal of LLM-generated text from a training corpus)

5.4 Modified Prompting Strategies

For OpenOrca set, Binoculars detects 92% of GPT-3 sampels and 89.57% of GPT-4 samples

Simple detection schemes are fooled by this changes of prompt

This is not affecting the performance of Binoculars score

5.5 Randomized Data

Test arbitrary mistakes, hashcodes, or other kinds of random string

Confidently scores them as human

LLMs usually don't generate such things

6. Discussion and Limitations

a method for detecting LLMs in Zero-Shot case

Transferable detector words in zero-shot setting

This transferability cames from the similarity between modern LLMs (Transformer!)

Due to VRAM, they didn't check larger models (30B+)

Didn't consider explicit efforts to bypass detection

Non-conversational text domains are not included

7. Comment

단순 PPL이 아닌 Cross-PPL을 이용해 상대적으로 모델의 생성을 체크하는 방법. 그런데 모델 두 개를 올리려면 리소스 사용량이 꽤 많이 필요할듯.

Sparse Upcycling: Training MoE from Dense Checkpoints

Tue, 16 Jan 2024 12:43:35 GMT

1. Introduciton

Increased Scale is one of the main drivers of better performancd in DL (NLP, Vision, Speech, RL, Multimodal etc.)

Most SOTA Neural Nets are trained from-scratch (random weights) $\rightarrow$ Cost for training is high

Model Upcycling: upgrading an existing model with a relatively small additional compulational budget

focus on dense models into larger sparsely activated MoEs (pretrained dense Transformer checkpoint)

less than 40% additional budget for all size for language and vision

Valuable in two scenarios

Have access to a pretrained Transformer and want to improve it within a computational budget

Plan to train a large model and don't know whether dense of MoE would be more effective $\rightarrow$ First train the dense model then upcycle it into a MoE

Central challenge in model upcycling is the initial performance decrease entailed by changing a trained network structure $\rightarrow$ present a model surgery recipe

2. Background

2.1 Sparsely Activated Mixture of Experts (MoE)

Dense vs Sparse

Dense model : apply all params to every input

Sparse model : activating a subset of params for each input

MoE models are an accelerator friendly family of sparse models that allow training of models with up to trillions of params

MoE Model

alternate standard Transformer blocks with MoE blocks

usually replace the MLPs in a Transformer block with a number of 'experts' (also MLP) with different params and a router (small neural net, decides which expert should be applied)

There is multiple routing algorithms (Top-K, BASE and Sinkhorn-BASE layers, Hash layers, Expert Choice routing)

Sparsely Gated MoE (Shazeer et al., 2017)

Gating network $G(x) \in \mathbb{R}^n$ and $n$ expert networks $E_1, E_2, ... E_n$

the output $y$ of the MoE module is $y = \sum_{i=1} ^n G(x)_i E_i(x)$

$G(x)$ is a sparse vector. This has only non-zero element in the index of selected expert

The choice of gating function

Softmax gating : $G_{\sigma}(x) = Softmax(x \cdot W_g)$ where $W_g$ is trainable weight matrix

Noisy Top-K gating $$ \begin{aligned} G(x) &= Softmax(KeepTopK(H(x), k)) \ H(x)i &= (x \cdot W_g)_i + StandardNormal() + Softplus((x \cdot W{noise})_i) \

KeepTopK(v, k)_i &= \begin{cases} v_i \quad &\text{if } v_i \text{ is in top } k \text{ elements of } v \ -\infin &\text{ otherwise } \end{cases} \ Softplus(x) &= \ln ({1 + e^x})

\end{aligned} $$

Expert Choice routing (Zhou et al., 2022)

$E$ for total # of experts

$n$ for total number of tokens

Router output $\bold{R} \in \mathbb{R}^{n \times E}$ : routing probabilities

the row $r_{i} \in \mathbb{R}^E$ corresponds to the $i$-th token and distribution over experts (non-negative and sum to 1)

Every expert $e$ independently chooses the $T$ tokens with highest probabilities (top-T per column) and process

parametrize $T$ as $T = C(n/E)$ where $C$ is a capacity factor to control # of tokens per expert (if $C=1$, some token will be processed by multiple experts while others by none)

This makes a model parameter count increase with minimal FLOPs overhead (router computation)

Letting $C > 1$ usually leads to higher performance at a higher compute cost

$$ S = Softmax(X \cdot W_g), \quad S \in \mathbb{R}^{n\times e} \ G,\ I = TopK(S^T, k), \quad P = Onehot(I) \in \mathbb{R}^{e \times k \times n} $$

$G \in \mathbb{R}^{e \times k}$ is for weight of expert for the selected token, $I$ is an index matrix where $I[i,j]$ = $j$-th selected token of the $i$-th expert

Then, apply MoE and gating funtion in the dense FFN layer

input : $X_{in} = P \ \cdot \ X \in \mathbb{R}^{e \times k \times d}$ where $P$ is permutation matrix

$X_{in}[i] \in \mathbb{R}^{k \times d}$ is input for $i$-th expert

output of each expert $X_e[i] = \text{GeLU}(X_{in}[i] \cdot W_1[i]) \cdot W_2[i]^T$

Final output $X_{\text{out}}[l,d] = \sum_{i,j}P[i, j, l]G[i, j]X_e[i, j, d]$ $l$ for batch dimension, $d$ for model dimension
2.2 Architectures

Apply same sparse upcycling recipe to both language and vision tasks on T5 and ViT (encoder)

ViT : follow V-MoE, but used global average pooling and Expert Choice Routing

T5 : use Expert Choice Routing for encoder, Top-K routing for decoder with $K=2$

3. The upcycling Algorithm

Initialize

Use dense model's parameters (checkpoint) to initialize new Transformer block (same number and shape)

A subset of the MLP layers are expanded into MoE layer

remaining layers are copied to new model

each MoE have a fixed number of experts

each expert is initialized as a copy of the original MLP

After initializing, continue training it for a number of additional steps (considering budget and resources)

Design Decisions

The performance of upcycled models is heavily influenced by the configuration of the MoE layers

Router Type

ViT : Expert Choice routing with $C=2$ (encoder)

T5 : Expert Choice routing with $C=2$ (encoder), Top-K routing with $K=2$ (decoder)

Number of layers to upcycle

Adding more MoE increases the model capacity

replace half of the MLP layers of original model with MoE layers

Number of Experts to add in upcycled layers

Adding more experts doesn't significantly affect the FLOPS (the expert capacity is inversely proportional to the number of experts)

Too many experts make the upcycled model's larger initial quality drop (this will be overcome by sufficientl upcycling compute)

32 experts was good

Expert capacity

Larger Expert Capacity generally yields larger quality byt increases the FLOPS

$C=2$ was good

Resuming Optimizer State (Vision)

reusing the optimizer state gives a performance boost for vision models (not language)

Normalize weights after routing (Vision)

To reduce the performance drop when upcycling model surgery, normalized the router combine weights of each token to 1

Each token was previously only processed by a single expert (original dense model)

for vision, it was helpful but it hurts the performance of language case. (the hypothesis that the decoder of T5 uses Top-K routing)

4. Experiments

4.1 Experimental Setup

Vision : V-MoE, ImageNet using 10-shot, 5 different training sets, average accuracy

Language : span corruption task on English C4 (pretrain), a proportional mix of all SuperGLUE (fine-tune), dense baseline starting checkpoint (Base), T5 1.1 checkpoints (L, XL)

4.2 Results

4.2.1 Core Result

Pretraining

When applying small amount of Extra training, the performance is almost their original checkpoint

Full Fine-Tune

Still, the upcycled model has faster growth of score

For language, the difference is larger

Sparse upcycling vs Sparse models from scratch

training from scratch takes longer to catch up with the upcycled models

For language, it used 120% of original dense checkpoint's computation to catch up upcycled models

Larger learning rate + experts can develop and diversify from the beginning

Given Large computation budget (> 100% of original dense), training MoE from scratch may be preferable

Sparse upcycling vs Warm starting

Dense upcycling (depth tiling) replicates layers from dense Base checkpoint to construct new layer

4.2.2 Ablations

Vision : B/16 sparse model with 32 experts, $C=1$, 6 MoE layers at the last few blocks, dense checkpoint with 14 epochs + 7 additional epoch

Language : Base with 32 experts, $C=2$, 6 MoE layers interspersed, 0.5M ~ 1M additional steps

Amount of dense pretraining

Regardless of the amount, upcycled model showed higher performance

Router type

For vision, Top-K routing and Batch Prioritized Routing matches performance of Expert Choice Routing but slightly slow (step basis)

Top-K underperforms Expert Choice routing per time basis

Expert Capacity Factor

The more tokens processed by expert, the greater the computation and performance

$C=2$ was best

Number of MoE layers

More MoE layers is not always better even on a per step basis

Initialization of Experts

copying MLP layer >> train from scratch

adding small Gaussian noise to each copied MLP layer didn't work (small amount - no effect, large amount - hurts the performance

Number of Experts

Adding more experts increases the model parameter count and quality

Using very large number of expert shows large initial quality drop (Fig 10 left two)

5. Conclusion

Provided Simple recipe to reuse pretrained dense checkpoints to initialize more powerful sparse models

Smooth transition from dense to MoE

Applicable for vision and language

Upcycling of model

6. Comment

생각했던 것과는 다른 MoE였음. Expert를 선택하는 방법에 있어 Router을 이용할 수 있다는 아이디어와 Expert Choice라는 색다른 아이디어를 볼 수 있었음.

SOLAR 10.7B: Scaling LLMs with Simple yet Effective Depth Up-Scaling

Tue, 09 Jan 2024 12:39:22 GMT

1. Introduction

Recent LLMs scaling with performance scaling law $\rightarrow$ MoE

Often require non-trivial changes to the training and inference framework

hinders widespread applicability

Scaling up and retain simplicity is important

Depth Up-Scaling (DUS)

Scaling the base model along the depth dimension and continually pretraining the scaled model

Not using MoE

No additional Module

No changes for framework

Applicable for all Transformer architecture

Solar > Mistral 7B, LLaMA 7b

Solar-Instruct > Mixtral-8x7b

2. Depth Up-Scaling

Use pretrained weights of base models to scale up

continually pretrain the scaled model

Base Model

Any $n$-layer transformer architecture is OK (used 32-layer Llama 2)

Initialized Llama-2 architecture + pretrained weights from Mistral-7B

Depthwise Scaling

From the base model with $n$ layers, set the target layer count $s$ for the scaled model (largely dictated by the available hardware)

Copy the base model

remove final $m$ layers for original model and initial $m$ layers for duplicated model

concatenate to form $s = 2 \cdot(n-m)$ layers ($n=32$, $s=48$, $m=8$ for SOLAR)

Continued Pretraining

The performance of scaled model initially drops below that of the base model

rapid performance recovery is observed

particular way of depthwise scaling has isolated the heterogeneity in the scaled model

if we just repead all layers so that the number of total layer to be $2n$, the layer distance or the difference in the layer indices is too large at the seam

SOLAR sacrificed the $2m$ middle layers thereby reducing the discrepancy at the seam

the success of DUS is obtained by both depthwise scaling and continued pretraining

Comparison to other up-scaling methods

DUS do not require a distinct training framework, additional modules (ex. gating networks, dynamic expert solution), specialized CUDA kernel

seamlessly integrate into existing training and inference frameworks with high efficiency

3. Training Details

Instruction Tuning

QA Format + synthesized math QA dataset

seed math data from Math dataset only to avoid contamination

using a process similar to MetaMath, rephrase the question and answers of the seed data $\rightarrow$ Synth. Math-Instruct

Alignment Tuning

Instruction-tuned model is further fine-tuned to be more aligned with human or strong AI like GPT4 preference using DPO

Open-Source + Synth.Math-Instruct

Speculated that rephrased answer > original answer

Made DPO tuple with {prompt(rephrased question), chosen(rephrased answer), rejected(original answer)} $\rightarrow$ Synth.Math-Alignment

4. Result

Training Dataset

Didn't always used all dataset

Synth. Math-Instruct can be replaced with MetaMathQA

Result

Merged some of models that they trained while instruction and alignment tuning stages.

Implemented their own merging method

Ablation on Instruction Tuning

Alpaca-GPT4 and OpenOrca makes the model to behave differently (SFTv1 and SFTv2)

Synth. Math-Instruct was helpful (SFTv3, SFT v4)

Merging models that specialize in different tasks is a promising way to obtain a model that performs well generally

Ablation on Alignment Tuning

Adding Synth. Math-Alignment was helpful

Merging is not beneficial as DPOv2 is strict improvement over DPOv1

Ablation on SFT base models

the performance gaps in certain tasks in the SFT base models don't always carry over to the alignment-tuned models

Ablation on Merge Methods

As merge candidates have sufficiently different strengths, merge method may not be as crucial.

Merge 1 is SOLAR 10.7b-Instruct

5. Conclusion

Depth up-scaled model SOLAR 1.7b

DUS is effective in scaling up from smaller ones