6X10

[Paper Review] On structuring probabilistic dependences in stochastic language modeling (KN Smoothing)

Thu, 13 Jun 2024 07:30:33 GMT

Before the beginning,,,

N-gram model

Using the definition of conditional probabilities, the decomposition (the joint probability of an entire sequence of words) using the chain rule of probability:

The intuition of the n-gram model is that instead of computing the probability of a word given its entire history, we can approximate the history by just the last few words.

The assumption that the probability of a word depends only on the previous word is called a Markov assumption. Markov models are the class of probabilistic models that assume we can predict the probability of some future unit without looking too far into the past. We can generalize the bigram (which looks one word into the past) to the trigram (which looks two words into the past) and thus to the n-gram (which looks n − 1 words into the past).

General equation for this n-gram approximation to the conditional probability of the next word in a sequence. We’ll use N here to mean the n-gram size, so N = 2 means bigrams and N = 3 means trigrams. Then we approximate the probability of a word given its entire context as follows:

To estimate probabilities, we use maximum likelihood estimation or MLE. For the general case of MLE n-gram parameter estimation: where $C(w_{n-N+1:n-1}w_n)$ is the count of the n-gram $w_{n-N+1:n-1}w_n$, and $C(w_{n-N+1:n-1})$ is the sum of all the n-gram that start with a given word $w_{n-N+1:n-1}$ = the sum of all n-gram counts that start with a given word $w_{n-N+1:n-1}$ must be equal to the unigram count for that word $w_{n-N+1:n-1}$.

=> Equation estimates the n-gram probability by dividing the observed frequency of a particular sequence by the observed frequency of a prefix. This ratio is called a relative frequency.

We always represent and compute language model probabilities in log format as log probabilities. Because:

Since probabilities are (by definition) less than or equal to 1, the more probabilities we multiply together, the smaller the product becomes.
Adding in log space is equivalent to multiplying in linear space, so we combine log probabilities by adding them.

Knerser-Ney Smoothing

2. Smoothing methods

2.1. Multinomial distribution and maximum likelihood estimation

Here, we will use the term event or event class to describe the type of observations we are considering. When we are looking at word bigrams, the events will be the possible word pairs. In discussing conditional bigram probabilities for a fixed predecessor word, the events wil be the single words that may follow the fixed predecessor word.

Let us denote

the event classes under consideration by $k=1,..., K$,
their sample counts by $N(k)$, i.e. the frequency how often class $k$ was observed in the training text,
the corresponding probabilities by $p(k)$, i.e. the chance of success of class $K$.

For a set of $N$ observations (in training or testing), we have $N$ independent trials with $K$ possible outcomes, where the sample counts $N(k)$ denote the number of trials resulting in class or outcome $k$. The distribution of the sample counts $N(k)$ is referred to as multinomial (or polynomial) distribution (Lehmann, 1983):

As we will see later, it is helpful to partition the event classes $k$ into equivalence classes according to their sample counts $r = N(k)$ (Good, 1953), which is referred to as symmetry requirement (Nadas, 1985). Let $n$, be the number of event classes which occured in the training text exactly r times and $p_r>0$ be the corresponding probability, i.e. we define: where we have: $n_r=0$ for $r>R$.

$F$ in terms of the sample counts $N(k)$ and of the equivalence class counts $n_r$:

2.2. Floor method and linear interpolation

To guarantee non-zero probability estimates, we can modify the sample counts N(k) by adding a floor value which is chosen proportional to probabilities q(k) of a less specific distribution, e.g. unigram distribution or uniform distribution. After renormalization, we then obtain the probability estimates:

-> linear interpolation (each sample count $N(k)$ is reduced("discounted") by a value ${lamda}*N(k)$. However, it can be argued that the higher the sample counts $N(k)$, the higher their contribution should be in the interpolation model.

2.4 Discounting models

We introduce a general discounting function $d:k→ d(k)$ that has to be subtracted from every sample count N(k) and obtain the following smoothing formula:

A constant value D with the range constraint0

The weight of the more general distribution q(k) is proportional to (K-n_0), which is the total number of different events k that were observed. This property is particularly attractive for modelling conditional probabilities: if a given word predecessor is followed by only one or a few different words, the smoothing effect will be much smaller than in the case where it is followed by many different words.
Seting D=1 amounts to pooling al singletons, i.e. events k with N(k) =1, with the unseen events. As Katz (Katz, 1987) pointed out in asimilar context, there si not much difference between seeing an event just once or not at all.
Although the interpolation is nonlinear, there is always a smoothing effect in that a weighted average between the two distributions is computed. This is different from Katz's backing-off approach (Katz, 1987), where a choice must be made between the more specific and the more general distribution.

[Paper Review] RNN + Long Short-Term Memory

Sat, 13 Apr 2024 14:04:10 GMT

Before the beginning,,,

What is Recurrent Neural Networks (RNN)?

RNNs use the idea of processing sequential information. The term “recurrent” applies as they perform the same task over each instance of the sequence such that the output is dependent on the previous computations and results. Generally, a fixed-size vector is produced to represent a sequence by feeding tokens one by one to a recurrent unit. In a way, RNNs have “memory” over previous computations and use this information in current processing.

Structure of Simple RNN

$x_t$: the input to the network at time step $t$ $s_t$: the hidden state at time step $t$

Caculation of $s_t$ is based as per the equation:

-> $s_t$ is caculated based on the current input and the previous time step's hidden state -> $s_t$ is considered as the network’s memory element that accumulates information from other time steps

The function $f$: a non-linear transformation such as $tanh, ReLU$ $U, V, W$: weights that are shared across time

Properties of RNN
- Pros
  - Given that an RNN performs sequential processing by modeling units in sequence, it has the ability to capture the inherent sequential nature present in language, where units are characters, words or even sentences.
    - RNNs have flexible computational steps that provide better modeling capability and create the possibility to capture unbounded context.
    - Cons
    - Simple RNN networks suffer from the infamous vanishing gradient problem, which makes it really hard to learn and tune the parameters of the earlier layers in the network.

Conventional BPTT (e.g., Williams and Zipser 1992)

Network Architecture
- Let the network have $n$ units, with $m$ external input lines.
- Let $y(t)$ denote the $n$-tuple of outputs of the units in the network at time $t$.
- Let $x^{net}(t)$ denote the $m$-tuple of external input signals to the network at time t.
- We also define $x(t)$ to be the $(m+ n)$-tuple obtained by concatenating $x^{net}(t)$ and $y(t)$ in some convenient fashion.
- Let $U$ denote the set of indices $k$ such that $x_k$ the $k^{th}$ component of $x$, is the output of a unit in the network.
- Let $I$ denote the set of indices $k$ for which $x_k$ is an external input.
- Furthermore, we assume that the indices on $y$ and $x^{net}$ are chosen to correspond to those of x, so that
- Let $W$ denote the weight matrix for the network, with a unique weight between every pair of units and also from each input line to each unit.
- The element $w_{ij}$ represents the weight on the connection to the $i^{th}$ unit from either the $j^{th}$ unit, if $j \in U$, or the $j^{th}$ input line, if $j \in I$.
- Furthermore, note that to accommodate a bias for each unit we simply include among the $m$ input lines one input whose value is always 1; the corresponding column of the weight matrix contains as its $i^{th}$ element the bias for unit $i$. In general, our naming convention dictates that we regard the weight $w_{ij}$ as having $x_j$ as its “presynaptic” signal and $y_j$ as its “postsynaptic” signal.
- For each k, the intermediate variable $s_{k}(t)$ represents the net input to the $k^{th}$ unit at time $t$. Its value at time $t+1$ is computed in terms of both the state of and input to the network at time $t$ by
  
  The longer one clarifies how the unit outputs and the external inputs are both used in the computation, while the more compact expression illustrates why we introduced $x$ and the corresponding indexing convention above.
- The output of such a unit at time $t+1$ is then expressed in terms of the net input by
  
  where $f_k$ is the unit's squashing function.
- In those cases where a specific assumption (differentiable) about these squashing functions is required, it will be assumed that all units use the logistic function.
Network Performance Measure
- Assume that the task to be performed by the network is a sequential supervised learning task, meaning that certain of the units’ output values are to match specified target values (which we also call teacher signals) at specified times.
- Let $T(t)$ denote the set of indices $k \in U$ for which there exists a specified target value $d_k(t)$ that the output of the $k^{th}$ unit should match at time $t$. Then define a time-varying $n$-tuple $e$ by
  
  -> Note that this formulation allows for the possibility that target values are specified for different units at different times.
  - Denote the negative of the overall network error at time t.
  - A natural objective of learning might be to maximize the negative of the total error oversome appropriate time period $(t',t]$.
  - One natural wat to make the weight changes is along a constant positive multiple of the performance measure gradient, so that
  for each $i$ and $j$, where $\eta$ is a positive learning rate parameter.

LSTM

1. Introduction

With conventional "Back-Propagation Through Time" (BPTT) or "Real-Time Recurrent Learning" (RTRL), error signals "flowing backwards in time" tend to either (1) blow up or (2) vanish (BPTT 참고: https://velog.io/@jody1188/BPTT)
LSTM is designed to overcome these error back-flow problems. It can learn to bridge time intervals in excess of 1000 steps even in case of noisy, incompressible input sequences, without loss of short time lag capabilities.

3. Constant Error Backprop

3.1 Exponentially Decaying Error

3.2 Constant Error Flow: NAIVE APPROACH

A single unit
- To avoid vanishing error signals, at time $t$, $j$'s local error back flow is
  - To enfore constant error flow through $j$, we require
The constant error carrousel
- Interating the differential equation above, we obtain
- This means: $f_j$ has to be linear, and unit $j$'s activation has to remain constant:
- In the experiments, this will be ensured by using the identity function $f_j$:$f_j(x)=x, \forall x$, and by setting $w_{jj}=1.0$. We refer to this as the constant error carrousel (CEC).

4. LONG SHORT-TERM MEMORY

Memory cells and gate units
- To construct an architecture that allows for constant error flow through special, self-connected units without the disadvantages of the naive approach, we extend the constant error carrousel CEC embodied by the self-connected, linear unit $j$ from Section 3.2 by introducin additional features.
- A multiplicative $input$ $gate$ $unit$ is introduced to protect the momory contents stored in $j$ from pertubation by irrelevant inputs. Likewise, a multiplicative $output$ $gate$ $unit$ is introduced which protects other units from perturbation by currently irrelevant memory contents stored in $j$.

$c_j$: $t

https://velog.io/@choonsik_mom/pytorch로-LSTM-구현하기

What is the Apache Airflow?

Sat, 16 Mar 2024 13:38:43 GMT

Foremost, we have to know Data Pipelines.

Data Pipelines

A data pipeline is a method in which raw data is ingested from various data sources—APIs, SQL and NoSQL databases, files, et cetera, and then ported to data store, like a data lake or data warehouse, for analysis. Before data flows into a data repository, it usually undergoes some data processing. This is inclusive of data transformations, such as filtering, masking, and aggregations, which ensure appropriate data integration and standardization.

There are two main types of data pipeline, which are batch processing and streaming data.

Batch Processing: Load “batches” of data into a repository during set time intervals, which are typically scheduled during off-peak business hours
Streaming data: Be required for data to be continuously updated (Click if you want to know more about streaming data)

ETL pipeline

ETL pipeline is a subcategory of data pipelines.

• ETL pipelines follow a specific sequence. As the abbreviation implies, they Extract data, Transform data, and then Load and store data in a data repository. All data pipelines do not need to follow this sequence. In fact, ELT pipelines have become more popular with the advent of cloud-native tools. While data ingestion still occurs first with this type of pipeline, any transformations are applied after the data has been loaded into the cloud data warehouse.

• ETL pipelines also tend to imply the use of batch processing, but as we noted above, the scope of data pipelines is broader. They can also be inclusive of stream processing.

Then,

What is the Apache Airflow?

Apache Airflow™ is an open-source platform for developing, scheduling, and monitoring batch-oriented workflows. Airflow’s extensible Python framework enables you to build workflows connecting with virtually any technology. A web interface helps manage the state of your workflows. Airflow is deployable in many ways, varying from a single process on your laptop to a distributed setup to support even the biggest workflows.

That is, Apache Airflow is a tool which can help develop, schedule, and monitor batch processing with Python.

Airflow Components

Core Components

A folder of DAG files is read by the scheduler to figure out what tasks to run and when and to run them.

A scheduler handles both triggering scheduled workflows, and submitting Tasks to the executor to run (1) Read DAGs from files and parses DAGs, (2) check their schedule interval and start date, (3) (if the DAGs’ schedule has passed) starts scheduling the DAGs’ tasks for execution by passing them to the Airflow workers. (The executor is a configuration property of the scheduler and runs within the scheduler process. Executors are the mechanism by which task instances get run)

A metadata database is use to store state of workflows (DAGs) and their tasks. It stores crucial information such as the configuration of your Airflow environment's roles and permissions, as well as all metadata for past and present DAG and task runs.

A webserver presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks.

Situational Components

A worker executes the tasks given to it by the scheduler. (In the basic installation worker might be part of the scheduler not a separate component.) It is the process that executes tasks, as defined by the executor of Scheduler)

A triggerer executes deferred tasks in an asyncio event loop. In basic installation where deferred tasks are not used, a triggerer is not necessary. (Need to be updated)

A dag processor parses DAG files and serializes them into the metadata database. By default, the dag processor process is part of the scheduler, but it can be run as a separate component for scalability and security reasons. If dag processor is present scheduler does not need to read the DAG files directly.

Plugins are a way to extend Airflow’s functionality (similar to installed packages). Plugins are read by the scheduler, dag processor, triggerer and webserver.

References

Process VS Thread

Tue, 12 Mar 2024 01:13:30 GMT

Process

운영체제로부터 시스템 자원(CPU)을 할당받은 작업의 단위 (모든 program은 운영체제가 실행되기 위한 메모리 공간을 할당해 줘야 실행될 수 ㅇ)
작업 중인 Program(Static Program, 코드 덩어리)
작업 관리자의 process 참고
한계: 기술 발전으로 프로그램이 복잡해져서, 프로세스 작업 하나만으로 프로그램을 실행하기에는 한계가 있어짐. 그렇다고 동일한 프로그램을 여러 개의 프로세스로 만들면, 그만큼 메모리가 낭비되고, cpu를 할당 받는 자원이 중복되는 경우 발생.

Thread

프로세스가 할당받은 자원을 이용하는 실행 흐름의 단위
Multithreading makes multitasking possible when it breaks programs into smaller, executable threads. Each thread has the programming elements needed to execute the main program, and the computer executes each thread one at a time. (참고: https://www.techtarget.com/whatis/definition/multithreading)
리눅스에서 코어의 수만 확인할 때는 nproc 명령을 사용하면 된다. 하이퍼-스레딩(hyper-threading)이 사용되고 있을 경우 nproc 명령의 결괏값은 물리적인 코어 수의 2배가 된다.
Ex. 6코어 12 쓰레드 CPU 한 개는 여러 개의 코어를 가질 수 있고, 코어는 말 그대로 CPU 코어 유닛을 의미한다. 즉, 코어의 개수는 명령어를 메모리에서 뽑아 해석하고 실행하는 반도체 유닛의 개수이다. 6코어가 물리적 코어의 개수라면, 12쓰레드는 논리적 코어 개수를 의미한다. 이 경우 물리적 코어 하나가 스레드 2개 이상을 동시에 실행가다는 의미가 된다. 즉, 운영 체제가 8개의 작업을 동시에 처리할 수 있다는 의미이고, 이를 Hyper-Threading 기술이라고 한다. (이때 CPU의 thread != process의 thread. 엄밀히는 CPU의 스레드는 하드웨어적 스레드이고, 프로그램의 스레드는 소프트웨어적 스레드로 구분한다.)

<참고> https://inpa.tistory.com/entry/👩%E2%80%8D💻-프로세스-⚔%EF%B8%8F-쓰레드-차이#

Linux 명령어

Sun, 10 Mar 2024 14:08:43 GMT

! 명령어와 argument는 한 칸 이상의 blank로 구분됨 ! cmd + l : terminal 화면 clear ! 프로세스: 동작 중인 프로그램

Linux Directory 구조 & 역할

-Linux file system hierarchy standard (Tree 구조의 규칙)

Basic Directories Ex. /libXX/*.so. etc/passwd(로그인 가능한 유저들의 정보가 ASCII text 형태로 있음)
관련 명령어
- 디렉토리 이동: cd <디렉토리 이름> Ex. cd / : 최상위 디렉토리(root)로 이동 (폴더-파란색, 실행 가능한 파일: 초록색(or 흰색), 실행 불가능한 파일: 하늘색) cd : home dir로 이동 (root의 sub-dir)
- 파일 목록 보기: ls <옵션> <파일 이름> Ex. ls -l : sub directory의 실제 파일 위치를 바로가기 아이콘(->)을 사용하는 등을 통해서 알려줌
- 현재 작업 디렉토리 보기: pwd

Basic 명령어

> 명령어 도움말 보기

man <옵션> keyword(명령어 or 파일이름) (Ubuntu 20.04 cd /usr/share/man에 매뉴얼 모아져있음) + option

      -k : 매뉴얼 목록을 검색
      -s [section number] : 입력한 섹션에서 매뉴얼 검색해서 출력 (생략 가능)
       (1) User Commands
       (2) System Calls
       (3) Subroutines
       (4) Devices
       (5) File Formats
  Ex. man (-s) 5 passwd : 5번째 section의 manual에서 passwd 검색해줘

! space :next page

  ! enter : next line
  ! b: backward
  ! q: backward

> 파일 목록 보기

ls <옵션> <파일|디렉토리> + option -a : (all) dot(.)로 시작하는 숨겨진 파일까지 모두(ex. config, log나 cache 정보 들어있는) -l : (long) file/dir의 자세한 정보(permission, owner, size, last modified, file name, type 등) -R : (recursive) 하위 디렉토리까지 모두 출력 -d : dir's content가 아닌 dir 자체를 출력 Ex. ls (자동적으로 ls --color=auto 적용돼서 파란색: 바로가기 symboling link, 초록색: binary 프로그램(실행 ㄱㄴ), 빨간색: special한 permission이 있음) Ex. ls -l /tmp (파란색: dir) Ex. ls -ld /tmp (파일 내용 출력 없이 디렉토리 자체의 목록) Ex. ls -l /etc/passwd /etc/hosts /etc/hostname
#### > Directory 생성/삭제/이동
mkdir <옵션> + option -m : 퍼미션 설정 -p : 존재하지 않는 parent directories까지 같이 생성 Ex. mkdir ~/tmp-dir Ex. mkdir -p ~/dir/subdir/subsubdir (subdir 없는 상태에서) Ex. mkdir -m 777 share
rmdir <옵션> + option -p : 비어있는 parent directories를 함께 삭제 Ex. rmdir share Ex. rmdir -p ~/dir/subdir/subsubdir ! Empty dir만 지울 수 있음!!!
cd + argument ~ : Home dir로 이동 .. : 하나 상위 dir로 이동
> File 복사, 이동, 삭제
cp <옵션> <원본_파일> <목적지_파일> + option -i : (interactive) 복사할 때 overwrite할 것인지 질문 (y 가 아닌 모든 문자는 명령어 취소시킴) -f : 복사할 때 overwrite할 것인지 질문 없이 무조건 덮어쓰기 (default) -r : dir 전체 복사(하위 dir 까지) Ex. cp /etc/passwd . (현재 디렉토리에 같은 이름으로 복사해라) Ex. cp /etc/passwd /etc/hosts conf.d/ (passwd, hosts file을 conf.d dir에 복사해라! file에는 복사 안됨) Ex. cp -r conf.d conf.d.backup
mv <옵션> <원본파일이름> <새이름(Rename) Or 디렉토리(이동)> + option -i : (interactive) 이름을 바꿀 때 overwrite할 것인지 질문 -f : 이름을 바꿀 때 overwrite할 것인지 질문 없이 무조건 덮어쓰기 (default) Ex. mv hosts hosts.file Ex. mv passwd /tmp/ (ls /tmp에서 passwd 확인 ㄱㄴ) Ex. mv conf.d.backup admin.d (dir 이름 바꾸기는 -r 없이 ㄱㄴ) Ex. mv home/* . (home dir의 모든 파일을 현재 dir로)
rm <옵션> <파일이름 Or Dir이름> + option -i : (interactive) 파일 삭제할 때 삭제 여부를 한 번 더 질문 -f : 파일 삭제할 때 질문없이 무조건 삭제 (Linux는 휴지통 없으니까..이런 짓 x) -r : 하위내용을 포함한 디렉토리를 삭제 Ex. rm -ri admin.d/ (admin.d라는 dir 물어보고 지워라)

> 기타

tree (구조 보여줌)
cat : file 내용 보여줘라

<참고> http://daddynkidsmakers.blogspot.com/2018/03/ https://www.youtube.com/@ttabae-learn https://seosh817.tistory.com/157#mv%20%5B파일명%5D%5B디렉토리%5D일%20경우-1

[Paper Review] Beam Search +@

Wed, 06 Mar 2024 16:46:06 GMT

Decoding: 인코딩(encoding)된 벡터를 다시 텍스트로 변환하는 것 (텍스트를 생성하는 과제(text generation)에서는 주어진 토큰 이후에 어떻게 토큰을 출력해야 하는지 방식을 정해야 하므로 디코딩 방법이 필요)

Greedy Search

각 타임 스텝마다 다음에 등장할 토큰으로 가장 높은 확률을 가진 토큰 단 하나만을 계산하여 선택하는 방식 (Sequence로서의 전체적인 문장의 확률값을 보는 것 X)
Usually we decode until the model produces a token
Problem: 해당 타임 스텝에서 최적의 토큰이 전체 문장에서 최적의 토큰은 아닐 수 있다는 문제가 발생 + 뒤로 돌아갈 수 없음

Exhaustive Search

가능한 모든 sequence 경우의 수를 고려하는 방식
joint probability 최대가 되는 sequence 선택
Complexity is O(V^t) where V is vocabulary size, t means time step

Beam Search

On each time step of the decoder, we keep track of the k most probable partial translations (which we call hypothesis) where k is the beam size
A hypothesis y1,...,yt has a score of its log probability:
joint probability에 log를 취해줌으로써 곱셈이 아니라 덧셈이 됨
이때 log 함수는 단조증가 함수이므로, 확률값과 Score 값이 비례.
For each of the k hypotheses, find top k next words and calculate scores
If different hypotheses my produce
tokens on different timesteps? When a hypothesis produdces
, that hypothesis is complete, then place it aside and continue exploring other hypotheses via beam search (Usually we continue beam search until:
- we reach time step T (where T is ome pre-defined cutoff), or
- we have at least n completed hypotheses (where n is the pre-defined cutoff)
Problem: longer hypotheses have lower scores(항상 -값을 더해주는 형태로 score 계산하므로) -> Nomalized by length

-> beam of size k conatinas a set of active hypotheses -> At each time step t, the decoder model produce a distribution over the target-language vocabulary V_T. THis produces a large matrix of dimensions k X |V_T|. Conceptually, a (row, column) entry (i,j) in this matrix contains the state obtained from starting from the ith state in the beam and generating the target word corresponding to the jth word of V_T.

GBS(Grid Beam Search)

≒ adding an additional dimension to the beam (base bim size k X the number of constraints)

For C constraints, this is accomplished by maintaining C+1 separate beams OR banks, B_0, B_1, ..., B_C, wehre B_i groups together hypotheses that have generated (or met) i of the constraints.
Decoding proceeds as with standard beam search, but with the addition of bookkeeping that tracks the number of constraints met by each hypothesis, such that each bank is filled at each time step.
When beam search is complete, the hypothesis returned is the highest-scoring one in bank B_C. => Impractical. B/c the beam size changes for every sentence => Decoding complexity is linear in the numer of constraints. B/c the effective beam size is k * (C+1)

DBA (fast lexically-constrained decoding via Dynamic Beam Allocation for NMT)

We maintain a single beam of size k, as with unconstrained decoding, and dynamically allocate the slots of this beam across the constraint banks at each time step.
Algorithm 1의 KBEST function을 다음과 같이 대체
candidates = (current beam, new word, expected constraints set)
candidates (1) from the k best items, (2) from extending each hypothesis with all unfilled constraints, (3) from its single-best token (to ensure consideration of partially-completed hypotheses) -> When allocating a size-k beam across C+1 constraint banks(the portion of the beam reserved for items having met the same number of constraints), where C may be greater than k, we use a simple allocation strategy, setting each bin size to [k/C], irrespective of the timestep. Any remaining slots are assigned to the "topmost" or maximally constrained bank, C. But! There is bank adjustment. An overfilled bank is one that has been allocated more slots than it has candidates to fill. Each such overfilled bank, in turn, gives its extra allotments to banks that have more candidates than slots, looking first to its immediate neighbors, and moving outward until it has distributed all of its extra slots. In this way, the beam is filled, up to the minimumu of the beam size or the number of candidates.
Hypotheses are not allowed to generate the end-of-sentence token, , unless they have met all of their constraints. when beam search is finished, the highest-scoring completed item is returned.
회색으로 칠해진 word가 KBEST-DBA에서 candidates list에 포함한 beam h의 후보 word

<참고> https://aws.amazon.com/blogs/machine-learning/train-neural-machine-translation-models-with-sockeye/ https://terms.naver.com/entry.naver?docId=6653720&cid=69974&categoryId=69974 https://blog.naver.com/eungang0301/223242740713 https://www.boostcourse.org/ai330/lecture/1455757?isDesc=false https://showmiso.tistory.com/156 https://arxiv.org/pdf/1804.06609.pdf

[Anaconda(Miniconda)] 가상환경

Mon, 04 Mar 2024 15:59:08 GMT

참고: https://chancoding.tistory.com/85

In VS code

conda create -n cralwer python=3.11y #cralwer라는 이름의 가상환경 생성 Conda activate crawler #가상환경 실행 Which python #가상환경 확인

Introduction to Unix & Linux

Fri, 01 Mar 2024 15:08:08 GMT

REFERENCE: 시스템소프트웨어 (http://www.kocw.net/home/cview.do?cid=8562026226b093ea)

UNIX
- 다중 사용자 및 다중 작업 지원 (For 대형 컴퓨터. 서버급)
- 높은 이식성 (어셈블리 언어로 만드는 게 기존 방식이었는데, 기본적으로 고급언어인 C언어로 만들어짐->머신독립적인 방식)
- 소스 코드 공개
- 프로그램 개발에 용이

Linux
- PC를 비롯한 다양한 컴퓨터 환경에서 사용 가능한 UNIX 운영체제
- Linus Torvalds를 중심으로 Internet 상의 많은 개발자의 참여로 kernel이 개발됨
- Free Software + 모듈화
- GNU General Public License (GPL)에 따라 공개로 배포 -> 모든 프로그램의 소스 또한 공개 -> GNU 정신에 따라 변경한 내용도 공개해야

Parameter

Fri, 01 Mar 2024 14:37:35 GMT

REFERENCE: [Fast Campus] 세계 3등에게 배우는 실무 밀착 데이터 시각화 올인원 패키지

Parameter는 데이터셋에 속한 값이 아닌, 사용자가 워크시트, 대시보드를 컨트롤 할 수 있게 해주는 변수를 의미한다. (참고: https://blog.naver.com/llyjin-/223088885357) 이때 parameter는 반드시 계산된 필드에 넣고, worksheet에 적용할 때만 사용할 수 있다. Parameter를 통해서 뷰의 세부 수준을 조정할 수 있고, 측정값을 변경할 수 있으며, 컬러를 변경, 데이터를 하이라이트, what-if 시나리오 분석과 디스플레이 뷰 변경이 가능하다. (What-if 시나리오 분석 참고: https://brunch.co.kr/@cheonmyung/91)

Select Granularity, Select a measure 라는 이름의 Parameter(string, list)를 만들고,
이를 포함하고 있는 Calculated Field를 CASE 함수를 이용해서 각각 만들어 준 뒤,
차원에 지정해주는 Field는 행선반에, 측정값에 지정해주는 Field는 열선반에 얹는다.

이 그림 처럼 Label에 넣고 싶은 모든 변수는 Marks 카드, 특히 Label에 드래그해놔야 함. (Dimension: 차원 지정해주는 Caculated Field, Measures: 측정값 지정해주는 Caculated Field)

[User Interaction] Filter & Highlight Actions Demo

Fri, 23 Feb 2024 17:06:38 GMT

REFERENCE: [Fast Campus] 세계 3등에게 배우는 실무 밀착 데이터 시각화 올인원 패키지

Filtering by Region, Year, Category (When Click Sales Trend)
Highlighting by clicking Sales Trend

Tooltip 지우기 전

Tooltip 정리하고 난 후

Interaction Dashboard는 정보량이 많기도 하고, 이것저것 예쁘게 꾸미다보면 min(0)처럼 의미없는 값들도 많아서 마우스 오버할 때 추가로 정보가 나오지 않도록 하는 것이 더 깔끔한 듯

근데 왜 라인은 제대로 그려져있는데 영역 부분에 빈 공간이 생기는지 원인 파악이 좀 필요해 보임

원래는 Worksheet 하나하나 쪼개서 어떻게 했는지 설명을 넣어보려고 했는데 velog 수정하다가 Worksheet 파트가 싹 다 날라갔기 때문에 오늘은 일단 여기까지...

+2023.03.01

처음에는 빈 공간이 데이터가 없어서 생겼나 했는데 그건 아니고...Sales Trend Worksheet의 Filters에 빈 공간이 생긴 (Category, 월, 연도)에 해당하는 데이터를 제외하는 Exclusions이 들어가 있었음. 해당 필터를 제거하니 빈 공간 사라지고 멀쩡하게 작동

Kernel? Shell? Terminal?

Thu, 22 Feb 2024 13:27:03 GMT

이때 System Software는 컴퓨터 시스템의 개별 하드웨어(HW) 요소들을 직접 제어, 통합, 관리하는 소프트웨어로서 사용자가 컴퓨터 하드웨어의 물리적인 특성이나 구조를 전부 알지 못하더라도 컴퓨터 시스템을 사용할 수 있게 도와주는 역할을 하는 소프트웨어를 말한다. (Machine Dependent: Instruction Set Architecture (ISA)에 의존적) (System Software 구성 요소 참고: https://devpad.tistory.com/119)

System Software에는 운영체제 (Operating System, 좁은 의미의 System Software)가 포함되는데, 커널 (Kernel)은 이러한 운영체제의 핵심적인 역할을 담당한다. 운영체제에는 User Mode와 Kernel Mode가 있다. 사용자 모드는 응용프로그램이 사용하는 영역이고, 커널 모드는 운영체제가 사용하는 영역으로 특정 권한이 필요한 명령어 수행이 가능하다. 이렇게 영역이 나뉜 이유는 보안을 위해서 중요 데이터에 영향을 끼칠 수 있는 명령어는 커널 모드에서만 수행되도록 하기 위함이다. (운영체제와 커널의 역할 참고: https://smallrich.tistory.com/77)

BUT kernel은 사용자와 직접 상호작용하지 않음! -> 운영체제와 사용자 간의 인터페이스, Shell로 운영체제와 사용자가 상호 작용함! (Unix 및 Unix-like 운영 체제의 터미널에서 사용하는 Shell 종류 참고: https://blog.naver.com/jdockko1/222968207191)

6X10

[Paper Review] On structuring probabilistic dependences in stochastic language modeling (KN Smoothing)

Before the beginning,,,

N-gram model

Knerser-Ney Smoothing

2. Smoothing methods

2.1. Multinomial distribution and maximum likelihood estimation

2.2. Floor method and linear interpolation

2.4 Discounting models

[Paper Review] RNN + Long Short-Term Memory

Before the beginning,,,

What is Recurrent Neural Networks (RNN)?

Conventional BPTT (e.g., Williams and Zipser 1992)

LSTM

1. Introduction

3. Constant Error Backprop

3.1 Exponentially Decaying Error

3.2 Constant Error Flow: NAIVE APPROACH

4. LONG SHORT-TERM MEMORY

What is the Apache Airflow?

Data Pipelines

ETL pipeline

What is the Apache Airflow?

Airflow Components

Core Components

Situational Components

References

Process VS Thread

Process

Thread

Linux 명령어

Linux Directory 구조 & 역할

Basic 명령어

> 명령어 도움말 보기

> 파일 목록 보기

> File 복사, 이동, 삭제

> 기타

[Paper Review] Beam Search +@

Greedy Search

Exhaustive Search

Beam Search

GBS(Grid Beam Search)

DBA (fast lexically-constrained decoding via Dynamic Beam Allocation for NMT)

[Anaconda(Miniconda)] 가상환경

In VS code

Introduction to Unix & Linux

Parameter

[User Interaction] Filter & Highlight Actions Demo

Kernel? Shell? Terminal?