kjh_b_d.log

Automata and Formal Languages - Introduction

Thu, 23 Mar 2023 22:43:47 GMT

Automata and Formal Languages - CSI3109 This post retrived from CSI3109 Yonsei University 2023-spring Used text book : Introduction to Automata Theory, Languages, and Computation (3rd edition) by John E. Hopcroft,, Rajeev Motwani and Jeffery D. Ullman

What we will learn?

Formal language and automata theory. Fundamental knowledge on computation and computability. Finite-state automata andPushdown automata, Turing machines.

Finite-state automata : Regular Languages
Context-free automata : Context-free languages
Turing machine : Unrestricted languages

We can understand each pairs' meaning as Computability. So, we learn Mathematical study of computing machines, their fundamental capabilities and their limitations.

More... What problem can be solved (computed) , in PRINCIPLE and in PRACTICE, by computer and what problems cannot? Computability theory for first one, Complexity theory(Algorithm Design Area) for second one.

Examples of UNSOLVABLE problems

Given two programs, determine whether or not they compute the same function.
The general problem of whether or not a program terminates under all inputs.

Terms

Alphabets, Strings and Languages

Alphabets

Finite, nonempty set of symbols. Denoted by $\Sigma$.

Strings

Finite sequence of symbols from Alphabet. Denoted usually by $w$ Empty string is the string with zero occurrences of symbols. Denoted by $\lambda$. Length of string $w$ is denoted by |$w$| Occurrence $|w|_{\sigma}$ is the number of occurrence of $\sigma$ in $w$. Power $\Sigma^{k}$ of an alphabet is the set of strings of length $k$. The set of all strings over $\Sigma$ is denoted by $\Sigma^{}$. $\Sigma^ - {\lambda}$ = $\Sigma^+$ $w_x \cdot w_y$ means concatenation of given two string $w_x$ and $w_y$. * $\lambda w = w\lambda = w$

Languages

Set of strings all of which are chosen from $\Sigma^$. $L \subseteq \Sigma^$ $\empty$ is empty languages over any alphabets.

Some notations and definitions about SETs

Power set

$2^A$ : The set of all subsets of a set $A$ is the power set of $A$. e.g $2^{a,b}$ = ${ \empty , {a}, {b} {a,b}}$

Partition

A partition of $A$ is any set ${A_1, A_2, ... }$ of nonempty subsets of $A$ such that

$A = A_1 \cup A_2 \cup ...$ and
$A_i \cap A_j = \empty$ for all $i \neq j$.

Cartesian Product and Functions

$A \times B$ of two sets $A$ and $B$ is the set of all possible ordered pairs $(a,b)$ with $a \in A$ and $b \in B$.

Function

$f : A \rightarrow B$, is a binary relation $R$ on $A$ and $B$ such that, for each $a \in A$, if $(a,b) \in R$ and $(a,c) \in R$, then $b = c$ and, for each $a \in A$, there is either exactly one $b \in B$ such that $f(a) = b$ or there is no $b \in B$ such that $f(a) = b$. Such a function is said to be a partial function. If $\forall a \in A$, there is $b \in B$ such that $f(a) = b$, we call $f$ as total

$f : A \rightarrow B$ is ${\color{red}\texttt{one-to-one}}$ if, for any distinct $a, a' \in A$, $f(a) \neq f(a')$. $f : A \rightarrow B$ is ${\color{red}\texttt{onto}}$ if, for all $b \in B$, there is some $a \in A$ such that $f(a) = b$. A total function $f$ is ${\color{red} \texttt{bijection}}$ if it is both one-to-one and onto.

Computer Network - Application Layer

Tue, 25 Oct 2022 00:40:31 GMT

Some Network Applications : e-mail, web, text messaging, remote login, P2P file sharing, multi-user network games, streaming stored video, voice over IP, real-time video conferencing, social networking, search ...

Principle of network applications

Important Themes overall Control vs Messages Centralized vs Decentralized Stateless vs Stateful Reliable vs Unreliable "COMPLEXITY AT NETWORK EDGE"

Program (application)은 end-system에서만 실행된다.
Network-core device(router)를 위한 software를 만들 필요는 없다.
Application Structures
Client-server와 peer-to-peer(P2P) 두가지가 있다.
Client-Server architecture
client : server가 n:n, n:1, 1:n, 1:1 모두 가능하다.

Server :

Always-on host
Permanent IP address
Usually as Data Center (2022년 10월에 kakako data center에 문제가 생겨 먹통이 되는 일이 있었다.)

Client :

communicate with server
may be intermittently connected
may have dynamic IP address
do not communicate directly with each other

P2P architecture

no always-on server peer request services from other peers, provide service in return to other peers - self scalability -

Process communicating

Process : program running within a host

same host : inter-process communication (shared memory, pipe 등의 OS 기법)
diff host : Communicate via Network. send, receive. (P2P는 server process와 client process 둘 다 가지고 있다.)

client process : initiate communication server process : wait to be contacted

sockets - kind of interface

process는 socket을 통해 message를 보내고 받는다. socket은 문과 유사한데 sending process가 message를 넣고 OS에게 보내달라고 요청한다.

1개의 process는 1개 이상의 socket이 필요하다. 그리고 그러한 process는 identifier를 가지고 있어야 한다. Host device는 32-bit의 unique한 IP address를 가지고 있는데 하나의 host device에서는 여러개의 process가 돌아가고 있어서 IP addr만으로는 identifier라고 하기에는 부족하다. 그래서 IP + Port number로 identifier를 만든다. (HTTP의 port number는 80, mail server는 25로 정해져 있음)

Application level protocol은 다음과 같은 것들을 정의한다.

Types of message exchanged (request, response)
message syntax == Grammar
message semantics == Meaning
Rules for when and how to send & respond

Open protocols : defined in RFCs, allows for interoperability (HTTP, SMTP) Proprietary protocols (Skype)

Application level에서 필요로하는 transport의 기능들은 4가지로 분류할 수 있다. 먼저 data가 손실없이 100% 보내진다는 걸 보증하는 data intergrity(TCP/IP가 보증한다.). Low delay를 원하는 timing, minimum amount of throughput을 필요로하는 throughput, 그리고 security이다.

TCP와 UDP 뒤에 조금 더 자세히 다루겠지만 앞으로 계속 나올 용어이기에 미리 정리 해야할 듯 하다.

TCP는 reliable transport이고, flow control이 가능하다. flow control은 너무 많은 양의 data (receiver가 감당하지 못할 정도의)를 보내지 않는 것이다. 너무 많은 송신자가 한번에 송신하는 것을 control하는 congestion control이 가능하다. 대부분이 TCP를 사용한다.

UDP는 위의 기능중 아무것도 가능하지 않다. 그러나 TCP의 flow control방식이 맞지 않거나 본인만의 고유한 reliability를 만들고 싶다면 UDP를 사용하기도 한다.

둘 다 Encryption이 없어 security를 보장하지 않지만 SSL로 이를 보완한다.

Web and HTTP

Webpage는 보통 HTML file이 base다. HTML file이 HTML file이나 JPEG image, Java applet, audio file 등을 referenced object로 가지고 있는 "File"이다. URL로 주소가 지정되고 URL은 host name + path name으로 구성된다.

HTTP

Key words : HTTP, non-persistent & persistent, Cookie, Cache(Proxy server)

HTTP : HyperText Transfer Protocol의 약자이다. Web의 application layer protocol이고 client/server model이다. (P2P가 아니다.) 여기서 client는 browser, server는 Web server이다.

HTTP는 TCP를 사용한다. Server는 항상 80번 port를 열어두고 client의 연결을 기다린다. Connect된 이후에는 Browser와 Web server간의 message를 교환하는 역할을 한다. 그리고 TCP 서버를 닫는다.

HTTP는 stateless하다. 즉, 과거에 어떤 정보가 오고 갔는지 알 필요가 없다는 뜻이다.

HTTP의 연결 방식에는 두가지가 있다. 과거에 쓰였던 non-persistent HTTP와 최신의 persistent HTTP가 있다. 둘의 차이는 하나 이상의 TCP connection(non-persistent)을 사용하는 것과 하나만의 TCP connection(persistent)을 사용하는 것에 있다. Webpage의 경우 하나의 파일이 아니라 여러개의 파일로 구성되어 있기에 multiple object를 받아야 한다. 따라서 multiple object를 하나의 TCP로 받을지, 여러개의 TCP로 받을지에 대한 논의가 필요하다.

Non-persistent HTTP는 HTTP client가 TCP connection을 initiate해서 연결한다. server와 request-response를 거친 후 server에서 HTTP connection을 닫는다. 따라서 n개의 object를 받기 위해서는 n번의 연결이 필요하다.

Round-trip time은 packet의 client -> sever -> client 왕복 시간이다. HTTP response time에서 non-persistent인 경우 2RTT + file transmission time이 된다. (1 RTT for connection, 1 RTT for request-response). object마다 2RTT가 필요하기에 시간적 문제도 있고, 너무 많은 TCP 연결은 OS에게 부담을 줄 수 있다.

Persistent HTTP 는 server에서 response후에 connection을 유지하는 것이다. 처음 connection할 때의 RTT 말고는 연결을 위한 RTT를 소비하지 않기에 시간적인 이득을 볼 수 있다.

Some response status codes in HTTP

200 OK
301 Moved Permanently
400 Bad Request
404 Not Found
505 HTTP Version Not Supported

Cookies

Web site들은 cookie를 사용한다. User의 ID를 만들어서 server뿐만 아니라 user에게도 저장하는 방법인데, 첫 접속 후 나중에 다시 접속할 때 user가 저장한 cookie값도 함께 보낸다. 이 경우 request-response를 한 단계 줄일 수 있는 장점이 있다. Cookie는 authorization, recommendation등에 쓰일 수 있다.

Caches (proxy server)

origin server에 갔다오지 않고도 HTTP response를 받기 위한 전략이다. 이 경우 request-response delay를 줄일 수 있으며 traffic이 복잡해 지는 것도 막을 수 있다.

proxy server를 client 근처에 두고 origin server의 data들을 저장해 놓다가, client가 request할 때 proxy server에 존재하는 data일 경우 origin server까지 가지 않고 바로 response하는 방법이다. Origin server를 main memory, proxy server를 cache라고 생각하자. Hit한 경우 좋지만, 만약 miss가 일어난 경우 proxy server가 client역할을 해서 data를 받아온다.

Conditional GET Proxy server는 origin server의 object에 변화가 있을 때에만 update를 하면 된다. proxy server가 특정 date 이후 변경된 data가 있는지 request하고 만약 있다면 origin server에서 respond 해준다(없어도 없다고 respond는 해준다).

이렇게 data가 오는 와중에 client가 접근한다면...bad scenario이다. conditional get과는 다른 해결책으로 origin server에서 update가 있을 때 proxy server에 push하는 방법도 있다.

Electronic mail

User agent, mail servers, protocol(SMTP : Simple Mail Transfer Protocol).

User agent : mail 'reader'

Mail server :

mail box : incoming message storage
message queue : outgoing messages queue
mail server 끼리는 P2P style

SMTP protocol : TCP를 사용하고 port number는 25다. 3가지 phase로 이루어져있다.

hand shaking (greeting)
stransfer of messages
closure Persistent connection을 사용한다.

** message는 7-bit ASCII ** ** mail을 읽을 때에는 SMTP 대신 POP3, IMAP등이 사용된다.**

SMTP mail message format header : To, From, Subject body : literal message

receiver의 mail server에 있는 data에 어떻게 retrieve 할 것인가? -> mail access protocol

POP(Post Office Protocol) : authorization - download POP3 protocol [ download and keep, stateless ]

  Authorization phase -> user : username, pass : password AND server response(+OK or -ERR)
  Transaction phase -> list : list message numbers, retr : retrieved message by number (mail을 local에 저장), dele : delete, quit

IMAP(Internet Mail Access Protocol) : more features, including manipulation of stored messages on server [keeps all messages in one place - server]
HTTP : gmail, Yahoo! Mail etc

HTTP & SMTP HTTP : Pull SMTP : Push

HTTP : each object encapsulated in ints own response message SMTP : multiple objects sent in multipart message (MIME)

DNS

Hierarchical database, TDC, name resolution

Domain Name System IP address(32-bit) for datagrams, name(www.yonsei.ac.kr) for humans.

Goal : Host name <-> IP address

host alising : 하나의 컴퓨터가 여러개의 logical name(Domain). load distribution : 하나의 Domain에 여러개의 IP -> 서버에 가해지는 부담을 줄인다. mail server의 ip주소도 DNS가 알려준다.

Distributed database

Application-layer protocol hosts, name servers communicate to resolve names.

How it works : $;$$;$Distributed, Hierarchical database $;$$;$ 이러한 root server가 전 세계에 13개 복제되어 있다. (~~map보면 거의다 미국이나 유럽이다.~~)

Top-Level Domain servers : .com, .org, .kr~~ Authoritative DNS servers : yonsei.ac, naver ...

End user는 root에 직접 접근할 수 없다. Local DNS name server에 접근해서 활용. Local DNS server는 proxy server처럼 작동한다.

PROTOCOL : name resolution에는 두가지 방법이 있다. Iterated query와 recursive query. Iterated query는 부서 전화 연결 생각하면 될듯 하다. 114에 전화 -> 회사 대표 번호 연결 -> 회사 부서 연결 ("I don't know this name, but ask this server"). Recursive query는 찾을때까지 알아서 recursive하게 간다. 114에 전화 -> 114가 회사 대표 번호로 전화해서 물어본다 -> 회사 대표번호가 회사 부서에 물어본다 -> pop pop pop 이러한 mapping 과정이 time consume이 클 수 밖에. 그래서 cahch로 저장한다. 모든 caching과 마찬가지로 out-of-date problem 발생할 수 있다.

P2P applications

Keywords : distribute file comparison, tit-for-tat

No always-on server File distribution time을 client-server architecture와 peer-to-peer architecture의 비교 File 양 F User 수 N upload capacity $u_s$ minimum client download rate $d_{min}$ Time to distribute F to N in client-server : $D_{C-S} \geq max(NF/u_{s}, F/d_{min}) $ Time to distrubute F to N in peer-to-peer : $D_{P2P} \geq max(F/u_s, F/d_{min}, NF/(u_s + \Sigma u_i))$ client도 서버처럼 작동해서 i개의 서버가 있는 것 처럼 보인다.

P2P file distribution example : BitTorrent

file이 256Kb chunks로 나뉘어 있고, peers가 sent/receive한다. Chunk를 교환하는 peers를 churn라고 한다.

tit-for-tat -> 토렌트를 유지하는 중요한 원리 가장 많은 chunk를 제공하준 peer top 4에게 chunk보냄. 10초마다 peer 재평가 (top 4) 30초마다 random peer에게 보낸다 (신규 가입 유저 배척을 막기 위해)

Video streaming and content distribution networks(CDN)

Challenge and Options, CDN

Video는 file이 크기에 major traffic consumer이다(Netfilx와 Youtube가 젗네 traffic의 50% 이상이다). User도 많고 video도 너무 많다. 또한 3G vs 5G의 차이도 심하다.

Solution : Distributed, Application-level infrastructure.

DASH protocol : Dynamic, Adaptive Streaming over HTTP encode 속도에 따라서 다른 quality의 video 제공

CDN : how to stream content to hundreds of thousands of simultaneous user? (Netflix 같은 것들) Option 1 : single, large "mega-server" - 단점이 더 많다. Option 2 : store/serve multiple copies of videos at multiple geographically distributed sites (CDN)- Web cache와 비슷한 원리

OTT : "over the top" CDN중 하나다. (여담. Netflix의 Original Contents는 AWS에 저장되어 있다.)

Socket programming with UDP and TCP

Socket Programming? -> Application - OS(end-end-transport protocol) 연결하는 Door.

Socket type : UDP (unreliable datagram) / TCP (reliable, byte stream-oriented)

UDP No connection between client & server no greeting. IP dst + port # is everything.

may be lost or out-of-order

server가 죽어도 client는 알 수 없다. (connection이 없기 때문)

socket programming client 와 server 모두 socket 생성. IP, port num, socket type(DGRAM) 필요. (server는 bind 꼭 해줘야함) data send / receive

TCP Client must contact server. server가 실행중이여야 하고, socket을 만들어 놔야 한다(welcome socket 느낌). 1 client : 1 socket. Connection을 확정한 후 data transfer 시작.

socket programming socket 생성. server가 먼저. IP, port num, socket type(STREAM). 얘도 server는 bind 먼저 해줘야 한다 Connection request, accept and data send / receive.

시험공부 하면서 적느라 엉망이다. 나중에 다시 와서 수정할듯

Computer Network - Introduction

Mon, 24 Oct 2022 18:23:47 GMT

This post retrieved from Yonsei University Computer Network course (CSI4106)

Credits :

우선 Computer Network를 공부하기에 좋다고 생각하는 흐름에 대해 얘기해보자. Computer Network를 짧게나마 공부하며 느낀 전체적으로 관통하고 있는 흐름은 다음과 같다.

Network Design 소개 + 몇몇 중요한 point
해당 Design의 pros and cons
Problem and Solution ( and more -not solved- problems )

어떤 챕터든 이러한 형태를 띄고 있는 것은 비슷했던 것 같다. Computer Network도 굳이 대분류로 분류했을 때 System 분야에 속하는 것 같아서 대부분은 Memory나 Bandwidth같은 현실적으로 Algorithm으로 해결이 어려운 부분들의 문제점이 많은 것 같다.

Operating System에서 Scheduling Policy들을 배울 때 어느 것이 가장 좋다고 말하기 어려웠듯이 여러 Policy들을 소개하고 있지만 어느것이 가장 좋다고 말하기 애매한 경우가 대부분이다. 각각의 장단점을 알고 어떤 문제점이 있으며 어떻게 보완할 수 있고 그럼에도 불구하고 가질 수 밖에 없는 단점이 무엇인지에 초점을 맞추면 공부하기 편한 것 같다. Introduction에서는 거의 모든 분야를 훑어보는 느낌이기에 이런 방법론적인 관점이 필요 없는 것 같지만 본격적으로 공부할 때에는 기억해 놓으면 편할 것 같다.

Introduction Overview

Terminology의 느낌 정도만 받는 걸로도 충분하다고 한다.

What is the Internet?

두가지 시각이 있다. "Nuts and Bolts view"와 "Service view". 전자는 구성 요소에 조금 더 집중하고 후자는 무슨 역할을 하는지에 조금 더 집중한다.

Nuts and Bolts view

Internet은 많은 computing device들의 집합이다. device들을 host라고 하고 host들은 Network의 end system이다. 이러한 host들이 communication을 위해 연결되어 있는데 이러한 연결의 medium을 communication links라고 한다. copper나 fiber를 많이 쓰는데 wireless로는 radio나 satellite도 쓴다. bandwidth가 중요하다! chunks나 data인 packet을 주고받는 내부 작용이 일어난다.

Internet은 networks of network다. 이러한 거대한 구조를 유지하기 위해서는 강력한 rule이 있어야한다. Network를 만든 이유는 sending, receiving이기에 이러한 sending receiving을 지배하는 rule을 protocol이라고 한다. (TCP, IP, HTTP, 802.11, Skype...)

Protocol은 fotmat, order of messages sent and received, actions taken on message transmission, receipt를 결정한다.

cf. $Internet \subset Computer;Network$

A Service view

Internet은 application에 services와 API를 제공하는 infrastructure다.

Network Edge

hosts : clients and servers

Host : sends packets of data

간단한 성능 측정 방법. 더 간단한 계산식

Host sending function : $;$$;$ 1. takes applicatoin message $;$$;$ 2. breaks into smaller chunks, known as packets of length L bits $;$$;$ 3. transmits packet into access network at transmission rate R $;$$;$$;$$;$ link transmission rate == link capacity, link bandwidth R : (bits/sec) L/R의 단위 계산식은 sec가 되는데 이를 transmission delay라고 한다.

Access networks and physical media

End system을 edge router에 연결해야 Network를 제대로 형성할 수 있다. 어떻게 연결할 것인가? 에 대한 얘기다. Router의 종류는 두가지 인데 part of Access Network와 part of Core Network이다. 여기서 Access Network의 router를 edge router라고 부른다.

Residential access nets(아파트 동마다 하나씩 있다고 생각하면 된다.) Institutional access networks(학교나 회사에 있다. 좀 다른 얘기긴 한데 yonsei web WiFi 너무 느리다), mobile access networks등이 있다.

연결 방법은 DSL과 cable network, home network가 있다.

DSL은 Digital Subscriber Line의 약자이다. 인터넷 초기 시절에 새로운 media를 설치하기에는 부담이 있어서 기존에 있던 telephone line을 사용했다. Frequency를 이용해 internet인지 telephone인지 구분했는데, 집과 센터에 muliplexer를 두고 신호를 transmission했다고 한다. Upstream보다 Downstream 속도가 훨씬 빠른데, 인터넷 초기임을 생각하면 당연한 얘기다. 초기에는 UCC가 많지 않았으니 Downstream 속도에 집중해야 했다. 하지만 그래도 old technology기에 많이 느리다.

Cable network는 TV에 사용하던 media라고 생각하면 된다. DSL과 다르게 하나의 선을 여러 집이 공유하고 있고, coaxial cable을 사용한다. 이것도 Unused frequency를 internet에 사용했다. Video data가 오고가야 하기 때문에 Telephone line보다 Capacity가 더 크다(-> 더 빠르다). 현재는 physical cable을 광섬유(fiber)로 교체해서 사용하고 있다.

Home network는 router(firewall, NAT)가 집 안에 있는 경우이다. Wireless access point도 있는데 router와 함께 공유기에 들어있다. Wireless access point로 WiFi가 작동한다.

Enterprise access networks (Ethernet) 학교나 회사에서 많이 사용한다.

Wireless access networks

Wireless access network는 공유되며, end system과 router를 연결해준다. (Access point같은 base station을 경유한다.)

wireless LANs(Local Access Network)는 빌딩에 있고 30m 이내 범위를 케어한다.

wide-area wireless access는 cellular를 생각하면 된다. 3G, 4G:LTE, 5G(?), 6G(2022-2 기준 Concept 잡는 중이라고 한다.)

Physical Media

이러한 연결들이 텔레파시로 이루어지지는 않는다. 여러 physical media가 존재하는데, guided media, unguided media로 분류할 수 있다. guided media는 "guided"에서 특정 line을 따라 흘러가는 모습을 상상할 수 있다. solid media이고 copper, fiber, coax가 있다. 안전하고, 빠르다. unguided media는 자유롭게 퍼지는 형태다. "흘러가는"것과 구분하자. radio가 있다.

Twisted Pair(TP) : DSL에서 쓰던 전선이다. copper다.

Coaxial cable : Coaxial은 동축의 라는 뜻이다. copper선 두개 꼬아놓은 형태이고 TP cable보다 capacity와 safety가 높다. 정보를 양쪽에서 보낼 수 있다.

Fiber optic cable : coaxial cable보다 higher capacity. 주로 core network에서 사용하고 가끔 access network에서도 사용한다. electric voltage가 전송되는 것이 아닌 light가 전송된다. High speed, Low error

Radio : 전자기 스펙트럼으로 전달된다. link type에 여러가지가 있는데 terrestrial microwave, LAN, wide-area, satellite가 있다.

Network Core

Interconnected routers의 mesh형으로 연결된 것이다. Ring이나 Tree처럼 계획적이지 않고 mesh형태인 것 정도를 기억하면 될듯하다.

Packet-switching

Packet을 옮긴다고 생각하면 된다. src에서 dst로.

Store-and-Forward를 이용하는데 이는 entire packet이 도착한 이후에 next link로의 이동을 시작한다는 의미다. 하나의 packet이 이동할 때에는 link를 fully하게 사용하므로 router에 packet들이 쌓일 수 밖에 없는 구조이다.

Two key network core functions

Rounting : src-dst 결정하기
Forwarding : Move packet from src to dst routing table에 header에 따른 output link가 저장되어 있다. -> 모든 packet에 ouput link에 대한 정보(header value)가 저장되어 있다.

Circuit switching

Router의 경로가 정해져 있는 경우. packet-switching과 다른 점은 dedicated하다는 점이다. 그래서 performance가 guaranteed하다. 또한 store가 필요 없으며 circuit의 순서를 기록한 table만 필요하다. data가 packet단위로 쪼개지지도 않는다.

다만 독점적이고, 사용하지 않는 circuit은 idle하기 때문에 효율적이지 않다. circuit을 사용하는 동안 다른 사람이 작업할 수 없다. 또한 circuit중 하나만 고장 나더라도 이용이 불가능하고 교체를 필요로 한다.

circuit switching에는 두가지 여러 user를 관리하는 방법이 있는데 FDM(Frequency division multiplexing)과 TDM(Time division multiplexing)이다. 이건 그림으로 보는게 이해가 빠를듯. 1/4Bandwidth x Time VS 1/4Time x Bandwidth냐의 차이이다.

Packet-switching VS Circuit-switching something : (packet, circuit) Break data into chunks : (Y, N) Store(Queueing) : (Y, N) Queueing delay : (Y, N) Dedicated : (N, Y) Many users at the same time : (Y, N) Guaranteed Performance : (N, Y)

packet-switching이 여러 user들을 만족시킬 수 있기에 많이 사용되지만 너무 많은 data가 오고가는 경우 delay와 loss가 커진다. 그래서 packet-switching을 사용하지만 circuit-like behavior를 갖게 하는 것을 목표로 한다. unsolved problem.

Delay, Loss, Throughput in Network

Delay와 Loss는 store-and-forward 때문에 생긴다.

End-to-end delay : propagation delay + transmission delay + queueing delay (+ nodal processing delay)

Nodal Processing : packet이 이미 corrupted 되었는지, 그렇다면 packet drop. Next packet 결정 (in routing table). 소요되는 시간이 매우 짧다.

Propagation Delay : 한 Node에서 다음 Node까지 data가 이동할 때 필요한 시간. length of pyhysical link / propagation speed. 위성 전송과 같은 경우에는 d가 커서 값이 클 수도 있다.

Transmission Delay : 어떤 Node에서 data를 전송하는데 걸리는 시간. packet length / link bandwidth = bits / bps

Queueing delay : Router에 여러개의 packet이 한 번에 들어오면 next link로 보내는 순서에 따라 저장을 해놔야 한다. 이때 저장하는 자료구조가 queue이다. 거기서 기다리는 시간이다.

R : link bandwidth (bps) L : packet length (bits) a : average packet arrival rate aL : 모든 packet이 도달하는데 걸리는 시간

La/R ~ 0 : avg.queueing delay small La/R -> 1 : avg.queueing delay large La/R > 1 : more "work" arriving -> packet loss TCP의 목표는 a를 줄이는 것이다.

Linux에서 traceroute [URL]로 실제 delay를 확인할 수 있다.

Queueing loss : Router도 memory의 size가 정해져있다. 따라서 queue의 capacity를 초과할 만큼 packet이 들어오면 packet loss가 발생한다. 매우 critical 하다.

Throughput : 얼마나 많은 bit을 보낼 수 있는가? (bits/time unit) -> bps src에서 dst까지 성공적으로 보내진 packet만 고려한다. Instantaneous : rate at given point in time Average : rate over longer period of time Total_Bits-Time Graph에서 미분값 vs end-to-end 기울기값 이라고 이해하면 될듯하다.

Protocol Layers, Service Models

Layers : each layer implements a service. 각 layer마다 action이 있다. 윗 단계의 layer는 아랫 단계에 의존한다. - 아랫 단계가 준비되어 있지 않으면 윗 단계를 실행할 수 없다. -

-이륙 할 수 없다면 gate로 들어갈 이유가 없다.-

Layering : reuse! common functionality of under line layers를 공유한다. Modularization하기 편하다.

Nework는 매우 복잡한 형태이다. 이를 조금 더 쉽기 다루기 위한 개념이라고 생각하자. 특정 layer의 service를 바꾸기 위해 모든 layer의 기능을 다 알 필요는 없다. Object Oriented Programming 개념과 비슷한 느낌이 있다. 그러나 Accessible한 정보는 interface parameter밖에 없기에 어느정도의 성능 저하도 감수해야 한다. Internet Protocol Stack

Application : supporting network applications (HTTP, SMTP, FTP) Transport : process-process data transtfer (TCP, UDP) Network : routing of datagrams from src to dst (IP, routing protocol) Link : data transfer between neighboring network elements (Ethernet, WiFi(802.111), PPP) Physical : bits "on the wire"

Ex) application : Web browser가 HTTP protocol message 생성 Transport : TCP는 message를 받아 Header 생성. 결과는 segment가 된다. Network : Transport의 segment를 받고 고유의 encapsulation을 수행해 own header를 붙인다. -> datagram (이걸 packet이라고 부른다.) Link : Frame의 header를 이용해 switching decision(dst addr)을 정함.

End-system : layer 7 device Switch : layer 2 device (link + physical) Router : layer 3 device (network + link + physical)

ISO/OSI reference model 현재는 Internet protocol stack쪽으로 기울었고 ISO model은 이론적인 부분으로만 배운다.

Internet protocol stack + 아래의 layer들

Presentation : allow applications to interpret meaning of data (encryption, compression, machine-specific conventions(machine마다 다른 binary data 처리)) Session : synchronization, checkpointing, recovery of data exchange

Internet protocol stack에서는 application이 이러한 부분들을 포함한다.

Network under attack : Security

초창기 internet은 자료 공유 / 연구 목적이었기에 보안에 대해선 크게 중요하지 않았다. -> 취약하다.

DDoS(Distributed Denial of Service) : packet 많이 보내서 마비 시키는 공격 Sniffing : Read data Spoofing : Send false data (pretend)

History

1961~1972 : Early packet-switching principles

1961 : queueing theory가 packet-switching에 좋다! 1964 : packet-switching in military nets 1967 : ARPAnet conceived by Advanced Research Projects Agency 1969 : ARPAnet first node operational 1972 : ARPAnet public demo -> 대학들끼리 network 형성, first e-mail program

1972~1980 : Internetworking, new and proprietary nets

1970 : ALOHAnet satellite network in Hawaii 1974 : Architecture for interconnecting networks 1976 : Ethernet at Xerox PARC late 70's : switching fixed length packets

1980~1990 : New protocols, a proliferation of networks

1983 : deployment of TCP/IP 1982 : smtp e-mail protocol defined 1983 : DNS defined for name-to-IP-address translation 1985 : ftp protocol defined 1988 : TCP congestion control

1990~2000's : commercialization, the Web, new apps

1991 : NSF lifts restrictions on commercial use of NSFnet early 1990s : Web, hypertext Nelson, HTML, HTTP ... late 1990's ~ 2000's : instant messaging(like kakaotalk)

2005~present

~5B device attached to Internet emergence of SNS service providers(Google, Microsoft) create their own networks Cloud

Compiler Design - Syntax Analysis

Sun, 23 Oct 2022 06:26:52 GMT

Syntax and Semantic

In English Language

Syntax : The form or structure of English sentences No concern about meaning, specified by only grammar

Semantic : The meaning of English sentences

In Programming Language

Syntax : The form or structure of programs No concern about meaning, specified by context-free grammar <- Now... we need to know this...

Semantic : The meaning of programs

Context-Free Grammar

Grammar를 정의하는 방법중 하나로, Recursion을 이용한 정의라고 생각하자. 어떤 format을 사용하는지 간단한 예시로 함께 보자. 편의상 간단한 English를 context-free grammar로 표현해본다.

Sentence ::= Subject Verb Object . Subject ::= I | A Noun | The Noun Object ::= me | a Noun | the Noun Noun ::= cat | mat | rat | $\epsilon$ Verb ::= like | is | see | sees 여기서 Red text가 아닌 것들은 non-terminal ($\epsilon$ 제외), red text로 표시한 것들은 Terminal이라고 부른다. Terminal은 더이상 grammar를 참고하지 않는 마지막 단계라고 생각할 수 있다. (Recursion을 이용한 정의라고 생각했을 때 return이 시작되는 단계라고 생각할 수 있다.)

Derivation

Derivation step마다 nonterminal을 그러한 nonterminal을 설명(정의)하고 있는 right-hand side로 대체한다. 이때 나타나는 string들을 sentential forms라고 한다. sentence는 sentential forms가 모두 terminal인 string이라고 볼 수 있다.

derivation이 불가능한 sentence는 illegal하다고 볼 수 있다. -> CFG specified all valid strings(Programs) of a given programming language!

Elements of CFG

Terminal Symbols : terminal string들이다. "while", "if", "<=", ID, INTLITERAL과 같은 것들이 있다.
Nonterminal Symbols : nonterminal string들이다. Program, Command, Expression, Declaration등이 있다.
Start Symbol : 보통 첫 production의 left-hand side이다.
Productions : nonterminal ::= (non)terminals 꼴의 모든 것들

Leftmost & Rightmost Derivation

Derivation step마다 두가지의 'choice'가 생긴다.

어떤 nonterminal을 derivation할 것인지
그 nonterminal을 어떤 (non)terminal로 derivation할 것인지

유용한(하다고 알려진) 두가지 derivation을 소개하고 있다.

Leftmost derivation : always replace leftmost nonterminal
Rightmost derivatoin : always replace rightmost nonterminal

BNF, EBNF and more CFG

특별할 것 없다. 그냥 CFG를 표현하는 방법일 뿐이다. Full name은 Backus-Naur Form. 위에서 들었던 예시도 BNF에 포함되는 듯 하다.

Program ::= Command Command ::= ~~~ $;;;;;;;;;;;;;;;;$| ~~~ ; ~~

EBNF는 Extended BNF의 약자이다. BNF에 REs를 합친것. $X^*$같은 표현들을 차용한 것이다.

CFG의 관용적인 부분들 - 위의 Elements of CFG와 큰 차이는 없다.

Start Symbol : The letter S, whenever it appears (예시를 봐야 납득할듯)
Non-terminals : lower case names, capital letters(A, B, ...)
Terminals : boldface names such as digits, operators, ...

A little bit usefull theory

Parser를 구현할 때 필요하다고 한다. 기본적인 집합 규칙과 비슷하다. Grammar Transformations

N ::= XY | XZ -> N ::= X(Y|Z) 결합법칙 성립함.
N ::= X | NY -> N ::= XY$^*$
N ::= X, M ::= $\alpha$N$\beta$ -> N ::= X, M ::= $\alpha$X$\beta$ 치환도 됨

Parsing

Derivation and parse tree

예시 한 번 보면 이해 가능할 듯 하다. 결국 terminal만 남아야 한다.

Ambiguous Grammars

이것도 예시로 보는게 편할듯! 어떤 nonterminal을 어떤 (non)terminal로 replace하냐에 따라 결과가 달라질 수 있다는 얘기다. 이렇게 several distinct parse tree가 존재한다면 ambiguous 하다고 얘기한다. 이러한 경우를 대비하기위해 CFG 단계에서 연산들에 우선순위를 고려한다.

IF ELSE문 또한 ambiguous grammar가 될 수 있는데 아래와 같은 예시가 있다. IF $\underline{a}$ THEN IF $\underline{b}$ THEN x = false; ELSE x = true; 위에서 ELSE가 첫 번째 IF에 속하는지 두 번째 IF에 속하는지 모호하기 때문이다. 이를 방지하기 위해 아래와 같은 CFG를 사용한다. stmt ::= IF expr THEN expr stmt END IF $;$$;$$;$$;$$;$$;$$;$$;$$;$| IF expr THEN stmt ELSE stmt END IF

cf. 여기까지가 2022-2 CSI4104 Midterm 범위였다.

Recursive Descent Parsing

Compiler Design - Lexical Analysis

Sun, 23 Oct 2022 00:30:08 GMT

Lexical Anaysis 는 Compilation의 첫 번째 단계이다. given sequence of string - code - 에서 token을 분리해내는 단계이다.

Role of a Scanner

Scanner의 역할은 Stream of characters를 words로 바꾸는 것이고, 이러한 word들을 Token이라고 부른다. Token은 syntax의 unit이다. -Introduction의 예시 참고!- Token은 보통 로 표기한다. (Tuple)

여기서 Lexeme이라는 개념이 나오는데, word를 구성하는 character들이다. Lexeme이라는 개념을 이해하기 어려웠는데, Pattern과 비교하여 예시를 보면 비교적 쉽게 이해할 수 있다.

Pattern은 특정한 Token type을 정의하는 규칙이다. 이 또한 예시로 함께 살펴보자.

Token type : ID (Identifier) Pattern : A string of letters, digits, and underscores, beginning with a letter or underscore Set of Lexemes : {var, _var, ptr_1, foo, foo123}

Token type은 어떠한 종류의 token인지를 나타낸다. 예시처럼 identifier가 될 수도 있고 INTLITERAL -Integer 값- 이나 if while과 같은 제어문, +나 * 같은 arithmetic 기호가 될 수도 있다. Pattern은 이러한 token type의 규칙의 정의이다.

Identifier의 경우 대부분의 high-level 언어에서 문자나 _로 시작하는 것을 규칙으로 하고 그 뒤로 문자와 숫자, _의 어떠한 조합도 가능하다. INTLITERAL의 경우 pattern은 decimal digits의 조합이 될 것이다.

Set of lexemes는 이러한 pattern으로 정의된 token type의 possible sets다. 즉 $\alpha \isin S$인 $\alpha$는 Pattern을 만족하는 string이라고 볼 수 있다. Pattern $X$에 의해 정의되는 set of lexemes를 $L(x)$라고 정의한다.

기본적인 $L(X)$에 대한 연산들 $L \cup M = {s|s\isin L \texttt{ or } s\isin M}$ $LM = {st|s \isin L \texttt{ and } t \isin M}$ $L^* = \bigcup_{0\leq i \leq \infty }^{}L^i$ $L^0 = {\epsilon}$ $L^i = \underset{\texttt{i times concatenation}}{\underbrace{LL...L}}$ $\epsilon = \texttt{empty string}$

Regular Expression

이제 Pattern이 무엇인지 알았으니 그 중요성 또한 알 수 있다. Token type을 정의하고 set of lexemes를 규제하는 것이 pattern이기에 이러한 pattern에 대한 formal한 notation을 정의할 필요성을 느낄 수 있다.

Scanner를 만들기 위해서는 Pattern에 대한 정보인 specification 을 명확히 해야 하는데, natural language로 이를 표현하면 서로 다른 기준의 specification들이 형성된다. 이는 수학적(일관적)이지 않다! 이러한 통용되는 specification을 만든 것을 Regular Expression(RE) 이라고 부른다. Lexeme은 finite set인 $\Sigma$에 속하는 character들로 만들어지고, 이러한 finite set $\Sigma$를 alphabet 이라고 부른다.

$\Sigma$의 character를 쓸 때에는 구분하기 위해서 $\underline{underline}$을 사용한다. alphabet $\Sigma$에 대해서 inductive definition은 아래와 같다.

R1 : $\epsilon$ is a RE denoting the set {$\epsilon$} R2 : If $\underline{a}$ is in $\Sigma$, then $\underline{a}$ is a RE denoting the set {$\underline{a}$} If x and y are REs denoting $L(x)$ and $L(y)$ then $\texttt{ }$$\texttt{ }$ R3: $x|y$ is a RE denoting $L(x) \cup L(y)$ -> "alternation" $\texttt{ }$$\texttt{ }$ R4: $xy$ is a RE denoting $L(x)L(y)$ -> "concatenation" $\texttt{ }$$\texttt{ }$ R5: $x^$ is a RE denoting $L(x)^$ -> "repetation" or "closure"

되게 당연해 보이는 얘기들이다. 그냥 한 번 보고 넘어가면 될 듯 하다. 다만 사칙연산처럼 우선순위(Precedence)를 가지는데, closure > concatenation > alternation 순이다. 다만, parenthesis가 있다면 parenthesis가 먼저이다.

Shorthand Notation도 존재한다. $r^+$는 많이 쓸듯? $r^+ = rr^*$ $r? = \epsilon | r$

Automata theory와 Theory of algorithm의 결과들로 recognizer를 만들 수 있다. 이때 RE로 표현하여 사용한다.

"We study REs and associated theory to automate scanner construction"

Automata - DFA

Automata theory를 어떻게 활용할 것인지에 대해 논의할 차례이다. RE는 set of lexemes를 묘사하는 pattern을 수식적으로 나타내는 방법론이였다. 그리고 RE에는 '순서'가 있다. $\underline{a}|(\underline{a}|\underline{b})$와 같은 RE에서는 $\underline{a}$ 이 나온 이후에 다른 경우의 수가 고려될 수 있다. 이러한 순서를 가지고 있는 RE는 Finite Automata로 나타낼 수 있다.

예를 들어 Register를 표현하는 RE는 $\underline{r}(\underline{0}|\underline{1}|\underline{2}|...|\underline{9})^{+}$와 같은 shorthand notation을 포함한 RE로 나타낼 수 있고 이를 Finite Automata로 나타내면 아래 그림과 같다. - 아래와 같은 형태를 Finite Automata라고 한다. - 여기서 Accepting state에 최종적으로 도달해야지만 Register로 받아들여진다. 만약 다른 state로 전이하는 과정에서 경우의 수에 해당하지 않는 $\alpha \isin \Sigma$가 input으로 들어오면 $S_e$라는 error state로 진입한다.

Finite Automata를 구성하는 요소 5개는 $<\Sigma , S, \delta, F, I>$이고 이는 각각 아래와 같은 의미이다.

$\Sigma$ is an alphabet
$S$ is a finite set of states
$\delta$ is a state transition function $\delta : S \times \Sigma -> S$
$F$ is a set of final or accepting states
$I$ is the start state

여기서 $\delta$는 모든 $\alpha \isin \Sigma$에 대하여 각각의 $S$, $\alpha$조합에서 다음 state가 어디인지를 나타낸 table이라고 생각할 수 있다.

이렇게 특정 $\underline{x}$를 판단하는 방법으로 FA를 사용할 수 있다. 그리고 그러한 FA중 diterministic한 FA를 DFA라고 부른다.

DFA accepts input string $\underline{x}$ $\textit{iff}$ $\underline{x}$ leaves it in an accepting state $(S_2)$

그렇다면 RE로 나타낸 pattern들은 모두 FA로 나타냈을 때 deterministic할까?

Automata - NFA

$(\underline{a}|\underline{b})^*\underline{abb}$와 같은 경우를 생각해보자. 이러한 pattern은 아래와 같은 FA로 나타낼 수 있다. $S_1$에서 input으로 $\underline{a}$가 들어왔을 때 우리는 $S_1$으로 갈 수도 있고, $S_2$로 갈 수도 있다. 여기서 어떤 선택을 해야 하는가? 단순한 '예측'으로 시도해볼 수 밖에 없다. 이러한 FA를 Non-deterministic Finite Automata라고 한다.

NFA accepts input string $\underline{x}$ $\textit{iff}$ $\exist , path ; s.t \texttt{ Through the transition graph from }S_0 \texttt{ to Final State}$

DFA는 NFA의 special case라고 볼 수 있다. DFA는 $\epsilon$ transition이 없으며 transition function은 single-valued이다.

Definition of DFA & NFA A FA is a DFA if

No state has an $\epsilon$-transition, i.e., there is no transition on input $\epsilon$
For each state s and input symbol $\underline{a}$, there is at most one edge labeled $\underline{a}$ leaving state s.

A FA is an NFA
if it contatins $\epsilon$-transition or
if it has several possible transitions from a state s on a given input symbol $\underline{a}$.

To convert a specification to code:

Write down the RE for the input language
Build a big NFA
Build the DFA that simulates the NFA
Minimize the number of states in the DFA
Generate the scanner code

DFA와 NFA는 Scanner Construction을 위해 알아야 했다. 이제 NFA를 simulation할 수 있는 DFA를 만드는 방법과 그러한 DFA를 minimize하는 방법을 소개한다. 이러한 과정을 이해하면 Scanner Generation을 수행할 수 있다! 아마도..

Automating Scanner Construction

RE -> NFA (Thompson's construction)

RE를 인수분해 한다고 생각하고 각각의 NFA를 만든다. 그 후 만들어진 NFA -어떤 RE를 표현하는-를 다른 NFA와 순서에 맞게 연결한다. 이때 NFA와 NFA를 $\epsilon$-transition으로 연결한다. 이때 symbol $\underline{a}$가 RE에서 여러번 등장하더라도 각각에 맞는 NFA를 만들어야 하는 점을 주의해야 한다.

예시 몇개 보면 이해할 수 있는 정도이다. 아래 그림을 참고하자. Reference : New Algorithms for Regular Expression Matching - Scientific Figure on ResearchGate. Available from: https://www.researchgate.net/figure/Thompsons-NFA-construction-The-regular-expression-for-a-character-a-S-corresponds-to_fig1_1959575 [accessed 22 Oct, 2022]

NFA -> DFA (subset construction)

NFA의 정의에서 알 수 있듯이 FA가 $\epsilon$-transition을 가지고 있거나 한 $(S,\alpha)$쌍에서 두가지 이상의 transition이 가능하다면 NFA이다. NFA를 DFA로 바꾸기 위해서는 이 두가지를 없애야 한다. $(\underline{a}|\underline{b})^\underline{abb}$의 예시에서 $\underline{a}$의 transition이 두가지 방향으로 가능했다. subset construction에서는 이 두가지 방향으로의 transition을 *'parallel'**하게 진행한다. 그리고 이러한 parallel한 transition을 n가지 State를 합친 virtual state로 표현한다.

어떤 State 에서 같은 $\alpha \isin \Sigma$에 대해 n개의 transition 이 가능한 경우 그 가능한 State들을 모두 합친 virtual state를 만든다.
virtual state 혹은 state에서 모든 $\alpha \isin \Sigma$에 대해 어떤 transition이 가능한지 보고 이를 반복한다.
반복하면서 $\epsilon$-transition이 있는 경우 합치는게 가능하다면 합쳐준다.

여기서 가장 어려웠던 부분이 만약 $S_i$ 과 $S_j$를 합친 virtual state가 존재하고, 그러한 virtual state에서 어떤 input $\alpha \isin \Sigma$에 대해 $S_i$는 $S_e$가 아닌 State로 transition이 가능하지만 $S_j$가 $S_e$로 transition하는 경우에는 어떻게 하지? 였다.

해답은 그렇게 어렵지 않았는데, 그냥 virtual state로 transition하게한 input이 $S_i$를 통과하는 path를 가진다고 생각하면 되는 일이었다. 이부분 말고는 어려운 부분은 없는 것 같다. 공부한 강의안에는 복잡한 알고리즘이 적혀 있었기에 정확한 알고리즘을 공부하고 싶은 사람은 참고하자. subset construction도 예제 문제 몇개 풀어보면 금방 감을 잡을 수 있을 것 같다.

DFA -> minimal DFA (Hopcroft's algorithm)

NFA에서 simulation을 위한 DFA를 만들었으니 그 DFA를 minimal하게 만들 차례이다. minimal하게 만들어야 하는 이유는 여러가지가 있겠지만 computation cost를 줄이기 위함이 큰 이유중 하나가 아닐까라고 추측했다.

먼저 State의 distinguishable에 관해 논해야 한다. 그 후에 distinguishable하지 않은 state들을 합치는 과정을 통해 DFA를 minimize한다.

Distinguishable에 대한 확인은 쉽다. Partition과 같은 새로운 개념을 활용하여 배웠지만 그 전에 이해를 돕기 위해 나름대로 이해한 내용을 전달하고자 한다. 어떤 $S_i$와 $S_j$가 $\alpha \isin \Sigma$에 대해서 같은 transition을 가진다면 이를 distinguishable하지 않다고 한다. 이렇게 distinguishable하지 않은 State들을 합치는 과정이 mimimazation이라고 볼 수 있다.

아래는 조금 더 수식적이고 정밀한 정리이다.

Partition

Partition은 at least one State가 포함된 집합이다. 같은 Partition에 존재하는 State는 not distinguishable하다. 어떤 State든 하나의 Partition에만 속한다.

처음 Partition은 2개의 집합으로 나뉜다. Final State $F$와 $S-F$이다. 이 둘은 $\epsilon$으로 distinguishable하다.

$p_i \isin P$와 $\alpha \isin \Sigma$에 대하여 distinguishable한지 판단한다. 이를 반복한다.

알고리즘을 보면 머리가 복잡하지만 간단한 예시를 통해 쉽게 접근이 가능했다. $(\underline{a}|\underline{b})^{*}\underline{a}\underline{b}\underline{b}$에 대한 예시이다.

Cf> Limitations of Regular Languages

$\underline{0}^n|\underline{1}^n$과 같은 것을 REs로 표현할 수 없다.

Context free grammer

Programming Languages exhibit self embedding -> Will handle it in Syntax Analysis

Compiler Design - Introduction

Thu, 20 Oct 2022 17:54:37 GMT

This post retrieved from Yonsei University Copiler Design course (CSI4104)

Introduction of Compiler Design

Why we need to know compiler?

Starting Question : High-level로 작성한 code들이 어떻게 machine code로 변환되는가?

C++, C, Fortran등 많은 Programming Language들이 존재한다. 이러한 코드들의 의도에 맞게 hardware들이 작동하게 하는 것은 중요한 문제이다.

Compiler가 Source Progarm들을 Assembly Program으로 변환하고, Assembler가 Assembly Program을 Executable Object Program으로 변환한다. Binary File인 EOP는 Memory에 저장되어 Processor가 해당 file의 instruction을 읽고 해석하여 실행한다(fetch, decode, execution).

Compiler Design(CSI4104)에서는 Assembler 이전의 과정에 집중한다. 내가 작성한 코드가 어떻게 실행이 되는지를 이해하는 첫번째 과정이라고 생각할 수 있다.

Compilation

Compilation은 Analysis와 Synthesis 단계로 나눌 수 있다.

Analysis는 High-level로 작성된 code를 해석하고 tree structure로 기록하는 단계이고, Synthesis는 tree structure를 target program으로 변환하는 단계이다.

Compliler Structure

위 사진은 Compiler의 전체적인 flow를 보여준다. Lexical Analysis는 Scanner 에서 이루어지며 Character string에서 token을 만드는 과정이다.

position = inital + velocity * 60; 과 같은 high-level 코드를 <=> <+> <*> 와 같은 형식으로 바꿀 수 있다. 이때 바뀐 형식에서 괄호로 구분한 것들을 Token이라고 부른다. Machine(Hardware) 관점에서는 variable name에는 관심이 없다는 것을 생각하자. 여기서 id는 identifier이다.

이렇게 Token으로 구분한 후에는 Parser 가 Syntax Analysis를 수행한다. 이 과정을 거치면 Abstract Syntax Tree로 나타낸다.

위에서 변환된 ...를 고려했을 때 아래와 같은 tree구조를 가진다. 여담으로 Java Virtual Machine에서는 stack을 사용하여 연산을 진행하는데, 위 tree에서 root node인 <=>의 오른쪽 sub tree를 후위순회한 것으로 계산한다. , , <*>, <+>의 순서에서 identifier인 경우에는 stack에 push하고 operator인 경우 operation에 필요한 수 만큼 operand를 stack에서 pop한 후 연산하여 다시 push한다. 이를 수행하는 JVM bytecode는 아래와 같다. JVM에 대해서는 후에 다시 다룰 예정이다.

fload_2
fload_3
bipush 60
i2f
fmul
fadd
fstore_1

Semantic Analysis는 program이 의미하는 바를 AST를 통해 확인한다.(라고 적혀있긴 하지만 code의 legal, illegal을 조사하는 것으로 이해했다.) Static한 semantic check와 Dynamic한 semantic check를 진행하는데, Static Semantic Check는 아래의 것들을 확인한다.

모든 identifier가 사용되기 전에 선언되었는지
모든 identifier가 적절하게 사용되었는지 (ex. int + string)
call argument가 적절한지
type checking

Dynamic Semantic check는 run-time에 걸쳐서 진행되고 아래와 같은 것들을 확인한다. segmentation fault 에러의 경우들이라고 이해했다.

Array index 범위
Arithmetic Error (ex. divide by zero)

이후의 과정들은 나중에 이어서 자세한 내용들을 작성할 예정이다. 이후 Intermediate Code Generator가 Intermediate Representation(IR)을 만든다. 여기까지가 Compiler Frontend이고 Analysis 과정이다. 그 이후에는 Code Optimizer와 Code Generator를 거쳐 Target Assembly Language를 반환한다.

Code Optimizer의 경우 GCC compiler에서 -O3, -O2와 같은 option으로 활용되는 것이 아닌가 싶다.

Future Road

Reference

Yonsei University CSI4104 - Compiler Design Lecture Slide, L00_Course_Introduction

kjh_b_d.log

Automata and Formal Languages - Introduction

What we will learn?

Terms

Alphabets, Strings and Languages

Alphabets

Strings

Languages

Some notations and definitions about SETs

Power set

Partition

Cartesian Product and Functions

Function

Computer Network - Application Layer

Principle of network applications

Application Structures

Client-Server architecture

P2P architecture

Process communicating

sockets - kind of interface

Web and HTTP

HTTP

Cookies

Caches (proxy server)

Electronic mail

DNS

P2P applications

P2P file distribution example : BitTorrent

Video streaming and content distribution networks(CDN)

Socket programming with UDP and TCP

Computer Network - Introduction

Introduction Overview

What is the Internet?

Nuts and Bolts view

A Service view

Network Edge

Access networks and physical media

Wireless access networks

Physical Media

Network Core

Packet-switching

Circuit switching

Delay, Loss, Throughput in Network

Protocol Layers, Service Models

Network under attack : Security

History

1961~1972 : Early packet-switching principles

1972~1980 : Internetworking, new and proprietary nets

1980~1990 : New protocols, a proliferation of networks

1990~2000's : commercialization, the Web, new apps

2005~present

Compiler Design - Syntax Analysis

Syntax and Semantic

In English Language

In Programming Language

Context-Free Grammar

Derivation

Leftmost & Rightmost Derivation

BNF, EBNF and more CFG

CFG의 관용적인 부분들 - 위의 Elements of CFG와 큰 차이는 없다.

A little bit usefull theory

Parsing

Derivation and parse tree

Ambiguous Grammars

Recursive Descent Parsing

Compiler Design - Lexical Analysis

Role of a Scanner

Regular Expression

Automata - DFA

Automata - NFA

Automating Scanner Construction

RE -> NFA (Thompson's construction)

NFA -> DFA (subset construction)

DFA -> minimal DFA (Hopcroft's algorithm)

Partition

Cf> Limitations of Regular Languages

Context free grammer

Compiler Design - Introduction

Introduction of Compiler Design

Why we need to know compiler?