_gyullbb.log

Cilium - Cilium Security

Sat, 06 Sep 2025 13:06:36 GMT

Cilium은 여러 수준에서 보안을 제공하는데, 각 수준은 개별적으로 사용될 수도 있고, 함께 결합하여 사용할 수도 있다. 각 수준에 대해 알아본다.

Layer 3 (Identity-Based)

엔드포인트간 연결 정책을 정의하는 방식이다. 전통적인 Kubernetes에서 사용하는 IP 기반 보안 모델은 파드가 생성·삭제될 때마다 모든 노드의 보안 규칙을 갱신해야 하기에 대규모 환경에서 한계가 존재하였다. Cilium은 이러한 한계를 극복하기 위해 Identity를 도입하여, 보안을 IP와 완전히 분리하고 Identity 기반으로 통신 규칙을 정의하는 방안을 도입하였다.

전통적인 Kubernetes

파드마다 할당된 IP 주소 기반으로 보안 정책 적용
파드가 생성·삭제될 때마다 모든 노드의 보안 규칙(IP 필터)을 갱신해야 함
대규모 환경에서 확장성·유연성에 한계 존재

Cilium

라벨 기반 아이덴티티(Identity) 를 도입해 보안 정책 적용
파드의 생성·삭제 시 보안 규칙 재설정 불필요, 아이덴티티 해석만 수행
확장성과 관리 효율성 향상

Identity

Identity란 모든 엔드포인트에 할당되는 객체로, 엔드포인트 간 기본 연결성을 강제하며 Layer 3 수준의 보안 적용에 사용된다.

Cilium은 엔드포인트의 보안 관련 라벨(Security Relevant Labels)에 따라, 각 엔드포인트에 클러스터 전체에서 고유한 식별자를 부여한다. 동일한 라벨 집합을 가진 엔드포인트는 동일한 Identity를 공유하게 된다. 이 Identity 개념을 통해 application을 확장하더라도 동일한 보안 라벨 집합에 속하는 엔드포인트는 모두 동일한 Identity를 공유하기 때문에, 정책 적용을 매우 큰 규모로 확장할 수 있다.

파드나 컨테이너의 라벨이 변경되면 아이덴티티도 다시 확인되며, 필요 시 자동으로 수정된다.

Cilium Code

코드와 함께 Identity가 엔드포인트에 할당되는 과정을 확인해본다.

아래 코드는 엔드포인트 보안 라벨 변경을 확인하고, Identity를 할당하는 코드이다. Identity에는 Local Identity와 Global Identity가 있는데, Local Identity는 단일 노드 내에서만 고유한 ID를 부여받는 엔드포인트이며, Global Identity는 클러스터 전체에서 고유한 ID를 부여받는 엔드포인트이다. Local Identity의 경우, 컨트롤러 동기화 및 KV Store 확인 없이 노드 내에서 바로 Identity를 할당한다. Global Identity의 경우, 컨트롤러 동기화를 수행하며, 컨트롤러에서는 Identity 할당을 진행한다.

//cilium/pkg/endpoint/endpoint.go
func (e *Endpoint) runIdentityResolver(ctx context.Context, blocking bool, updateJitter time.Duration) (regenTriggered bool) {
...
    newLabels := e.labels.IdentityLabels()
    e.runlock()
...
  //Local Identity의 경우, 컨트롤러 동기화 및 KV Store 확인 없이 노드 내에서 바로 Identity를 할당한다.
    if blocking || identity.IdentityAllocationIsLocal(newLabels) {
    scopedLog.Info("Resolving identity labels (blocking)")
        regenTriggered, err = e.identityLabelsChanged(ctx)
    ...
  }
  ...
  ctrlName := resolveIdentity + "-" + strconv.FormatUint(uint64(e.ID), 10)
  //Global Identity의 경우, 컨트롤러 동기화를 수행하며, 컨트롤러에서는 Identity 할당을 진행한다.
    e.controllers.UpdateController(ctrlName,
        controller.ControllerParams{
            Group: resolveIdentityControllerGroup,
            DoFunc: func(ctx context.Context) error {
                _, err := e.identityLabelsChanged(ctx)
...
            },
...
        },
    )
}

보안 라벨이 변경되었거나, 신규로 필요할 경우, 신규 Identity를 할당한다.

//cilium/pkg/endpoint/endpoint.go
func (e *Endpoint) identityLabelsChanged(ctx context.Context) (regenTriggered bool, err error) {
  ...
    allocatedIdentity, _, err := e.allocator.AllocateIdentity(ctx, newLabels, notifySelectorCache, identity.InvalidIdentity)
}

Global Identity의 경우, Controller에 의해 KV store에서 idPool 중 사용 가능한 ID를 할당받는다.

//cilium/pkg/identity/cache/allocator.go
func (m *CachingIdentityAllocator) AllocateIdentity(ctx context.Context, lbls labels.Labels, notifyOwner bool, oldNID identity.NumericIdentity) (id *identity.Identity, allocated bool, err error) {
      idp, allocated, isNewLocally, err := m.IdentityAllocator.Allocate(ctx, &key.GlobalIdentity{LabelArray: lbls.LabelArray()})
}
//cilium/pkg/allocator/allocator.go
func (a *Allocator) Allocate(ctx context.Context, key AllocatorKey) (idpool.ID, bool, bool, error) {
  ...
    //KV store에서 idPool 중 사용 가능한 ID를 할당받는다.
      kvstore.Trace(a.logger, "Allocating from kvstore", fieldKey, key)
        value, isNew, firstUse, err = a.lockedAllocate(ctx, key)
    if err == nil {
            a.mainCache.insert(key, value)
    }
...
}

Local Identity의 경우, Reserved Identity 여부 및 Well-Known Identity를 먼저 확인한다. Reserved Identity인 경우 추가 할당 없이 예약된 Identity를 그대로 반환한다.

//cilium/pkg/identity/cache/allocator.go
func (m *CachingIdentityAllocator) AllocateLocalIdentity(lbls labels.Labels, notifyOwner bool, oldNID identity.NumericIdentity) (id *identity.Identity, allocated bool, err error) {

    // If this is a reserved, pre-allocated identity, just return that and be done
    if reservedIdentity := identity.LookupReservedIdentityByLabels(lbls); reservedIdentity != nil {
        m.logger.Debug(
            "Resolving reserved identity",
            logfields.Identity, reservedIdentity.ID,
            logfields.IdentityLabels, lbls,
            logfields.New, false,
        )
        return reservedIdentity, false, nil
    }

Reserved Identity

위 코드에서 보듯이, Cilium에는 Reserved Identity가 존재한다. 이는 Cilium이 네트워크 통신을 수행할 때 반드시 필요하거나, 보안 신원이 명확히 정의된 잘 알려진 엔드포인트에 할당된다. Reserved Identity의 Numeric ID는 아래 코드에 정의되어 있다.

//cilium/pkg/identity/numericidentity.go
const (
    // IdentityUnknown represents an unknown identity
    IdentityUnknown NumericIdentity = iota

    // ReservedIdentityHost represents the local host
    ReservedIdentityHost

    // ReservedIdentityWorld represents any endpoint outside of the cluster
    ReservedIdentityWorld

    // ReservedIdentityUnmanaged represents unmanaged endpoints.
    ReservedIdentityUnmanaged

    // ReservedIdentityHealth represents the local cilium-health endpoint
    ReservedIdentityHealth

    // ReservedIdentityInit is the identity given to endpoints that have not
    // received any labels yet.
    ReservedIdentityInit

    // ReservedIdentityRemoteNode is the identity given to all nodes in
    // local and remote clusters except for the local node.
    ReservedIdentityRemoteNode

    // ReservedIdentityKubeAPIServer is the identity given to remote node(s) which
    // have backend(s) serving the kube-apiserver running.
    ReservedIdentityKubeAPIServer

    // ReservedIdentityIngress is the identity given to the IP used as the source
    // address for connections from Ingress proxies.
    ReservedIdentityIngress

    // ReservedIdentityWorldIPv4 represents any endpoint outside of the cluster
    // for IPv4 address only.
    ReservedIdentityWorldIPv4

    // ReservedIdentityWorldIPv6 represents any endpoint outside of the cluster
    // for IPv6 address only.
    ReservedIdentityWorldIPv6

    // ReservedEncryptedOverlay represents overlay traffic which must be IPSec
    // encrypted before it leaves the host
    ReservedEncryptedOverlay
)

// Special identities for well-known cluster components
// Each component has two identities. The first one is used for Kubernetes <1.21
// or when the NamespaceDefaultLabelName feature gate is disabled. The second
// one is used for Kubernetes >= 1.21 and when the NamespaceDefaultLabelName is
// enabled.
const (
    DeprecatedETCDOperator NumericIdentity = iota + 100
    DeprecatedCiliumKVStore

    // ReservedKubeDNS is the reserved identity used for kube-dns.
    ReservedKubeDNS

    // ReservedEKSKubeDNS is the reserved identity used for kube-dns on EKS
    ReservedEKSKubeDNS

    // ReservedCoreDNS is the reserved identity used for CoreDNS
    ReservedCoreDNS

    // ReservedCiliumOperator is the reserved identity used for the Cilium operator
    ReservedCiliumOperator

    // ReservedEKSCoreDNS is the reserved identity used for CoreDNS on EKS
    ReservedEKSCoreDNS

    DeprecatedCiliumEtcdOperator

    // Second identities for all above components
    DeprecatedETCDOperator2
    DeprecatedCiliumKVStore2
    ReservedKubeDNS2
    ReservedEKSKubeDNS2
    ReservedCoreDNS2
    ReservedCiliumOperator2
    ReservedEKSCoreDNS2
    DeprecatedCiliumEtcdOperator2
)

Reserved Identity	Numeric ID	설명
IdentityUnknown	0	알 수 없는 Identity
ReservedIdentityHost	1	로컬 호스트
ReservedIdentityWorld	2	클러스터 외부 엔드포인트
ReservedIdentityUnmanaged	3	Cilium이 관리하지 않는 엔드포인트
ReservedIdentityHealth	4	로컬 cilium-health 엔드포인트
ReservedIdentityInit	5	아직 라벨을 받지 않은 엔드포인트
ReservedIdentityRemoteNode	6	로컬 노드를 제외한 모든 노드
ReservedIdentityKubeAPIServer	7	kube-apiserver 백엔드가 있는 원격 노드
ReservedIdentityIngress	8	Ingress 프록시에서 소스 IP로 사용되는 엔드포인트
ReservedIdentityWorldIPv4	9	IPv4 클러스터 외부 엔드포인트
ReservedIdentityWorldIPv6	10	IPv6 클러스터 외부 엔드포인트
ReservedEncryptedOverlay	11	호스트를 떠나기 전에 IPSec으로 암호화해야 하는 오버레이 트래픽

Component Identity	Numeric ID	설명
ReservedKubeDNS	102	kube-dns
ReservedEKSKubeDNS	103	EKS kube-dns
ReservedCoreDNS	104	CoreDNS
ReservedCiliumOperator	105	Cilium Operator
ReservedEKSCoreDNS	106	EKS CoreDNS

Layer 3 정책 메소드

Layer 3은 여러 방식으로 통신 연결 규칙을 설정할 수 있다.

Endpoints based

Cilium이 관리하는 두 엔드포인트의 라벨을 기반으로 통신 관계를 정의한다.

목표 : `role: frontend` → `role: backend` 통신 허용

배포 yaml

# frontend pod
apiVersion: v1
kind: Pod
metadata:
  name: frontend
  labels:
    role: frontend
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
---
# other pod
apiVersion: v1
kind: Pod
metadata:
  name: other
  labels:
    role: other
spec:
  containers:
  - name: nginx
    image: nginx
    ports:
    - containerPort: 80
---
# backend pod
apiVersion: v1
kind: Pod
metadata:
  name: backend
  labels:
    role: backend
spec:
  containers:
  - name: backend
    image: nginx
    ports:
    - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: backend
spec:
  selector:
    role: backend
  ports:
    - port: 80
      targetPort: 80
---
#NetworkPolicy
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: allow-frontend-to-backend
spec:
  endpointSelector:
    matchLabels:
      role: backend
  ingress:
  - fromEndpoints:
    - matchLabels:
        role: frontend

정책 적용 확인

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system exec ds/cilium -- cilium policy get
[
  {
    "endpointSelector": {
      "matchLabels": {
        "any:role": "backend",
        "k8s:io.kubernetes.pod.namespace": "default"
      }
    },
    "ingress": [
      {
        "fromEndpoints": [
          {
            "matchLabels": {
              "any:role": "frontend",
              "k8s:io.kubernetes.pod.namespace": "default"
            }
          }
        ]
      }
    ],
    ...

통신 확인 frontend -> backend 통신은 가능하나, other -> backend 통신은 불가능한 것을 확인할 수 있다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it frontend -- curl -s http://backend



Welcome to nginx!



Welcome to nginx!
If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.

For online documentation and support please refer to
nginx.org.

Commercial support is available at
nginx.com.

Thank you for using nginx.



(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it other -- curl -sv http://backend
*   Trying 10.96.27.61:80...

^Ccommand terminated with exit code 130

Services based

Service 개념을 활용하여 통신 관계를 정의한다.

목표 : `role: frontend` → `Service: backend` 통신 허용

role: frontend 파드에서 backend라는 이름을 가진 Service와 통신을 할 수 있는 CiliumNetworkPolicy를 배포한다. 이 때, 주의할 점은 CiliumNetworkPolicy에 backend서비스만 지정을 해서 배포하면 coredns통신이 되지 않는다는 점이다. kube-dns로도 통신을 할 수 있도록 설정을 해야지 정상적인 backend서비스로의 통신이 가능하다.

배포 yaml

#NetworkPolicy (Service based)
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: allow-frontend-to-backend-service
spec:
  endpointSelector:
    matchLabels:
      role: frontend
  egress:
  - toServices:
    - k8sService:
        serviceName: backend
        namespace: default
    - k8sService:
        serviceName: kube-dns
        namespace: kube-system

정책 적용 확인

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system exec ds/cilium -- cilium policy get
[
[
  {
    "endpointSelector": {
      "matchLabels": {
        "any:role": "backend",
        "k8s:io.kubernetes.pod.namespace": "default"
      }
    },
    "ingress": [
      {
        "fromEndpoints": [
          {
            "matchLabels": {
              "any:role": "frontend",
              "k8s:io.kubernetes.pod.namespace": "default"
            }
          }
        ]
      }
    ],
...
  {
    "endpointSelector": {
      "matchLabels": {
        "any:role": "frontend",
        "k8s:io.kubernetes.pod.namespace": "default"
      }
    },
    "egress": [
      {
        "toEndpoints": [
          {
            "matchLabels": {
              "any:role": "backend",
              "k8s:io.kubernetes.pod.namespace": "default"
            }
          },
          {
            "matchLabels": {
              "any:k8s-app": "kube-dns",
              "k8s:io.kubernetes.pod.namespace": "kube-system"
            }
          }
        ],
        "toServices": [
          {
            "k8sService": {
              "serviceName": "backend",
              "namespace": "default"
            }
          },
          {
            "k8sService": {
              "serviceName": "kube-dns",
              "namespace": "kube-system"
            }
          }
        ]
      }
    ],

통신 확인 frontend -> backend 서비스 통신은 가능하나, other -> backend 서비스 통신은 불가능한 것을 확인할 수 있다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it frontend -- curl -sv http://backend
*   Trying 10.96.65.95:80...
* Connected to backend (10.96.65.95) port 80 (#0)
> GET / HTTP/1.1
> Host: backend
> User-Agent: curl/7.88.1
> Accept: */*
>
< HTTP/1.1 200 OK

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it other -- curl -sv http://backend
*   Trying 10.96.65.95:80...
^Ccommand terminated with exit code 130

Entities based

fromEntities와 toEntities를 사용하여 Pod 통신을 실제 IP 대신 엔티티(논리 그룹) 단위로 제어할 수 있다.

Entity	설명
`host`	로컬 호스트(노드) 및 hostNetwork 모드 컨테이너
`remote-node`	다른 노드 및 원격 노드의 hostNetwork 모드 컨테이너
`kube-apiserver`	Kubernetes API 서버 (내부/외부 배포 모두 포함)
`ingress`	Cilium Envoy 인그레스 프록시 (Pod-to-Pod hairpin 포함)
`cluster`	클러스터 내 모든 엔드포인트 (host, remote-node, init 포함)
`init`	identity 미할당 초기 부트스트랩 단계 엔드포인트
`health`	Cilium health check 엔드포인트
`unmanaged`	Cilium이 관리하지 않는 엔드포인트 (cluster 엔티티에도 포함됨)
`world`	클러스터 외부(인터넷 포함, CIDR 0.0.0.0/0과 동일)
`all`	cluster + world 모든 엔드포인트 전체

목표 : pod에서 로컬 호스트 및 hostNetwork 모드 컨테이너에만 접근 가능

배포 yaml

# dev Pod
apiVersion: v1
kind: Pod
metadata:
  name: dev-pod
  labels:
    role: dev
spec:
  containers:
  - name: curl
    image: curlimages/curl
    command: ["sleep", "3600"]
---
# Allow dev → host
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: dev-to-host
spec:
  endpointSelector:
    matchLabels:
      role: dev
  egress:
    - toEntities:
      - host
---
# same-pod (nginx, hostNetwork 사용, k8s-w1에 고정)
apiVersion: v1
kind: Pod
metadata:
  name: same-pod
  labels:
    app: same
spec:
  nodeName: k8s-w1       # 특정 노드에 고정
  hostNetwork: true      # hostNetwork 모드 활성화
  dnsPolicy: ClusterFirstWithHostNet
  containers:
  - name: nginx
    image: nginx:1.25
    ports:
    - containerPort: 80 
---
# same-pod (nginx, hostNetwork 사용, k8s-w2에 고정)
apiVersion: v1
kind: Pod
metadata:
  name: diff-pod
  labels:
    app: same
spec:
  nodeName: k8s-w2       # 특정 노드에 고정
  hostNetwork: true      # hostNetwork 모드 활성화
  dnsPolicy: ClusterFirstWithHostNet
  containers:
  - name: nginx
    image: nginx:1.25
    ports:
    - containerPort: 80

통신 확인 통신을 확인해보면, 동일한 노드에 hostNetwork로 뜬 Pod와는 정상통신이 되고, 타 노드에 hostNetwork로 뜬 Pod와는 정상통신이 되지 않음을 확인할 수 있다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -o wide
NAME       READY   STATUS    RESTARTS   AGE   IP               NODE     NOMINATED NODE   READINESS GATES
dev-pod    1/1     Running   0          5s    172.20.2.232     k8s-w2              
diff-pod   1/1     Running   0          44s   192.168.10.102   k8s-w2              
same-pod   1/1     Running   0          44s   192.168.10.101   k8s-w1              

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it dev-pod -- curl http://192.168.10.102



Welcome to nginx!



Welcome to nginx!

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it dev-pod -- curl http://192.168.10.101

^Ccommand terminated with exit code 130

Node based

특정 노드만 허용하거나 차단하도록 통신 관계를 정의한다.

IP/CIDR based

외부 서비스와의 통신 관계를 IP 주소나 서브넷을 이용해 정의한다.

DNS based

DNS 이름을 통해 IP로 변환하여 클러스터 외부 peer와의 통신 관계를 정의한다.

Layer 4 (Restriction of accessible ports)

엔드포인트가 접근할 수 있는 포트 범위와 프로토콜(TCP/UDP 등)을 제한하는 정책이다. L3 정책과 함께 적용되어, L3 정책으로 통신 가능 여부를 결정한 이후, 포트 단위로 더 세밀하게 제어할 때 사용된다.

Layer 7 (Application protocol level)

애플리케이션 프로토콜 단위로 트래픽을 제어하는 정책이다. L3(Layer3)와 L4(Layer4) 정책으로 허용된 트래픽 중, 애플리케이션 레벨에서 세밀한 접근 제어를 수행할 때 사용된다.

Cilium - SR-IOV + Multus

Sat, 30 Aug 2025 16:40:04 GMT

GPU 환경에서의 네트워크 최적화: Cilium + SR-IOV + Multus

GPU 워크로드는 대규모 연산과 빠른 데이터 전송을 동시에 요구한다. 특히 분산 학습이나 대규모 데이터셋을 다루는 환경에서는 네트워크 지연(latency)과 대역폭(bandwidth)이 성능의 핵심 요소가 된다. Kubernetes 환경에서 이러한 요구를 충족하기 위해서는 SR-IOV, Multus, Cilium을 함께 사용하는 구성이 효과적이다.

이번 글에서는 각 기술에 대한 소개와 구성 실습을 진행한다.

SR-IOV

GPU 환경에서 네트워크 성능을 최적화하기 위해서는 RDMA(Remote Direct Memory Access) 기술을 활용하는 것이 중요하다. RDMA란 CPU 개입 없이 네트워크를 통해 직접 메모리에 접근할 수 있도록 하는 기술로, 지연을 최소화하고 대역폭 활용을 극대화할 수 있다.

SR-IOV는 이러한 RDMA 기술을 Pod 단위에서 활용할 수 있도록, 물리 NIC의 가상 함수(VF)를 Pod에 직접 할당한다. 이 때 RDMA 통신은 일반적으로 RoCE 프로토콜을 통해 이루어진다.

RoCE (RDMA over Converged Ethernet)

RoCE는 이더넷 기반 네트워크 상에서 RDMA를 지원하는 기술이다. 기존 RDMA가 InfiniBand와 같은 특수 네트워크에서만 가능했던 것과 달리, RoCE를 활용하면 표준 이더넷 환경에서도 RDMA 성능을 활용할 수 있다.

RoCE는 두 가지 버전이 존재한다.

1.RoCE v1

특징

Ethernet Layer 2에서 동작한다.
동일 브로드캐스트 도메인 내에서만 통신이 가능하다.
L2 프레임 기반이므로 라우팅이 불가능하다. 장점: 간단한 구성으로 낮은 지연을 제공한다. 제약: 대규모 클러스터 또는 다른 네트워크 세그먼트를 넘는 통신에는 부적합하다.

2.RoCE v2

특징

UDP/IP 위에서 동작하는 RDMA 프로토콜이다.
L3 라우팅이 가능하여, 서로 다른 네트워크 서브넷 간에도 RDMA 통신을 지원한다.
QoS 및 Congestion Control 메커니즘과 결합하여 대규모 클러스터 환경에 적합하다. 장점: 대규모 데이터센터 환경에서 확장성이 뛰어나다. 제약: RoCE v1보다 상대적으로 복잡하며, 네트워크 설정(QoS, PFC 등)이 필요하다.

정리를 하자면 RoCE v1은 동일 노드 또는 동일 네트워크 세그먼트에서 GPU 간 저지연 통신이 필요한 경우 적합하다. RoCE v2는 대규모 GPU 클러스터에서 서로 다른 노드 간 학습 데이터를 주고받을 때 사용한다. SR-IOV를 통해 Pod에 VF를 할당하고, 해당 VF를 RoCE 지원 네트워크 인터페이스로 구성하면 Kubernetes GPU 워크로드에서도 RDMA 성능을 활용할 수 있다.

Multus

Kubernetes는 기본적으로 Pod에 하나의 네트워크 인터페이스만을 제공하며, CNI 플러그인(Cilium, Calico 등)에 의해 관리된다. 해당 인터페이스는 주로 Kubernetes 서비스 트래픽과 Pod 간 기본 통신에 사용된다. 그러나 GPU 환경에서는 단일 네트워크인터페이스 만으로는 서비스 트래픽과 고속 데이터 전송을 동시에 처리하기 어렵기 때문에 멀티 네트워크 인터페이스를 통해 트래픽을 분리하고 최적화해야 한다.

Multus CNI는 Kubernetes에서 Pod에 다중 네트워크 인터페이스를 부여할 수 있도록 해주는 메타 플러그인이다. 즉, 하나의 Pod가 여러 CNI 플러그인을 동시에 사용할 수 있도록 한다.

Multus 역할

1.데이터 경로와 제어 경로의 분리

eth0은 Kubernetes 서비스 및 관리 트래픽을 담당한다. net1 같은 추가 인터페이스는 SR-IOV 기반 고성능 RDMA/RoCE 트래픽을 담당한다. 이를 통해 서비스 안정성과 데이터 성능을 동시에 보장한다.

2.특화된 네트워크 할당

GPU 학습 워크로드는 고성능 RDMA 네트워크가 필요하다. Multus를 통해 특정 Pod에만 SR-IOV NIC의 VF를 붙여주어, 필요할 때만 성능 최적화 네트워크를 사용할 수 있다.

Multus CNI Plugin

Multus CNI는 다른 CNI를 호출하여 Pod에 여러 네트워크를 붙여주는 메타 플러그인으로 Multus CNI 자체가 네트워크를 직접 구성하지는 않는다. Multus를 통해 여러 네트워크를 추가할 때는 아래와 같은 CNI 플러그인들을 조합할 수 있다.

플러그인 종류	설명	활용 사례
SR-IOV CNI	물리 NIC의 SR-IOV 기능을 이용해 Pod에 가상 함수(VF)를 직접 할당한다.	GPU/HPC 환경에서 RoCE·RDMA 전용 네트워크 제공
macvlan CNI	Pod에 별도 MAC 주소를 부여하여 물리 네트워크에 직접 연결한다.	Pod이 외부 네트워크와 직접 통신해야 할 때
ipvlan CNI	호스트 NIC의 MAC을 공유하며 Pod에 IP만 부여한다.	MAC 주소 제한이 있는 대규모 Pod 환경
bridge CNI	Pod 인터페이스를 호스트 브리지에 연결한다.	테스트/개발 환경에서 간단히 브리지 네트워크 활용
vlan CNI	Pod 트래픽에 VLAN 태그를 붙여 네트워크를 분리한다.	멀티 테넌트 환경에서 네트워크 격리
host-device CNI	노드의 물리 NIC을 Pod에 직접 할당한다.	특정 Pod에 물리 NIC 리소스를 100% 전담시킬 때
loopback CNI	Pod 내부 `lo` 인터페이스를 생성한다.	모든 Pod에 기본 제공, 내부 통신용

IPAM Plugin

IPAM(IP Address Management) 플러그인은 각 네트워크 인터페이스에 IP 주소를 어떻게 할당할지를 결정한다. CNI 플러그인이 인터페이스를 생성하면, IPAM 플러그인이 호출되어 해당 인터페이스에 붙을 IP, Gateway, Route 등을 결정한다. 즉, “네트워크 인터페이스 생성”과 “IP 할당”을 분리하여 관리할 수 있다.

IPAM Plugin 종류는 다음과 같다.

host-local: 노드 로컬에 IPAM 상태를 저장하고 IP를 순차적으로 할당한다.
static: 설정 파일에 미리 정의한 IP를 할당한다.
dhcp: 외부 DHCP 서버에서 IP를 받아온다.

실습

*현재 통신 이슈를 확인하는 단계로 통신 이슈 해결 시 업데이트 예정이다.

구성 설명

SR-IOV

각 노드의 NIC에서 VLAN100을 Pod의 net1으로 직접 할당 Pod가 NIC VF를 통해 RoCE v2 트래픽을 바로 전송 가능

Multus

Pod에 net1 인터페이스를 붙여줌

RoCE v2

net1을 통해 UDP/IP 기반 RDMA 통신 수행 서로 다른 노드의 Pod 간 라우팅으로 데이터 전송

Pod 통신

ping 및 RDMA 테스트 시, net1(VLAN100) 사용

사전 확인

클러스터 생성

클러스터 생성 시 Cilium CNI와 Multus CNI를 배포한다. 정상적으로 배포되었다면 multus daemonset이 배포되며 노드에서도 cni 정보를 확인할 수 있다.

~# ls /opt/cni/bin/multus
/opt/cni/bin/multus

~# tree /etc/cni/net.d/ | grep multus
├── 00-multus.conf
├── multus.d
│   └── multus.kubeconfig

PF(Physical Funciton) 할당 확인

물리 NIC를 나누어 Pod에 가상 NIC조각을 붙이기 위해서는 현재 어떠한 물리 NIC가 매핑되어 있는 지 확인해야 한다. 여기서 물리 NIC를 PF(Physical Function), 가상 NIC를 VF(Virtual Funciton)이라 부른다.

RDMA 스택에서 어떠한 물리 NIC이 연결되어있는지, 커널 단에서 어떠한 명의 네트워크 인터페이스를 사용하는지를 확인해본다.

# RDMA를 지원하는 NIC이름(HCA-Host Channel Adapter) 확인 
~# rdma link show
...
link mlx5_8/1 state ACTIVE physical_state LINK_UP netdev enp157s0np0

# 지원 RDMA 종류 확인
~# cat /sys/class/infiniband/mlx5_8/ports/1/gid_attrs/types/1
RoCE v2

# 커널에서 해당 NIC을 식별하는 주소 (PCI) 확인
~# readlink /sys/class/infiniband/mlx5_8/device
../../../0000:9d:00.0

# PF 확인
~# ls /sys/bus/pci/devices/0000:9d:00.0/net
enp157s0np0

GPU WorkerNode SR-IOV 지원 여부 및 IOMMU 가상화 확인

네트워크를 통해 직접 메모리에 접근하기 위해서는 SR-IOV지원 및 IOMMU(nput-Output Memory Management Unit) 가상화를 확인해야 한다.

~# ls /sys/class/net/enp157s0np0/device | grep sriov
...
sriov_totalvfs

~# cat /proc/cmdline | grep iommu
BOOT_IMAGE=/boot/vmlinuz-5.15.0-91-generic root=UUID=de68bbf0-de79-441c-8e44-57c29618114e ro intel_iommu=on iommu=pt pci=pcie_bus_perf pcie_acs_override=downstream,multifunction console=tty1 console=ttyS0

Nvidia Network Operator 배포

Nvidia Network Operator란 NVIDIA GPU 서버 환경에서 RDMA 기반 고속 네트워크설정과 관리를 자동화하는 Kubernetes Operator이다. RDMA-capable 장치를 Pod, Node, NIC 수준에서 구성 및 모니터링, 자동화가 가능하기 때문에 Kubernetes 환경에서 많이 사용된다.

helm install network-operator nvidia/network-operator \
  --namespace network-operator \
  --create-namespace \
  --set deployCR=true \
  --set nfd.enabled=true \
  --set sriovNetworkOperator.enabled=true \
  --set rdmaSharedDevicePlugin.enabled=true \
  --set secondaryNetwork.enabled=true

Nvidia Network Operator helm Chart를 배포하면 아래와 같은 리소스들이 배포된다.

리소스	역할
DaemonSet: nvidia-network-operator	클러스터 모든 GPU 노드에 배포되어 RoCE, SR-IOV, VF 설정 및 관리
ConfigMap: nvidia-network-config	NNO 설정 값 저장 (SR-IOV, RDMA VF, QoS, VLAN 등)
DaemonSet: node-feature-discovery (NFD)	노드의 SR-IOV, RDMA, GPU 등의 하드웨어 특성 레이블링
DaemonSet: sriov-network-operator	SR-IOV 네트워크 CRD를 감시하고, PF/VF 생성 및 관리, Pod에 VF 할당
DaemonSet: sriov-network-config-daemon	노드의 SR-IOV PF/VF, VLAN, QoS 등 L2/L3 네트워크 설정 수행
NetworkAttachmentDefinition (NAD)	Multus CNI에서 Pod에 추가 네트워크 인터페이스를 연결하기 위한 정의
SriovNetworkNodePolicy (SR-IOV NNP)	노드별 SR-IOV 네트워크 정책을 정의하는 CRD. `sriov-network-operator`가 이를 기반으로 PF/VF 생성 및 관리

SriovNetwork관련 생성

Node1, Node2 각각 SriovNetworkNodePolicy를 생성한다. 각 노드에서 어떤 NIC를 사용할지, 몇개의 VF를 사용할지에 대한 정보이다. sriov-network-operator가 해당 리소스를 기준으로 VF생성 및 VLAN 할당을 진행한다.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: rocebgr1
  namespace: network-operator
spec:
  deviceType: netdevice
  isRdma: true
  linkType: eth
  mtu: 9000
  nicSelector:
    pfNames:
      - enp157s0np0
  nodeSelector:
    kubernetes.io/hostname: node1
  numVfs: 8
  priority: 99
  resourceName: rocebgr1
---
apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetworkNodePolicy
metadata:
  name: rocebgr2
  namespace: network-operator
spec:
  deviceType: netdevice
  isRdma: true
  linkType: eth
  mtu: 9000
  nicSelector:
    pfNames:
      - enp157s0np0
  nodeSelector:
    kubernetes.io/hostname: node2
  numVfs: 8
  priority: 99
  resourceName: rocebgr2

Node와 SriovNetwork가 잘 연결되었는지는 SriovNetworkNodeState CR을 통해 확인할 수 있다.

> kubectl get sriovnetworknodestate -A
NAMESPACE          NAME    SYNC STATUS   DESIRED SYNC STATE   CURRENT SYNC STATE   AGE
network-operator   node1   Succeeded     Idle                 Idle                 38h
network-operator   node2   Succeeded     Idle                 Idle                 38h

실제로 VF가 생성되었는지 확인해본다.

> kubectl describe node | grep -A10 "Allocatable" | grep 'nvidia.com/roce'
  nvidia.com/rocebgr1:  8
  nvidia.com/rocebgr2:  8

~# lspci | grep 9d
9d:00.0 Ethernet controller: Mellanox Technologies MT2910 Family [ConnectX-7]
9d:00.1 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
9d:00.2 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
9d:00.3 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
9d:00.4 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
9d:00.5 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
9d:00.6 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
9d:00.7 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
9d:01.0 Ethernet controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function

각 노드 별로 Policy를 생성했다면 SriovNetwork를 생성한다. 컨테이너에서 연결할 VF를 정의하기 위한 리소스이다.

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: rocebgr1
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "host-local",
      "subnet": "10.255.0.0/30",
      "rangeStart": "10.255.0.1",
      "rangeEnd": "10.255.0.1",
      "gateway": "10.255.0.2",
      "routes": [
        { "dst": "0.0.0.0/0", "gw": "10.255.0.2" }
      ]
    }
  linkState: enable
  logLevel: info
  networkNamespace: default
  resourceName: rocebgr1
  spoofChk: 'off'
  trust: 'on'
  vlan: 100

apiVersion: sriovnetwork.openshift.io/v1
kind: SriovNetwork
metadata:
  name: roce2-bgr
  namespace: network-operator
spec:
  ipam: |
    {
      "type": "host-local",
      "subnet": "10.255.0.4/30",
      "rangeStart": "10.255.0.5",
      "rangeEnd": "10.255.0.5",
      "gateway": "10.255.0.6",
      "routes": [
        { "dst": "0.0.0.0/0", "gw": "10.255.0.6" }
      ]
    }
  linkState: enable
  logLevel: info
  networkNamespace: default
  resourceName: rocebgr2
  spoofChk: 'off'
  trust: 'on'
  vlan: 100

SriovNetwork를 생성하면 자동으로 SriovNetwork기반 NetworkAttachmentDefinitions 리소스가 생성된다.

Deployment 생성 및 상태 확인

위에서 생성한 SriovNetwork를 가지고 각 Node에 Pod가 뜨도록 리소스를 배포한다.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: rocebgr1
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: roce-test
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: |
          [
            { "name": "rocebgr1" }
          ]
      labels:
        app: roce-test
    spec:
      containers:
        - command:
            - sleep
            - infinity
          image: >-
            registry.hcloud.hmc.co.kr/library/nvidia/cuda:12.2.0-base-ubuntu22.04-roce
          imagePullPolicy: IfNotPresent
          name: test
          resources:
            limits:
              nvidia.com/rocebgr1: '1'
            requests:
              nvidia.com/rocebgr1: '1'
          securityContext:
            capabilities:
              add:
                - IPC_LOCK
                - SYS_RAWIO
                - NET_ADMIN
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: rocebgr2
  namespace: default
spec:
  replicas: 1
  selector:
    matchLabels:
      app: roce-test
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        k8s.v1.cni.cncf.io/networks: |
          [
            { "name": "rocebgr2" }
          ]
      labels:
        app: roce-test
    spec:
      containers:
        - command:
            - sleep
            - infinity
          image: >-
            registry.hcloud.hmc.co.kr/library/nvidia/cuda:12.2.0-base-ubuntu22.04-roce
          imagePullPolicy: IfNotPresent
          name: test
          resources:
            limits:
              nvidia.com/rocebgr2: '1'
            requests:
              nvidia.com/rocebgr2: '1'
          securityContext:
            capabilities:
              add:
                - IPC_LOCK
                - SYS_RAWIO
                - NET_ADMIN
            privileged: true
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File

생성된 Pod에 들어가보면 lo, eth0외에 net1 인터페이스가 신규로 생성된 것을 확인할 수 있다.

/# ip a
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
       valid_lft forever preferred_lft forever
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
85: net1:  mtu 9000 qdisc mq state UP group default qlen 1000
    link/ether 1a:16:7a:56:dc:60 brd ff:ff:ff:ff:ff:ff
    altname enp157s0v6
    inet 10.255.0.1/30 brd 10.255.0.3 scope global net1
       valid_lft forever preferred_lft forever
    inet6 fe80::1816:7aff:fe56:dc60/64 scope link 
       valid_lft forever preferred_lft forever
171: eth0@if172:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether a2:39:a1:e4:2b:5c brd ff:ff:ff:ff:ff:ff link-netnsid 0
    inet 10.42.4.24/32 scope global eth0
       valid_lft forever preferred_lft forever
    inet6 fe80::a039:a1ff:fee4:2b5c/64 scope link 
       valid_lft forever preferred_lft forever

노드에서 확인해보면 VF 중 하나가 vlan100으로 할당된 것을 알 수 있다.

~# ip link show enp157s0np0
9: enp157s0np0:  mtu 9000 qdisc mq state UP mode DEFAULT group default qlen 1000
    link/ether c4:70:bd:e8:3f:b6 brd ff:ff:ff:ff:ff:ff
    vf 0     link/ether 32:16:09:43:16:22 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 1     link/ether 36:8c:40:53:d5:84 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 2     link/ether d6:b1:6e:c3:78:f7 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 3     link/ether 2a:6f:cd:d6:3f:e8 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 4     link/ether 06:6a:5b:cf:67:10 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 5     link/ether a2:35:0e:45:65:14 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off
    vf 6     link/ether 1a:16:7a:56:dc:60 brd ff:ff:ff:ff:ff:ff, vlan 100, spoof checking off, link-state enable, trust on, query_rss off
    vf 7     link/ether d2:ff:ba:3d:b7:10 brd ff:ff:ff:ff:ff:ff, spoof checking off, link-state auto, trust off, query_rss off

Pod 통신 테스트

동일 노드 위 pod간 net1 ping 통신은 성공하지만 서로 다른 노드 간 Pod에서 ping통신을 하면 패킷들이 모두 실패한다.

Node Routing 설정

여러 라우팅 조건을 넣어 테스트 중이나 현재는 통신이 실패하는 상태이다. 추후 해결 시 업데이트 예정이다. *서로 다른 노드 간 Pod ping 통신 실패 원인** Pod net1은 Pod 내부에만 존재하는 가상 인터페이스이며, 노드에는 대응하는 net1 인터페이스가 없다. 노드 라우팅 테이블에는 Pod 서브넷으로 향하는 경로가 존재하지 않기 때문에 다른 노드에 있는 Pod IP로 ping을 보내도, 노드는 어디로 패킷을 보내야 할지 몰라 전송이 실패한다.

또한 VF에 VLAN 태그가 붙어있더라도, 노드에서 L3 경로가 없으면 ARP 요청이 전달되지 않아 통신이 불가하다.

라우팅 추가 노드에서 Pod 서브넷으로 가는 L3 경로를 명시적으로 추가해야 한다.

예시1) PF에 VLAN 인터페이스 생성 ip link add link name .100 type vlan id 100 ip addr add /24 dev pf.100 ip link set dev pf.100 up

예시2) 노드 라우팅 테이블에 Pod 서브넷 경로 추가 ip route add via dev

Cilium - Cilium Service Mesh(2)-Gateway-API

Sat, 23 Aug 2025 16:43:36 GMT

Gateway API

Gateway API는 Kubernetes에서 서비스 트래픽의 진입 및 라우팅을 관리하기 위한 표준 API다. 기존 Ingress 리소스의 한계를 보완하고, 복잡한 L7 트래픽 관리, 멀티-테넌시, 확장성을 제공한다. Gateway API는 CRD(Custom Resource Definition) 형태로 제공되어 Kubernetes 네이티브 리소스처럼 사용한다.

구성요소

GatewayClass GatewayClass는 특정 Gateway 구현체(컨트롤러)의 유형을 정의한다. 예를 들어, istio, nginx, haproxy 등 컨트롤러 이름으로 GatewayClass를 설정한다.
Gateway Gateway는 실제 트래픽 진입 지점을 정의한다. NodePort, LoadBalancer, HostNetwork 등 외부에서 들어오는 트래픽을 수신한다. Gateway에는 여러 개의 Listener를 설정할 수 있어, 다양한 포트와 프로토콜을 동시에 처리할 수 있다. Gateway는 GatewayClass를 참조하여 해당 컨트롤러에 의해 관리된다.
Listener Listener는 Gateway에서 수신할 포트와 프로토콜을 정의한다. Listener는 Hostname, TLS 인증서 등 L7 속성을 포함할 수 있다. Listener는 Gateway와 연결되어 트래픽 라우팅의 세부 규칙을 설정한다.
Route (HTTPRoute, TCPRoute 등) Route는 Gateway로 들어오는 트래픽을 실제 서비스로 전달하는 규칙을 정의한다.

HTTPRoute: HTTP/HTTPS 요청을 서비스로 라우팅한다.
TCPRoute: TCP 요청을 서비스로 라우팅한다.

Route는 Gateway의 Listener와 연결되어, 특정 호스트명, 경로, 헤더 조건 등을 기반으로 트래픽을 분기한다. 여러 Route를 조합하여 멀티-테넌시 환경에서 서비스별 트래픽을 관리할 수 있다.

BackendRef Route에서 실제 트래픽이 전달될 대상 서비스나 리소스를 지정한다. 여러 개 BackendRef를 지정하여 트래픽을 분산하거나 가중치를 줄 수 있다.

Gateway API 설정

Gateway API를 사용하기 위해선 NodePort를 활성화하도록 구성되어야 한다. nodePort.enabled=true를 설정하거나, kubeProxyReplacement=true를 활성화해야 한다.

L7 프록시가 활성화된 상태여야 한다. l7Proxy=true로 설정하며, 기본적으로 활성화되어 있다.

Gateway API v1.2.0에서 사용하는 CRD들은 사전에 설치되어 있어야 한다.

# CRD 설치
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.2.0/config/crd/standard/gateway.networking.k8s.io_gatewayclasses.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.2.0/config/crd/standard/gateway.networking.k8s.io_gateways.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.2.0/config/crd/standard/gateway.networking.k8s.io_httproutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.2.0/config/crd/standard/gateway.networking.k8s.io_referencegrants.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.2.0/config/crd/standard/gateway.networking.k8s.io_grpcroutes.yaml
kubectl apply -f https://raw.githubusercontent.com/kubernetes-sigs/gateway-api/v1.2.0/config/crd/experimental/gateway.networking.k8s.io_tlsroutes.yaml

# 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get crd | grep gateway.networking.k8s.io
gatewayclasses.gateway.networking.k8s.io     2025-08-23T15:20:10Z
gateways.gateway.networking.k8s.io           2025-08-23T15:20:10Z
grpcroutes.gateway.networking.k8s.io         2025-08-23T15:20:12Z
httproutes.gateway.networking.k8s.io         2025-08-23T15:20:11Z
referencegrants.gateway.networking.k8s.io    2025-08-23T15:20:11Z
tlsroutes.gateway.networking.k8s.io          2025-08-23T15:20:12Z

Cilium Gateway API 설정을 진행한다. Cilium Ingress와 병행하여 사용할 수 없기 때문에, ingressController를 false로 설정한다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# helm upgrade cilium cilium/cilium --version 1.18.1 --namespace kube-system --reuse-values \
--set ingressController.enabled=false --set gatewayAPI.enabled=true

kubectl -n kube-system rollout restart deployment/cilium-operator
kubectl -n kube-system rollout restart ds/cilium

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get GatewayClass
NAME     CONTROLLER                     ACCEPTED   AGE
cilium   io.cilium/gateway-controller   True       60s

Gateway와 HTTP Route를 배포하여 Envoy에 설정값을 주입한다.

cat << EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: my-gateway
spec:
  gatewayClassName: cilium
  listeners:
  - protocol: HTTP
    port: 80
    name: web-gw
    allowedRoutes:
      namespaces:
        from: Same
---
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
  name: http-app-1
spec:
  parentRefs:
  - name: my-gateway
    namespace: default
  rules:
  - matches:
    - path:
        type: PathPrefix
        value: /details
    backendRefs:
    - name: details
      port: 9080
  - matches:
    - headers:
      - type: Exact
        name: magic
        value: foo
      queryParams:
      - type: Exact
        name: great
        value: example
      path:
        type: PathPrefix
        value: /
      method: GET
    backendRefs:
    - name: productpage
      port: 9080
EOF

Gateway LoadBalancer 서비스가 생성되었고, EXT-IP가 할당되어 있다. Gateway에도 동일한 EXT-IP가 할당되어 있다.

Gateway를 보면 상태가 PROGRAMMED인데, 이는 Gateway 설정이 Cilium Operator와 Cilium agent에 의해서 Envoy에 정상 주입되었다는 뜻이다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc,ep cilium-gateway-my-gateway
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
NAME                                TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)        AGE
service/cilium-gateway-my-gateway   LoadBalancer   10.96.184.183   192.168.10.211   80:31179/TCP   12s

NAME                                  ENDPOINTS              AGE
endpoints/cilium-gateway-my-gateway   192.192.192.192:9999   12s

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get gateway
NAME         CLASS    ADDRESS          PROGRAMMED   AGE
my-gateway   cilium   192.168.10.211   True         57s

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get httproutes -A
NAMESPACE   NAME         HOSTNAMES   AGE
default     http-app-1               118s

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl logs -n kube-system deployments/cilium-operator | grep gateway
time=2025-08-23T15:30:18.512590875Z level=info source=/go/src/github.com/cilium/cilium/operator/pkg/gateway-api/gateway_reconcile.go:47 msg="Reconciling Gateway" module=operator.operator-controlplane.leader-lifecycle.gateway-api controller=gateway resource=default/my-gateway
time=2025-08-23T15:30:18.517371014Z level=info source=/go/src/github.com/cilium/cilium/operator/pkg/gateway-api/httproute_reconcile.go:143 msg="Successfully reconciled HTTPRoute" module=operator.operator-controlplane.leader-lifecycle.gateway-api controller=httpRoute parentResource=default/http-app-1
time=2025-08-23T15:30:18.519187964Z level=info source=/go/src/github.com/cilium/cilium/operator/pkg/gateway-api/gateway_reconcile.go:202 msg="Successfully reconciled Gateway" module=operator.operator-controlplane.leader-lifecycle.gateway-api controller=gateway resource=default/my-gateway

통신 확인

HTTPRoute에 설정을 한 대로 통신을 확인해보자. curl -s --fail -v http://"$GATEWAY"/details/1로 호출을 하면 details-v1를 호출할 것이고, curl -s -v -H 'magic: foo' http://"$GATEWAY"\?great\=example로 호출을 하면 productpage-v1를 호출할 것이다.

root@router:~# GATEWAY=192.168.10.211

root@router:~# curl -s --fail -v http://"$GATEWAY"/details/1
*   Trying 192.168.10.211:80...
* Connected to 192.168.10.211 (192.168.10.211) port 80
> GET /details/1 HTTP/1.1
> Host: 192.168.10.211
> User-Agent: curl/8.5.0
> Accept: */*
>
< HTTP/1.1 200 OK
< content-type: application/json
< server: envoy
< date: Sat, 23 Aug 2025 15:40:51 GMT
< content-length: 178
< x-envoy-upstream-service-time: 7
<
* Connection #0 to host 192.168.10.211 left intact
{"id":1,"author":"William Shakespeare","year":1595,"type":"paperback","pages":200,"publisher":"PublisherA","language":"English","ISBN-10":"1234567890","ISBN-13":"123-1234567890"}

# Header 설정이 누락되어 호출 실패
root@router:~# curl -s -v http://"$GATEWAY"\?great\=example
...
< HTTP/1.1 404 Not Found
< date: Sat, 23 Aug 2025 15:41:18 GMT
< server: envoy
< content-length: 0
<
* Connection #0 to host 192.168.10.211 left intact

# Header 추가하여 호출 시 정상 호출
root@router:~# curl -s -v -H 'magic: foo' http://"$GATEWAY"\?great\=example
*   Trying 192.168.10.211:80...
* Connected to 192.168.10.211 (192.168.10.211) port 80
> GET /?great=example HTTP/1.1
> Host: 192.168.10.211
> User-Agent: curl/8.5.0
> Accept: */*
> magic: foo
>
< HTTP/1.1 200 OK
< server: envoy
< date: Sat, 23 Aug 2025 15:41:20 GMT
< content-type: text/html; charset=utf-8
< content-length: 2080
< x-envoy-upstream-service-time: 10
...
* Connection #0 to host 192.168.10.211 left intact

TPROXY

Gateway API또한 eBPF 프로그램이 직접 TPROXY를 통해 커널 내부에서 트래픽을 전달하는 것을 확인할 수 있다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system get lease | grep "cilium-l2announce"
cilium-l2announce-default-cilium-gateway-my-gateway        k8s-ctr                                                                     15m

(⎈|HomeLab:N/A) root@k8s-ctr:~# sudo iptables -t mangle -L CILIUM_PRE_mangle --line-numbers -n -v
Chain CILIUM_PRE_mangle (1 references)
num   pkts bytes target     prot opt in     out     source               destination
4        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x13320200 /* cilium: TPROXY to host default/cilium-gateway-my-gateway/listener proxy */ TPROXY redirect 127.0.0.1:12819 mark 0x200/0xffffffff

# Gateway 호출
root@router:~# curl -s -v -H 'magic: foo' http://"$GATEWAY"\?great\=example

# TPROXY pkts 증가
(⎈|HomeLab:N/A) root@k8s-ctr:~# sudo iptables -t mangle -L CILIUM_PRE_mangle --line-numbers -n -v
Chain CILIUM_PRE_mangle (1 references)
num   pkts bytes target     prot opt in     out     source               destination
4        1    60 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x13320200 /* cilium: TPROXY to host default/cilium-gateway-my-gateway/listener proxy */ TPROXY redirect 127.0.0.1:12819 mark 0x200/0xffffffff

TLS Route

TLSRoute는 HTTPS/TLS 트래픽을 Gateway를 통해 처리할 때, 트래픽의 종단(termination) 방식과 전달 방식을 정의하는 리소스이다. 주로 TLS 연결을 어디에서 종료할지와 Pod로 전달할 때 어떤 프로토콜을 사용할지를 결정한다.

Terminate (종단)

Gateway가 TLS를 해제(terminate) 한다.
클라이언트와 Gateway 사이에서는 HTTPS(암호화)로 통신하지만, Gateway와 백엔드 Pod 사이에서는 HTTP로 전달한다.
장점: Gateway에서 TLS 관리를 중앙 집중화 → Pod에서는 TLS 처리 부담 없음

Passthrough (통과)

TLS를 Gateway에서 해제하지 않고 그대로 전달한다.
클라이언트와 Pod 사이의 통신이 종단까지 HTTPS로 유지된다.
장점: 엔드-투-엔드 암호화 유지

TLS Route 설정

Sample App과 Gateway를 배포한다. 그 이전에 TLS Certificate와 Private Key를 생성한다.

apt install mkcert -y

mkcert '*.cilium.rocks'

kubectl create secret tls demo-cert --key=_wildcard.cilium.rocks-key.pem --cert=_wildcard.cilium.rocks.pem

mkcert -install

mkcert -CAROOT

tail -n 50 /etc/ssl/certs/ca-certificates.crt
vi 1.pem

cat <<'EOF' > nginx.conf
events {
}

http {
  log_format main '$remote_addr - $remote_user [$time_local]  $status '
  '"$request" $body_bytes_sent "$http_referer" '
  '"$http_user_agent" "$http_x_forwarded_for"';
  access_log /var/log/nginx/access.log main;
  error_log  /var/log/nginx/error.log;

  server {
    listen 443 ssl;

    root /usr/share/nginx/html;
    index index.html;

    server_name nginx.cilium.rocks;
    ssl_certificate /etc/nginx-server-certs/tls.crt;
    ssl_certificate_key /etc/nginx-server-certs/tls.key;
  }
}
EOF

kubectl create configmap nginx-configmap --from-file=nginx.conf=./nginx.conf

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
  name: my-nginx
  labels:
    run: my-nginx
spec:
  ports:
    - port: 443
      protocol: TCP
  selector:
    run: my-nginx
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: my-nginx
spec:
  selector:
    matchLabels:
      run: my-nginx
  replicas: 1
  template:
    metadata:
      labels:
        run: my-nginx
    spec:
      containers:
        - name: my-nginx
          image: nginx
          ports:
            - containerPort: 443
          volumeMounts:
            - name: nginx-config
              mountPath: /etc/nginx
              readOnly: true
            - name: nginx-server-certs
              mountPath: /etc/nginx-server-certs
              readOnly: true
      volumes:
        - name: nginx-config
          configMap:
            name: nginx-configmap
        - name: nginx-server-certs
          secret:
            secretName: demo-cert
EOF

cat << EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: cilium-tls-gateway
spec:
  gatewayClassName: cilium
  listeners:
    - name: https
      hostname: "nginx.cilium.rocks"
      port: 443
      protocol: TLS
      tls:
        mode: Passthrough
      allowedRoutes:
        namespaces:
          from: All
---
apiVersion: gateway.networking.k8s.io/v1alpha2
kind: TLSRoute
metadata:
  name: nginx
spec:
  parentRefs:
    - name: cilium-tls-gateway
  hostnames:
    - "nginx.cilium.rocks"
  rules:
    - backendRefs:
        - name: my-nginx
          port: 443
EOF

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get gateway cilium-tls-gateway
NAME                 CLASS    ADDRESS          PROGRAMMED   AGE
cilium-tls-gateway   cilium   192.168.10.211   True         6s

root@router:~# GATEWAY=192.168.10.211

TLS Request를 호출하면 정상 호출되는 것을 확인할 수 있다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# curl -v --resolve "nginx.cilium.rocks:443:$GATEWAY" "https://nginx.cilium.rocks:443"
* Added nginx.cilium.rocks:443:192.168.10.211 to DNS cache
* Hostname nginx.cilium.rocks was found in DNS cache
*   Trying 192.168.10.211:443...
* Connected to nginx.cilium.rocks (192.168.10.211) port 443
* ALPN: curl offers h2,http/1.1
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/certs/ca-certificates.crt
*  CApath: /etc/ssl/certs
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384 / X25519 / RSASSA-PSS
* ALPN: server accepted http/1.1
* Server certificate:
*  subject: O=mkcert development certificate; OU=root@k8s-ctr
*  start date: Aug 23 16:34:26 2025 GMT
*  expire date: Nov 23 16:34:26 2027 GMT
*  subjectAltName: host "nginx.cilium.rocks" matched cert's "*.cilium.rocks"
*  issuer: O=mkcert development CA; OU=root@k8s-ctr; CN=mkcert root@k8s-ctr
*  SSL certificate verify ok.
*   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
*   Certificate level 1: Public key type RSA (3072/128 Bits/secBits), signed using sha256WithRSAEncryption
* using HTTP/1.x
> GET / HTTP/1.1
> Host: nginx.cilium.rocks
> User-Agent: curl/8.5.0
> Accept: */*
>
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
< HTTP/1.1 200 OK
< Server: nginx/1.29.1
< Date: Sat, 23 Aug 2025 16:42:01 GMT
< Content-Type: text/html
< Content-Length: 615
< Last-Modified: Wed, 13 Aug 2025 14:33:41 GMT
< Connection: keep-alive
< ETag: "689ca245-267"
< Accept-Ranges: bytes
<



Welcome to nginx!



Welcome to nginx!
If you see this page, the nginx web server is successfully installed and
working. Further configuration is required.

For online documentation and support please refer to
nginx.org.

Commercial support is available at
nginx.com.

Thank you for using nginx.


* Connection #0 to host nginx.cilium.rocks left intact

Cilium - Cilium Service Mesh(1)-Cilium-Ingress

Sat, 23 Aug 2025 12:44:32 GMT

Cilium Service Mesh

Cilium Service Mesh는 기존의 사이드카 프록시(Envoy 등)를 강제적으로 사용하지 않고, eBPF를 활용해 커널 레벨에서 네트워크 및 애플리케이션 계층 트래픽을 제어하는 서비스 메시 구조를 제공한다. 이를 통해 애플리케이션 성능 손실을 최소화하면서도 L3~L7 보안 정책, 트래픽 관리, 모니터링 기능을 통합적으로 제공한다.

L3(Service-to-Service, IP 기반)

Cilium은 eBPF를 활용하여 Pod IP, Service IP 단위로 정책을 제어한다. Kubernetes NetworkPolicy보다 확장된 CiliumNetworkPolicy(CNP)를 통해 CIDR, Label, Namespace 기반 정책을 정의한다. L3 계층에서는 기본적으로 service-to-service 접근 제어를 수행하며, 클러스터 외부/내부 트래픽 모두 제어할 수 있다.

L7(HTTP/gRPC 등 애플리케이션 계층)

Cilium은 eBPF 기반 L7 필터링 기능을 제공하며 요청 단위(Path, Method, Header 등)까지 제어할 수 있는 L7 정책을 정의할 수 있다. 예를 들어 /admin 경로는 차단하고 /api 경로만 허용하는 세밀한 접근 제어가 가능하며, L7 로깅 및 메트릭을 통해 서비스 간 호출 패턴을 관찰하고, 보안 이상 탐지에도 활용한다.

TPROXY(Transparent Proxy)

기존 사이드카 방식과 달리, Transparent Proxy(TPROXY)를 활용해 Pod 네트워크 스택에서 바로 L7 트래픽을 가로챈다. 이 방식은 iptables/nftables 레벨에서 L7 프록시로 트래픽을 리다이렉션하는 것이 아니라, eBPF 프로그램이 직접 TPROXY를 통해 커널 내부에서 트래픽을 전달한다. 결과적으로 사이드카 없이도 L7 레벨 정책과 관찰 기능을 적용할 수 있으며, CPU/메모리 오버헤드가 크게 줄어든다. 특히, Envoy와 같은 별도의 프록시 프로세스를 모든 Pod에 배치하지 않고 노드 단위 공유 L7 프록시를 활용할 수 있다는 점에서 성능 및 관리 효율성이 크다.

K8S Ingress Support

Cilium Ingress 기본 동작

ingressClassName: cilium 을 사용하여 표준 Kubernetes Ingress 리소스를 지원한다. Path 기반 라우팅과 TLS Termination을 제공하며, 하위 호환성을 위해 kubernetes.io/ingress.class: cilium 주석(annotation)도 지원한다. Ingress 컨트롤러는 LoadBalancer 타입의 Service를 생성하므로, 환경에서 LoadBalancer 지원이 필요하다. (필요 시 NodePort 또는 HostNetwork 모드로도 노출 가능)

LoadBalancer 모드

Dedicated: Ingress 리소스별로 전용 LoadBalancer를 생성한다. (충돌 방지에 유리) Shared: 모든 Ingress 리소스가 하나의 LoadBalancer를 공유한다. (리소스 절약 가능) 모드 변경 시 새로운 LoadBalancer IP가 할당되므로, 기존 활성 연결이 끊어질 수 있다.

필수 조건

NodePort 활성화: nodePort.enabled=true 또는 kubeProxyReplacement=true 필요 L7 Proxy 활성화: l7Proxy=true (기본값) Ingress 컨트롤러는 기본적으로 LoadBalancer Service를 생성하므로, NodePort/HostNetwork 대안 구성이 필요할 수 있다.

Source IP Visibility

Envoy는 기본적으로 HTTP 연결의 Source IP를 X-Forwarded-For 헤더에 추가한다. externalTrafficPolicy와 관계없이, Cilium Ingress는 항상 TPROXY를 사용하여 Envoy에 전달하므로 Source IP가 유지된다. 즉, 일반적인 Ingress 컨트롤러와 달리 externalTrafficPolicy: Local 설정이 없어도 클라이언트 IP를 확인할 수 있다.

TPROXY(Transparent Proxy)

Cilium Ingress를 정확히 이해하기 위해서는 TPROXY에 대한 개념이해가 필요하다. TPROXY란 커널 레벨에서 원래 목적지 IP/Port를 바꾸지 않고 소켓에 트래픽을 직접 할당할 수 있는 기능을 의미한다.

일반 REDIRECT(DNAT)와 달리, 클라이언트가 보낸 원래 목적지 정보(IP/Port)가 유지되므로, 백엔드로 원래 목적지 정보를 그대로 전달할 수 있다. Envoy 같은 L7 프록시가 들어오는 트래픽을 목적지 주소 그대로 전달할 수 있기 때문에 L7 레벨 정책/라우팅을 적용할 수 있다.

TPROXY를 통해 트래픽이 전달되는 과정에 대해 확인해보자.

1) 패킷 마킹

L7 정책이 걸린 패킷이 들어오면 Cilium에 의해 MARK_MAGIC_TO_PROXY 즉, 0x0200가 마킹된다.

//cilium/bpf/lib/proxy.h
#ifdef ENABLE_TPROXY
    if (!from_host)
        ctx->mark |= MARK_MAGIC_TO_PROXY;
    else
#endif
        ctx->mark = MARK_MAGIC_TO_PROXY | proxy_port << 16;

    cilium_dbg_capture(ctx, DBG_CAPTURE_PROXY_PRE, proxy_port);

//cilium/bpf/lib/common.h
#define MARK_MAGIC_TO_PROXY        0x0200

2) Cilium Proxy Redirection

Iptables를 통해 마킹된 패킷을 Cilium Host Proxy로 Redirection한다. Node에서 확인해보면 pod에서 나가는 egress 트래픽의 경우, Cilium proxy port인 35531로 Iptables에 의해 패킷이 보내짐을 알 수 있다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# iptables -t mangle -S | grep -i proxy
...
-A CILIUM_PRE_mangle -p tcp -m mark --mark 0xfb820200 -m comment --comment "cilium: TPROXY to host cilium-dns-egress proxy" -j TPROXY --on-port 33531 --on-ip 127.0.0.1 --tproxy-mark 0x200/0xffffffff
-A CILIUM_PRE_mangle -p udp -m mark --mark 0xfb820200 -m comment --comment "cilium: TPROXY to host cilium-dns-egress proxy" -j TPROXY --on-port 33531 --on-ip 127.0.0.1 --tproxy-mark 0x200/0xffffffff

(⎈|HomeLab:N/A) root@k8s-ctr:~# cat /var/run/cilium/state/proxy_ports_state.json
{"cilium-dns-egress":{"type":"dns","ingress":false,"port":33531}}

(⎈|HomeLab:N/A) root@k8s-ctr:~# ss -nltup | grep cilium
udp   UNCONN 0      0           127.0.0.1:33531      0.0.0.0:*    users:(("cilium-agent",pid=5430,fd=48))

아래 실습에서 확인을 해보겠지만, cilium-ingress 활성화 상태에서 cilium클래스의 Ingress리소스를 생성하면 cilium daemonset에 의해, cilium-ingress관련 iptables 규칙이 새롭게 생성된다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# sudo iptables -t mangle -L CILIUM_PRE_mangle --line-numbers -n -v
Chain CILIUM_PRE_mangle (1 references)
num   pkts bytes target     prot opt in     out     source               destination
1        0     0 MARK       0    --  *      !lo     0.0.0.0/0            0.0.0.0/0            socket --transparent mark match ! 0xe00/0xf00 mark match ! 0x800/0xf00 /* cilium: any->pod redirect proxied traffic to host proxy */ MARK set 0x200
2        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0xfb820200 /* cilium: TPROXY to host cilium-dns-egress proxy */ TPROXY redirect 127.0.0.1:33531 mark 0x200/0xffffffff
3        0     0 TPROXY     17   --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0xfb820200 /* cilium: TPROXY to host cilium-dns-egress proxy */ TPROXY redirect 127.0.0.1:33531 mark 0x200/0xffffffff
4        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x17380200 /* cilium: TPROXY to host kube-system/cilium-ingress/listener proxy */ TPROXY redirect 127.0.0.1:14359 mark 0x200/0xffffffff
5        0     0 TPROXY     17   --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x17380200 /* cilium: TPROXY to host kube-system/cilium-ingress/listener proxy */ TPROXY redirect 127.0.0.1:14359 mark 0x200/0xffffffff

redirect port에 대해 확인을 해보면 cilium-envoy의 port이다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# ss -nltup | grep cilium | grep 14359
tcp   LISTEN 0      4096        127.0.0.1:14359      0.0.0.0:*    users:(("cilium-envoy",pid=15257,fd=63))
tcp   LISTEN 0      4096        127.0.0.1:14359      0.0.0.0:*    users:(("cilium-envoy",pid=15257,fd=56))
tcp   LISTEN 0      4096        127.0.0.1:14359      0.0.0.0:*    users:(("cilium-envoy",pid=15257,fd=55))
tcp   LISTEN 0      4096        127.0.0.1:14359      0.0.0.0:*    users:(("cilium-envoy",pid=15257,fd=52))

Ingress로 들어오는 트래픽의 경우, Iptables Rule에 의해 cilium-envoy로 패킷이 보내진다.

3) Socket 조회

현재 연결되어 있는 소켓이 없다면 신규로 튜플을 구성하여 커널에서 조회한다.

//cilium/bpf/lib/proxy.h
static __always_inline int                            \
NAME(struct __ctx_buff *ctx, const CT_TUPLE_TYPE * ct_tuple,            \
     __be16 proxy_port, void *tproxy_addr)                    \
{                                        \
    struct bpf_sock_tuple *tuple = (struct bpf_sock_tuple *)ct_tuple;    \
    __u8 nexthdr = ct_tuple->nexthdr;                    \
    __u32 len = sizeof(tuple->SK_FIELD);                    \
    __u16 port;                                \
    int result;                                \
...                        \
    /* 로컬에 연결된 Socket이 없다면 새로운 TPROXY Socket을 할당한다. */ \
  /* 목적지 포트를 proxy_port 즉 cilium proxy port로 설정한다. */
    tuple->SK_FIELD.dport = proxy_port;     \
  /* 출발지 포트는 wildcard처리를 한다. */
    tuple->SK_FIELD.sport = 0;    \
  /* 목적지 주소를 TPROXY 소켓이 바인딩된 로컬 IP로 지정한다. 일반적으로 127.0.0.1 혹은 ::1 같은 루프백 주소로 들어온다. */
    memcpy(&tuple->SK_FIELD.daddr, tproxy_addr, sizeof(tuple->SK_FIELD.daddr)); \
  /* 출발지 주소는 wildcard처리를 한다. */
    memset(&tuple->SK_FIELD.saddr, 0, sizeof(tuple->SK_FIELD.saddr));    \
    cilium_dbg3(ctx, DBG_LOOKUP_CODE,                    \
            tuple->SK_FIELD.SADDR_DBG, tuple->SK_FIELD.DADDR_DBG,    \
            combine_ports(tuple->SK_FIELD.dport, tuple->SK_FIELD.sport));    \
    result = assign_socket(ctx, tuple, len, nexthdr, false);        \
    if (result == CTX_ACT_OK)                        \
        goto out;    \
...
}

4) Socket 바인딩

구성한 튜플을 기반으로 커널에서 TRPROXY 리스닝 소켓을 조회한 후, 패킷을 소켓에 직접 바인딩한다. 즉, Cilium에 의해 해당 패킷이 iptables REDIRECT/DNAT없이 TPROXY 소켓에 직접 바인딩 된다. 이 단계에서 커널 TCP/IP 스택을 거치지 않고 Envoy 소켓으로 직접 바인딩되므로, NAT 없이 원래 목적지 IP/Port를 유지한 채로 L7 처리가 가능하다.

//cilium/bpf/lib/proxy.h
assign_socket_tcp(struct __ctx_buff *ctx,
          struct bpf_sock_tuple *tuple, __u32 len, bool established)
{
    int result = DROP_PROXY_LOOKUP_FAILED;
    struct bpf_sock *sk;
    __u32 dbg_ctx;

/* tuple과 일치하는 TCP 소켓을 커널에서 조회한다.*/
    sk = skc_lookup_tcp(ctx, tuple, len, BPF_F_CURRENT_NETNS, 0);
    if (!sk)
        goto out;
...
  /* 패킷을 소켓에 직접 바인딩한다. */
    result = sk_assign(ctx, sk, 0);
...

Cilium Envoy, Cilium-ingress 설정 확인

cilium 설치 명령어

helm install cilium cilium/cilium --version $2 --namespace kube-system \
--set k8sServiceHost=192.168.10.100 --set k8sServicePort=6443 \
--set ipam.mode="cluster-pool" --set ipam.operator.clusterPoolIPv4PodCIDRList={"172.20.0.0/16"} --set ipv4NativeRoutingCIDR=172.20.0.0/16 \
--set routingMode=native --set autoDirectNodeRoutes=true --set endpointRoutes.enabled=true --set directRoutingSkipUnreachable=true \
--set kubeProxyReplacement=true --set bpf.masquerade=true --set installNoConntrackIptablesRules=true \
--set endpointHealthChecking.enabled=false --set healthChecking=false \
--set hubble.enabled=true --set hubble.relay.enabled=true --set hubble.ui.enabled=true \
--set hubble.ui.service.type=NodePort --set hubble.ui.service.nodePort=30003 \
--set prometheus.enabled=true --set operator.prometheus.enabled=true --set hubble.metrics.enableOpenMetrics=true \
--set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\,source_namespace\,source_workload\,destination_ip\,destination_namespace\,destination_workload\,traffic_direction}" \
--set ingressController.enabled=true --set ingressController.loadbalancerMode=shared --set loadBalancer.l7.backend=envoy \
--set localRedirectPolicy=true --set l2announcements.enabled=true \
--set operator.replicas=1 --set debug.enabled=true >/dev/null 2>&1

k8s ingress support 설정 확인을 해보자.

Cilium을 설치할 때 ingressController를 true로 선택을 하였기 때문에 각 워커 노드별로 Cilium이 Ingress 용도로 예약한 IP가 존재한다. 이 IP는 실제 Pod가 가진 IP가 아니라, Cilium 에이전트가 관리하는 가상 IP이며, Ingress로 들어온 트래픽이 Envoy(L7 Proxy)로 전달될 때 식별 목적으로 사용된다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium config view | grep -E '^loadbalancer|l7'
enable-l7-proxy                                   true
loadbalancer-l7                                   envoy
loadbalancer-l7-algorithm                         round_robin
loadbalancer-l7-ports

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium ip list | grep ingress
172.20.0.147/32     reserved:ingress
172.20.1.202/32     reserved:ingress

Cilium Envoy는 hostNetwork로 각 노드에 한 대씩 daemonset으로 뜨며, 각 노드의 9964 포트를 listen하고 있다. Cilium Envoy 서비스의 Endpoint로 Cilium Envoy pod의 9964포트가 잡혀있다. Envoy는 Cilium Agent와 연결되며 admin.sock 유닉스 소켓을 통해 Cilium Agent ↔ Envoy 간 상태 조회/제어가 가능하다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc,ep -n kube-system cilium-envoy
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
NAME                   TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)    AGE
service/cilium-envoy   ClusterIP   None                 9964/TCP   18h

NAME                     ENDPOINTS                                 AGE
endpoints/cilium-envoy   192.168.10.100:9964,192.168.10.101:9964   18h

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -n kube-system -l k8s-app=cilium-envoy -owide
NAME                 READY   STATUS    RESTARTS   AGE   IP               NODE      NOMINATED NODE   READINESS GATES
cilium-envoy-hw9kw   1/1     Running   0          17h   192.168.10.100   k8s-ctr              
cilium-envoy-nvvw8   1/1     Running   0          17h   192.168.10.101   k8s-w1               

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl describe pod -n kube-system -l k8s-app=cilium-envoy
...
Volumes:
  envoy-sockets:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cilium/envoy/sockets
    HostPathType:  DirectoryOrCreate
  envoy-artifacts:
    Type:          HostPath (bare host directory volume)
    Path:          /var/run/cilium/envoy/artifacts
    HostPathType:  DirectoryOrCreate
  envoy-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      cilium-envoy-config
    Optional:  false
  bpf-maps:
    Type:          HostPath (bare host directory volume)
    Path:          /sys/fs/bpf
    HostPathType:  DirectoryOrCreate
...

(⎈|HomeLab:N/A) root@k8s-ctr:~# ls -al /var/run/cilium/envoy/sockets
total 0
drwxr-xr-x 3 root root 120 Aug 22 19:30 .
drwxr-xr-x 4 root root  80 Aug 22 19:29 ..
srw-rw---- 1 root 1337   0 Aug 22 19:30 access_log.sock
srwxr-xr-x 1 root root   0 Aug 22 19:29 admin.sock
drwxr-xr-x 3 root root  60 Aug 22 19:30 envoy
srw-rw---- 1 root 1337   0 Aug 22 19:30 xds.sock

(⎈|HomeLab:N/A) root@k8s-ctr:/var/run/cilium/envoy/sockets# kubectl exec -it -n kube-system ds/cilium-envoy -- cat /var/run/cilium/envoy/bootstrap-config.json > config.json

(⎈|HomeLab:N/A) root@k8s-ctr:~# cat config.json | jq
        "connectTimeout": "2s",
        "loadAssignment": {
          "clusterName": "/envoy-admin",
          "endpoints": [
            {
              "lbEndpoints": [
                {
                  "endpoint": {
                    "address": {
                      "pipe": {
                        "path": "/var/run/cilium/envoy/sockets/admin.sock"
                      }
                    }
                  }
                }
              ]
            }
          ]
        },
        "name": "/envoy-admin",
        "type": "STATIC"
      }
    ],
    "listeners": [
      {
        "address": {
          "socketAddress": {
            "address": "0.0.0.0",
            "portValue": 9964
          }

Cilium ingress의 경우 외부 트래픽을 받아들이기 위하여 LoadBalancer타입으로 생성된다. endpoint로 잡힌 192.192.192.192:9999는 외부에서 들어온 패킷이 Cilium eBPF 경로를 타고 Envoy로 리디렉션되기 위한 가상 endpoint로, 실제 이 IP로 통신되지는 않는다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc,ep -n kube-system cilium-ingress
NAME                     TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
service/cilium-ingress   LoadBalancer   10.96.132.142        80:30861/TCP,443:31103/TCP   18h

NAME                       ENDPOINTS              AGE
endpoints/cilium-ingress   192.192.192.192:9999   18h

노드의 /sys/fs/bpf/cilium은 Cilium이 eBPF 오브젝트들을 올려두는 가상 파일시스템이다. ControlPlane 노드의 /sys/fs/bpf/cilium 하위 트리 파일을 통해 구성 방식을 확인해보자.

/sys/fs/bpf/cilium
├── devices
...
│   ├── eth0
│   │   └── links
│   │       ├── cil_from_netdev
│   │       └── cil_to_netdev

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system exec -it ds/cilium -- cilium endpoint list
1215       Disabled           Disabled          28440      k8s:app.kubernetes.io/name=hubble-ui                                                       172.20.0.36    ready
                                                           k8s:app.kubernetes.io/part-of=cilium
                                                           k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=kube-system
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=hubble-ui
                                                           k8s:io.kubernetes.pod.namespace=kube-system
                                                           k8s:k8s-app=hubble-ui
...

devices/eth0/links/cil_* 등은 노드 네트워크 인터페이스 입출구에 붙은 eBPF hook으로, Cilium이 커널 네트워크 스택의 RX/TX 경로에 끼어들어 외부에서 들어오는 패킷을 Envoy로 보내기 위해 TPROXY 동작을 삽입한다. (예시 - 외부 클라이언트가 Ingress LB IP로 접속하면 eth0 RX → cil_from_netdev → eBPF → Envoy(TPROXY) 흐름을 탐.)

cil_from_netdev : NIC에서 들어온 패킷을 Cilium datapath로 끌어들임.
cil_to_netdev : Cilium datapath에서 나온 패킷을 실제 NIC로 흘려보냄.

L2 Announcement 설정 및 Cilium-Ingress에 EX-IP 설정

외부에서 Cilium-Ingress를 통해 호출하는 테스트를 위해 LB IPAM 및 L2 Announcement 설정을 추가한다.

cat << EOF | kubectl apply -f -
apiVersion: "cilium.io/v2" 
kind: CiliumLoadBalancerIPPool
metadata:
  name: "cilium-lb-ippool"
spec:
  blocks:
  - start: "192.168.10.211"
    stop:  "192.168.10.215"
EOF

cat << EOF | kubectl apply -f -
apiVersion: "cilium.io/v2alpha1"
kind: CiliumL2AnnouncementPolicy
metadata:
  name: policy1
spec:
  interfaces:
  - eth1
  externalIPs: true
  loadBalancerIPs: true
EOF

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc,ep cilium-ingress -n kube-system
NAME                     TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                      AGE
service/cilium-ingress   LoadBalancer   10.96.132.142   192.168.10.211   80:30861/TCP,443:31103/TCP   18h

NAME                       ENDPOINTS              AGE
endpoints/cilium-ingress   192.192.192.192:9999   18h

#현재 L2 Announcement설정에 의한 리더 노드 = k8s-w1
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system get lease | grep "cilium-l2announce"
cilium-l2announce-kube-system-cilium-ingress   k8s-w1                                                                      55s

외부와 내부 모두에서 cilium-ingress의 외부 IP를 통해 통신이 가능하다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# LBIP=$(kubectl get svc -n kube-system cilium-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
(⎈|HomeLab:N/A) root@k8s-ctr:~# echo $LBIP
192.168.10.211
(⎈|HomeLab:N/A) root@k8s-ctr:~# arping -i eth1 $LBIP -c 2
ARPING 192.168.10.211
60 bytes from 08:00:27:b0:11:69 (192.168.10.211): index=0 time=249.708 usec
60 bytes from 08:00:27:b0:11:69 (192.168.10.211): index=1 time=315.533 usec

--- 192.168.10.211 statistics ---
2 packets transmitted, 2 packets received,   0% unanswered (0 extra)


root@router:~# LBIP=192.168.10.211
root@router:~# arping -i eth1 $LBIP -c 2
ARPING 192.168.10.211
60 bytes from 08:00:27:b0:11:69 (192.168.10.211): index=0 time=533.017 usec
60 bytes from 08:00:27:b0:11:69 (192.168.10.211): index=1 time=270.677 usec
^C
--- 192.168.10.211 statistics ---
2 packets transmitted, 2 packets received,   0% unanswered (0 extra)
rtt min/avg/max/std-dev = 0.271/0.402/0.533/0.131 ms

Ingress HTTP Example

Istio 예제로 유명한 bookinfo로 실습을 진행한다. 배포 후 확인을 해보면 Istio 실습과는 다르게 Sidecar Container가 없다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod,svc,ep
NAME                                  READY   STATUS    RESTARTS   AGE
pod/details-v1-766844796b-bfgg4       1/1     Running   0          53s
pod/productpage-v1-54bb874995-579qn   1/1     Running   0          53s
pod/ratings-v1-5dc79b6bcd-csprv       1/1     Running   0          53s
pod/reviews-v1-598b896c9d-dnqln       1/1     Running   0          53s
pod/reviews-v2-556d6457d-46p47        1/1     Running   0          53s
pod/reviews-v3-564544b4d6-b5bps       1/1     Running   0          53s

NAME                  TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/details       ClusterIP   10.96.159.26            9080/TCP   53s
service/kubernetes    ClusterIP   10.96.0.1               443/TCP    24h
service/productpage   ClusterIP   10.96.28.207            9080/TCP   53s
service/ratings       ClusterIP   10.96.106.180           9080/TCP   53s
service/reviews       ClusterIP   10.96.212.117           9080/TCP   53s

NAME                    ENDPOINTS                                              AGE
endpoints/details       172.20.1.142:9080                                      53s
endpoints/kubernetes    192.168.10.100:6443                                    24h
endpoints/productpage   172.20.1.154:9080                                      53s
endpoints/ratings       172.20.1.21:9080                                       53s
endpoints/reviews       172.20.1.139:9080,172.20.1.186:9080,172.20.1.26:9080   53s

Ingress를 배포하여 통신을 확인해본다. Ingress IP는 cilium-ingress 서비스의 외부 IP로 지정된다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# cat << EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: basic-ingress
  namespace: default
spec:
  ingressClassName: cilium
  rules:
  - http:
      paths:
      - backend:
          service:
            name: details
            port:
              number: 9080
        path: /details
        pathType: Prefix
      - backend:
          service:
            name: productpage
            port:
              number: 9080
        path: /
        pathType: Prefix
EOF

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ingress
NAME            CLASS    HOSTS   ADDRESS          PORTS   AGE
basic-ingress   cilium   *       192.168.10.211   80      74s

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc -n kube-system cilium-ingress
NAME             TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)                      AGE
cilium-ingress   LoadBalancer   10.96.132.142   192.168.10.211   80:30861/TCP,443:31103/TCP   24h

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl describe ingress
...
Address:          192.168.10.211
Ingress Class:    cilium
Rules:
  Host        Path  Backends
  ----        ----  --------
  *
              /details   details:9080 (172.20.1.142:9080)
              /          productpage:9080 (172.20.1.154:9080)

(⎈|HomeLab:N/A) root@k8s-ctr:~# LBIP=$(kubectl get svc -n kube-system cilium-ingress -o jsonpath='{.status.loadBalancer.ingress[0].ip}')
(⎈|HomeLab:N/A) root@k8s-ctr:~# echo $LBIP
192.168.10.211
(⎈|HomeLab:N/A) root@k8s-ctr:~# curl -so /dev/null -w "%{http_code}\n" http://$LBIP/
200

(⎈|HomeLab:N/A) root@k8s-ctr:~# curl -so /dev/null -w "%{http_code}\n" http://$LBIP/details/1
200

(⎈|HomeLab:N/A) root@k8s-ctr:~# curl -so /dev/null -w "%{http_code}\n" http://$LBIP/ratings
404

(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f -t l7
Aug 23 10:52:09.731: 192.168.10.200:43330 (ingress) -> default/productpage-v1-54bb874995-579qn:9080 (ID:49314) http-request FORWARDED (HTTP/1.1 GET http://192.168.10.211/)
Aug 23 10:52:09.732: 192.168.10.200:43330 (ingress) <- default/productpage-v1-54bb874995-579qn:9080 (ID:49314) http-response FORWARDED (HTTP/1.1 200 2ms (GET http://192.168.10.211/))
Aug 23 10:52:37.857: 192.168.10.200:43452 (ingress) -> default/details-v1-766844796b-bfgg4:9080 (ID:20097) http-request FORWARDED (HTTP/1.1 GET http://192.168.10.211/details/1)
Aug 23 10:52:37.860: 192.168.10.200:43452 (ingress) <- default/details-v1-766844796b-bfgg4:9080 (ID:20097) http-response FORWARDED (HTTP/1.1 200 3ms (GET http://192.168.10.211/details/1))
Aug 23 10:53:11.823: 192.168.10.200:47462 (ingress) -> default/productpage-v1-54bb874995-579qn:9080 (ID:49314) http-request FORWARDED (HTTP/1.1 GET http://192.168.10.211/ratings)
Aug 23 10:53:11.832: 192.168.10.200:47462 (ingress) <- default/productpage-v1-54bb874995-579qn:9080 (ID:49314) http-response FORWARDED (HTTP/1.1 404 10ms (GET http://192.168.10.211/ratings))

Pod가 뜨는 WorkerNode에서 veth 트래픽을 캡처하여 TPROXY로 인해 Client IP가 보존되는 것을 확인해본다.

root@k8s-w1:~# PROID=172.20.1.154

root@k8s-w1:~# ip route | grep $PROID
172.20.1.154 dev lxc24276eae3fab proto kernel scope link

root@k8s-w1:~# PROVETH=lxc24276eae3fab

root@k8s-w1:~# ngrep -tW byline -d $PROVETH '' 'tcp port 9080'
lxc24276eae3fab: no IPv4 address assigned: Cannot assign requested address
interface: lxc24276eae3fab
filter: ( tcp port 9080 ) and ((ip || ip6) || (vlan && (ip || ip6)))
####
T 2025/08/23 19:56:14.533041 10.0.2.15:56674 -> 172.20.1.154:9080 [AP] #4
GET / HTTP/1.1.
host: 192.168.10.211.
user-agent: curl/8.5.0.
accept: */*.
x-forwarded-for: 192.168.10.200. # -> XFF에 client-ip가 담김.
x-forwarded-proto: http.
x-envoy-internal: true.
x-request-id: eabadda9-e958-4038-8446-aa5c5807c527.
.

##
T 2025/08/23 19:56:14.548321 172.20.1.154:9080 -> 10.0.2.15:56674 [AP] #6
HTTP/1.1 200 OK.
Server: gunicorn.
Date: Sat, 23 Aug 2025 10:56:14 GMT.
Connection: keep-alive.
Content-Type: text/html; charset=utf-8.
Content-Length: 2080.
...

추가로 Ingress로 들어오는 패킷이 TPROXY를 타고 cilium-envoy로 향하는 과정을 확인할 수 있다.

WorkerNode의 mangle IPTABLES를 확인하면 ingress로 들어오는 트래픽은 127.0.0.1의 10061로 향하라는 규칙이 있음을 확인할 수 있는데, 이는 cilium-envoy의 TCP 포트이다.

root@k8s-w1:~# sudo iptables -t mangle -L CILIUM_PRE_mangle --line-numbers -n -v
Chain CILIUM_PRE_mangle (1 references)
num   pkts bytes target     prot opt in     out     source               destination
...
4        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x4d270200 /* cilium: TPROXY to host kube-system/cilium-ingress/listener proxy */ TPROXY redirect 127.0.0.1:10061 mark 0x200/0xffffffff
5        0     0 TPROXY     17   --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x4d270200 /* cilium: TPROXY to host kube-system/cilium-ingress/listener proxy */ TPROXY redirect 127.0.0.1:10061 mark 0x200/0xffffffff

root@k8s-w1:~# ss -nltup | grep 10061
tcp   LISTEN 0      4096        127.0.0.1:10061      0.0.0.0:*    users:(("cilium-envoy",pid=11198,fd=63))
tcp   LISTEN 0      4096        127.0.0.1:10061      0.0.0.0:*    users:(("cilium-envoy",pid=11198,fd=56))
tcp   LISTEN 0      4096        127.0.0.1:10061      0.0.0.0:*    users:(("cilium-envoy",pid=11198,fd=55))
tcp   LISTEN 0      4096        127.0.0.1:10061      0.0.0.0:*    users:(("cilium-envoy",pid=11198,fd=52))

외부에서 LB를 호출시키면 IPTABLES의 TPROXY to host kube-system/cilium-ingress/listener proxy 규칙의 pkts가 하나씩 증가함을 확인할 수 있다.

root@k8s-w1:~# sudo iptables -t mangle -L CILIUM_PRE_mangle --line-numbers -n -v
Chain CILIUM_PRE_mangle (1 references)
num   pkts bytes target     prot opt in     out     source               destination
...
4        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x4d270200 /* cilium: TPROXY to host kube-system/cilium-ingress/listener proxy */ TPROXY redirect 127.0.0.1:10061 mark 0x200/0xffffffff

root@router:~# curl -so /dev/null -w "%{http_code}\n" http://$LBIP/
200

root@k8s-w1:~# sudo iptables -t mangle -L CILIUM_PRE_mangle --line-numbers -n -v
Chain CILIUM_PRE_mangle (1 references)
num   pkts bytes target     prot opt in     out     source               destination
...
4        1    60 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x4d270200 /* cilium: TPROXY to host kube-system/cilium-ingress/listener proxy */ TPROXY redirect 127.0.0.1:10061 mark 0x200/0xffffffff

Dedicated Mode

이번에는 Ingress를 dedicated 모드로 생성해본다. Dedicated모드로 생성하게 되면 Ingress 리소스 별로 전용 LoadBalancer가 생성된다.

# 샘플 애플리케이션 배포
cat << EOF | kubectl apply -f -
apiVersion: apps/v1
kind: Deployment
metadata:
  name: webpod
spec:
  replicas: 2
  selector:
    matchLabels:
      app: webpod
  template:
    metadata:
      labels:
        app: webpod
    spec:
      affinity:
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - sample-app
            topologyKey: "kubernetes.io/hostname"
      containers:
      - name: webpod
        image: traefik/whoami
        ports:
        - containerPort: 80
---
apiVersion: v1
kind: Service
metadata:
  name: webpod
  labels:
    app: webpod
spec:
  selector:
    app: webpod
  ports:
  - protocol: TCP
    port: 80
    targetPort: 80
  type: ClusterIP
EOF


# k8s-ctr 노드에 curl-pod 파드 배포
cat <


위 실습과 동일하게 Pod가 뜨는 Node에서 veth 트래픽을 캡처하여 TPROXY로 인해 Client IP가 보존되는 것을 확인해본다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -l app=webpod -owide
NAME                      READY   STATUS    RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
webpod-697b545f57-8qgpp   1/1     Running   0          76s   172.20.0.192   k8s-ctr              
webpod-697b545f57-gkzn6   1/1     Running   0          76s   172.20.1.230   k8s-w1               

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip link | grep 172.20.0.192
(⎈|HomeLab:N/A) root@k8s-ctr:~# WEBIP=172.20.0.192
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip route | grep $WEBIP
172.20.0.192 dev lxc02bb82001369 proto kernel scope link
(⎈|HomeLab:N/A) root@k8s-ctr:~# WPODVETH=lxc02bb82001369

root@router:~# LB2IP=192.168.10.212
root@router:~# curl -so /dev/null -w "%{http_code}\n" http://$LB2IP/
200
(⎈|HomeLab:N/A) root@k8s-ctr:~# ngrep -tW byline -d $WPODVETH '' 'tcp port 80'
lxc02bb82001369: no IPv4 address assigned: Cannot assign requested address
interface: lxc02bb82001369
filter: ( tcp port 80 ) and ((ip || ip6) || (vlan && (ip || ip6)))
...
T 2025/08/23 21:40:53.991963 172.20.0.192:80 -> 10.0.2.15:36594 [AP] #6
HTTP/1.1 200 OK.
Date: Sat, 23 Aug 2025 12:40:53 GMT.
Content-Length: 341.
Content-Type: text/plain; charset=utf-8.
.
Hostname: webpod-697b545f57-8qgpp
IP: 127.0.0.1
IP: ::1
IP: 172.20.0.192
IP: fe80::5c51:1ff:fee8:c410
RemoteAddr: 10.0.2.15:36594
GET / HTTP/1.1.
Host: 192.168.10.212.
User-Agent: curl/8.5.0.
Accept: */*.
X-Envoy-Internal: true.
X-Forwarded-For: 192.168.10.200.
X-Forwarded-Proto: http.
X-Request-Id: 135499d1-ed5d-457b-9815-8b5fb49d63c9.
.

###
여기서 하나 확인할 수 있는 점은 L2 Announcement 설정으로 인해 k8s-w1에 떠있는 파드를 호출하더라도 k8s-ctr을 거쳐서 k8s-w1로 패킷이 향하는 것인데, 이 때문에 RemoteAddr의 모습이 다르다.
리더 노드인 k8s-ctr위의 Pod의 경우 L2 leader Node에서 바로 패킷을 전달하기 때문에, pod가 바라보는 RemoteAddr은 k8s-ctr의 첫 번째 NIC IP이다.
k8s-w1위의 Pod의 경우 L2 leader Node(k8s-ctr)에서 k8s-w1로 패킷이 전달되기 때문에, RemoteAddr이 Leader Node를 거친 src IP 즉, Ingress의 EXT-IP이다.
하지만, TPROXY로 인해 실제 Client IP가 보존되기 때문에 X-Forwarded-For의 값은 외부 노드 Router의 eth1 IP와 동일하다.
# k8s-ctr
(⎈|HomeLab:N/A) root@k8s-ctr:~# ngrep -tW byline -d $WPODVETH '' 'tcp port 80'
...
Hostname: webpod-697b545f57-8qgpp
IP: 127.0.0.1
IP: ::1
IP: 172.20.0.192
IP: fe80::5c51:1ff:fee8:c410
RemoteAddr: 10.0.2.15:36594
GET / HTTP/1.1.
Host: 192.168.10.212.
User-Agent: curl/8.5.0.
Accept: */*.
X-Envoy-Internal: true.
X-Forwarded-For: 192.168.10.200.
X-Forwarded-Proto: http.
X-Request-Id: 135499d1-ed5d-457b-9815-8b5fb49d63c9.

# k8s-w1
root@k8s-w1:~# ngrep -tW byline -d $WPODVETH '' 'tcp port 80'
...
Hostname: webpod-697b545f57-gkzn6
IP: 127.0.0.1
IP: ::1
IP: 172.20.1.230
IP: fe80::8c39:e6ff:fe38:4d79
RemoteAddr: 172.20.0.147:35985
GET / HTTP/1.1.
Host: 192.168.10.212.
User-Agent: curl/8.5.0.
Accept: */*.
X-Envoy-Internal: true.
X-Forwarded-For: 192.168.10.200.
X-Forwarded-Proto: http.
X-Request-Id: 7958092f-8b81-47a1-81c3-31d3e6725660.
ingress-Nginx와 cilium Ingress 공존 가능
ingress-nginx와 cilium Ingress는 공존이 가능하다. 
# Ingress-Nginx 컨트롤러 설치
helm repo add ingress-nginx https://kubernetes.github.io/ingress-nginx
helm install ingress-nginx ingress-nginx/ingress-nginx --create-namespace -n ingress-nginx

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ingressclasses.networking.k8s.io
NAME     CONTROLLER                     PARAMETERS   AGE
cilium   cilium.io/ingress-controller          26h
nginx    k8s.io/ingress-nginx                  51s

# ingress 설정
cat << EOF | kubectl apply -f -
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: webpod-ingress-nginx
  namespace: default
spec:
  ingressClassName: nginx
  rules:
  - host: nginx.webpod.local
    http:
      paths:
      - backend:
          service:
            name: webpod
            port:
              number: 80
        path: /
        pathType: Prefix
EOF

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ingress
NAME                   CLASS    HOSTS                ADDRESS          PORTS   AGE
basic-ingress          cilium   *                    192.168.10.211   80      44m
webpod-ingress         cilium   *                    192.168.10.212   80      24m
webpod-ingress-nginx   nginx    nginx.webpod.local   192.168.10.213   80      12s
RemoteAddr
해당 LB를 호출하여 통신 상태를 확인해본다.
k8s-ctr, k8s-w1모두 RemoteAddr이 ingress-nginx-controller의 Pod IP임을 알 수 있다. 
root@router:~# LB3IP=192.168.10.213
root@router:~# curl -H "Host: nginx.webpod.local" $LB3IP

# k8s-ctr
(⎈|HomeLab:N/A) root@k8s-ctr:~# ngrep -tW byline -d $WPODVETH '' 'tcp port 80'
Hostname: webpod-697b545f57-8qgpp
IP: 127.0.0.1
IP: ::1
IP: 172.20.0.192
IP: fe80::5c51:1ff:fee8:c410
RemoteAddr: 172.20.1.35:53392
GET / HTTP/1.1.
Host: nginx.webpod.local.
User-Agent: curl/8.5.0.
Accept: */*.
X-Forwarded-For: 192.168.10.200.
X-Forwarded-Host: nginx.webpod.local.
X-Forwarded-Port: 80.
X-Forwarded-Proto: http.
X-Forwarded-Scheme: http.
X-Real-Ip: 192.168.10.200.
X-Request-Id: 02a94cc39febb17e48af05ccff0d4f9e.
X-Scheme: http.

# k8s-w1
root@k8s-w1:~# ngrep -tW byline -d $WPODVETH '' 'tcp port 80'
Hostname: webpod-697b545f57-gkzn6
IP: 127.0.0.1
IP: ::1
IP: 172.20.1.230
IP: fe80::8c39:e6ff:fe38:4d79
RemoteAddr: 172.20.1.35:35976
GET / HTTP/1.1.
Host: nginx.webpod.local.
User-Agent: curl/8.5.0.
Accept: */*.
X-Forwarded-For: 192.168.10.200.
X-Forwarded-Host: nginx.webpod.local.
X-Forwarded-Port: 80.
X-Forwarded-Proto: http.
X-Forwarded-Scheme: http.
X-Real-Ip: 192.168.10.200.
X-Request-Id: 7adfa99642ab225d4dd4c4c22b3d8408.
X-Scheme: http.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -n ingress-nginx -o wide
NAME                                        READY   STATUS    RESTARTS   AGE   IP            NODE     NOMINATED NODE   READINESS GATES
ingress-nginx-controller-67bbdf7d8d-f9bmk   1/1     Running   0          15m   172.20.1.35   k8s-w1              
TPROXY
TPROXY를 타는지를 확인을 해보면, nginx class의 Ingress의 경우 TPROXY 커널 기능을 사용하지 않는 것을 확인할 수 있다.
# nginx class Ingress 호출
(⎈|HomeLab:N/A) root@k8s-ctr:~# sudo iptables -t mangle -L CILIUM_PRE_mangle --line-numbers -n -v
Chain CILIUM_PRE_mangle (1 references)
...
4        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x17380200 /* cilium: TPROXY to host kube-system/cilium-ingress/listener proxy */ TPROXY redirect 127.0.0.1:14359 mark 0x200/0xffffffff
6        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x94480200 /* cilium: TPROXY to host default/cilium-ingress-default-webpod-ingress/listener proxy */ TPROXY redirect 127.0.0.1:18580 mark 0x200/0xffffffff

root@router:~# curl -H "Host: nginx.webpod.local" $LB3IP

# pkts 변화 없음.
Chain CILIUM_PRE_mangle (1 references)
...
4        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x17380200 /* cilium: TPROXY to host kube-system/cilium-ingress/listener proxy */ TPROXY redirect 127.0.0.1:14359 mark 0x200/0xffffffff
6        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x94480200 /* cilium: TPROXY to host default/cilium-ingress-default-webpod-ingress/listener proxy */ TPROXY redirect 127.0.0.1:18580 mark 0x200/0xffffffff
# cilium class Ingress 호출
root@router:~# curl -so /dev/null -w "%{http_code}\n" http://$LB2IP/

# cilium-ingress listener TPROXY pkts 1 증가
(⎈|HomeLab:N/A) root@k8s-ctr:~# (⎈|HomeLab:N/A) root@k8s-ctr:~# sudo iptables -t mangle -L CILIUM_PRE_mangle --line-numbers -n -v
Chain CILIUM_PRE_mangle (1 references)
...
4        1    60 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x94480200 /* cilium: TPROXY to host default/cilium-ingress-default-webpod-ingress/listener proxy */ TPROXY redirect 127.0.0.1:18580 mark 0x200/0xffffffff
6        0     0 TPROXY     6    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match 0x17380200 /* cilium: TPROXY to host kube-system/cilium-ingress/listener proxy */ TPROXY redirect 127.0.0.1:14359 mark 0x200/0xffffffff
Hubble Observe
webpod에 대해 hubble observe로 모니터링을 해본다.
nginx class Ingress
(⎈|HomeLab:N/A) root@k8s-ctr:~#  kubectl exec -n kube-system -c cilium-agent -it ds/cilium -- cilium-dbg endpoint list | grep webpod
3644       Disabled           Disabled          59436      k8s:app=webpod                                                                        172.20.1.230   ready

# nginx class Ingress 호출
root@router:~# curl -H "Host: nginx.webpod.local" $LB3IP

(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f  --protocol tcp --from-identity  59436
Aug 23 13:43:51.260: ingress-nginx/ingress-nginx-controller-67bbdf7d8d-f9bmk:34564 (ID:47827) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-endpoint FORWARDED (TCP Flags: SYN, ACK)
Aug 23 13:43:51.261: default/webpod-697b545f57-8qgpp:80 (ID:59436) <> ingress-nginx/ingress-nginx-controller-67bbdf7d8d-f9bmk (ID:47827) pre-xlate-rev TRACED (TCP)
Aug 23 13:43:51.261: default/webpod-697b545f57-8qgpp:80 (ID:59436) <> ingress-nginx/ingress-nginx-controller-67bbdf7d8d-f9bmk (ID:47827) pre-xlate-rev TRACED (TCP)
Aug 23 13:43:51.272: ingress-nginx/ingress-nginx-controller-67bbdf7d8d-f9bmk:34564 (ID:47827) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Aug 23 13:43:51.337: ingress-nginx/ingress-nginx-controller-67bbdf7d8d-f9bmk:34564 (ID:47827) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-network FORWARDED (TCP Flags: SYN, ACK)
Aug 23 13:43:51.344: ingress-nginx/ingress-nginx-controller-67bbdf7d8d-f9bmk:34564 (ID:47827) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-network FORWARDED (TCP Flags: ACK, PSH)

ingress-nginx-controller Pod(k8s-w1 위)에서 백엔드(webpod)로 TCP SYN 전달
default/webpod Pod(k8s-ctr 위)에서 ACK, PSH로 응답
ingress-nginx-controller Pod로 TO-NETWORK FORWARDED
Nginx가 처리한 후 응답 패킷이 클라이언트로 나가도록 eBPF가 라우팅

특징

Envoy/TPROXY 없음
TCP 흐름 : Nginx Pod → Backend Pod → Nginx Pod → 클라이언트
Nginx가 L7 인지, Host 기반 라우팅 수행
eBPF는 Pod→Pod와 Pod→Network 경로만 담당

cilium class Ingress
# cilium class Ingress 호출
root@router:~# curl -so /dev/null -w "%{http_code}\n" http://$LB2IP/

(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f  --from-identity  59436
# k8s-w1위의 webpod 호출됨
Aug 23 14:29:24.934: 10.0.2.15:59674 (host) <- default/webpod-697b545f57-gkzn6:80 (ID:59436) to-stack FORWARDED (TCP Flags: SYN, ACK)
Aug 23 14:29:24.935: 10.0.2.15:59674 (host) <- default/webpod-697b545f57-gkzn6:80 (ID:59436) to-stack FORWARDED (TCP Flags: ACK, PSH)
Aug 23 14:29:24.935: 192.168.10.200:40780 (ingress) <- default/webpod-697b545f57-gkzn6:80 (ID:59436) http-response FORWARDED (HTTP/1.1 200 1ms (GET http://192.168.10.212/))
Aug 23 14:29:39.941: 10.0.2.15:59674 (host) <- default/webpod-697b545f57-gkzn6:80 (ID:59436) to-stack FORWARDED (TCP Flags: ACK)
Aug 23 14:29:55.301: 10.0.2.15:59674 (host) <- default/webpod-697b545f57-gkzn6:80 (ID:59436) to-stack FORWARDED (TCP Flags: ACK)
Aug 23 14:30:10.661: 10.0.2.15:59674 (host) <- default/webpod-697b545f57-gkzn6:80 (ID:59436) to-stack FORWARDED (TCP Flags: ACK)
Aug 23 14:30:24.937: 10.0.2.15:59674 (host) <- default/webpod-697b545f57-gkzn6:80 (ID:59436) to-stack FORWARDED (TCP Flags: ACK, FIN)

# k8s-ctr위의 webpod 호출됨
Aug 23 14:30:28.342: 192.168.10.200:42700 (ingress) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) http-response FORWARDED (HTTP/1.1 200 4ms (GET http://192.168.10.212/))
Aug 23 14:30:28.416: 172.20.1.202:38801 (ingress) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-network FORWARDED (TCP Flags: SYN, ACK)
Aug 23 14:30:28.418: 172.20.1.202:38801 (ingress) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-network FORWARDED (TCP Flags: ACK, PSH)
Aug 23 14:30:43.614: 172.20.1.202:38801 (ingress) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-network FORWARDED (TCP Flags: ACK)
Aug 23 14:30:58.975: 172.20.1.202:38801 (ingress) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-network FORWARDED (TCP Flags: ACK)
Aug 23 14:31:14.334: 172.20.1.202:38801 (ingress) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-network FORWARDED (TCP Flags: ACK)
Aug 23 14:31:28.420: 172.20.1.202:38801 (ingress) <- default/webpod-697b545f57-8qgpp:80 (ID:59436) to-network FORWARDED (TCP Flags: ACK, FIN)
1) 외부 요청

외부 Router → Ingress LB IP 호출
L2 Announcement: 트래픽이 리더 노드인 k8s-w1로 전달
eBPF LB + TPROXY: 패킷이 Envoy 소켓(TPROXY)으로 리다이렉트
Envoy가 L7 요청을 파싱하고, 적절한 backend Pod 선택

2) 요청 패킷 전달 (Envoy -> webpod)
Envoy가 선택한 Pod로 요청 전달
2-1) Backend Pod가 k8s-w1(리더 노드)에 있는 경우

Envoy → 동일 노드 webpod 로 전달 
요청 패킷 경로: router → k8s-w1 Node → webpod(k8s-w1)

2-2) Backend Pod가 k8s-ctr(리모트 노드)에 있는 경우

Envoy가 요청 패킷을 remote Pod IP로 전달
요청 패킷 경로: router → Ingress EXT-IP → k8s-ctr webpod

3) 응답 패킷 전달 (WebPod → Client)
3-1) Backend Pod가 k8s-w1(리더 노드)에 있는 경우

응답 경로: webpod → k8s-w1 host stack (Envoy, to-stack) → router 
TPROXY가 Host Stack으로 redirect, Envoy가 HTTP 응답 처리

3-2) Backend Pod가 k8s-ctr(리모트 노드)에 있는 경우

응답 경로: webpod(k8s-ctr) → k8s-ctr host stack → k8s-w1 Envoy (to-network) → client
k8s-ctr에서 다시 k8s-w1로 응답이 돌아오는 과정에서 to-network message, Envoy가 HTTP 응답 처리



Cilium - BGP ControlPlane
Sat, 16 Aug 2025 17:21:48 GMT
BGP ControlPlane
앞선 글에서 살펴본 것 처럼, 서로 다른 네트워크 대역 간 통신을 위해 라우터를 거쳐야 하는 환경에서는 노드가 많아질수록 수동 라우트 설정이 비효율적이라는 문제가 발생한다. 
이러한 한계를 해결하기 위한 대표적인 방법에는 Overlay 네트워크와 BGP를 통한 동적 라우팅이 있는데, 이번 글에서는 BGP(Border Gateway Protocol)를 활용한 주소 알리기 방식에 대해 살펴본다.
실습 환경은 다음과 같다.

Kubernetes 클러스터 노드

k8s-ctr(IP: 192.168.10.100, podCIDR: 172.20.0.0/24)
k8s-w1(IP: 192.168.10.101, podCIDR: 172.20.1.0/24)
k8s-w0(IP: 192.168.20.100, podCIDR: 172.20.2.0/24)

Router 노드

router(IP: 192.168.10.200)

autoDirectNodeRoutes = false

노드 별 PodCIDR 라우팅이 없음.

기본 환경 통신 테스트
현재 cilium의 설정이 autoDirectNodeRoutes=false로 podCIDR에 대해 Node 라우팅이 걸려있지 않기 때문에 노드 내의 파드들 끼리만 통신이 가능하다.
# 노드 별 PodCIDR 라우팅이 없음.
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.0.0/24 via 172.20.0.167 dev cilium_host proto kernel src 172.20.0.167
172.20.0.167 dev cilium_host proto kernel scope link
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.100
192.168.20.0/24 via 192.168.10.200 dev eth1 proto static

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get endpointslices -l app=webpod
NAME           ADDRESSTYPE   PORTS   ENDPOINTS                             AGE
webpod-2zb7b   IPv4          80      172.20.0.28,172.20.1.76,172.20.2.68   21s

# 노드 내에 있는 pod와만 통신이 가능함.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'
---
---
---
---
---
---
Hostname: webpod-697b545f57-qmxbk
---
---
Hostname: webpod-697b545f57-qmxbk
---
^Ccommand terminated with exit code 130
Cilium BGP Control Plane
Cilium은 BGP Control Plane 기능을 통해 클러스터 내 노드들이 Pod CIDR, Service IP와 같은 네트워크 정보를 외부 라우터에 동적으로 광고하도록 지원한다. 이를 위해 여러 CRD를 제공하며, 각각의 역할은 다음과 같다.

CiliumBGPClusterConfig

클러스터 차원에서 적용되는 BGP 인스턴스 및 피어(peer) 설정을 정의한다.
이를 통해 특정 BGP 구성을 여러 노드에 일괄적으로 적용할 수 있으며, 노드마다 동일한 피어링 정책을 손쉽게 유지할 수 있다.

CiliumBGPPeerConfig

여러 피어에 공통적으로 적용할 수 있는 BGP 피어링 설정 집합을 정의한다.
예를 들어 hold time, keepalive 주기, eBGP multihop 등 반복적으로 쓰이는 설정을 모듈화하여 재사용할 수 있다.

CiliumBGPAdvertisement

어떤 요소를 BGP 라우팅 테이블에 주입할지를 정의한다.
Pod CIDR, Service CIDR, 혹은 LoadBalancer IP 범위와 같은 네트워크 대역을 외부 라우터에 알리도록 설정할 수 있다.

CiliumBGPNodeConfigOverride

특정 노드에 한정하여 적용하는 세부적인 BGP 설정을 정의한다.
FRR
Cilium BGP Control Plane을 통해 쿠버네티스 노드가 Pod CIDR이나 Service IP 대역을 외부로 광고하려면, 이를 받아줄 BGP 피어가 필요하다. 일반적인 데이터센터나 가상화 환경에서는 L3 라우터가 그 역할을 담당하며, 라우터는 클러스터 노드로부터 BGP 업데이트를 받아 외부 네트워크로의 경로를 전파한다.
FRR(FRRouting)은 리눅스에서 동작하는 라우팅 소프트웨어 오픈소스로, 
BGP, OSPF, IS-IS, RIP 등 다양한 라우팅 프로토콜을 지원하며, 커널 라우팅 테이블과 연동되어 동적으로 학습한 경로를 시스템 전반에 반영할 수 있다.
FRR은 다음과 같은 기능을 수행한다.
실습
(1) frr 설정
우선 bgp 광고를 위해 router노드에 frr설정을 주입한다.
root@router:~# cat /etc/frr/frr.conf
frr version 8.4.4
frr defaults traditional
hostname router
log syslog informational
no ipv6 forwarding
service integrated-vtysh-config
!
router bgp 65000
 bgp router-id 192.168.10.200
 no bgp ebgp-requires-policy
 bgp graceful-restart
 bgp bestpath as-path multipath-relax
 neighbor CILIUM peer-group
 neighbor CILIUM remote-as external
 neighbor 192.168.10.100 peer-group CILIUM
 neighbor 192.168.10.101 peer-group CILIUM
 neighbor 192.168.20.100 peer-group CILIUM
 !
 address-family ipv4 unicast
  network 10.10.1.0/24
  maximum-paths 4
 exit-address-family
exit
!
(2) Cilium BGP Control Plane 설정
Cilium에서 노드를 bgp대상으로 인지하고, podCIDR을 외부에 광고할 수 있도록 CR을 배포한다.
# Cilium에서 bgp대상을 인지하도록 Node에 라벨 설정
kubectl label nodes k8s-ctr k8s-w0 k8s-w1 enable-bgp=true

# Config Cilium BGP
# --------------------------------------------------------
# CiliumBGPAdvertisement
# - 어떤 네트워크 프리픽스를 BGP를 통해 광고할지를 정의한다.
# - 여기서는 각 노드에 할당된 PodCIDR을 외부 라우터에 알리도록 설정한다.
# --------------------------------------------------------
apiVersion: cilium.io/v2
kind: CiliumBGPAdvertisement
metadata:
  name: bgp-advertisements
  labels:
    advertise: bgp              # 이후 PeerConfig에서 matchLabels로 참조 가능
spec:
  advertisements:
    - advertisementType: "PodCIDR"   # 각 노드의 PodCIDR을 광고하도록 지정
---
# --------------------------------------------------------
# CiliumBGPPeerConfig
# - BGP 피어링 시 사용되는 공통 설정을 정의한다.
# --------------------------------------------------------
apiVersion: cilium.io/v2
kind: CiliumBGPPeerConfig
metadata:
  name: cilium-peer
spec:
  timers:
    holdTimeSeconds: 9          # 피어와 세션이 끊겼다고 간주하기까지의 시간
    keepAliveTimeSeconds: 3     # 피어에 keepalive 메시지를 보내는 주기
  ebgpMultihop: 2               # eBGP 피어링 시 허용할 홉 수 (기본은 1)
  gracefulRestart:
    enabled: true               # Graceful Restart 활성화
    restartTimeSeconds: 15      # 세션 복구 시 재학습을 기다리는 시간
  families:
    - afi: ipv4                
      safi: unicast             
      advertisements:
        matchLabels:            # 어떤 광고 리소스를 적용할지 라벨 기반으로 매칭
          advertise: "bgp"      # 위의 CiliumBGPAdvertisement와 연결됨
---
# --------------------------------------------------------
# CiliumBGPClusterConfig
# - 클러스터 단위의 BGP 인스턴스를 정의한다.
# - 특정 노드 셀렉터를 통해 어느 노드에 적용할지 지정할 수 있다.
# --------------------------------------------------------
apiVersion: cilium.io/v2
kind: CiliumBGPClusterConfig
metadata:
  name: cilium-bgp
spec:
  nodeSelector:
    matchLabels:
      "enable-bgp": "true"      # enable-bgp=true 라벨이 붙은 노드에만 적용
  bgpInstances:
  - name: "instance-65001"      # BGP 인스턴스 이름 
    localASN: 65001             # 로컬 노드(쿠버네티스 노드)의 ASN 번호
    peers:
    - name: "tor-switch"        # 피어 이름 
      peerASN: 65000            # 라우터(피어)의 ASN 번호
      peerAddress: 192.168.10.200  # 라우터의 IP 주소
      peerConfigRef:
        name: "cilium-peer"     # 위에서 정의한 CiliumBGPPeerConfig 참조
(3)설정 이후 통신 확인
Router 노드에서 확인해보면, Cilium이 각 노드의 PodCIDR을 BGP를 통해 광고하였고, 라우터는 이를 수신하여 라우팅 테이블에 반영한 것을 확인할 수 있다.

라우팅 테이블 확인
```bash
root@router:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.1.0/24 dev loop1 proto kernel scope link src 10.10.1.200
10.2.0/24 dev loop2 proto kernel scope link src 10.10.2.200
20.0.0/24 nhid 32 via 192.168.10.100 dev eth1 proto bgp metric 20
20.1.0/24 nhid 30 via 192.168.10.101 dev eth1 proto bgp metric 20
20.2.0/24 nhid 31 via 192.168.20.100 dev eth2 proto bgp metric 20
168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.200
168.20.0/24 dev eth2 proto kernel scope link src 192.168.20.200

172.20.0.0/24, 172.20.1.0/24, 172.20.2.0/24 대역이 BGP(proto bgp) 경로로 추가된 것을 확인할 수 있다.
이는 각각 컨트롤 플레인 노드(k8s-ctr), 워커 노드(k8s-w0, k8s-w1)의 PodCIDR이다.

BGP 세션 상태 확인root@router:~# vtysh -c 'show ip bgp summary'



Neighbor        V   AS     MsgRcvd   MsgSent   Up/Down State/PfxRcd
192.168.10.100  4  65001        76        79  00:03:38            1
192.168.10.101  4  65001        76        79  00:03:38            1
192.168.20.100  4  65001        76        79  00:03:38            1

총 3개의 노드(65001 ASN) 와 BGP 세션이 맺어진 것을 확인할 수 있다.
State/PfxRcd 값이 1인 것은 각 노드로부터 **1개의 프리픽스(PodCIDR)**를 수신했음을 의미한다.

3. BGP 라우팅 테이블 확인
```bash
root@router:~# vtysh -c 'show ip bgp'
*> 10.10.1.0/24     0.0.0.0                  0         32768 i
*> 172.20.0.0/24    192.168.10.100                         0 65001 i
*> 172.20.1.0/24    192.168.10.101                         0 65001 i
*> 172.20.2.0/24    192.168.20.100                         0 65001 i
172.20.0.0/24, 172.20.1.0/24, 172.20.2.0/24 프리픽스가 노드별 IP(192.168.x.x)를 NextHop으로 하여 등록되어 있다. 즉, 라우터가 Cilium 노드로부터 PodCIDR을 정상적으로 학습한 상태이다.

Cilium 측 광고 상태 확인(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium bgp routes
Node      VRouter   Prefix          NextHop   Age      Attrs
k8s-ctr   65001     172.20.0.0/24   0.0.0.0   5m49s    [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w0    65001     172.20.2.0/24   0.0.0.0   15m48s   [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w1    65001     172.20.1.0/24   0.0.0.0   15m48s   [{Origin: i} {Nexthop: 0.0.0.0}]
각 노드가 자신의 PodCIDR을 BGP 경로로 광고(advertise) 하고 있음을 보여주며, 이는 라우터에서 확인한 결과와 일치한다.

하지만 여전히 curl pod에서 web pod로 통신을 시도해보면 동일 노드 내의 pod와만 통신이 가능하다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'
---
---
---
---
Hostname: webpod-697b545f57-phntn
---
Hostname: webpod-697b545f57-phntn
---
Hostname: webpod-697b545f57-phntn
---
---
이는 Cilium이 클러스터 내에서 podCIDR 대역에 대해 노드 대역을 자동으로 라우팅 테이블에 추가하지 않기 때문이다.
Cilium이 물리 네트워크 경로를 라우팅 테이블에 추가하지 않는 이유
Cilium은 기본적으로 모든 노드가 서로 L3 reachable 하다는 가정을 깔고 있다.
그렇기 때문에 BGP 광고는 PodCIDR을 외부(라우터)로 알리는 역할만 하고, 물리 네트워크 경로 자체를 해결해주지는 않는다. 
노드 IP 대역 자체가 서로 통신 가능한 상태여야 Cilium이 광고한 PodCIDR도 의미가 생긴다.
따라서 Cilium BGP 광고를 활용할 때에는 물리 라우터가 각 노드가 속한 서브넷을 상호 라우팅할 수 있도록 설정이 되어야 한다.
만약 물리 네트워크 레벨에서 라우팅을 보장하기 어렵다면, VXLAN, Geneve 같은 오버레이 터널링 방식을 택해야 한다.
이 경우 노드 간 직접 라우팅이 불가능해도 Pod 트래픽을 캡슐화하여 전달할 수 있다.
Cilium BGP 통신 정리

Direct Routing 모드: 노드 간 PodCIDR 통신이 VXLAN/Geneve 터널 없이, L2/L3 직접 경로를 통해 전달됨.
autoDirectNodeRoutes=false: Cilium Agent가 커널에 PodCIDR route를 자동 추가하지 않음.


BGP설정을 Reconcile하며 CiliumNode의 PodCIDR 정보 실시간 반영 및 광고//pkg/bgpv1/manager/manager.go
func (m *BGPRouterManager) reconcileBGPConfig(ctx context.Context,
 sc *instance.ServerWithConfig,
 newc *v2alpha1.CiliumBGPVirtualRouter,
 ciliumNode *v2.CiliumNode) error {
...
 for _, r := range m.Reconcilers {
 //BGP 설정을 Reconcile하며 CiliumNode의 podCIDR 정보를 실시간 저장
     if err := r.Reconcile(ctx, reconciler.ReconcileParams{
         CurrentServer: sc,
         DesiredConfig: newc,
         CiliumNode:    ciliumNode,
     }); err != nil {
         return fmt.Errorf("reconciliation of virtual router with local ASN %v failed: %w", newc.LocalASN, err)
     }
 }
...
}



//pkg/bgpv1/manager/reconciler/pod_cidr.go
func (r *ExportPodCIDRReconciler) Reconcile(ctx context.Context, p ReconcileParams) error {
...
  advertisements, err := exportAdvertisementsReconciler(&advertisementsReconcilerParams{
        logger:    r.Logger,
        ctx:       ctx,
        name:      "pod CIDR",
        component: "exportPodCIDRReconciler",
        enabled:   *p.DesiredConfig.ExportPodCIDR,
    sc:   p.CurrentServer,
    newc: p.DesiredConfig,

    currentAdvertisements: r.getMetadata(p.CurrentServer),
    toAdvertise:           toAdvertise,
})
...
  // 광고해야 할 CiliumNode의 podCIDR 정보를 실시간 저장
    r.storeMetadata(p.CurrentServer, advertisements)
    return nil
}
//cf) 별도로 NextHop이 지정되어있지 않다면 0.0.0.0 반환
// NextHopFromPathAttributes returns the next hop address determined by the list of provided BGP path attributes.
func NextHopFromPathAttributes(pathAttributes []bgppacket.PathAttributeInterface) string {
    for _, a := range pathAttributes {
        switch attr := a.(type) {
        case *bgppacket.PathAttributeNextHop:
            return attr.Value.String()
        case *bgppacket.PathAttributeMpReachNLRI:
            return attr.Nexthop.String()
        }
    }
    return "0.0.0.0"
}

2. cilium bgp map 상태 확인
```bash
(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium bgp peers
Node      Local AS   Peer AS   Peer Address     Session State   Uptime      Family         Received   Advertised
k8s-ctr   65001      65000     192.168.10.200   established     37h44m51s   ipv4/unicast   5          3
k8s-w0    65001      65000     192.168.10.200   established     37h44m48s   ipv4/unicast   5          3
k8s-w1    65001      65000     192.168.10.200   established     37h44m48s   ipv4/unicast   5          3

(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium bgp routes
(Defaulting to `available ipv4 unicast` routes, please see help for more options)

Node      VRouter   Prefix          NextHop   Age         Attrs
k8s-ctr   65001     172.16.1.1/32   0.0.0.0   37h23m15s   [{Origin: i} {Nexthop: 0.0.0.0}]
          65001     172.20.0.0/24   0.0.0.0   38h3m45s    [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w0    65001     172.16.1.1/32   0.0.0.0   37h23m14s   [{Origin: i} {Nexthop: 0.0.0.0}]
          65001     172.20.2.0/24   0.0.0.0   38h13m42s   [{Origin: i} {Nexthop: 0.0.0.0}]
k8s-w1    65001     172.16.1.1/32   0.0.0.0   37h23m13s   [{Origin: i} {Nexthop: 0.0.0.0}]
          65001     172.20.1.0/24   0.0.0.0   38h13m42s   [{Origin: i} {Nexthop: 0.0.0.0}]

k8s-ctr 즉, 172.20.0.0/24에서 패킷을 보낸다고 가정을 했을 때, cilium은 bgp routes에 저장된 Prefix, Nexthop을 확인한다.
이 때, NextHop은 0.0.0.0으로 지정되어 있는데, k8s-ctr 노드의 라우팅 테이블을 확인하면 eth0으로 빠져나가게 된다. 

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
해당 실습에서 실제 통신이 되게 위해서는 eth1로 빠져나가 podCIDR BGP 라우트 정보가 있는 route노드로 향해야 하기 때문에 강제적으로 podCIDR대역에 대해 route 노드로 향하는 물리 라우트를 추가해주어야 한다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip route add 172.20.0.0/16 via 192.168.10.200
root@k8s-w0:~# ip route add 172.20.0.0/16 via 192.168.20.200
root@k8s-w1:~# ip route add 172.20.0.0/16 via 192.168.10.200
라우트를 추가한 이후 정상적으로 통신이 되는 것을 확인할 수 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'
Hostname: webpod-697b545f57-phntn
---
Hostname: webpod-697b545f57-hhjqw
---
Hostname: webpod-697b545f57-8d2xs
---
Hostname: webpod-697b545f57-8d2xs
---
Hostname: webpod-697b545f57-phntn
---
Hostname: webpod-697b545f57-hhjqw
---
결론적으로는 아래와 같이 정리할 수 있다.

autoDirectNodeRoutes=false 환경에서는, PodCIDR 광고만으로는 실제 노드 간 통신이 보장되지 않음.
라우팅 테이블을 통해 PodCIDR 대역 → 실제 물리 노드 경로를 명시적으로 지정해야 함.

Service IP advertisement
PodCIDR을 BGP로 광고하듯, Kubernetes 서비스의 IP도 BGP를 통해 외부 라우터에 광고할 수 있다.

External IP: 외부에서 접근 가능한 서비스 IP
Cluster IP: 클러스터 내부에서만 사용하는 서비스 IP

둘 다 필요에 따라 BGP를 통해 광고 가능하며, 이를 통해 외부 네트워크에서도 서비스 접근이 가능해진다.
서비스 IP를 광고하기 전에 알아둬야 할 개념으로 Traffic Policy가 있는데, Traffic Policy는 서비스 트래픽을 어떻게 분산할지 결정하는 설정이다.
External Traffic Policy

Cluster
외부에서 들어오는 트래픽을 클러스터 전체의 서비스 Pod로 분산한다.
Local
외부 트래픽은 해당 노드의 서비스 Pod로만 전달된다.

Internal Traffic Policy

Cluster
클러스터 내부 Pod → 서비스 트래픽을 전체 Pod로 분산한다.
Local
내부 트래픽은 같은 노드의 Pod로만 전달된다.

서비스 IP를 광고할 때는 External / Internal Traffic Policy에 따라 어떤 Pod로 트래픽이 전달될지가 달라지게 된다. 각 상황에 대해 알아본다.
External IP
External IP + External Traffic Policy (Cluster)
Service를 LoadBalancer타입으로 설정하고, External IP 설정을 한다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc
NAME         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
kubernetes   ClusterIP   10.96.0.1               443/TCP   21h
webpod       ClusterIP   10.96.252.129           80/TCP    8h

(⎈|HomeLab:N/A) root@k8s-ctr:~# cat << EOF | kubectl apply -f -
apiVersion: "cilium.io/v2"
kind: CiliumLoadBalancerIPPool
metadata:
  name: "cilium-pool"
spec:
  allowFirstLastIPs: "No"
  blocks:
  - cidr: "172.16.1.0/24"
EOF
ciliumloadbalancerippool.cilium.io/cilium-pool created

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ippool
NAME          DISABLED   CONFLICTING   IPS AVAILABLE   AGE
cilium-pool   false      False         254             7s

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl patch svc webpod -p '{"spec": {"type": "LoadBalancer"}}'
service/webpod patched

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc
NAME         TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
kubernetes   ClusterIP      10.96.0.1               443/TCP        21h
webpod       LoadBalancer   10.96.252.129   172.16.1.1    80:31726/TCP   8h

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ippool
NAME          DISABLED   CONFLICTING   IPS AVAILABLE   AGE
cilium-pool   false      False         253             16s
LB IP를 BGP로 광고하기 위해 adviertisementType를 service, LoadBalancerIP로 지정하여 CiliumBGPAdvertisements CR을 배포한다.
cat << EOF | kubectl apply -f -
apiVersion: cilium.io/v2
kind: CiliumBGPAdvertisement
metadata:
  name: bgp-advertisements-lb-exip-webpod
  labels:
    advertise: bgp
spec:
  advertisements:
    - advertisementType: "Service"
      service:
        addresses:
          - LoadBalancerIP
      selector:             
        matchExpressions:
          - { key: app, operator: In, values: [ webpod ] }
EOF

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl describe svc webpod | grep 'Traffic Policy'
External Traffic Policy:  Cluster
Internal Traffic Policy:  Cluster

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium-dbg bgp route-policies
VRouter   Policy Name                                             Type     Match Peers         Match Families   Match Prefixes (Min..Max Len)   RIB Action   Path Actions
65001     allow-local                                             import                                                                        accept
65001     tor-switch-ipv4-PodCIDR                                 export   192.168.10.200/32                    172.20.1.0/24 (24..24)          accept
65001     tor-switch-ipv4-Service-webpod-default-LoadBalancerIP   export   192.168.10.200/32                    172.16.1.1/32 (32..32)          accept

root@router:~# sudo vtysh -c 'show ip bgp 172.16.1.1/32'
BGP routing table entry for 172.16.1.1/32, version 5
Paths: (3 available, best #1, table default)
  Advertised to non peer-group peers:
  192.168.10.100 192.168.10.101 192.168.20.100
  65001
    192.168.10.100 from 192.168.10.100 (192.168.10.100)
      Origin IGP, valid, external, multipath, best (Router ID)
      Last update: Fri Aug 15 06:41:04 2025
  65001
    192.168.10.101 from 192.168.10.101 (192.168.10.101)
      Origin IGP, valid, external, multipath
      Last update: Fri Aug 15 06:41:04 2025
  65001
    192.168.20.100 from 192.168.20.100 (192.168.20.100)
      Origin IGP, valid, external, multipath
      Last update: Fri Aug 15 06:41:04 2025
LB IP를 통해 통신 테스트를 수행해본다. 통신 방식을 명확하게 확인하기 위해 webpod replicas수를 하나 줄인 후 외부 router에서 LB IP를 호출해본다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl scale deployment webpod --replicas 2
deployment.apps/webpod scaled
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -o wide
NAME                      READY   STATUS    RESTARTS      AGE   IP             NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   1 (39h ago)   47h   172.20.0.247   k8s-ctr              
webpod-697b545f57-8d2xs   1/1     Running   0             47h   172.20.2.68    k8s-w0               
webpod-697b545f57-hhjqw   1/1     Running   0             47h   172.20.1.76    k8s-w1               

(⎈|HomeLab:N/A) root@k8s-ctr:~# tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'
root@k8s-w0:~# tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'
root@k8s-w1:~# tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'

root@router:~# LBIP=172.16.1.1
root@router:~# curl -s $LBIP
호출 시 Pod는 k8s-w0, k8s-w1에만 떠있는 상태에서, 일부 패킷은 k8s-w0와, k8s-w1으로 동시에 패킷을 받는 경우가 있음을 알 수 있다.
이는 External Traffic Policy가 Cluster로 설정되었기 때문이다.
동작 방식
1) External Traffic Policy = Cluster일 경우, nodeSelector 조건에 맞는 모든 노드는 LB의 ExternalIP를 광고한다.
//pkg/bgpv1/manager/reconciler/service.go
func (r *ServiceReconciler) fullReconciliation(ctx context.Context, p ReconcileParams, pathRefs pathReferencesMap) error {
...
  //해당 노드에서 광고할 수 있는 서비스 endpoint가 존재하는 서비스 즉, 해당 노드에서 External Policy Type = Local로 통신이 가능한 서비스를 구분한다.
    ls, err := r.populateLocalServices(p.CiliumNode.Name)
    if err != nil {
        return err
    }
    for _, svc := range toReconcile {
        if err := r.reconcileService(ctx, p.CurrentServer, p.DesiredConfig, svc, ls, pathRefs); err != nil {
            return fmt.Errorf("failed to reconcile service %s/%s: %w", svc.Namespace, svc.Name, err)
        }
    }
...
    return nil
}

//Service를 Reconcile하는 함수
func (r *ServiceReconciler) reconcileService(ctx context.Context, sc *instance.ServerWithConfig, newc *v2alpha1api.CiliumBGPVirtualRouter, svc *slim_corev1.Service, ls localServices, pathRefs pathReferencesMap) error {
  //주어진 서비스에서 route되어야 하는 목록을 반환한다.
    desiredRoutes, err := r.svcDesiredRoutes(newc, svc, ls)
...
    return r.reconcileServiceRoutes(ctx, sc, svc, desiredRoutes, pathRefs)
}

//Service 광고 방식에 따라 최종 광고해야하는 route 목록을 반환한다.
func (r *ServiceReconciler) svcDesiredRoutes(newc *v2alpha1api.CiliumBGPVirtualRouter, svc *slim_corev1.Service, ls localServices) ([]netip.Prefix, error) {
...
    var desiredRoutes []netip.Prefix
    for _, svcAdv := range newc.ServiceAdvertisements {
        switch svcAdv {
        case v2alpha1api.BGPLoadBalancerIPAddr:
            desiredRoutes = append(desiredRoutes, r.lbSvcDesiredRoutes(svc, ls)...)
        case v2alpha1api.BGPClusterIPAddr:
            desiredRoutes = append(desiredRoutes, r.clusterIPDesiredRoutes(svc, ls)...)
    // 해당 케이스를 타게 된다.
        case v2alpha1api.BGPExternalIPAddr:
            desiredRoutes = append(desiredRoutes, r.externalIPDesiredRoutes(svc, ls)...)
        }
    }
}
아래 externalIPDesiredRoutes로 인해 각 Node의 Cilium-agent에서 LB의 ExternalIP를 광고한다.
func (r *ServiceReconciler) externalIPDesiredRoutes(svc *slim_corev1.Service, ls localServices) []netip.Prefix {
    var desiredRoutes []netip.Prefix
...
    for _, extIP := range svc.Spec.ExternalIPs {
        if extIP == "" {
            continue
        }
        addr, err := netip.ParseAddr(extIP)
        if err != nil {
            continue
        }
        desiredRoutes = append(desiredRoutes, netip.PrefixFrom(addr, addr.BitLen()))
    }
    return desiredRoutes
}
2) 외부 라우터가 광고된 External IP를 보고, 패킷을 노드로 전달.
이 때, 라우터는 Pod가 어느 노드에 있는지 모르는 상태이며, External IP의 BGP 광고를 보고 트래픽을 전달하는 상황.
3) Cilium LB가 패킷을 받음.
4) Cilium agent의 Service List의 Backend 인지
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium-dbg service list
ID   Frontend               Service Type   Backend
...
16   172.16.1.1:80/TCP      LoadBalancer   1 => 172.20.1.76:80/TCP (active)
                                           2 => 172.20.2.68:80/TCP (active)

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium-dbg bpf lb list | grep 172.16.1.1
172.16.1.1:80/TCP (1)          172.20.1.76:80/TCP (16) (1)
172.16.1.1:80/TCP (2)          172.20.2.68:80/TCP (16) (2)
172.16.1.1:80/TCP (0)          0.0.0.0:0 (16) (0) [LoadBalancer]                                          
5) bpf ipcache map을 기반으로 Pod가 떠있는 노드로 패킷 전달
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium-dbg bpf ipcache list
IP PREFIX/ADDRESS   IDENTITY
172.20.1.76/32      identity=28377 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.2.68/32      identity=28377 encryptkey=0 tunnelendpoint=192.168.20.100 flags=hastunnel
...

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip route get 172.20.1.76
172.20.1.76 via 192.168.10.200 dev eth1 src 192.168.10.100 uid 0
    cache

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip route get 172.20.2.68
172.20.2.68 via 192.168.10.200 dev eth1 src 192.168.10.100 uid 0
    cache
External IP + External Traffic Policy (Local)
이번에는 서비스의 External Traffic Policy를 Local로 변경한 후 통신을 확인해본다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl patch service webpod -p '{"spec":{"externalTrafficPolicy":"Local"}}'
service/webpod patched
External Traffic Policy를 local로 바꾸면 Router 노드에서의 bgp 경로도 Pod가 떠있는 노드만으로 변경됨을 알 수 있다.
# External Traffic Policy = Cluster
root@router:~# ip -c route
...
172.16.1.1 nhid 101 proto bgp metric 20
    nexthop via 192.168.10.100 dev eth1 weight 1
    nexthop via 192.168.20.100 dev eth2 weight 1
    nexthop via 192.168.10.101 dev eth1 weight 1

# External Traffic Policy = Local
root@router:~# ip -c route
...
172.16.1.1 nhid 105 proto bgp metric 20
    nexthop via 192.168.20.100 dev eth2 weight 1
    nexthop via 192.168.10.101 dev eth1 weight 1
이번에도 위와 동일하게 webpod가 k8s-w0, k8s-w1에만 떠있는 상태에서 LB IP를 호출해본다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'
root@k8s-w0:~# tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'
root@k8s-w1:~# tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'

root@router:~# LBIP=172.16.1.1
root@router:~# curl -s $LBIP
이번에는 Pod가 떠있는 노드 중 한쪽 노드로만 통신이 진행됨을 알 수 있다.
동작 방식
1) cilium에서 service가 External Policy Type = Local일 경우 Pod가 뜬 노드만 ExternalIP를 광고한다.
//pkg/bgpv1/manager/reconciler/service.go
func (r *ServiceReconciler) externalIPDesiredRoutes(svc *slim_corev1.Service, ls localServices) []netip.Prefix {
    var desiredRoutes []netip.Prefix
    // externalTrafficPolicy == Local 이며 endpoint 파드가 떠있지 않은 노드 같은 경우는 desiredRoutes를 빈 값으로 반환한다. 
  // 즉, externalTrafficPolicy == Local인 상황에서 endpoint파드가 떠있는 노드만 스스로를 광고한다.
    if svc.Spec.ExternalTrafficPolicy == slim_corev1.ServiceExternalTrafficPolicyLocal &&
        !hasLocalEndpoints(svc, ls) {
        return desiredRoutes
    }

  // externalTrafficPolicy == Local 이지만 endpoint 파드가 있는 노드는 스스로를 External IP로 광고한다.
    for _, extIP := range svc.Spec.ExternalIPs {
        if extIP == "" {
            continue
        }
        addr, err := netip.ParseAddr(extIP)
        if err != nil {
            continue
        }
        desiredRoutes = append(desiredRoutes, netip.PrefixFrom(addr, addr.BitLen()))
    }
    return desiredRoutes  
...
}

2) 외부 라우터가 광고된 External IP를 보고, 패킷을 노드로 전달힘.
이 때, Pod가 뜬 노드만 광고됨.
3) Pod가 뜬 노드에 패킷이 들어옴.
4) Cilium LB에서 Backend 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium-dbg service list
ID   Frontend               Service Type   Backend
...
16   172.16.1.1:80/TCP      LoadBalancer   1 => 172.20.1.76:80/TCP (active)
                                           2 => 172.20.2.68:80/TCP (active)

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium-dbg bpf lb list | grep 172.16.1.1
172.16.1.1:80/TCP (1)          172.20.1.76:80/TCP (16) (1)
172.16.1.1:80/TCP (2)          172.20.2.68:80/TCP (16) (2)
172.16.1.1:80/TCP (0)          0.0.0.0:0 (16) (0) [LoadBalancer]                                          
5) Cilium에서 service가 External Policy Type이 Local일 경우 해당 노드에 존재하는 Pod만 Backend로 선택
ipcache/map 기반으로 실제 Pod IP와 노드 매핑을 확인한 뒤 Local Pod로만 NAT/DNAT 적용
ECMP Hash Policy
리눅스 커널은 기본적으로 L3(목적지 IP 기반) 해시를 사용한다. 보다 정교한 부하분산을 원하면 L4 해시 (IP + 포트) 기반으로 설정해야 한다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# sysctl net.ipv4.fib_multipath_hash_policy
net.ipv4.fib_multipath_hash_policy = 0
(⎈|HomeLab:N/A) root@k8s-ctr:~# sysctl -w net.ipv4.fib_multipath_hash_policy=1
net.ipv4.fib_multipath_hash_policy = 1
이후 외부 router노드에서 LB IP로 통신을 시도하면 Pod가 떠있는 두 노드로 부하분산이 정상적으로 이뤄짐을 확인할 수 있다.
Cluster IP
이번에는 ExternalIP가 아닌 Cluster IP를 광고해보자.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl edit ciliumbgpadvertisement
...
  spec:
    advertisements:
    - advertisementType: Service
      selector:
        matchExpressions:
        - key: app
          operator: In
          values:
          - webpod
      service:
        addresses:
        - ClusterIP
...

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc webpod -o wide
NAME     TYPE           CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE    SELECTOR
webpod   LoadBalancer   10.96.252.129   172.16.1.1    80:31726/TCP   2d3h   app=webpod

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -n kube-system ds/cilium -- cilium service list
ID   Frontend               Service Type   Backend
...
12   10.96.252.129:80/TCP    ClusterIP     1 => 172.20.0.103:80/TCP (active)
                                           2 => 172.20.1.80:80/TCP (active)
                                           3 => 172.20.1.81:80/TCP (active)
외부 router의 bgp설정을 확인하면 아래와 같이 Cluster IP 광고로 변경된 것을 확인할 수 있다.
root@router:~# ip -c route
...
10.96.252.129 nhid 117 proto bgp metric 20
    nexthop via 192.168.10.100 dev eth1 weight 1
    nexthop via 192.168.20.100 dev eth2 weight 1
    nexthop via 192.168.10.101 dev eth1 weight 1
Cluster IP + Internal Traffic Policy (Cluster)
외부 router에서 바로 Cluster IP로 통신을 시도하면 통신이 되지 않는다.
root@router:~# curl -s $LBIP
^C
이는 원래 clusterIP의 목적이 클러스터 내부 통신을 위한 것이기 때문이다. bpf.lbExternalClusterIP=true 설정을 추가하여 내부 IP인 clusterIP도 외부에서 통신이 가능하도록 설정을 할 수 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# helm upgrade cilium cilium/cilium --version 1.18.0 --namespace kube-system --reuse-values --set  bpf.lbExternalClusterIP=true
Release "cilium" has been upgraded. Happy Helming!
NAME: cilium
LAST DEPLOYED: Sun Aug 17 02:53:00 2025
NAMESPACE: kube-system
STATUS: deployed
REVISION: 2
TEST SUITE: None
NOTES:
You have successfully installed Cilium with Hubble Relay and Hubble UI.

Your release version is 1.18.0.

For any further help, visit https://docs.cilium.io/en/v1.18/gettinghelp\

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system rollout restart ds/cilium
daemonset.apps/cilium restarted
bpf lb list를 통해 ClusterIP를 살펴보면 bpf.lbExternalClusterIP=true 설정 전에는 ClusterIP가 non-routable이지만, 설정 적용 이후에는 non-routable이 사라진 것을 볼 수 있다.
# 설정 적용 전
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -n kube-system ds/cilium -- cilium-dbg bpf lb list | grep 10.96.252.129
10.96.252.129:80/TCP (0)        0.0.0.0:0 (11) (0) [ClusterIP, non-routable]
10.96.252.129:80/TCP (2)        172.20.1.80:80/TCP (11) (2)
10.96.252.129:80/TCP (1)        172.20.0.103:80/TCP (11) (1)
10.96.252.129:80/TCP (3)        172.20.1.81:80/TCP (11) (3)

# 설정 적용 후
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -n kube-system ds/cilium -- cilium-dbg bpf lb list | grep 10.96.252.129
10.96.252.129:80/TCP (1)        172.20.0.103:80/TCP (12) (1)
10.96.252.129:80/TCP (2)        172.20.1.80:80/TCP (12) (2)
10.96.252.129:80/TCP (0)        0.0.0.0:0 (12) (0) [ClusterIP]
10.96.252.129:80/TCP (3)        172.20.1.81:80/TCP (12) (3)
외부 router에서 통신을 확인해보면, clusterIP로도 외부 통신이 가능한 것을 확인할 수 있다.
root@router:~# curl -s $LBIP
Hostname: webpod-697b545f57-hcqv2
IP: 127.0.0.1
IP: ::1
IP: 172.20.0.103
IP: fe80::7c3e:18ff:fe18:9940
RemoteAddr: 192.168.10.200:48558
GET / HTTP/1.1
Host: 10.96.252.129
User-Agent: curl/8.5.0
Accept: */*
앞선 테스트 처럼 pod의 개수를 2개로 줄인 후 통신을 확인해본다. pod는 k8s-ctr과 k8s-w1에 떠있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl scale deployment/webpod --replicas=2
deployment.apps/webpod scaled

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   0          49m   172.20.0.23    k8s-ctr              
webpod-697b545f57-dm2hg   1/1     Running   0          49m   172.20.1.80    k8s-w1               
webpod-697b545f57-hcqv2   1/1     Running   0          49m   172.20.0.103   k8s-ctr              
해당 테스트 또한 위 External Traffic Policy (Cluster)와 동일하게 일부 패킷은 k8s-ctr과, k8s-w1으로 동시에 패킷을 받는 경우가 생기게 된다.
k8s-ctr에서 패킷 덤프를 떠서 확인해본다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# tcpdump -i eth1 -w /tmp/dsr.pcap
tcpdump: listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
^C139 packets captured
141 packets received by filter
0 packets dropped by kernel

(1) Router 에서 Cilium BGP Peer인 모든 노드로 전달
(2) InternalTrafficPolicy:Cluster 서비스로, 여러 파드 중 (k8s-w1)의 파드를 대상으로 요청을 전달
(3) 해당 요청이 (k8s-ctr)을 통해 들어와 (k8s-w1)로 전달
(4) (k8s-w1)노드의 파드가 요청을 처리하고 응답 리턴을 위해서, NAT를 수행했던 노드(k8s-ctr)로 다시 전달
(5) 외부 인입을 받아서 NAT를 수행했던 연결 정보를 확인해서, Reverse NAT를 수행해서 최종 응답을 리턴
Cluster IP + Internal Traffic Policy (Local)
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl patch service webpod -p '{"spec":{"internalTrafficPolicy":"Local"}}'
External Traffic Policy를 local로 바꾸면 Router 노드에서의 bgp 경로도 Pod가 떠있는 노드만으로 변경이 된다.
root@router:~# ip -c route
...
10.96.122.24 nhid 77 proto bgp metric 20
    nexthop via 192.168.10.101 dev eth1 weight 1
    nexthop via 192.168.10.100 dev eth1 weight 1
webpod가 k8s-ctr, k8s-w0에 떠있는 상태에서 LB IP를 호출해본다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'
root@k8s-w0:~# tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'
root@k8s-w1:~# tcpdump -i eth1 -A -s 0 -nn 'tcp port 80'

root@router:~# LBIP=10.96.252.129
root@router:~# curl -s $LBIP
동작 방식
1) cilium에서 service가 Internal Policy Type = Local일 경우 Pod가 뜬 노드만 ClusterIP를 광고한다.
func (r *ServiceReconciler) clusterIPDesiredRoutes(svc *slim_corev1.Service, ls localServices) []netip.Prefix {
    var desiredRoutes []netip.Prefix
    // InternalTrafficPolicy == Local 이며 endpoint 파드가 떠있지 않은 노드 같은 경우는 desiredRoutes를 빈 값으로 반환한다. 
  // 즉, InternalTrafficPolicy == Local인 상황에서 endpoint파드가 떠있는 노드만 스스로를 광고한다.
    if svc.Spec.InternalTrafficPolicy != nil && *svc.Spec.InternalTrafficPolicy == slim_corev1.ServiceInternalTrafficPolicyLocal &&
        !hasLocalEndpoints(svc, ls) {
        return desiredRoutes
    }

  //ClusterIP를 확인한다.
    if svc.Spec.ClusterIP == "" || len(svc.Spec.ClusterIPs) == 0 || svc.Spec.ClusterIP == corev1.ClusterIPNone {
        return desiredRoutes
    }
    ips := sets.New[string]()
    if svc.Spec.ClusterIP != "" {
        ips.Insert(svc.Spec.ClusterIP)
    }

  //ClusterIP를 광고할 수 있도록 desiredRotues에 추가한다.
    for _, clusterIP := range svc.Spec.ClusterIPs {
        if clusterIP == "" || clusterIP == corev1.ClusterIPNone {
            continue
        }
        ips.Insert(clusterIP)
    }
...
    return desiredRoutes
}
2) BPF map에는 Local Pod만 Backend로 등록
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium-dbg bpf lb list | grep 10.96.122.24
10.96.122.24:80/TCP (1)          172.20.0.103:80/TCP (12) (1)
10.96.122.24:80/TCP (0)          0.0.0.0:0 (12) (0) [ClusterIP, InternalLocal]

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium-dbg bpf ipcache list | grep 14076
172.20.0.103/32     identity=14076 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.1.80/32      identity=14076 encryptkey=0 tunnelendpoint=192.168.10.101 flags=hastunnel
3) router에서 ClusterIP를 호출하면 모든 트래픽이 k8s-ctr에 있는 pod로 향한다.

예상 원인
bpf ipcache list에서 Pod IP의 tunnelendpoint=0.0.0.0 이어야 로컬 Pod로 인지가 되기 때문에 k8s-ctr에 있는 pod만 backend로 인지되는 것으로 보인다.

k8s-ctr에서 패킷 덤프를 떠서 확인해본다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# tcpdump -i eth1 -w /tmp/dsr2.pcap
tcpdump: listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
^C139 packets captured
141 packets received by filter
0 packets dropped by kernel

(1) Router 에서 Cilium BGP Peer인 노드로 전달 (pod가 떠있는 노드 대상)
(2) InternalTrafficPolicy:Local 서비스 기준으로, 유효한 backend는 k8s-ctr위의 Pod 뿐
(3) 해당 노드의 파드가 요청을 처리하고 최종 응답을 리턴



Cilium - Networking - 파드 간 통신 (2) Encapsulation, Service LB-IPAM, L2 Announcement
Sat, 09 Aug 2025 10:54:02 GMT
Native Routing 환경에서 잘못된 라우팅 설정으로 인해 통신이 되지 않는 상황을 만들고, 이를 라우팅 설정 변경을 통해 해결하는 실습을 진행한다.
실습 환경

Kubernetes 클러스터 노드

k8s-ctr(IP: 192.168.10.100, podCIDR: 172.20.0.98)
k8s-w1(IP: 192.168.10.101, podCIDR: 172.20.1.60)
k8s-w0(IP: 192.168.20.100, podCIDR: 172.20.2.159)

Router 노드

router(IP: 192.168.10.200)

초기 상태
autoDirectNodeRoutes
Native Routing Mode 클러스터에서 autoDirectNodeRoutes=true로 설정하면, 같은 네트워크 대역에 있는 노드들은 서로의 podCIDR 정보를 자동으로 라우팅 테이블에 추가한다.
그러나 k8s-w0 노드는 다른 네트워크 대역에 속해 있기 때문에, 이 노드의 podCIDR은 라우팅 테이블에 자동으로 추가되지 않는다.
k8s-ctr
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.0.0/24 via 172.20.0.98 dev cilium_host proto kernel src 172.20.0.98
172.20.0.98 dev cilium_host proto kernel scope link
172.20.1.0/24 via 192.168.10.101 dev eth1 proto kernel <- autoDirectNodeRoutes로 인한 추가
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.100

k8s-w1
root@k8s-w1:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.0.0/24 via 192.168.10.100 dev eth1 proto kernel <- autoDirectNodeRoutes로 인한 추가
172.20.1.0/24 via 172.20.1.60 dev cilium_host proto kernel src 172.20.1.60
172.20.1.60 dev cilium_host proto kernel scope link
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.101

k8s-w0
root@k8s-w0:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
172.20.2.0/24 via 172.20.2.159 dev cilium_host proto kernel src 172.20.2.159
172.20.2.159 dev cilium_host proto kernel scope link
192.168.20.0/24 dev eth1 proto kernel scope link src 192.168.20.100
* k8s-ctr, k8s-w1 관련 route 없음
파드 통신 확인
k8s-w1과 k8s-w0에 각각 배포된 webpod를 호출하면, 일부만 응답이 오고 일부는 응답이 없다.
특히 k8s-w0에서 실행 중인 webpod IP를 직접 지정하여 ping을 시도해도 100% 패킷 손실이 발생한다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ciliumnode -o json | jq '.items[].spec.addresses'
[
  {
    "ip": "192.168.10.100",
    "type": "InternalIP"
  },
  {
    "ip": "172.20.0.98",
    "type": "CiliumInternalIP"
  }
]
[
  {
    "ip": "192.168.20.100",
    "type": "InternalIP"
  },
  {
    "ip": "172.20.2.159",
    "type": "CiliumInternalIP"
  }
]
[
  {
    "ip": "192.168.10.101",
    "type": "InternalIP"
  },
  {
    "ip": "172.20.1.60",
    "type": "CiliumInternalIP"
  }
]

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS      AGE     IP             NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   1 (20m ago)   6h14m   172.20.0.201   k8s-ctr              
webpod-696774f575-5nvlc   1/1     Running   1 (18m ago)   6h13m   172.20.2.50    k8s-w0               
webpod-696774f575-dsblb   1/1     Running   1 (20m ago)   6h12m   172.20.0.130   k8s-ctr              
webpod-696774f575-wssxg   1/1     Running   1 (19m ago)   6h12m   172.20.1.184   k8s-w1               

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it curl-pod -- sh -c 'while true; do curl -s --connect-timeout 1 webpod | grep Hostname; echo "---" ; sleep 1; done'
Hostname: webpod-696774f575-wssxg
---
---
Hostname: webpod-696774f575-dsblb
---
---
Hostname: webpod-696774f575-dsblb
---
---
Hostname: webpod-696774f575-dsblb
---
---
---

(⎈|HomeLab:N/A) root@k8s-ctr:~# export WEBPOD=$(kubectl get pod -l app=webpod --field-selector spec.nodeName=k8s-w0 -o jsonpath='{.items[0].status.podIP}')
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it curl-pod -- ping -c 2 -w 1 -W 1 $WEBPOD
PING 172.20.2.50 (172.20.2.50) 56(84) bytes of data.

--- 172.20.2.50 ping statistics ---
1 packets transmitted, 0 received, 100% packet loss, time 0ms
패킷 흐름을 router 노드에서 tcpdump로 확인해보면, 요청이 eth1(내부망)으로 들어오지만, 다시 eth0(외부망)으로 나가 버린다.
root@router:~# tcpdump -i any icmp -nn
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
18:11:47.232880 eth1  In  IP 172.20.0.163 > 172.20.2.54: ICMP echo request, id 7, seq 1, length 64
18:11:47.232892 eth0  Out IP 172.20.0.163 > 172.20.2.54: ICMP echo request, id 7, seq 1, length 64
router의 ip route를 확인한 결과, 클러스터 Pod CIDR 대역으로 가는 경로가 설정되어 있지 않다.
즉, k8s-w0의 Pod로 향해야 할 트래픽이 올바른 내부 경로를 찾지 못하고 외부로 빠져나가 통신이 실패한 것이다.
root@router:~# ip route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.10.1.0/24 dev loop1 proto kernel scope link src 10.10.1.200
10.10.2.0/24 dev loop2 proto kernel scope link src 10.10.2.200
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.200
192.168.20.0/24 dev eth2 proto kernel scope link src 192.168.20.200
파드 통신 개선
각 노드의 Pod CIDR 대역으로 향하는 트래픽이 해당 노드로 전달되도록 라우팅 규칙을 수동으로 추가한다.
root@router:~# ip route add 172.20.1.0/24 via 192.168.10.101
root@router:~# ip route add 172.20.0.0/24 via 192.168.10.100
root@router:~# ip route add 172.20.2.0/24 via 192.168.20.100

root@router:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.10.1.0/24 dev loop1 proto kernel scope link src 10.10.1.200
10.10.2.0/24 dev loop2 proto kernel scope link src 10.10.2.200
172.20.0.0/24 via 192.168.10.100 dev eth1
172.20.1.0/24 via 192.168.10.101 dev eth1
172.20.2.0/24 via 192.168.20.100 dev eth2
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.200
192.168.20.0/24 dev eth2 proto kernel scope link src 192.168.20.200
라우팅 규칙을 추가한 후 다시 k8s-w0의 pod로의 통신을 확인해보면 정상 통신되는 것을 확인할 수 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it curl-pod -- ping -c 2 -w 1 -W 1 $WEBPOD
PING 172.20.2.54 (172.20.2.54) 56(84) bytes of data.
64 bytes from 172.20.2.54: icmp_seq=1 ttl=61 time=0.958 ms

--- 172.20.2.54 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.958/0.958/0.958/0.000 ms
command terminated with exit code 1

root@router:~# tcpdump -i any icmp -nn
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
18:20:48.457587 eth1  In  IP 172.20.0.163 > 172.20.2.54: ICMP echo request, id 19, seq 1, length 64
18:20:48.457605 eth2  Out IP 172.20.0.163 > 172.20.2.54: ICMP echo request, id 19, seq 1, length 64
18:20:48.458117 eth2  In  IP 172.20.2.54 > 172.20.0.163: ICMP echo reply, id 19, seq 1, length 64
18:20:48.458118 eth1  Out IP 172.20.2.54 > 172.20.0.163: ICMP echo reply, id 19, seq 1, length 64
^C
4 packets captured
4 packets received by filter
0 packets dropped by kernel
해당 실습은 노드 개수가 적기 때문에 수동으로 라우트를 추가해도 문제가 되지 않는다.
그러나 노드가 100대 이상으로 늘어나고, 각 노드의 Pod CIDR이 변경되는 상황에서는 모든 노드에 대해 수동으로 라우팅을 설정하는 것은 사실상 불가능하다.
이러한 문제를 해결하기 위해 BGP를 통한 동적 라우팅이나 Overlay 네트워크를 활용한다.
이번 글에서는 Overlay 네트워크를 이용해 통신이 이루어지는 방식을 살펴본다.
Overlay 네트워크
Cilium은 기본적으로 Encapsulation(캡슐화) 기반 네트워킹 방식을 제공한다. 각 노드 간에 UDP 기반 터널(VXLAN 또는 Geneve)을 자동으로 구성해 파드의 트래픽을 캡슐화하여 전달하기 때문에, 물리 노드 간 IP 통신만 가능하면 각 노드의 Pod CIDR에 대해 별도의 라우트를 수동 설정할 필요가 없다.
VXLAN

VXLAN은 고정된 8바이트 헤더 구조를 가진 L2 over L3 오버레이 프로토콜이다.
내부 L2 프레임을 외부 UDP 패킷으로 캡슐화하여 전송하며, 데이터센터와 클라우드 환경에서 널리 사용된다.
VXLAN은 VTEP(VXLAN Tunnel Endpoint)라는 네트워크 장비를 통해서 동작하는데, VTEP는 오버레이 네트워크와 물리 네트워크 사이를 연결하는 역할을 수행한다.
1) Ingress VTEP (송신 쪽)
패킷이 오버레이 네트워크에 들어오면, Ingress VTEP이 다음 작업을 수행한다.

원본 L2 프레임 수신
원본 L2 프레임을 UDP 패킷의 페이로드에 넣고 헤더 추가
VXLAN 헤더 (8바이트): VNI(네트워크 ID)로 목적지 설정
Outer UDP/IP 헤더: Egress VTEP의 IP 주소로 목적지 설정
Outer Ethernet 헤더: 물리망에서의 다음 홉 MAC 주소로 목적지 설정
물리망으로 전송

이렇게 캡슐화된 패킷은 일반 L3 네트워크를 통해 라우팅된다.
2) Egress VTEP (수신 쪽)
패킷이 터널 반대쪽에 도착하면, Egress VTEP가 다음 작업을 수행한다.

Outer Ethernet, Outer IP, Outer UDP, 그리고 VXLAN 헤더를 순서대로 벗겨냄.
원본 L2 프레임 복원
목적지에 전달

GENEVE
Geneve는 VXLAN에서 확장성과 유연성을 개선한 최신 캡슐화 프로토콜(IETF 표준)이다.

(참고문서 : https://tetrate.io/blog/using-geneve-tunnels-to-implement-istio-ambient-mesh-traffic-interception)
VXLAN과 GENEVE의 가장 큰 차이점은 사용 포트(VXLAN(4789/UDP), GENEVE(6081/UDP))와 헤더 구조이다.
VXLAN의 경우 헤더가 고정되어있기 때문에 캡슐화 정보 외의 추가 메타데이터(정책, QoS, 보안 태그, 트래픽 분석 등)를 담을 수 없는데, 
GENEVE는 이 VXLAN의 단점을 TLV 기반 확장성으로 개선하였다.
TLV(Type-Length-Value)
TLV는 Geneve 옵션 필드를 이루는 기본 단위이며, 여러 개를 연속해서 붙일 수 있다.
이를 통해 새로운 네트워크 기능을 블록처럼 조립하듯 필요에 따라 유연하게 추가할 수 있다.



필드
설명



Type
옵션의 종류 식별 (예: 보안 태그, QoS, 정책 ID 등)


Length
데이터 길이(4바이트 단위)


Value
실제 메타데이터 값


TLV 구조의 장점

유연성: 필요할 때만 옵션 추가 → 불필요한 오버헤드 감소
확장성: 새로운 기능이 필요해도 프로토콜 개정 없이 Type만 새로 정의해서 추가 가능
호환성: TLV를 이해 못하는 장비도 기본 패킷은 처리 가능 : 호환성 확장
메타데이터 다양성: 정책, 보안, 트래픽 분석, 장애 추적 등 다양한 목적의 데이터 삽입 가능

TLV 옵션 사용 예시 : 서비스 체이닝 (Service Chaining)
TLV 옵션 사용 예시로 서비스 체이닝이 있다.
서비스 체이닝이란 패킷이 네트워크를 통과하는 동안 방화벽, IDS, 로드밸런서, WAN 최적화 장비 등 여러 네트워크 기능을 정해진 순서대로 거치게 하는 기술이다.
VXLAN은 오직 VNI만 있어 어떤 순서로 서비스 거칠지 알 수 없지만, Geneve의 TLV 옵션을 쓰면 패킷 안에 경로 정보(SPID)와 현재 단계(SI)를 넣어 서비스 체이닝이 가능해진다.
예시
Option Class: 0x0101 (Service Chaining)
Type: Path + Index
Length: 8 bytes
Option Data:
    SPID = 0x0012
    SI   = 0x04
SPID (Service Path ID)

어떤 서비스 경로인지를 나타내는 번호
예: 0x0012 → 방화벽 → IDS → WAN → LB

SI (Service Index)

현재 서비스 경로에서 어디까지 왔는지를 나타내는 카운터
시작할 때는 전체 길이, 서비스 거칠 때마다 1씩 감소

동작 원리

SPID 확인 → 미리 정의된 경로(서비스 체인)를 찾음
SI 값을 보고 지금 어느 단계인지 파악
서비스 기능을 수행한 뒤 SI를 1 감소
SI가 0이 되면 체인 종료, 최종 목적지로 전달

MTU 1450
물리적 네트워크에서 1500바이트의 표준 이더넷 MTU를 사용하는 경우, VXLAN과 Geneve는 MTU를 1450으로 설정하는 것이 일반적이다.
위의 VXLAN과 GENEVE 패킷 이미지를 보면 VXLAN은 Original L2 Frame을 제외한 고정 헤더크기가 50 Byte이며, GENEVE도 마찬가지로 최소 헤더 크기가 50Byte이다.
캡슐화 시 추가되는 외부 IP/UDP/프로토콜 헤더 크기를 고려해, 패킷이 조각(Fragmentation)나지 않도록 MTU를 1450으로 설정하는 것이다.
GENEVE의 경우 TLV 옵션에 따라 헤더 크기가 증가할 수 있기 때문에, 옵션을 고려하여 MTU 사이즈를 결정해야 한다.
이전 글에서 VXLAN에서의 통신 실습을 해보았기 때문에 이번에는 GENEVE로 통신 확인을 진행한다.
GENEVE 통신 확인
GENEVE 설정
# 모듈 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# grep -E 'CONFIG_VXLAN=y|CONFIG_VXLAN=m|CONFIG_GENEVE=y|CONFIG_GENEVE=m|CONFIG_FIB_RULES=y' /boot/config-$(uname -r)
CONFIG_FIB_RULES=y
CONFIG_VXLAN=m
CONFIG_GENEVE=m

(⎈|HomeLab:N/A) root@k8s-ctr:~# lsmod | grep -E 'vxlan|geneve'
(⎈|HomeLab:N/A) root@k8s-ctr:~# modprobe geneve
(⎈|HomeLab:N/A) root@k8s-ctr:~# lsmod | grep -E 'vxlan|geneve'
geneve                 45056  0
ip6_udp_tunnel         16384  1 geneve
udp_tunnel             36864  1 geneve

(⎈|HomeLab:N/A) root@k8s-ctr:~# helm upgrade cilium cilium/cilium --namespace kube-system --version 1.18.0 --reuse-values --set routingMode=tunnel --set tunnelProtocol=geneve --set autoDirectNodeRoutes=false --set installNoConntrackIptablesRules=false

# cilium 설정 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium features status | grep datapath_network
Yes      cilium_feature_datapath_network                                         mode=overlay-geneve                               1        1       1

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -- cilium status | grep ^Routing
Routing:                 Network: Tunnel [geneve]   Host: BPF

(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium config view | grep tunnel
routing-mode                                      tunnel
tunnel-protocol                                   geneve
tunnel-source-port-range                          0-0

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c addr show cilium_geneve
26: cilium_geneve:  mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 0e:73:b0:50:47:a6 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::c73:b0ff:fe50:47a6/64 scope link
       valid_lft forever preferred_lft forever
GENEVE 설정 확인
클러스터의 노드가 서로 다른 네트워크 대역에 있지만, 모든 노드에 각 Pod의 네트워크 대역 정보가 route에 등록되어 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route | grep cilium_host
172.20.0.0/24 via 172.20.0.74 dev cilium_host proto kernel src 172.20.0.74 <- pod 대역
172.20.0.74 dev cilium_host proto kernel scope link
172.20.1.0/24 via 172.20.0.74 dev cilium_host proto kernel src 172.20.0.74 mtu 1450 <- pod 대역
172.20.2.0/24 via 172.20.0.74 dev cilium_host proto kernel src 172.20.0.74 mtu 1450 <- pod 대역

root@k8s-w1:~# ip -c route | grep cilium_host
172.20.0.0/24 via 172.20.1.41 dev cilium_host proto kernel src 172.20.1.41 mtu 1450 <- pod 대역
172.20.1.0/24 via 172.20.1.41 dev cilium_host proto kernel src 172.20.1.41 <- pod 대역
172.20.1.41 dev cilium_host proto kernel scope link
172.20.2.0/24 via 172.20.1.41 dev cilium_host proto kernel src 172.20.1.41 mtu 1450 <- pod 대역

root@k8s-w0:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.10.0.0/16 via 192.168.20.200 dev eth1 proto static
172.20.0.0/24 via 172.20.2.204 dev cilium_host proto kernel src 172.20.2.204 mtu 1450 <- pod 대역
172.20.0.0/16 via 192.168.20.200 dev eth1 proto static
172.20.1.0/24 via 172.20.2.204 dev cilium_host proto kernel src 172.20.2.204 mtu 1450 <- pod 대역
172.20.2.0/24 via 172.20.2.204 dev cilium_host proto kernel src 172.20.2.204 <- pod 대역
172.20.2.204 dev cilium_host proto kernel scope link
192.168.10.0/24 via 192.168.20.200 dev eth1 proto static
192.168.20.0/24 dev eth1 proto kernel scope link src 192.168.20.100

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip route get 172.20.1.10
172.20.1.10 dev cilium_host src 172.20.0.74 uid 0
    cache mtu 1450

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip route get 172.20.2.10
172.20.2.10 dev cilium_host src 172.20.0.74 uid 0
    cache mtu 1450
route 규칙을 보면 src에 172.20.0.74, 172.20.1.41, 172.20.2.204와 같은 IP가 지정되어 있는데, 이는 각 노드의 cilium-agent에서 router역할을 하는 IP이다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it $CILIUMPOD0 -n kube-system -c cilium-agent -- cilium status --all-addresses | grep router
  172.20.0.74 (router)
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it $CILIUMPOD1 -n kube-system -c cilium-agent -- cilium status --all-addresses | grep router
  172.20.1.41 (router)
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it $CILIUMPOD2 -n kube-system -c cilium-agent -- cilium status --all-addresses | grep router
  172.20.2.204 (router)
Cilium은 각 노드에 eBPF 기반 데이터패스와 가상 인터페이스(cilium_host)를 만드는데, 이 cilium_host 인터페이스가 해당 노드에서 다른 노드/외부로 나가는 트래픽의 기본 라우터 역할을 한다. 
Cilium Router IP는 이 cilium_host 인터페이스에 할당된 IP이다. 
cilium status로 본 router ip와 cilium_host의 ip가 동일한 것을 볼 수 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip addr show cilium_host
5: cilium_host@cilium_net:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether aa:71:74:33:f7:d6 brd ff:ff:ff:ff:ff:ff
    inet 172.20.0.74/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::a871:74ff:fe33:f7d6/64 scope link
       valid_lft forever preferred_lft forever

root@k8s-w1:~# ip addr show cilium_host
5: cilium_host@cilium_net:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 16:39:18:15:d2:cd brd ff:ff:ff:ff:ff:ff
    inet 172.20.1.41/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::1439:18ff:fe15:d2cd/64 scope link
       valid_lft forever preferred_lft forever

root@k8s-w0:~# ip addr show cilium_host
5: cilium_host@cilium_net:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether fa:b0:9f:c2:1b:84 brd ff:ff:ff:ff:ff:ff
    inet 172.20.2.204/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::f8b0:9fff:fec2:1b84/64 scope link
       valid_lft forever preferred_lft forever
Cilium Router IP는 각 노드에 하나씩 존재하며, 같은 노드의 모든 Pod는 이 Router IP를 게이트웨이로 사용한다.
bpf ipcache를 보면 Cilium이 다른 노드에 뜬 pod의 경우 hastunnel이라는 flag를 부여하는 것을 볼 수 있는데, 이 flag가 있을 경우 캡슐화 대상으로 인지하여 캡슐화를 진행하게 된다.
# 다른 노드 내 pod
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -n kube-system $CILIUMPOD0 -- cilium-dbg bpf ipcache list | grep hastunnel
172.20.1.41/32      identity=6 encryptkey=0 tunnelendpoint=192.168.10.101 flags=hastunnel
172.20.1.73/32      identity=11761 encryptkey=0 tunnelendpoint=192.168.10.101 flags=hastunnel
172.20.2.54/32      identity=11761 encryptkey=0 tunnelendpoint=192.168.20.100 flags=hastunnel
172.20.2.204/32     identity=6 encryptkey=0 tunnelendpoint=192.168.20.100 flags=hastunnel
172.20.2.0/24       identity=2 encryptkey=0 tunnelendpoint=192.168.20.100 flags=hastunnel
172.20.1.0/24       identity=2 encryptkey=0 tunnelendpoint=192.168.10.101 flags=hastunnel

# 동일 노드 내 pod 
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -n kube-system $CILIUMPOD0 -- cilium-dbg bpf ipcache list | grep -v hastunnel
IP PREFIX/ADDRESS   IDENTITY
172.20.0.63/32      identity=11761 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.152/32     identity=22419 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.163/32     identity=29101 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.210/32     identity=4046 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.224/32     identity=57021 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
192.168.10.101/32   identity=6 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
192.168.20.100/32   identity=6 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
0.0.0.0/0           identity=2 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
10.0.2.15/32        identity=1 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.81/32      identity=35319 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.242/32     identity=57021 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.74/32      identity=1 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.77/32      identity=31483 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.25/32      identity=14322 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
172.20.0.186/32     identity=16348 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
192.168.10.100/32   identity=1 encryptkey=0 tunnelendpoint=0.0.0.0 flags=
GENEVE 통신 확인
서로 다른 노드 간 Pod통신을 통해 GENEVE 통신을 확인해본다.
GENEVE는 기본 포트로 6081/UDP를 사용한다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it curl-pod -- ping -c 2 -w 1 -W 1 $WEBPOD
PING 172.20.2.54 (172.20.2.54) 56(84) bytes of data.
64 bytes from 172.20.2.54: icmp_seq=1 ttl=63 time=0.896 ms
--- 172.20.2.54 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.896/0.896/0.896/0.000 ms
command terminated with exit code 1

root@router:~# tcpdump -i any udp port 6081 -nn
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
23:59:32.251108 eth1  In  IP 192.168.10.100.31322 > 192.168.20.100.6081: Geneve, Flags [none], vni 0x71ad: IP 172.20.0.163 > 172.20.2.54: ICMP echo request, id 37, seq 1, length 64
23:59:32.251125 eth2  Out IP 192.168.10.100.31322 > 192.168.20.100.6081: Geneve, Flags [none], vni 0x71ad: IP 172.20.0.163 > 172.20.2.54: ICMP echo request, id 37, seq 1, length 64
23:59:32.251597 eth2  In  IP 192.168.20.100.29257 > 192.168.10.100.6081: Geneve, Flags [none], vni 0x2df1: IP 172.20.2.54 > 172.20.0.163: ICMP echo reply, id 37, seq 1, length 64
23:59:32.251607 eth1  Out IP 192.168.20.100.29257 > 192.168.10.100.6081: Geneve, Flags [none], vni 0x2df1: IP 172.20.2.54 > 172.20.0.163: ICMP echo reply, id 37, seq 1, length 64
pod 대역이 캡슐화되어 노드 간 통신을 통해 패킷이 전송되는 것을 확인할 수 있다.
이전 글에서 확인한 VXLAN과 마찬가지로 hubble로 모니터링을 할 때, 패킷이 캡슐화 된다는 의미의 to-overlay 메시지를 확인할 수 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f --protocol tcp --pod curl-pod
(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f --protocol tcp --pod curl-pod
Aug  8 15:05:43.118: default/curl-pod:36548 (ID:29101) -> default/webpod-697b545f57-9c7vb:80 (ID:11761) to-endpoint FORWARDED (TCP Flags: SYN)
Aug  8 15:05:43.118: default/curl-pod:36548 (ID:29101) <- default/webpod-697b545f57-9c7vb:80 (ID:11761) to-overlay FORWARDED (TCP Flags: SYN, ACK)
Aug  8 15:05:43.119: default/curl-pod:36548 (ID:29101) -> default/webpod-697b545f57-9c7vb:80 (ID:11761) to-endpoint FORWARDED (TCP Flags: ACK)
Aug  8 15:05:43.119: default/curl-pod:36548 (ID:29101) -> default/webpod-697b545f57-9c7vb:80 (ID:11761) to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Aug  8 15:05:43.120: default/curl-pod:36548 (ID:29101) <> default/webpod-697b545f57-9c7vb (ID:11761) pre-xlate-rev TRACED (TCP)
Aug  8 15:05:43.120: default/curl-pod:36548 (ID:29101) <> default/webpod-697b545f57-9c7vb (ID:11761) pre-xlate-rev TRACED (TCP)
Aug  8 15:05:43.121: default/curl-pod:36548 (ID:29101) <> default/webpod-697b545f57-9c7vb (ID:11761) pre-xlate-rev TRACED (TCP)
Aug  8 15:05:43.121: default/curl-pod (ID:29101) <> 10.96.107.144:80 (world) pre-xlate-fwd TRACED (TCP)
...
Service LB-IPAM
LB IPAM이란 Cilium이 IP 주소를 LoadBalancer 유형의 서비스에 할당할 수 있게 해주는 기능이다.
LB IPAM은 Cilium BGP Control Plane 및 L2 Announcements / L2 Aware LB(베타)와 같은 기능과 함께 작동하는데, Cilium BGP Control Plane을 사용하여 LB IPAM이 할당한 IP 주소를 BGP를 통해 광고하고 L2 Announcements / L2 Aware LB(베타)를 통해 로컬로 광고하게 된다.
Cilium에서 LoadBalancer IP Pool을 생성한 후 해당 Pool 내에서 LB에 IP를 할당하는 설정을 해본다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# cat << EOF | kubectl apply -f -
apiVersion: "cilium.io/v2"  # v1.17 : cilium.io/v2alpha1
kind: CiliumLoadBalancerIPPool
metadata:
  name: "cilium-lb-ippool"
spec:
  blocks:
  - start: "192.168.10.211"
    stop:  "192.168.10.215"
EOF
ciliumloadbalancerippool.cilium.io/cilium-lb-ippool created

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl api-resources | grep -i CiliumLoadBalancerIPPool
ciliumloadbalancerippools           ippools,ippool,lbippool,lbippools   cilium.io/v2                      false        CiliumLoadBalancerIPPool

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ippools
NAME               DISABLED   CONFLICTING   IPS AVAILABLE   AGE
cilium-lb-ippool   false      False         5               74s
기존에 배포한 webpod의 Service를 Loadbalancer Service로 변경하면 LB IPAM에 의해 자동으로 ExternalIP가 할당된다.
kubectl get ippools명령어를 통해 사용가능한 ip의 개수를 확인할 수 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc webpod
NAME     TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
webpod   ClusterIP   10.96.107.144           80/TCP    26h

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl patch svc webpod -p '{"spec":{"type":"LoadBalancer"}}'
service/webpod patched

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc webpod
NAME     TYPE           CLUSTER-IP      EXTERNAL-IP      PORT(S)        AGE
webpod   LoadBalancer   10.96.107.144   192.168.10.211   80:31968/TCP   26h

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ippools
NAME               DISABLED   CONFLICTING   IPS AVAILABLE   AGE
cilium-lb-ippool   false      False         4               3m11s
Kubernetes 노드 내 통신
Kubernetes 노드에서 생성한 LoadBalancer의 ExternalIP를 통해 Pod와 정상적으로 통신되는 것을 확인할 수 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# LBIP=$(kubectl get svc webpod -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

(⎈|HomeLab:N/A) root@k8s-ctr:~# curl -s $LBIP
Hostname: webpod-697b545f57-9c7vb
IP: 127.0.0.1
IP: ::1
IP: 172.20.2.54
IP: fe80::a8cc:5fff:fee9:7bf7
RemoteAddr: 172.20.0.74:52054
GET / HTTP/1.1
Host: 192.168.10.211
User-Agent: curl/8.5.0
Accept: */*

(⎈|HomeLab:N/A) root@k8s-ctr:~# for i in {1..100};  do kubectl exec -it curl-pod -- curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr

     36 Hostname: webpod-697b545f57-9r7tp
     33 Hostname: webpod-697b545f57-9c7vb
     31 Hostname: webpod-697b545f57-wls7t
외부 통신
실습 환경 상에서 클러스터 외부에 있는 Router 노드에서 LoadBalancer의 ExternalIP로 통신을 시도해보면 정상 통신이 되지 않는다.
root@router:~# LBIP=192.168.10.211

root@router:~# curl --connect-timeout 1 $LBIP
curl: (28) Failed to connect to 192.168.10.211 port 80 after 1001 ms: Timeout was reached

root@router:~# arping -i eth1 $LBIP -c 1
ARPING 192.168.10.211
Timeout

--- 192.168.10.211 statistics ---
1 packets transmitted, 0 packets received, 100% unanswered (0 extra)
LoadBalancer 타입 서비스의 ExternalIP는 클러스터 내부 노드들이 해당 IP를 직접 소유하거나 응답할 수 있는 IP가 아니다.
외부에서 ExternalIP에 ARP ping을 보냈지만, ARP 응답이 없는데, 이는 ExternalIP가 클러스터 노드 네트워크 내에 실제로 할당된 IP가 아니기 때문에 클러스터 내부 노드에서 ARP 응답을 하지 않아 외부 라우터가 MAC 주소를 알 수 없기 때문이다.
ExternalIP를 인지하기 위해서는 외부 네트워크에 클러스터 내 특정 노드가 LoadBalancer IP를 소유하고 있음을 알릴 수 있는 수단이 필요한데, 대표적으로는 L2 Announcement가 있다.
Cilium L2 Announcement
Cilium L2 Announcement는 클러스터 내 특정 노드가 LoadBalancer IP를 자신이 소유한 IP인 것처럼 외부 네트워크에 알리는 기능이다.
특정 노드가 LoadBalancer IP에 대한 ARP 요청을 받으면 직접 ARP 응답을 보내도록 처리하기 때문에, 외부 라우터는 LoadBalancer IP와 연결된 MAC 주소를 학습할 수 있게 된다.
결과적으로, 외부 네트워크에서 LoadBalancer IP로 보내는 패킷이 올바른 노드로 전달되어 통신이 가능해집진다.
Cilium에서 L2 Announcement 설정을 활성화 한 후 상태 변화를 확인해보자.
# router
root@router:~# arping -i eth1 $LBIP -c 100000
ARPING 192.168.10.211
Timeout
Timeout
Timeout
Timeout
Timeout

(⎈|HomeLab:N/A) root@k8s-ctr:~# helm upgrade cilium cilium/cilium --namespace kube-system --version 1.18.0 --reuse-values \
   --set l2announcements.enabled=true --set l2NeighDiscovery.enabled=true && watch -d kubectl get pod -A

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl rollout restart -n kube-system ds/cilium
daemonset.apps/cilium restarted

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system exec ds/cilium -c cilium-agent -- cilium-dbg config --all | grep EnableL2Announcements
EnableL2Announcements             : true

(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium config view | grep enable-l2
enable-l2-announcements                           true
enable-l2-neigh-discovery                         true

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip neigh show | grep -i reachable
192.168.10.101 dev eth1 lladdr 08:00:27:af:e6:df managed extern_learn REACHABLE
10.0.2.2 dev eth0 lladdr 52:55:0a:00:02:02 managed extern_learn REACHABLE
192.168.10.200 dev eth1 lladdr 08:00:27:0f:bd:dd managed extern_learn REACHABLE
L2 ARP 모드 정책 설정
CiliumL2AnnouncementPolicy를 통해 ARP를 광고할 대상 Service와 Node를 지정한다.
이 때, LoadBalancer IP Pool은 반드시 같은 네트워크 대역에 속한 노드에서만 유효하다는 제약사항이 있다.
이 때문에 실습 환경에서 다른 네트워크 대역에 존재하는 노드인 k8s-w0를 제외하고 정책을 설정해야 한다.
k8s-w0를 리더 노드로 지정하게 되면 리더 노드 선정 과정에서 동작 실패가 발생하게 된다.
cat << EOF | kubectl apply -f -
apiVersion: "cilium.io/v2alpha1"  # not v2
kind: CiliumL2AnnouncementPolicy
metadata:
  name: policy1
spec:
  serviceSelector:
    matchLabels:
      app: webpod
  nodeSelector:
    matchExpressions:
      - key: kubernetes.io/hostname
        operator: NotIn
        values:
          - k8s-w0
  interfaces:
  - ^eth[1-9]+
  externalIPs: true
  loadBalancerIPs: true
EOF
리더 노드 선정 확인 및 상태 확인
정책 적용 직후, Cilium은 Lease 리소스를 생성하여 ARP 광고를 수행할 리더 노드를 선출한다.
해당 실습에서는 k8s-w1 노드가 리더로 선정되었다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system get lease | grep "cilium-l2announce"
cilium-l2announce-default-webpod       k8s-w1                                                                      4s

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system get lease/cilium-l2announce-default-webpod -o yaml | yq
{
  "apiVersion": "coordination.k8s.io/v1",
  "kind": "Lease",
  "metadata": {
    "creationTimestamp": "2025-08-09T12:35:25Z",
    "name": "cilium-l2announce-default-webpod",
    "namespace": "kube-system",
    "resourceVersion": "76902",
    "uid": "98cf5f36-f611-47b4-aad6-de3f83312fc9"
  },
  "spec": {
    "acquireTime": "2025-08-09T12:35:25.085954Z",
    "holderIdentity": "k8s-w1",
    "leaseDurationSeconds": 15,
    "leaseTransitions": 0,
    "renewTime": "2025-08-09T12:35:45.144393Z"
  }
}

(⎈|HomeLab:N/A) root@k8s-ctr:~# export CILIUMPOD1=$(kubectl get -l k8s-app=cilium pods -n kube-system --field-selector spec.nodeName=k8s-w1  -o jsonpath='{.items[0].metadata.name}')
리더 노드의 Cilium 에이전트에서 현재 ARP 광고가 수행되는 인터페이스와 IP를 조회한다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -n kube-system $CILIUMPOD1 -- cilium-dbg shell -- db/show l2-announce
IP               NetworkInterface
192.168.10.211   eth1
리더 노드에서 External LoadBalancer IP을 대상으로 ARP 요청을 보내면 ARP 패킷이 인터페이스를 통해 정상 전송된다.
또한 HTTP 요청도 정상 전송된다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# arping -i eth1 $LBIP -c 1000
ARPING 192.168.10.211
60 bytes from 08:00:27:af:e6:df (192.168.10.211): index=0 time=362.306 usec
60 bytes from 08:00:27:af:e6:df (192.168.10.211): index=1 time=330.916 usec
60 bytes from 08:00:27:af:e6:df (192.168.10.211): index=2 time=260.588 usec
60 bytes from 08:00:27:af:e6:df (192.168.10.211): index=3 time=255.961 usec
^C
--- 192.168.10.211 statistics ---
4 packets transmitted, 4 packets received,   0% unanswered (0 extra)
rtt min/avg/max/std-dev = 0.256/0.302/0.362/0.046 ms

(⎈|HomeLab:N/A) root@k8s-ctr:~# curl --connect-timeout 1 $LBIP
Hostname: webpod-697b545f57-wls7t
IP: 127.0.0.1
IP: ::1
IP: 172.20.1.73
IP: fe80::a076:2dff:fe73:24ac
RemoteAddr: 172.20.0.74:46438
GET / HTTP/1.1
Host: 192.168.10.211
User-Agent: curl/8.5.0
Accept: */*
ARP 테이블을 조회해보면, LoadBalancer IP에 대응하는 MAC 주소가 현재 리더 노드인 k8s-w1의 eth1 MAC 주소와 일치한다.
Router 노드에서 ARP 테이블을 조회해보면 LoadBalancer IP(192.168.10.211)가 리더노드 MAC 주소와 매핑되어 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# arp -a | grep '08:00:27:af:e6:df'
k8s-w1 (192.168.10.101) at 08:00:27:af:e6:df [ether] on eth1

root@router:~# arp -a
? (192.168.10.101) at 08:00:27:af:e6:df [ether] on eth1
? (192.168.20.100) at 08:00:27:3a:be:fa [ether] on eth2
? (192.168.10.100) at 08:00:27:04:d1:9c [ether] on eth1
? (192.168.10.211) at 08:00:27:af:e6:df [ether] on eth1
? (10.0.2.3) at 52:55:0a:00:02:03 [ether] on eth0
_gateway (10.0.2.2) at 52:55:0a:00:02:02 [ether] on eth0
LoadBalancer IP에 다수 요청을 보내 서비스가 모든 Pod에 분산되는지를 확인해보면, 모든 Pod에 정상 분산되고 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# for i in {1..100};  do kubectl exec -it curl-pod -- curl -s $LBIP | grep Hostname; done | sort | uniq -c | sort -nr
     43 Hostname: webpod-697b545f57-wls7t
     29 Hostname: webpod-697b545f57-9r7tp
     28 Hostname: webpod-697b545f57-9c7vb
Failover
리더 노드의 Cilium Agent는 주기적으로 lease를 갱신하며 리더임을 선언하는데, 리더 노드에 네트워크 장애 등 특정 이유로 정상 동작을 하지 못하는 상황이면 Lease 갱신이 실패 된다.
Kubernetes API 서버는 leaseDurationSeconds 이후 Lease를 만료로 인식하게된다.
Lease가 만료되면, 다른 노드들의 Cilium Agent들이 API 서버에 Lease 획득 요청을 보낸다. 
Kubernetes API 서버는 Lease를 신규 획득하려는 노드들 중 첫번째 요청을 승인하고, 승인받은 노드는 새로운 리더가 된다.
리더가 된 노드의 Cilium Agent는 즉시, 자신이 소유한 LoadBalancer IP에 대해 ARP 응답을 송출한다.
이를 실습으로 확인해본다.
# Failover 전 arp 상태 : 리더 노드 = k8s-w1
root@router:~# arp -a
? (192.168.10.101) at 08:00:27:af:e6:df [ether] on eth1
? (192.168.20.100) at 08:00:27:3a:be:fa [ether] on eth2
? (192.168.10.100) at 08:00:27:04:d1:9c [ether] on eth1
? (192.168.10.211) at 08:00:27:af:e6:df [ether] on eth1
? (10.0.2.3) at 52:55:0a:00:02:03 [ether] on eth0
_gateway (10.0.2.2) at 52:55:0a:00:02:02 [ether] on eth0

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system get lease | grep "cilium-l2announce"
cilium-l2announce-default-webpod       k8s-w1                                                                      11m

# k8s-w1 reboot를 통한 장애 발생
k8s-w1 reboot

# 새로운 리더 노드 선출 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system get lease | grep "cilium-l2announce"
cilium-l2announce-default-webpod                                                                                   11m
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl -n kube-system get lease | grep "cilium-l2announce"
cilium-l2announce-default-webpod       k8s-ctr

# 새로운 리더 선출 이후 arp 송신 확인
root@router:~# arping -i eth1 $LBIP -c 100000
ARPING 192.168.10.211
60 bytes from 08:00:27:04:d1:9c (192.168.10.211): index=0 time=297.774 usec
60 bytes from 08:00:27:04:d1:9c (192.168.10.211): index=1 time=262.590 usec
60 bytes from 08:00:27:04:d1:9c (192.168.10.211): index=2 time=233.700 usec
60 bytes from 08:00:27:04:d1:9c (192.168.10.211): index=3 time=276.347 usec

root@router:~# arp -a
? (192.168.10.101) at 08:00:27:af:e6:df [ether] on eth1
? (192.168.20.100) at 08:00:27:3a:be:fa [ether] on eth2
? (192.168.10.100) at 08:00:27:04:d1:9c [ether] on eth1
? (192.168.10.211) at 08:00:27:04:d1:9c [ether] on eth1
? (10.0.2.3) at 52:55:0a:00:02:03 [ether] on eth0
_gateway (10.0.2.2) at 52:55:0a:00:02:02 [ether] on eth0

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip addr show dev eth1
3: eth1:  mtu 1500 qdisc fq_codel state UP group default qlen 1000
    link/ether 08:00:27:04:d1:9c brd ff:ff:ff:ff:ff:ff
    altname enp0s9
    inet 192.168.10.100/24 brd 192.168.10.255 scope global eth1
       valid_lft forever preferred_lft forever
    inet6 fe80::a00:27ff:fe04:d19c/64 scope link
       valid_lft forever preferred_lft forever



Cilium - Networking - 파드 간 통신 (1) IPAM, Routing, Masquerading
Sat, 02 Aug 2025 16:07:23 GMT
IPAM (IP Address Management)
IPAM이란 IP Address Management의 약자로, 네트워크의 IP 주소 할당을 자동화하고, 사용 상태를 추적하며, 충돌이나 낭비 없이 효과적으로 IP를 관리하는 시스템이다.
Cilium은 기본적으로 eBPF를 기반으로 한 고성능 네트워킹을 제공하는데, 이때 여러 IPAM 모드를 통해 다양한 IP 할당 정책 및 전략을 지원한다.
Cilium의 IPAM에는 다음과 같은 종류가 있다. 공식문서 참조

Cluster Scope (Default)
Kubernetes Host Scope
Multi-Pool (Beta)
CRD-Backed
기타 외부 IPAM (Azure IPAM, AWS ENI, Google Kubernetes Engine ...)

공식문서에 적혀있듯이 IPAM은 한번 설정한 이후 변경을 하지 않는 것을 권장한다.
라이브 환경에서 IPAM 모드를 변경하면 기존 워크로드의 지속적인 연결 중단이 발생할 수 있으며 IPAM 모드를 변경하는 가장 안전한 방법은 새로운 IPAM 구성으로 새로운 Kubernetes 클러스터를 설치하는 것이다.
각 IPAM에 대해 상세히 알아보자.
Kubernetes Host Scope
ipam:
  mode: kubernetes
Kubernetes Host Scope IPAM 모드는 노드 단위 IP 풀을 사용하는 방식이다.
각 노드는 고유한 PodCIDR을 가지고 있으며, Cilium은 해당 범위 내에서 Pod에 IP를 할당한다.
Kubernetes의 kube-controller-manager가 --allocate-node-cidrs 옵션을 통해 노드별 PodCIDR을 할당하면,
Cilium 에이전트는 v1.Node 리소스의 spec.podCIDR 또는 spec.podCIDRs 필드를 참조하여 IP를 배정한다.
설정 값 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl cluster-info dump | grep -m 2 -E "cluster-cidr|service-cluster-ip-range"
                            "--service-cluster-ip-range=10.96.0.0/16",
                            "--cluster-cidr=10.244.0.0/16",
(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium config view | grep kubernetes
ipam                                              kubernetes
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
k8s-ctr    10.244.0.0/24
k8s-w1    10.244.1.0/24
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl describe pod -n kube-system kube-controller-manager-k8s-ctr | grep kube-controller-manager -A3
...
    Command:
      kube-controller-manager
      --allocate-node-cidrs=true
      --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
      --authorization-kubeconfig=/etc/kubernetes/controller-manager.conf
Kubernetes Host Scope IPAM설정에서 Sample Application을 배포하여 상세 사항을 확인해보자.
Hubble로 모니터링을 하기 전 알아두면 좋을 bpf trace 메시지를 코드와 함께 정리해본다. Cilium 관련 첫번째 글에서  분석한 Cilium 서비스 통신 코드 분석의 연장선이다. 
BPF tracemessage
1) pre-xlate-fwd
ClusterIP를 받고, NAT가 수행되기 전 메시지에 해당한다.
//cilium/bpf/bpf_sock.c
static __always_inline int __sock4_xlate_fwd(struct bpf_sock_addr *ctx,
                         struct bpf_sock_addr *ctx_full,
                         const bool udp_only)
{
  ...
  //서비스 매칭을 시도한다.
    svc = lb4_lookup_service(&key, true);
    if (!svc) {
        /* Restore the original key's protocol as lb4_lookup_service
         * has overwritten it.
         */
        lb4_key_set_protocol(&key, protocol);
        svc = sock4_wildcard_lookup_full(&key, in_hostns);
    }
  ...
  //pre-xlate-fwd 메시지를 보낸다.
    send_trace_sock_notify4(ctx_full, XLATE_PRE_DIRECTION_FWD, dst_ip,
                bpf_ntohs(dst_port));
...
}
2) post-xlate-fwd
NAT 직전에 pre-xlate-fwd 메시지가 보내지고 NAT(서비스 IP -> backend pod IP)가 실제 수행된다.
//cilium/bpf/bpf_sock.c
static __always_inline int __sock4_xlate_fwd(struct bpf_sock_addr *ctx,
                         struct bpf_sock_addr *ctx_full,
                         const bool udp_only)
{
  ...
  //pre-xlate-fwd 메시지를 보낸다.
    send_trace_sock_notify4(ctx_full, XLATE_PRE_DIRECTION_FWD, dst_ip,
                bpf_ntohs(dst_port));
    ...
    //backend 정보 수집
        backend_id = backend_slot->backend_id;
        backend = __lb4_lookup_backend(backend_id);
  ...
  //post-xlate-fwd 메시지를 보낸다.
    send_trace_sock_notify4(ctx_full, XLATE_POST_DIRECTION_FWD, backend->address,
                bpf_ntohs(backend->port));
...
  //NAT 수행
    ctx->user_ip4 = backend->address;
    ctx_set_port(ctx, backend->port);
}
3) pre-xlate-rev
RevNAT 변환 전, 현재 패킷의 목적지 IP/포트 상태와 함께 pre-xlate-rev 메시지를 보낸다.
//cilium/bpf/bpf_sock.c
static __always_inline int __sock4_xlate_rev(struct bpf_sock_addr *ctx,
                         struct bpf_sock_addr *ctx_full)
{
  ...
  //pre-xlate-rev 메시지를 보낸다.
    send_trace_sock_notify4(ctx_full, XLATE_PRE_DIRECTION_REV, dst_ip,
                bpf_ntohs(dst_port));
...
  //RevNAT 매핑 테이블을 조회한다.
    val = map_lookup_elem(&cilium_lb4_reverse_sk, &key);
...
}
4) post-xlate-rev
RevNAT 변환 후에 post-xlate-rev 메시지를 보내 패킷의 목적지 IP/포트가 원래 서비스 IP/포트로 바뀐 시점을 알린다.
//cilium/bpf/bpf_sock.c
static __always_inline int __sock4_xlate_rev(struct bpf_sock_addr *ctx,
                         struct bpf_sock_addr *ctx_full)
{
  ...
  //RevNAT 매핑 테이블을 조회한다.
    val = map_lookup_elem(&cilium_lb4_reverse_sk, &key);
...
    //RevNAT 변환을 수행한다.
        ctx->user_ip4 = val->address;
        ctx_set_port(ctx, val->port);
    //post-xlate-rev 메시지를 보낸다.
        send_trace_sock_notify4(ctx_full, XLATE_POST_DIRECTION_REV, val->address,
                    bpf_ntohs(val->port));
        return 0;
    }
5) to-endpoint
정책 통과 후 실제 목적지 Pod로 트래픽이 전달될 때 발생하는 이벤트이다.
//cilium/bpf/bpf_lxc.c
static __always_inline int
ipv4_policy(struct __ctx_buff *ctx, struct iphdr *ip4, __u32 src_label,
        struct ipv4_ct_tuple *tuple_out, __s8 *ext_err, __u16 *proxy_port,
        bool from_tunnel)
{
    //1) 정책이 통과되고
        if (verdict != CTX_ACT_OK || ret != CT_ESTABLISHED)
  ...
  //2) Conntrack이 신규 생성되었으며
    if (ret == CT_NEW) 
  ...
  //3) proxy로 redirect되지 않고, 바로 Pod로 전달하는 경로이면
    if (*proxy_port > 0)
        goto redirect_to_proxy;

    //TRACE_TO_LXC 즉 to-endpoint 메시지를 보낸다.
    send_trace_notify4(ctx, TRACE_TO_LXC, src_label, SECLABEL_IPV4, orig_sip,
               LXC_ID, ifindex, trace.reason, trace.monitor);
...

}
참고로 TRACE 변수에 대한 값은 아래에서 확인 가능하다.
//cilium/pkg/monitoring/api/types.go
var TraceObservationPoints = map[uint8]string{
    TraceToLxc:       "to-endpoint",
    TraceToNetwork:   "to-network",
...
}
6) to-network
터널 모드가 아닐 경우에 한해서 패킷이 노드를 벗어나기 직전 TRACE_TO_NETWORK 메시지를 발생시킨다.
//cilium/bpf/bpf_lxc.c
static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *dst_sec_identity,
                        __s8 *ext_err)
{
  ...
  //터널 모드 조건 끝
  #endif /* TUNNEL_MODE */

  //터널 모드가 아닐 경우 + 호스트 라우팅이 되어야 하는 경우
    if (is_defined(ENABLE_HOST_ROUTING)) {
        int oif = 0;

    //라우팅 테이블을 조회해서 목적지 IP에 맞는 인터페이스 번호를 채워넣는다.
        ret = fib_redirect_v4(ctx, ETH_HLEN, ip4, false, false, ext_err, &oif);
        if (fib_ok(ret))
      //TRACE_TO_NETWORK 즉 to-network 메시지를 보낸다.
            send_trace_notify(ctx, TRACE_TO_NETWORK, SECLABEL_IPV4,
                      *dst_sec_identity, TRACE_EP_ID_UNKNOWN, oif,
                      trace.reason, trace.monitor, bpf_htons(ETH_P_IP));
        return ret;
    }
  //커널 네트워크 스택으로 패킷을 넘긴다.
    goto pass_to_stack;
}
Sample Application
Sample Application을 배포하여 통신 모니터링을 진행해본다.

같은 노드에서의 통신
kubernetes IPAM 모드에서는 노드마다 Pod CIDR을 갖고 있으며, 같은 노드 내 Pod 간 통신 시 Cilium은 해당 노드의 인터페이스와 endpoint map만으로 로컬 라우팅을 처리한다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -o wide
NAME                      READY   STATUS    RESTARTS   AGE     IP             NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   0          5d23h   10.244.0.164   k8s-ctr              
webpod-697b545f57-hkjzn   1/1     Running   0          5d23h   10.244.0.111   k8s-ctr              
webpod-697b545f57-qz7vp   1/1     Running   0          5d23h   10.244.1.47    k8s-w1               

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it curl-pod -- curl 10.244.0.111 | grep Hostname
Hostname: webpod-697b545f57-hkjzn
curl-pod에서 같은 노드 내에 위치한 webpod를 호출시키는 상황을 hubble로 관측하면 다음과 같다.

통신 시나리오:
curl-pod → ClusterIP(web-svc) → webpod (같은 노드 상의 Pod)


IPAM 모드: kubernetes
→ 각 노드에 PodCIDR가 할당되며, 이 범위 내에서 Pod IP가 고정적으로 할당됨.
Routing : Native Routing



모니터링 흐름:
1) curl-pod에서 webpod의 ClusterIP 호출
2) pre-xlate-fwd

bpf_sock 계층에서 패킷 수신 직후 발생.
NAT 전 상태로 ClsuterIP를 그대로 반환.

3) post-xlate-fwd

bpf_sock 계층에서 service IP → backend pod IP로 NAT된 이후 발생

4) to-endpoint

Cilium endpoint map을 통해 패킷이 webpod로 전달.
webpod에서 보낸 응답도 curl-pod로 도달.

5) pre-xlate-rev

webpod가 응답을 보낼 때 발생.
이후 bpf_sock 계층에서 Reverse NAT 여부를 탐색

❌ cf) post-xlate-rev 없음

왜 없을까? 에 대해 chatGPT에게 물어본 결과 아래와 같은 답변을 얻었다. 

kubernetes IPAM 모드에서 Pod 간 통신은, 노드가 다르더라도 Cilium이 IP 경로를 직선적으로 계산할 수 있다면, 응답 시 Reverse NAT 처리를 생략할 수 있습니다.
이때 Cilium은 cilium_lb4_reverse_sk에 항목을 생성하지 않으며, 이에 따라 post-xlate-rev 트레이스 이벤트가 발생하지 않습니다. 이는 같은 노드 간 통신뿐 아니라, 다른 노드 간 통신에서도 동일하게 발생할 수 있는 최적화 동작입니다.

다른 노드 간 통신

통신 시나리오:
curl-pod → ClusterIP(web-svc) → webpod (다른 노드 상의 Pod)


Routing : Native Routing

모니터링 흐름:
1) curl-pod에서 webpod의 ClusterIP 호출
2) pre-xlate-fwd

bpf_sock 계층에서 패킷 수신 직후 발생.
NAT 전 상태로 ClsuterIP를 그대로 반환.

3) post-xlate-fwd

bpf_sock 계층에서 service IP → backend pod IP로 NAT된 이후 발생

4) to-network

curl-pod가 webpod에 보내는 패킷이 다른 노드로 향하기 전 물리 네트워크 인터페이스(eth0 등)로 전달되는 시점에 발생 (Native Routing)

5) to-endpoint

webpod에서 보낸 응답 패킷이 Cilium endpoint map을 통해 curl-pod에 도달.

6) pre-xlate-rev

webpod가 응답을 보낼 때 발생.
이후 bpf_sock 계층에서 Reverse NAT 여부를 탐색

Cluster Scope (Default)
ipam:
  podCIDRs: 10.1.1.0/24 (예시)
Cluster Scope IPAM 모드는 각 노드에 노드별 PodCIDR을 할당하고 각 노드에 호스트 범위 할당기를 사용하여 IP를 할당하는 방식이다.
Kubernetes Host Scope IPAM 모드와의 차이점은 Kubernetes가 Kubernetes v1.Node 리소스를 통해 노드별 PodCIDR을 할당하는 대신, Cilium 운영자가 v2.CiliumNode 리소스(CRD)를 통해 노드별 PodCIDR을 관리한다는 점이다.
각 노드는 고유한 PodCIDR을 가지고 있으며, Cilium은 해당 범위 내에서 Pod에 IP를 할당한다.
Kubernetes의 kube-controller-manager가 --allocate-node-cidrs 옵션을 통해 노드별 PodCIDR을 할당하면,
Cilium 에이전트는 v1.Node 리소스의 spec.podCIDR 또는 spec.podCIDRs 필드를 참조하여 IP를 배정한다.
앞서 설치한 Kubernetes Host Scope IPAM모드를 Cluster Scope 모드로 마이그레이션 해보자.

# Cluster Scopre 로 설정 변경
helm upgrade cilium cilium/cilium --namespace kube-system --reuse-values \
--set ipam.mode="cluster-pool" --set ipam.operator.clusterPoolIPv4PodCIDRList={"172.20.0.0/16"} --set ipv4NativeRoutingCIDR=172.20.0.0/16

kubectl -n kube-system rollout restart deploy/cilium-operator # 오퍼레이터 재시작 필요
kubectl -n kube-system rollout restart ds/cilium

(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium config view | grep ^ipam
ipam                                              cluster-pool

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ciliumnode -o json | grep podCIDRs -A2
                    "podCIDRs": [
                        "172.20.1.0/24"
                    ],
--
                    "podCIDRs": [
                        "172.20.0.0/24"
                    ],

# pod IP 재할당을 위해서는 pod들 재기동 필요
# 
kubectl delete ciliumnode k8s-w1
kubectl -n kube-system rollout restart ds/cilium

#
kubectl delete ciliumnode k8s-ctr
kubectl -n kube-system rollout restart ds/cilium

#
kubectl -n kube-system rollout restart deploy/hubble-relay deploy/hubble-ui
kubectl -n cilium-monitoring rollout restart deploy/prometheus deploy/grafana
kubectl rollout restart deploy/webpod
kubectl delete pod curl-pod

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ciliumendpoints.cilium.io -A
NAMESPACE           NAME                            SECURITY IDENTITY   ENDPOINT STATE   IPV4           IPV6
cilium-monitoring   grafana-5554b5b99d-ngp5c        63656               ready            172.20.0.34
cilium-monitoring   prometheus-56564cbb6f-pgtzp     60653               ready            172.20.0.1
default             curl-pod                        28108               ready            172.20.1.147
default             webpod-5f99d864bd-6mbxt         5610                ready            172.20.0.125
default             webpod-5f99d864bd-7kwbv         5610                ready            172.20.1.88
kube-system         coredns-674b8bbfcf-n4wn5        63747               ready            172.20.1.239
kube-system         coredns-674b8bbfcf-wnqpt        63747               ready            172.20.0.195
kube-system         hubble-relay-7767cb5659-r6kjd   25120               ready            172.20.0.7
kube-system         hubble-ui-869b77984-z9crh       36288               ready            172.20.0.149
위 실습에서 진행했다시피 IPAM모드 변경을 위해서는 cilium 관련 리소스 재기동이 필요하기 때문에, 운영중 IPAM 모드 마이그레이션은 최대한 지양해야 한다.
Routing
Routing에는 크게 두가지 방식이 있다.
하나는 앞에서 계속 다뤄왔던 Native-Routing 방식이고, 다른 하나는 Encapsulation 방식이다.
Native Routing vs Encapsulation 비교



구분
Native Routing (직접 라우팅)
Encapsulation (캡슐화, 예: VXLAN)



트래픽 처리 방식
Pod IP 간 라우팅 테이블에 직접 경로 등록하여 노드 간 직접 라우팅
트래픽을 VXLAN 같은 터널 프로토콜로 캡슐화하여 터널을 통해 전달


네트워크 오버헤드
적음 (캡슐화 없음)
캡슐화 오버헤드 존재 (UDP 헤더 + VXLAN 헤더 추가)


MTU 이슈
기본 MTU 사용 (대체로 1500)
캡슐화로 MTU 감소, Path MTU 문제 발생 가능


라우팅 설정 복잡도
노드 라우팅 테이블에 Pod CIDR 경로 추가 필요
라우팅 테이블 복잡도 낮음, 터널 엔드포인트 간 패킷 전달


네트워크 정책 적용 지점
노드와 Pod 모두에서 정책 적용 가능
캡슐화로 인해 터널 종단점에서 정책 적용 필요 (터널 내부는 패킷 변경)


네트워크 환경 요구사항
클러스터 내 모든 노드가 Pod CIDR를 라우팅 가능해야 함
중간 네트워크(라우터, 스위치)에서 터널 프로토콜 지원 필요 없음


멀티테넌시 / 복잡한 네트워크
복잡한 네트워크 환경에서 라우팅 관리 어려움
복잡한 네트워크 환경에서 터널로 격리 및 네트워크 분리 가능


디버깅 난이도
비교적 쉬움
캡슐화로 인해 트래픽 분석 및 디버깅 어려움


성능 영향
일반적으로 더 낮은 지연 및 CPU 오버헤드
캡슐화/디캡슐화에 따른 CPU 오버헤드 및 약간의 지연 발생


노드 간 트래픽 흐름
노드 IP 기반 직접 전달
터널 엔드포인트 간 캡슐화된 패킷 전달


Native-Routing 통신에 관해서는 앞선 글과 위 실습에서 다루었기 때문에 이번에는 vxlan Encapsulation 실습을 진행해본다.
vxlan encapsulation
설치
helm install cilium cilium/cilium --version 1.17.6 --namespace kube-system \
  --set k8sServiceHost=192.168.10.100 \
  --set k8sServicePort=6443 \
  --set ipam.mode="kubernetes" \
  --set k8s.requireIPv4PodCIDR=true \
  --set ipv4NativeRoutingCIDR=10.244.0.0/16 \
  --set routingMode=tunnel \
  --set encapsulation=vxlan \
  --set autoDirectNodeRoutes=false \
  --set endpointRoutes.enabled=true \
  --set kubeProxyReplacement=true \
  --set bpf.masquerade=true \
  --set installNoConntrackIptablesRules=false \
  --set endpointHealthChecking.enabled=false \
  --set healthChecking=false \
  --set hubble.enabled=true \
  --set hubble.relay.enabled=true \
  --set hubble.ui.enabled=true \
  --set hubble.ui.service.type=NodePort \
  --set hubble.ui.service.nodePort=30003 \
  --set prometheus.enabled=true \
  --set operator.prometheus.enabled=true \
  --set hubble.metrics.enableOpenMetrics=true \
  --set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\,source_namespace\,source_workload\,destination_ip\,destination_namespace\,destination_workload\,traffic_direction}" \
  --set operator.replicas=1 \
  --set debug.enabled=true
상태 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.10.0.0/16 via 192.168.10.200 dev eth1 proto static
10.244.0.4 dev cilium_host proto kernel scope link
10.244.0.37 dev lxceae0fb68caf5 proto kernel scope link
10.244.0.145 dev lxcd1114ba6891e proto kernel scope link
10.244.0.169 dev lxc8810a263de82 proto kernel scope link
10.244.0.203 dev lxc1fa09f003b38 proto kernel scope link
10.244.0.209 dev lxc158567b48672 proto kernel scope link
10.244.1.0/24 via 10.244.0.4 dev cilium_host proto kernel src 10.244.0.4 mtu 1450
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.100

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip addr show 
4: cilium_net@cilium_host:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 6a:87:2e:a2:87:d7 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::6887:2eff:fea2:87d7/64 scope link
       valid_lft forever preferred_lft forever
5: cilium_host@cilium_net:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether e2:0f:04:be:e7:f1 brd ff:ff:ff:ff:ff:ff
    inet 10.244.0.4/32 scope global cilium_host
       valid_lft forever preferred_lft forever
    inet6 fe80::e00f:4ff:febe:e7f1/64 scope link
       valid_lft forever preferred_lft forever
24: cilium_vxlan:  mtu 1500 qdisc noqueue state UNKNOWN group default
    link/ether 36:da:0d:c5:8e:26 brd ff:ff:ff:ff:ff:ff
    inet6 fe80::34da:dff:fec5:8e26/64 scope link
       valid_lft forever preferred_lft forever
통신 확인(1) - tcpdump
VXLAN은 기본적으로 UDP 포트 8472를 통해 encapsulated 트래픽을 전송한다. 실제 HTTP 트래픽이 VXLAN encapsulation 안에 포함되어 있기 때문에, eth1에서는 HTTP (TCP/80) 패킷을 직접 볼 수 없고 UDP 포트 8472를 통해 확인할 수 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it curl-pod -- curl webpod | grep Hostname

(⎈|HomeLab:N/A) root@k8s-ctr:~# tcpdump -i eth1 udp port 8472 -nn | grep 10.244.1.108
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
IP 10.244.0.25.42626 > 10.244.1.108.80: Flags [S], seq 384054985, win 64860, options [mss 1410,sackOK,TS val 3044357904 ecr 0,nop,wscale 7], length 0
IP 10.244.1.108.80 > 10.244.0.25.42626: Flags [S.], seq 1361241289, ack 384054986, win 64308, options [mss 1410,sackOK,TS val 1148259990 ecr 3044357904,nop,wscale 7], length 0
IP 10.244.0.25.42626 > 10.244.1.108.80: Flags [.], ack 1, win 507, options [nop,nop,TS val 3044357905 ecr 1148259990], length 0
IP 10.244.0.25.42626 > 10.244.1.108.80: Flags [P.], seq 1:71, ack 1, win 507, options [nop,nop,TS val 3044357905 ecr 1148259990], length 70: HTTP: GET / HTTP/1.1
IP 10.244.1.108.80 > 10.244.0.25.42626: Flags [.], ack 71, win 502, options [nop,nop,TS val 1148259991 ecr 3044357905], length 0
IP 10.244.1.108.80 > 10.244.0.25.42626: Flags [P.], seq 1:322, ack 71, win 502, options [nop,nop,TS val 1148259993 ecr 3044357905], length 321: HTTP: HTTP/1.1 200 OK
IP 10.244.0.25.42626 > 10.244.1.108.80: Flags [.], ack 322, win 505, options [nop,nop,TS val 3044357908 ecr 1148259993], length 0
IP 10.244.0.25.42626 > 10.244.1.108.80: Flags [F.], seq 71, ack 322, win 505, options [nop,nop,TS val 3044357908 ecr 1148259993], length 0
IP 10.244.1.108.80 > 10.244.0.25.42626: Flags [F.], seq 322, ack 72, win 502, options [nop,nop,TS val 1148259994 ecr 3044357908], length 0
IP 10.244.0.25.42626 > 10.244.1.108.80: Flags [.], ack 323, win 505, options [nop,nop,TS val 3044357909 ecr 1148259994], length 0
^C73 packets captured
73 packets received by filter
0 packets dropped by kernel

(⎈|HomeLab:N/A) root@k8s-ctr:~# tcpdump -i eth1 tcp port 80 -nn | grep 10.244.1.108
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth1, link-type EN10MB (Ethernet), snapshot length 262144 bytes
^C0 packets captured
0 packets received by filter
0 packets dropped by kernel
termshark 도구를 활용하여 udp port 8472 tcpdump 결과를 확인해보면 패킷이 캡슐화 되어 패킷 헤더의 Src, Dst IP가 노드의 eth1 IP임을 확인할 수 있다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# tcpdump -i eth1 udp port 8472 -w /tmp/icmp.pcap

(⎈|HomeLab:N/A) root@k8s-ctr:~# termshark -r /tmp/icmp.pcap

통신 확인(2) - hubble
vxlan 모드에서 다른 노드 위 Pod간 통신을 모니터링해본다.

통신 시나리오:
curl-pod → ClusterIP(web-svc) → webpod (다른 노드 상의 Pod)


Routing : vxlan Routing

모니터링 흐름:
1) curl-pod에서 webpod의 ClusterIP 호출
2) pre-xlate-fwd

bpf_sock 계층에서 패킷 수신 직후 발생.
아직 서비스 IP 그대로, NAT 전 상태.

3) post-xlate-fwd

bpf_sock 계층에서 service IP → backend pod IP로 NAT된 이후 발생

4) to-overlay

curl-pod가 다른 노드에 있는 webpod에 패킷을 보낼 때 패킷이 다른 노드로 향하기 전 캡슐화 정보를 세팅하기 전 발생 (vxlan Routing)
이후 캡슐화 후 패킷을 터널 인터페이스로 리다이렉션 요청 반환

5) to-endpoint

webpod에서 보낸 응답 패킷이 Cilium endpoint map을 통해 curl-pod에 도달.

6) pre-xlate-rev

webpod가 응답을 보낼 때 발생.
이후 bpf_sock 계층에서 Reverse NAT 여부를 탐색

위에서 분석한 것과 같이 to-overaly BPF tracemessage 코드를 분석해보자.
to-overlay
//cilium/bpf/bpf_lxc.c
// 터널 모드의 경우
#if defined(TUNNEL_MODE) 
...
    //실제 터널링 정보가 있을 경우
        if (info && info->flag_has_tunnel_ep) {
      //캡슐화 및 리다이렉션 수행
            ret = encap_and_redirect_lxc(ctx, info, SECLABEL_IPV4,
                             *dst_sec_identity, &trace,
                             bpf_htons(ETH_P_IP));
#endif /* TUNNEL_MODE */
//cilium/bpf/lib/encap.h
encap_and_redirect_lxc(struct __ctx_buff *ctx, struct remote_endpoint_info *info,
               __u32 seclabel, __u32 dstid, const struct trace_ctx *trace, __be16 proto)
{
    return encap_and_redirect_with_nodeid(ctx, info, seclabel, dstid, trace, proto);
}

//코드를 따라가다 보면 encap_with_nodeid4를 수행하고, 해당 함수에서 to-overlay 메시지를 호출 후 캡슐화 정보를 세팅함.

__encap_with_nodeid4(struct __ctx_buff *ctx, __u32 src_ip, __be16 src_port,
             __be32 tunnel_endpoint,
             __u32 seclabel, __u32 dstid, __u32 vni,
             enum trace_reason ct_reason, __u32 monitor, int *ifindex,
             __be16 proto)
{
...
    send_trace_notify(ctx, TRACE_TO_OVERLAY, seclabel, dstid, TRACE_EP_ID_UNKNOWN,
              *ifindex, ct_reason, monitor, proto);

    return ctx_set_encap_info4(ctx, src_ip, src_port, tunnel_endpoint, seclabel, vni,
                   NULL, 0);
}
Masquerading
Masquerade (마스커레이드)란, 특정 네트워크 패킷의 출발지 IP 주소를 노드의 IP로 변경하는 SNAT(Source NAT) 동작을 의미한다.
Cilium은 클러스터를 떠나는 모든 트래픽의 소스 IP 주소를 자동으로 masquerade 하는데, 이는 Cilium의 경우 Pod에서 나가는 트래픽의 출발지 IP가 Pod IP이기 때문에, 그대로 외부로 보내면 외부에서 응답할 수 없기 때문이다.
따라서 출발지 IP를 해당 Pod가 위치한 노드의 IP로 바꿔서(SNAT) 보내고, 나중에 응답이 오면 다시 원래의 Pod IP를 반환하는 방식으로 동작한다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -c cilium-agent  -- cilium status | grep Masquerading
Masquerading:            BPF   [eth0, eth1]   10.244.0.0/16 [IPv4: Enabled, IPv6: Disabled]

(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium config view  | grep ipv4-native-routing-cidr
ipv4-native-routing-cidr                          10.244.0.0/16
ipv4-native-routing-cidr(10.244.0.0/16) 범위는 native routing (터널링 없이 직접 라우팅) 되는 네트워크이고, 이 범위 밖으로 나가는 트래픽은 SNAT(Masquerading) 된다.
단, 대상 IP가 클러스터 내부 Node IP라면 예외적으로 Masquerading 되지 않는다.
실습
다음은 Pod에서 클러스터 내부 노드, 클러스터 외부 서버 호출을 비교한 실습이다.
클러스터 내부 노드 호출을 할 때에는 Pod IP가 잡히지만, 클러스터 외부 서버 호출을 할 때에는 Node IP로 SNAT되는 것을 확인할 수 있다.

ipMasqAgent
ipMasqAgent란 쿠버네티스 클러스터에서 IP 마스커레이딩(즉, SNAT)을 제어하는 에이전트의 설정 항목이다.
ipMasqAgent로 다음 설정들을 할 수 있다.



설정 항목
설명
예시 값



enabled
ip-masq-agent 기능 활성화 여부
true, false


config.nonMasqueradeCIDRs
SNAT이 적용되지 않을 CIDR 목록
{10.10.1.0/24,10.10.2.0/24}


config.masqLinkLocal
링크 로컬 주소 (169.254.0.0/16)에 대해 SNAT 적용 여부
true, false


config.masqLinkLocalIPv6
IPv6 링크 로컬 주소 (fe80::/10)에 대해 SNAT 적용 여부
true, false


config.masqAgentConfigPath
사용자 정의 config 파일 경로
/etc/cilium/masq-agent.json


config.masqOutBoundCIDRs
SNAT이 항상 적용될 외부 CIDR 목록
{0.0.0.0/0}


config.masqOutBoundPortRanges
SNAT이 항상 적용될 포트 범위 목록
{80,443,1000-2000}


config.refreshInterval
iptables 규칙 갱신 주기
1m, 30s


installIptablesRules
iptables 규칙을 ip-masq-agent가 직접 설치할지 여부
true, false





Cilium - (Observability) Hubble
Sat, 26 Jul 2025 18:18:07 GMT
Hubble
Hubble이란 Cilium의 eBPF 흐름을 기반으로, 네트워크 보안 정책, 서비스 흐름, L3~L7 수준의 트래픽을 관찰·분석할 수 있도록 도와주는 관찰/모니터링 플랫폼이다.
Hubble 구성 요소
Cilium에서 제공하는 공식문서에는 Hubble을 이용한 Observability에 관해 상세하게 작성되어 있다.



구성 요소
설명
관찰 범위
연결 방식
배포/사용 위치
주요 특징



Hubble API
Cilium 에이전트가 실행 중인 로컬 노드에서 관찰된 네트워크 트래픽 정보를 제공하는 gRPC API
단일 Cilium 노드 (로컬)
Unix 도메인 소켓 (/var/run/cilium/hubble.sock)
각 Cilium 에이전트 Pod 내부
L3~L7 네트워크 이벤트 제공. 외부에서 직접 접근 불가


Hubble Relay
여러 Cilium 노드의 Hubble API를 집계하여 클러스터 전체 또는 ClusterMesh 환경의 여러 클러스터의 트래픽 정보를 통합 제공
전체 클러스터 
또는 ClusterMesh
내부: Hubble API와 통신 
외부: gRPC (CLI, UI에서 연결)
별도 Pod(Deployment 등)로 실행
중앙 집중형 데이터 수집기. 보안 및 인증 구성 가능. CLI 및 UI의 주요 백엔드 역할 수행


Hubble UI
클러스터의 서비스 간 통신 흐름을 자동으로 탐지하고 시각화하여 보여주는 웹 UI
전체 클러스터 
또는 ClusterMesh
gRPC 또는 HTTP로 Hubble Relay와 통신
Pod로 배포되며 웹 브라우저에서 접근
서비스 종속성 맵, 필터링 UI, L3/L4/L7 데이터 시각화. Grafana와 유사한 UX 제공


Hubble CLI
hubble 명령어를 통해 Hubble API 또는 Hubble Relay에 접근하여 트래픽 흐름을 조회하는 CLI 도구
로컬 노드 or 전체 클러스터
① Unix 도메인 소켓 (API 직접 연결) 
② Hubble Relay 주소 (gRPC)
Cilium Pod 내부 또는 외부 클라이언트
실시간 흐름 조회, 필터링, JSON 출력 등 다양한 커맨드 지원


Hubble 설치
Hubble을 설치한 후 기본적인 구성을 확인해본다.
helm을 이용하여 Hubble을 설치하면 hubble-relay, hubble-ui Deployment 및 Secret 등 리소스가 클러스터에 배포된다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# helm upgrade cilium cilium/cilium --namespace kube-system --reuse-values \
--set hubble.enabled=true \
--set hubble.relay.enabled=true \
--set hubble.ui.enabled=true \
--set hubble.ui.service.type=NodePort \
--set hubble.ui.service.nodePort=31234 \
--set hubble.export.static.enabled=true \
--set hubble.export.static.filePath=/var/run/cilium/hubble/events.log \
--set prometheus.enabled=true \
--set operator.prometheus.enabled=true \
--set hubble.metrics.enableOpenMetrics=true \
--set hubble.metrics.enabled="{dns,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\,source_namespace\,source_workload\,destination_ip\,destination_namespace\,destination_workload\,traffic_direction}"

(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium status
    /¯¯\
 /¯¯\__/¯¯\    Cilium:             OK
 \__/¯¯\__/    Operator:           OK
 /¯¯\__/¯¯\    Envoy DaemonSet:    OK
 \__/¯¯\__/    Hubble Relay:       OK
    \__/       ClusterMesh:        disabled

...
Deployment             hubble-relay             Desired: 1, Ready: 1/1, Available: 1/1
Deployment             hubble-ui                Desired: 1, Ready: 1/1, Available: 1/1
Containers:            ...
                       hubble-relay             Running: 1
                       hubble-ui                Running: 1
...

(⎈|HomeLab:N/A) root@k8s-ctr:~# cilium config view | grep -i hubble
enable-hubble                                     true
enable-hubble-open-metrics                        true
hubble-disable-tls                                false
hubble-export-allowlist
hubble-export-denylist
hubble-export-fieldmask
hubble-export-file-max-backups                    5
hubble-export-file-max-size-mb                    10
hubble-export-file-path                           /var/run/cilium/hubble/events.log
hubble-listen-address                             :4244
hubble-metrics                                    dns drop tcp flow port-distribution icmp httpV2:exemplars=true;labelsContext=source_ip,source_namespace,source_workload,destination_ip,destination_namespace,destination_workload,traffic_direction
hubble-metrics-server                             :9965
hubble-metrics-server-enable-tls                  false
hubble-socket-path                                /var/run/cilium/hubble.sock
hubble-tls-cert-file                              /var/lib/cilium/tls/hubble/server.crt
hubble-tls-client-ca-files                        /var/lib/cilium/tls/hubble/client-ca.crt
hubble-tls-key-file                               /var/lib/cilium/tls/hubble/server.key

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get secret -n kube-system | grep -iE 'cilium-ca|hubble'
cilium-ca                      Opaque                          2      4d14h
hubble-relay-client-certs      kubernetes.io/tls               3      4d14h
hubble-server-certs            kubernetes.io/tls               3      4d14h
Hubble 구조
Hubble의 구조는 정리하면 다음과 같다.

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc,ep -n kube-system | grep -i hubble-relay
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
service/hubble-relay     ClusterIP   10.96.173.198           80/TCP                   56s
endpoints/hubble-relay     172.20.1.6:4245                                               56s

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc,ep -n kube-system | grep -i hubble-peer
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
service/hubble-peer      ClusterIP   10.96.21.177            443/TCP                  60s
endpoints/hubble-peer      192.168.10.100:4244,192.168.10.101:4244,192.168.10.102:4244   60s
Hubble Relay는 kube-system/hubble-peer Endpoints 리소스를 참조하여, 각 노드의 cilium-agent가 노출하는 Hubble API(포트 4244)에 gRPC로 연결한다. 이를 통해 각 노드의 네트워크 흐름 데이터를 수집해 중앙에서 통합 제공한다.
Hubble이 배포된 전 후를 비교하면 Hubble API인 4244 포트가 신규로 열린 것을 확인할 수 있다.
공식문서 참조
(⎈|HomeLab:N/A) root@k8s-ctr:~# ss -tnlp | grep 4244
LISTEN 0      4096                *:4244             *:*    users:(("cilium-agent",pid=6955,fd=52))

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl describe pod -n kube-system -l k8s-app=hubble-relay
Name:             hubble-relay-5dcd46f5c-6pmq9
...
Containers:
  hubble-relay:
...
    Port:          4245/TCP
    Host Port:     0/TCP

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc,ep -n kube-system hubble-relay
NAME                   TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/hubble-relay   ClusterIP   10.96.180.56           80/TCP    4d14h

NAME                     ENDPOINTS          AGE
endpoints/hubble-relay   172.20.2.58:4245   4d14h
Hubble 구조 - Code 확인
Hubble Relay에서 어떻게 Hubble API 즉 peer를 인지하고, GRPC 통신을 하는지 코드를 통해 알아보자.
//cilium/pkg/hubble/relay/pool/manager.go
func (m *PeerManager) Start() {
    m.wg.Add(3)
    go func() {
        defer m.wg.Done()
    // Hubble Relay가 Hubble Agent와 gRPC연결을 통해 Peer목록을 실시간으로 반영
        m.watchNotifications()
    }()
    go func() {
        defer m.wg.Done()
    //Peer와의 gRPC통신 연결 수행
        m.manageConnections()
    }()
    go func() {
        defer m.wg.Done()
    // 연결 상태 확인
        m.reportConnectionStatus()
    }()
}
1. watchNotifications
Hubble Relay에서 코드에 명시된 Default ServerPort를 기준으로 gRPC연결을 할 Peer를 생성한다.
//cilium/pkg/hubble/relay/pool/manager.go
func (m *PeerManager) watchNotifications() {
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()
    go func() {
        <-m.stop
        cancel()
    }()
connect:
    for {
        cl, err := m.opts.peerClientBuilder.Client(m.opts.peerServiceAddress)
...
        client, err := cl.Notify(ctx, &peerpb.NotifyRequest{})
...
            cn, err := client.Recv()
...
      //신규 peer 생성
            p := peerTypes.FromChangeNotification(cn)
            switch cn.GetType() {
            case peerpb.ChangeNotificationType_PEER_ADDED:
                m.upsert(p)
            case peerpb.ChangeNotificationType_PEER_DELETED:
                m.remove(p)
            case peerpb.ChangeNotificationType_PEER_UPDATED:
                m.upsert(p)
            }
        }
    }
}

//cilium/pkg/hubble/peer/types/peer.go
// FromChangeNotification creates a new Peer from a ChangeNotification.
func FromChangeNotification(cn *peerpb.ChangeNotification) *Peer {
    if cn == nil {
        return (*Peer)(nil)
    }
    var err error
    var addr net.Addr
    switch a := cn.GetAddress(); {
...
    default:
        var host, port string
        if host, port, err = net.SplitHostPort(a); err == nil {
...
  //별도로 IP와 Port를 지정한 것이 아니면 Peer의 Port를 default Server Port로 지정한다.
        } else if ip := net.ParseIP(a); ip != nil {
            err = nil
            addr = &net.TCPAddr{
                IP:   ip,
                Port: defaults.ServerPort,
            }
        }
    }
...
    return &Peer{
        Name:          cn.GetName(),
        Address:       addr,
        TLSEnabled:    tlsEnabled,
        TLSServerName: tlsServerName,
    }
}

//cilium/pkg/hubble/defaults
//default Server Port는 4244로 코드에 명시되어 있다.
const (
    // ServerPort is the default port for hubble server when a provided
    // listen address does not include one.
    ServerPort = 4244
  ...
)
2. manageConnections
Peer정보가 업데이트 될 때 및 주기적으로 Peer상태 확인 및 연결을 시도한다.
//cilium/pkg/hubble/relay/pool/manager.go
func (m *PeerManager) manageConnections() {
    for {
        select {
        case <-m.stop:
            return
    // Peer 정보 업데이트 시 연결 시도
        case name := <-m.updated:
            m.mu.RLock()
            p := m.peers[name]
            m.mu.RUnlock()
            m.wg.Add(1)
            go func(p *peer) {
                defer m.wg.Done()
                // a connection request has been made, make sure to attempt a connection
                m.connect(p, true)
            }(p)
    // 주기적으로 Peer 연결 시도
        case <-time.After(m.opts.connCheckInterval):
            m.mu.RLock()
            for _, p := range m.peers {
                m.wg.Add(1)
                go func(p *peer) {
                    defer m.wg.Done()
                    m.connect(p, false)
                }(p)
            }
            m.mu.RUnlock()
        }
    }
}
...

func (m *PeerManager) connect(p *peer, ignoreBackoff bool) {
...
  //실제 gRPC 연결을 생성한다.
    scopedLog.Info("Connecting")
    conn, err := m.opts.clientConnBuilder.ClientConn(p.Address.String(), p.TLSServerName)
    if err != nil {
        duration := m.opts.backoff.Duration(p.connAttempts)
        p.nextConnAttempt = now.Add(duration)
        p.connAttempts++
        scopedLog.Warn(
            "Failed to create gRPC client",
            logfields.Error, err,
            logfields.NextTryIn, duration,
        )
        return
    }
    p.nextConnAttempt = time.Time{}
    p.connAttempts = 0
    p.conn = conn
    scopedLog.Info("Connected")
}
Hubble UI 접속
이를 Hubble UI 접속을 통해 알아보자
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc,ep -n kube-system hubble-ui
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
NAME                TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)        AGE
service/hubble-ui   NodePort   10.96.137.247           80:31234/TCP   3m43s

NAME                  ENDPOINTS           AGE
endpoints/hubble-ui   172.20.2.101:8081   3m43s

Hubble CLI 사용
Hubble CLI를 통해 실시간 통신 모니터링을 확인해본다.
Hubble CLI 설치
HUBBLE_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/hubble/master/stable.txt)
HUBBLE_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then HUBBLE_ARCH=arm64; fi
curl -L --fail --remote-name-all https://github.com/cilium/hubble/releases/download/$HUBBLE_VERSION/hubble-linux-${HUBBLE_ARCH}.tar.gz{,.sha256sum}
sudo tar xzvfC hubble-linux-${HUBBLE_ARCH}.tar.gz /usr/local/bin
which hubble
hubble status
모니터링
Local Machine에서 Hubble CLI로  Hubble Relay와 통신을 하기 위해 백그라운드에서 port-forwarding을 수행한 후 모니터링을 확인해본다.
cilium hubble port-forward&
Hubble Relay is available at 127.0.0.1:4245

# Now you can validate that you can access the Hubble API via the installed CLI
hubble status
Healthcheck (via localhost:4245): Ok
Current/Max Flows: 12,285/12,285 (100.00%)
Flows/s: 41.20

# hubble (api) server 기본 접속 주소 확인
hubble config view 
...
port-forward-port: "4245"
server: localhost:4245
다음은 hubble observe 옵션의 일부이다. 해당 옵션을 사용하여 원하는 설정으로 모니터링이 가능하다.



옵션
설명
예시



--from-pod
Source Pod 지정 (/ 형식)
--from-pod kube-system/cilium-abc


--to-pod
Destination Pod 지정
--to-pod default/myapp


--from-ip
Source IP 주소 지정
--from-ip 10.0.0.12


--to-ip
Destination IP 주소 지정
--to-ip 10.0.1.25


--from-fqdn
Source Fully Qualified Domain Name (FQDN) 지정
--from-fqdn api.example.com


--to-fqdn
Destination FQDN 지정
--to-fqdn google.com


--selector
Label selector (Source/Destination 모두에 적용됨)
--selector k8s:app=frontend


(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f
Jul 25 15:13:25.416: 192.168.10.100:49996 (kube-apiserver) <- kube-system/hubble-ui-76d4965bb6-vbjtc:8081 (ID:64472) to-network FORWARDED (TCP Flags: ACK, PSH)
Jul 25 15:13:26.683: 127.0.0.1:32926 (world) <> kube-system/coredns-674b8bbfcf-pdpmn (ID:3810) pre-xlate-rev TRACED (TCP)
Jul 25 15:13:26.683: 127.0.0.1:32926 (world) <> kube-system/coredns-674b8bbfcf-pdpmn (ID:3810) pre-xlate-rev TRACED (TCP)
Jul 25 15:13:27.434: 127.0.0.1:34640 (world) <> 192.168.10.102 (host) pre-xlate-rev TRACED (TCP)
Jul 25 15:13:27.484: 127.0.0.1:55200 (world) <> kube-system/hubble-relay-5dcd46f5c-78fnx (ID:29925) pre-xlate-rev TRACED (TCP)
Jul 25 15:13:27.484: 127.0.0.1:55200 (world) <> kube-system/hubble-relay-5dcd46f5c-78fnx (ID:29925) pre-xlate-rev TRACED (TCP)

(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe --from-pod kube-system/coredns-674b8bbfcf-pdpmn -f
Jul 25 15:14:19.621: kube-system/coredns-674b8bbfcf-pdpmn:53128 (ID:3810) -> 192.168.10.100:6443 (host) to-stack FORWARDED (TCP Flags: ACK)
Jul 25 15:14:23.375: 10.0.2.15:49350 (host) <- kube-system/coredns-674b8bbfcf-pdpmn:8080 (ID:3810) to-stack FORWARDED (TCP Flags: SYN, ACK)
Jul 25 15:14:23.375: 10.0.2.15:49350 (host) <- kube-system/coredns-674b8bbfcf-pdpmn:8080 (ID:3810) to-stack FORWARDED (TCP Flags: ACK, PSH)
Jul 25 15:14:23.375: 10.0.2.15:49350 (host) <- kube-system/coredns-674b8bbfcf-pdpmn:8080 (ID:3810) to-stack FORWARDED (TCP Flags: ACK, FIN)

(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe --to-ip 192.168.10.100 -f
Jul 25 15:15:11.670: 192.168.10.102:37974 (host) -> 192.168.10.100:6443 (kube-apiserver) to-network FORWARDED (TCP Flags: ACK)
Jul 25 15:15:13.414: 192.168.10.100:49953 (kube-apiserver) <- kube-system/hubble-ui-76d4965bb6-vbjtc:8081 (ID:64472) to-network FORWARDED (TCP Flags: ACK, PSH)
Jul 25 15:15:13.944: kube-system/hubble-ui-76d4965bb6-vbjtc:47176 (ID:64472) -> 192.168.10.100:6443 (kube-apiserver) to-network FORWARDED (TCP Flags: ACK, PSH)
Jul 25 15:15:14.882: 192.168.10.102:56384 (host) -> 192.168.10.100:6443 (kube-apiserver) to-network FORWARDED (TCP Flags: ACK)
Jul 25 15:15:18.256: 192.168.10.101:49758 (host) -> 192.168.10.100:6443 (kube-apiserver) to-network FORWARDED (TCP Flags: ACK, PSH)
Jul 25 15:15:19.404: 192.168.10.100:49953 (kube-apiserver) <- kube-system/hubble-ui-76d4965bb6-vbjtc:8081 (ID:64472) to-network FORWARDED (TCP Flags: ACK, PSH)
Jul 25 15:15:19.405: kube-system/hubble-relay-5dcd46f5c-78fnx:51508 (ID:29925) -> 192.168.10.100:4244 (kube-apiserver) to-network FORWARDED (TCP Flags: ACK, PSH)
Starwars Demo
Cilium에서 제공하는 Demo를 통해 접근 제어를 위한 다양한 보안 정책을 테스트 해본다.
Demo 구성
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod --show-labels
NAME                        READY   STATUS    RESTARTS   AGE   LABELS
deathstar-8c4c77fb7-7rqzr   1/1     Running   0          22m   app.kubernetes.io/name=deathstar,class=deathstar,org=empire,pod-template-hash=8c4c77fb7
deathstar-8c4c77fb7-8wdts   1/1     Running   0          22m   app.kubernetes.io/name=deathstar,class=deathstar,org=empire,pod-template-hash=8c4c77fb7
tiefighter                  1/1     Running   0          22m   app.kubernetes.io/name=tiefighter,class=tiefighter,org=empire
xwing                       1/1     Running   0          22m   app.kubernetes.io/name=xwing,class=xwing,org=alliance

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get deploy,svc,ep deathstar
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
NAME                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/deathstar   2/2     2            2           22m

NAME                TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)   AGE
service/deathstar   ClusterIP   10.96.60.53           80/TCP    22m

NAME                  ENDPOINTS                       AGE
endpoints/deathstar   172.20.1.85:80,172.20.2.33:80   22m
시나리오 1) 조건 X

1-1) tiefighter -> deathstar : request-landing
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec tiefighter -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed

(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f --protocol tcp --from-identity $TIEFIGHTERID
Jul 25 15:56:06.052: default/tiefighter:51000 (ID:62396) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: SYN)
Jul 25 15:56:06.052: default/tiefighter:51000 (ID:62396) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: ACK)
Jul 25 15:56:06.053: default/tiefighter:51000 (ID:62396) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:56:06.053: default/tiefighter:51000 (ID:62396) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:56:06.053: default/tiefighter:51000 (ID:62396) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:56:06.053: default/tiefighter:51000 (ID:62396) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Jul 25 15:56:06.053: default/tiefighter:51000 (ID:62396) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:56:06.053: default/tiefighter:51000 (ID:62396) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:56:06.054: default/tiefighter:51000 (ID:62396) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: ACK, FIN)
Jul 25 15:56:06.098: default/tiefighter (ID:62396) <> 10.96.60.53:80 (world) pre-xlate-fwd TRACED (TCP)
Jul 25 15:56:06.098: default/tiefighter (ID:62396) <> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) post-xlate-fwd TRANSLATED (TCP)
Jul 25 15:56:06.098: default/tiefighter:51000 (ID:62396) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-network FORWARDED (TCP Flags: SYN)
Jul 25 15:56:06.099: default/tiefighter:51000 (ID:62396) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-network FORWARDED (TCP Flags: ACK)
Jul 25 15:56:06.099: default/tiefighter:51000 (ID:62396) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-network FORWARDED (TCP Flags: ACK, PSH)
Jul 25 15:56:06.100: default/tiefighter:51000 (ID:62396) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-network FORWARDED (TCP Flags: ACK, FIN)
1-2) xwing -> deathstar : request-landing
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec xwing -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed

(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f --protocol tcp --from-identity $XWINGID
Jul 25 15:53:42.883: default/xwing (ID:6305) <> 10.96.60.53:80 (world) pre-xlate-fwd TRACED (TCP)
Jul 25 15:53:42.883: default/xwing (ID:6305) <> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) post-xlate-fwd TRANSLATED (TCP)
Jul 25 15:53:42.883: default/xwing:37840 (ID:6305) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: SYN)
Jul 25 15:53:42.883: default/xwing:37840 (ID:6305) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: ACK)
Jul 25 15:53:42.883: default/xwing:37840 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:53:42.883: default/xwing:37840 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:53:42.883: default/xwing:37840 (ID:6305) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Jul 25 15:53:42.883: default/xwing:37840 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:53:42.883: default/xwing:37840 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:53:42.883: default/xwing:37840 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts (ID:50122) pre-xlate-rev TRACED (TCP)
Jul 25 15:53:42.884: default/xwing:37840 (ID:6305) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: ACK, FIN)
시나리오 2) empire에 속한 그룹에 한해 deathstar 호출 가능
L3/L4 정책 적용
Cilium은 IP 주소가 아닌 포드의 레이블로 보안 정책을 정의한다.
예를 들어, org=empire 레이블이 있는 그룹만 deathstar 서비스에 접근할 수 있도록 접근 제한 정책을 생성 할 수 있다.
이 정책은 L3/L4 수준의 네트워크 보안 정책으로, IP와 TCP 수준에서 동작한다.
Cilium은 요청 트래픽만 명시적으로 허용하더라도, 그에 대한 응답 트래픽은 자동으로 허용된다.
이는 Cilium이 Linux 커널의 conntrack(connection tracking) 기능을 기반으로 하여 TCP/UDP 연결 상태를 추적하고, eBPF 프로그램 내에서 해당 상태를 검사하여 연결이 이미 허용된 것인지 확인하기 때문이다.
즉, 클라이언트가 서버에 요청을 보내는 방향의 트래픽만 정책으로 허용하면, 그 요청에 대한 응답은 conntrack에 의해 자동으로 허용된다.
cilium 코드를 통해 전반적인 과정을 이해해보자.
conntrack 기반 연결 상태 추적 Code
1) CiliumNetworkPolicy기반 L4Policy 구조체 생성
// cilium/pkg/policy/repository.go
func (p *Repository) resolvePolicyLocked(securityIdentity *identity.Identity) (*selectorPolicy, error) {
    ...
        // Policy 적용 여부 및 정책 목록을 반환한다.
        matchingRules := p.computePolicyEnforcementAndRules(securityIdentity)
    ...
    // ingerss 및 egress 항목을 분석하여 사용자가 정의한 정책을 L4Policy로 기록한다.(L4Policy 구조체 생성)
    if ingressEnabled {
        newL4IngressPolicy, err := matchingRules.resolveL4IngressPolicy(&policyCtx)
        if err != nil {
            return nil, err
        }
        calculatedPolicy.L4Policy.Ingress.PortRules = newL4IngressPolicy
    }

    if egressEnabled {
        newL4EgressPolicy, err := matchingRules.resolveL4EgressPolicy(&policyCtx)
        if err != nil {
            return nil, err
        }
        calculatedPolicy.L4Policy.Egress.PortRules = newL4EgressPolicy
    }
2) 생성된 정책을 실제 PolicyMap에 적용
// cilium/pkg/endpoint/bpf.go
func (e *Endpoint) runPreCompilationSteps(regenContext *regenerationContext) (preCompilationError error) {
...
// 저장된 policy 정책을 평가하여 endpointPolicy를 생성한다.
err := e.regeneratePolicy(stats, datapathRegenCtxt)
...
        // endpointPolicy를 PolicyMap(eBPF map)에 적용
        err = e.applyPolicyMapChangesLocked(regenContext, e.desiredPolicy != e.realizedPolicy)
...
}

// cilium/pkg/endpoint/policy.go
// 저장된 policy 정책을 평가하여 endpointPolicy를 생성한다.
func (e *Endpoint) regeneratePolicy(stats *regenerationStatistics, datapathRegenCtxt *datapathRegenerationContext) error {
...
    //1. selectorPolicy = 정책 리포지토리에서 추출된 정책
    selectorPolicy, result.policyRevision, err = e.policyRepo.GetSelectorPolicy(securityIdentity, skipPolicyRevision, stats, e.GetID())
...
    //2. selectorPolicy를 endpointPolicy로 변환하여 BPF로 전달 가능한 형태로 정제
    result.endpointPolicy = selectorPolicy.DistillPolicy(e.getLogger(), e, desiredRedirects)
3) eBPF 프로그램에서 패킷 수신 시 PolicyMap기반 conntrack 조회 및 정책 검사
//cilium/bpf/bpf_lxc.c
static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *dst_sec_identity,
                        __s8 *ext_err)
                        ...
    switch (ct_status) {
    case CT_NEW:
    case CT_ESTABLISHED:

        //PolicyMap(cilium_policy_v2)를 기반으로 connection 상태를 확인한다.
        verdict = policy_can_egress4(ctx, &cilium_policy_v2, tuple, l4_off, SECLABEL_IPV4,
                         *dst_sec_identity, &policy_match_type, &audited,
                         ext_err, &proxy_port);
    switch (ct_status) {
    //새로운 connection인 경우 conntrack 엔트리를 새롭게 생성한다.
    case CT_NEW:
ct_recreate4:
...
        break;

    //기존에 conntrack엔트리가 존재할 경우 통신을 허용한다. (필요 시 재 생성)
    case CT_ESTABLISHED:
...
        break;

CiliumNetworkPolicy 생성
# sw_l3_l4_policy.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "rule1"
spec:
  description: "L3-L4 policy to restrict deathstar access to empire ships only"
  endpointSelector:
    matchLabels:
      org: empire
      class: deathstar
  ingress:
  - fromEndpoints:
    - matchLabels:
        org: empire
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/1.17.6/examples/minikube/sw_l3_l4_policy.yaml
ciliumnetworkpolicy.cilium.io/rule1 created

# ingress에 policy 적용 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# c1 endpoint list
ENDPOINT   POLICY (ingress)   POLICY (egress)   IDENTITY   LABELS (source:key[=value])                                                  IPv6   IPv4          STATUS
           ENFORCEMENT        ENFORCEMENT
...                                                                                  ready
1224       Enabled            Disabled          50122      k8s:app.kubernetes.io/name=deathstar                                                172.20.1.85   ready
                                                           k8s:class=deathstar
                                                           k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default
                                                           k8s:io.cilium.k8s.policy.cluster=default
                                                           k8s:io.cilium.k8s.policy.serviceaccount=default
                                                           k8s:io.kubernetes.pod.namespace=default
                                                           k8s:org=empire

2-1) tiefighter -> deathstar : request-landing
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec tiefighter -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing
Ship landed

(⎈|HomeLab:N/A) root@k8s-ctr:~hubble observe -f --protocol tcp --from-identity $DEATHSTARIDID
Jul 25 16:07:58.104: default/tiefighter:50142 (ID:62396) <- default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-network FORWARDED (TCP Flags: SYN, ACK)
Jul 25 16:07:58.106: default/tiefighter:50142 (ID:62396) <- default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-network FORWARDED (TCP Flags: ACK, PSH)
Jul 25 16:07:58.107: default/tiefighter:50142 (ID:62396) <- default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-network FORWARDED (TCP Flags: ACK, FIN)
Jul 25 16:07:58.151: default/tiefighter:50142 (ID:62396) <- default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: SYN, ACK)
Jul 25 16:07:58.151: default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) <> default/tiefighter (ID:62396) pre-xlate-rev TRACED (TCP)
Jul 25 16:07:58.151: default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) <> default/tiefighter (ID:62396) pre-xlate-rev TRACED (TCP)
Jul 25 16:07:58.153: default/tiefighter:50142 (ID:62396) <- default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: ACK, PSH)
Jul 25 16:07:58.154: default/tiefighter:50142 (ID:62396) <- default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) to-endpoint FORWARDED (TCP Flags: ACK, FIN)
2-2) xwing -> deathstar : request-landing
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec xwing -- curl -s -XPOST deathstar.default.svc.cluster.local/v1/request-landing

(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f --type drop
Jul 25 16:05:58.942: default/xwing:38870 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) Policy denied DROPPED (TCP Flags: SYN)
Jul 25 16:05:59.962: default/xwing:38870 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) Policy denied DROPPED (TCP Flags: SYN)
Jul 25 16:06:00.987: default/xwing:38870 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) Policy denied DROPPED (TCP Flags: SYN)
Jul 25 16:06:02.011: default/xwing:38870 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) Policy denied DROPPED (TCP Flags: SYN)
Jul 25 16:06:03.034: default/xwing:38870 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) Policy denied DROPPED (TCP Flags: SYN)
Jul 25 16:06:04.059: default/xwing:38870 (ID:6305) <> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) Policy denied DROPPED (TCP Flags: SYN)
시나리오 3) empire에 속한 그룹에 한해 명확한 요청으로(request-landing) deathstar 호출 가능
L7 정책 적용
L7 동작 처리는 cilium-envoy 데몬셋이 담당한다.
공식 문서 참조1
공식 문서 참조2

L7 정책은 L3/L4와는 다르게 단순한 eBPF map 기반 정책으로는 처리할 수 없기 때문에, Cilium에서는 Envoy Proxy를 연동하여 L7 처리를 수행하도록 설계되어있다.
사용자가 CiliumNetworkPolicy에 HTTP method나 path 등 L7 룰을 정의하면, Cilium은 해당 정책을 분석해 해당 트래픽을 Envoy Proxy로 보낸다.
proxy_port는 eBPF 코드에서 트래픽을 리디렉션할 포트를 의미하며, L7 정책이 있을 때에만 할당되는 port이다.
proxy_port가 0보다 크면 아래 코드와 같이 해당 트래픽을 proxy redirection 체크를 한 후, envoy proxy로 redirect시킨다.
//cilium/bpf/bpf_lxc.c
        ct_state_new.proxy_redirect = *proxy_port > 0;

        /* ext_err may contain a value from __policy_can_access, and
         * ct_create6 overwrites it only if it returns an error itself.
         * As the error from __policy_can_access is dropped in that
         * case, it's OK to return ext_err from ct_create6 along with
         * its error code.
         */
        ret = ct_create6(get_ct_map6(tuple), &cilium_ct_any6_global, tuple, ctx, CT_INGRESS,
                 &ct_state_new, ext_err);
        if (IS_ERR(ret))
            return ret;
    }

    if (*proxy_port > 0)
        goto redirect_to_proxy;

...
redirect_to_proxy:
    send_trace_notify4(ctx, TRACE_TO_PROXY, src_label, SECLABEL_IPV4, orig_sip,
               bpf_ntohs(*proxy_port), ifindex, trace.reason,
               trace.monitor);
    if (tuple_out)
        *tuple_out = *tuple;
    return POLICY_ACT_PROXY_REDIRECT;
}
이후 Envoy는 전달받은 트래픽을 정책에 따라 검사하고, 허용되면 다시 eBPF를 통해 원래 목적지(Pod)로 전달한다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ds -n kube-system cilium-envoy
NAME           DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
cilium-envoy   3         3         3       3            3           kubernetes.io/os=linux   23h

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl describe ds -n kube-system cilium-envoy | grep -i mount -A4
    Mounts:
      /sys/fs/bpf from bpf-maps (rw)
      /var/run/cilium/envoy/ from envoy-config (ro)
      /var/run/cilium/envoy/artifacts from envoy-artifacts (ro)
      /var/run/cilium/envoy/sockets from envoy-sockets (rw)

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -c cilium-agent -- ss -xnp | grep -i -envoy
u_str ESTAB 0      0                                                /var/run/cilium/envoy/sockets/admin.sock 28795            * 29729
u_str ESTAB 0      0                                                /var/run/cilium/envoy/sockets/admin.sock 28789            * 29726
u_str ESTAB 0      0                                                  /var/run/cilium/envoy/sockets/xds.sock 35039            * 35038 users:(("cilium-agent",pid=1,fd=72))


CiliumNetworkPolicy 업데이트
# sw_l3_l4_l7_policy.yaml
apiVersion: "cilium.io/v2"
kind: CiliumNetworkPolicy
metadata:
  name: "rule1"
spec:
  description: "L7 policy to restrict access to specific HTTP call"
  endpointSelector:
    matchLabels:
      org: empire
      class: deathstar
  ingress:
  - fromEndpoints:
    - matchLabels:
        org: empire
    toPorts:
    - ports:
      - port: "80"
        protocol: TCP
      rules:
        http:
        - method: "POST"
          path: "/v1/request-landing"

kubectl apply -f https://raw.githubusercontent.com/cilium/cilium/1.17.6/examples/minikube/sw_l3_l4_l7_policy.yaml

(⎈|HomeLab:N/A) root@k8s-ctr:~# c0 policy get
[
  {
    "endpointSelector": {
      "matchLabels": {
        "any:class": "deathstar",
        "any:org": "empire",
        "k8s:io.kubernetes.pod.namespace": "default"
      }
    },
    "ingress": [
      {
        "fromEndpoints": [
          {
            "matchLabels": {
              "any:org": "empire",
              "k8s:io.kubernetes.pod.namespace": "default"
            }
          }
        ],
        "toPorts": [
          {
            "ports": [
              {
                "port": "80",
                "protocol": "TCP"
              }
            ],
            "rules": {
              "http": [
                {
                  "path": "/v1/request-landing",
                  "method": "POST"
                }
              ]
            }
          }
        ]
      }
    ],
    "labels": [
      {
        "key": "io.cilium.k8s.policy.derived-from",
        "value": "CiliumNetworkPolicy",
        "source": "k8s"
      },
      {
        "key": "io.cilium.k8s.policy.name",
        "value": "rule1",
        "source": "k8s"
      },
      {
        "key": "io.cilium.k8s.policy.namespace",
        "value": "default",
        "source": "k8s"
      },
      {
        "key": "io.cilium.k8s.policy.uid",
        "value": "c07db93d-ea58-448b-aee1-3a4701800f13",
        "source": "k8s"
      }
    ],
    "enableDefaultDeny": {
      "ingress": true,
      "egress": false
    },
    "description": "L7 policy to restrict access to specific HTTP call"
  }
]
Revision: 3
3-1) tiefighter -> deathstar : exhaust-port

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec tiefighter -- curl -s -XPUT deathstar.default.svc.cluster.local/v1/exhaust-port
Access denied

(⎈|HomeLab:N/A) root@k8s-ctr:~# hubble observe -f --pod deathstar --verdict DROPPED
Jul 26 14:37:11.689: default/tiefighter:49534 (ID:62396) -> default/deathstar-8c4c77fb7-8wdts:80 (ID:50122) http-request DROPPED (HTTP/1.1 PUT http://deathstar.default.svc.cluster.local/v1/exhaust-port)

(⎈|HomeLab:N/A) root@k8s-ctr:~# c1 monitor -v --type l7
CPU 01: [pre-xlate-rev] cgroup_id: 7275 sock_cookie: 9612, dst [172.20.1.39]:56024 tcp
<- Request http from 1899 ([k8s:app.kubernetes.io/name=tiefighter k8s:class=tiefighter k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=empire]) to 1224 ([k8s:app.kubernetes.io/name=deathstar k8s:class=deathstar k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=empire]), identity 62396->50122, verdict Denied PUT http://deathstar.default.svc.cluster.local/v1/exhaust-port => 0
<- Response http to 1899 ([k8s:app.kubernetes.io/name=tiefighter k8s:class=tiefighter k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=empire]) from 1224 ([k8s:app.kubernetes.io/name=deathstar k8s:class=deathstar k8s:io.cilium.k8s.namespace.labels.kubernetes.io/metadata.name=default k8s:io.cilium.k8s.policy.cluster=default k8s:io.cilium.k8s.policy.serviceaccount=default k8s:io.kubernetes.pod.namespace=default k8s:org=empire]), identity 50122->62396, verdict Forwarded PUT http://deathstar.default.svc.cluster.local/v1/exhaust-port => 403



Cilium - Cilium 기본 설치 및 통신 확인 (2)
Sat, 19 Jul 2025 16:54:55 GMT
Migration to Cilium
현재 많은 Kubernetes 클러스터에서 Flannel 또는 Calico와 같은 전통적인 CNI(Container Network Interface) 플러그인이 사용되고 있다.
하지만 최근 eBPF 기반의 고성능 네트워킹 기능이 각광받으면서, CNI 플러그인을 Cilium으로 전환하는 방향이 검토되고 있다. 다음은 Flannel과 Calico에서 Cilium으로 마이그레이션하는 간단한 데모이다.
Flannel to Cilium
기존 Flannel 환경 클러스터 점검
(⎈|HomeLab:N/A) root@k8s-ctr:~# helm list -A
NAME       NAMESPACE       REVISION    UPDATED                                    STATUS      CHART              APP VERSION
flannel    kube-flannel    1           2025-07-19 19:42:16.187290147 +0900 KST    deployed    flannel-v0.27.1    v0.27.1

(⎈|HomeLab:N/A) root@k8s-ctr:~# tree /opt/cni/bin/ | grep flannel
├── flannel

(⎈|HomeLab:N/A) root@k8s-ctr:~# tree /etc/cni/net.d/
/etc/cni/net.d/
└── 10-flannel.conflist

(⎈|HomeLab:N/A) root@k8s-ctr:~# cat /etc/cni/net.d/10-flannel.conflist | jq
{
  "name": "cbr0",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "type": "flannel",
      "delegate": {
        "hairpinMode": true,
        "isDefaultGateway": true
      }
    },
    {
      "type": "portmap",
      "capabilities": {
        "portMappings": true
      }
    }
  ]
}

(⎈|HomeLab:N/A) root@k8s-ctr:~# kc describe cm -n kube-flannel kube-flannel-cfg
...
Data
====
net-conf.json:
----
{
  "Network": "10.244.0.0/16",
  "Backend": {
    "Type": "vxlan"
  }
}

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route | grep 10.244.
10.244.0.0/24 dev cni0 proto kernel scope link src 10.244.0.1
10.244.1.0/24 via 10.244.1.0 dev flannel.1 onlink
10.244.3.0/24 via 10.244.3.0 dev flannel.1 onlink

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get nodes
NAME      STATUS   ROLES           AGE     VERSION
k8s-ctr   Ready    control-plane   3h41m   v1.33.2
k8s-w1    Ready              3h40m   v1.33.2
k8s-w2    Ready              3h39m   v1.33.2

(⎈|HomeLab:N/A) root@k8s-ctr:~# brctl show
bridge name    bridge id        STP enabled    interfaces
cni0        8000.5e70c06d50fe    no        veth10a6853e
                            veth75d3a172
                            vethbf10bd3c

(⎈|HomeLab:N/A) root@k8s-ctr:~# iptables -t nat -S | wc -l
77

(⎈|HomeLab:N/A) root@k8s-ctr:~# iptables -t filter -S | wc -l
30

(⎈|HomeLab:N/A) root@k8s-ctr:~# iptables -t nat -S > nat_flannel
(⎈|HomeLab:N/A) root@k8s-ctr:~# iptables -t filter -S > filter_flannel

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pods -o wide -A
NAMESPACE      NAME                              READY   STATUS    RESTARTS   AGE     IP               NODE      NOMINATED NODE   READINESS GATES
default        curl-pod                          1/1     Running   0          91m     10.244.0.4       k8s-ctr              
default        webpod-697b545f57-45wbp           1/1     Running   0          91m     10.244.1.2       k8s-w1               
default        webpod-697b545f57-bvz97           1/1     Running   0          91m     10.244.3.2       k8s-w2               
kube-flannel   kube-flannel-ds-6jj2z             1/1     Running   0          93m     192.168.10.102   k8s-w2               
kube-flannel   kube-flannel-ds-8lnj8             1/1     Running   0          93m     192.168.10.101   k8s-w1               
kube-flannel   kube-flannel-ds-l69mp             1/1     Running   0          93m     192.168.10.100   k8s-ctr              
kube-system    coredns-674b8bbfcf-sk67g          1/1     Running   0          3h41m   10.244.0.2       k8s-ctr              
kube-system    coredns-674b8bbfcf-xlw52          1/1     Running   0          3h41m   10.244.0.3       k8s-ctr              
kube-system    etcd-k8s-ctr                      1/1     Running   0          3h41m   192.168.10.100   k8s-ctr              
kube-system    kube-apiserver-k8s-ctr            1/1     Running   0          3h41m   192.168.10.100   k8s-ctr              
kube-system    kube-controller-manager-k8s-ctr   1/1     Running   0          3h41m   192.168.10.100   k8s-ctr              
kube-system    kube-proxy-dn95s                  1/1     Running   0          3h41m   192.168.10.100   k8s-ctr              
kube-system    kube-proxy-kr4xv                  1/1     Running   0          3h40m   192.168.10.101   k8s-w1               
kube-system    kube-proxy-ldx7j                  1/1     Running   0          3h39m   192.168.10.102   k8s-w2               
kube-system    kube-scheduler-k8s-ctr            1/1     Running   0          3h41m   192.168.10.100   k8s-ctr              
기존 Flannel CNI 제거
# helm uninstall -n kube-flannel flannel
# helm list -A

# kubectl get all -n kube-flannel
# kubectl delete ns kube-flannel

# kubectl get pod -A -owide

vnic 제거
# ip link del flannel.1
# ip link del cni0

제거 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c link
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
2: eth0:  mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 08:00:27:71:19:d8 brd ff:ff:ff:ff:ff:ff
    altname enp0s8
3: eth1:  mtu 1500 qdisc fq_codel state UP mode DEFAULT group default qlen 1000
    link/ether 08:00:27:d8:a8:88 brd ff:ff:ff:ff:ff:ff
    altname enp0s9
6: veth75d3a172@if2:  mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 5e:94:82:fd:f5:19 brd ff:ff:ff:ff:ff:ff link-netns cni-51adc222-7922-b776-8c89-11b2530104a7
7: vethbf10bd3c@if2:  mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether 0e:f0:10:d4:33:65 brd ff:ff:ff:ff:ff:ff link-netns cni-ad9cc013-1136-0f8d-3bdc-8335414f68b8
8: veth10a6853e@if2:  mtu 1450 qdisc noqueue state UP mode DEFAULT group default qlen 1000
    link/ether aa:22:ce:ca:55:5d brd ff:ff:ff:ff:ff:ff link-netns cni-70f28269-5810-35aa-459d-42d545727521

(⎈|HomeLab:N/A) root@k8s-ctr:~# brctl show

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route
default via 10.0.2.2 dev eth0 proto dhcp src 10.0.2.15 metric 100
10.0.2.0/24 dev eth0 proto kernel scope link src 10.0.2.15 metric 100
10.0.2.2 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
10.0.2.3 dev eth0 proto dhcp scope link src 10.0.2.15 metric 100
192.168.10.0/24 dev eth1 proto kernel scope link src 192.168.10.100
기존 kube-proxy 제거 및 노드별 파드 IPAM(pod CIDR) 확인
kube-controller-manager는 --allocate-node-cidrs=true 옵션이 설정되어 있을 경우, --cluster-cidr 플래그로 지정된 CIDR을 노드에 자동 할당한다.
# kubectl -n kube-system delete ds kube-proxy
# kubectl -n kube-system delete cm kube-proxy

# iptables-save | grep -v KUBE | grep -v FLANNEL | iptables-restore
# iptables-save

제거 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
k8s-ctr    10.244.0.0/24
k8s-w1    10.244.1.0/24
k8s-w2    10.244.3.0/24

(⎈|HomeLab:N/A) root@k8s-ctr:~# kc describe pod -n kube-system kube-controller-manager-k8s-ctr | grep -e cidr -e 'service-cluster-ip-range'
      --allocate-node-cidrs=true
      --cluster-cidr=10.244.0.0/16
      --service-cluster-ip-range=10.96.0.0/16

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -o wide
NAME                      READY   STATUS    RESTARTS   AGE    IP           NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   0          105m   10.244.0.4   k8s-ctr              
webpod-697b545f57-45wbp   1/1     Running   0          105m   10.244.1.2   k8s-w1               
webpod-697b545f57-bvz97   1/1     Running   0          105m   10.244.3.2   k8s-w2               
Cilium 설치
# helm repo add cilium https://helm.cilium.io/

# 
helm install cilium cilium/cilium --version 1.17.5 --namespace kube-system \
--set k8sServiceHost=192.168.10.100 --set k8sServicePort=6443 \
--set kubeProxyReplacement=true \
--set routingMode=native \
--set autoDirectNodeRoutes=true \
--set ipam.mode="cluster-pool" \
--set ipam.operator.clusterPoolIPv4PodCIDRList={"172.20.0.0/16"} \
--set ipv4NativeRoutingCIDR=172.20.0.0/16 \
--set endpointRoutes.enabled=true \
--set installNoConntrackIptablesRules=true \
--set bpf.masquerade=true \
--set ipv6.enabled=false



옵션
설명



kubeProxyReplacement=true
kube-proxy를 완전히 대체하여 Cilium이 직접 kube-proxy 기능을 수행하도록 설정.


routingMode=native
Cilium의 라우팅 모드를 native 모드로 설정 (Linux 커널 네이티브 라우팅 사용).


autoDirectNodeRoutes=true
노드 간 트래픽을 위해 자동으로 직접 라우팅 경로 설정.


ipam.mode="cluster-pool"
IP 주소 할당 모드를 클러스터 풀 모드로 설정 (Cilium이 직접 IP 할당 관리).


ipam.operator.clusterPoolIPv4PodCIDRList={"172.20.0.0/16"}
클러스터에서 사용할 Pod용 IPv4 CIDR 풀을 설정.


ipv4NativeRoutingCIDR=172.20.0.0/16
노드의 네이티브 라우팅에 사용할 IPv4 CIDR 범위 설정.


endpointRoutes.enabled=true
각 Pod에 대한 라우팅 경로를 별도로 생성하여 네트워크 경로를 최적화.


installNoConntrackIptablesRules=true
Cilium이 conntrack 기반 iptables 규칙을 설치하지 않도록 설정 (eBPF 방식 우선).


bpf.masquerade=true
eBPF를 사용해 IP 마스커레이딩 수행 (NAT 대체).


설치 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ciliumnodes -o json | grep podCIDRs -A2
                    "podCIDRs": [
                        "172.20.0.0/24"
                    ],
--
                    "podCIDRs": [
                        "172.20.1.0/24"
                    ],
--
                    "podCIDRs": [
                        "172.20.2.0/24"
                    ],

파드 재배포 후 IP 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -owide
NAME                      READY   STATUS    RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   0          33s   172.20.0.150   k8s-ctr              
webpod-85ccc4b7dd-qphtq   1/1     Running   0          74s   172.20.2.150   k8s-w2               
webpod-85ccc4b7dd-tqc9r   1/1     Running   0          71s   172.20.1.165   k8s-w1               

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ciliumendpoints
NAME                      SECURITY IDENTITY   ENDPOINT STATE   IPV4           IPV6
curl-pod                  16024               ready            172.20.0.150
webpod-85ccc4b7dd-qphtq   47031               ready            172.20.2.150
webpod-85ccc4b7dd-tqc9r   47031               ready            172.20.1.165

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -c cilium-agent -- cilium-dbg endpoint list
Calico to Cilium
Calico 설치
#
helm repo add projectcalico https://docs.tigera.io/calico/charts
helm repo update
helm install calico projectcalico/tigera-operator \
  --namespace tigera-operator \
  --create-namespace

cat < installation.yaml
apiVersion: operator.tigera.io/v1
kind: Installation
metadata:
  name: default
spec:
  calicoNetwork:
    ipPools:
    - blockSize: 26
      cidr: 10.244.0.0/16
      encapsulation: VXLAN
      natOutgoing: Enabled
      nodeSelector: all()
EOF

kubectl apply installation.yaml
기존 Calico 환경 클러스터 점검
(⎈|HomeLab:N/A) root@k8s-ctr:~# helm list -A
NAME      NAMESPACE          REVISION    UPDATED                                    STATUS      CHART                      APP VERSION
calico    tigera-operator    1           2025-07-19 22:45:44.341876006 +0900 KST    deployed    tigera-operator-v3.30.2    v3.30.2

(⎈|HomeLab:N/A) root@k8s-ctr:~# tree /opt/cni/bin/ | grep calico
├── calico
├── calico-ipam

(⎈|HomeLab:N/A) root@k8s-ctr:~# tree /etc/cni/net.d/
/etc/cni/net.d/
├── 10-calico.conflist
└── calico-kubeconfig

(⎈|HomeLab:N/A) root@k8s-ctr:~# cat /etc/cni/net.d/10-calico.conflist | jq
{
  "name": "k8s-pod-network",
  "cniVersion": "0.3.1",
  "plugins": [
    {
      "container_settings": {
        "allow_ip_forwarding": false
      },
      "datastore_type": "kubernetes",
      "endpoint_status_dir": "/var/run/calico/endpoint-status",
      "ipam": {
        "assign_ipv4": "true",
        "assign_ipv6": "false",
        "type": "calico-ipam"
      },
      "kubernetes": {
        "k8s_api_root": "https://10.96.0.1:443",
        "kubeconfig": "/etc/cni/net.d/calico-kubeconfig"
      },
      "log_file_max_age": 30,
      "log_file_max_count": 10,
      "log_file_max_size": 100,
      "log_file_path": "/var/log/calico/cni/cni.log",
      "log_level": "Info",
      "mtu": 0,
      "nodename_file_optional": false,
      "policy": {
        "type": "k8s"
      },
      "policy_setup_timeout_seconds": 0,
      "type": "calico"
    },
    {
      "capabilities": {
        "portMappings": true
      },
      "snat": true,
      "type": "portmap"
    }
  ]
}

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route | grep 10.244.
10.244.46.0/26 via 10.244.46.8 dev vxlan.calico onlink
10.244.46.8 dev vxlan.calico scope link
blackhole 10.244.78.64/26 proto 80
10.244.78.65 dev cali4aab08373ba scope link
10.244.78.66 dev cali33014ef5b3d scope link
10.244.228.64/26 via 10.244.228.66 dev vxlan.calico onlink
10.244.228.66 dev vxlan.calico scope link

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -o wide -A | grep 10.244 | grep ctr
calico-system      csi-node-driver-mvzvw                      2/2     Running   0          12m   10.244.78.65     k8s-ctr              
calico-system      whisker-7f5b7c657b-975qr                   2/2     Running   0          12m   10.244.78.66     k8s-ctr              

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip link show vxlan.calico
9: vxlan.calico:  mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
    link/ether 66:65:25:b2:65:0a brd ff:ff:ff:ff:ff:ff

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip link show | grep cali
4: cali4aab08373ba@if2:  mtu 1500 qdisc noqueue state UP mode DEFAULT group default qlen 1000
5: cali33014ef5b3d@if2:  mtu 1480 qdisc noqueue state UP mode DEFAULT group default qlen 1000
9: vxlan.calico:  mtu 1450 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000

샘플 Pod, Svc 배포
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pods -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   0          10s   10.244.78.68    k8s-ctr              
webpod-697b545f57-fbmmm   1/1     Running   0          10s   10.244.228.67   k8s-w1               
webpod-697b545f57-fg6bg   1/1     Running   0          10s   10.244.46.9     k8s-w2               
기존 Calico CNI 제거
# helm uninstall -n tigera-operator calico
# helm list -A

# kubectl get all -n calico-system
# kubectl delete ns tigera-operator

# ip link del vxlan.calico
기존 kube-proxy 제거 및 노드별 파드 IPAM(pod CIDR) 확인
# kubectl -n kube-system delete ds kube-proxy
# kubectl -n kube-system delete cm kube-proxy

# iptables-save | grep -v KUBE | grep -vi cali | iptables-restore
# iptables-save

# reboot

제거 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.spec.podCIDR}{"\n"}{end}'
k8s-ctr    10.244.0.0/24
k8s-w1    10.244.1.0/24
k8s-w2    10.244.2.0/24]

(⎈|HomeLab:N/A) root@k8s-ctr:~# kc describe pod -n kube-system kube-controller-manager-k8s-ctr | grep -e cidr -e 'service-cluster-ip-range'
      --allocate-node-cidrs=true
      --cluster-cidr=10.244.0.0/16
      --service-cluster-ip-range=10.96.0.0/16

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -o wide
NAME                      READY   STATUS    RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   0          21m   10.244.78.68    k8s-ctr              
webpod-697b545f57-fbmmm   1/1     Running   0          21m   10.244.228.67   k8s-w1               
webpod-697b545f57-fg6bg   1/1     Running   0          21m   10.244.46.9     k8s-w2               
Cilium 설치
위 Flannel => Cilium에서의 과정과 동일하여 생략한다.
설치 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ciliumnodes -o json | grep podCIDRs -A2
                    "podCIDRs": [
                        "172.20.0.0/24"
                    ],
--
                    "podCIDRs": [
                        "172.20.2.0/24"
                    ],
--
                    "podCIDRs": [
                        "172.20.1.0/24"
                    ],

파드 재배포 후 IP 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -owide
NAME                      READY   STATUS    RESTARTS   AGE   IP             NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   0          20s   172.20.0.6     k8s-ctr              
webpod-659cd747f8-v9xkm   1/1     Running   0          64s   172.20.1.172   k8s-w2               
webpod-659cd747f8-zqbk8   1/1     Running   0          67s   172.20.2.151   k8s-w1               

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ciliumendpoints
NAME                      SECURITY IDENTITY   ENDPOINT STATE   IPV4           IPV6
curl-pod                  61805               ready            172.20.0.6
webpod-659cd747f8-v9xkm   11362               ready            172.20.1.172
webpod-659cd747f8-zqbk8   11362               ready            172.20.2.151

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -c cilium-agent -- cilium-dbg endpoint list



AWS EKS : VPC CNI
Sat, 02 Nov 2024 19:16:38 GMT
AWS VPC CNI
AWS VPC CNI는 Amazon EKS에서 Pod가 VPC의 IP 주소를 직접 사용할 수 있게 해주는 네트워크 플러그인이다. 
이를 통해 Pod와 외부 네트워크간의 통신이 VPC 내에서 직접적으로 이루어지며, 추가적인 네트워크 변환 없이 통신이 가능하도록 한다.
VPC CNI의 중요한 특징 중 하나는 Pod가 노드의 네트워크 대역(VPC 서브넷)과 동일한 IP 대역을 사용한다는 것이다. 
각 Pod는 별도의 NAT나 프록시 없이 VPC 내의 다른 리소스와 직접 통신이 가능하며, pod와 노드의 네트워크 대역이 같다보니 오버레이(VXLAN, IP-IP 등)으로 통신하는 일반적인 K8s CNI와는 달리 동일 대역으로 직접 통신을 한다.
AWS VPC CNI에는 두 가지 주요 구성 요소인 CNI 바이너리(CNI Binary)와 IP 주소 관리(IPAM, IP Address Management)가 있다.
CNI 바이너리 (CNI Binary)
CNI 바이너리는 Pod의 네트워크 인터페이스를 생성하고 삭제하는 역할을 수행한다.
Pod가 시작될 때 ENI(Elastic Network Interface)와 연결된 IP를 Pod에 할당한다.
Pod가 종료되면 네트워크 자원을 정리한다.
이를 통해 Pod와 외부 네트워크 간의 원활한 통신이 가능하게 하며, 모든 Pod가 VPC 네트워크 대역의 IP를 사용하도록 한다.
IP 주소 관리 (L-IPAM)
IPAM은 VPC 서브넷 내에서 Pod에 할당할 IP 주소를 효율적으로 관리한다.
노드에 할당된 ENI에서 여러 IP를 미리 풀(pool) 형태로 확보해 두고, Pod가 생성될 때 즉시 할당한다.
사용하지 않는 IP는 재사용하거나 반환하여 네트워크 자원을 절약한다.
이 과정은 동적으로 관리되며, VPC 서브넷과 노드의 자원 상태에 따라 유연하게 조정된다.
이 두 요소를 통해 AWS VPC CNI는 Pod 간의 통신과 VPC 내 네트워크 자원 관리를 효율적으로 지원한다.

ip를 미리 가지고 있다가 warm pool에서 ip바로 pod에 할당할수있어서 pod도 빨리뜨고, api호출수도 줄어든다.

Pod 통신
기본 네트워크 구성 확인
생성된 노드(t3.medium)의 기본 네트워크 구성을 확인해보자.

현재 ENI는 2개로, 각 ENI는 자신의 IP 이외에 추가적으로 5개의 보조 프라이빗 IP를 가질수 있다.
coredns 파드는 veth 인터페이스로, 호스트에는 eniY@ifN 인터페이스와 파드에 eth0 과 연결되어 있다
파드를 배포한 후 각 워커 노드의 라우팅 정보와 ip 정보를 확인해보자.


파드가 생성 후, 워커 노드에 eniY@ifN 이 추가되고 라우팅 테이블에도 정보가 추가 된 것을 확인할 수 있다.
노드간 파드 통신
위에서 설명했듯이,  AWS VPC CNI 경우 Pod가 노드의 네트워크 대역과 동일한 IP 대역을 사용하기 때문에, 별도의 오버레이(Overlay) 통신 기술 없이, VPC Native 하게 파드간 직접 통신이 가능하다

Pod1에서 Pod2 통신을 시도함과 동시에 Pod가 떠있는 노드에서 eth0 패킷 캡처를 해본다.

패킷 캡처된 내역을 보면, 서로 다른 노드 위에 있음에도 파드 간 통신이 NAT없이 direct 통신이 이뤄짐을 알 수 있다.
[ec2-user@ip-192-168-1-193 ~]$ ip route show table main
default via 192.168.1.1 dev eth0
169.254.169.254 dev eth0
192.168.1.0/24 dev eth0 proto kernel scope link src 192.168.1.193
192.168.1.72 dev eni6376c4bd6b9 scope link
노드에서 라우팅 정보를 확인하면, default via 192.168.1.1 dev eth0라는 라우팅 정보가 있는데, 이는 서버에서 로컬 네트워크에 속하지 않는 패킷을 전송할 때 기본 게이트웨이(192.168.1.1)를 통해 eth0 네트워크 인터페이스로 보낸다는 뜻이다.
이 라우팅 경로가 default로 설정되어 있기 때문에, 이 서버에서 외부 네트워크나 인터넷으로 나가는 모든 트래픽은 eth0을 통해 빠져나가게 된다.
파드에서 외부 통신

파드에서 외부 통신을 함과 동시에 파드가 떠있는 워커노드에서 eth0 패킷을 캡처해본다.

파드가 외부 통신 할 때에는 'AWS-SNAT-CHAIN-0' 룰(rule)에 의해서 SNAT 되어서 외부와 통신이 된다.
해당 룰에 대해 확인해본다.
[ec2-user@ip-192-168-2-90 ~]$ sudo iptables -t nat -S | grep 'A AWS-SNAT-CHAIN'
-A AWS-SNAT-CHAIN-0 -d 192.168.0.0/16 -m comment --comment "AWS SNAT CHAIN" -j RETURN
-A AWS-SNAT-CHAIN-0 ! -o vlan+ -m comment --comment "AWS, SNAT" -m addrtype ! --dst-type LOCAL -j SNAT --to-source 192.168.2.90 --random-fully
-A AWS-SNAT-CHAIN-0 -d 192.168.0.0/16 -m comment --comment "AWS SNAT CHAIN" -j RETURN
이 규칙은 destination ip가 192.168.0.0/16 대역에 속하는 패킷을 SNAT 적용에서 제외시킨다. 즉, 로컬 서브넷 안에서의 트래픽은 SNAT을 적용하지 않도록 하는 설정이다.
-A AWS-SNAT-CHAIN-0 ! -o vlan+ -m comment --comment "AWS, SNAT" -m addrtype ! --dst-type LOCAL -j SNAT --to-source 192.168.2.90 --random-fully
이 규칙은 VLAN을 제외한 다른 인터페이스를 통한 트래픽에 SNAT을 적용하는 규칙이다. AWS 네트워크 외부로 나가는 트래픽에 대해 출발지 IP 주소를 노드 IP인 192.168.2.90으로 설정하여 관리 및 라우팅을 일관되게 유지할 수 있다.
노드에 파드 생성 갯수 제한
AWS는 워커 노드의 인스턴스 타입 별로 파드 생성 갯수를 제한한다.
인스턴스 타입 별 ENI 최대 갯수와 할당 가능한 최대 IP 갯수에 따라서 파드 배치 갯수가 결정되는데, aws-node 와 kube-proxy 파드의 경우 호스트의 IP를 사용함으로 최대 갯수에서 제외한다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# aws ec2 describe-instance-types --filters Name=instance-type,Values=t3.* \
>  --query "InstanceTypes[].{Type: InstanceType, MaxENI: NetworkInfo.MaximumNetworkInterfaces, IPv4addr: NetworkInfo.Ipv4AddressesPerInterface}" \
>  --output table
--------------------------------------
|        DescribeInstanceTypes       |
+----------+----------+--------------+
| IPv4addr | MaxENI   |    Type      |
+----------+----------+--------------+
|  12      |  3       |  t3.large    |
|  6       |  3       |  t3.medium   |
|  15      |  4       |  t3.xlarge   |
|  15      |  4       |  t3.2xlarge  |
|  2       |  2       |  t3.micro    |
|  2       |  2       |  t3.nano     |
|  4       |  3       |  t3.small    |
+----------+----------+--------------+

(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl describe node | grep Allocatable: -A6
Allocatable:
  cpu:                1930m
  ephemeral-storage:  27905944324
  hugepages-1Gi:      0
  hugepages-2Mi:      0
  memory:             3388312Ki
  pods:               17
제한된 개수 이상의 파드를 생성했을 때, 어떻게 되는지 알아본다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl scale deployment nginx-deployment --replicas=50
deployment.apps/nginx-deployment scaled


제한 개수를 넘어가면 pod가 Pending상태로 정상 기동되지 않는다.
워커노드에 pod ip할당도 제한된 개수까지 할당된 것을 볼 수 있다.

Service & AWS LoadBalancer Controller
AWS 환경에서 쿠버네티스 서비스의 트래픽을 외부로 노출하기 위해 AWS Load Balancer Controller와 Network Load Balancer(NLB)를 활용할 수 있다. 
AWS Load Balancer Controller
AWS Load Balancer Controller는 쿠버네티스 클러스터 내 서비스의 트래픽을 AWS의 네트워크 로드 밸런서(NLB)로 자동 연결해 주는 컨트롤러이다. 이 컨트롤러는 LoadBalancer 타입의 서비스를 생성할 때 자동으로 NLB를 프로비저닝하고 이를 통해 서비스 트래픽을 처리합니다.
NLB
NLB 모드에는 인스턴스 유형과 IP 유형이 있다.

인스턴스 유형


인스턴스 유형 로드 밸런서는 Amazon EC2 인스턴스를 대상으로 사용한다.
주로 기존의 Classic Load Balancer(CLB)와 연결되며, 대상 인스턴스는 EC2의 ID를 기준으로 로드 밸런서에 등록된다.
인스턴스의 상태에 따라 자동으로 로드 밸런서의 트래픽이 분배되며, 인스턴스가 중지 또는 종료되면 로드 밸런서에서 자동으로 제거된다.
탄력적 IP 주소(EIP)를 사용하지 않는 경우 IP 주소가 동적으로 할당되므로, 인스턴스의 IP 주소가 변경될 수 있다.


IP 유형


IP 유형 로드 밸런서는 특정 IP 주소를 대상으로 사용하며, 인스턴스뿐만 아니라 온프레미스 서버 등 다양한 IP 주소를 대상으로 로드 밸런싱할 수 있다.
주로 Application Load Balancer(ALB)와 Network Load Balancer(NLB)에서 사용되며, 로드 밸런서가 VPC 외부의 IP 주소로도 트래픽을 분배할 수 있다.
각 대상 IP는 탄력적 IP 또는 고정 IP일 수 있어 IP 주소가 변경되지 않고 고정된 대상 서버에 트래픽을 전달하는 경우 유용하다.
IP 기반 로드 밸런서를 사용하면 컨테이너화된 환경이나 서버리스 아키텍처에서도 효과적으로 로드 밸런싱을 수행할 수 있다.

AWS LoadBalancer Controller, Service/Pod 배포
AWS LoadBalancer Controller를 배포하여 분산 접속을 확인해보자.
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get crd
NAME                                         CREATED AT
cninodes.vpcresources.k8s.aws                2024-11-02T13:55:05Z
eniconfigs.crd.k8s.amazonaws.com             2024-11-02T13:58:33Z
ingressclassparams.elbv2.k8s.aws             2024-11-02T18:31:35Z
policyendpoints.networking.k8s.aws           2024-11-02T13:55:06Z
securitygrouppolicies.vpcresources.k8s.aws   2024-11-02T13:55:05Z
targetgroupbindings.elbv2.k8s.aws            2024-11-02T18:31:35Z


(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get deployment -n kube-system aws-load-balancer-controller
NAME                           READY   UP-TO-DATE   AVAILABLE   AGE
aws-load-balancer-controller   2/2     2            2           5m10s
Every 2.0s: kubectl get pod,svc,ep                                                                             Sun Nov  3 03:37:11 2024

NAME                                READY   STATUS    RESTARTS   AGE
pod/deploy-echo-857b6cfb88-mgdcm    1/1     Running   0          5m12s
pod/deploy-echo-857b6cfb88-z9qv7    1/1     Running   0          5m12s
pod/netshoot-pod-74b7555dc7-7hpkt   1/1     Running   0          175m
pod/netshoot-pod-74b7555dc7-mjs77   1/1     Running   0          175m
pod/netshoot-pod-74b7555dc7-s76j2   1/1     Running   0          175m

NAME                      TYPE           CLUSTER-IP     EXTERNAL-IP
     PORT(S)        AGE
service/kubernetes        ClusterIP      10.100.0.1     
     443/TCP        4h42m
service/svc-nlb-ip-type   LoadBalancer   10.100.12.82   k8s-default-svcnlbip-4588d22da5-bd009cf4bd83d981.elb.ap-northeast-2.amazonaws.c
om   80:31013/TCP   5m12s

NAME                        ENDPOINTS                               AGE
endpoints/kubernetes        192.168.2.143:443,192.168.3.204:443     4h42m
endpoints/svc-nlb-ip-type   192.168.1.170:8080,192.168.3.113:8080   5m12s

(bgrtest@myeks:default) [root@myeks-bastion ~]# for i in {1..100}; do curl -s $NLB | grep Hostname ; done | sort | uniq -c | sort -nr
     50 Hostname: deploy-echo-857b6cfb88-z9qv7
     50 Hostname: deploy-echo-857b6cfb88-mgdcm
pod 개수를 변화시켜도 정상적으로 분배를 수행한다. 
이 때, NLB 대상 타겟이 모두 정상 반영되었는지를 확인해야한다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl scale deployment deploy-echo --replicas=1

deployment.apps/deploy-echo scaled

(bgrtest@myeks:default) [root@myeks-bastion ~]# for i in {1..100}; do curl -s $NLB | grep Hostname ; done | sort | uniq -c | sort -nr
    100 Hostname: deploy-echo-857b6cfb88-mgdcm

(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl scale deployment deploy-echo --replicas=3
deployment.apps/deploy-echo scaled


# 아직 NLB 대상 타겟이 아직 initial 상태 : 타겟이 정상 반영되기 전
(bgrtest@myeks:default) [root@myeks-bastion ~]# for i in {1..100}; do curl -s $NLB | grep Hostname ; done | sort | uniq -c | sort -nr
    100 Hostname: deploy-echo-857b6cfb88-mgdcm

(bgrtest@myeks:default) [root@myeks-bastion ~]# for i in {1..100}; do curl -s $NLB | grep Hostname ; done | sort | uniq -c | sort -nr
     41 Hostname: deploy-echo-857b6cfb88-cqt65
     31 Hostname: deploy-echo-857b6cfb88-mgdcm
     28 Hostname: deploy-echo-857b6cfb88-znshb     
Ingress
AWS Load Balancer Controller는 Kubernetes 환경에서 AWS Application Load Balancer(ALB)와 통합되어 Ingress 리소스를 통해 트래픽을 관리한다. AWS VPC CNI를 사용해 IP 모드로 동작하는 경우, ALB는 Kubernetes 서비스의 개별 Pod IP를 대상으로 직접 트래픽을 라우팅한다.
ALB IP 모드
ALB의 기본 연결 방식은 NodePort이지만, IP 모드에서는 Pod의 IP가 ALB 타겟 그룹에 직접 등록된다. NodePort를 거치지 않고 트래픽을 바로 Pod로 라우팅하여 지연 시간을 줄이고 성능을 개선할 수 있다.
AWS VPC CNI는 Pod에게 VPC IP를 직접 할당하여 VPC 네트워크에 직접 연결되도록 한다. IP 모드의 ALB와 연동 시 각 Pod가 고유한 IP를 가져 네트워크 성능과 유연성이 개선된다.
실습
ALB 대상 그룹에 등록된 대상을 확인한다. ALB 대상에 파드 IP가 바로 등록되어 있음을 확인할 수 있다.

(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get pod -n game-2048 -o wide
NAME                              READY   STATUS    RESTARTS   AGE     IP              NODE                                               NOMINATED NODE   READINESS GATES
deployment-2048-85f8c7d69-k6t8b   1/1     Running   0          3m45s   192.168.3.113   ip-192-168-3-157.ap-northeast-2.compute.internal              
deployment-2048-85f8c7d69-nrpsx   1/1     Running   0          3m45s   192.168.1.251   ip-192-168-1-193.ap-northeast-2.compute.internal              
인그레스를 통한 접속을 확인해본다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# # Ingress 확인

(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl describe ingress -n game-2048 ingress-2048
Name:             ingress-2048
Labels:           
Namespace:        game-2048
Address:          k8s-game2048-ingress2-70d50ce3fd-900250973.ap-northeast-2.elb.amazonaws.com
Ingress Class:    alb
Default backend:  
Rules:
  Host        Path  Backends
  ----        ----  --------
  *
              /   service-2048:80 (192.168.1.251:80,192.168.3.113:80)
Annotations:  alb.ingress.kubernetes.io/scheme: internet-facing
              alb.ingress.kubernetes.io/target-type: ip
Events:
  Type    Reason                  Age    From     Message
  ----    ------                  ----   ----     -------
  Normal  SuccessfullyReconciled  4m46s  ingress  Successfully reconciled

(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get ingress -n game-2048 ingress-2048 -o jsonpath="{.status.loadBalancer.ingress[*].hostname}{'\n'}"
k8s-game2048-ingress2-70d50ce3fd-900250973.ap-northeast-2.elb.amazonaws.com

(bgrtest@myeks:default) [root@myeks-bastion ~]# # 게임 접속 : ALB 주소로 웹 접속
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get ingress -n game-2048 ingress-2048 -o jsonpath={.status.loadBalancer.ingress[0].hostname} | awk '{ print "Game URL = http://"$1 }'
Game URL = http://k8s-game2048-ingress2-70d50ce3fd-900250973.ap-northeast-2.elb.amazonaws.com
ALB 주소로 웹 접속하면 재밌는 게임도 즐길 수 있다.

Topology aware routing
Topology Aware Routing(토폴로지 인식 라우팅)은 네트워크의 구조를 고려하여 데이터 패킷을 전송하는 방식이다. 다음과 같은 장점을 제공한다

지연 최소화: 지리적으로 가까운 노드 간에 트래픽을 라우팅하여 지연을 줄인다.
부하 분산: 노드 간의 트래픽 분산을 최적화하여 특정 노드에 트래픽이 몰리는 것을 방지한다.
장애 복구: 경로 실패 시 대체 경로를 신속하게 찾아 트래픽을 재조정할 수 있다.
비용 효율성: 최소한의 경로를 사용하여 데이터 전송 비용을 절감한다.

Kubernetes에서는 topologyKeys를 설정하여 특정 노드의 레이블에 따라 요청을 분산시키는 방식으로 Topology Aware Routing을 구현한다. 이를 통해 대규모 분산 시스템에서 성능 최적화와 자원 관리를 효과적으로 수행할 수 있다.
현재 노드 AZ 분산을 확인해본다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get node --label-columns=topology.kubernetes.io/zone

NAME                                               STATUS   ROLES    AGE     VERSION               ZONE
ip-192-168-1-193.ap-northeast-2.compute.internal   Ready       5h14m   v1.30.4-eks-a737599   ap-northeast-2a
ip-192-168-2-90.ap-northeast-2.compute.internal    Ready       5h14m   v1.30.4-eks-a737599   ap-northeast-2b
ip-192-168-3-157.ap-northeast-2.compute.internal   Ready       5h14m   v1.30.4-eks-a737599   ap-northeast-2c
Topology aware routing 적용 전
테스트 파드(netshoot-pod)에서 ClusterIP 접속 시 부하분산을 확인하면 AZ(zone) 상관없이 랜덤 확률 부하분산 동작하는 것을 볼 수 있다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get pod -l app=deploy-websrv -owide
NAME                           READY   STATUS    RESTARTS   AGE    IP              NODE                                               NOMINATED NODE   READINESS GATES
deploy-echo-859cc9b57d-2sk67   1/1     Running   0          115s   192.168.2.215   ip-192-168-2-90.ap-northeast-2.compute.internal               
deploy-echo-859cc9b57d-6dczc   1/1     Running   0          115s   192.168.3.174   ip-192-168-3-157.ap-northeast-2.compute.internal              
deploy-echo-859cc9b57d-dn4cz   1/1     Running   0          115s   192.168.1.170   ip-192-168-1-193.ap-northeast-2.compute.internal              

(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl exec -it netshoot-pod -- zsh -c "for i in {1..100}; do curl -s svc-clusterip | grep Hostname; done | sort | uniq -c | sort -nr"
      37 Hostname: deploy-echo-859cc9b57d-dn4cz
     33 Hostname: deploy-echo-859cc9b57d-6dczc
     30 Hostname: deploy-echo-859cc9b57d-2sk67
Topology aware routing 적용 이후
Topology Mode 설정 후 테스트 파드(netshoot-pod)에서 ClusterIP 접속 시 부하분산을 확인하면 같은 AZ(zone)의 목적지 파드로만 접속하는 것을 볼 수 있다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl annotate service svc-clusterip "service.kubernetes.io/topology-mode=auto"
service/svc-clusterip annotated
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl exec -it netshoot-pod -- zsh -c "for i in {1..100}; do curl -s svc-clusterip | grep Hostname; done | sort | uniq -c | sort -nr"
    100 Hostname: deploy-echo-859cc9b57d-6dczc
endpointslices를 확인하면 기존에 없던 hints 가 추가되어 있다. 
Topology Hints는 Pod와 Node의 지리적 위치나 네트워크 토폴로지 정보를 제공하여 라우팅 결정을 돕는다. 각 Pod의 위치 정보가 힌트로 제공되면, 클러스터는 이 힌트를 기반으로 가까운 경로로 트래픽을 라우팅한다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get endpointslices -l kubernetes.io/service-name=svc-clusterip -o yaml
apiVersion: v1
items:
- addressType: IPv4
  apiVersion: discovery.k8s.io/v1
  endpoints:
  - addresses:
    - 192.168.3.174
    conditions:
      ready: true
      serving: true
      terminating: false
    hints:
      forZones:
      - name: ap-northeast-2c
    nodeName: ip-192-168-3-157.ap-northeast-2.compute.internal
...
iptables
iptables로 확인했을 때, SEP(Endpoint) 파드가 각 노드 당 노드와 동일한 AZ에 배포된 파드1개만 출력되는 것을 확인할 수 있다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# ssh ec2-user@$N1 sudo iptables -v --numeric --table nat --list KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SVC-UAGC4PYEYZJJEW6D  tcp  --  *      *       0.0.0.0/0            10.100.245.133       /* kube-system/aws-load-balancer-webhook-service:webhook-server cluster IP */ tcp dpt:443
    0     0 KUBE-SVC-KBDEBIL6IU6WL7RF  tcp  --  *      *       0.0.0.0/0            10.100.229.199       /* default/svc-clusterip:svc-webport cluster IP */ tcp dpt:80
    0     0 KUBE-SVC-NPX46M4PTMTKRN6Y  tcp  --  *      *       0.0.0.0/0            10.100.0.1           /* default/kubernetes:https cluster IP */ tcp dpt:443
    0     0 KUBE-SVC-I7SKRZYQ7PWYV5X7  tcp  --  *      *       0.0.0.0/0            10.100.238.197       /* kube-system/eks-extension-metrics-api:metrics-api cluster IP */ tcp dpt:443
    0     0 KUBE-SVC-TCOU7JCQXEZGVUNU  udp  --  *      *       0.0.0.0/0            10.100.0.10          /* kube-system/kube-dns:dns cluster IP */ udp dpt:53
    0     0 KUBE-SVC-ERIFXISQEP7F7OF4  tcp  --  *      *       0.0.0.0/0            10.100.0.10          /* kube-system/kube-dns:dns-tcp cluster IP */ tcp dpt:53
    0     0 KUBE-SVC-JD5MR3NA4I4DYORP  tcp  --  *      *       0.0.0.0/0            10.100.0.10          /* kube-system/kube-dns:metrics cluster IP */ tcp dpt:9153
   39  2340 KUBE-NODEPORTS  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL

(bgrtest@myeks:default) [root@myeks-bastion ~]#
(bgrtest@myeks:default) [root@myeks-bastion ~]# ssh ec2-user@$N1 sudo iptables -v --numeric --table nat --list KUBE-SVC-KBDEBIL6IU6WL7RF
Chain KUBE-SVC-KBDEBIL6IU6WL7RF (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SEP-2ZUOF3KQXKLOSGMW  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport -> 192.168.1.170:8080 */
(bgrtest@myeks:default) [root@myeks-bastion ~]# ssh ec2-user@$N2 sudo iptables -v --numeric --table nat --list KUBE-SVC-KBDEBIL6IU6WL7RF
Chain KUBE-SVC-KBDEBIL6IU6WL7RF (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SEP-SX6FOWFFGSKOV6K3  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport -> 192.168.2.215:8080 */
(bgrtest@myeks:default) [root@myeks-bastion ~]# ssh ec2-user@$N3 sudo iptables -v --numeric --table nat --list KUBE-SVC-KBDEBIL6IU6WL7RF
Chain KUBE-SVC-KBDEBIL6IU6WL7RF (1 references)
 pkts bytes target     prot opt in     out     source               destination
  200 12000 KUBE-SEP-L65MDBTORLAOI3KR  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport -> 192.168.3.174:8080 */
동일 AZ에 목적지 파드가 없다면?
파드 개수를 1개로 줄여서 동일 AZ노드에 목적지 파드가 없는 상황을 만들어보자.
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl scale deployment deploy-echo --replicas 1

(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get pod -o wide
NAME                            READY   STATUS    RESTARTS   AGE     IP              NODE                                               NOMINATED NODE   READINESS GATES
deploy-echo-859cc9b57d-dn4cz    1/1     Running   0          12m     192.168.1.170   ip-192-168-1-193.ap-northeast-2.compute.internal              
netshoot-pod                    1/1     Running   0          11m     192.168.3.37    ip-192-168-3-157.ap-northeast-2.compute.internal              
netshoot-pod-74b7555dc7-7hpkt   1/1     Running   0          3h48m   192.168.2.196   ip-192-168-2-90.ap-northeast-2.compute.internal               
netshoot-pod-74b7555dc7-mjs77   1/1     Running   0          3h48m   192.168.3.12    ip-192-168-3-157.ap-northeast-2.compute.internal              
netshoot-pod-74b7555dc7-s76j2   1/1     Running   0          3h48m   192.168.1.72    ip-192-168-1-193.ap-northeast-2.compute.internal   
이전과 달리 hint 정보가 사라져있다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# kubectl get endpointslices -l kubernetes.io/service-name=svc-clusterip -o yaml | grep -i hint

iptables를 확인해보면 AZ와 관련없이 3개 노드 모두 SVC에 1개의 SEP 정책 존재함을 확인할 수 있다.
(bgrtest@myeks:default) [root@myeks-bastion ~]# ssh ec2-user@$N1 sudo iptables -v --numeric --table nat --list KUBE-SVC-KBDEBIL6IU6WL7RF
Chain KUBE-SVC-KBDEBIL6IU6WL7RF (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SEP-2ZUOF3KQXKLOSGMW  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport -> 192.168.1.170:8080 */
(bgrtest@myeks:default) [root@myeks-bastion ~]# ssh ec2-user@$N2 sudo iptables -v --numeric --table nat --list KUBE-SVC-KBDEBIL6IU6WL7RF
Chain KUBE-SVC-KBDEBIL6IU6WL7RF (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SEP-2ZUOF3KQXKLOSGMW  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport -> 192.168.1.170:8080 */
(bgrtest@myeks:default) [root@myeks-bastion ~]# ssh ec2-user@$N3 sudo iptables -v --numeric --table nat --list KUBE-SVC-KBDEBIL6IU6WL7RF
Chain KUBE-SVC-KBDEBIL6IU6WL7RF (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 KUBE-SEP-2ZUOF3KQXKLOSGMW  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport -> 192.168.1.170:8080 */



Cilium - Cilium 기본 설치 및 통신 확인 (1)
Sat, 26 Oct 2024 19:19:40 GMT
BPF/eBPF
iptables는 오랜 기간 리눅스 네트워크 필터링 및 방화벽의 핵심 역할을 해왔지만, 몇 가지 명확한 한계가 있었다.

성능 저하: 방대한 규칙이 쌓일수록 검사해야하는 패킷 수가 늘어나면서 성능이 저하된다.
복잡한 유지보수: 규칙이 많아질수록 관리가 어려워지고 오류 발생 가능성이 커진다.
확장성의 한계: 커널과 user space 간 빈번한 전환으로 인한 오버헤드가 발생하며, 고속 네트워크 환경에서 iptables는 병목이 될 수 있다.

이러한 한계를 해결하기 위해 BPF(Berkeley Packet Filter)와 그 확장판인 eBPF(extended BPF)가 등장하였다. eBPF는 커널에서 사용자 정의 프로그램을 실행할 수 있도록 해주며, 이로 인해 기존 iptables가 가지지 못한 향상된 성능 / 유연한 프로그래밍 / 확장성 등 다양한 장점을 제공한다.
eBPF kernel hooks
eBPF는 리눅스 커널의 다양한 지점에 프로그램을 hook, 즉 연결을 해서 특정 이벤트가 발생할 때 커스텀 로직을 실행할 수 있게 해준다.
각 계층 별로 걸 수 있는 eBPF Hook 예시를 확인해본다.
1. Driver 계층

네트워크 장치 드라이버 또는 하드웨어와의 인터페이스에서 BPF 프로그램을 직접 실행한다.
1-1. XDP
위치: NIC(Network Interface Card) 드라이버 수준에서 직접 실행
용도: 초고속 패킷 처리
예시: 고성능 로드 밸런서(NIC)에서 패킷을 바로 분산 처리 가능
2. 네트워크 계층

2-1. TC(Traffic Control)

위치: 네트워크 트래픽의 Ingress(수신) 및 Egress(송신) 지점
용도: 패킷 필터링, QoS(품질 보장), 로드 밸런싱, 정책 기반 라우팅
예시: Cilium에서 네트워크 보안 정책 구현

2-2. XDP(eXpress Data Path)

위치: 네트워크 인터페이스에서 패킷 수신 직후, 커널의 네트워크 스택에 도달하기 전에 실행
용도: 고성능 네트워크 필터링 및 정책 구현
예시: DDoS 방어(악성 패킷이 스택에 전달되기 전에 drop), L4 로드 밸런싱(목적지 서버에 대한 초기 라우팅 수행)

3. 소켓 계층

소켓에서 발생하는 이벤트에 Hook을 걸어 애플리케이션 계층 통신 제어가 가능하다.
3-1. Socket Filtering

위치: Socket의 Send/Receive 지점
용도: 특정 애플리케이션 소켓에 대한 패킷을 필터링하거나 수정
예시: 특정 포트나 프로토콜의 소켓 트래픽을 모니터링하거나 차단

3-2. Socket Options

위치: 소켓 옵션을 설정하는 지점에 BPF 프로그램 연결
용도: 애플리케이션별 커스텀 필터링 로직 구현
예시: 특정 애플리케이션 트래픽에 대한 방화벽 규칙 적용

4. System call 계층

시스템 호출 인터페이스에 Hook을 걸어 프로세스와 커널 간 상호작용을 제어한다.
4-1. kprobe / kretprobe

위치: 커널 함수 진입(kprobe)과 종료(kretprobe) 지점
용도: 커널 함수가 호출될 때 파라미터를 추적하거나 로깅
예시: sys_execve 함수에 kprobe를 걸어 프로세스 생성 시 로깅

4-2. tracepoints

위치: 커널의 특정 이벤트가 발생하는 지점
용도: I/O 작업, 네트워크 트래픽, 프로세스 생성 등의 이벤트 추적
예시: 시스템 호출 모니터링, 특정 파일 접근 추적

5. Filesystem 및 Process 계층

파일 접근과 프로세스 이벤트에 Hook을 걸어 보안 정책을 적용하거나 모니터링한다.
5-1. LSM (Linux Security Modules) Hooks

위치: 커널의 파일 접근, 프로세스 권한 변경, 네트워크 접속 등 보안 관련 지점
용도: 보안 정책 구현 (e.g., SELinux, AppArmor)
예시: 특정 프로세스의 파일 접근 차단, 네트워크 사용 제한

5-2. cgroup BPF

위치: cgroup에 연결된 리소스 제한과 네트워크 트래픽 제어 지점
용도: 특정 컨테이너나 프로세스 그룹에 대해 네트워크 및 CPU 사용을 제한
예시: Kubernetes 컨테이너 네트워크 정책 설정

6. UserSpace 계층

eBPF는 커널 내에서 동작하는 프로그램이지만, userspace와 커널 간의 통신을 통해 애플리케이션 레벨의 트래픽을 추적하고 제어할 수 있다.
6-1. uprobe / uretprobe

위치: userspace의 특정 함수에 Hook을 걸어 모니터링.
예시: 데이터베이스 클라이언트 함수에 uprobe를 걸어 쿼리 지연 시간 측정 가능

6-2.Envoy + eBPF
Envoy와 eBPF를 연동해 애플리케이션의 L7 트래픽을 제어.

예시: HTTP/gRPC 호출 추적, 보안 정책 적용.

Cilium
Cilium은 eBPF를 기반으로 Pod 네트워크 환경과 보안을 제공하는 CNI 플러그인이다.
Cilium eBPF 는 추가적인 App 이나 설정 변경 없이 리눅스 커널을 자유롭게 프로그래밍하여 동작 가능 하다.
Kubernetes에서는 Cilium으로 100% kube-proxy replacement가 가능하다. 이로 인해 복잡한 kube-proxy, iptables, conntrack 관리와, 커널 버그로 인해 발생하는 다양한 환경 변수의 조합을 덜 고민해도 될 것으로 보인다.

최근 운영 하면서 겪은 이슈 
서비스 간에 Large Payload 전송 시 Connection reset by peer 에러와 함께 TCP connection이 Resete되며 끊기는 이슈 발생
원인 : Large Payload 전송 시 Out-of-order가 발생하는 경우가 있는데, 문제가 없는 패킷임에도 conntrack버그로 인해 패킷에 INVALID 마킹이 됨. INVALID 마킹이 된 패킷이 DROP되지 않고, 정상적으로 DNAT되지 않은 상태로 노드로 전송되면서 노드는 출처가 불분명해진 패킷에 대해 Connection Reset 요청을 보냄.
해결 방법 1: Kernel 6버전 이상에서는 해결되는 이슈이나, 운영상 민감한 부분이 많은 클러스터이기에 Kernel 업그레이드가 자유롭지 않음.
해결 방법 2: kube-proxy에 tcp_be_liberal=1 적용 가능, INVALID 마킹을 유연하게 하는 W/A 옵션으로 K8s 1.29부터 적용됨. 이 또한  K8s 업그레이드를 바로 하기에는 운영상 민감한 부분이 많음.
해결 방법 3: 노드에 net.netfilter.nf_conntrack_tcp_be_liberal=1 설정을 적용. INVALID 마킹을 유연하게 하는 옵션으로, 커널 자체의 문제를 개선한 것이 아니고 out-of-order 패킷을 허용하는 옵션이기 때문에 보안적으로 취약하다는 단점이 있음. 다만, 바로 별도의 업그레이드 없이 적용할 수 있는 가장 간단한 방법이기 때문에 해당 방법을 선택함.
커널 버전, 쿠버네티스 버전 업그레이드는 안정성 등의 이슈로 운영 중에 바로바로 업그레이드를 할 수 없어 문제를 가장 확실하게 해결할 수 있는 방법을 바로 적용하지 못하는 경우가 있는데, eBPF를 사용할 경우 기존 커널 위에 커스텀 커널 및 로직 개발이 가능하기 때문에, 이슈 해결을 조금 더 유연하게 할 수 있지 않을까 하는 생각이 든다.

Cilium 구성요소

Cilium Agent : 데몬셋으로 실행, K8S API 설정으로 부터 '네트워크 설정, 네트워크 정책, 서비스 부하분산, 모니터링' 등을 수행하며, eBPF 프로그램을 관리한다.
Cilium Client (CLI) : Cilium 커멘드툴, eBPF maps 에 직접 접속하여 상태를 확인할 수 있다.
Cilium Operator : K8S 클러스터에 대한 한 번씩 처리해야 하는 작업을 관리.
Hubble : 네트워크와 보안 모니터링 플랫폼 역할을 하여, 'Server, Relay, Client, Graphical UI' 로 구성되어 있다.
Data Store : Cilium Agent 간의 상태를 저장하고 전파하는 데이터 저장소, 2가지 종류 중 선택(K8S CRDs, Key-Value Store)
cf) eBPF map

eBPF map이란 eBPF 프로그램과 커널 또는 사용자 공간 간에 데이터를 저장하고 교환하기 위한 자료구조이다. 즉 eBPF용 공유 메모리 공간 (key-value 기반의 데이터 저장소) 을 의미한다.
모든 eBPF Map 은 상한 용량이 있으며, limit 관련 여러 옵션들이 있다.
eBPF Map
kube-proxy 는 리눅스 코어에 따라 CT table 최대 수가 결정되며, Cilium 은 BPF Maps 이라는 자체 연결 추적 테이블을 가지고 메모리에 따라 최대 수가 결정된다.




Cilium 설치
Cilium 시스템 요구 사항 확인
1. CPU 아키텍처
AMD64 또는 AArch64 CPU 아키텍처를 사용하는 호스트
# arch
aarch64
2. 커널 버전
Linux 커널 5.4 이상 또는 동등 버전(예: RHEL 8.6의 경우 4.18)
# uname -r
6.8.0-53-generic
고급 기능 동작을 위한 최소 커널 버전 - Docs



Cilium Feature
Minimum Kernel Version



WireGuard Transparent Encryption
>= 5.6


Full support for Session Affinity
>= 5.7


BPF-based proxy redirection
>= 5.7


Socket-level LB bypass in pod netns
>= 5.7


L3 devices
>= 5.8


BPF-based host routing
>= 5.10


Multicast Support in Cilium (Beta) (AMD64)
>= 5.10


IPv6 BIG TCP support
>= 5.19


Multicast Support in Cilium (Beta) (AArch64)
>= 6.0


IPv4 BIG TCP support
>= 6.3


3. 커널 구성 옵션 활성화
# [커널 구성 옵션] 기본 요구 사항 
grep -E 'CONFIG_BPF|CONFIG_BPF_SYSCALL|CONFIG_NET_CLS_BPF|CONFIG_BPF_JIT|CONFIG_NET_CLS_ACT|CONFIG_NET_SCH_INGRESS|CONFIG_CRYPTO_SHA1|CONFIG_CRYPTO_USER_API_HASH|CONFIG_CGROUPS|CONFIG_CGROUP_BPF|CONFIG_PERF_EVENTS|CONFIG_SCHEDSTATS' /boot/config-$(uname -r)
CONFIG_BPF=y
CONFIG_BPF_SYSCALL=y
CONFIG_BPF_JIT=y
CONFIG_NET_CLS_BPF=m
CONFIG_NET_CLS_ACT=y
CONFIG_NET_SCH_INGRESS=m
CONFIG_CRYPTO_SHA1=y
CONFIG_CRYPTO_USER_API_HASH=m
CONFIG_CGROUPS=y
CONFIG_CGROUP_BPF=y
CONFIG_PERF_EVENTS=y
CONFIG_SCHEDSTATS=y


# [커널 구성 옵션] Requirements for Tunneling and Routing
grep -E 'CONFIG_VXLAN=y|CONFIG_VXLAN=m|CONFIG_GENEVE=y|CONFIG_GENEVE=m|CONFIG_FIB_RULES=y' /boot/config-$(uname -r)
CONFIG_FIB_RULES=y # 커널에 내장됨
CONFIG_VXLAN=m # 모듈로 컴파일됨 → 커널에 로드해서 사용
CONFIG_GENEVE=m # 모듈로 컴파일됨 → 커널에 로드해서 사용

## (참고) 커널 로드
lsmod | grep -E 'vxlan|geneve'
modprobe geneve
lsmod | grep -E 'vxlan|geneve'


# [커널 구성 옵션] Requirements for L7 and FQDN Policies
grep -E 'CONFIG_NETFILTER_XT_TARGET_TPROXY|CONFIG_NETFILTER_XT_TARGET_MARK|CONFIG_NETFILTER_XT_TARGET_CT|CONFIG_NETFILTER_XT_MATCH_MARK|CONFIG_NETFILTER_XT_MATCH_SOCKET' /boot/config-$(uname -r)
CONFIG_NETFILTER_XT_TARGET_CT=m
CONFIG_NETFILTER_XT_TARGET_MARK=m
CONFIG_NETFILTER_XT_TARGET_TPROXY=m
CONFIG_NETFILTER_XT_MATCH_MARK=m
CONFIG_NETFILTER_XT_MATCH_SOCKET=m

...

# [커널 구성 옵션] Requirements for Netkit Device Mode
grep -E 'CONFIG_NETKIT=y|CONFIG_NETKIT=m' /boot/config-$(uname -r)
Cilium 설치
helm install cilium cilium/cilium --version 1.16.3 --namespace kube-system \
--set k8sServiceHost=192.168.10.10 --set k8sServicePort=6443 --set debug.enabled=true \
--set rollOutCiliumPods=true --set routingMode=native --set autoDirectNodeRoutes=true \
--set bpf.masquerade=true --set bpf.hostRouting=true --set endpointRoutes.enabled=true \
--set ipam.mode=kubernetes --set k8s.requireIPv4PodCIDR=true --set kubeProxyReplacement=true \
--set ipv4NativeRoutingCIDR=192.168.0.0/16 --set installNoConntrackIptablesRules=true \
--set hubble.ui.enabled=true --set hubble.relay.enabled=true --set prometheus.enabled=true --set operator.prometheus.enabled=true --set hubble.metrics.enableOpenMetrics=true \
--set hubble.metrics.enabled="{dns:query;ignoreAAAA,drop,tcp,flow,port-distribution,icmp,httpV2:exemplars=true;labelsContext=source_ip\,source_namespace\,source_workload\,destination_ip\,destination_namespace\,destination_workload\,traffic_direction}" \
--set operator.replicas=1



옵션
설명



rollOutCiliumPods
Cilium Pods의 롤아웃을 활성화하여 새로운 버전으로의 배포를 관리한다.


routingMode
패킷의 라우팅 방식을 설정하며, 기본적으로 네이티브 라우팅 모드를 사용한다.


autoDirectNodeRoutes
자동으로 노드 간 라우팅을 설정하여 패킷 전송을 최적화한다.


bpf.masquerade
BPF를 사용하여 IP 마스커레이드를 활성화한다.


bpf.hostRouting
호스트 라우팅 기능을 활성화하여 패킷을 직접 전달한다.


endpointRoutes.enabled
엔드포인트 라우트를 활성화하여 Pod 간의 트래픽 흐름을 최적화한다.


ipam.mode
IP 주소 할당 방식을 Kubernetes로 설정한다.


k8s.requireIPv4PodCIDR
IPv4 Pod CIDR 요구 사항을 활성화하여 클러스터 내 IPv4 주소 사용을 보장한다.


kubeProxyReplacement
Cilium이 kube-proxy의 역할을 대체하도록 설정한다.


ipv4NativeRoutingCIDR
IPv4 네이티브 라우팅에 사용할 CIDR 블록을 설정한다.


installNoConntrackIptablesRules
conntrack iptables 규칙을 설치하지 않도록 설정하여 Cilium의 효율성을 높인다.


hubble.ui.enabled
Hubble UI를 활성화하여 네트워크 모니터링 및 시각화를 지원한다.


hubble.relay.enabled
Hubble Relay를 활성화하여 메트릭 수집 및 이벤트 전달 기능을 지원한다.


prometheus.enabled
Prometheus 모니터링을 활성화하여 메트릭 수집을 지원한다.


operator.prometheus.enabled
Cilium Operator의 Prometheus 모니터링 기능을 활성화한다.


hubble.metrics.enableOpenMetrics
Hubble 메트릭의 OpenMetrics 형식을 활성화하여 호환성을 높인다.


hubble.metrics.enabled
Hubble에서 수집할 메트릭과 이벤트를 정의한다.


operator.replicas
Cilium Operator의 복제본 수를 설정하여 고가용성을 유지한다.


Cilium 설정 및 확인
Cilium CLI 설치
#
CILIUM_CLI_VERSION=$(curl -s https://raw.githubusercontent.com/cilium/cilium-cli/main/stable.txt)
CLI_ARCH=amd64
if [ "$(uname -m)" = "aarch64" ]; then CLI_ARCH=arm64; fi
curl -L --fail --remote-name-all https://github.com/cilium/cilium-cli/releases/download/${CILIUM_CLI_VERSION}/cilium-linux-${CLI_ARCH}.tar.gz >/dev/null 2>&1
tar xzvfC cilium-linux-${CLI_ARCH}.tar.gz /usr/local/bin
rm cilium-linux-${CLI_ARCH}.tar.gz

cilium config set debug true

#
kubectl exec -n kube-system -c cilium-agent -it ds/cilium -- cilium-dbg status --verbose
...
KubeProxyReplacement:   True   [eth0    10.0.2.15 fd17:625c:f037:2:a00:27ff:fe71:19d8 fe80::a00:27ff:fe71:19d8, eth1   192.168.10.102 fe80::a00:27ff:fe6d:c80b (Direct Routing)]
...
Routing:                Network: Native   Host: BPF
Attach Mode:            TCX
Device Mode:            veth
Masquerading:           BPF   [eth0, eth1]   172.20.0.0/16 [IPv4: Enabled, IPv6: Disabled]
...
KubeProxyReplacement Details:
  Status:                 True
  Socket LB:              Enabled
  Socket LB Tracing:      Enabled
  Socket LB Coverage:     Full
  Devices:                eth0    10.0.2.15 fd17:625c:f037:2:a00:27ff:fe71:19d8 fe80::a00:27ff:fe71:19d8, eth1   192.168.10.102 fe80::a00:27ff:fe6d:c80b (Direct Routing)
  Mode:                   SNAT
  Backend Selection:      Random
  Session Affinity:       Enabled
  Graceful Termination:   Enabled
  NAT46/64 Support:       Disabled
  XDP Acceleration:       Disabled
  Services:
  - ClusterIP:      Enabled
  - NodePort:       Enabled (Range: 30000-32767)
  - LoadBalancer:   Enabled
  - externalIPs:    Enabled
  - HostPort:       Enabled
  Annotations:
  - service.cilium.io/node
  - service.cilium.io/src-ranges-policy
  - service.cilium.io/type
네트워크 기본 정보
cilium-operator, cilium, cilium-envoy가 설치되어 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl get node,pod,svc -A -o wide
NAME          STATUS   ROLES           AGE    VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME
node/k8s-s    Ready    control-plane   141m   v1.30.6   192.168.10.10            Ubuntu 22.04.5 LTS   6.8.0-1015-aws   containerd://1.7.22
node/k8s-w1   Ready              140m   v1.30.6   192.168.10.101           Ubuntu 22.04.5 LTS   6.8.0-1015-aws   containerd://1.7.22
node/k8s-w2   Ready              140m   v1.30.6   192.168.10.102           Ubuntu 22.04.5 LTS   6.8.0-1015-aws   containerd://1.7.22

NAMESPACE     NAME                                   READY   STATUS    RESTARTS   AGE    IP               NODE     NOMINATED NODE   READINESS GATES
kube-system   pod/cilium-7qtr8                       1/1     Running   0          100s   192.168.10.102   k8s-w2              
kube-system   pod/cilium-envoy-g29t8                 1/1     Running   0          100s   192.168.10.102   k8s-w2              
kube-system   pod/cilium-envoy-gksj4                 1/1     Running   0          100s   192.168.10.10    k8s-s               
kube-system   pod/cilium-envoy-m546t                 1/1     Running   0          100s   192.168.10.101   k8s-w1              
kube-system   pod/cilium-f6hl2                       1/1     Running   0          100s   192.168.10.10    k8s-s               
kube-system   pod/cilium-operator-76bb588dbc-gx6bn   1/1     Running   0          100s   192.168.10.101   k8s-w1              
kube-system   pod/cilium-tgkbh                       1/1     Running   0          100s   192.168.10.101   k8s-w1              
kube-system   pod/coredns-55cb58b774-7tj27           1/1     Running   0          140m   172.16.1.13      k8s-w1              
kube-system   pod/coredns-55cb58b774-pldd7           1/1     Running   0          140m   172.16.1.217     k8s-w1              
kube-system   pod/etcd-k8s-s                         1/1     Running   0          141m   192.168.10.10    k8s-s               
kube-system   pod/hubble-relay-88f7f89d4-zpgdl       1/1     Running   0          100s   172.16.1.150     k8s-w1              
kube-system   pod/hubble-ui-59bb4cb67b-dk4hp         2/2     Running   0          100s   172.16.1.214     k8s-w1              
kube-system   pod/kube-apiserver-k8s-s               1/1     Running   0          141m   192.168.10.10    k8s-s               
kube-system   pod/kube-controller-manager-k8s-s      1/1     Running   0          141m   192.168.10.10    k8s-s               
kube-system   pod/kube-scheduler-k8s-s               1/1     Running   0          141m   192.168.10.10    k8s-s               

NAMESPACE     NAME                     TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE    SELECTOR
default       service/kubernetes       ClusterIP   10.10.0.1              443/TCP                  141m   
kube-system   service/cilium-envoy     ClusterIP   None                   9964/TCP                 101s   k8s-app=cilium-envoy
kube-system   service/hubble-metrics   ClusterIP   None                   9965/TCP                 101s   k8s-app=cilium
kube-system   service/hubble-peer      ClusterIP   10.10.77.48            443/TCP                  101s   k8s-app=cilium
kube-system   service/hubble-relay     ClusterIP   10.10.24.62            80/TCP                   101s   k8s-app=hubble-relay
kube-system   service/hubble-ui        ClusterIP   10.10.202.89           80/TCP                   101s   k8s-app=hubble-ui
kube-system   service/kube-dns         ClusterIP   10.10.0.10             53/UDP,53/TCP,9153/TCP   141m   k8s-app=kube-dns
cilium_net과 cilium_host 인터페이스로 Pod간 통신 , Pod-외부 통신을 가능하게 한다.

cilium_net : Cilium이 관리하는 Pod 간의 통신을 처리한다.
clium_host : Cilium과 호스트 간의 트래픽을 처리하는 인터페이스로, 외부와의 통신을 가능하게 한다.
lxc_healthcheck : 헬스체크 전용 인터페이스(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# ip -c addr
1: lo:  mtu 65536 qdisc noqueue state UNKNOWN group default qlen 1000
  link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  inet 127.0.0.1/8 scope host lo
     valid_lft forever preferred_lft forever
  inet6 ::1/128 scope host
     valid_lft forever preferred_lft forever
2: eth0:  mtu 9001 qdisc mq state UP group default qlen 1000
  link/ether 02:95:92:79:ac:d1 brd ff:ff:ff:ff:ff:ff
  inet 192.168.10.10/24 metric 100 brd 192.168.10.255 scope global dynamic eth0
     valid_lft 2194sec preferred_lft 2194sec
  inet6 fe80::95:92ff:fe79:acd1/64 scope link
     valid_lft forever preferred_lft forever
3: cilium_net@cilium_host:  mtu 9001 qdisc noqueue state UP group default qlen 1000
  link/ether 02:f7:b5:46:ac:49 brd ff:ff:ff:ff:ff:ff
  inet6 fe80::f7:b5ff:fe46:ac49/64 scope link
     valid_lft forever preferred_lft forever
4: cilium_host@cilium_net:  mtu 9001 qdisc noqueue state UP group default qlen 1000
  link/ether 52:5b:5b:86:6a:6d brd ff:ff:ff:ff:ff:ff
  inet 172.16.0.70/32 scope global cilium_host
     valid_lft forever preferred_lft forever
  inet6 fe80::505b:5bff:fe86:6a6d/64 scope link
     valid_lft forever preferred_lft forever
6: lxc_health@if5:  mtu 9001 qdisc noqueue state UP group default qlen 1000
  link/ether 7a:98:19:81:96:4e brd ff:ff:ff:ff:ff:ff link-netnsid 0
  inet6 fe80::7898:19ff:fe81:964e/64 scope link
     valid_lft forever preferred_lft forever


이 중 lxc_health 인터페이스는 cilium 자체가 각 노드에서 클러스터 내 연결 상태를 검사하기 위해 자동으로 생성하는 내부 Pod로, veth으로 cilium과 veth pair 이다.
cilium 인터페이스에 Pod IP가 할당되어 있으며, cilium-health-responder로 동작한다.
cilium endpoint로 확인을 해보았을 때, reserved:health 를 확인할 수 있는데, 이는 cilium이 사용하는 예약된 엔드포인트 ID중 하나로, 다른 것과 충돌하지 않게 하기 위해 예약된 것이다.
Kubernetes에 등록된 Pod가 아닌, Cilium 내부에서 관리되기 때문에 kubectl get pods 명령어로 확인이 되지 않는다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c addr show lxc_health
26: lxc_health@if25:  mtu 1500 qdisc noqueue state UP group default qlen 1000
    link/ether 26:53:c1:ce:2d:54 brd ff:ff:ff:ff:ff:ff link-netnsid 4
    inet6 fe80::2453:c1ff:fece:2d54/64 scope link
       valid_lft forever preferred_lft forever

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -c cilium-agent -- cilium-dbg status --verbose
...
Cluster health:   3/3 reachable   (2025-07-19T15:57:07Z)
Name              IP              Node   Endpoints
  k8s-w2 (localhost):
    Host connectivity to 192.168.10.102: <- Node IP
      ICMP to stack:   OK, RTT=136.792µs
      HTTP to agent:   OK, RTT=481.291µs
    Endpoint connectivity to 172.20.1.204: <- Health IP
      ICMP to stack:   OK, RTT=303.375µs
      HTTP to agent:   OK, RTT=516.209µs
  k8s-ctr:
    Host connectivity to 192.168.10.100: 
      ICMP to stack:   OK, RTT=700.542µs
      HTTP to agent:   OK, RTT=997.667µs
    Endpoint connectivity to 172.20.0.198: 
      ICMP to stack:   OK, RTT=569.584µs
      HTTP to agent:   OK, RTT=1.537167ms
  k8s-w1:
    Host connectivity to 192.168.10.101:
      ICMP to stack:   OK, RTT=296.417µs
      HTTP to agent:   OK, RTT=893.958µs
    Endpoint connectivity to 172.20.2.52:
      ICMP to stack:   OK, RTT=653.917µs
      HTTP to agent:   OK, RTT=781.833µs

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get node k8s-w2 -o wide
NAME     STATUS   ROLES    AGE    VERSION   INTERNAL-IP      EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-w2   Ready       3h3m   v1.33.2   192.168.10.102           Ubuntu 24.04.2 LTS   6.8.0-53-generic   containerd://1.7.27

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -it -n kube-system ds/cilium -c cilium-agent -- cilium-dbg endpoint list | grep health
2251       Disabled           Disabled          4          reserved:health                                                                 172.20.1.204   ready
routing 설정 확인
1. NativeRouting + autoDirectNodeRoutes=true
Cilium 설치 시 지정한 옵션인 routingMode=native와 autoDirectNodeRoutes=true 설정에 의해 노드 간 Pod 트래픽이 overlay(VXLAN, Geneve 등)가 아닌 Direct Routing 된다.

Cilium이 자동으로 노드 간 Pod CIDR 라우팅 경로를 추가.
노드 A에서 노드 B의 Pod CIDR 대역으로 패킷을 보낼 때, eth1(지정된 NIC)를 통해 직접 전달됨.
Flannel과 Calico는 VXLAN 혹은 IPIP를 사용하여 encapsulation하지만, Cilium의 native routing은 더 빠르고 단순한 경로를 사용하여 통신이 가능함.

# pod의 CIDR 라우트가 노드의 eth1로 직접 전달됨.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ciliumnodes -o json
...
                "name": "k8s-w1",
                "ipam": {
                    "podCIDRs": [
                        "172.20.2.0/24"
                    ],

root@k8s-w1:~# ip -c route | grep 172.20 | grep eth1
172.20.0.0/24 via 192.168.10.100 dev eth1 proto kernel
172.20.1.0/24 via 192.168.10.102 dev eth1 proto kernel
2. endpointRoutes.enabled=true + Non-hostNetwork Pods
endpointRoutes.enabled=true를 설정하면, Cilium은 Pod별로 독립된 라우팅 경로와 인터페이스(lxcXXXX) 를 설정한다.
lxcXXX 인터페이스는 각 Pod를 호스트 네트워크와 연결하는 가상 인터페이스 (veth pair)로, 각 Pod는 고유한 인터페이스로 통신하고, 이 경로를 통해 네트워크 정책 적용, 모니터링이 가능하다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get ciliumendpoints -A
NAMESPACE     NAME                       SECURITY IDENTITY   ENDPOINT STATE   IPV4           IPV6
default       curl-pod                   61805               ready            172.20.0.6
default       webpod-659cd747f8-v9xkm    11362               ready            172.20.1.172
default       webpod-659cd747f8-zqbk8    11362               ready            172.20.2.144
kube-system   coredns-674b8bbfcf-hsz8b   57947               ready            172.20.0.64
kube-system   coredns-674b8bbfcf-ntlf2   57947               ready            172.20.0.71

(⎈|HomeLab:N/A) root@k8s-ctr:~# ip -c route | grep lxc
172.20.0.6 dev lxca2ec5268c2d5 proto kernel scope link
172.20.0.64 dev lxc2055df1042a6 proto kernel scope link
172.20.0.71 dev lxc729021c580c0 proto kernel scope link
172.20.0.198 dev lxc_health proto kernel scope link
3. hostNetwork Pods
hostNetwork: true인 Pod는 노드의 네트워크 네임스페이스를 그대로 쓰기 때문에, 별도 인터페이스(lxcX)가 생성되지 않는다.
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -o wide -A | grep k8s-ctr | grep -v 172.20
kube-system   cilium-ctqxp                       1/1     Running   0              128m    192.168.10.100   k8s-ctr              
kube-system   cilium-envoy-jm92v                 1/1     Running   0              141m    192.168.10.100   k8s-ctr              
kube-system   etcd-k8s-ctr                       1/1     Running   0              3h49m   192.168.10.100   k8s-ctr              
kube-system   kube-apiserver-k8s-ctr             1/1     Running   0              3h49m   192.168.10.100   k8s-ctr              
kube-system   kube-controller-manager-k8s-ctr    1/1     Running   0              3h49m   192.168.10.100   k8s-ctr              
kube-system   kube-scheduler-k8s-ctr             1/1     Running   0              3h49m   192.168.10.100   k8s-ctr              

# 노드의 network namespace 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# readlink /proc/1/ns/net
net:[4026531840]

# cilium pod(hostNetwork pod)의 network namespace 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec -n kube-system ds/cilium -c cilium-agent -- readlink /proc/1/ns/net
net:[4026531840]

# curl pod(Non hostNetwork pod)의 network namespace 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec curl-pod -- readlink /proc/1/ns/net
net:[4026532218]
iptables 규칙 확인
iptables규칙을 확인해보면 Cilium과 관련된 NAT체인 규칙 외에 규칙이 거의 없다. cilium 설치 시 installNoConntrackIptablesRules 옵션을 통해 conntrack iptables룰을 사용하지 않고 eBPF를 통해 패킷이 처리도록 설정을 하였기 때문에 conntrack -L로 보았을 때 기록되는 연결 상태가 거의 없는 것을 확인할 수 있다.
notrack옵션으로 iptables 규칙을 조회해보면, 트래픽에서 conntrack을 우회하기 위한 설정(pod 네트워크 범위 트래픽 우회, L7 프록시 트래픽 우회)이 되어있음을 확인할 수 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# iptables -t nat -S
-P PREROUTING ACCEPT
-P INPUT ACCEPT
-P OUTPUT ACCEPT
-P POSTROUTING ACCEPT
-N CILIUM_OUTPUT_nat
-N CILIUM_POST_nat
-N CILIUM_PRE_nat
-N KUBE-KUBELET-CANARY
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_nat" -j CILIUM_PRE_nat
-A OUTPUT -m comment --comment "cilium-feeder: CILIUM_OUTPUT_nat" -j CILIUM_OUTPUT_nat
-A POSTROUTING -m comment --comment "cilium-feeder: CILIUM_POST_nat" -j CILIUM_POST_nat

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# conntrack -L | grep -v 2379 | wc -l
conntrack v1.4.6 (conntrack-tools): 125 flow entries have been shown.
54

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# iptables -t raw -S | grep notrack
-A CILIUM_OUTPUT_raw -d 192.168.0.0/16 -m comment --comment "cilium: NOTRACK for pod traffic" -j CT --notrack
-A CILIUM_OUTPUT_raw -s 192.168.0.0/16 -m comment --comment "cilium: NOTRACK for pod traffic" -j CT --notrack
-A CILIUM_OUTPUT_raw -o lxc+ -m comment --comment "cilium: NOTRACK for proxy return traffic" -j CT --notrack
-A CILIUM_OUTPUT_raw -o cilium_host -m comment --comment "cilium: NOTRACK for proxy return traffic" -j CT --notrack
-A CILIUM_OUTPUT_raw -o lxc+ -m comment --comment "cilium: NOTRACK for L7 proxy upstream traffic" -j CT --notrack
-A CILIUM_OUTPUT_raw -o cilium_host -m comment --comment "cilium: NOTRACK for L7 proxy upstream traffic" -j CT --notrack
-A CILIUM_PRE_raw -d 192.168.0.0/16 -m comment --comment "cilium: NOTRACK for pod traffic" -j CT --notrack
-A CILIUM_PRE_raw -s 192.168.0.0/16 -m comment --comment "cilium: NOTRACK for pod traffic" -j CT --notrack
-A CILIUM_PRE_raw -m comment --comment "cilium: NOTRACK for proxy traffic" -j CT --notrack
cilium CRD
cilium CRD를 살펴보면, ciliumnodes, ciliumendpoints라는 CRD가 있다.

ciliumnodes : cilium이 설치된 클러스터의 노드에 대한 정보를 저장하는 리소스이다. 
ciliumendpoints : 클러스터 내의 각 Pod에 대한 네트워크 정보를 저장하는 리소스이다. Cilium은 ciliumendpoints를 통해 Pod에 대한 정책 적용 여부를 모니터링하고, 필요한 정책을 동적으로 추가한다.(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl get crd
NAME                                         CREATED AT
ciliumcidrgroups.cilium.io                   2024-10-25T15:00:57Z
ciliumclusterwidenetworkpolicies.cilium.io   2024-10-25T15:00:58Z
ciliumendpoints.cilium.io                    2024-10-25T15:00:57Z
ciliumexternalworkloads.cilium.io            2024-10-25T15:00:57Z
ciliumidentities.cilium.io                   2024-10-25T15:00:57Z
ciliuml2announcementpolicies.cilium.io       2024-10-25T15:00:57Z
ciliumloadbalancerippools.cilium.io          2024-10-25T15:00:57Z
ciliumnetworkpolicies.cilium.io              2024-10-25T15:00:58Z
ciliumnodeconfigs.cilium.io                  2024-10-25T15:00:57Z
ciliumnodes.cilium.io                        2024-10-25T15:00:57Z
ciliumpodippools.cilium.io                   2024-10-25T15:00:57Z



(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl get ciliumnodes
NAME     CILIUMINTERNALIP   INTERNALIP       AGE
k8s-s    172.16.0.70        192.168.10.10    15m
k8s-w1   172.16.1.153       192.168.10.101   15m
k8s-w2   172.16.2.237       192.168.10.102   15m
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl get ciliumendpoints -A
NAMESPACE     NAME                           SECURITY IDENTITY   ENDPOINT STATE   IPV4           IPV6
kube-system   coredns-55cb58b774-7tj27       24672               ready            172.16.1.13
kube-system   coredns-55cb58b774-pldd7       24672               ready            172.16.1.217
kube-system   hubble-relay-88f7f89d4-zpgdl   37739               ready            172.16.1.150
kube-system   hubble-ui-59bb4cb67b-dk4hp     42419               ready            172.16.1.214
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl get pod -o wide -A | grep -v 192.168
NAMESPACE     NAME                               READY   STATUS    RESTARTS   AGE    IP               NODE     NOMINATED NODE   READINESS GATES
kube-system   coredns-55cb58b774-7tj27           1/1     Running   0          157m   172.16.1.13      k8s-w1              
kube-system   coredns-55cb58b774-pldd7           1/1     Running   0          157m   172.16.1.217     k8s-w1              
kube-system   hubble-relay-88f7f89d4-zpgdl       1/1     Running   0          18m    172.16.1.150     k8s-w1              
kube-system   hubble-ui-59bb4cb67b-dk4hp         2/2     Running   0          18m    172.16.1.214     k8s-w1              

아래 명령어로 다양한 cilium config가 적용되어 있는 것을 확인할 수 있다.
```bash
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl get cm -n kube-system cilium-config -o json | jq
{
  "apiVersion": "v1",
  "data": {
    "agent-not-ready-taint-key": "node.cilium.io/agent-not-ready",
    "arping-refresh-period": "30s",
    "auto-direct-node-routes": "true",
    "bpf-events-drop-enabled": "true",
    "bpf-events-policy-verdict-enabled": "true",
    ...
Cilium 노드 간 파드 통신 확인
파드를 생성해서 노드간 파드 통신을 확인해본다.
# cilium 파드 이름
export CILIUMPOD0=$(kubectl get -l k8s-app=cilium pods -n kube-system --field-selector spec.nodeName=k8s-ctr -o jsonpath='{.items[0].metadata.name}')
export CILIUMPOD1=$(kubectl get -l k8s-app=cilium pods -n kube-system --field-selector spec.nodeName=k8s-w1  -o jsonpath='{.items[0].metadata.name}')
export CILIUMPOD2=$(kubectl get -l k8s-app=cilium pods -n kube-system --field-selector spec.nodeName=k8s-w2  -o jsonpath='{.items[0].metadata.name}')
echo $CILIUMPOD0 $CILIUMPOD1 $CILIUMPOD2

# 단축키(alias) 지정
alias c0="kubectl exec -it $CILIUMPOD0 -n kube-system -c cilium-agent -- cilium"
alias c1="kubectl exec -it $CILIUMPOD1 -n kube-system -c cilium-agent -- cilium"
alias c2="kubectl exec -it $CILIUMPOD2 -n kube-system -c cilium-agent -- cilium"

alias c0bpf="kubectl exec -it $CILIUMPOD0 -n kube-system -c cilium-agent -- bpftool"
alias c1bpf="kubectl exec -it $CILIUMPOD1 -n kube-system -c cilium-agent -- bpftool"
alias c2bpf="kubectl exec -it $CILIUMPOD2 -n kube-system -c cilium-agent -- bpftool"

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl get pod -o wide
NAME      READY   STATUS    RESTARTS   AGE    IP             NODE     NOMINATED NODE   READINESS GATES
netpod    1/1     Running   0          115s   172.16.0.110   k8s-s               
webpod1   1/1     Running   0          115s   172.16.1.244   k8s-w1              
webpod2   1/1     Running   0          115s   172.16.2.198   k8s-w2              

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# c0 status --verbose | grep Allocated -A5
Allocated addresses:
  172.16.0.110 (default/netpod)
  172.16.0.250 (health)
  172.16.0.70 (router)
IPv4 BIG TCP:           Disabled
IPv6 BIG TCP:           Disabled

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# c1 status --verbose | grep Allocated -A5
Allocated addresses:
  172.16.1.106 (health)
  172.16.1.13 (kube-system/coredns-55cb58b774-7tj27)
  172.16.1.150 (kube-system/hubble-relay-88f7f89d4-zpgdl)
  172.16.1.153 (router)
  172.16.1.214 (kube-system/hubble-ui-59bb4cb67b-dk4hp)

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# c2 status --verbose | grep Allocated -A5
Allocated addresses:
  172.16.2.198 (default/webpod2)
  172.16.2.237 (router)
  172.16.2.246 (health)
IPv4 BIG TCP:           Disabled
IPv6 BIG TCP:           Disabled
cilium 정보 확인
# endpoint 정보 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get pod -o wide
NAME                      READY   STATUS    RESTARTS       AGE    IP             NODE      NOMINATED NODE   READINESS GATES
curl-pod                  1/1     Running   0              163m   172.20.0.6     k8s-ctr              
webpod-659cd747f8-v9xkm   1/1     Running   0              164m   172.20.1.172   k8s-w2               
webpod-659cd747f8-zqbk8   1/1     Running   1 (139m ago)   164m   172.20.2.144   k8s-w1               

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl get svc,ep webpod
Warning: v1 Endpoints is deprecated in v1.33+; use discovery.k8s.io/v1 EndpointSlice
NAME             TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)   AGE
service/webpod   ClusterIP   10.96.251.68           80/TCP    3h9m

NAME               ENDPOINTS                         AGE
endpoints/webpod   172.20.1.172:80,172.20.2.144:80   3h9m
Pod가 생성되면 생기는 인터페이스인 lxc...인터페이스에는 eBPF프로그램이 ingress/egress에 붙어있다.
즉, Pod에 들어오고 나가는 패킷을 BPF hook으로 가로채서 처리하는 것을 의미한다.
# BPF maps : 목적지 Pod와 통신 시 어디로 보내야할지 확인할 수 있음.
# tunnelendpoint = Pod가 떠있는 Node
(⎈|HomeLab:N/A) root@k8s-ctr:~# c0 map get cilium_ipcache | grep 172.20.1.172
172.20.1.172/32     identity=11362 encryptkey=0 tunnelendpoint=192.168.10.102 flags=   sync

# curl-pod LXC 변수 지정
# bpf net show
(⎈|HomeLab:N/A) root@k8s-ctr:~# LXC=lxca2ec5268c2d5
(⎈|HomeLab:N/A) root@k8s-ctr:~# c0bpf net show | grep $LXC
lxca2ec5268c2d5(24) tcx/ingress cil_from_container prog_id 1609 link_id 33
lxca2ec5268c2d5(24) tcx/egress cil_to_container prog_id 1600 link_id 34
BPF 프로그램 정보를 알아보자.
아래 예시에서 prog_id 1609는 Pod에서 나가는 패킷 처리용 eBPF를 의미하며
prog_id 1600은 Pod로 들어오는 패킷 처리용 eBPF를 의미한다.
각 프로그램에는 프로그램이 사용하는 eBPF Map정보가 있다.
# bpf prog show
(⎈|HomeLab:N/A) root@k8s-ctr:~# c0bpf prog show id 1609
1609: sched_cls  name cil_from_container  tag 41989045bb171bee  gpl
    loaded_at 2025-07-19T14:37:08+0000  uid 0
    xlated 752B  jited 784B  memlock 4096B  map_ids 272,271,90
    btf_id 563

# bpf prog show    
(⎈|HomeLab:N/A) root@k8s-ctr:~# c0bpf prog show id 1600
1600: sched_cls  name cil_to_container  tag 0b3125767ba1861c  gpl
    loaded_at 2025-07-19T14:37:08+0000  uid 0
    xlated 1448B  jited 1144B  memlock 4096B  map_ids 272,90,271
    btf_id 554
eBPF map을 상세히 확인해보자.

90: cilium_metrics
→ Pod의 메트릭을 위한 per-CPU hash map (패킷 수, 바이트 등)

271: cilium_calls_02
→ 프로그램 간 호출을 연결해주는 prog_array (BPF 프로그램 체인 구성)

272: .rodata.config
→ 읽기 전용 구성 값, BPF 프로그램이 설정값을 참고할 때 사용
# bpf map list
(⎈|HomeLab:N/A) root@k8s-ctr:~# c0bpf map list | grep -e 272: -e 271: -e ^90: -A2
90: percpu_hash  name cilium_metrics  flags 0x1
  key 8B  value 16B  max_entries 1024  memlock 19024B
...
271: prog_array  name cilium_calls_02  flags 0x0
  key 4B  value 4B  max_entries 50  memlock 720B
  owner_prog_type sched_cls  owner jited
272: array  name .rodata.config  flags 0x480
  key 4B  value 64B  max_entries 1  memlock 8192B
  btf_id 552  frozen


통신 과정 분석
netpod에서 webpod1로 통신하는 과정을 확인해본다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# p0 ping -c 1 $WEBPOD1IP
PING 172.16.2.42 (172.16.2.42) 56(84) bytes of data.
64 bytes from 172.16.2.42: icmp_seq=1 ttl=62 time=0.514 ms

--- 172.16.2.42 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 0.514/0.514/0.514/0.000 ms

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# hubble observe --pod netpod
Oct 26 15:05:12.844: default/netpod (ID:46040) -> default/webpod1 (ID:39469) to-network FORWARDED (ICMPv4 EchoRequest)
Oct 26 15:05:12.844: default/netpod (ID:46040)  default/webpod1 (ID:39469) to-endpoint FORWARDED (ICMPv4 EchoReply)

1) netpod에서 ping 명령어를 실행하여 webpod1의 IP 주소인 172.16.1.244로 ICMP 패킷을 전송한다.
2) 패킷이 k8s-s의 cilium_net 인터페이스로 도달하면, cilium의 eBPF 프로그램이 패킷을 가로챈다. 

패킷의 목적지 IP(172.16.1.244)를 확인하고, cilium ipcache에서 해당 IP에 대한 엔드포인트 ID(55847), 목적지 Node(k8s-w1)을 조회한다.

3) 목적지 Node가 동일한 서브넷임을 인지하고 Direct Routing 모드를 적용한다.

패킷은 터널링 없이 k8s-w1 노드로 전송된다.

4) k8s-w1 노드의 cilium_host 인터페이스에 패킷이 도착하면 k8s-w1노드의 cilium의 eBPF 프로그램이 패킷을 가로챈다.

cilium_ipcache를 조회하여 목적지가 webpod1임을 확인한다.

5) lxc 인터페이스의 veth 쌍과 cilium eBPF에서 생성한 목적지 MAC주소를 포함한 ARP 응답을 기반으로 패킷을 webpod1에 전달한다.
각 과정이 cilium에서는 어떤 코드로 구성되어 있는지 확인해보자.
통신 과정 코드
* cilium eBPF에서 패킷 유효성 검사 후 IPv4 프로토콜 패킷 처리 *
//cilium/bpf/bpf_lxc.c
static __always_inline int handle_ipv4_from_lxc(struct __ctx_buff *ctx, __u32 *dst_sec_identity,
                        __s8 *ext_err)
{
    struct ct_state *ct_state, ct_state_new = {};
    struct ipv4_ct_tuple *tuple;
...
// 패킷이 검증이 실패하면 패킷 DROP
    if (!revalidate_data(ctx, &data, &data_end, &ip4))
        return DROP_INVALID;
...

# ifdef ENABLE_HIGH_SCALE_IPCACHE
    if (identity_is_world_ipv4(identity)) {
        struct endpoint_info *ep;
        void *data, *data_end;
        struct iphdr *ip4;

        if (!revalidate_data(ctx, &data, &data_end, &ip4)) {
            ret = DROP_INVALID;
            goto out;
        }

// 패킷의 목적지 IP를 기반으로 ipcache에서 엔드포인트를 조회함. 
// 엔드포인트가 존재할 경우 security id를 패킷의 identity에 할당함.
        ep = __lookup_ip4_endpoint(ip4->saddr);
        if (ep)
            identity = ep->sec_id;
    }
# endif /* ENABLE_HIGH_SCALE_IPCACHE */
//패킷의 메타데이터를 저장함
        ctx_store_meta(ctx, CB_SRC_LABEL, identity);
        ret = tail_call_internal(ctx, CILIUM_CALL_IPV4_CT_INGRESS, &ext_err);
        break;
...
}
* IPCache eBPF맵 조회 *
//cilium/bpf/lib/eps.h
static __always_inline __maybe_unused struct endpoint_info *
__lookup_ip4_endpoint(__u32 ip)
{
    struct endpoint_key key = {};

    key.ip4 = ip;
    key.family = ENDPOINT_KEY_IPV4;

    return map_lookup_elem(&ENDPOINTS_MAP, &key);
}
통신을 시도하면 Hubble UI에서 관련된 패킷 흐름을 확인할 수 있다.
  
Cilium 서비스 통신 확인
ClusterIP 서비스를 생성해본다. iptables를 확인해보아도 더이상 KUBE-SVC룰이 생성되지 않는 것을 확인할 수 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl get svc,ep svc
NAME          TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)   AGE
service/svc   ClusterIP   10.10.254.180           80/TCP    2m1s

NAME            ENDPOINTS                       AGE
endpoints/svc   172.16.1.79:80,172.16.2.42:80   2m1s

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# iptables-save | grep KUBE-SVC
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# iptables-save | grep CILIUM
:CILIUM_POST_mangle - [0:0]
:CILIUM_PRE_mangle - [0:0]
-A PREROUTING -m comment --comment "cilium-feeder: CILIUM_PRE_mangle" -j CILIUM_PRE_mangle
-A POSTROUTING -m comment --comment "cilium-feeder: CILIUM_POST_mangle" -j CILIUM_POST_mangle
-A CILIUM_PRE_mangle ! -o lo -m socket --transparent -m comment --comment "cilium: any->pod redirect proxied traffic to host proxy" -j MARK --set-xmark 0x200/0xffffffff
-A CILIUM_PRE_mangle -p tcp -m comment --comment "cilium: TPROXY to host cilium-dns-egress proxy" -j TPROXY --on-port 41489 --on-ip 127.0.0.1 --tproxy-mark 0x200/0xffffffff
-A CILIUM_PRE_mangle -p udp -m comment --comment "cilium: TPROXY to host cilium-dns-egress proxy" -j TPROXY --on-port 41489 --on-ip 127.0.0.1 --tproxy-mark 0x200/0xffffffff
:CILIUM_OUTPUT_raw - [0:0]
...
netpod에서 서비스를 호출함과 동시에 netpod 내에서 tcpdump를 통해 파드 내부 통신을 캡처해본다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# while true; do kubectl exec netpod -- curl -s $SVCIP | grep Hostname;echo "-----";sleep 1;done
Hostname: webpod2
-----
Hostname: webpod2
-----
Hostname: webpod2
-----
Hostname: webpod2
-----
파드 내부에서 캡처를 했음에도 SVC IP가 보이는 것이 아니라 DNAT된 web-pod IP가 보이는 것을 알 수 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl exec netpod -- tcpdump -enni any -q
tcpdump: data link type LINUX_SLL2
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on any, link-type LINUX_SLL2 (Linux cooked v2), snapshot length 262144 bytes
17:20:38.927364 eth0  Out ifindex 11 ba:11:4d:8c:87:38 172.16.0.214.60718 > 172.16.1.79.80: tcp 0
17:20:38.927861 eth0  In  ifindex 11 1a:09:90:79:58:0e 172.16.1.79.80 > 172.16.0.214.60718: tcp 0
17:20:38.927916 eth0  Out ifindex 11 ba:11:4d:8c:87:38 172.16.0.214.60718 > 172.16.1.79.80: tcp 0
17:20:38.927974 eth0  Out ifindex 11 ba:11:4d:8c:87:38 172.16.0.214.60718 > 172.16.1.79.80: tcp 76
17:20:38.928318 eth0  In  ifindex 11 1a:09:90:79:58:0e 172.16.1.79.80 > 172.16.0.214.60718: tcp 0
17:20:38.929140 eth0  In  ifindex 11 1a:09:90:79:58:0e 172.16.1.79.80 > 172.16.0.214.60718: tcp 312

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# kubectl get pod -o wide
NAME      READY   STATUS    RESTARTS   AGE     IP             NODE     NOMINATED NODE   READINESS GATES
netpod    1/1     Running   0          3h49m   172.16.0.214   k8s-s               
webpod1   1/1     Running   0          3h49m   172.16.2.42    k8s-w1              
webpod2   1/1     Running   0          3h49m   172.16.1.79    k8s-w2              
cilium에서 확인할 수 있는 서비스 정보로, ClusterIP, Backend endpoint 정보를 확인할 수 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# c0 service list
ID   Frontend              Service Type   Backend
9    10.10.254.180:80      ClusterIP      1 => 172.16.1.79:80 (active)
                                          2 => 172.16.2.42:80 (active)

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# c0 bpf lb list
SERVICE ADDRESS           BACKEND ADDRESS (REVNAT_ID) (SLOT)
10.10.254.180:80 (1)      172.16.1.79:80 (9) (1)
10.10.254.180:80 (0)      0.0.0.0:0 (9) (0) [ClusterIP, non-routable]
10.10.254.180:80 (2)      172.16.2.42:80 (9) (2)

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# c0 map get cilium_lb4_services_v2
Key                       Value                State   Error
10.10.254.180:80 (0)      0 2 (9) [0x0 0x0]    sync
10.10.254.180:80 (1)      9 0 (9) [0x0 0x0]    sync
10.10.254.180:80 (2)      10 0 (9) [0x0 0x0]   sync

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# c0 map get cilium_lb4_backends_v3
Key   Value                 State   Error
10    ANY://172.16.2.42     sync
9     ANY://172.16.1.79     sync

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# c0 map get cilium_lb4_reverse_nat
Key   Value                 State   Error
9     10.10.254.180:80      sync

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-s:~# c0 map get cilium_lb4_reverse_sk
Key                           Value                         State   Error
[172.16.2.42]:20480, 130299   [10.10.254.180]:20480, 2304
[172.16.1.79]:20480, 125195   [10.10.254.180]:20480, 2304
[172.16.1.79]:20480, 128771   [10.10.254.180]:20480, 2304
[172.16.1.79]:20480, 128778   [10.10.254.180]:20480, 2304
[172.16.1.79]:20480, 125202   [10.10.254.180]:20480, 2304
[172.16.1.79]:20480, 130277   [10.10.254.180]:20480, 2304
Socket-Based LoadBalancing
Socket-based LoadBalancing(SLB)란, Cilium이 kube-proxy를 대체하면서 iptables 없이 커널 수준의 소켓 레벨 트래픽 처리를 통해 서비스를 라우팅하는 기술이다.
kube-proxy는 iptables 또는 IPVS를 사용해 서비스 IP → Pod IP 라우팅을 수행하는 반면, SLB는 socket이 생성되고 연결되는 순간 eBPF를 통해 트래픽을 처리한다.
이를 확인하기 위해 실제 syscall 호출을 strace로 분석해보자.
(⎈|HomeLab:N/A) root@k8s-ctr:~# c0 status --verbose | grep -i kubeproxyreplacement -A30
...
--
  KubeProxyReplacement Details:
  Status:                 True
  Socket LB:              Enabled
  Socket LB Tracing:      Enabled
  Socket LB Coverage:     Full
  Devices:                eth0    10.0.2.15 fd17:625c:f037:2:a00:27ff:fe71:19d8 fe80::a00:27ff:fe71:19d8, eth1   192.168.10.100 fe80::a00:27ff:fefd:46df (Direct Routing)
  Mode:                   SNAT
  Backend Selection:      Random
  Session Affinity:       Enabled
  Graceful Termination:   Enabled
  NAT46/64 Support:       Disabled
  XDP Acceleration:       Disabled
  Services:
  - ClusterIP:      Enabled
  - NodePort:       Enabled (Range: 30000-32767)
  - LoadBalancer:   Enabled
  - externalIPs:    Enabled
  - HostPort:       Enabled
...

# syscall 호출 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec curl-pod -- strace -c curl -s webpod
Hostname: webpod-659cd747f8-zqbk8
...

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 ...
  2.33    0.000287          95         3         1 connect
  0.63    0.000077          19         4           socket
  ...
  0.11    0.000013           2         5           getsockname
...
  0.02    0.000002           2         1           getsockopt
...

connect, getsocketname, getsockopt 시스템콜이 호출됨을 확인할 수 있다.



syscall
설명



connect()
소켓을 통해 서비스 ClusterIP에 연결 요청을 시도함


getsockname()
연결된 소켓의 로컬 주소를 확인함 *유저 공간에서는 BPF에서 redirect하는 목적지 IP는 확인할 수 없음. (실제 redirect는 유저스페이스가 아닌 커널 소켓 수준에서 실행되기 때문)


getsockopt()
소켓 오류 상태나 옵션을 확인함


# syscall 호출 상세 확인
(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec curl-pod -- strace -e trace=connect curl -s webpod
connect(4, {sa_family=AF_INET, sin_port=htons(53), sin_addr=inet_addr("10.96.0.10")}, 16) = 0
connect(5, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("10.96.251.68")}, 16) = 0
connect(4, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("10.96.251.68")}, 16) = -1 EINPROGRESS (Operation in progress)

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec curl-pod -- strace -e trace=getsockname curl -s webpod
getsockname(4, {sa_family=AF_INET, sin_port=htons(33247), sin_addr=inet_addr("172.20.0.6")}, [128 => 16]) = 0
getsockname(5, {sa_family=AF_INET, sin_port=htons(57096), sin_addr=inet_addr("172.20.0.6")}, [16]) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(60714), sin_addr=inet_addr("172.20.0.6")}, [128 => 16]) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(60714), sin_addr=inet_addr("172.20.0.6")}, [128 => 16]) = 0
getsockname(4, {sa_family=AF_INET, sin_port=htons(60714), sin_addr=inet_addr("172.20.0.6")}, [128 => 16]) = 0

(⎈|HomeLab:N/A) root@k8s-ctr:~# kubectl exec curl-pod -- strace -e trace=getsockopt curl -s webpod
getsockopt(4, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
connect()

위 sin_addr=inet_addr("10.96.251.68")은 ClusterIP로 지정된 webpod 서비스의 IP이다.
Cilium은 connect syscall 시점에 이 IP를 감지하고 해당 서비스에 연결된 실제 Pod IP 중 하나를 선택하여 연결을 리디렉션하며, 이 과정은 커널 eBPF 레이어에서 이루어진다.

getsockopt()

소켓에 오류가 없는지 확인하는 일반적인 호출로, 커넥션이 성공되었음을 확인할 수 있다.

통신 과정 분석
pod에서 connect() 시스템콜을 호출하여 소켓을 연결할 때, 목적지 주소가 서비스 주소면, 소켓의 목적지 주소를 서비스의 백엔드 주소로 변경한다.(*이후 과정은 pod - pod 통신과 동일)
해당 과정은 bpf프로그램을 cgroup에 연결해서 처리하게되는데, 이를 위해 cgroup을 pod에서 사용할  수 있도록 마운트하는 작업이 필요하다.
cilium파드의 init containers를 보면 cgroup 마운트를 수행하는 것을 볼 수 있다.
  - command:
    - sh
    - -ec
    - |
      cp /usr/bin/cilium-mount /hostbin/cilium-mount;
      nsenter --cgroup=/hostproc/1/ns/cgroup --mount=/hostproc/1/ns/mnt "${BIN_PATH}/cilium-mount" $CGROUP_ROOT;
통신 코드 분석
connect()시스템콜을 호출하여 소켓을 연결할 때 lb4_lookup_service 를 호출해 dst_ip, dst_port에 해당하는 서비스가 존재하는지 확인한다.
//cilium/bpf/bpf_sock.c
__section("cgroup/connect4")
int cil_sock4_connect(struct bpf_sock_addr *ctx)
{
    int err;
...

    err = __sock4_xlate_fwd(ctx, ctx, false);
...
}

static __always_inline int __sock4_xlate_fwd(struct bpf_sock_addr *ctx,
                         struct bpf_sock_addr *ctx_full,
                         const bool udp_only)
{
      struct lb4_key key = {
        .address    = dst_ip,
        .dport        = dst_port,
#if defined(ENABLE_SERVICE_PROTOCOL_DIFFERENTIATION)
        .proto        = protocol,
    }
  ...
  //서비스 존재 유무 확인
  svc = lb4_lookup_service(&key, true);
}
서비스가 존재하면 해당 서비스에 연결된 백엔드 정보를 backend_slot에 저장한 후, 소켓의 목적지 주소와 포트를 백엔드 주소와 포트로 변경한다.
//cilium/bpf/bpf_sock.c
static __always_inline int __sock4_xlate_fwd(struct bpf_sock_addr *ctx,
                         struct bpf_sock_addr *ctx_full,
                         const bool udp_only)
{
  ...
//서비스가 존재하지 않으면 에러 반환
if (!svc)
        return -ENXIO;
//서비스의 백엔드가 존재하지 않으면 에러 반환
    if (svc->count == 0 && !lb4_svc_is_l7loadbalancer(svc))
        return -EHOSTUNREACH;
...
    if (backend_id == 0) {
        backend_from_affinity = false;
    //서비스에 연결된 여러 백엔드 중 하나를 backend_slot에 저장한다.
        key.backend_slot = (sock_select_slot(ctx_full) % svc->count) + 1;
        backend_slot = __lb4_lookup_backend_slot(&key);
...
    //백엔드 ID를 이용하여 백엔드 정보를 가져온다.
        backend_id = backend_slot->backend_id;
        backend = __lb4_lookup_backend(backend_id);
    }
  ...
  //ReverseNAT테이블에 정보를 등록하여서 응답이 올바른 출발지로 돌아가는 것을 보장한다.
    if (sock4_update_revnat(ctx_full, backend, &orig_key,
                svc->rev_nat_index) < 0) {
        update_metrics(0, METRIC_EGRESS, REASON_LB_REVNAT_UPDATE);
        return -ENOMEM;
    }
  //소켓의 목적지 주소를 백엔드 IP로 설정하고, 목적지 포트도 백엔드 포트로 설정한다.
    ctx->user_ip4 = backend->address;
    ctx_set_port(ctx, backend->port);

    return 0;
}



Istio
Fri, 18 Oct 2024 16:37:25 GMT
Istio 구성요소
Istio는 크게 네트워크 정책을 설정하고 데이터플레인에 정책을 전달하는 중앙 관리 시스템인 컨트롤플레인, 실제로 서비스 간의 통신을 제어하며 실제 트래픽을 처리하는 데이터플레인으로 이루어진다. 각각에 대해서 자세히 알아본다.
컨트롤 플레인 (Istiod)
Istiod
Istiod는 컨트롤플레인의 관리 시스템이 동작하기 위한 핵심 컴포넌트로 크게는 아래 기능을 담당한다.

서비스 디스커버리 : Kubernetes APIServer로 부터 서비스 정보를 가져옴
트래픽 컨트롤 
트래픽 시프팅(Traffic shifting) : 카나리 배포 기능 제공
서킷 브레이커(Circuit Breaker) : 목적지 마이크로서비스에 문제가 있을 경우 접속을 차단하고 출발지 마이크로서비스에 요청 에러를 반환
폴트 인젝션(Fault Injection) : 의도적으로 요청을 지연 / 실패 구현
속도 제한(Rate Limit) : 요청 개수 제한


보안 : TLS 인증성 생성 / 배포 및 서비스 간 인증/인가 관리 수행
모니터링 및 로깅 시스템 연계

위 기능들은 기존에는 Pilot, Gally, Citadel 독립적인 컴포넌트들이 각각 담당을 하여 구현되었으나, Istio 1.5부터 독립적인 컴포넌트들의 기능이 모두 Istiod 단일 프로세스로 통합되었다. 각 독립적인 컴포넌트들이 Istiod 내에서 어떤 식으로 구현되었는지 확인해보자.
코드가 방대하여 Istiod의 전반적인 기능을 이해할 용도로 구현된 기능의 일부만 분석하였다. 
Pilot
Pilot은 서비스 디스커버리, 트래픽 컨트롤, Envoy 구성 전달을 하는 기능이다.
* 서비스 디스커버리 *

Kubernetes 클러스터에서 서비스 정보를 수집
코드 정보 : istio/pilot/pkg/serviceregistry/kube/controller/controller.go

1) 서비스 디스커버리를 수행할 컨트롤러 생성 및 ServiceEvent를 감지할 이벤트 핸들러 생성
// NewController creates a new Kubernetes controller
// Created by bootstrap and multicluster (see multicluster.Controller).
func NewController(kubeClient kubelib.Client, options Options) *Controller {
    c := &Controller{
        opts:                     options,
        client:                   kubeClient,
        queue:                    queue.NewQueueWithID(1*time.Second, string(options.ClusterID)),
        servicesMap:              make(map[host.Name]*model.Service),
        nodeSelectorsForServices: make(map[host.Name]labels.Instance),
        nodeInfoMap:              make(map[string]kubernetesNode),
        workloadInstancesIndex:   workloadinstances.NewIndex(),
        initialSyncTimedout:      atomic.NewBool(false),

        configCluster: options.ConfigCluster,
    }
  ...
  c.services = kclient.NewFiltered[*v1.Service](kubeClient, kclient.Filter{ObjectFilter: kubeClient.ObjectFilter()})

//이벤트 핸들러 생성
    registerHandlers[*v1.Service](c, c.services, "Services", c.onServiceEvent, nil)

    c.endpoints = newEndpointSliceController(c)
}
2) 서비스에 이벤트가 발생했을 때 함수 실행

서비스가 삭제/추가/업데이트 감지 후 각각에 해당하는 함수를 실행한다.
func (c *Controller) onServiceEvent(pre, curr *v1.Service, event model.Event) error {
  log.Debugf("Handle event %s for service %s in namespace %s", event, curr.Name, curr.Namespace)

  // Create the standard (cluster.local) service.
  svcConv := kube.ConvertService(*curr, c.opts.DomainSuffix, c.Cluster(), c.meshWatcher.Mesh())

  switch event {
  //서비스 삭제 이벤트 시 deleteService 함수 호출
  case model.EventDelete:
      c.deleteService(svcConv)
  default:
  //그 외(생성/업데이트) 이벤트 시 addOrUpdateService 함수 호출
      c.addOrUpdateService(pre, curr, svcConv, event, false)
  }

  return nil
}

서비스 삭제 외 이벤트 발생 시 addOrUpdateService 함수를 실행한다.

최신 서비스 정보를 serviceMap이라는 서비스의 hostname(예: test.default.svc.cluster.local)과 Istio과 관리하는 서비스 객체로 이루어진 Map에 저장한다.

따로 이런 serviceMap이라는 객체를 만든 이유는 Service 의 변화를 빠르게 판단하고, 또 실제 변경 사항이 있을 경우에만 Envoy에 정보를 push 하기 위함이다.
Service를 확인할 때, 이전에 Service정보와 최신의 Service 정보를 비교하여, 변화가 생겼을 시 XDS를 통해 Envoy에 정보를 push하게 되는데, 이 때, 이전 Service 정보를 모든 Service 리스트를 조회하지 않고도 빠르게 정보를 찾아 최신 Service와 비교를 할 수 있다.
func (c *Controller) addOrUpdateService(pre, curr *v1.Service, currConv *model.Service, event model.Event, updateEDSCache bool) {
...
c.Lock()
//prevConv == servicesMap에서 현재 확인하려는 서비스와 동일한 이름을 가진 이전 서비스 상태 
prevConv := c.servicesMap[currConv.Hostname]
c.servicesMap[currConv.Hostname] = currConv
c.Unlock()

기본적으로 needsFullPush는 false인데, 서비스가 외부 IP를 가지고 있을 경우 즉, 서비스가 Gateway로 기능하는 경우 fullpush를 수행하여 모든 서비스와 모든 엔드포인트(EDS)를 업데이트 한다. 이는 클러스터 외부에서 접근 가능한 서비스에 변경사항이 생길 경우, 트래픽의 라우팅 규칙이 변경될 수 있기 때문이다.

NodePort 서비스의 경우에도 fullpush가 일어나는데, 이는 NodePort 서비스가 변경 될 경우 모든 노드의 규칙에 영향을 끼치기 때문이다.

그 외의 서비스의 경우는 변경된 서비스에 관련해서만 업데이트가 발생한다.
...
//서비스가 Gateway로 기능하는 경우 모든 EDS 업데이트
    needsFullPush = c.extractGatewaysFromService(currConv)
...
//서비스가 NodePort일 경우 모든 EDS 업데이트
    needsFullPush = c.updateServiceNodePortAddresses(currConv)
...
// This full push needed to update ALL ends endpoints, even though we do a full push on service add/update
// as that full push is only triggered for the specific service.
if needsFullPush {
    // networks are different, we need to update all eds endpoints
    c.opts.XDSUpdater.ConfigUpdate(&model.PushRequest{Full: true, Reason: model.NewReasonStats(model.NetworksTrigger)})
}

신규로 변환한 Service의 Endpoint정보를 수집하여 Envoy가 사용하는 cache에 업데이트 한다.
// We also need to update when the Service changes. For Kubernetes, a service change will result in Endpoint updates,
// but workload entries will also need to be updated.
// TODO(nmittler): Build different sets of endpoints for cluster.local and clusterset.local.
if updateEDSCache || features.EnableK8SServiceSelectWorkloadEntries {
    endpoints := c.buildEndpointsForService(currConv, updateEDSCache)
    if len(endpoints) > 0 {
        c.opts.XDSUpdater.EDSCacheUpdate(shard, string(currConv.Hostname), ns, endpoints)
    }
}

prevConv와 currConv를 비교해서 변화가 없을 경우에 Envoy에 Push를 하지 않는다.
변화가 있을 경우에는 XDS를 통해 Envoy에 Service 정보를 업데이트를 알리고, NotifyServiceHandler를 호출하여 서비스 변경 이벤트를 알린다.
// filter out same service event
   //prevConv와 currConv가 동일할 경우 Envoy에 Push하지 않음.
if event == model.EventUpdate && !serviceUpdateNeedsPush(pre, curr, prevConv, currConv) {
    return
}

c.opts.XDSUpdater.SvcUpdate(shard, string(currConv.Hostname), ns, event)
c.handlers.NotifyServiceHandlers(prevConv, currConv, event)
}




* 트래픽 컨트롤 *
변경된 엔드포인트를 캐시하고, VirtualService 및 DestinationRule에 따라 라우팅 정보를 재생성한다.
VirtualService를 통해 Traffic shifting, Fault Injection을 구현할 수 있으며, DestinationRule을 통해 Circuit Breaker를 설정하고, 
EnvoyFilter를 통해 Rate Limit을 설정할 수 있다.
대표적으로 VirtualService를 처리하는 코드를 확인해본다.
1) VirtualService 처리 - 라우팅 규칙 생성

VirtualService의 Spec에서 http 라우트 규칙을 추출한 후, match 즉, uri 경로가 명시된 경우와, match 설정이 없는 경우를 나누어서 라우팅 규칙 slice를 생성한다.
//istio/pilot/pkg/networking/core/route/route.go
func BuildHTTPRoutesForVirtualService(
  node *model.Proxy,
  virtualService config.Config,
  serviceRegistry map[host.Name]*model.Service,
  hashByDestination DestinationHashMap,
  listenPort int,
  gatewayNames sets.String,
  opts RouteOptions,
) ([]*route.Route, error) {
  vs, ok := virtualService.Spec.(*networking.VirtualService)
...

  out := make([]*route.Route, 0, len(vs.Http))

  catchall := false
  for _, http := range vs.Http {
  //match가 0일 경우 라우팅 규칙 생성
      if len(http.Match) == 0 {
          if r := translateRoute(node, http, nil, listenPort, virtualService, serviceRegistry,
              hashByDestination, gatewayNames, opts); r != nil {
              out = append(out, r)
          }
          catchall = true
      } else {
      //match설정이 있을 경우 라우팅 규칙 생성
          for _, match := range http.Match {
              if r := translateRoute(node, http, match, listenPort, virtualService, serviceRegistry,
                  hashByDestination, gatewayNames, opts); r != nil {
                  out = append(out, r)
...
  return out, nil
}


2) RDS Push가 필요할 경우 BuildHTTPRoute를 통해 라우팅 규칙을 생성하고, 생성된 규칙을 xDS를 통해 Envoy에 전달한다.
//istio/pilot/pkg/xds/xdsgen.go
func (s *DiscoveryServer) pushXds(con *Connection, w *model.WatchedResource, req *model.PushRequest) error {
...
    res, logdata, err := gen.Generate(con.proxy, w, req)
...
}

//istio/pilot/pkg/xds/rds.go
func (c RdsGenerator) Generate(proxy *model.Proxy, w *model.WatchedResource, req *model.PushRequest) (model.Resources, model.XdsLogDetails, error) {
...
//라우팅 규칙을 생성하는 BuildHTTPRoutes 함수 호출
    resources, logDetails := c.ConfigGenerator.BuildHTTPRoutes(proxy, req, w.ResourceNames)
    return resources, logDetails, nil
}

//istio/pilot/pkg/networking/core/httproute.go
func (configgen *ConfigGeneratorImpl) BuildHTTPRoutes(
    node *model.Proxy,
    req *model.PushRequest,
    routeNames []string,
) ([]*discovery.Resource, model.XdsLogDetails) {
...
// 함수를 타고 들어가다 보면 위에서 언급한 BuildHTTPRoutesForVirtualService 함수 등을 호출하여 라우팅 규칙을 생성함.
}
* Envoy 구성 전달 *
istiod는 xDS Sync API를 통해 Envoy에 수집한 구성을 전달한다.

LDS(Listener Discovery Service)
RDS(Route Discovery Service)
CDS(Cluster Discovery Service)
EDS(Endpoint Discovery Service)
아래 구성들을 ADS(Aggregated Discovery Service)라는 단일 요청으로 xDS API를 통해 Envoy에 전달하게 된다.

코드 정보 : istio/pilot/pkg/xds/ads.go
코드 정보 : istio/pkg/xds/server.go
코드 정보 : istio/pkg/xds/discovery.go
1) Envoy와의 연결 수립

Envoy와의 연결 수립에 앞서 gRPC Context와 클라이언트 정보를 초기화 한다.//istio/pilot/pkg/xds/ads.go
// StreamAggregatedResources implements the ADS interface.
func (s *DiscoveryServer) StreamAggregatedResources(stream DiscoveryStream) error {
  return s.Stream(stream)
}



func (s *DiscoveryServer) Stream(stream DiscoveryStream) error {
...
    ctx := stream.Context()
    peerAddr := "0.0.0.0"
    if peerInfo, ok := peer.FromContext(ctx); ok {
        peerAddr = peerInfo.Addr.String()
    }

- 클라이언트의 인증을 확인하고, 인증에 성공하면 인증된 xDS요청이라는 로그와 함께 ID 정보를 반환한다.
```go
//istio/pilot/pkg/xds/ads.go
    ids, err := s.authenticate(ctx)
    if err != nil {
        return status.Error(codes.Unauthenticated, err.Error())
    }
    if ids != nil {
        log.Debugf("Authenticated XDS: %v with identity %v", peerAddr, ids)
    } else {
        log.Debugf("Unauthenticated XDS: %s", peerAddr)
    }

Envoy로 전달할 xDS 리소스를 초기화하고, 클라이언트(envoy)와의 연결 객체를 생성한다.
연결 객체를 생성한 후 xds.Stream을 통해 Envoy와의 xDS스트림을 시작한다.//istio/pilot/pkg/xds/ads.go
// InitContext returns immediately if the context was already initialized.
if err = s.globalPushContext().InitContext(s.Env, nil, nil); err != nil {
    // Error accessing the data - log and close, maybe a different pilot replica
    // has more luck
    log.Warnf("Error reading config %v", err)
    return status.Error(codes.Unavailable, "error reading config")
}
con := newConnection(peerAddr, stream)
con.ids = ids
con.s = s
return xds.Stream(con)
}




3) LDS, RDS, CDS, EDS 관련 요청이 있을 경우 ConfigUpdate 함수를 통해 pushChannel에 request를 넣는다.
대표적 예시로 EDS함수를 확인해보자.
//EDS
//istio/pilot/pkg/xds/eds.go
func (s *DiscoveryServer) EDSUpdate(shard model.ShardKey, serviceName string, namespace string,
    istioEndpoints []*model.IstioEndpoint,
) {
    inboundEDSUpdates.Increment()
    // Update the endpoint shards
    pushType := s.Env.EndpointIndex.UpdateServiceEndpoints(shard, serviceName, namespace, istioEndpoints)
    if pushType == model.IncrementalPush || pushType == model.FullPush {
        // Trigger a push
        //EDSUpdate가 되었을 경우 ConfigUpdate 함수 호출
        s.ConfigUpdate(&model.PushRequest{
            Full:           pushType == model.FullPush,
            ConfigsUpdated: sets.New(model.ConfigKey{Kind: kind.ServiceEntry, Name: serviceName, Namespace: namespace}),
            Reason:         model.NewReasonStats(model.EndpointUpdate),
        })
    }
}

//istio/pkg/xds/discovery.go
func (s *DiscoveryServer) ConfigUpdate(req *model.PushRequest) {
...
//요청을 pushChannel에 넣음.
    s.pushChannel <- req
}
4) Istiod에서 Envoy에 Push할 이벤트가 있는 경우(pushChannel) 시작된 xDS 스트림에서 해당 이벤트 및 구성(xDS)을 전달한다.
//istio/pkg/xds/server.go
func Stream(ctx ConnectionContext) error {
...
        select {
...
//앞에서 EDS 변화가 발생하여 생긴 요청을 pushChannel에 넣었음. 해당 요청을 Push함수를 통해 최종적으로 xDS로 Envoy에 전달함.
        case pushEv := <-con.pushChannel:
            err := ctx.Push(pushEv)
            if err != nil {
                return err
            }
        case <-con.stop:
            return nil
}

//istio/pkg/xds/ads.go
func (conn *Connection) Push(ev any) error {
    pushEv := ev.(*Event)
    err := conn.s.pushConnection(conn, pushEv)
    pushEv.done()
    return err
}

func (s *DiscoveryServer) pushConnection(con *Connection, pushEv *Event) error {
    pushRequest := pushEv.pushRequest
...
    wrl := con.watchedResourcesByOrder()
    for _, w := range wrl {
        if err := s.pushXds(con, w, pushRequest); err != nil {
            return err
        }
    }
...
}
Galley
Galley는 Istio의 구성 데이터를 검증하고 변환하는 기능을 제공하는 기능이다.
Istiod에서는 해당 기능을 아래 코드에서 제공하고 있다. 대표적으로 HTTPHeaderName을 검증하는 코드를 확인해본다.
1) HTTP header name이 비어있거나, validHeaderRegex에 해당하지 않는 name이라면 error를 반환한다.
//istio/pkg/config/validation

var validHeaderRegex = regexp.MustCompile("^[-_A-Za-z0-9]+$")
// ValidateHTTPHeaderName validates a header name
func ValidateHTTPHeaderName(name string) error {
    if name == "" {
        return fmt.Errorf("header name cannot be empty")
    }
    if !validHeaderRegex.MatchString(name) {
        return fmt.Errorf("header name %s is not a valid header name", name)
    }
    return nil
}
이 외에도 Metadata, Wight, Percent, Gateway, Server, Port 등 다양한 항목들에 대한 검증 기능을 제공한다.
Citadel
Citadel은 Istio의 보안 관리 및 인증을 제공하는 기능이다.
Istiod에서는 해당 기능을 아래 코드에서 제공하고 있다. 대표적으로 자체 서명 CA 인증서를 주기적으로 생성/갱신하고, 이를 Kubernetes Secret으로 저장하는 코드를 확인해본다.
1) 기존 CA인증서를 로드하고, 기존 인증서가 없을 경우, 새로운 인증서를 생성한다. 이 후 인증서를 Kubernetes Secret으로 저장한다.
//isito/security/pkg/pki/ca/ca.go
func NewSelfSignedIstioCAOptions(ctx context.Context,
    rootCertGracePeriodPercentile int, caCertTTL, rootCertCheckInverval, defaultCertTTL,
    maxCertTTL time.Duration, org string, useCacertsSecretName, dualUse bool, namespace string, client corev1.CoreV1Interface,
    rootCertFile string, enableJitter bool, caRSAKeySize int,
) (caOpts *IstioCAOptions, err error) {
...
//기존 인증서 로드
        err := loadSelfSignedCaSecret(client, namespace, caCertName, rootCertFile, caOpts)
    ...
            pkiCaLog.Infof("CASecret %s not found, will create one", caCertName)
      ...
//없을 경우 새로운 인증서 생성
            pemCert, pemKey, ckErr := util.GenCertKeyFromOptions(options)
      ...
            if caOpts.KeyCertBundle, err = util.NewVerifiedKeyCertBundleFromPem(pemCert, pemKey, nil, rootCerts); err != nil {
                pkiCaLog.Warnf("failed to create CA KeyCertBundle (%v)", err)
                return fmt.Errorf("failed to create CA KeyCertBundle (%v)", err)
            }
...
//생성한 인증서로 Kubernetes Secret 생성
            secret := BuildSecret(caCertName, namespace, nil, nil, pemCert, pemCert, pemKey, istioCASecretType)
            _, err = client.Secrets(namespace).Create(context.TODO(), secret, metav1.CreateOptions{})
}
데이터 플레인 (envoy proxy)
Envoy Proxy는 각 애플리케이션에 사이드카로 배포되어 서비스 간의 통신을 제어하고 처리한다. 앞서 살펴본 Istiod로 부터 설정을 받아와 동작하며, Istiod와 Envoy는 xDS프로토콜을 사용해 통신한다.
자세한 기능은 Envoy 설명에서 확인해본다.
Envoy
Envoy에는 아래와 같은 구성요소들이 있다.

Listener: Envoy가 들어오는 트래픽을 수신하는 방법(예: IP/포트)을 정의함
Route: 들어오는 요청을 규칙(예: 헤더, 경로)에 따라 특정 클러스터에 매핑함
Cluster: 업스트림 엔드포인트(서비스 인스턴스)를 논리적으로 그룹화한 것
Endpoint: 클러스터 내의 특정 백엔드 인스턴스(IP/포트)
Filter: 인증, 속도 제한 등 다양한 단계에서 트래픽을 처리함
Upstream: Envoy가 요청을 전달하는 대상 서비스나 서버
Downstream: Envoy에 요청을 시작하는 클라이언트나 서비스

Envoy proxy 실습
envoy를 설치하고 간단한 envoy proxy 실습을 진행한다.
# wget -O- https://apt.envoyproxy.io/signing.key | sudo gpg --dearmor -o /etc/apt/keyrings/envoy-keyring.gpg

# echo "deb [signed-by=/etc/apt/keyrings/envoy-keyring.gpg] https://apt.envoyproxy.io jammy main" | sudo tee /etc/apt/sources.list.d/envoy.list

# sudo apt-get update && sudo apt-get install envoy -y

# envoy --version
envoy  version: e3b4a6e9570da15ac1caffdded17a8bebdc7dfc9/1.32.0/Clean/RELEASE/BoringSSL

static_resources:

  listeners:
  - name: listener_0
    address:
      socket_address:
        address: 0.0.0.0
        port_value: 10000
    filter_chains:
    - filters:
      - name: envoy.filters.network.http_connection_manager
        typed_config:
          "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
          stat_prefix: ingress_http
          access_log:
          - name: envoy.access_loggers.stdout
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.access_loggers.stream.v3.StdoutAccessLog
          http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
          route_config:
            name: local_route
            virtual_hosts:
            - name: local_service
              domains: ["*"]
              routes:
              - match:
                  prefix: "/"
                route:
                  host_rewrite_literal: www.envoyproxy.io
                  cluster: service_envoyproxy_io

  clusters:
  - name: service_envoyproxy_io
    type: LOGICAL_DNS
    # Comment out the following line to test on v6 networks
    dns_lookup_family: V4_ONLY
    load_assignment:
      cluster_name: service_envoyproxy_io
      endpoints:
      - lb_endpoints:
        - endpoint:
            address:
              socket_address:
                address: www.envoyproxy.io
                port_value: 443
    transport_socket:
      name: envoy.transport_sockets.tls
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext
        sni: www.envoyproxy.io
통신 확인
root@testpc:~# envoy -c envoy-demo.yaml

root@testpc:~# ss -tnlp
State      Recv-Q     Send-Q          Local Address:Port            Peer Address:Port     Process
LISTEN     0          4096                  0.0.0.0:10000                0.0.0.0:*         users:(("envoy",pid=3407,fd=25))

root@testpc:~# curl -s http://127.0.0.1:10000 | grep -o ".*"
Envoy proxy - home

(⎈|default:N/A) root@k3s-s:~# curl -s http://192.168.10.200:10000 | grep -o ".*"
Envoy proxy - home

root@testpc:~# 
...
[2024-10-18T23:24:30.719Z] "GET / HTTP/1.1" 200 - 0 15795 355 284 "-" "curl/7.81.0" "6af88bbd-742c-4eb2-be25-af8c364b23c0" "www.envoyproxy.io" "46.137.195.11:443"
관리자 페이지 설정 덮어쓰기
cat < envoy-override.yaml
admin:
  address:
    socket_address:
      address: 0.0.0.0
      port_value: 9902
EOT
envoy -c envoy-demo.yaml --config-yaml "$(cat envoy-override.yaml)"

# envoy 관리페이지 외부 접속 정보 출력
echo -e "http://$(curl -s ipinfo.io/ip):9902"

Istio 설치
아래 yaml을 배포하여 Istio Base Component와 Istiod를 설치한다. 
복잡성을 줄이기 위해 ingressgateway는 활성화하고 egressgateway는 비활성화 한다.
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
  components:
    base:
      enabled: true
    egressGateways:
    - enabled: true
      name: istio-egressgateway
    ingressGateways:
    - enabled: true
      name: istio-ingressgateway
    pilot:
      enabled: true
  hub: docker.io/istio
  profile: demo
  tag: 1.23.2
  values:
    defaultRevision: ""
    gateways:
      istio-egressgateway: {}
      istio-ingressgateway: {}
    global:
      configValidation: true
      istioNamespace: istio-system
    profile: demo
istio를 배포하게 되면 아래와 같은 리소스들이 생성된다.
*istio-ingressgateway는 NodePort로 변경함.
* Service *
NAME                   TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                                                                      AGE
istio-ingressgateway   NodePort    10.10.200.37            15021:32488/TCP,80:32102/TCP,443:31354/TCP,31400:32683/TCP,15443:30575/TCP   17m
istiod                 ClusterIP   10.10.200.238           15010/TCP,15012/TCP,443/TCP,15014/TCP
* pod *
NAME                                        READY   STATUS    RESTARTS   AGE
pod/istio-ingressgateway-5f9f654d46-w87wh   1/1     Running   0          4m35s
pod/istiod-7f8b586864-jl9fd                 1/1     Running   0          4m51s
* crd *
#kubectl get crd | grep istio.io | sort
NAME                                              MIN AVAILABLE   MAX UNAVAILABLE   ALLOWED DISRUPTIONS   AGE
poddisruptionbudget.policy/istio-ingressgateway   1               N/A               0                     4m35s
poddisruptionbudget.policy/istiod                 1               N/A               0                     4m51s
authorizationpolicies.security.istio.io    2024-10-19T15:15:33Z
destinationrules.networking.istio.io       2024-10-19T15:15:33Z
envoyfilters.networking.istio.io           2024-10-19T15:15:34Z
gateways.networking.istio.io               2024-10-19T15:15:34Z
peerauthentications.security.istio.io      2024-10-19T15:15:34Z
proxyconfigs.networking.istio.io           2024-10-19T15:15:34Z
requestauthentications.security.istio.io   2024-10-19T15:15:34Z
serviceentries.networking.istio.io         2024-10-19T15:15:34Z
sidecars.networking.istio.io               2024-10-19T15:15:34Z
telemetries.telemetry.istio.io             2024-10-19T15:15:34Z
virtualservices.networking.istio.io        2024-10-19T15:15:34Z
wasmplugins.extensions.istio.io            2024-10-19T15:15:34Z
workloadentries.networking.istio.io        2024-10-19T15:15:34Z
workloadgroups.networking.istio.io         2024-10-19T15:15:34Z



이름
설명



poddisruptionbudget.policy/istio-ingressgateway
Istio Ingress Gateway의 안정성을 보장하기 위해 최소 가용 Pod 수를 설정한다.


poddisruptionbudget.policy/istiod
Istio Control Plane(istiod)이 일정 수 이상의 Pod을 유지하도록 설정한다.


authorizationpolicies.security.istio.io
서비스 간 통신 및 외부 요청에 대한 접근 제어 규칙을 정의한다.


destinationrules.networking.istio.io
서비스 호출 시 적용할 로드 밸런싱, 연결 설정 등의 정책을 정의한다.


envoyfilters.networking.istio.io
Envoy 프록시의 동작을 커스터마이징하기 위해 필터를 추가한다.


gateways.networking.istio.io
외부 트래픽을 내부 서비스로 라우팅하기 위한 Gateway를 정의한다.


peerauthentications.security.istio.io
서비스 간 통신에 대한 인증 정책을 설정한다. (mTLS 등)


proxyconfigs.networking.istio.io
프록시 설정을 제어하며, 사이드카 및 인그레스 동작을 커스터마이징한다.


requestauthentications.security.istio.io
들어오는 요청에 대한 인증을 처리하는 규칙을 정의한다.


serviceentries.networking.istio.io
외부 서비스에 대한 접근을 허용하기 위해 내부에 가상 서비스 항목을 만든다.


sidecars.networking.istio.io
특정 네임스페이스나 서비스에 대해 사이드카 프록시 동작을 정의한다.


telemetries.telemetry.istio.io
모니터링 및 메트릭 수집을 위한 텔레메트리 설정을 정의한다.


virtualservices.networking.istio.io
요청을 특정 서비스로 라우팅하기 위한 규칙을 정의한다.


wasmplugins.extensions.istio.io
WebAssembly(WasM) 플러그인을 사용하여 Istio 프록시를 확장한다.


workloadentries.networking.istio.io
클러스터 외부의 워크로드를 Istio 메쉬에 포함시킨니다.


workloadgroups.networking.istio.io
비슷한 특성을 가진 외부 워크로드를 그룹으로 정의한다.


istio-ingressgateway
istio-ingressgateway는 Istio의 외부 트래픽을 내부 서비스로 라우팅 하는 역할이다. deployment에서 ps -ef를 확인하면 다음과 같다.

pilot-agent : Envoy proxy가 정상 작동하도록 사이드카 컨테이너를 관리한다.
envoy : 트래픽을 규칙에 따라 처리하고 요청을 라우팅 한다.

(⎈|default:N/A) root@k3s-s:~# kubectl exec -it deployment.apps/istio-ingressgateway -n istio-system -- ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
istio-p+       1       0  0 15:16 ?        00:00:00 /usr/local/bin/pilot-agent proxy router --domain istio-system.svc.cluster.local -
istio-p+      12       1  0 15:16 ?        00:00:02 /usr/local/bin/envoy -c etc/istio/proxy/envoy-rev.json --drain-time-s 45 --drain-
istio-p+      61       0  0 15:36 pts/0    00:00:00 ps -ef
envoy 프로세스에서 사용하는 envoy 구성 파일을 확인해본다.
(⎈|default:N/A) root@k3s-s:~# kubectl exec -it deployment.apps/istio-ingressgateway -n istio-system -- cat /etc/istio/proxy/envoy-rev.json



구성 요소
설명
주요 값/설정



application_log_config
애플리케이션 로그 포맷을 정의
%Y-%m-%dT%T.%fZ\t%l\tenvoy %n %g:%#\t%v\tthread=%t


node 정보
Envoy 노드 및 메타데이터 관련 정보
id: router~172.16.1.3~...
cluster: istio-ingressgateway.istio-system
instance_ip: 172.16.1.3


annotations
Istio 및 Kubernetes 메타데이터 정보
istio.io/rev: default, prometheus.io/scrape: true


layered_runtime
Envoy 런타임 설정을 정의
- overload.global_downstream_max_connections: 2147483647
- re2.max_program_size.error_level: 32768


bootstrap_extensions
내부 리스너 구성.
buffer_size_kb: 64


admin 설정
관리 인터페이스 관련 설정
address: 127.0.0.1:15000
profile_path: /var/lib/istio/data/envoy.prof


dynamic_resources
동적 리소스 관리 설정(LDS, CDS, ADS)
api_type: DELTA_GRPC
discovery_address: istiod.istio-system.svc:15012


static_resources
정적 리소스(클러스터, 리스너) 설정
클러스터: prometheus_stats, agent, xds-grpc
리스너: 0.0.0.0:15090, 0.0.0.0:15021


클러스터 설정
xDS, Prometheus 등과의 연결을 위한 클러스터 정보
Circuit Breakers: Max 100,000 connections/requests


리스너 설정
네트워크 리스너 구성
리스너 포트: 15090, 15021
HTTP 필터: Router, Health Check


proxy_config
프록시 동작을 위한 세부 설정
binaryPath: /usr/local/bin/envoy
concurrency: 2
statusPort: 15020


메타데이터
Pod 및 서비스와 관련된 정보.
Pod 이름: istio-ingressgateway-5f9f654d46-w87wh
서비스 계정: istio-ingressgateway-service-account


istio 접속 테스트
현재는 istio ingress gateway 외부 노출 설정이 없기 때문에 nodePort를 기반으로 접속을 시도하였을 때 접속 실패가 된다.
(⎈|default:N/A) root@k3s-s:~# export IGWHTTP=$(kubectl get service -n istio-system istio-ingressgateway -o jsonpath='{.spec.ports[1].nodePort}')
echo $IGWHTTP
32102

(⎈|default:N/A) root@k3s-s:~# export MYDOMAIN=www.gyuri.dev

(⎈|default:N/A) root@k3s-s:~# echo -e "192.168.10.10 $MYDOMAIN" >> /etc/hosts
echo -e "export MYDOMAIN=$MYDOMAIN" >> /etc/profile

(⎈|default:N/A) root@k3s-s:~# curl -v -s $MYDOMAIN:$IGWHTTP
*   Trying 192.168.10.10:32102...
* connect to 192.168.10.10 port 32102 failed: Connection refused
* Failed to connect to www.gyuri.dev port 32102 after 1 ms: Connection refused
* Closing connection 0
Istio 외부 노출

* nginx Deployment 배포 *
nginx Deployment를 배포한다. 이 때, default namespace에 istio-injection=enabled설정을 통해 해당 네임스페이스에 배포된 파드들에 istio사이드카가 붙도록 설정한다.
cat <

(⎈|default:N/A) root@k3s-s:~# kubectl label namespace default istio-injection=enabled
namespace/default labeled

(⎈|default:N/A) root@k3s-s:~# kubectl get pod -o wide
NAME                                 READY   STATUS    RESTARTS   AGE   IP           NODE     NOMINATED NODE   READINESS GATES
pod/deploy-websrv-778ffd6947-zdgdf   2/2     Running   0          15s   172.16.1.4   k3s-w1              
* Gateway, VirtualService 배포 *
지정한 ingress gateway로부터 인입된 트래픽을 관리하기 위해 Gateway를 배포하고, 인입 처리할 hosts 설정 및 목적지 라우팅 정책을 설정하기 위해 VirtualService를 배포한다.
apiVersion: networking.istio.io/v1
kind: Gateway
metadata:
  name: test-gateway
spec:
  selector:
    istio: ingressgateway
  servers:
  - port:
      number: 80
      name: http
      protocol: HTTP
    hosts:
    - "*"
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata:
  name: nginx-service
spec:
  hosts:
  - "$MYDOMAIN"
  gateways:
  - test-gateway
  http:
  - route:
    - destination:
        host: svc-clusterip
        port:
          number: 80
EOF
istio proxy들이 정상적으로 작동하는지 점검한다.
(⎈|default:N/A) root@k3s-s:~# kubectl get gw,vs
NAME                                       AGE
gateway.networking.istio.io/test-gateway   2s

NAME                                               GATEWAYS           HOSTS               AGE
virtualservice.networking.istio.io/nginx-service   ["test-gateway"]   ["www.gyuri.dev"]   2s

(⎈|default:N/A) root@k3s-s:~# istioctl proxy-status
NAME                                                   CLUSTER        CDS                LDS                EDS                RDS                ECDS        ISTIOD                      VERSION
deploy-websrv-778ffd6947-zdgdf.default                 Kubernetes     SYNCED (4m18s)     SYNCED (4m18s)     SYNCED (4m18s)     SYNCED (4m18s)     IGNORED     istiod-7f8b586864-jl9fd     1.23.2
istio-ingressgateway-5f9f654d46-w87wh.istio-system     Kubernetes     SYNCED (46s)       SYNCED (46s)       SYNCED (4m18s)     SYNCED (46s)       IGNORED     istiod-7f8b586864-jl9fd     1.23.2

* 접속 테스트 *
(⎈|default:N/A) root@k3s-s:~# curl -s $MYDOMAIN:$IGWHTTP | grep -o ".*"
Welcome to nginx!
(⎈|default:N/A) root@k3s-s:~# kubetail -n istio-system -l app=istio-ingressgateway -f
Will tail 1 logs...
istio-ingressgateway-5f9f654d46-w87wh
[istio-ingressgateway-5f9f654d46-w87wh] [2024-10-19T16:11:05.545Z] "GET / HTTP/1.1" 200 - via_upstream - "-" 0 615 8 8 "172.16.0.0" "curl/7.81.0" "2ee7d856-eb06-94a5-8a08-d2f6ab77c9fd" "www.gyuri.dev:32102" "172.16.1.4:80" outbound|80||svc-clusterip.default.svc.cluster.local 172.16.1.3:33970 172.16.1.3:8080 172.16.0.0:52692 - -
[istio-ingressgateway-5f9f654d46-w87wh] [2024-10-19T16:11:14.050Z] "GET / HTTP/1.1" 200 - via_upstream - "-" 0 615 2 1 "172.16.0.0" "curl/7.81.0" "7eff26f0-f971-96c1-8365-9de15f2cfac0" "www.gyuri.dev:32102" "172.16.1.4:80" outbound|80||svc-clusterip.default.svc.cluster.local 172.16.1.3:33970 172.16.1.3:8080 172.16.0.0:4801 - -



Ingress
Sat, 12 Oct 2024 12:07:22 GMT
ingress란
Kubernetes에서 클러스터 외부의 네트워크 트래픽을 클러스터 내부의 서비스로 라우팅하기 위한 리소스, 도메인 기반 라우팅 / 경로 기반 라우팅 / TLS 인증 / 로드 밸런싱 등 기능을 제공한다.
ingress controller란
Ingress Controller는 정의된 Ingress 리소스를 실제로 동작하게 하는 리소스로, 클러스터 내에서 배포되어 사용자가 생성한 Ingress 객체를 해석하여, 내부적으로 라우팅 규칙을 반영한 프록시 서버를 설정한다.
가장 대표적인 nginx ingress controller로 동작을 살펴보자.
ingress controller 설치
ingress controller는 Nodeport 타입으로 설치한다.
* yaml 파일 *
cat < ingress-nginx-values.yaml
controller:
  service:
    type: NodePort
    nodePorts:
      http: 30080
      https: 30443
  nodeSelector:
    kubernetes.io/hostname: "k3s-s"
  metrics:
    enabled: true
  serviceMonitor:
      enabled: true
EOT
ingress-nginx-controller 배포
배포한 Ingress-nginx-controller 리소스이다.
(⎈|default:N/A) root@k3s-s:~# kubectl get pod -n ingress -o wide
NAME                                        READY   STATUS    RESTARTS   AGE   IP              NODE    NOMINATED NODE   READINESS GATES
ingress-nginx-controller-7c68d6f654-brpgm   1/1     Running   0          65s   192.168.10.10   k3s-s              
(⎈|default:N/A) root@k3s-s:~# kubectl get svc,ep -n ingress -o wide
NAME                                         TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE   SELECTOR
service/ingress-nginx-controller             NodePort    10.10.200.211           80:30080/TCP,443:30443/TCP   77s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
service/ingress-nginx-controller-admission   ClusterIP   10.10.200.203           443/TCP                      77s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
service/ingress-nginx-controller-metrics     ClusterIP   10.10.200.72            10254/TCP                    77s   app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx

NAME                                           ENDPOINTS                            AGE
endpoints/ingress-nginx-controller             192.168.10.10:443,192.168.10.10:80   77s
endpoints/ingress-nginx-controller-admission   192.168.10.10:8443                   77s
endpoints/ingress-nginx-controller-metrics     192.168.10.10:10254                  77s

(⎈|default:N/A) root@k3s-s:~# kc get svc -n ingress ingress-nginx-controller
NAME                       TYPE       CLUSTER-IP      EXTERNAL-IP   PORT(S)                      AGE
ingress-nginx-controller   NodePort   10.10.200.157           80:30080/TCP,443:30443/TCP   12m

(⎈|default:N/A) root@k3s-s:~# kc describe svc -n ingress ingress-nginx-controller
Name:                     ingress-nginx-controller
Namespace:                ingress
Labels:                   app.kubernetes.io/component=controller
                          app.kubernetes.io/instance=ingress-nginx
                          app.kubernetes.io/managed-by=Helm
                          app.kubernetes.io/name=ingress-nginx
                          app.kubernetes.io/part-of=ingress-nginx
                          app.kubernetes.io/version=1.11.2
                          helm.sh/chart=ingress-nginx-4.11.2
Annotations:              meta.helm.sh/release-name: ingress-nginx
                          meta.helm.sh/release-namespace: ingress
Selector:                 app.kubernetes.io/component=controller,app.kubernetes.io/instance=ingress-nginx,app.kubernetes.io/name=ingress-nginx
Type:                     NodePort
IP Family Policy:         SingleStack
IP Families:              IPv4
IP:                       10.10.200.157
IPs:                      10.10.200.157
Port:                     http  80/TCP
TargetPort:               http/TCP
NodePort:                 http  30080/TCP
Endpoints:                172.16.0.4:80 #ingress-nginx-controller pod IP
Port:                     https  443/TCP
TargetPort:               https/TCP
NodePort:                 https  30443/TCP
Endpoints:                172.16.0.4:443 #ingress-nginx-controller pod IP
Session Affinity:         None
External Traffic Policy:  Local
Events:                   
ingress controller 동작 방식
ingress controller 동작을 확인하기 위해 svc, deployment, ingress를 배포한다.

ingress-nginx-controller가 배포된 상태에서 Ingress를 생성하면 해당 설정을 기반으로 ingress-nginx-controller에서 프록시 역할을 할 수 있도록 nginx.conf를 구성한다.
ingress-nginx-controller는 nginx.conf에 설정된 경로로 요청이 들어오면, 경로에 해당하는 서비스의 Endpoint를 모니터링하여 해당 Endpoint로 직접 요청을 보낸다.
(⎈|default:N/A) root@k3s-s:~# kubectl get pod,svc,ep
NAME                                    READY   STATUS    RESTARTS   AGE
pod/deploy1-websrv-5c6b88bd77-r4b5k     1/1     Running   0          11m
pod/deploy2-guestsrv-649875f78b-2fl9v   1/1     Running   0          11m
pod/deploy2-guestsrv-649875f78b-w9s8q   1/1     Running   0          11m
pod/deploy3-adminsrv-7c8f8b8c87-5pf7g   1/1     Running   0          11m
pod/deploy3-adminsrv-7c8f8b8c87-76sjm   1/1     Running   0          11m
pod/deploy3-adminsrv-7c8f8b8c87-qq2fj   1/1     Running   0          11m

NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)          AGE
service/kubernetes   ClusterIP   10.10.200.1             443/TCP          151m
service/svc1-web     ClusterIP   10.10.200.245           9001/TCP         11m
service/svc2-guest   NodePort    10.10.200.148           9002:30832/TCP   11m
service/svc3-admin   ClusterIP   10.10.200.241           9003/TCP         11m

NAME                   ENDPOINTS                                         AGE
endpoints/kubernetes   192.168.10.10:6443                                151m
endpoints/svc1-web     172.16.2.6:80                                     11m
endpoints/svc2-guest   172.16.1.3:8080,172.16.3.6:8080                   11m
endpoints/svc3-admin   172.16.1.4:8080,172.16.2.7:8080,172.16.3.5:8080   11m
(⎈|default:N/A) root@k3s-s:~# kc describe ingress ingress-1
  Host        Path  Backends
  ----        ----  --------
  *           
              /        svc1-web:80 ()
              /guest   svc2-guest:8080 ()
              /admin   svc3-admin:8080 ()

(⎈|default:N/A) root@k3s-s:~# kubectl exec deploy/ingress-nginx-controller -n ingress -it -- cat /etc/nginx/nginx.conf | grep location -A5
                location = /guest {

                        set $namespace      "default";
                        set $ingress_name   "ingress-1";
                        set $service_name   "svc2-guest";
                        set $service_port   "8080";
                        set $location_path  "/guest";
                        set $global_rate_limit_exceeding n;

                        rewrite_by_lua_block {
                                lua_ingress.rewrite({
                                        force_ssl_redirect = false,
...
--
                location = /admin {

                        set $namespace      "default";
                        set $ingress_name   "ingress-1";
                        set $service_name   "svc3-admin";
                        set $service_port   "8080";
                        set $location_path  "/admin";
                        set $global_rate_limit_exceeding n;

                        rewrite_by_lua_block {
                                lua_ingress.rewrite({
                                        force_ssl_redirect = false,
--
                location / {

                        set $namespace      "default";
                        set $ingress_name   "ingress-1";
                        set $service_name   "svc1-web";
                        set $service_port   "80";
                        set $location_path  "/";
                        set $global_rate_limit_exceeding n;

                        rewrite_by_lua_block {
                                lua_ingress.rewrite({
                                        force_ssl_redirect = false,
ingress-nginx-controller 서비스 Endpoint 모니터링 확인
ingress-nginx-controller 코드를 보면 로그레벨3의 서비스의 Endpoint를 수집하는 로그를 출력할 수 있음을 알 수 있다.
//ingress-nginx/internal/ingress/controller/endpoints.go
func getEndpoints(s *corev1.Service, port *corev1.ServicePort, proto corev1.Protocol,
    getServiceEndpoints func(string) (*corev1.Endpoints, error)) []ingress.Endpoint {
    ...
      klog.V(3).Infof("Getting Endpoints for Service %q and port %v", svcKey, port.String())
    ...
  }
ingress-nginx-controller를 로그레벨3으로 배포하여 로그를 확인해보자.
(⎈|default:N/A) root@k3s-s:~# cat ingress-nginx-values.yaml 
controller:
  service:
    type: NodePort
    nodePorts:
      http: 30080
      https: 30443
  nodeSelector:
    kubernetes.io/hostname: "k3s-s"
  metrics:
    enabled: true
  serviceMonitor:
      enabled: true
  extraArgs: # 이 부분 추가
    v: "3" # 이 부분 추가

(⎈|default:N/A) root@k3s-s:~# helm upgrade ingress-nginx ingress-nginx/ingress-nginx -f ingress-nginx-values.yaml --namespace ingress --version 4.11.2

(⎈|default:N/A) root@k3s-s:~# kubectl get pod -n ingress
NAME                                        READY   STATUS    RESTARTS   AGE
ingress-nginx-controller-5b8b86dff5-5stvf   1/1     Running   0          11m

(⎈|default:N/A) root@k3s-s:~# kubectl logs ingress-nginx-controller-5b8b86dff5-5stvf -n ingress
서비스 Endpoint에 해당하는 upstream을 생성하고, 서비스 Endpoint를 찾는 로그이다. 
# default-svc1-web-80
I1012 03:21:29.189148       7 controller.go:1075] Creating upstream "default-svc1-web-80"
I1012 03:21:29.189160       7 controller.go:1190] Obtaining ports information for Service "default/svc1-web"
I1012 03:21:29.189175       7 endpointslices.go:79] Getting Endpoints from endpointSlices for Service "default/svc1-web" and port &ServicePort{Name:web-port,Protocol:TCP,Port:9001,TargetPort:{0 80 },NodePort:0,AppProtocol:nil,}
I1012 03:21:29.189188       7 endpointslices.go:166] Endpoints found for Service "default/svc1-web": [{172.16.2.6 80 &ObjectReference{Kind:Pod,Namespace:default,Name:deploy1-websrv-5c6b88bd77-r4b5k,UID:0792180e-010c-48d6-961b-282ca789ca58,APIVersion:,ResourceVersion:,FieldPath:,}}]

# default-svc2-guest-8080
I1012 03:21:29.189213       7 controller.go:1075] Creating upstream "default-svc2-guest-8080"
I1012 03:21:29.189218       7 controller.go:1190] Obtaining ports information for Service "default/svc2-guest"
I1012 03:21:29.189226       7 endpointslices.go:79] Getting Endpoints from endpointSlices for Service "default/svc2-guest" and port &ServicePort{Name:guest-port,Protocol:TCP,Port:9002,TargetPort:{0 8080 },NodePort:30832,AppProtocol:nil,}
I1012 03:21:29.189245       7 endpointslices.go:166] Endpoints found for Service "default/svc2-guest": [{172.16.3.6 8080 &ObjectReference{Kind:Pod,Namespace:default,Name:deploy2-guestsrv-649875f78b-2fl9v,UID:c485d507-259b-4dea-aa4e-d6f8c9e57557,APIVersion:,ResourceVersion:,FieldPath:,}} {172.16.1.3 8080 &ObjectReference{Kind:Pod,Namespace:default,Name:deploy2-guestsrv-649875f78b-w9s8q,UID:cf983782-0a43-4f5b-911a-ff8838d94801,APIVersion:,ResourceVersion:,FieldPath:,}}]

# default-svc3-admin-8080
I1012 03:21:29.189258       7 controller.go:1075] Creating upstream "default-svc3-admin-8080"
I1012 03:21:29.189263       7 controller.go:1190] Obtaining ports information for Service "default/svc3-admin"
I1012 03:21:29.189270       7 endpointslices.go:79] Getting Endpoints from endpointSlices for Service "default/svc3-admin" and port &ServicePort{Name:admin-port,Protocol:TCP,Port:9003,TargetPort:{0 8080 },NodePort:0,AppProtocol:nil,}
I1012 03:21:29.189279       7 endpointslices.go:166] Endpoints found for Service "default/svc3-admin": [{172.16.3.5 8080 &ObjectReference{Kind:Pod,Namespace:default,Name:deploy3-adminsrv-7c8f8b8c87-76sjm,UID:789812d4-ee0a-4e5b-9e79-62b495bdb40c,APIVersion:,ResourceVersion:,FieldPath:,}} {172.16.1.4 8080 &ObjectReference{Kind:Pod,Namespace:default,Name:deploy3-adminsrv-7c8f8b8c87-5pf7g,UID:dfb70e2a-549d-499f-8435-a625b20b7889,APIVersion:,ResourceVersion:,FieldPath:,}} {172.16.2.7 8080 &ObjectReference{Kind:Pod,Namespace:default,Name:deploy3-adminsrv-7c8f8b8c87-qq2fj,UID:5a511298-8b6e-433a-a33c-2e1898d7f5e0,APIVersion:,ResourceVersion:,FieldPath:,}}]
ingress-nginx-controller에서 각 경로 (/, /guest, /admin)로 오는 요청을 각각의 upstream으로 라우팅을 변경하는 로그이며, 배포된 Ingress에 따라 변경 값이 설정된다.
I1012 03:21:29.189307       7 controller.go:811] Replacing location "/" for server "_" with upstream "upstream-default-backend" to use upstream "default-svc1-web-80" (Ingress "default/ingress-1")
I1012 03:21:29.189328       7 controller.go:831] Adding location "/guest" for server "_" with upstream "default-svc2-guest-8080" (Ingress "default/ingress-1")
I1012 03:21:29.189340       7 controller.go:831] Adding location "/admin" for server "_" with upstream "default-svc3-admin-8080" (Ingress "default/ingress-1")
ingress controller의 endpoint 모니터링 권한
ingress controller에서 서비스의 endpoint를 추적할 수 있는 것은 ingress-nginx에 endpoint를 mointoring 할 수 있는 권한을 부여했기 때문이다.
(⎈|default:N/A) root@k3s-s:~# kc describe clusterroles ingress-nginx
Name:         ingress-nginx
Labels:       app.kubernetes.io/instance=ingress-nginx
              app.kubernetes.io/managed-by=Helm
              app.kubernetes.io/name=ingress-nginx
              app.kubernetes.io/part-of=ingress-nginx
              app.kubernetes.io/version=1.11.2
              helm.sh/chart=ingress-nginx-4.11.2
Annotations:  meta.helm.sh/release-name: ingress-nginx
              meta.helm.sh/release-namespace: ingress
PolicyRule:
  Resources                           Non-Resource URLs  Resource Names  Verbs
  ---------                           -----------------  --------------  -----
  events                              []                 []              [create patch]
  services                            []                 []              [get list watch]
  ingressclasses.networking.k8s.io    []                 []              [get list watch]
  ingresses.networking.k8s.io         []                 []              [get list watch]
  nodes                               []                 []              [list watch get]
  endpointslices.discovery.k8s.io     []                 []              [list watch get]
  configmaps                          []                 []              [list watch]
  endpoints                           []                 []              [list watch]
  namespaces                          []                 []              [list watch]
  pods                                []                 []              [list watch]
  secrets                             []                 []              [list watch]
  leases.coordination.k8s.io          []                 []              [list watch]
  ingresses.networking.k8s.io/status  []                 []              [update]
이렇든 ingress-nginx-controller는 ingress에 작성된 설정 값으로 트래픽 라우팅을 설정하고, 서비스의 EndpointService를 모니터링하여 해당 Endpoint로 직접 트래픽을 보낸다.
Ingress 접속 확인
기본적으로 ingress-nginx-controller는 라운드로빈 부하분산 알고리즘을 사용한다.
Ingress에 설정한 경로를 통해 백엔드 서비스를 호출해보자.
(⎈|default:N/A) root@k3s-s:~# MYIP=$(curl -s ipinfo.io/ip)
(⎈|default:N/A) root@k3s-s:~# echo $MYIP
3.35.217.39
 ggyul 🐵  
# MYIP=3.35.217.39

 ggyul 🐵  
# curl -s $MYIP:30080
...
Welcome to nginx!

 ggyul 🐵  
# curl -s $MYIP:30080/guest
Hello Kubernetes bootcamp! | Running on: deploy2-guestsrv-649875f78b-w9s8q | v=1

 ggyul 🐵  
# for i in {1..100}; do curl -s $MYIP:30080/guest ; done | sort | uniq -c | sort -nr
  50 Hello Kubernetes bootcamp! | Running on: deploy2-guestsrv-649875f78b-w9s8q | v=1
  50 Hello Kubernetes bootcamp! | Running on: deploy2-guestsrv-649875f78b-2fl9v | v=1

 ggyul 🐵  
# curl -s $MYIP:30080/admin
Hostname: deploy3-adminsrv-7c8f8b8c87-76sjm

Pod Information:
    -no pod information available-

Server values:
    server_version=nginx: 1.13.0 - lua: 10008

Request Information:
    client_address=172.16.0.7
    method=GET
    real path=/admin
    query=
    request_version=1.1
    request_uri=http://3.35.217.39:8080/admin

Request Headers:
    accept=*/*
    host=3.35.217.39:30080
    user-agent=curl/7.79.1
    x-forwarded-for=175.116.31.155
    x-forwarded-host=3.35.217.39:30080
    x-forwarded-port=80
    x-forwarded-proto=http
    x-forwarded-scheme=http
    x-real-ip=175.116.31.155
    x-request-id=63a5cdff6192f4430ba75ec92ad418bd
    x-scheme=http

Request Body:
    -no body in request-

 ggyul 🐵  
# for i in {1..100}; do curl -s $MYIP:30080/admin | grep Hostname ; done | sort | uniq -c | sort -nr
  34 Hostname: deploy3-adminsrv-7c8f8b8c87-qq2fj
  33 Hostname: deploy3-adminsrv-7c8f8b8c87-76sjm
  33 Hostname: deploy3-adminsrv-7c8f8b8c87-5pf7g
/admin 으로 호출을 하였을 때, client_address와 x-forwarded-for값을 확인할 수 있다.

ingress-nginx-controller 서비스에 externalTrafficPolicy를 local로 설정했을 경우 client_address는 ingress-nginx-controller pod ip이고, x-forwarded-for은 호출을 하는 Client IP이다.
ingress-nginx-controller 서비스에 externalTrafficPolicy를 default로 설정했을 경우 client_address는 ingress-nginx-controller pod ip이고, x-forwarded-for은 Control-Plane의 cni0이다.


(⎈|default:N/A) root@k3s-s:~# kubectl get pod -n ingress -o wide
NAME                                        READY   STATUS    RESTARTS   AGE    IP           NODE    NOMINATED NODE   READINESS GATES
ingress-nginx-controller-5b8b86dff5-5stvf   1/1     Running   0          144m   172.16.0.7   k3s-s              

 ggyul 🐵  
# curl -s $MYIP:30080/admin | egrep '(client_address|x-forwarded-for)'
    client_address=172.16.0.7
    x-forwarded-for=175.116.31.155

(⎈|default:N/A) root@k3s-s:~# kubectl patch svc -n ingress ingress-nginx-controller -p '{"spec":{"externalTrafficPolicy": "Cluster"}}'
service/ingress-nginx-controller patched

 ggyul 🐵  
# curl -s $MYIP:30080/admin | egrep '(client_address|x-forwarded-for)'
    client_address=172.16.0.7
    x-forwarded-for=172.16.0.1

ingress-nginx-controller는 클라이언트와 백엔드 서비스 사이에 proxy역할을 하기 때문에, 백엔드 서비스 입장에서는 Ingress-nginx-controller가 요청을 보낸 주체로 보이게 되고, client_address로 ingress-nginx-controller pod ip를 반환한다.

x-forwarded-for의 경우, ingress-nginx-controller 서비스가 ExternalTrafficPolicy가 Local일 경우, k3s-s로 들어온 요청이 해당 노드 내의 ingress-nginx-controller에만 전달되기 때문에 SNAT되지 않으면서 요청을 보낸 원본 IP가 보존되어 요청을 호출한 client의 IP가 반환된다.

ingress-nginx-controller 서비스의 ExternalTrafficPolicy가 Cluster일 경우, SNAT되어 요청이 들어온 k3s-s의 IP가 반환된다.


패킷 캡처
flannel vxlan, 파드 간 통신 시 IP 정보를 확인한다.
(⎈|default:N/A) root@k3s-s:~# tcpdump -i vetha126644e tcp port 8080 -w /tmp/ingress-nginx.pcap

 ggyul 🐵  
# curl -s $MYIP:30080/admin

 ggyul 🐵  
# sftp -i kp-bgr.pem ubuntu@$MYIP
Connected to 3.35.217.39.
sftp> get /tmp/ingress-nginx.pcap
Fetching /tmp/ingress-nginx.pcap to ingress-nginx.pcap
/tmp/ingress-nginx.pcap                                                                            100% 2095   201.2KB/s   00:00

Host 기반 라우팅

* yaml *
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ingress-2
spec:
  ingressClassName: nginx
  rules:
  - host: gyull.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: svc3-admin
            port:
              number: 8080
  - host: "*.gyull.com"
    http:
      paths:
      - path: /echo
        pathType: Prefix
        backend:
          service:
            name: svc3-admin
            port:
* 리소스 확인 *
(⎈|default:N/A) root@k3s-s:~# kubectl describe ingress ingress-2
Name:             ingress-2
...
Rules:
  Host         Path  Backends
  ----         ----  --------
  gyull.com
               /   svc3-admin:8080 ()
  *.gyull.com
               /echo   svc3-admin:8080 ()
...
접속 확인
* 도메인 설정 *
 ggyul 🐵  
# MYDOMAIN1=gyull.com

 ggyul 🐵  
# MYDOMAIN2=test.gyull.com

 ggyul 🐵  
# echo $MYIP $MYDOMAIN1 $MYDOMAIN2
3.35.217.39 gyull.com test.gyull.com

 ggyul 🐵  
# echo "$MYIP $MYDOMAIN1" | sudo tee -a /etc/hosts
echo "$MYIP $MYDOMAIN2" | sudo tee -a /etc/hosts
* MYDOMAIN1 접속 *
ggyul 🐵  
# curl -o /dev/null -s -w "%{http_code}\n" $MYDOMAIN1:30080/
200

 ggyul 🐵  
# curl -o /dev/null -s -w "%{http_code}\n" $MYDOMAIN1:30080/gyull
200

 ggyul 🐵  
# curl -o /dev/null -s -w "%{http_code}\n" $MYDOMAIN1:30080/echo
200

 ggyul 🐵  
# curl -o /dev/null -s -w "%{http_code}\n" $MYDOMAIN1:30080/echo/5
200
* MYDOMAIN2 접속 *
 ggyul 🐵  
# curl -o /dev/null -s -w "%{http_code}\n" $MYDOMAIN2:30080/
404

 ggyul 🐵  
# curl -o /dev/null -s -w "%{http_code}\n" $MYDOMAIN2:30080/gyull
404

 ggyul 🐵  
# curl -o /dev/null -s -w "%{http_code}\n" $MYDOMAIN2:30080/echo
200

 ggyul 🐵  
# curl -o /dev/null -s -w "%{http_code}\n" $MYDOMAIN2:30080/echo/5
200
카나리 배포

NGINX Ingress Controller에서 카나리 배포는 ingress의 Annotations를 통해 구현된다. 이를 통해 기존 서비스와 새로운 카나리 서비스 간에 트래픽을 동적으로 분배하여 새로운 버전 배포 시, 안정성을 높이고 리스크를 줄일 수 있다.
카나리 Ingress 반영 Upstream 생성 코드
ingress-nginx에서 카나리 Ingress를 인지하고 Upstream을 생성하는 코드는 아래와 같다.
* Ingress parse -> Canary config 반환 *
// ingress-nginx/internal/ingress/annotations/canary/main.go
const (
    canaryAnnotation                = "canary"
    canaryWeightAnnotation          = "canary-weight" // canary-weight annotation 지정
    canaryWeightTotalAnnotation     = "canary-weight-total"
    canaryByHeaderAnnotation        = "canary-by-header"
    canaryByHeaderValueAnnotation   = "canary-by-header-value"
    canaryByHeaderPatternAnnotation = "canary-by-header-pattern"
    canaryByCookieAnnotation        = "canary-by-cookie"
)

func (c canary) Parse(ing *networking.Ingress) (interface{}, error) {
  ...
  // canary-weight annotation 값을 인지하여 config에 넣는다.
    config.Weight, err = parser.GetIntAnnotation(canaryWeightAnnotation, ing, c.annotationConfig.Annotations)
  ...
  //WeightTotal의 default는 100이다.
  config.WeightTotal, err = parser.GetIntAnnotation(canaryWeightTotalAnnotation, ing, c.annotationConfig.Annotations)
    if err != nil {
        if errors.IsValidationError(err) {
            klog.Warningf("%s is invalid, defaulting to '100'", canaryWeightTotalAnnotation) 
        }
        config.WeightTotal = 100
    }

  ...
  }
* Upstream 생성 *
// ingress-nginx/internal/ingress/controller/controller.go
func (n *NGINXController) createUpstreams(data []*ingress.Ingress, du *ingress.Backend) map[string]*ingress.Backend {
...
// configure traffic shaping for canary
            if anns.Canary.Enabled {
                upstreams[defBackend].NoServer = true
                upstreams[defBackend].TrafficShapingPolicy = newTrafficShapingPolicy(&anns.Canary) //weight가 주입된 ingress로 upstream을 생성한다.
            }
      ...
}

// newTrafficShapingPolicy creates new ingress.TrafficShapingPolicy instance using canary configuration
func newTrafficShapingPolicy(cfg *canary.Config) ingress.TrafficShapingPolicy {
    return ingress.TrafficShapingPolicy{ //parsing해서 얻은 weight를 ingress에 주입한다.
        Weight:        cfg.Weight,
        WeightTotal:   cfg.WeightTotal,
        Header:        cfg.Header,
        HeaderValue:   cfg.HeaderValue,
        HeaderPattern: cfg.HeaderPattern,
        Cookie:        cfg.Cookie,
    }
}
* 리소스 확인 *
(⎈|default:N/A) root@k3s-s:~# kubectl get svc,ep,pod
NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/kubernetes   ClusterIP   10.10.200.1             443/TCP    5m51s
service/svc-v1       ClusterIP   10.10.200.183           9001/TCP   5m18s
service/svc-v2       ClusterIP   10.10.200.225           9001/TCP   5m18s

NAME                   ENDPOINTS                                           AGE
endpoints/kubernetes   192.168.10.10:6443                                  5m51s
endpoints/svc-v1       172.16.1.5:8080,172.16.2.11:8080,172.16.3.9:8080    5m18s
endpoints/svc-v2       172.16.1.6:8080,172.16.2.10:8080,172.16.3.10:8080   5m18s

NAME                         READY   STATUS    RESTARTS   AGE
pod/dp-v1-8684d45558-fzdt4   1/1     Running   0          5m18s
pod/dp-v1-8684d45558-gvxt9   1/1     Running   0          5m18s
pod/dp-v1-8684d45558-k2qfg   1/1     Running   0          5m18s
pod/dp-v2-7757c4bdc-8kf24    1/1     Running   0          5m18s
pod/dp-v2-7757c4bdc-96m7j    1/1     Running   0          5m18s
pod/dp-v2-7757c4bdc-fwx54    1/1     Running   0          5m18s

(⎈|default:N/A) root@k3s-s:~# for pod in $(kubectl get pod -o wide -l app=svc-v1 |awk 'NR>1 {print $6}'); do curl -s $pod:8080 | egrep '(Hostname|nginx)'; done
Hostname: dp-v1-8684d45558-fzdt4
    server_version=nginx: 1.13.0 - lua: 10008
Hostname: dp-v1-8684d45558-gvxt9
    server_version=nginx: 1.13.0 - lua: 10008
Hostname: dp-v1-8684d45558-k2qfg
    server_version=nginx: 1.13.0 - lua: 10008

(⎈|default:N/A) root@k3s-s:~# for pod in $(kubectl get pod -o wide -l app=svc-v2 |awk 'NR>1 {print $6}'); do curl -s $pod:8080 | egrep '(Hostname|nginx)'; done
Hostname: dp-v2-7757c4bdc-8kf24
    server_version=nginx: 1.13.1 - lua: 10008
Hostname: dp-v2-7757c4bdc-96m7j
    server_version=nginx: 1.13.1 - lua: 10008
Hostname: dp-v2-7757c4bdc-fwx54
    server_version=nginx: 1.13.1 - lua: 10008
카나리 접속 테스트
전체 weight가 100이고, ingress-canary-v1에 해당하는 Pod(nginx 1.13.0)에 90%, ingress-canary-v2에 해당하는 Pod(nginx 1.13.1)에 10%의 트래픽이 처리되는 것을 확인할 수 있다.
 ggyul 🐵  
# for i in {1..100};  do curl -s $MYDOMAIN1:30080 | grep nginx ; done | sort | uniq -c | sort -nr
  90     server_version=nginx: 1.13.0 - lua: 10008
  10     server_version=nginx: 1.13.1 - lua: 10008

 ggyul 🐵  
# for i in {1..1000}; do curl -s $MYDOMAIN1:30080 | grep nginx ; done | sort | uniq -c | sort -nr=
 876     server_version=nginx: 1.13.0 - lua: 10008
 124     server_version=nginx: 1.13.1 - lua: 10008
HTTPS 처리
Ingress와 인증서 기반 secret을 연결하여 HTTPS 처리가 가능하다.

* yaml *
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: https
spec:
  ingressClassName: nginx
  tls:
  - hosts:
    - gyull.com
    secretName: secret-https
  rules:
  - host: gyull.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: svc-https
            port:
              number: 8080
* 리소스 확인 *
(⎈|default:N/A) root@k3s-s:~# kubectl describe ingress
Name:             https
Labels:           
Namespace:        default
Address:          10.10.200.170
Ingress Class:    nginx
Default backend:  
TLS:
  secret-https terminates gyull.com
Rules:
  Host        Path  Backends
  ----        ----  --------
  gyull.com
              /   svc-https:8080 (172.16.3.11:8080)

(⎈|default:N/A) root@k3s-s:~# kubectl get svc,ep,pod
NAME                 TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/kubernetes   ClusterIP   10.10.200.1             443/TCP    3m11s
service/svc-https    ClusterIP   10.10.200.182           8080/TCP   3m4s

NAME                   ENDPOINTS            AGE
endpoints/kubernetes   192.168.10.10:6443   3m11s
endpoints/svc-https    172.16.3.11:8080     3m4s

NAME            READY   STATUS    RESTARTS   AGE
pod/pod-https   1/1     Running   0          3m4s            
* 인증서 생성 *
(⎈|default:N/A) root@k3s-s:~# MYDOMAIN1=gyull.com
(⎈|default:N/A) root@k3s-s:~# openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout tls.key -out tls.crt -subj "/CN=$MYDOMAIN1/O=$MYDOMAIN1"
(⎈|default:N/A) root@k3s-s:~# tree
.
├...
├── tls.crt
└── tls.key

(⎈|default:N/A) root@k3s-s:~# kubectl create secret tls secret-https --key tls.key --cert tls.crt
secret/secret-https created

(⎈|default:N/A) root@k3s-s:~# kubectl get secrets secret-https
NAME           TYPE                DATA   AGE
secret-https   kubernetes.io/tls   2      9s
* 접속 확인 *

* 패킷 캡처 *
nginx-ingress-controller pod의 443포트로 패킷이 통신되는 것을 확인할 수 있다.
(⎈|default:N/A) root@k3s-s:~# export IngHttp=$(kubectl get service -n ingress ingress-nginx-controller -o jsonpath='{.spec.ports[0].nodePort}')
(⎈|default:N/A) root@k3s-s:~# export IngHttps=$(kubectl get service -n ingress ingress-nginx-controller -o jsonpath='{.spec.ports[1].nodePort}')

(⎈|default:N/A) root@k3s-s:~# tcpdump -i vetha126644e tcp port $IngHttp -nn
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vetha126644e, link-type EN10MB (Ethernet), snapshot length 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

(⎈|default:N/A) root@k3s-s:~# tcpdump -i vetha126644e tcp port $IngHttps -nn
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vetha126644e, link-type EN10MB (Ethernet), snapshot length 262144 bytes
^C
0 packets captured
0 packets received by filter
0 packets dropped by kernel

(⎈|default:N/A) root@k3s-s:~# tcpdump -i vetha126644e tcp port 443 -nn
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on vetha126644e, link-type EN10MB (Ethernet), snapshot length 262144 bytes
00:03:10.784942 IP 175.116.31.155.63439 > 172.16.0.7.443: Flags [S], seq 1383706545, win 65535, options [mss 1460,nop,wscale 6,nop,nop,TS val 723061224 ecr 0,sackOK,eol], length 0
00:03:10.784958 IP 172.16.0.7.443 > 175.116.31.155.63439: Flags [S.], seq 814443817, ack 1383706546, win 62293, options [mss 8911,sackOK,TS val 4107631988 ecr 723061224,nop,wscale 7], length 0
00:03:10.789822 IP 175.116.31.155.63439 > 172.16.0.7.443: Flags [.], ack 1, win 2058, options [nop,nop,TS val 723061229 ecr 4107631988], length 0
00:03:10.790540 IP 175.116.31.155.63439 > 172.16.0.7.443: Flags [.], seq 1:1449, ack 1, win 2058, options [nop,nop,TS val 723061229 ecr 4107631988], length 1448
00:03:10.790544 IP 175.116.31.155.63439 > 172.16.0.7.443: Flags [P.], seq 1449:1879, ack 1, win 2058, options [nop,nop,TS val 723061229 ecr 4107631988], length 430
00:03:10.790554 IP 172.16.0.7.443 > 175.116.31.155.63439: Flags [.], ack 1449, win 476, options [nop,nop,TS val 4107631994 ecr 723061229], length 0
00:03:10.790575 IP 172.16.0.7.443 > 175.116.31.155.63439: Flags [.], ack 1879, win 473, options [nop,nop,TS val 4107631994 ecr 723061229], length 0
00:03:10.791059 IP 172.16.0.7.443 > 175.116.31.155.63439: Flags [P.], seq 1:255, ack 1879, win 473, options [nop,nop,TS val 4107631994 ecr 723061229], length 254
00:03:10.796129 IP 175.116.31.155.63439 > 172.16.0.7.443: Flags [.], ack 255, win 2054, options [nop,nop,TS val 723061235 ecr 4107631994], length 0
00:03:10.796132 IP 175.116.31.155.63439 > 172.16.0.7.443: Flags [P.], seq 1879:1909, ack 255, win 2054, options [nop,nop,TS val 723061235 ecr 4107631994], length 30



IPVS
Sat, 05 Oct 2024 16:09:27 GMT
IPVS란
IPVS 는 리눅스 커널에서 동작하는 소프트웨어 로드밸런서이다. 백엔드(플랫폼)으로 Netfilter 를 사용하며, TCP/UDP 요청을 처리 할 수 있다.
IPVS 설정 확인
strictARP
IPVS 클러스터를 생성하고 난 후 kube-proxy configmap을 확인하면 strictARP: true 설정을 볼 수 있다.
# kubectl describe cm -n kube-system kube-proxy
ipvs:
  excludeCIDRs: null
  minSyncPeriod: 0s
  scheduler: ""
  strictARP: true
  syncPeriod: 0s
  tcpFinTimeout: 0s
  tcpTimeout: 0s
  udpTimeout: 0s
* strictARP: true *

노드가 자신이 소유한 IP에 대해서만 ARP 응답을 보내도록 제한하는 설정
IPVS에서는 클러스터 내부 Pod간, 외부와 클러스터 내부 Pod간 트래픽을 처리하는데, ARP 응답이 올바르게 처리되지 않으면 잘못된 노드에 트래픽이 전달될 수 있기 때문에 strictARP 설정이 필수이다.

kube-ipvs0
클러스터의 노드에서 ip정보를 확인해보면 kube-ipvs0이라는 인터페이스가 새롭게 생긴 것을 확인할 수 있다.
kube-ipvs0의 정보는 모든 노드에서 동일하며, 이는 클러스터에 배포된 SVC IP 이다.
# for i in control-plane worker worker2 worker3; do echo ">> node myk8s-$i <<"; docker exec -it myk8s-$i ip -br -c addr show kube-ipvs0; echo; done

>> node myk8s-control-plane <<
kube-ipvs0       DOWN           10.200.1.10/32 10.200.1.1/32 

>> node myk8s-worker <<
kube-ipvs0       DOWN           10.200.1.10/32 10.200.1.1/32 

>> node myk8s-worker2 <<
kube-ipvs0       DOWN           10.200.1.1/32 10.200.1.10/32 

>> node myk8s-worker3 <<
kube-ipvs0       DOWN           10.200.1.10/32 10.200.1.1/32 

# kubectl get svc,ep -A                        
NAMESPACE     NAME                 TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)                  AGE
default       service/kubernetes   ClusterIP   10.200.1.1            443/TCP                  11m
kube-system   service/kube-dns     ClusterIP   10.200.1.10           53/UDP,53/TCP,9153/TCP   11m

NAMESPACE     NAME                   ENDPOINTS                                            AGE
default       endpoints/kubernetes   172.18.0.5:6443                                      11m
kube-system   endpoints/kube-dns     10.10.0.2:53,10.10.0.4:53,10.10.0.2:53 + 3 more...   11m
kube-ipvs0이란?
Kubernetes에서 IPVS 모드로 네트워크 로드 밸런싱을 사용할 때 서비스의 클러스터 IP를 처리하는 가상 네트워크 인터페이스이다. 
이 인터페이스는 서비스의 IP를 인지하고 트래픽을 서비스에 연결된 Pod로 전달한다.
kube-ipvs0이 SVC의 IP를 추적하는 방법
kube-proxy는 Kubernetes API 서버로부터 서비스와 엔드포인트의 상태를 감시하다가 새로운 서비스 추가/삭제, 엔드포인트 변경 시 이를 인지한다.
변경 사항이 생길 경우 IPset과 IPVS 가상서버를 설정하여 이를 IPVS 인터페이스에 반영한다.
결과적으로 kube-proxy가 kube-ipvs0 인터페이스에 서비스의 Cluster IP를 등록하고, 해당 Cluster IP로 도달하는 모든 트래픽은 kube-ipvs0를 통해 적절한 Pod로 전달된다.
해당 부분을 처리하는 코드를 확인해보자.
* Service, Endpoint 동기화 후 SVC의 Cluster IP를 IPVS 테이블에 추가 *
https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/ipvs/proxier.go
func (proxier *Proxier) syncProxyRules() {
  ...
  // Build IPVS rules for each service.
    for svcPortName, svcPort := range proxier.svcPortMap {
        svcInfo, ok := svcPort.(*servicePortInfo)
    ...

    # 서비스의 Cluster IP를 가져온다.
    // Capture the clusterIP.
        // ipset call
        entry := &utilipset.Entry{
            IP:       svcInfo.ClusterIP().String(),
            Port:     svcInfo.Port(),
            Protocol: protocol,
            SetType:  utilipset.HashIPPort,
        }

    # IPSet 데이터 구조에 Cluster IP 정보를 저장한다.
        // add service Cluster IP:Port to kubeServiceAccess ip set for the purpose of solving hairpin.
        // proxier.kubeServiceAccessSet.activeEntries.Insert(entry.String())
    ...
        proxier.ipsetList[kubeClusterIPSet].activeEntries.Insert(entry.String())

    ...
    # IPVS에 가상 서버를 설정한다. 이 때, Cluster IP, Port, Protocol, 부하분산 방식 정보가 들어간다.
        // ipvs call
        serv := &utilipvs.VirtualServer{
            Address:   svcInfo.ClusterIP(),
            Port:      uint16(svcInfo.Port()),
            Protocol:  string(svcInfo.Protocol()),
            Scheduler: proxier.ipvsScheduler,
        }
    ...

    # IPVS 가상 서버를 kube-ipvs0 dummy 인터페이스에 바인딩한다. 이를 통해 kube-ipvs0 인터페이스에서 SVC의 ClusterIP를 확인할 수 있다.
    // We need to bind ClusterIP to dummy interface, so set `bindAddr` parameter to `true` in syncService()
        if err := proxier.syncService(svcPortNameString, serv, true, alreadyBoundAddrs); err == nil {
            activeIPVSServices.Insert(serv.String())
            activeBindAddrs.Insert(serv.Address.String())
}
ipset
IP주소나 포트의 집합으로, 이를 통해 IP와 포트의 집합을 빠르게 필터링할 수 있다.
위 코드에서 확인했다시피, kube-proxy에 의해 기록된다.
# docker exec -it myk8s-worker ipset -L | grep -i kube-cluster-ip -A12
Name: KUBE-CLUSTER-IP
Type: hash:ip,port
Revision: 7
Header: family inet hashsize 1024 maxelem 65536 bucketsize 12 initval 0x2c7e0b73
Size in memory: 408
References: 3
Number of entries: 4
Members:
10.200.1.10,udp:53
10.200.1.1,tcp:443
10.200.1.10,tcp:53
10.200.1.10,tcp:9153
iptables 개수 비교
iptables 모드와 IPVS 모드에서의 iptables 규칙 개수를 비교해보자.
* iptables *
# iptables -nvL -t nat | wc -l
327
* ipvs *
# docker exec -it myk8s-worker iptables -nvL -t nat | wc -l 
61
iptables 모드와 비교했을 때 규칙 수가 현저히 줄어든 것을 확인할 수 있는데, 이는 서비스 부하 분산 관련 규칙을 따로 ipvs가 처리하기 때문이다.
* KUBE-SERVICES 비교 *
ipvs모드에서 iptables kube-service 체인을 확인해보면 아래와 같다.
iptables모드와는 다르게 Service 에서 Endpoint로 부하분산하는 규칙이 사라졌다.
# docker exec -it myk8s-worker iptables -nvL -t nat
...
Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 RETURN     0    --  *      *       127.0.0.0/8          0.0.0.0/0           
    0     0 KUBE-MARK-MASQ  0    --  *      *      !10.10.0.0/16         0.0.0.0/0            /* Kubernetes service cluster ip + port for masquerade purpose */ match-set KUBE-CLUSTER-IP dst,dst
    0     0 KUBE-NODE-PORT  0    --  *      *       0.0.0.0/0            0.0.0.0/0            ADDRTYPE match dst-type LOCAL
    0     0 ACCEPT     0    --  *      *       0.0.0.0/0            0.0.0.0/0            match-set KUBE-CLUSTER-IP dst,dst
서비스 접속
부하분산 확인 - ipvsadm
각 노드에서 ipvsadm을 통해 ClusterIP로의 통신 부하분산을 확인했을 때, 부하분산 방식이 Round Robin으로 설정되어 엔드포인트에 동일한 가중치로 분산됨을 확인할 수 있다.
# CIP=$(kubectl get svc svc-clusterip -o jsonpath="{.spec.clusterIP}")
# CPORT=$(kubectl get svc svc-clusterip -o jsonpath="{.spec.ports[0].port}")
# echo $CIP $CPORT
10.200.1.33 9000

# kubectl get svc,ep                                       
NAME                    TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE
service/kubernetes      ClusterIP   10.200.1.1            443/TCP    88m
service/svc-clusterip   ClusterIP   10.200.1.33           9000/TCP   17m

NAME                      ENDPOINTS                                AGE
endpoints/kubernetes      172.18.0.5:6443                          88m
endpoints/svc-clusterip   10.10.1.2:80,10.10.2.2:80,10.10.3.2:80   17m

# for i in control-plane worker worker2 worker3; do echo ">> node myk8s-$i <<"; docker exec -it myk8s-$i ipvsadm -Ln -t $CIP:$CPORT ; echo; done
>> node myk8s-control-plane <<
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.200.1.33:9000 rr
  -> 10.10.1.2:80                 Masq    1      0          0         
  -> 10.10.2.2:80                 Masq    1      0          0         
  -> 10.10.3.2:80                 Masq    1      0          0         

>> node myk8s-worker <<
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.200.1.33:9000 rr
  -> 10.10.1.2:80                 Masq    1      0          0         
  -> 10.10.2.2:80                 Masq    1      0          0         
  -> 10.10.3.2:80                 Masq    1      0          0         

>> node myk8s-worker2 <<
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.200.1.33:9000 rr
  -> 10.10.1.2:80                 Masq    1      0          0         
  -> 10.10.2.2:80                 Masq    1      0          0         
  -> 10.10.3.2:80                 Masq    1      0          0         

>> node myk8s-worker3 <<
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.200.1.33:9000 rr
  -> 10.10.1.2:80                 Masq    1      0          0         
  -> 10.10.2.2:80                 Masq    1      0          0         
  -> 10.10.3.2:80                 Masq    1      0          0     
부하분산 확인 - ipvsadm rate 모니터링
ipvsadm rate를 통해 실제 연결이 엔드포인트로 얼마나 부하되었는지 개수를 확인할 수 있다.
# kubectl exec -it net-pod -- zsh -c "for i in {1..10000}; do curl -s $SVC1:9000 | grep Hostname; sleep 0.01; done"
Hostname: webpod2
Hostname: webpod3
Hostname: webpod1
Hostname: webpod2
Hostname: webpod3
...
^Ccommand terminated with exit code 130

# watch -d "docker exec -it myk8s-control-plane ipvsadm -Ln -t $CIP:$CPORT --stats; echo; docker exec -it myk8s-control-plane ipvsadm -Ln -t $CIP:$CPORT --rate"
Every 2.0s: docker exec -it myk8s-control-plane ipvsadm -Ln -t 10.200.1.33:9000 ...  baggyuliui-MacBookPro.local: Sun Oct  6 01:33:47 2024

Prot LocalAddress:Port               Conns   InPkts  OutPkts  InBytes OutBytes
  -> RemoteAddress:Port
TCP  10.200.1.33:9000                  725     4350     2900   289275   380625
  -> 10.10.1.2:80                      242     1452      968    96558   127050
  -> 10.10.2.2:80                      241     1446      964    96159   126525
  -> 10.10.3.2:80                      242     1452      968    96558   127050

Prot LocalAddress:Port                 CPS    InPPS   OutPPS    InBPS   OutBPS
  -> RemoteAddress:Port
TCP  10.200.1.33:9000                    2       13        9      882     1160
  -> 10.10.1.2:80                        1        4        3      295      388
  -> 10.10.2.2:80                        1        4        3      293      386
  -> 10.10.3.2:80                        1        4        3      294      386
Conns를 보았을 때 거의 균일한 개수로 부하분산이 되고 있다.



Service-(3)LoadBalancer
Fri, 04 Oct 2024 15:40:46 GMT
LoadBalancer
LoadBalancer 타입 서비스는 크게 2가지 방식으로 동작한다.
각 방식에 대하여 살펴보자.
LoadBalancer - Nodeport - Pod

외부 클라이언트가 LoadBalancer로 접속을 하면, LoadBalancer는 노드의 NodePort를 목적지 포트로 트래픽을 전송한다. 이후 노드의 iptables 정보로 Pod에 랜덤 부하분산을 통해 전달된다.
LoadBalancer Controller (LoadBalancer - Pod)

LoadBalancer Controller에서 Kubernetes Endpoints API를 통해 특정 서비스에 연결된 Pod들의 IP주소를 수집한 후, LB에서 수집된 Pod들에 직접 접속한다.
LoadBalancer Controller의 대표 예시로는 AWS LoadBalancer Controller가 있다.
AWS LoadBalancer Controller에서 Target Group Binding CRD를 배포하여 Service Endpoint 주소를 수집하는 코드를 간략하게 살펴보자.
AWS LoadBalancer Controller
https://github.com/kubernetes-sigs/aws-load-balancer-controller/blob/main/pkg/targetgroupbinding/resource_manager.go
Service에서 바라보는 Endpoint 슬라이스를 반환한다.
package targetgroupbinding
...
func (m *defaultResourceManager) reconcileWithIPTargetType(ctx context.Context, tgb *elbv2api.TargetGroupBinding) error {
    svcKey := buildServiceReferenceKey(tgb, tgb.Spec.ServiceRef)
...
    var endpoints []backend.PodEndpoint
    var containsPotentialReadyEndpoints bool
    var err error
  // Service에서 바라보는 Endpoint 슬라이스를 반환한다. ResolvePodEndpoints함수는 pkg/backend/endpoint_resolver.go에 존재한다.
    endpoints, containsPotentialReadyEndpoints, err = m.endpointResolver.ResolvePodEndpoints(ctx, svcKey, tgb.Spec.ServiceRef.Port, resolveOpts...) 
  ...
  if err := m.networkingManager.ReconcileForPodEndpoints(ctx, tgb, endpoints); err != nil {
        m.eventRecorder.Event(tgb, corev1.EventTypeWarning, k8s.TargetGroupBindingEventReasonFailedNetworkReconcile, err.Error())
        needNetworkingRequeue = true
    }
  ...
  //TargetGroup에 등록되지 않은 Endpoint를 AWS TargetGroup에 등록한다.
  if len(unmatchedEndpoints) > 0 {
        if err := m.registerPodEndpoints(ctx, tgARN, vpcID, unmatchedEndpoints); err != nil {
            return err
        }
    }

func (r *defaultEndpointResolver) ResolvePodEndpoints(ctx context.Context, svcKey types.NamespacedName, port intstr.IntOrString, opts ...EndpointResolveOption) ([]PodEndpoint, bool, error) {
    resolveOpts := defaultEndpointResolveOptions()
    resolveOpts.ApplyOptions(opts)

    _, svcPort, err := r.findServiceAndServicePort(ctx, svcKey, port)
    if err != nil {
        return nil, false, err
    }
    endpointsDataList, err := r.computeServiceEndpointsData(ctx, svcKey)
    if err != nil {
        return nil, false, err
    }
    return r.resolvePodEndpointsWithEndpointsData(ctx, svcKey, svcPort, endpointsDataList, resolveOpts.PodReadinessGates)
}

...
// 실제 ready상태의 Pod Endpoint를 반환한다.
func (r *defaultEndpointResolver) resolvePodEndpointsWithEndpointsData(ctx context.Context, svcKey types.NamespacedName, svcPort corev1.ServicePort, endpointsDataList []EndpointsData, podReadinessGates []corev1.PodConditionType) ([]PodEndpoint, bool, error) {
...
반환된 Endpoint를 AWS TargetGroup에 등록한다.
func (m *defaultResourceManager) registerPodEndpoints(ctx context.Context, tgARN, tgVpcID string, endpoints []backend.PodEndpoint) error {
    vpcID := m.vpcID
    // Target group is in a different VPC from the cluster's VPC
    if tgVpcID != "" && tgVpcID != m.vpcID {
        vpcID = tgVpcID
        m.logger.Info("registering endpoints using the targetGroup's vpcID", tgVpcID,
            "which is different from the cluster's vpcID", m.vpcID)
    }
  ...

    sdkTargets := make([]elbv2types.TargetDescription, 0, len(endpoints))
    for _, endpoint := range endpoints {
        target := elbv2types.TargetDescription{
            Id:   awssdk.String(endpoint.IP),
            Port: awssdk.Int32(endpoint.Port),
        }
        podIP, err := netip.ParseAddr(endpoint.IP)
        if err != nil {
            return err
        }
        if !networking.IsIPWithinCIDRs(podIP, vpcCIDRs) {
            target.AvailabilityZone = awssdk.String("all")
        }
        sdkTargets = append(sdkTargets, target)
    }
  //Endpoint를 AWS TargetGroup에 등록한다.
    return m.targetsManager.RegisterTargets(ctx, tgARN, sdkTargets)
}
AWS LoadBalancer는 TargetGroup을 대상으로 트래픽을 분산하기 때문에 Pod IP에 직접적으로 트래픽 분산이 가능하다.
MetalLB
온프레미스 환경에서 대표적으로 사용되는 LoadBalancer 서비스인 MetalLB에 대해 알아본다.
MetalLB는 Layer2 모드와 BGP 모드로 동작한다.
Layer2 모드에 대해 알아본다.
Layer2 Mode

MetalLB를 설치하면 모든 노드에 스피커 파드가 Daemonset으로 뜨게되는데, LoadBalancer 서비스 리소스 생성 시 MetalLB 스피커 파드 중 리더 스피커 파드가 선출되어 연결된다.
리더 스피커 파드는 ARP 메시지로 LoadBalancer 서비스의 External IP를 전파한다.
클라이언트가 LoadBalancer 서비스로 접속을 시도하면 연결된 리더 스피커 파드가 뜬 노드로 트래픽이 들어오고, 해당 노드의 Iptables룰을 통해 각 Pod들에 트래픽이 전달된다.
클라이언트로부터 Pod까지 트래픽 전달 과정 중 DNAT는 두 번 일어난다.
클라이언트에서 LB로 전달된 트래픽이 LB에서 Node로 향할 때 NodeIP:Port로 DNAT되고, Node에서 Iptables룰에 의해 Pod로 향할 때 PodIP:Port로 DNAT된다.
MetalLB 리소스 확인
MetalLB와 L2 통신 관련 리소스 설치 후 설치된 리소스를 확인해본다.
# kubectl get crd | grep metallb  
bfdprofiles.metallb.io                      2024-10-03T07:15:10Z
bgpadvertisements.metallb.io                2024-10-03T07:15:10Z
bgppeers.metallb.io                         2024-10-03T07:15:10Z
communities.metallb.io                      2024-10-03T07:15:10Z
ipaddresspools.metallb.io                   2024-10-03T07:15:10Z
l2advertisements.metallb.io                 2024-10-03T07:15:10Z
servicel2statuses.metallb.io                2024-10-03T07:15:10Z



CRD Name
설명



bfdprofiles.metallb.io
BFD (Bidirectional Forwarding Detection) 프로파일을 정의하여 BGP 세션의 장애를 감지하는 데 사용된다.


bgpadvertisements.metallb.io
BGP(Border Gateway Protocol) 광고를 관리하는 CRD로, MetalLB가 클러스터의 IP 주소를 광고할 때 사용된다.


bgppeers.metallb.io
MetalLB가 연결할 BGP 피어를 정의한다. 이 CRD를 통해 다른 BGP 라우터와의 관계를 설정할 수 있다.


communities.metallb.io
BGP 커뮤니티를 정의하는 CRD로, 특정 라우팅 정책을 구현하기 위해 BGP 광고에 추가할 수 있는 커뮤니티이다.


ipaddresspools.metallb.io
MetalLB가 사용할 IP 주소 풀을 정의한다. 이 풀에서 IP 주소를 할당하여 서비스를 제공한다.


l2advertisements.metallb.io
L2 모드에서 IP 주소를 광고하기 위한 CRD로, L2 브로드캐스트를 통해 IP 주소를 광고한다.


servicel2statuses.metallb.io
L2 모드에서 서비스의 상태를 추적하는 CRD로, 서비스의 상태와 관련된 메타데이터를 제공한다.


Speacker Pod가 데몬셋으로 뜨고, MetalLB Controller가 하나 생성된다.
이 때, Speaker Pod는 HostNS와 동일한 NS를 사용한다.
# kubectl get ds,deploy -n metallb-system
NAME                     DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
daemonset.apps/speaker   4         4         4       4            4           kubernetes.io/os=linux   32h

NAME                         READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/controller   1/1     1            1           32h

# kubectl get pod -o wide -n metallb-system | grep speaker 
speaker-2mfpd                 2/2     Running   0          32h   172.18.0.4   myk8s-worker2                    
speaker-bvm4p                 2/2     Running   0          32h   172.18.0.3   myk8s-worker3                    
speaker-f5nhm                 2/2     Running   0          32h   172.18.0.5   myk8s-control-plane              
speaker-sntgs                 2/2     Running   0          32h   172.18.0.2   myk8s-worker                     

# kubectl get nodes -o wide                               
NAME                  STATUS   ROLES           AGE    VERSION   INTERNAL-IP   EXTERNAL-IP   OS-IMAGE                         KERNEL-VERSION    CONTAINER-RUNTIME
myk8s-control-plane   Ready    control-plane   3d3h   v1.31.0   172.18.0.5            Debian GNU/Linux 12 (bookworm)   6.4.16-linuxkit   containerd://1.7.18
myk8s-worker          Ready              3d3h   v1.31.0   172.18.0.2            Debian GNU/Linux 12 (bookworm)   6.4.16-linuxkit   containerd://1.7.18
myk8s-worker2         Ready              3d3h   v1.31.0   172.18.0.4            Debian GNU/Linux 12 (bookworm)   6.4.16-linuxkit   containerd://1.7.18
myk8s-worker3         Ready              3d3h   v1.31.0   172.18.0.3            Debian GNU/Linux 12 (bookworm)   6.4.16-linuxkit   containerd://1.7.18
ipaddresspools.metallb.io와 l2advertisements.metallb.io를 배포하여 서비스의 IPAddressPool을 관리하고, 해당 Pool을 기반으로 Layer2모드로 LoadBalancer IP 사용을 허용한다.
cat <

MetalLB 리소스와 서비스
각 서비스 별 Leader Speacker Pod 확인
서비스의 Event로 각 서비스의 리더 스피커 파드가 어떤 파드인지 확인한다.
# kubectl describe svc svc1 | grep Events: -A5
Events:
  Type    Reason        Age   From                Message
  ----    ------        ----  ----                -------
  Normal  IPAllocated   39s   metallb-controller  Assigned IP ["172.18.255.200"]
  Normal  nodeAssigned  39s   metallb-speaker     announcing from node "myk8s-worker" with protocol "layer2"

# kubectl describe svc svc2 | grep Events: -A5
Events:
  Type    Reason        Age   From                Message
  ----    ------        ----  ----                -------
  Normal  IPAllocated   51s   metallb-controller  Assigned IP ["172.18.255.201"]
  Normal  nodeAssigned  51s   metallb-speaker     announcing from node "myk8s-worker" with protocol "layer2"

# kubectl describe svc svc3 | grep Events: -A5
Events:
  Type    Reason        Age   From                Message
  ----    ------        ----  ----                -------
  Normal  IPAllocated   55s   metallb-controller  Assigned IP ["172.18.255.202"]
  Normal  nodeAssigned  55s   metallb-speaker     announcing from node "myk8s-worker3" with protocol "layer2"

서비스를 생성함과 동시에 Node에서 arp tcpdump를 확인하면, 리더 스피커 파드가 각 Node에 ARP 통신을 보내서 자신이 리더 스피커 파드임을 알리는 것을 확인할 수 있다.
# docker exec -it myk8s-worker tcpdump -i eth0 arp
tcpdump: verbose output suppressed, use -v[v]... for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), snapshot length 262144 bytes
21:28:14.714811 ARP, Request who-has myk8s-worker3.kind tell myk8s-worker, length 28
21:28:14.715200 ARP, Reply myk8s-worker3.kind is-at 02:42:ac:12:00:03 (oui Unknown), length 28
21:28:19.834218 ARP, Request who-has myk8s-worker tell myk8s-worker2.kind, length 28
21:28:19.834233 ARP, Reply myk8s-worker is-at 02:42:ac:12:00:02 (oui Unknown), length 28
21:28:29.579059 ARP, Request who-has 172.18.255.200 (Broadcast) tell 172.18.255.200, length 46
21:28:29.579226 ARP, Reply 172.18.255.200 is-at 02:42:ac:12:00:02 (oui Unknown), length 46
...
21:28:29.610111 ARP, Request who-has 172.18.255.201 (Broadcast) tell 172.18.255.201, length 46
21:28:29.610142 ARP, Reply 172.18.255.201 is-at 02:42:ac:12:00:02 (oui Unknown), length 46
...
21:28:29.647648 ARP, Request who-has 172.18.255.202 (Broadcast) tell 172.18.255.202, length 46
21:28:29.647748 ARP, Reply 172.18.255.202 is-at 02:42:ac:12:00:03 (oui Unknown), length 46
mypc에서 arping을 통해 SVC EXTERNAL-IP를 담당하는 리더 스피커 파드를 찾을 수 있다.
# docker exec -it mypc arping -I eth0 -f -c 1 $SVC1EXIP
ARPING 172.18.255.200 from 172.18.0.6 eth0
Unicast reply from 172.18.255.200 [02:42:AC:12:00:02]  1.858ms # 172.18.255.200에서 응답을 받음
Sent 1 probes (1 broadcast(s)) # ARP 요청을 브로드캐스트 방식으로 전송함
Received 1 response(s) # 응답을 수신

# docker exec -it mypc arping -I eth0 -f -c 1 $SVC2EXIP
ARPING 172.18.255.201 from 172.18.0.6 eth0
Unicast reply from 172.18.255.201 [02:42:AC:12:00:02]  4.066ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)

# docker exec -it mypc arping -I eth0 -f -c 1 $SVC3EXIP
ARPING 172.18.255.202 from 172.18.0.6 eth0
Unicast reply from 172.18.255.202 [02:42:AC:12:00:03]  1.900ms
Sent 1 probes (1 broadcast(s))
Received 1 response(s)
mypc에서 arp 테이블 정보를 확인하여 SVC별로 리더 스피커 역할을 하는 노드를 확인한다.
# docker exec -it mypc ip -c neigh | sort
172.18.0.1 dev eth0 lladdr 02:42:58:36:2d:23 STALE
172.18.0.2 dev eth0 lladdr 02:42:ac:12:00:02 STALE
172.18.0.3 dev eth0 lladdr 02:42:ac:12:00:03 STALE
172.18.0.4 dev eth0 lladdr 02:42:ac:12:00:04 STALE
172.18.0.5 dev eth0 lladdr 02:42:ac:12:00:05 STALE
172.18.255.200 dev eth0 lladdr 02:42:ac:12:00:02 STALE # -> myk8s-worker
172.18.255.201 dev eth0 lladdr 02:42:ac:12:00:02 STALE # -> myk8s-worker
172.18.255.202 dev eth0 lladdr 02:42:ac:12:00:03 STALE # -> myk8s-worker3
서비스 접속

mypc에서 각 서비스를 호출한 후 응답을 확인해본다.
# or i in $SVC1EXIP $SVC2EXIP $SVC3EXIP; do echo ">> Access Service External-IP : $i <<" ;docker exec -it mypc curl -s $i | egrep 'Hostname|RemoteAddr|Host:' ; echo ; done

>> Access Service External-IP : 172.18.255.200 <<
Hostname: webpod1
RemoteAddr: 10.10.3.1:32676
Host: 172.18.255.200

>> Access Service External-IP : 172.18.255.201 <<
Hostname: webpod2
RemoteAddr: 172.18.0.2:47331
Host: 172.18.255.201

>> Access Service External-IP : 172.18.255.202 <<
Hostname: webpod1
RemoteAddr: 172.18.0.3:53125
Host: 172.18.255.202
arp 테이블에 등록된 대로 External-IP로 트래픽이 들어오면(Access Service External-IP), 해당 트래픽을 연결된 리더 스피커 파드가 있는 노드로 보내는 것을 확인할 수 있다.
각 서비스를 호출하여 부하분산을 확인해본다.
# docker exec mypc zsh -c  "for i in {1..1000}; do curl -s --connect-timeout 1 $SVC1EXIP | grep Hostname; done | sort | uniq -c | sort -nr"
    504 Hostname: webpod1
    496 Hostname: webpod2

# docker exec mypc zsh -c  "for i in {1..1000}; do curl -s --connect-timeout 1 $SVC2EXIP | grep Hostname; done | sort | uniq -c | sort -nr"
    515 Hostname: webpod1
    485 Hostname: webpod2

# docker exec mypc zsh -c  "for i in {1..1000}; do curl -s --connect-timeout 1 $SVC3EXIP | grep Hostname; done | sort | uniq -c | sort -nr"
    510 Hostname: webpod1
    490 Hostname: webpod2
거의 균일하게 부하분산 되는 것을 확인할 수 있다.
iptables 확인
myk8s-worker에서 iptables 규칙을 확인해본다.

PREROUTING 체인 : 클러스터 내부 트래픽이 Kubernetes Service iptables에 의해 처리되도록 하는 역할.# docker exec -it myk8s-worker bash



iptables -t nat -S PREROUTING
-P PREROUTING ACCEPT
-A PREROUTING -m comment --comment "kubernetes service portals" -j KUBE-SERVICES
-A PREROUTING -d 192.168.65.254/32 -j DOCKER_OUTPUT

- KUBE-SERVICES 체인: SVC의 ExternalIP 172.18.255.200에 대한 요청을 KUBE-EXT-DLGPAL4ZCYSJ7UPR 체인이 처리함.

- KUBE-EXT-DLGPAL4ZCYSJ7UPR 체인 : 
    - 외부 목적지로 향하는 트래픽에 대해 MARK 처리하여, 클러스터 외부로 패킷이 빠져나갈 때 SNAT 되도록 함.
    - 그 외 트래픽을 KUBE-SVC-DLGPAL4ZCYSJ7UPR로 전달함.
```bash
# SVC1EXIP=172.18.255.200

# iptables -t nat -S KUBE-SERVICES |grep $SVC1EXIP
-A KUBE-SERVICES -d 172.18.255.200/32 -p tcp -m comment --comment "default/svc1:svc1-webport loadbalancer IP" -m tcp --dport 80 -j KUBE-EXT-DLGPAL4ZCYSJ7UPR

# iptables -t nat -S KUBE-EXT-DLGPAL4ZCYSJ7UPR
-N KUBE-EXT-DLGPAL4ZCYSJ7UPR
-A KUBE-EXT-DLGPAL4ZCYSJ7UPR -m comment --comment "masquerade traffic for default/svc1:svc1-webport external destinations" -j KUBE-MARK-MASQ
-A KUBE-EXT-DLGPAL4ZCYSJ7UPR -j KUBE-SVC-DLGPAL4ZCYSJ7UPR

# iptables -t nat -S KUBE-MARK-MASQ
-N KUBE-MARK-MASQ
-A KUBE-MARK-MASQ -j MARK --set-xmark 0x4000/0x4000

KUBE-SVC-DLGPAL4ZCYSJ7UPR 체인 : 
10.10.0.0/16 source가 아닌 패킷(외부 패킷)이 Cluster IP:80으로 들어올 경우 KUBE-MARK-MASQ 체인으로 mark 설정을 해서, 클러스터 외부로 패킷이 빠져나갈 때 SNAT 되도록 함.
probability에 따라 KUBE-SEP-YYY 체인으로 패킷을 보냄.
```bashiptables -t nat -S KUBE-SVC-DLGPAL4ZCYSJ7UPR



N KUBE-SVC-DLGPAL4ZCYSJ7UPR
A KUBE-SVC-DLGPAL4ZCYSJ7UPR ! -s 10.10.0.0/16 -d 10.200.1.217/32 -p tcp -m comment --comment "default/svc1:svc1-webport cluster IP" -m tcp --dport 80 -j KUBE-MARK-MASQ
A KUBE-SVC-DLGPAL4ZCYSJ7UPR -m comment --comment "default/svc1:svc1-webport -> 10.10.1.5:80" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-NZKLDL653RQ74MZI
A KUBE-SVC-DLGPAL4ZCYSJ7UPR -m comment --comment "default/svc1:svc1-webport -> 10.10.3.3:80" -j KUBE-SEP-L6SWHLHTULTEJGNO

iptables -t nat -S KUBE-SEP-NZKLDL653RQ74MZI
-N KUBE-SEP-NZKLDL653RQ74MZI
-A KUBE-SEP-NZKLDL653RQ74MZI -s 10.10.1.5/32 -m comment --comment "default/svc1:svc1-webport" -j KUBE-MARK-MASQ
-A KUBE-SEP-NZKLDL653RQ74MZI -p tcp -m comment --comment "default/svc1:svc1-webport" -m tcp -j DNAT --to-destination 10.10.1.5:80
iptables -t nat -S KUBE-SEP-L6SWHLHTULTEJGNO
-N KUBE-SEP-L6SWHLHTULTEJGNO
-A KUBE-SEP-L6SWHLHTULTEJGNO -s 10.10.3.3/32 -m comment --comment "default/svc1:svc1-webport" -j KUBE-MARK-MASQ
-A KUBE-SEP-L6SWHLHTULTEJGNO -p tcp -m comment --comment "default/svc1:svc1-webport" -m tcp -j DNAT --to-destination 10.10.3.3:80

- POSTROUTING 체인 : 패킷이 NIC를 빠져나가기 전에 적용되는 규칙

- KUBE-POSTROUTING 체인 :
    - 0X4000 마킹이 되어있지 않은 경우 RETRUN되어 SNAT 규칙을 통해 라우팅되지 않음.
```bash
# iptables -t nat -S POSTROUTING
-P POSTROUTING ACCEPT
-A POSTROUTING -m comment --comment "kubernetes postrouting rules" -j KUBE-POSTROUTING
-A POSTROUTING -d 192.168.65.254/32 -j DOCKER_POSTROUTING
-A POSTROUTING -m addrtype ! --dst-type LOCAL -m comment --comment "kind-masq-agent: ensure nat POSTROUTING directs all non-LOCAL destination traffic to our custom KIND-MASQ-AGENT chain" -j KIND-MASQ-AGENT

# iptables -t nat -S KUBE-POSTROUTING
-N KUBE-POSTROUTING
-A KUBE-POSTROUTING -m mark ! --mark 0x4000/0x4000 -j RETURN
-A KUBE-POSTROUTING -j MARK --set-xmark 0x4000/0x0
-A KUBE-POSTROUTING -m comment --comment "kubernetes service traffic requiring SNAT" -j MASQUERADE --random-fully
Failover
MetalLB L2 Mode의 경우, 리더 스피커 파드가 있는 노드에 장애가 발생하면, 남아있는 스피커 파드들이 해당 노드의 장애를 인식한다.
장애가 발생한 스피커 파드와 연결된 서비스의 ExternalIP의 새로운 리더 스피커 파드를 선출하게 되고, 리더가 선출이 되면 GARP로 서비스 ExternalIP를 광고한다.
장애 인지 시간과 새로운 리더 스피커 파드 선출 후 광고까지의 시간이 짧게는 20초, 길게는 1분이 걸려, 장애 지속 시간이 길어진다는 단점이 있다.
Failover과정에 대해 테스트를 진행해본다.
* 장애 상황 발생 *
# docker stop myk8s-worker --signal 9

myk8s-worker
* SVC1 호출 상태 확인 *
따로 떠있는 서비스가 많지 않은 상황에서도 장애 발생 이후 정상화까지 약 30초의 시간이 소요되는 것을 확인할 수 있다.
# docker exec -it mypc zsh -c "while true; do curl -s --connect-timeout 1 $SVC1EXIP | egrep 'Hostname|RemoteAddr'; date '+%Y-%m-%d %H:%M:%S' ; echo ;  sleep 1; done"

Hostname: webpod1
RemoteAddr: 10.10.3.1:33831
2024-10-05 14:27:21

Hostname: webpod2
RemoteAddr: 172.18.0.2:25693
2024-10-05 14:27:22

# 장애 발생 시작
2024-10-05 14:27:24

2024-10-05 14:27:26

...

Hostname: webpod2
RemoteAddr: 10.10.1.1:54046
2024-10-05 14:27:44

2024-10-05 14:27:46
...

# 장애 복구 완료
Hostname: webpod2
RemoteAddr: 10.10.1.1:5946
2024-10-05 14:27:55

Hostname: webpod2
RemoteAddr: 10.10.1.1:18744
2024-10-05 14:27:56
* 새로운 리더 스피커 파드 선정 확인 *
새로운 리더 스피커 파드가 서비스의 External IP를 소유하고 있음을 확인할 수 있다.
172.18.0.1 dev eth0 lladdr 02:42:58:36:2d:23 STALE
172.18.0.2 dev eth0 lladdr 02:42:ac:12:00:02 STALE
172.18.0.3 dev eth0 lladdr 02:42:ac:12:00:03 STALE
172.18.0.4 dev eth0 lladdr 02:42:ac:12:00:04 STALE
172.18.0.5 dev eth0 lladdr 02:42:ac:12:00:05 STALE
172.18.255.200 dev eth0 lladdr 02:42:ac:12:00:04 REACHABLE # myk8s-worker -> myk8s-worker2
172.18.255.201 dev eth0 lladdr 02:42:ac:12:00:03 STALE # myk8s-worker -> myk8s-worker3
172.18.255.202 dev eth0 lladdr 02:42:ac:12:00:03 STALE # myk8s-worker3
* 장애 노드 정상화 이후 확인 *
장애 노드를 정상화하고 나면, 리더 스피커 노드가 원복된다.
# docker start myk8s-worker

# kubectl get nodes
NAME                  STATUS   ROLES           AGE    VERSION
myk8s-control-plane   Ready    control-plane   4d2h   v1.31.0
myk8s-worker          Ready              4d2h   v1.31.0
myk8s-worker2         Ready              4d2h   v1.31.0
myk8s-worker3         Ready              4d2h   v1.31.0

# docker exec -it mypc ip -c neigh | sort
172.18.0.1 dev eth0 lladdr 02:42:58:36:2d:23 STALE
172.18.0.2 dev eth0 lladdr 02:42:ac:12:00:02 STALE
172.18.0.3 dev eth0 lladdr 02:42:ac:12:00:03 STALE
172.18.0.4 dev eth0 lladdr 02:42:ac:12:00:04 STALE
172.18.0.5 dev eth0 lladdr 02:42:ac:12:00:05 STALE
172.18.255.200 dev eth0 lladdr 02:42:ac:12:00:02 REACHABLE # myk8s-worker2 -> myk8s-worker
172.18.255.201 dev eth0 lladdr 02:42:ac:12:00:02 STALE # myk8s-worker3 -> myk8s-worker
172.18.255.202 dev eth0 lladdr 02:42:ac:12:00:03 STALE # myk8s-worker3



Service-(1)ClusterIP
Fri, 27 Sep 2024 17:18:42 GMT
Service
Kubernetes Service에는 clusterIP, nodePort, LoadBalancer 3가지 타입이 존재하고, Service는 kube-proxy에 의해 동작하게 된다.
각각에 대하여 상세히 알아보자.
kube-proxy
kube-proxy란?
서비스 통신 동작에 대한 설정을 관리하는 역할로 모든 Kubernetes 노드에 daemonset으로 배포된다.
kube-proxy에는 iptables proxy모드, ipvs proxy모드, nftable모드, eBPF모드 + XDP 가 있다.
clusterIP

clusterIP는 클러스터 내부 통신 목적의 서비스로, Control Plane의 iptables 분산 룰이 kube-proxy에 의해 모든 Worker Node에 적용된 이후 ,클라이언트가 clusterIP 접속 시 노드의 iptables 룰에 의해서 DNAT 처리가 되어 목적지와 통신하게 된다.
ClusterIP iptables 및 부하분산 접속
서비스 생성 시 kube-proxy에 의해 모든 노드에 iptables 규칙 추가
$ kubectl get svc -A                                         
NAMESPACE     NAME            TYPE        CLUSTER-IP     EXTERNAL-IP   PORT(S)                  AGE
default       kubernetes      ClusterIP   10.200.1.1             443/TCP                  153m
default       svc-clusterip   ClusterIP   10.200.1.145           9000/TCP                 139m
kube-system   kube-dns        ClusterIP   10.200.1.10            53/UDP,53/TCP,9153/TCP   153m
kube-system   kube-ops-view   NodePort    10.200.1.108           8080:30000/TCP           151m

$ docker exec -it myk8s-control-plane iptables -t nat -S | grep 10.200.1.145
-A KUBE-SERVICES -d 10.200.1.145/32 -p tcp -m comment --comment "default/svc-clusterip:svc-webport cluster IP" -m tcp --dport 9000 -j KUBE-SVC-KBDEBIL6IU6WL7RF
-A KUBE-SVC-KBDEBIL6IU6WL7RF ! -s 10.10.0.0/16 -d 10.200.1.145/32 -p tcp -m comment --comment "default/svc-clusterip:svc-webport cluster IP" -m tcp --dport 9000 -j KUBE-MARK-MASQ

$ for i in worker worker2 worker3; do echo ">> node myk8s-$i <<"; docker exec -it myk8s-$i iptables -t nat -S | grep 10.200.1.145; echo; done
>> node myk8s-worker <<
-A KUBE-SERVICES -d 10.200.1.145/32 -p tcp -m comment --comment "default/svc-clusterip:svc-webport cluster IP" -m tcp --dport 9000 -j KUBE-SVC-KBDEBIL6IU6WL7RF
-A KUBE-SVC-KBDEBIL6IU6WL7RF ! -s 10.10.0.0/16 -d 10.200.1.145/32 -p tcp -m comment --comment "default/svc-clusterip:svc-webport cluster IP" -m tcp --dport 9000 -j KUBE-MARK-MASQ

>> node myk8s-worker2 <<
-A KUBE-SERVICES -d 10.200.1.145/32 -p tcp -m comment --comment "default/svc-clusterip:svc-webport cluster IP" -m tcp --dport 9000 -j KUBE-SVC-KBDEBIL6IU6WL7RF
-A KUBE-SVC-KBDEBIL6IU6WL7RF ! -s 10.10.0.0/16 -d 10.200.1.145/32 -p tcp -m comment --comment "default/svc-clusterip:svc-webport cluster IP" -m tcp --dport 9000 -j KUBE-MARK-MASQ

>> node myk8s-worker3 <<
-A KUBE-SERVICES -d 10.200.1.145/32 -p tcp -m comment --comment "default/svc-clusterip:svc-webport cluster IP" -m tcp --dport 9000 -j KUBE-SVC-KBDEBIL6IU6WL7RF
-A KUBE-SVC-KBDEBIL6IU6WL7RF ! -s 10.10.0.0/16 -d 10.200.1.145/32 -p tcp -m comment --comment "default/svc-clusterip:svc-webport cluster IP" -m tcp --dport 9000 -j KUBE-MARK-MASQ
linux tcp socket listen, routing table 확인 시 서비스 관련 규칙 없음 -> iptables rule에 통신이 처리되는 것 확인
$ docker exec -it myk8s-control-plane ss -tnlp
State    Recv-Q   Send-Q       Local Address:Port        Peer Address:Port   Process                                     
LISTEN   0        4096             127.0.0.1:37521            0.0.0.0:*       users:(("containerd",pid=104,fd=10))       
LISTEN   0        4096            127.0.0.11:35057            0.0.0.0:*                                                  
LISTEN   0        4096            172.18.0.4:2379             0.0.0.0:*       users:(("etcd",pid=656,fd=9))              
LISTEN   0        4096            172.18.0.4:2380             0.0.0.0:*       users:(("etcd",pid=656,fd=7))              
LISTEN   0        4096             127.0.0.1:2381             0.0.0.0:*       users:(("etcd",pid=656,fd=16))             
LISTEN   0        4096             127.0.0.1:2379             0.0.0.0:*       users:(("etcd",pid=656,fd=8))              
LISTEN   0        4096             127.0.0.1:10257            0.0.0.0:*       users:(("kube-controller",pid=562,fd=3))   
LISTEN   0        4096             127.0.0.1:10259            0.0.0.0:*       users:(("kube-scheduler",pid=550,fd=3))    
LISTEN   0        4096             127.0.0.1:10249            0.0.0.0:*       users:(("kube-proxy",pid=830,fd=13))       
LISTEN   0        4096             127.0.0.1:10248            0.0.0.0:*       users:(("kubelet",pid=723,fd=21))          
LISTEN   0        4096                     *:10250                  *:*       users:(("kubelet",pid=723,fd=9))           
LISTEN   0        4096                     *:10256                  *:*       users:(("kube-proxy",pid=830,fd=12))       
LISTEN   0        4096                     *:6443                   *:*       users:(("kube-apiserver",pid=570,fd=3))    

$ docker exec -it myk8s-control-plane ip -c route
default via 172.18.0.1 dev eth0 
10.10.0.2 dev vethee7e19dd scope host 
10.10.0.3 dev vethb5c52334 scope host 
10.10.0.4 dev vetha420baee scope host 
10.10.0.5 dev vethc9967629 scope host 
10.10.0.6 dev veth321e7cdd scope host 
10.10.1.0/24 via 172.18.0.5 dev eth0 
10.10.2.0/24 via 172.18.0.2 dev eth0 
10.10.3.0/24 via 172.18.0.3 dev eth0 
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.4 
ClusterIP 부하분산 확인
Cluster IP로 curl 접속 시 3개의 파드로 거의 균일하게 부하 분산 접속을 확인할 수 있다.
$ kubectl exec -it net-pod -- zsh -c "for i in {1..1000}; do curl -s $SVC1:9000 | grep Hostname; done | sort | uniq -c | sort -nr"
    345 Hostname: webpod3
    328 Hostname: webpod2
    327 Hostname: webpod1

$ kubectl exec -it net-pod -- zsh -c "for i in {1..100};   do curl -s $SVC1:9000 | grep Hostname; sleep 0.1; done"
Hostname: webpod2
Hostname: webpod1
Hostname: webpod3
Hostname: webpod2
Hostname: webpod3
Hostname: webpod2
Hostname: webpod3
Hostname: webpod1
Hostname: webpod1
Hostname: webpod2
Hostname: webpod2
Hostname: webpod2
Hostname: webpod3
Hostname: webpod1
Hostname: webpod3
Hostname: webpod3
Hostname: webpod2
Hostname: webpod3
Hostname: webpod2
Hostname: webpod3
...
부하 분산 코드 분석
iptables 모드에서 service 생성 시, 연동된 파드 갯수 퍼센트(%)를 기반으로 분산률을 정하는 코드를 분석해보자.
관련 코드는 kubernetes의 proxy/iptables/proxier.go 하위에 있다.
https://github.com/kubernetes/kubernetes/blob/master/pkg/proxy/iptables/proxier.go
1) writeServiceToEndpointRules
svc가 생성되고, kube-proxy가 iptables 규칙을 생성하는 함수이다.
해당 함수의 매개 변수는 아래와 같다.

natRules : iptables 규칙을 담을 LineBuffer
svcPortNameString : 서비스의 이름과 포트 정보가 담긴 문자열
svcInfo : 서비스의 SessionAffinityType, StickyMaxAgeSeconds 등 서비스 관련 정보를 포함한 객체
svcChain : 서비스의 iptalbes 체인 이름
endpoints : 서비스에 연결된 엔드포인트 slice
args : 최종 iptables 규칙을 담는 slice

func (proxier *Proxier) writeServiceToEndpointRules(natRules proxyutil.LineBuffer, svcPortNameString string, svcInfo proxy.ServicePort, svcChain utiliptables.Chain, endpoints []proxy.Endpoint, args []string) {
...

    // Now write loadbalancing rules.
    numEndpoints := len(endpoints)

  // 서비스 endpoint를 돌면서 iptables 규칙을 한 줄 씩 생성한다.
    for i, ep := range endpoints {
        epInfo, ok := ep.(*endpointInfo)
        if !ok {
            continue
        }
    // 예시 ) "default/svc-clusterip:svc-webport -> 10.10.2.2:80"
        comment := fmt.Sprintf(`"%s -> %s"`, svcPortNameString, epInfo.String())

    // 예시) -A KUBE-SVC-KBDEBIL6IU6WL7RF
        args = append(args[:0], "-A", string(svcChain))
        args = proxier.appendServiceCommentLocked(args, comment)
    // i < (numEndpoints - 1) 규칙을 통해 마지막 서비스 Endpoint의 경우, probability를 추가하지 않음으로써 모든 남은 트래픽을 해당 엔드포인트로 보내는 것을 보장한다.
        if i < (numEndpoints - 1) {
            // Each rule is a probabilistic match.
      // 예시) -m statistic --mode random --probability 0.33333333349
            args = append(args,
                "-m", "statistic",
                "--mode", "random",
        // proxier.probability함수에 서비스 Endpoint 개수가 넘겨지고, 그 개수에 의해 확률이 정해진다.
                "--probability", proxier.probability(numEndpoints-i))
        }
        // The final (or only if n == 1) rule is a guaranteed match.
    // 예시) -j KUBE-SEP-TBW2IYJKUCAC7GB3
        natRules.Write(args, "-j", string(epInfo.ChainName))
    }
}
2) probability
writeServiceToEndpointRules에서 호출되어 서비스 Endpoint의 분산 확률 값을 계산하고, 반환하는 코드이다.
type Proxier struct {
  ...
  // Since converting probabilities (floats) to strings is expensive
    // and we are using only probabilities in the format of 1/n, we are
    // precomputing some number of those and cache for future reuse.
    precomputedProbabilities []string
  ...
}

//precomputedProbabilities string slice에 저장된 분산 확률 값 목록인 preComputedProbabilities(n)를 반환하고, n번째 확률 값이 없다면 computeProbability를 호출하여 확률값을 추가하고 반환한다.
func (proxier *Proxier) probability(n int) string {
    if n >= len(proxier.precomputedProbabilities) {
        proxier.precomputeProbabilities(n)
    }
    return proxier.precomputedProbabilities[n]
}

func (proxier *Proxier) precomputeProbabilities(numberOfPrecomputed int) {
    if len(proxier.precomputedProbabilities) == 0 {
        proxier.precomputedProbabilities = append(proxier.precomputedProbabilities, "")
    }
  // 서비스 Endpoint 개수에 따라 계산한 확률 값을 preComputedProbabilities에 저장한다.
    for i := len(proxier.precomputedProbabilities); i <= numberOfPrecomputed; i++ {
        proxier.precomputedProbabilities = append(proxier.precomputedProbabilities, computeProbability(i))
    }
}

//서비스 Endpoint 개수에 따라 확률값을 계산한다. 서비스 Endpoint 개수 n에 대해 1/n의 확률 값을 소수점 10자리 형식으로 반환한다.
func computeProbability(n int) string {
    return fmt.Sprintf("%0.10f", 1.0/float64(n))
}
3) Netfilter에서 iptables의 규칙을 기반으로 실제 트래픽 라우팅
iptables의 -m statistic --mode random --probability 옵션은 Netfilter에 의해 처리된다.
해당 기능은 net/netfilter/xt_statistic.c에 구현되어 있다.
https://github.com/torvalds/linux/blob/master/net/netfilter/xt_statistic.c
statistic mode가 random일 때 처리되는 방식에 대해 확인해보자.
//statistic이 호출될 때 해당 함수 실행
statistic_mt(const struct sk_buff *skb, struct xt_action_param *par)
{
    const struct xt_statistic_info *info = par->matchinfo;
  //ret는 statictic 모듈의 규칙이 패킷에 매칭되는지를 나타내는 bool변수이며, default가 false이다.
    bool ret = info->flags & XT_STATISTIC_INVERT;
    int nval, oval;

    switch (info->mode) {
  //statistic mode가 random : iptables에 명시된 probability 비율보다 무작위로 생성한 숫자(0~최대값)가 작다면 ret의 값을 반전시킨다.(true->false, false->true)
  //ret이 true이면 패킷이 규칙에 매칭되었다 판단하여 트래픽을 보내고, false이면 매칭되지 않았다 판단하여 라우팅시키지 않고, 다음 규칙으로 넘어간다.
  //최종적으로는 probability에 설정한 확률값보다 작은 랜덤 숫자가 생성되면 정상적으로 라우팅 하고, 더 큰 랜덤 숫자가 생성되면 라우팅 하지 않고 다음 규칙으로 넘어감으로써 probability 수치에 맞게 트래픽 랜덤 분산을 수행 한다.
    case XT_STATISTIC_MODE_RANDOM:
        if ((get_random_u32() & 0x7FFFFFFF) < info->u.random.probability)
            ret = !ret;
        break;
...
    }

    return ret;
}

conntrack
클러스터 내부 pod에서 ClusterIP:port로 통신 시 control-plane 내에서 conntrack을 통해 Netfilter connection tracking 테이블을 조회할 수 있다.
src:dst가 클러스터 내부 podIP:ClusterIP 였다가, 서비스와 연결된 podIP:클러스터 내부 podIP로 바뀌는 것을 확인할 수 있다.
# SVC1=$(kubectl get svc svc-clusterip -o jsonpath={.spec.clusterIP})
# kubectl exec -it net-pod -- zsh -c "for i in {1..100};   do curl -s $SVC1:9000 | grep Hostname; sleep 1; done"
Hostname: webpod2
Hostname: webpod3
Hostname: webpod3
Hostname: webpod3
Hostname: webpod2
...

# docker exec -it myk8s-control-plane sh  
# conntrack -L --src 10.10.0.6
# conntrack -L --dst 10.200.1.189
tcp      6 2 TIME_WAIT src=10.10.0.6 dst=10.200.1.189 sport=50638 dport=9000 src=10.10.3.3 dst=10.10.0.6 sport=80 dport=50638 [ASSURED] mark=0 use=1
tcp      6 118 TIME_WAIT src=10.10.0.6 dst=10.200.1.189 sport=39762 dport=9000 src=10.10.1.2 dst=10.10.0.6 sport=80 dport=39762 [ASSURED] mark=0 use=1
tcp      6 0 TIME_WAIT src=10.10.0.6 dst=10.200.1.189 sport=50624 dport=9000 src=10.10.3.3 dst=10.10.0.6 sport=80 dport=50624 [ASSURED] mark=0 use=1
tcp      6 12 TIME_WAIT src=10.10.0.6 dst=10.200.1.189 sport=45118 dport=9000 src=10.10.3.3 dst=10.10.0.6 sport=80 dport=45118 [ASSURED] mark=0 use=1
tcp      6 15 TIME_WAIT src=10.10.0.6 dst=10.200.1.189 sport=45152 dport=9000 src=10.10.3.3 dst=10.10.0.6 sport=80 dport=45152 [ASSURED] mark=0 use=1
tcp      6 20 TIME_WAIT src=10.10.0.6 dst=10.200.1.189 sport=58170 dport=9000 src=10.10.2.2 dst=10.10.0.6 sport=80 dport=58170 [ASSURED] mark=0 use=1
Packet dump
클러스터 내부 Pod에서 ClusterIP 서비스로 통신할 때, 각 워커노드에서의 packet dump를 확인해본다.
eth0 interface
tcpdump -i eth0 tcp port 80 -nn
클러스터 내부 pod <-> 서비스 Pod 간의 패킷이 확인된다.

tcpdump -i eth0 tcp port 9000 -nn
패킷 통신이 확인되지 않는다.

veth interface
tcpdump -i $VETH tcp port 80 -nn
클러스터 내부 pod <-> 서비스 Pod 간의 패킷이 확인된다.

tcpdump -i $VETH tcp port 9000 -nn
패킷 통신이 확인되지 않는다.

ClusterIP의 경우 클라이언트 Pod가 ClusterIP:ClusterIP port로 통신을 시도할 때, (출발지 = 클라이언트 PodIP:랜덤port | 목적지 = ClusterIP:port) 패킷이 iptables rule에 의해 DNAT되어 (출발지 = 클라이언트 PodIP:랜덤port | 목적지 = 대상 PodIP:Pod port)로 변환된다.
이미 DNAT이 된 패킷이 NIC를 통해 빠져나가 다른 Node로 향하기 때문에, 위 실습에서 ClusterIP port로 packet dump가 찍히지 않고, pod port로 packet dump가 찍히게 된다.
iptables rules
Control-Plane에서 iptables rules를 확인해본다.
PREROUTING 체인 : 클러스터 내부 트래픽이 Kubernetes Service iptables에 의해 처리되도록 하는 역할.
# iptables -v --numeric --table nat --list PREROUTING            
Chain PREROUTING (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         
  387 23318 KUBE-SERVICES  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service portals */
KUBE-SERVICES 체인: Cluster IP 10.200.1.189의 tcp 9000포트에 대한 요청을 KUBE-SVC-KBDEBIL6IU6WL7RF 체인이 처리함.
# iptables -v --numeric --table nat --list KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
 pkts bytes target     prot opt in     out     source               destination         
    0     0 KUBE-SVC-KBDEBIL6IU6WL7RF  6    --  *      *       0.0.0.0/0            10.200.1.189         /* default/svc-clusterip:svc-webport cluster IP */ tcp dpt:9000
KUBE-SVC-XXX 체인 : 

10.10.0.0/16 source가 아닌 패킷(외부 패킷)이 Cluster IP:9000으로 들어올 경우 KUBE-MARK-MASQ 체인으로 mark 설정을 해서, 클러스터 외부로 패킷이 빠져나갈 때 SNAT되도록 함.
외부에서 들어온 패킷이 내부 Pod로 전달된 후, Pod가 다시 외부로 응답을 보낼 때, 응답 패킷의 출발지 IP를 클러스터 노드의 IP로 변환함으로써 클라이언트가 응답을 받을 수 있도록 하는 설정
probability에 따라 KUBE-SEP-YYY 체인으로 패킷을 보냄.# iptables -v --numeric --table nat --list KUBE-SVC-KBDEBIL6IU6WL7RF
Chain KUBE-SVC-KBDEBIL6IU6WL7RF (1 references)
pkts bytes target     prot opt in     out     source               destination         
  0     0 KUBE-MARK-MASQ  6    --  *      *      !10.10.0.0/16         10.200.1.189         /* default/svc-clusterip:svc-webport cluster IP */ tcp dpt:9000
 67  4020 KUBE-SEP-TBW2IYJKUCAC7GB3  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport -> 10.10.1.2:80 */ statistic mode random probability 0.33333333349
 70  4200 KUBE-SEP-DOIEFYKPESCDTYCH  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport -> 10.10.2.2:80 */ statistic mode random probability 0.50000000000
 97  5820 KUBE-SEP-K7ALM6KJRBAYOHKX  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport -> 10.10.3.3:80 */


KUBE-SEP-YYY 체인 : 

패킷이 Pod가 속한 Node에서 오는 경우 KUBE-MARK-MASQ 체인으로 라우팅(MARK 표시)하여 SNAT 적용
DNAT 규칙을 사용하여 그 외 모든 tcp 패킷을 Pod IP:Pod Port로 DNAT# iptables -v --numeric --table nat --list KUBE-SEP-TBW2IYJKUCAC7GB3
Chain KUBE-SEP-TBW2IYJKUCAC7GB3 (1 references)
pkts bytes target     prot opt in     out     source               destination         
  0     0 KUBE-MARK-MASQ  0    --  *      *       10.10.1.2            0.0.0.0/0            /* default/svc-clusterip:svc-webport */
 67  4020 DNAT       6    --  *      *       0.0.0.0/0            0.0.0.0/0            /* default/svc-clusterip:svc-webport */ tcp to:10.10.1.2:80


POSTROUTING 체인 : 패킷이 NIC를 빠져나가기 전에 적용되는 규칙
KUBE-POSTROUTING 체인 :

0X4000 마킹이 되어있지 않은 경우 RETRUN되어 SNAT 규칙을 통해 라우팅되지 않음.# iptables -v --numeric --table nat --list POSTROUTING 
Chain POSTROUTING (policy ACCEPT 10724 packets, 643K bytes)
pkts bytes target     prot opt in     out     source               destination         
183K   11M KUBE-POSTROUTING  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes postrouting rules */



iptables -v --numeric --table nat --list KUBE-POSTROUTING
Chain KUBE-POSTROUTING (1 references)
 pkts bytes target     prot opt in     out     source               destination
 3903  234K RETURN     0    --  *      *       0.0.0.0/0            0.0.0.0/0            mark match ! 0x4000/0x4000
    0     0 MARK       0    --  *      *       0.0.0.0/0            0.0.0.0/0            MARK xor 0x4000
    0     0 MASQUERADE  0    --  *      *       0.0.0.0/0            0.0.0.0/0            /* kubernetes service traffic requiring SNAT */ random-fully

### iptables 운영 시 발생했던 이슈 공유
Kubernetes 네트워크 단에서 이슈 발생 시 기본적으로 iptables를 분석하면서 이슈를 해결하게 된다.
운영을 하며 이슈가 발생하였을 때, iptables를 참고한 몇가지 사례이다.

#### 1) nginx-ingress-controller iptables duplicate
(1) 현상
Kubernetes 버전 1.19 -> 1.23으로 순차적으로 업그레이드 하는 상황에서 Kubernetes 버전 1.21 -> 1.22 업그레이드 시 ingress 통신이 실패하는 현상 발생

(2) 현상 분석
Kubernetes 버전 1.21 -> 1.22로 업그레이드를 한 후 nginx-ingress-controller pod가 새롭게 뜨면 기존의 CNI-HOSTPORT-DNAT 체인이 사라지고 새로운 CNI-HOSTPORT-DNAT 체인이 생기는 것이 아니라 계속해서 체인이 쌓이면서 통신이 불가능해짐.
```bash
# iptables -t nat -L CNI-HOSTPORT-DNAT 
Chain CNI-HOSTPORT-DNAT (2 references) target prot opt source destination 
CNI-DN-927a1e1c4bd5758f249e2 tcp -- anywhere anywhere /* dnat name: "k8s-pod-network" id: "4c75c2d6bf09093eb41fed19414ce577a5d4f0e0196a435d63b2c3b9795bb2c5" */ multiport dports http,https,pcsync-https 
CNI-DN-35351e11e6944203cb9a5 tcp -- anywhere anywhere /* dnat name: "k8s-pod-network" id: "3de4fe737cdd5b3787ca3e9d50ea40e4815179e1be62e7002e137a8e18916c3f" */ multiport dports http,https,pcsync-https
(3) 원인
kubelet에서 dockershim을 사용하는 경우에 해당하는 이슈였다.
kubelet의 dockershim은 hostport로 뜨는 Pod에 대해서 kubelet 컨테이너의 /var/lib/dockershim/sandbox하위에 Portmapping 정보를 넣어놓고, pod가 바뀔 때마다 기존의 정보는 삭제하고, 최신 정보를 iptables에 등록하게 된다.
# docker exec -it kubelet /bin/sh
sh-5.1# cd /var/lib/dockershim/sandbox/ 
sh-5.1# ls 
7dded8712b8dbafacafda34d56a4f28aa5445f96c2274e0e509028a55b1d29ec 

sh-5.1# cat 7dded8712b8dbafacafda34d56a4f28aa5445f96c2274e0e509028a55b1d29ec
# 현재 해당 노드에 떠있는 hostport 파드의 정보와 portmapping 정보
{"version":"v1","name":"nginx-ingress-controller-dd8rh","namespace":"ingress-nginx","data":{"port_mappings":[{"protocol":"tcp","container_port":80,"host_port":80},{"protocol":"tcp","container_port":443,"host_port":443},{"protocol":"tcp","container_port":8443,"host_port":8443}]},"checksum":2543962669}
Kubernetes의 1.22에서 Portmap 플러그인이 v0.8.6에서 v1.0.0으로 업그레이드 되었는데, 해당 버전의 Portmap 플러그인에는 Portmapping 정보가 주어지지 않으면 아무 작업을 수행하지 않도록 코드가 업데이트 되었다.
kubelet은 /var/lib/dockershim에 Pod 메타데이터를 저장하는데, 업그레이드 직후에는 kubelet이 재시작되면서 해당 디렉토리의 메타데이터가 사라진다.
hostPort pod의 경우, portMappings의 Pod 메타데이터를 기반으로 portmap 바이너리가 iptables 규칙을 변경하게 된다.
일반적으로 kubelet은 Pod 메타데이터가 없더라도 정상적으로 Pod를 종료하지만, hostPort와 새로운 버전의 Portmap 플러그인을 사용할 경우, pod 메타데이터가 사라지고 portMappings 리스트가 빈 값으로 반환되면서 iptables에서 사라진 Pod 관련 규칙 처리가 정상적으로 이루어지지 않고, 결과적으로 iptables duplicate 현상이 발생한다.
(4) 해결 방법
제일 간단한 해결 방법은 portmap 플러그인 버전을 업그레이드 하는 것이나, 해당 플러그인만 업그레이드 할 수 없는 상황이었고, W/A로 kubelet의 /var/lib/dockershim 하위 pod metadata 정보를 VM에 기록하여 portMappings 리스트를 정상적으로 반환하도록 하여 iptables에 삭제된 Pod의 규칙이 정상 제거되도록 하였다.
Cluster에 아래와 같은 설정을 추가하면 VM에 /var/lib/dockershim 경로가 생성되고, 하위에 Pod의 메타데이터 정보가 기록된다.
services:
  kubelet:
    extra_binds:
      - "/var/lib/dockershim:/var/lib/dockershim"
이 후 업그레이드를 진행했을 때, Pod 메타데이터 정보를 기반으로 Pod가 iptables에서 정상적으로 삭제되고, 추가되어 정상적으로 ingress 통신이 되었다.
2) kube-proxy 업그레이드 실패
(1) 현상
Kubernetes 버전 업그레이드 이후 Worker Node에서 모든 Pod 기동이 실패함.
(2) 현상 분석
Worker Node에 접속하여 iptables를 확인해보았을 때 Kubernetes 관련 규칙이 사라짐.
iptables 업데이트를 관장하는 kube-proxy 상태를 확인해보았을 때, kube-proxy가 정상적으로 버전 업데이트 되지 않음.
(3) 원인
Kubernetes에는 Static Pod와 Mirror Pod라는 개념이 있다.
Static Pod는 Kubelet이 직접 노드의 파일 시스템에서 관리하는 Pod로, /etc/kubernetes/manifests 등 경로에 있는 yaml 파일을 주기적으로 확인하여 해당 파일에 정의된 Static Pod를 생성한다.
Mirror Pod는 Static Pod의 상태를 Kubernetes API 서버에 반영하기 위해 생성되는 리소스로, Static Pod의 내용을 Kubernetes API 서버에서 조회하거나 관리할 수 있도록 생성되는 리소스이다.
Kubelet이 Static Pod를 생성하면, 이를 Kubernetes API 서버에 Mirror Pod로 등록하고 동기화 상태를 유지하게 된다.
해당 이슈는 Kubelet과 Kubernetes API서버간의 타이밍 이슈로 인한 문제로, Kubelet이 새로운 kube-proxy Static Pod를 생성하기 전에 기존의 Mirror Pod가 먼저 Kubernetes API 서버에 등록되면서 Kubernetes API에서는 새로운 kube-proxy Pod가 존재한다고 인식하고, 새로운 Pod가 생성되지 않는다.
(4) 해결 방법
kube-proxy 관련 pod manifest파일을 삭제하거나 이동시킨다.
이를 통해 Kubelet이 kube-proxy 파드 삭제 후 생성 과정을 순서대로 명확히 밟으면서 Static Pod 삭제 -> Mirror Pod 삭제 -> 새로운 Static Pod 생성 -> 새로운 Mirror Pod 생성 순서가 보장되고 kube-proxy의 정상 업데이트가 가능해진다.



Calico CNI 통신 동작
Sat, 21 Sep 2024 15:24:26 GMT
동일 노드 내 파드 통신

동일 노드 내 파드 간 통신은 가상 라우터 calico-node를 통해 내부에서 직접 통신된다.
동일 노드 내에 파드를 2개 생성 한 후 통신을 하며 모니터링 해본다.
pod 생성 전 상태
# IPIP 통신
root@k8s-w1:~# ip -c -d addr show tunl0
5: tunl0@NONE:  mtu 8981 qdisc noqueue state UNKNOWN group default qlen 1000
    link/ipip 0.0.0.0 brd 0.0.0.0 promiscuity 0 minmtu 0 maxmtu 0
    ipip any remote any local any ttl inherit nopmtudisc numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
    inet 172.16.158.0/32 scope global tunl0
       valid_lft forever preferred_lft forever

# Pod가 생성되지 않았기에 네트워크 네임스페이스가 존재하지 않는다.
root@k8s-w1:~# lsns -t net
        NS TYPE NPROCS   PID USER     NETNSID NSFS                                                COMMAND
4026531840 net     137     1 root  unassigned                                                     /sbin/init

# BGP route protocol에 의해 노드 간에 서로 교환하는 정보
# blackhole = 세부 라우팅 설정(longest prefix match)이 없는 경우 노드의 Pod CIDR 대역(여기에서는 172.16.158.0)으로 들어오는 트래픽의 경우 폐기처분을 한다.
root@k8s-w1:~# ip -c route | grep bird
172.16.34.0/24 via 192.168.20.100 dev tunl0 proto bird onlink
172.16.116.0/24 via 192.168.10.10 dev tunl0 proto bird onlink
blackhole 172.16.158.0/24 proto bird
172.16.184.0/24 via 192.168.10.102 dev tunl0 proto bird onlink

# tunl0인터페이스를 통해 상대방 workernode의 Pod 대역과 통신이 가능하다.
root@k8s-w1:~# route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         192.168.10.1    0.0.0.0         UG    100    0        0 eth0
172.16.34.0     192.168.20.100  255.255.255.0   UG    0      0        0 tunl0
172.16.116.0    192.168.10.10   255.255.255.0   UG    0      0        0 tunl0
172.16.158.0    0.0.0.0         255.255.255.0   U     0      0        0 *
172.16.184.0    192.168.10.102  255.255.255.0   UG    0      0        0 tunl0
192.168.0.2     192.168.10.1    255.255.255.255 UGH   100    0        0 eth0
192.168.10.0    0.0.0.0         255.255.255.0   U     100    0        0 eth0
192.168.10.1    0.0.0.0         255.255.255.255 UH    100    0        0 eth0
pod 생성 후 상태
calicoctl을 통해서 workload endpoint를 확인해보았을 때, Pod의 정보를 확인할 수 있다.
Pod가 뜬 k8s-w1에서의 라우팅 테이블에도 파드의 IP/32bit 호스트 라우팅 대역이 라우팅 테이블에 추가된 것을 볼 수 있다.


통신 모니터링
통신 모니터링을 해보면 아래와 같은 순서대로 진행된다.
1) 파드1에서 게이트웨이의 IP인 169.254.1.1 의 MAC 주소를 알기 위해서 ARP Request 를 보낸다.
2) proxy arp 설정이 되어 있는 veth에 연결된 calice에서 자신의 mac 주소(ee:ee:ee:ee:ee:ee)를 응답하고, 이후 정상 통신이 된다.

파드 -> 외부 통신

파드에서 외부(인터넷) 통신 시에는 해당 노드의 네트워크 인터페이스 IP 주소로 MASQUERADE(출발지 IP가 변경) 되어서 외부에 연결된다.
파드에서 외부 통신을 시도하며 모니터링 해본다.
pod 생성 전 상태
# 기본 설정 : NAT = true
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl get ippool -o wide
NAME                  CIDR            NAT    IPIPMODE   VXLANMODE   DISABLED   DISABLEBGPEXPORT   SELECTOR
default-ipv4-ippool   172.16.0.0/16   true   Always     Never       false      false              all()

# 노드에서 외부 통신 시 MASQUERADE 룰 존재 확인
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# iptables -n -t nat --list cali-nat-outgoing
Chain cali-nat-outgoing (1 references)
target     prot opt source               destination
MASQUERADE  all  --  0.0.0.0/0            0.0.0.0/0            /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst random-fully

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# ipset list cali40masq-ipam-pools
Name: cali40masq-ipam-pools
Type: hash:net
Revision: 7
Header: family inet hashsize 1024 maxelem 1048576 bucketsize 12 initval 0x7c23000b
Size in memory: 504
References: 1
Number of entries: 1
Members:
172.16.0.0/16
통신 모니터링
1) pod의 VETH인터페이스에서는 pod IP와 외부 대역 간 통신 패킷이 확인된다.
00:58:58.317730 IP 172.16.34.8 > 8.8.8.8: ICMP echo request, id 31, seq 2, length 64
00:58:58.353223 IP 8.8.8.8 > 172.16.34.8: ICMP echo reply, id 31, seq 2, length 64
2) eth0인터페이스에서는 출발지 IP가 노드의 eth0 네트워크 인터페이스 IP로 변경되어 외부로 패킷이 나간다.
00:59:00.320892 IP 192.168.20.100 > 8.8.8.8: ICMP echo request, id 21320, seq 4, length 64
00:59:00.356195 IP 8.8.8.8 > 192.168.20.100: ICMP echo reply, id 21320, seq 4, length 64
3) 외부 통신 시 nat MASQUERADE rule 카운트가 증가하는 것을 확인할 수 있다.
root@k8s-w0:~# iptables -n -v -t nat --list cali-nat-outgoing
Chain cali-nat-outgoing (1 references)
 pkts bytes target     prot opt in     out     source               destination
    7   517 MASQUERADE  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:flqWnvo8yq4ULQLa */ match-set cali40masq-ipam-pools src ! match-set cali40all-ipam-pools dst random-fully
pod의 VETH인터페이스와 eth0에서와는 달리 터널 인터페이스에서는 패킷이 확인되지 않는다.

다른 노드 간 파드 통신(IPIP)

다른 노드 간의 파드 통신은 tunl0 인터페이스를 통해 IP 헤더에 감싸져서 상대측 노드로 도달 후 tunl0 인터페이스에서 Outer 헤더를 제거하고 내부의 파드와 통신된다.
파드에서 다른 노드의 파드와 통신을 시도하며 모니터링 해본다.
통신 모니터링
1) pod의 VETH인터페이스에서는 pod IP와 다른 노드의 Pod IP간 통신 패킷이 확인된다.
01:36:23.770105 IP 172.16.34.9 > 172.16.158.16: ICMP echo request, id 22, seq 5, length 64
01:36:23.772248 IP 172.16.158.16 > 172.16.34.9: ICMP echo reply, id 22, seq 5, length 64
2) tunl0 인터페이스에서도 pod IP와 다른 노드의 Pod IP간 통신 패킷이 확인된다.
01:36:22.768870 IP 172.16.34.9 > 172.16.158.16: ICMP echo request, id 22, seq 4, length 64
01:36:22.771009 IP 172.16.158.16 > 172.16.34.9: ICMP echo reply, id 22, seq 4, length 64
3) eth0 인터페이스에서 노드와 다른 노드 IP간 통신 패킷이 확인된다. IP Outer 헤더 안쪽에 IP 헤더가 하나 더 있는데, 해당 헤더에는 Pod의 출발지, 목적지 IP정보가 존재한다.
01:36:19.764191 IP 192.168.20.100 > 192.168.10.101: IP 172.16.34.9 > 172.16.158.16: ICMP echo request, id 22, seq 1, length 64
01:36:19.765890 IP 192.168.10.101 > 192.168.20.100: IP 172.16.158.16 > 172.16.34.9: ICMP echo reply, id 22, seq 1, length 64




Calico CNI 구성
Sat, 21 Sep 2024 13:28:24 GMT
Calico CNI 구성
Calico는 아래의 구성요소들을 통해 Kubernetes에서의 네트워크 통신 기능을 제공한다.

Calico-kube-controllers
Kubernetes API Server로부터 pod, namespace, 네트워크 정책 등 클러스터의 네트워크 리소스를 주기적으로 조회하여 새로운 네트워크 정책이 생성된 것을 감지한 후 Calico datastore를 업데이트한다.
Calico datastore
Calico datastore는 Kubernetes API datastore(kdd) 혹은 ETCD 중 하나를 선택할 수 있다. calico-kube-controllers에 의해 전달된 정책이 Calico datastore에 기록된다.
기록되는 정책은 아래와 같다.

Network Policies : 네트워크 트래픽 허용/차단 정책
Workload Endpoints : Pod의 네트워크 인터페이스와 연결된 IP 주소, 네트워크 정책 등 정보
Host Endpints : Host에서 실행되는 서비스 및 네트워크 네임스페이스 정보 / 노드 간 BGP 피어링 정보
IPAM : IP 주소 할당 관리
Namespace/Node 
GlobalNetworkPolicy : 전역적으로 적용되는 네트워크 정책
Profile : 클러스터 내 특정 pod 그룹이나 네임스페이스에 대한 네트워크 정책

Calicoctl
사용자는 calicoctl cli를 통해 Calico datastore에 저장된 리소스를 조회하거나 업데이트 할 수 있다.
CNI Plugin
Kubernetes 클러스터 내에서 pod가 생성될 때, 해당 pod의 네트워크 인터페이스를 생성하고, Pod와 네트워크의 가상 네트워크 브릿지 / overlay 네트워크를 설정함으로써 네트워크 트래픽을 관리하는 플러그인이다.
CNI IPAM Plugin
pod에 IP를 할당/회수/관리하는 플러그인으로 CNI Plugin에 의해 pod의 네트워크 인터페이스가 설정되면, CNI IPAM Plugin이 Calico datastore에 저장되어 있는 Pod Pool을 기반으로 Pod에 IP를 할당한다. 
pod가 삭제되면 해당 IP 주소를 회수하여 Calico datastore에 업데이트함으로써 IP가 재사용될 수 있도록 관리한다.
Calico Node Daemonset (Pod)
calico-node-pod는 클러스터에서 daemonset으로 뜨는 네트워크 플러그인으로, 각 노드에서 네트워크 트래픽을 제어하고, Calico 네트워크 정책을 적용한다. Calico Node 내의 bird, felix, confd 프로그램에 의해 동작한다.
bird
Calico에서 BGP 라우팅을 담당하는 오픈 소스 소프트웨어 라우팅 데몬 프로그램으로, 노드의 pod CIDR을 BGP 라우팅 프로토콜을 통해 다른 노드에 광고한다. 이를 통해 다른 노드의 pod 대역과 통신 할 수 있다.
felix
Calico의 네트워크 정책 엔진으로 bird로 학습한 다른 노드의 pod CIDR을 host의 라우팅 테이블에 업데이트하며, IPtables 규칙 설정 관리를 한다. 
confd
Calico 구성 관리 도구로, Calico에서 BGP와 관련된 설정을 관리하고, Calico의 구성 파일을 동적으로 업데이트 한다.
또한 Calico datastore에서 네트워크 설정을 가져와 Calico BGP 설정을 조정한다. (e.g. 클러스터에 새로운 노드 추가 시 라우팅 구성 업데이트)
Calico 구성 요소 확인
calicoctl
calicoctl version에는 Cluster Type항목이 있는데, 현재 어떤 구성을 사용 중인지를 알 수 있다. 아래 테스트에서는 k8s, bgp routeing, kubeadm, kdd(kubernetes api - calico datastore)을 사용 중인 것을 확인할 수 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl version
Client Version:    v3.28.1
Git commit:        601856343
Cluster Version:   v3.28.1
Cluster Type:      k8s,bgp,kubeadm,kdd
calico-node daemonset
calico-node는 각 노드에 daemonset으로 떠있다. 
calico-node의 IP를 보면 Host의 eth0 IP와 동일한데, 이는 Host의 네트워크 네임스페이스와 calico-node pod의 네트워크 네임스페이스가 동일하기 때문이다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# kubectl get daemonset -n kube-system
NAME          DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR            AGE
calico-node   4         4         4       4            4           kubernetes.io/os=linux   6m47s
kube-proxy    4         4         4       4            4           kubernetes.io/os=linux   13m

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# kubectl get pod -n kube-system -l k8s-app=calico-node -o wide
NAME                READY   STATUS    RESTARTS   AGE   IP               NODE     NOMINATED NODE   READINESS GATES
calico-node-q2b9f   1/1     Running   0          16m   192.168.10.10    k8s-m               
calico-node-t2njm   1/1     Running   0          16m   192.168.10.102   k8s-w2              
calico-node-vkmct   1/1     Running   0          16m   192.168.20.100   k8s-w0              
calico-node-vw2wk   1/1     Running   0          16m   192.168.10.101   k8s-w1              

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# ifconfig eth0
eth0: flags=4163  mtu 9001
        inet 192.168.10.10  netmask 255.255.255.0  broadcast 192.168.10.255
        inet6 fe80::6b:2dff:feb1:1e03  prefixlen 64  scopeid 0x20
        ether 02:6b:2d:b1:1e:03  txqueuelen 1000  (Ethernet)
        RX packets 467466  bytes 674909756 (674.9 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 60071  bytes 10813763 (10.8 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
Calico IPAM
현재 calico가 어떠한 IPAM을 사용 중인지는 설정에서 확인할 수 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# cat /etc/cni/net.d/10-calico.conflist | jq
...
  "plugins": [
    {
      "type": "calico",
      "log_level": "info",
      "log_file_path": "/var/log/calico/cni/cni.log",
      "datastore_type": "kubernetes",
      "nodename": "k8s-m",
      "mtu": 0,
      "ipam": {
        "type": "calico-ipam"
      },
      "policy": {
        "type": "k8s"
      },
...
Node의 pod CIDR대역을 확인하고, Cluster의 다른 노드들이 파드에 할당하기 위해 가지고 있는 대역인 Block을 확인해본다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl ipam show
+----------+---------------+-----------+------------+--------------+
| GROUPING |     CIDR      | IPS TOTAL | IPS IN USE |   IPS FREE   |
+----------+---------------+-----------+------------+--------------+
| IP Pool  | 172.16.0.0/16 |     65536 | 7 (0%)     | 65529 (100%) |
+----------+---------------+-----------+------------+--------------+

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# kubectl get nodes -o jsonpath='{.items[*].spec.podCIDR}' ;echo
172.16.0.0/24 172.16.1.0/24 172.16.3.0/24 172.16.2.0/24

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl ipam show --show-blocks
+----------+-----------------+-----------+------------+--------------+
| GROUPING |      CIDR       | IPS TOTAL | IPS IN USE |   IPS FREE   |
+----------+-----------------+-----------+------------+--------------+
| IP Pool  | 172.16.0.0/16   |     65536 | 7 (0%)     | 65529 (100%) |
| Block    | 172.16.116.0/24 |       256 | 1 (0%)     | 255 (100%)   |
| Block    | 172.16.158.0/24 |       256 | 1 (0%)     | 255 (100%)   |
| Block    | 172.16.184.0/24 |       256 | 1 (0%)     | 255 (100%)   |
| Block    | 172.16.34.0/24  |       256 | 4 (2%)     | 252 (98%)    |
+----------+-----------------+-----------+------------+--------------+
위에서 보이는 Block CIDR은 각 노드의 tunl0 인터페이스 IP이다.
# k8s-m
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# ifconfig tunl0
tunl0: flags=193  mtu 8981
        inet 172.16.116.0  netmask 255.255.255.255
        tunnel   txqueuelen 1000  (IPIP Tunnel)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# k8s-w0
root@k8s-w0:~# ifconfig tunl0
tunl0: flags=193  mtu 8981
        inet 172.16.34.0  netmask 255.255.255.255
        tunnel   txqueuelen 1000  (IPIP Tunnel)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# k8s-w1
root@k8s-w1:~# ifconfig tunl0
tunl0: flags=193  mtu 8981
        inet 172.16.158.0  netmask 255.255.255.255
        tunnel   txqueuelen 1000  (IPIP Tunnel)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0

# k8s-w2
root@k8s-w2:~# ifconfig tunl0
tunl0: flags=193  mtu 8981
        inet 172.16.184.0  netmask 255.255.255.255
        tunnel   txqueuelen 1000  (IPIP Tunnel)
        RX packets 0  bytes 0 (0.0 B)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 0  bytes 0 (0.0 B)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
추가 설정을 확인해본다.
calicoctl의 StrictAffinity PROPERTY의 값이 true일 경우, 노드가 자신의 IP Pool에서 할당 가능한 주소를 모두 사용했을 경우, 다른 노드의 IP 주소 대역을 빌려 올 수 있는데, 이 때 빌려온 주소는 --show-borrowd 플래그로 표시된다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl ipam show --show-borrowed
+----+----------------+-------+-------------+------+--------------+
| IP | BORROWING-NODE | BLOCK | BLOCK OWNER | TYPE | ALLOCATED-TO |
+----+----------------+-------+-------------+------+--------------+
+----+----------------+-------+-------------+------+--------------+

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl ipam show --show-configuration
+--------------------+-------+
|      PROPERTY      | VALUE |
+--------------------+-------+
| StrictAffinity     | false |
| AutoAllocateBlocks | true  |
| MaxBlocksPerHost   |     0 |
+--------------------+-------+
Node BGP peer
Bird를 통해 어떤 노드와 BGP통신 연결이 되어 있는지 확인한다. 해당 노드를 제외한 나머지 노드들(BGP Peer) 정보를 확인할 수 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl node status
Calico process is running.

IPv4 BGP status
+----------------+-------------------+-------+----------+-------------+
|  PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+----------------+-------------------+-------+----------+-------------+
| 192.168.20.100 | node-to-node mesh | up    | 03:51:03 | Established |
| 192.168.10.101 | node-to-node mesh | up    | 03:51:04 | Established |
| 192.168.10.102 | node-to-node mesh | up    | 03:51:04 | Established |
+----------------+-------------------+-------+----------+-------------+

IPv6 BGP status
No IPv6 peers found.
BGP통신 연결 확인은 Bird에서 주기적으로 각 노드(Peer Router)에 살아있음을 알리는 메세지를 주고 받으면서 확인하게 된다.(BGP Keepalive)
BGP Message의 종류에 대해 좀 더 자세히 확인하고, 패킷 캡처를 통해 모니터링 해본다.
BGP Message
BGP Message Type에는 Open, Update, Notification, Keepalive, Route-Refresh 가 있다.
Open
TCP 연결이 된 BGP Peer 간에 최초로 주고받는 메세지이다. 주로, 자신의 정보를 알리기 위한 메세지이다.
Update
이미 Peer관계를 확립한 Peer Router 상호 간에 새롭게 나타난 경로 정보에 대한 광고, 혹은 이전에 알려진 경로에 대한 취소에 사용되는 메세지이다.
Notification
어떠한 에러가 발생했을 때 혹은 연결을 닫고자 할 때, 보내는 BGP 메세지이다. 해당 메세지가 보내진 후 연결 세션이 끊어진다.
Keepalive
BGP Open Message에 대한 응답으로, 다른 노드(Peer Router)에 자신이 살아있음을 알리는 주기적인 19바이트의 BGP 메세지이다.
Route-Refresh
다른 노드(Peer Router)에게 Routing 정보를 다시 보내도록 요청하는 메세지이다.
패킷 캡처
BGP Message는 bird의 port인 TCP 179를 사용한다.
root@k8s-w1:~# netstat -tnlp | grep bird
tcp        0      0 0.0.0.0:179             0.0.0.0:*               LISTEN      3698/bird
BGP가 사용하는 인터페이스는 eth0이다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl node status
Calico process is running.

IPv4 BGP status
+----------------+-------------------+-------+----------+-------------+
|  PEER ADDRESS  |     PEER TYPE     | STATE |  SINCE   |    INFO     |
+----------------+-------------------+-------+----------+-------------+
| 192.168.20.100 | node-to-node mesh | up    | 03:51:03 | Established |
| 192.168.10.101 | node-to-node mesh | up    | 03:51:04 | Established |
| 192.168.10.102 | node-to-node mesh | up    | 03:51:04 | Established |
+----------------+-------------------+-------+----------+-------------+

root@k8s-w1:~# ifconfig eth0
eth0: flags=4163  mtu 9001
        inet 192.168.10.101  netmask 255.255.255.0  broadcast 192.168.10.255
        inet6 fe80::5f:26ff:fe41:132b  prefixlen 64  scopeid 0x20
        ether 02:5f:26:41:13:2b  txqueuelen 1000  (Ethernet)
        RX packets 458793  bytes 667403787 (667.4 MB)
        RX errors 0  dropped 0  overruns 0  frame 0
        TX packets 66191  bytes 5712673 (5.7 MB)
        TX errors 0  dropped 0 overruns 0  carrier 0  collisions 0
아래 명령어를 사용하여 패킷을 캡처하고, calico-node 파드 하나를 삭제한다. calico-node를 삭제하고 다시 생성되는 과정에서, BGP 세션을 설정하고 네트워크 정보를 업데이트하는 BGP Open, Update, Keepalive 메세지를 확인할 수 있다.
root@k8s-w1:~# tcpdump -i eth0 'tcp port 179' -vvv -w bgp_update_capture.pcap

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# kubectl get pod -A -o wide | grep calico-node
kube-system   calico-node-7fbzp                          1/1     Running   0          58m   192.168.10.101   k8s-w1              
kube-system   calico-node-nvqnx                          1/1     Running   0          58m   192.168.10.102   k8s-w2              
kube-system   calico-node-wbx5r                          1/1     Running   0          58m   192.168.20.100   k8s-w0              
kube-system   calico-node-xr6vl                          1/1     Running   0          58m   192.168.10.10    k8s-m               
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# kubectl delete pod calico-node-nvqnx -n kube-system --force
패킷캡처 내역

OPEN Message
OPEN Message의 ASN은 별도의 설정을 하지 않았을 때 default 값이 64512이다.
https://docs.tigera.io/calico/latest/reference/resources/bgpconfig#:~:text=true-,asNumber,-The%20default%20local

UPDATE Message

Keepalive Message

ip pool, Calico Mode
클러스터가 사용하는 ip 대역 정보와 calico Mode 정보를 확인한다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl get ippool -o wide
NAME                  CIDR            NAT    IPIPMODE   VXLANMODE   DISABLED   DISABLEBGPEXPORT   SELECTOR
default-ipv4-ippool   172.16.0.0/16   true   Always     Never       false      false              all()
Workload Endpoint
calico를 통해서 할당받은 Pod의 정보를 출력한다. host network 네임스페이스를 사용하는 Pod들은 출력되지 않는다.
각 Pod가 어떠한 노드에 떠있는지, 어떠한 인터페이스에 연결되었는지, 어떤 IP를 할당 받았는지 등 Pod의 Endpoint 정보를 확인할 수 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl get wep -A -o wide
NAMESPACE     NAME                                                            WORKLOAD                                   NODE     NETWORKS         INTERFACE         PROFILES                                                  NATS
kube-system   k8s--w0-k8s-calico--kube--controllers--77d59654f4--pjmtr-eth0   calico-kube-controllers-77d59654f4-pjmtr   k8s-w0   172.16.34.1/32   cali2b966b54fdf   kns.kube-system,ksa.kube-system.calico-kube-controllers
kube-system   k8s--w0-k8s-coredns--55cb58b774--5ttdh-eth0                     coredns-55cb58b774-5ttdh                   k8s-w0   172.16.34.3/32   cali8aed9290dbd   kns.kube-system,ksa.kube-system.coredns
kube-system   k8s--w0-k8s-coredns--55cb58b774--7x77t-eth0                     coredns-55cb58b774-7x77t                   k8s-w0   172.16.34.2/32   calic153b0d7173   kns.kube-system,ksa.kube-system.coredns

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# kubectl get pod -n kube-system -o wide | grep 172
calico-kube-controllers-77d59654f4-pjmtr   1/1     Running   0          78m   172.16.34.1      k8s-w0              
coredns-55cb58b774-5ttdh                   1/1     Running   0          85m   172.16.34.3      k8s-w0              
coredns-55cb58b774-7x77t                   1/1     Running   0          85m   172.16.34.2      k8s-w0              
Node에서 Calico 프로세스 확인해보기
Node에 calico-node를 구성하는 프로세스가 떠 있는 것을 확인할 수 있다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# ps axf
...
   4063 ?        Ss     0:00  \_ /usr/local/bin/runsvdir -P /etc/service/enabled
   4137 ?        Ss     0:00      \_ runsv node-status-reporter
   4159 ?        Sl     0:00      |   \_ calico-node -status-reporter
   4138 ?        Ss     0:00      \_ runsv bird6
   4339 ?        S      0:00      |   \_ bird6 -R -s /var/run/calico/bird6.ctl -d -c /etc/calico/confd/config/bird6.cfg
   4139 ?        Ss     0:00      \_ runsv bird
   4340 ?        S      0:00      |   \_ bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
   4140 ?        Ss     0:00      \_ runsv felix
   4145 ?        Sl     0:50      |   \_ calico-node -felix
   4141 ?        Ss     0:00      \_ runsv cni
   4149 ?        Sl     0:00      |   \_ calico-node -monitor-token
   4142 ?        Ss     0:00      \_ runsv confd
   4150 ?        Sl     0:00      |   \_ calico-node -confd
   4143 ?        Ss     0:00      \_ runsv monitor-addresses
   4146 ?        Sl     0:00      |   \_ calico-node -monitor-addresses
   4144 ?        Ss     0:00      \_ runsv allocate-tunnel-addrs
   4151 ?        Sl     0:00          \_ calico-node -allocate-tunnel-addrs
felix
felix는 bird를 통해 얻은 정보를 기반으로 iptables를 업데이트 한다.
Pod를 하나 추가해보면서 iptables룰 추가를 모니터링해보면, Pod에 연결된 cali interface와 관련된 iptables rule이 추가되는 것을 확인할 수 있다.
# pod 추가 이전
root@k8s-w0:~# iptables -t filter -S | grep cali | wc -l
117
root@k8s-w0:~# iptables -t nat -S | grep cali | wc -l
15

# pod 추가 이후
root@k8s-w0:~# iptables -t filter -S | grep cali | wc -l
147
root@k8s-w0:~# iptables -t nat -S | grep cali | wc -l
15

# pod wep 확인
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# calicoctl get wep -A
NAMESPACE     WORKLOAD                                   NODE     NETWORKS          INTERFACE
default       pod1                                       k8s-w1   172.16.158.1/32   calice0906292e2
default       pod2                                       k8s-w0   172.16.34.4/32    calibd2348b4f67
kube-system   calico-kube-controllers-77d59654f4-pjmtr   k8s-w0   172.16.34.1/32    cali2b966b54fdf
kube-system   coredns-55cb58b774-5ttdh                   k8s-w0   172.16.34.3/32    cali8aed9290dbd
kube-system   coredns-55cb58b774-7x77t                   k8s-w0   172.16.34.2/32    calic153b0d7173

# 추가된 pod iptables rule 확인
root@k8s-w0:~# iptables -v --numeric --table filter --list cali-tw-calibd2348b4f67
Chain cali-tw-calibd2348b4f67 (1 references)
 pkts bytes target     prot opt in     out     source               destination
    0     0 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:XbY6-cDccVq4BXbw */ ctstate RELATED,ESTABLISHED
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:N4hE4yZtnqCjJbJL */ ctstate INVALID
    0     0 MARK       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:9XJDTASzmA98NbgL */ MARK and 0xfffcffff
    0     0 cali-pri-kns.default  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:B21wAo2MmHxTE20i */
    0     0 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:z0vNGDZb9fZZTnBN */ /* Return if profile accepted */
    0     0 cali-pri-ksa.default.default  all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:Xs37buFlu3zcQDO1 */
    0     0 RETURN     all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:TQSFKOw4Wln6EBF3 */ /* Return if profile accepted */
    0     0 DROP       all  --  *      *       0.0.0.0/0            0.0.0.0/0            /* cali:GlPBajbc7Ht3AzxB */ /* Drop if no profiles matched */
bird
calico-node에 직접 접속하여 bird 라우팅 정보를 확인해본다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# kubectl get pod -n kube-system -l k8s-app=calico-node -o name
pod/calico-node-7fbzp
pod/calico-node-klzj6
pod/calico-node-wbx5r
pod/calico-node-x7j6w

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# kubectl exec -it calico-node-7fbzp -n kube-system -- birdcl
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), mount-bpffs (init)
BIRD v0.3.3+birdv1.6.8 ready.
bird> show route
0.0.0.0/0          via 192.168.10.1 on eth0 [kernel1 06:01:14] * (10)
172.16.184.0/24    via 192.168.10.102 on eth0 [Mesh_192_168_10_102 06:01:15] * (100/0) [i]
172.16.158.5/32    dev cali1f7bbbd982a [kernel1 08:10:55] * (10)
192.168.0.2/32     via 192.168.10.1 on eth0 [kernel1 06:01:14] * (10)
192.168.10.0/24    dev eth0 [direct1 06:01:13] * (240)
192.168.10.1/32    dev eth0 [kernel1 06:01:14] * (10)
172.16.158.0/24    blackhole [static1 06:01:13] * (200)
172.16.158.0/32    dev tunl0 [direct1 06:01:13] * (240)
172.16.116.0/24    via 192.168.10.10 on eth0 [Mesh_192_168_10_10 06:01:15] * (100/0) [i]
172.16.34.0/24     via 192.168.10.1 on eth0 [Mesh_192_168_20_100 06:01:15 from 192.168.20.100] * (100/?) [i]

bird> show protocol
name     proto    table    state  since       info
static1  Static   master   up     06:01:13
kernel1  Kernel   master   up     06:01:13
device1  Device   master   up     06:01:13
direct1  Direct   master   up     06:01:13
Mesh_192_168_10_10 BGP      master   up     08:13:47    Established
Mesh_192_168_20_100 BGP      master   up     06:01:15    Established
Mesh_192_168_10_102 BGP      master   up     06:59:50    Established

bird> show route all
0.0.0.0/0          via 192.168.10.1 on eth0 [kernel1 06:01:14] * (10)
    Type: inherit unicast univ
    Kernel.source: 16
    Kernel.metric: 100
    Kernel.prefsrc: 192.168.10.101
172.16.184.0/24    via 192.168.10.102 on eth0 [Mesh_192_168_10_102 06:01:15] * (100/0) [i]
    Type: BGP unicast univ
    BGP.origin: IGP
    BGP.as_path:
    BGP.next_hop: 192.168.10.102
    BGP.local_pref: 100
...

bird> show status
BIRD v0.3.3+birdv1.6.8
Router ID is 192.168.10.101
Current server time is 2024-09-21 12:57:08
Last reboot on 2024-09-21 06:01:13
Last reconfiguration on 2024-09-21 06:01:13
Daemon is up and running
confd
bird.cfg 파일 설정을 확인해본다.
(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# crictl ps | grep calico
8d8b4ac833d53       8bbeb9e1ee328       5 hours ago         Running             calico-node               0                   25ce6edc0464c       calico-node-klzj6

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# ps -ef | grep bird.cfg
root       48060   47880  0 17:13 ?        00:00:01 bird -R -s /var/run/calico/bird.ctl -d -c /etc/calico/confd/config/bird.cfg
root      134271  134145  0 21:48 pts/1    00:00:00 grep bird.cfg

(⎈|kubernetes-admin@kubernetes:N/A) root@k8s-m:~# find / -name bird.cfg
/run/containerd/io.containerd.runtime.v2.task/k8s.io/8d8b4ac833d5307a3ab2ca69313ce5d4e50a8ef0ac2af1c16de6102075e1de22/rootfs/etc/calico/confd/config/bird.cfg
/var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/64/fs/etc/calico/confd/config/bird.cfg





Kind를 통한 Docker IN Docker Kubernetes Cluster 환경 구성

Sat, 07 Sep 2024 16:47:01 GMT
Docker In Docker란
컨테이너 내부에 Docker를 사용하는 것을 의미한다.
그 중 Kind는 Node역할을 하는 Container안에서 또 컨테이너를 띄워 클러스터를 구성하는 방식으로, 손쉽게 클러스터를 생성하고 테스트 용도로 활용할 수 있다.
Kind를 통한 multi Control-Plane + Worker Cluster 구성
1.클러스터 생성
Kind 설치 후 아래 yaml을 활용하여 클러스터를 생성한다.
# brew install kind

# cat < kind.yaml
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: control-plane
- role: control-plane
- role: worker
  extraPortMappings:
  - containerPort: 30000
    hostPort: 30000
  - containerPort: 30001
    hostPort: 30001
EOT

# kind create cluster --config kind.yaml
클러스터를 생성하면 Node의 역할을 하는 컨테이너(control-plane, worker)와 LB의 역할을 하는 컨테이너(external-load-balancer)가 생성되고, Node역할의 컨테이너에 Docker와 Kubernetes 리소스가 설치된다.
이 때, MAC에서는 별도의 지정을 하지 않으면 ~/.kube/config 하위에 클러스터의 kubeconfig가 생성되는데, api server 주소 를 확인해보면 localhost:{{external-load-balancer port}}로 지정이 되어 있다.
docker ps로 컨테이너 설정을 확인 해보면, kind-external-load-balancer Pod에서 특정 port로 트래픽이 들어오면 그 트래픽을 Pod내부의 6443포트 즉, kube-apiserver로 전달을 하고 있다.
이 api server 설정을 통해서 kubectl로 생성된 클러스터의 상세 정보를 확인할 수 있다.
# kubectl get nodes                                                                       
NAME                  STATUS   ROLES           AGE     VERSION
kind-control-plane    Ready    control-plane   6h30m   v1.27.3
kind-control-plane2   Ready    control-plane   6h30m   v1.27.3
kind-control-plane3   Ready    control-plane   6h29m   v1.27.3
kind-worker           Ready              6h29m   v1.27.3

# docker ps                            
CONTAINER ID   IMAGE                                COMMAND                  CREATED       STATUS       PORTS                                  NAMES
92c0d9e82326   kindest/haproxy:v20230606-42a2262b   "haproxy -W -db -f /…"   7 hours ago   Up 7 hours   127.0.0.1:64936->6443/tcp              kind-external-load-balancer
f783aa78721d   kindest/node:v1.27.3                 "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   127.0.0.1:64933->6443/tcp              kind-control-plane2
251c2a851d6b   kindest/node:v1.27.3                 "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   0.0.0.0:31000-31001->31000-31001/tcp   kind-worker
8c134c320691   kindest/node:v1.27.3                 "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   127.0.0.1:64934->6443/tcp              kind-control-plane3
be1adbe4267b   kindest/node:v1.27.3                 "/usr/local/bin/entr…"   7 hours ago   Up 7 hours   127.0.0.1:64935->6443/tcp              kind-control-plane

# cat ~/.kube/config
- cluster:
    certificate-authority-data: LS0tLS1...
    server: https://127.0.0.1:64936
  name: kind-kind
  ...


2. multi Control-Plane 구조 확인
Kind에서 여러개의 Control Plane을 지정하여 클러스터를 생성하면 Node역할을 하는 컨테이너 외에 external-load-balancer라는 이름의 컨테이너가 하나 더 생성된다.
이 컨테이너는 Control Plane 간의 통신을 위한 LB역할을 하는 컨테이너이다.
이 컨테이너에 직접 접근해서 설정을 확인하려 하면 아래와 같은 에러가 뜬다.
# docker exec -it kind-external-load-balancer /bin/sh
OCI runtime exec failed: exec failed: unable to start container process: exec: "/bin/sh": stat /bin/sh: no such file or directory: unknown
kindest/haproxy 이미지를 확인해보면 ENTRYPOINT설정 외 쉘 명령이나 패키지 설치 기록이 없는 이미지로, 필요한 haproxy 역할 외 다른 기능으로 사용하지 못하도록 하고 있다. (Distroless 이미지 형식과 유사.)
# docker history kindest/haproxy:v20230606-42a2262b --no-trunc
IMAGE                                                                     CREATED         CREATED BY                                                                    SIZE      COMMENT
sha256:816da8fb6d42bb9e25f1b0a21a7d8c81b5334c53b13fcdc61ef6a9961574d702   15 months ago   ENTRYPOINT ["haproxy" "-W" "-db" "-f" "/usr/local/etc/haproxy/haproxy.cfg"]   0B        buildkit.dockerfile.v0
                                                                 15 months ago   STOPSIGNAL SIGUSR1                                                            0B        buildkit.dockerfile.v0
                                                                 15 months ago   COPY haproxy.cfg /usr/local/etc/haproxy/haproxy.cfg # buildkit                916B      buildkit.dockerfile.v0
                                                                 15 months ago   COPY /opt/stage/ / # buildkit                                                 13MB      buildkit.dockerfile.v0
                                                                 15 months ago   ARG STAGE_DIR=/opt/stage                                                      0B        buildkit.dockerfile.v0
                                                                 N/A                                                                                           219kB     
                                                                 N/A                                                                                           346B      
                                                                 N/A                                                                                           497B      
                                                                 N/A                                                                                           0B        
                                                                 N/A                                                                                           64B       
                                                                 N/A                                                                                           149B      
                                                                 N/A                                                                                           1.93MB    
                                                                 N/A                                                                                           29.4kB    
                                                                 N/A                                                                                           270kB     
각 레이어는 다음과 같은 역할을 한다.

ARG STAGE_DIR=/opt/stage : Dockerfile에서 빌드 시 /opt/stage를 사용하여 빌드한다.
COPY /opt/stage/ / # buildkit : STAGE_DIR 하위 파일들을 컨테이너 파일시스템의 /로 복사한다.
COPY haproxy.cfg /usr/local/etc/haproxy/haproxy.cfg # buildkit : haproxy.cfg 설정을 /usr/local/etc/haproxy 하위로 복사한다.
STOPSIGNAL SIGUSR1 : 컨테이너가 종료될 때 기본으로는 SIGTERM을 사용하지만, STOPSIGNAL을 통해 컨테이너 종료 시그널을 변경할 수 있다. HAProxy는 SIGUSR1을 받으면 Graceful Stop을 수행하는 기능을 가지고 있다. 즉, HAProxy가 즉시 종료되지 않고, 요청을 모두 처리한 후 종료될 수 있도록 하는 설정이다.
HAProxy supports a graceful and a hard stop. The hard stop is simple, when the SIGTERM signal is sent to the haproxy process, it immediately quits and all established connections are closed. The graceful stop is triggered when the SIGUSR1 signal is sent to the haproxy process. It consists in only unbinding from listening ports, but continue to process existing connections until they close. Once the last connection is closed, the process leaves.
https://cbonte.github.io/haproxy-dconv/2.0/management.html#4


ENTRYPOINT ["haproxy" "-W" "-db" "-f" "/usr/local/etc/haproxy/haproxy.cfg"] : 컨테이너가 시작되면 HAPROXY를 복사한 haproxy 설정으로 실행한다.

각 Control-Plane(사실은 Pod)에 들어가서 kubernetes/admin.conf 설정을 확인해본다.
# docker exec -it kind-control-plane /bin/sh
# cat /etc/kubernetes/admin.conf
apiVersion: v1
clusters:
- cluster:
    certificate-authority-data: LS0tLS...
    server: https://kind-external-load-balancer:6443
  name: kind
admin.conf 설정에서 api-server 주소로 external-load-balancer의 Pod:6443을 바라보고 있는 것을 확인할 수 있다. 
Control Plane(pod)에서 해당 Pod Name의 dns lookup을 확인해본다.
# apt-update && apt install dnsutils
# nslookup kind-external-load-balancer
Server:         192.168.65.254
Address:        192.168.65.254#53

Non-authoritative answer:
Name:   kind-external-load-balancer
Address: 172.18.0.6
Name:   kind-external-load-balancer
Address: fc00:f853:ccd:e793::6
Control Plane(pod)에서 Load Balancer Pod의 DNS를 인식할 수 있는 이유는 Kind 네트워크와 Docker 내장 DNS서버때문이다.
Docker 네트워크 리스트와 Kind 네트워크에 연결된 컨테이너를 확인해본다.
# docker network ls                         
NETWORK ID     NAME      DRIVER    SCOPE
adcd45d03761   bridge    bridge    local
5729ccff2cc6   host      host      local
8f9eb4754dc5   kind      bridge    local
510f916ad4df   none      null      local

# docker network inspect kind
[
    {
        "Name": "kind",
        "Id": "8f9eb4754dc5ec22fb20fc9a39371217916d86246f7cdd66276a595cdd08b2ce",
        "Created": "2024-09-01T11:15:53.672353221Z",
        "Scope": "local",
        "Driver": "bridge",
        "EnableIPv6": true,
        "IPAM": {
            "Driver": "default",
            "Options": {},
            "Config": [
                {
                    "Subnet": "172.18.0.0/16",
                    "Gateway": "172.18.0.1"
                },
                {
                    "Subnet": "fc00:f853:ccd:e793::/64"
                }
            ]
        },
        "Internal": false,
        "Attachable": false,
        "Ingress": false,
        "ConfigFrom": {
            "Network": ""
        },
        "ConfigOnly": false,
        "Containers": {
            "251c2a851d6b977425ce387c088525ae1b42e5d9c4e9c1c92cf98c6df8eb03e6": {
                "Name": "kind-worker",
                "EndpointID": "9f1e99f5b8fac4cd728c9aad7fa231c6e8d95e5edb57c7627782ed3e55c741b8",
                "MacAddress": "02:42:ac:12:00:02",
                "IPv4Address": "172.18.0.2/16",
                "IPv6Address": "fc00:f853:ccd:e793::2/64"
            },
            "8c134c320691e72191fa0826a696ff013cfd600c41413e01307ef9cc2c8db35f": {
                "Name": "kind-control-plane3",
                "EndpointID": "64cf903b80680ee12eaf6dc79abcd10b265833fe6e4f0d2e87977457e145810d",
                "MacAddress": "02:42:ac:12:00:03",
                "IPv4Address": "172.18.0.3/16",
                "IPv6Address": "fc00:f853:ccd:e793::3/64"
            },
            "92c0d9e82326fad89c789211856ad9dc03da0f9b8f70beca086e291f31e0aac3": {
                "Name": "kind-external-load-balancer",
                "EndpointID": "c28b1bec9d281a9f3011fa2ca0e0ac26c7fc238dcd98c1915a42fe53921924ba",
                "MacAddress": "02:42:ac:12:00:06",
                "IPv4Address": "172.18.0.6/16",
                "IPv6Address": "fc00:f853:ccd:e793::6/64"
            },
            "be1adbe4267bc775bcfa537d26e8cb44ed88fba3794517d67762e40993694dbd": {
                "Name": "kind-control-plane",
                "EndpointID": "3db490ea8a13c39ba0f6f4e6d96607a3a25596e587e67f3a877dc32efdfb17f5",
                "MacAddress": "02:42:ac:12:00:04",
                "IPv4Address": "172.18.0.4/16",
                "IPv6Address": "fc00:f853:ccd:e793::4/64"
            },
            "f783aa78721d2218d28f08ebfe007da951b7e843faca2e453b2c01e7ed45b386": {
                "Name": "kind-control-plane2",
                "EndpointID": "a14df9bcc08714558729021735ade7c22d58723f44842fd652e8f359c077a840",
                "MacAddress": "02:42:ac:12:00:05",
                "IPv4Address": "172.18.0.5/16",
                "IPv6Address": "fc00:f853:ccd:e793::5/64"
            }
        },
        "Options": {
            "com.docker.network.bridge.enable_ip_masquerade": "true",
            "com.docker.network.driver.mtu": "1500"
        },
        "Labels": {}
    }
]
Kind 네트워크에 kind-control-plane, kind-control-plane2, kind-control-plane3, kind-worker, kind-external-load-balancer이 연결되어 있고, 각각의 MACAddr, IPv4Addr, IPv6Addr이 명시되어 있다.
Docker의 내장 DNS 서버에 의해 Pod명과 Pod IP를 매핑하여 사용할 수 있다.
cf) host.docker.internal
kind-control-plane에서 /etc/resolv.conf를 확인해보았을 때, nameserver가 192.168.65.254로 잡혀있는 것을 확인할 수 있다. 이는 MAC에서 Docker Desktop을 실행했을 경우, 컨테이너에서 호스트 머신의 IP 주소를 참조할 수 있는 DNS 이름인 host.docker.internal의 IP이다.
# docker exec -it kind-control-plane /bin/sh
# cat /etc/resolv.conf
nameserver 192.168.65.254
options ndots:0

# nslookup host.docker.internal
Server:         192.168.65.254
Address:        192.168.65.254#53

Non-authoritative answer:
Name:   host.docker.internal
Address: 192.168.65.254

컨테이너에서 /etc/resolv.conf를 Docker의 기본 내장 DNS인 127.0.0.11에서 192.168.65.254로 바꾸는 작업은 entrypoint에 명시되어있다.
# docker exec -it kind-control-plane /bin/sh
# cat /usr/local/bin/entrypoint
...
enable_network_magic(){
  # well-known docker embedded DNS is at 127.0.0.11:53
  local docker_embedded_dns_ip='127.0.0.11'

  # first we need to detect an IP to use for reaching the docker host
  local docker_host_ip
  docker_host_ip="$( (head -n1 <(timeout 5 getent ahostsv4 'host.docker.internal') | cut -d' ' -f1) || true)"
  # if the ip doesn't exist or is a loopback address use the default gateway
  if [[ -z "${docker_host_ip}" ]] || [[ $docker_host_ip =~ ^127\.[0-9]+\.[0-9]+\.[0-9]+$ ]]; then
    docker_host_ip=$(ip -4 route show default | cut -d' ' -f3)
  fi

  # patch docker's iptables rules to switch out the DNS IP
  iptables-save \
    | sed \
      # switch docker DNS DNAT rules to our chosen IP \
      -e "s/-d ${docker_embedded_dns_ip}/-d ${docker_host_ip}/g" \
      # we need to also apply these rules to non-local traffic (from pods) \
      -e 's/-A OUTPUT \(.*\) -j DOCKER_OUTPUT/\0\n-A PREROUTING \1 -j DOCKER_OUTPUT/' \
      # switch docker DNS SNAT rules rules to our chosen IP \
      -e "s/--to-source :53/--to-source ${docker_host_ip}:53/g"\
      # nftables incompatibility between 1.8.8 and 1.8.7 omit the --dport flag on DNAT rules \
      # ensure --dport on DNS rules, due to https://github.com/kubernetes-sigs/kind/issues/3054 \
      -e "s/p -j DNAT --to-destination ${docker_embedded_dns_ip}/p --dport 53 -j DNAT --to-destination ${docker_embedded_dns_ip}/g" \
    | iptables-restore

  # now we can ensure that DNS is configured to use our IP
  cp /etc/resolv.conf /etc/resolv.conf.original
  replaced="$(sed -e "s/${docker_embedded_dns_ip}/${docker_host_ip}/g" /etc/resolv.conf.original)"
  if [[ "${KIND_DNS_SEARCH+x}" == "" ]]; then
    # No DNS search set, just pass through as is
    echo "$replaced" >/etc/resolv.conf
  elif [[ -z "$KIND_DNS_SEARCH" ]]; then
    # Empty search - remove all current search clauses
    echo "$replaced" | grep -v "^search" >/etc/resolv.conf
  else
    # Search set - remove all current search clauses, and add the configured search
    {
      echo "search $KIND_DNS_SEARCH";
      echo "$replaced" | grep -v "^search";
    } >/etc/resolv.conf
  fi
코드에는 iptables 규칙을 패치하는 부분이 있는데, iptables를 통해 규칙 반영을 확인할 수 있다.
Docker의 내장 DNS 서버 IP는 127.0.0.11이며, 192.168.65.254로 들어오는 DNS요청을 127.0.0.11의 포트 39871 (TCP) 및 50966 (UDP)으로 DNAT 하고, 127.0.0.11의 포트 39871 및 50966에서 나오는 DNS 응답을 192.168.65.254로 SNAT 한다.
# docker exec -it kind-control-plane /bin/sh

# iptables -t nat -S | grep 192
-A PREROUTING -d 192.168.65.254/32 -j DOCKER_OUTPUT
-A OUTPUT -d 192.168.65.254/32 -j DOCKER_OUTPUT
-A POSTROUTING -d 192.168.65.254/32 -j DOCKER_POSTROUTING
-A DOCKER_OUTPUT -d 192.168.65.254/32 -p tcp -m tcp --dport 53 -j DNAT --to-destination 127.0.0.11:39871
-A DOCKER_OUTPUT -d 192.168.65.254/32 -p udp -m udp --dport 53 -j DNAT --to-destination 127.0.0.11:50966
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p tcp -m tcp --sport 39871 -j SNAT --to-source 192.168.65.254:53
-A DOCKER_POSTROUTING -s 127.0.0.11/32 -p udp -m udp --sport 50966 -j SNAT --to-source 192.168.65.254:53
host.docker.internal은 컨테이너 내부에서 Host에 접근할 수 있게 하는 설정으로 Linux가 아닌 Window/MAC 환경에서 Docker Desktop을 띄우는 경우 들어가는 기본 설정이다. 
kind-control-plane를 활용해서 컨테이너 내부에서 Host로 직접 통신이 가능한 것을 확인해볼 수 있다. 
컨테이너 내부에서는 kube-apiserver의 포트인 6443을 통해 localhost:6443으로 통신이 가능하지만,이 때 컨테이너 내부에서 Host의 portforwarding용 포트인 64936로 직접 접속을 시도하면 당연히 통신이 되지 않는다.
host.docker.internal:64936로 접속을 시도하면 통신이 되는데, 이는 위에 설명한 설정들로 인해 컨테이너 내부에서 Host에 접근할 수 있기 때문이다.
# docker ps | grep kind 
92c0d9e82326   kindest/haproxy:v20230606-42a2262b   "haproxy -W -db -f /…"   10 hours ago   Up 10 hours   127.0.0.1:64936->6443/tcp              kind-external-load-balancer
...

# docker exec -it kind-control-plane curl -k https://localhost:6443/livez ;echo
ok

# docker exec -it kind-control-plane curl -k https://localhost:64936/livez ;echo
curl: (7) Failed to connect to localhost port 64936: Connection refused

# docker exec -it kind-control-plane curl -k https://host.docker.internal:64936/livez ;echo
ok
3. Control-Plane 추가 정보 확인
네트워크 정보 확인
root@kind-control-plane:/# ip -br -c -4 addr
lo               UNKNOWN        127.0.0.1/8 
veth1ad4d02b@if4 UP             10.244.0.1/32 
veth935c07f1@if4 UP             10.244.0.1/32 
veth5c6482c3@if4 UP             10.244.0.1/32 
eth0@if28        UP             172.18.0.4/16 

root@kind-control-plane:/# ip -c route
default via 172.18.0.1 dev eth0 
10.244.0.2 dev veth1ad4d02b scope host 
10.244.0.3 dev veth935c07f1 scope host 
10.244.0.4 dev veth5c6482c3 scope host 
10.244.1.0/24 via 172.18.0.5 dev eth0 
10.244.2.0/24 via 172.18.0.3 dev eth0 
10.244.3.0/24 via 172.18.0.2 dev eth0 
172.18.0.0/16 dev eth0 proto kernel scope link src 172.18.0.4 
프로세스 확인 -> PID 1은 /sbin/init이다.
root@kind-control-plane:/# ps -ef
UID          PID    PPID  C STIME TTY          TIME CMD
root           1       0  0 08:42 ?        00:00:07 /sbin/init=
DinD 컨테이너 확인
root@kind-control-plane:/# crictl version
Version:  0.1.0
RuntimeName:  containerd
RuntimeVersion:  v1.7.1
RuntimeApiVersion:  v1

root@kind-control-plane:/# crictl ps
CONTAINER           IMAGE               CREATED             STATE               NAME                      ATTEMPT             POD ID              POD
3441550cbf2dc       6234a065dec4c       11 hours ago        Running             kube-scheduler            1                   71cbab0fc2502       kube-scheduler-kind-control-plane
1c38faa646c43       aea4f169db16d       11 hours ago        Running             kube-controller-manager   1                   b66a90e7e4e32       kube-controller-manager-kind-control-plane
f2877ad64ab44       97e04611ad434       11 hours ago        Running             coredns                   0                   d9b94fd54fa61       coredns-5d78c9869d-f2k4x
bf6cc889eae28       97e04611ad434       11 hours ago        Running             coredns                   0                   cf2988f63fe8d       coredns-5d78c9869d-bq4s2
1e7cb2240f203       eec7db0a07d0d       11 hours ago        Running             local-path-provisioner    0                   a2afb342ffb08       local-path-provisioner-6bc4bddd6b-2nwbx
7ab97a2daa31f       b18bf71b941ba       11 hours ago        Running             kindnet-cni               0                   d98e993387c86       kindnet-29rgk
dd048a591c6d7       278dd40f83dfb       11 hours ago        Running             kube-proxy                0                   cfb9fd3d19f32       kube-proxy-dqc4t
6824a0ee4ea01       24bc64e911039       11 hours ago        Running             etcd                      0                   658f527c26814       etcd-kind-control-plane
e1820cead3fd8       634c53edb5c14       11 hours ago        Running             kube-apiserver            0                   af772ba71196f       kube-apiserver-kind-control-plane
4. Worker Node 추가 정보 확인
클러스터를 생성할 때 WorkerNode에 hostPort와 containerPort 매핑 정보를 추가하여 생성하였다.
# docker port kind-worker
31000/tcp -> 0.0.0.0:31000
31001/tcp -> 0.0.0.0:31001
nginx Deployment와 Service를 배포하여 포트 매핑을 확인해본다.
cat <        80:31001/TCP   38s

NAME                      ENDPOINTS                     AGE
endpoints/deploy-websrv   10.244.3.3:80,10.244.3.4:80   38s

# curl -s localhost:31001 | grep -o ".*"
Welcome to nginx!



컨테이너 격리
Sat, 31 Aug 2024 14:00:11 GMT
1. chroot를 이용한 프로세스 격리
https://velog.io/@_gyullbb/1-1.-%EC%BB%A8%ED%85%8C%EC%9D%B4%EB%84%88-%EA%B2%A9%EB%A6%AC
2. pivot_root + mnt를 이용한 프로세스 격리
https://velog.io/@_gyullbb/1-2.-%EC%BB%A8%ED%85%8C%EC%9D%B4%EB%84%88-%EA%B2%A9%EB%A6%ACpivotroot
3. overlay 파일시스템 - 이미지 중복 문제 해결
필요한 환경을 이미지로 만들 때, 환경이 조금만 달라질 때마다 모든 환경별로 이미지를 만든다면 저장공간 부족, 배포 및 통신 속도 지연, 관리 포인트 증가, 보안 위험도 증가 등의 문제가 야기된다.
이러한 문제를 해결하기 위해 컨테이너에서는 layer개념을 이용하여 overlay 파일시스템을 통해 중복 문제를 해결한다.
overlay 파일 시스템의 구조이다.

Lower Dir(RO) : 읽기 전용 레이어로 이미지 저장소로부터 내려받는 "이미지"에 해당
Upper Dir(RW) : 새로운 파일에 대한 쓰기 및 Lower Dir 파일의 변경을 처리, 이때 Lower Dir 파일 변경 시, Upper Dir로 파일을 복사 한 후 변경을 처리하는 CoW(Copy-On-Write) 방식을 사용
Merged View : overlay 파일시스템이 마운트 되는 디렉터리로, Lower Dir과 Upper Dir의 통합된 뷰를 제공

각 특징들을 예제를 통해 확인해본다.
3-1. lowdir, upperdir, merged view 구성
lowdir1 구성 (/bin/{bash, ls, mkdir, mount, ps, sh})
bash, ls, mkdir, mount, ps, sh 명령어가 작동하도록 구성하고, 추가로 originalfile1이라는 파일을 생성한다.
# ldd /bin/bash
    linux-vdso.so.1 (0x00007ffe8f5ed000)
    libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f56fd30e000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f56fd000000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f56fd49d000)
# ldd /bin/ls
    linux-vdso.so.1 (0x00007ffc1f5fa000)
    libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f4a8c2b3000)
    libcap.so.2 => /lib64/libcap.so.2 (0x00007f4a8c2a9000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f4a8c000000)
    libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f4a8c20d000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f4a8c30a000)
# ldd /bin/mkdir
    linux-vdso.so.1 (0x00007ffe93d6d000)
    libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f949bc09000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f949b800000)
    libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f949bb6d000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f949bc4e000)
# ldd /bin/mount
    linux-vdso.so.1 (0x00007ffe91d3f000)
    libmount.so.1 => /lib64/libmount.so.1 (0x00007f368692a000)
    libselinux.so.1 => /lib64/libselinux.so.1 (0x00007f36868fd000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f3686600000)
    libblkid.so.1 => /lib64/libblkid.so.1 (0x00007f36868c5000)
    libpcre2-8.so.0 => /lib64/libpcre2-8.so.0 (0x00007f3686829000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f3686982000)
# ldd /bin/ps
    linux-vdso.so.1 (0x00007ffff37ae000)
    libprocps.so.8 => /lib64/libprocps.so.8 (0x00007fd08e8ac000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fd08e600000)
    libsystemd.so.0 => /lib64/libsystemd.so.0 (0x00007fd08e523000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fd08e930000)
    libcap.so.2 => /lib64/libcap.so.2 (0x00007fd08e8a2000)
    libgcrypt.so.20 => /lib64/libgcrypt.so.20 (0x00007fd08e3ea000)
    liblzma.so.5 => /lib64/liblzma.so.5 (0x00007fd08e876000)
    libzstd.so.1 => /lib64/libzstd.so.1 (0x00007fd08e313000)
    liblz4.so.1 => /lib64/liblz4.so.1 (0x00007fd08e850000)
    libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007fd08e835000)
    libgpg-error.so.0 => /lib64/libgpg-error.so.0 (0x00007fd08e80f000)
# ldd /bin/sh
    linux-vdso.so.1 (0x00007ffc153f6000)
    libtinfo.so.6 => /lib64/libtinfo.so.6 (0x00007f04ba5a4000)
    libc.so.6 => /lib64/libc.so.6 (0x00007f04ba200000)
    /lib64/ld-linux-x86-64.so.2 (0x00007f04ba733000)

# mkdir -p /tmp/lowdir1
# mkdir -p /tmp/lowdir1/{bin,lib64}
# cp /lib64/{libtinfo.so.6,libc.so.6,ld-linux-x86-64.so.2,libselinux.so.1,libcap.so.2,libpcre2-8.so.0,libmount.so.1,libblkid.so.1,libprocps.so.8,libsystemd.so.0,libgcrypt.so.20,liblzma.so.5,libzstd.so.1,liblz4.so.1,libgcc_s.so.1,libgpg-error.so.0} /tmp/lowdir1/lib64/
# cp /bin/{bash,ls,mkdir,mount,ps,sh} /tmp/lowdir1/bin/
# mkdir -p /tmp/lowdir1/proc
# mount -t proc proc /tmp/lowdir1/proc
# touch /tmp/lowdir1/originalfile1

# tree -L 2 /tmp/lowdir1/
/tmp/lowdir1/
├── bin
│   ├── bash
│   ├── ls
│   ├── mkdir
│   ├── mount
│   ├── ps
│   └── sh
├── lib64
│   ├── ld-linux-x86-64.so.2
│   ├── libblkid.so.1
│   ├── libc.so.6
│   ├── libcap.so.2
│   ├── libgcc_s.so.1
│   ├── libgcrypt.so.20
│   ├── libgpg-error.so.0
│   ├── liblz4.so.1
│   ├── liblzma.so.5
│   ├── libmount.so.1
│   ├── libpcre2-8.so.0
│   ├── libprocps.so.8
│   ├── libselinux.so.1
│   ├── libsystemd.so.0
│   ├── libtinfo.so.6
│   └── libzstd.so.1
├── originalfile1
└── proc
    ├── 1
lowdir2 구성 (/bin/rm)
# ldd /bin/rm
    linux-vdso.so.1 (0x00007ffce975e000)
    libc.so.6 => /lib64/libc.so.6 (0x00007fa4d1400000)
    /lib64/ld-linux-x86-64.so.2 (0x00007fa4d1808000)
# mkdir -p /tmp/lowdir2
# mkdir -p /tmp/lowdir2/{bin,lib64}
# cp /lib64/{libc.so.6,ld-linux-x86-64.so.2} /tmp/lowdir2/lib64/
# cp /bin/rm /tmp/lowdir2/bin/

# tree -L 2 /tmp/lowdir2
/tmp/lowdir2
├── bin
│   └── rm
└── lib64
    ├── ld-linux-x86-64.so.2
    └── libc.so.6
upperdir & merged view 구성
mount 옵션을 통해 lowerdir, upperdir, workdir을 지정한 후 특정 폴더에 merge 할 수 있다.

lowerdir 지정 시 이미지 레이어를 구분하여 적을 수 있는데, 이 때 레이어를 위에서 아래 순서대로 콜론(:)으로 구분하여 작성한다.

mount 이후 lowdir1, lowdir2의 파일시스템이 합쳐져 rootfs의 merge에 보이는 것을 확인할 수 있다.
# mkdir -p /tmp/rootfs/{container,work,merge}
# tree /tmp/rootfs
/tmp/rootfs
├── container
├── merge
└── work
3 directories, 0 files

# mount -t overlay overlay -o lowerdir=lowdir2:lowdir1,upperdir=rootfs/container,workdir=rootfs/work rootfs/merge
# tree /tmp/rootfs
/tmp/rootfs
├── container
├── merge
│   ├── bin
│   │   ├── bash
│   │   ├── ls
│   │   ├── mkdir
│   │   ├── mount
│   │   ├── ps
│   │   ├── rm
│   │   └── sh
│   ├── lib64
│   │   ├── ld-linux-x86-64.so.2
│   │   ├── libblkid.so.1
│   │   ├── libc.so.6
│   │   ├── libcap.so.2
│   │   ├── libgcc_s.so.1
│   │   ├── libgcrypt.so.20
│   │   ├── libgpg-error.so.0
│   │   ├── liblz4.so.1
│   │   ├── liblzma.so.5
│   │   ├── libmount.so.1
│   │   ├── libpcre2-8.so.0
│   │   ├── libprocps.so.8
│   │   ├── libselinux.so.1
│   │   ├── libsystemd.so.0
│   │   ├── libtinfo.so.6
│   │   └── libzstd.so.1
│   ├── originalfile1
│   └── proc
└── work
    └── work
3-2. overlay 파일 시스템 구조 특성 확인
merged view 파일 변경
lowdir1으로 인해 merge에 보이는 originalfile1을 수정해보자.

upperdir에 originalfile1이 생긴 것을 확인할 수 있다.
merge의 originalfile1은 upperdir의 originalfile1(lowdir1의 originalfile1 복제본)을 바라보기 때문에, 이제 lowdir1에 있는 originalfile1을 수정해도 merge에서 보이는 originalfile1은 변경되지 않고, upperdir의 originalfile1을 수정해야 merge의 originalfile1이 수정되는 것을 확인할 수 있다.
# echo "change1" > /tmp/rootfs/merge/originalfile1
# tree /tmp/rootfs/
/tmp/rootfs/
├── container
│   └── originalfile1
├── merge
│   ├── bin
│   │   ├── bash
│   │   ├── ls
│   │   ├── mkdir
│   │   ├── mount
│   │   ├── ps
│   │   ├── rm
│   │   └── sh
│   ├── lib64
│   │   ├── ld-linux-x86-64.so.2
│   │   ├── libblkid.so.1
│   │   ├── libc.so.6
│   │   ├── libcap.so.2
│   │   ├── libgcc_s.so.1
│   │   ├── libgcrypt.so.20
│   │   ├── libgpg-error.so.0
│   │   ├── liblz4.so.1
│   │   ├── liblzma.so.5
│   │   ├── libmount.so.1
│   │   ├── libpcre2-8.so.0
│   │   ├── libprocps.so.8
│   │   ├── libselinux.so.1
│   │   ├── libsystemd.so.0
│   │   ├── libtinfo.so.6
│   │   └── libzstd.so.1
│   ├── originalfile1
│   └── proc
└── work
    └── work

7 directories, 25 files
# cat /tmp/rootfs/merge/originalfile1
change1

# echo "change2" > /tmp/lowdir1/originalfile1
# cat /tmp/rootfs/merge/originalfile1
change1

# echo "chnage3" > /tmp/rootfs/container/originalfile1
# cat /tmp/rootfs/merge/originalfile1
chnage3
merged view 파일 삭제
merge에 있는 originalfile1을 삭제하면 merged view에서는 해당 파일이 삭제되고, upperdir인 container폴더 하위의 originalfile1에 마킹 표기가 되는 것을 확인할 수 있다.

실제 lowdir1의 originalfile1을 삭제하는 것이 아니라, 카피본인 upperdir 즉, container폴더 하위의 originalfile1에 삭제마킹이 되면서 merged view에서는 보이지 않게 된다.
# rm /tmp/rootfs/merge/originalfile1
rm: remove regular file '/tmp/rootfs/merge/originalfile1'? y

# tree /tmp/rootfs
/tmp/rootfs
├── container
│   └── originalfile1
├── merge
│   ├── bin
│   │   ├── bash
│   │   ├── ls
│   │   ├── mkdir
│   │   ├── mount
│   │   ├── ps
│   │   ├── rm
│   │   └── sh
│   ├── lib64
│   │   ├── ld-linux-x86-64.so.2
│   │   ├── libblkid.so.1
│   │   ├── libc.so.6
│   │   ├── libcap.so.2
│   │   ├── libgcc_s.so.1
│   │   ├── libgcrypt.so.20
│   │   ├── libgpg-error.so.0
│   │   ├── liblz4.so.1
│   │   ├── liblzma.so.5
│   │   ├── libmount.so.1
│   │   ├── libpcre2-8.so.0
│   │   ├── libprocps.so.8
│   │   ├── libselinux.so.1
│   │   ├── libsystemd.so.0
│   │   ├── libtinfo.so.6
│   │   └── libzstd.so.1
│   └── proc
└── work
    └── work
        └── #5

# tree -L 1 /tmp/lowdir1
/tmp/lowdir1
├── bin
├── lib64
├── originalfile1
└── proc

3 directories, 1 file

# cat /tmp/lowdir1/originalfile1
change2
lowdir 파일 변경
이 테스트에서 upperdir에 반영이 되지 않은 lowdir의 파일을 수정하면 merged view에서도 수정된 상태로 표시되는 것을 확인할 수 있다.

lowdir은 수정이 되지 않는 것이 원칙이나, 해당 테스트에서는 lowdir이 호스트의 root 파일시스템에 속해있기 때문에, 유저가 root 파일시스템을 직접 수정하면 merged view에서는 lowdir의 수정된 파일을 출력하게 된다.
# rm container/originalfile1
rm: remove character special file 'container/originalfile1'? y
# umount /tmp/rootfs/merge/

# mount -t overlay overlay -o lowerdir=lowdir2:lowdir1,upperdir=rootfs/container,workdir=rootfs/work rootfs/merge

# cat /tmp/rootfs/merge/originalfile1
change2

# echo "change4" > /tmp/lowdir1/originalfile1
# cat /tmp/rootfs/merge/originalfile1
change4
4. Linux namespace 종류 별 격리 테스트
Linux에서 사용할 수 있는 namespace 종류는 아래와 같다.
이 중 time namespace는 리눅스 커널 5.6 이상 버전 부터 사용할 수 있다. 



Namespace
Flag
Page
Isolates



Cgroup
CLONE_NEWCGROUP
cgroup_namespaces(7)
Cgroup root directory


IPC
CLONE_NEWIPC
ipc_namespaces(7)
System V IPC, POSIX message queues


Network
CLONE_NEWNET
network_namespaces(7)
Network devices, stacks, ports, etc.


Mount
CLONE_NEWNS
mount_namespaces(7)
Mount points


PID
CLONE_NEWPID
pid_namespaces(7)
Process IDs


Time
CLONE_NEWTIME
time_namespaces(7)
Boot and monotonic clocks


User
CLONE_NEWUSER
user_namespaces(7)
User and group IDs


UTS
CLONE_NEWUTS
uts_namespaces(7)
Hostname and NIS domain name


예시)
# lsns -p $$

        NS TYPE   NPROCS PID USER COMMAND
4026531834 time      126   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531835 cgroup    126   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531836 pid       126   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531837 user      126   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531838 uts       123   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531839 ipc       126   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531840 net       126   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531841 mnt       116   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
namespace API 중 unshare system call을 이용하여 프로세스를 새로운 namespace로 격리시킬 수 있다.
각 namespace 별로 격리 테스트를 진행한다.
4-1. mnt
프로세스의 파일 시스템 마운트 지점을 격리한다.
unshare를 통한 mnt ns 격리 확인
터미널1
# lsns -t mnt -p 1
        NS TYPE NPROCS PID USER COMMAND
4026531841 mnt     123   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
# lsns -t mnt -p $$
        NS TYPE NPROCS PID USER COMMAND
4026531841 mnt     123   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31

# unshare -m
# lsns -p 1
        NS TYPE   NPROCS PID USER COMMAND
4026531834 time      135   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531835 cgroup    135   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531836 pid       135   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531837 user      135   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531838 uts       132   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531839 ipc       135   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531840 net       134   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531841 mnt       122   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
# lsns -p $$
        NS TYPE   NPROCS   PID USER COMMAND
4026531834 time      135     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531835 cgroup    135     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531836 pid       135     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531837 user      135     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531838 uts       132     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531839 ipc       135     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531840 net       134     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026532223 mnt         2  2677 root -bash



터미널2
mnt namespace 분리 확인
# lsns -p $$
        NS TYPE   NPROCS PID USER COMMAND
4026531834 time      134   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531835 cgroup    134   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531836 pid       134   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531837 user      134   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531838 uts       131   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531839 ipc       134   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531840 net       134   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531841 mnt       123   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31

# ps -ef | grep 2677
root        2677    2604  0 03:50 pts/0    00:00:00 -bash
root        2705    2646  0 03:53 pts/1    00:00:00 grep --color=auto 2677
4-2. uts
프로세스의 호스트 이름과 NIS(Network Information Service)도메인 네임을 격리한다.
*NIS 도메인 네임 : 호스트명이나 사용자명등 동일한 네트워크 자원을 공유할 수 있는 네트워크 영역의 이름


터미널1
# unshare -u
# lsns -p 1
        NS TYPE   NPROCS PID USER COMMAND
4026531834 time      132   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531835 cgroup    132   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531836 pid       132   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531837 user      132   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531838 uts       127   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531839 ipc       132   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531840 net       131   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531841 mnt       121   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
# lsns -p $$
        NS TYPE   NPROCS   PID USER COMMAND
4026531834 time      132     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531835 cgroup    132     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531836 pid       132     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531837 user      132     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531839 ipc       132     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531840 net       131     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026531841 mnt       121     1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 31
4026532223 uts         2  2747 root -bash

# hostname bgr
# hostname
bgr

# nisdomainname
nisdomainname: Local domain name not set

# nisdomainname superbgr
# nisdomainname
superbgr



터미널2
# hostname
bgrprac.novalocal

# nisdomainname
nisdomainname: Local domain name not set
4-3. ipc
ipc란 독립된 구조의 프로세스가 서로 자원을 공유할 때 프로세스 간에 발생하는 통신으로, 이 통신 자원을 분리하여 관리할 수 있다.
프로세스 간 통신은 메세지 전달 방식과 공유 메모리 방식 두 가지 방식으로 이뤄진다.
메세지 전달 방식

커널에 있는 공유 공간 속에 데이터를 주고 받는 방식
메시지를 전달할 때마다 커널, 즉 운영체제가 개입하다보니 커널 의존성이 높아지고 속도가 낮아질 수 있다.
단 커널이 메세지 전달을 관리하기 때문에 동기화 문제가 발생하지 않는다.
프로세스가 커널에 있는 메시지 큐 공간으로 자원을 전달하면 직접 통신, 간접 통신 두가지 방법으로 다른 프로세스에게 자원이 전달된다.
직접 통신 : 프로세스 A가 자원 a를 커널의 메세지 큐로 전달 -> 커널에서 자원 a를 프로세스 B로 전달 (시스템 콜 : msgsnd(), msgrcv())
간접 통신 : 프로세스 A가 자원 a를 커널의 메세지 큐로 전달 -> 커널에서 프로세스 B에게 자원을 읽어가라고 alert 전달



공유 메모리 방식

프로세스들이 주소 공간의 일부를 공유하는 방식
운영체제에서 공유 메모리를 사용하는 시스템 콜을 지원하여 서로 다른 프로세스 들이 주소 공간 중 일부를 공유할 수 있다.
프로세스가 커널에게 시스템 콜을 통해 공유메모리 할당을 요청 -> 해당 프로세스에 메모리 공간을 할당 -> 타 프로세스가 해당 메모리 영역에 접근 가능
프로세스가 중개자 없이 바로 다른 프로세스의 공유메모리에 접근할 수 있기 때문에 속도가 빠르다.
단 커널이 직접 모든 것을 관리하지 않기 대문에 데이터 간 일관성 문제 즉, 동기화 문제가 발생할 수 있다. -> 세마포어 필요

docker의 ipc mode
docker의 ipc mode에는 private, sharable, host가 있다. 
sharable mode를 통하여 컨테이너 프로세스에 공유 메모리를 할당 및 공유할 수 있다.
host mode를 통하여 컨테이너가 호스트의 IPC리소스에 직접 액세스할 수 있다. 단, 이 옵션의 경우 컨테이너에 할당된 리소스 보다 더 많은 리소스를 사용할 수 있게 설정될 수 있기 때문에 관리적 측면으로 좋지 않다.
docker의 ipc mode 중 shareable, host mode를 통하여 프로세스 격리를 확인할 수 있다.
(1) ipc mode : shareable
터미널1
ipcmk로 도커 컨테이너 내에서 공유메모리를 생성한다.
# docker run --rm --name test1 --ipc=shareable -it ubuntu bash
root@8ada959737a1:/# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status

root@8ada959737a1:/# ipcmk -M 2000
Shared memory id: 0
root@8ada959737a1:/# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x6094dcfd 0          root       644        2000       0

root@8ada959737a1:/# lsns -p $$
        NS TYPE   NPROCS PID USER COMMAND
4026531834 time        2   1 root bash
4026531835 cgroup      2   1 root bash
4026531837 user        2   1 root bash
4026532719 mnt         2   1 root bash
4026532720 uts         2   1 root bash
4026532721 ipc         2   1 root bash
4026532722 pid         2   1 root bash
4026532724 net         2   1 root bash
root@8ada959737a1:/# lsns -t ipc -p $$
        NS TYPE NPROCS PID USER COMMAND
4026532721 ipc       2   1 root bash



터미널2
docker ipc shareable mode를 통해 test1 컨테이너의 공유 메모리에 접근 가능한 것을 확인할 수 있다.

이 때 두 컨테이너가 동일한 ipc 네임스페이스에 속해있다.
# docker run --rm --name test2 --ipc=container:test1 -it ubuntu bash
root@055126df0f88:/# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x6094dcfd 0          root       644        2000       0

root@055126df0f88:/# lsns -p $$
        NS TYPE   NPROCS PID USER COMMAND
4026531834 time        2   1 root bash
4026531835 cgroup      2   1 root bash
4026531837 user        2   1 root bash
4026532721 ipc         2   1 root bash
4026532794 mnt         2   1 root bash
4026532795 uts         2   1 root bash
4026532796 pid         2   1 root bash
4026532798 net         2   1 root bash
root@055126df0f88:/# lsns -t ipc -p $$
        NS TYPE NPROCS PID USER COMMAND
4026532721 ipc       2   1 root bash



(2) ipc mode : host
터미널1
# docker run --rm --name test3 --ipc=host -it ubuntu bash
root@11d0e13b0220:/# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status



터미널2
host VM에서 수행
# ipcmk -M 2000
Shared memory id: 4
# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x681930da 4          root       644        2000       0

# lsns -t ipc -p $$
        NS TYPE NPROCS PID USER COMMAND
4026531839 ipc     189   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16

root@11d0e13b0220:/# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x681930da 4          root       644        2000       0

root@11d0e13b0220:/# lsns -t ipc -p $$
        NS TYPE NPROCS PID USER COMMAND
4026531839 ipc       2   1 root bash



터미널1
host VM의 ipc네임스페이스와 별개의 네임스페이스임에도 test3 컨테이너에서 host VM의 메모리에 접근가능한 것 확인 가능. 
root@11d0e13b0220:/# ipcs -m

------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status
0x681930da 4          root       644        2000       0

root@11d0e13b0220:/# lsns -t ipc -p $$
        NS TYPE NPROCS PID USER COMMAND
4026531839 ipc       2   1 root bash
4-4. pid
unshare를 통해 pid를 격리할 수 있다.


터미널1
# echo $$
3156706

# unshare -fp --mount-proc /bin/sh
sh-4.4# echo $$
1
sh-4.4# lsns -t pid -p 1
        NS TYPE NPROCS PID USER COMMAND
4026532718 pid       2   1 root /bin/sh



터미널2
host VM에서 격리된 pid 네임스페이스를 조회할 수 있다.
# lsns -t pid -p 1
        NS TYPE NPROCS PID USER COMMAND
4026531836 pid     188   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16

# ps aux | grep '/bin/sh'
root        2566  0.0  0.0   2316  1720 ?        Ss   Aug28   0:00 /bin/sh /usr/bin/nginx-proxy CP_HOSTS=172.21.27.183,172.21.26.91,172.21.27.194
root     3159289  0.0  0.0 217084   932 pts/0    S    10:22   0:00 unshare -fp --mount-proc /bin/sh
root     3159290  0.0  0.0 224784  3636 pts/0    S+   10:22   0:00 /bin/sh
root     3160579  0.0  0.0 221912  1136 pts/1    S+   10:24   0:00 grep --color=auto /bin/sh

# lsns -t pid -p  3159290
        NS TYPE NPROCS     PID USER COMMAND
4026532718 pid       1 3159290 root /bin/sh
호스트 VM에서 격리 프로세스 종료
터미널1
# unshare -fp --mount-proc /bin/sh
sh-4.4# echo $$
1
sh-4.4# lsns -t pid -p 1
        NS TYPE NPROCS PID USER COMMAND
4026532718 pid       2   1 root /bin/sh
sh-4.4# sleep 10000
Killed
sh-4.4# Killed



터미널2
# kill -SIGKILL $(pgrep sleep)

# ps aux | grep '/bin/sh'
root        2566  0.0  0.0   2316  1720 ?        Ss   Aug28   0:00 /bin/sh /usr/bin/nginx-proxy CP_HOSTS=172.21.27.183,172.21.26.91,172.21.27.194
root     3159289  0.0  0.0 217084   932 pts/0    S    10:22   0:00 unshare -fp --mount-proc /bin/sh
root     3159290  0.0  0.0 224784  3784 pts/0    S+   10:22   0:00 /bin/sh
root     3165529  0.0  0.0 221912  1176 pts/1    S+   10:32   0:00 grep --color=auto /bin/sh

# kill -SIGKILL 3159290
4-5. user
사용자 및 그룹 ID 매핑의 격리를 제공하는 Linux 커널 기능으로, 각 프로세스가 숫자적으로는 동일한 사용자 및 그룹ID를 사용하더라도 서로 다른 권한을 가질 수 있다.


터미널1
# unshare -U --map-root-user /bin/sh
sh-4.4# whoami
root

sh-4.4# id
uid=0(root) gid=0(root) groups=0(root)

sh-4.4# readlink /proc/$$/ns/user
user:[4026532717]

sh-4.4# lsns -p $$
        NS TYPE   NPROCS     PID USER COMMAND
4026531835 cgroup      2 3176595 root /bin/sh
4026531836 pid         2 3176595 root /bin/sh
4026531838 uts         2 3176595 root /bin/sh
4026531839 ipc         2 3176595 root /bin/sh
4026531840 mnt         2 3176595 root /bin/sh
4026531992 net         2 3176595 root /bin/sh
4026532717 user        2 3176595 root /bin/sh



터미널2
unshare로 네임스페이스가 격리된 것을 host VM에서 확인할 수 있다.
# readlink /proc/$$/ns/user
user:[4026531837]

# lsns -p $$
        NS TYPE   NPROCS PID USER COMMAND
4026531835 cgroup    237   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026531836 pid       188   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026531837 user      236   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026531838 uts       200   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026531839 ipc       186   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026531840 mnt       180   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16
4026531992 net       214   1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 16

# ps -ef | grep "/bin/sh"
root     3176595 3176498  0 10:49 pts/0    00:00:00 /bin/sh
root     3177167 3119323  0 10:50 pts/1    00:00:00 grep --color=auto /bin/sh

# lsns -t user -p 3176595
        NS TYPE  NPROCS     PID USER COMMAND
4026532717 user       1 3176595 root /bin/sh
user namespace in k8s 활용 사례
문제 상황


Host Filesystem에서 NFS 서버와 클라이언트 간 문제로 인해 Stale File Handle Error 발생.
telegraf daemonset에서 'df -h 2>&1 | grep -i 'stale file handle' | wc -l' 이 1 이상일 때 alert가 울리도록 구성 필요

telegraf configmap의 input.exec 활용 & telegraf daemonset에서 host root 파일시스템을 RO로 mount 구성.

telegraf의 기본 구성의 경우 telegraf user와 group으로 실행되기 때문에 df -h 명령어를 시행하였을 때, telegraf user에는 /var/lib/kubelet 하위 접근 권한이 없어 Permission denied와 함께 정상적으로 에러 검출 실패

telegraf 공식 github Common Issues : https://github.com/influxdata/telegraf/blob/master/plugins/inputs/exec/README.md#common-issues




문제 상황 확인
# docker exec --user telegraf -it telegraf-ds sh

/ $ df -h
df: /hostfs/var/lib/kubelet/pods/...: Permission denied
df: /hostfs/var/lib/kubelet/pods/...: Permission denied
df: /hostfs/var/lib/kubelet/pods/...: Permission denied

/ $ id
uid=100(telegraf) gid=101(telegraf) groups=101(telegraf)



해결 방법

securityContext fsGroup에 root 그룹을 지정하여 컨테이너 user(telegraf)가 telegraf 그룹과 root 그룹에 속할 수 있도록 설정.

컨테이너를 telegraf 유저(100)로 실행하고 최소한의 권한만 허용하게 하여 보안적 위험도를 낮춤.

사용자의 UID가 100으로, root 그룹에 속해 있지만, 루트 권한 자체를 가지는 것이 아님. 접근하려는 파일의 그룹 소유자가 root(GID 0)이고, 해당 파일의 그룹에 대한 권한이 부여되어 있다면, telegraf user는 root 그룹의 권한을 통해 root 파일에 접근 가능.



spec: 
  containers: 
  ... 
    securityContext: 
      allowPrivilegeEscalation: false 
      privileged: false 
      readOnlyRootFilesystem: false 
      runAsNonRoot: false 
      runAsUser: 100 
      capabilities: {} 
  dnsPolicy: ClusterFirst 
  restartPolicy: Always 
  securityContext: 
    fsGroup: 0
    ...



해결 방법 확인
# docker exec --user telegraf -it telegraf-ds sh

/ $ df -h 2>&1 | grep -i 'stale' 
df: ‘/var/lib/kubelet/pods/c5fa7996-d8b0-4340-a580-96620af88aad/volume-subpaths/pvc-3dfe4830-bb98-43ef-b34a-8472591dc73e/vauth/1’: Stale file handle df: ‘/var/lib/kubelet/pods/c5fa7996-d8b0-4340-a580-96620af88aad/volume-subpaths/pvc-3dfe4830-bb98-43ef-b34a-8472591dc73e/auth-sidecar/1’: Stale file handle 
...

/ $ id
uid=100(telegraf) gid=101(telegraf) groups=0(root),101(telegraf)
4-6. time
time 네임스페이스는 리눅스 커널 5.6이상 버전에서 사용가능한 네임스페이스이다.
https://www.man7.org/linux/man-pages/man7/time_namespaces.7.html


터미널1
# which uptime
/bin/uptime

# ldd /bin/uptime

# uptime
 13:00:44 up 1 day,  5:28,  2 users,  load average: 0.00, 0.00, 0.00

# unshare --mount-proc -T --boottime=86400 --root=/tmp/myroot/
-bash-5.1# uptime
 13:01:00 up 2 days,  5:29,  0 users,  load average: 0.00, 0.00, 0.00

-bash-5.1# lsns -p $$
        NS TYPE   NPROCS   PID USER COMMAND
4026531835 cgroup    131     1 0    /usr/lib/systemd/systemd --switched-root --system --deserial
4026531836 pid       131     1 0    /usr/lib/systemd/systemd --switched-root --system --deserial
4026531837 user      131     1 0    /usr/lib/systemd/systemd --switched-root --system --deserial
4026531838 uts       128     1 0    /usr/lib/systemd/systemd --switched-root --system --deserial
4026531839 ipc       131     1 0    /usr/lib/systemd/systemd --switched-root --system --deserial
4026531840 net       131     1 0    /usr/lib/systemd/systemd --switched-root --system --deserial
4026532223 mnt         2  3291 0    -bash
4026532224 time        2  3291 0    -bash



터미널2
$ uptime
 13:01:04 up 1 day,  5:29,  2 users,  load average: 0.00, 0.00, 0.00
4-7. cgroup
cgroupv2를 활용하여 CPU, MEM, DISK IO 자원 제한을 설정할 수 있다.
4-7-1. CPU
CPU제한을 주지 않은 상황과 제한을 준 상황에서의 프로세스 CPU 사용률을 확인한다.
CPU제한 X
# sudo mount -t cgroup2 none /sys/fs/cgroup
mount: /sys/fs/cgroup: none already mounted on /run/credentials/systemd-tmpfiles-setup-dev.service.

# mkdir /sys/fs/cgroup/parent

# echo "+cpu" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
+cpu

# echo "+cpu" | sudo tee /sys/fs/cgroup/parent/cgroup.subtree_control
+cpu

# mkdir /sys/fs/cgroup/parent/child

# echo $$ > /sys/fs/cgroup/parent/child/cgroup.procs

# cat /proc/$$/cgroup
0::/parent/child

# stress -c 1
stress: info: [3350] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

top - 13:16:36 up 1 day,  5:44,  2 users,  load average: 0.08, 0.02, 0.01
Tasks: 129 total,   2 running, 127 sleeping,   0 stopped,   0 zombie
%Cpu(s): 25.0 us,  0.1 sy,  0.0 ni, 74.7 id,  0.0 wa,  0.1 hi,  0.1 si,  0.0 st
MiB Mem :  15737.8 total,  15247.5 free,    458.4 used,    322.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  15279.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3351 root      20   0    3516    112      0 R  99.3   0.0   0:05.73 stress



CPU제한 O
# echo 100000 1000000 > /sys/fs/cgroup/parent/cpu.max

# echo $$ > /sys/fs/cgroup/parent/child/cgroup.procs

# stress -c 1
stress: info: [3354] dispatching hogs: 1 cpu, 0 io, 0 vm, 0 hdd

top - 13:19:24 up 1 day,  5:47,  2 users,  load average: 0.01, 0.02, 0.00
Tasks: 130 total,   2 running, 128 sleeping,   0 stopped,   0 zombie
%Cpu(s):  2.6 us,  0.0 sy,  0.0 ni, 97.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  15737.8 total,  15247.5 free,    458.4 used,    322.9 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  15279.4 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3355 root      20   0    3516    112      0 R  10.0   0.0   0:01.51 stress
4-7-2. MEM
MEM제한을 주지 않은 상황과 제한을 준 상황에서의 프로세스 MEM 사용률을 확인한다.
MEM제한 X
# mkdir /sys/fs/cgroup/parent

# echo "+memory" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
+memory

# echo "+memory" | sudo tee /sys/fs/cgroup/parent/cgroup.subtree_control
+memory

# mkdir /sys/fs/cgroup/parent/child

# echo $$ > /sys/fs/cgroup/parent/child/cgroup.procs

# stress-ng --vm 1 --vm-bytes 100% --vm-keep --timeout 60s
stress-ng: info:  [3403] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [3403] dispatching hogs: 1 vm
...
stress-ng: info:  [3403] successful run completed in 18.24 secs

top - 13:23:40 up 1 day,  5:51,  2 users,  load average: 0.15, 0.03, 0.01
Tasks: 136 total,   2 running, 134 sleeping,   0 stopped,   0 zombie
%Cpu(s): 18.8 us,  6.2 sy,  0.0 ni, 75.0 id,  0.0 wa,  0.1 hi,  0.0 si,  0.0 st
MiB Mem :  15737.8 total,    163.2 free,  15668.6 used,    158.1 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.     69.1 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3405 root      20   0   14.9g  14.9g    984 R  99.7  96.8   0:10.42 stress-ng-vm



MEM제한 O
# echo "500M" | sudo tee /sys/fs/cgroup/parent/memory.max
500M

# stress-ng --vm 1 --vm-bytes 100% --vm-keep --timeout 60s
stress-ng: info:  [3421] setting to a 1 min, 0 secs run per stressor
stress-ng: info:  [3421] dispatching hogs: 1 vm
...
stress-ng: info:  [3421] successful run completed in 30.25 secs

top - 13:25:29 up 1 day,  5:53,  2 users,  load average: 0.39, 0.13, 0.04
Tasks: 135 total,   2 running, 133 sleeping,   0 stopped,   0 zombie
%Cpu(s):  3.2 us, 21.6 sy,  0.0 ni, 74.9 id,  0.1 wa,  0.2 hi,  0.0 si,  0.0 st
MiB Mem :  15737.8 total,  14918.1 free,    911.0 used,    163.5 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.  14826.8 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
   3538 root      20   0   15.1g 480316    412 R   6.0   3.0   0:00.18 stress-ng-vm

# dmesg
[107618.823784] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/parent,task_memcg=/parent/child,task=stress-ng-vm,pid=3546,uid=0
[107618.824690] Memory cgroup out of memory: Killed process 3546 (stress-ng-vm) total-vm:15818924kB, anon-rss:479904kB, file-rss:396kB, shmem-rss:36kB, UID:0 pgtables:30948kB oom_score_adj:1000
[107619.036717] Memory cgroup out of memory: Killed process 3547 (stress-ng-vm) total-vm:15818924kB, anon-rss:479904kB, file-rss:448kB, shmem-rss:36kB, UID:0 pgtables:30948kB oom_score_adj:1000
[107619.251595] Memory cgroup out of memory: Killed process 3548 (stress-ng-vm) total-vm:15818924kB, anon-rss:479904kB, file-rss:388kB, shmem-rss:36kB, UID:0 pgtables:30948kB oom_score_adj:1000
4-7-3. DISK IO
IO제한을 주지 않은 상황과 제한을 준 상황에서의 프로세스 IO 성능을 확인한다.
IO제한 X
# mkdir /sys/fs/cgroup/parent
# echo "+io" | sudo tee /sys/fs/cgroup/cgroup.subtree_control
+io
# echo "+io" | sudo tee /sys/fs/cgroup/parent/cgroup.subtree_control
+io

# mkdir /sys/fs/cgroup/parent/child
# echo $$ > /sys/fs/cgroup/parent/child/cgroup.procs

# dd if=/dev/zero of=/tmp/file1 bs=512M count=1
1+0 records in
1+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 0.269322 s, 2.0 GB/s

# iostat -d 1
Device             tps    kB_read/s    kB_wrtn/s    kB_dscd/s    kB_read    kB_wrtn    kB_dscd
vda               3.00        24.00         0.00         0.00         24          0          0



IO제한 O
# cat /proc/partitions
major minor  #blocks  name
 252        0   52428800 vda
 252        1     101376 vda1
 252        2    1024000 vda2
 252        3       4096 vda3
 252        4       1024 vda4
 252        5   51296239 vda5

# echo "252:0 wbps=10485760" > /sys/fs/cgroup/parent/io.max

# echo $$ > /sys/fs/cgroup/parent/child/cgroup.procs

# dd if=/dev/zero of=/tmp/file1 bs=512M count=1
Killed

# journalctl -f
Aug 31 13:42:17 bgrprac.novalocal kernel: INFO: task dd:3653 blocked for more than 122 seconds.
Aug 31 13:42:17 bgrprac.novalocal kernel:       Not tainted 5.14.0-284.11.1.el9_2.x86_64 #1
Aug 31 13:42:17 bgrprac.novalocal kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.

# ps -ef | grep dd
root           2       0  0 Aug30 ?        00:00:00 [kthreadd]
root          76       2  0 Aug30 ?        00:00:00 [ipv6_addrconf]
root        3653    3624  0 13:39 pts/0    00:00:00 dd if=/dev/zero of=/tmp/file1 bs=512M count=1
root        3661    3299  0 13:43 pts/1    00:00:00 grep --color=auto dd

# kill -9 3624
4-8. net
network interface, iptables 등 network 리소스를 격리하여 사용가능하다.
참고 문서
https://tech.kakaoenterprise.com/171
https://tech.kakaoenterprise.com/154

설정 항목	설명	예시 값
`enabled`	ip-masq-agent 기능 활성화 여부	`true`, `false`
`config.nonMasqueradeCIDRs`	SNAT이 적용되지 않을 CIDR 목록	`{10.10.1.0/24,10.10.2.0/24}`
`config.masqLinkLocal`	링크 로컬 주소 (169.254.0.0/16)에 대해 SNAT 적용 여부	`true`, `false`
`config.masqLinkLocalIPv6`	IPv6 링크 로컬 주소 (fe80::/10)에 대해 SNAT 적용 여부	`true`, `false`
`config.masqAgentConfigPath`	사용자 정의 config 파일 경로	`/etc/cilium/masq-agent.json`
`config.masqOutBoundCIDRs`	SNAT이 항상 적용될 외부 CIDR 목록	`{0.0.0.0/0}`
`config.masqOutBoundPortRanges`	SNAT이 항상 적용될 포트 범위 목록	`{80,443,1000-2000}`
`config.refreshInterval`	iptables 규칙 갱신 주기	`1m`, `30s`
`installIptablesRules`	iptables 규칙을 ip-masq-agent가 직접 설치할지 여부	`true`, `false`

옵션	설명	예시
`--from-pod`	Source Pod 지정 (`/` 형식)	`--from-pod kube-system/cilium-abc`
`--to-pod`	Destination Pod 지정	`--to-pod default/myapp`
`--from-ip`	Source IP 주소 지정	`--from-ip 10.0.0.12`
`--to-ip`	Destination IP 주소 지정	`--to-ip 10.0.1.25`
`--from-fqdn`	Source Fully Qualified Domain Name (FQDN) 지정	`--from-fqdn api.example.com`
`--to-fqdn`	Destination FQDN 지정	`--to-fqdn google.com`
`--selector`	Label selector (Source/Destination 모두에 적용됨)	`--selector k8s:app=frontend`

구성 요소	설명	주요 값/설정
application_log_config	애플리케이션 로그 포맷을 정의	`%Y-%m-%dT%T.%fZ\t%l\tenvoy %n %g:%#\t%v\tthread=%t`
node 정보	Envoy 노드 및 메타데이터 관련 정보	id: `router~172.16.1.3~...` cluster: `istio-ingressgateway.istio-system` instance_ip: `172.16.1.3`
annotations	Istio 및 Kubernetes 메타데이터 정보	`istio.io/rev: default`, `prometheus.io/scrape: true`
layered_runtime	Envoy 런타임 설정을 정의	- `overload.global_downstream_max_connections`: 2147483647 - `re2.max_program_size.error_level`: 32768
bootstrap_extensions	내부 리스너 구성.	`buffer_size_kb`: 64
admin 설정	관리 인터페이스 관련 설정	address: `127.0.0.1:15000` profile_path: `/var/lib/istio/data/envoy.prof`
dynamic_resources	동적 리소스 관리 설정(LDS, CDS, ADS)	api_type: `DELTA_GRPC` discovery_address: `istiod.istio-system.svc:15012`
static_resources	정적 리소스(클러스터, 리스너) 설정	클러스터: `prometheus_stats`, `agent`, `xds-grpc` 리스너: `0.0.0.0:15090`, `0.0.0.0:15021`
클러스터 설정	xDS, Prometheus 등과의 연결을 위한 클러스터 정보	Circuit Breakers: Max 100,000 connections/requests
리스너 설정	네트워크 리스너 구성	리스너 포트: 15090, 15021 HTTP 필터: Router, Health Check
proxy_config	프록시 동작을 위한 세부 설정	binaryPath: `/usr/local/bin/envoy` concurrency: 2 statusPort: 15020
메타데이터	Pod 및 서비스와 관련된 정보.	Pod 이름: `istio-ingressgateway-5f9f654d46-w87wh` 서비스 계정: `istio-ingressgateway-service-account`

필드	설명
Type	옵션의 종류 식별 (예: 보안 태그, QoS, 정책 ID 등)
Length	데이터 길이(4바이트 단위)
Value	실제 메타데이터 값

구분	Native Routing (직접 라우팅)	Encapsulation (캡슐화, 예: VXLAN)
트래픽 처리 방식	Pod IP 간 라우팅 테이블에 직접 경로 등록하여 노드 간 직접 라우팅	트래픽을 VXLAN 같은 터널 프로토콜로 캡슐화하여 터널을 통해 전달
네트워크 오버헤드	적음 (캡슐화 없음)	캡슐화 오버헤드 존재 (UDP 헤더 + VXLAN 헤더 추가)
MTU 이슈	기본 MTU 사용 (대체로 1500)	캡슐화로 MTU 감소, Path MTU 문제 발생 가능
라우팅 설정 복잡도	노드 라우팅 테이블에 Pod CIDR 경로 추가 필요	라우팅 테이블 복잡도 낮음, 터널 엔드포인트 간 패킷 전달
네트워크 정책 적용 지점	노드와 Pod 모두에서 정책 적용 가능	캡슐화로 인해 터널 종단점에서 정책 적용 필요 (터널 내부는 패킷 변경)
네트워크 환경 요구사항	클러스터 내 모든 노드가 Pod CIDR를 라우팅 가능해야 함	중간 네트워크(라우터, 스위치)에서 터널 프로토콜 지원 필요 없음
멀티테넌시 / 복잡한 네트워크	복잡한 네트워크 환경에서 라우팅 관리 어려움	복잡한 네트워크 환경에서 터널로 격리 및 네트워크 분리 가능
디버깅 난이도	비교적 쉬움	캡슐화로 인해 트래픽 분석 및 디버깅 어려움
성능 영향	일반적으로 더 낮은 지연 및 CPU 오버헤드	캡슐화/디캡슐화에 따른 CPU 오버헤드 및 약간의 지연 발생
노드 간 트래픽 흐름	노드 IP 기반 직접 전달	터널 엔드포인트 간 캡슐화된 패킷 전달

구성 요소	설명	관찰 범위	연결 방식	배포/사용 위치	주요 특징
Hubble API	Cilium 에이전트가 실행 중인 로컬 노드에서 관찰된 네트워크 트래픽 정보를 제공하는 gRPC API	단일 Cilium 노드 (로컬)	Unix 도메인 소켓 (`/var/run/cilium/hubble.sock`)	각 Cilium 에이전트 Pod 내부	L3~L7 네트워크 이벤트 제공. 외부에서 직접 접근 불가
Hubble Relay	여러 Cilium 노드의 Hubble API를 집계하여 클러스터 전체 또는 ClusterMesh 환경의 여러 클러스터의 트래픽 정보를 통합 제공	전체 클러스터 또는 ClusterMesh	내부: Hubble API와 통신 외부: gRPC (CLI, UI에서 연결)	별도 Pod(Deployment 등)로 실행	중앙 집중형 데이터 수집기. 보안 및 인증 구성 가능. CLI 및 UI의 주요 백엔드 역할 수행
Hubble UI	클러스터의 서비스 간 통신 흐름을 자동으로 탐지하고 시각화하여 보여주는 웹 UI	전체 클러스터 또는 ClusterMesh	gRPC 또는 HTTP로 Hubble Relay와 통신	Pod로 배포되며 웹 브라우저에서 접근	서비스 종속성 맵, 필터링 UI, L3/L4/L7 데이터 시각화. Grafana와 유사한 UX 제공
Hubble CLI	`hubble` 명령어를 통해 Hubble API 또는 Hubble Relay에 접근하여 트래픽 흐름을 조회하는 CLI 도구	로컬 노드 or 전체 클러스터	① Unix 도메인 소켓 (API 직접 연결) ② Hubble Relay 주소 (gRPC)	Cilium Pod 내부 또는 외부 클라이언트	실시간 흐름 조회, 필터링, JSON 출력 등 다양한 커맨드 지원

옵션	설명
`kubeProxyReplacement=true`	kube-proxy를 완전히 대체하여 Cilium이 직접 kube-proxy 기능을 수행하도록 설정.
`routingMode=native`	Cilium의 라우팅 모드를 native 모드로 설정 (Linux 커널 네이티브 라우팅 사용).
`autoDirectNodeRoutes=true`	노드 간 트래픽을 위해 자동으로 직접 라우팅 경로 설정.
`ipam.mode="cluster-pool"`	IP 주소 할당 모드를 클러스터 풀 모드로 설정 (Cilium이 직접 IP 할당 관리).
`ipam.operator.clusterPoolIPv4PodCIDRList={"172.20.0.0/16"}`	클러스터에서 사용할 Pod용 IPv4 CIDR 풀을 설정.
`ipv4NativeRoutingCIDR=172.20.0.0/16`	노드의 네이티브 라우팅에 사용할 IPv4 CIDR 범위 설정.
`endpointRoutes.enabled=true`	각 Pod에 대한 라우팅 경로를 별도로 생성하여 네트워크 경로를 최적화.
`installNoConntrackIptablesRules=true`	Cilium이 conntrack 기반 iptables 규칙을 설치하지 않도록 설정 (eBPF 방식 우선).
`bpf.masquerade=true`	eBPF를 사용해 IP 마스커레이딩 수행 (NAT 대체).

Cilium Feature	Minimum Kernel Version
WireGuard Transparent Encryption	>= 5.6
Full support for Session Affinity	>= 5.7
BPF-based proxy redirection	>= 5.7
Socket-level LB bypass in pod netns	>= 5.7
L3 devices	>= 5.8
BPF-based host routing	>= 5.10
Multicast Support in Cilium (Beta) (AMD64)	>= 5.10
IPv6 BIG TCP support	>= 5.19
Multicast Support in Cilium (Beta) (AArch64)	>= 6.0
IPv4 BIG TCP support	>= 6.3

옵션	설명
`rollOutCiliumPods`	Cilium Pods의 롤아웃을 활성화하여 새로운 버전으로의 배포를 관리한다.
`routingMode`	패킷의 라우팅 방식을 설정하며, 기본적으로 네이티브 라우팅 모드를 사용한다.
`autoDirectNodeRoutes`	자동으로 노드 간 라우팅을 설정하여 패킷 전송을 최적화한다.
`bpf.masquerade`	BPF를 사용하여 IP 마스커레이드를 활성화한다.
`bpf.hostRouting`	호스트 라우팅 기능을 활성화하여 패킷을 직접 전달한다.
`endpointRoutes.enabled`	엔드포인트 라우트를 활성화하여 Pod 간의 트래픽 흐름을 최적화한다.
`ipam.mode`	IP 주소 할당 방식을 Kubernetes로 설정한다.
`k8s.requireIPv4PodCIDR`	IPv4 Pod CIDR 요구 사항을 활성화하여 클러스터 내 IPv4 주소 사용을 보장한다.
`kubeProxyReplacement`	Cilium이 kube-proxy의 역할을 대체하도록 설정한다.
`ipv4NativeRoutingCIDR`	IPv4 네이티브 라우팅에 사용할 CIDR 블록을 설정한다.
`installNoConntrackIptablesRules`	conntrack iptables 규칙을 설치하지 않도록 설정하여 Cilium의 효율성을 높인다.
`hubble.ui.enabled`	Hubble UI를 활성화하여 네트워크 모니터링 및 시각화를 지원한다.
`hubble.relay.enabled`	Hubble Relay를 활성화하여 메트릭 수집 및 이벤트 전달 기능을 지원한다.
`prometheus.enabled`	Prometheus 모니터링을 활성화하여 메트릭 수집을 지원한다.
`operator.prometheus.enabled`	Cilium Operator의 Prometheus 모니터링 기능을 활성화한다.
`hubble.metrics.enableOpenMetrics`	Hubble 메트릭의 OpenMetrics 형식을 활성화하여 호환성을 높인다.
`hubble.metrics.enabled`	Hubble에서 수집할 메트릭과 이벤트를 정의한다.
`operator.replicas`	Cilium Operator의 복제본 수를 설정하여 고가용성을 유지한다.

syscall	설명
`connect()`	소켓을 통해 서비스 ClusterIP에 연결 요청을 시도함
`getsockname()`	연결된 소켓의 로컬 주소를 확인함 *유저 공간에서는 BPF에서 redirect하는 목적지 IP는 확인할 수 없음. (실제 redirect는 유저스페이스가 아닌 커널 소켓 수준에서 실행되기 때문)
`getsockopt()`	소켓 오류 상태나 옵션을 확인함

이름	설명
poddisruptionbudget.policy/istio-ingressgateway	Istio Ingress Gateway의 안정성을 보장하기 위해 최소 가용 Pod 수를 설정한다.
poddisruptionbudget.policy/istiod	Istio Control Plane(`istiod`)이 일정 수 이상의 Pod을 유지하도록 설정한다.
authorizationpolicies.security.istio.io	서비스 간 통신 및 외부 요청에 대한 접근 제어 규칙을 정의한다.
destinationrules.networking.istio.io	서비스 호출 시 적용할 로드 밸런싱, 연결 설정 등의 정책을 정의한다.
envoyfilters.networking.istio.io	Envoy 프록시의 동작을 커스터마이징하기 위해 필터를 추가한다.
gateways.networking.istio.io	외부 트래픽을 내부 서비스로 라우팅하기 위한 Gateway를 정의한다.
peerauthentications.security.istio.io	서비스 간 통신에 대한 인증 정책을 설정한다. (mTLS 등)
proxyconfigs.networking.istio.io	프록시 설정을 제어하며, 사이드카 및 인그레스 동작을 커스터마이징한다.
requestauthentications.security.istio.io	들어오는 요청에 대한 인증을 처리하는 규칙을 정의한다.
serviceentries.networking.istio.io	외부 서비스에 대한 접근을 허용하기 위해 내부에 가상 서비스 항목을 만든다.
sidecars.networking.istio.io	특정 네임스페이스나 서비스에 대해 사이드카 프록시 동작을 정의한다.
telemetries.telemetry.istio.io	모니터링 및 메트릭 수집을 위한 텔레메트리 설정을 정의한다.
virtualservices.networking.istio.io	요청을 특정 서비스로 라우팅하기 위한 규칙을 정의한다.
wasmplugins.extensions.istio.io	WebAssembly(WasM) 플러그인을 사용하여 Istio 프록시를 확장한다.
workloadentries.networking.istio.io	클러스터 외부의 워크로드를 Istio 메쉬에 포함시킨니다.
workloadgroups.networking.istio.io	비슷한 특성을 가진 외부 워크로드를 그룹으로 정의한다.

CRD Name	설명
`bfdprofiles.metallb.io`	BFD (Bidirectional Forwarding Detection) 프로파일을 정의하여 BGP 세션의 장애를 감지하는 데 사용된다.
`bgpadvertisements.metallb.io`	BGP(Border Gateway Protocol) 광고를 관리하는 CRD로, MetalLB가 클러스터의 IP 주소를 광고할 때 사용된다.
`bgppeers.metallb.io`	MetalLB가 연결할 BGP 피어를 정의한다. 이 CRD를 통해 다른 BGP 라우터와의 관계를 설정할 수 있다.
`communities.metallb.io`	BGP 커뮤니티를 정의하는 CRD로, 특정 라우팅 정책을 구현하기 위해 BGP 광고에 추가할 수 있는 커뮤니티이다.
`ipaddresspools.metallb.io`	MetalLB가 사용할 IP 주소 풀을 정의한다. 이 풀에서 IP 주소를 할당하여 서비스를 제공한다.
`l2advertisements.metallb.io`	L2 모드에서 IP 주소를 광고하기 위한 CRD로, L2 브로드캐스트를 통해 IP 주소를 광고한다.
`servicel2statuses.metallb.io`	L2 모드에서 서비스의 상태를 추적하는 CRD로, 서비스의 상태와 관련된 메타데이터를 제공한다.

Namespace	Flag	Page	Isolates
Cgroup	CLONE_NEWCGROUP	cgroup_namespaces(7)	Cgroup root directory
IPC	CLONE_NEWIPC	ipc_namespaces(7)	System V IPC, POSIX message queues
Network	CLONE_NEWNET	network_namespaces(7)	Network devices, stacks, ports, etc.
Mount	CLONE_NEWNS	mount_namespaces(7)	Mount points
PID	CLONE_NEWPID	pid_namespaces(7)	Process IDs
Time	CLONE_NEWTIME	time_namespaces(7)	Boot and monotonic clocks
User	CLONE_NEWUSER	user_namespaces(7)	User and group IDs
UTS	CLONE_NEWUTS	uts_namespaces(7)	Hostname and NIS domain name

_gyullbb.log

Cilium - Cilium Security

Layer 3 (Identity-Based)

Identity

Cilium Code

Reserved Identity

Layer 3 정책 메소드

Endpoints based

목표 : role: frontend → role: backend 통신 허용

Welcome to nginx!

Services based

목표 : role: frontend → Service: backend 통신 허용

Entities based

목표 : pod에서 로컬 호스트 및 hostNetwork 모드 컨테이너에만 접근 가능

Welcome to nginx!

Node based

IP/CIDR based

DNS based

Layer 4 (Restriction of accessible ports)

Layer 7 (Application protocol level)

Cilium - SR-IOV + Multus

GPU 환경에서의 네트워크 최적화: Cilium + SR-IOV + Multus

SR-IOV

RoCE (RDMA over Converged Ethernet)

1.RoCE v1

2.RoCE v2

Multus

Multus 역할

1.데이터 경로와 제어 경로의 분리

2.특화된 네트워크 할당

Multus CNI Plugin

IPAM Plugin

실습

구성 설명

SR-IOV

Multus

RoCE v2

Pod 통신

사전 확인

클러스터 생성

PF(Physical Funciton) 할당 확인

GPU WorkerNode SR-IOV 지원 여부 및 IOMMU 가상화 확인

Nvidia Network Operator 배포

SriovNetwork관련 생성

Deployment 생성 및 상태 확인

Pod 통신 테스트

Node Routing 설정

Cilium - Cilium Service Mesh(2)-Gateway-API

Gateway API

구성요소

Gateway API 설정

통신 확인

TPROXY

TLS Route

TLS Route 설정

Welcome to nginx!

Cilium - Cilium Service Mesh(1)-Cilium-Ingress

Cilium Service Mesh

L3(Service-to-Service, IP 기반)

L7(HTTP/gRPC 등 애플리케이션 계층)

TPROXY(Transparent Proxy)

K8S Ingress Support

Cilium Ingress 기본 동작

LoadBalancer 모드

필수 조건

Source IP Visibility

TPROXY(Transparent Proxy)

1) 패킷 마킹

2) Cilium Proxy Redirection

3) Socket 조회

4) Socket 바인딩

Cilium Envoy, Cilium-ingress 설정 확인

L2 Announcement 설정 및 Cilium-Ingress에 EX-IP 설정

Ingress HTTP Example

Dedicated Mode

ingress-Nginx와 cilium Ingress 공존 가능

RemoteAddr

TPROXY

Hubble Observe

nginx class Ingress

목표 : `role: frontend` → `role: backend` 통신 허용

목표 : `role: frontend` → `Service: backend` 통신 허용