[EKS Best Practices] Cluster Autoscaling - Karpenter

원문: Cluster Autoscaling - Karpenter

Karpenter 개요

Karpenter는 Kubernetes 클러스터의 노드 수명주기 관리를 자동화하는 오픈소스 프로젝트입니다. Pod의 스케줄링 요구사항에 따라 노드를 자동으로 provisioning/deprovisioning하여 효율적인 스케일링과 비용 최적화를 달성합니다.

핵심 동작 흐름

flowchart LR
    A[Pending Pods 감지] --> B[스케줄링 요구사항 평가]
    B --> C{리소스 요청·Node Selector\nAffinity·Toleration 분석}
    C --> D[적절한 크기의\n신규 노드 Provisioning]
    D --> E[Pod 스케줄링]
    E --> F{노드 활용도\n모니터링}
    F -->|불필요| G[노드 자동 제거]
    F -->|활용 중| E

Karpenter의 4가지 핵심 기능

기능	설명
Pending Pod 모니터링	Kubernetes scheduler가 리소스 부족으로 스케줄링하지 못한 Pod를 감지
스케줄링 요구사항 평가	resource requests, node selectors, affinities, tolerations 등 분석
노드 Provisioning	요구사항에 맞는 Right-sized 노드를 자동 생성
노드 제거	더 이상 필요하지 않은 노드를 자동 삭제

Karpenter를 사용해야 하는 이유

Karpenter vs Cluster Autoscaler 비교

항목	Cluster Autoscaler (CAS)	Karpenter
Node Group 관리	다양한 요구사항 대응을 위해 수십 개의 Node Group 필요	단일 NodePool로 다양한 워크로드 수용 가능
API 의존성	AWS API와 Kubernetes API 사이를 오가야 함	Kubernetes native API에 가깝게 동작
Kubernetes 버전 결합도	Kubernetes 버전에 강하게 결합	느슨한 결합
Instance 유연성	Node Group 단위로 제한	NodePool 옵션으로 유연한 설정
스케일링 속도	ASG 기반으로 상대적 지연	빠른 노드 Launch 및 Pod 스케줄링
AZ 타겟팅	제한적	특정 AZ 스케줄링 지원

flowchart TB
    subgraph CAS["Cluster Autoscaler 방식"]
        direction TB
        K8s1[Kubernetes API] --> Bridge[CAS Bridge] --> ASG[AWS ASG API]
        ASG --> NG1[Node Group 1\n- m5.large]
        ASG --> NG2[Node Group 2\n- c5.xlarge]
        ASG --> NG3[Node Group 3\n- r5.large]
        ASG --> NG4[Node Group N\n- ...]
    end

    subgraph KAR["Karpenter 방식"]
        direction TB
        K8s2[Kubernetes API] --> KARP[Karpenter Controller]
        KARP --> NP[단일 NodePool]
        NP --> I1[m5.large]
        NP --> I2[c5.xlarge]
        NP --> I3[r5.large]
        NP --> I4[필요에 따라\n자동 선택]
    end

언제 Karpenter를 사용할 것인가

시나리오	권장 솔루션
부하 변동이 크고 Spike가 잦은 워크로드	Karpenter
다양한 컴퓨팅 요구사항이 혼재	Karpenter
정적이고 일정한 워크로드	MNG / ASG
혼합 환경	Karpenter + MNG 병행 가능

Karpenter Best Practices

1. Production에서 AMI 고정 (Pin AMI)

Production 클러스터에서는 반드시 검증된 AMI 버전을 고정(pin)해야 합니다.

# EC2NodeClass에서 AMI 버전 고정 (권장)
amiSelectorTerms:
  - alias: al2023@v20240807

환경	AMI 전략
Production	테스트 완료된 특정 버전 고정
Non-Production	최신 버전 테스트용으로 활용

2. Karpenter Controller 배치

flowchart LR
    subgraph 권장배치["Karpenter Controller 배치 옵션"]
        A["옵션 1: 소규모 Managed Node Group\n최소 1개 워커 노드"]
        B["옵션 2: EKS Fargate\nkarpenter namespace에\nFargate Profile 생성"]
    end
    C[/"Karpenter가 관리하는 노드에\nKarpenter를 배치하지 말 것!"/]

    style C fill:#ff6b6b,color:#fff

주의: Karpenter가 관리하는 노드 위에 Karpenter Controller를 실행하면 안 됩니다. Karpenter가 자신의 노드를 삭제하면 Controller도 함께 내려갈 수 있습니다.

3. 불필요한 Instance Type 제외

# 대형 Graviton 인스턴스 제외 예시
- key: node.kubernetes.io/instance-type
  operator: NotIn
  values:
    - m6g.16xlarge
    - m6gd.16xlarge
    - r6g.16xlarge
    - r6gd.16xlarge
    - c6g.16xlarge

4. Spot 사용 시 Interruption Handling 활성화

sequenceDiagram
    participant SQS as SQS Queue
    participant KAR as Karpenter Controller
    participant Node as 영향받는 Node
    participant NewNode as 신규 Node

    SQS->>KAR: Interruption Event 수신
    KAR->>Node: Taint 적용
    KAR->>NewNode: 신규 노드 프로비저닝 시작
    KAR->>Node: Drain (Pod 이전)
    NewNode-->>KAR: Ready
    KAR->>Node: Terminate
    Note over Node,NewNode: Spot 2분 경고 시
빠르게 신규 노드 준비

설정	내용
활성화 방법	`--interruption-queue` CLI 인수에 SQS 큐 이름 지정
주의사항	Node Termination Handler와 함께 사용하지 말 것

5. Private Cluster (인터넷 미연결) 구성

VPC Endpoint	필요 이유	미설정 시 에러
STS (Regional)	IRSA를 통한 자격증명 획득	`WebIdentityErr: failed to retrieve credentials ... i/o timeout`
SSM	Launch Template 설정 및 AMI 정보 조회	`getting ssm parameter ... i/o timeout`
Price List API	가격 데이터 조회 (VPC Endpoint 미존재)	Karpenter 바이너리에 내장된 On-Demand 가격 데이터 사용

NodePool 생성 Best Practices

1. 복수 NodePool 생성이 필요한 경우

시나리오	설명
팀별 워크로드 분리	서로 다른 워커 노드에서 실행 필요
OS 요구사항 차이	한 팀은 Bottlerocket, 다른 팀은 Amazon Linux
하드웨어 요구사항 차이	GPU 노드가 필요한 팀과 불필요한 팀

2. NodePool 간 상호 배타성 또는 가중치 설정

여러 NodePool이 동일 Pod에 매칭될 경우 Karpenter가 랜덤하게 선택하므로, 상호 배타적(mutually exclusive) 이거나 가중치(weighted) 를 부여해야 합니다.

패턴 A: GPU NodePool + Taint/Toleration 기반 분리

# GPU NodePool - Taint로 일반 워크로드 차단
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: gpu
spec:
  disruption:
    consolidateAfter: 1m
    consolidationPolicy: WhenEmptyOrUnderutilized
  template:
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - p3.8xlarge
            - p3.16xlarge
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["on-demand"]
      taints:
        - effect: NoSchedule
          key: nvidia.com/gpu
          value: "true"

패턴 B: 일반 컴퓨팅 NodePool + Label/NodeAffinity 기반 분리

# 일반 컴퓨팅 NodePool - 팀별 Label 부여
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: generalcompute
spec:
  template:
    metadata:
      labels:
        billing-team: my-team
    spec:
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass
        name: default
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values:
            - m5.large
            - m5.xlarge
            - m5.2xlarge
            - c5.large
            - c5.xlarge
            - r5.large
            - r5.xlarge

flowchart TB
    subgraph Cluster["EKS Cluster"]
        subgraph NP1["NodePool: gpu"]
            T1["Taint: nvidia.com/gpu=true:NoSchedule"]
            N1["p3.8xlarge / p3.16xlarge"]
        end

        subgraph NP2["NodePool: generalcompute"]
            L1["Label: billing-team=my-team"]
            N2["m5 / c5 / r5 계열"]
        end
    end

    GPU_WL["GPU Workload\n(Toleration 있음)"] --> NP1
    GENERAL_WL["일반 Workload\n(NodeAffinity 매칭)"] --> NP2

    style NP1 fill:#e8d5f5
    style NP2 fill:#d5e8f5

3. Spot 사용 시 Instance Type을 과도하게 제한하지 않기

flowchart LR
    A["다양한 Instance Type 허용"] --> B["깊은 Spot Pool 접근"]
    B --> C["낮은 중단 위험"]

    E["소수 Instance Type만 허용"] --> F["제한된 Spot Pool"]
    F --> G["높은 중단 위험"]

    style A fill:#90EE90
    style E fill:#FFB6C1

# ec2-instance-selector 사용 예시
$ ec2-instance-selector --memory 4 --vcpus 2 --cpu-architecture x86_64 -r ap-southeast-1
c5.large
c5a.large
c5ad.large
c5d.large
c6i.large
t2.medium
t3.medium
t3a.medium

Pod 스케줄링 Best Practices

1. 고가용성(HA) 확보

메커니즘	용도
Topology Spread Constraints	Pod를 노드와 AZ에 걸쳐 분산 배치
Pod Disruption Budgets (PDB)	최소 가용 Pod 수를 보장하여 Eviction/삭제 제한

2. 비용 모니터링 및 Resource Limit 설정

# NodePool에 Resource Limit 설정
spec:
  limits:
    cpu: 1000        # 최대 1000 vCPU
    memory: 1000Gi   # 최대 1000Gi 메모리

flowchart TB
    subgraph 비용관리["비용 관리 다층 방어 체계"]
        direction TB
        L1["1단계: NodePool Resource Limits\ncpu: 1000, memory: 1000Gi"]
        L2["2단계: CloudWatch Billing Alarm\n임계값 초과 시 알림"]
        L3["3단계: AWS Cost Anomaly Detection\nML 기반 이상 지출 감지"]
        L4["4단계: AWS Budgets Actions\n이메일 / SNS / Slack 알림"]

        L1 --> L2 --> L3 --> L4
    end

3. `karpenter.sh/do-not-disrupt` Annotation 활용

metadata:
  annotations:
    karpenter.sh/do-not-disrupt: "true"

4. Consolidation 사용 시 non-CPU 리소스는 requests = limits 설정

flowchart LR
    subgraph 문제상황["requests 와 limits 가 다를 때 문제"]
        P1["Pod A\nrequest: 256Mi\nlimit: 512Mi"] --> Node["Node\nallocatable: 1Gi"]
        P2["Pod B\nrequest: 256Mi\nlimit: 512Mi"] --> Node
        P3["Pod C\nrequest: 256Mi\nlimit: 512Mi"] --> Node
        P4["Pod D\nrequest: 256Mi\nlimit: 512Mi"] --> Node
        Node -->|"4 Pod x 256Mi request = 1Gi\n하지만 4 Pod x 512Mi burst = 2Gi"| OOM["OOM Kill 발생!"]
    end

    style OOM fill:#ff6b6b,color:#fff

리소스	권장 설정
Memory	`requests = limits` (Burst 시 OOM 방지)
기타 non-CPU 리소스	`requests = limits`
CPU	Compressible 리소스이므로 별도 고려

5. LimitRange로 기본 리소스 설정

flowchart LR
    A["Pod 생성\n(리소스 미지정)"] --> B{"LimitRange\n존재?"}
    B -->|Yes| C["기본 requests/limits\n자동 적용"]
    B -->|No| D["무제한 리소스 사용\n스케줄링 문제 발생"]
    C --> E["Karpenter가 정확한\n노드 크기 선택 가능"]
    D --> F["Karpenter가 올바른\n노드 결정 불가"]

    style E fill:#90EE90
    style F fill:#FFB6C1

CoreDNS 권장 사항

설정	목적
CoreDNS lameduck duration	종료 전 유예 기간을 두어 진행 중인 DNS 쿼리가 완료될 시간 확보
CoreDNS readiness probe	아직 준비되지 않은 CoreDNS Pod로 DNS 쿼리가 전달되는 것을 방지

전체 Best Practices 체크리스트

카테고리	Best Practice	중요도
Karpenter 운영	Production에서 AMI 버전 고정	필수
Karpenter 운영	Controller를 MNG 또는 Fargate에 배치	필수
Karpenter 운영	Spot 사용 시 Interruption Handling 활성화	높음
Karpenter 운영	Private Cluster 시 STS/SSM VPC Endpoint 생성	환경별
NodePool	팀/워크로드별 복수 NodePool 구성	권장
NodePool	NodePool 간 상호 배타성 또는 가중치 설정	높음
NodePool	Spot 시 Instance Type 다양하게 허용	높음
NodePool	Resource Limits 설정	높음
Pod 스케줄링	Topology Spread + PDB로 HA 확보	높음
Pod 스케줄링	비용 알림(Billing Alarm) 설정	높음
Pod 스케줄링	Consolidation 시 non-CPU requests = limits	높음
Pod 스케줄링	LimitRange로 기본 리소스 설정	권장
Pod 스케줄링	모든 워크로드에 정확한 Resource Requests	필수
CoreDNS	lameduck duration + readiness probe 설정	권장

추가 학습 리소스

리소스	링크
Karpenter Immersion Day Workshop	catalog.workshops.aws/karpenter
Karpenter Cost Optimization Workshop	ec2spotworkshops.com/karpenter
EKS Workshop - Karpenter	eksworkshop.com - Karpenter
Karpenter vs Cluster Autoscaler (영상)	YouTube
Spot + Karpenter 튜토리얼	community.aws