Kubernetes 高级调度和资源管理

调度过程

把 Pod 放到合适的 Node 上去

  • 满足 Pod 资源要求
  • 满足 Pod 的特殊关系要求
  • 满足 Node 限制条件要求
  • 做到集群资源合理利用

基础调度能力

  • 资源调度 – 满足 Pod 资源要求
    • 资源 request/limit
      • CPU 1=1000m
      • 内存 1Gi=1024Mi
      • 存储
      • GPU
      • FPGA
    • QoS
      • Guaranteed 保障(高)
      • Burstable 弹性(中)
      • BestEffort 尽力而为(低)
    • 资源配额
  • 关系调度 – 满足 Pod/Node 特殊关系/条件要求
    • Pod 和 Pod 间关系
      • PodAffinity
      • PodAntiAffinity
    • 由 Pod 决定适合自己的 Node
      • NodeSelector
      • NodeAffinity
    • 限制调度到某些 Node
      • Taint
      • Tolerations

 

资源调度

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
    resource:
      requests:
        cpu: 2
        memory: 1Gi
      limits:
        cpu: 2
        memory: 1Gi

 

Kubernetes 无法手动定义 QoS

Guaranteed

CPU/Mem 必须 request==limit

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
    resource:
      requests:
        cpu: 2
        memory: 1Gi
      limits:
        cpu: 2
        memory: 1Gi

 

Burstable

CPU/Mem request 和 limit 不等

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
    resource:
      requests:
        cpu: 2
        memory: 1Gi

 

BestEffort

所有资源 request/limit 都不填

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container

 

不同的 QoS

  • 调度表现不同
    • 调度器会使用 request 进行调度
  • 底层表现不同
    • CPU 按照 request 划分权重
    • Mem 按 QoS 划分 OOMScore
      • Guaranteed -998
      • Burstable 2~999
      • BestEffort 1000
    • Eviction
      • 优先 BestEffort
      • Kubelet

 

资源配额

限制每个 Namespace 资源用量,当配额用超过后会禁止创建

apiVersion: v1
kind: ResourceQuota
metadata:
  name: demo-quota
  namespace: demo-ns
spec:
  hard:
    cpu: 1000
    memory: 200Gi
    pods: 10
  scopeSelector:
    matchExpressions:
    - operator: Exists
      scopeName: NotBestEffort

 

scope:

  • Terminating/NotTerminating
  • BestEffort/NotBestEffort
  • PriorityClass Pod 要配置合理的资源要求
  1. CPU/Mem/EphemeralStorage/GPU

通过 request 和 limit 来为不同业务特点的 Pod 选择不同的 QoS

  • Guaranteed 敏感型、需要保障的业务
  • Burstable 次敏感型、需要弹性的业务
  • BestEffort 可容忍型业务
  • 为每个命名空间配置 ResourceQuota 来防止过量使用,保障其他人的资源可用

 

亲和调度

Pod – Pod

  • Pod 亲和调度 PodAffinity
    • 必须和某些 Pod 调度到一起requiredDuringSchedulingIgnoredDuringExecution
    • 优先和某些 Pod 调度到一起preferredDuringSchedulingIgnoredDuringExecution
  • Pod 反亲和调度 PodAntiAffinity
    • 禁止和某系 Pod 调度到一起requiredDuringSchedulingIgnoredDuringExecution
    • 优先不和某些 Pod 调度preferredDuringSchedulingIgnoredDuringExecution
  • operator
    • In
    • NotIn
    • Exists
    • DoesNotExist

 

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution
      - labelSelector:
        matchExpressions:
        - key: k1
          operator: In
          values:
          - v1
        topologykey: "kubernetes.io/hostname"

 

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  affinity:
    podAntiAffinity:
      requiredDuringSchedulingIgnoredDuringExecution
      - labelSelector:
        matchExpressions:
        - key: k1
          operator: In
          values:
          - v1
        topologykey: "kubernetes.io/hostname"

 

Pod – Node

  • NodeSelector
    • 必须调度到带了某些标签的 Node
    • Map[string]string
  • NodeAffinity
    • 必须调度到某些 Node 上requiredDuringSchedulingIgnoredDuringExecution
    • 优先调度到某些 Node 上preferredDuringSchedulingIgnoredDuringExecution
    • operator
apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  nodeSelector:
    k1: v1

 

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: k1
            operator: In
            values:
            - v1

 

Node 污点/容忍

  • Taint (Node)
    • 一个 Node 可以有多个 Taints
    • Effect(Taint 的行为)
      • NoSchedule 禁止新的 Pod 调度上来
      • PreferNoSchedule 尽量不调度到这台
      • NoExecute 会驱逐不能容忍的 Pod
  • Toleration (Pod)
    • 一个 Pod 可以有多个 Tolerations
    • Effect 可以为空,匹配所有
    • operator
      • Exists
      • Equal

 

apiVersion: v1
kind: Node
metadata:
  name: demo-node
spec:
  taints:
  - key: k1
    value: v1
    effect: NoSchedule

 

apiVersion: v1
kind: Pod
metadata:
  namespace: demo-ns
  name: demo-pod
spec:
  containers:
  - image: nginx:latest
    name: demo-container
  tolerations:
  - key: k1
    operator: Equal
    value: v1
    effect: NoSchedule

 

Kubernetes 高级调度能力

  • 优先级抢占调度
    • Priority
    • Preemption

 

优先级调度配置

apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high
value: 10000
globalDefault: false
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: low
value: 100
globalDefault: false

 

优先级:

  • 默认优先级DefaultPriorityWhenNoDefaultClassExists=0
  • 用户可配置的最大优先级限制HighestUserDefaultPriority=1000000000
  • 系统级别优先级SystemCriticalPriority=200000000
  • 内置系统级别优先级
    • system-cluster-critical
    • system-node-critical

优先级调度过程:

  1. Pod2 和 Pod1 先后进入调度队列,但均未被调度
  2. 当进行调度时,PriorityQueue 会优先 Pod 优先级更大的 Pod1 出队列镜像调度
  3. 调度成功后,下一轮调度 Pod2

优先级抢占过程:

  1. Pod2 先进行调度,调度成功后被分配至 Node1 上运行
  2. 之后 Pod1 再进行调度,由于 Node1 资源不足出现调度失败,此时进入抢占流程
  3. 在经过抢占算法计算后,选中 Pod2 为 Pod1 让渡
  4. 驱逐 Node1 上运行的 Pod2,并将 Pod1 调度至 Node1

 

发表评论