In-Depth Analysis of Kubernetes Scheduling

11 min readSep 12, 2023

Kubernetes is a distributed system capable of managing and orchestrating containerized applications. The key capability of Kubernetes is its ability to schedule containers onto cluster nodes automatically.

In this article, I will provide an in-depth look at how Kubernetes scheduling works:

The scheduling process
Scheduling algorithms
Scheduler components
Scheduling in action

What is Kubernetes Scheduling?

Kubernetes scheduling refers to the automated process of placing containerized workloads, known as Pods, onto nodes within a Kubernetes cluster. The Kubernetes scheduler is the core component that handles this placement process.

When a new Pod is created, it is initially unscheduled. The scheduler will then match the Pod to an appropriate node based on the Pod’s resource requirements, hardware constraints, affinity rules, current node utilization, and other factors. The goal is to optimally place Pods for performance and high availability.

The scheduler continuously monitors the cluster and handles scheduling as new Pods are created or existing Pods are rescheduled. Key Kubernetes resources like Node and Pod spec objects provide the scheduler with cluster information and scheduling instructions.

How Kubernetes Scheduling Works

The Kubernetes scheduler follows a multi-step process to automatically assign Pods to the most suitable nodes. Here is an overview of how Kubernetes scheduling works:

Pod Creation: Developers or system admins create Pod YAML definition files that specify the containers, volumes, and resource requirements for the workload. These Pods are submitted to the Kubernetes API server.

Scheduler Monitoring: The kube-scheduler process running on the master node monitors the API server for any newly created Pods that are not yet scheduled on a node.

Node Selection: For each unscheduled Pod, the scheduler runs algorithms to evaluate the fitness of each node, scoring nodes based on factors like resource usage, hardware constraints, and affinity policies. It selects an optimal node to host the Pod.

Pod Binding: The scheduler updates the API server, binding the Pod to the chosen node. This scheduling information is stored in etcd for persistence.

Pod Launch: The kubelet agent on the target node queries the API server and launches the Pod using the container runtime. If node failure occurs, the scheduler can reschedule the Pod on a different node.

Other Factors

When the Kubernetes scheduler selects nodes for Pods, it takes into account the following factors:

Node resources: The scheduler needs to consider the resource usage of the node, such as CPU, memory, disk, etc.

Pod resource requirements: The scheduler needs to consider the resource requirements of Pods, such as CPU, memory, and disk.

Affinity and anti-affinity: The scheduler can select nodes according to the affinity and anti-affinity rules specified by the Pod.

Node taints and tolerations: Nodes can set taints to indicate that specific types of Pods are not allowed to run on the node.

Node labels: The scheduler can select the best node according to the label of the node.

Scheduling Algorithms

A scheduling algorithm is a method used to determine which node a Pod should be scheduled to. Kubernetes provides a variety of scheduling algorithms, and you can choose the appropriate algorithm according to your specific needs and requirements.

Random algorithm: This algorithm simply schedules the Pod to a random node. It is the simplest scheduling algorithm, but it’s not always the best.

Minimum load algorithm: This algorithm schedules the Pod to the node with the least amount of load. It is a good algorithm for ensuring that all nodes are evenly utilized.

Greedy algorithm: This algorithm schedules the Pod to the node that can provide the most resources for the Pod. It is a good algorithm for ensuring that Pods have enough resources to run.

Best fit algorithm: This algorithm schedules the Pod to the node that can accommodate the Pod’s resource requirements with the least amount of unused resources. It is a good algorithm for ensuring that resources are used efficiently.

Weighted minimum load average algorithm: This algorithm is a combination of the minimum load algorithm and the greedy algorithm. It weights the load of the nodes according to their importance and then schedules the Pod to the node with the least weighted load.

Scheduling Components

The scheduling component in Kubernetes is responsible for scheduling Pods to nodes in the cluster. This is done to achieve load balancing and maximize resource utilization.

The scheduling component consists of two main parts:

The scheduler: This is the core component that selects nodes for Pods. The scheduler takes into account factors such as node resource utilization, Pod resource requirements, affinity/anti-affinity rules, and other criteria to make its decisions.
The scheduler extender: This is a plug-in that can be used to extend the scheduler and provide additional scheduling policies and rules. For example, the scheduler extender can be used to add custom affinity/anti-affinity rules, node selectors, or scheduler filters.

Scheduler implementation

Kubernetes also provides multiple scheduler implementations, including:

Default Scheduler: The kube-scheduler process that comes built-in with Kubernetes. It handles core scheduling functions like filtering nodes and ranking node suitability for pod placement.

Custom Scheduler: Fully separate scheduler implementations that can completely replace the default kube-scheduler. Custom schedulers allow implementing non-standard scheduling logic from scratch.

Scheduler Extenders: Plugins that extend the default kube-scheduler to modify or enhance its logic through additional code and policies. Extenders augment the existing scheduler rather than replace it.

In short, the Kubernetes scheduling component is a core component in the Kubernetes cluster that can achieve load balancing and maximize resource utilization. It is necessary to select the appropriate scheduler implementation and scheduling strategy based on the actual situation and needs to ensure that the resource utilization of the Kubernetes cluster is maximized.

Scheduling in Action

When we create a Pod, it is added to the scheduler’s unscheduled queue. The scheduler periodically polls the unscheduled queue, checks the scheduling needs of each Pod, and then schedules them to the most suitable node.

The scheduler’s scheduling decisions are based on factors such as node resource utilization, Pod resource requirements, and affinity/anti-affinity rules.

The following are the detailed steps of the Pod scheduling process:

Get the scheduling requirements of Pod
Choose the appropriate node
Assign Pods to nodes
Save scheduling information
Start Pod

Get the Scheduling Requirements of Pod

The scheduler first retrieves the Pod’s scheduling requirements. This includes information such as the Pod’s container image, resource requirements, affinity/anti-affinity rules, and other criteria. The scheduler uses this information to make its scheduling decisions.

Choose the Appropriate Node

The scheduler will select the most suitable node among the available nodes based on the scheduling requirements of the Pod and the resource conditions of the cluster. The scheduler will make a selection based on factors such as node resource utilization, Pod resource requirements, and affinity/anti-affinity rules. If there are no available nodes to meet the Pod’s needs, the Pod will always be in a state of waiting for scheduling.

Assign Pods to Nodes

Once the scheduler has selected a node for a Pod, it will bind the Pod to the node. This means that the scheduler will update the Pod object’s spec.nodeName field to the node’s name. The kubelet, which is the process that runs on each node, will then be able to know which node the Pod is assigned to.

Save Scheduling Information

The scheduler will store the binding information between Pods and nodes in the etcd database. This information includes the Pod’s name, namespace, scheduling timestamp, node name, and other relevant data. The kube-scheduler will periodically check this binding information to ensure that the Pod has been assigned to the correct node.

Start Pod

When a Pod is assigned to a node, the kubelet retrieves the Pod’s configuration information from the etcd database. This information includes the Pod’s container image, resource requirements, and other relevant data. The kubelet then starts the container in the Pod based on this information. Once the container is started, the Pod can start running.

How does the scheduler choose the most suitable node?

When the scheduler selects the most suitable node, it will select it based on certain strategies and algorithms. Here are some of the main considerations for the scheduler in node selection:

Resource utilization
Pod resource requirements
Affinity and anti-affinity rules
Node labels and comments
Node load balancing

Resource utilization: The scheduler checks the resource usage of each node in the cluster, including CPU, memory, disk, and network usage. The scheduler will select the node with the lowest resource utilization to ensure that the Pod can get enough resources.

Pod resource requirements: The scheduler will check the resource requirements of the Pod, including requirements for CPU and memory. The scheduler will select nodes that can meet the resource requirements of the Pod to prevent the Pod from not running properly due to insufficient resources.

Affinity and anti-affinity rules: The scheduler checks the Pod’s affinity and anti-affinity rules. Affinity rules specify which nodes a Pod should be scheduled on, while anti-affinity rules specify which nodes a Pod should not be scheduled on. The scheduler will select nodes based on these rules to ensure that Pods are scheduled to the most appropriate node.

No labels and comments: The scheduler checks the node’s labels and comments. A node’s label can be used to identify the node’s features and properties, while annotations can provide additional information about the node. The scheduler can use this information to select the most suitable node.

Node load balancing: The scheduler attempts to balance the load across the cluster so that one node is not overloaded and other nodes are underutilized. The scheduler selects nodes with optimal load balancing to ensure maximum resource utilization in the cluster.

How does the scheduler check the resource requirements of Pods?

When the scheduler checks the resource requirements of a Pod, it will look in the Pod’s definition.

spec.containers[*].resources.requests
spec.containers[*].resources.limits

Resource requirements and resource limits are specified in the fields.

resources.requestsField specifies the minimum amount of resources required when the Pod starts. For example, the demand for resources such as CPU and memory. If a Pod's actual resource usage exceeds the requested amount, Kubernetes kills the Pod and restarts it.
resources.limitsField specifies the maximum amount of resources that the Pod can use. For example, the amount of limits on resources such as CPU and memory. If the actual resource usage of a Pod exceeds the limit, Kubernetes will limit the resource usage of the Pod and may cause the Pod to fail to run properly.

The scheduler will select the most suitable node to schedule Pods based on these resource requirements and limitations. The scheduler will consider the resource utilization of the node and the resource requirements of the Pod when selecting nodes to ensure that the Pod can get enough resources to run normally.

How to set the affinity and anti-affinity rules of Pod?

In Kubernetes, affinity and anti-affinity are rules used to specify the relationship between Pods and Nodes. By setting affinity and anti-affinity rules, the scheduler can allocate Pods to the most appropriate nodes.

Affinity rules are used to indicate which nodes a Pod should be scheduled to, while anti-affinity rules are used to indicate which nodes a Pod should not be scheduled to. Here’s how to set affinity and anti-affinity rules for Pods:

Set affinity and anti-affinity rules through labels (Labels)

Pod affinity and anti-affinity rules can be set through labels. First, you need to define a label selector (Label Selector) in the Pod, then use Node Affinity and Pod Affinity to specify affinity rules and use Node Anti-Affinity and Pod Anti-Affinity to specify anti-affinity rules.

For example, here is a Pod definition using tag selectors and affinity rules:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: my-image
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: my-label
            operator: In
            values:
            - my-value

In the example above, the Pod selector uses matchExpressionsa label selector that selects my-label=my-valuethe node. Then, the usage nodeAffinityspecifies an affinity rule that requires the Pod to be scheduled on the node with that label.

Set affinity and anti-affinity rules through topology fields

You can use Topology to specify Pod affinity and anti-affinity rules. Topology refers to the topological structure of nodes, such as topological domains, regions, racks, etc. Using Topology can ensure that Pods are scheduled to nodes with similar topology structures.

For example, here is a Pod definition using Topology and affinity rules:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: my-image
  affinity:
    podAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
          matchLabels:
            app: my-app
        topologyKey: rack

In the example above, the Pod selector used matchLabelsspecifies a label selector that selects app=my-appthe Pods. Then, us topologyKeyspecifies an affinity rule that requires the Pod to be scheduled app=my-appon a node in the same rack as the scheduled Pod.

Note that correctly setting affinity and anti-affinity rules requires understanding the cluster topology and resource usage, otherwise, Pods may not be scheduled correctly.

How to set resource limits and requests for Pods?

In Kubernetes, you can control the amount of resources used by a Pod by setting resource limits and requests for the Pod. Resource Limits specify the maximum amount of resources that a Pod can use, while Resource Requests specify the minimum amount of resources required when a Pod starts.

Setting resource limits and requests for Pods can ensure that Pods do not use too many resources when running, and can improve the success rate of Pod scheduling in the cluster.

Here is an example of how to set resource limits and requests for a Pod:

apiVersion: v1
kind: Pod
metadata:
  name: my-pod
spec:
  containers:
  - name: my-container
    image: my-image
    resources:
      limits:
        cpu: "1"
        memory: "500Mi"
      requests:
        cpu: "0.5"
        memory: "250Mi"

In the example above, the containers in the Pod my-containerset resource limits and requests. The resource limit of the container is 1 CPU core and 500MiB memory, while the resource request of the container is 0.5 CPU core and 250MiB memory.

Resource limits and requests can be set using CPU and memory. The unit of CPU is CPU core, and the unit of memory is byte. CPU and memory can be specified using the following format:

CPU: Use a fraction or integer to represent the number of CPU cores. For example, 0.5 means half a core, 1 means one core, and 2 means two cores.
Memory: Express memory size in bytes, kilobytes, megabytes, or gigabytes. For example, 1Gi means 1 gigabyte and 500Mi means 500 megabytes.

In addition to CPU and memory, resource limits and requests can also be set using other resources, such as GPU, storage, etc. Different resources can use different units, depending on the resource type. Details of the various resource types can be viewed in the Kubernetes documentation.

In summary, setting resource limits and requests for a Pod can control the amount of resources used by the Pod. Resource limits and requests can be set using CPU and memory, as well as other resource types. When setting resource limits and requests, you need to consider the amount of resources required by the Pod and the resource usage of the cluster to avoid the Pod from not running properly or affecting the operation of other Pods.

Conclusion

Kubernetes scheduling is one of the core functions of Kubernetes. The kube-scheduler component implements scheduling functions like filtering Nodes and ranking suitability for Pod placement using configurable algorithms. Key factors in scheduling decisions include Node resource capacity, Pod resource demands, affinity/anti-affinity rules, and custom pluggable logic via extenders.

You can choose between default or custom scheduling algorithms tailored to your cluster’s needs. Monitoring kube-scheduler logs provides visibility into how Pods are being assigned to Nodes. Overall, intelligent Kubernetes scheduling enables efficient workload distribution and resource optimization in cluster environments. Automated placement of Pods onto healthy Nodes is a critical function that administrators rely on for production Kubernetes deployments.

In summary, Kubernetes scheduling brings automation, flexibility, and resilience to operating containerized applications across many Nodes. The scheduler is fundamental for taking full advantage of Kubernetes’ distributed capabilities.