Taming the Wild West of Research Computing: How Policies Saved Us a Thousand Headaches

Originally posted on Medium

At IBM Research, one of our focus areas is the convergence of High-Performance Computing (HPC), Artificial Intelligence (AI), hybrid cloud and quantum computing — a field we call accelerated discovery.

To support the diverse needs of the researchers working in this area, our team manages two large bare-metal OpenShift® clusters. These clusters are a critical resource, providing access to powerful computing capabilities, including GPUs, which are essential for conducting this type of research. However, the inherently bursty nature of research workloads that often concentrate near project and conference deadlines, complicates the task of managing resources effectively.

The Challenge

Researchers, while skilled in their respective domains, often lack hands-on experience with Kubernetes and OpenShift®, leading to suboptimal resource utilization and unforeseen consequences.

For example, they may inadvertently monopolize GPU resources by launching interactive pods that persist indefinitely after workload completion or launch CPU-only Jobs with high parallelism, causing them to be co-scheduled on our subset of GPU-enabled nodes. This results in a resource contention scenario where these nodes become CPU- or memory-constrained, effectively idling the GPUs and blocking the execution of GPU-dependent workloads.

As cluster administrators, it is our duty to implement resource governance and provide tooling that fosters responsible and efficient behavior among our users.

The Goals

To address these challenges, we set out to:

Establish a streamlined process for defining, modifying, and rolling out groups, resources, and policies allocated to each research project, ensuring consistency and visibility across administrators, and enabling quick responses to changing demands and requirements.
Implement restrictions on user actions that go beyond traditional Role-Based Access Control (RBAC) rules, including restrictions on pod-level actions such as exec-ing into containers.
Automatically validate and amend user-created resources to comply with our policies and set of best practices.
Provide a fair-share scheduling experience for GPU access, leveraging a batch-scheduling paradigm familiar to researchers from traditional HPC environments.

The Components of our Solution

To minimize custom development and maximize the use of proven, community-driven solutions, in our implementation we prioritized the adoption of open-source projects. Drawing from the Cloud Native Computing Foundation (CNCF) Landscape, we selected Kyverno as our policy engine for designing and enforcing fine-grained access controls and resource governance, Kueue for implementing job queuing and scheduling, and Argo CD for our GitOps-based configuration management and automation framework:

Argo CD is a declarative, GitOps-based continuous delivery tool for Kubernetes that automates and streamlines application deployment and lifecycle management, ensuring a transparent and auditable process. By creating Argo Applications that are connected to a source Git repository, Argo CD watches for changes and maintains a consistent and up-to-date deployment state, aligning with our goal of simplifying and automating rollouts of configuration changes.
Kyverno offers a robust Policy-as-Code (PaC) framework for Kubernetes and cloud-native environments, enabling the management of the entire policy lifecycle. Policies are defined as Kubernetes resources, utilizing a declarative YAML syntax that aligns with existing Kubernetes configuration files, eliminating the need to learn additional languages, like those required by other tools, such as Open Policy Agent’s Rego. Kyverno supports four policy types — validate, mutate, generate, and cleanup — which collectively enable us to achieve two of our objectives.
Kueue is a cloud-native job queueing system for batch, HPC, AI/ML, and similar workloads within a Kubernetes cluster. By partitioning available resources into separate, configurable Queues, Kueue enables the creation of a multi-tenant batch service with hierarchical resource sharing and quotas. Kueue’s quota-based scheduling logic determines when jobs should be queued and when and where they should be executed, allowing us to achieve our last objective.

The solution

Automating Cluster and Project Configuration with GitOps and Argo CD

In line with Kubernetes’ best practices, we provision a dedicated Namespace and associated Group for each research project, complete with necessary RoleBindings, ResourceQuotas, and other critical configurations. However, our manual update process had introduced knowledge gaps among administrators, highlighting the need for a more streamlined and transparent approach.

To address this challenge, we implemented a GitOps approach, designating a centralized Git repository as the single source of truth for our clusters. This repository contains all definitions and configurations, ensuring consistency, auditability, and version control. To streamline management, we created a set of reusable “base” manifests that apply to all namespaces, supplemented by overlays stored in dedicated folders for each namespace.

This enabled us to leverage Argo CD’s ApplicationSets and git directory Generators, allowing us to create and update namespaces with minimal effort, typically requiring only a simple Kustomization file. This setup also facilitated easy deployment and tracking of quota increases and policy exceptions, making it simple to respond to user requests and deadlines.

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: user-projects
spec:
  goTemplate: true
  goTemplateOptions: ["missingkey=error"]
  syncPolicy:
      automated: {}
  generators:
  - git:
      repoURL: https://gitsource/org/repo.git
      revision: HEAD
      directories:
      - path: user-projects/*
  template:
    metadata:
      name: '{{.path.basename}}'
    spec:
      project: "default"
      source:
        repoURL: https://gitsource/org/repo.git
        targetRevision: HEAD
        path: '{{.path.path}}'
      destination:
        server: https://kubernetes.default.svc
        namespace: '{{.path.basename}}'
      syncPolicy:
        syncOptions:
        - CreateNamespace=true
        automated: {}

Preventing GPU Resource Hogging with Kyverno Policies

We then addressed the most significant obstacle to accessing GPUs on our cluster: the proliferation of interactive pods. Initially, our goal was to prevent users from running commands that would indefinitely occupy resources, such as sleep infinity or tail -f /dev/null. However, we realized that removing the ability to exec into these “sleeping” pods and use them interactively would disincentivize this type of behavior.

To solve this, we created a Kyverno ClusterPolicy that restricts pod exec access to cluster administrators only. By enforcing this policy, we effectively encouraged researchers to adopt a more declarative and Kubernetes-native approach to pod creation, where pods are designed to execute a specific command and then complete, rather than persist indefinitely.

apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata:
  name: exec-only-from-cluster-admins
  namespace: kyverno-admin
spec:
  validationFailureAction: Enforce
  background: false
  rules:
    - name: exec-only-from-cluster-admins
      context:
        - name: exec-namespace-exceptions
          configMap:
            name: exec-namespace-exceptions
            namespace: kyverno-admin
      match:
        any:
          - resources:
              kinds:
                - Pod/exec
      preconditions:
        all:
          - key: "{{ request.operation || 'BACKGROUND' }}"
            operator: Equals
            value: CONNECT
          - key: "{{ request.clusterRoles.contains(@, 'cluster-admin') }}"
            operator: NotEquals
            value: true
          - key: "{{ request.namespace }}"
            operator: AnyNotIn
            value:
              '{{ "exec-namespace-exceptions".data."exceptions" | parse_json(@) }}'
      validate:
        message:
          Executing a command in a container is forbidden for Pods running in
          this Namespace. To request an exception, reach out to the admins.
        deny: {}

Reserving GPU Nodes for GPU Workloads automatically with Affinity Rules

To further optimize resource utilization and prevent over-allocation, we implemented resource quotas to establish a baseline enforcement mechanism that limits pods and resources in use by a single namespace. This ensured that each namespace had a defined ceiling for resource consumption, preventing any one namespace from monopolizing cluster resources.

However, we soon realized that this alone was not sufficient to address the issue of GPU-enabled nodes being starved of CPU and memory by non-GPU pods. To mitigate this, we created a Kyverno Policy that dynamically adds affinity rules to non-GPU pods, preventing them from being scheduled on GPU-enabled nodes.

This policy is particularly effective because it operates transparently, without requiring any manual intervention or configuration from users. By automatically enforcing this scheduling constraint, we can ensure that GPU-enabled nodes are reserved for workloads that actually require GPU resources, while non-GPU pods are scheduled on more suitable nodes, thereby optimizing overall resource utilization and reducing contention for GPU resources.

apiVersion: kyverno.io/v1
kind: Policy
metadata:
  name: avoid-gpu-nodes-for-pods-with-no-gpus
spec:
  background: false
  rules:
    - name: avoid-gpu-nodes-for-pods-with-no-gpus
      match:
        any:
          - resources:
              kinds:
                - Pod
      context:
        - name: gpu_requests
          variable:
            # Use || [0] to return an array with just a zero if the field does not exist.
            # This is required because the sum function in Kyverno requires at least one element
            # in the array, which would otherwise be empty.
            jmesPath: 'request.object.spec.containers[].resources.requests."nvidia.com/gpu" || [0]'
            default: [0]
      preconditions:
        all:
          - key: "{{ sum(gpu_requests) }}"
            operator: Equals
            value: 0
      mutate:
        patchStrategicMerge:
          spec:
            template:
              spec:
                affinity:
                  nodeAffinity:
                    requiredDuringSchedulingIgnoredDuringExecution:
                      nodeSelectorTerms:
                        - matchExpressions:
                            - key: nvidia.com/gpu.product
                              operator: DoesNotExist

Enabling Fair Sharing of GPU resources with Kueue

To ensure fair and efficient access to GPU resources, we utilized Kueue to define ResourceFlavors for each of our distinct GPU types, enabling granular categorization and management of these resources. Corresponding ClusterQueues were then created for each ResourceFlavor, allowing users to submit jobs that specifically target a particular GPU type. Furthermore, we established a generic GPU ClusterQueue that leverages Kueue’s cohort mechanism and resource borrowing capabilities, providing users with the flexibility to access any available GPU without the need to specify a particular type.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "nvidia-a100-80gb-pcie"
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-A100-80GB-PCIe
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "nvidia-h100-80gb-pcie"
spec:
  nodeLabels:
    nvidia.com/gpu.product: NVIDIA-H100-PCIe
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "a100-cluster-queue"
spec:
  cohort: gpu-cohort
  namespaceSelector:
    matchLabels:
      kueue-enable: gpu-cluster-queue
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: "nvidia-a100-80gb-pcie"
          resources:
            - name: "cpu"
              nominalQuota: 394
            - name: "memory"
              nominalQuota: 800Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 8
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "h100-cluster-queue"
spec:
  cohort: gpu-cohort
  namespaceSelector:
    matchLabels:
      kueue-enable: gpu-cluster-queue
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: "nvidia-h100-80gb-pcie"
          resources:
            - name: "cpu"
              nominalQuota: 160
            - name: "memory"
              nominalQuota: 400Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 4
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "gpu-cluster-queue"
spec:
  cohort: gpu-cohort
  namespaceSelector:
    matchLabels:
      kueue-enable: gpu-cluster-queue
  resourceGroups:
    - coveredResources: ["cpu", "memory", "nvidia.com/gpu"]
      flavors:
        - name: "nvidia-a100-80gb-pcie"
          resources:
            - name: "cpu"
              nominalQuota: 0
            - name: "memory"
              nominalQuota: 0Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 0
        - name: "nvidia-h100-80gb-pcie"
          resources:
            - name: "cpu"
              nominalQuota: 0
            - name: "memory"
              nominalQuota: 0Gi
            - name: "nvidia.com/gpu"
              nominalQuota: 0

Conclusion

The adoption of Kyverno, Kueue, and GitOps has been instrumental in transforming our OpenShift® clusters into a more stable, efficient, and fair shared environment for our colleagues. By establishing and implementing clear policies and controls on resource utilization, we have significantly reduced the occurrence of issues related to GPU usage, such as resource contention and over-allocation. This, in turn, has improved the overall user experience, enabling researchers to focus on their work without interruptions or delays. Moreover, the automation and standardization provided by these practices have greatly reduced the administrative burden on our team, freeing us up to concentrate on our own research.

TL;DR

We enhanced the user experience for researchers using our clusters by leveraging open source technologies, resulting in:

Streamlined cluster management: we leveraged ArgoCD to establish a fully automated GitOps setup, ensuring seamless configuration management across our cluster and projects.
GPU resource optimization: we utilized Kyverno to prevent users from monopolizing GPU resources with interactive pods and dynamically added affinity rules to CPU-only pods, ensuring efficient node utilization.
Fair-sharing and HPC-like experience: we deployed Kueue to provide a fair-sharing, HPC-like experience for our users, promoting efficient resource allocation and minimizing contention.

These improvements not only reduced resource contention and over-allocation but also increased efficiency, fairness, and user satisfaction, while decreasing the administrative burden on our team.

Source link
lol