Kubernetes Orchestration Engineer – GPU Hypercomputing & AI Workloads

NETS-International Group
Riyadh
تاريخ النشر: ٣‏/٧‏/٢٠٢٥

وصف الوظيفة

Riyadh, Saudi Arabia

contractual

Company Description

NETS is a leading global Solutions Provider and Systems Integrator dedicated empowering the future through our integrated approach and commitment to delivering Innovative, Intelligent, and Integrated Solutions (NETS 3 I’s) Effectively, Efficiently, and Economically (NETS 3 E’s). Our service portfolio covers 3 verticals namely Infrastructure, Digital, and Managed Solutions, and NETS Services include Access Networks (Fixed and Wireless), Enterprise Data Networks, Cloud Solutions, Cyber Security, Automation, Resource Outsourcing, and Managed Services. NETS brings over 4 decades of proven domain expertise, service specialization, and industry leadership, delivering over 3,000+ successful projects. Our 1,000+ highly skilled & professional staff, collaboration with over 50 leading global technology partners, 100+ NETS OEM Partners, and NETS Reach, with offices in the UK, UAE, USA, Saudi Arabia, and Pakistan, has allowed us to be the preferred trusted partner to over 200 long-standing satisfied customers including fortune 500 companies across 25+ countries.

Job Description

Role Overview:

We are seeking a highly skilled Kubernetes Orchestration Engineer to lead the deployment and management of GPU-optimized Kubernetes environments that power AI/ML and hypercomputing workloads. This role is critical to ensuring scalable, reliable, and high-performance infrastructure across on-premises and hybrid cloud environments.

As a core member of our infrastructure engineering team, you will work at the intersection of container orchestration, GPU resource management, and AI application scaling, enabling large-scale distributed training and inference across GPU clusters.

Key Responsibilities

Deploy and manage production-grade Kubernetes clusters tailored for GPU-intensive workloads in AI/ML environments.

Build and maintain GPU node pools using NVIDIA Device Plugin, CRI-O, or container runtimes supporting GPU scheduling.

Orchestrate containerized distributed model training using Kubernetes and frameworks like PyTorch, TensorFlow, or Hugging Face.

Design and implement Kubernetes Operators to automate scaling, health monitoring, and lifecycle management of AI/ML services.

Use Helm charts to deploy and scale GPU-accelerated applications across clusters.

Integrate observability tools such as Prometheus, Grafana, and Kibana to monitor GPU, memory, and pod resource utilization.

Configure and troubleshoot CNI plugins (e.g., Calico, Flannel, Cilium) to ensure high-performance pod networking.

Collaborate with DevOps, AI, and infrastructure teams to support hybrid and multi-cloud Kubernetes deployments.

Requirements

Required Skills & Expertise:

Strong experience with Kubernetes (K8s) and container orchestration in production environments.

Expertise in managing GPU workloads in Kubernetes using NVIDIA GPU Operator, vGPU, and device plugin configurations.

Proficiency with container runtimes such as Docker and CRI-O, and orchestration tools like Helm and Kubernetes Operators.

Solid understanding of networking within Kubernetes and service mesh integration (e.g., Istio, Linkerd).

Familiarity with hybrid/multi-cloud Kubernetes platforms (e.g., GKE, EKS, AKS).

Strong scripting and automation skills (e.g., YAML, Helm templating, Bash, Python).

Preferred Certifications

Certified Kubernetes Administrator (CKA) – Required

Certified Kubernetes Application Developer (CKAD) – Preferred

NVIDIA Certified Kubernetes Specialist – Nice to have

Show more Show less