Back to jobs

SENIOR SITE RELIABILITY ENGINEER / KUBERNETES (REMOTE)

Pragmatike
Full-timesenior

Job description

Job Description Location: Fully remote EU timezone (CET ±2h) Start date: ASAP Languages: Fluent English is mandatory Industry: Cloud Computing We are hiring at Pragmatike to expand our team and drive the growth of our internal projects. Our focus is on developing cutting-edge solutions in Cloud Computing, while fostering a culture of collaboration and innovation. Joining us means being part of a passionate team where your ideas and skills directly contribute to shaping tomorrows technologies. If you're excited about working on ambitious projects in a dynamic and flexible environment, we'd love to hear from you!   RESPONSIBILITIES - Operate and maintain Linux-based infrastructure (Debian/Ubuntu). - Deploy, manage, and scale Kubernetes clusters across bare-metal, virtualized, and on-prem environments. - Oversee full cluster lifecycle: upgrades, node pools, networking, storage, and security hardening. - Implement automation for provisioning and operations using Ansible, Bash/Python, and GitOps workflows. - Design and maintain networking architecture including VLANs, L2/L3 routing, VPNs, and multi-site connectivity. - Build automated deployment workflows (PXE boot, Preseed, cloud-init). - Deploy and maintain observability stacks (Prometheus/Grafana, Loki, ELK, Graylog). - Lead incident response and escalation activities across the platform. - Improve system availability and reduce latency at all levels. - Define and implement SLOs/SLIs at multiple infrastructure levels (physical network/hardware, platform virtualization, software services). - Optimize alerting and monitoring pipelines to provide actionable insights. - Establish and maintain on-call schedules to ensure coverage across timezones. - Develop Standard Operating Procedures (SOPs) for repeatable operations and maintenance tasks. - Coordinate physical maintenance for Policlouds (periodic maintenance, hardware issues, DC-Ops). - Manage virtualization and orchestration layers (OpenStack, Proxmox, VMware). - Help develop and maintain overall architecture across all products. - Plan resources for future initiatives, accounting for demand and growth projections. - Work with development teams to improve overall quality and optimize resource utilization. - Collaborate with cross-functional stakeholders (Hivenet, Policloud, Customer Success teams). REQUIREMENTS - Expert-level, hands-on experience operating Kubernetes in production environments. - Strong network engineering skills (VLANs, L2/L3 routing, VPNs, multi-site connectivity) - this is essential for the role. - Strong proficiency with Linux systems administration (Debian/Ubuntu). - Solid understanding of networking fundamentals and ability to design complex network architectures. - Experience building and maintaining automation workflows (Ansible, Bash/Python, Git-based). - Experience with observability stacks such as Prometheus, Grafana, ELK, Loki, or Graylog. - Background with virtualization technologies (OpenStack, Proxmox, VMware). - Experience with bare-metal provisioning and MAAS (Metal as a Service). - Strong understanding of distributed systems and container orchestration. - Process-oriented mindset with ability to develop SOPs and operational procedures from scratch. - Experience with incident response, escalation procedures, and on-call rotations. - Ability to work autonomously in a fast-paced, engineering-driven environment. - Strong technical skills combined with alignment to team values. NICE TO HAVE - Experience with service mesh (Istio, Linkerd) or advanced CNI implementations. - Knowledge of Cloudflare APIs, DNS automation, or tunnel configurations. - Experience with GPU infrastructure, node preparation, or resource scheduling. - Familiarity with security best practices (RBAC, firewalls, network policies). - Exposure to IT asset management or license tracking workflows. - Experience working in multi-timezone environments and coordinating across distributed teams. - Background establishing reliability practices and SRE frameworks in growing organizations. Why Join Us: - 100% remote work with flexible hours - High-impact role with autonomy and ownership - Collaborative and international engineering team - Cutting-edge tech stack with strong focus on reliability and automation.