DEVOPS ENGINEER - PLATFORM RELIABILITY (REMOTE, CHINA)

Bjakcareer

Full-timemid

Job description

BJAK’s automation systems support customer journeys across quote generation, policy issuance, claims, payments, renewals and insurer integrations. These systems are business-critical—meaning reliability, uptime and safe deployments directly impact customers and operations. We're looking for a DevOps Engineer based in China to strengthen platform reliability, improve infrastructure resilience and ensure BJAK’s AI automation systems run safely and consistently at scale. This is a fully remote position where you'll collaborate closely with our Malaysia-based engineering, product and operations teams to build and maintain highly reliable production systems. THE MISSION Build and maintain a highly reliable platform for BJAK’s AI automation systems by improving infrastructure stability, deployment safety and operational resilience across all services. WHAT YOU’LL OWN - Own and improve platform reliability across production systems and environments. - Manage cloud infrastructure, deployment pipelines and runtime environments. - Design and improve CI/CD workflows to enable safe, fast and repeatable releases. - Build and enhance monitoring, alerting, logging and system observability. - Lead incident response efforts and perform structured root cause analysis. - Improve system resilience through redundancy, failover and recovery mechanisms. - Work with engineering teams to reduce production risk through better deployment and system design practices. - Strengthen infrastructure security, access control and secrets management. - Support reliability for business-critical workflows across multiple countries and services. - Continuously improve operational discipline, uptime and system stability. WHAT WE'RE LOOKING FOR - Experience in DevOps, SRE, platform engineering or infrastructure-focused roles. - Strong understanding of cloud infrastructure, CI/CD pipelines and deployment systems. - Experience with production monitoring, alerting and incident management practices. - Ability to troubleshoot infrastructure and production issues in a structured and calm manner. - Strong understanding of reliability engineering principles (availability, fault tolerance, recovery). - Experience supporting business-critical or high-availability systems. - Strong ownership mindset during incidents and operational failures. - Practical judgment on reliability, performance, security and cost trade-offs. - Comfortable working closely with engineering teams in fast-paced environments. - Low ego, disciplined and focused on long-term system stability. BONUS POINTS - Experience with AWS, GCP, Azure or similar cloud platforms. - Experience with Kubernetes, Docker or container orchestration. - Experience with infrastructure-as-code tools (Terraform, Ansible, Pulumi, etc.). - Experience with observability stacks (Prometheus, Grafana, ELK, Datadog, etc.). - Experience with zero-downtime deployments, blue-green or canary release strategies. - Experience supporting distributed or high-traffic production systems. - Strong knowledge of security best practices in cloud infrastructure. - Experience in fintech, insurance or regulated industry environments. - Contributions to platform reliability or infrastructure scaling initiatives. THE KIND OF BUILDER WE WANT - Calm and structured under pressure, especially during production incidents. - Hands-on with infrastructure and deeply familiar with production systems. - Thinks in failure modes, system risks and recovery paths. - Proactive in preventing incidents, not just reacting to them. - Strong focus on uptime, reliability and operational discipline. - Careful and deliberate when making production changes. - Builds systems engineers can trust to deploy and operate safely. THIS ROLE IS NOT FOR - People who only react after systems fail instead of preventing them. - Engineers who are careless with production changes or access control. - Individuals who ignore monitoring, alerting or operational discipline. - People who make risky infrastructure changes without proper evaluation. - Candidates who cannot stay calm during incidents or outages. SUCCESS IN THIS ROLE You'll be successful if you can: - Improve platform uptime, stability and deployment safety. - Reduce production incidents and infrastructure-related failures. - Strengthen monitoring, alerting and system visibility across services. - Enable engineers to deploy with confidence and lower operational risk. - Improve resilience of BJAK’s AI automation platform as it scales. WHY JOIN BJAK - Build Reliable AI Platform Infrastructure – Support systems powering end-to-end insurance automation. - High-Impact Engineering – Solve real-world reliability and scaling challenges. - Global Engineering Team – Work with experienced engineers across multiple countries. - Fully Remote – Work remotely from China while collaborating with our Malaysia-based teams. - International Exposure – Build systems used across Southeast Asia markets. - Learning & Development Budget – Support continuous technical growth and certifications. - High Ownership Environment – Strong autonomy over infrastructure and reliability strategy. - Modern Engineering Culture – Focus on stability, observability and engineering excellence. - Competitive Compensation – Attractive salary package based on experience and impact. INTERVIEW PROCESS We assess infrastructure depth, reliability thinking and production problem-solving ability. The process usually includes application review, two interviews and a technical scenario or systems discussion.