DEVOPS ENGINEER - PLATFORM RELIABILITY (REMOTE, CHINA)
Bjakcareer
Full-timemid
Job description
BJAK’s automation systems support customer journeys across quote generation, policy issuance, claims, payments, renewals and insurer integrations. These systems are business-critical—meaning reliability, uptime and safe deployments directly impact customers and operations.
We're looking for a DevOps Engineer based in China to strengthen platform reliability, improve infrastructure resilience and ensure BJAK’s AI automation systems run safely and consistently at scale.
This is a fully remote position where you'll collaborate closely with our Malaysia-based engineering, product and operations teams to build and maintain highly reliable production systems.
THE MISSION
Build and maintain a highly reliable platform for BJAK’s AI automation systems by improving infrastructure stability, deployment safety and operational resilience across all services.
WHAT YOU’LL OWN
- Own and improve platform reliability across production systems and environments.
- Manage cloud infrastructure, deployment pipelines and runtime environments.
- Design and improve CI/CD workflows to enable safe, fast and repeatable releases.
- Build and enhance monitoring, alerting, logging and system observability.
- Lead incident response efforts and perform structured root cause analysis.
- Improve system resilience through redundancy, failover and recovery mechanisms.
- Work with engineering teams to reduce production risk through better deployment and system design practices.
- Strengthen infrastructure security, access control and secrets management.
- Support reliability for business-critical workflows across multiple countries and services.
- Continuously improve operational discipline, uptime and system stability.
WHAT WE'RE LOOKING FOR
- Experience in DevOps, SRE, platform engineering or infrastructure-focused roles.
- Strong understanding of cloud infrastructure, CI/CD pipelines and deployment systems.
- Experience with production monitoring, alerting and incident management practices.
- Ability to troubleshoot infrastructure and production issues in a structured and calm manner.
- Strong understanding of reliability engineering principles (availability, fault tolerance, recovery).
- Experience supporting business-critical or high-availability systems.
- Strong ownership mindset during incidents and operational failures.
- Practical judgment on reliability, performance, security and cost trade-offs.
- Comfortable working closely with engineering teams in fast-paced environments.
- Low ego, disciplined and focused on long-term system stability.
BONUS POINTS
- Experience with AWS, GCP, Azure or similar cloud platforms.
- Experience with Kubernetes, Docker or container orchestration.
- Experience with infrastructure-as-code tools (Terraform, Ansible, Pulumi, etc.).
- Experience with observability stacks (Prometheus, Grafana, ELK, Datadog, etc.).
- Experience with zero-downtime deployments, blue-green or canary release strategies.
- Experience supporting distributed or high-traffic production systems.
- Strong knowledge of security best practices in cloud infrastructure.
- Experience in fintech, insurance or regulated industry environments.
- Contributions to platform reliability or infrastructure scaling initiatives.
THE KIND OF BUILDER WE WANT
- Calm and structured under pressure, especially during production incidents.
- Hands-on with infrastructure and deeply familiar with production systems.
- Thinks in failure modes, system risks and recovery paths.
- Proactive in preventing incidents, not just reacting to them.
- Strong focus on uptime, reliability and operational discipline.
- Careful and deliberate when making production changes.
- Builds systems engineers can trust to deploy and operate safely.
THIS ROLE IS NOT FOR
- People who only react after systems fail instead of preventing them.
- Engineers who are careless with production changes or access control.
- Individuals who ignore monitoring, alerting or operational discipline.
- People who make risky infrastructure changes without proper evaluation.
- Candidates who cannot stay calm during incidents or outages.
SUCCESS IN THIS ROLE
You'll be successful if you can:
- Improve platform uptime, stability and deployment safety.
- Reduce production incidents and infrastructure-related failures.
- Strengthen monitoring, alerting and system visibility across services.
- Enable engineers to deploy with confidence and lower operational risk.
- Improve resilience of BJAK’s AI automation platform as it scales.
WHY JOIN BJAK
- Build Reliable AI Platform Infrastructure – Support systems powering end-to-end insurance automation.
- High-Impact Engineering – Solve real-world reliability and scaling challenges.
- Global Engineering Team – Work with experienced engineers across multiple countries.
- Fully Remote – Work remotely from China while collaborating with our Malaysia-based teams.
- International Exposure – Build systems used across Southeast Asia markets.
- Learning & Development Budget – Support continuous technical growth and certifications.
- High Ownership Environment – Strong autonomy over infrastructure and reliability strategy.
- Modern Engineering Culture – Focus on stability, observability and engineering excellence.
- Competitive Compensation – Attractive salary package based on experience and impact.
INTERVIEW PROCESS
We assess infrastructure depth, reliability thinking and production problem-solving ability. The process usually includes application review, two interviews and a technical scenario or systems discussion.