
Senior Site Reliability Engineer
About Us
With the move to the cloud, Kubernetes has become widely adopted by DevOps and Platform Engineering teams, but it has also added complexity. While scaling Kubernetes at Intuit, the Akuity founders started building Argo CD to streamline Kubernetes adoption. Argo CD helps developers own, understand, and deploy their K8s deployments via GitOps.
Today, Argo CD is the third most popular project in the CNCF (Cloud Native Computing Foundation) and is used by 70% of companies running Kubernetes in production. Users include Intuit, BlackRock, Tesla, Major League Baseball, Peloton, and many more.
The team founded Akuity in 2021 to enable enterprises to ship software faster and more reliably with modern GitOps best practices. The Akuity Platform enables teams to manage development and deployment across hundreds – if not thousands – of Kubernetes clusters from a single control plane.
Our mission is to simplify the software delivery process so DevOps and Platform Engineering teams can move fast and deploy code effortlessly without fear of breaking things.
The Role
We are looking for a Senior SRE to help us keep the Akuity platform running at the level our enterprise customers expect. This is a high-ownership role; you won’t just respond to incidents, you’ll shape how we define and defend reliability across the entire platform. You’ll work closely with engineering, infrastructure, and product to build the systems and culture that let us scale with confidence.
What You’ll Own
Platform Reliability & SLAs
Own SLI/SLO/SLA definitions for the Akuity SaaS platform and drive continuous improvement against them
Design, instrument, and maintain observability systems (metrics, logs, traces) across multi-region AWS infrastructure
Identify reliability gaps, lead blameless post-mortems, and close the loop with permanent fixes
Partner with engineering teams to build reliability into new features before they ship to production
On-Call & Incident Response
Participate in an on-call rotation and act as incident commander for high-severity production events
Build and maintain runbooks, escalation paths, and incident playbooks to keep mean time to resolution low
Drive improvements to alerting fidelity; reduce noise, increase signal, eliminate toil
Lead post-incident reviews with clear timelines, root cause analysis, and follow-through on action items
What We’re Looking For
Required
5+ years of SRE, platform engineering, or production operations experience in a SaaS environment
Deep hands-on Kubernetes expertise (scheduler, networking, storage, autoscaling) with strong debugging skills
Strong AWS fundamentals across compute (EC2, EKS), networking (VPC, NLB, Route53), storage (S3, RDS), and IAM
Experience defining and operating against SLOs in production (error budgets, reliability metrics)
Proficiency with observability tooling (Prometheus, Grafana, OpenTelemetry, Datadog, or equivalent)
Solid scripting and automation skills (Go, Python, Bash, or similar)
Strong written communication: clear runbooks, sharp incident reports, thoughtful post-mortems
Must live within US time zones (Pacific through Eastern), including Canada and other regions
Strong Advantage
Experience with Argo CD, Kargo, or GitOps-based delivery workflows
Familiarity with multi-region, multi-cluster Kubernetes deployments
Experience with compliance-adjacent infrastructure (SOC 2, ISO 27001, HIPAA, PCI DSS)
Background operating infrastructure for other platform or developer tooling companies
Our Stack
Kubernetes (EKS): multi-region, enterprise-grade clusters serving Argo CD and Kargo workloads
AWS: primary cloud provider across all production and DR environments
Argo CD & Kargo: GitOps delivery tools we build and run ourselves
Prometheus, Grafana, OpenTelemetry: observability stack
Terraform & GitOps-driven infrastructure management
What We Offer
Competitive compensation, commensurate with experience
Equity participation in a well-funded, growing company
Fully remote: work from anywhere within US time zones (Pacific through Eastern), including Canada and other regions
Home office stipend and equipment budget
Flexible time off and a culture that respects it
Work directly with the engineers who built Argo CD and Kargo — you’ll learn a lot here
US-based employees receive full benefits, including comprehensive health, dental, and vision coverage. Candidates based outside the US will be engaged as contractors.
Don't let this one get away.
About the company
Similar Remote Jobs
Opened 5 days ago Featured Job Remote Job
Closes in 15 days Promoted Job Remote Job
Closes in 15 days Promoted Job Remote Job
Closes in 2 days Promoted Job Remote Job
New Job! Remote Job
