About Bobsled

We are looking for an experienced Site Reliability Engineer to drive the reliability, scalability, and operational excellence of Bobsled's data-sharing platform. You'll apply your expertise to complex technical and business challenges, ensuring that our infrastructure and pipelines are highly available and performant.

Please note: This role is open exclusively to candidates located in the Central Time (CT) or Eastern Time (ET) zones in the USA or Canada, as you will be working closely with European engineers.

Key Responsibilities

Infrastructure Reliability: Design, build, and maintain highly available, scalable infrastructure using modern IaC practices such as Terraform/Pulumi.
Multi-Cloud Operations: Manage and optimize Bobsled's infrastructure across GCP, AWS, Azure, and other cloud providers.
CI/CD Pipelines: Build and maintain robust pipelines that ensure safe, reliable, and automated deployment of infrastructure and applications.
Monitoring & Observability: Develop comprehensive monitoring, logging, and alerting systems to ensure visibility into infrastructure and application health.
Incident Response: Establish and continuously improve incident response processes, ensuring rapid detection and resolution of production issues.
Performance Optimization: Identify and resolve performance bottlenecks, capacity planning, and cost optimization across our cloud environments.
On-Call & Reliability: Participate in on-call rotations and drive improvements to reduce toil and improve system reliability.

Preferred Qualifications

8+ years of experience in SRE, DevOps, or Platform Engineering, managing distributed cloud-native systems in production.
Proficiency in Infrastructure as Code (IaC) tools like Terraform/Pulumi.
Experience with TypeScript or other modern programming languages (our stack is heavily TypeScript-based).
Strong background in cloud platforms (GCP, AWS, Azure) - hands-on experience with at least one is required.
Experience with monitoring and observability tools (Prometheus, Grafana, Datadog, etc).
Understanding of CI/CD best practices and experience with pipeline tools like Github Actions.
Strong troubleshooting skills and experience with incident management.

Nice To Have

Experience with cloud security solutions, IAM, secrets management (HashiCorp Vault, GCP Secrets Manager), Identity based Authentication, Zero Trust
Knowledge of security compliance frameworks (SOC 2, ISO 27001).
Experience with Kubernetes, serverless architectures, or container security.
Exposure to data and data platforms, e.g. Snowflake, Databricks and Spark engines like AWS EMR and GCP Dataproc

Compensation & Benefits

Competitive Salary and Equity
Health Insurance: Medical (100% paid), dental, and vision benefits for you and your family
Generous PTO policy and paid parental leave
Fully upgraded Apple MacBook and 4K monitor (for engineering team only)
Home office stipend of $1,000
Flexible work hours in a fully remote work environment
Fully sponsored individual coaching for all employees to help foster a culture of personal reflection and growth (optional but encouraged)

Site Reliability Engineer