New Job! Remote Job
Company

Site Reliability Engineer

Why deepset

At deepset, we’re making sovereign AI accessible to every organization. With Haystack, thousands of developers build advanced AI applications, while our Enterprise platform helps teams scale across use cases, users, and environments. We’re remote-first, flexible, and built on trust and ownership. You’ll work alongside strong technical talent, take on meaningful challenges, and help turn complex AI into solutions that are practical, reliable, and ready for the real world.

What you will do

Build and operate infrastructure

Design, configure, and evolve infrastructure that runs both in our cloud and inside customer environments (SaaS, private cloud, on-prem).

Make self-hosted production-ready

Help us deliver a production-grade, self-hosted platform that can be deployed on any Kubernetes setup in weeks - not months.

Drive automation & platform maturity

Improve CI/CD pipelines, GitHub workflows, and GitOps setups so teams can ship faster with confidence.

Reduce complexity and cost

Continuously simplify systems and optimize infrastructure spend without compromising performance or reliability.

Shape how we build

Champion best practices in reliability, scalability, and security across the organization, not as rules, but as working systems.

Requirements

  • 2-5 years of experience working with large-scale production infrastructure

  • Experience with distributed or service-oriented architectures

  • Hands-on expertise with:

    • AWS

    • Kubernetes

    • CI/CD and GitOps (e.g. ArgoCD)

  • Working knowledge of Infrastructure as Code (Terraform preferred)

  • Solid troubleshooting skills - you can debug across systems, not just within one layer

  • A pragmatic mindset: you balance speed, simplicity, and reliability

  • Ownership and accountability - you take responsibility for systems end-to-end

  • Ability to work independently while staying aligned with the team’s goals

Nice to have

  • Familiarity with observability stacks (e.g. Datadog, Prometheus)

  • Experience optimizing cloud costs at scale

  • Interest or experience in Machine Learning / LLM systems

  • Experience improving developer experience and platform tooling using AI agents

  • Contributions to SRE practices like postmortems, SLIs/SLOs, and reliability engineering culture

Benefits

  • Remote-first setup with flexible hours & tech of your choice

  • 30 days vacation + extra days for family sick leave

  • Competitive salary & stock options for every team member

  • Monthly sports & mental health support allowance with Oliva

  • Annual learning & development budget

  • Monthly team socials & in-person meetups

  • Dog-friendly Berlin HQ

Don't wait, tomorrow could be too late.