🧨 Site Reliability Engineer, Remote Job, May 2025

About HighLevel:

HighLevel is a cloud-based, all-in-one white-label marketing and sales platform that empowers marketing agencies, entrepreneurs, and businesses to elevate their digital presence and drive growth. We are proud to support a global and growing community of over 2 million businesses, from marketing agencies to entrepreneurs to small businesses and beyond. Our platform empowers users across industries to streamline operations, drive growth, and crush their goals.

HighLevel processes over 15 billion API hits and handles more than 2.5 billion message events every day. Our platform manages 470 terabytes of data distributed across five databases, operates with a network of over 250 micro-services, and supports over 1 million domain names.

Our People

With over 1,500 team members across 15+ countries, we operate in a global, remote-first environment. We are building more than software; we are building a global community rooted in creativity, collaboration, and impact. We take pride in cultivating a culture where innovation thrives, ideas are celebrated, and people come first, no matter where they call home.

Our Impact

Every month, our platform powers over 1.5 billion messages, helps generate over 200 million leads, and facilitates over 20 million conversations for the more than 2 million businesses we serve. Behind those numbers are real people growing their companies, connecting with customers, and making their mark - and we get to help make that happen.

Learn more about us on our YouTube Channel or Blog Posts

About the Role:

We are looking for a Site Reliability Engineer to join our team and help ensure the availability, performance, and scalability of our critical systems. You will work closely with development and operations teams to automate processes, enhance system reliability, and improve observability.

Requirements:

Experience: 4+ years in Site Reliability Engineering, DevOps, or Cloud Infrastructure roles
Cloud Expertise: Hands-on experience with GCP and AWS
Infrastructure as Code (IaC): Terraform, Helm, or equivalent tools
Containerisation & Orchestration: Docker, Kubernetes (GKE)
Observability: Experience with Prometheus, Grafana, ELK, OpenTelemetry, or similar monitoring/logging tools
Programming/Scripting: Proficiency in Python, Bash, or Shell scripting. Basic understanding of API parsing and JSON manipulation
CI/CD Pipelines: Hands-on experience with Jenkins, GitHub Actions, ArgoCD, or similar tools
Incident Management: Experience with on-call rotations, SLOs, SLIs, SLAs, Escalation Policies, and incident resolution
Databases: Experience in monitoring MongoDB, Redis, ES, Queue based etc

Responsibilities:

Develop and improve observability using monitoring, logging, tracing, and alerting tools (Prometheus, Grafana, ELK, OpenTelemetry, etc.)
Optimize system performance, troubleshoot incidents, and conduct post-mortems/RCA to prevent future issues
Collaborate with developers to enhance application reliability, scalability, and performance
Drive cost optimisation efforts in cloud environments.
Monitor multiple databases (MongoDB, Redis, ES, Queue based etc.)

EEO Statement:

The company is an Equal Opportunity Employer. As an employer subject to affirmative action regulations, we invite you to voluntarily provide the following demographic information. This information is used solely for compliance with government recordkeeping, reporting, and other legal requirements. Providing this information is voluntary and refusal to do so will not affect your application status. This data will be kept separate from your application and will not be used in the hiring decision.

#LI-Remote

#LI-HB1

Site Reliability Engineer