Location: Los Angeles, CA (Marina del Rey)   |   Full-Time   |   $150,000 - $170,000
SRE Site Reliability Engineer EKS GKE Kubernetes GCP AWS GitOps ArgoCD Github Actions Python React Node.js Observability Prometheus Grafana Open Telemetry Datadog ML Workloads Infrastructure AI Engineer Back End Engineer Staff Engineer
**About Zefr:**
Zefr is the leading global technology company enabling responsible marketing in walled garden social environments. Zefr’s solutions empower brands to manage their content adjacency on scaled platforms such as YouTube, Meta, TikTok, and Snap, in accordance with industry standard frameworks. Through its patented AI technology, Zefr offers brands and agencies more accurate and transparent solutions for social walled gardens. Zefr is an E-Verified equal opportunity employer that embraces diversity and inclusion in the workplace, committed to building a team reflective of a variety of backgrounds, skills, and perspectives.

**The Role: Senior Site Reliability Engineer**
We are seeking a Senior Site Reliability Engineer to combine technical expertise with strong leadership and a passion for continuous improvement and innovation. This role is crucial for ensuring the continuous health and efficiency of our infrastructure, including those supporting critical ML workloads. You will directly contribute to Zefr’s commitment to providing a consistently high-quality user experience. We expect to learn from you and have you learn from us, fostering an environment of mutual growth and continuous improvement.

**Key Responsibilities:**
- Ensure the continuous health and efficiency of Zefr''s infrastructure.
- Oversee and optimize infrastructure supporting critical Machine Learning (ML) workloads.
- Drive continuous improvement and innovation within the SRE domain.
- Contribute directly to providing a consistently high-quality user experience for our clients.
- Apply strong leadership in technical projects and initiatives.

**Technical Stack & Skills:**
Candidates should have strong experience with:
- **Clouds:** Google Cloud Platform (GCP), Amazon Web Services (AWS)
- **Kubernetes:** Specifically GKE and EKS
- **GitOps:** ArgoCD, Github Actions
- **Languages:** Python, React, Node.js
- **Observability:** Prometheus, Grafana, Open Telemetry, Datadog

**Ideal Candidate:**
The ideal candidate is a highly technical individual with a strong background in site reliability engineering, coupled with excellent leadership abilities. They possess a deep passion for continuous improvement, innovation, and maintaining high-quality, efficient systems. Experience with critical ML workloads and the listed tech stack is essential. We are looking for someone who is eager to contribute their expertise while also being open to learning and developing new skills within our dynamic environment.
Post Date: May 30, 2025