Senior Site Reliability Engineer, Supply

Foundry (Headquarters: Palo Alto / San Francisco Bay Area)

Location: Palo Alto / San Francisco Bay Area | Full-Time | $170,000 - $230,000

Python Bash Ansible Grafana Prometheus Linux GPU Cloud Staff Engineer Cyber Security

**About Foundry:** Foundry is revolutionizing the AI compute landscape by providing a flexible and accessible cloud platform for machine learning developers. Our infrastructure is designed to handle the most demanding AI workloads, from training large language models to running real-time inference at scale.

**About The Role:** As a Senior Site Reliability Engineer in the Supply team, you will be responsible for managing the provisioning, scaling, and health of Foundry's global GPU fleet. You will design and implement systems that ensure the reliability, availability, and performance of our compute infrastructure across multiple cloud environments and on-premises data centers. This role involves working closely with engineering teams to build robust monitoring, alerting, and incident response frameworks, guided by Service Level Indicators (SLIs) and Objectives (SLOs). You will also serve as the primary technical liaison with our compute partners, ensuring smooth operations and addressing any technical challenges that arise.

**Key Responsibilities:**
- Design and implement systems for GPU provisioning, spot bidding, and node pool health management.
- Build and maintain monitoring and observability tools using platforms like Grafana and Prometheus.
- Define and enforce SLIs and SLOs to ensure system reliability and performance.
- Lead or participate in incident response and root cause analysis to prevent recurrence of issues.
- Collaborate with internal teams and external partners to resolve technical challenges and optimize infrastructure operations.
- Develop automation scripts and tools to streamline routine maintenance and operational tasks.

**Required Skills and Ideal Candidate:**
- Bachelor’s degree in Computer Science, Computer Engineering, or a related field, or equivalent professional experience.
- Experience with Linux systems administration and command-line interfaces.
- Proficiency in scripting and automation (Python, Bash, or similar).
- Deep understanding of key infrastructure metrics and data center operations.
- Strong written and verbal communication skills, with the ability to translate technical concepts for various audiences.
- Project management experience and the ability to handle multiple priorities effectively.
- Demonstrated experience in incident response and root cause analysis.

We are looking for a highly skilled SRE who can thrive in a fast-paced environment and has a passion for building reliable systems.

Post Date: July 17, 2025