Foundry Job - General Software Engineer, Infrastructure

About Foundry: Foundry is revolutionizing the AI compute landscape by providing a flexible and accessible cloud platform for machine learning developers. Our infrastructure is designed to handle the most demanding AI workloads, from training large language models to running real-time inference at scale. We are building a new type of public cloud that makes accessing high-performance GPUs as simple as flipping a switch, eliminating the traditional friction of procurement, limited quotas, and clunky tooling.

About The Role: As a General Software Engineer in the Infrastructure team, you will play a pivotal role in building and scaling the core systems that power Foundry’s platform. You will own the design and development of critical components for our batch and streaming workload engine, which enables ML engineers to train, fine-tune, and serve state-of-the-art models with ease. This role involves deep 0→1 ownership, where you will drive the technical strategy and execution for projects focused on GPU scheduling, fault-tolerant execution, and rich job dependency graphs (DAGs). Your work will directly impact the scalability, reliability, and performance of Foundry’s infrastructure, which serves customers ranging from startups to large enterprises.

Key Responsibilities:

Lead the design and development of systems that optimize GPU scheduling and resource allocation for AI workloads.
Build fault-tolerant execution frameworks that ensure high availability and reliability for long-running ML jobs.
Design and implement rich job dependency graphs (DAGs) to manage complex workflows in distributed environments.
Collaborate with cross-functional teams, including product, security, and operations, to align on technical requirements and ensure seamless integration.
Drive the implementation of innovative features that improve the efficiency and usability of our infrastructure.
Mentor junior engineers and contribute to a culture of engineering excellence.

Required Skills and Ideal Candidate:

Proven experience in building scalable and distributed systems, particularly in the context of AI/ML infrastructure.
Strong expertise in languages like Go, Python, or Java, with a focus on concurrent and distributed programming.
Deep understanding of Kubernetes and container orchestration, as well as experience with cloud platforms (AWS, GCP, Azure).
Familiarity with GPU computing, including scheduling, resource management, and optimization techniques.
Excellent problem-solving skills and the ability to tackle complex technical challenges with creative solutions.
A track record of ownership and leadership in 0→1 projects.

We are looking for someone who is passionate about infrastructure and eager to make a significant impact on the future of AI compute.