We can't find the internet
Attempting to reconnect
Something went wrong!
Hang in there while we get back on track
Senior Backend Developer - Machine Learning Platform
Location: Montreal, QC, Canada; Remote; Hybrid
|
Full-Time
Backend
Machine Learning Platform
Java
Python
AWS
Kubernetes
ECS
CI/CD
Terraform
Distributed Systems
Scalable Systems
Developer Experience
Observability
AI Engineer
Back End Engineer
Data Engineer
Company: Coveo is a Quebec-based company, pioneer in AI-powered search and recommendations. Coveo uses AI technologies and intelligent search to personalize every digital experience for customers, partners, dealers, and employees. Coveo combines unified content, unified interactions behavioral data and machine learning to deliver relevant information and recommendations across every business interaction, making websites, e-commerce, contact centers and intranets efficient, effortless, content-rich, thus boosting conversion. At Coveo, we build AI-powered systems that bring hyper-personalization to every enterprise experience — whether it's e-commerce, customer service, or internal workplace tools. Our platform unifies relevance across the stack, helping users find what they need, when they need it. We offer competitive salaries, top-tier equipment, great offices, and a team that genuinely values your input. About the Role: Are you ready to play a key role in simplifying the deployment of Machine Learning models? Are you passionate about cloud-native technologies, automation, and developer experience? Coveo is looking for a Senior Developer to join our ML Model Training team! Your mission? Build and evolve the infrastructure that powers thousands of model rebuilds every day, enabling our Data Scientists and Applied Scientists to train their models at scale, reliably, and efficiently. You’ll focus on simplifying the ML model development experience, designing tools and systems that abstract away complexity while giving internal users the visibility and control they need to iterate with confidence. Your work will directly impact how fast, how often, and how safely models are trained across Coveo’s AI ecosystem. Responsibilities: - Design simple, powerful interfaces and tools that enable scientists to configure and launch training jobs with minimal friction, whether for prototyping or production. - Develop smart orchestration and automation mechanisms to prioritize, batch, retry, or rollback training jobs at a massive scale. - Champion performance and cost optimization, helping the organization manage compute usage responsibly without sacrificing velocity or quality. - Implement robust observability layers so users can monitor performance, track metrics, and debug model training workflows. - Collaborate with applied scientists and data engineers to understand their needs, improve developer experience, and continuously raise the bar on reliability and efficiency. Qualifications/Technical Skills: - 8+ years of backend or platform engineering experience, with a strong focus on cloud-native and distributed systems (Java, Python, AWS preferred). - Deep understanding of scalable system design, CI/CD, and container orchestration (Kubernetes, ECS, or similar). - Familiarity with Terraform & Kubernetes for infrastructure automation and container orchestration. - Experience building ML infrastructure or internal platforms used by data science teams. - Hands-on experience with job orchestration, task queues, or pipelines at scale. - Solid grasp of observability practices (logs, metrics, traces), and how to build systems that are easy to monitor and debug. Who You Are/Ideal Candidate: - Passion for developer experience: you care about ergonomics and eliminating friction for internal users. - A problem-solving mindset, with the resourcefulness to analyze, optimize, and debug large-scale systems while continuously embracing a growth-oriented approach.
Post Date:
June 3, 2025