trivago Job - Senior Site Reliability Engineer - Observability

trivago is a leading metasearch engine that compares accommodation offers from many booking sites, helping travelers find the best deals. We are seeking a Senior Site Reliability Engineer to join our Observability team. In this role, you will play a critical part in designing, building, and maintaining our monitoring and observability infrastructure. You will be responsible for ensuring the reliability, performance, and scalability of our systems, with a focus on observability.

Your responsibilities will include designing and implementing monitoring solutions using tools like Prometheus, Thanos, and Grafana, as well as setting up alerting systems to detect and respond to issues quickly. You will also be involved in improving our observability practices, including tracing, logging, and metrics collection. This role requires strong technical skills in systems architecture, infrastructure management, and a deep understanding of how to build scalable and reliable systems.

We are looking for someone with extensive experience in designing and managing large-scale distributed systems, particularly with containerization technologies like Docker and orchestration tools like Kubernetes. Proficiency in scripting languages (e.g., Python, Bash) and familiarity with cloud platforms (AWS, GCP, or Azure) is essential. Additionally, you should have a proven track record of building and maintaining observability pipelines and a strong understanding of SRE best practices.

If you thrive in a fast-paced environment and are passionate about building reliable systems, this role offers a unique opportunity to make a significant impact at a global company. You will work closely with engineering teams across the organization to ensure system reliability and drive innovation in our observability practices.

Join us and help shape the future of travel technology at trivago!