The goal is to ensure the reliability, scalability, monitoring, and performance of our on-premises services. Responsibilities will include designing, implementing best practices, and managing our infrastructure. The role includes working within cross-functional teams to improve systems and processes and ensure uptime and efficiency.
- Design and maintain monitoring infrastructure
- Create custom dashboards, alerts, and visualization solutions
- Implement distributed tracing and log aggregation systems
- Establish monitoring best practices and SLI/SLO frameworks
- Maintain security compliance for on-premises monitoring tools
- Automate deployment and configuration management
- Collaborate with development teams on application instrumentation
- Participate to on-duty rotations
Requirements
- Core Technologies
- Advanced Grafana,
- Prometheus (PromQL),
- OpenTelemetry,
- Elasticsearch
- Infrastructure
- Linux administration,
- networking,
- on-premises security
- Programming
- Python,
- Bash, or Go for automation
- Experience
- 3+ years monitoring/observability,
- 2+ years Grafana/Prometheus in production,
- strong Linux system administration experience,
- proven track record with on-premises infrastructure solutions
- Security
- Enterprise security practices,
- compliance requirements
- Ability to balance technical trade-offs with business needs and prioritize effectively.
- Participation to on-duty rotations (24/7 Incident support)
Key Deliverables
Reduced MTTD/MTTR through effective monitoring
Comprehensive observability across all systems
Automated monitoring, deployment, and management
Security-compliant monitoring practices
Languages
- English (C1).
Extra Languages: German, French, Dutch.