Vacature

Site Reliability Engineer

Brussel

Solliciteer

The goal is to ensure the reliability, scalability, monitoring, and performance of our on-premises services. Responsibilities will include designing, implementing best practices, and managing our infrastructure. The role includes working within cross-functional teams to improve systems and processes and ensure uptime and efficiency. 

  • Design and maintain monitoring infrastructure
  • Create custom dashboards, alerts, and visualization solutions
  • Implement distributed tracing and log aggregation systems
  • Establish monitoring best practices and SLI/SLO frameworks
  • Maintain security compliance for on-premises monitoring tools
  • Automate deployment and configuration management
  • Collaborate with development teams on application instrumentation
  • Participate to on-duty rotations

Requirements

  • Core Technologies
    • Advanced Grafana,
    • Prometheus (PromQL),
    • OpenTelemetry,
    • Elasticsearch
  • Infrastructure
    • Linux administration,
    • networking,
    • on-premises security
  • Programming
    • Python,
    • Bash, or Go for automation
  • Experience
    • 3+ years monitoring/observability,
    • 2+ years Grafana/Prometheus in production,
    • strong Linux system administration experience,
    • proven track record with on-premises infrastructure solutions
  • Security
    • Enterprise security practices,
    • compliance requirements
  • Ability to balance technical trade-offs with business needs and prioritize effectively.
  • Participation to on-duty rotations (24/7 Incident support)

Key Deliverables

  • Reduced MTTD/MTTR through effective monitoring

  • Comprehensive observability across all systems

  • Automated monitoring, deployment, and management

  • Security-compliant monitoring practices

Languages

  • English (C1).
  • Extra Languages: German, French, Dutch.