Reliability Engineer

Odixcity Consulting

Lagos, Nigeria Permanent

Published 1 month ago · Expires 3 weeks from now

Share :

Job description

A reputable organization seeks a dedicated individual for this role. Position Overview: -  We are looking for a proactive and hands-on Reliability Engineer to join our team. You will be crucial in ensuring our core services are stable, scalable, and efficient. Key Responsibilities: - Closely monitor system health, performance, and availability using tools like Grafana, Prometheus, Datadog, or New Relic. Respond to and resolve incidents. - Lead and document post-incident reviews to identify root causes and preventive actions. - Write scripts (Python, Bash) and use configuration management tools to automate operational tasks, deployments, and recovery procedures. - Build the internal platforms and tools that make reliability a default for every engineering team- self-healing systems, automated canary analysis, and performance tracing at scale. - Work with software teams to define Service Level Objectives (SLOs) and Error Budgets. Implement improvements to reduce manual toil, improve system resilience, and prevent recurring issues. - Manage and optimize cloud resources (AWS, Google Cloud, or Azure) to ensure cost-effectiveness and performance. Implement infrastructure as Code (IaC) principles. - Lead the design and implementation of chaos engineering practices, disaster recovery automation, and capacity planning. Requirements: - 3-5 years of experience in a DevOps, SRE, Linux System Administration, or Backend Engineering role. - Proficiency in scripting language; Python or GO. - Solid experience with cloud platforms; Azure, Google Cloud, AWS etc. - Experience with containerization and orchestration (Docker, Kubernetes). - Practical knowledge of monitoring/ observability tools. - Familiarity with CI/CD Pipelines (GitLab CI, Jenkins, GitHub Actions). Core Skills: - Excellent problem solving and trobuleshooting skills under pressure. - Strong understanding of network fundamentals (TCP/IP, DNS, HTTP/S). - Knowledge of database performance and reliability (PostgreSQL, MySQL, MongoDB). - A systematic approach to automation and a desire to eliminate manual work. - Good communication skills to collaborate with both technical and non-technical teams. - Understanding of security best practices in infrastructure.

Interested in this job?

Log in to see the email

Not registered yet? Create a free account