The ideal candidate will be responsible for ensuring the reliability, scalability, and performance of systems through incident management, automation, collaboration with developers and infrastructure teams, and continuous improvement initiatives. If you have a passion for building resilient systems, thrive in a fast-paced environment, and possess expertise in SRE/DevOps practices, we want to hear from you.
Key Responsibilities:
- Manage and respond to incidents, ensuring timely resolution and minimal impact on system performance and availability.
- Develop and implement automation tools and processes to streamline operational tasks and enhance system reliability.
- Collaborate closely with developers and infrastructure teams to optimize system architecture and improve deployment processes.
- Lead initiatives to continuously improve system reliability, scalability, and performance.
- Implement and maintain monitoring and alerting systems to ensure proactive detection of issues and timely response.
- Participate in on-call rotation and contribute to the overall reliability of our systems.
Main Requirements:
- 5 years of experience in Site Reliability Engineering (SRE) or DevOps roles, with a proven track record of managing complex systems in production environments.
- Strong ability to design, build, and maintain scalable systems that meet performance and reliability requirements.
- Proficiency in Linux system administration and shell scripting.
- Experience with containerization technologies such as Docker and orchestration platforms like Kubernetes.
- Familiarity with cloud platforms such as AWS or Azure, including infrastructure provisioning and management.
- Solid understanding of Continuous Integration/Continuous Deployment (CI/CD) pipelines and observability concepts.
- Excellent communication and collaboration skills, with the ability to work effectively across teams and departments.