Job Description:
"System Reliability
? Lead efforts to enhance the reliability, availability, and performance of critical systems
? Perform in-depth analysis of system behavior, identifying areas for improvement and
implementing solutions
2. Automation Frameworks
? Design, implement, and maintain automation tools and frameworks to streamline operational
processes
? Drive the integration of observability automation into the CI/CD pipeline
3. Tooling and Technology Evaluation
? Evaluate and recommend new tools, technologies, and methodologies to enhance SRE
capabilities.
? Stay abreast of industry trends and emerging technologies
4. Incident Management
? Lead incident response and resolution activities, ensuring timely and effective resolution of
system issues
? Conduct post-incident reviews and implement preventive measures to mitigate future
occurrences
5. Scalability and Performance
? Collaborate with cross-functional engineering teams to conduct capacity planning and scalability
assessments and design solutions for handling current and future growth
? Implement and maintain monitoring solutions to proactively identify and address capacity-
related issues
? Implement performance optimization strategies to ensure optimal system response times
6. Collaboration and Knowledge Sharing
? Collaborate with development and operations teams to promote a culture of reliability and
operational excellence
? Mentor junior team members and actively contribute to knowledge-sharing initiatives"