11432 Lackland Rd St Louis, MO 63146
We are seeking a seasoned Site Reliability Engineer responsible for overseeing the availability, scalability and performance of the technology infrastructure. This is a highly visible and critical role, as a SRE (Site Reliability Engineer) will work with other Tribe SREs and will define and implement the standards for global infrastructure services including Cloud transformation. The ideal candidate would have experience transitioning legacy systems to the Cloud (GCP), and delivering on an infrastructure service vision.
• Articulate a strategic vision for the business unit regarding infrastructure services and work with technology leadership to execute on that vision.
• Champion SRE standards to ensure availability, scalability and performance of infrastructure services.
• Spending time doing “ op’ s” work i.e. being on call and spending time doing development work.
• Responsible for providing stable, secure, and compliant infrastructure environments.
• Promote infrastructure as code, CI/CD and design for failure approach to infrastructure services.
• Focus on understanding business and customer engagement, service performance and continuous service improvement.
• Lead the effort to transition the legacy team into the SRE discipline.
• Foster a culture of ownership, collaboration and trust within and outside the team.
• Influence a team of talented engineers in support of the business and technology goals.
• Automation and development for ticketing generated by support issues and data/reporting requests.
• Engage in and improve the software development lifecycle – from inception and design, through development, deployment, operation and refinement
• Influence and design infrastructure, architecture, standards and methods for large-scale systems
• Support services prior to production via infrastructure design, software platform development, load testing, capacity planning and launch reviews
• Maintain services during deployment and in production by measuring and monitoring key performance and service level indicators including availability, latency, and overall system health
• Automate system scalability and continually work to improve system resiliency, performance and efficiency
• Practice sustainable incident response as part of an on-call rotation and through blameless postmortems
• Remediate tasks within corrective action plan via sustainable, preventative, and automated measures whenever possible
• 2-3 years of relevant work experience supporting automation processes in cloud (preferably AWS).
• 1-3 years of experience in a leadership role.
• Proficiency with continuous integration and continuous delivery tooling and practices.
• Proficiency with programming languages Java and Node.js
• Technical Degree required (3 years) Bachelors preferred (4 – 6 years) in IT related field.
• Configuring, building, and supporting apps and operations in a cloud environment (GCP, AWS, and/or Azure).
• Experience developing and/or administering software in Public Cloud (AWS, Azure or GCP)
• Experience in monitoring infrastructure and application uptime and availability to ensure functional and performance objectives
• Experience building CI/CD pipelines with Jenkins from scratch
• Demonstrable cross-functional knowledge with systems, storage, networking, security and databases
• System administration skills, including automation and orchestration of Linux/Windows using Chef, Puppet, Ansible, Salt Stack and/or containers (Docker, Kubernetes, etc.)
• Proficiency with continuous integration and continuous delivery tooling and practices