Site Reliability Engineer
St Louis, MO 63105
Site Reliability Engineering (SRE) is a discipline that combines software and systems engineering for building and running large-scale, distributed, fault-tolerant systems. SRE ensures that internal and external services meet or exceed reliability and performance expectations while adhering to engineering principles.
- Engage in and improve the software development life cycle – from inception and design, through development, deployment, operation and refinement
- Influence and design infrastructure, architecture, standards and methods for large-scale systems
- Support services prior to production via infrastructure design, software platform development, load testing, capacity planning and launch reviews
- Maintain services during deployment and in production by measuring and monitoring key performance and service level indicators including availability, latency, and overall system health
- Automate system scalability and continually work to improve system resiliency, performance and efficiency
- Practice sustainable incident response as part of an on-call rotation and through blameless postmortems
- Re-mediate tasks within corrective action plan via sustainable, preventative, and automated measures whenever possible
- Work with cloud operations team to resolve trouble tickets, developing and running scripts, and troubleshooting.
- Create new tools and scripts designed for auto-remediation of incidents.
- Design/Implementation of Big Data technologies, including Hadoop, MongoDB, Kafka, RabbitMQ, Zookeeper, Spark, ELK, etc
- BS degree in Computer Science or related technical field involving coding (e.g., physics or mathematics), or equivalent practical experience
- 7 years of experience in software development and maintenance.
- 2+ years of experience developing and/or administering software in public cloud
- Experience in monitoring infrastructure and application up time and availability to ensure functional and performance objectives.
- Experience with Web service technologies, including REST, SOAP, JSON, XML
- Experience with database (RDBMS, NoSql) technologies is a plus.
- Demonstrable cross-functional knowledge with systems, storage, networking, security and databases
- System administration skills, including automation and orchestration of Linux/Windows using Chef, Puppet, Ansible, Salt Stack and/or containers (Docker, Kubernetes, etc.)
- Proficiency with continuous integration and continuous delivery tooling and practices
- Strong analytical and troubleshooting skills
- Expertise designing, analyzing and troubleshooting large-scale distributed systems.
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
- Experience managing Infrastructure as code via tools such as Terra form or Cloud-formation
- A passion for automation with a desire to eliminate toil whenever possible
- Experience building software or maintaining systems in a highly secure, regulated or compliant industry
- Experience and passion for working within a DevOps culture and as part of a team