11432 Lackland Road St Louis, MO 63146
Site Reliability Engineering (SRE) is an engineering discipline that combines software and systems engineering to build and run large-scale, massively distributed, fault-tolerant systems. The SRE ensures that services— both internally critical and externally visible systems— have reliability and uptime appropriate to users' and customers’ needs and a fast rate of improvement while keeping an ever-watchful eye on capacity and performance.
What You’ ll Do
- You will be responsible for mission critical business functions and partner with other infrastructure, operations, and development teams to identify and implement automation opportunities to drive down toil, reduce technical debt, and improve system reliability.
- You will support the production operations of our systems, as well as development/engineering of solutions to maximize system reliability & automation.
- You will be responsible for root cause analysis of incidents and pro-active prevention of recurrence thru the creative design and development of technical solutions as well as process improvements.
- You will engage in and improve the whole lifecycle of software development services— from inception and design, through deployment, operation, and refinement.
- You will support services before they go live through activities such as system design consulting, developing software platforms and frameworks, capacity planning and launch reviews.
- You will maintain services once they are live by measuring and monitoring availability, latency, and overall system health in a 24x7 environment.
- You will scale systems sustainably through mechanisms like automation and evolve systems by pushing for changes that improve reliability and velocity.
- You will practice sustainable incident response and blameless postmortems.
- You will influence and create new designs, architecture, standards, and methods for large-scale systems.
- You will bind and orchestrate the system infrastructure with the application layer to enable High Availability/Clustering load balancing and integration;
- You will provide technical guidance or support for the development or troubleshooting of systems;
- You will be responsible for establishing end-to-end monitoring and alerting on all critical aspects to ensure SLOs, SLIs, and SLAs and get proactive notifications of possible issues for all systems;
- You will develop automated solutions to address potential problems before they result in a service interruption and demonstrate a passion for automation, including CI/CD automation;
- You will establish performance baseline, capacity thresholds, correlate events, and define monitoring/alerting criteria.
- Bachelors of Science degree in Computer Science, Engineering, or equivalent relevant experience.
- Good understanding of Site Reliability Engineering (SRE) and DevOps philosophies, technologies, platforms and tools, SLA management, incident resolution, and automation;
- Systematic problem-solving approach, coupled with strong communication skills and a sense of ownership and drive;
- Ability to debug and optimize code and automate routine tasks;
- 3-5 years of experience in one or more of the following: Amazon Web Services, Google Cloud Platform, Kubernetes, etc.;
- 3-5 years of experience building JavaEE applications using, build tools like Maven/ANT, Subversion, JIRA Jenkins, Bitbucket and Chef;
- 3-5 years of experience in continuous integration tools (Jenkins, SonarQube, JIRA, Nexus, Confluence, GIT-BitBucket, Maven, Gradle, RunDeck, is a plus);
- 3-5 years of experience as SCM/release engineer, or in a position with similar skill sets and responsibilities (Software Engineer, Systems Engineer, Systems Administrator);
- 3-5 years of experience performing source code control management Subversion/GIT including branching, merging, tagging, etc.;
- 3-5 years of experience configuring and administering JavaEE application servers (Tomcat, WebSphere, WebLogic, etc.);
- 3-5 years of experience with scripting language such as Unix Shells, Python, Perl, Shell, bash, ksh);
- 3-5 years of experience configuring, building, and supporting apps and operations in a public cloud environment (AWS, GCP);
- 3-5 years of experience with Monitoring and Logging tools (Elastic Search, ELK, AppDynamics, Splunk, etc.);
- Collaborate well with team members, developers, QA, and ownership teams to resolve issues;
- Knowledge of Agile / Scrum methodologies and principles;
- Possess excellent written and verbal communication skills with the ability to communicate with team members at various levels, including business leaders;
- A real passion for and the ability to learn new technologies.
Extra Points For Any Of The Following:
- You have expertise designing, analyzing and troubleshooting large-scale distributed systems.
- You take a system problem-solving approach, coupled with strong communication skills and a sense of ownership and drive
- You have experience managing Infrastructure as code via tools such as Terraform or CloudFormation
- You are passionate for automation with a desire to eliminate toil whenever possible
- You’ ve built software or maintained systems in a highly secure, regulated or compliant industry
- You thrive in and have experience and passion for working within a DevOps culture and as part of a team
- You' ve created automation using Chef, Puppet or another SCM tool; Docker and container scheduler services such as ECS or Kubernetes is desirable;
- You' ve worked with Nginx, Tomcat, HAProxy, Redis, Elastic Search, MongoDB, and RabbitMQ, Kafka, Zookeeper;