44223- Site Reliability Engineer
Riverwoods, IL 60015 US
Job Description
Location: Hybrid - Local to Riverwoods candidates only please
Duration: 6 MONTHS CONTRACT TO HIRE
Job Responsibilities:
- Site Reliability Engineers (SREs) are responsible for keeping production systems running smoothly. SREs are a blend of pragmatic operators and software crafts people that apply engineering principles, operational discipline, and mature automation to our operating environments.
- SREs specialize in systems (operating systems, networks, observability), while implementing best practices to continuously improve availability, reliability, and scalability.
- Develop and run SRE own tooling and observability using automation like CI/CD, and Kubernetes.
- Build monitoring that alerts on symptoms rather than on outages.
- Document every action so your findings turn into repeatable actions and then into automation.
- Debug production issues across services and levels of the stack.
- Plan the growth and reliability of services.
- Use your on-call shift to prevent incidents from ever happening.
- Be on an on-call rotation to respond to “Code Red” incidents to help restore customer impacting service.
- Have an urge for delivering quickly and effectively and iterating fast.
- Think about systems: edge cases, failure modes, behaviors, specific implementations.
- As an engineer, when you see something broken, you cannot help but fix it.
- Have an urge to document all the things so you do not need to learn the same thing twice.
- Strong knowledge of SDLC (System Development Life Cycle)
- Strong knowledge of git, Docker, Kubernetes, Jenkins, AWS (Amazon Web Services) or similar technologies
- Know what the use of configuration management systems like Chef, Ansible
- Have strong programming skills in one or more of the following languages: C, Ruby, Python, Java
- Good understanding of hybrid infrastructure
- Automation like CI/CD, self-healing of services, end-to-end or performance testing
- Improve monitoring (data Dog, AppD etc.) and building new smart metrics
- Develop a relationship with a product group and help define their SLO/SLI
- Work directly with AppDev to improve product by Non-functional and production readiness
- Improve operability, latency, capacity planning, change management and improve MTTR (Mean Time to Repair)
- Leveling of Site Reliability Engineering
- Configuration management: use Chef and Ansible to effectively manage our infrastructure
- Infrastructure as code: use Terraform and GitLab CI/CD for automation, containerize our environments (Kubernetes), and leverage cloud technologies to meet our goals
- Systems: manage, configure, and troubleshoot operating system issues, storage (block and object), networking VPC (Virtual Private Cloud), proxies and CDN (Content Delivery Network) and administer high-availability PostgreSQL and Redis clusters
- Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations
- Engineering practices: availability, reliability, and scalability, as well as disaster recovery
- Use and contribute to code to git
- Experience coding in one or more of the following languages: C, Ruby, Python, Shell, Java
- Planning: familiar with agile methodologies; use epics and issues to drive projects
- Organization: workload organization, OKR (Objective and Key Result) leadership
- Management: a manager of one, able to self-organize and report asynchronously
- Leading and contributing to scope and designs for issues, epics, and OKRs (Objective and Key Result)
- Contributing to the Handbook, create and update runbooks, general documentation, and write blogs
- Completing Root Cause Analysis (RCA) investigations and performing readiness reviews
- Improving team practices through code reviews, handoffs of work and incidents
- Knowledge sharing, mentoring.
- Self-awareness, handling conflict in the team, and providing and receiving feedback
- Maintaining good relationships with other engineering teams that help improve the product
- Accountability: willing to proactively step in and do the right thing while providing candid and constructive feedback
- Site Reliability Engineer
- 5+ Years experience BE/B.Sc