43102- Site Reliability Engineer- No Third Parties Please

Riverwoods, IL 60015

Posted: 11/21/2022 Industry: IT/Software Development Job Number: 39179

Job Description


Site Reliability Engineer- No third parties plz #43102

Location: REMOTE

Duration: 6 MONTHS CONTRACT TO HIRE

Job Responsibilities:
  • Site Reliability Engineers (SREs) are responsible for keeping production systems running smoothly. SREs are a blend of pragmatic operators and software crafts people that apply engineering principles, operational discipline, and mature automation to our operating environments.
  • SREs specialize in systems (operating systems, networks, observability), while implementing best practices to continuously improve availability, reliability, and scalability.

As an SRE you will:
  • Develop and run SRE own tooling and observability using automation like CI/CD, and Kubernetes.
  • Build monitoring that alerts on symptoms rather than on outages.
  • Document every action so your findings turn into repeatable actions and then into automation.
  • Debug production issues across services and levels of the stack.
  • Plan the growth and reliability of services.
  • Use your on-call shift to prevent incidents from ever happening.
  • Be on an on-call rotation to respond to “Code Red” incidents to help restore customer impacting service.

You may be a fit for this role if you have some of these inclinations:
  • Have an urge for delivering quickly and effectively and iterating fast.
  • Think about systems: edge cases, failure modes, behaviors, specific implementations.
  • As an engineer, when you see something broken, you cannot help but fix it.
  • Have an urge to document all the things so you do not need to learn the same thing twice.
  • Strong knowledge of SDLC (System Development Life Cycle)
  • Strong knowledge of git, Docker, Kubernetes, Jenkins, AWS (Amazon Web Services) or similar technologies
  • Know what the use of configuration management systems like Chef, Ansible
  • Have strong programming skills in one or more of the following languages: C, Ruby, Python, Java
  • Good understanding of hybrid infrastructure

Projects you could work on:
  • Automation like CI/CD, self-healing of services, end-to-end or performance testing
  • Improve monitoring (data Dog, AppD etc.) and building new smart metrics
  • Develop a relationship with a product group and help define their SLO/SLI
  • Work directly with AppDev to improve product by Non-functional and production readiness
  • Improve operability, latency, capacity planning, change management and improve MTTR (Mean Time to Repair)
  • Leveling of Site Reliability Engineering
  • Configuration management: use Chef and Ansible to effectively manage our infrastructure
  • Infrastructure as code: use Terraform and GitLab CI/CD for automation, containerize our environments (Kubernetes), and leverage cloud technologies to meet our goals
  • Systems: manage, configure, and troubleshoot operating system issues, storage (block and object), networking VPC (Virtual Private Cloud), proxies and CDN (Content Delivery Network) and administer high-availability PostgreSQL and Redis clusters
  • Monitoring and instrumentation: implement metrics in Prometheus, Grafana, log management and related system, and Slack/PagerDuty integrations
  • Engineering practices: availability, reliability, and scalability, as well as disaster recovery
  • Use and contribute to code to git
  • Experience coding in one or more of the following languages: C, Ruby, Python, Shell, Java
  • Planning: familiar with agile methodologies; use epics and issues to drive projects
  • Organization: workload organization, OKR (Objective and Key Result) leadership
  • Management: a manager of one, able to self-organize and report asynchronously
  • Leading and contributing to scope and designs for issues, epics, and OKRs (Objective and Key Result)
  • Contributing to the Handbook, create and update runbooks, general documentation, and write blogs
  • Completing Root Cause Analysis (RCA) investigations and performing readiness reviews
  • Improving team practices through code reviews, handoffs of work and incidents
  • Knowledge sharing, mentoring.
  • Self-awareness, handling conflict in the team, and providing and receiving feedback
  • Maintaining good relationships with other engineering teams that help improve the product
  • Accountability: willing to proactively step in and do the right thing while providing candid and constructive feedback
  • Site Reliability Engineer

Required:
  • 5+ Years experience BE/B.Sc

Job Requirements

AWS API Chef/Ansible CI/CD GIT/Docker/Kubernetes

Meet Your Recruiter

Ryan Demar

Apply Online

Send an email reminder to:

Share This Job:

Related Jobs:

Login to save this search and get notified of similar positions.