Overview

The Senior Site Reliability Engineer will assist with the design, development, and implementation of the cloud architecture in various cloud, hybrid, and on-premise systems. This position will directly contribute to the overall implementation of enterprise cloud architecture while working closely with staff to enhance and develop new designs and strategies across all types of cloud-based applications. The Site Reliability Engineer will collaborate with both Information Technology and Business Units to ensure open lines of communication and clear understanding of objectives within each project. The successful candidate possesses excellent interpersonal and communication skills, required for collaborating with both internal business units and resources and external partners and integrators.
RESPONSIBILITIES

  • Develop / Monitor dashboards to detect problems related to application, infrastructure and potential security incidents on daily basis
  • Run the production environment by monitoring availability and taking a holistic view of system health
  • Build software and systems to manage platform infrastructure and applications
  • Improve reliability, quality, and time-to-market of our suite of software solutions by creating sustainable systems and services through automation and uplifts
  • Provide primary operational support and engineering for multiple large, distributed software applications
  • Gather and analyze metrics from both operating systems and applications to assist in performance tuning and fault finding
  • Ensure appropriate sizing of solutions, technology fit, and DR are assessed and accounted for

EDUCATION AND EXPERIENCE QUALIFICATIONS

  • 4 year degree in IT or related field preferred; equivalent experience may be substituted in lieu of education
  • 4-6 years of experience with Architecting and/or Engineering in cloud environments.
  • 4-6 years of experience with Azure and/or AWS Cloud platform.
  • 2 – 4 years of experience with CI/CD automating
  • 4-6 years in an Operations Support Role

REQUIRED KNOWLEDGE, SKILLS or ABILITIES

  • Hands-on experience with Microsoft Azure is required. Specifically, Azure Security Center, Azure monitoring, Azure Key Vault, Azure Kubernetes Service, Azure Dedicated HSM, Blob Storage, Azure Backup, Azure Functions, Virtual Machines, Service Fabric and Container Instances
  • Hands-on experience with Python, Bash, and/or PowerShell with a focus on orchestration and automation of underlying services, systems, provisioning, and security hardening
  • Understanding of Windows and Linux operating systems at a detailed level including processes, memory allocation, and networking with an understanding of how applications function and impact other OS components and cloud services
  • Expert-level debug/troubleshooting skills
  • Experience developing and/or maintaining production-grade cloud solutions in virtualized environments such as Pivotal Cloud Foundry and Kubernetes
  • Experience with creating and deploying AWS and ARM templates
  • Able to pick up and learn new AWS/Azure technologies and create internal training docs
  • Experience with Database technologies (SQL, Cluster technology and creation, Always-On, migration, log shipping)
  • Hands-on experience with log aggregation tools
  • Experience architecting solutions within Azure
  • Working knowledge of common and industry standard cloud-native/cloud-friendly authentication mechanisms (OAuth, OpenID, etc.)
  • Experience with automation systems: e.g. Ansible, Jenkins, Chef, GIT
  • Experience with monitoring solutions: e.g. Splunk, SolarWinds, Nagios
  • Experience with Jira, Confluence, Atlassian
  • Experience with APM tools: e.g. Dynatrace, AppDynamics, New Relic, Stackify, Raygun
  • Experience with Cloud Security
  • Experience with C#, Python, NodeJS, JSON, Java, etc.
  • Experience working with cloud security and governance tools, cloud access security brokers (CASBs), and server virtualization technologies
  • Experience with enterprise applications (architecture, development, support, and troubleshooting)
  • Experience with enterprise architecture and working as part of a cross-functional team to implement solutions
  • Experience in handling production support incidents and connect between Developer/Operations team to perform deep dive analysis work on RCA/implement code fix
  • Strong interpersonal and communication skills; ability to work in a team environment
  • Ability to work independently with minimal direction; self-starter/self-motivated
  • Technical writing experience