Listing Description
This position offers excellent bonus and 401K matching ranging from 10-17%.
This position will sit in Owings Mills, MD with the ability to work remote 2 days/week.
In this role as Principal Site Reliability Engineer, Cloud Infrastructure you will formulate, develop, implement, and lead a team of Site Reliability Engineers (SREs) focused on the observability, sustainability, scalability, measurability and recoverability of T. Rowe Price’s innovative cloud & on-prem solutions by leveraging automation and best-of-breed tools. The successful candidate will have a strong operations & engineering background, is hands-on when needed, and has expertise in the cloud environments (public, private), infrastructure operations, DevOps practices, CI/CD toolchain and systems, code build and deployment, incident response, and 24x7 monitoring and support.
The candidate will also have extensive experience in building and running an SRE function within a complex, distributed environment. They will have a demonstrated ability to work horizontally and vertically within an organization with diverse partners and sponsor groups.
Role summary and job responsibilities
-
Possesses extensive knowledge in own area of expertise and extensive in-depth knowledge of the broader portfolio for comprehensive understanding of up/downstream impacts across technology infrastructure
-
Overall responsibility for the design of technology solutions to prevent or minimize service disruptions
-
Prevents technology service disruptions through technology solution recommendations and automations
-
Fosters a culture of deep learning through blameless post-mortems to improve the shared goal of reliability across services
-
Transform operations teams by facilitating internal change to adopt SRE standard methodologies across the organization and driving strategic growth in this area within Global Technology
-
Analyzes incidents impacting technology availability for high-level trends across the broad portfolio
-
Drive initiatives to reduce or prevent technology failures in a complex, distributed technology environment
-
Pulls together information from disconnected systems into cohesive views of the technology portfolio for identifying trends, redundancies, and risk
-
Overall responsibility for creation and execution of road maps for applications and technology platforms
-
Demonstrates outstanding awareness of the complexities of the tech and asset management industries
-
May lead initiatives of varying degrees of complexity that span multi-functional areas and of varying degrees of complexity
-
Contributes to definition of target state architecture and design of the technology environment
Requirements
-
10+ years of relevant technology experience
-
5+ years building and supporting solutions in Amazon AWS
-
5+ years of experience building and running a DevOps and/or SRE function
-
Experience with implementation and operation of the chaos model at scale
-
Strategic and program-level implementation experience
-
Demonstrable experience implementing new technology, tools, and platforms
-
System administration and scripting experience
-
Demonstrable experience leveraging automation to proactively prevent or quickly remediate incidents
-
Fluent in multiple programming languages (e.g., Python, Java, GO, Node.js, .Net Core, etc.).
-
Proficiency with database development (SQL Server, PostgreSQL, MySQL, etc.)
-
Proficiency with defining, right-sizing, tracking, and reporting on Service Level Objectives (SLOs), Service Level Indicators (SLIs), system availability, and the progress and outcomes related to reliability
-
Experience with implementing and managing Error Budgets
-
Proficiency with understanding and explaining incident situations and their recovery plans to prevent recurrence
-
Knowledge/experience driving dashboard standardization across the ecosystem for observability, APM and infrastructure monitoring, and application-specific logging
-
Knowledge/experience with observability tools such as New Relic, Elastic Stack, Prometheus, Grafana, Splunk, and cloud native tools is desirable
-
Knowledge/experience with cloud management tools such as Ansible, Terraform, Vault, and Vagrant.
-
Works independently, with guidance in only the most complex situations
-
Makes sound decisions with limited facts or resources.
-
Balances strategic and pragmatic concerns when solving problems
-
Adjusts communication style and materials to suit a given audience
-
Able to clearly articulate operational principles, practices, and policies
-
Stays abreast of industry trends and technologies
-
Accountable for work of self and others; sets standards around which others will operate
-
Maintains a broad internal professional network and knows when to engage/activate it
-
Develops or mentor’s diverse talent on the team
-
Ability to be on-call and/or work during off-hours
Listing Details
- Salary: $140000 - $175000
- Citizenship: Us Citizen
- Incentives: Bonus
- Education: Not Provided
- Travel: No Travel
- Telework: Optional Telecommute