Principal Site Reliability Engineer - T. Rowe Price New York, NY, USA Bookmark Share Print 898 0 3

Listing Description

Role summary and job responsibilities

Possesses extensive knowledge in own area of expertise and extensive in-depth knowledge of the broader portfolio for comprehensive understanding of up/downstream impacts across technology infrastructure

Overall responsibility for the design of technology solutions to prevent or minimize service disruptions

Prevents technology service disruptions through technology solution recommendations and automations

Fosters a culture of deep learning through blameless post-mortems to improve the shared goal of reliability across services

Transform operations teams by facilitating internal change to adopt SRE standard methodologies across the organization and driving strategic growth in this area within Global Technology

Analyzes incidents impacting technology availability for high-level trends across the broad portfolio

Drive initiatives to reduce or prevent technology failures in a complex, distributed technology environment

Pulls together information from disconnected systems into cohesive views of the technology portfolio for identifying trends, redundancies, and risk

Overall responsibility for creation and execution of road maps for applications and technology platforms

Demonstrates outstanding awareness of the complexities of the tech and asset management industries

May lead initiatives of varying degrees of complexity that span multi-functional areas and of varying degrees of complexity

Contributes to definition of target state architecture and design of the technology environmentRequirements

10+ years of relevant technology experience

5+ years building and supporting solutions in Amazon AWS

5+ years of experience building and running a DevOps and/or SRE function

Experience with implementation and operation of the chaos model at scale

Strategic and program-level implementation experience

Demonstrable experience implementing new technology, tools, and platforms

System administration and scripting experience

Demonstrable experience leveraging automation to proactively prevent or quickly remediate incidents

Fluent in multiple programming languages (e.g., Python, Java, GO, Node.js, .Net Core, etc.).

Proficiency with database development (SQL Server, PostgreSQL, MySQL, etc.)

Proficiency with defining, right-sizing, tracking, and reporting on Service Level Objectives (SLOs), Service Level Indicators (SLIs), system availability, and the progress and outcomes related to reliability

Experience with implementing and managing Error Budgets

Proficiency with understanding and explaining incident situations and their recovery plans to prevent recurrence

Knowledge/experience driving dashboard standardization across the ecosystem for observability, APM and infrastructure monitoring, and application-specific logging

Knowledge/experience with observability tools such as New Relic, Elastic Stack, Prometheus, Grafana, Splunk, and cloud native tools is desirable

Knowledge/experience with cloud management tools such as Ansible, Terraform, Vault, and Vagrant.

Works independently, with guidance in only the most complex situations

Makes sound decisions with limited facts or resources.

Balances strategic and pragmatic concerns when solving problems

Adjusts communication style and materials to suit a given audience

Able to clearly articulate operational principles, practices, and policies

Stays abreast of industry trends and technologies

Accountable for work of self and others; sets standards around which others will operate

Maintains a broad internal professional network and knows when to engage/activate it

Develops or mentor’s diverse talent on the team

Ability to be on-call and/or work during off-hours

Listing Details

Salary: $200000 - $220000
Citizenship: No Requirements
Incentives: Bonus

Education: Bachelors Degree
Travel: No Travel
Telework: Full Telecommute

Listing Description

Listing Details

About Us

Useful Links

Our Contacts