Job posting has expired

Senior Site Reliability Engineering Manager - CTJ - Top Secret

Microsoft
United States, Nevada, Reno
6840 Sierra Center Parkway (Show on map)
Oct 01, 2025
OverviewAre you interested in working on cutting-edge cloud security products Would you like to be part of one of the world's most advanced cyber-security solutions and protect millions of computers from thousands of active attack attempts, every month Look no further than the Microsoft Defender engineering team. We are looking for a Senior Site Reliability Engineering (SRE) Manager. You will be building and delivering cloud solutions to meet the scale that few companies in the industry are required to support. Leveraging state-of-the-art technologies, you will be instrumental in delivering holistic protection within government environments. The Microsoft Defender team is responsible for delivering a constantly evolving set of services and solutions to meet the challenging landscape of our ever-evolving attackers. This is a team which provides on-call operational support and improvements to the operational posture of the Microsoft Defender products within US Government clouds. You will operate our production services, and work closely with other engineering teams to ensure services and systems are highly stable, meet performance SLAs, and meet the expectations of internal and external customers and users. The Microsoft Defender team is responsible for delivering a constantly evolving set of services and solutions to meet the challenging landscape of our ever-evolving attackers. ResponsibilitiesLead Reliability StrategyDrive the vision and execution of reliability, performance, and security across critical systems and services. Influence product design and engineering decisions to ensure resilient, scalable infrastructure.Build and Scale AutomationChampion intelligent automation (AI/ML-powered) for monitoring, deployment, and incident response to reduce manual overhead and accelerate safe delivery.Drive Operational ExcellenceUse telemetry and service-level data to guide improvements in availability, efficiency, and cost. Lead post-incident reviews and service improvement plans that restore customer trust and drive long-term resilience.Foster Engineering PartnershipsCollaborate deeply with product engineering and security teams from early development through production to align on reliability goals and prevent recurrence of issues.Grow and Empower TeamsAttract, mentor, and develop high-performing SRE talent. Create a culture of inclusion, learning, and accountability that supports career growth and innovation.Shape Technical DirectionGuide architecture and tooling decisions across distributed systems and cloud infrastructure. Promote adoption of best practices and scalable solutions across teams.