Site Reliability Engineer II

Microsoft
United States, Texas, Irving
7000 State Highway 161 (Show on map)
May 20, 2025
OverviewInterested in a start-up like environment whilst helping extend Azure's core enterprise capabilities for mission critical workloads? Passionate about Cloud Computing technology and driving growth and maturity for very visible and ambitious programs? Then the Azure Specialized team is the right place for you. The Site Reliability Engineering Team (SRE) in Azure Specialized is directly implanted into the product engineering team and you will work closely with engineers, operations, industry vendors and workload partners to ensure mission critical systems continue to work optimally for our customers. Customers around the world depend on us to run their mission critical workload and place their trust in us to deliver the services they need, to work every day. In order to make this work for our growing customer base, we need continual effort to make Azure highly reliable. Join a growing team, owning reliability of Azure Specialized.Our SRE team, represents a deep investment in improving the availability, reliability, operational efficiency of our systems and services. We are hiring highly motivated site reliability engineers to help drive our Azure special projects focused on enabling global scale offerings. In this role you will help Microsoft and Azure become a world leader at running and operating mission-critical workloads like AI supercomputers, Payment systems for Fintech, in memory HANA databases, all running on dedicated hardware. We're a small, agile, nimble team in Azure focused on bringing the state of the art of mission-critical software into Microsoft and providing bare-metal machines in the Azure Cloud. Come join us and be part of this platform and help us scale massively in the coming years. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. ResponsibilitiesWork closely with product engineering to ensure that the right set of service capabilities are being built to manage the service end to end. Examples include deployment systems, diagnostic capabilities and run time operational insights into key service behaviors. Identify monitoring gaps and drive implementation. Consume and extend telemetry using queries, dashboards, alerts to monitor reliability. Be a part of on-call rotation and monitor all customer reported incidents (CRI), triage them, participate in root-cause analysis, track monitoring gaps, help drive work to ensure these incidents are auto-detected in the future and have reduced time to mitigation and resolution.Coordinate large scale fleet wide maintenance and updates using safe deployment practices. Identify impact of these system changes, coordinate closely with customer facing teams and customers directly to plan maintenance windows and downtime. Work with customer support team for updated trouble shooting guides.Work closely with 3rd party HW vendors and appliance providers to ensure quality and reliability of systems provided to Microsoft.