New

Data Center Engagement Lead

Microsoft
United States, Texas, Irving
7000 State Highway 161 (Show on map)
Nov 26, 2024
OverviewMicrosoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team is looking for thought leaders to help design, deploy and support flagship AI supercomputers. Azure is enabling the largest supercomputing deployments to tackle complex computational problems in public cloud, evident from the various HPC products that have already made the mark on Top500, MLPerf and Graph500 rankings. Our team directly supports Azure's top-tier AI customers, enabling breakthroughs such as ChatGPT.As the DataCenter Engagement Lead, you will play a pivotal role in shaping next-generation supercomputing systems. You will contribute to the design process, oversee buildout and validation pipelines, ensure timely delivery, and proactively drive operational excellence. Additionally, you will engage deeply with strategic customers, directly influencing their business outcomes while indirectly benefiting the broader Azure ecosystem. Your work will enable the next wave of growth and innovation in AI and high-performance computing (HPC) in the cloud. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond. ResponsibilitiesPartner with cross organizational teams to drive architecture, design, development and deployment of end-to-end solutions to manage core infrastructure including current & next generation datacenter, IT hardware, power & cooling technologies Drive operational excellence by developing strategies and execution plans to improve key metrics such as Job Mean Time to Interrupt, Nodes in Service, Mean Time to Resolve on flagship supercomputers. Drive prioritization across the key issues and tactical decision making mindful of resourcing & staffing constraints. Monitor SLA's across partner teams and champion efforts to improve efficiency across staffing & resourcing constraints. Define and drive the development of integrated telemetry and data pipelines needed to provide real time alerting and monitoring of job impacting incidents Partner with teams on continuous learning and continuous improvement programs by leading the resolution of complex incidents, driving root cause analyses and championing initiatives to minimize future customer impact Lead and grow a team of engineers to build scalable services while championing a growth mindset, diversity and inclusion, and our model, coach, care management philosophy.