New
Principal Supercomputing Software Engineer
Microsoft | |
United States, Texas, Irving | |
7000 State Highway 161 (Show on map) | |
Nov 26, 2024 | |
OverviewMicrosoft Azure Artificial Intelligence/High Performance Computing (AI/HPC) team is looking for systems engineers, architects and thought leaders to enable customers in deploying, monitoring, profiling, and debugging their applications on hyperscale cloud infrastructure. Azure is enabling the largest supercomputing deployments to tackle complex computational problems in public cloud, evident from the various HPC products that have already made the mark on Top500, MLPerf and Graph500 rankings. At this supercomputing scale, we need specialized tools and techniques to maintain the reliability, runtime performance, health of the system and running jobs continuing to meet the Service Level Agreements (SLAs) of customers. Your job would be to build and use state-of-the-art tools and techniques, find operational gaps and instrument features to achieve the smooth operation of cloud-native supercomputers. As a Principal Supercomputing Engineer, you would also bring to the table establishing best practices drive architectural changes and influence roadmap of relevant software and hardware components. Your work will directly impact business goals of a wide range of users and facilitate the next wave of growth and innovation in AI, and HPC in the cloud in general. Microsoft's mission is to empower every person and every organization on the planet to achieve more. As employees we come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals. Each day we build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
ResponsibilitiesBe part of a comprehensive systems management team focused on operational excellence and customer successAnalyze key system metrics and telemetry to proactively identify and debug HPC system issues, build appropriate tooling, help develop processes and ensure that solutions are responsive to emerging user needsPartner with customers, vendors, and other teams within Azure to drive comprehensive solutions for operating world class Supercomputers in the public cloud environmentEnsure that the Azure platform is performant, scalable and resilientFoster test-driven engineering culture to reduce regressions and bugs in production and will set a higher bar for infrastructure quality |