New

Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI

Microsoft
$119,800.00 - $234,700.00 / yr
United States, Washington, Redmond
Jan 14, 2026
Overview The AI Platform organization builds the end-to-end Azure AI stack, from theinfrastructure layer to thePaaSanduser experience offeringsforAI application builders,researchers, and major partnergroupsacross Microsoft.The platformis core to Azure's innovation,differentiationandoperationalefficiency, as well as the AI-related capabilities of all of Microsoft's flagship products, fromM365andTeamsto GitHub Copilotand Bing Copilot. We are the team buildingtheAzure OpenAIservice,AIFoundry,Azure MLStudio, Cognitive Services, and the global Azure infrastructure formanaging the GPU and NPU capacityrunning the largest AI workloads on the planet. One of the major, mature offerings ofAI PlatformisAzure ML Services. Itprovidesdata scientists anddevelopersa richexperiencefordefining,training,fine-tuning,deploying,monitoring,and consumingmachine learning models.We provide the infrastructure and workload management capabilities powering Azure ML Services, and we engage directly with some of the major internal researchand applied MLgroupsusing these services, includingMicrosoft Research and the BingWebXTteam. As part of AI Platform, the AI Infra team is looking for a Senior Software Engineer - AI Infrastructure (Scheduler) - CoreAI. The scheduler is the "brains" of the AI Infra control plane. It governs access to the GPU and NPU capacity of the platform according to a complex system of workload preference rules, placement constraints, optimization objectives, and dynamically interacting policies aimed to maximize hardware utilization and fulfill greatly varying needs of users and the AI Platform partner services in terms of workload types, prioritization, and capacity targeting flexibility. The scheduler's set of capabilities is broad and ambitions. It manages quota, capacity reservations, SLA tiers, preemption, auto-scaling, and a wide range of configurable policies. Global scheduling is a distinctive major feature that overcomes the regional segmentation of the Azure compute fleet by treating the GPU capacity as a single global virtual pool, which greatly increases capacity availability and utilization for major classes of ML workload. We have achieved this capability without allowing a major global single point of failure, based on regional instances of the scheduler service interacting via peer-to-peer protocols for sharing capacity inventory and coordinating handoff of jobs for scheduling. Our system manages significant amount of GPU capacity even outside Azure datacenters, through a unified model and operational process and highly generalized, flexible workload scheduling capabilities. To be able to manage the inherent complexity of the Scheduler subsystem and enable it to meet the stringent expectations of high service reliability, availability, and throughput, we emphasize rigorous engineering, utmost precision and quality, and ownership-from feature design to livesite. Quality mindset, attention to detail, development process rigor, and data-driven design and problem-solving skills are key for success in our mission-critical control plane space. Responsibilities Work on the design and development of the core AI Infrastructuredistributedand in-clusterservices that support large scale AI training and inferencing. Develop, test, andmaintaincontrol plane services written in C#, hosted onService FabricorKubernetes(AKS)clusters. Enhance systems and applications to ensure high stability, efficiencyandmaintainability, low latency, tightcloud security. Provide operational support and DRI (on-call) responsibilities for theservice. Develop and foster a deep understanding of the machine learning concepts, use cases, and relevant services usedby our customers. Collaborate closely withserviceengineers, product managers, and internal applied research and data science teams within Microsoftto build bettersolutions together. Provide vision,expertise, and technical leadership to other team members. Help to grow talent in these areas. Embody ourcultureandvalues Qualifications Required Qualifications Bachelor's Degree in Computer Science or related technical field AND 4+ years technical engineering experience with coding in languages including, but not limited to, C++, C#, Java, Scala, Rust, Go, TypeScript \| OR equivalent experience. Other Requirements Ability to meet Microsoft, customer and/or government security screening requirements are required for this role. These requirements include but are not limited to the following specialized security screenings: Microsoft Cloud Background Check: This position will be required to pass the Microsoft Cloud background check upon hire/transfer and every two years thereafter. Preferred/Additional Qualifications Master's degree in Computer Science or a related technical field OOP proficiency and practical familiarity with common code design patterns 3+ years of experience with large-scale services in a distributed environment, including concurrency management and stateful resource management Hands-on experience with public cloud services at the IaaS level Advanced knowledge of C# and.Net Proficiencywithuse of complex data structures and algorithms, preferably in the setting ofa resource allocator/scheduler,workflow/executionorchestrationengine, database engine,or similar Experience with managing the evolution of a large, complex codebase Proficiencyand thoroughness in unit testing and testability techniques Knowledge of AI infrastructure, major use cases, and AI workload management Demonstrated major design contributions and technical leadership Excellent technicalcommunication skills: verbal and written; product documentation experience First-hand experience with building large-scale, multi-tenantglobal services with highavailability Experience withbuilding and operating"stateful"and critical control plane services; handlingchallenges with data size and data partitioning; advanced use of a NoSQL cloud database Experience with mapping complex object models to relational and non-relational datastores Dev-ops experiencewithmicroservices architecturein acomplexinfrastructure andoperational environment Service reliability and fundamentals engineering; instrumentation for KPIs or performance analysis;demonstratedservice and code quality mindset Performance engineering: work on scalability,profiling;CPU,memoryand I/Ouseoptimization techniques Appliedcryptographyand compliant handling of customer data Network security:endpoint protection,federatedauthentication, RBAC Applied knowledge of Kubernetes: service model,workload packaging and deployment, programmatic extensibility(CRDs, operators);or equivalentknowledge of Service Fabric; experiencewithany service mesh Server-sideWindowsprogramming andperformance engineering Data analytics skills,in particular withKusto Work in a geo-distributed team #AIPLATFORM #CoreAI Software Engineering IC4 - The typical base pay range for this role across the U.S. is USD $119,800 - $234,700 per year. There is a different range applicable to specific work locations, within the San Francisco Bay area and New York City metropolitan area, and the base pay range for this role in those locations is USD $158,400 - $258,000 per year. Certain roles may be eligible for benefits and other compensation. Find additional benefits and pay information here: https://careers.microsoft.com/us/en/us-corporate-pay This position will be open for a minimum of 5 days, with applications accepted on an ongoing basis until the position is filled. Microsoft is an equal opportunity employer. All qualified applicants will receive consideration for employment without regard to age, ancestry, citizenship, color, family or medical care leave, gender identity or expression, genetic information, immigration status, marital status, medical condition, national origin, physical or mental disability, political affiliation, protected veteran or military status, race, ethnicity, religion, sex (including pregnancy), sexual orientation, or any other characteristic protected by applicable local laws, regulations and ordinances. If you need assistance with religious accommodations and/or a reasonable accommodation due to a disability during the application process, read more about requesting accommodations.