We use cookies. Find out more about it here. By continuing to browse this site you are agreeing to our use of cookies.
#alert
Back to search results
New

Sr. Data Center Platform Systems Engineer

Advanced Micro Devices, Inc.
USD $117,600.00/Yr.-USD $176,400.00/Yr.
United States, Texas, Austin
7171 Southwest Parkway (Show on map)
Nov 15, 2024


WHAT YOU DO AT AMD CHANGES EVERYTHING

We care deeply about transforming lives with AMD technology to enrich our industry, our communities, and the world. Our mission is to build great products that accelerate next-generation computing experiences - the building blocks for the data center, artificial intelligence, PCs, gaming and embedded. Underpinning our mission is the AMD culture. We push the limits of innovation to solve the world's most important challenges. We strive for execution excellence while being direct, humble, collaborative, and inclusive of diverse perspectives.

AMD together we advance_

THE ROLE:

AMD is looking for a senior platform engineer to join our growing team. As a key contributor you will be part of a leading team to drive and enhance AMD's abilities to deliver the highest quality, industry-leading technologies to market.

THE PERSON:

The Software Platform Architecture (SPA) team has an open position for a Data Center Platform Systems Engineer. SPA is the hardware-accelerated, software-focused wing of the newly-formed Cluster Platform Engineering (CPE) team at AMD and rolls up through the Data Center GPU (DCGPU) business unit. This role will be responsible for helping to select, curate, design, automate, and document all software underpinning an entire full-stack AI-focused platform. This work is not net-new code development but instead focused on choosing the right software properties and how data and operations flow through it to ease the adoption and operations of large-scale GPU-accelerated AI (Artificial Intelligence) and HPC (High Performance Computing) Cluster systems within AMD. SPA works closely with the Site Reliability Engineering (SRE) and Data Center Operations (DCOps) teams who tackle day-to-day commissioning and operations of the clusters under CPE's control. SPA's work is measured by how much we reduce the operational toil while increasing the rigor and repeatability of processes for the SRE and DCOps teams. SPA has design responsibility for the full Day 0 - Day 2 software platform.

KEY RESPONSIBILITIES:

  • The Platform Systems Engineer role in SPA cuts across all hardware and software infrastructure, up through platform software, consumption portals, and ultimately the real goal: having the AI application software experience be optimized for AMD. AI applications are focused on those best-leveraging the AMD Instinct GPU and AMD EYPC CPU in cluster systems.
  • Work with all CPE teams to validate that SPA's platform designs are Day 0 - Day 2 ready and able to integrate with other teams' workflows
  • Work with the Release Engineering team to automate the application of updates and system configuration management tools.
  • Maintain tight interaction with the SRE team to continually improve how what SPA designs is integrated into an operational change process and cadence
  • Ensure that all applications and infrastructure elements expose/export telemetry that is centrally managed and used to guide the management of the entire platform
  • Write the glue-code necessary to connect systems to each other if no native mechanisms exist
  • Ensure all platform designs reflect Security as a core principle, with input to Policy, Guidelines, and participate in platform and project retrospectives/blameless post-mortems

PREFERRED EXPERIENCE:

  • Experience in full-stack (infra, platform, application) multi-site, multi-region solutions at scale
  • Strong multi-distro Linux knowledge across deployment, configuration, and management
  • Cloud Native platform implementation
  • Kubernetes as application dial-tone all the way up through Service Mesh and multi-tenant application deployment and management
  • Strong knowledge of multiple virtualization and containerization technologies systems like KVM, Xen, and Kubernetes - OpenShift a bonus
  • Experience with automation platforms at scale using Ansible, Terraform / OpenTofu
  • Some experience with application and platform telemetry frameworks, such as OpenTelemetry
  • Strong networking knowledge with a primary focus on L3 and path-vector routing protocols
  • Experience with RDMA/RoCE and InfiniBand a plus
  • Demonstrated record of accomplishment of successfully building and delivering complex operational solutions at scale, with the ability to learn new systems quickly in a rapidly changing environment
  • Python ang Golang experience a plus
  • Platform message-bus (such as Kafka) experience

ACADEMIC CREDENTIALS:

  • Bachelor's or Master's degree in Computer/Software Engineering, Computer Science, or related technical discipline preferred


#LI-RW1

#LI-HYBRID

At AMD, your base pay is one part of your total rewards package. Your base pay will depend on where your skills, qualifications, experience, and location fit into the hiring range for the position. You may be eligible for incentives based upon your role such as either an annual bonus or sales incentive. Many AMD employees have the opportunity to own shares of AMD stock, as well as a discount when purchasing AMD stock if voluntarily participating in AMD's Employee Stock Purchase Plan. You'll also be eligible for competitive benefits described in more detail here.

AMD does not accept unsolicited resumes from headhunters, recruitment agencies, or fee-based recruitment services. AMD and its subsidiaries are equal opportunity, inclusive employers and will consider all applicants without regard to age, ancestry, color, marital status, medical condition, mental or physical disability, national origin, race, religion, political and/or third-party affiliation, sex, pregnancy, sexual orientation, gender identity, military or veteran status, or any other characteristic protected by law. We encourage applications from all qualified candidates and will accommodate applicants' needs under the respective laws throughout all stages of the recruitment and selection process.

Applied = 0

(web-69c66cf95d-jtnrk)