Description
The Cloud Engineer will design, build, and operate infrastructure and applications supporting UCLA Health's Analytics Platform across both on-premises and multi-cloud environments (AWS, Azure, GCP). This role focuses on enabling secure, scalable AI/ML and GenAI platforms, with an emphasis on automation, reliability, and compliance in a regulated healthcare setting.
Key Responsibilities
- Design, implement, and manage cloud and hybrid infrastructure supporting analytics and AI/ML workloads
- Build and operate MLOps capabilities, including:
- Model training and inference platforms
- Model and artifact management
- CI/CD and deployment pipelines
- Observability and monitoring solutions
- Cost optimization controls
- Develop and maintain automation and infrastructure-as-code (IaC) solutions for provisioning and configuration
- Troubleshoot and resolve complex system and environment issues across cloud and on-prem platforms
- Establish platform guardrails to ensure secure, reliable, and compliant operations
- Collaborate with cross-functional teams to:
- Gather requirements
- Design and prototype solutions
- Implement and test deployments
- Support ongoing operations and enhancements
- Apply security, privacy, and governance controls aligned with healthcare data regulations
- Execute release, deployment, and configuration management processes
What You'll Bring
- Strong background in cloud engineering and platform operations
- Experience with multi-cloud environments (AWS, Azure, GCP)
- Proficiency in:
- Automation, scripting, and infrastructure-as-code
- CI/CD pipeline development and optimization
- Monitoring and observability tools
- Experience supporting AI/ML or data platform workloads (preferred)
- Ability to troubleshoot complex systems and drive solutions independently
- Strong collaboration skills and the ability to translate business requirements into technical solutions
Salary Range: $
128500 - $
298100 annually.
Qualifications
* BS/MS in Computer Science (or equivalent) * AWS Certified Cloud Engineer, Architect, Administrator Certifications required * 7+ years of advanced knowledge and experience as an AWS Cloud Engineer in all core services and offerings. AWS experience a plus * 15+ years of advanced knowledge and experience of Microsoft Technologies such as, Windows server and Linux based servers, enterprise system support experience and strong background in systems engineering and administration for both operating systems * 15+ years of advanced knowledge and experience with enterprise scale Windows technologies such as Server platforms, Desktop platforms, Exchange Environments, Active Directory, IIS, Windows Clustering, Virtualization and Collaboration tools. AWS Certification or equivalent experience preferred * Working knowledge of DevOps-like work or experience in a real time operational role * Advanced knowledge of analytics and AI/ML platform services across AWS, Azure, and GCP (e.g., AWS SageMaker/Bedrock, Azure Machine Learning/Azure OpenAI, Google Vertex AI) and how to operate them securely at enterprise scale. * Experience enabling teams to build and deploy ML/AI solutions by providing reusable platform capabilities (reference architectures, templates, SDK/CLI standards, self-service onboarding, and guardrails) rather than only project-specific implementations. * Hands-on experience operationalizing ML/AI workloads on cloud platforms (AWS/Azure/GCP): managed training/inference, batch vs real-time serving, feature/metadata management, model registry, and cost/performance optimization. * Strong MLOps/platform engineering experience: CI/CD for ML and GenAI, automated validation gates, reproducible pipelines, environment promotion, artifact/version management, and production monitoring (drift, data quality, latency, cost) using cloud-native and/or enterprise tooling (e.g., Azure DevOps/GitHub Actions, SageMaker Pipelines, Vertex AI Pipelines, MLflow, Terraform). * GenAI platform experience (AWS/Azure/GCP): deploying and governing LLM applications using managed services (e.g., Bedrock/Azure OpenAI/Vertex AI), RAG architectures, embeddings and vector databases/search, prompt/version management, and evaluation/guardrails for safety and groundedness. * Responsible AI + governance experience for regulated environments: PHI/PII protections, access controls, encryption and key management, audit logging, model/endpoint risk assessments, bias/fairness considerations, and policy enforcement aligned to HIPAA and secure SDLC. * Strong data engineering foundations that support AI platforms: standardized data ingestion/ETL/ELT, data quality/lineage, dataset and feature pipeline design, schema/version management, and integration with lake/lakehouse platforms (e.g., S3/ADLS/GCS with Spark/Databricks/BigQuery/Synapse) for feature and training data readiness. * Experience operating scalable training/inference platforms (GPU/accelerated workloads): capacity planning/quotas, cluster or managed compute configuration, distributed training concepts, performance tuning, and chargeback/showback in cloud environments.
|