
Lead Machine Learning Platform Engineer
Paramount
New York, NYThis was removed by the employer on 5/1/2026 9:57:00 PM PST
This is a Full Time Job
#WeAreParamount on a mission to unleash the power of content… you in?
We’ve got the brands, we’ve got the stars, we’ve got the power to achieve our mission to entertain the planet – now all we’re missing is… YOU! Becoming a part of Paramount means joining a team of passionate people who not only recognize the power of content but also enjoy a touch of fun and uniqueness. Together, we co-create moments that matter – both for our audiences and our employees – and aim to leave a positive mark on culture.
Overview
We are seeking a Senior Lead / Lead ML Platform Engineer to architect and own the technical direction for our Training and Inference infrastructure. This is a high-leverage role designed for an expert who understands the deep technical stack required to shift ML models from research to global production. You will be responsible for the ''engine room'' of the AMLG, ensuring that our MLEs can train massive models efficiently and serve them with sub-millisecond reliability. This role requires a unique blend of expertise in distributed systems and hardware acceleration. You will lead the adoption and optimization of AnyScale (Ray) for distributed training and manage a high-performance Kubernetes-based inference environment. You aren't just managing clusters; you are building a seamless, scalable platform that abstracts the complexity of GPUs and distributed compute for the entire organization.
Why This Role Matters
The ML Platform Lead is the force-multiplier for every other ML pod. In this role, you will directly shape:
• The Training Foundation: Establishing AnyScale/Ray as the standard for distributed compute, enabling MLEs to train models on petabytes of data without managing infrastructure.
• Inference at Scale: Architecting the serving layer that handles billions of requests per day, optimizing for both p99 latency and GPU utilization.
• Operational Excellence: Setting the organizational standards for how ML models are deployed, monitored, and scaled across the enterprise.
Key Responsibilities
• Technical Roadmap & Strategy: Own the long-term architectural direction for the Training and Inference domains, ensuring the platform scales 10x over a 1–3 year horizon.
• Distributed Training Leadership: Lead the implementation and optimization of Ray/AnyScale, providing a unified compute layer for batch processing, model training, and reinforcement learning.
• High-Performance Inference: Design and maintain K8s-based inference servers (e.g., Triton, TorchServe, or vLLM) optimized for GPU memory management and high throughput.
• Hardware & Cost Optimization: Navigate the trade-offs between different GPU instances (A100s, H100s, T4s), optimizing for cost, availability, and performance.
• Cross-Team Standardization: Solve high-leverage problems that affect multiple pods (e.g., Entry, Session, Presentation), establishing reusable patterns for CI/CD, model versioning, and canary deployments.
• Reliability Engineering: Define and enforce SLIs/SLOs for the platform, ensuring that infrastructure failures never interrupt the user-facing personalization experience.
• Mentorship & Coaching: Act as a technical mentor to senior engineers across the ML Platform and Applied ML pods, raising the bar for system design and operational rigor.
Basic Qualifications
• 6-8 years of experience in ML Infrastructure, Platform Engineering, or high-scale Backend Engineering.
• Orchestration & Serving: Extensive experience with Kubernetes (K8s) and serving frameworks for large-scale ML models.
• Hardware Proficiency: Strong knowledge of GPU architecture, CUDA, and optimizing ML workloads for hardware acceleration.
• Leadership (IC4/5): Proven track record of owning the technical direction for a major domain and driving impact across multiple teams.
Preferred Qualifications
• Experience with Infra-as-Code (Terraform/Pulumi) and building automated MLOps pipelines.
• Distributed Systems Mastery: Deep expertise with Ray (AnyScale) or similar distributed compute frameworks.
• Familiarity with ML observability tools (Prometheus, Grafana, Weights & Biases, or MLFlow).
• Experience managing multi-cloud or hybrid-cloud ML environments.
• Deep knowledge of Python and C++ for performance-critical systems.
What Success Looks Like
In your first 6–12 months, you will:
• Unify the Compute Layer: Successfully transition the majority of AMLG training workloads to a governed AnyScale/Ray environment.
• Optimize Inference ROI: Measurably improve GPU utilization and reduce inference costs through better auto-scaling and server optimization.
• Establish Durable Standards: Author the ''Gold Standard'' for ML deployments that is adopted by at least three other pods in the organization.
• Reduce Systemic Risk: Implement a self-healing infrastructure layer that significantly reduces manual intervention for cluster-related failures.
#LI-KA1