Training a foundation model is one problem. Shipping it with 2–10× better throughput, sub-second latency, and unit economics that actually work is another. We’ve spent the last seven years recruiting exclusively the engineers who can do both.
ML Infrastructure is a narrow, high-leverage discipline. The people who do it well are rare, they don’t apply to jobs, and they’re consistently the hardest hire on any frontier AI org chart. We source across the full stack:
Training infrastructure — distributed training (FSDP, ZeRO, tensor + pipeline parallelism), NCCL tuning, multi-node GPU orchestration, dataloaders and fusion, activation rematerialization, checkpoint hygiene, determinism under failure.
Kernels and compilers — Triton, CUDA, CUTLASS, FlashAttention-family speedups, sequence packing, KV-cache optimization, MLIR, LLVM IR, JIT codegen, torch.compile / Inductor, XLA, JAX, TensorRT, IREE.
Inference and serving — vLLM, TensorRT-LLM, SGLang, Triton Inference Server, KServe. Continuous batching, speculative decoding, prefix caching, prefill/decode disaggregation, quantization (GPTQ, AWQ, FP8), distillation.
Platform and orchestration — Kubernetes-based GPU orchestration, Ray, Dagster, SLURM, Terraform, cloud-scale observability (Prometheus, Grafana, OpenTelemetry), autoscaling on custom metrics, zero-downtime model deploys.
Real-time media inference — WebRTC at scale, long-lived connections, latency-critical video and audio pipelines, token-level metering and SLA enforcement.
Our ML infrastructure candidates come from a tight network we’ve built over seven years: frontier labs, inference-platform startups, hyperscaler GPU-infra teams, and ex-FAANG engineers who’ve recently moved to scale-up AI companies.
Typical candidate backgrounds:
Typical client profile:
