Cloudwalk

Machine Learning Engineer (Distributed Training)

Remote Full Time
CloudWalk is reimagining financial infrastructure through AI, blockchain, and intelligent design. Our systems serve millions of entrepreneurs across Brazil and the US, and we’re now pushing the frontier of large-scale model training inside our GPU cluster.

We’re looking for a Research Engineer focused on distributed training - someone who loves experimenting with new frameworks, optimizing performance, and scaling frontier models. You’ll work with DeepSpeed, FSDP, Accelerate, and emerging technologies like Unsloth, Axolotl, Torch Titan and others that reshape how models are trained and fine-tuned at scale.

Your mission is to design and evolve the backbone of our training pipeline: orchestrating multi-node jobs, managing checkpointing and experiment tracking, and squeezing every bit of efficiency from our GPUs. You’ll collaborate closely with infra and research teams to turn prototypes into robust, high-throughput systems.

We’re looking for someone fluent in PyTorch and distributed systems, with hands-on experience scaling large models using DeepSpeed or Accelerate. You know your way around GPUs, containers, and schedulers like Kubernetes or Slurm, and you care deeply about reliability, speed, and reproducibility. Experience with Ray clusters, MLflow, or W&B tracking is a plus.

Our process is simple: a deep conversation on distributed systems and LLM training, and a cultural interview.

Competitive salary, equity, and the opportunity to shape the next generation of large-scale AI infrastructure at CloudWalk.