xAI’s mission is to create AI systems that can accurately understand the universe and aid humanity in its pursuit of knowledge. Our team is small, highly motivated, and focused on engineering excellence. This organization is for individuals who appreciate challenging themselves and thrive on curiosity. We operate with a flat organizational structure. All employees are expected to be hands-on and to contribute directly to the company’s mission. Leadership is given to those who show initiative and consistently deliver excellence. Work ethic and strong prioritization skills are important. All engineers are expected to have strong communication skills. They should be able to concisely and accurately share knowledge with their teammates.
As a Site Reliability Engineer (SRE): Storage at xAI, you will play a pivotal role in ensuring the reliability, scalability, and performance of our petabyte-to-exabyte scale storage infrastructure, including filesystems and our internal storage product supporting the Colossus superclusters in Memphis — the world's largest AI training clusters with hundreds of thousands of liquid-cooled GPUs. We're deploying multiple exabytes of storage this year across several sites to fuel Grok's training and advanced AI workloads. You will collaborate with storage engineers, software engineers and hardware storage teams to deploy, troubleshoot, and optimize storage for 24/7 AI I/O demands like checkpointing and dataset streaming, long term archival storage, and ensure maximum uptime. This is a hands-on technical position in a dynamic environment, offering the opportunity to tackle complex challenges at the intersection of storage systems, hardware integration, and reliability engineering.
xAI is an equal opportunity employer.