Zoox

PhD Research Intern, Multi-Modal Foundation Encoder for Perception

Foster City, CA Full Time

About Our Internship Program

 

Zoox’s internship program offers hands-on experience with cutting-edge technology, mentorship from some of the industry’s brightest minds, and the opportunity to make meaningful contributions to real projects. We seek interns who demonstrate strong academic performance, engagement beyond the classroom, intellectual curiosity, and a genuine interest in Zoox’s mission.


Project Overview

 

During this internship, you will lead the development of a multi-modality (vision, LiDAR, Radar, and language), temporal foundation encoder to support 3D object detection & tracking, 3D segmentation (occupancy), and live maps. This Multi-Modal Foundation Encoder (MMFE) is a critical key to achieving End-to-End Perception at Zoox.

Your research will aim to significantly improve system performance on long-tail events and rare classes by utilizing a large-capacity foundation model to learn rich representations across different sensor modalities. Additionally, the project aims to improve perception in adverse environmental conditions (such as medium to heavy rain and fog, reducing false positives on water splashes or dust particles) , achieve long-range sensing for highway driving , and build robustness to occlusion.

This is a highly research-driven role with the goal of publication. You will have the opportunity to explore novel directions such as tri-modal foundation models with self-supervised pre-training, radar-language grounding for zero-shot detection, efficient sensor fusion via sparse cross-attention, or integrating 3D Gaussian Splats for dynamic agent geometry and streaming sparse Gaussian occupancy prediction.