About the Role
Together AI is building the next-generation AI compute platform, and networking is at the center of that mission. As a Network Architect, you will define and evolve the global network architecture that powers our AI training, inference, and research platforms. This is a deeply technical and strategic role: you will own the end-to-end routing, topology, traffic engineering, and control-plane strategy for a global network spanning self-built data centers, partner colo, cloud environments, and high-capacity backbone fabrics.
You will collaborate closely with infrastructure engineering, compute systems, hardware, and operations teams to design architectures that deliver massive east–west bandwidth, low latency, high resiliency, and predictable performance at multi-terabit scale. Your work directly influences how we build, scale, and operate the physical and logical networks that underpin cutting-edge AI workloads.
This is a role for architects who are hands-on enough to validate designs in production, experienced enough to reason about systems at huge scale, and creative enough to develop architectures that don’t exist yet.
Responsibilities
- Define and evolve Together AI’s global routing and backbone architecture, spanning self-built data centers, partner colocation sites, PoPs, cloud regions, and interconnect fabrics.
- Establish the end-to-end topology strategy for high-bandwidth AI workloads: east–west fabrics, spine/superspine/core, DCI, and cross-region interconnect.
- Design traffic engineering, load balancing, and capacity planning models to ensure low latency, deterministic performance, and fault tolerance at scale.
- Develop the multicloud interconnect and peering strategy, including BGP policy frameworks, route leak mitigation, and security posture across heterogeneous networks.
- Architect the control-plane stack for programmability, stability, and automation—including routing design, provisioning, configuration management, and state consistency.
- Establish foundational observability primitives for a global backbone (telemetry, flow sampling, path validation, synthetic testing, health models).
- Work closely with compute, storage, hardware, and data platform teams to ensure network design meets the performance demands of distributed AI training workloads.
- Collaborate with operations and NOC teams to ensure designs are supportable, debuggable, and resilient under real-world failure conditions.
- Provide architectural direction and mentorship to engineers across the org, influencing long-term strategy for both physical and virtual network domains.
- Model evolving topologies for next-generation workloads (multi-Tbps east–west, high fan-in/fan-out distributed systems, GPU cluster fabrics).
- Evaluate and guide the adoption of emerging technologies: advanced optical transport, RoCEv2, high-speed Ethernet fabrics, Infiniband overlays, EVPN/VXLAN, SR-MPLS/SRv6, programmable data planes, and hardware offload.
Requirements
- Have deep experience designing and operating large-scale GPU clusters or HPC-style compute fabrics, and understand the unique demands these workloads place on network design (east–west dominance, congestion behavior, fan-in/fan-out patterns, loss sensitivity).
- Are fluent in building high-throughput data center fabrics (leaf–spine/superspine/core) that support tens of thousands of GPUs, multi-terabit east–west traffic, and strict performance SLAs.
- Have architected or operated RoCEv2 or lossless Ethernet environments at scale—including PFC/ECN tuning, congestion control, and end-to-end stability considerations.
- Are experienced designing backbone and DCI architectures that support GPU training clusters across multiple regions, interconnect exotic fabrics, and handle high-volume synchronization traffic.
- Have led architecture for networks spanning multiple clouds, private backbones, and diverse PoPs, and understand how AI workloads behave across these domains.
- Design with operational realities in mind: observability, capacity modeling, automation, telemetry, and failure-mode analysis for GPU-heavy environments.
- Are comfortable setting architectural direction in fast-moving environments where compute, storage, and network evolution are tightly coupled.
About Together AI
Together AI is a research-driven artificial intelligence company. We believe open and transparent AI systems will drive innovation and create the best outcomes for society, and together we are on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. We have contributed to leading open-source research, models, and datasets to advance the frontier of AI, and our team has been behind technological advancement such as FlashAttention, Hyena, FlexGen, and RedPajama. We invite you to join a passionate group of researchers and engineers in our journey in building the next generation AI infrastructure.
Compensation
We offer competitive compensation, startup equity, health insurance and other competitive benefits. The US base salary range for this full-time position is: $250,000 - $280,000 + equity + benefits. Our salary ranges are determined by location, level and role. Individual compensation will be determined by experience, skills, and job-related knowledge.
Equal Opportunity
Together AI is an Equal Opportunity Employer and is proud to offer equal employment opportunity to everyone regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.
Please see our privacy policy at https://www.together.ai/privacy