Amazon Unveils $50 Billion AI Chip Lab Powering OpenAI Deal and Challenging Nvidia
By admin | Mar 22, 2026 | 24 min read
Following Amazon CEO Andy Jassy's announcement of AWS's landmark $50 billion investment agreement with OpenAI, Amazon extended an invitation for a private tour of the chip development lab central to the deal, covering most of the associated costs. Industry observers are closely monitoring Amazon's Trainium chip, developed at this facility, for its potential to lower the cost of AI inference and possibly challenge Nvidia's near-monopoly. Intrigued, I accepted the offer. My guides for the visit were the lab's director, Kristopher King, and director of engineering Mark Carroll, alongside the team's PR representative who coordinated the trip, Doron Aronson.

AWS has served as Anthropic's primary cloud platform from the AI lab's inception—a partnership robust enough to endure Anthropic's subsequent addition of Microsoft as a cloud partner and Amazon's own growing alliance with OpenAI. The OpenAI agreement designates AWS as the exclusive provider for the model maker's new AI agent builder, Frontier, which could become a significant part of OpenAI's business if agents gain the traction Silicon Valley anticipates. Whether this exclusivity holds remains to be seen. A recent Financial Times report suggested Microsoft may believe OpenAI's deal with Amazon conflicts with its own agreement, which grants Microsoft access to all of OpenAI's models and technology. A key appeal of AWS for OpenAI is the cloud giant's commitment to supply 2 gigawatts of Trainium computing capacity. This is a massive undertaking, especially since Anthropic and Amazon's own Bedrock service are already consuming Trainium chips faster than production can keep up. There are 1.4 million Trainium chips deployed across all three generations, with Anthropic's Claude running on over 1 million of the deployed Trainium2 chips. Notably, while Trainium was initially focused on faster, cheaper model training, it is now also optimized and utilized for inference. Inference—the process of running an AI model to generate responses—is currently the industry's most significant performance bottleneck. For example, Trainium2 handles the majority of inference traffic on Amazon's Bedrock service, which supports AI application development for Amazon's enterprise customers and allows the use of multiple models. "Our customer base is just expanding as fast as we can get capacity out there," King stated. He added, "Bedrock could be as big as EC2 one day," referring to AWS's massive compute cloud service.

**Trainium vs. Nvidia**
Beyond providing an alternative to Nvidia's backlogged and difficult-to-acquire GPUs, Amazon claims its new chips operating on its specialty Trn3 UltraServers cost up to 50% less to run for comparable performance versus classic cloud servers. Alongside Trainium3, released in December, this AWS team also developed new Neuron switches, a combination Carroll describes as transformative. "What that gives us is something huge," Carroll said. The switches enable every Trainium3 chip to communicate with every other chip in a mesh configuration, reducing latency. "That’s why Trainium3 is breaking all kinds of records," particularly in "price per power," he explained. When processing trillions of tokens daily, such efficiencies yield substantial benefits. In fact, Amazon's chip team earned praise from Apple in 2024. In an unusual display of openness, Apple's director of AI publicly detailed how it utilized another of the team's chips—Graviton, a low-power, ARM-based server CPU and the team's first breakout design. Apple also commended Inferentia—a chip specifically engineered for inference—and acknowledged Trainium, which was new at the time. These chips exemplify Amazon's classic strategy: identify what customers want to buy, then build a competitively priced in-house alternative. Historically, the challenge with chips has been switching costs. Applications designed for Nvidia's chips require re-architecting to function on others—a time-intensive process that deters developers from switching. However, the AWS chip team proudly noted that Trainium now supports PyTorch, a popular open-source framework for building AI models. This includes many models hosted on Hugging Face, a vast library where developers share open-source models. Carroll explained that the transition requires "basically a one-line change, and then recompile, and then run on Trainium." In essence, Amazon is strategically working to erode Nvidia's market dominance. AWS also recently announced a partnership with Cerebras Systems, integrating that company's inference chip on servers running Trainium for what Amazon promises will be superpowered, low-latency AI performance. Amazon's ambitions extend beyond the chips themselves to include the server design that hosts them. In addition to networking components, this team has created "Nitro," a hardware-software combination that provides virtualization technology; new state-of-the-art liquid cooling; and the server sleds that house this equipment. All these efforts aim to control cost and optimize performance.

**Working 24/7 on the “Bring-Up”**
Amazon's custom chip-designing unit originated with the cloud giant's acquisition of Israeli chip designer Annapurna Labs in January 2015 for approximately $350 million. Consequently, this team has over a decade of experience designing chips for AWS. The unit has preserved its Annapurna heritage and name—its logo is prominently displayed throughout the office. This chip lab is situated in a sleek, chrome-windowed building in Austin's upscale "The Domain" district, a walkable area filled with shops and restaurants often dubbed Austin's Silicon Valley. The offices feature a classic tech corporate atmosphere: cubicle desks, communal areas, and conference rooms. However, tucked away on a high floor is the actual lab, offering expansive city views. The shelving-filled lab, roughly the size of two large conference rooms, is a noisy industrial space due to equipment fans. It resembles a cross between a high school shop class and a Hollywood set for a high-end lab, though the engineers wear jeans, not white coats.


This is not where chips are manufactured, so no white hazmat suits were needed. The Trainium3 is a state-of-the-art 3-nanometer chip produced by TSMC, a leader in 3-nanometer manufacturing, with other chips produced by Marvell. This room is where the "bring-up" magic happens. "A silicon bring-up is when you get the chip for the first time, and it’s like a big overnight party. You stay here, like a lock-in," King explained. After 18 months of work, the chip is activated for the first time to verify it functions as designed. The team even filmed portions of the Trainium3 bring-up and posted it online. Spoiler: it's never without issues. For Trainium3, the prototype chip was originally air-cooled, like prior versions. The current chip is liquid-cooled, offering energy advantages and representing a significant engineering achievement. During the bring-up, the dimensions for attaching the chip to the air-cooling heat sink were incorrect, preventing activation. Unfazed, the team "immediately got a grinder and just started grinding off the metal," King recalled. To avoid disrupting the bring-up pizza party atmosphere with noise, they discreetly performed the grinding in a conference room. Staying up all night solving problems "is what silicon bring-up is all about," King said. The lab even includes a welding station, where hardware lab engineer and master welder Isaac Guevara demonstrated welding tiny integrated circuit components under a microscope. This work is so exceptionally challenging that senior leader Carroll openly admitted he couldn't do it, eliciting laughter from Guevara and the other engineers present.

The lab also contains both custom-made and commercial tools for testing and analyzing chip issues. Here, signal engineer Arvind Srinivasan demonstrates how the lab tests each tiny component on the chip:

**Sleds Are the Star of the Lab**
The highlight of the lab is an entire row displaying each generation of the "sleds" the team designed.

Sleds are the trays that house the Trainium AI chips, Graviton CPU chips, and supporting boards and components. Stack them together on a rack with the custom-designed networking component, and you get the systems central to Anthropic Claude's success. Here is the sled showcased during the AWS re:Invent conference in December:

**Proven by Anthropic and OpenAI**
I anticipated my guides would enthusiastically discuss the OpenAI deal during the tour, but they did not. This reticence might relate to the potential legal complexities surrounding the agreement. However, the impression I gathered was that these hands-on engineers, who are currently designing Trainium4, haven't yet had extensive collaboration with OpenAI. Their daily work has primarily focused on meeting the needs of Anthropic and Amazon. Currently, the largest deployment of Trainium2 chips is in Project Rainier—one of the world's largest AI compute clusters—which launched in late 2025 with 500,000 chips and is used by Anthropic. Nevertheless, a wall monitor in the main office displayed a quote about OpenAI's planned use of Trainium, indicating a subtle sense of pride. Beyond this lab, the team also operates its own private data center for quality assurance and testing. Located a short drive away, it does not run customer workloads and is housed at a co-location facility rather than an AWS data center. Security is stringent, with strict protocols for building entry and access to Amazon's designated area. The data center's cooling system is so loud that earplugs are mandatory, and the air carries the acrid scent of heated metal, making it an unpleasant environment for most.

This data center features rows of servers filled with sleds integrating all of Amazon's newest custom chips: Graviton CPU, liquid-cooled Trainium3, and Amazon Nitro, all operating smoothly. The engineers noted that the liquid cooling uses a closed system, meaning the coolant is reused, which should help reduce environmental impact. Here is a current Trn3 UltraServer: multiple sleds are positioned on top and bottom, with Neuron switches in the middle. Hardware development engineer David Martinez-Darrow is shown here performing maintenance on a sled:

While the team has always attracted attention, scrutiny has intensified recently. Amazon CEO Andy Jassy monitors this lab closely, publicly praising its products with evident pride. In December, he stated that Trainium is already a multibillion-dollar business for AWS and highlighted it as a piece of AWS technology he is most excited about. He also mentioned the chip when announcing the OpenAI agreement. The team feels this pressure. Engineers work around the clock for three to four weeks during each bring-up event to resolve any issues, ensuring the chips can be mass-produced and deployed to data centers. "It’s very important that we get as fast as possible to prove that it’s actually going to work," Carroll said. "So far, we’ve been doing really well."
*Disclosure: Amazon provided airfare and covered one night at a local hotel. True to its Leadership Principle of Frugality, this included a middle seat at the back of the plane and a modest room. (Yes, I checked a bag for an overnight trip. I’m high maintenance that way.)

Comments
Please log in to leave a comment.
No comments yet. Be the first to comment!