Epoch AI Narrations

“AI doesn’t get better at this board game with practice” by Benjamin Ou, Greg Burnham

Thu, 02 Jul 2026 15:50:37 GMT

Subtitle: Our latest benchmark suggests AI struggles to learn from experience.

Can AI systems improve at challenging tasks on the fly, performing them over and over and learning from mistakes? It's one of the biggest open questions in AI capabilities right now, with large economic and safety implications. Our latest benchmark, EBR-bench, tests AI systems for this ability by having them play Earthborne Rangers, a complex board game, repeatedly. So far, we see little evidence of AI learning from experience. With EBR-bench as part of our benchmarking suite, we have a new tool for detecting if and when that changes.

Learning to play games is a proxy for important capabilities

An AI system that could pick up unfamiliar tasks on the fly would be much more capable than we’re used to. Even if it didn’t perform well out of the box on some economically relevant task, it could still learn “on the job”. It would also be harder to determine whether it had dangerous capabilities prior to release, since it could gain such capabilities through learning. We think learning to play games is a reasonable proxy for these more impactful kinds of learning. Whether it's a [...]

---

Outline:

(00:53) Learning to play games is a proxy for important capabilities

(02:41) AI doesn't improve at EBR with repeated play

(04:20) Agents manage tactical execution poorly

(06:38) Agents underexplore strategic options

(07:55) Agents struggle even when given an explicit strategy guide

(09:18) Elicitation gaps may remain

(10:04) Out-of-distribution generalization remains limited--for now

The original text contained 7 footnotes which were omitted from this narration.

---

First published:
July 1st, 2026

Source:
https://epoch.ai/publications/earthborne-rangers-benchmark

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What we learned from 1,604 Chinese AI job postings” by Cheryl Wu, Jean-Stanislas Denain, Anson Ho

Wed, 24 Jun 2026 23:49:31 GMT

Subtitle: Inferring Chinese AI labs’ strategies from their job descriptions.

What's going on inside Chinese AI companies like Alibaba and DeepSeek? Western observers usually answer this question in two ways: (1) read through their technical papers, and (2) voraciously consume news reports and follow everything about Chinese AI on X. But there's a third approach that people have rarely explored: scrape Chinese AI job postings.

Chinese labs need to hire the right people, so in their job descriptions they need to reveal what skills or expertise they’re looking for. These give us direct clues into what constraints they face and what they hope to build.

So like we previously did for Western labs, we scoured over 1,600 job postings across six of the most notable Chinese AI companies: DeepSeek, MiniMax, Moonshot, Z.ai, ByteDance, and Alibaba. Here's what we found.

Chinese AI labs still rely on Nvidia, but they’re exploring domestic alternatives

Many people care about whether Chinese companies still use Nvidia because it means they still depend on “Western” AI infrastructure. And at least for now, that seems true.

Consider ByteDance. One of its open roles is called “Inference GPU Performance Optimization Expert”. Whoever [...]

---

Outline:

(01:15) Chinese AI labs still rely on Nvidia, but they're exploring domestic alternatives

(03:57) Chinese startups are renting domestic cloud compute, and building data centers too

(05:50) Chinese AI startups have pretty varied commercial strategies

(08:08) Startups stay model-centric; platform companies make a wider range of research bets

(09:50) Job postings are more spread out than in the US

(11:07) Chinese AI jobs require less prior experience

(12:41) The complex reality of Chinese AI firms

The original text contained 7 footnotes which were omitted from this narration.

---

First published:
June 24th, 2026

Source:
https://epoch.ai/gradient-updates/what-we-learned-from-1604-chinese-ai-job-postings

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Toward an O*NET for AI R&D” by Jean-Stanislas Denain, Joe Kwon, Anson Ho

Thu, 18 Jun 2026 07:32:19 GMT

Subtitle: Proposing a new way to track AI research automation.

What trends are we extrapolating?

A common way that experts forecast AI timelines is so simple it's hard to believe: trend extrapolation. Sure they also use numerical models that bake in things like runaway feedback loops, but the bread and butter of AI forecasting is to draw a line on a graph and extend it as far as you dare. Somehow this works well enough to be a state-of-the-art approach. However, the trends they extrapolate share a common weakness: they lean heavily on easy-to-measure things, not what we directly care about — how close AI is to doing AI research itself.

Many experts want to know when we’ll fully automate AI research, because this would massively speed up AI progress, kicking off an “intelligence explosion”.1 If that's right, it's hugely important to know how close we are. But historically, there haven’t been many points of direct evidence to point to, because full automation of AI R&D has been so hard to measure. Instead, researchers have been forced to rely on proxies.

One such proxy is in key AI inputs like compute, data, and energy. Take Situational Awareness [...]

---

Outline:

(00:19) What trends are we extrapolating?

(04:14) An O\*NET for AI R&D

(08:28) A first proposal

(11:59) What's next?

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
June 17th, 2026

Source:
https://epoch.ai/gradient-updates/toward-an-onet-for-ai-rnd

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Are Mythos’ cyber capabilities overhyped?” by Timothée Chauvin, Alexander Barry, Jean-Stanislas Denain, Anson Ho

Fri, 12 Jun 2026 07:05:57 GMT

Subtitle: Compiling all the public evidence on Mythos Preview's cyber abilities.

If what Anthropic says is true, then the Claude Mythos family is a massive leap forward in AI's cyber capabilities. When they announced Mythos Preview, they considered it so dangerous that they had to launch a $100+ million initiative to “secure the world's most critical software”. Then on Tuesday, they one-upped themselves by releasing Claude Mythos 5, which improves modestly on cyber benchmarks.1

But skeptics have argued that Anthropic was exaggerating — or at least, people should chill out about Mythos. For instance, some people have pointed out that GPT-5.5 is on par with Mythos Preview on a range of cyber benchmarks, and yet its launch didn’t lead to a cyber catastrophe.

So is Mythos actually a big leap for cyber capabilities? To figure this out, we looked at all the public evidence we could get our hands on. Most of this evidence applies to Mythos Preview, but the conclusions should hold for Mythos 5 too. This post describes what we found.

Discovering vs exploiting code vulnerabilities

To start off, let's take a closer look at what Anthropic actually claimed when they released [...]

---

Outline:

(01:21) Discovering vs exploiting code vulnerabilities

(02:49) Mythos Preview was a major advance in exploit development

(05:40) It's unclear how large of a practical advance Mythos Preview is in vulnerability discovery

(10:34) Conclusion

The original text contained 9 footnotes which were omitted from this narration.

---

First published:
June 11th, 2026

Source:
https://epoch.ai/gradient-updates/are-mythos-cyber-capabilities-overhyped

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Controlling the capital after AGI” by Phil Trammell, Anson Ho

Wed, 10 Jun 2026 05:52:49 GMT

Subtitle: A simple taxonomy of the main proposals for post-AGI universal redistribution.

Introduction

AGI1 might generate immense economic output, but it could take many people's jobs in the process and leave them with no way to earn a decent living. Those with little savings during the “AGI transition” would then be unable to support themselves on the other side. Less drastically, even if many well-paying jobs remain after AGI, the capital share2 may greatly increase, which would tend to greatly increase inequality.

If this happens, how might the gains be redistributed? More concretely, putting aside the question of how the state raises tax revenues after AGI, and what percentage of GDP is raised, how do existing proposals for redistributing this revenue differ?

Proposals for universal benefits abound, including:

Universal basic income (UBI): The government pays everyone cash. This is the best known, and has been endorsed by Elon Musk, Vinod Khosla, Geoffrey Hinton, and many others.3 As part of this, the government might impose restrictions on the extent to which people could borrow against their future payments, just as it is illegal today to borrow against your social security, to prevent people from impoverishing themselves in [...]

---

Outline:

(00:18) Introduction

(02:37) The main axis: control of the capital

(05:26) Why care who controls the capital?

(08:32) Why have the state give people control of capital, instead of letting people buy it themselves?

(11:41) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:
June 9th, 2026

Source:
https://epoch.ai/gradient-updates/controlling-the-capital-after-agi

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Is a compute crunch coming?” by Luke Emberson, Jaime Sevilla

Mon, 25 May 2026 00:00:00 GMT

Subtitle: We estimated trends in global inference capacity and found that token demand appears to be growing much faster than supply.

Much has been made about AI-driven capex in the past year. Hyperscalers have been clamoring to construct massive data centers, spending hundreds of billions in the process. The St. Louis Fed estimates that AI-related investment contributed about 1 percentage point — almost 40% of the total — to US real GDP growth in the first three quarters of 2025, exceeding the IT investment contribution at the height of the dot-com boom. Whether the current AI buildout constitutes a bubble depends largely on whether there will be sufficient demand for the computing infrastructure being built.

It's tough to estimate future demand for tokens, as it depends heavily on hard-to-forecast trends in capabilities and diffusion. However, we have a much more concrete picture of the supply side. In this article, we do our best to answer how many tokens per second the world could produce with the chips we have today.

To do this, we dig into the technical details of inference. We model prefill and decode runtimes, account for two common efficiency techniques (chunked prefill and [...]

---

Outline:

(03:30) Introducing our setting

(05:13) Inference settings

(05:17) Hardware specs

(05:27) What happens during inference?

(06:59) Prefill

(09:46) Decode

(13:48) Chunked prefill

(16:21) Speculative decoding

(18:24) Calibrating against inference benchmarks

(21:18) The present and future of inference

The original text contained 17 footnotes which were omitted from this narration.

---

First published:
May 25th, 2026

Source:
https://epoch.ai/gradient-updates/is-a-compute-crunch-coming

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Frontier labs don’t use most AI compute (yet)” by Josh You

Wed, 20 May 2026 00:00:00 GMT

Subtitle: But Anthropic and OpenAI may rapidly grow their compute share in the next few years. After that, continued scaling would require an economic transformation.

Disclaimer: the estimates of frontier developer compute discussed below are more tentative than our standard data work.

OpenAI kicked off the AI boom when it launched ChatGPT in 2022. Frontier LLMs soon accrued hundreds of millions of users and billions in revenue, sparking a massive investment boom in AI compute infrastructure, with Nvidia's AI-related sales spiking more than fourfold in 2023. Global AI computing power has now grown to the equivalent of around 20 million Nvidia H100s, funded by hundreds of billions of dollars in annual capital expenditures.

Yet while OpenAI launched the compute boom, they don’t dominate AI compute usage. I estimate that the compute OpenAI uses for research, training, and inference as of the end of 2025 made up around 10% to 15% of the world's operational AI compute supply, and this share was even smaller a year ago. Even after adding the other most well-resourced frontier developers — Anthropic, xAI, and the AI labs within Google and Meta — the combined total is probably still under half of [...]

---

Outline:

(02:25) Most AI compute probably doesn't go to frontier AI

(06:28) Will Anthropic and OpenAI absorb the rest of global AI compute?

(11:39) What happens if frontier labs run out of headroom?

The original text contained 20 footnotes which were omitted from this narration.

---

First published:
May 20th, 2026

Source:
https://epoch.ai/gradient-updates/frontier-labs-dont-use-most-ai-compute

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The economics of superstar AI researchers” by Anson Ho

Wed, 13 May 2026 19:52:55 GMT

Subtitle: What might explain AI researcher pay, and why it matters.

AI is one of those fields where the best winds up much better off than the rest. Superstar researchers at frontier labs earn over ten times more than most of their colleagues, who earn measly million-dollar salaries. They might even earn over a hundred times more than your average AI postdoc:

Ballpark estimates of AI researcher compensation. Postdoc compensation is estimated using NSF report data. For tenure-track professors, I anchor on this Taulbee 2024 survey of computer scientists. Compensation for frontier lab researchers is estimated from Levels.fyi for L4-L5 OpenAI researchers, and news reports for superstars.

So why are the differences in pay so large? The naive explanation is that some researchers are just vastly superior. Perhaps the superstar researchers have excellent research taste in designing algorithms and experiments. Or they have a knack for pulling off “yolo runs” — training runs that implement many ambitious changes all at once, relying on deep intuition, whereas most people would need to systematically test the individual changes to make sure they work. Under this framing, superstars are the “10× researchers” that Silicon Valley so deeply reveres [...]

---

Outline:

(01:38) The superstar effect

(04:31) Why this applies to AI

(05:34) Race dynamics amplify the effect

(06:24) Reality is complicated, and so is managing an army of AIs

The original text contained 7 footnotes which were omitted from this narration.

---

First published:
May 13th, 2026

Source:
https://epoch.ai/gradient-updates/economics-of-superstar-ai-researchers

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Introducing the AI Chip Components Explorer” by Venkat Somala

Fri, 08 May 2026 00:00:00 GMT

Subtitle: Our new AI Chip Components explorer tracks how much advanced-node logic, memory, and advanced packaging capacity is consumed by leading AI chip designers.

AI compute capacity is growing exponentially. But as spending on AI chips climbs into the hundreds of billions, the semiconductor supply chain is increasingly strained. To help researchers, policymakers, and the public understand semiconductor inputs and production constraints, we are launching the AI Chip Components explorer.

Building on our AI Chip Sales explorer, which tracks completed chips, the AI Chip Components explorer looks further up the supply chain at the components used to build chips. We estimate how much global chip component supply is consumed by the four leading US AI chip designers: Nvidia, AMD, Google, and Amazon. We further break down the consumption of components by chip type, designer, and quarter. Our scope is limited to the chip itself. We do not cover rack-level components or networking equipment, which are also significant inputs to AI infrastructure.

The explorer tracks three critical chip components

Modern AI chips rely on several specialized inputs: advanced logic wafers that perform the core computation, high-bandwidth memory (HBM) that stores data and feeds it to the compute [...]

---

Outline:

(01:19) The explorer tracks three critical chip components

(02:17) Packaging was the major bottleneck in late 2024 and early 2025

(03:44) Memory became the bottleneck in 2025

(05:22) Advanced logic was a softer constraint in 2024 and 2025

(06:46) Chip Component Spend More than Doubled from 2024 to 2025

---

First published:
May 8th, 2026

Source:
https://epoch.ai/publications/introducing-the-ai-chip-components-explorer

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“RIP Classic Reasoning Benchmarks. What’s Next?” by Greg Burnham

Tue, 05 May 2026 00:00:00 GMT

Subtitle: Give up at least one of: text only, short time horizon, easy to grade, and expert human superiority.

There's a familiar recipe for reasoning benchmarks: tasks are text-only, output is easy to grade, and expert humans can do the tasks in several hours. Unfortunately, this recipe is now obsolete. As an emblematic case, consider GPQA: a benchmark consisting of graduate-level science questions. It had remarkable staying power but by now it's clearly saturated.

The same is true for many classical reasoning benchmarks, whether in science, math, or coding. What's next? I think the old recipe points to a new recipe. Just relax one of the elements: text only, easy to grade, short time horizon, and expert human superiority. I see each of these categories as extremely fruitful to pursue, and far from saturated. The tradeoff is just that it takes more time and money to create such benchmarks.

Keep the classic format, but make it multimodal

It's hard to say precisely, but to my eyes AI visual and spatial reasoning lags behind text-only reasoning. Still growing rapidly, but from a lower base. At any rate, it still seems comparably easy to create meaningful multimodal reasoning [...]

---

Outline:

(01:24) Keep the classic format, but make it multimodal

(02:41) Keep the classic format, but push the time horizons

(04:38) Bite the bullet on hard-to-grade outputs

(06:44) Target well above human expert ability

(08:24) What about common sense?

(09:48) Reasoning benchmarks aren't dead yet

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
May 5th, 2026

Source:
https://epoch.ai/gradient-updates/rip-classic-benchmarks

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What you need to know about AI chips” by Epoch AI

Fri, 01 May 2026 00:00:00 GMT

Subtitle: A look at the specialized hardware driving modern AI — why chips cost tens of thousands of dollars each, and why demand continues to outstrip supply.

One of the biggest factors shaping AI progress today is access to a very specific kind of computer chip, manufactured almost entirely by a single company in Taiwan. These specialized AI chips, also sometimes called AI accelerators, power every frontier AI product, from chatbots to image generators, and are the most important physical input to the training and deployment of AI systems. Some prominent examples of AI chips are Nvidia's Blackwell and Hopper GPUs (named for the graphics chips they descend from), Google's TPU, and Amazon's Trainium series.

An Nvidia Blackwell GPU. Credit: Nvidia.

Who manufactures AI chips, who can buy them, and whether there is enough electricity to power them at scale — these questions are shaping which companies can build the most capable AI, which countries can support an AI industry, and how fast the technology advances.

Why AI companies want more chips than they can get

When a company wants to train a new AI model, one of the most important things they need is [...]

---

Outline:

(01:19) Why AI companies want more chips than they can get

(06:09) AI chips get more cost-effective every year

(09:09) Electricity efficiency is increasing, but so is total consumption

(11:07) AI chips sit at the center of progress in AI

---

First published:
May 1st, 2026

Source:
https://epoch.ai/publications/chips-topic-overview

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Diversion and resale: estimating compute smuggling to China” by Isabel Juniewicz

Wed, 29 Apr 2026 00:00:00 GMT

Subtitle: We estimate that between 290,000 and 1.6 million H100-equivalents (H100e) were smuggled to China through 2025. Our median estimate of 660,000 H100e would be roughly a third of China's total compute.

Key takeaways

Substantial quantities of AI chips have been sent to China in violation of US export controls. Evidence of diverted or missing chips, drawn from indictments and investigative reporting, points to nearly 300,000 Nvidia H100-equivalents by the end of 2025. This would equal roughly a quarter of the compute China acquired through legal channels or domestic production. Because much smuggling goes undetected, the true total is likely higher.
We estimate, with 90% confidence, that between 290,000 and 1.6 million H100-equivalents of compute were smuggled through the end of 2025. Our median estimate of 660,000 represents roughly 3% of the global compute stockpile, comparable to what xAI, a leading US AI lab, had at the time. The upper bound of our estimate would mean that, by the end of 2025, the majority of China's AI compute had been smuggled.
We are uncertain about many variables, notably the magnitude of undetected smuggling and the proportion of chips allegedly diverted or missing that ultimately reached [...]

---

Outline:

(00:32) Key takeaways

(01:45) Overview

(07:13) Evidence on smuggled chips

(07:17) Compute diversion

(07:47) Compute resale

(10:02) Estimation methodology

(10:15) Compute diversion

(12:27) Compute resale

(15:15) Combined results

(16:43) Comparison to other estimates

(18:51) Quarterly extrapolation

(19:35) Conclusion

(20:58) Acknowledgements

The original text contained 20 footnotes which were omitted from this narration.

---

First published:
April 29th, 2026

Source:
https://epoch.ai/publications/chip-smuggling

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How Fast Could Robot Production Scale Up?” by Jean-Stanislas Denain, Yann Rivière

Wed, 22 Apr 2026 00:00:00 GMT

Subtitle: We look at reference classes, factory buildout timelines, and upstream component supply to estimate plausible production rates for humanoids, quadrupeds, robotic arms, wheeled robots, and drones.

Suppose that in the next few years, robotics capabilities take a large leap forward. Humanoid robots or mobile manipulators become able to perform most manual tasks that humans can. The potential market is enormous: billions of people do physical work, and a robot that could substitute for a human worker at a fraction of the cost would face nearly unlimited demand.

But robots are physical objects. While software can be copied and deployed nearly instantly, each robot must be manufactured from real components in real factories by real workers. Even if capabilities jumped overnight, production would take time to catch up.

How much time? In this post, we aim to produce numbers useful for people trying to answer that question. We focus on five form factors: humanoids, quadrupeds, robotic arms, wheeled robots, and drones. While the future of robotics may involve form factors that don’t yet exist at scale, or coordinated fleets of different kinds of robots, we believe that our analysis of existing form factors is still useful [...]

---

Outline:

(04:59) Reference classes for robot production scaling

(05:03) Where robot production stands today

(08:38) How fast can production scale following demand shocks?

(12:19) Inside view: what are the actual constraints?

(12:42) Building factories

(16:33) Getting the components

(17:24) Three component tiers

(21:59) Putting it together

(25:40) Acknowledgements

---

First published:
April 22nd, 2026

Source:
https://epoch.ai/publications/how-fast-could-robot-production-scale-up

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“OpenAI Stargate: where the US sites stand” by Elliot Stewart, Ben Cottier

Fri, 17 Apr 2026 00:00:00 GMT

Subtitle: The $500 billion AI data center initiative is projected to exceed 9 gigawatts of capacity by 2029, with 0.3 gigawatts already operational in Abilene and six more US sites under active construction.

Introduction

The United States is in the middle of an unprecedented build-out of AI infrastructure. No project illustrates the scale of that effort more than Stargate, a $500 billion endeavor involving AI developer OpenAI, cloud provider Oracle, and investment company SoftBank.

Stargate has seven locations across the US, all of which are now showing active development. The most advanced—in Abilene, Texas—is already operating at an estimated capacity of 0.3 gigawatts (GW). The six other sites include two more in Texas, as well as facilities in New Mexico, Wisconsin, Michigan, and Ohio. Together, the seven sites add up to over 9 gigawatts of planned capacity, which is comparable to the peak power demand of New York City.1 This will be enough to power the equivalent of 20 million Nvidia H100 GPUs, which was the total amount of AI compute in the world by the end of 2025.2

SiteCurrent
capacity (GW)Projected
capacity (GW)3Construction
beganProjected
completionPower
sourcesAbilene, Texas0.31.2Q2 2024Q4 2026On-site gas, GridShackelford County [...]

---

Outline:

(00:29) Introduction

(02:16) The sites

(02:18) Abilene, Texas

(03:37) Shackelford County, Texas

(04:39) Doña Ana County, New Mexico

(05:32) Milam County, Texas

(06:38) Port Washington, Wisconsin

(07:35) Saline Township, Michigan

(08:28) Lordstown, Ohio

(09:39) The road ahead

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
April 17th, 2026

Source:
https://epoch.ai/publications/openai-stargate-where-the-us-sites-stand

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Have AI Capabilities Accelerated?” by Jean-Stanislas Denain, Alexander Barry

Thu, 16 Apr 2026 00:00:00 GMT

Subtitle: We investigate progress trends on four capability metrics to determine whether AI capabilities have recently accelerated. Three of four metrics show strong evidence of acceleration, driven by reasoning models.

Introduction

We investigated progress trends on four capability metrics to determine whether AI capabilities have recently accelerated. We do this by fitting several candidate curves to historical data (for example, a simple linear trend vs. a hyperbolic trend) and comparing how well each curve predicts data it hasn’t seen yet.

The following interactive plot shows how each candidate curve fits the historical data. Use the tabs to switch between the time series view and the cross-validation accuracy of each curve.

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ Performance over time: Epoch Capabilities Index

Three of four metrics show acceleration, seemingly driven by reasoning models

Three of the four metrics (ECI, log METR 50% time horizon, and a math-focused index we constructed from several math benchmarks) show strong evidence that progress has sped up relative to a global linear trend fit to data from 2023 onward.

The best-performing model across these three metrics was a pair of independent linear trends: one for reasoning models and [...]

---

Outline:

(00:27) Introduction

(01:11) Three of four metrics show acceleration, seemingly driven by reasoning models

(04:14) Methodology

(04:17) AI Capability Metrics

(07:15) Dataset preparation modes

(08:15) Candidate fits

(10:16) Assessing fit quality

(11:44) Constructing "best-performing fits" sets

---

First published:
April 16th, 2026

Source:
https://epoch.ai/publications/have-ai-capabilities-accelerated

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“MirrorCode: Evidence that AI can already do some weeks-long coding tasks” by Tom Adamczewski, David Rein, David Owen, Florian Brand

Fri, 10 Apr 2026 00:00:00 GMT

Subtitle: In our new benchmark, MirrorCode, Claude Opus 4.6 autonomously reimplemented a 16,000-line bioinformatics toolkit — a task we believe would take a human engineer weeks.

Introduction

We present early results from MirrorCode, a benchmark (co-developed with METR) of long-horizon coding tasks derived from real software applications. We find that AI models can autonomously reimplement complex existing software without access to the original program's source code, provided there is a detailed, checkable specification. For example, Claude Opus 4.6 successfully reimplemented gotree — a bioinformatics toolkit with ~16,000 lines of Go and 40+ commands. We guess this same task would take a human engineer without AI assistance 2–17 weeks. We see continued gains from inference scaling on larger projects, suggesting they may be solvable given enough tokens.

AI models are increasingly capable at autonomous coding. Several notable software engineering (SWE) benchmarks have seen rapid progress. However, these usually measure fairly short coding tasks; for example, only about 100 of the 731 SWE-bench Pro tasks involve diffs larger than 100 lines. Meanwhile, recent demos of AI coding (for example, to develop a new C compiler or a new browser) are impressive but hard to evaluate. The completeness of the resulting [...]

---

Outline:

(00:29) Introduction

(04:01) Methodology

(09:34) Preliminary results

(09:37) Recent AI models can fully reimplement real programs

(12:05) Opus 4.6 solved gotree through perseverance, and its engineering was better than older models

(14:08) Further inference scaling might solve Pkl

(16:13) Limitations

(21:13) Discussion and conclusion

(24:31) Data

The original text contained 25 footnotes which were omitted from this narration.

---

First published:
April 10th, 2026

Source:
https://epoch.ai/publications/mirrorcode-preliminary-results

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What does the war in Iran mean for AI?” by Josh You

Fri, 10 Apr 2026 00:00:00 GMT

Subtitle: A prolonged Hormuz crisis probably won't derail the compute buildout, but it could slow data center expansion and disrupt Gulf investment flows into AI.

Disclaimer: My background is in the economics of AI and compute. I’m not an expert in war, diplomacy, or oil and gas, but I’m familiar with what economic inputs matter for AI. I am also writing about a very dynamic situation. So specific claims about the situation in Iran and Hormuz and its impacts on supply chains should be read as tentative and based on relatively quick research.

Since the US and Israel went to war with Iran at the end of February, shipping through the Strait of Hormuz — the sole sea route out of the Persian Gulf — has mostly shut down. This has disrupted around 10% of the world's supply of oil, as well as exports of natural gas, helium, urea, and aluminum, and others. Iran has also struck targets in the Gulf states, notably oil and gas facilities and a few data centers.

On April 8, the US and Iran agreed to a two-week ceasefire that would reopen the Strait of Hormuz, though it is not clear [...]

---

Outline:

(03:10) The Iran War's impact on energy

(04:53) Energy for fabs

(08:22) Energy for data centers

(10:37) Helium

(12:06) Gulf data centers and investment flows

(16:47) Takeaways

The original text contained 15 footnotes which were omitted from this narration.

---

First published:
April 10th, 2026

Source:
https://epoch.ai/gradient-updates/war-in-iran-and-ai

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“AI is a common workplace tool: half of employed AI users now use it for work” by Caroline Falkman Olsson, Yafah Edelman

Thu, 09 Apr 2026 00:00:00 GMT

Subtitle: We surveyed over 2,000 Americans on how they use AI at work: who uses it, how much, which services, and whether it's replacing or creating tasks.

What we found

AI is becoming a mainstream work tool. Half of employed Americans who used AI in the past week reported using AI tools at least as much for work as for personal tasks.
AI is changing what people do at work. It has replaced existing tasks for 27% of employed AI work users and created new ones for 21%.
AI work use is higher among paid subscribers. Employer-paid subscribers are far more likely to use AI for work than free-tier users, and self-payers fall in-between.

AI tools have moved from a niche technology to a part of everyday life. In a new Epoch AI/Ipsos survey of over 2,000 U.S. adults, half reported using AI tools in the past week.

But adoption rate alone does not capture the full picture of how AI is used. Among employed users, it has become a work tool that is already changing the tasks they perform, with substantially higher workplace use among paid subscribers than free-tier users.

[...]

---

Outline:

(00:27) What we found

(01:33) AI is now a workplace tool, not just a personal one

(02:41) About this survey

(03:36) AI both creates and replaces tasks at work

(05:09) Work use is higher among paid subscribers

(08:25) Conclusion

---

First published:
April 9th, 2026

Source:
https://epoch.ai/publications/half-of-employed-ai-users-now-use-it-for-work

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Keeping up with the GPTs” by Anson Ho

Tue, 07 Apr 2026 00:00:00 GMT

Subtitle: Can Chinese and open model companies compete with the frontier through e.g. distillation and talent?

If the last decade of AI has taught us one lesson, it's that scaling compute builds better models. This sounds great — until you realize your competitors have ten times more compute than you.

This is the situation that many Chinese and open model companies find themselves in; relative to frontier companies, they’re “compute-poor”. Just last year, Anthropic spent over ten times more on compute than Minimax and Zhipu AI combined, and the gap is even wider for OpenAI:

Data from Epoch's data on AI companies and Data Insights.

You don’t need to be an AI expert to see that this is a huge handicap. With less compute, it's harder to run experiments, train bigger models, and serve many users.

But compute-poor AI labs have an ace up their sleeve. Even lacking frontier-level compute, they can try to use theirs more efficiently to punch well above their weight. That's how DeepSeek was on the heels of OpenAI despite using a fraction of the training compute (at least on some benchmarks), driving the stock market bananas.1

The big [...]

---

Outline:

(01:35) Breaking down the efficiency gains

(02:45) Approach 1: Innovate faster than the compute-rich labs

(05:17) Approach 2: Replicate innovations from frontier labs

(08:44) Approach 3: Leverage the capabilities of frontier models

(14:06) Putting things together

(15:29) What does this mean for the future of AI?

(15:59) Compute-poor = Chinese AI labs?

(20:15) Compute-poor = Open models?

(23:47) The bottom line

The original text contained 20 footnotes which were omitted from this narration.

---

First published:
April 7th, 2026

Source:
https://epoch.ai/gradient-updates/keeping-up-with-the-gpts

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Introducing the AI Chip Owners Explorer” by Josh You, Venkat Somala

Mon, 06 Apr 2026 00:00:00 GMT

Subtitle: We announce our new AI Chip Owners explorer, showing which companies own the world's leading AI chips.

Computing capacity (“compute”) is a critical input to the development, training, and deployment of AI systems. How much AI-optimized compute exists in the world, and who owns it? Earlier this year, we launched the AI Chip Sales explorer to track the first question. Today, we’re launching our AI Chip Owners explorer to track the second.

Our AI Chip Owners explorer contains interactive visualizations of our analysis of the number of leading AI chips owned by the largest US hyperscalers and cloud companies, one frontier AI developer (xAI), and Chinese customers — with breakdowns by chip family, chip model, and shifts in ownership over time. We build upon our estimates the total volumes of Nvidia, Google TPU, Amazon Trainium, AMD, and Huawei chips from the AI Chip Sales, and distribute these chips among major owners using estimates from analysts and industry researchers, company financial disclosures, capital spending, and our analysis of frontier-scale AI data centers.

The AI Chip Owners explorer is intended as a resource for researchers, policymakers, and anyone tracking the strategic landscape of AI compute. You can [...]

---

Outline:

(01:36) Hyperscalers own the majority of global AI compute

(03:41) Chinese customers own just 5% of global AI compute

---

First published:
April 6th, 2026

Source:
https://epoch.ai/publications/introducing-the-ai-chip-owners-explorer

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What do frontier AI companies’ job postings reveal about their plans?” by Jean-Stanislas Denain, Campbell Hutcheson

Tue, 24 Mar 2026 00:00:00 GMT

Subtitle: A fast increase in go-to-market roles, and hints about upcoming products.

AI companies guard their strategies closely. Their hiring pages, however, are public.

And those posts contain clues about what products a company is developing, who it hopes to sell them to, and which bottlenecks it sees coming. A posting for a “Camera ISP Software Engineer” suggests a device with a camera. A search for “Forward Deployed Engineers” hints at the challenges of deploying AI inside companies. A cluster of roles mentioning robotics implies ambitions well beyond chatbots.

We analyzed open roles at the leading foundation labs, including OpenAI, Anthropic, xAI and Google DeepMind1. Here is what we found:

First, sales and sales-related hiring has increased sharply at both Anthropic and OpenAI over the past year. Anthropic's go-to-market share of open roles grew from 17% to 31% and OpenAI's from 18% to 28%. This increase has been particularly concentrated in technical roles that help clients deploy AI to their companies.
Second, open roles can provide insight into the product roadmap at the labs. For example, OpenAI and DeepMind are both investing in hardware products, such as robotics and consumer devices. In [...]

---

Outline:

(02:29) Go-to-market is the top hiring category at OpenAI and Anthropic

(06:09) Job postings shed light on new product bets at OpenAI and DeepMind

(08:46) Job postings also offer clues about how labs secure compute and data

(10:53) Conclusion

The original text contained 1 footnote which was omitted from this narration.

---

First published:
March 24th, 2026

Source:
https://epoch.ai/gradient-updates/ai-lab-job-postings

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Final training runs account for a minority of R&D compute spending” by Jean-Stanislas Denain, Cheryl Wu

Mon, 23 Mar 2026 00:00:00 GMT

Subtitle: New evidence following the MiniMax and Z.ai IPOs.

In the popular picture of how AI companies use compute, there are two big buckets: training and inference. But in reality, the R&D side is more complex. The final training run — the one that produces the model with a name — is only the last step in a long, expensive process of exploration. Before that run begins, companies burn through compute on: running experiments at various scales, generating synthetic data, testing which ideas work before committing to a final run, and training models that are never released.

This distinction matters. When people discuss compute thresholds or the cost of training a frontier model, they often mean the final training run. However, the full cost of developing that model is much higher. And if most of the spending is exploration rather than execution, then a competitor who learns what works from the frontier could replicate the results for a fraction of the original cost.

So there's more to R&D compute than final training runs, but how much? Last year, we estimated the breakdown of OpenAI's 2024 compute spending at around $5 billion on R&D compute that [...]

---

Outline:

(02:54) Breaking down MiniMax and Z.ai's compute spending

(06:35) Final training runs are a small fraction of R&D compute spending

(08:30) R&D compute and catch-up growth

(10:42) Conclusion

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
March 23rd, 2026

Source:
https://epoch.ai/gradient-updates/r-and-d-vs-training-compute

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The least understood driver of AI progress” by Anson Ho

Wed, 25 Feb 2026 00:00:00 GMT

Subtitle: An opinionated guide to “algorithmic progress” and why it matters.

AI software progress is one of those things that everyone vaguely knows about, but only a handful of people in the world truly understand its significance. Consider that many of the most fervent debates in AI to date depend enormously on it: How did DeepSeek seem to catch up to OpenAI's o1 within months while using less training compute? When will the world develop AGI? And if we automate AI research, will AI progress accelerate like crazy a la Situational Awareness and AI 2027?

I don’t know your stances on these questions, but I do know that you can’t have a well-informed opinion on them without understanding software progress. So I figured I should write a post describing the most important things that you need to know, starting from the basics and leading up to the current frontier.

Here are the main takeaways, one for each section of the post:

AI software progress is about reducing the training compute you need to get to the same level of capability, through better algorithms or data. This is commonly called “algorithmic progress” including by [...]

---

Outline:

(03:40) 1. AI software progress: Doing more with what we have

(07:42) 2. How fast is AI software progress?

(14:27) 3. What drives software progress? (Or, why all the estimates we just saw are misleading)

(23:37) 4. How this impacts the software intelligence explosion debate

(30:03) 5. Conclusion

The original text contained 29 footnotes which were omitted from this narration.

---

First published:
February 25th, 2026

Source:
https://epoch.ai/gradient-updates/the-least-understood-driver-of-ai-progress

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Expanding our analysis of biological AI models” by David Atanasov, Niccolò Zanichelli, Jean-Stanislas Denain

Fri, 20 Feb 2026 00:00:00 GMT

Subtitle: We release a database of over 1,100 biological AI models across nine categories. We analyze their safeguards, accessibility, training data sources, and the foundation models they build on.

This report presents an expanded database of AI models in biology, commissioned by Sentinel Bio and building on our 2024 collaboration, in which Sentinel Bio funded Epoch AI to collect and organize information about AI models in biology. The goal of this new report is to expand coverage to new categories of biology-relevant AI models and to capture releases since September 2024.

To build the database, we searched major academic databases and preprint servers for papers introducing AI models in biology, then used language models to filter candidates and extract structured metadata from the remaining papers. The most important models received additional manual review. The full methodology is described in the Appendix.

Key findings

The final database contains 1,196 models, of which 1,124 were annotated using AI assistance only while 72 received dedicated manual annotation. We also manually checked every entry for which we reported safeguards being used. Here are the main findings from our analysis:

Pre-release risk assessments and risk-related evaluations are rare. Only 2.5% of [...]

---

Outline:

(01:18) Key findings

(03:52) Data access

(04:14) Methodology overview

(05:45) Analysis

(05:48) 1. Dataset Overview

(06:15) Category distribution

(07:33) Geographic distribution

(08:38) Institutional distribution

(09:47) Notable models

(10:50) 2. Risk management practices

(15:31) 3. Accessibility

(16:58) 4. Building Block Models

(18:38) 5. Training Data, Parameters, and Compute

(18:43) Training datasets

(19:38) Parameters, data size, and compute

The original text contained 1 footnote which was omitted from this narration.

---

First published:
February 20th, 2026

Source:
https://epoch.ai/publications/expanding-our-analysis-of-biological-ai-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How persistent is the inference cost burden?” by Jean-Stanislas Denain

Mon, 16 Feb 2026 00:00:00 GMT

Subtitle: Toby Ord argues that RL scaling primarily increases inference costs, creating a persistent economic burden. While the framing is useful, the cost to reach a given capability level falls fast, and the RL scaling data is thin.

Toby Ord has written a thoughtful post on how RL and inference compute scale for frontier AI models.

As I understand it, the core of his argument is

(1) RL scaling primarily bears fruit by enabling models to productively use longer outputs, which means you need to scale inference compute to realize the gains
(2) RL scaling itself delivers poor returns, requiring roughly 10,000x more compute to match what 100x more inference provides.

Combined with the fact that inference costs are per-use and can’t be amortized like training costs, this paints a picture of a significant and persistent economic burden as we shift away from pretraining scaling.

There's a lot I agree with in Toby's analysis, and I find the framing useful. However, I think both claims above may be overstated. On (1): even though inference costs are per-use, the dollar cost to reach a given capability level falls rapidly over time, so [...]

---

Outline:

(01:41) What I agree with

(02:46) Fixed-capability costs fall fast

(06:33) The returns to RL scaling might be higher

(09:07) Conclusion

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
February 16th, 2026

Source:
https://epoch.ai/gradient-updates/how-persistent-is-the-inference-cost-burden

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What do “economic value” benchmarks tell us?” by Florian Brand, Greg Burnham

Fri, 13 Feb 2026 00:00:00 GMT

Subtitle: These benchmarks track a wide range of digital work. Progress will correlate with economic utility, but tasks are too self-contained to indicate full automation.

Introduction

We review three recently-developed benchmarks that aim to measure whether AI systems can perform real-world, digital, non-coding tasks of economic value: Remote Labor Index (RLI), GDPval, and APEX-Agents.

We expect progress on these benchmarks to correlate with real utility. However, the benchmark tasks are well-defined and relatively self-contained. High scores on the benchmarks, therefore, would not imply end-to-end automation of digital professions. Instead, it would imply a shift in how these jobs are done, away from manual execution and toward delegating work to AI, similar to the effect that coding agents have on software engineering today.

The benchmarks also have important differences. We give a short take-away for each.

RLI measures AI ability to do multimedia projects that take humans several days. The first batch of evaluations likely under-elicited models, but this has been improved recently, and top scores are still very low (<5%).
APEX-Agents measures AI ability to do tasks across classically high-paid white collar jobs that take experts a couple hours. Task instructions are self-contained, but the task [...]

---

Outline:

(00:24) Introduction

(02:13) Example tasks

(04:46) Tasks are sourced in different ways, from different fields

(08:19) Lack of interaction and environment messiness affect task realism

(09:59) Estimated human time-to-complete varies substantially

(11:49) Web access makes evaluation a bit less reliable for GDPval

(13:12) Evaluation strategies differ substantially

(14:42) Models have made different amounts of progress on the benchmarks

(16:02) The chosen scaffolds may under-elicit capabilities

(19:28) Conclusion

The original text contained 7 footnotes which were omitted from this narration.

---

First published:
February 13th, 2026

Source:
https://epoch.ai/publications/what-do-economic-value-benchmarks-tell-us

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Where Autonomy Works: Evaluating Robot Capabilities in 2026” by Yann Rivière, Jean-Stanislas Denain

Tue, 10 Feb 2026 00:00:00 GMT

Subtitle: We assess the current state of autonomous robotics by evaluating robot performance on concrete tasks across industrial, household, and navigation domains.

Introduction

Impressive demos are not hard to come by in autonomous robotics. Forming a precise understanding of real-world capabilities is much harder: a task that looks solved in a demonstration may be brittle in deployment. This report assesses robot performance on concrete tasks across three domains (industrial, household, and navigation). For each task, we review the available evidence on reliability, speed, cost, and the ability to adapt (“transfer”) to new environments and objects.

Key takeaways

Navigation is deployed commercially, while most industrial and household tasks are not. Autonomous robots already deliver food in multiple cities, transport goods in warehouses, and inspect infrastructure in remote environments with high reliability. Most tasks requiring robots to handle, assemble, or manipulate objects remain largely in the lab.
Manipulation is commercially deployed in controlled environments with simple tasks, but mostly not beyond. Warehouse picking is the clearest example: robots can handle thousands of object types reliably, because the environment is stable, can be designed around the robot, and the task itself is straightforward. The further we move from that [...]

---

Outline:

(00:27) Introduction

(01:01) Key takeaways

(02:50) Foundation models have become the default for robot manipulation

(05:20) Methodology

(07:21) 1. Industrial applications

(10:13) Tasks we assess

(11:10) Industrial applications: Takeaways

(12:37) 1. Connect cables inside a PC case

(14:39) 2. Assemble IKEA furniture

(16:08) 3. Insert small objects with high precision

(19:01) 4. Sort and handle packages on a conveyor belt

(20:50) 5. Pick and place items of varying fragility and weight

(23:13) 2. Household

(24:42) Tasks we assess

(25:15) Household tasks: Takeaways

(26:59) 1. Cook simple meals

(30:01) 2. Water plants

(31:58) 3. Clean a kitchen

(35:13) 4. Take out the trash

(36:45) 5. Fold basic laundry

(39:59) 6. Tidy a bedroom

(42:18) 3. Navigation

(43:31) Tasks we assess

(44:12) Navigation: Takeaways

(45:23) 1. Transport a human in a building

(47:49) 2. Deliver food

(49:42) 3. Maneuver underwater

(53:30) Looking ahead

(54:35) Acknowledgements

The original text contained 7 footnotes which were omitted from this narration.

---

First published:
February 10th, 2026

Source:
https://epoch.ai/publications/where-autonomy-works-evaluating-robot-capabilities-in-2026

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How close is AI to taking my job?” by Anson Ho

Fri, 06 Feb 2026 00:00:00 GMT

Subtitle: Beyond benchmarks as leading indicators for task automation.

1. Searching under the streetlight

How can we anticipate when AI will be able to do our jobs? AI researchers have mainly tried to answer this question by building complex AI benchmarks. The problem is that this approach is fundamentally flawed.

A good example of this is OpenAI's GDPval. On paper, it's a cool benchmark that captures AI performance on a wide range of real-world job tasks in the US economy. The benchmark tasks were meticulously constructed to be realistic, involving the hard work of hundreds of experts and likely millions of dollars — placing it among the most expensive economics papers of all time.1 If there's one benchmark that could be the leading indicator of AI job automation, it's GDPval.

Unfortunately, the benchmark seems to have fallen prey to the same issue plaguing most other benchmarks. Shortly after release, AI models have beaten the human baseline — GPT-5.2 reached parity with industry experts, and Claude Opus 4.6 likely does even better. And yet, the actual economic impacts of AI remain muted. The benchmark doesn’t fully reflect the economic effects, and so it's falling short in its role [...]

---

Outline:

(00:16) 1. Searching under the streetlight

(03:09) 2. Trying to automate my job away (for science)

(04:48) Task 1: Replicating an interactive web interface for an economic model

(09:05) Task 2: Writing an article

(14:43) Task 3: Publishing an article

(19:56) 3. What this all means for my job, and perhaps yours too

(23:18) What about you and your job?

The original text contained 9 footnotes which were omitted from this narration.

---

First published:
February 6th, 2026

Source:
https://epoch.ai/gradient-updates/how-close-is-ai-to-taking-my-job

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Can AI companies become profitable?” by Jaime Sevilla, Hannah Petrovic, Anson Ho

Wed, 28 Jan 2026 00:00:00 GMT

Subtitle: Lessons from GPT-5's economics.

This post was written in collaboration with Exponential View.

Update (March 6, 2026): We’ve revised our estimates based on new information and feedback from people familiar with the matter. In particular, we’ve 1) increased our estimate of inference costs given new information, and 2) lowered our estimate of sales and marketing spending after excluding inference compute costs associated with serving free users. This article reflects these updated figures.

Are AI models profitable? If you ask Sam Altman and Dario Amodei, the answer seems to be yes — it just doesn’t appear that way on the surface.

Here's the idea: running each AI model generates enough revenue to cover its own R&D costs. But that surplus gets outweighed by the costs of developing the next big model. So, despite making money on each model, companies can lose money each year.

This is big if true. In fast-growing tech sectors, investors typically accept losses today in exchange for big profits down the line. So if AI models are already covering their own costs, that would paint a healthy financial outlook for AI companies.

But we can’t take Altman and [...]

---

Outline:

(03:29) Part I: How profitable is running AI models?

(08:13) Part II: Are models profitable over their lifecycle?

(11:07) Part III: Will AI models become profitable?

The original text contained 26 footnotes which were omitted from this narration.

---

First published:
January 28th, 2026

Source:
https://epoch.ai/gradient-updates/can-ai-companies-become-profitable

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How well did forecasters predict 2025 AI progress?” by Anson Ho

Fri, 16 Jan 2026 00:00:00 GMT

Subtitle: Mostly right about benchmarks, mixed results on real-world impacts.

This post was written in collaboration between the AI Futures Project, the AI Digest, and Epoch AI. It analyzes the results of the 2025 AI Digest survey. You can take the 2026 AI forecasting survey here.

Every other AI paper I read seems to start with some version of “AI progress has been fast”. And sure, that's obviously true — a year ago there was no GPT-5, no DeepSeek-R1, and not even Claude 3.7 Sonnet! But few people seem to say exactly how fast things have been, and whether people saw it coming. So when the AI Digest released a survey for people to forecast AI progress over the last year, I was excited.

The survey helps track something akin to an “AI 2027 worldview”, where automating AI R&D leads to a surge in AI capabilities and hence a range of risks to humanity, especially from handing power off to AI systems. You can see this in the question topics: about half of the survey is about forecasting performance on benchmarks related to AI R&D. The other half looks at real-world impacts — think AI-enabled [...]

---

Outline:

(01:48) Demographics: Junior, short-ish timelines, high risk of AI catastrophe

(04:11) Benchmarks related to AI R&D: The median forecast was on the money (for the most part)

(08:46) OpenAI preparedness scores: mixed results on risks

(11:55) AI's prominence: underestimated revenue, and overestimated public perception

(12:32) Frontier lab revenues

(15:32) Public attention on AI

(16:59) Takeaways from the survey

The original text contained 1 footnote which was omitted from this narration.

---

First published:
January 16th, 2026

Source:
https://epoch.ai/gradient-updates/how-well-did-forecasters-predict-2025-ai-progress

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Epoch AI 2025 impact report” by The Epoch AI Team

Fri, 16 Jan 2026 00:00:00 GMT

Subtitle: In 2025, Epoch AI published over a hundred outputs, more than doubled its reach and raised over ten million dollars.

In 2025, we saw AI continue to increase in scale and importance. AI companies reached annual revenues totalling tens of billions of dollars, and are building data centers that individually cost comparable amounts. Leading benchmarks show capabilities accelerating, propped up by the establishment of reasoning models, such as OpenAI's oN model series. And we have seen an incredible diffusion of capabilities, with Chinese open weight models such as DeepSeek R1 closing in the gap with US frontier models released only months before.

Epoch AI has responded with new and expanded initiatives to advance its mission of sharing up-to-date information about – and making sense of – the trajectory of AI. We are excited to share a recap of our work in 2025, and our plans for 2026.

We are raising $3 million to execute a more ambitious version of our plans. Donations can be made directly through our website. For those considering a substantial contribution, or commissioning a project, please contact us at donate@epoch.ai.

Highlights from 2025

AI data centers & compute clusters

AI [...]

---

Outline:

(01:29) Highlights from 2025

(01:32) AI data centers & compute clusters

(02:29) The Benchmarking Hub & the Epoch Capabilities Index (ECI)

(03:56) FrontierMath Tier 4

(05:17) Growth and AI Transition Endogenous (GATE) model

(06:19) Data Insights & Gradient Updates

(07:44) AI in 2030

(08:44) Epoch AI by the numbers

(08:47) Outputs

(09:32) Reach

(10:22) Finances and organization

(10:53) Press and citations

(13:17) Paid engagements

(14:43) Events and other engagements

(15:46) Testimonials from our audience

(19:32) Governance and Transparency

(20:23) Our plans for 2026

(21:02) Data & Trends

(22:54) Evaluations & Benchmarks

(24:55) Research & Consultations

(26:03) Website and communications

(26:55) Support our work

---

First published:
January 16th, 2026

Source:
https://epoch.ai/publications/epoch-impact-report-2025

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Introducing the AI Chip Sales Data Explorer” by The Epoch AI Team

Tue, 13 Jan 2026 00:00:00 GMT

Subtitle: We announce our new AI Chip Sales data explorer, which uses financial reports, company disclosures, and more to estimate compute, power usage, and spending over time for a wide variety of AI chips.

Introduction

Discussions about AI progress increasingly hinge on computing capacity – aka compute – which is essential in order to develop, train, and deploy AI systems. But public data on the total capacity of AI computing hardware can be fragmented and incomplete.

To address this, we are releasing a new AI Chip Sales data explorer, estimating and visualizing both the number and capacity of AI accelerators that have been sold or delivered in recent years. We leverage data and evidence from earnings reports, company disclosures, analyst coverage, and media reporting to produce estimates of AI chip counts across major vendors: Nvidia, Google, Amazon, AMD, and Huawei, broken down by AI chip model.

We believe this release provides the most complete publicly available picture to date on the global stock of AI compute.

Compute

We find that cumulative global AI compute capacity has reached the equivalent of more than 15 million Nvidia H100 GPUs, measured using each chip's respective peak specifications in 8-bit operations [...]

---

Outline:

(00:26) Introduction

(01:23) Compute

(02:18) Costs and power

---

First published:
January 13th, 2026

Source:
https://epoch.ai/publications/introducing-the-ai-chip-sales-data-explorer

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“An FAQ on Reinforcement Learning Environments” by Jean-Stanislas Denain, Chris Barber

Mon, 12 Jan 2026 00:00:00 GMT

Subtitle: We interviewed 18 people across RL environment startups, neolabs, and frontier labs about the state of the field and where it's headed.

This post is a collaboration between guest author Chris Barber and JS Denain from Epoch AI.

Reinforcement learning (RL) environments have become central to how frontier AI labs train their models. In September 2025, The Information reported that Anthropic had discussed spending over $1 billion on RL environments over the following year. As Andrej Karpathy put it in his 2025 year-in-review: by training LLMs on a wide range of verifiable tasks across different environments, “the LLMs spontaneously develop strategies that look like ‘reasoning’ to humans.”

This wave of RL for capabilities started with OpenAI's o1, which was trained on math and coding problems with verifiable answers. Since then, labs have expanded the range of tasks they train on, all the while scaling the amount of compute spent on RL training.

Without diverse, high-quality environments and tasks to train on, throwing more compute at RL risks wasting much of it. As a result, creating those tasks and environments has become a key bottleneck for scaling capabilities, and a growing market that remains largely [...]

---

Outline:

(02:47) What are RL environments and tasks?

(05:50) How are RL environments used by labs?

(08:06) Which companies build RL Environments?

(10:08) How much do environments and tasks cost?

(12:33) What domains do RL environments cover?

(16:22) What are the top priorities and challenges?

The original text contained 10 footnotes which were omitted from this narration.

---

First published:
January 12th, 2026

Source:
https://epoch.ai/gradient-updates/state-of-rl-envs

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How far can decentralized training over the internet scale?” by Jaime Sevilla

Mon, 29 Dec 2025 00:00:00 GMT

Subtitle: Decentralized training over the internet promises to scale training to the limits of the internet.

Previously, I discussed decentralized training in the context of hyperscalers. Microsoft, Google and other giants are building interconnected gigawatt scale datacenters, which could be used to train models at an unprecedented computational scale. The decentralization could sidestep the difficulty of securing 10 gigawatts of power in a single location by splitting one massive run into ten more manageable gigawatt-scale blocks.

But when people think of decentralized training, they don’t first think of gigantic datacenters, owned by the same company, training models across large distances. Instead, they imagine thousands of small datacenters, or individual consumers, pooling their spare compute over the internet to orchestrate a training run larger than any single actor could manage alone.

Many companies are pursuing this vision: Pluralis Research, Prime Intellect and Nous Research have already successfully decentrally trained models at scale. But in practice, training decentrally over the internet has lagged far behind more centralized training. Even their largest models (Pluralis’ 8B Protocol Model, Prime Intellect's INTELLECT-1, and Nous’ Consilience 40B) have been trained with 1,000x less compute than today's frontier models (such as xAI's Grok [...]

---

Outline:

(02:42) Is decentralized training over the internet feasible?

(04:04) Decentralized data parallelism

(08:26) Decentralized model parallelism

(10:28) Decentralized RL training

(12:49) Putting it all together: decentralized internet training at frontier scale is likely feasible

(15:15) Can decentralized developers amass the necessary compute?

(20:22) Conclusion

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
December 29th, 2025

Source:
https://epoch.ai/gradient-updates/how-far-can-decentralized-training-over-the-internet-scale

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Why benchmarking is hard” by Florian Brand, Jean-Stanislas Denain

Tue, 23 Dec 2025 00:00:00 GMT

Subtitle: Running benchmarks involves many moving parts, each of which can influence the final score. The two most impactful components are scaffolds and API providers.

This post is part of our Gradient Updates newsletter, which shares more opinionated or informal takes about big questions in AI progress. These posts solely represent the views of the authors, and do not necessarily reflect the views of Epoch AI as a whole.

Benchmarks play a crucial role in the AI landscape: They inform everyone, from AI researchers to the general public, about the current state of capabilities and the overall rate of progress. Third-party organizations, such as Epoch AI, independently run and collate benchmark results on a page like the benchmarking hub.

However, benchmarking isn’t easy: at each stage of the benchmarking pipeline, there are many moving parts and degrees of freedom that can affect the final result: this makes it hard to compare any two evaluation scores. Moreover, each stage can introduce bugs or mistakes that make the results costly to obtain or invalid.

In this post, we dive into the different steps of the benchmarking process, which we split into two main parts:

Benchmark [...]

---

Outline:

(02:03) Main takeaways

(02:33) The Benchmark Setup

(02:55) Prompts & Sampling Parameters

(06:12) Scaffolds continue to have an outsized impact

(08:01) Execution Environment

(09:15) Scoring

(10:01) Model Access

(10:15) API & SDK

(11:10) API Aggregator

(11:41) Model Provider

(15:29) Model Deployment

(16:04) Conclusion

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
December 23rd, 2025

Source:
https://epoch.ai/gradient-updates/why-benchmarking-is-hard

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Top 10 Data Insights and Gradient Updates of 2025” by The Epoch AI Team

Tue, 23 Dec 2025 00:00:00 GMT

Subtitle: In 2025 we released over 70 short form investigations of AI. We review the 10 most popular ones on our website.

Introduction

In 2025, we ramped up our public communication to keep pace with rapid developments in AI.

Our Data Insights offer short, visual, self-contained investigations of key trends and metrics in AI.

Gradient Updates is our outlet for leading-edge commentary by specific authors (also offered as a newsletter on Substack), without necessarily representing the views of Epoch AI as a whole.

Over the year, we published 36 Data Insights and 37 Gradient Updates.

Here, we bring you our top 10 most popular Data Insights and Gradient Updates in 2025.1

Most popular Data Insights

LLM inference prices have fallen rapidly but unequally across tasks

🠊 In short: Between April 2023 and March 2025, we saw a >10x and larger drop in the price per token at an equivalent performance level.

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ LLM inference prices have fallen 9x to 900x per year, depending on the task

🠊 Why this matters: API cost reductions indicate a more competitive market and large gains in efficiency, making AI more affordable to [...]

---

Outline:

(00:23) Introduction

(01:06) Most popular Data Insights

(01:10) LLM inference prices have fallen rapidly but unequally across tasks

(01:55) Frontier AI performance becomes accessible on consumer hardware within a year

(03:03) Most of OpenAI's 2024 compute went to experiments

(03:52) The stock of computing power from NVIDIA chips is doubling every 10 months

(04:38) GPT-5 and GPT-4 were both major leaps in benchmarks from the previous generation

(05:21) Most popular Gradient Updates

(05:25) How much energy does ChatGPT use?

(06:20) How has DeepSeek improved the Transformer architecture?

(07:29) How far can reasoning models scale?

(08:21) How big could an "AI Manhattan Project" get?

(09:20) Most AI value will come from broad automation, not from R&D

The original text contained 1 footnote which was omitted from this narration.

---

First published:
December 23rd, 2025

Source:
https://epoch.ai/publications/top-10-data-insights-and-gradient-updates-of-2025

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The changing drivers of LLM adoption” by Jean-Stanislas Denain, Anson Ho

Fri, 19 Dec 2025 00:00:00 GMT

Subtitle: Public data as well as our original polling suggest LLM adoption is roughly on trend, but the underlying drivers are shifting.

In the world of AI, half a year is a very long time. Back in July, we saw LLMs being adopted faster than almost any other technology in history. Five months later we’re still seeing rapid growth, but we’re also seeing early winds of change — both in who uses AI and how they do so.

Using the latest public data,1 and a poll of US adults we conducted with Blue Rose Research, this post shares an updated picture of the state of LLM adoption.

How quickly are consumers adopting LLMs?

More people are using LLMs — but they’re increasingly using different LLMs, different products, and in different places

Through the first half of 2025, ChatGPT's user base grew at a remarkable pace, from under 400 million weekly active users in January to nearly 800 million by August — roughly 50 million new users per month. Since then, growth may have slowed slightly, though it's a bit soon to tell how much of this is noise versus a lasting trend change:

Does this mean [...]

---

Outline:

(00:52) How quickly are consumers adopting LLMs?

(00:56) More people are using LLMs -- but they're increasingly using different LLMs, different products, and in different places

(03:57) Consumers are using LLMs much more intensely on AI apps, while web traffic has stagnated

(07:13) AI company revenues have continued to grow incredibly fast, in line with previous trends

(08:07) How embedded is AI in daily tasks and jobs?

(08:24) AI has entered the workplace beyond formal enterprise adoption

(10:11) Most consumers use AI to seek information

(12:41) AI use is stratified by income and job type, and less so by gender

(14:44) Conclusion

The original text contained 8 footnotes which were omitted from this narration.

---

First published:
December 19th, 2025

Source:
https://epoch.ai/gradient-updates/the-changing-drivers-of-llm-adoption

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Is almost everyone wrong about America’s AI power problem?” by Anson Ho, Yafah Edelman, Josh You, Jean-Stanislas Denain

Wed, 17 Dec 2025 00:00:00 GMT

Subtitle: Why power is less of a bottleneck than you think.

In AI circles, there's a common argument that goes: “The US is horrible at building power, but China's great at it. And since power is so important for the AI race, China wins by default.”

This line of reasoning is everywhere. NVIDIA CEO Jensen Huang used it to argue that “China is going to win the AI race” last month. It features in Situational Awareness, a series of essays about how the world's in a fierce race to AGI, which received a seal of endorsement from Ivanka Trump. There's even an entire Dwarkesh podcast episode called “China is killing the US on energy. Does that mean they’ll win AGI?”.

But we think this argument is overstated — power bottlenecks likely won’t dramatically or permanently impede the data center buildout in the US. Claims about America's AI power predicament are partially based on a misunderstanding, and there are multiple promising approaches to meet America's AI's power demands. That means that people are overrating the strength of the power bottleneck, and how much that impacts the “race to AGI”.

So why do we believe this, and [...]

---

Outline:

(01:28) America's AI power predicament -- or not?

(04:26) Plucking fruit from the power tree

(04:45) Natural gas: Digging power out of the ground

(09:15) Solar: Raining power down from the sky

(13:13) Demand response: Acquiring power "out of thin air"

(15:55) Adding up the numbers

(18:20) So is everyone wrong about this?

The original text contained 16 footnotes which were omitted from this narration.

---

First published:
December 17th, 2025

Source:
https://epoch.ai/gradient-updates/is-almost-everyone-wrong-about-americas-ai-power-problem

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“A Rosetta Stone for AI benchmarks” by Anson Ho, Jean-Stanislas Denain, David Atanasov, Samuel Albanie, Rohin Shah

Tue, 02 Dec 2025 00:00:00 GMT

Subtitle: Most benchmarks saturate too quickly to study long-run AI trends. We solve this using a statistical framework that stitches benchmarks together, with big implications for algorithmic progress and AI forecasting.

Introduction

We rely on benchmarks to measure AI capabilities, but even the best benchmarks are just narrow glimpses into what AI can do.

Consider a benchmark. If a model is really bad, it will score 0% on the benchmark. But the same is true for a model that's extremely bad — the benchmark offers no signal to distinguish these two models, even though one is much better than the other.

Similarly, a model that's really good will score 100% — but so will a model that's extremely good. We can’t tell these good models apart either.

We can only compare models when they’re in the middle — not too good and not too bad. And since models improve so quickly, their time in the middle is really short, so we can’t see long-run trends in whether AI progress is speeding up, slowing down, or hitting a wall.

A New Approach

So how do we solve this? We propose a new approach: using a statistical model, we [...]

---

Outline:

(00:29) Introduction

(01:33) A New Approach

(03:55) What this framework tells us

(07:10) Next steps

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
December 2nd, 2025

Source:
https://epoch.ai/publications/a-rosetta-stone-for-ai-benchmarks

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Benchmark Scores = General Capability + Claudiness” by Greg Burnham

Thu, 20 Nov 2025 00:00:00 GMT

Subtitle: Is this because skills generalize very well, or because developers are pushing on all benchmarks at once?

The Gemini 3 release included a massive table showing how the model was state-of-the-art on nineteen diverse benchmarks. Such tables are commonplace by now, but they add up to an odd statistical situation. Benchmarks ostensibly measure different things, but since models tend to improve on many benchmarks at once, the dataset of benchmark scores is dominated by a single “General Capability” dimension.

In this post, I’ll describe the statistics of this dataset, look into what's left when you factor out this dominant dimension (hint: it's “Claudiness”), and discuss how this relates to an important question about cross-task generalization.

Benchmarking data is dominated by a single underlying dimension

This is one of the lessons of our recent work on the Epoch Capabilities Index (ECI), which combines thirty-nine benchmarks into a single capabilities score. If benchmarks were generally uncorrelated with each other, you’d expect to see large residuals: the benchmark scores predicted by a model's ECI number wouldn’t match the model's actual benchmark scores. As it turns out, we see a very good match. In other words, our nominally high-dimensional [...]

---

Outline:

(01:00) Benchmarking data is dominated by a single underlying dimension

(03:08) Benchmarking data shows a smaller "Claudiness" dimension

(04:16) Is the "general capability" dimension deep, or contingent?

(06:09) A trillion dollar question

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
November 20th, 2025

Source:
https://epoch.ai/gradient-updates/benchmark-scores-general-capability-claudiness

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The software intelligence explosion debate needs experiments” by Anson Ho, Parker Whitfill

Fri, 14 Nov 2025 00:00:00 GMT

Subtitle: The existing debate rests on data and assumptions that are shakier than most people realize. To make progress, we need better evidence, and experiments are the best way to get it on the margin.

Suppose you had a million AIs, each surpassing humanity's best AI researchers. If they all worked on advancing AI, how much would AI progress accelerate?

This might sound like science fiction, but it may be the most consequential question about the future of AI. The problem is that the experts disagree wildly on the answer.

Some foresee a positive feedback loop. These AIs are smart enough to find new algorithms to make smarter AIs, which make even smarter AIs, and so on. Very soon, we could see multiple years of AI progress compressed into a single year just through software advances — a “software intelligence explosion”.1

Others agree that AI progress would speed up, but think that something will block the explosive feedback loop. For example, increasing difficulty in finding new algorithms might bottleneck AI self-improvement, or software improvements might depend heavily on physical resources like compute, which can’t be scaled as easily.

And we really need to know [...]

---

Outline:

(01:55) Flawed data

(06:42) Flawed models

(09:32) To make progress, we need experiments

The original text contained 19 footnotes which were omitted from this narration.

---

First published:
November 14th, 2025

Source:
https://epoch.ai/gradient-updates/the-software-intelligence-explosion-debate-needs-experiments

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Introducing the Frontier Data Centers Hub” by The Epoch AI Team

Tue, 04 Nov 2025 00:00:00 GMT

Subtitle: We announce our new Frontier Data Centers Hub, a database tracking large AI data centers using satellite and permit data to show compute, power use, and construction timelines.

Introduction

Companies are building AI data centers at an unprecedented scale. These facilities have the power capacity of small countries and cost tens of billions to construct. Yet until now, the details of their true capacity and progress have remained opaque. To help the public, researchers, policymakers, and investors understand the scale of this new infrastructure wave, Epoch AI has created the Frontier Data Centers Hub.

This open database tracks the construction and capacity of major AI data centers using satellite imagery, public permits, and other open sources. It's the most detailed public resource to date on how much power, land, and hardware the largest AI companies are deploying — and when.

The 13 large U.S. data centers tracked in the hub account for a substantial share of total compute stock globally: about 2.5 million (~15%) of the roughly 15 million H100-equivalents that have been delivered to customers in the past several years as of late 2025.

You can read more about how AI data centers work in our [...]

---

Outline:

(00:24) Introduction

(01:40) Power

(04:23) Compute

(06:21) Cost

---

First published:
November 4th, 2025

Source:
https://epoch.ai/publications/introducing-the-frontier-data-centers-hub

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What you need to know about AI data centers” by Ben Cottier, Yafah Edelman

Tue, 04 Nov 2025 00:00:00 GMT

Subtitle: AI companies are planning a buildout of data centers that will rank among the largest infrastructure projects in history. We examine their power demands, what makes AI data centers special, and what all this means for AI policy and the future of AI.

This report accompanies our Frontier Data Center Hub.

Introduction

It's difficult to appreciate the historic scale of AI data centers. They represent some of the largest infrastructure projects humanity has ever created.

To get a sense of the scale, consider that OpenAI's Stargate Abilene data center will need:

Enough electricity to serve the population of Seattle1
More than 250× the computing power of the supercomputer that trained GPT-42
A plot of land larger than 450 soccer fields3
$32 billion in construction and IT equipment costs
A few thousand construction workers4
Around two years for construction5

And that's just a small part of the picture. Companies are currently building many other data centers like Stargate Abilene.6 By the end of 2027, AI data centers could collectively see hundreds of billions in investment — rivalling the Apollo program and Manhattan Project.

This raises many questions, such as:

[...]

---

Outline:

(00:32) Introduction

(01:56) Power - the most important thing to know about an AI data center

(02:02) Power determines where AI data centers are built

(05:07) Where power comes from

(07:07) What's so special about AI data centers?

(07:11) AI data centers have exceptionally high power densities

(08:52) Huge power densities call for unique cooling systems

(11:03) What does this all mean for AI progress and policy?

(11:08) AI's broad climate impact isn't very big (yet)

(12:31) Companies probably won't need to decentralize AI training over the next two years

(13:45) Gigawatt-scale AI data centers are hard to secure

(15:06) Conclusion

The original text contained 25 footnotes which were omitted from this narration.

---

First published:
November 4th, 2025

Source:
https://epoch.ai/publications/what-you-need-to-know-about-ai-data-centers

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What does OSWorld tell us about AI’s ability to use computers?” by Florian Brand, Greg Burnham

Thu, 30 Oct 2025 00:00:00 GMT

Subtitle: We review OSWorld, a prominent computer use benchmark. Its tasks are relatively simple, many don’t require GUIs, and success often hinges on interpreting ambiguous instructions. It is also not stable over time.

OSWorld Computer use

OSWorld is a benchmark for evaluating large language models on computer use tasks. A model is given task instructions and an Ubuntu virtual machine and must execute actions to perform the task.

Size: 361 computer use tasks
Data sourcing: Humans, forums, tutorials, etc.
Scoring method: Evaluation function
Contamination risk: Medium

If AI systems are to be digital coworkers, they will need to be able to use computers. In this article we review OSWorld, a popular benchmark designed to measure progress toward this milestone. What is in this benchmark, how should we interpret progress, and what will it mean when AI systems score near-perfect, i.e. have saturated it?

Main Takeaways

Saturation on OSWorld means a model can execute simple, realistic tasks in Linux-based environments using popular open-source applications. These include things like adding page numbers to a document or exporting a CSV file from [...]

---

Outline:

(03:21) OSWorld setup and evaluation

(04:56) OSWorld task instructions are continually updated, making through-time comparisons of uncertain value

(05:42) Saturation on OSWorld means a model can do simple, realistic tasks in Linux-based environments using popular open-source applications

(06:14) Most tasks are simple

(07:49) Tasks use a variety of applications

(08:30) The titular OS is Linux and applications are open source, but this is probably not a major issue

(09:31) Terminal use and Python scripting can go a long way

(09:47) About 15% of tasks can be completed using only a terminal

(10:59) About 30% of tasks can substitute terminal use and Python scripting for much GUI use

(12:58) Many instructions are borderline-ambiguous and discerning the instruction's intent is a significant part of succeeding

(16:05) About 10% of OSWorld tasks have serious errors

(18:11) About 10% of OSWorld tasks rely on live data from the Internet, and thus their difficulty may change over time

(19:10) Conclusion

The original text contained 9 footnotes which were omitted from this narration.

---

First published:
October 30th, 2025

Source:
https://epoch.ai/publications/what-does-osworld-tell-us-about-ais-ability-to-use-computers

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Could decentralized training solve AI’s power problem?” by Jaime Sevilla, Anton Troynikov

Tue, 28 Oct 2025 00:00:00 GMT

Subtitle: We illustrate a decentralized 10 gigawatts training run across a dozen sites spanning thousands of kilometers. Developers are likely to scale datacenters to multi-gigawatt levels before adopting decentralized training.

Introduction

In their quest to make smarter AI, companies vie to build — and power — the largest datacenters.

xAI built the 350 megawatts Colossus datacenter in Memphis, which they plan to expand to 1.5 gigawatts, while OpenAI's 240 megawatts Abilene Stargate datacenter is planned to reach 1.2 gigawatts. At full scale, these single datacenters will rival the most power-hungry facilities in the world today, such as the 1.2 gigawatts Maaden or the 1.6 gigawatts Bahrain aluminium smelters.

And if trends continue, companies might soon exceed this, with training clusters projected to reach 10 gigawatts by the end of the decade. This is larger than the capacity of the US's largest power plant, the Grand Colulee Dam, and nearly matches the total installed power capacity for all NVIDIA AI chips at the end of 2024.1

But utilities are already struggling to supply the power AI hyperscalers demand. John Ketchum, CEO of the largest US utility, NextEra, stated last year that while some sites could readily support one gigawatt [...]

---

Outline:

(00:28) Introduction

(03:45) Planning a ten-gigawatt training run

(05:37) What kind of model would we train?

(07:17) How would we decentralize training?

(09:15) Can decentralized training maintain sufficient throughput at very large scales?

(10:25) How long to process each batch?

(11:47) How much time will we spend on the network?

(13:07) Would the propagation latency be low enough?

(14:48) Could bandwidth be a bottleneck?

(17:26) Is the network cost feasible?

(20:09) Will companies actually turn to decentralized training at scale?

The original text contained 28 footnotes which were omitted from this narration.

---

First published:
October 28th, 2025

Source:
https://epoch.ai/publications/could-decentralized-training-solve-ais-power-problem

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Less than 70% of FrontierMath is within reach for today’s models” by Greg Burnham

Fri, 17 Oct 2025 00:00:00 GMT

Subtitle: 57% of problems have been solved at least once.

The best we have seen a model perform on a single run of FrontierMath is 29%.1 If you want to use a model to solve FrontierMath-style problems, that's the number to consider.

But there's another way to gauge state-of-the-art performance: how many FrontierMath problems have been solved by any model, on any run, even once? This tells us more about what is “within reach” for today's models. It's also more forward-looking: if today's models can generate the right ideas to solve a problem at all, then that makes it more likely that tomorrow's models will be able to solve the problem reliably.2 In other words, we can get a view of the future that's a bit more concrete than just extrapolating accuracy trends.

To make matters more interesting, there's some empirical evidence that if you run an LLM on a benchmark N times, the percentage of problems correctly solved at least once (known as pass@N) increases proportionally to log(N). If that's true in general, then, since log(N) is unbounded, we should expect pass@N to approach 100% as the models are given more tries. Could FrontierMath's saturation [...]

---

Outline:

(02:26) GPT-5's pass@N caps out below 50%

(03:14) GPT-5 Pass@N on FrontierMath Tiers 1-3

(04:21) Pass@the-kitchen-sink likely caps out below 70%

(06:48) ChatGPT Agent Pass@N on FrontierMath Tiers 1-3

(08:52) This gives us something to watch as models improve on FrontierMath

The original text contained 8 footnotes which were omitted from this narration.

---

First published:
October 17th, 2025

Source:
https://epoch.ai/gradient-updates/less-than-70-percent-of-frontiermath-is-within-reach-for-todays-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“OpenAI is projecting unprecedented revenue growth” by Greg Burnham

Wed, 15 Oct 2025 00:00:00 GMT

Subtitle: No company has gone from 10 billion dollars to 100 billion dollars as fast as OpenAI projects to do.

Epoch's new AI companies database shows the remarkable level and pace of growth of OpenAI's revenue. It first exceeded $1 billion in 2023 and will exceed $10 billion in 2025. This is impressive, but not unprecedented — a few other companies have matched this growth rate historically.

OpenAI's projections, however, are a different story. According to The Information, in Q3 2025 OpenAI projected its 2028 revenue to be $100 billion. I couldn’t find any examples of a company growing its revenue from around $10 billion to $100 billion in such a short period of time.

What happens if OpenAI falls short of these projections? At a minimum, it would likely have to scale back its plans for large compute build-outs. The recently-announced deals with Nvidia, AMD, and Broadcom imply expenditures of roughly $1.3 trillion within the next decade, and some of this is presumably expected to be financed by revenue or debt raised against revenue.1

But the second-order effects of a miss could be larger. This is because investors and other companies are increasingly betting [...]

---

Outline:

(02:18) OpenAI's revenue grew very quickly from $1B to $10B

(03:51) No company has grown its revenue from $10 billion to $100 billion in three years

(06:34) Can OpenAI do it?

(09:29) What if not?

The original text contained 8 footnotes which were omitted from this narration.

---

First published:
October 15th, 2025

Source:
https://epoch.ai/gradient-updates/openai-is-projecting-unprecedented-revenue-growth

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Evaluating Gemini 2.5 Deep Think’s math capabilities” by Greg Burnham

Thu, 09 Oct 2025 00:00:00 GMT

Subtitle: It has improved at using background knowledge and doing precise computations. It can be a helpful research assistant and may take a more conceptual approach to geometry. It shows limited creativity and sometimes struggles with citations.

Introduction

We evaluated the math capabilities of Gemini 2.5 Deep Think (hereafter, Deep Think), the publicly available version of the model that got a gold medal-equivalent score on the International Mathematical Olympiad (IMO). What are its strengths and weaknesses, both in absolute terms and relative to other models? To our knowledge, this is the most comprehensive third-party evaluation conducted to-date on such a “high compute” model setting.

Note: This work was commissioned by Google. Epoch maintained editorial control over the output. We offer timely and in-depth evaluation as a service to model developers; email info@epoch.ai for details.

Executive Summary

Deep Think set a new record on FrontierMath Tiers 1–3 (29%) and Tier 4 (10%), representing an improved ability to solve short-answer math problems that require deep background knowledge and precise execution of computations. ()
Two professional mathematicians characterized Deep Think as a generally helpful research assistant, broadly on par with the best available models. ()
While this [...]

---

Outline:

(00:28) Introduction

(01:13) Executive Summary

(02:32) Methodology

(03:46) Deep Think's performance on FrontierMath indicates advances in background knowledge and executing complex computations

(06:48) Deep Think Performance vs. Problem Ratings

(10:44) Mathematicians characterized Deep Think as a helpful research assistant, though one noted a weakness at citing the literature

(16:01) Deep Think did well on the 2025 IMO, but failed to solve two older IMO problems requiring more creative and intricate proofs

(26:17) Deep Think's approach to geometry is more conceptual than we have seen from other models

(31:06) We observed Deep Think making one mistake that is reminiscent of classical human cognitive biases

(32:59) Conclusion

The original text contained 19 footnotes which were omitted from this narration.

---

First published:
October 9th, 2025

Source:
https://epoch.ai/publications/deep-think-math

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How many digital workers could OpenAI deploy?” by Jean-Stanislas Denain, Anson Ho, Jaime Sevilla

Fri, 03 Oct 2025 00:00:00 GMT

Subtitle: OpenAI has the inference compute to deploy tens of millions of digital workers, but only on a narrow set of tasks – for now.

The core argument for how AI could drive explosive economic growth is that you can dramatically scale up the number of AI “digital workers”. The idea is that growth is constrained by labor, so rapidly expanding the workforce would hugely accelerate growth rates.

This is where AI comes in. While you can’t double the human population each year, you can double the number of AI chips – as we’ve seen with NVIDIA and OpenAI.1 So if AI can fully substitute for human workers, the workforce could grow many times faster than today. As a result, the economy could grow many times faster too.

If this framing is right, then to know how far we are from explosive growth, we need to answer three questions. First, how many AI “digital workers” can be deployed today? Second, how far is AI from fully substituting for human workers? And third, how are both of these changing over time?

In this post, we’ll take a stab at the first question: On the tasks that [...]

---

Outline:

(02:16) Estimating the number of digital workers that frontier labs can deploy

(06:26) What do these numbers tell us?

The original text contained 14 footnotes which were omitted from this narration.

---

First published:
October 3rd, 2025

Source:
https://epoch.ai/gradient-updates/how-many-digital-workers-could-openai-deploy

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Introducing the AI Companies Data Hub” by The Epoch AI Team

Tue, 30 Sep 2025 00:00:00 GMT

Subtitle: Our new AI Companies Data Hub tracks key economic and operational data, including frontier AI companies’ revenue, funding, valuations, staff counts, compute spending, and product usage.

Introduction

The AI industry has changed rapidly in recent years, with frontier companies like OpenAI and Anthropic seeing fast exponential growth in their revenues and valuations. This growth has important implications for AI's trajectory: AI companies are continually improving their technology by scaling up compute and labor inputs, and their revenue and usage track how AI is already impacting the world.

To help researchers, policymakers, and the public understand these trends, we have created our AI Companies Data Hub. The hub tracks financial and operational metrics—revenue, funding, staff, usage, and compute spend—for the key companies developing frontier AI models, along with interactive visualizations. This supplements our data hubs on AI models and GPU clusters and provides a more holistic view of the resource inputs and economic impact of the AI industry.

Our data shows that the combined revenues of OpenAI and Anthropic grew around 10x since early 2024. OpenAI's annualized revenue reached $13 billion in August 2025, up from $5 billion at the beginning of the year, while Anthropic's revenue [...]

---

First published:
September 30th, 2025

Source:
https://epoch.ai/publications/introducing-the-ai-companies-data-hub

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Why GPT-5 used less training compute than GPT-4.5 (but GPT-6 probably won’t)” by Yafah Edelman, Jean-Stanislas Denain, Jaime Sevilla, Anson Ho

Fri, 26 Sep 2025 00:00:00 GMT

Subtitle: OpenAI focused on scaling post-training on a smaller model.

Out of all the GPT models, GPT-5 is the odd one out. Unlike all previous versions of GPT, it was likely trained on less compute than its immediate predecessor, GPT-4.5.1

While the exact numbers are uncertain, GPT-4.5 very likely used more training compute than GPT-5.

But this leads to a puzzle: Models trained with more compute tend to be better, so why did OpenAI train GPT-5 with less compute than GPT-4.5? And what will this mean for future OpenAI models?

In this post, we’ll argue that the answers to these questions are the following:

GPT-5 used less training compute than GPT-4.5 because OpenAI focused on scaling post-training. New post-training techniques made it possible to outperform GPT-4.5 with less training compute, but these methods likely weren’t yet mature enough to be applied at GPT-4.5's compute scale. Doing so would’ve taken more time (and compute), which OpenAI likely chose not to do due to strong market pressures.
OpenAI's next flagship model (“GPT-6”) will probably be trained on more compute than GPT-4.5: When OpenAI figures out how to productively scale post-training, they’ll likely shift [...]

---

Outline:

(02:22) GPT-5 used less training compute than GPT-4.5 because OpenAI focused on scaling post-training

(05:58) GPT-6 will probably be trained on more compute than GPT-4.5

The original text contained 12 footnotes which were omitted from this narration.

---

First published:
September 26th, 2025

Source:
https://epoch.ai/gradient-updates/why-gpt5-used-less-training-compute-than-gpt45-but-gpt6-probably-wont

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The huge potential implications of long-context inference” by Jean-Stanislas Denain, Anson Ho

Fri, 19 Sep 2025 00:00:00 GMT

Subtitle: Continual learning, scaling RL, and research feedback loops.

On paper, modern LLMs can ingest many books’ worth of text in one go. For example, Gemini 2.5 Pro has a “context window” of 1 million tokens, enough to stuff in ten copies of Harry Potter and the Philosopher's Stone.1 But what if we could do lots of inference with much longer contexts? What if LLMs could take in 10 billion tokens of context, and we had the hardware and algorithms to make this usable in practice?

The naive use case is being able to take in ever-longer documents. But we think the implications of long context inference could be much greater:

It provides an angle of attack on the ability to continually learn new knowledge after the model is deployed, one of the biggest bottlenecks to the real-world utility of current AI systems.
It supports a ton of RL scaling: doing more reasoning, verifying model outputs, and generating high-quality RL environments.
But there are also bottlenecks. As RL scales to longer runs, research iteration cycles slow down. And you’ll also need a lot of hardware and algorithmic progress so that long-context inference [...]

---

Outline:

(01:59) Extremely long context inference provides an angle of attack on continual learning

(06:23) Being able to do lots of long context inference supports more RL scaling

(09:32) Bottlenecks: Slower research iteration times and potentially rising costs

The original text contained 7 footnotes which were omitted from this narration.

---

First published:
September 19th, 2025

Source:
https://epoch.ai/gradient-updates/the-huge-potential-implications-of-long-context-inference

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What will AI look like in 2030?” by David Owen

Tue, 16 Sep 2025 00:00:00 GMT

Subtitle: If scaling persists to 2030, AI investments will reach hundreds of billions of dollars and require gigawatts of power. Benchmarks suggest AI could improve productivity in valuable areas such as scientific R&D.

This report was commissioned by Google DeepMind. All points of views and conclusions expressed are those of the authors and do not necessarily reflect the position or endorsement of Google DeepMind.

Introduction

What will happen if AI scaling persists to 2030? We are releasing a report that examines what this scale-up would involve in terms of compute, investment, data, hardware, and energy. We further examine the future AI capabilities this scaling will enable, particularly in scientific R&D, which is a focus for leading AI developers. We argue that AI scaling is likely to continue through 2030, despite requiring unprecedented infrastructure, and will deliver transformative capabilities across science and beyond.

Scaling is likely to continue until 2030: On current trends, frontier AI models in 2030 will require investments of hundreds of billions of dollars, and gigawatts of electrical power. Although these are daunting challenges, they are surmountable. Such investments will be justified if AI can generate corresponding economic returns by increasing productivity. If AI lab [...]

---

Outline:

(00:39) Introduction

(02:25) Scaling is likely to continue to 2030

(06:17) AI will accelerate scientific R&D across several domains

(13:06) Conclusion

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
September 16th, 2025

Source:
https://epoch.ai/publications/what-will-ai-look-like-in-2030

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Three challenges facing compute-based AI policies” by Venkat Somala, Anson Ho, Séb Krier

Thu, 11 Sep 2025 00:00:00 GMT

Subtitle: 'Training compute' is constantly evolving, and compute-based AI policies must adapt to remain relevant.

This week's post is a collaboration between writers from Google DeepMind's AI Policy Perspectives substack, and Epoch AI.

When the EU AI Act was drafted, pre-training compute was a reasonable proxy for model capabilities. At the time, pre-training accounted for 90-99% of total training compute, and the relationship was relatively reliable: more compute meant larger models pre-trained on more data, which consistently translated to stronger capabilities.

This simple proxy has been steadily breaking down. While pre-training compute remains a primary driver of capabilities, modern AI development leans heavily on distillation, synthetic data generation, reward models, and reasoning post-training. These methods can consume significant compute and drive capability gains, yet are often unaccounted for in current regulatory frameworks.1

The standard approach for measuring compute, used by the now-defunct Biden AI executive order, is to sum compute across two stages: “pre-training” and “post-training.” If the sum crosses some predefined threshold, the model is subject to additional scrutiny.2 But as training methods continue to evolve, this metric risks measuring an increasingly narrow slice of the factors that produce advanced capabilities.

The [...]

---

Outline:

(03:07) 1. Not all uses of compute contribute equally to model capabilities

(06:36) 2. AI labs can use compute for methods besides pre/post-training

(07:08) Knowledge distillation

(08:47) Synthetic data generation

(10:11) Reward models

(11:17) The diversity in training methods challenges standardized compute metrics

(12:54) 3. When deployed, an AI model's downstream capabilities depend on more than the compute used to train it

(14:44) What does this mean for AI public policy?

The original text contained 7 footnotes which were omitted from this narration.

---

First published:
September 11th, 2025

Source:
https://epoch.ai/gradient-updates/three-issues-undermining-compute-based-ai-policies

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Compute scaling will slow down due to increasing lead times” by Yafah Edelman, Anson Ho

Fri, 05 Sep 2025 00:00:00 GMT

Subtitle: A heavily underappreciated dynamic when thinking about AI timelines.

The massive compute scaling that has driven AI progress since 2020 is likely to slow down soon, due to increasing economic uncertainty and longer development cycles.

While investors could theoretically scale compute by several orders of magnitude, the required hundreds of billions, combined with uncertain returns, will push them toward incremental scaling — investing, deploying products to gauge returns, then reevaluating further investment. Additionally, as the required compute grows larger, the time between project initiation and product deployment (i.e. “lead time”) lengthens significantly, creating a feedback loop that naturally slows the pace of compute scaling.

In particular, our current best guess is that every additional 10× increase in compute scale lengthens lead times by around a year. For example, OpenAI currently likely has over $15 billion worth of compute, and this compute stock has been growing by around 2.2× each year.1 At that pace, current trends would predict a trillion dollar cluster around 2030 — but longer lead times would delay this to around 2035.

The “extrapolation with lead times” is determined by taking the direct extrapolation, and adjusting it such that each additional 10× [...]

---

Outline:

(02:31) Uncertainties about investment returns prevent a "YOLO scaleup"

(04:23) Lead times are getting longer, causing investments (and hence compute scaling) to slow down

(11:28) What does this mean for AI progress over the next few years?

The original text contained 14 footnotes which were omitted from this narration.

---

First published:
September 5th, 2025

Source:
https://epoch.ai/gradient-updates/compute-scaling-will-slow-down-due-to-increasing-lead-times

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Why future AI agents will be trained to work together” by Anson Ho, Jean-Stanislas Denain

Fri, 22 Aug 2025 00:00:00 GMT

Subtitle: Many multi-agent setups are based on fancy prompts, but this is unlikely to persist.

We’re moving towards a world where multi-agent systems will be near-ubiquitous, and where they won’t just look like prompt engineering on steroids.

Over the last few years, we’ve increasingly seen AI systems spin up multiple LLM instances to solve problems. OpenAI has a multi-agent team which was involved in their recent IMO gold medal.1 Grok 4 Heavy involves multiple agents working in parallel on the same task. Claude Research coordinates multiple instances of Claude 4. Claude Code uses the Task tool to delegate subtasks to subagents. And over the last year, all of OpenAI, Anthropic, and Google DeepMind have had job postings looking for expertise in multi-agent systems.

Anthropic's Claude Research is based on a multi-agent setup. (Image source)

We expect this trend toward multi-agent systems to continue: as task lengths increase, the benefits of parallelization will become too large to ignore. Importantly, while these LLM instances will work in parallel, they often won’t work independently: they’ll need to coordinate to avoid stepping on each other's toes, and performance improves when they can share key context and learnings. Currently [...]

---

Outline:

(01:51) The enormous gains from parallelization

(06:39) Parallel LLM instances will interact and coordinate

(08:13) Moving away from hard-coded multi-agent systems

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
August 22nd, 2025

Source:
https://epoch.ai/gradient-updates/why-future-ai-agents-will-be-trained-to-work-together

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How much power will frontier AI training demand in 2030?” by Josh You, David Owen

Mon, 11 Aug 2025 00:00:00 GMT

Subtitle: The power required to train the largest frontier models is growing by more than 2x per year, and is on trend to reaching multiple gigawatts by 2030.

Introduction

The electrical power required to train individual frontier AI models has been growing rapidly over time, driven by the growth in total training compute and the size of training clusters. Previously, we found that the power required to train a frontier model has been more than doubling every year. If trends continue, how high could these power demands become?

In a new white paper, “Scaling Intelligence: The Exponential Growth of AI's Power Needs”, written in collaboration with EPRI, we analyze the factors driving power growth for frontier training, and forecast this growth out to 2030. We conclude that the largest individual frontier training runs in 2030 will likely draw 4-16 gigawatts (GW) of power, or enough to power millions of US homes.

Forecasting power demands using model training compute

Power demands for frontier training runs have historically grown at a rate of 2.2x per year, with the largest runs now exceeding 100 megawatts. This has primarily been driven by frontier training compute, which has been growing at 4 [...]

---

Outline:

(00:23) Introduction

(01:14) Forecasting power demands using model training compute

(05:16) Implications for the energy sector

The original text contained 1 footnote which was omitted from this narration.

---

First published:
August 11th, 2025

Source:
https://epoch.ai/publications/power-demands-of-frontier-ai-training

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“We didn’t learn much from the IMO” by Greg Burnham

Thu, 07 Aug 2025 00:00:00 GMT

Subtitle: The problems gave AI only a slim chance to show new capabilities.

A few weeks ago I laid out what I thought the IMO might tell us about AI math capabilities. The IMO has now happened, with Google and OpenAI both announcing experimental LLMs that solved the same 5 of the IMO's 6 problems—just enough for a gold medal. What did we learn?

There was understandably a lot of excitement about the gold medals, but I think a closer look shows that this achievement tells us little about capabilities progress. This is due to bad luck: the 5 solved problems happen to be no harder than problems AI systems could already solve, and the one unsolved problem was much harder than anything any system has solved.

I’ll use this post to make the case that we didn’t learn much from the IMO. I’ll take both an “outside” view, using statistics about the problems and the performance of prior AI systems, and an “inside” view, taking a closer look at the specific problems and the AI solutions.

Viewed from outside, the sample of problems looks uninformative

The main issue is the difficulty distribution of [...]

---

Outline:

(01:12) Viewed from outside, the sample of problems looks uninformative

(02:49) AI systems were already performing at this level

(03:12) Prior to the IMO, Deep Think had already solved a "medium-hard" problem, and older models seemed close to doing so

(04:30) Some currently-available models did decently on the IMO

(05:31) We can't even conclude that LLMs caught up to AlphaProof

(06:22) An inside view confirms that these easy-to-medium problems didn't require new capabilities

(12:21) The IMO was at least an interesting case study in reliability on hard-to-verify domains

The original text contained 8 footnotes which were omitted from this narration.

---

First published:
August 7th, 2025

Source:
https://epoch.ai/gradient-updates/we-didnt-learn-much-from-the-imo

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Quantifying the algorithmic improvement from reasoning models” by Anson Ho, Arden Berg

Sat, 02 Aug 2025 00:00:00 GMT

Subtitle: Reasoning models were as big of an improvement as the Transformer, at least on some benchmarks.

Almost a year ago, OpenAI introduced o1, the world's first “reasoning model”. Compared to its likely predecessor GPT-4o, o1 is more heavily optimized to do multi-step reasoning when solving problems. So it's perhaps no surprise that it does much better on common math and science benchmarks.

o1 performs far better than GPT-4o on GPQA diamond (PhD-level multiple-choice science questions) and MATH level 5 (high-school math competition problems).1 Data is taken from Epoch AI's benchmarking hub.

By itself, this performance improvement was already a big deal. But what's even more important was how it was achieved: This wasn’t achieved by using a lot more training compute. Instead, this was the byproduct of a major algorithmic innovation. o1 went through a period of “reasoning training”, where its chain-of-thought was fine-tuned on reasoning traces and optimized using reinforcement learning. This allows the model to spend more time “reasoning” before responding to user queries.

But how can we quantify the importance of this algorithmic innovation? One way to do this is to interpret its importance in terms of a hypothetical increase [...]

---

Outline:

(02:02) On GPQA, MATH, and Mock AIME, early reasoning models yielded on the order of a 10x increase in compute equivalent gain

(06:35) How much should we believe these estimates?

(08:37) A wide range of common benchmarks saw performance improvements due to reasoning

(12:29) Big open questions remain about generalization and test-time scaling

The original text contained 11 footnotes which were omitted from this narration.

---

First published:
August 2nd, 2025

Source:
https://epoch.ai/gradient-updates/quantifying-the-algorithmic-improvement-from-reasoning-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Why China isn’t about to leap ahead of the West on compute” by Veronika Blablová, Robi Rahman

Sat, 26 Jul 2025 00:00:00 GMT

Subtitle: Chinese hardware is closing the gap, but major bottlenecks remain.

We keep hearing that China is catching up with the West in AI compute. A great example of this comes from NVIDIA's CEO Jensen Huang, who recently claimed that China has made “enormous progress” in the last few years, and that “China is right behind us. We’re very, very close.”

And China has indeed been making a ton of progress. As we’ll see, Chinese hardware has been closing the gap across a range of metrics relating to computational power and data transfer, both of which are crucial aspects of AI workloads.

But despite this progress, we don’t think China is about to leap ahead of the West on AI compute. China's top developers—including Alibaba, ByteDance, Baidu, and DeepSeek—still rely primarily on NVIDIA chips. And major roadblocks still remain before China can leap ahead.

The first bottleneck lies in chip manufacturing. U.S. export controls of chipmaking equipment make it more costly for China to produce chips at the massive scale needed for frontier model training and inference.

The second bottleneck lies in China's weaker software ecosystem. Unlike NVIDIA's CUDA stack, Chinese chips operate [...]

---

Outline:

(01:53) On paper, China's hardware is closing the gap

(05:47) China still has to overcome major bottlenecks in domestic AI compute

(05:53) In practice, reliance on Western chips persists

(07:01) Manufacturing Limitations and Export Controls

(10:09) Software Ecosystem Gaps

(11:21) Will China overcome these bottlenecks?

The original text contained 20 footnotes which were omitted from this narration.

---

First published:
July 26th, 2025

Source:
https://epoch.ai/gradient-updates/why-china-isnt-about-to-leap-ahead-of-the-west-on-compute

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Evaluating Grok 4’s math capabilities” by Greg Burnham

Fri, 25 Jul 2025 00:00:00 GMT

Subtitle: It's good at involved computations, improving at proofs from a low base, and useful for literature search. It still favors low-level grinds and leans on background knowledge.

Introduction

xAI commissioned Epoch AI to evaluate Grok 4's math capabilities. What are its strengths and weaknesses, absolutely and relative to other models? This report goes beyond headline numbers, aiming to characterize how Grok 4 approaches mathematical tasks. Such qualitative investigation informs a broader understanding of progress: it helps identify signs of novel capabilities before they show up in headline numbers, and suggests additional benchmarks that would be useful going forward.

Note: while this work was compensated, Epoch maintains full editorial control over the output. We offer timely and in-depth evaluation as a service to model developers; email info@epoch.ai for details.

Executive Summary

Grok 4 is state-of-the-art at “grinding out” solutions on medium-hard high school math competitions. ()
Grok 4 is near the state-of-the-art at solving proof-based problems from challenging high school math competitions, though much headroom remains on proofs in general. ()
Professional mathematicians say Grok 4 may be the best available model for mathematical literature search. ()
Grok 4 shows an interesting tendency [...]

---

Outline:

(00:23) Introduction

(01:12) Executive Summary

(02:35) Methodology

(04:35) Grok 4 is state-of-the-art at "grinding out" solutions on medium-hard high school math competitions

(06:15) Solving these problems requires moderate knowledge and high diligence

(07:11) What does it look like to "grind out" a problem?

(09:11) Grok 4 is at the frontier of "grinding out" problems

(12:01) While a few problems on these competitions remain unsolved by AI, that probably won't remain the case for long

(14:14) Disregard Settings Involving Coding Tools

(15:06) Grok 4 is near the frontier of solving proof-based problems, but much headroom remains

(16:23) Self-Reported Grading Makes Interpretation Harder

(17:25) Solving these problems requires deeper mathematical skills

(18:38) Grok 4 Heavy made novel progress on a challenging USAMO problem, thanks in part to its background knowledge

(23:36) Grok 4 did not shine on the 2025 IMO

(28:58) Mathematicians say Grok 4's proof-writing abilities are hit or miss

(30:42) Sense Check: Grok 4 did fine on FrontierMath

(32:48) Grok 4 is good at mathematical literature search

(36:42) Grok 4 shows a tendency to catch its own mistakes

(36:58) Grok 4 gets my favorite "trick" question

(39:45) Grok 4 also usually gets a counterintuitive geometry problem

(42:24) That said, Grok 4 can still make simple mistakes

(43:26) Grok 4's mathematical reasoning is not very human-like

(43:52) Grok 4 relies on Cartesian coordinates where humans would use spatial intuition

(45:21) Grok 4 doesn't solve problems with off-the-beaten-path thinking

(49:33) Conclusion

The original text contained 13 footnotes which were omitted from this narration.

---

First published:
July 25th, 2025

Source:
https://epoch.ai/publications/grok-4-math

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“After the ChatGPT moment: Measuring AI’s adoption” by Arden Berg, Anson Ho

Thu, 17 Jul 2025 00:00:00 GMT

Subtitle: How quickly has AI been diffusing through the economy?

In February 2023, ChatGPT made headlines for purportedly being the fastest-growing consumer app in history. It reached 100 million users within two months, years faster than both Instagram and Netflix, making it a clear example of speedy technology adoption.

Two years on, work on AI has been awarded two Nobel Prizes, and major AI companies have collectively grown their annualized revenues over ten-fold to reach multi-billion-dollar scales. Two years is a long time in the world of AI.

With all these changes, it's time to take a new look at the evidence on AI diffusion. How fast has AI been diffusing throughout the economy? How many people are using AI systems in the US, and how are they doing so?

AI is being adopted faster than most technologies in history

Technologies are being adopted more quickly over time

To put the speed of AI adoption into context, we can first look at data on other technologies as a reference point. Conveniently for us, Nicholas Felton and Karl Hartig prepared a graph that shows this for a range of technologies, ranging from electricity to the internet.

[...]

---

Outline:

(01:05) AI is being adopted faster than most technologies in history

(01:11) Technologies are being adopted more quickly over time

(04:06) AI system adoption is likely faster than these historical trends would predict

(07:18) Average AI use has likely been increasing, but it's unclear by how much

(07:49) Most users don't use state-of-the-art AI systems very much, and the fraction of users that do has likely been declining

(10:12) The average number of tokens processed per user has probably been growing a lot

(11:41) Surveys provide mixed evidence about increases in the frequency of AI use

(12:47) Overall verdict

The original text contained 11 footnotes which were omitted from this narration.

---

First published:
July 17th, 2025

Source:
https://epoch.ai/gradient-updates/after-the-chatgpt-moment-measuring-ais-adoption

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How to run SWE-bench Verified in one hour on one machine” by Tom Adamczewski

Thu, 10 Jul 2025 00:00:00 GMT

Subtitle: We are releasing a public registry of optimized Docker images for SWE-bench. This allows us to run SWE-bench Verified in 62 minutes on a single GitHub actions VM.

Introduction

We are releasing a public registry of Docker images for SWE-bench, to help the community run more efficient and reproducible SWE-bench evaluations. By making better use of layer caching, we reduced the total size of the registry to 67 GiB for all 2290 SWE-bench images (10x reduction), and to 30 GiB for 500 SWE-bench Verified images (6x reduction). This allows us to run SWE-bench Verified in 62 minutes on a single GitHub actions VM with 32 cores and 128GB of RAM.

We’re hiring an experienced engineer to lead our benchmarking efforts and be my new manager. Details at the bottom of the post.

Background

SWE-bench is a benchmark designed to evaluate large language models on real-world software engineering tasks. It consists of 2,294 GitHub issues from 12 popular Python repositories, paired with the actual pull requests that resolved those issues.

For each task, the AI system is given access to the repo in its state immediately before the pull request was merged, along with the issue description. [...]

---

Outline:

(00:24) Introduction

(01:10) Background

(02:32) SWE-bench and Docker

(04:46) Docker layering

(06:48) Anatomy of a SWE-bench Dockerfile

(08:52) Moving the git clone operation

(11:28) Should the git history be included?

(12:32) The matplotlib 1.9 GB top layer

(13:35) Disabling the pip cache (or how to go insane)

(16:46) Impact on size

(19:23) Running SWE-bench Verified in about an hour

(21:09) How to use our image registry

(22:02) Come be my boss?

The original text contained 4 footnotes which were omitted from this narration.

---

First published:
July 10th, 2025

Source:
https://epoch.ai/publications/swebench-docker

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What will the IMO tell us about AI math capabilities?” by Greg Burnham

Tue, 08 Jul 2025 00:00:00 GMT

Subtitle: Most discussion about AI and the IMO focuses on gold medals, but that's not the thing to pay most attention to.

This year's International Mathematical Olympiad (IMO) will take place on July 15th and 16th in Sunshine Coast, Australia. It is the pinnacle of high school math competitions. Much like the Olympic Games, the stakes are national pride and personal glory.

AI model developers must be gauging their own chances for pride and glory. No AI system has yet achieved a score equivalent to an IMO gold medal, much less a perfect score. Could this be the year?

In this post, I’ll say what I think different results might mean for AI math capabilities. In particular, I think there are some important distinctions between results that might generate hype and results that will actually tell us something new.

Here are the key background facts I have in mind.

The baseline is high. Google's AlphaProof1, a specialized system that outputs formal math proofs, solved 4/6 problems on the 2024 IMO. On the 2025 USAMO, a contest similar to the IMO, the best general-purpose LLMs can already solve 2/6 problems. Progress means doing better [...]

---

Outline:

(04:41) IMO Background

(07:52) AlphaProof already set a high bar, but some key abilities were missing

(08:35) AlphaProof solved its hardest problem in a surprisingly uninteresting way

(10:08) AlphaProof didn't solve problems that required more creativity

(11:18) Geometry won't tell us much

(12:05) If an AlphaProof-like system scores well, we'll have to look at the specific problems

(14:12) Closing Notes on AlphaProof

(15:18) General-purpose LLMs have more headroom

(16:12) Deep Think suggests we'll see something better than this

(18:19) All bets are off if obscure background knowledge cracks problems

(19:30) Geometry still probably won't tell us much

(20:26) Closing Notes on LLMs

(21:01) Conclusion

The original text contained 18 footnotes which were omitted from this narration.

---

First published:
July 8th, 2025

Source:
https://epoch.ai/gradient-updates/what-will-the-imo-tell-us-about-ai-math-capabilities

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How big could an “AI Manhattan Project” get?” by Arden Berg, Anson Ho

Wed, 02 Jul 2025 00:00:00 GMT

Subtitle: An AI Manhattan Project could accelerate compute scaling by two years.

Over the last year, the possibility of an AI national project has steadily grown.

In November, the US-China Economic and Security Review Commission listed that its top recommendation to Congress was to “establish and fund a Manhattan Project-like program dedicated to racing to and acquiring an Artificial General Intelligence capability.” Over the last few months, the US Department of Energy has also repeatedly compared AI to the Manhattan Project and indicated that it would use its power to help the project succeed, recently tweeting this:

But what would a “Manhattan Project for AI” actually entail? It's not entirely clear, but we think that three distinct features capture much of the essence of what people are referring to:

It's a project initiated by the US government
Private sector AI resources (e.g. compute) are consolidated
Total compute investments reach a similar fraction of US GDP as the peak of the Manhattan Project or the Apollo program

In addition to these core properties, for the purposes of this analysis we focus primarily on the physical bottlenecks to this scaling, thus assuming [...]

---

Outline:

(02:55) How much compute could a national project muster?

(05:44) Will there be enough power to support this?

(08:23) Discussion

The original text contained 14 footnotes which were omitted from this narration.

---

First published:
July 2nd, 2025

Source:
https://epoch.ai/gradient-updates/how-big-could-an-ai-manhattan-project-get

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“AI and explosive growth redux” by Andrei Potlogea, Anson Ho

Fri, 20 Jun 2025 00:00:00 GMT

Subtitle: GATE model shows AI-driven growth surges more easily than expected and supports much larger investments—advocating moderate optimism.

The debate around the macroeconomic effects of AI has shown no sign of convergence.

On the one hand, renowned economists like Daron Acemoglu envision that AI will only increase US GDP by <2% over ten years. On the other hand, others argue that AI could plausibly drive “explosive growth”, with GWP growth rates north of 30% per year.

So who's right? To shed light on this debate, we recently released the Growth and AI Transition Endogenous (GATE) model, an integrated assessment model of AI automation designed to bridge the gap between economists and AI practitioners. But while we discussed how the model is laid out on a technical level, we’ve yet to detail how to interpret the model's predictions, and our most substantial takeaways from the model.

As such, in this post we’ll explain our two biggest updates from our work on GATE. Importantly, these are qualitative updates – the model was designed to provide high-level qualitative insights, not make precise quantitative predictions:

Significant AI-driven growth accelerations happen more easily than we thought: Skeptics of [...]

---

Outline:

(02:15) 1. Significant growth accelerations more plausible than we thought, as Baumol effects are overrated

(07:02) 2. We underestimated just by how much the world could be underinvesting in AI today

(10:04) A blow against the skeptics, many blows against the overconfident

The original text contained 11 footnotes which were omitted from this narration.

---

First published:
June 20th, 2025

Source:
https://epoch.ai/gradient-updates/ai-and-explosive-growth-redux

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Inference economics of language models” by Ege Erdil

Tue, 17 Jun 2025 00:00:00 GMT

Subtitle: We investigate how speed trades off against cost in language model inference. We find that inference latency scales with the square root of model size and the cube root of memory bandwidth, and other results.

Introduction

As the capabilities of AI models have expanded, and as the recent paradigm of test-time compute scaling has taken off, the demand for AI inference has grown enormously. Inference revenue at major AI companies such as OpenAI and Anthropic has been growing at a rate of 3x per year or more, even as their models continue to become smaller and cheaper compared to 2023.

A few years ago, the benchmark for whether a language model was fast enough was “human reading speed”: if a model could generate 10 tokens per second when responding to a user, that was good enough. Now, as models are asked to reason at length about complex problems and are placed inside elaborate agentic loops, this benchmark has become obsolete. The benefits to serving models faster for inference are greater than ever before. Despite this, there has been little work investigating how language models can be served quickly at scale and how much we can increase their [...]

---

Outline:

(00:25) Introduction

(01:51) How does the model work?

(03:56) Some takeaways from the model

(07:07) Conclusion

---

First published:
June 17th, 2025

Source:
https://epoch.ai/publications/inference-economics-of-language-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Do the biorisk evaluations of AI labs actually measure the risk of developing bioweapons?” by Anson Ho, Arden Berg

Fri, 13 Jun 2025 00:00:00 GMT

Subtitle: Assessing if AI labs' biorisk evaluations effectively measure models' potential to enable amateur bioweapons development.

With the recent release of Claude Opus 4, Anthropic activated their AI Safety Level 3 protections. This threshold was designed to pertain to models that can significantly help individuals or groups with basic technical backgrounds create/obtain and deploy CBRN weapons, such as pandemic bioweapons.

Their reasoning was as follows:

“We are deploying Claude Opus 4 with our ASL-3 measures as a precautionary and provisional action. […] due to continued improvements in CBRN-related knowledge and capabilities, we have determined that clearly ruling out ASL-3 risks is not possible”.

But how exactly did they come to this conclusion? And more generally, do existing AI biorisk evaluations provide strong evidence of whether LLMs can aid amateurs in developing bioweapons?

To answer these questions, we analyzed the biorisk evaluations (or lack thereof) of 8 notable AI labs. Here's what we found:

Publicly described benchmarks are common but saturate rapidly, with uncertain implications for biorisk: The most common LLM biorisk evaluations reported in the most recent model cards are publicly described benchmarks (i.e. those that have clearly described in a public [...]

---

Outline:

(03:28) 1. Publicly described benchmarks are common but saturate rapidly, with uncertain implications for biorisk

(10:08) 2. We know little about most other AI biorisk evaluations

(10:14) Evaluations are generally light on detail, and often by design

(13:24) In practice, many biorisk evaluations only tell us about one model

(16:06) 3. Existing evaluations do not fully address the positions most skeptical of LLM-driven biorisk

(16:58) The need for somatic tacit knowledge

(18:57) The importance of infrastructure access

(20:09) These objections are valid but do not rule out AI biorisk concerns

(21:29) Discussion

The original text contained 20 footnotes which were omitted from this narration.

---

First published:
June 13th, 2025

Source:
https://epoch.ai/gradient-updates/do-the-biorisk-evaluations-of-ai-labs-actually-measure-the-risk-of-developing-bioweapons

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What skills does SWE-bench Verified evaluate?” by Florian Brand, Jean-Stanislas Denain

Fri, 13 Jun 2025 00:00:00 GMT

Subtitle: We take a deep dive into SWE-bench Verified, a prominent agentic coding benchmark. While one of the best public tests of AI coding agents, it is limited by its focus on simple bug fixes in familiar open-source repositories.

SWE-bench Verified Coding

SWE-bench is a benchmark for evaluating large language models on real world software issues collected from GitHub. Given a codebase and an issue, a language model is tasked with generating a patch that resolves the described problem.

Size: 500 Python-only coding problems with issue descriptions
Data sourcing: Scraping of GitHub issues followed by human filtering
Scoring method: Unit tests
Contamination risk: High

Main takeaways

SWE-bench Verified tests AI's real-world agentic coding skills, the kind required for coding tools like Cursor or Claude Code ().
Most of the problems are relatively simple, needing less than 1 hour to fix for a human engineer ().
The benchmark has a high contamination risk (), and the tests might not generalize well to real-world, closed-source codebases ().
The scaffold built around a model [...]

---

Outline:

(01:01) Main takeaways

(01:46) Introduction

(03:07) Anatomy of a benchmark sample

(04:14) How models are evaluated

(04:47) The error rate in SWE-bench Verified is relatively low

(07:34) Most tasks are simple bug fixes

(11:35) The low diversity of codebases limits external validity

(12:50) Half the benchmark tests issues from before 2020

(15:08) Scaffolds matter as much as models

(17:18) Conclusion

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
June 13th, 2025

Source:
https://epoch.ai/publications/what-skills-does-swe-bench-verified-evaluate

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Beyond benchmark scores: Analyzing o3-mini’s mathematical reasoning” by Anson Ho, Jean-Stanislas Denain, Elliot Glazer

Fri, 06 Jun 2025 00:00:00 GMT

Subtitle: Examining o3-mini's math reasoning: an erudite, vibes-based solver that excels in knowledge but lacks precision, creativity, and formal human rigor.

If you’re reading this, you’ll no doubt have heard of the impressive progress that state-of-the-art language models have been able to make in solving math problems. For instance, we recently found that o4-mini outperformed the average team of mathematicians in our human baseline competition.

However, these numbers alone provide limited insight into what exactly these models are or aren’t able to do, and why. How do reasoning models solve complex math problems? Do they reason similarly to human mathematicians? And where do they fall short?

To answer these questions, we asked fourteen mathematicians to analyze 29 of o3-mini-high's raw, unsummarized reasoning traces on FrontierMath problems, which OpenAI shared with us.1 Our goal in this post is to share the main takeaways from this survey, and discuss what this means for future developments at the intersection of AI and math.

How does o3-mini-high solve FrontierMath problems?

Extreme erudition – and it's not just memorization

Out of the 29 reasoning traces, 13 of them resulted in a correct response – but how does o3-mini-high solve these [...]

---

Outline:

(01:23) How does o3-mini-high solve FrontierMath problems?

(01:28) Extreme erudition - and it's not just memorization

(02:57) A "vibes-based inductive reasoner"

(04:11) Where o3-mini-high fails

(04:14) Lack of precision

(06:15) Lack of creativity and depth of understanding

(07:51) Hallucinations

(08:57) Does o3-mini-high reason like a human mathematician?

(10:48) Discussion

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
June 6th, 2025

Source:
https://epoch.ai/gradient-updates/beyond-benchmark-scores-analysing-o3-mini-math-reasoning

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What is Epoch?” by Jaime Sevilla

Thu, 05 Jun 2025 00:00:00 GMT

Subtitle: Our director explains Epoch AI's mission and how we decide our priorities. In short, we work on projects to understand the trajectory of AI, share this knowledge publicly, and inform important decisions about AI.

Introduction

Since we started Epoch three years ago, we have engaged in hundreds of projects and achieved a wide audience. Yet, one question I often get asked is, ‘What is Epoch?’

In a way, this is an easy question to answer. We are a nonprofit research organization with the mission of improving society's understanding of the trajectory of AI. Simply put, we are doing what we can so that decisions about AI are informed by the best possible evidence.

To achieve this, we are curating data and conducting high-quality research into some of the most significant trends in AI. We share most of this work publicly, aimed at a broad audience, including AI policy experts, journalists and AI developers. Importantly, we are committed to always sharing what the data says, rather than tailoring it to fit a narrative.

We work on this mission because we believe that if we all collectively know more about AI, we will make better decisions on average. I [...]

---

Outline:

(00:24) Introduction

(01:57) What we do

(02:46) We curate and analyze data on AI trends

(04:21) We develop benchmarks to measure advanced AI capabilities

(06:03) We provide independent evaluations of AI models

(06:54) We provide consultations and commissioned research

(10:30) What we are not

(11:04) We are not an AI development company

(12:00) We are not an AI policy think tank

(13:05) We are not a company incubator

(14:18) Closing words

---

First published:
June 5th, 2025

Source:
https://epoch.ai/publications/what-is-epoch

---

Narrated by TYPE III AUDIO.

“GPQA Diamond: What’s left?” by Greg Burnham

Fri, 30 May 2025 00:00:00 GMT

Subtitle: Investigate GPQA Diamond benchmark's validity: uncover flawed questions, model challenges, and why it still informs AI evaluation.

A specter looms whenever AI systems approach 100% on a benchmark: what if the rest of the benchmark is flawed?

Recently, these concerns have been levied at GPQA Diamond, a popular benchmark consisting of graduate-level multiple-choice science questions. Scores from state-of-the-art models are clustered in a narrow band, around 83%.1 This led one of the creators of the benchmark to speculate that there's something wrong with the other 17%.

I’ll use this post to investigate. I’ll start by looking at the questions that models consistently get wrong: is something wrong with these questions, or are they just hard for the models? To assess this, I’ll look at the scientific subdomains of these questions and then go through a small set of outliers in more detail.

All in all, I think it's likely that at least 90% of the benchmark is valid: GPQA Diamond has a bit more juice left. Regardless of that conclusion, though, I also just think it's good to dig into benchmarks. This post is as much about the journey as the destination.

What [...]

---

Outline:

(01:28) What could be wrong with GPQA Diamond?

(02:53) Models Tend to Get the Same Questions Wrong

(04:16) Most Unsolved Questions Are in Organic Chemistry

(05:32) Deep Dive: The Most Consistently Wrong Questions

(06:15) Guess the Pattern

(08:06) Know-How Questions

(10:43) Computational Questions, With a Twist

(13:06) Silver Fluorides

(14:24) Reports of Its Death Are Probably Exaggerated

(16:28) If Not Now, Soon

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
May 30th, 2025

Source:
https://epoch.ai/gradient-updates/gpqa-diamond-whats-left

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How many AI models will exceed compute thresholds?” by Ben Cottier, David Owen

Fri, 30 May 2025 00:00:00 GMT

Subtitle: We project how many notable AI models will exceed training compute thresholds, with results accessible in an interactive tool. Model counts rapidly increase from 10 above 1e26 FLOP by 2026, to over 200 by 2030.

Executive summary

The compute used to train AI models has been a key driver of AI progress, informing many predictions of AI's future capabilities. However, the number of AI models that will surpass different compute levels has received less attention. This is relevant to compute-based AI regulation, as well as AI development and deployment more broadly. We develop a projective model that relates key inputs such as investment and the distribution of compute to the number of notable AI models: models that are state of the art, highly cited, or otherwise historically notable. The projections can be explored in a new interactive tool.

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ Cumulative number of notable AI models by year

Our modeling shows that the number of notable AI models above a given compute threshold rapidly accelerates over time. For example, the first model in our dataset estimated to use over 10 to the 26 FLOP was Grok-3 from xAI, released [...]

---

Outline:

(00:30) Executive summary

(03:41) Introduction

(05:55) Methodology

(05:58) Overview

(09:25) Dataset and inclusion criteria

(13:14) Scenarios based on AI investment and model development

(18:22) Investment in the largest training run

(19:13) Total number of models per year

(20:06) Number of models near the largest training run

(20:57) Distribution of compute over AI models

(23:11) Hardware price-performance

(25:29) Training run duration

(26:44) Sensitivity analysis

(29:05) Limitations

(33:57) Results

(36:21) Conclusion

(38:34) Acknowledgements

The original text contained 16 footnotes which were omitted from this narration.

---

First published:
May 30th, 2025

Source:
https://epoch.ai/publications/model-counts-compute-thresholds

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Is AI already superhuman on FrontierMath?” by Anson Ho

Fri, 23 May 2025 00:00:00 GMT

Subtitle: How do humans and AIs compare on FrontierMath? We ran a competition at MIT to put this to the test.

How well do humans perform on FrontierMath?

This is a benchmark that we released last year, designed to test the limits of AI's math capabilities. It contains 300 questions that range in difficulty from upper-undergraduate level, to those that even Fields Medallists find challenging.

To figure out a human baseline, we organized a competition at MIT, with around forty exceptional math undergrads and subject matter experts taking part. The participants were split into eight teams of four or five people, and given 4.5 hours to solve 23 questions with internet access.1 They were then pitted against the current state-of-the-art AI system on FrontierMath, namely o4-mini-medium.2

The result? o4-mini-medium outperformed the average human team, but worse than the combined score across all teams, where we look at the fraction of problems solved by at least one team. So AIs aren’t yet unambiguously superhuman on FrontierMath – but I think they soon will be.

Figure 1: o4-mini-medium scored 22% on the FrontierMath human baseline competition, outperforming the average team (19%) but falling short of the [...]

---

Outline:

(02:21) 1. Subject matter expertise was underrepresented

(03:22) 2. The competition was more designed to reflect reasoning capabilities than broad knowledge

(05:30) 3. The definition of "human baseline" is somewhat ambiguous

(08:13) 4. AIs aren't yet superhuman on FrontierMath, but they probably soon will be

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
May 23rd, 2025

Source:
https://epoch.ai/gradient-updates/is-ai-already-superhuman-on-frontiermath

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How fast can algorithms advance capabilities?” by Henry Josephson

Fri, 16 May 2025 00:00:00 GMT

Subtitle: This week's issue is a guest post by Henry Josephson, who is a research manager at UChicago's XLab and an AI governance intern at Google DeepMind.

This week's issue is a guest post by Henry Josephson, who is a research manager at UChicago's XLab and an AI governance intern at Google DeepMind.

In the AI 2027 scenario, the authors predict a fast takeoff of AI systems recursively self-improving until we have superintelligence in just a few years.

Could this really happen? Whether it's possible may depend on if a software intelligence explosion — a series of rapid algorithmic advances that lead to greater AI capabilities — occurs.

A key crux in the debate about the possibility of a software intelligence explosion comes down to whether key algorithmic improvements scale from small models to larger models. If the most important algorithmic advances need a large amount of compute to demonstrate their effectiveness, then we should think that a software-only intelligence explosion is less likely. And so a fast takeoff could be bottlenecked by compute constraints.

In a recent preprint, my team at UChicago's XLab — Spencer Guo, Teddy Foley, Jack Sanderson, Anqi Qu, and [...]

---

Outline:

(01:52) Are the best algorithmic improvements compute-dependent?

(07:49) Can Capabilities Advance With Frozen Compute? DeepSeek-V3

(08:55) What This Means for AI Progress

(12:59) Limitations

(14:20) Conclusion

The original text contained 7 footnotes which were omitted from this narration.

---

First published:
May 16th, 2025

Source:
https://epoch.ai/gradient-updates/how-fast-can-algorithms-advance-capabilities

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How far can reasoning models scale?” by Josh You

Fri, 09 May 2025 00:00:00 GMT

Subtitle: Available evidence suggests that rapid growth in reasoning training can continue for a year or so.

Reasoning models like OpenAI's o3 are less than a year old, but they’ve already seen rapid improvements on capabilities, and OpenAI researchers are very optimistic that this progress will continue.1 But it's not clear how much further the techniques used to train reasoning models can scale.

After looking into the question, I think there is room to scale reasoning training further, but it's unlikely that OpenAI or other frontier AI developers can scale by many orders of magnitude.

If reasoning training continues to scale at 10× every few months, in line with the jump from o1 to o3, it will reach the frontier of total training compute before long, perhaps within a year. At that point, the scaling rate will slow and converge with the overall growth rate in training compute of ~4× per year. Progress in reasoning models may slow down after this point as well.

Figure 1: An illustration of a possible trajectory for reasoning compute growth, if scale-ups similar to the jump between o1 and o3 continue.

How much compute is used for frontier [...]

---

Outline:

(01:25) How much compute is used for frontier reasoning training?

(02:59) Scaling from o1 to o3

(04:17) Insights from DeepSeek-R1

(05:27) Insights from other reasoning models

(06:39) What can we conclude?

(09:35) What does reasoning compute scale mean for AI progress?

(11:24) Can reasoning actually scale?

The original text contained 11 footnotes which were omitted from this narration.

---

First published:
May 9th, 2025

Source:
https://epoch.ai/gradient-updates/how-far-can-reasoning-models-scale

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Where’s my ten minute AGI?” by Anson Ho

Fri, 02 May 2025 00:00:00 GMT

Subtitle: Why don't AIs automate more real-world tasks if they can handle 1-hour ones? Anson Ho explores key capability and context bottlenecks.

Recently, METR released a paper arguing that the length of tasks that AIs can do is doubling every 7 months.

We can see this in the following graph, where the best AI system1 is able to do roughly hour-long tasks at a 50% success rate on average:

METR's research finds that AIs are rapidly able to do longer and longer tasks, where length is measured by the time it takes for a human with requisite expertise to do the task.

But there's a big problem here – if AIs are actually able to perform most tasks on 1-hour task horizons, why don’t we see more real-world task automation? For example, most emails take less than an hour to write, but crafting emails remains an important part of the lives of billions of people every day.

Some of this could be due to people underusing AI systems,2 but in this post I want to focus on reasons that are more fundamental to the capabilities of AI systems. In particular, I think there are [...]

---

Outline:

(01:48) 1. Time-horizon estimates are very domain-specific

(04:15) 2. Task reliability strongly influences task horizons

(07:19) 3. Real-world tasks are bundled together and hard to separate out

(09:56) Discussion

The original text contained 9 footnotes which were omitted from this narration.

---

First published:
May 2nd, 2025

Source:
https://epoch.ai/gradient-updates/where-is-my-ten-minute-agi

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The case for multi-decade AI timelines” by Ege Erdil

Sat, 26 Apr 2025 00:00:00 GMT

Subtitle: In this Gradient Updates weekly issue, Ege discusses the case for multi-decade AI timelines.

The date at which transformative AI capabilities will be reached is among the most discussed questions about AI. Opinions vary widely, with industry insiders typically expecting far faster progress than external observers. For instance, Dario Amodei thinks there might be only 2 to 3 years left until AI surpasses “almost all humans at almost everything”, while economists such as William Nordhaus still believe we might have more than 100 years left.

Compared to most people in the world, my own median timelines of ~ 20 years until full automation of remote work would be considered quite aggressive. However, most people in the field of AI (and even many others at Epoch) have much shorter timelines than this, and timelines on the order of 1 to 10 years, as seen in the recent AI 2027 report, are often seen as a “default position” that one has to present arguments against. In this issue, I’ll elaborate on the key reasons behind my relatively bearish views. I’ll first explain why I find some common short timelines arguments unconvincing, then elaborate on how I arrive at [...]

---

Outline:

(03:10) Trend extrapolations don't point towards short timelines

(08:39) A software singularity is unlikely

(12:15) AI agents will need a lot of compute to automate all remote work

(18:17) Conclusion

---

First published:
April 26th, 2025

Source:
https://epoch.ai/gradient-updates/the-case-for-multi-decade-ai-timelines

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Trends in AI supercomputers” by Konstantin F. Pilz, Robi Rahman, James Sanders, Lennart Heim

Wed, 23 Apr 2025 00:00:00 GMT

Subtitle: AI supercomputers double in performance every 9 months, cost billions of dollars, and require as much power as mid-sized cities. Companies now own 80% of all AI supercomputers, while governments’ share has declined.

Introduction

Frontier AI development relies on powerful AI supercomputers. To train AI models with exponentially more compute, companies have developed massive systems like xAI's Colossus, that contain up to 200,000 specialized AI chips, cost billions of dollars to build, and require hundreds of MW of power—equivalent to a medium-sized city.

However, public data on these systems is limited.

We curated a dataset of over 500 AI supercomputers (sometimes called GPU clusters or AI data centers) from 2019 to 2025 and analyzed key trends in performance, power needs, hardware cost, and ownership. We found:

Computational performance grew 2.5x per year, driven by using more and better chips in the leading AI supercomputers.
Power requirements and hardware costs doubled every year. If current trends continue, the largest AI supercomputer in 2030 would cost hundreds of billions of dollars and require 9 gigawatts of power.
The rapid growth in AI supercomputers coincided with a shift to private ownership. In our dataset, industry owned about [...]

---

Outline:

(00:29) Introduction

(02:10) Computational performance, energy, and cost trends

(04:37) Locations and public/private sector share of AI supercomputers

(06:31) Our dataset

---

First published:
April 23rd, 2025

Source:
https://epoch.ai/publications/trends-in-ai-supercomputers

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The real reason AI benchmarks haven’t reflected economic impacts” by Anson Ho, Jean-Stanislas Denain

Fri, 28 Mar 2025 00:00:00 GMT

Subtitle: The real reason that AI benchmarks haven’t reflected real-world impacts historically is that they weren’t optimized for this, not because of fundamental limitations – but this might be changing.

Figure 1: This graph demonstrates rapid AI progress across key benchmarks, which have been useful indicators for driving capabilities forward. However, for most of this period, benchmark realism was not a priority. This explains why high benchmark scores often provide limited insight into AI systems’ real-world impact.

Back in the prehistoric days of March 2023, OpenAI released GPT-4, and its benchmark results raised a lot of questions and speculation about the future of the legal profession.

In response to these concerns, Naryanan and Kapoor wrote a blog post pointing out that this is an instance of a more general problem, where AI benchmarks fail to reflect the complexities of the real world. And if benchmarks fail at this, it can lead to misleading conclusions about the current and future impacts of AI systems.

This is undoubtedly true, but it seems to only give a partial answer, because it doesn’t tell us why existing benchmarks don’t capture the complexities of the real world. For example, surely [...]

The original text contained 1 footnote which was omitted from this narration.

---

First published:
March 28th, 2025

Source:
https://epoch.ai/gradient-updates/the-real-reason-ai-benchmarks-havent-reflected-economic-impacts

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“GATE: Modeling the trajectory of AI and automation” by The Epoch AI Team

Fri, 21 Mar 2025 00:00:00 GMT

Subtitle: We introduce a compute-centric model of AI automation and its economic effects, illustrating key dynamics of AI development. The model suggests large AI investments and subsequent economic growth.

Introduction

The rapid progress and adoption of large language models in recent years have sparked extensive discussions about how artificial intelligence (AI) will shape the future of our economy. Central to these discussions are questions about whether AI will substantially accelerate economic growth, how much investment AI will attract, the timing and scale of these investments, and the speed at which automation will transform labor markets.

To advance this discussion, we introduce the Growth and AI Transition Endogenous (GATE) model. GATE brings together concepts from machine learning and economic growth theory to illustrate the key dynamics of AI development, task automation and their downstream macroeconomic effects. It draws heavily on scaling laws—empirical regularities relating compute scaling to performance for both training and inference—and semi-endogenous growth—a theory that explains economic growth as a result of R&D efforts that generate scientific advances. You can find a technical description of the model in our whitepaper.

Alongside our paper, we are releasing an interactive model. This tool lets you simulate a variety of [...]

---

Outline:

(00:27) Introduction

(02:17) About GATE

(05:29) Preliminary insights

(09:16) Conclusion and next steps

The original text contained 1 footnote which was omitted from this narration.

---

First published:
March 21st, 2025

Source:
https://epoch.ai/publications/announcing-gate

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Most AI value will come from broad automation, not from R&D” by Ege Erdil, Matthew Barnett

Fri, 21 Mar 2025 00:00:00 GMT

Subtitle: AI's biggest impact will come from broad labor automation—not R&D—driving economic growth through scale, not scientific breakthroughs.

A popular view about the future impact of AI on the economy is that it will be primarily mediated through AI automation of R&D. In some form or another, this view has been expressed by many influential figures in the industry:

In his essay “Machines of Loving Grace”, Dario Amodei lists five ways in which AI can benefit humanity in a scenario where AI goes well. He considers biology R&D, neuroscience R&D, and economics R&D as three of these ways. There's no point at which he clearly argues that AI will lead to high rates of economic growth due to being broadly deployed throughout the economy as opposed to speeding up R&D and perhaps improving economic governance.
Demis Hassabis, CEO of DeepMind, is also bullish on R&D as the main channel through which AI will benefit society. In a recent interview, he provides specific mechanisms through which this could happen: AI could cure all diseases and “solve energy”. He mentions “radical abundance” as a possibility as well, but beyond the R&D channel doesn’t name [...]

---

Outline:

(03:44) The primary economic impact of AI will be its ability to broadly automate labor

(09:56) Automating AI R&D alone likely won't dramatically accelerate AI progress

(14:29) Fully automating R&D requires a very broad set of abilities

(19:11) AI takeoff will likely be diffuse and salient

(21:44) Key takeaways

---

First published:
March 21st, 2025

Source:
https://epoch.ai/gradient-updates/most-ai-value-will-come-from-broad-automation-not-from-r-d

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Train once, deploy many: AI and increasing returns” by Ege Erdil, Tamay Besiroglu

Fri, 07 Mar 2025 00:00:00 GMT

Subtitle: AI's “train-once-deploy-many” advantage yields increasing returns: doubling compute more than doubles output by increasing models' inference efficiency and enabling more deployed inference instances.

Introduction

A significant advantage of AI models over human intelligence is the ability to train a model once and then serve arbitrarily many copies of it for inference. This ‘train-once-deploy-many’ property means we can justify spending far more resources to train a single AI model than we could ever spend training a single human (something that AI labs have recently started doing). For example, it's common for frontier models to be trained on tens of thousands of GPUs, yet each instance during inference requires only a few dozen GPUs.

This difference suggests that AI systems exhibit increasing returns to scale when we add more compute for training and inference. If we set aside price effects for the moment, then with twice the compute, we can double economic output simply by running twice as many copies of the models we’re using for inference. In addition, we can use the same extra compute to train models using twice the training compute, and we expect these larger models to be more efficient at converting inference compute into [...]

---

Outline:

(00:28) Introduction

(03:59) The basic argument

(06:02) Doesn't this assume an infinite span for the tradeoff?

(07:35) What does this imply about an AI-only economy?

---

First published:
March 7th, 2025

Source:
https://epoch.ai/publications/train-once-deploy-many-ai-and-increasing-returns

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What AI can currently do is not the story” by Ege Erdil

Fri, 07 Mar 2025 00:00:00 GMT

Subtitle: Forecasting AI progress requires more than extrapolating current capabilities; understanding fundamental task difficulty is key to predicting future breakthroughs.

When trying to forecast future capabilities of AI systems and the economic and social impacts these capabilities will have, there are two different common methods that people use:

Look at past AI capabilities along with how fast they’ve changed and try to extrapolate that knowledge to the future.
Use first principles reasoning based on the capabilities and resource use of the human brain, the availability of training data across different domains, how expensive it is to get reward signals on different tasks, etc. to estimate the difficulty of automating tasks.

There are further details about how each method may be used in practice, but they represent two fundamentally different ways of forecasting AI capabilities. The first method is often preferred by economists: for instance, Robin Hanson used a variant of it in 2012 by asking AI experts how much progress towards human-level capabilities we had made over the past 20 years and extrapolated their answer to reach human-level AI timelines of a century or longer.

People following this [...]

---

Outline:

(05:27) The perils of extrapolation

(10:34) What is the alternative?

(15:07) Conclusion

---

First published:
March 7th, 2025

Source:
https://epoch.ai/gradient-updates/what-ai-can-currently-do-is-not-the-story

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The promise of reasoning models” by Matthew Barnett

Fri, 28 Feb 2025 00:00:00 GMT

Subtitle: AI reasoning models will achieve superhuman performance in math and coding, yet their economic applications will lag behind, limiting real-world impact.

Perhaps the most significant AI development of the past year has been the rise of reasoning models—LLMs trained via reinforcement learning to solve complex problems, such as OpenAI's o1, DeepSeek-R1, and Claude 3.7 Sonnet. These models have already demonstrated remarkable success, significantly enhancing AI capabilities in mathematical problem-solving, scientific reasoning, and coding.

In this article, I aim to present a clear conceptual framework for understanding the impacts reasoning models may have on the world. My core thesis is that the primary consequence of reasoning models will be the creation of AIs that are narrowly superhuman at “pure reasoning tasks”—abstract tasks with correct answers that can be cheaply verified. For example, I would guess that in the next three years, AIs will likely be developed that are capable of outperforming top human mathematicians at proving arbitrary mathematical theorems. At the same time, I predict that economically valuable AI capabilities will lag behind, with reliable computer-control agents arriving significantly later than high-quality reasoning models.

I also address some broader speculation about the downstream implications of reasoning [...]

---

Outline:

(02:48) A brief primer on reasoning models

(06:51) What I think reasoning models will be able to do

(12:14) What I suspect AI labs will struggle with in the near term

(15:44) Will reasoning models upset the business model of AI labs?

(21:31) Conclusion

---

First published:
February 28th, 2025

Source:
https://epoch.ai/gradient-updates/the-promise-of-reasoning-models

---

Narrated by TYPE III AUDIO.

“AI progress is about to speed up” by Ege Erdil

Fri, 21 Feb 2025 00:00:00 GMT

Subtitle: AI progress is accelerating, with next-gen models surpassing GPT-4 in compute power, driving major leaps in reasoning, coding, and math capabilities.

AI is a field where progress happens remarkably quickly compared to the standards of other industries. Even in the past two years, we’ve seen impressive capability gains and cost reductions over models such as GPT-4 that would have been unprecedented in almost any other domain. However, there's a general sense among observers that progress has been slower than they’ve expected since GPT-4 was released.

I think this is mostly because compute growth has been slow since then, so we’ve been seeing gains from algorithmic progress and improved data quality rather than large compute scale-ups. The chart below from our article on the compute cost of training frontier models shows this clearly:

The release of GPT-4 in March 2023 stands out because GPT-4 represented a 10x compute scale-up over the models we had seen before. Since then, we’ve not seen another scale-up of this magnitude: all currently available frontier models, with the exception of Grok 3, have been trained on a compute budget similar to GPT-4 or less. For instance, Dario Amodei, CEO of Anthropic [...]

---

Outline:

(03:06) What to expect from the new models

(06:15) Programming

(07:20) Math

(08:07) Agents

(10:28) What should we make of Grok 3?

(13:14) Why has spending not grown faster before?

(16:14) Concluding thoughts

---

First published:
February 21st, 2025

Source:
https://epoch.ai/gradient-updates/ai-progress-is-about-to-speed-up

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Algorithmic progress likely spurs more spending on compute, not less” by Matthew Barnett

Fri, 14 Feb 2025 00:00:00 GMT

Subtitle: Algorithmic progress in AI may not reduce compute spending—instead, it could drive higher investment as efficiency unlocks new opportunities.

In recent weeks, there has been widespread speculation about the economic implications of algorithmic progress—improvements to machine learning methods that allow us to develop and deploy models using fewer resources. Many have suggested that algorithmic progress, as observed in DeepSeek's training of V3 and R1, will reduce demand for high-performance GPUs going forward. Their argument is that it enables AI labs to “do more with less”, and build AI products without needing as much compute.

Here, I argue the opposite: rather than decreasing overall spending, algorithmic progress is likely to increase AI compute spending, both in inflation-adjusted terms and in terms of the share of GDP spent on compute. This is particularly true for forms of algorithmic progress that allow AI labs to improve frontier performance at the same time they can increase computational efficiency, which is largely true for DeepSeek's recent innovations.

I approach this argument from an empirical perspective. Most directly, data from machine learning trends indicates that algorithmic progress has coincided with a rapidly increasing rate of investment in computers used for training [...]

---

Outline:

(02:57) Introduction

(08:00) The empirical data

(15:57) The case for AI as an unusual computing product

(21:30) Conclusion

---

First published:
February 14th, 2025

Source:
https://epoch.ai/gradient-updates/algorithmic-progress-likely-spurs-more-spending-on-compute-not-less

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“A more systematic and transparent AI benchmarking hub” by Tom Adamczewski

Fri, 07 Feb 2025 00:00:00 GMT

Subtitle: We've overhauled our AI benchmarking infrastructure to provide more transparent, systematic, and up-to-date evaluations of AI model capabilities.

Introduction

Back in November, we released the Epoch AI Benchmarking Hub. This platform hosts the results of evaluations of notable AI models conducted by Epoch AI, including visualizations and additional analysis. Its goal is to shed light on what today's AI systems are capable of—and where they are headed. Data on the platform is publicly accessible under a permissive license, allowing other members of the research community to use it for their own analyses.

Today, we are publishing a major update of the Benchmarking Hub. Our visualisations still look very similar, but we have completely overhauled the process by which we run AI benchmarks and share results. We’ve made significant engineering investments in our infrastructure that allow us to be more transparent, systematic, and up-to-date.

The most noticeable changes for you as a user are:

You have access to richer data about each evaluation and the model being evaluated
The database will be much more frequently updated

Key features of the AI Benchmarking Hub

Our database fills a gap in the publicly available data about [...]

---

Outline:

(00:23) Introduction

(01:26) Key features of the AI Benchmarking Hub

(03:38) The Epoch AI client library

(04:26) Evaluation platform

(04:54) Next steps

---

First published:
February 7th, 2025

Source:
https://epoch.ai/publications/benchmarking-hub-update

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How much energy does ChatGPT use?” by Josh You

Fri, 07 Feb 2025 00:00:00 GMT

Subtitle: This Gradient Updates issue explores how much energy ChatGPT uses per query, revealing it's 10x less than common estimates.

Credit to Alex Erben and Ege Erdil for substantial help with research and calculations. In this issue, “we” refers to our collective judgment.

A commonly-cited claim is that powering an individual ChatGPT query requires around 3 watt-hours of electricity, or 10 times as much as a Google search.1 This is often brought up to express concern over AI's impact on the environment, climate change, or the electric grid.

However, we believe that this figure of 3 watt-hours per query is likely an overestimate. In this issue, we revisit this question using a similar methodology, but using up-to-date facts and clearer assumptions. We find that typical ChatGPT queries using GPT-4o likely consume roughly 0.3 watt-hours, which is ten times less than the older estimate. This difference comes from more efficient models and hardware compared to early 2023, and an overly pessimistic estimate of token counts in the original estimate.

For context, 0.3 watt-hours is less than the amount of electricity that an LED lightbulb or a laptop consumes in a few minutes. And even for a [...]

---

Outline:

(02:36) Estimating the energy cost of a query

(08:33) Why is our estimate different from others?

(10:33) What about other models besides GPT-4o?

(14:55) Training and other upstream costs

(18:20) Discussion

The original text contained 23 footnotes which were omitted from this narration.

---

First published:
February 7th, 2025

Source:
https://epoch.ai/gradient-updates/how-much-energy-does-chatgpt-use

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What went into training DeepSeek-R1?” by Ege Erdil

Fri, 31 Jan 2025 00:00:00 GMT

Subtitle: This Gradient Updates issue explores DeepSeek-R1's architecture, training cost, and pricing, showing how it rivals OpenAI's o1 at 30x lower cost.

On January 20th, 2025, DeepSeek released their latest open-weights reasoning model, DeepSeek-R1, which is on par with OpenAI's o1 in benchmark performance. The release has generated a significant amount of controversy, most notably about the possibility that DeepSeek might have underreported or misrepresented the training cost of their model. I find this claim implausible for reasons that I will explore in this issue.

Aside from the point about the model's training cost, I also want to clarify what we actually know about the model's architecture, training process, performance, and pricing.

Architecture

DeepSeek R1's architecture is identical to DeepSeek v3, an earlier model that the company released in December 2024. I covered the key architectural details of this model in a Gradient Updates issue from two weeks ago, so I will only provide a brief high-level summary here.

Overall, the model is a very sparse mixture-of-experts, with 671 billion total parameters but only 37 billion active per token. The experts are divided into two classes: one “shared expert” which every token is always routed to [...]

---

Outline:

(01:01) Architecture

(04:13) Training

(04:27) Pre-training

(10:20) RL training for R1-Zero

(17:20) Subsequent training for R1

(19:50) Performance and pricing

(21:29) Conclusion

---

First published:
January 31st, 2025

Source:
https://epoch.ai/gradient-updates/what-went-into-training-deepseek-r1

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Announcing our expanded biology AI coverage” by Pablo Villalobos, David Atanasov

Wed, 29 Jan 2025 00:00:00 GMT

Subtitle: We've expanded our Biology AI Dataset, now covering 360+ models. Our analysis reveals rapid scaling from 2017-2021, followed by a notable slowdown in biological model development.

We’re pleased to announce an expansion of our Biological Model Dataset, a component of Epoch AI's larger database of machine learning models. As the role of AI in biology continues to grow—powering advances in drug design, protein engineering, and genomics—the opportunities and governance challenges posed by biological AI models increase the importance of tracking advances in this field.

Our goal with this project is to provide a comprehensive resource for researchers and policymakers. To this end, we have curated information from over 360 models in this update, prioritizing recent models at the frontier of capability, scale, or scientific impact. Alongside details on their developers, intended tasks, and training datasets, we’ve included new estimates of the training compute that went into developing them.

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ Training compute of biological models

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ Training dataset size of biological models

Analyzing compute and data trends can help us understand how invested the field is [...]

---

First published:
January 29th, 2025

Source:
https://epoch.ai/publications/announcing-expanded-biology-ai-coverage

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“AGI could drive wages below subsistence level” by Matthew Barnett

Fri, 24 Jan 2025 00:00:00 GMT

Subtitle: This Gradient Updates issue explores how AGI could disrupt labor markets, potentially driving wages below subsistence levels, and challenge historical economic trends.

Historically, many have feared that automation would lead to mass unemployment and lower wages. Yet, despite massive improvements in automation in the last two centuries, average wages have risen, living standards have improved, and high unemployment has not become a persistent, long-term issue as many had expected. This historical pattern has led most economists to adopt the following optimistic view: automation typically creates at least as many opportunities as it destroys, and its overall impacts on wages are positive.

But artificial general intelligence (AGI)—defined here as a technology that can functionally substitute for human workers in all labor tasks—may defy these historical precedents. Unlike past technologies, which typically automated specific tasks within industries, AGI has the potential to replace human labor across the entire spectrum of work, including physical tasks, and any new tasks that could be created in the future. Because of this, AGI might disrupt labor markets in an unprecedented way.

In fact, there is a straightforward case for why developing AGI could drive human wages below subsistence level—the bare minimum [...]

---

Outline:

(03:39) Why wages will plausibly fall after AGI

(09:37) Returns to scale

(14:34) Reasons for temporary optimism

(16:34) Long-run pessimism about wages

(22:37) Comparative advantage doesn't save us

(25:46) Conclusion

---

First published:
January 24th, 2025

Source:
https://epoch.ai/gradient-updates/agi-could-drive-wages-below-subsistence-level

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Clarifying the creation and use of the FrontierMath benchmark” by Tamay Besiroglu, Jaime Sevilla

Thu, 23 Jan 2025 00:00:00 GMT

Subtitle: We clarify that OpenAI commissioned Epoch AI to produce 300 math questions for the FrontierMath benchmark. They own these and have access to the statements and solutions, except for a 50-question holdout set.

FrontierMath is a benchmark we created to evaluate the mathematical capabilities of frontier AI models. We saw a need for high-quality, challenging mathematical problems that could meaningfully test the limits of these systems. This remains our core mission—to help the AI community and the public at large accurately understand and measure AI capabilities.

Building high-quality evaluations at this scale requires substantial resources. After approaching several potential funders, we partnered with OpenAI, who provided both the necessary funding and technical expertise to develop the benchmark.1 Working with industry sponsors helps make the benchmark more impactful for the AI field.

However, we recognize we have not communicated clearly enough about the relationship between FrontierMath and OpenAI, leading to questions and concerns among contributors, researchers, and the public. To address these issues, here are the facts:

OpenAI commissioned Epoch AI to produce 300 advanced math problems for AI evaluation that form the core of the FrontierMath benchmark. As is typical of commissioned [...]

The original text contained 1 footnote which was omitted from this narration.

---

First published:
January 23rd, 2025

Source:
https://epoch.ai/publications/openai-and-frontiermath

---

Narrated by TYPE III AUDIO.

“Epoch AI 2024 impact report” by The Epoch AI Team

Fri, 17 Jan 2025 00:00:00 GMT

Subtitle: In 2024, Epoch published influential research, launched FrontierMath, expanded its AI data hub, engaged with policy and industry leaders, raised 7 million dollars, and more.

2024 Impact Report

2024 has proven yet another impactful year for AI. The release of OpenAI's o1 established inference-time scaling as a crucial driver of progress, later culminating in the announcement of OpenAI's o3, for which OpenAI claims large advances in math, reasoning and coding benchmarks. We also saw many other major LLM releases, including highly capable models from OpenAI competitors, such as Google's Gemini 1.5 and 2.0 and Anthropic's Claude 3.5, a proliferation of GPT-4 kcale models such as Mistral Large (2), GLM-4, Doubao Pro, Nemotron-4 340B and Grok 2, and Llama 3.1 405B and DeepSeek v3, the first downloadable-weight models comparable to GPT-4 in performance. Lastly, we have seen large advances in video generation, including the release of models such as Sora, ImageGen and Veo 2, as well as some early results in computer interaction through GPT-4o and Claude 3.5 Sonnet.

Amidst all these developments, Epoch AI's mission of informing the public about ongoing developments remains as critical as ever. Throughout the year, we have updated our data page with [...]

---

Outline:

(00:25) 2024. Impact Report

(02:52) Highlights from 2024

(02:59) FrontierMath

(04:04) Can AI Scaling Continue Through 2030?

(05:13) Epoch AI's Data Hub

(06:38) Press and citations

(08:40) What people are saying about Epoch AI

(10:08) 2024. in numbers

(10:11) Outputs

(10:23) Data collection

(10:47) Engagement

(11:16) Social media

(11:48) Company

(12:09) Our plans for 2025

(12:17) Curating Data on AI

(12:56) Measuring AI capabilities

(13:53) Modeling the impact of AI

(15:08) Communications

(15:29) Hiring

(15:57) Feedback

(16:16) Partnerships

(16:45) Support our work

---

First published:
January 17th, 2025

Source:
https://epoch.ai/publications/epoch-impact-report-2024

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How has DeepSeek improved the Transformer architecture?” by Ege Erdil

Fri, 17 Jan 2025 00:00:00 GMT

Subtitle: This Gradient Updates issue goes over the major changes that went into DeepSeek's most recent model.

DeepSeek has recently released DeepSeek v3, which is currently state-of-the-art in benchmark performance among open-weight models, alongside a technical report describing in some detail the training of the model. Impressively, they’ve achieved this SOTA performance by only using 2.8 million H800 hours of training hardware time—equivalent to about 4e24 FLOP if we assume 40% MFU. This is about ten times less training compute than the similarly performing Llama 3.1 405B.

In this issue, I’ll cover some of the important architectural improvements that DeepSeek highlight in their report and why we should expect them to result in better performance compared to a vanilla Transformer. The full technical report contains plenty of non-architectural details as well, and I strongly recommend reading it if you want to get a better idea of the engineering problems that have to be solved when orchestrating a moderate-sized training run.

Figure 1: The DeepSeek v3 architecture with its two most important improvements: DeepSeekMoE and multi-head latent attention (MLA). Multi-token prediction is not shown. From the DeepSeek v3 technical report.

Multi-head latent attention (MLA)

Multi-head latent [...]

---

Outline:

(01:38) Multi-head latent attention (MLA)

(02:12) What is the KV cache and why does it matter?

(05:40) Beating grouped-query attention

(09:01) Mixture-of-experts innovations

(11:46) Auxiliary-loss-free load balancing

(12:57) Shared experts

(15:16) Multi-token prediction

(17:46) Conclusion

---

First published:
January 17th, 2025

Source:
https://epoch.ai/gradient-updates/how-has-deepseek-improved-the-transformer-architecture

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“FrontierMath competition: Setting benchmarks for AI evaluation” by Tamay Besiroglu, Elliot Glazer, Caroline Falkman Olsson

Thu, 16 Jan 2025 00:00:00 GMT

Subtitle: We are hosting a competition to establish rigorous human performance baselines for FrontierMath. With a prize pool of $10,000, your participation will contribute directly to measuring AI progress in solving challenging mathematical problems.

We’re launching a competition to establish rigorous human performance baselines for FrontierMath, our benchmark for evaluating AI mathematical capabilities. The results will provide crucial reference points for measuring AI progress in tackling very difficult mathematics problems.

Competition Overview

Format: 4.5 hours solving novel mathematics problems alongside leading mathematicians
Guaranteed payment: $150 per participant
Prize pool: 10 thousand dollars distributed across top performers
Recognition: Participants acknowledged in FrontierMath baseline publication
Date: March 30, 2025
Time: Full event: 11 AM-7 PM. Competition: 12:30PM - 5PM.
Location: Cambridge, MA

Sign Up

We’re now encouraging interested mathematicians to sign up for the event. Please express your interest in competing below. Note that sign up does not guarantee participation, as we may need to select among applicants.

Get involved

Why Participate

This competition offers a unique opportunity to contribute to AI progress measurement while competing for substantial prizes. Results will directly inform our understanding of AI capabilities by providing [...]

---

Outline:

(00:51) Competition Overview

(01:28) Sign Up

(01:48) Why Participate

(02:06) Contact us

---

First published:
January 16th, 2025

Source:
https://epoch.ai/publications/frontiermath-competition-setting-benchmarks-for-ai-evaluation

---

Narrated by TYPE III AUDIO.

“The economic consequences of automating remote work” by Matthew Barnett

Fri, 10 Jan 2025 00:00:00 GMT

Subtitle: This Gradient Updates issue investigates the economic consequences of fully automating remote work.

Recent AI progress has shown great promise in automating cognitive tasks, like those in natural language processing and vision. By contrast, progress in general-purpose robotics has lagged. While we already have access to intelligent virtual assistants at our fingertips, robots capable of fully cleaning a standard suburban home still appear years away.

Given the relatively slow pace of progress in robotics, a large share of knowledge work might be automated before physical jobs are overtaken by AIs. This naturally prompts the question: what might be the economic impact if only remote work—defined as work that can be performed entirely from home using digital tools, a computer, and an internet connection—were to be automated?

To investigate this question, I conduct a three-part analysis. First, I use GPT-4o to classify the tasks involved in US occupations according to the O*NET database, finding that around 34% of job tasks can be performed remotely. This contrasts with previous findings from a study by Dingel & Neiman, which found that 37% of US occupations—not tasks—can be performed entirely remotely.

Second, I use data from the transition [...]

---

Outline:

(02:34) What fraction of present economic work can be done remotely?

(06:49) How much would the economy grow if remote work were automated?

(07:20) A brief primer on the elasticity of substitution

(10:01) The production function

(12:50) Using the pandemic to estimate the elasticity of substitution

(18:32) Estimating the elasticity of substitution via proxy

(21:02) Putting it all together

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
January 10th, 2025

Source:
https://epoch.ai/gradient-updates/consequences-of-automating-remote-work

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Moravec’s paradox and its implications” by Ege Erdil

Fri, 27 Dec 2024 00:00:00 GMT

Subtitle: This Gradient Updates issue explains Moravec's paradox and offers a speculative picture of how hard various economic tasks are to automate based on the paradox.

Since the birth of the field of artificial intelligence in the 20th century, researchers have observed that the difficulty of a task for humans at best weakly correlates with its difficulty for AI systems. For example, humans find it difficult to multiply ten-digit numbers in their heads but easy to draw boxes around each individual cat in a photograph. In contrast, for AI systems the difficulty is reversed: they could do the former task in the 1950s, and it took until the 2010s for segmentation algorithms to match human performance on the latter task.

The specific observation that it's easy to build AI systems that perform formal reasoning tasks but difficult to build AI systems whose perception and motor skills are comparable to a human is called Moravec's paradox. Moravec himself offered an evolutionary explanation for the paradox: we should expect cognitive skills that have been around for longer to be more difficult to reproduce in AI systems, because evolution is likely to have applied significantly more optimization pressure on older [...]

---

Outline:

(02:05) How does the brain work?

(05:05) What explains performance differences between AIs and the human brain?

(10:11) Which tasks will be automated next according to this picture?

(13:24) Concluding thoughts

---

First published:
December 27th, 2024

Source:
https://epoch.ai/gradient-updates/moravec-s-paradox

---

Narrated by TYPE III AUDIO.

“How do mixture-of-experts models compare to dense models in inference?” by Ege Erdil

Fri, 20 Dec 2024 00:00:00 GMT

Subtitle: This Gradient Updates issue explores how mixture-of-experts models compare to dense models in inference, focusing on costs, efficiency, and decoding dynamics.

In last week's Gradient Updates issue, I discussed how we can guess that GPT-4o and Claude 3.5 Sonnet have significantly fewer parameters than GPT-4. The most common question I’ve received from readers was some version of the following:

Don’t active parameters matter more for inference economics than total parameters? If so, how can you infer the total number of parameters of a model by only looking at its inference cost and speed?

In response to this, I’ve decided to make this issue specifically about the question of inference with mixture-of-experts (MoE) models. The basic takeaway is that MoEs are more efficient at inference than dense models of the same total parameter count, but less efficient than dense models with the same active parameter count. A rough rule of thumb is that an 8-way sparse model has the same short-context decoding economics as a dense model half its size.

For the sake of brevity, I’ll assume familiarity with basic concepts in Transformer inference, so I expect this issue to be confusing to readers without [...]

---

Outline:

(01:45) Advantages of MoE models

(03:00) MoEs have fewer active parameters than dense models

(06:39) MoEs are shallower and wider than dense models

(08:14) At fixed model depth, MoE models need less network communication than dense models

(10:00) MoEs have smaller attention blocks than dense models

(11:53) Estimating the MoE inference edge

(15:09) Conclusion

---

First published:
December 20th, 2024

Source:
https://epoch.ai/gradient-updates/moe-vs-dense-models-inference

---

Narrated by TYPE III AUDIO.

“Announcing Gradient Updates: Our new weekly newsletter” by Ege Erdil

Fri, 13 Dec 2024 00:00:00 GMT

Subtitle: We are announcing Gradient Updates, our new weekly newsletter focused on timely and important questions in AI.

Last Friday, we released the first issue of our new weekly newsletter, Gradient Updates, led and mainly written by senior researcher Ege Erdil. Each issue will offer in-depth commentary on timely and enduring questions in AI. Rather than delivering a roundup of the week's headlines, Gradient Updates focuses on a single, carefully chosen topic each week. For instance, our inaugural issue examined the impact of U.S. export controls on Chinese AI capabilities.

With Gradient Updates, we aim to share insights and explorations that are less formal than a full-length paper or technical report, but more substantial than typical industry news briefs. You won’t find the latest investments or product releases covered here; instead, expect content that grapples with broader themes—from the economic implications of vertical disintegration within the AI sector to the potential of synthetic data and test-time compute scaling to surpass current pretraining limitations.

You can read our first two issues now, and subscribe to receive new issues as they’re published.

Issue #1: What did US export controls mean for China's AI capabilities? In this issue [...]

---

First published:
December 13th, 2024

Source:
https://epoch.ai/publications/announcing-gradient-updates-newsletter

---

Narrated by TYPE III AUDIO.

“Frontier language models have become much smaller” by Ege Erdil

Fri, 13 Dec 2024 00:00:00 GMT

Subtitle: In this Gradient Updates weekly issue, Ege discusses how frontier language models have unexpectedly reversed course on scaling, with current models an order of magnitude smaller than GPT-4.

Between the release of the original Transformer in 2017 and the release of GPT-4, language models at the frontier of capabilities became much larger. Parameter counts were scaled up by 1000 times from 117 million to 175 billion between GPT-1 and GPT-3 in the span of two years and by another 10 times from 175 billion to 1.8 trillion between GPT-3 and GPT-4 in the span of the next two years and nine months.

If the post GPT-3 trend had continued, given that GPT-4 was released in March 2023, by now we could have expected to see models with close to 10 trillion parameters, around 4 times bigger than GPT-4. However, in 2023, the trend of frontier language models becoming bigger reversed. Let alone reaching the 10 trillion parameter mark, current frontier models such as the original GPT-4o and Claude 3.5 Sonnet are probably an order of magnitude smaller than GPT-4, with 4o having around 200 billion and 3.5 Sonnet around 400 billion parameters.

In this issue [...]

---

Outline:

(01:51) How do we know this has happened?

(05:51) Why did this happen?

(09:53) Should we expect frontier models to keep getting smaller?

---

First published:
December 13th, 2024

Source:
https://epoch.ai/gradient-updates/frontier-language-models-have-become-much-smaller

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What did US export controls mean for China’s AI capabilities?” by Ege Erdil

Fri, 06 Dec 2024 00:00:00 GMT

Subtitle: Export controls on China give the US a hardware lead of around 4 years in training frontier models, but essentially no lead in serving those models to users.

Four days ago, the US government announced new rules around the export of powerful chips and semiconductor manufacturing equipment to China. This recent update is part of a broader trend of increasing export restrictions, following the announcement of the first export controls by the US Bureau of Industry and Security (BIS) in 2022 and the first revision in 2023.

The BIS documents in which these export controls are announced are quite long. The most recent update takes up two documents, totaling 210 pages, and previous updates weren’t any shorter. In this issue, I aim to cut through most of this noise and give you a broad picture of what impact the chip export controls have had on China's ability to train and serve powerful AI models.

The high-level takeaway is that current export controls on China give the US a hardware lead of around 4 years when it comes to training frontier models, but the US has essentially no lead when it comes to serving those models [...]

---

Outline:

(01:47) The October 2022 export controls

(05:43) The October 2023 update

(09:11) The December 2024 update

(10:48) What's going to happen next?

---

First published:
December 6th, 2024

Source:
https://epoch.ai/gradient-updates/us-export-controls-china-ai

---

Narrated by TYPE III AUDIO.

“What is the future of AI in mathematics? Interviews with leading mathematicians” by Anson Ho, Tamay Besiroglu

Wed, 04 Dec 2024 00:00:00 GMT

Subtitle: How will AI transform mathematics? Fields Medalists and other leading mathematicians discuss whether they expect AI to automate advanced math research.

Recent advances in artificial intelligence are beginning to influence the field of mathematics, prompting important questions about how mathematical research might evolve. Systems like Google DeepMind's AlphaProof have demonstrated capabilities approaching gold medal-level performance at the International Mathematical Olympiad. Professor Timothy Gowers described these abilities as “very impressive, and well beyond what I thought was state of the art.”

Developments like these raise several important questions: How will the nature of mathematics research evolve over the next decade? Can mathematics research be fully automated, and when might this happen? To answer these questions, we interviewed four distinguished mathematicians about the implications of AI progress in mathematics: Fields Medalists Prof Terence Tao, Prof Timothy Gowers, and Prof Richard Borcherds, as well as IMO expert Evan Chen.

In our conversations, the mathematicians highlighted key themes regarding AI's potential impact on mathematics. They discussed how AI could assist in proof development and verification, facilitate experimental approaches by exploring vast numbers of potential statements, generate novel conjectures by synthesizing information across fields, reduce barriers to entry into specialized [...]

---

Outline:

(02:04) AI augmentation of mathematics research

(06:17) Challenges to AI systems achieving deep research competence in math

(08:31) Fully automating mathematics research

(11:13) Conclusion

---

First published:
December 4th, 2024

Source:
https://epoch.ai/frontiermath/tiers-1-4/expert-perspectives

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Introducing the distributed training interactive simulator” by Ege Erdil, Tamay Besiroglu

Fri, 29 Nov 2024 00:00:00 GMT

Subtitle: We introduce an interactive simulation tool which can simulate distributed training runs of large language models under ideal conditions.

Recently, we published the results of our investigation into data movement bottlenecks in distributed training of deep learning models, introducing a detailed model of the bandwidth and latency costs of different modes of parallelism in GPU clusters. Today, we’re also releasing an interactive simulator with a user-friendly interface implementing the same model.

This post will introduce the simulation tool's features by using it to answer the following question: how big of a neural network training run could we have done using the GTX 580 GPUs that were used to train AlexNet in 2012? In other words, instead of training AlexNet on two GTX 580 3GB GPUs for six days, how far could an AI lab have scaled training if they wanted to train the largest possible model using just GTX 580 3GB GPUs, disregarding cost considerations?

How to use the tool

Before getting into the specific experiment to answer this question, let's see how to get started with using the tool. On launch, you’ll be greeted with the following interface:

Figure 1: The user interface as [...]

---

Outline:

(01:20) How to use the tool

(05:00) What would have been the largest feasible training run in 2012?

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
November 29th, 2024

Source:
https://epoch.ai/publications/introducing-the-distributed-training-interactive-simulator

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Introducing Epoch AI’s AI benchmarking hub” by The Epoch AI Team

Wed, 27 Nov 2024 00:00:00 GMT

Subtitle: We are launching the AI Benchmarking Hub: a platform presenting our evaluations of leading models on challenging benchmarks, with analysis of trends in AI capabilities.

Epoch AI is launching our AI Benchmarking Hub—a platform for comprehensively understanding AI capabilities.

By evaluating leading AI models ourselves and carefully analyzing the results, we aim to shed light on the main trends in AI capabilities. Our clear visuals and detailed findings can help researchers, developers, and decision-makers better understand what today's AI systems can actually do—and where they’re headed.

Key Features of the AI Benchmarking Hub

Challenging benchmarks: Our goal is to track model performance on the hardest and most informative benchmarks. For this first release, the hub features results from two benchmarks:

GPQA Diamond: This is a higher-quality, challenging subset of the GPQA benchmark, which tests models’ ability to answer PhD-level multiple choice questions about chemistry, physics, and biology.
MATH Level 5: This is a subset of the hardest questions from the MATH benchmark, a dataset of high-school level competition math problems.

We plan to rapidly expand our suite of benchmarks to create a thorough picture of AI progress, by adding benchmarks such as [...]

---

Outline:

(01:05) Key Features of the AI Benchmarking Hub

(02:50) What's Next?

---

First published:
November 27th, 2024

Source:
https://epoch.ai/publications/introducing-benchmarks-dashboard

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Hardware failures won’t limit AI scaling” by Alexander Erben, Ege Erdil

Fri, 22 Nov 2024 00:00:00 GMT

Subtitle: Our analysis shows hardware failures won't limit AI training scale. GPU memory-based checkpointing enables training beyond millions of GPUs.

Introduction

Computational power has become the primary driving force in the race for improved model capabilities. Leading tech companies (Google, Microsoft, Meta, and Amazon) already possess AI computing infrastructure equivalent to more than two million NVIDIA H100 GPUs, and frontier AI model training costs are projected to exceed $1 billion per model by 2027. This scale of computation brings with it an increasingly important challenge in distributed computing: hardware failures.

To understand why, consider that if a single H100 GPU fails on average once every 50,000 hours (about 6 years), a cluster of 100,000 GPUs will face a failure every 30 minutes, and a million-GPU cluster will see failures every 3 minutes.

To combat these failures, engineers resort to one crucial technique: checkpointing. Checkpointing saves training progress to a resilient storage, and in the event of a failure, training can restart by loading saved progress into a working set of GPUs. This opens up an opportunity for optimization, as excessively frequent checkpointing wastes valuable computation time, while insufficient checkpointing risks losing significant progress in the event of a [...]

---

Outline:

(00:24) Introduction

(02:43) Background and Current Practices

(07:30) Alternative Checkpointing Strategies Using GPU Memory

(11:48) Chinchilla-based Model Sizes

(16:02) Idle Spares, Maintenance and Catastrophic Failures

(18:41) Conclusion

(20:18) Acknowledgements

The original text contained 15 footnotes which were omitted from this narration.

---

First published:
November 22nd, 2024

Source:
https://epoch.ai/publications/hardware-failures-wont-limit-ai-scaling

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“FrontierMath: A benchmark for evaluating advanced mathematical reasoning in AI” by Tamay Besiroglu, Elliot Glazer, Caroline Falkman Olsson

Fri, 08 Nov 2024 00:00:00 GMT

Subtitle: FrontierMath: a new benchmark of expert-level math problems designed to measure AI's mathematical abilities. See how leading AI models perform against the collective mathematics community.

This announcement was originally published on November 8, 2024. For the latest benchmark results on FrontierMath, please see AI Benchmarking.

We’re introducing FrontierMath, a benchmark of hundreds of original, expert-crafted mathematics problems designed to evaluate advanced reasoning capabilities in AI systems. These problems span major branches of modern mathematics, from computational number theory to abstract algebraic geometry, and typically require hours or days for expert mathematicians to solve.1

Figure 1. While leading AI models now achieve near-perfect scores on traditional benchmarks like GSM-8k and MATH, they solve less than 2% of FrontierMath problems, revealing a substantial gap between current AI capabilities and the collective prowess of the mathematics community. MMLU scores shown are for the College Mathematics category of the benchmark.

To understand and measure progress in artificial intelligence, we need carefully designed benchmarks that can assess how well AI systems engage in complex scientific reasoning. Mathematics offers a unique opportunity for this assessment: it requires extended chains of precise reasoning, with each step building exactly on what came [...]

---

Outline:

(02:02) The FrontierMath Benchmark

(05:22) Current Performance on FrontierMath

(06:37) Our next steps

(07:49) Conclusion

(08:52) Conflict of interest statement

The original text contained 1 footnote which was omitted from this narration.

---

First published:
November 8th, 2024

Source:
https://epoch.ai/frontiermath/tiers-1-4/the-benchmark

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How far behind are open models?” by Ben Cottier, Josh You, Natalia Martemianova, David Owen

Mon, 04 Nov 2024 00:00:00 GMT

Subtitle: We compare open and closed AI models, and study how openness has evolved. The best open model today is on par with closed models in performance and training compute, but with a lag of about one year.

Introduction

Openness has long been a norm in AI research, fostering collaboration in the field. However, rapid advances in AI have sparked concerns about the potential consequences of releasing the most capable models. In addition, businesses that sell access to models like ChatGPT have a commercial incentive to keep models private.

Industry AI labs have responded to these developments in various ways. Some models remain unreleased, such as Google DeepMind's Chinchilla model. Alternatively, models like GPT-4o have structured access, controlling how users can interact with the model. Other models have their weights available to download with restrictions on the terms of use, such as Meta's Llama models.

Publishing models, code, and datasets enables innovation and external scrutiny, but is irrevocable and risks misuse if a model's safeguards are bypassed.1,2 There is an ongoing debate about whether this trade-off is acceptable, or avoidable. Supporters of open source AI have argued that openness benefits society, as well as the model developer, through [...]

---

Outline:

(00:27) Introduction

(02:57) Summary of findings

(05:39) What is an "open" AI model?

(08:27) Degrees of accessibility

(09:23) Benchmark performance

(10:02) Open models have lagged on benchmarks by 5 to 22 months

(12:10) Newer, open LLMs use less compute to match older, closed LLMs

(15:00) Training compute

(16:13) The scaling of open models lags by about 15 months

(18:00) Will open AI models catch up or fall further behind?

(23:08) Most notable AI models released between 2019 to 2023 were open

(27:30) Related work

(29:26) Discussion

The original text contained 42 footnotes which were omitted from this narration.

---

First published:
November 4th, 2024

Source:
https://epoch.ai/publications/open-models-report

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Data movement bottlenecks to large-scale model training: Scaling past 1e28 FLOP” by Ege Erdil

Sat, 02 Nov 2024 00:00:00 GMT

Subtitle: Data movement bottlenecks limit LLM scaling beyond 2e28 FLOP, with a "latency wall" at 2e31 FLOP. We may hit these in ~3 years. Aggressive batch size scaling could potentially overcome these limits.

Introduction

Over the past five years, the performance of large language models (LLMs) has improved dramatically, driven largely by rapid scaling in training compute budgets to handle larger models and training datasets. Our own estimates suggest that the training compute used by frontier AI models has grown by 4-5 times every year from 2010 to 2024. This rapid pace of scaling far outpaces Moore's law, and sustaining it has required scaling along three dimensions: First, making training runs last longer; second, increasing the number of GPUs participating in each training run; and third, utilizing more performant GPUs.

It's relatively easy to scale the duration that a GPU cluster is used to train a model.1 However, in practice training runs rarely exceed 6 months. This is because both the hardware and software used for a training run risks becoming obsolete at timescales longer than this, and no lab would want to release a model which has become outdated immediately upon release. This sets a practical limit [...]

---

Outline:

(00:34) Introduction

(04:18) Data movement patterns during distributed training

(04:43) Intra-GPU data movement

(06:41) Inter-GPU data movement

(10:10) The limits to scaling

(11:26) Understanding the latency wall

(14:21) Possible ways to overcome the limits

(16:15) Increasing the batch size

(18:18) Reducing the model depth

(19:43) Conclusion

The original text contained 4 footnotes which were omitted from this narration.

---

First published:
November 2nd, 2024

Source:
https://epoch.ai/publications/data-movement-bottlenecks-scaling-past-1e28-flop

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Introducing Epoch AI’s machine learning hardware database” by The Epoch AI Team

Wed, 23 Oct 2024 00:00:00 GMT

Subtitle: Our new database covers hardware used to train AI models, featuring over 100 accelerators (GPUs and TPUs) across the deep learning era.

Modern AI models are trained on large supercomputing clusters using specialized hardware. For the leading AI models of today, hardware spending can reach billions of dollars. To explore ML hardware trends in detail, we have added a new Machine Learning Hardware database in our data hub. This covers key trends, such as how hardware performance has improved over time, or how AI clusters have grown larger for training leading models. It also features an interactive visualization, allowing users to explore their own questions using the database.

For example, we can use this data to plot how GPU performance has improved in two distinct ways: the raw number of operations per second has increased around 20% per year, but innovations such as tensor cores and reduced precision number formats have provided further improvements.

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ Computational performance improves 15x when switching from FP32 to INT8

We also track properties such as hardware prices or energy consumption, allowing us to chart how leading AI chips have [...]

---

First published:
October 23rd, 2024

Source:
https://epoch.ai/publications/introducing-epoch-ai-machine-learning-hardware-database

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Interviewing AI researchers on automation of AI R&D” by David Owen

Tue, 27 Aug 2024 00:00:00 GMT

Subtitle: AI could accelerate AI R&D, especially in coding and debugging tasks. We explore AI researchers’ differing predictions on automation, and their suggestions for designing AI R&D evaluations.

Introduction

The question of when and how AI might automate AI R&D is crucial for AI forecasting—if AI could automate the tasks involved in AI research, it could drastically accelerate AI progress. There is a long history of researchers considering this question in the abstract, and describing its importance for how AI will shape the future.1 However, AI researchers disagree substantially on timelines for automating AI R&D—for instance, researchers’ predictions for when all AI researcher tasks will be automated vary between years and centuries.2

In this work, we interviewed AI researchers with three goals:

Characterize AI R&D work tasks in detail, to better understand how automation might take place.
Clarify the reasoning underpinning researchers’ predictions about automation, to see where and why they disagree.
Collect their views on evaluations intended to measure how capable AI systems are at performing AI R&D, to better understand how society can track AI progress in this critical area.

To do this, we used qualitative interviews. We asked open-ended questions to [...]

---

Outline:

(00:26) Introduction

(02:27) Summary of findings

(03:49) Key results

(03:51) Hypotheses are fast, implementation is time-consuming; both are crucial

(05:43) Predictions differ on automation pace, but agree that engineering will be the main driver of R&D automation

(09:10) Existing R&D evaluations may be a promising start

(11:24) Discussion

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
August 27th, 2024

Source:
https://epoch.ai/publications/interviewing-ai-researchers-on-automation-of-ai-rnd

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Can AI scaling continue through 2030?” by Jaime Sevilla, Tamay Besiroglu, Ben Cottier, Josh You, Edu Roldán, Pablo Villalobos, Ege Erdil

Tue, 20 Aug 2024 00:00:00 GMT

Subtitle: We investigate the scalability of AI training runs. We identify electric power, chip manufacturing, data and latency as constraints. We conclude that 2e29 FLOP training runs will likely be feasible by 2030.

Introduction

In recent years, the capabilities of AI models have significantly improved. Our research suggests that this growth in computational resources accounts for a significant portion of AI performance improvements.1 The consistent and predictable improvements from scaling have led AI labs to aggressively expand the scale of training, with training compute expanding at a rate of approximately 4x per year.

To put this 4x annual growth in AI training compute into perspective, it outpaces even some of the fastest technological expansions in recent history. It surpasses the peak growth rates of mobile phone adoption (2x per year, 1980-1987), solar energy capacity installation (1.5x per year, 2001-2010), and human genome sequencing (3.3x per year, 2008-2015).

Here, we examine whether it is technically feasible for the current rapid pace of AI training scaling—approximately 4x per year—to continue through 2030. We investigate four key factors that might constrain scaling: power availability, chip manufacturing capacity, data scarcity, and the “latency wall”, a fundamental speed limit imposed by unavoidable delays [...]

---

Outline:

(00:31) Introduction

(08:17) What constrains AI scaling this decade

(08:21) Power constraints

(10:37) The current trend of AI power demand

(15:35) Power constraints for geographically localized training runs

(20:37) Power constraints for geographically distributed training

(28:36) Feasibility of geographically distributed training

(34:01) Modeling energy bottlenecks

(35:02) Chip manufacturing capacity

(36:48) Current production and projections

(42:27) Modeling GPU production and compute availability

(47:45) Data scarcity

(50:00) Multimodality

(55:29) Synthetic data

(01:00:36) Latency wall

(01:03:19) Latency wall given intranode latencies

(01:05:40) Latency wall given latencies between nodes

(01:07:50) How can these latencies be reduced?

(01:10:01) What constraint is the most limiting?

(01:12:00) Will labs attempt to scale to these new heights?

(01:15:54) Conclusion

The original text contained 107 footnotes which were omitted from this narration.

---

First published:
August 20th, 2024

Source:
https://epoch.ai/publications/can-ai-scaling-continue-through-2030

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Announcing Epoch AI’s data hub” by The Epoch AI Team

Wed, 19 Jun 2024 00:00:00 GMT

Subtitle: We are launching a hub for data and visualizations, to make our databases more accessible for users and researchers. It currently features our data on notable and large-scale AI models.

Our data on the trajectory of AI has been valuable for policymakers, journalists, and other stakeholders. Our research, for example on training compute or model development costs, relies on our data. Now, we are creating a new Data on AI page as a central hub for all of this data. This includes key insights, interactive visualizations, and detailed documentation to inform users on how the data were collected.

Currently, the page hosts two datasets. Our collection of notable AI models tracks key factors driving machine learning progress in over 800 historically significant AI models. The database contains information on development, training details, and more. Among other key information, we track training compute, dataset sizes, parameter counts, and training hardware. A new interactive visualization tool allows users to explore these details, examining trends in time and breakdowns by machine learning domain.

Figure 1: Visualization of notable AI models and the overall trend of training compute overtime.

The second dataset, devoted to large-scale AI models, features [...]

---

First published:
June 19th, 2024

Source:
https://epoch.ai/publications/announcing-epoch-ai-data-hub

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Will we run out of data? Limits of LLM scaling based on human-generated data” by Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, Marius Hobbhahn

Thu, 06 Jun 2024 00:00:00 GMT

Subtitle: We estimate the effective stock of quality and repetition adjusted human-generated public text for AI training at around 300 trillion tokens. If trends continue, language models will fully utilize this stock between 2026 and 2032, or even earlier if intensely overtrained.

Introduction

Scaling has been a key factor driving progress in AI. Models are growing in parameters and being trained on increasingly enormous datasets, leading to exponential growth in training compute, and dramatic increases in performance. For example, five years and four orders of magnitude of compute separate the barely coherent GPT-2 with the powerful GPT-4.

So far, AI developers have not faced major limits to scaling beyond simply procuring AI chips, which are scarce but rapidly growing in supply. If chips are the only bottleneck, then AI systems are likely to continue growing exponentially in compute and expanding the frontier of capabilities. As such, a key question in forecasting AI progress is whether inputs other than raw compute could become binding constraints.

In particular, scaling requires growing training datasets. The most powerful AI systems to date are language models that are primarily trained on trillions of words of human-generated text from the internet. However, there is [...]

---

Outline:

(00:36) Introduction

(02:03) Results

(04:26) Comparison with previous estimates

(06:17) Discussion

The original text contained 6 footnotes which were omitted from this narration.

---

First published:
June 6th, 2024

Source:
https://epoch.ai/publications/will-we-run-out-of-data-limits-of-llm-scaling-based-on-human-generated-data

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How much does it cost to train frontier AI models?” by Ben Cottier, Robi Rahman, Loredana Fattorini, Nestor Maslej, David Owen

Mon, 03 Jun 2024 00:00:00 GMT

Subtitle: The cost of training frontier AI models has grown by a factor of 2 to 3x per year for the past eight years, suggesting that the largest models will cost over a billion dollars by 2027.

Summary of findings

The costs of training frontier AI models have grown dramatically in recent years, but there is limited public data on the magnitude and growth of these expenses. In our new paper, we develop a detailed cost model to address this gap, estimating training costs for up to 45 frontier models using three different approaches that account for hardware and energy expenditures, cloud rental costs, and R&D staff expenses, respectively. This work builds upon the cost estimates featured in the 2024 AI Index.

Our analysis reveals that the amortized hardware and energy cost for the final training run of frontier models has grown rapidly, at a rate of 2.4x per year since 2016 (95% CI: 2.0x to 3.1x). We also estimated a cost breakdown to develop key frontier models such as GPT-4 and Gemini Ultra, including R&D staff costs and compute for experiments. We found that most of the development cost is for the hardware at 47–67%, but R&D [...]

---

Outline:

(00:27) Summary of findings

(01:59) Key Results

(05:19) Implications

(06:17) Webinar

---

First published:
June 3rd, 2024

Source:
https://epoch.ai/publications/how-much-does-it-cost-to-train-frontier-ai-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Training compute of frontier AI models grows by 4-5x per year” by Jaime Sevilla, Edu Roldán

Tue, 28 May 2024 00:00:00 GMT

Subtitle: Our expanded AI model database shows that the compute used to train recent models grew 4 to 5 times yearly from 2010 to May 2024. We find similar growth in frontier models, recent large language models, and models from leading companies.

Introduction

Over the last ten years, we have witnessed a dramatic increase in the computational resources dedicated to training state-of-the-art AI models. This strategy has been incredibly productive, translating into large gains in generality and performance. For example, we estimate that about two-thirds of the improvements in performance in language models in the last decade have been due to increases in model scale.

Given the central role of scaling, it is important to track how the computational resources (‘compute’) used to train models have grown in recent years. In this short article, we provide an updated view of the trends so far, having collected three times more data since our last analysis.

We tentatively conclude that compute growth in recent years is currently best described as increasing by a factor of 4 to 5 times per year. We find consistent growth between recent notable models, the running top 10 of models by compute, recent large language [...]

---

Outline:

(00:31) Introduction

(03:06) The overall trend of training compute growth has held

(05:17) Frontier model growth has slowed, and now aligns with overall trends

(09:31) Language models caught up to the frontier around 2020

(11:43) Leading companies are growing their top models by 5x/year

(13:20) The scale of the largest models today can be retrodicted using a 4-5x/year growth rate

(15:49) Conclusion

The original text contained 16 footnotes which were omitted from this narration.

---

First published:
May 28th, 2024

Source:
https://epoch.ai/publications/training-compute-of-frontier-ai-models-grows-by-4-5x-per-year

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Do the returns to software R&D point towards a singularity?” by Tamay Besiroglu, Ege Erdil, Anson Ho

Fri, 17 May 2024 00:00:00 GMT

Subtitle: The returns to R&D are crucial in determining the dynamics of growth and potentially the pace of AI development. Our new paper offers new empirical techniques and estimates for this crucial parameter.

Returns to R&D and hyperbolic technological progress

Improvements in AI have predominantly been driven by two factors. First, advancements in hardware performance and substantial investments in larger clusters have increased the computing power for training AI models. This has resulted in improved performance given the abundance of data that we can use to train larger AI systems. Second, progress on the “software” side (training techniques, architectures, algorithm implementations, etc.) has resulted in the compute being used more efficiently (see our work on algorithmic progress). This means that AI model performance surpasses what we’d expect from merely increasing computing resources. At Epoch, we have extensively researched each of these trends.

The combination of the scaling of compute and improvements in training techniques has effectively increased the total budget of “effective compute,” which refers to the computational resources available for AI development when accounting for improvements on the “software” side. This increase in effective compute has been a key driver in the development of capable AI systems.

[...]

---

Outline:

(00:28) Returns to R&D and hyperbolic technological progress

(04:17) Hyperbolic growth in a model of idea production

(07:01) Do the returns to software R&D point towards a software singularity?

(07:36) Computer chess

(09:26) Other domains of software

(12:05) Do our estimates point to a singularity?

(13:39) Conclusion

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
May 17th, 2024

Source:
https://epoch.ai/publications/do-the-returns-to-software-rnd-point-towards-a-singularity

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Chinchilla scaling: A replication attempt” by Tamay Besiroglu, Ege Erdil, Matthew Barnett, Josh You

Wed, 17 Apr 2024 00:00:00 GMT

Subtitle: We replicate Hoffmann et al.'s estimation of a parametric scaling law and find issues with their estimates. Our estimates fit the data better and align with Hoffmann's other approaches.

Summary

Hoffmann et al. (2022) investigate the optimal model size and number of tokens for training a transformer language model under a given compute budget. The authors train over 400 language models and find that for compute-optimal training, the model size and number of training tokens should scale at equal rates: for every doubling of model size, the number of training tokens should also be doubled. They train a 70B model called Chinchilla, compute-optimally according to their results, on 1.4 trillion tokens for a ratio of 20 tokens per parameter. For this reason, the scaling laws they propose are often called “Chinchilla scaling laws”.

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ Optimal ratio of training tokens to model parameters

In their paper, Hoffmann et al. use three different methods to derive the optimal scaling policy. In one of their approaches, they estimate the following parametric scaling law:

L equals E plus the fraction with numerator [...]

---

Outline:

(00:27) Summary

(03:59) Implications of this work

---

First published:
April 17th, 2024

Source:
https://epoch.ai/publications/chinchilla-scaling-a-replication-attempt

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Tracking large-scale AI models” by Robi Rahman, David Owen, Josh You

Fri, 05 Apr 2024 00:00:00 GMT

Subtitle: We present a dataset of 81 large-scale models, from AlphaGo to Gemini, developed across 18 countries, at the leading edge of scale and capabilities.

Update

Explore our Large-scale AI models dataset through interactive visualizations and documentation on our dedicated data page.

We present a new dataset tracking AI models with training compute over 10 to the 23 floating point operations (FLOP). This corresponds to training costs of hundreds of thousands of dollars or more.1 We have identified 81 such models, and another 86 models that may exceed the 10 to the 23 FLOP threshold but don’t have confirmed training details.

Our previous work has examined the crucial role of training compute in the development of modern AI, and how it drives model capabilities. Existing AI regulation explicitly acknowledges the importance of training compute: both the recent US Executive Order on AI development and the EU AI Act establish reporting requirements based on compute thresholds. Motivated by these developments, we plan to track models with training compute above 10 to the 23 FLOP by updating this dataset on an ongoing basis. We call models above this threshold “large-scale models”.

The dataset offers insight into several recent [...]

---

Outline:

(03:05) Results

(03:08) There are few models at the leading edge, but the frontier advances rapidly

(04:38) Most large-scale AI models are language models

(08:01) Most large-scale models are developed by US companies

(09:33) Downloadable models are common, but have lower training compute

(11:15) Methods for finding large-scale models

(11:50) Benchmarks and Repositories

(14:12) Non-English news and websites

(16:05) Other sources

(17:07) Unconfirmed large-scale models

(18:25) Outcomes and limitations

(20:03) Conclusion

The original text contained 9 footnotes which were omitted from this narration.

---

First published:
April 5th, 2024

Source:
https://epoch.ai/publications/tracking-large-scale-ai-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Optimally allocating compute between inference and training” by Ege Erdil

Fri, 29 Mar 2024 00:00:00 GMT

Subtitle: Our analysis indicates that AI labs should spend comparable resources on training and running inference, assuming they can flexibly balance compute between these tasks to maintain model performance.

Introduction

Sam Altman recently claimed that OpenAI currently generates around 100 billion tokens per day, or about 36 trillion tokens per year. Given that modern language models are trained on the order of 10 trillion tokens and tokens seen during training are around three times more expensive1 compared to tokens seen or generated during inference, a naive analysis2 suggests OpenAI's annual inference costs are on the same order as their annual model training costs.

This seems like an odd coincidence at first: why should one of these not completely dominate the other? However, there's a good reason to suppose that these two quantities should be on a similar order of magnitude: the training-inference compute tradeoff. In this post, I will briefly explain what this tradeoff is about and why the current empirical evidence about it implies we should see rough parity in how much compute is spent on training versus inference.

The tradeoff

In general, it's possible to get a model to perform better by one of two [...]

---

Outline:

(00:25) Introduction

(01:24) The tradeoff

(04:03) Why we expect investment parity

(08:10) Conclusion

The original text contained 3 footnotes which were omitted from this narration.

---

First published:
March 29th, 2024

Source:
https://epoch.ai/publications/optimally-allocating-compute-between-inference-and-training

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Algorithmic progress in language models” by Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, Zifan Carl Guo, David Atkinson, Neil Thompson, Jaime Sevilla

Tue, 12 Mar 2024 00:00:00 GMT

Subtitle: Progress in pretrained language model performance surpasses what we’d expect from merely increasing computing resources, occurring at a pace equivalent to doubling computational power every 5 to 14 months.

Overview

In 2012, the best language models were recurrent networks that struggled to form coherent sentences. Fast forward to today and language models like GPT-4 assist hundreds of millions of active users and are able to perform tasks across a wide range of domains.

Clearly, progress has been rapid—but what made this possible? One reason is that the compute used to train language models has been scaled up drastically, resulting in better performance. But that's only part of the puzzle. AI practitioners have developed better model architectures, optimizers, and other algorithmic innovations that reduce the compute required to reach certain performance levels—what we refer to as algorithmic progress.

Figure 1. Performance of 231 language models (measured in log perplexity) used in our work against their date and scale (measured in FLOP). Models are both becoming larger and more proficient. It's unclear to which degree the better results are driven by improvements in scale or in efficiency.

In our new paper, we conduct the most comprehensive analysis of algorithmic [...]

The original text contained 1 footnote which was omitted from this narration.

---

First published:
March 12th, 2024

Source:
https://epoch.ai/publications/algorithmic-progress-in-language-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Epoch AI 2023 impact report” by The Epoch AI Team

Fri, 19 Jan 2024 00:00:00 GMT

Subtitle: In 2023, Epoch published almost 20 reports on developments in AI, added hundreds of new models to our database, had a direct impact on government policies, raised over $7 million in funds, and more.

About Epoch AI

Epoch AI is a multidisciplinary research institute investigating the trajectory of Artificial Intelligence (AI). We produce papers and reports on the drivers, trajectory and consequences of the development and deployment of AI.

In the past, we have produced cutting-edge AI forecasting work such as Compute Trends Across Three Eras of Machine Learning (Sevilla et al., 2022), Revisiting Algorithmic Progress (Besiroglu and Erdil, 2022) and Will We Run Out of ML Data? Evidence From Projecting Dataset Size Trends (Villalobos et al., 2022). We also maintain a database of notable ML models, widely regarded as the most comprehensive public database of its kind in existence.

Testimonials

Epoch does the most thoughtful and best-researched survey work in the industry. Several times I have thought I found errors in their results, only to discover when going through their notebooks that they had it right. They are my go-to resource for field-wide trends.

Nat Friedman - Former CEO of GitHub

I feel like I [...]

---

Outline:

(00:27) About Epoch AI

(01:17) Testimonials

(01:51) This year and future goals

(02:59) Summary of key findings

(03:02) ML trends

(04:54) Algorithmic improvements

(05:55) Economics of AI

(06:44) Other topics

(07:39) Overview of research projects

(08:04) Overview of non-research activities

(08:15) Fundraising

(08:53) Hiring

(09:10) Website

(09:27) Mentorship Programs

(09:47) Overview of impact

(09:50) Media and scientific communication

(10:17) Conferences and workshops

(10:52) Mentions in industry, research and scientific publications

(11:36) Engagement with policymakers

(12:17) Next year's goals and funding needs

(12:21) Research goals

(12:24) Model of AI and growth

(13:11) Drivers in AI

(13:30) Data on machine learning

(13:53) Organizational goals

(13:57) Stakeholder engagement

(14:19) Hiring

(14:43) Project management

(15:11) Diversifying our funding

(15:28) Website and brand

---

First published:
January 19th, 2024

Source:
https://epoch.ai/publications/epoch-impact-report-2023

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Biological sequence models in the context of the AI directives” by Nicole Maug, Aidan O’Gara, Tamay Besiroglu

Wed, 17 Jan 2024 00:00:00 GMT

Subtitle: The expanded Epoch database now includes biological sequence models, revealing potential regulatory gaps in the White House's Executive Order on AI and the growth of the compute used in their training.

Executive summary

The White House's recent Executive Order on AI addresses risks from AI applied to biology. Developers training machine learning models on primarily biological sequence data using more than 1e23 operations are required by the Executive Order to report to the federal government about their cybersecurity, red-teaming, and other risk management efforts made during the development of these models.

This report provides an overview of our newly curated dataset focused on biological sequence models. Our dataset contains comprehensive information on nearly a hundred biological sequence models, including the specific datasets used for their training, their intended tasks, and the availability of the model weights. Additionally, our analysis covers 30 biological sequence datasets, collectively containing billions of protein sequences. Our focus in this analysis includes:

The training compute trends of biological sequence models

xTrimoPGLM-100B (Chen et al., 2023), a 100B-parameter protein language model, exceeds the Executive Order's reporting threshold of 1e23 operations by a factor of six. Over a dozen models are within a factor [...]

---

Outline:

(00:28) Executive summary

(04:19) The AI Executive Order

(07:22) Models trained on biological sequence data

(12:56) Training compute

(15:39) Large language models trained on biological data

(17:57) Biological sequence data

(23:16) Trends in biological sequence data

(26:05) Sources of biological sequence data

(28:41) Discussion

(31:23) Acknowledgments

The original text contained 8 footnotes which were omitted from this narration.

---

First published:
January 17th, 2024

Source:
https://epoch.ai/publications/biological-sequence-models-in-the-context-of-the-ai-directives

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Limits to the energy efficiency of CMOS microprocessors” by Anson Ho, Ege Erdil, Tamay Besiroglu

Fri, 15 Dec 2023 00:00:00 GMT

Subtitle: How far can the energy efficiency of CMOS microprocessors be pushed before we hit physical limits? Using a simple model, we find that there is room for a further 50 to 1000x improvement in energy efficiency.

Summary

Compared to the ENIAC in 1946, modern microprocessors can perform roughly a quadrillion times more computations for every unit of dissipated energy. This illustrates one of the most important trends in the history of computing – Koomey's Law – under which the energy efficiency of computers has increased drastically over the last eight decades.

But how much longer can this progress persist before hitting physical limits? The answer to this question has important implications for forecasts of future progress in hardware: if we are close to the fundamental limits, the returns to hardware R&D within the existing paradigm of Complementary Metal-Oxide Semiconductor (CMOS) processors may rapidly diminish in the near future.

In our new paper, published in the IEEE International Conference on Rebooting Computing, we propose a simple model of energy efficiency to shed light on this question. This allows us to estimate an upper bound to the energy efficiency of CMOS microprocessors, measured in Floating Point Operations per Joule [...]

---

Outline:

(00:30) Summary

(04:47) Implications

---

First published:
December 15th, 2023

Source:
https://epoch.ai/publications/limits-to-the-energy-efficiency-of-cmos-microprocessors

---

Narrated by TYPE III AUDIO.

“AI capabilities can be significantly improved without expensive retraining” by Tom Davidson, Jean-Stanislas Denain, Pablo Villalobos, Guillem Bas

Tue, 12 Dec 2023 00:00:00 GMT

Subtitle: While scaling compute for training is key to improving LLM performance, some post-training enhancements can offer gains equivalent to training with 5 to 20x more compute at less than 1% the cost.

The massive computation used to train LLMs and similar foundation models has been one of the main drivers of AI progress in recent years, which has led to the recognition of the “Bitter Lesson”: that general methods that better leverage computational power are ultimately the most effective (Sutton, 2019). The cost of training frontier models has now become so high that only a handful of actors can afford it (Epoch AI, 2023).

Our study explores methods of improving performance after training that don’t rely on access to vast computing resources. We divide these enhancements in five categories, presented in Table 1.

You can read the full paper here. This article was a collaboration between Epoch AI, Open Philanthropy, UC Berkeley, and ORCG.

“Who is leading in AI? An analysis of industry AI research” by Ben Cottier, Tamay Besiroglu, David Owen

Mon, 27 Nov 2023 00:00:00 GMT

Subtitle: Industry emerged as a driving force in AI, but which companies are steering the field? We compare leading AI companies on research impact, training runs, and contributions to algorithmic innovations.

The private sector's pivotal role in AI research and development is marked by its substantial resource investments and significant influence, evidenced by higher citation rates for industry-involved research articles and dominance in compute-intensive training runs. Recent analyses, such as the 2023 Stanford AI Index, highlight the varying research impacts of institutions across countries, with US tech giants leading in terms of research impact. Our study is informed by three datasets—OpenAlex for AI publications, the PCD database for AI training compute data, and a new dataset on key algorithmic innovations in large language models. We offer a comprehensive comparison of leading companies in publications, citations, unique authors, training runs, and algorithmic innovation adoption. This multi-faceted approach provides a nuanced understanding of industry's influence on AI development, contributing to policy discussions on key industry players.

You can read the full paper here.

Key results

Bibliometric output and impact. Our analysis shows that Google and Microsoft lead in total publications and citations over the past 13 years, reflecting [...]

---

Outline:

(01:31) Key results

(04:47) Policy implications

---

First published:
November 27th, 2023

Source:
https://epoch.ai/publications/who-is-leading-in-ai-an-analysis-of-industry-ai-research

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Challenges in predicting AI automation” by David Owen, Tamay Besiroglu

Fri, 24 Nov 2023 00:00:00 GMT

Subtitle: Economists have proposed several different approaches to predicting AI automation of economically valuable tasks. There is vast disagreement between different approaches and no clear winner.

There's a chart here.

Introduction

The automation of tasks by AI systems has the potential to generate tremendous economic value (Erdil and Besiroglu, 2023). The prospect of capturing this value can incentivize greater investments into developing AI capabilities. Accurately predicting when tasks are likely to be automated by AI could therefore help forecast the trajectory of AI investment and AI development. Understanding the impact of automation on the economy and labor force is also important for policymakers; governments may need to implement policies to help workers transition and ensure the benefits of automation are broadly shared.

This review examines the literature on predicting AI automation, focusing on the economics literature on AI-driven automation of occupational tasks. We also review the nascent literature on empirical validation of these predictions, examining whether we should put more trust in some predictions than others. We hope this review will help researchers engage with this important problem. We also hope that clarifying the challenges faced by existing predictions will surface promising directions for future work.

In [...]

---

Outline:

(00:28) Introduction

(04:47) How does automation happen?

(08:19) What do we want from automation predictions?

(11:15) An overview of automatability prediction methodologies

(13:04) Task feature analysis

(13:08) Background

(15:15) Task-focused analyses

(18:27) Response to generative AI and LLMs

(19:52) Task-patent mapping

(20:59) Automation forecasting surveys

(22:50) Overview of predictions

(27:20) Empirical evidence and comparison

(29:12) Studies of economic effects from AI automation

(31:49) Case studies on AI in specific applications

(36:02) Comparison of prediction methodologies

(39:48) Discussion

(43:39) Acknowledgments

---

First published:
November 24th, 2023

Source:
https://epoch.ai/publications/challenges-in-predicting-ai-automation

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Trends in machine learning hardware” by Marius Hobbhahn, Lennart Heim, Gökçe Aydos

Thu, 09 Nov 2023 00:00:00 GMT

Subtitle: FLOP/s performance in 47 ML hardware accelerators doubled every 2.3 years. Switching from FP32 to tensor-FP16 led to a further 10x performance increase. Memory capacity and bandwidth doubled every 4 years.

Executive summary

There's a chart here.

We study the performance of GPUs for computational performance across different number representations, memory capacities and bandwidth, and interconnect bandwidth using a dataset of 47 ML accelerators (GPUs and other AI chips) commonly used in ML experiments from 2010-2023, plus 1,948 additional GPUs from 2006-2021. Our main findings are:

Lower-precision number formats like 16-bit floating point (FP16) and 8-bit integers (INT8), combined with specialized tensor core units, can provide order-of-magnitude performance improvements for machine learning workloads compared to traditionally used 32-bit floating point (FP32). For example, we estimate, though using limited amounts of data, that using tensor-FP16 can provide roughly 10x speedup compared to FP32.
Given that the overall performance of large hardware clusters for state-of-the-art ML model training and inference depends on factors beyond just computational performance, we investigate memory capacity, memory bandwidth and interconnects, and find that:
1. Memory capacity is doubling every ~4 years and memory bandwidth every ~4.1 years. They have increased at [...]

---

Outline:

(00:31) Executive summary

(03:30) Introduction

(05:25) Terminology

(06:26) Dataset

(07:17) Trends of primary performance metrics

(07:39) Number representations

(09:10) Computational performance for FP32 and FP16

(10:20) Computational performance gains through hardware support for less precise number formats

(12:54) Memory capacity and bandwidth

(17:11) Interconnect bandwidth

(21:24) Computational price-performance

(26:08) Energy efficiency

(27:51) Conclusions

The original text contained 26 footnotes which were omitted from this narration.

---

First published:
November 9th, 2023

Source:
https://epoch.ai/publications/trends-in-machine-learning-hardware

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Announcing Epoch AI’s updated parameter, compute and data trends database” by The Epoch AI Team

Mon, 23 Oct 2023 00:00:00 GMT

Subtitle: Our expanded database, which tracks the parameters, datasets, training compute, and other details of notable machine learning systems, now spans over 700 notable machine learning models.

Machine learning is advancing at breakneck speed, but what's driving its progress? It is widely recognized that the performance of machine learning models is closely related to the amount of training data, compute, and number of parameters in the model. At Epoch AI, we’re investigating the key inputs that enable today's AIs to reach new heights.

Our recently expanded Parameter, Compute and Data Trends database traces these details for hundreds of landmark ML systems and research papers. Our database allows everyone to understand:

How models have swelled from mere dozens of parameters in early networks to over half a trillion in systems like Minerva today.
How training compute has increased by nearly 8 orders of magnitude from 2012's AlexNet to 2023's GPT-4.
How datasets have grown by billions of times, from 200 thousand words for early language models to the 1.9 trillion words used to train Flan-PaLM.

There's a chart here.

Building on a model and parameter dataset we first introduced in 2021 [...]

---

First published:
October 23rd, 2023

Source:
https://epoch.ai/publications/announcing-updated-pcd-database

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Explosive growth from AI: A review of the arguments” by Ege Erdil, Tamay Besiroglu

Sat, 23 Sep 2023 00:00:00 GMT

Subtitle: Our new article examines why we might (or might not) expect growth on the order of ten-fold the growth rates common in today's frontier economies once advanced AI systems are widely deployed.

The extent to which AI can automate economically valuable tasks is perhaps the most important measure of the capabilities of AI systems. As we have previously investigated, the rapid automation of such tasks has the potential to accelerate economic growth and technological development. We think the potential for explosive growth serves as a critical factor underlining AI's transformative impact on society. Yet, questions remain about why or why not extreme accelerations could occur, and if they could, how long such accelerations could last.

In our new article, available as a preprint on arXiv, we take stock of the key arguments for why we might or might not expect growth that is on the order of ten-fold the growth rates common in today's frontier economies once advanced AI systems are widely deployed. We spell out these arguments in detail and tentatively assess their force. We aim to elucidate why certain mechanisms—such as regulation, the R&D challenges with developing capable AI, or constraints on other inputs—could [...]

---

Outline:

(03:22) Why we might see explosive growth from AI

(07:36) Why we might not see explosive growth from AI

(16:08) Conclusion

---

First published:
September 23rd, 2023

Source:
https://epoch.ai/publications/explosive-growth-from-ai-a-review-of-the-arguments

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Trading off compute in training and inference” by Pablo Villalobos, David Atkinson

Fri, 28 Jul 2023 00:00:00 GMT

Subtitle: We explore several techniques that induce a tradeoff between spending more resources on training or on inference and characterize the properties of this tradeoff. We outline some implications for AI governance.

Key takeaways

In current machine learning systems, the performance of a system is closely related to how much compute is spent during the training process. However, it is also possible to augment the capabilities of a trained model at the cost of increasing compute usage during inference or reduce compute usage during inference at the cost of lower performance. For example, models can be pruned to reduce their inference cost, or instructed to reason via chains of thought, which increases their inference cost.

Based on evidence from five concrete techniques (model scaling, Monte Carlo Tree Search, pruning, resampling, and chain of thought), we expect that, relative to most current models (eg: GPT-4) it is possible to:

Increase the amount of compute per inference by 1-2 orders of magnitude (OOM), in exchange for saving ~1 OOM in training compute while maintaining performance. We expect this to be the case in most language tasks that don’t require specific factual knowledge or very concrete skills (eg: knowing how [...]

---

Outline:

(00:26) Key takeaways

(03:46) Overview

(03:49) Introduction

(05:51) The tradeoff

(05:54) Individual techniques

(08:49) Combining techniques

(10:01) Implications

(11:51) Conclusions

(13:25) Full report

(13:27) Background

(14:26) Contributions

(15:56) Techniques

(15:59) Varying the scaling policy

(18:09) Monte Carlo Tree Search

(21:47) Pruning

(24:22) Repeated sampling and filtering

(26:05) Unlimited Trials

(28:36) Limited trials

(29:24) Chain of thought and model cascades

(30:53) Combining tradeoffs

(32:01) Modeling the tradeoff

(35:21) Efficiency and optimal scaling

(35:54) Conclusion

(37:15) Acknowledgements

(37:32) Bibliography

The original text contained 19 footnotes which were omitted from this narration.

---

First published:
July 28th, 2023

Source:
https://epoch.ai/publications/trading-off-compute-in-training-and-inference

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The limited benefit of recycling foundation models” by Matthew Barnett

Fri, 07 Jul 2023 00:00:00 GMT

Subtitle: While reusing pretrained models often saves training costs on large training runs, it is unlikely that model recycling will result in more than a modest increase in AI capabilities.

Introduction

Reusing pre-trained models often saves training costs. In the current large-scale paradigm, foundation models are routinely re-used by fine-tuning a general foundation model on specific tasks (Bommasani et al. 2022). But it is also possible to save costs by re-using an older foundation model to train a newer foundation model. Let's call this practice “foundation model recycling”, which can be distinguished from the more general phenomenon of model re-use (Jiang et al. 2023). This short report investigates the benefits and implications of recycling foundation models.

There appear to be two general ways of recycling foundation models. The first method is by employing a student-teacher setup with the older model as a teacher for at least some time during training. In deep reinforcement learning, this method is generally known as kickstarting, and helps cut down on training costs by providing a dense reward signal during the early stages of training (Schmitt et al. 2018). The second method is to augment the older model with a modified architecture or [...]

---

Outline:

(00:24) Introduction

(04:25) Modeling model recycling

(06:32) Kickstarting

(10:48) Model augmentation

(13:56) Discussion

(16:06) Acknowledgements

---

First published:
July 7th, 2023

Source:
https://epoch.ai/publications/the-limited-benefit-of-recycling-foundation-models

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How predictable is language model benchmark performance?” by David Owen

Fri, 09 Jun 2023 00:00:00 GMT

Subtitle: We investigate large language model performance across five orders of magnitude of compute scaling, finding that compute-focused extrapolations are a promising way to forecast AI capabilities.

Executive summary

We investigate large language model performance across five orders of magnitude of compute scaling in 11 recent model architectures at 36 different model sizes.
We present data on performance in BIG-Bench and MMLU, covering a range of model sizes and architectures.
We examine trends in performance, showing a fairly smooth relationship between overall performance and scale, consistent with an S-curve.
We outline an approach for predicting benchmark performance based on compute scaling.
We back-test predictability of aggregate benchmark performance using this approach, showing that performance is moderately predictable from compute scaling.
We show that individual benchmark tasks are less predictable, but remain more predictable than chance or a simple per-task average baseline.
We conclude that compute-based extrapolations are a promising way to forecast AI capabilities.

Background

Scaling laws allow prediction of a model's loss from model and dataset sizes. However, scaling does not directly predict a model's performance on downstream tasks - as assessed through benchmarks. To bridge this gap [...]

---

Outline:

(00:24) Executive summary

(01:27) Background

(02:59) Results

---

First published:
June 9th, 2023

Source:
https://epoch.ai/publications/how-predictable-is-language-model-benchmark-performance

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Epoch AI and FRI mentorship program summer 2023” by The Epoch AI Team

Thu, 08 Jun 2023 00:00:00 GMT

Subtitle: We are launching the Epoch and FRI mentorship program for women, non-binary people, and transgender people of all genders to provide guidance to individuals who want to contribute to AI forecasting.

We are thrilled to announce the Epoch AI and Forecasting Research Institute (FRI) mentorship program for women, non-binary people, and trans people of all genders. This program aims to provide guidance to individuals who want to contribute to the field of AI forecasting.

The program mentors are Tegan McCaslin (FRI), Molly G Hickman (FRI), Avital Morris (FRI), David Owen (Epoch AI), Ben Cottier (Epoch AI) and Pablo Villalobos (Epoch AI).

Ajeya Cotra (Open Philanthropy) is a research advisor to the project. The program is coordinated by Jaime Sevilla (Epoch AI), with the support of Maria de la Lama (Epoch AI). Kathryn Mecrow-Flynn (Magnify Mentoring) is a program advisor.

Throughout the program, the mentees will work in pairs under the guidance of a mentor to produce original research on Artificial Intelligence Forecasting. Each mentor will offer to guide a selection of projects within their expertise. Examples of projects on offer might include studying trends on the context size of Large Language Models, estimating the [...]

The original text contained 1 footnote which was omitted from this narration.

---

First published:
June 8th, 2023

Source:
https://epoch.ai/publications/epoch-and-fri-mentorship-program-summer-2023

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Direct Approach interactive model” by David Atkinson, Matthew Barnett, Edu Roldán, Ben Cottier, Tamay Besiroglu

Wed, 31 May 2023 00:00:00 GMT

Subtitle: We combine the Direct Approach framework with simple models of progress in algorithms, investment, and compute costs to produce a user-adjustable forecast of when TAI will be achieved.

Summary: This post presents an interactive model for forecasting transformative AI, by which we mean AI that if deployed widely, would precipitate a change comparable to the industrial revolution. In addition to showcasing the results of the Direct Approach, we present a simple extrapolative model of key inputs (algorithmic progress, investment, hardware efficiency) that produce a user-adjustable forecast over the date transformative AI will be deployed. This model contains four parts:

Compute requirements estimated using the Direct Approach framework
Projected algorithmic progress
Projected investment in training transformative AI models
Projected compute availability and cost

These components are combined to estimate the probability that the compute needs for transformative AI will be met in a given future year. Under default parameter values calibrated on historical estimates, the simple extrapolative model assigns a high chance of the development of transformative AI by 2050.

We take this to mean that current trends of algorithmic progress and compute scaling, if continued, will likely lead to [...]

---

Outline:

(03:33) Compute requirements

(07:22) Algorithmic Progress

(09:08) Investment

(11:23) Compute

(12:40) Conclusion

The original text contained 7 footnotes which were omitted from this narration.

---

First published:
May 31st, 2023

Source:
https://epoch.ai/publications/direct-approach-interactive-model

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“A compute-based framework for thinking about the future of AI” by Matthew Barnett

Wed, 31 May 2023 00:00:00 GMT

Subtitle: AI's potential to automate labor is likely to alter the course of human history within decades, with the availability of compute being the most important factor driving rapid progress in AI capabilities.

How should we expect AI to unfold over the coming decades? In this article, I explain and defend a compute-based framework for thinking about AI automation. This framework makes the following claims, which I defend throughout the article:

The most salient impact of AI will be its ability to automate labor, which is likely to trigger a productivity explosion later this century, greatly altering the course of history.
The availability of useful compute is the most important factor that determines progress in AI, a trend which will likely continue into the foreseeable future.
AI performance is likely to become relatively predictable on most important, general measures of performance, at least when predicting over short time horizons.

While none of these ideas are new, my goal is to provide a single article that articulates and defends the framework as a cohesive whole. In doing so, I present the perspective that Epoch AI researchers find most illuminating about the future of [...]

---

Outline:

(01:37) Summary

(02:37) Part 1: Widespread automation from AI

(07:24) Simple models of explosive growth

(11:44) Part 2: A compute-centered theory of AI automation

(18:11) What about algorithmic progress?

(21:24) Part 3: Predictability of AI performance

(23:08) Why predicting AI performance may be tractable

(25:17) Predicting performance via a theoretical model

(30:42) Part 4: Modeling AI timelines

(33:14) Against very short timelines

(36:50) My personal AI Timelines

(39:02) Conclusion

The original text contained 11 footnotes which were omitted from this narration.

---

First published:
May 31st, 2023

Source:
https://epoch.ai/publications/a-compute-based-framework-for-thinking-about-the-future-of-ai

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Please report your compute” by Jaime Sevilla, Anson Ho, Tamay Besiroglu

Wed, 26 Apr 2023 00:00:00 GMT

Subtitle: Compute is essential for AI performance, but researchers often fail to report it. Adopting reporting norms would support research, enhance forecasts of AI's impacts and developments, and assist policymakers.

We’ve recently published an opinion piece in the Communications of the ACM, where we ask machine learning researchers and engineers to consistently report their compute usage.

If this norm was widely adopted, it would allow us to better discern to what degree improvements in AI have been driven by scale rather than novel algorithms, it would help forecast the emergence of novel AI capabilities and it would provide a tangible lever around which external and internal regulation could be developed.

To facilitate the estimation and reporting of compute usage, we have prepared an interactive calculator that you can find in our report Estimating Training Compute of Deep Learning Models.

You can read the opinion piece here.

---

First published:
April 26th, 2023

Source:
https://epoch.ai/publications/please-report-your-compute

---

Narrated by TYPE III AUDIO.

“The Direct Approach” by Matthew Barnett, Tamay Besiroglu

Tue, 25 Apr 2023 00:00:00 GMT

Subtitle: Empirical scaling laws can help predict the cross-entropy loss associated with training inputs, such as compute and data. However, in order to predict when AI will achieve some subjective level of performance, it is necessary to devise a way of interpreting the cross-entropy loss of a model. This blog post provides a discussion of one such theoretical method, which we call the Direct Approach.

Overview

Empirical scaling laws can help predict the cross-entropy loss associated with training inputs, such as compute and data. However, in order to predict when AI will achieve some subjective level of performance, we need to interpret the cross-entropy loss of a model. This blog post discusses one such theoretical method, which we call the Direct Approach. The key to understanding the Direct Approach is that scaling laws can be used to forecast KL divergence of the true distribution from the model, which can in turn tell us how distinguishable the model is from the true distribution. Arguably, indistinguishability over sufficiently long sequences implies competence on the tasks implicit in the data distribution; if true, we can use scaling laws to upper bound the training compute necessary to achieve a particular level of [...]

---

Outline:

(00:36) Overview

(06:44) Questions and answers

(15:59) Reviews

The original text contained 1 footnote which was omitted from this narration.

---

First published:
April 25th, 2023

Source:
https://epoch.ai/publications/the-direct-approach

---

Narrated by TYPE III AUDIO.

“Power laws in speedrunning and machine learning” by Ege Erdil, Jaime Sevilla

Fri, 21 Apr 2023 00:00:00 GMT

Subtitle: We develop a model for predicting record improvements in video game speedrunning and apply it to predicting machine learning benchmarks. This model suggests that machine learning benchmarks are not close to saturation, and that large sudden improvements are infrequent, but not ruled out.

Overview

Anticipating the future of AI necessarily requires anticipating results in performance. Ideally, we would like to understand the dynamics of improvements. This would help us preempt capabilities and understand how plausible sudden improvements are.

However, this problem is notoriously difficult. Among other reasons, machine learning benchmarks use many different metrics for measuring performance. And the history of improvements in all domains is limited, spanning around a dozen improvements in the longest-running benchmarks.

To circumvent these problems, we follow Sevilla (2021) and study video game speedrunning. Using data from speedrun.com, we investigate a previously noted regularity in world record progressions - an astounding fit to a power law pattern.

There's a chart here.

Exploiting this regularity, we develop a random effects model for predicting the size of successive record improvements. We show that this model is significantly better than a baseline of predicting no improvement, and has a performance comparable to a model fit [...]

---

First published:
April 21st, 2023

Source:
https://epoch.ai/publications/power-laws-in-speedrunning-and-machine-learning

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Announcing Epoch AI’s dashboard of key trends and figures in machine learning” by The Epoch AI Team

Wed, 12 Apr 2023 00:00:00 GMT

Subtitle: We are launching a dashboard that provides key data from our research on machine learning, aiming to serve as a valuable resource for understanding the present and future of the field.

Developments in machine learning have been happening extraordinarily fast, and as their impacts become increasingly visible, it becomes ever more important to develop a quantitative understanding of these changes. However, relevant data has thus far been scattered across multiple papers, has required expertise to gather accurately, or has been otherwise hard to obtain.

Given this, Epoch AI is thrilled to announce the launch of our new dashboard, which covers key numbers and figures from our research to help understand the present and future of machine learning. This includes:

Training compute requirements
Model size, measured by the number of trainable parameters
The availability and use of data for training
Trends in hardware efficiency
Algorithmic improvements for achieving better performance with fewer resources
The growth of investment in training runs over time

Our dashboard gathers all of this information in a single, accessible place. The numbers and figures are accompanied by further information such as confidence intervals, labels [...]

---

First published:
April 12th, 2023

Source:
https://epoch.ai/publications/announcing-trends-dashboard

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Epoch AI 2022 impact report” by The Epoch AI Team

Wed, 01 Feb 2023 00:00:00 GMT

Subtitle: Our impact report for 2022.

Epoch AI is a research group forecasting the development of transformative Artificial Intelligence. We try to understand how progress in AI happens and what economic impacts we might see from advanced AI.

We want to enable better governance during this economic transition by gathering information about the timing of new developments, studying which levers can be used to influence AI progress and making current and past trends in ML more understandable.

Founded in April of 2022, Epoch AI currently has a staff of 13 people, corresponding to 9 FTEs. We have received 1.96 million dollars in funding through a grant from Open Philanthropy. We are fiscally sponsored and operationally supported by Rethink Priorities, whose Special Projects team has been a core part of our success as an organisation.

Epoch AI is fundraising a total of 6.07 million dollars over 2 years, or approximately 2.64 million dollars for October 2023 to September 2024, and 3.42 million dollars for October 2024 to September 2025.1 A detailed budget can be found in the full report.

With this funding, we expect to continue and expand our research capacity in understanding the future [...]

The original text contained 1 footnote which was omitted from this narration.

---

First published:
February 1st, 2023

Source:
https://epoch.ai/publications/epoch-impact-report-2022

---

Narrated by TYPE III AUDIO.

“Trends in the dollar training cost of machine learning systems” by Ben Cottier

Tue, 31 Jan 2023 00:00:00 GMT

Subtitle: I combine training compute and GPU price-performance data to estimate the cost of compute in US dollars for the final training run of 124 machine learning systems published between 2009 and 2022, and find that the cost has grown by approximately 0.5 orders of magnitude per year.

Important caveats about the results in this report

The cost estimates have large uncertainty bounds—the true costs could be several times larger or smaller. The cost estimates are themselves built on top of estimates (e.g. training compute estimates, GPU price-performance estimates, etc.). See the Methods section and Appendix J for discussion of the uncertainties in the respective estimates.
Although the estimated growth rates in cost are more robust than any individual cost estimate, these growth rates should also be interpreted with caution—especially when extrapolated into the future.
The cost estimates only cover the compute for the final training runs of ML systems—nothing more.
The cost estimates are for notable publicly known ML systems according to the criteria discussed in Sevilla et al. (2022, p.16). The improvements in performance over time are irregular—this means that a 2x increase in compute budget did not always lead to the [...]

---

Outline:

(00:33) Important caveats about the results in this report

(02:05) Summary

(06:09) Why study dollar training costs?

(08:04) Method

(08:07) Background on methods to estimate the dollar cost of training compute

(11:11) Estimating training cost from training compute and GPU price-performance

(15:11) Method 1: Using the overall GPU price-performance trend

(15:54) Method 2: Using the price-performance of actual hardware used to train ML systems

(17:21) Dataset

(17:33) Code

(17:40) Large-scale systems

(18:42) Results

(18:45) Method 1: Using the overall GPU price-performance trend for all ML systems (n=124)

(18:54) Growth rate of training cost for all ML systems: 0.51 OOMs/year

(21:21) Growth rate of training cost for large-scale ML systems: 0.2 OOMs/year

(24:05) Method 2: Using the price-performance of NVIDIA GPUs used to train ML systems (n=48)

(24:14) Growth rate of training cost for all ML systems: 0.44 OOMs/year

(25:55) Growth rate of training cost for large-scale ML systems: 0.2 OOMs/year

(27:15) Summary and comparison of all regression results

(28:56) Predictions of when a spending limit will be reached

(34:11) Recommended future work

(34:14) Include systems trained with Google TPUs for Method 2

(34:46) Estimate more reliable bounds on cost using cloud compute prices and profit margins

(37:28) Investigate investment, allocation of spending, and revenue

The original text contained 95 footnotes which were omitted from this narration.

---

First published:
January 31st, 2023

Source:
https://epoch.ai/publications/trends-in-the-dollar-training-cost-of-machine-learning-systems

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Scaling laws literature review” by Pablo Villalobos

Thu, 26 Jan 2023 00:00:00 GMT

Subtitle: I have collected a database of scaling laws for different tasks and architectures, and reviewed dozens of papers in the scaling law literature.

Common shape of a scaling law, taken from Hestness et al. (2017)

Executive summary

Scaling laws are predictable relations between the scale of a mode and performance or other useful properties.
I have collected a database of scaling laws for different tasks and architectures, and reviewed dozens of papers in the scaling law literature.
My main takeaways are:
- Functional forms: a basic power law can effectively model the scaling behavior in the power-law region but not the transitions to the other two regions. For this, either the M4 estimator or the BNSL estimator introduced below seem to be the best options right now.
- Transfer learning: there is not a simple universal scaling law for transfer learning between arbitrary tasks. When the tasks are similar enough, upstream loss and downstream performance are closely related, but when tasks are very different, the details of the architecture and hyperparameters become very relevant.

See the full table of scaling laws here.

Introduction

The term “scaling laws” in deep [...]

---

Outline:

(00:29) Executive summary

(01:31) Introduction

(03:03) Overview

(06:13) Takeaways

(06:15) Upstream loss

(07:27) Transfer and downstream accuracy

(08:00) Theoretical analyses

---

First published:
January 26th, 2023

Source:
https://epoch.ai/publications/scaling-laws-literature-review

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“An interactive model of AI takeoff speeds” by Jaime Sevilla, Edu Roldán

Tue, 24 Jan 2023 00:00:00 GMT

Subtitle: We have developed an interactive website showcasing a new model of AI takeoff speeds.

Tom Davidson from Open Philanthropy has released What a compute-centric framework says about AI takeoff speeds, a report investigating how fast AI capabilities might transform the economy.

Epoch AI has supported this project by coding the model and running the simulation experiments required for the investigation. As a supplement to the report, we have developed an interactive website presenting the model and some of the report's results.

This website includes several sections and features, which we briefly describe below.

Playground

In the playground, you will find an interface to enter the parameters of the Full Takeoff Model, and see how these affect the results. It includes graphs of the most important variables of the model, as well as tables summarising the results.

Reports

In this section we show four reports:

The Monte Carlo analysis shows a summary of 10 thousand samples of the model's parameter values.
The aggressive Monte Carlo analysis is the same, but using a more aggressive distribution for the amount of 2022 FLOP required to automate all productive tasks.
The parameter importance analysis [...]

---

Outline:

(00:54) Playground

(01:25) Reports

(02:51) Description

---

First published:
January 24th, 2023

Source:
https://epoch.ai/publications/interactive-model-of-takeoff-speeds

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Literature review of transformative artificial intelligence timelines” by Keith Wynroe, David Atkinson, Jaime Sevilla

Tue, 17 Jan 2023 00:00:00 GMT

Subtitle: We summarize and compare several models and forecasts predicting when transformative AI will be developed.

Previous work: Grokking “Forecasting TAI with biological anchors“, Grokking “Semi-informative priors over AI timelines”

Highlights

The review includes quantitative models, including both outside and inside view, and judgment-based forecasts by (teams of) experts.
While we do not necessarily endorse their conclusions, the inside-view model the Epoch AI team found most compelling is Ajeya Cotra's “Forecasting TAI with biological anchors”, the best-rated outside-view model was Tom Davidson's “Semi-informative priors over AI timelines”, and the best-rated judgment-based forecast was Samotsvety's AGI Timelines Forecast.
The inside-view models we reviewed predicted shorter timelines (e.g. bioanchors has a median of 2052) while the outside-view models predicted longer timelines (e.g. semi-informative priors has a median over 2100). The judgment-based forecasts are skewed towards agreement with the inside-view models, and are often more aggressive (e.g. Samotsvety assigned a median of 2043).

Introduction

Over the last few years, we have seen many attempts to quantitatively forecast the arrival of transformative and/or general Artificial Intelligence (TAI/AGI) using very different methodologies and assumptions. Keeping track of and assessing these models’ relative strengths can be daunting for [...]

---

Outline:

(00:32) Highlights

(01:32) Introduction

(04:33) Results

(04:36) Model based forecasts

(05:02) Ajeya Cotra's bio anchors

(05:11) Semi-informative priors

(05:20) Insights-based model

(05:29) Whole Brain Emulation

(05:38) Phase transitions and AGI

(05:48) Weighted Linear Average of Probabilities

(05:54) Judgment based forecasts

(06:20) AI Impacts survey (2022)

(06:31) Metaculus

(06:39) Samotsvety report

(06:48) Ajeya Cotra

(06:57) Holden Karnofsky

(07:05) Weighted Geometric Average of Odds

(07:52) Model-based forecasts

(08:18) Forecasting TAI with biological anchors (inside view)

(10:14) Semi-informative priors over AI timelines (outside view)

(12:36) Insight-based AI timelines model (outside view)

(13:57) Whole Brain Emulation (inside view)

(16:13) Phase Transitions and AGI (outside view)

(17:25) Judgment-based forecasts

(17:57) AI Impacts Survey (2022)

(18:41) Metaculus: Date of Artificial General Intelligence as of 2022-10-11

(19:19) Samotsvety's AGI Timelines Forecasts

(20:41) Two-year update on my personal AI timelines

(21:28) Forecasting transformative AI: what's the burden of proof?

(22:39) Conclusion

The original text contained 1 footnote which was omitted from this narration.

---

First published:
January 17th, 2023

Source:
https://epoch.ai/publications/literature-review-of-transformative-artificial-intelligence-timelines

---

Narrated by TYPE III AUDIO.

“Revisiting algorithmic progress” by Ege Erdil, Tamay Besiroglu

Mon, 12 Dec 2022 00:00:00 GMT

Subtitle: We use a dataset of over a hundred computer vision models from the last decade to investigate how better algorithms and architectures have enabled researchers to use compute and data more efficiently. We find that every 9 months, the introduction of better algorithms contribute the equivalent of a doubling of compute budgets.

Overview

How much progress in ML depends on algorithmic progress, scaling compute, or scaling relevant datasets is relatively poorly understood. In our paper, we make progress on this question by investigating algorithmic progress in image classification on ImageNet, perhaps the most well-known test bed for computer vision.

Using a dataset of a hundred computer vision models, we estimate a model—informed by neural scaling laws—that enables us to analyse the rate and nature of algorithmic advances. We use Shapley values to produce decompositions of the various drivers of progress computer vision and estimate the relative importance of algorithms, compute, and data.

Our main results include:

Every nine months, the introduction of better algorithms contributes the equivalent of a doubling of compute budgets. This is much faster than the gains from Moore's law; that said, there's uncertainty (our 95% CI spans 4 to 25 months)

[...]

The original text contained 1 footnote which was omitted from this narration.

---

First published:
December 12th, 2022

Source:
https://epoch.ai/publications/revisiting-algorithmic-progress

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Predicting GPU performance” by Marius Hobbhahn, Tamay Besiroglu

Thu, 01 Dec 2022 00:00:00 GMT

Subtitle: We develop a simple model that predicts progress in the performance of field-effect transistor-based GPUs under the assumption that transistors can no longer miniaturize after scaling down to roughly the size of a single silicon atom. Our model forecasts that the current paradigm of field-effect transistor-based GPUs will plateau sometime between 2027 and 2035, offering a performance of between 1e14 and 1e15 FLOP/s in FP32.

Executive summary

We develop a simple model that predicts progress in the performance of field-effect transistor-based GPUs under the assumption that transistors can no longer miniaturize after scaling down to roughly the size of a single silicon atom. We construct a composite model from a performance model (a model of how GPU performance relates to the features of that GPU), and a feature model (a model of how GPU features change over time given the constraints imposed by the physical limits of miniaturization), each of which are fit on a dataset of 1948 GPUs released between 2006 and 2021. We find that almost all progress can be explained by two variables: transistor size and the number of cores. Using estimates of the physical limits informed by the relevant literature, our model predicts [...]

---

Outline:

(00:45) Executive summary

(03:10) Introduction

(06:03) A simple model of GPU performance

(08:13) Feature and model selection

(12:05) Physical limits of transistor miniaturization

(14:21) Limits on the number of cores

(15:42) Predictions

(18:26) Limitations

The original text contained 4 footnotes which were omitted from this narration.

---

First published:
December 1st, 2022

Source:
https://epoch.ai/publications/predicting-gpu-performance

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Will we run out of ML data? Evidence from projecting dataset size trends” by Pablo Villalobos, Jaime Sevilla, Lennart Heim, Tamay Besiroglu, Marius Hobbhahn, Anson Ho

Thu, 10 Nov 2022 00:00:00 GMT

Subtitle: Based on our previous analysis of trends in dataset size, we project the growth of dataset size in the language and vision domains. We explore the limits of this trend by estimating the total stock of available unlabeled data over the next decades.

Our projections predict that we will have exhausted the stock of low-quality language data by 2030 to 2050, high-quality language data before 2026, and vision data by 2030 to 2060. This might slow down ML progress.

All of our conclusions rely on the unrealistic assumptions that current trends in ML data usage and production will continue and that there will be no major innovations in data efficiency. Relaxing these and other assumptions would be promising future work.

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ Low-quality language data

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ High-quality language data

There's a chart here. The chart title reads: en-US-AvaMultilingualNeural__ Image data

Historical projectionCompute projectionLow-quality language stock2032.4
[2028.4 ; 2039.2]2040.5
[2034.6 ; 2048.9]High-quality language stock2024.5
[2023.5 ; 2025.7]2024.1
[2023.2 ; 2025.3]Image stock2046
[2037 ; 2062.8]2038.8
[2032 ; 2049.8]

Table 1: Median and 90% [...]

---

Outline:

(01:32) Background

(02:47) Results

The original text contained 2 footnotes which were omitted from this narration.

---

First published:
November 10th, 2022

Source:
https://epoch.ai/publications/will-we-run-out-of-ml-data-evidence-from-projecting-dataset

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Trends in training dataset sizes” by Pablo Villalobos, Anson Ho

Tue, 20 Sep 2022 00:00:00 GMT

Subtitle: We collected a database of notable ML models and their training dataset sizes. We use this database to find historical growth trends in dataset size for different domains, particularly language and vision.

Key takeaways

We collected over 200 notable ML models and estimated their training dataset size.
Vision and language datasets have historically grown at 0.1 and 0.2 orders of magnitude (OOMs) per year, respectively.
There seems to be some transition around 2014-2015, after which training datasets became much bigger and (in the case of language) smaller datasets disappeared. This might be just an artefact of our small sample size.
We also provide trends for games, speech, recommendation and drawing, but since our sample size is very small in these domains we would advise some level of scepticism.

Figure 1: Training datasets for language (left) and vision (right).

DomainScale (data points)Yearly growth
(OOMs/year)Yearly growth
(OOMs/year) (95% CI)#systemsLanguage1e2- 2e120.22[0.18 ; 0.28]79Vision2e3 - 3e90.09[0.08 ; 0.11]55Speech9e2 - 3e120.21[0.17 ; 0.30]13Games7e5 - 4e110.09[0.08 ; 0.15]12Recommendation1e8 - 1e100.05[0.00 ; 0.47]11Drawing6e4 - 4e90.43[0.17 ; 0.64]10

Table 1: Summary of trends for each domain. Scale is the maximum and minimum observed dataset size, and yearly growth is [...]

---

Outline:

(00:26) Key takeaways

(01:35) Introduction

(02:13) Methods

(02:16) Measuring dataset size

(04:16) Our database

(05:20) Dataset size trends

(05:23) Vision

(06:33) Language

(07:38) Other domains

(08:22) Conclusion

---

First published:
September 20th, 2022

Source:
https://epoch.ai/publications/trends-in-training-dataset-sizes

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“The longest training run” by Jaime Sevilla, Tamay Besiroglu, Owen Dudney, Anson Ho

Wed, 17 Aug 2022 00:00:00 GMT

Subtitle: Training runs of large machine learning systems are likely to last less than 14-15 months. This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms.

In short: Training runs of large machine learning systems are likely to last less than 14-15 months. This is because longer runs will be outcompeted by runs that start later and therefore use better hardware and better algorithms.

Larger compute budgets and a better understanding of how to effectively use compute (through, for example, using scaling laws) are two major driving forces of progress in recent machine learning.

There are many ways to increase your effective compute budget: better hardware, rising investments in AI R&D and improvements in algorithmic efficiency. In this article we investigate one often-overlooked but plausibly important factor: how long—in terms of wall-clock time—you are willing to train your model for.

Here we explore a simple mathematical framework for estimating the optimal duration of a training run. A researcher is tasked with training a model by some deadline, and must decide when to start their training run. The researcher is faced with a key problem [...]

---

Outline:

(02:52) A simple framework for training run lengths

(06:28) Accounting for increasing dollar-budgets

(07:48) Accounting for increased algorithmic efficiency

(09:29) Accounting for hardware swapping

(12:34) Accounting for stochasticity

(13:28) Fixed deadlines

(14:28) Renting hardware

(15:41) Conclusion

(18:00) Acknowledgements

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
August 17th, 2022

Source:
https://epoch.ai/publications/the-longest-training-run

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“A time-invariant version of Laplace’s rule” by Jaime Sevilla, Ege Erdil

Fri, 15 Jul 2022 00:00:00 GMT

Subtitle: We explore how to estimate the probability of an event given information of past occurrences. We explain a problem with the naive application of Laplace's rule in this context, and suggest a modification to correct it.

What is the probability that the sun will rise tomorrow? What are the chances of a pandemic happening next year? What are the odds of survival of a new surgery that has been successfully executed only once?

These and many other questions can be answered appealing to a general rule: Laplace's rule of succession. This rule describes the probability of a positive outcome given information about past successes. The versatility and generality of the rule makes it an invaluable tool to forecasters, who use it to estimate base rates1.

Laplace's rule can be stated in simple terms. If we have repeated an experiment (T) times, and observed (S) successes, we can estimate the posterior probability of obtaining a success in the next trial as (p = frac{S+1}{T+2}).

However, there is a fatal problem when applying the rule to observations over a time period, where the definition of what constitutes a trial is not as clear. For example [...]

---

Outline:

(04:20) A refresher: Laplace's rule of succession

(05:36) The problem of time

(07:08) A problem of priors

(08:24) Continuous improvement

(09:47) The scale invariant prior

(13:01) Inference with the scale-invariant prior

(14:03) Unprecedented success

(16:34) Adjusting for a variable observation period

(18:10) Putting it all together

(18:41) An example: Earthquakes in Chile

(23:38) Conclusion

(25:48) Acknowledgements

The original text contained 11 footnotes which were omitted from this narration.

---

First published:
July 15th, 2022

Source:
https://epoch.ai/publications/a-time-invariant-version-of-laplace-s-rule

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Machine learning model sizes and the parameter gap” by Pablo Villalobos, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, Anson Ho, Marius Hobbhahn

Tue, 05 Jul 2022 00:00:00 GMT

Subtitle: The model size of notable machine learning systems has grown ten times faster than before since 2018. After 2020 growth has not been entirely continuous: there was a jump of one order of magnitude which persists until today. This is relevant for forecasting model size and thus AI capabilities.

Summary: The model size of notable machine learning systems has grown ten times faster than before since 2018. After 2020 growth has not been entirely continuous: there was a jump of one order of magnitude which persists until today. This is relevant for forecasting model size and thus AI capabilities.

Trends in model size

In current ML systems, model size (number of parameters) is related to performance via known scaling laws. We used our dataset to analyze trends in the model size of 237 milestone machine learning systems. The systems are categorized into Language, Vision, Games and Other according to the task they solve.

Model size slowly increased by 7 orders of magnitude from the 1950s to around 2018. Since 2018, growth has accelerated for language models, with model size increasing by another 4 orders of magnitude in the four years from 2018 to 2022 (see Figure [...]

---

Outline:

(00:57) Trends in model size

(02:23) The parameter gap

---

First published:
July 5th, 2022

Source:
https://epoch.ai/publications/machine-learning-model-sizes-and-the-parameter-gap

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Trends in GPU price-performance” by Marius Hobbhahn, Tamay Besiroglu

Mon, 27 Jun 2022 00:00:00 GMT

Subtitle: Using a dataset of 470 models of graphics processing units released between 2006 and 2021, we find that the amount of floating-point operations/second per $ doubles every ~2.5 years.

We would like to thank Alyssa Vance, Ashwin Acharya, Jessica Taylor and the Epoch AI team for helpful feedback and comments.

Executive Summary

Using a dataset of 470 models of graphics processing units (GPUs) released between 2006 and 2021, we find that the amount of floating-point operations/second per $ (hereafter FLOP/s per $) doubles every ~2.5 years. For top GPUs at any point in time, we find a slower rate of improvement (FLOP/s per $ doubles every 2.95 years), while for models of GPU typically used in ML research, we find a faster rate of improvement (FLOP/s per $ doubles every 2.07 years). GPU price-performance improvements have generally been slightly slower than the 2-year doubling time associated with Moore's law, much slower than what is implied by Huang's law, yet considerably faster than was generally found in prior work on trends in GPU price-performance. We aim to provide a more precise characterization of GPU price-performance trends based on more or higher-quality data, that is more robust [...]

---

Outline:

(00:36) Executive Summary

(02:11) Introduction

(05:07) Dataset

(07:17) Empirical analysis

(07:40) Empirical trend vs. other predictions

(10:06) Trends across precision for floating formats

(11:24) Trends of GPUs used in ML

(13:48) Trend of top-performing GPUs

(14:49) All trends (table & figure)

(15:30) Conclusion

The original text contained 10 footnotes which were omitted from this narration.

---

First published:
June 27th, 2022

Source:
https://epoch.ai/publications/trends-in-gpu-price-performance

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Announcing Epoch AI: A research initiative investigating the road to transformative AI” by The Epoch AI Team

Thu, 23 Jun 2022 00:00:00 GMT

Subtitle: We are a new research initiative forecasting developments in AI. Come join us!

Summary

We are a new research initiative working on investigating trends in machine learning and forecasting the development of transformative Artificial Intelligence
This work is done in close collaboration with other organizations, like Rethink Priorities and Open Philanthropy
We will be hiring for 2-4 full-time roles this summer – more information here

What is Epoch AI?

Epoch AI is a new research organization that works to support AI strategy and improve forecasts around the development of transformative Artificial Intelligence – AI systems that have the potential to have an effect on society as large as that of the industrial revolution.

Our founding team consists of seven members – Jaime Sevilla, Tamay Besiroglu, Lennart Heim, Pablo Villalobos, Edu Roldán, Marius Hobbhahn, and Anson Ho. Collectively, we have backgrounds in Machine Learning, Statistics, Economics, Forecasting, Physics, Computer Engineering, and Software Engineering.

Our work involves close collaboration with other organizations, Open Philanthropy, and Rethink Priorities’ AI Governance and Strategy team. We are advised by Tom Davidson from Open Philanthropy and Neil Thompson. Rethink Priorities is also our fiscal sponsor.

Our mission

[...]

---

Outline:

(00:21) Summary

(00:45) What is Epoch AI?

(01:55) Our mission

(02:29) Our research agenda

(03:45) Our work so far

(04:50) Hiring

The original text contained 1 footnote which was omitted from this narration.

---

First published:
June 23rd, 2022

Source:
https://epoch.ai/publications/announcing-epoch

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Grokking “Semi-informative priors over AI timelines”” by Anson Ho

Mon, 13 Jun 2022 00:00:00 GMT

Subtitle: I give visual explanations for Tom Davidson's report, Semi-informative priors over AI timelines, and summarise the key assumptions and intuitions.

Notes:

I give visual explanations for Tom Davidson's report, Semi-informative priors over AI timelines, and summarise the key assumptions and intuitions
The diagrams can be found here – you can click on some of the boxes to get linked to the part of the report that you’re interested in1

Thanks to the Epoch AI team for feedback and support! Thanks especially to Jaime Sevilla and Tom Davidson for providing detailed feedback.

Executive Summary

The framework in Semi-informative priors over AI timelines assumes a model of AGI development which consists of a sequence of Bernoulli trials, i.e. it treats each calendar year as a “trial” at building AGI with constant probability p of succeeding.

Image source: Davidson, 2021

However, we don’t know what this value of p is, so we use a generalisation of Laplace's rule of succession to estimate —there's a complex formula here. See the original text—. This is done by specifying a first-trial probability, the probability of successfully building AGI in the first year of AI research, together [...]

---

Outline:

(00:51) Executive Summary

(04:26) Motivation

(05:25) Laplace's Rule of Succession

(09:09) Making the priors less uninformative

(12:33) Semi-informative priors demystified

(13:05) First-trial probability

(14:50) Number of virtual successes

(15:43) Regime start time

(17:13) Trial definition

(19:43) Putting things together: Final distribution

(19:47) Model Extensions

(20:25) Final Distribution

(22:08) Conclusion

The original text contained 18 footnotes which were omitted from this narration.

---

First published:
June 13th, 2022

Source:
https://epoch.ai/publications/grokking-semi-informative-priors

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Grokking “Forecasting TAI with biological anchors”” by Anson Ho

Mon, 06 Jun 2022 00:00:00 GMT

Subtitle: I give a visual explanation of Ajeya Cotra's draft report, Forecasting TAI with biological anchors, summarising the key assumptions, intuitions, and conclusions.

Notes:

I give a visual explanation of Ajeya Cotra's draft report, Forecasting TAI with biological anchors (Cotra, 2020), summarising the key assumptions, intuitions, and conclusions
The diagrams can be found here – you can click on some of the boxes to get linked to the part of the report that you’re interested in1

Thanks to Michael Aird, Ashwin Acharya, and the Epoch AI team for suggestions and feedback! Special thanks to Jaime Sevilla and Ajeya Cotra for detailed feedback.

Executive Summary

Click here to skip the summary

Ajeya Cotra's biological anchors framework attempts to forecast the development of Transformative AI (TAI) by treating compute as a key bottleneck to AI progress. This lets us focus on a concrete measure (compute, measured in FLOP) as a proxy for the question “when will TAI be developed?” Given this, we can decompose the question into two main questions:

2020 training compute requirements: How much compute will we need to train TAI, using 2020 machine learning architectures and algorithms?
Affordability of [...]

---

Outline:

(00:56) Executive Summary

(04:18) Motivation

(05:23) Why focus on compute?

(08:19) Framework

(13:59) Zooming Into the Biological Anchors

(14:33) Evolution anchor

(15:37) Lifetime anchor

(17:35) Neural network anchors

(19:04) Genome anchor

(20:00) Affordability of compute

(22:12) Putting Things Together: Final distribution

(24:04) Conclusion

The original text contained 13 footnotes which were omitted from this narration.

---

First published:
June 6th, 2022

Source:
https://epoch.ai/publications/grokking-bioanchors

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Projecting compute trends in machine learning” by Tamay Besiroglu, Lennart Heim, Jaime Sevilla

Mon, 07 Mar 2022 00:00:00 GMT

Subtitle: Projecting forward 70 years' worth of trends in the amount of compute used to train machine learning models.

Summary

Using our dataset of milestone machine learning models, and our recent analysis of compute trends in ML, we project forward 70 years’ worth of trends in the amount of compute used to train machine learning models. Our simulations account for (a) uncertainty in estimates of the growth rates in compute usage during the Deep Learning (DL)-era and Pre-DL era, and (b) uncertainty over the ‘reversion date’, i.e. the date when the current DL-era compute trend (with a ~6 month doubling time) will end and revert to the historically more common trend associated with Moore's law. Assuming a reversion date of between 8 to 18 years, and without accounting for algorithmic progress, our projections suggest that the median of Cotra 2020's biological anchors may be surpassed around August 2046 [95% CI: Jun 2039, Jul 2060]. This suggests that historical rates of compute scaling, if sustained briefly (relative to how long these trends have been around so far), could result in the emergence of transformative models.

Our work can be replicated using this Colab notebook.

Note: we present projections, not [...]

---

Outline:

(00:21) Summary

(01:57) Introduction

(04:19) When will the current scaling trend revert back to Moore's law?

(07:42) Projecting ML compute trends

(09:15) Conclusion

(10:02) Details of the simulations

The original text contained 1 footnote which was omitted from this narration.

---

First published:
March 7th, 2022

Source:
https://epoch.ai/publications/projecting-compute-trends

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Compute trends across three eras of machine learning” by Jaime Sevilla, Lennart Heim, Anson Ho, Tamay Besiroglu, Marius Hobbhahn, Pablo Villalobos

Wed, 16 Feb 2022 00:00:00 GMT

Subtitle: We’ve compiled a dataset of the training compute for over 120 machine learning models, highlighting novel trends and insights into the development of AI since 1952, and what to expect going forward.".

Summary: We have collected a dataset and analysed key trends in the training compute of machine learning models since 1950. We identify three major eras of training compute - the pre-Deep Learning Era, the Deep Learning Era, and the Large-Scale Era. Furthermore, we find that the training compute has grown by a factor of 10 billion since 2010, with a doubling rate of around 5-6 months. See our recent paper, Compute Trends Across Three Eras of Machine Learning, for more details.

Introduction

It is well known that progress in machine learning (ML) is driven by three primary factors - algorithms, data, and compute. This makes intuitive sense - the development of algorithms like backpropagation transformed the way that machine learning models were trained, leading to significantly improved efficiency compared to previous optimisation techniques (Goodfellow et al., 2016; Rumelhart et al., 1986). Data has been becoming increasingly available, particularly with the advent of “big data” in recent years. At the same time, progress in [...]

---

Outline:

(01:00) Introduction

(03:30) Methodology

(05:42) Results

(06:42) Compute trends are slower than previously reported

(07:31) Three eras of machine learning

(09:44) Implications and further work

The original text contained 5 footnotes which were omitted from this narration.

---

First published:
February 16th, 2022

Source:
https://epoch.ai/publications/compute-trends

---

Narrated by TYPE III AUDIO.

“Estimating training compute of deep learning models” by Jaime Sevilla, Lennart Heim, Marius Hobbhahn, Tamay Besiroglu, Anson Ho, Pablo Villalobos

Thu, 20 Jan 2022 00:00:00 GMT

Subtitle: We describe two approaches for estimating the training compute of Deep Learning systems, by counting operations and looking at GPU time.

ML Models trained on more compute have better performance and more advanced capabilities (see e.g. Kaplan et al., 2020 or Hoffman et al., 2022). Due to this, estimating and reporting compute usage is crucial to enable accurate comparisons between ML models.

Compute usage is commonly measured as the number of floating point operations (FLOP) required to train the final version of the system. To estimate this we can resort to two strategies: a) using information about the architecture and amount of training data, or b) using information about the hardware used and training time.

Below we provide two calculators that illustrate these methods.

Do you see a mistake or do you want to submit missing information about hardware specs? Fill this form and we will look into it.

Introduction

In this article we will explain (with examples) how to estimate the amount of compute used to train an AI system. We will explain two procedures, one based on the architecture of the network and number of training batches processed; and another based [...]

---

Outline:

(01:16) Introduction

(02:05) Method 1: Counting operations in the model

(04:07) The ratio of backward pass operations to forward pass operations

(05:03) Forward pass compute and parameter counts of common layers

(06:45) Example: CNN-LSTM-FCN model

(10:12) Example: Transformer

(13:26) Method 2: GPU time

(14:18) Estimating the number of FLOP from the GPU time

(15:07) Which number representation is used?

(19:00) Imputing GPU performance when the hardware model is not known

(20:36) About GPU utilization rates

(24:42) Example: Image GPT

(26:58) Conclusion

(28:12) Acknowledgements

(28:43) Bibliography

The original text contained 8 footnotes which were omitted from this narration.

---

First published:
January 20th, 2022

Source:
https://epoch.ai/publications/estimating-training-compute

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“What’s the backward-forward FLOP ratio for neural networks?” by Marius Hobbhahn, Jaime Sevilla

Mon, 13 Dec 2021 00:00:00 GMT

Subtitle: Determining the backward-forward FLOP ratio for neural networks, to help calculate their total training compute.

Summary

Classic settings, i.e. deep networks with convolutional layers and large batch sizes, almost always have backward-forward FLOP ratios close to 2:1.
Depending on the following criteria we can encounter ratios between 1:1 and 3:1
1. Type of layer: Passes through linear layers have as many FLOP as they use to do weight updates. Convolutional layers have many more FLOP for passes than for weight updates. Therefore, in CNNs, FLOP for weight updates basically play no role.
2. Batch size: Weights are updated after the gradients of the batch have been aggregated. Thus, FLOP for passes increase with batch size but stay constant for weight updates.
3. Depth: The first layer has a backward-forward ratio of 1:1 while all others have 2:1. Therefore, the overall ratio is influenced by the fraction of FLOP in first vs. FLOP in other layers.
We assume the network is being optimized by stochastic gradient descent (w += ɑ⋅dw) and count the weight update as part of the backward pass. Other optimizers would imply different FLOP counts and could create ratios [...]

---

Outline:

(00:21) Summary

(01:47) Introduction

(02:22) Theory

(04:49) Empirical results

(06:00) Backward and forward FLOP in the first and the rest of the layers

(06:39) Type of layer

(07:44) Batch size

(08:15) Depth

(09:41) Combining all above

(11:05) Conclusion

(11:37) Acknowledgment

---

First published:
December 13th, 2021

Source:
https://epoch.ai/publications/backward-forward-FLOP-ratio

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“How to measure FLOP for neural networks empirically?” by Marius Hobbhahn

Mon, 29 Nov 2021 00:00:00 GMT

Subtitle: Computing the utilization rate for multiple Neural Network architectures.

Experiments and text by Marius Hobbhahn. I would like to thank Jaime Sevilla, Jean-Stanislas Denain, Tamay Besiroglu, Lennart Heim, and Anson Ho for their feedback and support.

Summary

We measure the utilization rate of a Tesla P100 GPU for training different ML models. Most architectures and methods result in a utilization rate between 0.3 and 0.75. However, two architectures result in implausible low utilization rates of lower than 0.04. The most probable explanation for these outliers is that FLOP for inverted bottleneck layers are not counted correctly by the profiler. In general, the profiler we use shows signs of under- and overcounting and there is a possibility we made errors.

Findings

Counting the FLOP for a forward pass is very simple and many different packages give correct answers.
Counting the FLOP for the backward pass is harder and our estimator of choice makes weird overcounting and undercounting errors.
After cleaning mistakes, it is very likely that the backward/forward ratio is 2:1 (at least for our setup).
After correcting for the overcounting issues, we get empirical utilization rates between 0.3 and 0.75 for [...]

---

Outline:

(00:29) Summary

(01:07) Findings

(01:58) Introduction

(02:47) Methods for counting FLOP

(04:40) Our experimental setup

(07:19) Analysis

(07:21) Something is fishy with profiler_nvtx

(09:21) Investigating profiler_nvtx further

(12:33) Results

(12:51) Comparing batch sizes

(13:56) Backward-forward pass ratios

(15:06) Utilization rates

(16:46) Conclusion

---

First published:
November 29th, 2021

Source:
https://epoch.ai/publications/measure-FLOP-empirically

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.

“Parameter counts in machine learning” by Jaime Sevilla, Pablo Villalobos, Juan Felipe Cerón

Sat, 19 Jun 2021 00:00:00 GMT

Subtitle: Compiling a large dataset of machine learning models to determine changes in the parameters counts of systems since 1952.

In short: we have compiled information about the date of development and trainable parameter counts of n=139 machine learning systems between 1952 and 2021. This is, as far as we know, the biggest public dataset of its kind. You can access our dataset here, and the code to produce an interactive visualization is available here.

We chose to focus on parameter count because previous work indicates that it is an important variable for model performance [1], because it helps as a proxy of model complexity and because it is information usually readily available or easily estimable from descriptions of model architecture.

We hope our work will help AI researchers and forecasters understand one way in which models have become more complex over time, and ground their predictions of how the field will progress in the future. In particular, we hope this will help us tease apart how much of the progress in machine learning has been due to algorithmic improvements versus increases in model complexity.

It is hard to draw firm conclusions from our [...]

---

Outline:

(03:02) Features of the dataset

(04:31) Caveats

(06:02) Insights

(08:35) Open questions

(09:47) Next steps

(10:37) Acknowledgements

(11:09) Bibliography

---

First published:
June 19th, 2021

Source:
https://epoch.ai/publications/parameter-counts

---

Narrated by TYPE III AUDIO.

---

Images from the article:

Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts, or another podcast app.