Andrej Karpathy's Open-Source Breakthrough

AutoResearch: Run 100
Experiments While You Sleep

A 630-line Python script that lets AI agents autonomously run, evaluate, and iterate on ML experiments overnight. No human needed.

45.7k

GitHub Stars

630

Lines of Code

~100

Experiments / Night

8.6M

Views in 48 Hours

What Is AutoResearch?

AutoResearch flips the research paradigm: instead of a human manually tweaking parameters and running experiments one-by-one, an AI agent reads its own source code, forms hypotheses, rewrites the training logic, runs experiments, and evaluates outcomes.

You write instructions in plain English (program.md), point the agent at a training script, and go to sleep. By morning, you wake up to a full log of automated experiments and an optimized model. Every experiment is git-committed and logged.

Released March 6, 2026 under MIT license. One of the fastest-growing repositories in GitHub history, reaching 30k stars in its first week.

Creator

Andrej Karpathy. Former Tesla AI Director, OpenAI founding member, Stanford CS231n creator. One of the most respected voices in AI.

Stack

Python + PyTorch. No external dependencies. Single GPU. Uses uv package manager. Works with Claude Code or any AI coding agent.

Core Principle

"Instead of directly improving the model, the human programs the experimental process using natural language."

"All LLM frontier labs will do this. It's the final boss battle."

Andrej Karpathy

Architecture

How AutoResearch Works

The Three Core Files

prepare.py

IMMUTABLE

Downloads the ClimbMix dataset from HuggingFace, trains a BPE tokenizer, and writes sharded binary data files. Run once, never touched again.

train.py

AGENT EDITS

Complete GPT model definition, Muon + AdamW optimizer, and training loop. ~630 lines. The only file the AI agent is allowed to modify.

program.md

HUMAN WRITES

Natural language instructions for the agent. Defines what to search for, constraints, and stopping criteria. Your "research org in English."

The Autonomous Experiment Loop

1

Read Instructions & Examine Code

Agent reads program.md, examines the current state of train.py, and reviews past experiment results from results.tsv.

2

Form Hypothesis & Modify Code

The agent proposes an improvement (architecture, hyperparams, optimizer), rewrites train.py, and git-commits the change with a description.

3

Run 5-Minute Training Cycle

Executes uv run train.py with a fixed 5-min wall-clock budget. Every experiment gets identical time, making results directly comparable.

4

Evaluate & Branch

Extracts val_bpb (validation bits-per-byte). Lower = better. Decides whether to keep or discard the change based on improvement.

Improved? Keep it.

Commit stays on branch. New baseline established. Agent builds on this improvement in the next cycle.

No gain? Revert.

Git reset to previous state. Experiment logged to results.tsv. Agent tries a different hypothesis next.

Editable Asset

One file the agent can modify. Keeps search space interpretable.

train.py

Scalar Metric

Single number to optimize. No human judgment needed.

val_bpb

Time Box

Fixed wall-clock budget. Every experiment is directly comparable.

5 minutes

Non-Negotiable Rule from program.md:

"Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human may be asleep; continuous iteration is expected." The agent runs autonomously until interrupted.

Sample results.tsv Output

Commit	val_bpb	VRAM (GB)	Status	Description
a3f7c21	0.9979	38.2	keep	Baseline run
b8e2d44	0.9891	39.1	keep	Increase depth 8→10
c1a9f67	0.0000	0.0	crash	OOM on batch_size 2x
d5b3e89	0.9812	37.8	keep	RoPE embeddings

Proven Results

What Happened Overnight

Real-World Outcomes

Karpathy's Own Runs

700

experiments in 2 days

Session 1 89 experiments

Session 2 126 experiments

Optimizations Found 20 total

Training Speedup 11%

Shopify CEO (Tobias Lutke)

19%

performance gain overnight

Experiments Run 37 autonomous

Liquid Engine 53% faster

Object Allocations 61% fewer

His Reaction "Totally insane"

Manual vs. AutoResearch

Aspect	Manual Research	AutoResearch
Experiments / Night	1-3 (if researcher stays late)	~100 autonomous
Human Involvement	Constant: design, run, analyze each	Write program.md, then sleep
Consistency	Variable (fatigue, cognitive bias)	Identical 5-min budget each
Hypothesis Gen	Limited by time & human creativity	LLM generates continuously
Documentation	Often incomplete or missing	Every experiment git-committed
Cost	Researcher salary + GPU time	Just GPU (agent is free)

How It Went Viral

30k

Stars in 1 Week

One of the fastest-growing repos in GitHub history

6.3k

Forks

Community ports to macOS, Windows, AMD, and MLX within days

8.6M

Tweet Views

Karpathy's announcement hit 8.6M views in just 48 hours

"OK this thing is totally insane. Ran 37 experiments overnight on our internal data and got a 19% performance gain. This is the future."

Tobias Lutke, CEO of Shopify

Beyond Machine Learning

Use Cases Across Industries

The Universal Pattern

Measurable Metric + Controllable Input + Repeatable Process = AutoResearch Loop

Any domain with these three properties can use the autonomous experiment pattern. The AI agent handles the loop; you define what "better" means.

Business & Growth

Marketing

Ad Performance Optimization

Agent generates copy and creative variants, deploys through API, measures CTR and CPA. Iterates on winning combinations overnight.

Metric: CTR / CPA

Sales

Cold Email Outreach

Test subject lines, body copy, CTAs, and send timing. Teams report going from 2-4% baseline reply rate to 8-12% within weeks.

Metric: Reply Rate

Product

Activation & Onboarding

Vary in-app messaging sequences, tooltip copy, and onboarding step order. Measure time-to-activation and completion rate.

Metric: Activation Rate

Revenue

Landing Page Conversion

AI generates headline, layout, and CTA variants. Deploys via API, tracks conversions. Keeps winners, discards losers automatically.

Metric: Conversion %

Science & Engineering

Biotech

Drug Discovery Simulation

Autonomous experimentation on molecular configurations. Agent modifies simulation parameters, evaluates binding affinity scores across thousands of candidates overnight.

Metric: Binding Affinity

Manufacturing

Process Parameter Tuning

Optimize temperature, pressure, and timing parameters in simulation. Minimize defect rates and maximize yield without shutting down the production line.

Metric: Defect Rate

More Applications

Financial Modeling

Optimize trading strategies and risk models against historical data continuously.

Support Deflection

Test documentation variants to maximize self-serve resolution rate and reduce tickets.

Pricing Strategy

Test price points, tiers, and packaging to maximize revenue per transaction.

The Key Insight

AutoResearch is not a tool. It is a pattern. Karpathy himself said: "You don't 'use it' directly, it's just a recipe. Give it to your agent and apply to what you care about." Any domain with a measurable metric and a controllable input is a candidate. The 630 lines of code are a starting point, not a product.

Step-by-Step Setup

Getting Started in 15 Minutes

Prerequisites

NVIDIA GPU

Tested on H100. Smaller GPUs work with adjustments (see tips).

Python 3.10+

Standard Python install. No exotic dependencies required.

uv Package Mgr

Modern Python package manager. One-line install via curl.

AI Coding Agent

Claude Code recommended by Karpathy. Any agent works.

Installation & First Run

1

Install the uv Package Manager

uv is a modern, fast Python package manager. One curl command installs it system-wide. It handles all AutoResearch dependencies automatically.

curl -LsSf https://astral.sh/uv/install.sh | sh

2

Clone the Repository & Install Dependencies

Pull the repo from GitHub and let uv handle the dependency resolution. No manual pip installs, no requirements.txt. Just clone and sync.

git clone https://github.com/karpathy/autoresearch
cd autoresearch && uv sync

3

Prepare Data & Tokenizer

Downloads the ClimbMix dataset from HuggingFace, trains a BPE tokenizer, and writes sharded binary data. Takes about 2 minutes. One-time setup only.

uv run prepare.py # ~2 min, one-time

4

Run Baseline Training

Execute the training script to establish reference metrics. This gives you the starting val_bpb that the agent will try to beat in each experiment cycle.

uv run train.py # establishes baseline

5

Launch the Autonomous Loop

Open the repo in your AI coding agent (Claude Code, Cursor, etc.). Tell it: "Read program.md and kick off a new experiment!" The agent creates a branch, starts experimenting, and you go to sleep. Wake up to results in results.tsv.

# In your AI agent, say:
"Have a look at program.md and let's
kick off a new experiment!"
# The agent handles everything from here.

Running on Smaller GPUs? Adjust These:

Use narrower datasets (e.g., TinyStories)

Reduce vocab_size: 8192 to 4096 or 2048

Lower MAX_SEQ_LEN in prepare.py

Decrease EVAL_TOKENS for faster validation

Reduce model DEPTH (default: 8)

Lower TOTAL_BATCH_SIZE in powers-of-2

The Bigger Picture

Who Should Pay Attention

This Is For You If You Are A...

ML / AI Engineer

The most obvious fit. Run 100 architecture and hyperparameter experiments overnight instead of 3 per sprint. Let the agent find optimizations you would never think to try.

Growth / Marketing Lead

Apply the loop to ad copy, landing pages, email sequences, and pricing. Any metric you already track (CTR, conversion, reply rate) can be optimized autonomously.

Product Manager

program.md is essentially a PRD. Define "better," set constraints, let the agent execute. Run 100 activation experiments while you sleep. A PM who tests 100x faster wins.

Founder / CTO

Shopify's CEO got 19% gains overnight. If you have a measurable metric and a controllable input, this pattern can compound improvements across your entire stack.

Research Scientist

Drug discovery, materials science, climate modeling. Any simulation-based research with a scalar objective can be fed into the loop for 100x experiment velocity.

Data / Analytics Engineer

Optimize data pipelines, feature engineering, model selection. Use the loop to test transformation strategies against downstream accuracy or processing speed.

Karpathy's Vision: What Comes Next

"The next step is that it has to be asynchronously massively collaborative for agents. Think: SETI@home style. The goal is not to emulate a single PhD student, it's to emulate a research community of them."

Andrej Karpathy

He envisions spinning up swarms of agents, having them collaborate to tune smaller models, promoting the most promising ideas to larger scales, with humans contributing on the edges. Fortune magazine has already coined this "The Karpathy Loop." Welcome to the loopy era of AI.

Community Forks (First Week)

autoresearch-macos

macOS (Apple Silicon)

autoresearch-mlx

Apple MLX Framework

autoresearch-win-rtx

Windows + RTX GPUs

autoresearch-amd

AMD GPU Support

generic-metric

Any Optimization Target

distributed-seti

Multi-Node Swarms

Start Experimenting Tonight

AutoResearch is MIT-licensed, 630 lines of Python, and requires zero API keys for the core loop. Clone it, write your program.md, and wake up to results.

github.com/karpathy/autoresearch

by Rizvi Haider