Andrej Karpathy's Open-Source Breakthrough

AutoResearch: Run 100
Experiments While You Sleep

A 630-line Python script that lets AI agents autonomously run, evaluate, and iterate on ML experiments overnight. No human needed.

45.7k
GitHub Stars
630
Lines of Code
~100
Experiments / Night
8.6M
Views in 48 Hours

AutoResearch flips the research paradigm: instead of a human manually tweaking parameters and running experiments one-by-one, an AI agent reads its own source code, forms hypotheses, rewrites the training logic, runs experiments, and evaluates outcomes.

You write instructions in plain English (program.md), point the agent at a training script, and go to sleep. By morning, you wake up to a full log of automated experiments and an optimized model. Every experiment is git-committed and logged.

Released March 6, 2026 under MIT license. One of the fastest-growing repositories in GitHub history, reaching 30k stars in its first week.

Creator
Andrej Karpathy. Former Tesla AI Director, OpenAI founding member, Stanford CS231n creator. One of the most respected voices in AI.
Stack
Python + PyTorch. No external dependencies. Single GPU. Uses uv package manager. Works with Claude Code or any AI coding agent.
Core Principle
"Instead of directly improving the model, the human programs the experimental process using natural language."
"All LLM frontier labs will do this. It's the final boss battle."
Andrej Karpathy
Architecture

How AutoResearch Works

prepare.py
IMMUTABLE
Downloads the ClimbMix dataset from HuggingFace, trains a BPE tokenizer, and writes sharded binary data files. Run once, never touched again.
train.py
AGENT EDITS
Complete GPT model definition, Muon + AdamW optimizer, and training loop. ~630 lines. The only file the AI agent is allowed to modify.
program.md
HUMAN WRITES
Natural language instructions for the agent. Defines what to search for, constraints, and stopping criteria. Your "research org in English."
1
Read Instructions & Examine Code
Agent reads program.md, examines the current state of train.py, and reviews past experiment results from results.tsv.
2
Form Hypothesis & Modify Code
The agent proposes an improvement (architecture, hyperparams, optimizer), rewrites train.py, and git-commits the change with a description.
3
Run 5-Minute Training Cycle
Executes uv run train.py with a fixed 5-min wall-clock budget. Every experiment gets identical time, making results directly comparable.
4
Evaluate & Branch
Extracts val_bpb (validation bits-per-byte). Lower = better. Decides whether to keep or discard the change based on improvement.
Improved? Keep it.
Commit stays on branch. New baseline established. Agent builds on this improvement in the next cycle.
No gain? Revert.
Git reset to previous state. Experiment logged to results.tsv. Agent tries a different hypothesis next.
Editable Asset
One file the agent can modify. Keeps search space interpretable.
train.py
Scalar Metric
Single number to optimize. No human judgment needed.
val_bpb
Time Box
Fixed wall-clock budget. Every experiment is directly comparable.
5 minutes
Non-Negotiable Rule from program.md:
"Once the experiment loop has begun, do NOT pause to ask the human if you should continue. The human may be asleep; continuous iteration is expected." The agent runs autonomously until interrupted.
Commit val_bpb VRAM (GB) Status Description
a3f7c21 0.9979 38.2 keep Baseline run
b8e2d44 0.9891 39.1 keep Increase depth 8→10
c1a9f67 0.0000 0.0 crash OOM on batch_size 2x
d5b3e89 0.9812 37.8 keep RoPE embeddings
Proven Results

What Happened Overnight

Karpathy's Own Runs
700
experiments in 2 days
Session 1 89 experiments
Session 2 126 experiments
Optimizations Found 20 total
Training Speedup 11%
Shopify CEO (Tobias Lutke)
19%
performance gain overnight
Experiments Run 37 autonomous
Liquid Engine 53% faster
Object Allocations 61% fewer
His Reaction "Totally insane"
Aspect Manual Research AutoResearch
Experiments / Night 1-3 (if researcher stays late) ~100 autonomous
Human Involvement Constant: design, run, analyze each Write program.md, then sleep
Consistency Variable (fatigue, cognitive bias) Identical 5-min budget each
Hypothesis Gen Limited by time & human creativity LLM generates continuously
Documentation Often incomplete or missing Every experiment git-committed
Cost Researcher salary + GPU time Just GPU (agent is free)
30k
Stars in 1 Week
One of the fastest-growing repos in GitHub history
6.3k
Forks
Community ports to macOS, Windows, AMD, and MLX within days
8.6M
Tweet Views
Karpathy's announcement hit 8.6M views in just 48 hours
"OK this thing is totally insane. Ran 37 experiments overnight on our internal data and got a 19% performance gain. This is the future."
Tobias Lutke, CEO of Shopify
Beyond Machine Learning

Use Cases Across Industries

Measurable Metric + Controllable Input + Repeatable Process = AutoResearch Loop
Any domain with these three properties can use the autonomous experiment pattern. The AI agent handles the loop; you define what "better" means.
Marketing
Ad Performance Optimization
Agent generates copy and creative variants, deploys through API, measures CTR and CPA. Iterates on winning combinations overnight.
Metric: CTR / CPA
Sales
Cold Email Outreach
Test subject lines, body copy, CTAs, and send timing. Teams report going from 2-4% baseline reply rate to 8-12% within weeks.
Metric: Reply Rate
Product
Activation & Onboarding
Vary in-app messaging sequences, tooltip copy, and onboarding step order. Measure time-to-activation and completion rate.
Metric: Activation Rate
Revenue
Landing Page Conversion
AI generates headline, layout, and CTA variants. Deploys via API, tracks conversions. Keeps winners, discards losers automatically.
Metric: Conversion %
Biotech
Drug Discovery Simulation
Autonomous experimentation on molecular configurations. Agent modifies simulation parameters, evaluates binding affinity scores across thousands of candidates overnight.
Metric: Binding Affinity
Manufacturing
Process Parameter Tuning
Optimize temperature, pressure, and timing parameters in simulation. Minimize defect rates and maximize yield without shutting down the production line.
Metric: Defect Rate
More Applications
Financial Modeling
Optimize trading strategies and risk models against historical data continuously.
Support Deflection
Test documentation variants to maximize self-serve resolution rate and reduce tickets.
Pricing Strategy
Test price points, tiers, and packaging to maximize revenue per transaction.
The Key Insight
AutoResearch is not a tool. It is a pattern. Karpathy himself said: "You don't 'use it' directly, it's just a recipe. Give it to your agent and apply to what you care about." Any domain with a measurable metric and a controllable input is a candidate. The 630 lines of code are a starting point, not a product.
Step-by-Step Setup

Getting Started in 15 Minutes

NVIDIA GPU
Tested on H100. Smaller GPUs work with adjustments (see tips).
Python 3.10+
Standard Python install. No exotic dependencies required.
uv Package Mgr
Modern Python package manager. One-line install via curl.
AI Coding Agent
Claude Code recommended by Karpathy. Any agent works.
1
Install the uv Package Manager
uv is a modern, fast Python package manager. One curl command installs it system-wide. It handles all AutoResearch dependencies automatically.
curl -LsSf https://astral.sh/uv/install.sh | sh
2
Clone the Repository & Install Dependencies
Pull the repo from GitHub and let uv handle the dependency resolution. No manual pip installs, no requirements.txt. Just clone and sync.
git clone https://github.com/karpathy/autoresearch cd autoresearch && uv sync
3
Prepare Data & Tokenizer
Downloads the ClimbMix dataset from HuggingFace, trains a BPE tokenizer, and writes sharded binary data. Takes about 2 minutes. One-time setup only.
uv run prepare.py # ~2 min, one-time
4
Run Baseline Training
Execute the training script to establish reference metrics. This gives you the starting val_bpb that the agent will try to beat in each experiment cycle.
uv run train.py # establishes baseline
5
Launch the Autonomous Loop
Open the repo in your AI coding agent (Claude Code, Cursor, etc.). Tell it: "Read program.md and kick off a new experiment!" The agent creates a branch, starts experimenting, and you go to sleep. Wake up to results in results.tsv.
# In your AI agent, say: "Have a look at program.md and let's kick off a new experiment!" # The agent handles everything from here.
Running on Smaller GPUs? Adjust These:
Use narrower datasets (e.g., TinyStories)
Reduce vocab_size: 8192 to 4096 or 2048
Lower MAX_SEQ_LEN in prepare.py
Decrease EVAL_TOKENS for faster validation
Reduce model DEPTH (default: 8)
Lower TOTAL_BATCH_SIZE in powers-of-2
The Bigger Picture

Who Should Pay Attention

ML / AI Engineer
The most obvious fit. Run 100 architecture and hyperparameter experiments overnight instead of 3 per sprint. Let the agent find optimizations you would never think to try.
Growth / Marketing Lead
Apply the loop to ad copy, landing pages, email sequences, and pricing. Any metric you already track (CTR, conversion, reply rate) can be optimized autonomously.
Product Manager
program.md is essentially a PRD. Define "better," set constraints, let the agent execute. Run 100 activation experiments while you sleep. A PM who tests 100x faster wins.
Founder / CTO
Shopify's CEO got 19% gains overnight. If you have a measurable metric and a controllable input, this pattern can compound improvements across your entire stack.
Research Scientist
Drug discovery, materials science, climate modeling. Any simulation-based research with a scalar objective can be fed into the loop for 100x experiment velocity.
Data / Analytics Engineer
Optimize data pipelines, feature engineering, model selection. Use the loop to test transformation strategies against downstream accuracy or processing speed.
Karpathy's Vision: What Comes Next
"The next step is that it has to be asynchronously massively collaborative for agents. Think: SETI@home style. The goal is not to emulate a single PhD student, it's to emulate a research community of them."
Andrej Karpathy
He envisions spinning up swarms of agents, having them collaborate to tune smaller models, promoting the most promising ideas to larger scales, with humans contributing on the edges. Fortune magazine has already coined this "The Karpathy Loop." Welcome to the loopy era of AI.
autoresearch-macos
macOS (Apple Silicon)
autoresearch-mlx
Apple MLX Framework
autoresearch-win-rtx
Windows + RTX GPUs
autoresearch-amd
AMD GPU Support
generic-metric
Any Optimization Target
distributed-seti
Multi-Node Swarms
Start Experimenting Tonight
AutoResearch is MIT-licensed, 630 lines of Python, and requires zero API keys for the core loop. Clone it, write your program.md, and wake up to results.
github.com/karpathy/autoresearch
Rizvi Haider by Rizvi Haider