Who are these courses for?

It depends on the course. Prompting and Claude Cowork require no coding. Claude Code and Agentic AI are for developers comfortable with Python. Build Your Own LLM is for engineers who want to understand LLM internals from first principles. Each course page lists specific prerequisites.

What's the difference between self-paced and a live cohort?

Self-paced gives you instant access to all materials — no schedule, no live sessions. Live cohorts have a fixed start date, are capped at 20 seats, and include weekly office hours with the lead instructor. Both give you lifetime access to materials and the alumni community.

How much time does each course take?

Self-paced courses are flexible — work at whatever pace suits you. Each course page shows the number of modules and estimated total hours. Live cohorts are structured around 6–8 hours per week: 2 hours of live sessions on Saturdays, 1 hour of midweek office hours, and 3–5 hours of project work.

What if I miss a live session?

Every live session is recorded and uploaded to the course portal within a few hours. You have lifetime access to all recordings, including future curriculum updates.

Do I need a GPU or expensive hardware?

No. All practical work uses cloud environments like Google Colab or Modal. You will not need your own GPU for any course, including Build Your Own LLM.

Will this help me get an AI role?

Courses are primarily about building real skills and a portfolio of work. Each course ends with a project you can reference in interviews. We don't run a formal hiring programme, but the alumni community includes engineers at many companies.

What is the refund policy?

Self-paced: full refund within 7 days if you've completed fewer than 2 modules. Live cohort: full refund after attending the first live session if it's not the right fit. Contact support and we'll sort it out.

Is there a certificate of completion?

Yes. Students who complete all modules and submit their final project receive a digital certificate. It includes your name, the course, and your capstone project title — designed to be added to LinkedIn and your CV.

Intermediate to Advanced OPEN

Now open in offline mode (no live sessions) due to high demand — enroll anytime and earn the same certificate.

Build Your Own LLM — From 300 Lines to Frontier Architecture

A practitioner-first curriculum covering the complete LLM stack — from autograd and tokenisation through Transformer architecture, systems engineering, scaling laws, data curation, distillation, RL and alignment, mechanistic interpretability, and evaluation science. Every concept in working Python.

Start DateEnroll anytime

DurationSelf-paced · 15 Modules

Lessons107 lessons

Best ForIntermediate to Advanced

ROI-Driven Engineering Training

₹24,990₹2,49,900

Based on your location (India), you qualify for a Purchasing Power Parity discount.

Premium Access Included

Dedicated A100 GPU Sandbox for training runs and 25+ hours of implementation-focused labs.

Self-paced access — enroll anytime, no cohort start date to wait for

All 15 modules and 107 lessons, usable fully offline once loaded

40+ expert gems drawn from Karpathy, Reiner Pope, Noam Shazeer, CS336, and the original papers — not available in any textbook

Hands-on labs in every module: build GPT-2, FlashAttention, a full data pipeline, DPO/GRPO alignment runs, and a distillation pipeline

Covers the full 2026 frontier stack: GQA/MLA/SwiGLU, Muon optimizer, μP, FP8 training, speculative decoding, SKD, SAEs

Verifiable Professional Certificate on completion

No live sessions required — built for busy engineering schedules

Create Account to Pay

Secure checkout via Razorpay

India checkout supports cards, UPI, net banking, and eligible EMI through Razorpay.

About this program

"Teeth over education" — Karpathy's nanoGPT README. This cohort takes that motto seriously across the entire LLM stack. Most LLM courses start with a Transformer tutorial and stop at fine-tuning. This one starts with autograd from 100 lines of Python and ends with you understanding exactly why DeepSeek-V3 trains in FP8, why the Chinchilla ratio became 300:1 post-inference-optimal-shift, and how Anthropic's Constitutional AI actually eliminates human labellers. The curriculum follows the Karpathy progression (feel it in a toy model), verify it against CS336 rigour (implement from primitives), then read the original paper — in that order. Theory follows implementation, never the reverse. Across 15 modules and 107 lessons, you will build: a scalar autograd engine, a BPE tokeniser, GPT-2 from raw tensor ops, a LLaMA-style model with GQA/RoPE/SwiGLU, a FlashAttention kernel in Triton, a full data pipeline from Common Crawl WET files, PPO and DPO and GRPO alignment runs, a knowledge distillation pipeline from a 7B teacher, and a mechanistic interpretability toolset using sparse autoencoders and activation patching. You will also read — and understand — the papers that shaped the field: Attention Is All You Need, Chinchilla, the PPO paper, DPO, DeepSeek-R1, and Anthropic's SAE scaling work. You will finish knowing what every expert gem in the curriculum means because you derived it yourself first.

Who is this for?

Software engineers and researchers who want to understand LLMs from first principles — calculus and Python required, no prior ML or deep learning experience needed

What you'll actively build & learn

Autograd From Scratch

Build a scalar autograd engine and understand reverse-mode autodiff from first principles.

Transformer Architecture

Implement GPT-2 from raw tensor ops and a LLaMA-style model with GQA, RoPE, and SwiGLU.

Systems & Alignment

Write a FlashAttention kernel in Triton, run a full data pipeline, and complete the SFT/DPO/GRPO/RLVR alignment stack.

Distillation & Literacy

Distill a 7B teacher into a 1B student and read papers like Chinchilla, PPO, DPO, and DeepSeek-R1 fluently.

Time Commitment & Schedule

Self-Paced Modules

Flexible

No live sessions — work through all 15 modules and labs whenever suits you, in any order you need.

Hands-On Labs

~36 hrs total

Every module ends with a lab: real code, real training runs, real GPU profiling — from micrograd to a full alignment pipeline on a 1B model.

Module-Based Syllabus

Each module is structured around three things: what you'll cover, what capability you'll walk away with, and the concrete deliverable that moves you toward a working system of your own. Work through them in any order, at any pace.

Cadence

15 self-paced modules, 107 lessons — work through them in order or jump to what you need

End Result

A working LLM you built yourself, from autograd to alignment — plus the ability to read any frontier paper and implement it from first principles

Format

Code-first, every concept in working Python — Karpathy progression (feel it), CS336 rigour (implement it), original paper (understand it)

Module 1

How neural networks learn — micrograd in 100 lines

What you'll cover

Scalar autograd, reverse-mode autodiff, the optimizer lineage from SGD to Adam — all from scratch, no ML libraries.

You leave with

A working scalar autograd engine and intuition for why backprop is just one rule applied repeatedly.

Primary deliverable

A two-layer MLP trained on the moons dataset using only your micrograd engine.

AutogradBackpropagationOptimizers

Module 2

Language modelling fundamentals — bigrams to makemore

What you'll cover

Probability distributions over sequences, character bigrams, NLL/cross-entropy/perplexity, MLP language models, and the makemore lineage.

You leave with

A clear mental model of what a language model actually is — a probability distribution, not a magic box.

Primary deliverable

The full makemore zoo: bigram, MLP, RNN, and dilated convolutions — measuring perplexity across each.

Language modellingMLPPerplexity

Module 3

Tokenization — the most underestimated step in the entire pipeline

What you'll cover

BPE from scratch, vocabulary size tradeoffs, why tokenisation causes LLM failures at string operations, and the frontier tokeniser landscape.

You leave with

The ability to implement and train a BPE tokeniser without any tokenisation library.

Primary deliverable

A domain-specific BPE tokeniser trained from scratch; benchmark showing how vocabulary decisions affect model performance.

BPETokenizationVocabulary

Module 4

Attention is all you need — implemented line by line

What you'll cover

Q/K/V geometry, scaled dot-product, multi-head attention, causal masking, positional encodings through RoPE, and the full Transformer block.

You leave with

A GPT-2 (small) built from raw torch.Tensor ops that matches nanoGPT exactly.

Primary deliverable

A working GPT-2 implementation with no nn.Transformer — every weight matrix written from first principles.

AttentionTransformerRoPE

Module 5

Architecture decisions — what Noam Shazeer knew that the paper didn't say

What you'll cover

MQA, GQA, MLA, SwiGLU, Mixture of Experts, weight tying, SSMs — the architecture decisions behind every frontier model in 2026.

You leave with

Fluency reading and implementing the architectural variants that separate GPT-2-era models from LLaMA/DeepSeek-era models.

Primary deliverable

A LLaMA-style nanoGPT upgrade with GQA, RoPE, SwiGLU, and pre-norm — perplexity delta measured vs. baseline.

GQASwiGLUMoE

Module 6

The Roofline model — why your GPU is memory-bound and what to do about it

What you'll cover

Compute vs.
memory-bandwidth bottlenecks, FlashAttention, Flash Attention 2 in Triton, KV cache arithmetic, PagedAttention, and the quantisation ladder.

You leave with

The ability to predict and measure GPU bottlenecks and to implement FlashAttention without materialising the N×N attention matrix.

Primary deliverable

A Triton FlashAttention 2 kernel; nanoGPT profiled at multiple batch sizes with Roofline predictions verified against measurements.

FlashAttentionRoofline modelQuantization

Module 7

Distributed training and inference acceleration

What you'll cover

DDP, ZeRO, tensor parallelism, pipeline parallelism, speculative decoding, and the modded-nanoGPT speedrun anatomy.

You leave with

A working multi-GPU DDP setup with gradient accumulation; understanding of every parallelism strategy used in frontier training runs.

Primary deliverable

nanoGPT trained on 2+ GPUs with DDP; linear scaling efficiency measured and compared to theoretical peak.

DDPZeROSpeculative decoding

Module 8

Training dynamics — optimizers, schedules, μP, and what kills runs

What you'll cover

Adam from scratch, the Muon optimizer, Maximal Update Parametrization, learning rate schedules, gradient clipping, mixed-precision, and the nanoGPT reference run.

You leave with

Practical mastery of what kills training runs and how elite teams prevent it; understanding of the μP hyperparameter transfer technique.

Primary deliverable

GPT-2 (124M) trained on OpenWebText to ~2.85 val loss in ~1hr on a cloud GPU — the nanoGPT reference benchmark.

AdamMuonμP

Module 9

Scaling laws — reading the equations that govern the whole field

What you'll cover

Kaplan power laws, Chinchilla compute-optimal training, post-Chinchilla inference-optimal reality in 2026, IsoFLOP curves, emergent abilities, and MoE scaling.

You leave with

The ability to fit your own scaling laws, interpret IsoFLOP curves, and understand why today's token/parameter ratios have diverged so far from Chinchilla.

Primary deliverable

20 small training experiments at constant compute; power law fitted; optimal N predicted and verified against a larger run.

Scaling lawsChinchillaIsoFLOP

M10

Module 10

Data — the discipline that determines everything

What you'll cover

Common Crawl, FineWeb, DCLM, text extraction, quality filtering, MinHash deduplication, PII removal, and synthetic data.

You leave with

The ability to build a complete data pipeline from raw WET files to a tokenised training corpus with quality and deduplication guarantees.

Primary deliverable

A complete data pipeline: crawl → clean → filter → deduplicate → tokenize → quality metrics verified.

Data curationBPEDeduplication

M11

Module 11

Reinforcement learning foundations — value functions, policy gradients, and PPO from scratch

What you'll cover

Policy π and reward R at the token level, value functions, GAE, PPO's four-model architecture, reward model training, PRMs vs.
ORMs.

You leave with

A working PPO implementation with reward hacking visible when the KL penalty is removed — the entire alignment stack understood from first principles.

Primary deliverable

PPO implemented from scratch on a toy text task; reward hacking observed with and without the KL penalty.

PPOReward modellingRL foundations

M12

Module 12

Alignment — SFT, DPO, GRPO, Constitutional AI, and the full post-training pipeline

What you'll cover

SFT, RLHF (InstructGPT), DPO loss derivation, GRPO, RLVR with verifiable rewards, Constitutional AI, and multi-token prediction.

You leave with

Fluency running the full post-training pipeline — from instruction-tuning through preference optimisation through verifiable reward training.

Primary deliverable

Full alignment run: 1B base → SFT → DPO → GRPO on a math verifier (RLVR) → reasoning benchmark evaluation.

DPOGRPOConstitutional AI

M13

Module 13

Knowledge distillation — how small models are actually made useful

What you'll cover

Soft targets vs.
hard labels, supervised KD vs.
on-policy KD, Speculative Knowledge Distillation (SKD), reasoning chain distillation, LoRA/QLoRA, and the build-vs-distill decision framework.

You leave with

The ability to distil a large teacher model into a high-quality student — including the SKD technique that outperforms supervised KD by 230% on summarisation.

Primary deliverable

A 1B student distilled from a 7B teacher using logit-level KD then SKD; quality gap vs. training from scratch measured.

DistillationSKDLoRA

M14

Module 14

Mechanistic interpretability — looking inside the model you built

What you'll cover

Linear representation hypothesis, superposition, circuits, logit lens, sparse autoencoders (SAEs), and activation patching.

You leave with

A practical interpretability toolkit: the ability to use SAEs, logit lens, and activation patching to understand what your model is computing and why.

Primary deliverable

Logit lens applied to your trained GPT-2; activation patching experiment isolating a specific causal component; SAE trained on residual stream activations.

Mechanistic interpSAEsCircuits

M15

Module 15

Evaluation science — benchmarks, contamination, and building evals that actually tell you something

What you'll cover

Benchmark limitations, contamination, the 2026 eval stack (GPQA Diamond, SWE-bench, AIME, LMArena), lm-eval-harness, domain-specific evals, and triangulated evaluation.

You leave with

The ability to design a rigorous evaluation suite for a domain-specific model — including contamination resistance, LLM-as-judge calibration, and CI gating.

Primary deliverable

Capstone: a complete training proposal for a domain-specific 1B model with architecture justification, scaling law estimates, data pipeline design, alignment strategy, and eval suite.

EvaluationBenchmarksContamination

Capstone Focus

The syllabus builds toward a final proof of work.

The weekly syllabus is designed to stack toward a capstone that demonstrates what you can actually build. By the end of the cohort, you are not just finishing modules. You are presenting a concrete output that ties the learning arc together.

View Alumni Capstones

Next layer of proof

Industry-Grade Certification

Earn a credential that actually matters. Every certificate is tied to your Capstone Project repo, valid for life, and optimized for your professional technical profile.

View Certification Tiers

Your instructor

Anubhav Srivastava

Anubhav has spent the past two decades building machine learning and AI systems across startups, large enterprises, and high-scale consumer platforms. He has worked on patented AI technologies, authored books, and founded multiple ventures, and is currently building a deeptech startup focused on physical AI. Known for combining technical depth with practical thinking, he enjoys breaking down complex ideas into clear, accessible insights and is driven by a curiosity for how technology can solve real-world problems.

From our students

Engineers at different levels share what they built and what changed.

500+

Engineers trained

25+

Engineering leaders

40+

SaaS startups

50+

Alumni network

Alumni at

GoogleStripeMetaOpenAIAnthropic

“The most technically rigorous program I've attended. No fluff — just pure deep-dives into transformer blocks and swarm logic. It's about understanding how LLMs actually work.”

Siddharth S.

Staff Engineer · Build Your Own LLM

“LangGraph and multi-agent orchestration was the missing link for our production pipeline. Essential for developers who need to move beyond single-prompt engineering.”

Elena R.

Senior AI Engineer · Agentic AI

“Direct access to instructors who are actually shipping AI products. The focus on evals-driven development is unique — we implemented their RAG evaluation approach across our entire startup.”

Arjun R.

Tech Lead · Claude Code

FAQ

Common questions about courses, formats, and what to expect.