This repository contains my personal implementation and experiments while working through Sebastian Raschka's book "Build a Large Language Model (From Scratch)".
This is a learning-focused repository where I've implemented the concepts from the book, including:
- Building a GPT-style large language model from scratch using PyTorch
- Understanding tokenization, embeddings, and attention mechanisms
- Training and fine-tuning language models
- Instruction fine-tuning techniques
- Various experiments and exercises from the book
.
├── Chapters/ # Jupyter notebooks for each chapter
│ ├── Chapter2.ipynb # Tokenization and data preparation
│ ├── Chapter3.ipynb # Attention mechanisms
│ ├── Chapter4.ipynb # Implementing GPT from scratch
│ ├── Chapter5.ipynb # Pretraining on unlabeled data
│ ├── Chapter6.ipynb # Fine-tuning for classification
│ ├── Chapter7.ipynb # Instruction fine-tuning
│ ├── Exercise_6_*.ipynb # Chapter 6 exercises
│ ├── bells_and_whistles.ipynb # Advanced training strategies (Appendix D)
│ ├── lora_classification.ipynb # LoRA for classification (Appendix E)
│ ├── lora_instruction.ipynb # LoRA for instruction fine-tuning
│ ├── alpaca_finetuning.ipynb # Fine-tuning on Alpaca dataset (52K examples)
│ └── LLM-as-a-judge.ipynb # Model evaluation using LLM-as-a-judge
├── data/ # Training and test data
├── models/ # Saved model checkpoints
└── *.py # Helper modules and utilities
- Chapter 2: Working with text data - tokenization and data sampling
- Chapter 3: Coding attention mechanisms
- Chapter 4: Implementing a GPT model from scratch
- Chapter 5: Pretraining on unlabeled data
- Chapter 6: Fine-tuning for classification tasks
- Chapter 7: Fine-tuning for instruction following
- Appendix D: Advanced training strategies (gradient clipping, cosine decay, learning rate warmup)
- Appendix E: Parameter-efficient fine-tuning with LoRA
Beyond the main chapters, this repository includes additional experiments and implementations:
- LoRA Fine-tuning: Implementation of Low-Rank Adaptation (LoRA) for both classification and instruction fine-tuning tasks, demonstrating parameter-efficient training methods
- Alpaca Dataset Fine-tuning: Fine-tuning experiments on the larger Stanford Alpaca dataset (52K examples) with comparisons between full fine-tuning and LoRA approaches
- LLM-as-a-Judge Evaluation: Automated model evaluation using Claude Haiku 4.5 to score model responses, comparing different fine-tuning approaches and prompt styles
- Book: Build a Large Language Model (From Scratch)
- Author: Sebastian Raschka, Ph.D.
- Publisher: Manning Publications
- ISBN: 9781633437166
Official Resources:
This project uses:
- Python 3.10+
- PyTorch
- Transformers (Hugging Face)
- tiktoken (OpenAI's tokenizer)
- Additional dependencies listed in
pyproject.toml
To install dependencies:
uv syncThis repository contains code adapted from and inspired by Sebastian Raschka's book and official repository.
- Original code: Copyright (c) Sebastian Raschka under Apache License 2.0
- Adaptations and experiments: My personal implementations while learning from the book
If you find this repository useful, please cite the original book:
@book{build-llms-from-scratch-book,
author = {Sebastian Raschka},
title = {Build A Large Language Model (From Scratch)},
publisher = {Manning},
year = {2024},
isbn = {978-1633437166},
url = {https://www.manning.com/books/build-a-large-language-model-from-scratch},
github = {https://github.com/rasbt/LLMs-from-scratch}
}This repository is shared for educational purposes. The original book and code are licensed under the Apache License 2.0. See the LICENSE file for details.
I originally made Claude Code write the README, and I feel like it feels a bit fake to make Claude write "Special thanks ...". So, I decided to at least edit the acknowledgements section.
I would like to thank Sebastian Raschka for preparing such a good learning resource. Because I was mostly dealing with the production side of ML recently, and/or dealing with things I understand better (e.g., uncertainty estimation), I felt a bit left behind in what has happened in the NLP field and to be honest, also got a bit intimidated. Going through the whole implementation of GPT-2, and also doing the exercises made me understand the decoder-only models better, and I also realized that I was not as far behind as I originally thought. Most of the new developments tend to be "take GPT-2, but instead do this in that layer", which makes understanding newer developments much easier. Stripping complicated topics such as this into its barebones and making it understandable is a talent Raschka clearly possesses.