I made a GPT myself.
no cloud api calls, no fancy LLM's
everything literally coded & trained on my laptop
— Shrit (@Shrit1401) May 21, 2026

How did it start?

my aim was to try recreating a simple GPT model without any API calls or anything, of course the latest models are insanely good they can search web, reason with themselves, we are gonna try creating a simple transformer on which models like openai, claude are based

I didn't have that much time / compute to do something that crazy so whatever you'll see is like a small miniature version I tried replicating on my small macbook (M2 Air 8gb variant)

alright I started by watching Andrej Karpathy's this video https://youtu.be/kCc8FmEb1nY and to be frank this article won't be possible without this video, half of the things i'll talk about are things i learned from this video or the research papers he mentioned

I'm gonna tell main crux of the video and how i enhanced it, if you want a full video watch Karapathy's video, he has explained very nicely there

ShritGPT Github Repo: https://github.com/Shrit1401/ShritGPT

Data Collection

anyways before we begin i scraped all of my https://shrit.substack.com/ substack posts which I have been for two years and anything else I had on the internet, in the initial attempt when I fully trained the model the responses were something like

text
Me: who is shrit
ShritGPT: is study newsletter building so for properly not insane more study ships so intell and easy something cooking and hard problem. is hard problem. good problem dumb. one outreach is

which if u think is pretty cool that it can form words which are actually not in the training data, but it's really really wrong so I had go and add small section such as dm answers for tokens to more about me more easily

Screenshot 2026-05-20 at 12.50.09 PM

i personally don't like to use much filler words when talking to someone on text, if u want u can add filler words

I had roughly 1000029 characters in my data, i recommend something around 1-1.2million characters to have as a data

Interpreting the data

we can't really put so much text data files in the model and be like predict next token, we have to encode strings to make it into simple numbers so it's easy to read

Encoding / Decoding

we need method to encode and decode our data, openai using https://github.com/openai/tiktoken this library to encode & decode, for us simply doing this will work since we have very small data

Screenshot 2026-05-20 at 1.56.27 PM

Tokenizer Playgroundvocab: 99 · len: 18

hover a token to see its integer ID · spaces shown as · · real model uses 99-char vocab from sorted(set(corpus))

Data Loading

After splitting 90% of data for training we want our model to predict the next character, in the world of transformers we call them token, we wanted something to test and train our data so we try to train the model with the possible targets of the token

Screenshot 2026-05-20 at 2.01.23 PM

Training The Model (Bigram Language Model)

In order to train our data we want model to look at previous values as well, in most simplest form Bigram Language Model, it looks at the previous token and try to generate the next token

it takes input in something known as BTC - Batch, Time, Channel, and output in logits(raw scores from neural network) and loss(deviation with mode prediction and value, it updates model weights)

this model is actually logic for how character sequences are predicted

python
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

We are using AdamW as training loop for the model, calculating train loss and val loss, more closer these both values more better

Note: with Bigram model even though we take all the previous tokens to in our memory, only the last token is used to predict the next, this is inefficient

this is our basic and simplest way to make a transformer, however if u run this you'll be nowhere close to the output, you'll still get some gibberish

we're on theory done with the model, but the model isn't that powerful to produce some nice output, so we're gonna try making it more optimized so as to see some results

Bigram Prediction Table

high problow probtop-3

click a row label to see the top-3 most likely next characters · tuned to the DM corpus — every line starts me: or them:

Self Attention

we want our model to utilize our memory of tokens, with our model we will want to use all our tokens to give as much information as possible, so self attention allows us to talk to diffrent tokens

so how do we do it?

we want tokens to prevent looking at the future so we usually use a triangular mask to hide any future data

Every token at every position produces two vectors, key and pair
we dot product of one query and key of all others
By Applying softmax function we calculate the weighted sum which is much better than just finding the average

then we divide it by root of head size (dk), this is called scaled attention it's a important normalization which has to be done

\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^{T}}{\sqrt{d_k}}\right)V

these values are known as weights, one of another crucial aspects of our model. we calculate attention in our Head class (forward function)

you'll see train loss an value loss to be more less, that's an imporvement there.

Self-Attention Heatmap

them

who

shrit

them

100

who

shrit

high attentionlow attention× masked (future)

hover a row to see its attention pattern · toggle to see the ÷√dk smoothing effect · scaled by 1/√64 (head_size = n_embd/n_head = 384/6)

Multi Head Attention

mutlihead attention is basically multiple self attention working in parallel to provide more compute and better results we it in MultiHeadAttention

after implementing this, you'll again see train value and loss value reducing

Multi-Head Attention

click a head tile to expand its full attention pattern · real model: 6 heads × 64-dim each = 384 (concatenated back to n_embd)

Feed Forward

\mathrm{FFN}(x) = \mathrm{ReLU}(xW_1 + b_1)W_2 + b_2

from experiment we usually find x to be in mutlitple of 4

in simple words we have given data to a token, we want to give time for every token to think, we're giving enough compute to each token so they have enough time to interact with each other

python
self.net = nn.Sequential(
    nn.Linear(n_embd, 4 * n_embd),
    nn.ReLU(),
    nn.Linear(4 * n_embd, n_embd),
    nn.Dropout(dropout),
)

Residual Dropout

It's a regularization technique where model shuts off a subset of neurons, the model is prevented from relying too heavily on any specific path, which helps to mitigate overfitting when the model is scaled up

Awesome our basic system to make a basic system optimized pipelin for the is done, it looks something like this

Architecture diagram

Scaling Up

up until now you'll be probably be working with very small set of data, now let's train the full data I had macbook air m2 8gb so my settings were

python
batch_size = 6
grad_accum_steps = 2
block_size = 384
max_iters = 25000
finetune_iters = 8000
eval_interval = 250
learning_rate = 3e-4
finetune_lr = 8e-5
eval_iters = 50
n_embd = 384
n_head = 6
n_layer = 6
dropout = 0.1
max_new_tokens = 200
temperature = 0.5
top_k = 25

it took roughly 30 minute and the output was pretty good there were lot of discrepancies but i could see the things being worked on.

Optimizing the model

it works well but we can optimize and show the text more nicely

Fine Tuning

I am basically trying to polish the mode, I tried to decay learning rate to slowly, instead of going straight zero like main model

Fine tuning learning rate decay

Temperature & K

After training, the model does not output one fixed next character. It outputs logits — raw scores for every character in the vocab. on a conversational model it often feels too random (weird jumps) or too repetitive (same phrases looping).

I added two knobs on top of that: temperature and top-k.

Temperature

Temperature scales logits before softmax:

python
logits = logits / max(temperature, 1e-8)
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)

Lower temperature (e.g. 0.5) → sharper distribution → model sticks to high-probability chars → more focused / conservative replies.
Higher temperature (e.g. 1.0+) → flatter distribution → more creative / chaotic text.

Top k

Even after temperature, the tail of the distribution can still contain thousands of low-probability characters (typos, random symbols). Top-k fixes that by only keeping the k highest logits and zeroing everything else before softmax:

python
v, _ = torch.topk(logits, k)
logits = logits.masked_fill(logits < v[-1], float("-inf"))

So you only sample from the top 25 (my default) most likely next characters. That cuts a lot of garbage without killing variety completely.

Sampling Playground

T=0.50k=25H=1.72 bitsreal defaults: T=0.5, k=25

temperature: 0.50

0.1 (focused)2.0 (creative)

top-k: 25

k=1 (greedy)k=40 (all)

drag sliders to reshape the distribution · blue bars = within top-k · press sample to draw a character · real sample_next: logits/max(T, 1e-8) → top-k mask → softmax → multinomial

Checkpoint & Point Loss

Screenshot 2026-05-21 at 11.28.24 AM

The Final model took ~270 minutes to train & finetune, it was really hard for me, plus if i wanted to keep talking to model with this model i have to wait for another 270min, instead of that I added checkpoints which will save the weights, and started logging the loss with time to generate graphs

Training Loss

hover to read off loss values · toggle fine-tune to see the lower-lr tail · val below train is expected — dropout (p=0.1) only active during training

Web UI

Training in the terminal is fine, but i wanted to make the user experience much better, so a simple api endpoint and a html file by claude made this

Screenshot 2026-05-21 at 11.46.24 AM

API Endpoints

Request Body

json
{
  "message": "who is shrit",
  "history": [{ "role": "user", "content": "hey" }, { "role": "assistant", "content": "yo" }],
  "temperature": 0.5,
  "top_k": 25,
  "max_tokens": 180
}

I added generate_stream() so each new character is yielded as it is sampled. Flask wraps that in Server-Sent Events (SSE):

text
data: {"text": "i'm"}
data: {"text": " shrit"}
data: {"done": true}

The frontend uses fetch + ReadableStream, parses data: {...} lines, and appends text to the bubble every few characters. If streaming 404s, it falls back to /api/chat.

After doing this changes, it took a whooping ~270 minutes for a full dataset compute in detail, but the results are still better.

Conclusion

We tried creating a small GPT, if u want the full code go to this repo: https://github.com/Shrit1401/ShritGPT, ofcourse even though I smashed everything about myself as data it's still not upto the mark of a proper llm like OpenAI's GPT 3, bcz we had restrain the compute and data, the model base will stay the same, to scale we might add more Notes to compute

Screenshot 2026-05-21 at 11.53.01 AM

The text is producing is not propr he doesn't know anything apart from me and all my details, not even proper english. It was really fun project honestly, a very nice way to know how llms respond

Sources

Attention is All u Need: https://arxiv.org/abs/1706.03762

Language Models are Few-Shot Learners: https://arxiv.org/abs/2005.14165

Deep Residual Learning for Image Recognition: https://arxiv.org/abs/1512.03385

YT Video: https://youtu.be/kCc8FmEb1nY

Making ur own GPT