I wanted to understand how LLMs work so I dug into some research.
This AI train is moving fast and every business I talk to, myself included, are trying to adapt and adopt as quickly as they can for their own operations and to add value to the clients they serve.
I ended up going down a bit of a rabbit hole so I apologize if any of this is overly granular but I found it fascinating to understand and I wanted to share how the LLMs actually work.
This blog unpacks the fundamental mechanisms behind LLMs, from text extraction to tokenization and prediction, offering a comprehensive look at how these models generate responses with remarkable accuracy.
I. Text Extraction: The Foundation of LLMs
At their core, LLMs are trained on vast amounts of text data extracted from the internet. This extraction process involves crawling publicly available web pages while excluding certain categories like adult content, gambling, and dating sites.
The primary goal is to collect raw textual data while ignoring any underlying web page structures, such as HTML markup, CSS, and JavaScript, that contribute to user experience but hold no value for language processing.
If you right click on any web page and select ‘Inspect’, you will see all the actual content along with the additional code and markup language that lives on a webpage that the LLM does not care about:
Once extracted, this text data exists as an enormous, unstructured text repository—a random collection of sentences pulled from countless sources of web pages on the internet.
Here is an example of that data dump in notepad (text editor). You can see it is simply random text/copy/sentences, taken from websites on the internet.
There is no rhyme or reason to their ordering, it is simply a data dump.
The LLM takes this sentence of random copy (highlighted above) from a website, but it may not relate to the preceding or following sentences in our data set.
“Your gut, often called the “second brain,” is home to trillions of bacteria that not only help digest your food but also influence your mood, immune system, and even decision-making in ways scientists are only beginning to understand.”
To understand the volume of data, I decrease the font size so you can see how much text is there while our sample sentence remains highlighted. Of course, this represents only a minuscule fraction of the overall data collected
What I found fascinating:
Although this data is vast in scale, researchers can store all the world’s online text (excluding certain sites) in roughly 44 terabytes (TB) of storage.
This means that in just 44 USB drives, each holding 1TB, one could theoretically store the entire textual internet!
II. Binary Representation and the Need for Tokenization
Computers use binary code, meaning they first convert all extracted text into sequences of 0s and 1s.
This is what our dataset from above looks like in binary code.
Looking at our sample of random copy:
“Your gut, often called the “second brain,” is home to trillions of bacteria that not only help digest your food but also influence your mood, immune system, and even decision-making in ways scientists are only beginning to understand.”
Take the first section of the sentence:
“Your gut, often called the “second brain,”
This sentence, translated to “bits” or a sequence of binary code, looks like the following:
01011001 01101111 01110101 01110010 00100000 01100111 01110101 01110100 00101100 00100000 01101111 01100110 01110100 01100101 01101110 00100000 01100011 01100001 01101100 01101100 01100101 01100100 00100000 01110100 01101000 01100101 00100000 11100010 10000000 10011110 01110011 01100101 01100011 01101111 01101110 01100100 00100000 01100010 01110010 01100001 01101001 01101110 11100010 10000000 10011010
You can see that this amount of data is significantly longer than the word based copy which would increase the amount of data and computing power required to work with it. In order to be more efficient, we need a way to shorten the sequence of lengthy data to a shorter version.
If we look at the first letter “Y” in our sentence, represented in binary code is 01011001.
01011001 = Y
If we swap the order of the last two 0’s and 1’, the bit is no longer “Y” but the letter “Z”
01011010 = Z
However, this process results in an overwhelming amount of data, making direct processing computationally expensive.
To optimize efficiency, LLMs use tokenization—a technique that compresses binary data into manageable units called tokens.
Tokenization involves breaking down text into standardized symbols that a model can process. Instead of storing full words or phrases in binary.
LLMs translate these elements into a compact set of tokens, which act as unique identifiers for language components.
If you recall learning permutations in high school, you know that an eight-digit number has exactly 256 possible representations. So we can make our data set eight times smaller by translating all of the bits into bytes (or symbols/tokens really) of what the eight digit represents.
So our data set is now eight times smaller and also only has 256 number or symbol variations ranging from 0 to 255.
Here is what our dataset looks like now:
As mentioned above, when converted to binary, this phrase becomes a lengthy sequence of bits (0s and 1s). However, tokenization breaks it down into a series of numerical identifiers, significantly reducing the dataset’s size.
The partial sentence:
Your gut, often called the “second brain,”
returns the following binary code which is lengthy:
01011001 01101111 01110101 01110010 00100000 01100111 01110101 01110100 00101100 00100000 01101111 01100110 01110100 01100101 01101110 00100000 01100011 01100001 01101100 01101100 01100101 01100100 00100000 01110100 01101000 01100101 00100000 11100010 10000000 10011110 01110011 01100101 01100011 01101111 01101110 01100100 00100000 01100010 01110010 01100001 01101001 01101110 11100010 10000000 10011010
The word “often” in our partial sentence returns the following binary code which is lengthy:
01001111 01100110 01110100 01100101 01101110
Thanks to Tiktokenizer (https://tiktokenizer.vercel.app/), we can visualize how this sentence translates into tokens.
The sequence of words in our partial sentence:
Your gut, often called the “second brain,”
returns 10 Tokens.
9719, 10998, 11, 4783, 4358, 290, 966, 13901, 12891, 3881
The word “often” in this sequence of words corresponds to the token 4783.
This allows the LLM to process this word as a compressed numerical representation rather than a full binary sequence.
I will explain how this procedure is powerful and allows the LLM to respond to our queries or prompts with accurately formulated sentences.
Remember we previously said that all of the text on the internet would take up 44TB of data? Another way to look at this is that when we translate all the text into tokens, we create a sequence of 15 trillion tokens—these tokens serve as the fundamental building blocks for LLM training.
All Internet Text = 44TB of Storage = 15 Trillion Tokens
III. Training and Predicting: How LLMs Generate Responses
The power of an LLM lies in its ability to predict the next token in a sequence based on statistical probabilities derived from its dataset. Neural networks drive this prediction process by analyzing vast amounts of text to determine likely patterns and word associations. Here’s how it works:
The neural network processes batches of our dataset to train the LLM through input tokenization. (e.g., “Your gut, often called the”), the LLM converts this input into its respective token sequence.
- We take our sample sentence segment:
- Your gut, often called the “second brain,”
- And we know the following 10 tokens or symbols, represent this sentence
- 9719, 10998, 11, 4783, 4358, 290, 966, 13901, 12891, 3881
- To train, we feed only part of the sentence into the training model:
- Your gut, often called the prediction?
- The LLM will take the sequence of tokens for this partial sentence and try to predict based on the data set which token will appear next in the sequence:
- 9719, 10998, 11, 4783, 4358, 290, next?
Next Token Prediction: The model analyzes this sequence and predicts the most probable next token based on past patterns. It generates a ranked list of potential outputs, assigning each a probability score.
Selection and Continuation: The model selects the highest-probability token and adds it to the sequence. This process repeats iteratively until the response reaches a satisfactory user intent.
For example, if the neural network receives the input:
- “Your gut, often called the”
The model might generate multiple possible continuations:
- Option 1: “second brain” (Probability: 4%)
- Option 2: “body’s hidden command center” (Probability: 1%)
- Option 3: “silent decision-maker” (Probability: 3%)
- Option 4: “instinct engine” (Probability: 2%)
By reinforcing correct sequences over time, the LLM continuously improves its ability to generate accurate and contextually relevant responses.
IV. Parallel Processing and Training at Scale
A key advantage of LLMs is their ability to process millions of token predictions in parallel. Rather than training on individual sequences one at a time, these models handle massive batches of tokens simultaneously, refining their probabilities and strengthening their predictions in real-time.
This training process requires immense computational power, typically leveraging high-performance GPUs (Graphics Processing Units) and TPUs (Tensor Processing Units) to handle the sheer volume of calculations. Over time, the model becomes increasingly refined, developing an advanced understanding of language structure, context, and meaning.
V. The Strength of LLMs: From Tokens to Text and Back Again
The true magic of LLMs lies in their bidirectional conversion between text and tokens.
Once trained, an LLM can take human input, break it down into tokens, analyze the probability of various outputs, and generate coherent, human-like responses—effectively translating numbers back into meaningful language.
Each time an LLM produces text, it’s not simply copying from memory; rather, it is statistically predicting the most probable next token in a sequence.
This makes every output dynamic, adaptable, and capable of generating novel insights beyond its original training data.
VI. Conclusion: Understanding LLMs for Smarter AI Adoption
LLMs are complex yet fascinating systems that transform raw text into intelligent, conversational AI. By leveraging tokenization, statistical prediction, and neural network training, these models can generate human-like responses with unprecedented accuracy and fluency.
Grasping these foundational concepts will allow you to harness the full potential of LLMs
Why Does This Matter?
We are implementing and integrating AI-powered tools at Logical and for our clients to enhance operational efficiencies and marketing strategies.
If you’re interested in learning how to leverage LLMs and AI Agents for your business, please contact us.