Foundations of Modern AI

Lesson 1

Foundations of Modern AI

200-Level Introduction

1. Tensors — the numbers underneath everything

At the lowest level, AI systems don't process text or images directly. They process numbers. These numbers are organised into structures called tensors — simply a multi-dimensional array of numbers.

0D Tensor: A single number — a scalar (e.g. 5)
1D Tensor: A list of numbers — a vector (e.g. [1, 2, 3])
2D Tensor: A table of numbers — a matrix
3D+ Tensor: Higher dimensions — an RGB image, for example, is 3D: height × width × colour channels

2. Neural networks — graphs of operations

A neural network is a sequence of mathematical operations applied to tensors. Think of it as a flow chart where each node does a calculation and passes the result along. Nothing mystical — just arithmetic, applied many times in sequence.

3. The MLP — the basic building block

The Multi-Layer Perceptron (MLP) is a fundamental component. It takes an input, multiplies it by a set of weights, applies a non-linear function, then does it again.

Output = MatrixMultiply( Activation( MatrixMultiply(Input, W1) + b1 ), W2 ) + b2

W1 and W2 are the weights — the numbers the system has learned. b1 and b2 are small adjustments called biases.

4. Stacking blocks

Modern AI models are built by stacking these blocks dozens or hundreds of times. Each layer learns to recognise increasingly complex patterns — early layers might detect simple shapes, later layers recognise faces or sentence structure.

5. Training — how the system learns

Training involves two passes:

Forward pass: Data goes in, flows through the network, a prediction comes out.
Backward pass: The error is calculated, and the weights are adjusted slightly to reduce it next time.

Gradients and Backpropagation

The backward pass calculates gradients — which tell the network how much each weight contributed to the error. Weights are then nudged in the opposite direction of the gradient. Repeat millions of times. That is training.

6. Inference — using the trained model

Once trained, the network is used for inference. Input goes in, flows through the now-frozen network, output comes out. No more learning — just the forward pass, very fast.

7. Hardware

Because these operations — especially matrix multiplications — are highly parallelisable, they run best on GPUs or TPUs. Specialised chips designed to do the same simple calculation millions of times simultaneously.

Continue to Lesson 2 →

Lesson 2

How a Copilot-Style AI Answers a Question

200-Level Introduction

2.1 Tokenisation — turning words into numbers

Before text enters the network it must become numbers. Text is split into chunks called tokens — roughly words or parts of words — and each token is mapped to an integer ID.

Embeddings and Positional Encoding

Embeddings: Each ID is looked up in a dictionary to retrieve a dense vector representing its meaning.
Positional Encoding: Because transformers process all tokens at once, a positional signal is added to each embedding so the model knows the order of words.

2.2 The Transformer layer

The core of modern large language models is the Transformer layer, which has two main components: Self-Attention and an MLP block.

2.3 Self-Attention — how tokens talk to each other

Self-attention allows each token to look at every other token in the sequence to gather context. It uses three matrices — Queries (Q), Keys (K), and Values (V):

Attention(Q, K, V) = softmax( (Q × Kᵀ) / √d_k ) × V

Think of it as a retrieval system. The Query is what a token is looking for. The Key is what other tokens are offering. The Value is the actual information transferred when there is a match.

2.4 The MLP block

After attention gathers context from across the sequence, the MLP block processes that aggregated information for each token individually — transforming it into a higher-level representation.

2.5 Normalisation and residuals

To keep the network stable at depth, two techniques are used:

Normalisation: Keeps the numbers within a manageable range at each layer.
Residual connections: Instead of passing only the transformed output to the next layer, the original input is added back: Output = Layer(Input) + Input. This helps gradients flow during training and prevents information being lost.

2.6 The complete forward pass — generating a response

When you ask an AI a question, here is exactly what happens:

Your prompt is tokenised and each token is embedded.
Positional encodings are added.
The tensors pass through multiple Transformer layers — Attention then MLP, repeated many times.
The final layer outputs a probability distribution over all possible next tokens.
A token is selected from that distribution.
The new token is appended to the input and the whole process repeats from step 2.
This continues until a stop token is generated or a length limit is reached.

That is it. One token at a time, each one the result of the entire sequence flowing through the network again.

← Lesson 1 Back to WorkFolds™