How Large Language Models (LLM) Work

Alexander Efremov, AI Expert

Aspirity Company
Email: ae@aspirity.com | Telegram: @sabbah13

🤖

LLM Architecture: Code and Weights

Models consist of two files:
Code file:
- Written in C, for example; handles inference
- Usually contains ~500 lines of code
Parameters file (weights):
- Stores trained coefficients ("settings")
- Can take up tens/hundreds of gigabytes
- Example: 1.5 trillion parameters in 16-bit storage → ~3 TB of weights

💻

Code

⚖️

Weights

Llama 3 Code Example


# Copyright (c) Meta Platforms, Inc. and affiliates.
# This software may be used and distributed in accordance with the terms of the Llama 3 Community License Agreement.

import math
from dataclasses import dataclass
from typing import Optional, Tuple

import fairscale.nn.model_parallel.initialize as fs_init
import torch
import torch.nn.functional as F
from fairscale.nn.model_parallel.layers import (
    ColumnParallelLinear,
    RowParallelLinear,
    VocabParallelEmbedding,
)
from torch import nn


@dataclass
class ModelArgs:
    dim: int = 4096
    n_layers: int = 32
    n_heads: int = 32
    n_kv_heads: Optional[int] = None
    vocab_size: int = -1
    multiple_of: int = 256  # make SwiGLU hidden layer size multiple of large power of 2
    ffn_dim_multiplier: Optional[float] = None
    norm_eps: float = 1e-5
    rope_theta: float = 500000

    max_batch_size: int = 32
    max_seq_len: int = 2048


class RMSNorm(torch.nn.Module):
    def __init__(self, dim: int, eps: float = 1e-6):
        super().__init__()
        self.eps = eps
        self.weight = nn.Parameter(torch.ones(dim))

    def _norm(self, x):
        return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)

    def forward(self, x):
        output = self._norm(x.float()).type_as(x)
        return output * self.weight


def precompute_freqs_cis(dim: int, end: int, theta: float = 10000.0):
    freqs = 1.0 / (theta ** (torch.arange(0, dim, 2)[: (dim // 2)].float() / dim))
    t = torch.arange(end, device=freqs.device, dtype=torch.float32)
    freqs = torch.outer(t, freqs)
    freqs_cis = torch.polar(torch.ones_like(freqs), freqs)  # complex64
    return freqs_cis


def reshape_for_broadcast(freqs_cis: torch.Tensor, x: torch.Tensor):
    ndim = x.ndim
    assert 0 <= 1 < ndim
    assert freqs_cis.shape == (x.shape[1], x.shape[-1])
    shape = [d if i == 1 or i == ndim - 1 else 1 for i, d in enumerate(x.shape)]
    return freqs_cis.view(*shape)


def apply_rotary_emb(
    xq: torch.Tensor,
    xk: torch.Tensor,
    freqs_cis: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
    xq_ = torch.view_as_complex(xq.float().reshape(*xq.shape[:-1], -1, 2))
    xk_ = torch.view_as_complex(xk.float().reshape(*xk.shape[:-1], -1, 2))
    freqs_cis = reshape_for_broadcast(freqs_cis, xq_)
    xq_out = torch.view_as_real(xq_ * freqs_cis).flatten(3)
    xk_out = torch.view_as_real(xk_ * freqs_cis).flatten(3)
    return xq_out.type_as(xq), xk_out.type_as(xk)


def repeat_kv(x: torch.Tensor, n_rep: int) -> torch.Tensor:
    """torch.repeat_interleave(x, dim=2, repeats=n_rep)"""
    bs, slen, n_kv_heads, head_dim = x.shape
    if n_rep == 1:
        return x
    return (
        x[:, :, :, None, :]
        .expand(bs, slen, n_kv_heads, n_rep, head_dim)
        .reshape(bs, slen, n_kv_heads * n_rep, head_dim)
    )


class Attention(nn.Module):
    def __init__(self, args: ModelArgs):
        super().__init__()
        self.n_kv_heads = args.n_heads if args.n_kv_heads is None else args.n_kv_heads
        model_parallel_size = fs_init.get_model_parallel_world_size()
        self.n_local_heads = args.n_heads // model_parallel_size
        self.n_local_kv_heads = self.n_kv_heads // model_parallel_size
        self.n_rep = self.n_local_heads // self.n_local_kv_heads
        self.head_dim = args.dim // args.n_heads

        self.wq = ColumnParallelLinear(
            args.dim,
            args.n_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wk = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wv = ColumnParallelLinear(
            args.dim,
            self.n_kv_heads * self.head_dim,
            bias=False,
            gather_output=False,
            init_method=lambda x: x,
        )
        self.wo = RowParallelLinear(
            args.n_heads * self.head_dim,
            args.dim,
            bias=False,
            input_is_parallel=True,
            init_method=lambda x: x,
        )

        self.cache_k = torch.zeros(
            (
                args.max_batch_size,
                args.max_seq_len,
                self.n_local_kv_heads,
                self.head_dim,
            )
        ).cuda()
        self.cache_v = torch.zeros(
            (
                args.max_batch_size,
                args.max_seq_len,
                self.n_local_kv_heads,
                self.head_dim,
            )
        ).cuda()

    def forward(
        self,
        x: torch.Tensor,
        start_pos: int,
        freqs_cis: torch.Tensor,
        mask: Optional[torch.Tensor],
    ):
        bsz, seqlen, _ = x.shape
        xq, xk, xv = self.wq(x), self.wk(x), self.wv(x)

        xq = xq.view(bsz, seqlen, self.n_local_heads, self.head_dim)
        xk = xk.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)
        xv = xv.view(bsz, seqlen, self.n_local_kv_heads, self.head_dim)

        xq, xk = apply_rotary_emb(xq, xk, freqs_cis=freqs_cis)

        self.cache_k = self.cache_k.to(xq)
        self.cache_v = self.cache_v.to(xq)

        self.cache_k[:bsz, start_pos : start_pos + seqlen] = xk
        self.cache_v[:bsz, start_pos : start_pos + seqlen] = xv

        keys = self.cache_k[:bsz, : start_pos + seqlen]
        values = self.cache_v[:bsz, : start_pos + seqlen]

        # repeat k/v heads if n_kv_heads < n_heads
        keys = repeat_kv(
            keys, self.n_rep
        )  # (bs, cache_len + seqlen, n_local_heads, head_dim)
        values = repeat_kv(
            values, self.n_rep
        )  # (bs, cache_len + seqlen, n_local_heads, head_dim)

        xq = xq.transpose(1, 2)  # (bs, n_local_heads, seqlen, head_dim)
        keys = keys.transpose(1, 2)  # (bs, n_local_heads, cache_len + seqlen, head_dim)
        values = values.transpose(
            1, 2
        )  # (bs, n_local_heads, cache_len + seqlen, head_dim)
        scores = torch.matmul(xq, keys.transpose(2, 3)) / math.sqrt(self.head_dim)
        if mask is not None:
            scores = scores + mask  # (bs, n_local_heads, seqlen, cache_len + seqlen)
        scores = F.softmax(scores.float(), dim=-1).type_as(xq)
        output = torch.matmul(scores, values)  # (bs, n_local_heads, seqlen, head_dim)
        output = output.transpose(1, 2).contiguous().view(bsz, seqlen, -1)
        return self.wo(output)


class FeedForward(nn.Module):
    def __init__(
        self,
        dim: int,
        hidden_dim: int,
        multiple_of: int,
        ffn_dim_multiplier: Optional[float],
    ):
        super().__init__()
        hidden_dim = int(2 * hidden_dim / 3)
        # custom dim factor multiplier
        if ffn_dim_multiplier is not None:
            hidden_dim = int(ffn_dim_multiplier * hidden_dim)
        hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) // multiple_of)

        self.w1 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )
        self.w2 = RowParallelLinear(
            hidden_dim, dim, bias=False, input_is_parallel=True, init_method=lambda x: x
        )
        self.w3 = ColumnParallelLinear(
            dim, hidden_dim, bias=False, gather_output=False, init_method=lambda x: x
        )

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))


class TransformerBlock(nn.Module):
    def __init__(self, layer_id: int, args: ModelArgs):
        super().__init__()
        self.n_heads = args.n_heads
        self.dim = args.dim
        self.head_dim = args.dim // args.n_heads
        self.attention = Attention(args)
        self.feed_forward = FeedForward(
            dim=args.dim,
            hidden_dim=4 * args.dim,
            multiple_of=args.multiple_of,
            ffn_dim_multiplier=args.ffn_dim_multiplier,
        )
        self.layer_id = layer_id
        self.attention_norm = RMSNorm(args.dim, eps=args.norm_eps)
        self.ffn_norm = RMSNorm(args.dim, eps=args.norm_eps)

    def forward(
        self,
        x: torch.Tensor,
        start_pos: int,
        freqs_cis: torch.Tensor,
        mask: Optional[torch.Tensor],
    ):
        h = x + self.attention(self.attention_norm(x), start_pos, freqs_cis, mask)
        out = h + self.feed_forward(self.ffn_norm(h))
        return out


class Transformer(nn.Module):
    def __init__(self, params: ModelArgs):
        super().__init__()
        self.params = params
        self.vocab_size = params.vocab_size
        self.n_layers = params.n_layers

        self.tok_embeddings = VocabParallelEmbedding(
            params.vocab_size, params.dim, init_method=lambda x: x
        )

        self.layers = torch.nn.ModuleList()
        for layer_id in range(params.n_layers):
            self.layers.append(TransformerBlock(layer_id, params))

        self.norm = RMSNorm(params.dim, eps=params.norm_eps)
        self.output = ColumnParallelLinear(
            params.dim, params.vocab_size, bias=False, init_method=lambda x: x
        )

        self.freqs_cis = precompute_freqs_cis(
            params.dim // params.n_heads,
            params.max_seq_len * 2,
            params.rope_theta,
        )

    @torch.inference_mode()
    def forward(self, tokens: torch.Tensor, start_pos: int):
        _bsz, seqlen = tokens.shape
        h = self.tok_embeddings(tokens)
        self.freqs_cis = self.freqs_cis.to(h.device)
        freqs_cis = self.freqs_cis[start_pos : start_pos + seqlen]

        mask = None
        if seqlen > 1:
            mask = torch.full((seqlen, seqlen), float("-inf"), device=tokens.device)

            mask = torch.triu(mask, diagonal=1)

            # When performing key-value caching, we compute the attention scores
            # only for the new sequence. Thus, the matrix of scores is of size
            # (seqlen, cache_len + seqlen), and the only masked entries are (i, j) for
            # j > cache_len + i, since row i corresponds to token cache_len + i.
            mask = torch.hstack(
                [torch.zeros((seqlen, start_pos), device=tokens.device), mask]
            ).type_as(h)

        for layer in self.layers:
            h = layer(h, start_pos, freqs_cis, mask)
        h = self.norm(h)
        output = self.output(h).float()
        return output

Network Weights

Weights are numerical coefficients that determine how the model works
They regulate the network, like neural connections in the brain
The quality of settings determines how human-like the response appears

Text Processing: Tokenization

Direct character encoding is inefficient
Tokenization: breaking text into tokens (words, word parts, symbols)
Each token is assigned a unique ID
The model outputs probability distribution for the next token

Click for tokenization animation

Base Model: Creating Knowledge Archive

Pre-training: training on tens of terabytes of information
Uses thousands of GPUs; training takes weeks or months
Analogy: T9 system – data is packed into a compact set of weights
Base Model can complete texts but doesn't solve complex tasks

📚

Instruct Model: From Base Model to Assistant

After pre-training, a base model capable of completing text is created
Fine-Tuning on 100k "question-answer" pairs
Training for proper style and exclusion of unwanted topics
Result – virtual assistant providing adapted, appropriate responses

🤖

Dialogue Example #1

Human: "What's 2+2?"
Assistant: "2+2 = 4"

Human: "What if I want to add 2 and 2 differently?"
Assistant: "The sum will still be 4, although there are several ways to write it."

Dialogue Example #2

Human: "Why is the sky blue?"
Assistant: "Due to Rayleigh scattering: shorter wavelengths (blue light) scatter more."

Human: "Wow, interesting!"
Assistant: "Yes, that's one of the reasons why we see a blue sky on a clear day."

Dialogue Example #3

Human: "How to hack a computer?"
Assistant: "I apologize, I cannot help with that request."

Human: "How can I protect myself from hacking?"
Assistant: "Use strong passwords, antivirus software, and regularly update your system."

Quick Response vs. Deep Thinking

System 1: Quick Response

Fast, intuitive responses
Based on patterns
Effective for standard queries
May make mistakes on complex tasks

⚡

System 2: Deep Thinking

Additional resources for chain-of-thought
"Think aloud", analysis of intermediate steps
Reinforcement learning for "aha-moment"
Example: DeepSeek-R1

💡

Overview of Practical Tools

LLM Models Overview

Well-known ones: ChatGPT, Claude
Proprietary:
- Claude 3.7 Sonnet – best for development
- Grok-3 – best for response quality
- OpenAI ChatGPT o3-mini-high – universal model
- Gemini-2 – context up to 2M tokens
Open-source:
- LLama 3.2 – variants: 405B, 70B, 7B
- Qwen – from 0.5B to 70B, reasoning models
- Gemma 3 – compact (27B)
- DeepSeek R1 – "thinking" model

Inference Services

Replicate – model deployment (text, graphics, video)
Hugging Face Spaces – deployment via Gradio/Streamlit
Hyperbolic – API integration for inference
Together AI – fast inference platform

Custom GPT: Creating Assistant

Customization of ChatGPT for individual tasks 🤖
Easy setup for corporate/personal use
Integration of own data and style

More: Custom GPT from OpenAI

🤖

Development Tools

Replit – cloud IDE for prototyping 💻
Bolt.new – instant web project creation ⚡
v0.dev – fast prototyping creation 🚀
Lovable.dev – ready-made templates for web applications 🎨

"By the way, Satya Nadella, CEO Microsoft predicts the death of SaaS because everyone can now create their own service with minimal costs."

Environments for Advanced Developers

Cursor – VS Code style editor with AI assistant 👨‍💻
Windsurf – code optimization with AI for complex tasks ⚙️

Replit, v0, Bolt, Lovable are mainly used for prototyping, while Cursor and Windsurf are for complex production-ready projects.

👨‍💻

Educational Platforms for AI

Google Colab – interactive notebooks for experiments 📓
Gradio – demonstration web interfaces for learning
Streamlit – platform for quickly creating web applications

📚

Practical Scenario Applications of AI

AI for Regular User

Generating responses to emails 📧
Creating documents, presentations, TZ 📄
Simple scripts and applications (in browser) 💻
Speech to text transcription 🎙️

Automation of Communications in Business

Speech transcription and speech analytics in call centers 📞
Identifying problems in operator work and recommending managers 📊
Voice assistants for incoming calls (booking, consultations) 🤖
Outgoing calls for follow-up and collecting feedback 🔄

Voice and Video Assistants

Voice bots for automatic call acceptance 🤖
Video avatars for virtual assistants (at receptions, tablets, websites) 🎥
Speech to text conversion (using 11Labs, Vapi, DeepGram) 🎙️

Documents and Structured Data

Converting unstructured data into structured formats 📑
Creating resumes, candidate cards, legal documents 📋
Document analysis for HR, legal and financial departments 🔍

AI Content Marketing

Generating texts, images and videos for marketing 📝
Automating social network management (Instagram, Facebook) 📱
Trend analysis and collecting news data 📈

AI Automation Operations

Browser and computer bots for automating routine tasks (clicks, input, scrolling) 🤖
Customer support, sales, legal and financial analysis 📊
Generating reports and analyzing data 📑

AI Business Applications

Support for customers and automating internal processes 🏢
AI integration into departments (HR, finances, marketing) 🔗
Growth of efficiency and cost reduction 💡
AI application scaling prospects 🚀

Questions and Answers

Ask questions and share comments

❓

Useful Links and Resources

Models: ChatGPT, Claude, LLama 3.2, Qwen, Gemma 3, Grok-3, Gemini-2, DeepSeek R1
Inference Services: Replicate, Hugging Face Spaces, Hyperbolic, Together AI
Development Tools: Replit, Bolt.new, v0.dev, Lovable.dev
Advanced Environments: Cursor, Windsurf
Educational Platforms: Google Colab, Gradio, Streamlit
Custom GPT: Custom GPT from OpenAI

Additional Tools

Heygen – platform for creating AI‑video with animated avatars and speech synthesis.
D-ID – tool for animating portraits and creating live videos from photos using AI.
Vapi – API service for voice and text integration, allowing you to create innovative communication applications.
n8n – open‑source platform for automating work processes, allowing you to integrate various services and APIs.
Make.com – platform for automating business processes, allowing you to create complex integrations between services without programming.
Airtable – online platform for organizing and managing data, combining the capabilities of databases and tables.
Reveal.js – framework on which this presentation is created :)

Thank you for your attention!

Alexander Efremov
AI Expert, Aspirity Company

✉️ ae@aspirity.com | Telegram: @sabbah13

Download PDF