AI Tools & Trends 2024–2025

What changed, what matters, what to use

Alex Efremov

Agenda

  • 2024-2025 AI Milestones
  • Reasoning LLMs & Test-Time Compute
  • Voice Stack
  • Agentic AI
  • Evals & Monitoring
  • AI Development
⏰
Timeline
🧠
Reasoning
🎀
Voice
πŸ€–
Agents
πŸ“Š
Evals
⌨️
Coding

Key Milestones

Timeline

2024: The Year AI Went Mainstream

Major Releases

  • OpenAI o1 - Reasoning models
  • Sora - Text-to-video
  • ChatGPT 4o - Multimodal
  • Claude 3.5 - Enhanced reasoning
  • Gemini 2.0 - 2x faster
  • Apple Intelligence - Platform integration
  • Llama 3.2 - Open multimodal

Key Innovations

  • Voice Mode - Real-time AI
  • ChatGPT Search - AI web search
  • Multimodal AI - All modalities
  • Agentic AI - Autonomous
  • EU AI Act - Global standards
  • GenCast - AI forecasting
  • NVIDIA - Chip dominance

2025: Era of Reasoning & Agents

  • Vibe Coding - Andrej Karpathy (Jan)
  • Sesame - Open source SOTA TTS (Feb)
  • Gemini 2.5 Pro - 2M tokens (Mar)
  • Qwen3 - Robotics AI (Apr)
  • Claude Sonnet and Opus 4 - Top benchmarks (May)
  • Grok 4 Heavy - $300/mo tier
  • Kimi K2 - Trillion params open source
  • Qwen3-Coder - Agentic coding new SOTA

Reasoning at scale β€’ Agentic workflows β€’ Multimodal default β€’ Open vs closed race

Regular vs Reasoning Models

Reasoning models think in multiple steps and spend extra compute at inference (TTC) to get higher accuracy.

  • Single-pass β†’ Multi-step deliberation
  • Tool calls during thinking
  • Test-Time Compute (TTC) boosts correctness
  • Benchmarks jump (GPQA, MMLU-Pro, AIME, HLE)

Traditional

Prompt β†’ Answer

β†’

Reasoning

Prompt β†’ Plan β†’ Tool β†’ Reflect β†’ Answer

o3/o4-mini system card πŸ”— DeepSeek R1 GitHub πŸ”— Anthropic Opus 4 PDF πŸ”—

Test-Time Compute (TTC) in Practice

More tokens spent reasoning = better results, with diminishing returns.

  • Allocate 'thinking tokens'
  • Higher TTC β‰ˆ +X% on GPQA/MATH
  • APIs expose 'budget' or 'steps' knobs
TTC Budget (tokens) Accuracy % Low Medium High Max
System cards with TTC curves πŸ”—
TTC Curves
TTC Curves

DeepSeek R1

R1 proves open-weight reasoning can rival closed models.

Metric Value
Parameters 685B (MoE)
Training Tokens 14.8T
GPQA 81.0
License MIT
  • Strengths: reasoning, math, multi-step logic
  • Caveat: needs safety filters
HF repo πŸ”— R1 paper/blog πŸ”—

Claude Opus 4 / Sonnet 4

Anthropic's hybrid reasoning + tool use models handle day-long tasks & coding reliably.

  • Opus 4: highest intelligence tier
  • Sonnet 4: cheaper, strong reasoning
  • Can work continuously for hours on complex tasks
Sonnet 4 Opus 4 o3 82.1 85.3 87.7 GPQA Performance
Anthropic PDF/system card πŸ”—
Claude Benchmarks

Kimi K2

Moonshot Kimi K2 delivers open, massive MoE reasoning at scale (1T params, 384 experts).

  • 1T params / 15.5T tokens
  • Strong math & logic (AIME 77.5)
  • Apache-style open license
MMLU GPQA AIME Code Math Logic K2
Moonshot GitHub πŸ”— OpenRouter entry πŸ”—
Kimi K2 Benchmarks

Qwen3 Coder (480B MoE)

Alibaba's specialist coding model crushing benchmarks with 480B params, 35B active.

Metric Value
Parameters 480B (35B active)
Context Length 256K (1M ext.)
SWE-Bench Verified SOTA open-source
CodeForces ELO Leading
  • Multi-language: Python, JS, Java, C++, Go, Rust+
  • Debugging & refactoring: automated optimization
  • Security: vulnerability detection
SWE-Bench CodeForces LiveCodeBench BFCL
Qwen GitHub πŸ”— HuggingFace πŸ”— Technical Report πŸ”—
Qwen3 Coder Benchmark

Leaderboard (Jul 2025)

Snapshot of key reasoning contenders & a couple headline metrics.

Model Params Tokens MMLU GPQA
Grok 4 5000B 80T - 88.9
Claude Opus 4 1200B 100T - 83.3
DeepSeek R1 685B 14.8T 93.4 81.0
Qwen3 Coder 480B - - -
Kimi K2 1000B 15.5T 89.5 75.1
Qwen3-235B 235B 36T 93.1 77.5

Voice Pipeline Overview

Reliable voice agents need a tuned pipeline: STT β†’ Reasoner β†’ TTS or V2V, optimized for latency & quality.

  • STT: fast/accurate
  • LLM: reason/tool-use
  • TTS/V2V: expressive & low-latency
  • Barge-in, interruptions handling
🎀

STT

Audio β†’ Text

β†’
🧠

LLM

Reason & Tools

β†’
πŸ”Š

TTS/V2V

Text β†’ Audio

Speech-to-Text: Accuracy vs Latency

Choose STT by WER, latency, language coverage & diarization.

Word Error Rate (%) Whisper v3 GPT-4o STT Deepgram AssemblyAI 2.3% 2.8% 2.1% 2.5%
  • WER ~2-3%, Latency ~50-200ms
  • Streaming + partial transcripts
  • Multi-language & diarization
~150ms
Typical Latency
Whisper API πŸ”— Deepgram docs πŸ”— AssemblyAI πŸ”—

TTS: Naturalness, Control & Styles

Modern TTS offers controllable emotion/style with near-human naturalness.

Voice Waveform + Pitch Waveform Pitch
  • ElevenLabs v3: style, emotion, low latency
  • PlayHT, Hume Octave, Papla P1 (streaming APIs)
  • Sesame
ElevenLabs API πŸ”— PlayHT docs πŸ”— Hume Octave πŸ”—
ElevenLabs v3 Demo

Voice-to-Voice (V2V) Models

V2V is here: direct voice in β†’ voice out, enabling fluid, natural dialogs.

  • OpenAI Realtime,
  • Gemini Live
  • Sesame & others emerging
Voice In Voice Encoder Voice Embeddings Voice Decoder Voice Out ~200-500ms end-to-end
OpenAI Realtime API πŸ”— Gemini Live docs πŸ”—

Glueing It Together: Agents, Tools & Protocols

MCP and graph frameworks standardize tool access and agent flow.

  • MCP: tool servers for LLMs
  • Graph orchestration (LangGraph, Haystack 2, CrewAI)
  • Realtime frameworks for voice agents
MCP Database APIs Files Web Email Cloud
Anthropic MCP docs πŸ”—

Model Context Protocol (MCP) Deep Dive

MCP is rapidly becoming the default way to expose SaaS tools to models.

  • Spec defines: tools, prompts, schemas
  • Growing vendor support (CRMs, DBs, SaaS)
  • Easy local MCP servers
{ "jsonrpc": "2.0", "method": "tools/call", "params": { "name": "database_query", "arguments": { "query": "SELECT * FROM users", "database": "production" } } }
Official MCP docs πŸ”—
MCP Deep Dive

Graph-Oriented Orchestration

LangGraph & similar frameworks let you design agent flows as state graphs.

  • Nodes = steps/tools
  • Edges = transitions/conditions
  • Great for long-running workflows
Voice Call Intent Detect DB Lookup Response question? lookup?
LangGraph docs πŸ”— CrewAI GitHub πŸ”—

Realtime Agent Frameworks (Voice-Focused)

Frameworks like Pipecat, Vapi, Retell simplify low-latency streaming & barge-in.

  • Compare: streaming, cost, integrations
  • Pick based on latency & features
Framework Low-Latency? STT/TTS Built-in? Pricing
Vapi βœ… Yes βœ… Built-in $0.05/min
Retell βœ… Yes βœ… Managed $0.08/min
Bland ⚠️ Medium βœ… Full Stack $0.12/min
Pipecat βœ… Yes βœ… Integrated Open Source
Pipecat πŸ”— Vapi πŸ”— Retell πŸ”—

Measure to Improve

Without evals and telemetry, you can't iterate intelligently.

  • Offline evals (benchmarks, unit tests)
  • Online metrics (latency, success, UX)
  • Feedback loops & guardrails
98.2%
Success Rate
1.2s
Avg Latency
4.3/5
User Rating
Performance Over Time
πŸ“Š Dashboard Mock

Modern LLM Evaluation Landscape

Comprehensive evaluation tools for prompt testing, model comparison, and security assessment.

πŸ“Š Performance Testing

  • promptfoo - Red teaming & regression tests
  • OpenAI Evals - Official benchmark framework

🎯 Advanced Evaluation

  • DeepEval - RAG & conversational metrics
  • PromptWizard - Self-evolving optimization

Why Evaluation Matters

"Creating high quality evals is one of the most impactful things you can do" - Greg Brockman, OpenAI

promptfoo: AI Red Teaming & Testing

Local LLM evaluation with security focus - 7.9k GitHub stars

πŸ” Key Features

  • Red teaming & vulnerability scanning
  • Side-by-side model comparisons
  • 100% local execution
  • CI/CD integration ready

Quick Start

npx promptfoo@latest init
npx promptfoo eval

🎯 Use Cases

  • Prompt engineering optimization
  • Model performance comparison
  • AI application security testing
  • Automated vulnerability reporting

Supported Providers

OpenAI, Anthropic, Azure, Bedrock, Ollama

promptfoo GitHub πŸ”— Documentation πŸ”—

OpenAI Evals: Official Benchmark Framework

Open-source evaluation registry - 16.7k GitHub stars

πŸ—οΈ Framework Features

  • Open registry of benchmarks
  • Custom evaluations for specific use cases
  • Private evals with proprietary data
  • Model-graded evaluations

Installation

pip install evals

πŸ“Š Evaluation Types

  • Basic model evaluations
  • Prompt chain assessments
  • Tool-using agent tests
  • Custom workflow patterns
"Creating high quality evals is one of the most impactful things you can do" - Greg Brockman
OpenAI Evals GitHub πŸ”— Documentation πŸ”—

DeepEval: Comprehensive LLM Evaluation

Local evaluation framework with extensive metrics - 9.8k GitHub stars

🎯 Core Metrics

  • RAG metrics: Answer Relevancy, Faithfulness
  • Agentic metrics: Task Completion
  • Safety metrics: Hallucination, Bias, Toxicity
  • Conversational: Knowledge Retention

Quick Setup

pip install -U deepeval

πŸ”§ Advanced Features

  • Pytest integration
  • Custom metric creation
  • Synthetic dataset generation
  • Red team testing (40+ vulnerabilities)
  • MMLU & HellaSwag benchmarking

Integrations

LlamaIndex, Hugging Face, Cloud Platform

DeepEval GitHub πŸ”— Documentation πŸ”—

PromptWizard: Self-Evolving Optimization

Microsoft's AI-driven prompt optimization framework

πŸ§™β€β™‚οΈ Self-Evolution

  • AI generates its own prompts
  • Self-critique and refinement
  • Synthetic examples generation
  • Chain of Thought optimization

Installation

git clone microsoft/PromptWizard
pip install -e .

πŸ“ˆ Usage Scenarios

  • Optimize prompts without examples
  • Generate synthetic training data
  • Optimize with existing training data
  • Task-aware prompt refinement

Tested Datasets

GSM8k, SVAMP, AQUARAT, Instruction Induction

PromptWizard GitHub πŸ”— Research Paper πŸ”—

LLM Evaluation Best Practices

Strategic approach to comprehensive AI model assessment

πŸ—οΈ Foundation Layer

  • Automated Testing: OpenAI Evals for standardized benchmarks
  • Security Scanning: promptfoo for vulnerability assessment
  • CI/CD Integration: Run evals on every deployment

🎯 Advanced Layer

  • RAG Evaluation: DeepEval for retrieval quality
  • Prompt Optimization: PromptWizard for self-improvement
  • Custom Metrics: Domain-specific evaluations

Evaluation Strategy Framework

πŸ“Š

Baseline
Standard benchmarks

β†’
πŸ”

Security
Red team testing

β†’
🎯

Domain
Custom metrics

β†’
πŸ”„

Optimize
Continuous improvement

Voice Agent KPIs

Track voice-specific metrics: barge-in success, first-token latency, intent accuracy.

  • Latency (STT, LLM, TTS)
  • Intent success / fallback rate
  • CSAT proxies (sentiment, repeat calls)
STT WER: 2.1% Latency: 150ms LLM Intent: 95% Latency: 800ms TTS Quality: 4.2/5 Latency: 300ms UX CSAT: 4.1/5 Barge-in: 87%

Guardrails & Safety Checks

Insert safety filters pre & post model to prevent PII leaks and jailbreaks.

  • Input filter β†’ LLM β†’ Output filter
  • PII scrub, toxicity detection
  • Red-team tests
User Input INPUT FILTER PII, Toxic LLM OUTPUT FILTER Safety, Quality Response
Safety toolkits πŸ”— Red-team guidelines πŸ”—

From Autocomplete to Swarm Coders

Coding tools now plan, write, test, and review codeβ€”not just autocomplete.

  • IDE copilots β†’ Repo-wide agents
  • Autonomy: run tests, open PRs
  • Speed + quality gains
Autocomplete Pair Programming Swarm Coding Code completion Chat, explain, debug Plan, test, deploy

Top IDE/Agent Tools - Part 1

Primary AI-powered development environments.

Cursor

Best for: VS Code replacement with AI

Claude Code

Best for: Complex reasoning tasks

Replit Agent

Best for: Full-stack prototyping

Caveat: Cloud-only environment

GitHub Copilot

Best for: Code completion

Caveat: Limited reasoning

Top IDE/Agent Tools - Part 2

Additional specialized AI development tools.

Lovable

Best for: Frontend development

Caveat: Limited backend features

πŸ€– Cline

Best for: VS Code extension agent

Caveat: Early development stage

⚫ Void Editor

Best for: Minimal AI-first IDE

Caveat: Limited plugin ecosystem

Bolt.new

Best for: Instant web apps

Caveat: Limited customization

Prompt & Command Layers (TaskMaster, Superprompt)

Reusable prompt macros and task planners supercharge IDE agents.

// TaskMaster Template
@template feature-implementation
@context ${codebase}
@requirements ${specs}
@output structured-plan
1. Analyze requirements
2. Design architecture
3. Implement & test
4. Document changes
πŸ“ Template Example
  • TaskMaster: task planning for agents
  • Reusable 'workbench' prompts
  • Command palettes inside IDE

Productivity Boost

Standardized workflows reduce context switching and improve consistency

TaskMaster site πŸ”— Prompt libraries πŸ”—

Repo-Wide Agents: Branching, Testing, PRs

Agents can now autonomously modify repos, run CI, open PRs, self-review.

  • Branch per task
  • Auto tests & lint
  • Agent reviews agent
Agent 1 CI/CD Agent 2 Create Branch Write Code Open PR Run Tests Lint & Build Code Review 🌿 βš™οΈ πŸ‘οΈ
Swarm GitHub πŸ”— AutoGen docs πŸ”— CrewAI GitHub πŸ”—

Interactive Demo: Create 3D Game

This game was created live during the presentation in Claude Code.

Game Demo Screenshot β–Ά Play Game

Key Takeaways

Summarize the entire talk in five crisp points.

  • 🧠 Reasoning + TTC is the new baseline for hard problems.
  • 🎀 Voice-to-Voice & realtime stacks are production-ready.
  • πŸ”— MCP becomes the new standard for orchestration.
  • πŸ“Š Evals & metrics guard your quality.
  • ⌨️ Coding agents can 3–10Γ— productivity if set up right.

Resources & Link Dump

All links in one place - key AI development tools and frameworks.

Thank You for Your Attention!

Alexander Efremov
AI Expert, Aspirity Company

βœ‰οΈ ae@aspirity.com | Telegram: @sabbah13

Alexander Efremov
Download PDF