AI Tools & Trends 2024–2025

What changed, what matters, what to use

Alex Efremov

Agenda

2024-2025 AI Milestones
Reasoning LLMs & Test-Time Compute
Voice Stack
Agentic AI
Evals & Monitoring
AI Development

⏰

Timeline

🧠

Reasoning

🎤

Voice

🤖

Agents

📊

Evals

⌨️

Coding

Key Milestones

2024: The Year AI Went Mainstream

Major Releases

OpenAI o1 - Reasoning models
Sora - Text-to-video
ChatGPT 4o - Multimodal
Claude 3.5 - Enhanced reasoning
Gemini 2.0 - 2x faster
Apple Intelligence - Platform integration
Llama 3.2 - Open multimodal

Key Innovations

Voice Mode - Real-time AI
ChatGPT Search - AI web search
Multimodal AI - All modalities
Agentic AI - Autonomous
EU AI Act - Global standards
GenCast - AI forecasting
NVIDIA - Chip dominance

2025: Era of Reasoning & Agents

                            Vibe Coding - Andrej Karpathy (Jan)
Sesame - Open source SOTA TTS (Feb)
Gemini 2.5 Pro - 2M tokens (Mar)
Qwen3 - Robotics AI (Apr)
Claude Sonnet and Opus 4 - Top benchmarks (May)
Grok 4 Heavy - $300/mo tier
Kimi K2 - Trillion params open source
Qwen3-Coder - Agentic coding new SOTA

                        

Reasoning at scale • Agentic workflows • Multimodal default • Open vs closed race

Regular vs Reasoning Models

Reasoning models think in multiple steps and spend extra compute at inference (TTC) to get higher accuracy.

Single-pass → Multi-step deliberation
Tool calls during thinking
Test-Time Compute (TTC) boosts correctness
Benchmarks jump (GPQA, MMLU-Pro, AIME, HLE)

Traditional

Prompt → Answer

→

Reasoning

Prompt → Plan → Tool → Reflect → Answer

o3/o4-mini system card 🔗 DeepSeek R1 GitHub 🔗 Anthropic Opus 4 PDF 🔗

Test-Time Compute (TTC) in Practice

More tokens spent reasoning = better results, with diminishing returns.

Allocate 'thinking tokens'
Higher TTC ≈ +X% on GPQA/MATH
APIs expose 'budget' or 'steps' knobs

System cards with TTC curves 🔗

DeepSeek R1

R1 proves open-weight reasoning can rival closed models.

Metric	Value
Parameters	685B (MoE)
Training Tokens	14.8T
GPQA	81.0
License	MIT

Strengths: reasoning, math, multi-step logic
Caveat: needs safety filters

HF repo 🔗 R1 paper/blog 🔗

Claude Opus 4 / Sonnet 4

Anthropic's hybrid reasoning + tool use models handle day-long tasks & coding reliably.

Opus 4: highest intelligence tier
Sonnet 4: cheaper, strong reasoning
Can work continuously for hours on complex tasks

Anthropic PDF/system card 🔗

Kimi K2

Moonshot Kimi K2 delivers open, massive MoE reasoning at scale (1T params, 384 experts).

1T params / 15.5T tokens
Strong math & logic (AIME 77.5)
Apache-style open license

Moonshot GitHub 🔗 OpenRouter entry 🔗

Qwen3 Coder (480B MoE)

Alibaba's specialist coding model crushing benchmarks with 480B params, 35B active.

Metric	Value
Parameters	480B (35B active)
Context Length	256K (1M ext.)
SWE-Bench Verified	SOTA open-source
CodeForces ELO	Leading

Multi-language: Python, JS, Java, C++, Go, Rust+
Debugging & refactoring: automated optimization
Security: vulnerability detection

Qwen GitHub 🔗 HuggingFace 🔗 Technical Report 🔗

Leaderboard (Jul 2025)

Snapshot of key reasoning contenders & a couple headline metrics.

Model	Params	Tokens	MMLU	GPQA
Grok 4	5000B	80T	-	88.9
Claude Opus 4	1200B	100T	-	83.3
DeepSeek R1	685B	14.8T	93.4	81.0
Qwen3 Coder	480B	-	-	-
Kimi K2	1000B	15.5T	89.5	75.1
Qwen3-235B	235B	36T	93.1	77.5

Voice Pipeline Overview

Reliable voice agents need a tuned pipeline: STT → Reasoner → TTS or V2V, optimized for latency & quality.

STT: fast/accurate
LLM: reason/tool-use
TTS/V2V: expressive & low-latency
Barge-in, interruptions handling

🎤

STT

Audio → Text

→

🧠

LLM

Reason & Tools

→

🔊

TTS/V2V

Text → Audio

Speech-to-Text: Accuracy vs Latency

Choose STT by WER, latency, language coverage & diarization.

WER ~2-3%, Latency ~50-200ms
Streaming + partial transcripts
Multi-language & diarization

~150ms

Typical Latency

Whisper API 🔗 Deepgram docs 🔗 AssemblyAI 🔗

TTS: Naturalness, Control & Styles

Modern TTS offers controllable emotion/style with near-human naturalness.

ElevenLabs v3: style, emotion, low latency
PlayHT, Hume Octave, Papla P1 (streaming APIs)
Sesame

ElevenLabs API 🔗 PlayHT docs 🔗 Hume Octave 🔗

Voice-to-Voice (V2V) Models

V2V is here: direct voice in → voice out, enabling fluid, natural dialogs.

OpenAI Realtime,
Gemini Live
Sesame & others emerging

OpenAI Realtime API 🔗 Gemini Live docs 🔗

Glueing It Together: Agents, Tools & Protocols

MCP and graph frameworks standardize tool access and agent flow.

MCP: tool servers for LLMs
Graph orchestration (LangGraph, Haystack 2, CrewAI)
Realtime frameworks for voice agents

Anthropic MCP docs 🔗

Model Context Protocol (MCP) Deep Dive

MCP is rapidly becoming the default way to expose SaaS tools to models.

Spec defines: tools, prompts, schemas
Growing vendor support (CRMs, DBs, SaaS)
Easy local MCP servers

                                {
                                "jsonrpc": "2.0",
                                "method": "tools/call",
                                "params": {
                                "name": "database_query",
                                "arguments": {
                                "query": "SELECT * FROM users",
                                "database": "production"
                                }
                                }
                                }
                            

Official MCP docs 🔗

Graph-Oriented Orchestration

LangGraph & similar frameworks let you design agent flows as state graphs.

Nodes = steps/tools
Edges = transitions/conditions
Great for long-running workflows

LangGraph docs 🔗 CrewAI GitHub 🔗

Realtime Agent Frameworks (Voice-Focused)

Frameworks like Pipecat, Vapi, Retell simplify low-latency streaming & barge-in.

Compare: streaming, cost, integrations
Pick based on latency & features

Framework	Low-Latency?	STT/TTS Built-in?	Pricing
Vapi	✅ Yes	✅ Built-in	$0.05/min
Retell	✅ Yes	✅ Managed	$0.08/min
Bland	⚠️ Medium	✅ Full Stack	$0.12/min
Pipecat	✅ Yes	✅ Integrated	Open Source

Pipecat 🔗 Vapi 🔗 Retell 🔗

Measure to Improve

Without evals and telemetry, you can't iterate intelligently.

Offline evals (benchmarks, unit tests)
Online metrics (latency, success, UX)
Feedback loops & guardrails

98.2%

Success Rate

1.2s

Avg Latency

4.3/5

User Rating

📊 Dashboard Mock

Modern LLM Evaluation Landscape

Comprehensive evaluation tools for prompt testing, model comparison, and security assessment.

📊 Performance Testing

promptfoo - Red teaming & regression tests
OpenAI Evals - Official benchmark framework

🎯 Advanced Evaluation

DeepEval - RAG & conversational metrics
PromptWizard - Self-evolving optimization

Why Evaluation Matters

"Creating high quality evals is one of the most impactful things you can do" - Greg Brockman, OpenAI

promptfoo: AI Red Teaming & Testing

Local LLM evaluation with security focus - 7.9k GitHub stars

🔍 Key Features

Red teaming & vulnerability scanning
Side-by-side model comparisons
100% local execution
CI/CD integration ready

Quick Start

npx promptfoo@latest init
npx promptfoo eval

🎯 Use Cases

Prompt engineering optimization
Model performance comparison
AI application security testing
Automated vulnerability reporting

Supported Providers

OpenAI, Anthropic, Azure, Bedrock, Ollama

promptfoo GitHub 🔗 Documentation 🔗

OpenAI Evals: Official Benchmark Framework

Open-source evaluation registry - 16.7k GitHub stars

🏗️ Framework Features

Open registry of benchmarks
Custom evaluations for specific use cases
Private evals with proprietary data
Model-graded evaluations

Installation

pip install evals

📊 Evaluation Types

Basic model evaluations
Prompt chain assessments
Tool-using agent tests
Custom workflow patterns

"Creating high quality evals is one of the most impactful things you can do" - Greg Brockman

OpenAI Evals GitHub 🔗 Documentation 🔗

DeepEval: Comprehensive LLM Evaluation

Local evaluation framework with extensive metrics - 9.8k GitHub stars

🎯 Core Metrics

RAG metrics: Answer Relevancy, Faithfulness
Agentic metrics: Task Completion
Safety metrics: Hallucination, Bias, Toxicity
Conversational: Knowledge Retention

Quick Setup

pip install -U deepeval

🔧 Advanced Features

Pytest integration
Custom metric creation
Synthetic dataset generation
Red team testing (40+ vulnerabilities)
MMLU & HellaSwag benchmarking

Integrations

LlamaIndex, Hugging Face, Cloud Platform

DeepEval GitHub 🔗 Documentation 🔗

PromptWizard: Self-Evolving Optimization

Microsoft's AI-driven prompt optimization framework

🧙‍♂️ Self-Evolution

AI generates its own prompts
Self-critique and refinement
Synthetic examples generation
Chain of Thought optimization

Installation

git clone microsoft/PromptWizard
pip install -e .

📈 Usage Scenarios

Optimize prompts without examples
Generate synthetic training data
Optimize with existing training data
Task-aware prompt refinement

Tested Datasets

GSM8k, SVAMP, AQUARAT, Instruction Induction

PromptWizard GitHub 🔗 Research Paper 🔗

LLM Evaluation Best Practices

Strategic approach to comprehensive AI model assessment

🏗️ Foundation Layer

Automated Testing: OpenAI Evals for standardized benchmarks
Security Scanning: promptfoo for vulnerability assessment
CI/CD Integration: Run evals on every deployment

🎯 Advanced Layer

RAG Evaluation: DeepEval for retrieval quality
Prompt Optimization: PromptWizard for self-improvement
Custom Metrics: Domain-specific evaluations

Evaluation Strategy Framework

📊

Baseline
Standard benchmarks

→

🔍

Security
Red team testing

→

🎯

Domain
Custom metrics

→

🔄

Optimize
Continuous improvement

Voice Agent KPIs

Track voice-specific metrics: barge-in success, first-token latency, intent accuracy.

Latency (STT, LLM, TTS)
Intent success / fallback rate
CSAT proxies (sentiment, repeat calls)

Guardrails & Safety Checks

Insert safety filters pre & post model to prevent PII leaks and jailbreaks.

Input filter → LLM → Output filter
PII scrub, toxicity detection
Red-team tests

Safety toolkits 🔗 Red-team guidelines 🔗

From Autocomplete to Swarm Coders

Coding tools now plan, write, test, and review code—not just autocomplete.

IDE copilots → Repo-wide agents
Autonomy: run tests, open PRs
Speed + quality gains

Top IDE/Agent Tools - Part 1

Primary AI-powered development environments.

Cursor

Best for: VS Code replacement with AI

Claude Code

Best for: Complex reasoning tasks

Replit Agent

Best for: Full-stack prototyping

Caveat: Cloud-only environment

GitHub Copilot

Best for: Code completion

Caveat: Limited reasoning

Top IDE/Agent Tools - Part 2

Additional specialized AI development tools.

Lovable

Best for: Frontend development

Caveat: Limited backend features

🤖 Cline

Best for: VS Code extension agent

Caveat: Early development stage

⚫ Void Editor

Best for: Minimal AI-first IDE

Caveat: Limited plugin ecosystem

Bolt.new

Best for: Instant web apps

Caveat: Limited customization

Prompt & Command Layers (TaskMaster, Superprompt)

Reusable prompt macros and task planners supercharge IDE agents.

// TaskMaster Template

                                        @template feature-implementation

                                        @context ${codebase}

                                        @requirements ${specs}

                                        @output structured-plan
                                    
                                        1. Analyze requirements

                                        2. Design architecture

                                        3. Implement & test

                                        4. Document changes

📝 Template Example

TaskMaster: task planning for agents
Reusable 'workbench' prompts
Command palettes inside IDE

Productivity Boost

Standardized workflows reduce context switching and improve consistency

TaskMaster site 🔗 Prompt libraries 🔗

Repo-Wide Agents: Branching, Testing, PRs

Agents can now autonomously modify repos, run CI, open PRs, self-review.

Branch per task
Auto tests & lint
Agent reviews agent

Swarm GitHub 🔗 AutoGen docs 🔗 CrewAI GitHub 🔗

Interactive Demo: Create 3D Game

This game was created live during the presentation in Claude Code.

▶ Play Game

Key Takeaways

Summarize the entire talk in five crisp points.

🧠 Reasoning + TTC is the new baseline for hard problems.
🎤 Voice-to-Voice & realtime stacks are production-ready.
🔗 MCP becomes the new standard for orchestration.
📊 Evals & metrics guard your quality.
⌨️ Coding agents can 3–10× productivity if set up right.

Resources & Link Dump

All links in one place - key AI development tools and frameworks.

Models

DeepSeek R1 Claude Opus 4 Kimi K2 Grok 4

Voice

OpenAI Realtime ElevenLabs v3 Whisper v3 Gemini Live

IDEs

Cursor Claude Code Replit Agent Bolt.new

Orchestration

MCP Docs LangGraph CrewAI Pipecat

Evals

promptfoo OpenAI Evals DeepEval PromptWizard

Thank You for Your Attention!

Alexander Efremov
AI Expert, Aspirity Company

✉️ ae@aspirity.com | Telegram: @sabbah13

Download PDF