Reasoning at scale β’ Agentic workflows β’ Multimodal default β’ Open vs closed race
Reasoning models think in multiple steps and spend extra compute at inference (TTC) to get higher accuracy.
Prompt β Answer
Prompt β Plan β Tool β Reflect β Answer
More tokens spent reasoning = better results, with diminishing returns.
R1 proves open-weight reasoning can rival closed models.
| Metric | Value |
|---|---|
| Parameters | 685B (MoE) |
| Training Tokens | 14.8T |
| GPQA | 81.0 |
| License | MIT |
Anthropic's hybrid reasoning + tool use models handle day-long tasks & coding reliably.
Moonshot Kimi K2 delivers open, massive MoE reasoning at scale (1T params, 384 experts).
Alibaba's specialist coding model crushing benchmarks with 480B params, 35B active.
| Metric | Value |
|---|---|
| Parameters | 480B (35B active) |
| Context Length | 256K (1M ext.) |
| SWE-Bench Verified | SOTA open-source |
| CodeForces ELO | Leading |
Snapshot of key reasoning contenders & a couple headline metrics.
| Model | Params | Tokens | MMLU | GPQA |
|---|---|---|---|---|
| Grok 4 | 5000B | 80T | - | 88.9 |
| Claude Opus 4 | 1200B | 100T | - | 83.3 |
| DeepSeek R1 | 685B | 14.8T | 93.4 | 81.0 |
| Qwen3 Coder | 480B | - | - | - |
| Kimi K2 | 1000B | 15.5T | 89.5 | 75.1 |
| Qwen3-235B | 235B | 36T | 93.1 | 77.5 |
Reliable voice agents need a tuned pipeline: STT β Reasoner β TTS or V2V, optimized for latency & quality.
Audio β Text
Reason & Tools
Text β Audio
Choose STT by WER, latency, language coverage & diarization.
Modern TTS offers controllable emotion/style with near-human naturalness.
V2V is here: direct voice in β voice out, enabling fluid, natural dialogs.
MCP and graph frameworks standardize tool access and agent flow.
MCP is rapidly becoming the default way to expose SaaS tools to models.
LangGraph & similar frameworks let you design agent flows as state graphs.
Frameworks like Pipecat, Vapi, Retell simplify low-latency streaming & barge-in.
| Framework | Low-Latency? | STT/TTS Built-in? | Pricing |
|---|---|---|---|
| Vapi | β Yes | β Built-in | $0.05/min |
| Retell | β Yes | β Managed | $0.08/min |
| Bland | β οΈ Medium | β Full Stack | $0.12/min |
| Pipecat | β Yes | β Integrated | Open Source |
Without evals and telemetry, you can't iterate intelligently.
Comprehensive evaluation tools for prompt testing, model comparison, and security assessment.
"Creating high quality evals is one of the most impactful things you can do" - Greg Brockman, OpenAI
Local LLM evaluation with security focus - 7.9k GitHub stars
npx promptfoo@latest init
npx promptfoo eval
OpenAI, Anthropic, Azure, Bedrock, Ollama
Open-source evaluation registry - 16.7k GitHub stars
pip install evals
"Creating high quality evals is one of the most impactful things you can do" - Greg Brockman
Local evaluation framework with extensive metrics - 9.8k GitHub stars
pip install -U deepeval
LlamaIndex, Hugging Face, Cloud Platform
Microsoft's AI-driven prompt optimization framework
git clone microsoft/PromptWizard
pip install -e .
GSM8k, SVAMP, AQUARAT, Instruction Induction
Strategic approach to comprehensive AI model assessment
Baseline
Standard benchmarks
Security
Red team testing
Domain
Custom metrics
Optimize
Continuous improvement
Track voice-specific metrics: barge-in success, first-token latency, intent accuracy.
Insert safety filters pre & post model to prevent PII leaks and jailbreaks.
Coding tools now plan, write, test, and review codeβnot just autocomplete.
Primary AI-powered development environments.
Best for: VS Code replacement with AI
Best for: Complex reasoning tasks
Best for: Full-stack prototyping
Caveat: Cloud-only environment
Best for: Code completion
Caveat: Limited reasoning
Additional specialized AI development tools.
Best for: Frontend development
Caveat: Limited backend features
Best for: VS Code extension agent
Caveat: Early development stage
Best for: Minimal AI-first IDE
Caveat: Limited plugin ecosystem
Best for: Instant web apps
Caveat: Limited customization
Reusable prompt macros and task planners supercharge IDE agents.
Standardized workflows reduce context switching and improve consistency
Agents can now autonomously modify repos, run CI, open PRs, self-review.
This game was created live during the presentation in Claude Code.
Summarize the entire talk in five crisp points.
All links in one place - key AI development tools and frameworks.
Alexander Efremov
AI Expert, Aspirity
Company
βοΈ ae@aspirity.com | Telegram: @sabbah13