Upcoming Webinar: "Maximizing LLM Inference Throughput"
Strategies for Low Latency and High Efficiency
Seminar Overview
This session focuses on the critical shift from AI training to high-efficiency inference, featuring updated insights and live Q&A with Fixstars engineers.
As LLMs move into production, inference performance becomes the primary driver of both user experience and operational costs. In this webinar, Fixstars engineers will demonstrate how to transition from "standard" inference to "high-speed" execution by tackling the "Memory Wall" and compute bottlenecks. We will explore low-level optimizations including custom GPU kernels, efficient memory handling, and advanced decoding techniques.
The session will feature real-world optimization scenarios, demonstrating how to achieve significant reductions in
TTFT (Time To First Token) and increases in total throughput. We will showcase the Fixstars AIBooster platform’s
capabilities in visualizing inference bottlenecks and automating the path to optimal performance. A demo will guide
you through the process of pr As LLMs move into production, inference performance becomes the primary driver of both
user experience and operational costs. In this webinar, Fixstars engineers will demonstrate how to transition from
"standard" inference to "high-speed" execution by tackling the "Memory Wall" and compute bottlenecks. We will explore
low-level optimizations including custom GPU kernels, efficient memory handling, and advanced decoding techniques.
ofiling an LLM workload, identifying latency spikes, and applying targeted optimizations to squeeze every bit of
performance out of your hardware.
Speaker
Miho Yoneda
Agenda
- Challenges in LLM Inference
- Optimization Techniques: Software & Hardware
- Case Study & Benchmark
- Live Q&A Session
Total 45 minutes
* You can enter and leave the session at any time.
* Please note that the schedule and content may change without prior notice.
AIBooster is a performance engineering platform for continuously observing and improving the performance of AI workloads.
Through comprehensive dashboards, users can visualize the utilization efficiency of various hardware resources—including CPU, GPU, interconnect, and storage—as well as identify software bottlenecks to analyze AI workload performance characteristics. Furthermore, by applying optimization frameworks specifically designed for AI workloads, users can achieve efficient performance improvements.
Learn more about AIBooster here.
Date and time
Thursday, February 26, 2026
12:00 PM - 12:45 PM PDT
Location
Zoom
Target Audience
- AI Service Architects: Engineers responsible for deploying LLMs into production environments.
- Inference Infrastructure Engineers: Professionals focused on scaling AI workloads and minimizing latency.
- Technical Decision Makers: Leaders looking to optimize GPU cloud costs and improve AI service profitability.
- MLOps Engineers: Those interested in integrating performance monitoring and optimization into the CI/CD pipeline.
Participation fee
Free