Back to home|AI|BreakingApril 6, 2026

Running Google Gemma 4 Locally: A Guide to LM Studio CLI and Claude Code

Discover how to run Google's Gemma 4 locally using the new LM Studio CLI, leveraging its Mixture-of-Experts architecture for high-performance inference.

Running Google Gemma 4 Locally: A Guide to LM Studio CLI and Claude Code

Key Points

LM Studio 0.4.0 introduces the llmster engine and a new headless CLI (lms).
Google's Gemma 4 26B-A4B model uses Mixture-of-Experts for high-efficiency inference.
Achieves up to 51 tokens per second on Apple Silicon hardware.
Supports 256K context, vision capabilities, and native tool calling.
Built-in estimation tools allow for precise memory planning before model deployment.

The release of LM Studio 0.4.0 marks a significant milestone in local AI development, introducing the llmster inference engine and a powerful new headless CLI. This update allows developers to run sophisticated models like Google’s Gemma 4 entirely from the command line, bypassing the need for a GUI and enabling seamless integration into professional workflows. For those concerned about API costs, data privacy, and network latency, this local-first approach provides a robust alternative to cloud-based AI services. At the heart of this setup is the Google Gemma 4 26B-A4B model, which utilizes a Mixture-of-Experts (MoE) architecture. Unlike dense models that activate all parameters for every request, this MoE model activates only a small fraction of its 26 billion parameters—roughly 4 billion—per forward pass. On hardware like a 14-inch MacBook Pro with an M4 Pro chip and 48GB of unified memory, this translates to impressive inference speeds of approximately 51 tokens per second. The efficiency gains are substantial, allowing the model to perform complex reasoning tasks while remaining responsive. Google has positioned the Gemma 4 family to cover a wide range of hardware targets. The lineup includes the 'E' variants (E2B and E4B), which are optimized for on-device deployment and feature unique support for audio inputs. The flagship 31B dense model remains the performance leader, scoring highly on benchmarks like MMLU Pro and AIME 2026. However, the 26B-A4B variant has emerged as the 'sweet spot' for local users, offering a high Elo score of 1441—competitive with models many times its size—while maintaining a manageable memory footprint. LM Studio 0.4.0 fundamentally rearchitected how local models are served. By extracting the core inference engine into the llmster daemon, the developers have enabled a 'headless' mode. Users can now manage models, load them, and interact with them via the 'lms' CLI. This is a game-changer for CI/CD pipelines, SSH sessions, and developers who prefer the terminal environment. The new version also introduces parallel request processing through continuous batching, allowing the model to handle multiple concurrent tasks efficiently, provided the hardware supports it. Installation is straightforward across platforms, requiring only a simple curl or irm command. Once the daemon is running, the CLI provides granular control over model management. Users can list downloaded models, check memory usage, and even estimate the memory requirements for specific context lengths before loading. This 'estimate-only' feature is particularly useful for capacity planning, ensuring that users do not overload their system memory when pushing to the model's maximum 256K context limit. Performance tuning is another area where the new CLI excels. Users can adjust GPU offloading, set TTL (Time-To-Live) values for automatic model unloading, and configure parallel request slots. For instance, on Apple Silicon, where CPU and GPU share the same memory pool, users can fine-tune the --gpu flag to balance speed and resource allocation. These configurations can also be saved as per-model defaults through the desktop app, ensuring a consistent experience whether loading from the CLI or the GUI. Ultimately, the ability to run high-performance models like Gemma 4 locally is transformative. It allows for the use of advanced features such as vision support, native function calling, and configurable thinking modes without the constraints of external servers. As local inference continues to mature, tools like LM Studio's headless CLI are providing the necessary infrastructure to bring powerful AI capabilities directly to the developer's laptop, ensuring that privacy, performance, and cost-efficiency remain at the forefront of the AI revolution.

The Mixture-of-Experts Advantage

The efficiency of the Gemma 4 26B-A4B model stems from its Mixture-of-Experts (MoE) architecture, which dynamically activates only the necessary parameters for each task. This approach ensures high-quality reasoning and performance while keeping the memory and computational requirements low enough for high-end consumer hardware. By effectively punching above its weight class, this model allows users to tap into capabilities previously reserved for massive, cloud-hosted behemoths, all while maintaining complete control over the local execution environment.

New Headless CLI Capabilities

LM Studio 0.4.0 has fundamentally changed the landscape for local AI by introducing a robust headless CLI. This allows developers to move beyond GUI-based interactions, enabling the integration of local LLMs into automated pipelines, server environments, and advanced development workflows. With features like parallel request processing, TTL-based model unloading, and precise memory estimation, users have the tools needed to optimize their local AI infrastructure for speed, reliability, and resource efficiency.

This article was drafted with AI assistance and editorially reviewed before publication. Sources are listed below.

عن الكاتب

عبدالله الجاسر

المؤسس

مهندس صناعي | مؤسس منصة نيوزلي | شغوف بالتقنية والذكاء الاصطناعي

كل مقالات الكاتب

Sources

George Liu's Tech Blog