picoLLM Compression produces the smallest models at any accuracy target, and picoLLM Inference runs them across mobile, desktop, embedded, and every web browser. No cloud dependency or heavy compute requirements.
picoLLM enables enterprises to run large language models entirely on-device across mobile phones, web browsers, embedded systems, desktop applications, and on-premise servers. Prompts and completions never leave the customer's infrastructure.
picoLLM is the only on-device LLM platform with a quantization algorithm and an inference engine designed together for non-uniform bit allocation. Every other LLM quantization algorithm assigns fixed bit depths to weights. picoLLM Compression learns the optimal bit distribution per weight per model per compression ratio, allocating 1-bit through 8-bit values continuously across the model based on each weight's contribution to output quality. The result: at 2-bit compression on Llama-3-8b, picoLLM retains 95% of float16 MMLU performance (61.3 vs 64.9), where GPTQ collapses to 25.1, which is essentially random for a 4-choice benchmark. This algorithmic approach required a purpose-built inference engine, because existing LLM runtimes assume uniform bit depth and cannot execute variable-bit-rate models. picoLLM Inference was built from scratch to handle x-bit quantized weights.
This combination matters most where LLM inference is genuinely sensitive: regulated industries that cannot send prompts to third-party clouds, mobile applications where unbounded API costs make cloud LLMs economically infeasible at scale, embedded systems where connectivity is unreliable, and any product where the LLM's output is part of a confidential workflow.
picoLLM Inference takes a compressed model file and a prompt, and returns a completion with streaming token output. No cloud processing, no streaming connection to manage. Compressed models are downloadable from the Picovoice Console as single files that load directly into the SDK without conversion or build steps. Use picoLLM Inference with its native SDKs for Android, C, .NET, iOS, NodeJS, Python, and Web.
picoLLM combined with Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech delivers a complete on-device voice assistant pipeline across every platform your product ships.
picoLLM consists of an inference engine (runtime) and language models that work together to enable on-device LLM deployment. Models can be trained from scratch on picoGYM, or compressed with picoCompression.
A novel LLM quantization algorithm that learns the optimal bit allocation strategy across and within model weights. Unlike GPTQ, GGUF, or AWQ, which assign fixed bit depths, picoLLM Compression treats compression ratio as a target and bit allocation as a learned variable. Given a target size and a task-specific cost function, it distributes bits where they matter most, retaining accuracy even at sub-4-bit, where other methods fail.
A cross-platform inference engine built from scratch to execute variable-bit-rate quantized models. Existing LLM runtimes assume uniform quantization and cannot run x-bit compressed models. picoLLM Inference handles every bit depth from 1 to 8, with SIMD implementations and parallel execution paths for CPU, GPU, and web. That's how the same engine can run x-bit compressed LLMs across Linux, macOS, Windows, iOS, Android, Raspberry Pi, and every major browser (Chrome, Safari, Edge, Firefox).
picoLLM Compression is benchmarked against GPTQ, the most widely deployed LLM quantization algorithm, on MMLU, ARC, and perplexity across six model families. picoLLM maintains near-float16 accuracy at 2, 3, and 4-bit, where GPTQ collapses. picoLLM's sub-4-bit advantage is what makes the smallest mobile and embedded deployments possible without sacrificing the model's actual capability or user experience.
picoLLM is the only production-grade on-device LLM that runs across every platform: mobile, web, embedded, desktop, and server. Supported models compress to sub-4-bit with near-float16 accuracy, offering cloud performance even on hardware-constrained devices.
Look no further for the best CPU to run LLMs locally on the device. picoLLM runs any LLMs on any CPU.
On-device LLMs that ship to production. Maximum accuracy at minimum size.