picoLLM On-Device LLM Platform

On-device LLM for real-time mobile, web, and IoT applications

picoLLM Compression produces the smallest models at any accuracy target, and picoLLM Inference runs them across mobile, desktop, embedded, and every web browser. No cloud dependency or heavy compute requirements.

99.9%
Accuracy retained at 3-bit vs. 83.1% for GPTQ for Llama-3-8b
94.5%
Accuracy retained at 2-bit vs. 38.7% for GPTQ for Llama-3-8b
Any
Any transformer architecture on any platform
What is picoLLM On-Device LLM Platform?

On-device LLM platform built for maximum accuracy at the minimum size

picoLLM enables enterprises to run large language models entirely on-device across mobile phones, web browsers, embedded systems, desktop applications, and on-premise servers. Prompts and completions never leave the customer's infrastructure.

picoLLM is the only on-device LLM platform with a quantization algorithm and an inference engine designed together for non-uniform bit allocation. Every other LLM quantization algorithm assigns fixed bit depths to weights. picoLLM Compression learns the optimal bit distribution per weight per model per compression ratio, allocating 1-bit through 8-bit values continuously across the model based on each weight's contribution to output quality. The result: at 2-bit compression on Llama-3-8b, picoLLM retains 95% of float16 MMLU performance (61.3 vs 64.9), where GPTQ collapses to 25.1, which is essentially random for a 4-choice benchmark. This algorithmic approach required a purpose-built inference engine, because existing LLM runtimes assume uniform bit depth and cannot execute variable-bit-rate models. picoLLM Inference was built from scratch to handle x-bit quantized weights.

This combination matters most where LLM inference is genuinely sensitive: regulated industries that cannot send prompts to third-party clouds, mobile applications where unbounded API costs make cloud LLMs economically infeasible at scale, embedded systems where connectivity is unreliable, and any product where the LLM's output is part of a confidential workflow.

Developer Experience

Ready-to-use quantized LLMs on embedded, mobile, web and desktop in a few lines of code

picoLLM Inference takes a compressed model file and a prompt, and returns a completion with streaming token output. No cloud processing, no streaming connection to manage. Compressed models are downloadable from the Picovoice Console as single files that load directly into the SDK without conversion or build steps. Use picoLLM Inference with its native SDKs for Android, C, .NET, iOS, NodeJS, Python, and Web.

Platform Demos

On-device LLM-powered voice AI assistant on mobile, embedded, desktop and web

picoLLM combined with Porcupine Wake Word, Cheetah Streaming Speech-to-Text, and Orca Streaming Text-to-Speech delivers a complete on-device voice assistant pipeline across every platform your product ships.

Select a platform
How picoLLM Works

picoLLM consists of an inference engine (runtime) and language models that work together to enable on-device LLM deployment. Models can be trained from scratch on picoGYM, or compressed with picoCompression.

Quantization Algorithm
picoLLM Compression

A novel LLM quantization algorithm that learns the optimal bit allocation strategy across and within model weights. Unlike GPTQ, GGUF, or AWQ, which assign fixed bit depths, picoLLM Compression treats compression ratio as a target and bit allocation as a learned variable. Given a target size and a task-specific cost function, it distributes bits where they matter most, retaining accuracy even at sub-4-bit, where other methods fail.

Learn more about picoLLM Compression →
Cross-Platform Runtime
picoLLM Inference

A cross-platform inference engine built from scratch to execute variable-bit-rate quantized models. Existing LLM runtimes assume uniform quantization and cannot run x-bit compressed models. picoLLM Inference handles every bit depth from 1 to 8, with SIMD implementations and parallel execution paths for CPU, GPU, and web. That's how the same engine can run x-bit compressed LLMs across Linux, macOS, Windows, iOS, Android, Raspberry Pi, and every major browser (Chrome, Safari, Edge, Firefox).

Learn more about picoLLM Inference →
Open-Source LLM Compression Benchmark

Near-float16 accuracy at sub-4-bit, where other LLM quantization algorithms collapse

picoLLM Compression is benchmarked against GPTQ, the most widely deployed LLM quantization algorithm, on MMLU, ARC, and perplexity across six model families. picoLLM maintains near-float16 accuracy at 2, 3, and 4-bit, where GPTQ collapses. picoLLM's sub-4-bit advantage is what makes the smallest mobile and embedded deployments possible without sacrificing the model's actual capability or user experience.

2-bit Quantized Llama-3-8b MMLU
Higher is better
Float16 (Original Model)64.9
picoLLM61.3
GPTQ25.1
3-bit Quantized Llama-3-8b MMLU
Higher is better
Float16 (Original Model)64.9
picoLLM64.8
GPTQ53.9
2-bit Quantized Gemma-7b MMLU
Higher is better
Float16 (Original Model)64.5
picoLLM64.3
GPTQ25.6
3-bit Quantized Gemma-7b MMLU
Higher is better
Float16 (Original Model)64.5
picoLLM64.5
GPTQ53.3
Ready to integrate? Check our docs to start building or talk to the sales team about enterprise deployment.
Capabilities

Why enterprises choose picoLLM On-device LLM

picoLLM is the only production-grade on-device LLM that runs across every platform: mobile, web, embedded, desktop, and server. Supported models compress to sub-4-bit with near-float16 accuracy, offering cloud performance even on hardware-constrained devices.

01Co-designed compression and inferenceEvery other LLM quantization assigns fixed bit depths to weights. GPTQ, GGUF, AWQ, and SpinQuant all impose this constraint at the algorithm level, then run on inference engines (llama.cpp, ExecuTorch, MLX) that assume uniform bit depth. picoLLM Compression learns the optimal bit distribution per weight, and picoLLM Inference was built from scratch to execute variable-bit-rate models. This co-design is the architectural reason picoLLM holds accuracy at 2-bit and 3-bit, where fixed-allocation methods collapse.
02Sub-4-bit accuracyQuantizing LLMs is easy, but quantizing them while maintaining the accuracy is not. That's why most quantized models are at 4-bit or above. At 2-bit compression on Llama-3-8b, picoLLM retains 95% of float16 MMLU performance (61.3 vs 64.9). GPTQ at 2-bit collapses to 25.1, essentially random for a 4-choice benchmark. For mobile and embedded deployments where memory constraints and compression ratios are a concern, 2-bit vs. 4-bit is the difference between a usable model and an unusable one.
03X-bit allocation, learned per weightpicoLLM Compression learns the bit distribution rather than imposing it. For Llama-2-7b at 7× compression, attention.k/o/q weights get 1.65 bits while embeddings get 3.10 bits, and the distribution within each weight matrix is also non-uniform. No two models or compression ratios receive the same allocation pattern. The algorithm minimizes accuracy loss across all layers simultaneously, instead of layer-by-layer like GPTQ or block-by-block like GGUF.
04Compression-aware trainingStandard quantization is post-training, which limits how aggressively a model can be compressed without quality loss. picoGYM trains small language models with compression as a first-class objective, producing models that compress further without quality loss.
05Custom LLM compressionEnterprises with in-house LLMs can compress them with picoLLM Compression powered by picoCompression for local deployment. Available through enterprise engagement with NDA-protected model handling.
06Cross-PlatformpicoLLM Inference runs locally on every platform your product ships — Android, Chrome, Edge, Firefox, iOS, Linux, macOS, Raspberry Pi, Safari, and Windows — across AMD, Intel, NVIDIA, and Qualcomm hardware.
07Runs on CPU, GPU and NPUThe same picoLLM Inference engine runs on both CPU and GPU using the same SDK. No separate code paths, no platform-specific runtime selection. Workstations and desktops with discrete GPUs use them automatically. Mobile and embedded targets fall back to CPU. The architecture also extends to NPU acceleration for hardware that supports it.
08Compliance by architectureCloud LLM APIs achieve HIPAA and GDPR compliance through Business Associate Agreements and contractual controls. picoLLM is compliant by architecture: text never leaves the device, so there is no data to regulate in transit or at rest outside the customer's infrastructure. For healthcare practices, legal teams, financial institutions, and defense applications, on-device processing is the only correct architecture.
09Offline ProcessingNo network connection required for picoLLM Inference. picoLLM operates in air-gapped environments, remote deployments, aircraft, vessels, classified networks, and any infrastructure where cloud APIs cannot reach or where data handling requirements prohibit transmission to third-party servers.
10Enterprise ReadypicoLLM Inference is production-grade and enterprise-ready. Picovoice offers flexible licensing, dedicated engineering support, NDA-protected custom model training, and SLA-backed response times for teams shipping at scale.
11Pairs with on-device voice AI picoLLM combined with Porcupine Wake Word, Cheetah Streaming Speech-to-Text, Orca Streaming Text-to-Speech and other voice AI products. Below is an example of a voice assistant demo powered by picoLLM running on CPU, GPU, Android, iOS, web, Embedded, and Serverless in the Cloud.
Select a platform
LLM Inference on CPU

Look no further for the best CPU to run LLMs locally on the device. picoLLM runs any LLMs on any CPU.

Ship it.
On device.

On-device LLMs that ship to production. Maximum accuracy at minimum size.

Frequently asked questions

Feature
Usage
Technical Questions
Custom Models & Support
Data Security & Privacy
Building with picoLLM
Feature
+
What are the key benefits of picoLLM On-device LLM Platform?
  • Works with any LLM – Custom, proprietary, or open-weight (e.g., Llama, Gemma, Phi)
  • Runs anywhere – Web, mobile, desktop, embedded, and serverless
  • Compressed & optimized – Reduces storage and memory needs without losing accuracy
  • Fully private – All inference happens on-device, no external servers
  • Production-ready – No tuning or ML expertise required
+
What does the picoLLM Platform offer?
With more capabilities coming soon, the initial release of picoLLM offers:
+
What does picoLLM Compression do?
picoLLM Compression is a novel large language model (LLM) quantization algorithm developed within Picovoice. Existing techniques require a fixed bit allocation scheme, which is subpar. Given a task-specific cost function, picoLLM Compression automatically learns the optimal bit allocation strategy across and within LLM's weights.
+
What does the picoLLM Inference do?
picoLLM Inference runs X-bit quantized LLMs, simplifying the development process to add LLMs to any software. picoLLM Inference is the only local LLM inference engine
  • across Linux, macOS, Windows, Android, iOS, Raspberry Pi, Chrome, Safari, Edge, and Firefox
  • supports CPU and GPU out-of-the-box and has the architecture to tap into other forms of accelerated computing
  • runs any LLM architecture
+
Does picoLLM offer local Llama models?
Yes, picoLLM offers ready-to-use quantized Llama models. Quantized Llama models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.
+
Does picoLLM offer local Mistral models?
Yes, picoLLM offers ready-to-use quantized Mistral models. Quantized Mistral models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.
+
Does picoLLM offer local Mixtral models?
Yes, picoLLM offers ready-to-use quantized Mixtral models. Quantized Mixtral models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.
+
Does picoLLM offer local Microsoft Phi models?
Yes, picoLLM offers ready-to-use quantized Microsoft Phi-2. Quantized Microsoft Phi-2 models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.
+
Does picoLLM offer local Gemma models?
Yes, picoLLM offers ready-to-use quantized Gemma models. Quantized Gemma models can be downloaded from Picovoice Console within your plan limits, and deployed locally across platforms.
+
How can I get access to picoLLM GYM to train small language models?
Currently, picoLLM GYM is only open to selected enterprise customers. Please engage with your account manager if you're already a Picovoice customer.
Usage
+
What are the platforms supported by picoLLM Inference?
  • Desktop & Server: Linux, Windows & macOS
  • Mobile: Android & iOS
  • Web Browsers: Chrome, Safari, Edge and Firefox
  • Single Board Computers: Raspberry Pi
  • Cloud Providers: AWS, Azure, Google, IBM, Oracle and others.
+
Does picoLLM Inference run LLMs in the public or private cloud, including VPC(Virtual Private Cloud)?
Yes, picoLLM is cloud-agnostic and interoperable. You can deploy LLMs in the cloud, work with the cloud provider of your choice, and easily move from one to another.
+
Does picoLLM Inference run LLMs in the serverless?
Yes, you can deploy LLMs in the serverless working with the cloud provider of your choice, and easily move from one to another.
+
Does picoLLM Inference run LLMs on-prem?
Yes, you can run LLMs on-prem with picoLLM.
+
Does picoLLM Inference run LLMs on mobile devices?
Yes, you can run LLMs on mobile devices. picoLLM supports both Android and iOS.
+
Does picoLLM Inference run LLMs within web browsers?
Yes, you can run LLMs within web browsers. picoLLM supports all modern web browsers - Chrome, Safari, Firefox, and Edge.
+
Does picoLLM Inference run LLMs on embedded devices?
Yes, you can run LLMs on embedded devices.
+
Where's user data stored?
picoLLM doesn't track, access, or store user data.
+
Why does picoLLM Inference require an AccessKey (i.e., internet connectivity) if the engine processes data locally?
All Picovoice engines, including picoLLM Inference, use AccessKey to serve you within your plan limits.
Technical Questions
+
What are the advantages of using quantized models over non-quantized models?
There are several advantages of running quantized models:
  • Reduced Model Size: Quantization decreases the model sizes of large language models, resulting in
    • Smaller download size: Quantized LLMs require less time and bandwidth to download. For example, a mobile app using a large-sized language model may not be approved to be on the Apple Store.
    • Smaller storage size: Quantized LLMs occupy less storage space. For example, an Android app using a small language model will take up less storage space, improving the usability of your application, and the experience of users.
    • Less memory usage: Quantized LLMs use less RAM, which speeds up LLM inference and your application and frees up memory for other parts of your application to use, resulting in better performance and stability.
  • Reduced Latency: Compute latency and network latency consist of the total latency.
    • Reduced Compute Latency: Compute latency is the time between a machine receiving a request and the moment returning a response. LLMs require powerful infrastructure to run with minimal compute latency. Otherwise, it may take minutes, even hours, or days to respond. Reduced computational requirements allow quantized LLMs to respond faster given the same resources (reduces latency) or to achieve the same latency using fewer resources.
    • Zero Network Latency: Network latency, delay, or lag shows the time that data takes to transfer across the network. Since quantized LLMs can run where the data is generated rather than requiring data to be sent to a 3rd party cloud, there is no need for the data transfer, hence zero network latency.
Quantization can be used to reduce the size of models and latency potentially at the expense of some accuracy. Choosing the right quantized model is important to ensure small to no accuracy loss. Our Deep Learning Researchers explain why picoLLM Compression is different from other quantization techniques.
+
Is picoLLM open-source?
picoLLM SDKs are open-source and available via Picovoice's GitHub and SDK-specific package managers.

We're currently working on open-sourcing the picoLLM Inference, making the picoLLM compression algorithm available on the Picovoice Console, as well as adding new capabilities to the picoLLM platform to improve the developer experience.
+
How accurate is picoLLM Compression?
We compare picoLLM Compression algorithm accuracy against popular quantization techniques. Ceteris paribus -at a given size and model. - picoLLM offers better accuracy than the popular quantization techniques, such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM. You can check the open-source compression benchmark to compare the performance of picoLLM Compression against GPTQ.

Please note that there is no only one widely framework used to evaluate LLM accuracy as LLMs are relatively new and capable of performing various tasks. One metric can be more important for a certain task, and irrelevant to others. Taking “accuracy” metrics at face value, and comparing two figures calculated in different settings may lead to wrong conclusions. Also, picoLLM Compression's value add is retaining the original quality while making LLMs available across platforms, i.e., offering the most efficient models without sacrificing accuracy, not offering the most accurate model. We highly encourage enterprises to compare the accuracy against the original models, e.g., llama-2 70B vs. pico.llama-2 70B at different sizes.
+
How does picoLLM Compression differ from other compression techniques such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM?
Quantization techniques, such as AWQ, GPTQ, LLM.int8(), and SqueezeLLM are developed by researchers for research. picoLLM is developed by researchers for production to enable enterprise-grade applications.

At any given size, picoLLM retains more of the original quality. In other words, picoLLM compresses models more efficiently than the others, offering efficient models without sacrificing accuracy compared to these techniques. Read more from our deep learning research team about our approach to LLM quantization.
+
How fast is picoLLM?
The smaller the models and more powerful the systems are, the faster language models run.
Speed tests (token/second) are generally done in a controlled environment and, unsurprisingly, in favor of the model/vendor. Several factors, hardware (GPU, CPU, RAM, motherboard, original size of the models) and software (background processes and programs), language model, and so on affect the speed.
At Picovoice, our communication has always been fact-based and scientific. Since speed tests are easy to manipulate and it's impossible to create a reproducible framework we cannot publish any metrics. We strongly suggest everyone run their own tests in their environment.
+
How does picoLLM Inference differ from other inference engines?
picoLLM Inference is specifically developed for the picoLLM platform.

Existing inference engines can handle models with known bit distribution (4 or 8-bit) across model weights. picoLLM-compressed weight contains 1, 2, 3, 4, 5, 6, 7, and 8-bit quantized parameters to retain intelligence while minimizing the model size. Hence existing inference engines built for pre-defined bit distribution are not able to match the dynamic nature of picoLLM.
Read more from our engineering team who explained why and how we developed picoLLM Inference engine.
+
Can I use picoLLM offerings with another LLM Inference engine?
There are three major issues with the existing LLM inference engines.
  1. They are not versatile. They only support certain platforms or model types.
  2. They are not ready-to-use, requiring machine learning knowledge.
  3. They cannot handle X-bit quantization, as this innovative approach is unique to picoLLM Compression.
HuggingFace transformers work with transformers only. TensorFlow Serving works with TensorFlow models only and has a steep learning curve to get started. TorchServe is designed for Pytorch and integrates well with AWS. NVIDIA Triton Inference Server is designed for NVIDIA GPUs only. OpenVINO is optimized for Intel hardware. In reality, your software can and will be run on different platforms. That's why we had to develop picoLLM Inference. It's the only ready-to-use and hardware-agnostic engine.
Custom Models & Support
+
Do you train custom LLM models? Can I fine-tune picoLLM models?
Yes, at the moment custom training is available through picoLLM GYM for selected enterprise customers. Please engage with your account manager if you’re already a Picovoice customer.
+
How do custom large language models compare with general open LLMs?
Custom LLMs are created for specific tasks and specific use cases. General-purpose large language models are jacks-of-all-trades and masters-of-none. In other words, they can help a student with their homework but not a knowledge worker with company-specific information.

General-purpose LLMs are offered by foundation model providers, such as OpenAI, Google, Meta, Microsoft, Cohere, Anthropic, Mistral, Databricks, and so on. They're good at developing products such as chatbots, translation services, and content creation apps. Developers building hobby projects, one-size-fits-all applications, or with no access to training datasets can choose general-purpose LLMs.

Custom LLMs can offer distinctive feature sets and increased domain expertise, resulting in unmatched precision and relevance. Hence, custom LLMs have become popular in enterprise applications in several industries, including healthcare, law, and finance. They're used in various applications, such as medical diagnosis, legal document analysis, and financial risk assessment. Unlike general-purpose LLMs, custom LLMs are not ready to use, they require special training that leverages domain-specific data to perform better in certain use cases.
+
Why shouldn't we just use big vendors' closed-source models, such as GPT-4 or Claude?
If you think they're a better fit, you should. Especially, in the beginning, to have an understanding of what LLMs can achieve, using an API can be a better approach as control over data, model, infrastructure or inference cost is a concern. Closed-source model drawbacks become a concern when enterprises want to have control over their specific use case. If customizability, privacy, ownership, reliability, or inference cost at scale is a concern, then you should be more cautious about choosing a closed-source model.
  1. Customizability: Each vendor has different criteria and processes to develop custom models. In order to send an inquiry to OpenAI, one has to acknowledge that it may take months to train custom models and pricing starts at $2-3million.
  2. Privacy: The default business model for closed-source models is to run inference in the cloud. Hence it requires enterprises to send their user data and confidential information to the cloud.
  3. Ownership: You never have ownership of a closed-source model. If your LLM is critical for the success of your product, or in other words, if you view your LLM as an asset rather than a simple tool, it should be owned and controlled by you.
  4. Reliability: You are at the mercy of closed-source model providers. When their API goes down or has an increase in traffic, the performance of your software, hence user experience and productivity, is negatively affected.
  5. Cost at scale: Cloud computing at scale is costly. That's why cloud repatriation has become popular among large enterprises. Large Language Model APIs are not different if not more costlier given the size of the models. If your growth estimation involves high-volume inference, do your math carefully.
+
We have a custom LLM, how can we use the picoLLM Compression?
Contact sales with your project requirements to get your custom or fine-tuned LLMs quantized using the picoLLM Compression.
+
Do I own my custom models after getting them quantized?
Yes, models trained on your private data or developed by you will be 100% yours.
+
Can I use picoLLM models with RAG?
Yes. picoLLM models, similar to other LLMs, can be used for complex workflows, including retrieval augmented generation (RAG). Large Language Models may struggle with knowledge retrieval while understanding which information is most relevant to each query. Moreover, it may not be optimal to re-train a model every time your knowledge base is updated. RAG applications produce more nuanced and contextually relevant outputs, allowing enterprises to feed the model with information that's always permissions-aware, recent, and relevant.
+
My platform is not currently supported by picoLLM or we're planning to launch new hardware. How can I get picoLLM to support it?
picoLLM platform supports the most popular and widely-used hardware and software out-of-the-box - from web, mobile, desktop, and on-prem to private cloud. However, there may be certain chipsets we do not currently support. Contact sales to get the picoLLM inference engine ported to the platform of your choice.
+
It seems picoLLM doesn't offer the SDK we're using in production. How can I get a new SDK added to picoLLM?
picoLLM platform supports the most popular and widely used SDKs. If you need another SDK, you can check our open-source SDKs and build it yourself or contact sales. Picovoice experts can create a public or private library for the SDK of your choice and maintain it.
+
I am using official picoLLM demos, however, I get an error. How do I report bugs?
You can create a GitHub issue under the relevant repository/demo.
Data Security & Privacy
+
Where does picoLLM process data?
picoLLM processes data in your environment, whether it's public or private cloud, on-prem, web, mobile, desktop, or embedded.
+
For how long does picoLLM retain user data?
picoLLM is private by design and has no access to user data. Thus, picoLLM doesn't retain user data as it never tracks or stores them in the first place.
+
Is picoLLM HIPAA-compliant?
Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically HIPAA compliant.
+
Is picoLLM GDPR-compliant?
Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically GDPR compliant.
+
Is picoLLM CCPA-compliant?
Yes. Enterprises using picoLLM don't need to share their user data with Picovoice or any other 3rd party to run LLMs, making picoLLM intrinsically CCPA compliant.
Building with picoLLM
+
Can I use picoLLM to build a generative AI assistant that uses my company's content with enterprise-grade permissions, data governance, and referable resources?
Yes! McKinsey & Co. estimates that we spend 20% of our time looking for internal information or tracking down colleagues who can help with specific tasks. You can save your company a significant amount of time with a generative assistant without breaking the bank or jeopardizing trade secrets and confidential information. Contact sales if you need a jumpstart.
+
How can I deploy custom language models for production?
The answer is “it depends”. Deploying LLMs for production requires diligent work. It depends on your use case, other tools, and the tech stack used, along with hardware and software choice. Given the variables, it can be challenging. Contact sales to get help with finding the best approach to deploying language models for production.
+
What are the best practices to develop and deploy LLM applications?
Developers have a myriad of choices while building LLM applications. Choosing the best AI models that fit the use case is a big challenge, given that there are hundreds of open-source LLMs if not thousands to start with. (Although most are the fine-tuned versions of a few base LLMs.) Best practices depend on the use cases, other tools used, and tech stack.

Contact sales to dedicated support for your use case.