Local LLM Inference in .NET: Tutorial for C# Developers

🎯 Enterprise LLM Consulting

Work with AI consultants to build LLM-powered apps to improve productivity, retention, and time-to-market.

To run language models locally in .NET, developers typically rely on open-source options like llama.cpp bindings or ONNX Runtime. While these work for prototyping, they hit production roadblocks as they require managing external processes, handling runtime dependencies, and manually optimizing models for deployment.

This tutorial shows a simpler path: add local LLM inference to your .NET application in minutes using the picoLLM .NET SDK. You'll build a working C# application with straightforward NuGet package installation, simple API calls, and reliable cross-platform deployment. No manual model quantization or optimization, no compatibility issues, no cloud processing.

Setting Up the .NET LLM Inference API

This walkthrough shows you how to run local LLM inference in .NET without cloud APIs. You'll install picoLLM, load a quantized model, and generate text entirely on-device.

Prerequisites for .NET Local LLM Integration

Windows (x86_64): .NET Framework 4.6.1+, .NET Standard 2.0+, or .NET Core 3.0+
macOS (x86_64): .NET Standard 2.0+ or .NET Core 3.0+
macOS (arm64), Windows (arm64), Linux (x86_64), Raspberry Pi (4, 5): .NET 6.0+

Step 1: Install the .NET On-Device LLM Package

Add the official PicoLLM NuGet package to enable local LLM inference in your C# project:

dotnet add package PicoLLM

Step 2: Download a Quantized LLM Model File

Download a .pllm model file from the picoLLM page. For this tutorial, choose a model with chat support (e.g., Llama 3.2, Phi-2, Phi-3, Phi-3.5).

Step 3: Initialize the Local LLM Inference Engine in C#

Create an instance of picoLLM with your AccessKey and path to the downloaded model:

using Pv;

PicoLLM picoLLM = PicoLLM.Create(
    "${ACCESS_KEY}",
    "${MODEL_PATH}");

Step 4: Create a Dialog for Conversational Context

Use PicoLLMDialog to manage multi-turn conversations:

PicoLLMDialog dialog = picoLLM.GetDialog();

string? prompt = Console.ReadLine();
dialog.AddHumanRequest(prompt);

The dialog object maintains conversation history, so the model can reference previous messages. For single-turn generation (no conversation context), you can pass prompts directly to Generate() instead:

string? prompt = Console.ReadLine();

PicoLLMCompletion res = pllm.Generate(prompt);
Console.WriteLine(res.Completion);

Step 5: Configure Text Generation Parameters

Control inference behavior with parameters that affect output randomness, length, and repetition:

float temperature = 0.7f;       // Higher = more random (0.0-1.0)
int completionTokenLimit = 100; // Maximum output length

PicoLLMCompletion response = picoLLM.Generate(
    dialog.Prompt(),
    temperature: temperature,
    completionTokenLimit: completionTokenLimit
);

Parameter guide:

temperature: Lower values make output more deterministic
completionTokenLimit: Maximum number of tokens to generate
presencePenalty/frequencyPenalty: Reduce repetition

See the picoLLM API docs for the complete parameter list.

Step 6: Generate Text with Streaming Output

Stream tokens as they're generated instead of waiting for the complete response:

PicoLLMCompletion response = picoLLM.Generate(
    dialog.Prompt(),
    streamCallback: (string token) =>
    {
        if (!isInterrupt)
        {
            Console.Write(token);
            Console.Out.Flush();
        }
    });
dialog.AddLLMResponse(response.Completion);

Streaming allows you to display output progressively as it's generated rather than waiting for the complete response.

Interrupt generation mid-stream if needed (e.g., user cancels or types a new message):

picoLLM.Interrupt();

Step 7: Clean Up .NET LLM Resources

Free memory immediately after use by wrapping in a using statement:

using (PicoLLM pllm = PicoLLM.Create(accessKey, modelPath))
{
    // Use pllm here
} // Resources freed automatically

Or call Dispose() manually when done. Without explicit disposal, the GC will eventually free resources.

Complete C# Example: .NET On-Device LLM Chat Application

Here's a working console app implementing local LLM inference with conversation history, streaming, and interruption handling:

using System;
using System.Threading;
using Pv;

class PicoLLMDemo
{
    static void Main()
    {
        Console.WriteLine("Initializing on-device LLM...");
        using (PicoLLM picoLLM = PicoLLM.Create("${ACCESS_KEY}", "${MODEL_PATH}"))
        {
            PicoLLMDialog dialog = picoLLM.GetDialog();
            Console.WriteLine(
                "Type a message and press Enter. " +
                "Press Spacebar to interrupt generation. " +
                "Press Ctrl+C to exit."
            );

            while (true)
            {
                Console.Write("\n> ");
                string? prompt = Console.ReadLine();
                if (prompt != null)
                {
                    dialog.AddHumanRequest(prompt);
                    bool isInterrupt = false;
                    PicoLLMCompletion? response = null;
                    
                    // Handle interrupt key in background
                    Task interruptKeyTask = Task.Run(async () =>
                    {
                        while (!isInterrupt && response == null)
                        {
                            if (Console.KeyAvailable)
                            {
                                ConsoleKeyInfo keyInfo = Console.ReadKey(intercept: true);
                                if (keyInfo.Key == ConsoleKey.Spacebar)
                                {
                                    Console.WriteLine("\nInterrupting generation...");
                                    isInterrupt = true;
                                    picoLLM.Interrupt();
                                }
                            }
                            await Task.Delay(100);
                        }
                    });

                    // Generate with streaming
                    response = picoLLM.Generate(
                        dialog.Prompt(),
                        streamCallback: (string token) =>
                        {
                            if (!isInterrupt)
                            {
                                Console.Write(token);
                                Console.Out.Flush();
                            }
                        });
                    
                    interruptKeyTask.Wait();
                    dialog.AddLLMResponse(response.Completion);
                }
            }
        }
    }
}

This example includes the core functionality you need: model initialization, conversation management, streaming output, and interrupt handling. For a more complete implementation with error handling and advanced features, see the picoLLM .NET demo on GitHub.

Tips & Best Practices: LLM Inference in .NET

Prompt clarity: Provide clear and concise prompts to get the best results.
Context management: Keep track of conversation state for multi-turn interactions. If you're using picoLLM LLM Inference, you can configure this with the history parameter of GetDialog.
Choosing the right chat template: Some models define multiple chat template modes. For example, phi-2 allows both qa and chat templates. Set the mode you wish to use in GetDialog for the best results.
Resource efficiency: Dispose of LLM instances when not needed to minimize memory use.

Troubleshooting .NET Local LLM Issues

Model Load Failures

Verify the .pllm file path is correct and file exists. Check that the file isn't corrupted (re-download if needed).

Performance bottlenecks

Common bottlenecks include large prompt histories, oversized models, or running multiple inference sessions simultaneously on limited-core devices. Monitoring CPU and memory usage during runtime can help identify which part of your workflow is slowing down. Consider limiting context length or using lighter models for faster responses.

Incorrect or unexpected responses

If the model produces irrelevant or confusing output, check your prompt clarity and context history. Overly long or ambiguous prompts can confuse the model. Trimming unnecessary conversation history or refining prompts can improve response quality.

Building Voice-Enabled .NET Applications with Local LLMs

Combine picoLLM with speech engines to build fully voice-controlled applications:

Complete voice assistant stack:

Porcupine Wake Word: Activates the assistant using a custom phrase such as "Hey Assistant"
Cheetah Streaming Speech-to-Text: Captures the user's spoken prompt and displays transcribed text in real time
picoLLM: Interprets the prompt and generates a response
Orca Text-to-Speech: Streams the generated response back to the user as speech

All of these run on-device with no cloud dependencies. See the complete implementation in the .NET LLM Voice Assistant demo on GitHub.

Use cases:

Offline voice assistants for desktop apps
Privacy-focused enterprise tools (no data leaving corporate network)
Embedded devices with limited internet (kiosks, industrial equipment)

.NET On-Device LLM Documentation and Resources

Official packages and docs:

Models and usage management:

Picovoice Console: download optimized models

Start Building

Frequently Asked Questions: Local LLMs in .NET

What is on-device LLM inference in .NET?

On-device LLM inference allows a .NET application to run a local language model for text generation or chat without relying on cloud APIs. This ensures low latency, predictable performance, and full data privacy.

Which platforms support picoLLM for .NET?

picoLLM works on Windows, macOS, Linux, and Raspberry Pi (x86_64 and ARM64), supporting desktop, mobile, and embedded .NET applications.

Can I run picoLLM fully offline in .NET?

Once the model is downloaded, all processing happens locally on your device. Internet is required only for licensing and usage tracking.

Can picoLLM handle multi-turn conversations?

Yes. By using the PicoLLMDialog object, you can manage conversation history and maintain context across multiple prompts and responses.