Part 1 of 3 in the “Local AI with Ollama and .NET” series: Part 2 – Local RAG with Ollama, LiteLLM, and Qdrant | Part 3 – Building AI Agents | 🇫🇷 Version
The AI landscape has evolved rapidly, but there’s a growing concern about sending sensitive data to cloud services, managing API costs, and maintaining functionality without internet connectivity. Enter Ollama, a solution that brings powerful language models to your local machine, paired perfectly with .NET’s robust ecosystem.
Why Local AI Development Matters
Privacy and Security: Your data never leaves your machine. This is crucial for industries dealing with sensitive information, legal documents, healthcare records, or proprietary business data.
Cost Control: No per-token charges, no surprise bills. Once you’ve downloaded a model, inference is free beyond your electricity costs.
Offline Capability: Build applications that work without internet connectivity—essential for field work, air-gapped environments, or regions with unreliable connectivity.
Development Flexibility: Experiment freely without worrying about API rate limits or costs during development and testing.
What is Ollama?
Ollama is an open-source tool that makes it easy to run large language models locally. Think of it as Docker for AI models—it handles model downloads, manages resources, and provides a simple API interface.
Supported Models:
- Llama - Meta’s flagship models (Llama 3.1, 3.2, 3.3, 4 with vision)
- Qwen - Alibaba’s high-performing multilingual models (Qwen 3, Qwen 2.5)
- Mistral - Efficient models with long context (Mistral Small, Large, Nemo)
- DeepSeek - Reasoning models (DeepSeek-R1, DeepSeek-V3)
- Phi - Microsoft’s compact models (Phi-3, Phi-4)
- Gemma - Google’s open models (Gemma 3, CodeGemma)
- CodeLlama / Devstral - Specialized for code generation
- Specialized Models - Vision models, multimodal, reasoning, embedding models
- And hundreds more
Browse the complete library at ollama.com/library to explore all available models, including specialized versions for coding, reasoning, multilingual support, and vision capabilities.
Setting Up Ollama
Installation
Windows: Download the installer from ollama.ai/download and run it, or use winget:
winget install Ollama.Ollama
Ollama runs as a background service.
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.ai/install.sh | sh
Downloading Your First Model
ollama pull llama3.2
This downloads the Llama 3.2 model (~4.7GB). Start it with:
ollama run llama3.2
Testing the API
Ollama exposes a REST API on http://localhost:11434. Test it:
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Why is the sky blue?",
"stream": false
}'
Integrating Ollama with .NET
Using HttpClient
For simple scenarios, use .NET’s built-in HttpClient:
using System.Net.Http.Json;
using System.Text.Json;
public class OllamaClient
{
private readonly HttpClient _httpClient;
private const string BaseUrl = "http://localhost:11434";
public OllamaClient()
{
_httpClient = new HttpClient { BaseAddress = new Uri(BaseUrl) };
}
public async Task<string> GenerateAsync(string model, string prompt)
{
var request = new
{
model,
prompt,
stream = false
};
var response = await _httpClient.PostAsJsonAsync("/api/generate", request);
response.EnsureSuccessStatusCode();
var result = await response.Content.ReadFromJsonAsync<OllamaResponse>();
return result?.Response ?? string.Empty;
}
}
public record OllamaResponse(string Response, string Model, bool Done);
Usage:
var client = new OllamaClient();
var answer = await client.GenerateAsync("llama3.2", "Explain dependency injection in C#");
Console.WriteLine(answer);
Using OllamaSharp Library
For production use, consider OllamaSharp:
dotnet add package OllamaSharp
using OllamaSharp;
var ollama = new OllamaApiClient("http://localhost:11434", "llama3.2");
var chat = new Chat(ollama);
await foreach (var response in chat.SendAsync("What are SOLID principles?"))
{
Console.Write(response);
}
Building a Document Q&A System
Here’s a practical example combining Ollama with document processing:
public class DocumentQAService
{
private readonly OllamaClient _ollama;
private readonly Dictionary<string, string> _documents = new();
public DocumentQAService(OllamaClient ollama)
{
_ollama = ollama;
}
public void AddDocument(string id, string content)
{
_documents[id] = content;
}
public async Task<string> AskQuestionAsync(string question)
{
// Combine all documents as context
var context = string.Join("\n\n", _documents.Values);
var prompt = $"""
Context:
{context}
Question: {question}
Answer based only on the context provided above.
""";
return await _ollama.GenerateAsync("llama3.2", prompt);
}
}
Usage:
var qa = new DocumentQAService(new OllamaClient());
qa.AddDocument("policy", "Our refund policy allows returns within 30 days...");
qa.AddDocument("shipping", "We ship worldwide with DHL. Delivery takes 3-5 days...");
var answer = await qa.AskQuestionAsync("What is your refund policy?");
Console.WriteLine(answer);
Choosing the Right Model
Different models suit different needs:
| Model | Size | Best For | Speed |
|---|---|---|---|
| Phi-4 | 14B | Fast reasoning & coding | Fast |
| Llama 3.2 | 1B-3B | Quick tasks, chat | Very Fast |
| Qwen 2.5 | 7B-32B | General purpose, coding | Medium |
| Llama 3.1 | 8B-70B | Complex reasoning | Medium-Slow |
| DeepSeek-R1 | 7B-671B | Advanced reasoning | Slower |
For development, start with Qwen 2.5 7B or Llama 3.2 3B (excellent balance of speed and quality). For coding tasks, DeepSeek-Coder or Devstral are specialized choices.
Best Practices
1. Prompt Engineering
Be specific and provide context:
// ❌ Vague
var result = await ollama.GenerateAsync("qwen2.5", "Write code");
// ✅ Specific
var result = await ollama.GenerateAsync("qwen2.5",
"Write a C# method that validates email addresses using regex. " +
"Include error handling and XML documentation comments.");
2. Temperature Control
Control randomness with temperature (0.0 = deterministic, 1.0 = creative):
var request = new
{
model = "qwen2.5",
prompt = "Generate a creative story",
options = new
{
temperature = 0.3 // Lower value = more deterministic
}
};
3. Context Windows
Models have token/context limits that vary by model and configuration (many modern models support from 8K up to 128K tokens or more). For longer documents, implement chunking:
public async Task<string> SummarizeLongDocument(string document)
{
const int chunkSize = 2000; // characters
var chunks = SplitIntoChunks(document, chunkSize);
var summaries = new List<string>();
foreach (var chunk in chunks)
{
var summary = await _ollama.GenerateAsync("llama3.2",
$"Summarize this text:\n{chunk}");
summaries.Add(summary);
}
// Final summary of summaries
return await _ollama.GenerateAsync("llama3.2",
$"Create a final summary from these summaries:\n{string.Join("\n", summaries)}");
}
4. Model Management
Check available models:
ollama list
Remove unused models to save space:
ollama rm mistral
Performance Considerations
Hardware Requirements:
- Minimum: 8GB RAM, modern CPU
- Recommended: 16GB+ RAM, GPU with CUDA/ROCm support
- Optimal: 32GB RAM, NVIDIA RTX series GPU
GPU Acceleration:
Ollama automatically uses your GPU if available. Check with:
ollama ps
Memory Usage:
Monitor with:
# Windows
Get-Process ollama
# Linux/Mac
ps aux | grep ollama
Common Use Cases
1. Code Review Assistant
var review = await ollama.GenerateAsync("devstral",
$"Review this C# code for issues:\n{code}");
2. Email Drafting
var email = await ollama.GenerateAsync("qwen2.5",
"Draft a professional email declining a meeting request politely.");
3. Data Extraction
var extracted = await ollama.GenerateAsync("qwen2.5",
$"Extract names, dates, and amounts from this invoice:\n{invoice}");
4. Translation
var translation = await ollama.GenerateAsync("qwen2.5",
$"Translate to French: {englishText}");
Limitations and Considerations
Model Quality: Local models may not match GPT-4o’s quality for complex tasks. Choose based on your requirements.
Resource Intensive: Larger models need significant RAM and benefit greatly from GPUs.
No Internet Knowledge: Models only know information from their training data. They can’t access current events or web content.
Hallucinations: All LLMs can generate plausible-sounding but incorrect information. Always validate critical outputs.
Next Steps
- Start Small: Try Phi-4 or Llama 3.2 for initial experiments
- Build Prototypes: Create simple chat or document processing apps
- Optimize Prompts: Experiment with different prompt structures
- Add RAG: Combine with vector databases for better context
- Monitor Performance: Profile your application under load
The combination of Ollama and .NET opens up possibilities for privacy-focused, cost-effective AI applications. Whether you’re building internal tools, prototyping ideas, or creating offline-capable software, local AI gives you control and flexibility.
Explore, experiment, and innovate—the future of AI is in your hands, locally.
This post was created with the assistance of AI.