I wanted to run retrieval-augmented generation (RAG) entirely on my laptop: no API keys, no uploads to the cloud—just my machine, a few text snippets, and Ollama. RAG means the model answers using your text, not only what it memorized during training. Ollama's own embedding post describes the same pipeline I followed: generate embeddings, retrieve what matches the question, then generate an answer from that context. Below is exactly how I did it, using one small sample so you can mirror the same checks on your side.
What I set out to build (one concrete sample)
I picked three short "documents" about llamas—plain strings in memory, like tiny pages in a notebook. My question would be: What animals are llamas related to?
The answer should come from the sentence that mentions camelids, not from the model guessing. That is the whole point of RAG: retrieve the right snippet first, then ask the chat model to speak only from that snippet.
In plain terms, the pipeline has three parts: (1) turn each document into a vector (an embedding), (2) turn my question into a vector with the same embedding model and pick the closest document by cosine similarity, (3) call a local chat model with a system rule that says "use only this context." Ollama's REST API exposes embeddings at POST /api/embed with fields model and input; the response includes an embeddings array (one vector per input string). For the final answer I used ollama.chat with stream: false so I got one JSON object back—easier to read while learning.
What I did first on my machine
I installed Ollama from the official site and left the app running so the HTTP server listened on 127.0.0.1:11434 (the default the ollama-js client uses). Then I pulled two models: one embedding model for vectors, and one chat model for answers. The Ollama blog lists several embedding options; I used nomic-embed-text and llama3.2 for chat—both small enough to run on a typical dev machine.
# In my terminal (embedding vectors)
ollama pull nomic-embed-text
# Chat / completion model for the final reply
ollama pull llama3.2I made a folder for the experiment, initialized npm, and installed the official client—the same calls you can make with raw HTTP, but typed and convenient in Node.
mkdir rag-ollama-demo && cd rag-ollama-demo
npm init -y
npm install ollamaBecause I used ES modules in the script below, I opened package.json and set "type": "module". Without that, Node would not accept import in a .js file the way I wrote it.
How I verified Ollama before writing JavaScript
I did not jump straight into code. I ran the same embedding request the Ollama docs show with curl, pointed at localhost:11434. When it worked, I saw JSON with model, embeddings (a nested array of floats), and timing fields like total_duration. If I had seen a connection error instead, that would have meant the Ollama daemon was not running or the port was wrong—not a bug in my script yet.
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Llamas are members of the camelid family"
}'}
I treated that response as my ground truth: whatever ollama.embed() returns in Node should match this shape—same endpoint, same fields—because the client is a thin wrapper over the REST API documented in Ollama's repository.
The Node script I ran: what each part does
I saved the file as rag.mjs in my project folder and ran node rag.mjs. Here is the program, followed by a line-by-line walkthrough of what happened when I executed it.
import ollama from "ollama";
const EMBED_MODEL = "nomic-embed-text";
const CHAT_MODEL = "llama3.2";
const documents = [
"Llamas are members of the camelid family; they are related to vicuñas and camels.",
"Llamas were domesticated thousands of years ago in the Andes as pack animals.",
"Adult llamas often weigh between 280 and 450 pounds.",
];
function cosineSimilarity(a, b) {
let dot = 0;
let na = 0;
let nb = 0;
for (let i = 0; i < a.length; i++) {
dot += a[i] * b[i];
na += a[i] * a[i];
nb += b[i] * b[i];
}
return dot / (Math.sqrt(na) * Math.sqrt(nb));
}
async function buildIndex() {
const rows = [];
for (let i = 0; i < documents.length; i++) {
const res = await ollama.embed({
model: EMBED_MODEL,
input: documents[i],
});
rows.push({ id: String(i), text: documents[i], vector: res.embeddings[0] });
}
return rows;
}
async function retrieve(index, question, k = 1) {
const res = await ollama.embed({ model: EMBED_MODEL, input: question });
const q = res.embeddings[0];
return index
.map((row) => ({ ...row, score: cosineSimilarity(q, row.vector) }))
.sort((a, b) => b.score - a.score)
.slice(0, k);
}
async function answerWithRag(index, question) {
const [best] = await retrieve(index, question, 1);
const context = best.text;
const reply = await ollama.chat({
model: CHAT_MODEL,
messages: [
{
role: "system",
content:
"Answer clearly using only the provided context. If the context does not contain the answer, say you do not know.",
},
{
role: "user",
content: "Context:\n" + context + "\n\nQuestion: " + question,
},
],
stream: false,
});
return reply.message.content;
}
const index = await buildIndex();
const q = "What animals are llamas related to?";
console.log(await answerWithRag(index, q));Constants and the tiny knowledge base
I set EMBED_MODEL and CHAT_MODEL to the same names I had pulled with ollama pull. The documents array is my whole dataset for this demo—three strings. In a real project each string might be a chunk from a PDF; here they are short so I could print scores and reason about mistakes easily.
Indexing: three HTTP calls under the hood
buildIndex() loops over each document and calls ollama.embed. For every call, Ollama returns an object whose embeddings field is an array of vectors; for a single string input I used res.embeddings[0] as that row's vector. I stored the original text next to the vector so retrieval could return human-readable context, not just numbers.
Retrieval: one embedding for the question, then cosine similarity
retrieve() embeds my question with the same EMBED_MODEL. Then it scores every stored row with cosineSimilarity: same length vectors, dot product divided by the product of their L2 norms, producing a score between -1 and 1 for normalized embeddings. I sorted descending and took k = 1 so only the top chunk fed the model—enough for this sample; you might pass k > 1 and concatenate several chunks in a larger app.
Generation: system prompt + user message
answerWithRag() takes the best row's text as context. I used ollama.chat with a system message that instructs the model to stay inside the context, and a user message that concatenates context and question. With stream: false, the promise resolves to a single object; the string I cared about was reply.message.content, as documented for the chat API in the ollama-js README.
What I ran and what I expected to see
At the bottom of the file, top-level await builds the index once, sets the question to What animals are llamas related to?
, and logs the final string. When I ran node rag.mjs, I expected the retrieved chunk to be the sentence about camelids, and the printed answer to mention vicuñas or camels because that is what the context contained—not because the model "remembered" zoology from pretraining alone.
node rag.mjsFirst run was slower while weights loaded; later runs were quicker because the models could stay resident (Ollama's API also exposes keep_alive if you need to tune that). If Node printed a connection error to 127.0.0.1:11434, I would have started the Ollama app again before blaming the script.
Optional: I double-checked generation with curl (no SDK)
After I trusted embeddings via curl, I sometimes sanity-checked the chat model the same way. The REST API documents POST /api/generate; with stream: false I got one JSON object whose response field held the full text. I pasted a tiny prompt that included the same fact my RAG pipeline would have retrieved, so I could compare "raw" generation behavior to the Node client.
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Using only this fact: Llamas are camelids related to vicuñas and camels. What animals are llamas related to?",
"stream": false
}'}
Accuracy and good habits
- Use one embedding model for indexing and querying; mixing models breaks comparability of vectors.
- Tell the model to rely on the provided context (as in the system message above) to reduce invented facts—hallucinations are still possible, so treat outputs as drafts for high-stakes use.
- Respect hardware limits: larger chat models need more RAM; embedding models vary in size and quality (see the model table on Ollama's embedding blog post).
Where to read more (official sources)
- Ollama — Embedding models (what embeddings are, example models, and the embed / retrieve / generate flow)
- Ollama REST API reference (
/api/embed,/api/generate, chat endpoints, and parameters) - ollama-js (JavaScript client:
embed,chat,generate, default host) - API documentation is also being expanded at docs.ollama.com as noted in the upstream API readme.
That local loop—curl, then a twenty-line script—was enough for me to trust each layer before I added a database or a UI. Once this sample behaved the way I expected, I could swap in better chunking, more documents, or a vector store without changing the mental model: embed, retrieve, generate, still all on my machine.