Saturday, 18 January 2025

Hour 4 - Introduction to Embeddings

Lecture Notes: 

 Here’s an lecture notes with a code sample that includes generating and using embeddings with Ollama:

1. Concepts

What are Embeddings?

  • Definition: Embeddings are numerical representations of text, words, or concepts in a vector space. These vectors capture semantic meaning, allowing models to understand relationships between words or phrases.
  • Key Idea: Words or sentences with similar meanings are mapped to vectors that are close together in the vector space.

How Embeddings Work:

  • Transform textual data into fixed-size dense vectors.
  • Represent semantic similarity (e.g., "king" and "queen" will have similar embeddings).
  • Provide a foundation for tasks like search, clustering, and recommendation systems.

2. Key Aspects

Properties of Embeddings:

  1. Dimensionality: Number of values in the vector (e.g., 512, 768).
  2. Contextual vs. Static:
    • Static Embeddings: Fixed embeddings for words (e.g., Word2Vec, GloVe).
    • Contextual Embeddings: Represent words based on their context (e.g., BERT, GPT).
  3. Similarity Measures: Cosine similarity is commonly used to compare embeddings.

Applications of Embeddings:

  • Search Engines: Find documents or information using semantic similarity.
  • Recommendation Systems: Recommend items based on user preferences.
  • Clustering and Classification: Group similar data points together.

3. Implementation

Step-by-Step: Using Embeddings in Ollama

  1. Generate Embeddings:

    • Use the Ollama CLI to create embeddings for text or documents.
  2. Store Embeddings:

    • Save the embeddings in a JSON file or a vector database.
  3. Perform Similarity Search:

    • Compare embeddings to find semantically similar items.

4. CLI Commands for Embeddings

Command Description Example
ollama embed Generates embeddings for a given text or document. ollama embed "The quick brown fox"
ollama embed --format Outputs embeddings in JSON format for easier integration with databases. ollama embed "AI is amazing" --format json

5. Real-Life Example

Scenario: Building a Semantic Search Engine

Suppose you want to search a set of documents based on meaning rather than exact keyword matches. Use embeddings to find documents most relevant to a user's query.


6. Code Examples

Generating and Storing Embeddings with Ollama CLI

# Generate embeddings for a document
ollama embed "Artificial Intelligence is fascinating." --format json > ai_embedding.json

# Generate embeddings for another text
ollama embed "Machine learning is a subset of AI." --format json > ml_embedding.json

# Inspect the JSON output
cat ai_embedding.json

Sample output in ai_embedding.json:

{
  "text": "Artificial Intelligence is fascinating.",
  "embedding": [0.123, -0.456, 0.789, ...]
}

Implementing Similarity Search with Ollama and Python

import json
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity

# Load embeddings generated by Ollama
with open("ai_embedding.json", "r") as file:
    ai_data = json.load(file)

with open("ml_embedding.json", "r") as file:
    ml_data = json.load(file)

# Extract embeddings
ai_embedding = np.array(ai_data["embedding"])
ml_embedding = np.array(ml_data["embedding"])

# Simulate a user query and generate its embedding (use Ollama CLI in practice)
query = "Tell me about AI and its applications."
query_embedding = np.random.rand(len(ai_embedding))  # Replace with actual embedding

# Compute cosine similarity
similarities = cosine_similarity([query_embedding], [ai_embedding, ml_embedding])
ranked_indices = similarities.argsort()[0][::-1]

# Map indices to documents
documents = [
    ai_data["text"],
    ml_data["text"]
]

# Print results
print("Query:", query)
print("Top matches:")
for idx in ranked_indices:
    print(f"- {documents[idx]} (Score: {similarities[0][idx]:.4f})")

7. Summary

  • Concepts Covered: Definition and significance of embeddings, their properties, and applications.
  • Key Aspects: Dimensionality, contextual vs. static embeddings, and similarity measures.
  • CLI Commands: Generating and using embeddings with ollama embed.
  • Real-Life Example: Semantic search for finding relevant documents.
  • Code Examples: Generating embeddings using Ollama CLI and performing similarity search.

8. Homework/Practice

  1. Use ollama embed to generate embeddings for five text samples.
  2. Save the embeddings in JSON files.
  3. Write a Python script to load these embeddings and implement a semantic search engine.
  4. Experiment with additional similarity measures (e.g., Euclidean distance).

This extended lecture note now includes a practical demonstration of generating embeddings using the Ollama CLI and processing them programmatically for real-world applications.

No comments:

Post a Comment

OpenWebUI - Beginner's Tutorial

  OpenWebUI Tutorial: Setting Up and Using Local Llama 3.2 with Ollama Introduction This tutorial provides a step-by-step guide to setting...