Ollama Made Simple in 12 Hours: Hour 7 - Creating Custom Models for PDF and Web Scraping

Lecture Notes:

1. Concepts

Custom Models in Ollama

Custom Models: Tailored versions of base models created to handle specific tasks like answering questions from PDFs or summarizing web pages.
Ollama allows users to create models by defining custom system prompts and incorporating specific templates.

PDF and Web Scraping with AI

PDF Parsing: Extracting meaningful information (e.g., text, metadata) from PDF documents.
Web Scraping: Collecting data from websites for insights or analysis.
Both tasks require processing structured and unstructured text data, making them ideal for custom AI models.

2. Key Aspects

Key Components of a Custom Model for PDF and Web Scraping:
- Input Source: The source data (PDFs or web pages).
- Preprocessing: Cleaning and structuring the data for AI consumption.
- Model Behavior: Tailored system prompts to guide output generation.
Why Custom Models for PDF and Web Scraping?
- Automate repetitive tasks like extracting summaries or key points.
- Handle domain-specific data with fine-tuned responses.
- Increase efficiency in research, data collection, and reporting.
Challenges:
- Handling large or complex PDFs.
- Avoiding CAPTCHA and legal concerns during web scraping.
- Processing noisy or unstructured data effectively.

3. Implementation

CLI Commands for Custom Models:

Command	Description	Example
`ollama run`	Run a custom model to process extracted text.	`ollama run pdf_reader --prompt "Summarize"`
`ollama create`	Create a new model with a system prompt and template.	`ollama create pdf_reader -f ./modelfile`
`ollama pull`	Pull a base model as a starting point.	`ollama pull llama`
`ollama show`	Display the details of the custom model.	`ollama show pdf_reader`

4. Real-Life Example

Scenario: Extracting Key Points from Research PDFs

Objective: Build a model to summarize PDFs containing scientific research papers.
Use Case: A researcher needs concise summaries to save time.

5. Code Examples

Step 1: Preprocess PDFs

Use Python to extract text from PDFs. Libraries like PyPDF2 or pdfplumber are commonly used.

import pdfplumber

def extract_text_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ""
        for page in pdf.pages:
            text += page.extract_text()
    return text

# Example usage
pdf_text = extract_text_from_pdf("example_research.pdf")
print(pdf_text[:500])  # Print the first 500 characters

Step 2: Create a Custom Model

Define a modelfile with behavior tailored for summarizing research.

Modelfile (modelfile.txt):

FROM llama
SYSTEM """
You are a research assistant. Summarize the content of research papers in a concise and clear manner. Include key points and findings.
"""

Create the custom model with Ollama CLI:

# Create the custom model
ollama create pdf_reader -f ./modelfile.txt

Step 3: Run the Custom Model

Pass the extracted text from the PDF to the model.

# Run the custom model
ollama run pdf_reader --prompt "Summarize the following: [Insert extracted text here]"

Step 4: Web Scraping for Data

Use Python with libraries like BeautifulSoup to scrape data from web pages.

from bs4 import BeautifulSoup
import requests

def scrape_web_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    return soup.get_text()

# Example usage
web_text = scrape_web_page("https://example.com/research-article")
print(web_text[:500])  # Print the first 500 characters

Step 5: Integrate Web Data into the Model

Run the scraped content through the custom model.

# Run the custom model with web-scraped content
ollama run pdf_reader --prompt "Summarize the following: [Insert scraped text here]"

6. Example Outputs

PDF Summary:

"This research explores the impact of climate change on agriculture. Key findings include a 20% decrease in crop yield due to rising temperatures and droughts. Adaptive measures, such as genetic modification, show potential to mitigate these effects."

Web Scraping Summary:

"The article discusses the latest advancements in AI, focusing on generative models and their applications in healthcare and education."

7. Summary

Concepts Covered: Custom models, PDF parsing, and web scraping.
Key Aspects: Preprocessing, model creation, and data integration.
Implementation: Preprocessing PDFs and web data, creating a model, and running it for summaries.
Real-Life Example: Summarizing research papers and web content.

8. Homework/Practice

Extract text from a PDF of your choice and pass it through a custom Ollama model.
Scrape a webpage and summarize its content using the model.
Experiment with different system prompts to customize model behavior.
Compare the summaries generated by a base model and your custom model.

These lecture notes provide a comprehensive understanding of creating custom models for PDF and web scraping tasks, with practical examples and code samples to enhance learning.

Ollama Made Simple in 12 Hours

Saturday, 18 January 2025

Hour 7 - Creating Custom Models for PDF and Web Scraping