Saturday, 18 January 2025

Hour 7 - Creating Custom Models for PDF and Web Scraping

Lecture Notes: 


1. Concepts

Custom Models in Ollama

  • Custom Models: Tailored versions of base models created to handle specific tasks like answering questions from PDFs or summarizing web pages.
  • Ollama allows users to create models by defining custom system prompts and incorporating specific templates.

PDF and Web Scraping with AI

  • PDF Parsing: Extracting meaningful information (e.g., text, metadata) from PDF documents.
  • Web Scraping: Collecting data from websites for insights or analysis.
  • Both tasks require processing structured and unstructured text data, making them ideal for custom AI models.

2. Key Aspects

  1. Key Components of a Custom Model for PDF and Web Scraping:

    • Input Source: The source data (PDFs or web pages).
    • Preprocessing: Cleaning and structuring the data for AI consumption.
    • Model Behavior: Tailored system prompts to guide output generation.
  2. Why Custom Models for PDF and Web Scraping?

    • Automate repetitive tasks like extracting summaries or key points.
    • Handle domain-specific data with fine-tuned responses.
    • Increase efficiency in research, data collection, and reporting.
  3. Challenges:

    • Handling large or complex PDFs.
    • Avoiding CAPTCHA and legal concerns during web scraping.
    • Processing noisy or unstructured data effectively.

3. Implementation

CLI Commands for Custom Models:

Command Description Example
ollama run Run a custom model to process extracted text. ollama run pdf_reader --prompt "Summarize"
ollama create Create a new model with a system prompt and template. ollama create pdf_reader -f ./modelfile
ollama pull Pull a base model as a starting point. ollama pull llama
ollama show Display the details of the custom model. ollama show pdf_reader

4. Real-Life Example

Scenario: Extracting Key Points from Research PDFs

  • Objective: Build a model to summarize PDFs containing scientific research papers.
  • Use Case: A researcher needs concise summaries to save time.

5. Code Examples

Step 1: Preprocess PDFs

Use Python to extract text from PDFs. Libraries like PyPDF2 or pdfplumber are commonly used.

import pdfplumber

def extract_text_from_pdf(pdf_path):
    with pdfplumber.open(pdf_path) as pdf:
        text = ""
        for page in pdf.pages:
            text += page.extract_text()
    return text

# Example usage
pdf_text = extract_text_from_pdf("example_research.pdf")
print(pdf_text[:500])  # Print the first 500 characters

Step 2: Create a Custom Model

Define a modelfile with behavior tailored for summarizing research.

Modelfile (modelfile.txt):

FROM llama
SYSTEM """
You are a research assistant. Summarize the content of research papers in a concise and clear manner. Include key points and findings.
"""

Create the custom model with Ollama CLI:

# Create the custom model
ollama create pdf_reader -f ./modelfile.txt

Step 3: Run the Custom Model

Pass the extracted text from the PDF to the model.

# Run the custom model
ollama run pdf_reader --prompt "Summarize the following: [Insert extracted text here]"

Step 4: Web Scraping for Data

Use Python with libraries like BeautifulSoup to scrape data from web pages.

from bs4 import BeautifulSoup
import requests

def scrape_web_page(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    return soup.get_text()

# Example usage
web_text = scrape_web_page("https://example.com/research-article")
print(web_text[:500])  # Print the first 500 characters

Step 5: Integrate Web Data into the Model

Run the scraped content through the custom model.

# Run the custom model with web-scraped content
ollama run pdf_reader --prompt "Summarize the following: [Insert scraped text here]"

6. Example Outputs

PDF Summary:

"This research explores the impact of climate change on agriculture. Key findings include a 20% decrease in crop yield due to rising temperatures and droughts. Adaptive measures, such as genetic modification, show potential to mitigate these effects."

Web Scraping Summary:

"The article discusses the latest advancements in AI, focusing on generative models and their applications in healthcare and education."


7. Summary

  • Concepts Covered: Custom models, PDF parsing, and web scraping.
  • Key Aspects: Preprocessing, model creation, and data integration.
  • Implementation: Preprocessing PDFs and web data, creating a model, and running it for summaries.
  • Real-Life Example: Summarizing research papers and web content.

8. Homework/Practice

  1. Extract text from a PDF of your choice and pass it through a custom Ollama model.
  2. Scrape a webpage and summarize its content using the model.
  3. Experiment with different system prompts to customize model behavior.
  4. Compare the summaries generated by a base model and your custom model.

These lecture notes provide a comprehensive understanding of creating custom models for PDF and web scraping tasks, with practical examples and code samples to enhance learning.

No comments:

Post a Comment

OpenWebUI - Beginner's Tutorial

  OpenWebUI Tutorial: Setting Up and Using Local Llama 3.2 with Ollama Introduction This tutorial provides a step-by-step guide to setting...