Saturday, 18 January 2025

Hour 8 - Introduction to Fine-Tuning Custom PDF and Web Scraping Models

Lecture Notes: 


1. Concepts

What is Fine-Tuning?

  • Fine-tuning is the process of adjusting a pre-trained model to improve its performance for a specific task or dataset.
  • Fine-tuning allows a model to better understand and generate responses based on domain-specific data, improving its accuracy and usefulness in real-world applications.

Why Fine-Tune PDF and Web Scraping Models?

  • Models that are trained on general data may not understand the nuances or specific needs of tasks like summarizing academic papers or extracting specific data from web pages.
  • Fine-tuning allows the model to specialize in these tasks by exposing it to relevant, labeled data.

Key Idea

  • Fine-tuning involves updating the weights of a model after it has been pre-trained. This is achieved by training it on new data that aligns with the target task.

2. Key Aspects of Fine-Tuning

  1. Base Model Selection:
    • Choose a model that already has useful general knowledge. Models like Llama are good starting points for fine-tuning.
  2. Dataset Preparation:
    • Labeled Data: For fine-tuning, you need a labeled dataset. For example, if you want to fine-tune a model for summarizing research papers, you need a dataset of papers paired with their summaries.
    • For PDFs: Label the data with clear instructions for the model to understand key points, summaries, or other types of content.
    • For Web Scraping: You can label data for specific types of information such as titles, articles, or key facts extracted from scraped web pages.
  3. Training Process:
    • The training process involves using small batches of data to modify the model’s weights.
    • Learning Rate: A key parameter for fine-tuning that controls how much the weights change during training.
  4. Evaluation:
    • After fine-tuning, evaluate the model to check if it performs well on new, unseen data.
  5. Transfer Learning:
    • Fine-tuning is a form of transfer learning, where you apply knowledge from one domain (general model) to another (specific task).

3. Implementation

Prerequisites:

  • Python Libraries: torch, transformers, ollama
  • Data Preparation: A dataset with labeled examples of the target task (summaries, extracted content).

Example: Fine-Tuning for PDF Summarization

Step 1: Create a Dataset for Fine-Tuning

First, prepare a small dataset of PDF summaries (input-output pairs).

# Example dataset for fine-tuning (PDF summaries)
data = [
    {"input": "Text of research paper 1", "output": "Summary of paper 1"},
    {"input": "Text of research paper 2", "output": "Summary of paper 2"},
    # Add more labeled examples
]
Step 2: Define the Model and Tokenizer

For fine-tuning, you’ll need to choose a pre-trained model. Let's assume we are working with Llama.

from transformers import LlamaForCausalLM, LlamaTokenizer

# Load the pre-trained model and tokenizer
model = LlamaForCausalLM.from_pretrained("llama")
tokenizer = LlamaTokenizer.from_pretrained("llama")
Step 3: Tokenize the Dataset

Convert the text data into tokens that can be fed into the model.

inputs = tokenizer([d['input'] for d in data], padding=True, truncation=True, return_tensors="pt")
labels = tokenizer([d['output'] for d in data], padding=True, truncation=True, return_tensors="pt")

# Create dataset for PyTorch
import torch
class PDFSummaryDataset(torch.utils.data.Dataset):
    def __init__(self, inputs, labels):
        self.inputs = inputs
        self.labels = labels
        
    def __getitem__(self, idx):
        return {"input_ids": self.inputs["input_ids"][idx], "labels": self.labels["input_ids"][idx]}

    def __len__(self):
        return len(self.inputs["input_ids"])

# Create DataLoader for batching
train_dataset = PDFSummaryDataset(inputs, labels)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=2, shuffle=True)
Step 4: Fine-Tune the Model

Now, you can start the fine-tuning process using the dataset.

from transformers import Trainer, TrainingArguments

# Define training arguments
training_args = TrainingArguments(
    output_dir="./model_output",      # output directory
    evaluation_strategy="steps",      # evaluation strategy to adopt during training
    learning_rate=5e-5,               # learning rate
    per_device_train_batch_size=2,    # batch size
    num_train_epochs=3,               # number of epochs
    weight_decay=0.01                 # weight decay to avoid overfitting
)

# Define the Trainer
trainer = Trainer(
    model=model,                      # the pre-trained model
    args=training_args,               # training arguments
    train_dataset=train_dataset,      # training dataset
    eval_dataset=train_dataset        # evaluation dataset (optional)
)

# Fine-tune the model
trainer.train()
Step 5: Save the Fine-Tuned Model

After training, save your fine-tuned model.

model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

4. Real-Life Example

Scenario: Fine-Tuning for Extracting Key Information from Web Scraped Articles

  • Objective: Fine-tune a model to extract specific information (e.g., author name, publication date, and article summary) from web pages scraped using BeautifulSoup.
Step 1: Scrape Data from the Web

Use the requests and BeautifulSoup libraries to scrape articles from a webpage.

from bs4 import BeautifulSoup
import requests

def scrape_web_article(url):
    response = requests.get(url)
    soup = BeautifulSoup(response.text, "html.parser")
    title = soup.find("h1").get_text()
    author = soup.find("span", class_="author").get_text()  # Example class
    return {"title": title, "author": author}

# Example: scrape an article
article = scrape_web_article("https://example.com/article")
print(article)
Step 2: Label the Data

Label the scraped content with the correct output (summary, author name, etc.).

web_data = [
    {"input": "Text from scraped article 1", "output": "Summary and key points"},
    # Add more data
]
Step 3: Fine-Tune the Model

Follow the same fine-tuning steps as in the PDF case, using the web-scraped content.


5. Code Summary

from transformers import LlamaForCausalLM, LlamaTokenizer, Trainer, TrainingArguments
import torch

# Load and prepare the model and tokenizer
model = LlamaForCausalLM.from_pretrained("llama")
tokenizer = LlamaTokenizer.from_pretrained("llama")

# Prepare dataset (input-output pairs)
data = [
    {"input": "Text of research paper 1", "output": "Summary of paper 1"},
    # Add more labeled examples
]

inputs = tokenizer([d['input'] for d in data], padding=True, truncation=True, return_tensors="pt")
labels = tokenizer([d['output'] for d in data], padding=True, truncation=True, return_tensors="pt")

# Fine-tune the model
training_args = TrainingArguments(
    output_dir="./model_output", num_train_epochs=3, per_device_train_batch_size=2, learning_rate=5e-5
)
trainer = Trainer(model=model, args=training_args, train_dataset=train_dataset)
trainer.train()

# Save the fine-tuned model
model.save_pretrained("./fine_tuned_model")
tokenizer.save_pretrained("./fine_tuned_model")

6. Summary

  • Concepts Covered: Fine-tuning, transfer learning, dataset preparation, training, and evaluation.
  • Key Aspects: Fine-tuning requires a labeled dataset, careful model selection, and tuning of hyperparameters.
  • Real-Life Example: Fine-tuning a model for summarizing research papers (PDFs) and extracting key details from web-scraped content.
  • Implementation: Steps involved creating datasets, tokenizing them, fine-tuning the model, and evaluating it.

7. Homework/Practice

  1. Fine-tune the model you created in the previous lesson to summarize a new set of PDFs.
  2. Use web-scraped content and fine-tune the model for extracting key details (e.g., title, author, summary) from articles.
  3. Experiment with different learning rates and batch sizes to see how they affect model performance.

These lecture notes provide a step-by-step introduction to fine-tuning models for custom tasks like PDF summarization and web scraping, offering practical examples with Python and Ollama CLI code.

No comments:

Post a Comment

OpenWebUI - Beginner's Tutorial

  OpenWebUI Tutorial: Setting Up and Using Local Llama 3.2 with Ollama Introduction This tutorial provides a step-by-step guide to setting...