Coding Track Module 4 | TACTIC in Lib: IMLS AI Workshop

From Archives to AI Insights

This module introduces a practical coding workflow for digital archive materials. Learners move from raw PDF documents and image-based pages to machine-readable text, structured classification outputs, sentiment insights, and multimodal image search.

Learning Goals

PDF Text Extraction & OCR

Inspect whether archival PDFs contain a text layer, compare born-digital and scanned documents, and use OCR when text exists only as an image.

Archival Text Classification

Prepare cleaned archival text for NLP and classify pages or records into meaningful categories using lightweight machine learning workflows.

Sentiment Analysis Applications

Apply rule-based and pretrained language-model approaches to interpret positive, neutral, negative, or mixed sentiment in text collections.

Multimodal Image Search

Use CLIP-style image and text embeddings to connect visual materials with natural-language queries for retrieval and digital collection exploration.

🗂️

Why Digital Archive Processing Matters

Archive collections often mix born-digital PDFs, scanned pages, captions, images, and historical text. Choosing the right extraction method is the first step toward reliable AI analysis, because OCR noise, missing text layers, and formatting artifacts can affect every downstream result.

💡 Coding Projects

Google Colab

Module 4: Text and Image Processing for Digital Archives

Complete the three guided coding tasks below. Each task includes a Colab notebook and a walkthrough video placeholder that you can replace with your final links.

1 Task 1: Extracting and Preparing Archival Text for NLP

Inspect digital archive PDFs, distinguish born-digital pages from scanned pages, extract text with PyMuPDF, apply OCR with pytesseract when pages are image-based, and clean the extracted text so it is ready for page-level NLP and classification workflows.

📘 Watch Walkthrough Video 🔶 Open Colab Notebook

2 Task 2: Analyzing Digital Collections with Sentiment Analysis

Apply sentiment analysis methods such as VADER, zero-shot classification, and pretrained transformer pipelines to interpret emotional tone in text collections.

📘 Watch Walkthrough Video 🔶 Open Colab Notebook

3 Task 3: Multimodal Application with CLIP

Build an image search demo by encoding images and natural-language queries into a shared embedding space, comparing similarity scores, and returning top image matches.

📘 Watch Walkthrough Video 🔶 Open Colab Notebook

🚀 After processing the archive, compare how each AI output should be interpreted!

💡 Mini Exercise

Archive AI Preview

Quick Check: Which Method Fits the Archive Item?

Scenario: “Choose the right processing step.”
Learners decide whether a page needs direct text extraction, OCR, text classification, sentiment analysis, or multimodal image search.

# Example decision logic for Module 4

# Task 1: PDF extraction, OCR, and text preparation
if page_has_text_layer:
    method = "extract_text_with_pymupdf"
else:
    method = "run_ocr_with_tesseract"

# Task 1 continued: Text classification-ready output
classification_output = {
    "page_id": "page_001",
    "predicted_category": "archives",
    "confidence": 0.86
}

# Task 2: Sentiment analysis
sentiment_output = {
    "text": "Visitors loved the exhibit.",
    "label": "positive"
}

# Task 3: Multimodal retrieval
clip_search = {
    "query": "people reading in a library",
    "top_result": "archive_image_03.jpg"
}

Back to All Modules