Coding Track for Module 2 - Data Preparation Pathway

AI Project Planning & Data Preparation: preparing clean, model-ready text, QA, and multimodal datasets.

From Raw Data to AI-Ready Data

This module introduces practical data preparation workflows for AI projects. Learners will move from raw datasets to structured, model-ready formats such as cleaned CSV tables, SQuAD-style JSONL records, and paired image-text datasets.

Learning Goals

1

Sentiment Analysis Data Prep

Clean text, tokenize inputs, compare traditional NLP preprocessing with transformer-based tokenization, and prepare text-label data for sentiment models.

2

Question Answering Data Prep

Normalize question/context/answer fields and convert flat QA tables into SQuAD-style schema for QA model development.

3

Multimodal Data Prep

Organize image folders, caption files, and metadata tables into clear image-caption pairs for multimodal learning pipelines.

🧹

Why Data Preparation Matters

AI models rely on consistent input formats. Small preprocessing decisions—such as lowercasing, stopword removal, answer-span alignment, or image-caption pairing—can directly affect model quality, reproducibility, and interpretability.

💡 Coding Projects

Google Colab
Module 2: AI Project Planning & Data Preparation

Complete the three guided coding tasks below. Each task includes a Colab notebook and a walkthrough video placeholder that you can replace with your final links.

1 Task 1: Sentiment Analysis Data Preparation

Prepare a text + sentiment label dataset by exploring raw data, cleaning text, tokenizing, and comparing TF-IDF vectorization with transformer-based tokenization.

2 Task 2: Question Answering Data Preparation

Clean QA text fields, preserve answer-span alignment, and convert question-context-answer records into SQuAD-style format with id, question, context, and answers fields.

3 Task 3: Multimodal Image + Text Data Preparation

Load image folders and caption files, check file paths, clean captions, create image-caption pairs, and export a model-ready multimodal table.

🚀 After preparing the datasets, review how each output becomes model-ready!

💡 Mini Exercise

Data Prep Preview
Quick Check: What Format Does the Model Need?

Scenario: “Choose the right output format.”
Learners identify whether a task should produce a sentiment CSV, a SQuAD-style QA JSONL file, or an image-caption table.

# Example model-ready outputs from Module 2

# Task 1: Sentiment Analysis
sentiment_row = {
    "text": "the service was helpful",
    "label": "positive"
}

# Task 2: Question Answering
qa_row = {
    "id": "squad_000001",
    "question": "What elements did Marie Curie discover?",
    "context": "Marie Curie discovered radium and polonium.",
    "answers": {"text": ["radium and polonium"], "answer_start": [23]}
}

# Task 3: Multimodal Learning
multimodal_row = {
    "image_path": "Images/example.jpg",
    "caption": "A group of students studying in a library."
}