AI Project Planning & Data Preparation: preparing clean, model-ready text, QA, and multimodal datasets.
This module introduces practical data preparation workflows for AI projects. Learners will move from raw datasets to structured, model-ready formats such as cleaned CSV tables, SQuAD-style JSONL records, and paired image-text datasets.
Clean text, tokenize inputs, compare traditional NLP preprocessing with transformer-based tokenization, and prepare text-label data for sentiment models.
Normalize question/context/answer fields and convert flat QA tables into SQuAD-style schema for QA model development.
Organize image folders, caption files, and metadata tables into clear image-caption pairs for multimodal learning pipelines.
AI models rely on consistent input formats. Small preprocessing decisions—such as lowercasing, stopword removal, answer-span alignment, or image-caption pairing—can directly affect model quality, reproducibility, and interpretability.
Complete the three guided coding tasks below. Each task includes a Colab notebook and a walkthrough video placeholder that you can replace with your final links.
Prepare a text + sentiment label dataset by exploring raw data, cleaning text, tokenizing, and comparing TF-IDF vectorization with transformer-based tokenization.
Clean QA text fields, preserve answer-span alignment, and convert question-context-answer records into SQuAD-style format with id, question, context, and answers fields.
Load image folders and caption files, check file paths, clean captions, create image-caption pairs, and export a model-ready multimodal table.
🚀 After preparing the datasets, review how each output becomes model-ready!
Scenario: “Choose the right output format.”
Learners identify whether a task should produce a sentiment CSV, a SQuAD-style QA JSONL file,
or an image-caption table.
# Example model-ready outputs from Module 2
# Task 1: Sentiment Analysis
sentiment_row = {
"text": "the service was helpful",
"label": "positive"
}
# Task 2: Question Answering
qa_row = {
"id": "squad_000001",
"question": "What elements did Marie Curie discover?",
"context": "Marie Curie discovered radium and polonium.",
"answers": {"text": ["radium and polonium"], "answer_start": [23]}
}
# Task 3: Multimodal Learning
multimodal_row = {
"image_path": "Images/example.jpg",
"caption": "A group of students studying in a library."
}