Module 2: No-Code Data Preparation

Using OpenRefine to discover, clean, structure, validate, and export AI-ready datasets without coding.

Why No-Code Data Preparation?

AI projects usually begin before model training: they begin with data. This module introduces OpenRefine as a no-code environment for preparing messy text datasets for later AI workflows.

Module goal: learners practice the core preparation cycle: discover the dataset, clean inconsistent values, reshape columns, validate changes, and export the final data.

For AI tasks such as sentiment analysis and question answering, this preparation step improves consistency, reduces noise, and helps learners understand why task-specific data structure matters.

The OpenRefine Workflow

The module follows the same workflow used in the OpenRefine video series, adapted for AI data preparation.

1. Discover

Explore columns, patterns, missing values, and label distributions before changing the data.

2. Structure

Rename, reorder, split, or add columns so the dataset fits the task.

3. Clean

Normalize text, trim spaces, fix inconsistent categories, and standardize values.

4. Enrich

Add helper columns when useful, such as category labels or model-ready fields.

5. Validate

Use facets and filters again after cleaning to check whether changes were applied correctly.

6. Publish

Export cleaned data as CSV, or save the OpenRefine project as a reusable archive.

Task 1: Sentiment Analysis Data Preparation

A sentiment dataset usually contains a text column and a sentiment label, such as positive, negative, or neutral. OpenRefine helps learners inspect the labels and clean the text before model training.

Text cleanup demo

  1. Open the text column.
  2. Select Edit cells → Transform.
  3. Apply lowercase and trim cleanup.

Label validation demo

  1. Open the sentiment column.
  2. Select Facet → Text facet.
  3. Check counts and inconsistent label values.
value.toLowercase().trim()

This task connects OpenRefine to traditional NLP preparation: making text consistent and making labels reliable.

Task 2: Question Answering Data Preparation

A question answering dataset usually needs question, context, and answer fields. The goal is not only to clean the text, but also to preserve the relationship between the answer and its context.

Important teaching point: QA data should not be cleaned too aggressively. If the answer phrase changes or no longer appears inside the context, span-based QA formatting can break.

Recommended OpenRefine actions:

  • Use facets to find blank questions, contexts, or answers.
  • Use transformations to normalize extra spaces and quotation marks.
  • Add helper columns if you want to prepare a simplified SQuAD-style export.
value.replace("“", "\"").replace("”", "\"").trim()

Core OpenRefine Skills in This Module

Facet = Find Problems

Use text facets, blank facets, and filters to understand what values exist and where missing or inconsistent records appear.

Cluster = Fix Problems

Use clustering to merge similar values, such as capitalization differences, spelling variations, or duplicated category names.

Transform = Standardize

Use GREL expressions or common transforms to change letter case, remove spaces, normalize strings, or append useful values.

Undo / Redo = Safe Experimentation

Use the history panel to track changes and reverse mistakes without damaging the original data file.

Video Tutorials

Replace each YouTube link placeholder with the final video URL after uploading.

🎬 Video 1: OpenRefine & Data Preparation Overview

Introduces OpenRefine and the full preparation cycle: discovering, structuring, cleaning, enriching, validating, and publishing datasets.

🎬 Video 2: Importing Data, Sorting, Facets, and Filters

Shows how to create a project, check parsing options, use rows mode, sort values, inspect columns with facets, and focus on subsets with text filters.

🎬 Video 3: Basic Cleaning and Undo / Redo

Covers renaming columns, changing letter case, and using OpenRefine's history panel to safely experiment with data-cleaning operations.

🎬 Video 4: Reshaping, Clustering, and Exporting

Demonstrates removing blank records, reordering or splitting columns, converting dates, consolidating similar values with clustering, and exporting cleaned data.

Learning Outcomes

  • Import CSV data into OpenRefine and confirm parsing options before creating a project.
  • Use sorting, facets, and filters to discover patterns and data-quality problems.
  • Apply no-code transformations for lowercase conversion, whitespace cleanup, and quotation normalization.
  • Use clustering and direct edits to standardize inconsistent category values.
  • Understand why sentiment analysis and QA require different preprocessing decisions.
  • Export a cleaned dataset for downstream AI workflows.