Claude Skill

Data Cleanup Workbook

Cleans messy data — dedupes, normalizes dates, fixes inconsistent caps, fills missing, flags anomalies.

Download skill (.zip)Or download whole pack

What it does

Takes a messy data file (mixed date formats, inconsistent capitalization, duplicates, missing values, typos in categories) and produces a cleaned output plus a clear log of every transformation applied. Designed so the user can sanity-check and roll back specific cleanups, not a black box.

When to use

✓You exported data from a CRM / form / scraper and it's a mess
✓You're consolidating multiple sources with different conventions
✓You need a clean dataset for a downstream model or upload

When not to use

✗The data has serious quality issues (PII leaks, schema mismatches) — those need a real ETL, not a one-pass cleanup
✗You don't actually know what "clean" means for this dataset — define the schema first

Install

Download the .zip, then unzip into your Claude skills folder.

mkdir -p ~/.claude/skills
unzip ~/Downloads/data-cleanup-workbook.zip -d ~/.claude/skills/

# Restart Claude Code session.
# Skill is now available — Claude will use it when relevant.

SKILL.md

---
name: data-cleanup-workbook
description: Use when cleaning messy data into a structured workbook — deduping, normalizing dates, fixing capitalization, filling missing, flagging anomalies. Triggers on "clean this data", "normalize this CSV", "dedupe and fix", "data cleanup xlsx".
---

# Data Cleanup Workbook

Build a workbook that cleans data and shows its work. The reason to use a workbook (vs a one-shot Python script) is auditability — the user needs to see WHAT changed and approve before downstream use. Every transformation logged.

## Required inputs

1. **The dirty data** — CSV, Excel, paste
2. **Target schema** — columns expected, their types (date, number, string, enum), required vs optional
3. **Known bad patterns** — user often knows ("dates are mixed", "emails have whitespace", "~5% duplicates")
4. **Dedupe key** — what makes two rows the same? Email? Email + phone? Composite?
5. **Categorical canonical values** — for enum-style columns ("Country", "Status"), the allowed values
6. **What "missing" means** — null? empty string? "N/A"? "—"?

If the user pastes data without telling you the schema, infer + show your inferred schema first and ask for approval. Don't silently clean the wrong thing.

## Approach

**pandas** for the heavy lifting (dedup, normalize, regex), then write to .xlsx via **openpyxl** so the user sees: cleaned data, the change log, and the original side-by-side. The change log is what makes this a workbook and not a script.

For Google Sheets, **gspread** + a separate "Change Log" sheet, with conditional formatting on cells that were modified.

## Workbook structure

1. **Cover** — source file, row count in / out, dedup key, run date, summary of transformations applied
2. **Original** — the raw data, untouched. Read-only reference. Always include this — never trust your own cleanup blindly.
3. **Cleaned** — the cleaned dataset. This is the output the user will export.
4. **Change Log** — one row per transformation: row index, column, original value, new value, reason, confidence (high / medium / flag-for-review).
5. **Duplicates** — rows identified as duplicates, grouped by the dedup key. Shows which row was kept and why (most-recent timestamp / most-complete / first-seen).
6. **Anomalies** — rows flagged for human review: unusual values, out-of-range numbers, dates in the future when they shouldn't be, missing required fields.
7. **Schema** — the target schema with type, required, allowed values, regex pattern. The contract.

## Cleanup operations (in order)

1. **Trim whitespace** on every string column. Leading, trailing, and collapse runs of internal whitespace. Log per-cell.
2. **Normalize case** based on column type: emails lowercase, names title case, codes uppercase, free text leave alone.
3. **Parse dates**. If formats are mixed, try a list of formats in order (ISO first, then US, then EU, then named-month). If a date can't be parsed unambiguously (e.g., "03/04/2026"), flag for review — don't guess.
4. **Coerce numbers**. Strip currency symbols, thousand separators. Flag any cell that has text in a numeric column.
5. **Map enums to canonical values**. "USA" / "U.S." / "United States" → "US". Anything not in the canonical list goes to Anomalies.
6. **Fill missing**. Strategy by column:
   - Required + missing → Anomalies
   - Optional + missing → leave null (don't fill with 0 unless the schema says to)
   - Computed (e.g., full_name from first + last) → derive
7. **Dedupe**. Group by dedup key. Strategy: keep most-recent timestamp / most-complete row / first-seen — declare which.
8. **Flag anomalies**: outliers (>3 std dev for numerics, optional), future dates, regex failures (e.g., bad email format).

## Formula principles (for the cleaned tab if formulas vs values)

- **TRIM(CLEAN(...))** for whitespace.
- **PROPER, UPPER, LOWER** with care — PROPER breaks acronyms ("NASA" → "Nasa").
- **TEXT(date, "yyyy-mm-dd")** to normalize date display once parsed.
- **REGEXMATCH** in Sheets / regex in pandas — for email format, phone format, postal code patterns.
- **XLOOKUP against an enum mapping table** for category normalization.

Generally write VALUES not formulas to the Cleaned tab — formulas tied to the Original tab break when someone exports.

## Common pitfalls

- **Silent date guessing.** "03/04/2026" — March 4th or April 3rd? FLAG, don't guess.
- **Title-casing all-caps acronyms.** "IBM Corp" becomes "Ibm Corp". Skip case normalization for known acronyms or columns marked as such.
- **Email lowercase except the local part is technically case-sensitive.** Lowercase is the right call 99% of the time but document it.
- **Dedup keeping the wrong row.** "Most recent" can mean updated_at or created_at — they're different. Declare which.
- **Filling missing numerics with 0.** A revenue of 0 is not the same as missing. Leave null.
- **Trimming a JSON or CSV-in-a-cell field destroys structure.** Skip TRIM on columns flagged as structured text.
- **No change log.** Black-box cleanups erode trust — the user can't tell what was wrong with their data.

## Output

Generate the .xlsx. Print a summary:

```
Rows in: 4,213
Rows out: 3,987
Duplicates removed: 226
Cells modified: 1,847
Anomalies flagged for review: 41
```

Plus the top 3 transformation categories ("trim whitespace: 1,420 cells", "case normalize: 280 cells", "date reformat: 147 cells") and a reminder to review the Anomalies tab before using the cleaned data downstream.

Example prompts

Once installed, try these prompts in Claude:

Clean this CSV of contacts. Dates are mixed (some MM/DD, some DD/MM, some "April 3"), names are LASTNAME, FIRSTNAME and reverse, emails have trailing spaces, ~5% duplicates.
Order export from Shopify with inconsistent country codes (US/USA/United States), normalize and flag anything I should check.

Related prompts

Don't want to install a skill? These prompts in /prompts cover similar ground for one-shot use:

analysis prompts