I'm about to analyze this dataset. Help me find the gotchas before I draw conclusions from it.
Dataset description:
{{what it is, where it came from, time range, columns}}
The questions I'm planning to answer:
{{your analysis questions}}
Output:
1. **Selection effects** — who or what isn't in this data and might bias results?
2. **Definition shifts** — has the way columns are populated changed over time? (renames, methodology changes, schema migrations)
3. **Survivorship bias** — are we seeing only the "winners," missing what dropped out?
4. **Outliers and skew** — what's the shape of the data? Mean vs median?
5. **Missing data patterns** — are NULLs random or systematic? (Systematic NULLs are the dangerous ones)
6. **Aggregation risks** — when I roll this up, what do I lose?
For each, name the specific risk and how to test for it before assuming the data is clean.data qualitybiasanalysis