Data Cleaning Best Practices in SPSS Before Running Analysis
Running statistical tests on dirty data is like building a house on sand. No matter how sophisticated your analysis, the results are only as good as the data going in. Data cleaning is the most important and most overlooked step in any research project.
Why Data Cleaning Matters
Messy data leads to:
- Inflated or deflated statistics that misrepresent your findings
- Violated assumptions that make your test results invalid
- Misleading conclusions that cannot be replicated
Spending 30 minutes on data cleaning can save you hours of re-analysis later.
Step 1: Screen for Missing Data
Open your dataset in SPSS and check for missing values:
- Go to Analyze > Descriptive Statistics > Frequencies
- Select all your variables and click OK
- Look at the "Missing" row in each frequency table
How to Handle Missing Data
- Less than 5% missing: Generally safe to use listwise deletion (SPSS default)
- 5-20% missing: Consider multiple imputation or expectation-maximization (EM)
- More than 20% missing: The variable may not be usable; consult your supervisor
For multiple imputation in SPSS: Analyze > Multiple Imputation > Impute Missing Data Values
Step 2: Check for Outliers
Outliers can dramatically affect means, standard deviations, and regression coefficients.
Using Box Plots
- Go to Graphs > Chart Builder
- Select Box Plot and drag your variable in
- Look for circles (mild outliers) and stars (extreme outliers)
Using Z-Scores
- Go to Analyze > Descriptive Statistics > Descriptives
- Check "Save standardized values as variables"
- Any case with a Z-score beyond ±3.29 is a potential outlier
Using Cook's Distance (for Regression)
Cook's Distance measures how much a single observation influences the overall regression model.
- Run your regression: Analyze > Regression > Linear
- Click Save and check "Cook's distance"
- Cases with Cook's D > 1.0 (or > 4/n as a stricter rule) are influential
Do not automatically delete outliers. Investigate them first. Are they data entry errors? Legitimate extreme cases? The decision to remove should be justified in your methodology.
Step 3: Test for Normality
Most parametric tests assume your data is approximately normally distributed.
Shapiro-Wilk Test
- Go to Analyze > Descriptive Statistics > Explore
- Move your variable to "Dependent List"
- Click Plots and check "Normality plots with tests"
- If Shapiro-Wilk significance > .05, your data is normally distributed
Visual Inspection
- Histogram: Should look roughly bell-shaped
- Q-Q Plot: Points should fall close to the diagonal line
- Skewness and Kurtosis: Values between -2 and +2 are generally acceptable
What If Data Is Not Normal?
- Transform the data: Log, square root, or inverse transformations
- Use nonparametric alternatives: Mann-Whitney U, Kruskal-Wallis, Spearman correlation
- Proceed with caution: ANOVA is robust to moderate violations with large samples (n > 30 per group)
Step 4: Check Homogeneity of Variances
For t-tests and ANOVA, the spread of scores should be similar across groups.
Run Levene's test (automatically included in SPSS ANOVA output):
- If p > .05: Variances are equal, proceed normally
- If p < .05: Use Welch's correction or Games-Howell post-hoc
Step 5: Check for Multicollinearity (Regression)
If you are running multiple regression, your predictors should not be too highly correlated with each other.
- Run regression with Statistics > Collinearity diagnostics checked
- Check the VIF (Variance Inflation Factor) column
- VIF > 10 indicates serious multicollinearity; VIF > 5 is concerning
Solutions: Remove one of the correlated predictors, or combine them into a composite score.
Step 6: Verify Data Types and Coding
Simple but often forgotten:
- Are categorical variables coded correctly? (e.g., 1 = Male, 2 = Female)
- Are continuous variables measured on the right scale?
- Are value labels assigned in SPSS Variable View?
- Are there any impossible values? (e.g., age = 300, or a 7-point Likert scale showing a value of 9)
Data Cleaning Checklist
Before running any analysis, verify:
- No unexplained missing data patterns
- Outliers identified and decision documented
- Normality tested for dependent variables
- Homogeneity of variances checked (for group comparisons)
- Multicollinearity checked (for regression)
- All variables correctly coded and labeled
- Sample size adequate for planned tests
When to Ask for Help
Data cleaning requires judgment calls. If you are unsure whether to remove an outlier, which transformation to use, or how to handle missing data in your specific design, it is worth getting expert advice before proceeding. A wrong decision at this stage affects everything downstream.
We handle data cleaning as part of every analysis project, with full documentation of every decision made.
Need This Analysis Done For You?
Save time and get accurate results. Our experts handle the analysis while you focus on your research.
Get a Free Quote