Data Cleaning Best Practices in SPSS Before Running Analysis

Running statistical tests on dirty data is like building a house on sand. No matter how sophisticated your analysis, the results are only as good as the data going in. Data cleaning is the most important and most overlooked step in any research project.

Why Data Cleaning Matters

Messy data leads to:

Inflated or deflated statistics that misrepresent your findings
Violated assumptions that make your test results invalid
Misleading conclusions that cannot be replicated

Spending 30 minutes on data cleaning can save you hours of re-analysis later.

Step 1: Screen for Missing Data

Open your dataset in SPSS and check for missing values:

Go to Analyze > Descriptive Statistics > Frequencies
Select all your variables and click OK
Look at the "Missing" row in each frequency table

How to Handle Missing Data

Less than 5% missing: Generally safe to use listwise deletion (SPSS default)
5-20% missing: Consider multiple imputation or expectation-maximization (EM)
More than 20% missing: The variable may not be usable; consult your supervisor

For multiple imputation in SPSS: Analyze > Multiple Imputation > Impute Missing Data Values

Step 2: Check for Outliers

Outliers can dramatically affect means, standard deviations, and regression coefficients.

Using Box Plots

Go to Graphs > Chart Builder
Select Box Plot and drag your variable in
Look for circles (mild outliers) and stars (extreme outliers)

Using Z-Scores

Go to Analyze > Descriptive Statistics > Descriptives
Check "Save standardized values as variables"
Any case with a Z-score beyond ±3.29 is a potential outlier

Using Cook's Distance (for Regression)

Cook's Distance measures how much a single observation influences the overall regression model.

Run your regression: Analyze > Regression > Linear
Click Save and check "Cook's distance"
Cases with Cook's D > 1.0 (or > 4/n as a stricter rule) are influential

Do not automatically delete outliers. Investigate them first. Are they data entry errors? Legitimate extreme cases? The decision to remove should be justified in your methodology.

Step 3: Test for Normality

Most parametric tests assume your data is approximately normally distributed.

Shapiro-Wilk Test

Go to Analyze > Descriptive Statistics > Explore
Move your variable to "Dependent List"
Click Plots and check "Normality plots with tests"
If Shapiro-Wilk significance > .05, your data is normally distributed

Visual Inspection

Histogram: Should look roughly bell-shaped
Q-Q Plot: Points should fall close to the diagonal line
Skewness and Kurtosis: Values between -2 and +2 are generally acceptable

What If Data Is Not Normal?

Transform the data: Log, square root, or inverse transformations
Use nonparametric alternatives: Mann-Whitney U, Kruskal-Wallis, Spearman correlation
Proceed with caution: ANOVA is robust to moderate violations with large samples (n > 30 per group)

Step 4: Check Homogeneity of Variances

For t-tests and ANOVA, the spread of scores should be similar across groups.

Run Levene's test (automatically included in SPSS ANOVA output):

If p > .05: Variances are equal, proceed normally
If p < .05: Use Welch's correction or Games-Howell post-hoc

Step 5: Check for Multicollinearity (Regression)

If you are running multiple regression, your predictors should not be too highly correlated with each other.

Run regression with Statistics > Collinearity diagnostics checked
Check the VIF (Variance Inflation Factor) column
VIF > 10 indicates serious multicollinearity; VIF > 5 is concerning

Solutions: Remove one of the correlated predictors, or combine them into a composite score.

Step 6: Verify Data Types and Coding

Simple but often forgotten:

Are categorical variables coded correctly? (e.g., 1 = Male, 2 = Female)
Are continuous variables measured on the right scale?
Are value labels assigned in SPSS Variable View?
Are there any impossible values? (e.g., age = 300, or a 7-point Likert scale showing a value of 9)

Data Cleaning Checklist

Before running any analysis, verify:

No unexplained missing data patterns
Outliers identified and decision documented
Normality tested for dependent variables
Homogeneity of variances checked (for group comparisons)
Multicollinearity checked (for regression)
All variables correctly coded and labeled
Sample size adequate for planned tests

When to Ask for Help

Data cleaning requires judgment calls. If you are unsure whether to remove an outlier, which transformation to use, or how to handle missing data in your specific design, it is worth getting expert advice before proceeding. A wrong decision at this stage affects everything downstream.

We handle data cleaning as part of every analysis project, with full documentation of every decision made.