Python for Data Analysis: A Beginner's Guide to pandas and SciPy

Python has become one of the most popular tools for data analysis alongside SPSS and Excel. Its open-source ecosystem — especially pandas, SciPy, and matplotlib — gives you the power to clean, analyze, and visualize data with reproducible scripts. If you are a researcher, student, or analyst looking to go beyond spreadsheets, Python is the logical next step.

Why Python for Data Analysis?

Free and open source — No expensive software licenses
Reproducible — Scripts can be rerun on new data instantly
Scalable — Handles datasets too large for Excel
Versatile — From data cleaning to machine learning in one language
Growing community — Extensive libraries, tutorials, and Stack Overflow support

Setting Up Your Environment

The easiest way to get started is with Anaconda or Google Colab (no installation required).

The key libraries you need:

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

Loading Data with pandas

pandas is the backbone of data analysis in Python. It organizes data into DataFrames — similar to Excel spreadsheets or SPSS data views.

# Load from CSV
df = pd.read_csv("survey_data.csv")

# Load from Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Quick look at your data
df.head()        # First 5 rows
df.shape         # (rows, columns)
df.info()        # Data types and missing values

Data Cleaning Basics

Before any analysis, you need clean data:

# Check for missing values
df.isnull().sum()

# Drop rows with missing values
df_clean = df.dropna()

# Or fill missing values with the mean
df["age"].fillna(df["age"].mean(), inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Rename columns
df.rename(columns={"old_name": "new_name"}, inplace=True)

Descriptive Statistics

Getting summary statistics in pandas is straightforward:

# Full summary for all numeric columns
df.describe()

# Specific statistics
df["score"].mean()       # Mean
df["score"].median()     # Median
df["score"].std()        # Standard deviation
df["score"].var()        # Variance
df["score"].skew()       # Skewness
df["score"].kurtosis()   # Kurtosis

# Frequency table for categorical variables
df["gender"].value_counts()
df["gender"].value_counts(normalize=True)  # Percentages

Visualization

matplotlib and Seaborn make creating charts simple:

# Histogram
plt.figure(figsize=(8, 5))
sns.histplot(df["score"], kde=True)
plt.title("Distribution of Scores")
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.show()

# Box plot by group
sns.boxplot(x="group", y="score", data=df)
plt.title("Scores by Group")
plt.show()

# Scatter plot
sns.scatterplot(x="study_hours", y="exam_score", data=df)
plt.title("Study Hours vs Exam Score")
plt.show()

Statistical Tests with SciPy

SciPy provides all the common statistical tests:

Independent Samples T-Test

group1 = df[df["group"] == "control"]["score"]
group2 = df[df["group"] == "treatment"]["score"]

t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t = {t_stat:.3f}, p = {p_value:.3f}")

Paired Samples T-Test

t_stat, p_value = stats.ttest_rel(df["pre_score"], df["post_score"])
print(f"t = {t_stat:.3f}, p = {p_value:.3f}")

One-Way ANOVA

group_a = df[df["condition"] == "A"]["score"]
group_b = df[df["condition"] == "B"]["score"]
group_c = df[df["condition"] == "C"]["score"]

f_stat, p_value = stats.f_oneway(group_a, group_b, group_c)
print(f"F = {f_stat:.3f}, p = {p_value:.3f}")

Pearson Correlation

r, p_value = stats.pearsonr(df["hours"], df["score"])
print(f"r = {r:.3f}, p = {p_value:.3f}")

Chi-Square Test

contingency_table = pd.crosstab(df["gender"], df["preference"])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ² = {chi2:.3f}, p = {p_value:.3f}, df = {dof}")

Regression with statsmodels

For more detailed regression output (similar to SPSS), use statsmodels:

import statsmodels.api as sm

X = df[["study_hours", "attendance"]]
X = sm.add_constant(X)  # Add intercept
y = df["exam_score"]

model = sm.OLS(y, X).fit()
print(model.summary())

This gives you R², coefficients, p-values, confidence intervals, and diagnostics — similar to what you would see in SPSS regression output.

Python vs. SPSS: When to Use Which

Feature	Python	SPSS
Cost	Free	Licensed
Learning curve	Steeper	GUI-friendly
Reproducibility	Excellent (scripts)	Limited (syntax files)
Advanced analysis	Extensive libraries	Built-in tests
APA tables	Requires formatting	Better native support
Big data	Handles large datasets	Memory limited

For academic projects where your supervisor expects SPSS output, stick with SPSS. For business analytics, automation, or large datasets, Python is often the better choice. Many analysts use both.

Next Steps

Once you are comfortable with these basics, explore:

Seaborn for publication-quality statistical plots
statsmodels for detailed regression and time series analysis
scikit-learn for machine learning and predictive modeling
Jupyter Notebooks for combining code, output, and narrative

Need professional data analysis using Python, SPSS, or Excel? We handle everything from data cleaning to APA-formatted reports. Request a free consultation.

Python for Data Analysis: A Beginner's Guide to pandas and SciPy

Why Python for Data Analysis?

Setting Up Your Environment

Loading Data with pandas

Data Cleaning Basics

Descriptive Statistics

Visualization

Statistical Tests with SciPy

Independent Samples T-Test

Paired Samples T-Test

One-Way ANOVA

Pearson Correlation

Chi-Square Test

Regression with statsmodels

Python vs. SPSS: When to Use Which

Next Steps

Keep Reading

Get More Guides Like This

Need Professional Data Analysis Services?

Related Articles

A/B Testing for Marketing: How to Analyze Experiment Results

10 Common Data Analysis Mistakes Researchers Make (And How to Avoid Them)

How to Analyze Customer Survey Data: A Practical Business Guide