Python for Data Analysis: A Beginner's Guide to pandas and SciPy
Python has become one of the most popular tools for data analysis alongside SPSS and Excel. Its open-source ecosystem — especially pandas, SciPy, and matplotlib — gives you the power to clean, analyze, and visualize data with reproducible scripts. If you are a researcher, student, or analyst looking to go beyond spreadsheets, Python is the logical next step.
Why Python for Data Analysis?
- Free and open source — No expensive software licenses
- Reproducible — Scripts can be rerun on new data instantly
- Scalable — Handles datasets too large for Excel
- Versatile — From data cleaning to machine learning in one language
- Growing community — Extensive libraries, tutorials, and Stack Overflow support
Setting Up Your Environment
The easiest way to get started is with Anaconda or Google Colab (no installation required).
The key libraries you need:
import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
Loading Data with pandas
pandas is the backbone of data analysis in Python. It organizes data into DataFrames — similar to Excel spreadsheets or SPSS data views.
# Load from CSV
df = pd.read_csv("survey_data.csv")
# Load from Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")
# Quick look at your data
df.head() # First 5 rows
df.shape # (rows, columns)
df.info() # Data types and missing values
Data Cleaning Basics
Before any analysis, you need clean data:
# Check for missing values
df.isnull().sum()
# Drop rows with missing values
df_clean = df.dropna()
# Or fill missing values with the mean
df["age"].fillna(df["age"].mean(), inplace=True)
# Remove duplicates
df.drop_duplicates(inplace=True)
# Rename columns
df.rename(columns={"old_name": "new_name"}, inplace=True)
Descriptive Statistics
Getting summary statistics in pandas is straightforward:
# Full summary for all numeric columns
df.describe()
# Specific statistics
df["score"].mean() # Mean
df["score"].median() # Median
df["score"].std() # Standard deviation
df["score"].var() # Variance
df["score"].skew() # Skewness
df["score"].kurtosis() # Kurtosis
# Frequency table for categorical variables
df["gender"].value_counts()
df["gender"].value_counts(normalize=True) # Percentages
Visualization
matplotlib and Seaborn make creating charts simple:
# Histogram
plt.figure(figsize=(8, 5))
sns.histplot(df["score"], kde=True)
plt.title("Distribution of Scores")
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.show()
# Box plot by group
sns.boxplot(x="group", y="score", data=df)
plt.title("Scores by Group")
plt.show()
# Scatter plot
sns.scatterplot(x="study_hours", y="exam_score", data=df)
plt.title("Study Hours vs Exam Score")
plt.show()
Statistical Tests with SciPy
SciPy provides all the common statistical tests:
Independent Samples T-Test
group1 = df[df["group"] == "control"]["score"]
group2 = df[df["group"] == "treatment"]["score"]
t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t = {t_stat:.3f}, p = {p_value:.3f}")
Paired Samples T-Test
t_stat, p_value = stats.ttest_rel(df["pre_score"], df["post_score"])
print(f"t = {t_stat:.3f}, p = {p_value:.3f}")
One-Way ANOVA
group_a = df[df["condition"] == "A"]["score"]
group_b = df[df["condition"] == "B"]["score"]
group_c = df[df["condition"] == "C"]["score"]
f_stat, p_value = stats.f_oneway(group_a, group_b, group_c)
print(f"F = {f_stat:.3f}, p = {p_value:.3f}")
Pearson Correlation
r, p_value = stats.pearsonr(df["hours"], df["score"])
print(f"r = {r:.3f}, p = {p_value:.3f}")
Chi-Square Test
contingency_table = pd.crosstab(df["gender"], df["preference"])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ² = {chi2:.3f}, p = {p_value:.3f}, df = {dof}")
Regression with statsmodels
For more detailed regression output (similar to SPSS), use statsmodels:
import statsmodels.api as sm
X = df[["study_hours", "attendance"]]
X = sm.add_constant(X) # Add intercept
y = df["exam_score"]
model = sm.OLS(y, X).fit()
print(model.summary())
This gives you R², coefficients, p-values, confidence intervals, and diagnostics — similar to what you would see in SPSS regression output.
Python vs. SPSS: When to Use Which
| Feature | Python | SPSS |
|---|---|---|
| Cost | Free | Licensed |
| Learning curve | Steeper | GUI-friendly |
| Reproducibility | Excellent (scripts) | Limited (syntax files) |
| Advanced analysis | Extensive libraries | Built-in tests |
| APA tables | Requires formatting | Better native support |
| Big data | Handles large datasets | Memory limited |
For academic projects where your supervisor expects SPSS output, stick with SPSS. For business analytics, automation, or large datasets, Python is often the better choice. Many analysts use both.
Next Steps
Once you are comfortable with these basics, explore:
- Seaborn for publication-quality statistical plots
- statsmodels for detailed regression and time series analysis
- scikit-learn for machine learning and predictive modeling
- Jupyter Notebooks for combining code, output, and narrative
Need professional data analysis using Python, SPSS, or Excel? We handle everything from data cleaning to APA-formatted reports. Request a free consultation.
Keep Reading
Get More Guides Like This
Free tutorials on SPSS, Excel, Python, and research methods delivered to your inbox.
Need Professional Data Analysis Services?
Save time and get accurate results. Our experts provide statistical analysis services using SPSS, Excel, and Python — from hypothesis testing to APA-formatted reports.