Python Tutorial

Python for Data Analysis: A Beginner's Guide to pandas and SciPy

2026-05-2510 min read
Python data analysispandas tutorialSciPy statisticsPython for researchersdata analysis with Python

Python has become one of the most popular tools for data analysis alongside SPSS and Excel. Its open-source ecosystem — especially pandas, SciPy, and matplotlib — gives you the power to clean, analyze, and visualize data with reproducible scripts. If you are a researcher, student, or analyst looking to go beyond spreadsheets, Python is the logical next step.

Why Python for Data Analysis?

  • Free and open source — No expensive software licenses
  • Reproducible — Scripts can be rerun on new data instantly
  • Scalable — Handles datasets too large for Excel
  • Versatile — From data cleaning to machine learning in one language
  • Growing community — Extensive libraries, tutorials, and Stack Overflow support

Setting Up Your Environment

The easiest way to get started is with Anaconda or Google Colab (no installation required).

The key libraries you need:

import pandas as pd
import numpy as np
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns

Loading Data with pandas

pandas is the backbone of data analysis in Python. It organizes data into DataFrames — similar to Excel spreadsheets or SPSS data views.

# Load from CSV
df = pd.read_csv("survey_data.csv")

# Load from Excel
df = pd.read_excel("data.xlsx", sheet_name="Sheet1")

# Quick look at your data
df.head()        # First 5 rows
df.shape         # (rows, columns)
df.info()        # Data types and missing values

Data Cleaning Basics

Before any analysis, you need clean data:

# Check for missing values
df.isnull().sum()

# Drop rows with missing values
df_clean = df.dropna()

# Or fill missing values with the mean
df["age"].fillna(df["age"].mean(), inplace=True)

# Remove duplicates
df.drop_duplicates(inplace=True)

# Rename columns
df.rename(columns={"old_name": "new_name"}, inplace=True)

Descriptive Statistics

Getting summary statistics in pandas is straightforward:

# Full summary for all numeric columns
df.describe()

# Specific statistics
df["score"].mean()       # Mean
df["score"].median()     # Median
df["score"].std()        # Standard deviation
df["score"].var()        # Variance
df["score"].skew()       # Skewness
df["score"].kurtosis()   # Kurtosis

# Frequency table for categorical variables
df["gender"].value_counts()
df["gender"].value_counts(normalize=True)  # Percentages

Visualization

matplotlib and Seaborn make creating charts simple:

# Histogram
plt.figure(figsize=(8, 5))
sns.histplot(df["score"], kde=True)
plt.title("Distribution of Scores")
plt.xlabel("Score")
plt.ylabel("Frequency")
plt.show()

# Box plot by group
sns.boxplot(x="group", y="score", data=df)
plt.title("Scores by Group")
plt.show()

# Scatter plot
sns.scatterplot(x="study_hours", y="exam_score", data=df)
plt.title("Study Hours vs Exam Score")
plt.show()

Statistical Tests with SciPy

SciPy provides all the common statistical tests:

Independent Samples T-Test

group1 = df[df["group"] == "control"]["score"]
group2 = df[df["group"] == "treatment"]["score"]

t_stat, p_value = stats.ttest_ind(group1, group2)
print(f"t = {t_stat:.3f}, p = {p_value:.3f}")

Paired Samples T-Test

t_stat, p_value = stats.ttest_rel(df["pre_score"], df["post_score"])
print(f"t = {t_stat:.3f}, p = {p_value:.3f}")

One-Way ANOVA

group_a = df[df["condition"] == "A"]["score"]
group_b = df[df["condition"] == "B"]["score"]
group_c = df[df["condition"] == "C"]["score"]

f_stat, p_value = stats.f_oneway(group_a, group_b, group_c)
print(f"F = {f_stat:.3f}, p = {p_value:.3f}")

Pearson Correlation

r, p_value = stats.pearsonr(df["hours"], df["score"])
print(f"r = {r:.3f}, p = {p_value:.3f}")

Chi-Square Test

contingency_table = pd.crosstab(df["gender"], df["preference"])
chi2, p_value, dof, expected = stats.chi2_contingency(contingency_table)
print(f"χ² = {chi2:.3f}, p = {p_value:.3f}, df = {dof}")

Regression with statsmodels

For more detailed regression output (similar to SPSS), use statsmodels:

import statsmodels.api as sm

X = df[["study_hours", "attendance"]]
X = sm.add_constant(X)  # Add intercept
y = df["exam_score"]

model = sm.OLS(y, X).fit()
print(model.summary())

This gives you R², coefficients, p-values, confidence intervals, and diagnostics — similar to what you would see in SPSS regression output.

Python vs. SPSS: When to Use Which

Feature Python SPSS
Cost Free Licensed
Learning curve Steeper GUI-friendly
Reproducibility Excellent (scripts) Limited (syntax files)
Advanced analysis Extensive libraries Built-in tests
APA tables Requires formatting Better native support
Big data Handles large datasets Memory limited

For academic projects where your supervisor expects SPSS output, stick with SPSS. For business analytics, automation, or large datasets, Python is often the better choice. Many analysts use both.

Next Steps

Once you are comfortable with these basics, explore:

  • Seaborn for publication-quality statistical plots
  • statsmodels for detailed regression and time series analysis
  • scikit-learn for machine learning and predictive modeling
  • Jupyter Notebooks for combining code, output, and narrative

Need professional data analysis using Python, SPSS, or Excel? We handle everything from data cleaning to APA-formatted reports. Request a free consultation.

Get More Guides Like This

Free tutorials on SPSS, Excel, Python, and research methods delivered to your inbox.

Need Professional Data Analysis Services?

Save time and get accurate results. Our experts provide statistical analysis services using SPSS, Excel, and Python — from hypothesis testing to APA-formatted reports.