Data Analysis with Python: Basics (2025 Guide)

Last updated: ⏱ Reading time: ~16 minutes

AI-assisted guide Curated by Norbert Sowinski

Share this guide:

Illustration of Python data analysis: pandas tables, charts, and a notebook workflow

Data analysis with Python is not about writing “clever code”. It is about making datasets understandable, trustworthy, and decision-ready. In practice, that means you load data, validate the schema, clean issues, transform fields, compute metrics, visualize patterns, and document your assumptions.

This guide focuses on the fundamentals you will use repeatedly: pandas for tabular data, NumPy for numeric work, and basic plotting for communicating results. The goal is a repeatable workflow you can apply to real datasets, not just toy examples.

Good analysis signal

If someone else can rerun your notebook from raw data to the same conclusions, your analysis is already above average.

1. What Data Analysis with Python Is (And What It Is Not)

Data analysis is the process of turning raw data into insights you can defend. With Python, that typically includes:

What it is not: skipping validation, eyeballing numbers, and trusting the first chart you produce. Most analysis errors come from silent issues like wrong types, duplicated rows, missing values, or joins that multiply data unexpectedly.

2. Setup: Python, Jupyter, and the Minimum Toolchain

You can do analysis in scripts, but notebooks are great for exploration. A minimal setup:

Minimal install

pip install pandas numpy matplotlib seaborn jupyter

3. Step 1: Load Data (CSV, Excel, SQL)

Start by loading data into a DataFrame. Treat loading as part of data quality: specify encodings, separators, and types when needed.

import pandas as pd

# CSV
df = pd.read_csv("data/orders.csv")

# Excel
df_xlsx = pd.read_excel("data/orders.xlsx", sheet_name="Orders")

# SQL (example pattern)
# from sqlalchemy import create_engine
# engine = create_engine("postgresql+psycopg://user:pass@host:5432/db")
# df_sql = pd.read_sql("SELECT * FROM orders", engine)

Load responsibly

If a column should be a date, parse it as a date. If an ID should be a string (because leading zeros matter), force it to string early.

4. Step 2: First Look (Schema, Types, Quick Profiling)

Before cleaning or plotting, confirm what you actually imported: columns, types, missingness, and row counts.

df.shape
df.head(5)
df.info()
df.describe(include="all")

# Missing values per column (as %)
(df.isna().mean() * 100).sort_values(ascending=False).head(10)

Common trap

Dates imported as strings and numeric columns imported as objects will produce misleading summaries and charts. Fix types early.

5. Step 3: Clean Data (Missing Values, Duplicates, Types)

Cleaning is not “make it pretty”. It is: remove ambiguity and enforce rules. Typical tasks:

# Remove exact duplicate rows
df = df.drop_duplicates()

# Convert date strings to datetime
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")

# Example: ensure IDs stay as strings
df["customer_id"] = df["customer_id"].astype("string")

# Example: missing values strategy (simple)
df["discount"] = df["discount"].fillna(0)

Document decisions

Write a one-line comment for each important cleaning rule. Six weeks later, you will not remember why you filled missing values with 0.

6. Step 4: Transform Data (Columns, Strings, Dates)

Transformations make the data analysis-ready: standard columns, create useful fields, and encode domain logic.

# Standardize column names
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]

# Create a numeric metric
df["revenue"] = df["quantity"] * df["unit_price"]

# Extract year/month from a datetime column
df["year"] = df["order_date"].dt.year
df["month"] = df["order_date"].dt.to_period("M").astype("string")

# Clean strings
df["country"] = df["country"].astype("string").str.strip().str.title()

7. Step 5: Summarize with groupby, pivot_table, and aggregations

Most business questions are aggregations. Use groupby for KPIs and segmentation.

# Revenue by month
rev_by_month = (
  df.groupby("month", as_index=False)["revenue"]
    .sum()
    .sort_values("month")
)

# Multi-metric summary by country
kpis_by_country = (
  df.groupby("country")
    .agg(
      orders=("order_id", "nunique"),
      customers=("customer_id", "nunique"),
      revenue=("revenue", "sum"),
      avg_order_value=("revenue", "mean"),
    )
    .sort_values("revenue", ascending=False)
)

# Pivot table example
pivot = pd.pivot_table(
  df,
  index="month",
  columns="country",
  values="revenue",
  aggfunc="sum",
  fill_value=0,
)

Aggregation sanity check

Always validate totals. If the sum changes unexpectedly after a join or reshape, you likely introduced duplicated keys or many-to-many joins.

8. Step 6: Combine Datasets (merge, joins, keys)

Real analysis often needs multiple tables (orders, customers, products). Understand your keys: one-to-one, one-to-many, or many-to-many.

# Example: enrich orders with customer attributes
customers = pd.read_csv("data/customers.csv")
orders_enriched = df.merge(customers, on="customer_id", how="left", validate="m:1")

# validate="m:1" means: many orders to one customer
# Use validate to catch accidental many-to-many merges early.

Use validate

The validate parameter is a practical guardrail. It turns silent join mistakes into loud errors while you still have context.

9. Step 7: EDA Workflow (Questions, Distributions, Outliers)

EDA works best when you write down questions first, then verify with summaries and plots. A simple workflow:

# Quick outlier scan (example thresholds)
df.loc[df["revenue"] < 0].head()
df["revenue"].quantile([0.5, 0.9, 0.95, 0.99])

10. Step 8: Visualize Clearly (matplotlib + seaborn basics)

The goal of plotting is clarity, not decoration. Start simple: trend lines, histograms, bar charts for categories.

import matplotlib.pyplot as plt

# Line chart (revenue by month)
plt.figure(figsize=(8, 4))
plt.plot(rev_by_month["month"], rev_by_month["revenue"])
plt.xticks(rotation=45)
plt.title("Revenue by Month")
plt.xlabel("Month")
plt.ylabel("Revenue")
plt.tight_layout()
plt.show()

Plot hygiene

Add titles, label axes, and sort categories. If a chart cannot be understood without explanation, it is not finished.

11. Step 9: Basic Statistics (correlation, variance, sanity checks)

You do not need advanced statistics to deliver value. Basic checks help avoid false conclusions:

# Correlation on numeric columns
numeric = df.select_dtypes(include=["number"])
corr = numeric.corr(numeric_only=True)

# Robust summary
numeric.describe(percentiles=[0.05, 0.5, 0.95]).T.head(10)

Correlation caution

Correlation can guide questions, not prove answers. Always check for confounders (time, seasonality, segment mix).

12. Step 10: Make It Reproducible (notebooks, functions, exports)

Turn exploration into a repeatable pipeline: parameterize inputs, reuse cleaning functions, and export clean artifacts.

from pathlib import Path

OUTPUT = Path("out")
OUTPUT.mkdir(exist_ok=True)

def clean_orders(df: pd.DataFrame) -> pd.DataFrame:
  df = df.drop_duplicates().copy()
  df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
  df["discount"] = df["discount"].fillna(0)
  df["revenue"] = df["quantity"] * df["unit_price"]
  return df

df_clean = clean_orders(df)
df_clean.to_csv(OUTPUT / "orders_clean.csv", index=False)

Deliverable mindset

A strong outcome is a notebook that (1) loads raw inputs, (2) applies documented cleaning, (3) produces summary tables and figures, and (4) exports a clean dataset and a short conclusion section.

13. Common Beginner Mistakes (And How to Avoid Them)

14. Data Analysis Checklist

Use this as a quick pre-flight for most Python analyses:

Fast win

Add one cell early: “Assumptions + known limitations”. It prevents most misinterpretations when the notebook is shared.

15. FAQ: Python Data Analysis

What should I learn first: NumPy or pandas?

Start with pandas for DataFrames and common operations (filter, select, groupby, merge). Learn essential NumPy concepts (arrays, vectorized operations) as you need them.

How do I handle missing values correctly?

Measure missingness, then pick a strategy aligned with meaning: drop if safe, impute if justified, or keep missing as its own category when it carries signal. Document the rule.

How do I avoid wrong totals after merges?

Check uniqueness of merge keys, use validate in pandas merge, and compare totals before and after. Many-to-many merges are the most common cause of inflated metrics.

What plots should I start with in EDA?

Use histograms for distributions, bar charts for categorical summaries, and line charts for trends over time. Only move to more complex plots when they answer a specific question.

How do I share results with non-technical stakeholders?

Provide one page of key findings with 2–4 charts, a short methodology note, and a clear definition of metrics. Avoid dumping raw tables without context.

Key data terms (quick glossary)

DataFrame
A tabular data structure in pandas with labeled columns and rows, used for most analysis tasks.
EDA
Exploratory data analysis: profiling and exploration to understand quality, distributions, and relationships before decisions or modeling.
Missingness
The amount and pattern of missing values in a dataset. Important because it can bias results.
Groupby
A pandas operation that splits data into groups by key(s) and applies aggregations (sum, mean, count) to compute KPIs.
Join / Merge
Combining two tables using key columns. Incorrect keys can duplicate data and inflate metrics.
Outlier
A value far from the typical range. It can be an error or a meaningful rare event that should be investigated.
Reproducibility
The ability to rerun the analysis from raw inputs to the same outputs using the same code and dependencies.

Found this useful? Share this guide: