Data analysis with Python is not about writing “clever code”. It is about making datasets understandable, trustworthy, and decision-ready. In practice, that means you load data, validate the schema, clean issues, transform fields, compute metrics, visualize patterns, and document your assumptions.
This guide focuses on the fundamentals you will use repeatedly: pandas for tabular data, NumPy for numeric work, and basic plotting for communicating results. The goal is a repeatable workflow you can apply to real datasets, not just toy examples.
Good analysis signal
If someone else can rerun your notebook from raw data to the same conclusions, your analysis is already above average.
1. What Data Analysis with Python Is (And What It Is Not)
Data analysis is the process of turning raw data into insights you can defend. With Python, that typically includes:
- Loading data from files or databases,
- Validating schema and data quality,
- Cleaning and transforming columns,
- Summarizing and segmenting results,
- Visualizing key patterns,
- Documenting assumptions and limitations.
What it is not: skipping validation, eyeballing numbers, and trusting the first chart you produce. Most analysis errors come from silent issues like wrong types, duplicated rows, missing values, or joins that multiply data unexpectedly.
2. Setup: Python, Jupyter, and the Minimum Toolchain
You can do analysis in scripts, but notebooks are great for exploration. A minimal setup:
- Python 3.11+ (or 3.10+): stable ecosystem.
- Jupyter: Notebook or JupyterLab.
- pandas + NumPy: core data stack.
- matplotlib (+ seaborn optional): basic visualization.
Minimal install
pip install pandas numpy matplotlib seaborn jupyter
3. Step 1: Load Data (CSV, Excel, SQL)
Start by loading data into a DataFrame. Treat loading as part of data quality: specify encodings, separators, and types when needed.
import pandas as pd
# CSV
df = pd.read_csv("data/orders.csv")
# Excel
df_xlsx = pd.read_excel("data/orders.xlsx", sheet_name="Orders")
# SQL (example pattern)
# from sqlalchemy import create_engine
# engine = create_engine("postgresql+psycopg://user:pass@host:5432/db")
# df_sql = pd.read_sql("SELECT * FROM orders", engine)
Load responsibly
If a column should be a date, parse it as a date. If an ID should be a string (because leading zeros matter), force it to string early.
4. Step 2: First Look (Schema, Types, Quick Profiling)
Before cleaning or plotting, confirm what you actually imported: columns, types, missingness, and row counts.
df.shape
df.head(5)
df.info()
df.describe(include="all")
# Missing values per column (as %)
(df.isna().mean() * 100).sort_values(ascending=False).head(10)
Common trap
Dates imported as strings and numeric columns imported as objects will produce misleading summaries and charts. Fix types early.
5. Step 3: Clean Data (Missing Values, Duplicates, Types)
Cleaning is not “make it pretty”. It is: remove ambiguity and enforce rules. Typical tasks:
- Handle missing values: drop, impute, or keep with an explicit meaning.
- Remove duplicates: especially after concatenation.
- Fix types: numbers, dates, categories.
- Normalize values: consistent casing, trimming, standard labels.
# Remove exact duplicate rows
df = df.drop_duplicates()
# Convert date strings to datetime
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
# Example: ensure IDs stay as strings
df["customer_id"] = df["customer_id"].astype("string")
# Example: missing values strategy (simple)
df["discount"] = df["discount"].fillna(0)
Document decisions
Write a one-line comment for each important cleaning rule. Six weeks later, you will not remember why you filled missing values with 0.
6. Step 4: Transform Data (Columns, Strings, Dates)
Transformations make the data analysis-ready: standard columns, create useful fields, and encode domain logic.
# Standardize column names
df.columns = [c.strip().lower().replace(" ", "_") for c in df.columns]
# Create a numeric metric
df["revenue"] = df["quantity"] * df["unit_price"]
# Extract year/month from a datetime column
df["year"] = df["order_date"].dt.year
df["month"] = df["order_date"].dt.to_period("M").astype("string")
# Clean strings
df["country"] = df["country"].astype("string").str.strip().str.title()
7. Step 5: Summarize with groupby, pivot_table, and aggregations
Most business questions are aggregations. Use groupby for KPIs and segmentation.
# Revenue by month
rev_by_month = (
df.groupby("month", as_index=False)["revenue"]
.sum()
.sort_values("month")
)
# Multi-metric summary by country
kpis_by_country = (
df.groupby("country")
.agg(
orders=("order_id", "nunique"),
customers=("customer_id", "nunique"),
revenue=("revenue", "sum"),
avg_order_value=("revenue", "mean"),
)
.sort_values("revenue", ascending=False)
)
# Pivot table example
pivot = pd.pivot_table(
df,
index="month",
columns="country",
values="revenue",
aggfunc="sum",
fill_value=0,
)
Aggregation sanity check
Always validate totals. If the sum changes unexpectedly after a join or reshape, you likely introduced duplicated keys or many-to-many joins.
8. Step 6: Combine Datasets (merge, joins, keys)
Real analysis often needs multiple tables (orders, customers, products). Understand your keys: one-to-one, one-to-many, or many-to-many.
# Example: enrich orders with customer attributes
customers = pd.read_csv("data/customers.csv")
orders_enriched = df.merge(customers, on="customer_id", how="left", validate="m:1")
# validate="m:1" means: many orders to one customer
# Use validate to catch accidental many-to-many merges early.
Use validate
The validate parameter is a practical guardrail. It turns silent join mistakes into loud errors while you still have context.
9. Step 7: EDA Workflow (Questions, Distributions, Outliers)
EDA works best when you write down questions first, then verify with summaries and plots. A simple workflow:
- Quality: missing values, duplicates, impossible values.
- Univariate: distributions, ranges, skew.
- Bivariate: relationships between key variables.
- Segmentation: by category, cohort, time.
- Outliers: decide whether they are errors or signal.
# Quick outlier scan (example thresholds)
df.loc[df["revenue"] < 0].head()
df["revenue"].quantile([0.5, 0.9, 0.95, 0.99])
10. Step 8: Visualize Clearly (matplotlib + seaborn basics)
The goal of plotting is clarity, not decoration. Start simple: trend lines, histograms, bar charts for categories.
import matplotlib.pyplot as plt
# Line chart (revenue by month)
plt.figure(figsize=(8, 4))
plt.plot(rev_by_month["month"], rev_by_month["revenue"])
plt.xticks(rotation=45)
plt.title("Revenue by Month")
plt.xlabel("Month")
plt.ylabel("Revenue")
plt.tight_layout()
plt.show()
Plot hygiene
Add titles, label axes, and sort categories. If a chart cannot be understood without explanation, it is not finished.
11. Step 9: Basic Statistics (correlation, variance, sanity checks)
You do not need advanced statistics to deliver value. Basic checks help avoid false conclusions:
- Range checks: min/max for numeric fields.
- Counts: unique IDs, duplicates, missingness.
- Correlation: directional signal, not causation.
- Segment comparisons: medians often beat means.
# Correlation on numeric columns
numeric = df.select_dtypes(include=["number"])
corr = numeric.corr(numeric_only=True)
# Robust summary
numeric.describe(percentiles=[0.05, 0.5, 0.95]).T.head(10)
Correlation caution
Correlation can guide questions, not prove answers. Always check for confounders (time, seasonality, segment mix).
12. Step 10: Make It Reproducible (notebooks, functions, exports)
Turn exploration into a repeatable pipeline: parameterize inputs, reuse cleaning functions, and export clean artifacts.
from pathlib import Path
OUTPUT = Path("out")
OUTPUT.mkdir(exist_ok=True)
def clean_orders(df: pd.DataFrame) -> pd.DataFrame:
df = df.drop_duplicates().copy()
df["order_date"] = pd.to_datetime(df["order_date"], errors="coerce")
df["discount"] = df["discount"].fillna(0)
df["revenue"] = df["quantity"] * df["unit_price"]
return df
df_clean = clean_orders(df)
df_clean.to_csv(OUTPUT / "orders_clean.csv", index=False)
Deliverable mindset
A strong outcome is a notebook that (1) loads raw inputs, (2) applies documented cleaning, (3) produces summary tables and figures, and (4) exports a clean dataset and a short conclusion section.
13. Common Beginner Mistakes (And How to Avoid Them)
- Trusting types by default: objects everywhere. Fix: run df.info() and explicitly parse dates/IDs.
- Ignoring duplicates: inflated metrics. Fix: define what “duplicate” means (row vs key).
- Many-to-many merges: totals mysteriously grow. Fix: use validate and check key uniqueness.
- Plotting unsorted categories: unreadable charts. Fix: sort by metric, limit top-N, label axes.
- No reproducibility: results cannot be rerun. Fix: functions for cleaning + exported artifacts.
14. Data Analysis Checklist
Use this as a quick pre-flight for most Python analyses:
- Inputs: data sources listed (files, queries, extracts).
- Schema: columns and types validated (info()).
- Quality: missingness, duplicates, impossible values checked.
- Cleaning rules: documented and implemented in code.
- Joins: keys validated; totals sanity-checked.
- EDA: distributions, relationships, and segments explored.
- Visuals: labeled, readable, and tied to questions.
- Outputs: exported tables/figures + short conclusions section.
- Reproducible run: notebook runs top-to-bottom without manual edits.
Fast win
Add one cell early: “Assumptions + known limitations”. It prevents most misinterpretations when the notebook is shared.
15. FAQ: Python Data Analysis
What should I learn first: NumPy or pandas?
Start with pandas for DataFrames and common operations (filter, select, groupby, merge). Learn essential NumPy concepts (arrays, vectorized operations) as you need them.
How do I handle missing values correctly?
Measure missingness, then pick a strategy aligned with meaning: drop if safe, impute if justified, or keep missing as its own category when it carries signal. Document the rule.
How do I avoid wrong totals after merges?
Check uniqueness of merge keys, use validate in pandas merge, and compare totals before and after. Many-to-many merges are the most common cause of inflated metrics.
What plots should I start with in EDA?
Use histograms for distributions, bar charts for categorical summaries, and line charts for trends over time. Only move to more complex plots when they answer a specific question.
How do I share results with non-technical stakeholders?
Provide one page of key findings with 2–4 charts, a short methodology note, and a clear definition of metrics. Avoid dumping raw tables without context.
Key data terms (quick glossary)
- DataFrame
- A tabular data structure in pandas with labeled columns and rows, used for most analysis tasks.
- EDA
- Exploratory data analysis: profiling and exploration to understand quality, distributions, and relationships before decisions or modeling.
- Missingness
- The amount and pattern of missing values in a dataset. Important because it can bias results.
- Groupby
- A pandas operation that splits data into groups by key(s) and applies aggregations (sum, mean, count) to compute KPIs.
- Join / Merge
- Combining two tables using key columns. Incorrect keys can duplicate data and inflate metrics.
- Outlier
- A value far from the typical range. It can be an error or a meaningful rare event that should be investigated.
- Reproducibility
- The ability to rerun the analysis from raw inputs to the same outputs using the same code and dependencies.
Worth reading
Recommended guides from the category.