Photo by author
# Introduction
While groupby().sum() And groupby().mean() While fine for quick checks, production level measurements require a more robust solution. Real-world tables often contain multiple keys, time series data, weights, and variations such as promotions, returns, or exits.
This means you often need to calculate totals and rates, classify items within each class, roll the data through calendar buckets, and then group statistics back into the original queue for modeling. This article will guide you through using advanced grouping techniques Pandas library to efficiently handle these complex scenarios.
# Choosing the right mode
// Using AGG to reduce groups to a row
use agg When you want one record per group, such as totals, means, medians, minimum/maximum values, and custom vectorized reductions.
out = (
df.groupby(('store', 'cat'), as_index=False, sort=False)
.agg(sales=('rev', 'sum'),
orders=('order_id', 'nunique'),
avg_price=('price', 'mean'))
)It’s good for key performance indicator (KPI) tables, weekly rollups, and multimetric summaries.
// Return the change to the row using changes in statistics
transform The method returns a result with the same format as the input. This is ideal for creating the features you need in each row, such as z-scores, within-group shares, or groupwise fills.
g = df.groupby('store')('rev')
df('rev_z') = (df('rev') - g.transform('mean')) / g.transform('std')
df('rev_share') = df('rev') / g.transform('sum')It is good for modeling features, quality assurance ratios, and defects.
// Apply for custom per-group logic
use apply Only when the required logic cannot be expressed with built-in functions. It’s slow and hard to improve, so you should try agg or transform First
def capped_mean(s):
q1, q3 = s.quantile((.25, .75))
return s.clip(q1, q3).mean()
df.groupby('store')('rev').apply(capped_mean)This is good for bespoke rules and small groups.
// Use filters to keep or exclude entire groups
filter The procedure allows entire groups to pass or fail a condition. It is useful for data quality rules and thresholds.
big = df.groupby('store').filter(lambda g: g('order_id').nunique() >= 100)This is good for minimum-sized peers and removing sparse categories before aggregation.
# Multi-key grouping and named collections
// Grouping by multiple keys
You can control the format and order of the output so that results can be dropped directly into a business intelligence tool.
g = df.groupby(('store', 'cat'), as_index=False, sort=False, observed=True)as_index=FalseReturns a flat data frame, which is easy to join and exportsort=FalseAvoids reordering groups, which saves work when the order is irrelevantobserved=True(with category columns) Drops unused category pairs
// Using the nominative plural
Named collation produces readable, SQL-like column names.
out = (
df.groupby(('store', 'cat'))
.agg(sales=('rev', 'sum'),
orders=('order_id', 'nunique'), # use your id column here
avg_price=('price', 'mean'))
)// Clearing columns
If you stack multiple combinations, you get one MultiIndex. Flatten it once and standardize the column order.
out = out.reset_index()
out.columns = (
'_'.join(c) if isinstance(c, tuple) else c
for c in out.columns
)
# optional: ensure business-friendly column order
cols = ('store', 'cat', 'orders', 'sales', 'avg_price')
out = out(cols)# Conditional collectives without any application
// using Boolean-mask arithmetic within AGG
When the mask depends on other columns, align the data by its index.
# promo sales and promo rate by (store, cat)
cond = df('is_promo')
out = df.groupby(('store', 'cat')).agg(
promo_sales=('rev', lambda s: s(cond.loc(s.index)).sum()),
promo_rate=('is_promo', 'mean') # proportion of promo rows
)// Calculating rates and ratios
A rate is simple sum(mask) / sizewhich is equal to the middle of the Boolean column.
df('is_return') = df('status').eq('returned')
rates = df.groupby('store').agg(return_rate=('is_return', 'mean'))// Creating cohort-style windows
First, prep the mask with date ranges, and then collect the data.
# example: repeat purchase within 30 days of first purchase per customer cohort
first_ts = df.groupby('customer_id')('ts').transform('min')
within_30 = (df('ts') <= first_ts + pd.Timedelta('30D')) & (df('ts') > first_ts)
# customer cohort = month of first purchase
df('cohort') = first_ts.dt.to_period('M').astype(str)
repeat_30_rate = (
df.groupby('cohort')
.agg(repeat_30_rate=('within_30', 'mean'))
.rename_axis(None)
)# Weighting matrix per group
// Implementing weighted average sampling
Vectorize arithmetic and guard against zero-weight divisions.
import numpy as np
tmp = df.assign(wx=df('price') * df('qty'))
agg = tmp.groupby(('store', 'cat')).agg(wx=('wx', 'sum'), w=('qty', 'sum'))
# weighted average price per (store, cat)
agg('wavg_price') = np.where(agg('w') > 0, agg('wx') / agg('w'), np.nan)// Safe handling of NAN values
Decide what to return for empty groups or all-NaN Values Two common choices are:
# 1) Return NaN (transparent, safest for downstream stats)
agg('wavg_price') = np.where(agg('w') > 0, agg('wx') / agg('w'), np.nan)
# 2) Fallback to unweighted mean if all weights are zero (explicit policy)
mean_price = df.groupby(('store', 'cat'))('price').mean()
agg('wavg_price_safe') = np.where(
agg('w') > 0, agg('wx') / agg('w'), mean_price.reindex(agg.index).to_numpy()
)# Time-aware grouping
// Using PD.grouper with frequency
Respect calendar boundaries for KPIs by grouping time series data into specific intervals.
weekly = df.groupby(('store', pd.Grouper(key='ts', freq='W')), observed=True).agg(
sales=('rev', 'sum'), orders=('order_id', 'nunique')
)// Application of rolling/expanding windows in each group
First sort your data first and align on the timestamp column.
df = df.sort_values(('customer_id', 'ts'))
df('rev_30d_mean') = (
df.groupby('customer_id')
.rolling('30D', on='ts')('rev').mean()
.reset_index(level=0, drop=True)
)// Avoiding data leakage
Keep the historical setting and make sure Windows only “sees” past data. Do not transform the time series data, and do not compute group statistics on the full dataset before partitioning for training and testing.
# Top-n in rankings and groups
// Finding the top-k rows per group
There are two practical options for selecting the top rows from each group.
# Sort + head
top3 = (df.sort_values(('cat', 'rev'), ascending=(True, False))
.groupby('cat')
.head(3))
# Per-group nlargest on one metric
top3_alt = (df.groupby('cat', group_keys=False)
.apply(lambda g: g.nlargest(3, 'rev')))// Using helper functions
Pandas provides several helper functions for classification and selection.
rank – controls how relationships are handled (eg, method='dense' or 'first') and can calculate percentages with percentages pct=True.
df('rev_rank_in_cat') = df.groupby('cat')('rev').rank(method='dense', ascending=False)cumcount -Provides the 0-based position of each array in its group.
df('pos_in_store') = df.groupby('store').cumcount()Ninth -Selects the K-Th rows per group without sorting the entire data frame.
second_row = df.groupby('store').nth(1) # the second row present per store# Broadcast features with Transform
// Executing group wise routine
Standardize a metric within each group so that rows are comparable across groups.
g = df.groupby('store')('rev')
df('rev_z') = (df('rev') - g.transform('mean')) / g.transform('std')// Imputing missing values
Fill in missing values ​​from a group statistic. This often keeps the distribution closer to reality than using a global fill value.
df('price') = df('price').fillna(df.groupby('cat')('price').transform('median'))// Characterizing Group Shares
Convert raw numbers to group proportions for cleaner comparisons.
df('rev_share_in_store') = df('rev') / df.groupby('store')('rev').transform('sum')# Handling categories, empty groups, and missing data
// Improving speed with categorical types
If your keys come from a fixed set (such as stores, regions, product categories), put them into an explicit type once. It makes GroupBy Operation is faster and more memory efficient.
from pandas.api.types import CategoricalDtype
store_type = CategoricalDtype(categories=sorted(df('store').dropna().unique()), ordered=False)
df('store') = df('store').astype(store_type)
cat_type = CategoricalDtype(categories=('Grocery', 'Electronics', 'Home', 'Clothing', 'Sports'))
df('cat') = df('cat').astype(cat_type)// Dropping unused combinations
When grouping on category columns, sort observed=True Excludes category pairs that are not actually found in the data, resulting in cleaner output with less noise.
out = df.groupby(('store', 'cat'), observed=True).size().reset_index(name="n")// Grouping with non-keys
Be clear about how you handle lost keys. By default, Pandas crashes NaN groups; Keep them only if it helps your quality assurance process.
# Default: NaN keys are dropped
by_default = df.groupby('region').size()
# Keep NaN as its own group when you need to audit missing keys
kept = df.groupby('region', dropna=False).size()# Quick cheat sheet
// Calculating the conditional rate in each group
# mean of a boolean is a rate
df.groupby(keys).agg(rate=('flag', 'mean'))
# or explicitly: sum(mask)/size
df.groupby(keys).agg(rate=('flag', lambda s: s.sum() / s.size))// Calculating mean weights
df.assign(wx=df(x) * df(w))
.groupby(keys)
.apply(lambda g: g('wx').sum() / g(w).sum() if g(w).sum() else np.nan)
.rename('wavg')// Finding the top-k per group
(df.sort_values((key, metric), ascending=(True, False))
.groupby(key)
.head(k))
# or
df.groupby(key, group_keys=False).apply(lambda g: g.nlargest(k, metric))// Calculating weekly metrics
df.groupby((key, pd.Grouper(key='ts', freq='W')), observed=True).agg(...)// Executing group-wise fill
df(col) = df(col).fillna(df.groupby(keys)(col).transform('median'))// Calculating a share in a group
df('share') = df(val) / df.groupby(keys)(val).transform('sum')# wrap up
First, choose the right mode for your task: Use agg to reduce, transform broadcast, and reserve apply Because when vectorization is not an option. tilt pd.Grouper For ranking helpers for time-based buckets and top-n selection. By supporting clear, vectorized patterns, you can keep your output flat, named, and easy to test while keeping your measurements accurate and your notebooks running fast.
Josep Ferrer is an analytical engineer from Barcelona. He graduated in physics engineering and is currently working in the field of data science applied to human mobility. He is a part-time content creator focused on data science and technology. Josep writes everything on AI, covering the application of ongoing blasting in the field.