About Pyjanitor’s method chaining functionality, and why it’s useful.

by SkillAiNest

About Pyjanitor’s method chaining functionality, and why it’s useful.
Photo by editor

# Introduction

Working deeply with data in Python teaches us all an important lesson: cleaning data doesn’t usually feel like doing data science, but more like acting as a digital watchdog. In most use cases this is what it takes: loading a dataset, figuring out many column names are messed up, encountering missing values, and ending up with a lot of temporary data variables, only the last of which contains your final, clean dataset.

Paginator provides a cleaner way to perform these steps. This library can be used with the concept of method chaining to turn otherwise cumbersome data cleaning processes into pipelines that look elegant, efficient, and readable.

This article shows how to demystify procedure chaining in the context of Pyjanitor and data cleaning.

# Understanding method chaining

Method chaining is nothing new in the programming realm: in fact, it’s a well-established coding pattern. It consists of calling multiple methods on an object in sequential order: all in just one statement. That way, you don’t have to reassign the variable after each step, because each method returns an object that invokes the next associated method, and so on.

The following example helps to understand its basic concept. Observe how we will apply several simple modifications to a small piece of text (string) using “standard” Python:

text = "  Hello World!  "
text = text.strip()
text = text.lower()
text = text.replace("world", "python")

The resulting value in text will be: "hello python!".

Now, with the method chain, the same process would look like this:

text = "  Hello World!  "
cleaned_text = text.strip().lower().replace("world", "python")

Notice that the logical flow of applied operations goes from left to right: all in a single, unified train of thought!

If you got it, you now understand the concept of method chaining. Let us now translate this vision into the context of using data science. Panda.. A standard data cleanup on a data frame, which involves several steps, typically looks like this without chaining:

# Traditional, step-by-step Pandas approach
df = pd.read_csv("data.csv")
df.columns = df.columns.str.lower().str.replace(' ', '_')
df = df.dropna(subset=('id'))
df = df.drop_duplicates()

As we’ll see shortly, by implementing method chaining, we construct a unified pipeline whereby data frame operations are encapsulated using parentheses. On top of that, we no longer need intermediate variables containing non-final data frames, which allows for cleaner, more bug-tolerant code. And (again) on top of that, Pyjanitor makes this process seamless.

# Entering Pyjanitor: Example Application

Pandas itself provides some degree of native support for procedure chaining. However, some of its essential features are not strictly designed with this style in mind. This is one of the main motivations for why Pyjanitor was born, based on an R package roughly named: janitor.

In summary, Pyjanitor can be developed as an extension to Pandas that brings a package of custom data cleaning processes in a method chain-friendly manner. Includes examples of its application programming interface (API) method names. clean_names(), rename_column(), remove_empty()and so on. Its API uses a set of intuitive method names that take code expression to a whole new level. In addition, Pyjanitor relies entirely on open source, free tools, and can be run seamlessly in cloud and notebook environments, such as Google Collab.

Let’s fully understand how method chaining is implemented in Pyjanitor, through an example in which we first create a small, synthetic dataset that looks intentionally messy, and feed it to pandas. DataFrame Objection

Important: To avoid common, but somewhat scary errors caused by incompatibilities between library versions, make sure you have the latest available version of both Pandas and Paginator. !pip install --upgrade pyjanitor pandas First

messy_data = {
    'First Name ': ('Alice', 'Bob', 'Charlie', 'Alice', None),
    '  Last_Name': ('Smith', 'Jones', 'Brown', 'Smith', 'Doe'),
    'Age': (25, np.nan, 30, 25, 40),
    'Date_Of_Birth': ('1998-01-01', '1995-05-05', '1993-08-08', '1998-01-01', '1983-12-12'),
    'Salary ($)': (50000, 60000, 70000, 50000, 80000),
    'Empty_Col': (np.nan, np.nan, np.nan, np.nan, np.nan)
}

df = pd.DataFrame(messy_data)
print("--- Messy Original Data ---")
print(df.head(), "\n")

We now define a Pyjanitor procedure chain that applies a series of processing to both the column names and the data:

cleaned_df = (
    df
    .rename_column('Salary ($)', 'Salary')  # 1. Manually fix tricky names BEFORE getting them mangled
    .clean_names()                          # 2. Standardize everything (makes it 'salary')
    .remove_empty()                         # 3. Drop empty columns/rows
    .drop_duplicates()                      # 4. Remove duplicate rows
    .fill_empty(                            # 5. Impute missing values
        column_names=('age'),               # CAUTION: after previous steps, assume lowercase name: 'age'
        value=df('Age').median()            # Pull the median from the original raw df
    )
    .assign(                                # 6. Create a new column using assign
        salary_k=lambda d: d('salary') / 1000
    )
)

print("--- Cleaned Pyjanitor Data ---")
print(cleaned_df)

The code above is self-explanatory, with inline comments defining each method that is called at each step in the chain.

This is the output of our example, which compares the original dirty data with the cleaned version:

--- Messy Original Data ---
  First Name    Last_Name   Age Date_Of_Birth  Salary ($)  Empty_Col
0       Alice       Smith  25.0    1998-01-01       50000        NaN
1         Bob       Jones   NaN    1995-05-05       60000        NaN
2     Charlie       Brown  30.0    1993-08-08       70000        NaN
3       Alice       Smith  25.0    1998-01-01       50000        NaN
4         NaN         Doe  40.0    1983-12-12       80000        NaN 

--- Cleaned Pyjanitor Data ---
  first_name_ _last_name   age date_of_birth  salary  salary_k
0       Alice      Smith  25.0    1998-01-01   50000      50.0
1         Bob      Jones  27.5    1995-05-05   60000      60.0
2     Charlie      Brown  30.0    1993-08-08   70000      70.0
4         NaN        Doe  40.0    1983-12-12   80000      80.0

# wrap up

Throughout this article, we’ve learned how to use the Pyjanitor library to implement procedural chaining and simplify the otherwise difficult process of cleaning data. This makes the code clean, expressive, and — so to speak — self-documenting, so that other developers or your future self can read the pipeline and easily understand what’s going on in the journey from raw to finished dataset.

Great job!

Iván Palomares Carrascosa He is a leader, author, speaker, and consultant in AI, Machine Learning, Deep Learning and LLMs. He trains and guides others in using AI in the real world.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro