Deploying from dataset to dataframe: Your first project with pandas and skatelearn

by SkillAiNest

Deploying from dataset to dataframe: Your first project with pandas and skatelearnDeploying from dataset to dataframe: Your first project with pandas and skatelearn
Photo by editor

# Introduction

Eager to start your first, manageable machine learning project with popular Python libraries Pandas And Learn to skatebut not sure where to start? Look no further.

In this article, I’ll walk you through a gentle, beginner-friendly machine learning project in which we’ll build a regression model that predicts employee earnings based on socio-economic attributes. Along the way, we’ll learn some important machine learning concepts and essential tricks.

# To clear a data frame from a raw dataset

First, as with any Python-based project, it’s a good practice to start by importing the necessary libraries, modules, and components that we’ll use throughout the process:

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
import joblib

The following instructions will load a publicly available dataset This collection One of the pandas DataFrame Object: A clean data structure for loading, analyzing and managing fully structured data, i.e. data in tabular format. Once loaded, we see its basic properties and data types in its properties.

url = "
df = pd.read_csv(url)
print(df.head())
print(df.info())

You’ll notice that the dataset has 1000 entries or instances—that is, the data describes 1000 employees—but for most attributes, such as age, income, and so on, there are fewer than 1000 actual values. Why? Because This dataset has missing valuesa common problem in real-world data, needs to be addressed.

In our project, we will set a goal to predict the income of an employee based on the remaining attributes. Therefore, we will adopt the approach of discarding rows (employees) that have a missing value for this attribute. While for predictive features it is sometimes fine to deal with missing values ​​and impute or estimate them, for the target variable, we need fully known labels to train our machine learning model: the catch is that our machine learning model learns by being exposed to examples with known predictive output.

There is also a specific directive to only check for missing values:

So, let’s clear up DataFrame Exclusion of missing values ​​for target variable: income. This code will eliminate entries with missing values, specifically for this attribute.

target = "income"
train_df = df.dropna(subset=(target))

X = train_df.drop(columns=(target))
y = train_df(target)

So, what about the missing values ​​in the rest of the attributes? We’ll take care of that shortly, but first, we need to separate our dataset into two major subsets: a training set for training the model, and a test set to evaluate the performance of our model once after training, which will include different examples of what the model sees during training. SkateLearn provides a single instruction for doing this distribution randomly:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

The next step goes a step further to transform the data into a finer form for training a machine learning model: building a preprocessing pipeline. In general, this preprocessing should distinguish between numerical and categorical features, so that each type of feature is subject to different preprocessing tasks along the pipeline. For example, numerical features will typically be scaled, while categorical features can be mapped or encoded into numerical features so that a machine learning model can digest them. For example, the code below shows the complete process of building a preprocessing pipeline. It includes automatic detection of numeric versus categorical features so that each type can be handled correctly.

numeric_features = X.select_dtypes(include=("int64", "float64")).columns
categorical_features = X.select_dtypes(exclude=("int64", "float64")).columns

numeric_transformer = Pipeline((
    ("imputer", SimpleImputer(strategy="median"))
))

categorical_transformer = Pipeline((
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("onehot", OneHotEncoder(handle_unknown="ignore"))
))

preprocessor = ColumnTransformer((
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
))

You can find more information about data preprocessing pipelines here This article.

This pipeline, once applied to DataFramewill result in a clean, ready-to-use version for machine learning. But we’ll implement it in the next step, where we’ll combine both data preprocessing and machine learning model training into a single overarching pipeline.

# From clean dataframes to ready-to-build models

Now we will define a main pipeline which:

  1. Applies the default pre-processing steps – preprocessor Variables – for both numeric and explicit attributes.
  2. Trains a regression model, i.e. random forest regression, to predict income using preprocessed training data.
model = Pipeline((
    ("preprocessor", preprocessor),
    ("regressor", RandomForestRegressor(random_state=42))
))

model.fit(X_train, y_train)

Importantly, the training phase only obtains a subset of the training that we previously created on the partition, not the entire dataset.

Now, we take another subset of the data, the test set, and use it to evaluate the model’s performance on these example employees. We will use the absolute error (MAE) as our evaluation metric:

preds = model.predict(X_test)
mae = mean_absolute_error(y_test, preds)
print(f"\nModel MAE: {mae:.2f}")

You can get an MAE of around 13000, which is acceptable but not spectacular, considering most incomes are in the 60-90K range. Anyway, not bad for a first machine learning model!

On a final note I show you how to save your trained model to a file for future deployment.

joblib.dump(model, "employee_income_model.joblib")
print("Model saved as employee_income_model.joblib")

Saving your trained model to a .joblib The file is useful for future deployments, by allowing you to reload and reuse it without having to retrain from scratch. Think of it as “frozen” your entire preprocessing pipeline and trained model into a portable object. Faster options for future use and deployment include plugging it into a simple Python script or notebook, or building a lightweight web app built with tools. Streamletfor , for , for , . Gradioor Flask.

# wrap up

In this article, we have developed an introductory machine learning model for regression, i.e. predicting employee earnings, outlining the steps necessary to clean, preprocess the raw dataset. DataFrameand from DataFrame Ready-to-build model.

Ivan Palomares Carrascosa Is a leader, author, speaker, and consultant in AI, Machine Learning, Deep Learning, and LLMS. He trains and guides others in real-world applications of AI.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro