How to create boxplots and model data in R using GGPLOT2

by SkillAiNest

In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with GGPLT2. Finally, you will learn about statistical modeling using linear regression and logistic regression in R.

By the end of this article, you should understand how to create boxplots in R, why they matter, and how they fit into real-world analytical workflows.

Table of Contents

Conditions

Before you begin, you should be comfortable with the following:

  • Basic R syntax (variables, functions, dataframes).

  • Installing and loading R packages

  • Understanding what rows and columns represent in a dataset.

  • Very basic statistics (mean, median, distribution).

How to Configure Your R Environment

Start by installing and loading the packages you need.

install.packages(c("tidyverse", "ggplot2"))
library(tidyverse)
library(ggplot2)

tidyverse Provides tools for data manipulation and visualization. ggplot2 What is the visual engine you will use for the boxplots? Loading libraries makes their functions available for use

How to load and inspect data

First, download HR Analytics Dataset by Saad Haroon from Kaggle.

Assuming the downloaded dataset is saved as “C:/Users/Jando/Downloads/Archive(2)/hr_analytics.csv”, load the path file into R.

You can see a sample dataset while running head To see the structure of the event dataset, you can run str The ceremony

hr <- read.csv("C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv")
head(hr)
str(hr)

read.csv The function imports the dataset into R head The function displays the first six rows so you can preview the data. str The function displays data types, helping you quickly find primitive vs. numeric variables.

Remember that understanding the structure of your data quickly prevents mistakes when plotting or modeling. Once you run head function, you should see the following in your console:

from head function, you can see:

-First six rows in a-HR-Dataset-Shown-R-Console

Structure

Key Columns and Meaning

  • EmpID → Employee Identifier

  • Age Age in years

  • AgeGroup → Age category (eg, 18-25Jeez

  • Attrition → Whether or not the employee remains (Yes/NoJeez

  • BusinessTravel → Travel frequency (Travel_Rarelyfor , for , for , . Travel_Frequentlyfor , for , for , . Non-TravelJeez

  • Department → Employed Department

  • DistanceFromHome home Distance from Home to Office (km)

  • Education / EducationField Level and field of education

  • EmployeeCount → Usually 1 per employee (redundant)

  • Gender → Men / Women

  • JobRole / JobSatisfaction → Job title and satisfaction level

  • MonthlyIncome / SalarySlab → Salary amount and category

  • YearsAtCompany / YearsInCurrentRole → Experience Matrix

  • OverTime → works overtime (Yes/NoJeez

  • Other features: PerformanceRatingfor , for , for , . TrainingTimesLastYearfor , for , for , . WorkLifeBalancefor , for , for , . StockOptionLeveland so on.

Data types

  • Numerical → Agefor , for , for , . DistanceFromHomefor , for , for , . MonthlyIncomefor , for , for , . YearsAtCompany

  • Category / Role → Attritionfor , for , for , . Genderfor , for , for , . Departmentfor , for , for , . JobRole

Observations

  • A dataset is tabular, like a spreadsheet.

  • There are multiple category columns

  • There are multiple numeric columns

  • Some columns seem redundant or constant. does not provide useful information because of the same values ​​(for example, EmployeeCountJeez

from str function, you can assemble it:

R-Output Shown Structure-HR-Dataset

The dataset contains 1,480 observations and 38 variables. Each row represents an employee, and each column represents an attribute about that employee.

Each column has a name, data type, and instance values. For example, Age And DistanceFromHome are numeric (int), with values ​​like 28 or 12. EmpID And Department are character strings (chr), with examples such as research and development or sales. Other features include JobRole (Analyst, Manager) and Attrition (Yes/No).

The dataset contains mixed data types. Some columns are numeric, eg MonthlyIncome or YearsAtCompany. Some are characters or categories, eg Gender (Male/Female) and BusinessTravel (from travel_gerly, travel_free). Some columns are redundant or constant. For example, EmployeeCount All rows have the same value of 1 and do not provide useful information.

How to clean and prepare data

Before visualization, you need to clear your data. You can investigate the data to find out what you need to clean up.

run summary Function to view dataset statistics. You also need to run is.na Function to identify missing values.

summary(hr)
colSums(is.na(hr))

summary The function returns suspicious values ​​to the immediate statistics and flags. is.na The function checks for missing data. Boxplots are sensitive to extreme values, so knowing what you’re working with is important.

After running summary function, the following will appear in your console:

R-Summary-Output-HR-Dataset-Showing-statistical-distributions

It shows basic statistics for each column. After running is.na function, the following will also appear in your console:

The R output shows the missing cost-calculation per column-in-HR-Dataset

From this output, you can just see YearsWithCurrManager is 57which means 57 employees This column has no value.

You can skip this entire column along with the other redundant columns we saw earlier. You can do this with the code below.

hr <- hr %>% select(-c(EmployeeCount, Over18, StandardHours, YearsWithCurrManager))

To verify that the columns are dead, use this code:

colnames(hr)

Now we need to convert the main class variables into factors. By doing this, R is told that the column is Two categories (‘yes’ and ‘no’), not continuous text.

hr$Attrition <- as.factor(hr$Attrition)
hr$JobRole <- as.factor(hr$JobRole)
hr$Department <- as.factor(hr$Department)

It also ensures that GGPLOT2 treats them correctly when doing groupings.

How to use boxplots

A boxplot shows key features of the dataset. The median is shown by the line in the middle of the box. The intercortical range is represented by the box itself while the whiskers span the data. Outliers appear as individual points.

Boxplots are most useful when you want to compare distributions across groups, such as income by job role or age by age.

Let’s start with a simple box plot of monthly income.

ggplot(hr, aes(y = MonthlyIncome)) +
  geom_boxplot(fill = "blue") +
  labs(
    title = "Distribution of Monthly Income",
    y = "Monthly Income")

aes The function gg tells plot which variable to plot. geom_boxplot Draws a box plot. labs The function labels parts of the generated plot, that is x axis, y Axis, and title.

A normal-boxplot shows the distribution of employee mouths.

How to Create Boxplots with GGPLOT2

Now lets compare income across job roles.

ggplot(hr, aes(x = JobRole, y = MonthlyIncome)) +
  geom_boxplot(fill = "lightblue") +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(
    title = "Monthly Income by Job Role",
    x = "Job Role",
    y = "Monthly Income")

X Lists all roles in the Aesthetics job. The label is rotated to improve readability. This concept quickly reveals income differences in roles.

Multiple-Boxplots-Compare-Monthly-Incarceration-Distribution-A-Cars-Different-Work.

How to do Research Data Analysis (EDA)

Exploratory data analysis involves using visual methods to ask questions and gain a deeper understanding of data.

We can use this example Years at company by department.

ggplot(hr, aes(x = Department, y = YearsAtCompany)) +
  geom_boxplot(fill = "darkblue") +
  labs(
    title = "Years at Company by Department",
    y = "Years at Company")

boxplots shung-employee-tern-across-department.

How to Build a Linear Regression Model

To understand how to build a linear regression model, you need to build the model MonthlyIncome By using YearsAtCompany with the command below.

The first creates a model while the second displays it.

hr_lm<- lm(MonthlyIncome ~ YearsAtCompany, data = hr)
summary(hr_lm)

Linear regression estimates how earnings change over the earnings period. This works when the variables are numeric.

After running the code, your console should show you this output:

Call:
lm(formula = MonthlyIncome ~ YearsAtCompany, data = hr)

Residuals:
   Min     1Q Median     3Q    Max 
 -9506  -2488  -1186   1403  15483 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     3734.47     159.41   23.43   <2e-16 ***
YearsAtCompany   395.25      17.14   23.07   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 4032 on 1478 degrees of freedom
Multiple R-squared:  0.2647,    Adjusted R-squared:  0.2642 
F-statistic:   532 on 1 and 1478 DF,  p-value: < 2.2e-16

Let us interpret this model.

If an employee has 0 years with the company, their basic monthly income is 34 3734.47. It comes intermittently.

For every year an employee spends at the company, their monthly income is predicted to increase by $395.25.

Both coefficients have p-values 2e-16. This means they are extremely important. This strongly suggests that the number of years an employee spends at a company affects their earnings.

is the R-squared of the model 0.2647. This means that about 26% of the variation in monthly earnings is explained by the number of years an employee spends at the company. This is low, so other factors such as role, department, or education likely affect earnings as well.

is the F statistic of the model 532with a p value < 2.2e-16. This means that the overall model is statistically significant.

In general, the longer an employee stays at a company, the more they earn, about $395 more per year. But only a quarter of their income is accounted for over the years in the company alone. You need to consider other variables for better predictions.

How to Build a Logistic Regression Model

Now you can learn how to predict volatility. The first command creates the model while the second displays it.

hr_glm<- glm(
  Attrition ~ MonthlyIncome + YearsAtCompany,
  data = hr,
  family = binomial)


summary(hr_glm)

Your console should display this as output when you run both commands.

Call:
glm(formula = Attrition ~ MonthlyIncome + YearsAtCompany, family = binomial, 
    data = hr)

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -8.094e-01  1.375e-01  -5.886 3.96e-09 ***
MonthlyIncome  -9.449e-05  2.302e-05  -4.104 4.05e-05 ***
YearsAtCompany -5.047e-02  1.792e-02  -2.817  0.00485 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1305.4  on 1479  degrees of freedom
Residual deviance: 1252.5  on 1477  degrees of freedom
AIC: 1258.5

Number of Fisher Scoring iterations: 5

Logistic regression is used for binary outcomes, ie yes or no. It estimates the probability.

Let’s interpret this logistic regression model. The model predicts whether an employee is likely to leave the company (volatility) based on them Monthly Income And Years at Company.

There is a break -0.809. This is the baseline log of departures when their income and years at the company are zero.

employees’ Monthly Income has an ability -0.0000945. This means that as their income increases, they are slightly less likely to leave. An increase in income makes them less likely to leave.

employees’ Years at Company has an ability -0.0505. This shows that the longer they stay, the less likely they are to leave. Every additional year reduces their chances of instability.

All coefficients are statistically significant. Monthly Income And Years at Company Both strongly affect survival chances.

The residual deviation of the model is 1252.5less than the zero deviation of 1305.4. This means that the model explains some of the variation in instability.

The key takeaway is that if an employee earns more and stays with the company longer, they are less likely to leave. These factors are important, but other factors also affect volatility.

Why Conception Comes Before Modeling

Boxplots help you:

  • Detect outliers: Boxplots highlight extreme values ​​that interfere with model results.

  • Compare the groups: Boxplots allow quick comparison of distributions across categories.

  • Make assumptions: Visual patterns help identify testable relationships in a model.

  • Validate modeling assumptions: Boxplots help check the shape and variance of the distribution before modeling.

Modeling without imagination often leads to misinterpretation or false confidence.

The result

In this tutorial, you learned how to load and clean data, understand boxplots and their significance. You also learned how to use GGPLOT2 to compare distributions, perform exploratory data analysis (EDA), build linear and logistic regression models, and link visualization insights to modeling results.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro