In this tutorial, you’ll walk through a complete data analysis project using the HR Analytics dataset by Saad Haroon on Kaggle. You’ll start by loading and cleaning the data, then explore it visually using boxplots with GGPLT2. Finally, you will learn about statistical modeling using linear regression and logistic regression in R.
By the end of this article, you should understand how to create boxplots in R, why they matter, and how they fit into real-world analytical workflows.
Table of Contents
Conditions
Before you begin, you should be comfortable with the following:
Basic R syntax (variables, functions, dataframes).
Installing and loading R packages
Understanding what rows and columns represent in a dataset.
Very basic statistics (mean, median, distribution).
How to Configure Your R Environment
Start by installing and loading the packages you need.
install.packages(c("tidyverse", "ggplot2"))
library(tidyverse)
library(ggplot2)
tidyverse Provides tools for data manipulation and visualization. ggplot2 What is the visual engine you will use for the boxplots? Loading libraries makes their functions available for use
How to load and inspect data
First, download HR Analytics Dataset by Saad Haroon from Kaggle.
Assuming the downloaded dataset is saved as “C:/Users/Jando/Downloads/Archive(2)/hr_analytics.csv”, load the path file into R.
You can see a sample dataset while running head To see the structure of the event dataset, you can run str The ceremony
hr <- read.csv("C:/Users/johndoe/Downloads/archive (2)/HR_Analytics.csv")
head(hr)
str(hr)
read.csv The function imports the dataset into R head The function displays the first six rows so you can preview the data. str The function displays data types, helping you quickly find primitive vs. numeric variables.
Remember that understanding the structure of your data quickly prevents mistakes when plotting or modeling. Once you run head function, you should see the following in your console:
from head function, you can see:

Structure
Key Columns and Meaning
EmpID→ Employee IdentifierAgeAge in yearsAgeGroup→ Age category (eg,18-25JeezAttrition→ Whether or not the employee remains (Yes/NoJeezBusinessTravel→ Travel frequency (Travel_Rarelyfor , for , for , .Travel_Frequentlyfor , for , for , .Non-TravelJeezDepartment→ Employed DepartmentDistanceFromHomehome Distance from Home to Office (km)Education/EducationFieldLevel and field of educationEmployeeCount→ Usually 1 per employee (redundant)Gender→ Men / WomenJobRole/JobSatisfaction→ Job title and satisfaction levelMonthlyIncome/SalarySlab→ Salary amount and categoryYearsAtCompany/YearsInCurrentRole→ Experience MatrixOverTime→ works overtime (Yes/NoJeezOther features:
PerformanceRatingfor , for , for , .TrainingTimesLastYearfor , for , for , .WorkLifeBalancefor , for , for , .StockOptionLeveland so on.
Data types
Numerical →
Agefor , for , for , .DistanceFromHomefor , for , for , .MonthlyIncomefor , for , for , .YearsAtCompanyCategory / Role →
Attritionfor , for , for , .Genderfor , for , for , .Departmentfor , for , for , .JobRole
Observations
A dataset is tabular, like a spreadsheet.
There are multiple category columns
There are multiple numeric columns
Some columns seem redundant or constant. does not provide useful information because of the same values ​​(for example,
EmployeeCountJeez
from str function, you can assemble it:

The dataset contains 1,480 observations and 38 variables. Each row represents an employee, and each column represents an attribute about that employee.
Each column has a name, data type, and instance values. For example, Age And DistanceFromHome are numeric (int), with values ​​like 28 or 12. EmpID And Department are character strings (chr), with examples such as research and development or sales. Other features include JobRole (Analyst, Manager) and Attrition (Yes/No).
The dataset contains mixed data types. Some columns are numeric, eg MonthlyIncome or YearsAtCompany. Some are characters or categories, eg Gender (Male/Female) and BusinessTravel (from travel_gerly, travel_free). Some columns are redundant or constant. For example, EmployeeCount All rows have the same value of 1 and do not provide useful information.
How to clean and prepare data
Before visualization, you need to clear your data. You can investigate the data to find out what you need to clean up.
run summary Function to view dataset statistics. You also need to run is.na Function to identify missing values.
summary(hr)
colSums(is.na(hr))
summary The function returns suspicious values ​​to the immediate statistics and flags. is.na The function checks for missing data. Boxplots are sensitive to extreme values, so knowing what you’re working with is important.
After running summary function, the following will appear in your console:

It shows basic statistics for each column. After running is.na function, the following will also appear in your console:

From this output, you can just see YearsWithCurrManager is 57which means 57 employees This column has no value.
You can skip this entire column along with the other redundant columns we saw earlier. You can do this with the code below.
hr <- hr %>% select(-c(EmployeeCount, Over18, StandardHours, YearsWithCurrManager))
To verify that the columns are dead, use this code:
colnames(hr)
Now we need to convert the main class variables into factors. By doing this, R is told that the column is Two categories (‘yes’ and ‘no’), not continuous text.
hr$Attrition <- as.factor(hr$Attrition)
hr$JobRole <- as.factor(hr$JobRole)
hr$Department <- as.factor(hr$Department)
It also ensures that GGPLOT2 treats them correctly when doing groupings.
How to use boxplots
A boxplot shows key features of the dataset. The median is shown by the line in the middle of the box. The intercortical range is represented by the box itself while the whiskers span the data. Outliers appear as individual points.
Boxplots are most useful when you want to compare distributions across groups, such as income by job role or age by age.
Let’s start with a simple box plot of monthly income.
ggplot(hr, aes(y = MonthlyIncome)) +
geom_boxplot(fill = "blue") +
labs(
title = "Distribution of Monthly Income",
y = "Monthly Income")
aes The function gg tells plot which variable to plot. geom_boxplot Draws a box plot. labs The function labels parts of the generated plot, that is x axis, y Axis, and title.

How to Create Boxplots with GGPLOT2
Now lets compare income across job roles.
ggplot(hr, aes(x = JobRole, y = MonthlyIncome)) +
geom_boxplot(fill = "lightblue") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(
title = "Monthly Income by Job Role",
x = "Job Role",
y = "Monthly Income")
X Lists all roles in the Aesthetics job. The label is rotated to improve readability. This concept quickly reveals income differences in roles.

How to do Research Data Analysis (EDA)
Exploratory data analysis involves using visual methods to ask questions and gain a deeper understanding of data.
We can use this example Years at company by department.
ggplot(hr, aes(x = Department, y = YearsAtCompany)) +
geom_boxplot(fill = "darkblue") +
labs(
title = "Years at Company by Department",
y = "Years at Company")

How to Build a Linear Regression Model
To understand how to build a linear regression model, you need to build the model MonthlyIncome By using YearsAtCompany with the command below.
The first creates a model while the second displays it.
hr_lm<- lm(MonthlyIncome ~ YearsAtCompany, data = hr)
summary(hr_lm)
Linear regression estimates how earnings change over the earnings period. This works when the variables are numeric.
After running the code, your console should show you this output:
Call:
lm(formula = MonthlyIncome ~ YearsAtCompany, data = hr)
Residuals:
Min 1Q Median 3Q Max
-9506 -2488 -1186 1403 15483
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3734.47 159.41 23.43 <2e-16 ***
YearsAtCompany 395.25 17.14 23.07 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4032 on 1478 degrees of freedom
Multiple R-squared: 0.2647, Adjusted R-squared: 0.2642
F-statistic: 532 on 1 and 1478 DF, p-value: < 2.2e-16
Let us interpret this model.
If an employee has 0 years with the company, their basic monthly income is 34 3734.47. It comes intermittently.
For every year an employee spends at the company, their monthly income is predicted to increase by $395.25.
Both coefficients have p-values 2e-16. This means they are extremely important. This strongly suggests that the number of years an employee spends at a company affects their earnings.
is the R-squared of the model 0.2647. This means that about 26% of the variation in monthly earnings is explained by the number of years an employee spends at the company. This is low, so other factors such as role, department, or education likely affect earnings as well.
is the F statistic of the model 532with a p value < 2.2e-16. This means that the overall model is statistically significant.
In general, the longer an employee stays at a company, the more they earn, about $395 more per year. But only a quarter of their income is accounted for over the years in the company alone. You need to consider other variables for better predictions.
How to Build a Logistic Regression Model
Now you can learn how to predict volatility. The first command creates the model while the second displays it.
hr_glm<- glm(
Attrition ~ MonthlyIncome + YearsAtCompany,
data = hr,
family = binomial)
summary(hr_glm)
Your console should display this as output when you run both commands.
Call:
glm(formula = Attrition ~ MonthlyIncome + YearsAtCompany, family = binomial,
data = hr)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -8.094e-01 1.375e-01 -5.886 3.96e-09 ***
MonthlyIncome -9.449e-05 2.302e-05 -4.104 4.05e-05 ***
YearsAtCompany -5.047e-02 1.792e-02 -2.817 0.00485 **
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1305.4 on 1479 degrees of freedom
Residual deviance: 1252.5 on 1477 degrees of freedom
AIC: 1258.5
Number of Fisher Scoring iterations: 5
Logistic regression is used for binary outcomes, ie yes or no. It estimates the probability.
Let’s interpret this logistic regression model. The model predicts whether an employee is likely to leave the company (volatility) based on them Monthly Income And Years at Company.
There is a break -0.809. This is the baseline log of departures when their income and years at the company are zero.
employees’ Monthly Income has an ability -0.0000945. This means that as their income increases, they are slightly less likely to leave. An increase in income makes them less likely to leave.
employees’ Years at Company has an ability -0.0505. This shows that the longer they stay, the less likely they are to leave. Every additional year reduces their chances of instability.
All coefficients are statistically significant. Monthly Income And Years at Company Both strongly affect survival chances.
The residual deviation of the model is 1252.5less than the zero deviation of 1305.4. This means that the model explains some of the variation in instability.
The key takeaway is that if an employee earns more and stays with the company longer, they are less likely to leave. These factors are important, but other factors also affect volatility.
Why Conception Comes Before Modeling
Boxplots help you:
Detect outliers: Boxplots highlight extreme values ​​that interfere with model results.
Compare the groups: Boxplots allow quick comparison of distributions across categories.
Make assumptions: Visual patterns help identify testable relationships in a model.
Validate modeling assumptions: Boxplots help check the shape and variance of the distribution before modeling.
Modeling without imagination often leads to misinterpretation or false confidence.
The result
In this tutorial, you learned how to load and clean data, understand boxplots and their significance. You also learned how to use GGPLOT2 to compare distributions, perform exploratory data analysis (EDA), build linear and logistic regression models, and link visualization insights to modeling results.