How to create scatterplots and model data in R using GGPLOT2

by SkillAiNest

You can use R as a powerful tool for data analysis, data visualization, and statistical modeling. In this guide, you’ll learn how to load real-world data into R, visualize patterns using GGPLOT2, build simple linear and logistic regression models, and interpret models. Finally, you should know how to use R for your projects.

Table of Contents

Conditions

Before we begin, you should have the following:

  • r installed (version 4.0 or higher).

  • Install R Studio (recommended for beginners).

  • Basic familiarity with programming concepts such as variables and functions.

  • Basic understanding of statistics (mean, correlation, regression).

How to Configure Your R Environment

Before working with the data, load the required libraries:

library(tidyverse)   # Data manipulation + ggplot2
library(readxl)      # Importing Excel files

This loads the required libraries into R. tidyverse A collection of packages used for data manipulation and visualization, including ggplot2. readxl Allows you to import Excel files directly into R without first converting to CSV format.

How to use data types in r

Knowing the types of data helps you avoid mistakes and choose the right analysis methods.

Common data types

Data typeExampleUse case
Numericalx <- 5.7Measurements, prices
Integery <- 10Lcounting
The character"House prices"Text label
LogicalTRUEConditions
Complicated2 + 3iAdvanced Mathematics

Numeric data types in R

price <- 199.99
tax <- 16.5
total_cost <- price + tax
total_cost

Numeric data is used for constant values ​​such as measurements, values ​​or averages. As you can see, these are numeric values ​​that can be used in calculations. Numeric data types allow arithmetic operations such as addition, subtraction, multiplication, and division.

Integer data types in r

students <- 30L
classes <- 4L
total_students <- students * classes
total_students

Numbers are whole numbers and are usually used for counting. L Tells R that the values ​​are integers. Integers are useful when working with counts, indexes, or discrete values.

Character data types in r

course_name <- "Data Science"
university <- "Harvard University"
paste(course_name, "at", university)

Character data is used to store text such as names, labels, or categories. The above example shows how character data can be combined using paste() Function This data type cannot be used in arithmetic operations.

Logical data types in r

score <- 75
passed <- score >= 50
passed

Logical data represents Boolean values: TRUE or FALSE. These are commonly used in conditions and filtering. Here, R evaluates and returns a condition TRUE Because the score meets the requirement. Logical values ​​are essential in decision making and control flow.

Complex data types in R

Complex numbers have both real and imaginary parts and are used in most modern mathematical computations.

z <- 2 + 3i
Mod(z)

This example calculates the magnitude of a complex number. Complex data types are rarely used in basic data analysis but are available in R.

How to use data structures in r

R stores data in different structures depending on your goals. This is important because choosing the right structure makes the operation easier. Its functions behave differently depending on the structure. Moreover, structure helps to understand whether your data is numbers, categories or text.

Common data structures in r

StructureBest for
VectorA single column of data
MatrixNumerical tables
Data frameSpreadsheet-like data
ListMixed items
vec <- c(1, 2, 3, 4)
mat <- matrix(1:9, nrow = 3)
df <- data.frame(Name = c("Car", "Bike"), Number = c(110, 95))
lst <- list(numbers = vec, matrix = mat, info = df)

str(lst) 

Lets understand the above code:

  • vec is a vector that stores the same type of data.

  • mat A matrix is ​​a matrix that organizes numeric values ​​into rows and columns.

  • df A data frame that works like a spreadsheet, allowing for different types of data in each column.

  • lst is a list that stores multiple objects of different types.

  • str() The function shows how these items are punched into the list.

How to import data into R

Now you can start working with your real data. You can import files into R by copying the path to the CSV or Excel file and pasting it into the command.

For Windows: Replace single backward slashes / with either a double backward slash \ or a single forward slash \. For example:


Windows
```r
data <- read.csv("C:\\Users\\file\\Documents\\data.csv") or 
data <- read.csv("C:/Users/file/Documents/data.csv")

For macOS/Linux: Single forward slashes work fine:

macOS/Linux
data <- read.csv("/Users/file/Documents/data.csv")

How to read CSV and Excel file


data <- read.csv("C:/Users/file/Documents/data.csv") or data <- read.csv("C:\\Users\\file\\Documents\\data.csv") 

head(data.csv)

You can import a CSV file into R using the file path. On Windows systems, file paths can use either a double forward slash (//) or double backslash (\) The imported data is saved as a data frame named data.

data_excel <- read_excel("C:/Users/file/Documents/HR Data Set.xlsx")
head(data_excel)

You can import an Excel file into R using the code read_excel() work from readxl The package head() The function is then used to preview the first few rows of the dataset.

Use the following commands to understand your data:

str(data.csv)
summary(data.csv)

str(data_excel)
summary(data_excel)

str() Shows the structure of the dataset, including column names and data types. summary() Provides descriptive statistics such as minimum, maximum, mean and quartile for each variable. Together, these functions help you understand the dataset before analysis.

How to visualize data with GGPLOT2

Visualization helps you visualize patterns before building a model.

Example of a scatter plot

We will use the built-in mtcars Dataset in R First, load the library to make it available for use:

data(mtcars)
library(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg, color = factor(cyl))) +
  geom_point(size = 3,color="blue") +geom_smooth(method="lm",color="red",se=FALSE)+
  labs(
    title = "Fuel Efficiency by Weight and Cylinders",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon"
  ) +
  theme_minimal()

Let’s break down the code to fully understand it:

  • data(mtcars) Loads the built-in mtcars Dataset, which contains information about car specifications.

  • library(ggplot2) Enables viewing of data.

  • aes() Your dataset was used to insert the column, which describes it x And y Values

  • aes() was used to design the outside plot. For example, setpoint size And color.

  • geom_smooth() Was used to add a trend line with the VAS. Here, we use method="lm" To fit a linear regression line. se=TRUE/FALSE option controls the shading for confidence intervals. use TRUE If you want shading and FALSE If you don’t.

  • labs() Plots were used to label and set titlefor , for , for , . x-axis, and y-Axis label.

  • Finally, we set the plot theme using theme_minimal().

Running this code will generate a scatterplot showing fuel efficiency by weight and by cylinder. The plot should look like this:

Scatterplot of MPG versus vehicle weight with regression line

How to Build Data Models in R

Linear regression

Basically you can use linear regression for constant results to predict numerical values. For example, predicting a car’s miles per gallon (mpg) based on weight (wt) and horsepower (hp), you can use this formula:

lm_model <- lm(mpg ~ wt + hp, data = mtcars)
summary(lm_model)

But what does that mean?

  • lm() stands for linear model.

  • The response is variable mpg. This is the result you want to predict.

  • Predictors are variable wt And hp. These explain the changes in response.

Once you run the model, it should look like this in your console:

Call:
lm(formula = mpg ~ wt + hp, data = mtcars)

Residuals:
   Min     1Q Median     3Q    Max 
-3.941 -1.600 -0.182  1.050  5.854 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 37.22727    1.59879  23.285  < 2e-16 ***
wt          -3.87783    0.63273  -6.129 1.12e-06 ***
hp          -0.03177    0.00903  -3.519  0.00145 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.593 on 29 degrees of freedom
Multiple R-squared:  0.8268,    Adjusted R-squared:  0.8148 
F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

An interpretation of the linear regression model is:

  • You developed a model on miles per gallon (mpg) based on weight (wt) and horsepower (hp)

  • Intercept 37.227 is mpg when wt=0 And hp=0. In other words, when all other variables are 0twenty mpg is 37.227. The interval is always the baseline value of the outcome when all other variables in the model are zero.

  • With each additional unit of weight (1000lbs), mpg is reduced by 3.877. This variable is affected mpg As seen with so many p-value. p-value <0.001, so robust and statistically significant.

  • With each additional unit of horsepower, mpg is reduced by 0.031. This variable is affected mpgas seen with p-value Existence 0.00145which is Less than 0.01indicating that horsepower is a statistically significant predictor mpgalthough the effect is small compared to the weight of the vehicle.

Does the model fit the data, and why?

The r-squared value shows that 83% of the variance mpg Defined by weight and horsepower.

Summary of interpretation: Cars that are heavy and have low fuel efficiency with high horsepower. These two variables explain most of the variation mpg In the dataset

Logistic Regression

You can use logistic regression for binary outcomes, such as yes/no questions. For example, predicting whether a vehicle is automatic or manual based on weight and horsepower.

glm_model <- glm(am ~ wt + hp, data = mtcars, family = binomial)
summary(glm_model)

Let’s understand the code

  • glm() stands for the general linear model.

  • family=binomial The option tells R to run logistic regression.

  • Answer variable am Indicates the type of transmission: 0 = automatic, 1 = manual.

  • Predictor variables remain wt And hp.

Once you run the model, it should look like this in your console:

Call:
glm(formula = am ~ wt + hp, family = binomial, data = mtcars)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)   
(Intercept) 18.86630    7.44356   2.535  0.01126 * 
wt          -8.08348    3.06868  -2.634  0.00843 **
hp           0.03626    0.01773   2.044  0.04091 * 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 43.230  on 31  degrees of freedom
Residual deviance: 10.059  on 29  degrees of freedom
AIC: 16.059

Number of Fisher Scoring iterations: 8

Here is an interpretation of the logistic regression model:

  • Intercept 18.866 Represents the log when the car is manual wt=0 And hp=0. In other words, when all other variables are 0the baseline of results is log-unders 18.866. The interval is always the baseline value of the outcome when all other variables in the model are zero.

  • With each additional unit of weight (1000 lbs), the manual reduction by the car’s log difficulties is 8.083. This variable strongly affects the likelihood of a car being manual, as seen with the Lord p-value Existence 0.008which is statistically significant.

  • With each additional unit of horsepower, the manual increases through the car’s logging difficulties 0.036. This variable also affects the probability of manualization, as seen with Lord p-value Existence 0.041which is statistically significant.

Summary of interpretation: Heavier cars are more likely to be automatic, while higher horsepower cars are slightly more likely to be manual. Together together wt And hp Explain a large portion of the transmission type variations.

The result

In this tutorial, you learned how to use R for data analysis, visualization, and statistical modeling, and how to set up your R environment and work with basic data types and data structures.

This article also shows you how to import real-world datasets and explore them using summary statistics. This should help you understand your data before analysis.

Using GGPLOT2, we visualized relationships and patterns of identification. We constructed and interpreted a linear regression model to predict fuel efficiency and a logistic regression model to classify transmission type.

You also learned how to interpret coefficients, p-values, and goodness-of-fit measures.

With these skills, you can load datasets, visualize trends, and build simple predictive models in R. Keep practicing with new datasets and explore more advanced techniques to improve your data analysis skills.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro