Soft introduction to the principal component analysis (PCA) in Azar

by SkillAiNest

Soft introduction to the principal component analysis (PCA) in Azar
Photo by Author | Ideogram

Principal component analysis (PCA) is one of the most popular techniques to reduce the dimension of high -dimensional data. It is an important process of data transformation in various real -world scenarios and industries such as image processing, finance, genetics, and machine learning applications, where data contains many features that need to be more effectively analyzed.

The reasons for the importance of the technique of dimensional deficiencies like PCA are several times, three of them stand:

  • Performance: Reducing the number of features in your data indicates a reduction in computational cost of data related processes such as training for advanced machine learning models.
  • Illustration: It is easy to interpret and imagine in 2D and 3D, keeping your data in a low -dimensional space, maintaining its key patterns and features, sometimes helps to gain insights.
  • Noise in noiseOften, high -dimensional data may include useless or noise properties, which, when detected in methods like PCA, can be eliminated (or even better (or even better) while preserving the effectiveness of subsequent analysis.

Hopefully, at this time I have convinced you of the PCA’s practical compatibility when handling complex data. If this is the case, read, because we will begin to learn how to use the PCA in the Ujjar.

How to analyze the principal component in Uzar

Thanks to supporting libraries such as Skyklin, which contains the abstract implementation of PCA algorithm, using it on your data is relatively straightened as long as the data are free from numerical, advanced, and missing values, in which the characteristics of the features are protected. This is particularly important, as PCA is a deep data method that depends on the variations of the feature to fix Principal Components: The original features and new features derived from the orthogonal.

We will start using the PCA from the beginning, by importing the necessary libraries, loading the MNist dataset of handwritten digits of handwritten digits, and putting it into the Pandas data frame:

import pandas as pd
from torchvision import datasets

mnist_data = datasets.MNIST(root="./data", train=True, download=True)
data = ()
for img, label in mnist_data:
    img_array = list(img.getdata()) 
    data.append((label) + img_array)
columns = ("label") + (f"pixel_{i}" for i in range(28*28))
mnist_data = pd.DataFrame(data, columns=columns)

I mnist datasateEach example is a 28×28 square image, with a total of 784 pixels, each contains a numerical code associated with its gray level, which has up to 255 for white (no intensity) white (maximum intensity). These figures should first be reset by its original 28×28 grid arrangements instead of two -dimensional array. This process, called Flening, occurs in the aforementioned code, which contains the final data in the data frame format, which contains a total of 785 variables: a label for each of the 784 pixels, which indicates a numerical price between 0 and 9, which is actually written in the picture.

MNIST DATEST | Source: Tensor Flu
MNIST DATEST | Source: Tensor Flu

In this example, we will not need label – useful for other use issues such as image ratings – but we will assume that we may need to keep it easy to analyze future, so we will separate it from the rest of the image pixels in a new variable.

X = mnist_data.drop('label', axis=1)

y = mnist_data.label

Although we will not apply the learning techniques under supervision after the PCA, we will assume that we may need to do so in future analysis, so we will divide the dataset into training (80 %) and tests (20 %) in subsets. Another reason we’re doing this, let me explain it a little later.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.2, random_state=42)

Pre -developing data And making it a suitable suitable of PCA algorithm is as important as applying itself. For example, pre -processing has to increase the original pixel intensity in a standard extent with a standard deviation of 0 and 1 in pre -processing so that all features are equal contributions to counting variations, while avoiding some features. To do this, we will use the Standards Scaler Class with Skyrin.Prosing, which standardize the numerical features:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Consider the use of fit_transform Training data. , While we used for test data transform Instead, this is the second reason why we first divide the data into training and test data, to have the opportunity to discuss: changes in data such as changing numerical attributes, training and changes in test sets should be permanent. fit_transform The procedure is used on training data as it calculates the necessary data that will guide the data replacement process from the training set (fitting), and then apply the change. Meanwhile, the transform method is used on test data, which applies the same change in the test set from training data. This ensures that the model views the test data on the same target scale used for training data, preserves consistency and avoids issues such as data leakage or prejudice.

Now we can apply the PCA algorithm. In the implementation of Skate Learn, the PCA takes an important argument: n_components. It determines the proportion of principal ingredients to maintain hyperpressor. Nearly large values ​​of 1 mean maintaining more ingredients and gaining more variations in the original data, while low values ​​are close to 0, which means keeping lower ingredients and applying more aggressive dimension strategies. For example, sequence n_components 0.95 means maintaining the sufficient enough ingredients to capture 95 % of the original data, which may be suitable for reducing the dimension of data while preserving most of its information. If the data dimensions have decreased significantly after implementing this layout, this means that many original features do not have high data related information.

from sklearn.decomposition import PCA

pca = PCA(n_components = 0.95)
X_train_reduced = pca.fit_transform(X_train_scaled)

X_train_reduced.shape

Using shape After applying the PCA, the resulting dataset attribute, we can see that the dimension of the data has been reduced by 784 features to only 325, while still maintaining 95 % important information.

Is that a good result? Answering this question largely depends on the type of application or analysis that you want to perform with low data. For example, if you want to create a photo rating of digits, you want to create a two -ranked model: trained with an original, high -dimensional datastate, and trained with a low dataset. If there is no significant loss of classification accuracy in your second ranking, good news: You have achieved fast -ranking (dimension reduction usually means more performance in training and indicators), and similar rating performance you are using actual data.

Wrap

In this article, a phased -phase tutorial states how to use PCA algorithm from the beginning, starting with the dimension of handwritten digits with high dimension.

Ivan Palomars Carcosa AI, Machine Learning, Deep Learning and LLMS is a leader, writer, speaker, and adviser. He trains and guides others to use AI in the real world.

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro