XAI - Dimensionality Reduction

Are they useful? Is there a difference using projections for pre-processing vs. post-hoc embeddings? PCA, TSNE, UMAP

Submission for XAI - Univeristy of Konstanz Sumer Semester 2023


Introduction to Dimensionality Reduction

In this section we will do a short introduction on what dimensionality reduction is, the main methods and where is it useful.
(keep in mind that the words, attributes, features and dimensions refer to the same thing so they will be used interchangably throughout this paper. Also the term DR will be referring to Dimensionality Reduction )
Data is the main component for any machine learning task. We humans are able to te percieve a maximum of 3 dimension but on the real word data the number of dimensions can go up to milions. Example
For us humans it is impossible to imagine this many features. While there are high dimensional visualization techniuqes such as PCP that can help in this case, there are still drawbacks to it. Another method used to help us is dimensionality reduction.
Dimensionality Reduction is the process of reducing the dimensions, but the key principle is that we want to retain the variation from the original dataset as much as possible.
In a machine learning task pipeline, dimensionality reduction can be used in the pre processing step or post.
We will discuss further what the difference is in both of these cases.
A downside of dimensionality reduction is that we have to give up variance in the original data. Later we will discuss how can we choose the appropriate number of features we want. While this may seem like a big problem, dimensionality reduction brings us more “goodies”.

Dimensionality Reduction methods can be diveded into 2 groups:

Exploring the methods and doing experiments

In this section we will have a closer look to the DR methods alongside with small snippet codes whith experiments to explain what is happening behind inside the method.
Firstly before diving into the methods I will talk about the dataset for the experiments.
The dataset I choose is called “Palmer Penguins Dataset”

A sample of this dataset is as follows

For pre processing steps we turn the nominal data into numerical categories, normalize the values and dropping any datapoint with null values.

PCA

Principal Component analysis a a fairly old and common technique, it dates back to 1901. It is a linear technique. It finds data representation that retains the maximum nonredundant and uncorrelated information.
The steps to caluclate PCA are the following

  1. Substract mean
  2. Calculate covariance matrix
  3. Calculate Eigenvector Eigenvalue
  4. Forming a feature vector
  5. Deriving new data set

using sklearn library we can easily use PCA whith the following:

pca = PCA(n_components=2)
components = pca.fit_transform(data)

Which gives us the output

We can clearly see that DR helped us visualize a high dimensional data into a 2D plot. Now the data tells us that there are 5 clusters of penguins and on each clusters there a distinction between the males and females.

For my first experiment on helping understand PCA I would like to demonstrate how easily PCA can be skewed if there are noisy or outliar data.
for this I removed the step that deletes outliars.

As we can see outliars have massive impact in the final output of PCA. As we explained, for each dimension there is a principal component that shows the variance in that dataset.
If we use the following method from Sklearn, we can have access to the variance of each dimension

exp_var_cumul = np.cumsum(components.explained_variance_ratio_)

We get this output

for a choosen k number of components, PCA will retain the k principal componenets with the lowest values (sorted from low to high)

T-sne

t is widely used in image processing and NLP. The Scikit-learn documentation recommends you to use PCA or Truncated SVD before t-SNE if the number of features in the dataset is more than 50. The following is the general syntax to perform t-SNE after PCA. Also, note that feature scaling is required before PCA.
Explanation how it works:

first measure the distance between two points then plot that distance on a normal curve that is centered on the point of interest. Lastly draw a line from the point to the curve the length of that line is the unscaled similarity. We calculate the unscaled similarity 2) for this pair of points now we calculate the unscaled similarity for this pair of points and now we calculate the unscaled similarity for this pair of points. 3) using a normal distribution means that distant points have very low similarity values and close points have high similarity values ultimately we measure the distances between all of the points. The point of interest then plot them on a normal curve and then measure the distances from the points to the curve to get the unscaled similarity scores with respect to the point of interest the next step is to scale the unscaled similarities so that they add
up to 1. Why do the similarity scores need to add up to 1 it has to do with something I didn’t tell you earlier and to illustrate the concept I need to add a cluster that is half as dense as the others the width of the normal curve depends on the density of data near the point of interest less dense regions have wider curves so if these points 4) have 1/2 the density as these points and this curve is half as wide as this curve. Then scaling the similarity scores will make them the same for both clusters.
T Snee has a perplexity parameter equal to the expected density around each point and that comes into play but these clusters are still more similar than you might expect now back to the original scatter plot we’ve calculated similarity scores for this point now we do it for this point and we do it for all the points one last thing and the scatter plot will be all set with similarity scores because the width of the distribution is based on the density of the surrounding data points the similarity score for this node might not be the same as the similarity to this node.

tsne = TSNE(n_components=2,n_iter=1000,perplexity=80)
projections = tsne.fit_transform(data)

output of this code:

As we explained in the upper parts, perplexity is a hyper paramater, a lower number will give you more noise since the t-distribution is capturing the local embedding, but if we give it a higher perplexity it will try to perserve the global structure.
high number of perplexity
t-sne with a high number of perplexity, as we can see it is showing us the same global features as PCA.

t-sne with a low number of perplexity. This demonstratese how the lower number of perplexity will give us mora noisy data.