The Beginner's Guide to Dimensionality Reduction

Explore the methods that data scientists use to visualize high-dimensional data.

July 16, 2018

Art or science?

Your browser has just loaded information about roughly 800 artworks from the collection at the Metropolitan Museum of Art. The museum has publicly released a large dataset about their collection^[5], just a small fraction are displayed here. They are positioned randomly.

Hover over an artwork to see its details.

Each artwork includes basic metadata, such as its title, artist, date made, medium, and dimensions. Data scientists like to call metadata for each data point (artwork) features. Below are some of the features of 10 artworks in the dataset.

1680

Spindle-back armchair

1875

Armchair

Pottier and Stymus Manufacturing Company

1776

Basket

Myer Myers

1650

Basin

Master Potter A

1710

Two-handled Bowl

Cornelius Kierstede

1876

The Bryant Vase

James Horton Whitehouse|Tiffany & Co.|Eugene J. Soligny|Augustus Saint-Gaudens

1765

Bureau table

John Townsend

1880

Cabinet

Daniel Pabst

1866

Cabinet

Alexander Roux

1835

Celery vase

Boston & Sandwich Glass Company

Projecting onto a line

These features can be thought of as vectors existing in a high-dimensional space. Visualizing the vectors would reveal a lot about the distribution of the data, however humans can’t see so many dimensions all at once.

Instead the data can be projected onto a lower dimension, one that can be visualized directly. This kind of projection is called an embedding.

Computing a 1-dimensional embedding requires taking each artwork and computing a single number to describe it. A benefit of reducing to 1D is that the numbers, and the artworks, can be sorted on a line.

On the right you see the artwork positioned according to their average pixel brightness. Notice that the images are sorted, with the darkest images appearing at the top and the brightest images on the bottom!

For the mathematically inclined

Dimensionality reduction can be formulated mathematically in the context of a given dataset. Consider a dataset represented as a matrix $X$ , where $X$ is of size $m \times n$ , where $m$ represents the number of rows of $X$ , and $n$ representes the number of columns.

Typically, the rows of the matrix are data points and the columns are features. Dimensionality reduction will reduce the number of features of each data point, turning $X$ into a new matrix, $X'$ , of size $m \times d$ , where $d < n$ . For visualizations we typically set $d$ to be 1, 2 or 3.

Say $m=n$ , that is $X$ is a square matrix. Performing dimensionality reduction on $X$ and setting $d=2$ will change it from a square matrix to a tall, rectangular matrix.

X = \begin{bmatrix} x & x & x \\ x & x & x \\ x & x & x \end{bmatrix} \implies \begin{bmatrix} x' & x' \\ x' & x' \\ x' & x' \end{bmatrix} = X'

Each data point only has two features now, i.e., each data point has been reduced from a 3 dimensional vector to a 2 dimensional vector.

Embedding data in two dimensions

The same brightness feature can be used to position the artworks in 2D space instead of 1D. The pieces have more room to spread out.

On the right you see a simple 2-dimensional embedding based on image brightness, but this isn’t the only way to position the artworks. In fact, there are many, and some projections are more useful than others.

Use the slider to vary the influence that the brightness and artwork age have in determining the embedding positions. As you move the slider from brightness to artwork age, the embedding changes from highlighting bright and dark images, and starts to cluster recent modern-day images in the bottom left corner whereas older artworks are moved farther away (hover over images to see their date).

Artwork Age Brightness

Real-world algorithms

The previous section showed an example of a user-driven embedding, where the exact influence of each feature is known. However, you may have noticed that it’s hard to find meaningful combinations of feature weights.

State-of-the-art algorithms can find an optimal combination of features so that distances in the high dimensional space are preserved in the embedding. Use the tool below to project the artworks using three commonly used algorithms.

In this example the reduction is performed on the pixels of each image: each image is flattened into a single vector, where each pixel represents one feature. The vectors records are then reduced to two dimensions.

Principal component analysis

Pros:

Relatively computationally cheap.
Can save embedding model to then project new data points into the reduced space.

Cons:

Linear reduction limits information that can be captured; not as discriminably clustered as other algorithms.

There are many algorithms that compute a dimensionality reduction of a dataset. Simpler algorithms such as principal component analysis (PCA) maximize the variance in the data to produce the best possible embedding. More complicated algorithms, such as t-distributed stochastic neighbor embedding (t-SNE)^[2], iteratively produce highly clustered embeddings. Unfortunately, whereas before the influence of each feature was explicitly known, one must relinquish control to the algorithm to determine the best embedding— that means that it is not clear what features of the data are used to compute the embedding. This can be problematic for misinterpreting what an embedding is showing^[11].

Dimensionality reduction, and more broadly the field of unsupervised learning, is an active area of research where researchers are developing new techniques to create better embeddings. A new technique, uniform manifold approximation and projection (UMAP)^[4], is a non-linear reduction that aims to create visually striking embeddings fast, scaling to larger datasets.

Try it for yourself

Dimensionality reduction is a powerful tool to better understand high-dimensional data. If you have your own dataset and wish to visualize it using dimensionality reduction, there are a number of different algorithms ^[3] and implementations available. In Python, the scikit-learn package ^{[8, 9]} provides APIs for many unsupervised dimensionality reduction algorithms, as well as manifold learning: an approach to non-linear dimensionality reduction.

Regarding the three algorithms discussed above, you can find the open-source Python implementations that we used here: PCA, t-SNE ^[2], and UMAP^[4].

Acknowledgments

This article was created using Idyll.
The source code is available on Github.
Thanks to everyone who gave feedback on this article, including Jeffrey Heer, Polo Chau, Caleb Robinson, Nicky Case, and Hamish Todd, and support from a NASA Space Technology Research Fellowship.

References

Uber die stetige Abbildung einer Linie auf ein Flachenstuck.
David Hilbert.
Dritter Band: Analysis Grundlagen der Mathematik Physik Verschiedenes , 1890.
Visualizing Data Using t-SNE.
Laurens van der Maaten, Geoffrey Hinton.
Journal of machine learning research, 2008.
Dimensionality Reduction: A Comparative Review.
Laurens van der Maaten, Eric Postma, Jaap Van den Herik.
Tilburg University Technical Report, TiCC-TR, 2009.
UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.
Leland McInnes, John Healy.
arXiv, 2018.
The Metropolitan Museum of Art Open Access.
The Metropolitan Museum of Art.
Github, 2017.
Analysis of the Clustering Properties of the Hilbert Space-filling Curve.
Bongki Moon, Hosagrahar V Jagadish, Christos Faloutsos, Joel H. Saltz.
IEEE Transactions on knowledge and data engineering, 2001.
Visualizing MNIST: An Exploration of Dimensionality Reduction.
Chris Olah.
Colah's Blog, 2014.
Scikit-learn: Machine Learning in Python.
F. Pedregosa, G. Varoquaux, A. Gramfort, and V. Michel, B. Thirion, O. Grisel, M. Blondel, and P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay.
Journal of Machine Learning Research, 2011.
API Design for Machine Learning Software: Experiences from the Scikit-learn Project.
Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, Gael Varoquaux.
ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013.
Embedding Projector: Interactive Visualization and Interpretation of Embeddings.
Daniel Smilkov, Nikhil Thorat, Charles Nicholson, Emily Reif, Fernanda B Viegas, Martin Wattenberg.
arXiv, 2016.
How to Use t-SNE Effectively.
Martin Wattenberg, Fernanda Viegas, Ian Johnson.
Distill, 2016.