The Beginner's Guide to Dimensionality Reduction

Explore the methods that data scientists use to visualize high-dimensional data.

July 16, 2018

Art or science?

Your browser has just loaded information about roughly 800 artworks from the collection at the Metropolitan Museum of Art. The museum has publicly released a large dataset about their collection[5], just a small fraction are displayed here. They are positioned randomly.

Hover over an artwork to see its details.

Each artwork includes basic metadata, such as its title, artist, date made, medium, and dimensions. Data scientists like to call metadata for each data point (artwork) features. Below are some of the features of 10 artworks in the dataset.


Year
Title
Artist
1680
Spindle-back armchair
1875
Armchair
Pottier and Stymus Manufacturing Company
1776
Basket
Myer Myers
1650
Basin
Master Potter A
1710
Two-handled Bowl
Cornelius Kierstede
1876
The Bryant Vase
James Horton Whitehouse|Tiffany & Co.|Eugene J. Soligny|Augustus Saint-Gaudens
1765
Bureau table
John Townsend
1880
Cabinet
Daniel Pabst
1866
Cabinet
Alexander Roux
1835
Celery vase
Boston & Sandwich Glass Company
Loading...

Projecting onto a line

These features can be thought of as vectors existing in a high-dimensional space. Visualizing the vectors would reveal a lot about the distribution of the data, however humans can’t see so many dimensions all at once.

Instead the data can be projected onto a lower dimension, one that can be visualized directly. This kind of projection is called an embedding.

Computing a 1-dimensional embedding requires taking each artwork and computing a single number to describe it. A benefit of reducing to 1D is that the numbers, and the artworks, can be sorted on a line.

On the right you see the artwork positioned according to their average pixel brightness. Notice that the images are sorted, with the darkest images appearing at the top and the brightest images on the bottom!

For the mathematically inclined

Dimensionality reduction can be formulated mathematically in the context of a given dataset. Consider a dataset represented as a matrix XX , where XX is of size m×nm \times n , where mm represents the number of rows of XX , and nn representes the number of columns.

Typically, the rows of the matrix are data points and the columns are features. Dimensionality reduction will reduce the number of features of each data point, turning XX into a new matrix, XX' , of size m×dm \times d , where d<nd < n . For visualizations we typically set dd to be 1, 2 or 3.

Say m=nm=n , that is XX is a square matrix. Performing dimensionality reduction on XX and setting d=2d=2 will change it from a square matrix to a tall, rectangular matrix.

X=[xxxxxxxxx][xxxxxx]=X X = \begin{bmatrix} x & x & x \\ x & x & x \\ x & x & x \end{bmatrix} \implies \begin{bmatrix} x' & x' \\ x' & x' \\ x' & x' \end{bmatrix} = X'

Each data point only has two features now, i.e., each data point has been reduced from a 3 dimensional vector to a 2 dimensional vector.

Embedding data in two dimensions

The same brightness feature can be used to position the artworks in 2D space instead of 1D. The pieces have more room to spread out.

On the right you see a simple 2-dimensional embedding based on image brightness, but this isn’t the only way to position the artworks. In fact, there are many, and some projections are more useful than others.

Use the slider to vary the influence that the brightness and artwork age have in determining the embedding positions. As you move the slider from brightness to artwork age, the embedding changes from highlighting bright and dark images, and starts to cluster recent modern-day images in the bottom left corner whereas older artworks are moved farther away (hover over images to see their date).

Artwork Age Brightness

Real-world algorithms

The previous section showed an example of a user-driven embedding, where the exact influence of each feature is known. However, you may have noticed that it’s hard to find meaningful combinations of feature weights.

State-of-the-art algorithms can find an optimal combination of features so that distances in the high dimensional space are preserved in the embedding. Use the tool below to project the artworks using three commonly used algorithms.

In this example the reduction is performed on the pixels of each image: each image is flattened into a single vector, where each pixel represents one feature. The records are then reduced to two dimensions.

Principal component analysis

Pros:

  • Relatively computationally cheap.
  • Can save embedding model to then project new data points into the reduced space.

Cons:

  • Linear reduction limits information that can be captured; not as discriminably clustered as other algorithms.

There are many algorithms that compute a dimensionality reduction of a dataset. Simpler algorithms such as principal component analysis (PCA) maximize the variance in the data to produce the best possible embedding. More complicated algorithms, such as t-distributed stochastic neighbor embedding (t-SNE)[2], iteratively produce highly clustered embeddings. Unfortunately, whereas before the influence of each feature was explicitly known, one must relinquish control to the algorithm to determine the best embedding— that means that it is not clear what features of the data are used to compute the embedding. This can be problematic for misinterpreting what an embedding is showing[11].

Dimensionality reduction, and more broadly the field of unsupervised learning, is an active area of research where researchers are developing new techniques to create better embeddings. A new technique, uniform manifold approximation and projection (UMAP)[4], is a non-linear reduction that aims to create visually striking embeddings fast, scaling to larger datasets.

Try it for yourself

Dimensionality reduction is a powerful tool to better understand high-dimensional data. If you have your own dataset and wish to visualize it using dimensionality reduction, there are a number of different algorithms [3] and implementations available. In Python, the scikit-learn package [8, 9] provides APIs for many unsupervised dimensionality reduction algorithms, as well as manifold learning: an approach to non-linear dimensionality reduction.

Regarding the three algorithms discussed above, you can find the open-source Python implementations that we used here: PCA, t-SNE [2], and UMAP[4].

Acknowledgments

  • This article was created using Idyll.
  • The source code is available on Github.
  • Thanks to everyone who gave feedback on this article, including Jeffrey Heer, Polo Chau, Caleb Robinson, Nicky Case, and Hamish Todd.

References

  1. Uber die stetige Abbildung einer Linie auf ein Flachenstuck.
    David Hilbert.
    Dritter Band: Analysis Grundlagen der Mathematik Physik Verschiedenes , 1890.
  2. Visualizing Data Using t-SNE.
    Laurens van der Maaten, Geoffrey Hinton.
    Journal of machine learning research, 2008.
  3. Dimensionality Reduction: A Comparative Review.
    Laurens van der Maaten, Eric Postma, Jaap Van den Herik.
    Tilburg University Technical Report, TiCC-TR, 2009.
  4. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction.
    Leland McInnes, John Healy.
    arXiv, 2018.
  5. The Metropolitan Museum of Art Open Access.
    The Metropolitan Museum of Art.
    Github, 2017.
  6. Analysis of the Clustering Properties of the Hilbert Space-filling Curve.
    Bongki Moon, Hosagrahar V Jagadish, Christos Faloutsos, Joel H. Saltz.
    IEEE Transactions on knowledge and data engineering, 2001.
  7. Visualizing MNIST: An Exploration of Dimensionality Reduction.
    Chris Olah.
    Colah's Blog, 2014.
  8. Scikit-learn: Machine Learning in Python.
    F. Pedregosa, G. Varoquaux, A. Gramfort, and V. Michel, B. Thirion, O. Grisel, M. Blondel, and P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay.
    Journal of Machine Learning Research, 2011.
  9. API Design for Machine Learning Software: Experiences from the Scikit-learn Project.
    Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, Olivier Grisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton, Jake VanderPlas, Arnaud Joly, Brian Holt, Gael Varoquaux.
    ECML PKDD Workshop: Languages for Data Mining and Machine Learning, 2013.
  10. Embedding Projector: Interactive Visualization and Interpretation of Embeddings.
    Daniel Smilkov, Nikhil Thorat, Charles Nicholson, Emily Reif, Fernanda B Viegas, Martin Wattenberg.
    arXiv, 2016.
  11. How to Use t-SNE Effectively.
    Martin Wattenberg, Fernanda Viegas, Ian Johnson.
    Distill, 2016.