If I Installed Numpy by Pip Should I Install It Again by Conda

UMAP logo

pypi_version pypi_downloads

conda_version conda_downloads

License build_status Coverage

Docs joss_paper

UMAP

Compatible Manifold Approximation and Projection (UMAP) is a dimension reduction technique that can exist used for visualisation similarly to t-SNE, but besides for general non-linear dimension reduction. The algorithm is founded on three assumptions about the information:

The information is uniformly distributed on a Riemannian manifold;
The Riemannian metric is locally constant (or can exist approximated equally such);
The manifold is locally connected.

From these assumptions it is possible to model the manifold with a fuzzy topological structure. The embedding is constitute past searching for a low dimensional projection of the data that has the closest possible equivalent fuzzy topological structure.

The details for the underlying mathematics can exist found in our paper on ArXiv:

McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Project for Dimension Reduction, ArXiv e-prints 1802.03426, 2018

The important thing is that you don't need to worry about that—yous can use UMAP correct at present for dimension reduction and visualisation as easily every bit a drop in replacement for scikit-larn's t-SNE.

Documentation is available via Read the Docs.

New: this packet now also provides support for densMAP. The densMAP algorithm augments UMAP to preserve local density data in add-on to the topological construction of the data. Details of this method are described in the following paper:

Narayan, A, Berger, B, Cho, H, Density-Preserving Data Visualization Unveils Dynamic Patterns of Single-Cell Transcriptomic Variability, bioRxiv, 2020

Installing

UMAP depends upon scikit-larn, and thus scikit-learn'due south dependencies such as numpy and scipy. UMAP adds a requirement for numba for operation reasons. The original version used Cython, only the improved lawmaking clarity, simplicity and performance of Numba made the transition necessary.

Requirements:

Python iii.half-dozen or greater
numpy
scipy
scikit-learn
numba
tqdm

Recommended packages:

pynndescent
For plotting
- matplotlib
- datashader
- holoviews
for Parametric UMAP
- tensorflow > ii.0.0

Installing pynndescent can significantly increase performance, and in later versions it will become a hard dependency.

Install Options

Conda install, via the excellent work of the conda-forge team:

conda install -c conda-forge umap-learn

The conda-forge packages are bachelor for Linux, Bone Ten, and Windows 64 fleck.

PyPI install, presuming you lot have numba and sklearn and all its requirements (numpy and scipy) installed:

If yous wish to utilize the plotting functionality you lot can use

pip install umap-larn[plot]

to install all the plotting dependencies.

If you wish to use Parametric UMAP, you need to install Tensorflow, which tin can be installed either using the instructions at https://www.tensorflow.org/install (reccomended) or using

pip install umap-learn[parametric_umap]

for a CPU-only version of Tensorflow.

If pip is having difficulties pulling the dependencies then we'd suggest installing the dependencies manually using anaconda followed by pulling umap from pip:

conda install numpy scipy conda install scikit-acquire conda install numba pip install umap-larn

For a manual install get this package:

wget https://github.com/lmcinnes/umap/archive/principal.goose egg unzip master.null rm master.zip              cd              umap-primary

Install the requirements

sudo pip install -r requirements.txt

conda install scikit-learn numba

Install the bundle

How to use UMAP

The umap package inherits from sklearn classes, and thus drops in neatly next to other sklearn transformers with an identical calling API.

              import              umap              from              sklearn.datasets              import              load_digits              digits              =              load_digits()              embedding              =              umap.UMAP().fit_transform(digits.data)

There are a number of parameters that can be fix for the UMAP class; the major ones are as follows:

n_neighbors: This determines the number of neighboring points used in local approximations of manifold structure. Larger values volition upshot in more global structure being preserved at the loss of detailed local structure. In full general this parameter should often be in the range five to 50, with a choice of 10 to fifteen being a sensible default.

min_dist: This controls how tightly the embedding is allowed compress points together. Larger values ensure embedded points are more evenly distributed, while smaller values allow the algorithm to optimise more accurately with regard to local construction. Sensible values are in the range 0.001 to 0.5, with 0.1 being a reasonable default.

metric: This determines the choice of metric used to measure altitude in the input infinite. A broad variety of metrics are already coded, and a user divers role can exist passed as long as it has been JITd past numba.

An example of making use of these options:

              import              umap              from              sklearn.datasets              import              load_digits              digits              =              load_digits()              embedding              =              umap.UMAP(n_neighbors              =              5,              min_dist              =              0.3,              metric              =              'correlation').fit_transform(digits.data)

UMAP too supports fitting to sparse matrix data. For more than details delight run into the UMAP documentation

Benefits of UMAP

UMAP has a few signficant wins in its current incarnation.

Offset of all UMAP is fast. Information technology can handle large datasets and high dimensional data without likewise much difficulty, scaling across what most t-SNE packages can manage. This includes very high dimensional sparse datasets. UMAP has successfully been used directly on data with over a 1000000 dimensions.

Second, UMAP scales well in embedding dimension—it isn't simply for visualisation! You can utilise UMAP equally a general purpose dimension reduction technique as a preliminary step to other motorcar learning tasks. With a piddling care it partners well with the hdbscan clustering library (for more details delight see Using UMAP for Clustering).

Third, UMAP often performs meliorate at preserving some aspects of global structure of the information than about implementations of t-SNE. This means that it can ofttimes provide a better "large picture" view of your information equally well equally preserving local neighbor relations.

Fourth, UMAP supports a wide variety of distance functions, including non-metric distance functions such as cosine altitude and correlation distance. You tin can finally embed word vectors properly using cosine distance!

5th, UMAP supports adding new points to an existing embedding via the standard sklearn transform method. This means that UMAP can exist used equally a preprocessing transformer in sklearn pipelines.

Sixth, UMAP supports supervised and semi-supervised dimension reduction. This ways that if you take characterization information that you wish to use as actress information for dimension reduction (even if it is merely partial labelling) you can do that—every bit only equally providing information technology every bit the y parameter in the fit method.

7th, UMAP supports a diversity of boosted experimental features including: an "inverse transform" that can gauge a loftier dimensional sample that would map to a given position in the embedding space; the ability to embed into non-euclidean spaces including hyperbolic embeddings, and embeddings with uncertainty; very preliminary support for embedding dataframes also exists.

Finally, UMAP has solid theoretical foundations in manifold learning (run across our paper on ArXiv). This both justifies the approach and allows for farther extensions that will shortly be added to the library.

Performance and Examples

UMAP is very efficient at embedding big high dimensional datasets. In particular it scales well with both input dimension and embedding dimension. For the best possible operation we recommend installing the nearest neighbour computation library pynndescent . UMAP will work without it, only if installed it will run faster, particularly on multicore machines.

For a problem such as the 784-dimensional MNIST digits dataset with 70000 information samples, UMAP can complete the embedding in under a minute (as compared with around 45 minutes for scikit-learn's t-SNE implementation). Despite this runtime efficiency, UMAP all the same produces high quality embeddings.

The obligatory MNIST digits dataset, embedded in 42 seconds (with pynndescent installed and after numba jit warmup) using a 3.1 GHz Intel Core i7 processor (n_neighbors=x, min_dist=0.001):

UMAP embedding of MNIST digits

The MNIST digits dataset is fairly straightforward, however. A ameliorate examination is the more recent "Style MNIST" dataset of images of manner items (again 70000 data sample in 784 dimensions). UMAP produced this embedding in 49 seconds (n_neighbors=5, min_dist=0.i):

UMAP embedding of "Fashion MNIST"

The UCI shuttle dataset (43500 sample in 8 dimensions) embeds well under correlation altitude in 44 seconds (note the longer fourth dimension required for correlation distance computations):

UMAP embedding the UCI Shuttle dataset

The following is a densMAP visualization of the MNIST digits dataset with 784 features based on the aforementioned parameters every bit above (n_neighbors=x, min_dist=0.001). densMAP reveals that the cluster corresponding to digit 1 is noticeably denser, suggesting that there are fewer degrees of freedom in the images of i compared to other digits.

densMAP embedding of the MNIST dataset

Plotting

UMAP includes a subpackage umap.plot for plotting the results of UMAP embeddings. This package needs to be imported separately since it has extra requirements (matplotlib, datashader and holoviews). It allows for fast and simple plotting and attempts to brand sensible decisions to avoid overplotting and other pitfalls. An example of use:

              import              umap              import              umap.plot              from              sklearn.datasets              import              load_digits              digits              =              load_digits()              mapper              =              umap.UMAP().fit(digits.information)              umap.plot.points(mapper,              labels              =              digits.target)

The plotting package offers basic plots, also as interactive plots with hover tools and diverse diagnostic plotting options. Meet the documentation for more than details.

Parametric UMAP

Parametric UMAP provides support for training a neural network to learn a UMAP based transformation of data. This tin be used to support faster inference of new unseen data, more robust inverse transforms, autoencoder versions of UMAP and semi-supervised classification (peculiarly for data well separated by UMAP and very limited amounts of labelled data). See the documentation of Parametric UMAP or the example notebooks for more.

densMAP

The densMAP algorithm augments UMAP to additionally preserve local density information in add-on to the topological structure captured by UMAP. One can easily run densMAP using the umap package by setting the densmap input flag:

              embedding              =              umap.UMAP(densmap              =              True).fit_transform(information)

This functionality is congenital upon the densMAP implementation provided by the developers of densMAP, who too contributed to integrating densMAP into the umap package.

densMAP inherits all of the parameters of UMAP. The post-obit is a list of additional parameters that can exist set up for densMAP:

dens_frac: This determines the fraction of epochs (a value betwixt 0 and 1) that will include the density-preservation term in the optimization objective. This parameter is gear up to 0.3 past default. Note that densMAP switches density optimization on after an initial stage of optimizing the embedding using UMAP.

dens_lambda: This determines the weight of the density-preservation objective. Higher values prioritize density preservation, and lower values (closer to zero) prioritize the UMAP objective. Setting this parameter to nothing reduces the algorithm to UMAP. Default value is 2.0.

dens_var_shift: Regularization term added to the variance of local densities in the embedding for numerical stability. Nosotros recommend setting this parameter to 0.one, which consistently works well in many settings.

output_dens: When this flag is True, the call to fit_transform returns, in add-on to the embedding, the local radii (inverse mensurate of local density defined in the densMAP paper) for the original dataset and for the embedding. The output is a tuple (embedding, radii_original, radii_embedding). Note that the radii are log-transformed. If False, only the embedding is returned. This flag can as well exist used with UMAP to explore the local densities of UMAP embeddings. By default this flag is Imitation.

For densMAP we recommend larger values of n_neighbors (east.g. xxx) for reliable estimation of local density.

An example of making utilise of these options (based on a subsample of the mnist_784 dataset):

              import              umap              from              sklearn.datasets              import              fetch_openml              from              sklearn.utils              import              resample              digits              =              fetch_openml(proper noun              =              'mnist_784')              subsample,              subsample_labels              =              resample(digits.data,              digits.target,              n_samples              =              7000,              stratify              =              digits.target,              random_state              =              1)              embedding,              r_orig,              r_emb              =              umap.UMAP(densmap              =              Truthful,              dens_lambda              =              2.0,              n_neighbors              =              thirty,              output_dens              =              True).fit_transform(subsample)

See the documentation for more details.

Assistance and Support

Documentation is at Read the Docs. The documentation includes a FAQ that may answer your questions. If you still have questions then please open an issue and I volition try to provide any help and guidance that I can.

Commendation

If you make apply of this software for your work nosotros would appreciate it if you would cite the newspaper from the Periodical of Open up Source Software:

              @article{mcinnes2018umap-software,              title=                {UMAP: Compatible Manifold Approximation and Projection}              ,              writer=                {McInnes, Leland and Healy, John and Saul, Nathaniel and Grossberger, Lukas}              ,              journal=                {The Journal of Open Source Software}              ,              book=                {three}              ,              number=                {29}              ,              pages=                {861}              ,              year=                {2018}                            }

If yous would like to cite this algorithm in your work the ArXiv paper is the current reference:

              @article{2018arXivUMAP,              author              =                              {{McInnes}, L. and {Healy}, J. and {Melville}, J.}              ,              title              =                              "{UMAP: Compatible Manifold Approximation                              and Projection for Dimension Reduction}"              ,              journal              =                              {ArXiv e-prints}              ,              archivePrefix              =                              "arXiv"              ,              eprint              =                              {1802.03426}              ,              primaryClass              =                              "stat.ML"              ,              keywords              =                              {Statistics - Machine Learning,                              Computer Science - Computational Geometry,                              Computer Scientific discipline - Learning}              ,              yr              =              2018,              month              = feb, }

Additionally, if you apply the densMAP algorithm in your work delight cite the following reference:

              @commodity              {NBC2020,              author              =                              {Narayan, Ashwin and Berger, Bonnie and Cho, Hyunghoon}              ,              championship              =                              {Density-Preserving Data Visualization Unveils Dynamic Patterns of Unmarried-Cell Transcriptomic Variability}              ,              journal              =                              {bioRxiv}              ,              twelvemonth              =                              {2020}              ,              doi              =                              {10.1101/2020.05.12.077776}              ,              publisher              =                              {Cold Leap Harbor Laboratory}              ,              URL              =                              {https://www.biorxiv.org/content/early/2020/05/14/2020.05.12.077776}              ,              eprint              =                              {https://www.biorxiv.org/content/early on/2020/05/xiv/2020.05.12.077776.full.pdf}              , }

If you employ the Parametric UMAP algorithm in your work please cite the following reference:

              @article              {NBC2020,              author              =                              {Sainburg, Tim and McInnes, Leland and Gentner, Timothy Q.}              ,              title              =                              {Parametric UMAP: learning embeddings with deep neural networks for representation and semi-supervised learning}              ,              journal              =                              {ArXiv e-prints}              ,              archivePrefix              =                              "arXiv"              ,              eprint              =                              {2009.12981}              ,              primaryClass              =                              "stat.ML"              ,              keywords              =                              {Statistics - Machine Learning,                              Estimator Science - Computational Geometry,                              Computer Science - Learning}              ,              year              =              2020,     }

License

The umap package is 3-clause BSD licensed.

We would like to notation that the umap package makes heavy utilise of NumFOCUS sponsored projects, and would not be possible without their support of those projects, and so delight consider contributing to NumFOCUS.

Contributing

Contributions are more than welcome! At that place are lots of opportunities for potential projects, so delight get in touch if you lot would like to help out. Everything from lawmaking to notebooks to examples and documentation are all equally valuable so please don't feel you can't contribute. To contribute please fork the projection brand your changes and submit a pull request. We will practise our best to work through any bug with you and become your lawmaking merged into the primary branch.

Varner Seache