Index t-SNE: Tracking Dynamics of High-Dimensional Datasets with Coherent Embeddings

t-SNE is an embedding method that the data science community has widely used. It helps two main tasks: to display results by coloring items according to the item class or feature value; and for forensic, giving a first overview of the dataset distribution. Two interesting characteristics of t-SNE are the structure preservation property and the answer to the crowding problem, where all neighbors in high dimensional space cannot be represented correctly in low dimensional space. t-SNE preserves the local neighborhood, and similar items are nicely spaced by adjusting to the local density. These two characteristics produce a meaningful representation, where the cluster area is proportional to its size in number, and relationships between clusters are materialized by closeness on the embedding. This algorithm is non-parametric. The transformation from a high to low dimensional space is described but not learned. Two initializations of the algorithm would lead to two different embedding. In a forensic approach, analysts would like to compare two or more datasets using their embedding. A naive approach would be to embed all datasets together. However, this process is costly as the complexity of t-SNE is quadratic, and would be infeasible for too many datasets. Another approach would be to learn a parametric model over an embedding built with a subset of data. While this approach is highly scalable, points could be mapped at the same exact position, making them indistinguishable. This type of model would be unable to adapt to new outliers nor concept drift. This paper presents a methodology to reuse an embedding to create a new one, where cluster positions are preserved. The optimization process minimizes two costs, one relative to the embedding shape and the second relative to the support embedding’ match. The embedding with the support process can be repeated more than once, with the newly obtained embedding. The successive embedding can be used to study the impact of one variable over the dataset distribution or monitor changes over time. This method has the same complexity as t-SNE per embedding, and memory requirements are only doubled. For a dataset of n elements sorted and split into k subsets, the total embedding complexity would be reduced from O(n2) to O(n2/k), and the memory requirement from n2 to 2(n/k)2 which enables computation on recent laptops. The method showed promising results on a real-world dataset, allowing to observe the birth, evolution and death of clusters. The proposed approach facilitates identifying significant trends and changes, which empowers the monitoring high dimensional datasets’ dynamics.





References:
[1] S. Cohen, E. Ruppin, and G. Dror, “Feature selection based on the
shapley value.” 01 2005, pp. 665–670.
[2] M. A. Hall, “Correlation-based feature selection for machine learning,”
Ph.D. dissertation, 1999.
[3] I. Jolliffe, Principal Component Analysis. John Wiley & Sons, Ltd,
2005.
[4] J. Masci, U. Meier, D. Ciresan, and J. Schmidhuber, “Stacked
convolutional auto-encoders for hierarchical feature extraction,” 06 2011,
pp. 52–59.
[5] M. Maggipinto, C. Masiero, A. Beghi, and G. A. Susto, “A convolutional
autoencoder approach for feature extraction in virtual metrology,”
Procedia Manufacturing, vol. 17, pp. 126–133, 2018, 28th International
Conference on Flexible Automation and Intelligent Manufacturing
(FAIM2018), June 11-14, 2018, Columbus, OH, USAGlobal Integration
of Intelligent Manufacturing and Smart Industry for Good of Humanity.
[6] T. Kohonen, Self-organizing maps, 3rd ed., ser. Springer series in
information sciences, 30. Berlin: Springer, Dec. 2001.
[7] J. B. Tenenbaum, V. de Silva, and J. C. Langford, “A global geometric
framework for nonlinear dimensionality reduction,” Science, vol. 290,
no. 5500, p. 2319, 2000.
[8] L. McInnes, J. Healy, and J. Melville, “Umap: Uniform manifold
approximation and projection for dimension reduction,” 2018, cite
arxiv:1802.03426Comment: Reference implementation available at
http://github.com/lmcinnes/umap.
[9] L. van der Maaten and G. Hinton, “Visualizing data
using t-SNE,” Journal of Machine Learning Research,
vol. 9, pp. 2579–2605, 2008. [Online]. Available:
http://www.jmlr.org/papers/v9/vandermaaten08a.html
[10] D. Kobak and P. Berens, “The art of using t-sne for single-cell
transcriptomics,” bioRxiv, 2019.
[11] N. Pezzotti, A. Mordvintsev, T. H¨ollt, B. P. F. Lelieveldt,
E. Eisemann, and A. Vilanova, “Linear tsne optimization for
the web,” CoRR, vol. abs/1805.10817, 2018. [Online]. Available:
http://arxiv.org/abs/1805.10817
[12] L. van der Maaten, “Learning a parametric embedding by preserving
local structure,” in Proceedings of the Twelth International Conference
on Artificial Intelligence and Statistics, ser. Proceedings of Machine
Learning Research, D. van Dyk and M. Welling, Eds., vol. 5.
Hilton Clearwater Beach Resort, Clearwater Beach, Florida USA:
PMLR, 16–18 Apr 2009, pp. 384–391. [Online]. Available:
http://proceedings.mlr.press/v5/maaten09a.html
[13] M. R. Min, H. Guo, and D. Shen, “Parametric t-distributed stochastic
exemplar-centered embedding,” CoRR, vol. abs/1710.05128, 2017.
[Online]. Available: http://arxiv.org/abs/1710.05128
[14] A. Boytsov, F. Fouquet, T. Hartmann, and Y. L. Traon,
“Visualizing and exploring dynamic high-dimensional datasets with
lion-tsne,” CoRR, vol. abs/1708.04983, 2017. [Online]. Available:
http://arxiv.org/abs/1708.04983
[15] P. E. Rauber, A. X. Falc˜ao, and A. C. Telea, “Visualizing time-dependent
data using dynamic t-sne,” in Proceedings of the Eurographics / IEEE
VGTC Conference on Visualization: Short Papers, ser. EuroVis ’16.
Goslar, DEU: Eurographics Association, 2016, p. 7377.
[16] C. R. Harris, K. J. Millman, S. J. van der Walt, R. Gommers,
P. Virtanen, D. Cournapeau, E. Wieser, J. Taylor, S. Berg, N. J.
Smith, R. Kern, M. Picus, S. Hoyer, M. H. van Kerkwijk, M. Brett,
A. Haldane, J. F. del R’ıo, M. Wiebe, P. Peterson, P. G’erard-Marchant,
K. Sheppard, T. Reddy, W. Weckesser, H. Abbasi, C. Gohlke,
and T. E. Oliphant, “Array programming with NumPy,” Nature,
vol. 585, no. 7825, pp. 357–362, Sep. 2020. [Online]. Available:
https://doi.org/10.1038/s41586-020-2649-2
[17] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su, “Arnetminer:
Extraction and mining of academic social networks,” in Proceedings
of the 14th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, ser. KDD ’08. New York, NY, USA:
Association for Computing Machinery, 2008, p. 990998.
[18] K. Wang, Z. Shen, C. Huang, C.-H. Wu, D. Eide, Y. Dong, J. Qian,
A. Kanakia, A. Chen, and R. Rogahn, “A review of microsoft academic
services for science of science studies,” Frontiers in Big Data, vol. 2,
p. 45, 2019.
[19] B. H. Weinberg, “Bibliographic coupling: A review,” Information
Storage and Retrieval, vol. 10, no. 5, pp. 189–196, 1974.
[20] D. Chavalarias and J.-P. Cointet, “The reconstruction of science
phylogeny,” 04 2009.
[21] L. van der Maaten, “Barnes-hut-sne,” in ICLR, Y. Bengio and Y. LeCun,
Eds., 2013.