Issue 3, 2024

A human-in-the-loop approach for visual clustering of overlapping materials science data

Abstract

Data-to-knowledge has started to reveal significant promise in materials science. Still, some classes of materials, such as Metal–Organic Frameworks (MOFs), possess multi-dimensional interrelated physicochemical properties that pose challenges in using data clustering methods. We considered an in-house generated database of MOFs consisting of geometrical (pore size and dimensions), chemical (atomic charge of the framework), and adsorption properties (CO2 uptake, heat of adsorption) to evaluate the challenges and limitations of various clustering techniques and propose a solution based on visual clustering. As a starting step, we examined data via principal component analysis (PCA) to understand the interrelationships among a set of dimensions without prior knowledge. This dimensionality reduction method was unsuccessful in visually discovering clusters of MOFs. Then, we tested two combinations of data projection and clustering methods: T-distributed stochastic neighbour embedding (t-SNE) and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) on the original dimension input data (t-SNE//DBSCAN), and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) clustering the 2D embedding data obtained from Uniform Manifold Approximation and Projection (UMAP) (UMAP → HDBSCAN). Both the t-SNE//DBSCAN and UMAP → HDBSCAN pipelines are found to have overlapped clusters, which lack reproducibility and are parameter-sensitive. In contrast, we relied on a Gaussian mixture model (GMM) that uses the eigenvalue decomposition discriminant analysis (EDDA) method. This method is stable and not strongly dependent on the prior definition of the hyperparameters. We propose a novel interactive divide-and-conquer approach, the combination of GMM-EDDA and a form of linear discriminant analysis to enable visual split or merge decisions for each pair of Gaussian clusters. The end-user engages in the clustering process using trustworthy visualization where clusters appear as separated only if they are also well separated in the data space. Further, the identified meta-clusters were characterized using correlation heatmaps and violin plots of their distribution along each data dimension. Our methodology paves the way to address the clustering and data visualization challenges of highly overlapped and correlated databases.

Graphical abstract: A human-in-the-loop approach for visual clustering of overlapping materials science data

Supplementary files

Transparent peer review

To support increased transparency, we offer authors the option to publish the peer review history alongside their article.

View this article’s peer review history

Article information

Article type
Paper
Submitted
08 Sep 2023
Accepted
29 Jan 2024
First published
02 Feb 2024
This article is Open Access
Creative Commons BY-NC license

Digital Discovery, 2024,3, 502-513

A human-in-the-loop approach for visual clustering of overlapping materials science data

S. Bonakala, M. Aupetit, H. Bensmail and F. El-Mellouhi, Digital Discovery, 2024, 3, 502 DOI: 10.1039/D3DD00179B

This article is licensed under a Creative Commons Attribution-NonCommercial 3.0 Unported Licence. You can use material from this article in other publications, without requesting further permission from the RSC, provided that the correct acknowledgement is given and it is not used for commercial purposes.

To request permission to reproduce material from this article in a commercial publication, please go to the Copyright Clearance Center request page.

If you are an author contributing to an RSC publication, you do not need to request permission provided correct acknowledgement is given.

If you are the author of this article, you do not need to request permission to reproduce figures and diagrams provided correct acknowledgement is given. If you want to reproduce the whole article in a third-party commercial publication (excluding your thesis/dissertation for which permission is not required) please go to the Copyright Clearance Center request page.

Read more about how to correctly acknowledge RSC content.

Social activity

Spotlight

Advertisements