# Fast molecular graph duplicate removal

Molecules are commonly treated as (undirected, colored and connected) graphs. If two graphs are identical, they are called isomorphisms of each other.

In recent work, I had to remove duplicate graphs/molecules from a long list of several million molecules. The straight-forward albei naive approach would be to do (n^2-n)/2 isomorphism checks, e.g. with igraph. The drawback being that these checks are costly and that the scaling of O(n^2) makes this highly undesirable for larger datasets.

Using necessary conditions for those checks allows to pre-screen the molecular graphs to identify those that cannot possibly be identical. Having necessary conditions for isomorphism checks are nice, as they allow for prescreening this way. Besides the obvious ones (number of vertices and edges, sorted degrees), there is one which is particularly helpful for lists of graphs where the other criteria are not sensitive enough (or the lists are long enough): the spectrum of the adjacency matrix. This property is a (not complete) graph invariant, so two graphs can be isomorphisms only if their eigenvalue spectrum is identical.

To use this quantity in the duplicate removal process, do the following:

• Take the adjacency matrix (which does depend on the vertex labels) in O(n).
• Calculate its eigenvalue spectrum (which does not depend on the vertex labels) in O(n).
• Sort all spectra in O(n log n).
• Perform actual isomorphism checks only with neighbors in the sorted list in O(n).

A sample implementation with igraph and numpy could look like this: ``````graphs = # get list of igraph.Graph objects here
colors = # put your element-based coloring here, e.g. *7 +  * 12 + *12 for O7C12H12

spectra = np.zeros((limit, len(colors)))
for i in range(limit):

ordering = np.lexsort(spectra.T)
duplicates = []
for pos in range(len(ordering)):
for other in range(pos+1, len(ordering)):
if np.linalg.norm(spectra[ordering[pos]] - spectra[ordering[other]]) > 1e-2:
break
ida, idb = ordering[pos], ordering[other]
if graphs[ida].isomorphic_vf2(graphs[idb], color1=colors, color2=colors):
duplicates.append(max(ida, idb))
duplicates = set(duplicates)

for i in range(len(graphs)):
if i not in duplicates:
# output graphs[i] here``````