Oracle AI & Data Science Blog
Learn AI, ML, and data science best practices

The Business Value of Combinatorial Subgraph Search

In my 12 years working in academia and business, I have seen how differently the two approach graph implementation. While academia embraces graphs, businesses often do not see the value. Almost always after executives see what graphs can do and the initial cool factor wears off, they pose the ubiquitous question, “Now what?” Even the best graphist may not have a good answer to this simple problem.

In this article, I will share a few niche examples that showcase the business value of graph implementation.

Combinatorial Subgraph Search

Combinatorial Subgraph Search (CSS) is an exhaustive search of a graph for all subgraphs of a particular size (where all nodes are connected directly or indirectly via other nodes, i.e. no dangling nodes). When it’s applied to traversing graphs, it’s able to ignore irrelevant node combinations and consider only node combinations that constitute a connected subgraph, Figure 1. This not only allows the CSS to scale down the search but makes it possible to search at all, a valuable capability for many businesses.

So how does the CSS scale down the search so dramatically? One, the CSS algorithm only chooses combinations that are relevant. Two, the algorithm never chooses the same combination. Three, the algorithm ranks only the best  million combinations. In other words, the CSS considers only node combinations that matter (without asking if they matter) because it keeps track of the combinations that have been visited (without keeping track of them) and ranks the combinations (without spending time on ranking).

Illustration of graph traversal 

Figure 1: Illustration of graph traversal by the Combinatorial Subgraph Search (CSS). Task: Find a subgraph of size 6 with the lowest sum of labels. Solution: Combination of nodes 456789. The CSS traverses only connected subgraphs (green, 8 combinations) and ignores unconnected combinations of nodes (red, 76 combinations) while checking the sum. This annihilation of a majority of combinations that are irrelevant is what makes the CSS valuable. For a larger graph, the reduction in combinatorial space is much more dramatic, e.g. from 1040 to 1014 in drug discovery.

CSS Use Case

CSS is used in drug discovery to identify a hot spot on the surface of the protein where a drug can bind. Since an average-sized drug displaces about 18 water molecules from the surface of the protein when it binds, the CSS algorithm first converts a layer of water molecules on the surface of the protein into a graph (proteins are not in a vacuum but in a water-like environment within cells) and then finds all subgraphs of size 18. In addition, the binding energy between a drug and a protein is mostly due to the displacement of water molecules from the surface of the protein, so identifying a cluster of 18 water molecules that is the easiest to displace is the easiest way to find the hot spot for binding. 

As shown in Figure 2, one can take all water molecules from the surface of a protein and replace them with nodes. Then we link nodes for all water molecules that are touching each other (on average, each molecule is connected to four other water molecules). Then to each node we assign a label, which represents the displacement energy for that water molecule. This leaves us with the problem of finding a subgraph of 18 nodes that has the smallest sum of 18 energy labels, i.e. the easiest to displace. The graph can be from 5,000 to 50,000 nodes depending on the size of a protein so this means that sometimes the search would have to go through 1040 subgraphs (which is impossible).

However, in drug discovery, this is done overnight on a single CPU. As mentioned, the CSS will not consider node combinations that are not in one piece since the molecule is in one piece, which will reduce the search to 1014 (still large but doable). Also, stochastic approaches—no matter how valuable—will not work with proteins. Proteins are very specific which drug and on what surface location they will bind to. This leaves exhaustive traversal to be the best approach.

Further adding to the value, one can repeat this search on another protein (or entire portfolio) and compare which proteins are good candidates for designing a drug and which are not, i.e. those with low displacement energies in the hot spot are druggable and those with high displacement energies are undruggable. Simply identifying that a protein is a bad target for drug development is valuable because the cost for such processes goes into tens of millions of dollars. Finally, since each water molecule has the coordinate, identifying a hot spot on the protein also gives the exact shape of the future drug.


First layer of water molecules

Figure 2: First layer of water molecules (light blue) on the surface of a protein (gray) can be converted into a graph (blue). The CSS can identify a hot spot for drug binding (green) by identifying a cluster of 18 water molecules that are the easiest to displace. Once the CSS is applied to many proteins, hot spots can be compared as well as the relative druggability of proteins. (Source: S. Vukovic, P. E. Brennan, D. J. Huggins. “Exploring the Role of Water in Molecular Recognition: Predicting Protein Ligandability Using a Combinatorial Search of Surface Hydration Sites.” J. Phys.: Condens. Matter 2016, 28, 344007-344019)


CSS is allowing the search for information that seems impossible to obtain. This is valuable in many business scenarios, from hedge funds deciding how to choose a group of stocks with the lowest risk to power companies deciding how to redistribute electricity within a smart two-way grid.


Be the first to comment

Comments ( 0 )
Please enter your name.Please provide a valid email address.Please enter a comment.CAPTCHA challenge response provided was incorrect. Please try again.