Naturally-occurring networks are naturally modeled as being made up of many overlapping communities —
for instance, our work and friends when it comes to social networks, or those describing different protein complexes in biology. Detecting such overlapping communities is an active area of current research in many disciplines, but most existing algorithms to do so do not fare well with the size of networks regularly seen in the
real world, which can contain millions of members. Now scientists have developed a technique for efficiently discovering overlapping communities in massive real-world networks, findings detailed this week in the Proceedings of the National Academy of Sciences.
There are two major problems to detecting communities in massive real-world networks. First, most community-detection algorithms presume each node or member of the network belongs to just one community, while in real life each node often has a number of memberships. Second, most existing algorithms are too slow, analyzing every pair of nodes regardless of whether they are actually connected in the network, or their assumptions are too simple, limiting their ability to reveal hidden communities.
Instead of analyzing the entirety of a network at once, computer scientists Prem Gopalan and David Blei at Princeton University came up with a technique to sample portions of a network repeatedly.
“Our algorithm allows nodes to be involved in multiple communities, which is an important property for analyzing large modern networks,” Blei says. “Based on the connections between nodes, our algorithm might be able to find communities that we are otherwise unaware of–ad hoc communities that aren’t formally defined in advance but that emerge from the process that created the network.”
The researchers discovered overlapping communities within several real-world networks, including 3.7 million US patents, 575,000 physics articles from the arXiv preprint server, and 875,000 connected Web pages from the Internet, identifying hundreds of overlapping communities among millions of nodes in a matter of hours. For instance, when it came to the arXiv papers, it found certain ones apparently related to multiple fields. Readers can try the method out themselves via https://github.com/premgopalan/svinet.
“Using a simple model of overlapping communities, we can accurately and consistently find communities on large benchmark networks,” Gopalan says.
“One application is to automatically identify the overlapping friends or ‘circles’ of a social network user,” Gopalan notes. “These communities may provide context to understand a user’s interactions with another user, and also help in selecting the social network content shown to a user.”
“Another application is a routine tool for scientific analysis of networks,” Gopalan says. “For example, proteins are known to belong to multiple protein complexes and have multiple functions. By analyzing a network of protein-protein interactions, we can predict the diverse roles that each protein takes on.”
When it comes to detecting, say, hidden communities of criminals or terrorists, the algorithm by itself “simply finds overlapping communities–our method can’t distinguish between a book group and a terrorist cell,” Blei says. “But one might build on this method to create a more advanced model–specifically one that accounts for what activities those communities are participating in–in applications for national security.”