11. Community Detection and Online Graph Algorithms

Community Detection

Goal: find densely linked clusters.

Strength of Weak Ties

If the number of hops between two nodes is large, the connection is likely to be a weak tie.

Edge betweenness: number of shortest paths that contain that edge. If there are multiple shortest paths, take the fraction of shortest paths that contain the edge.

Girvan-Newman

Find the link with the highest betweenness factor and remove it.

By repeating, the network can be decomposed in a hierarchical manner.

Computing Betweenness

Pick starting node AA and run a breadth first search, computing the number of shortest paths from AA to every other node.

This can be done by summing the values of the node’s parents. Set the root node, AA, to have a value of 11.

Store information about links between edges on the same level (hops from AA).

      a  ----
    / | \    \
   /  |  \    \
 b1 - c1  d1   e1   level 1 (1 hop from a)
  \  /    / \  /
   \/    /   \/
   f2   g1   h2     level 2
    \   /\   /
     \ /  \ /
      i3   j3       level 3
       \  /
        \/
        k6          level 4

Then, compute betweenness by working up the tree using BFS following these rules:

In the above example:

Hence, the resultant graph is:

      a  ----
   2/2| \4  2\
   /  |  \    \
2b - 2c  4d   2e
 1\ 1/   2/ \1 /1
   \/    /   \/
   2f  2g    2h
   1\ .5/\.5 /1
     \ /  \ /
   1.5i  1.5j
       \  /
     0.5\/0.5
        1k

This must be repeated, picking every node as the root node. The betweenness value for each edge can be summed and then divided by 2 (as each shortest path discovered twice from each end).

In order to improve performance, a random subset of nodes can be used as the root node to get an approximation.

A hierarchical network decomposition can be made by using the order in which the graph was divided.

Network Communities

Community detection requires not just computing betweenness, but also determining when to stop dividing clusters; otherwise, each cluster would be left with a single node.

Network communities are sets of tightly connected nodes.

Modularity

QQ is a measure of how well a network is partitioned.

If a network is partitioned into groups, sSs \in S:

QsSnumber of edges in sexpected number of edges in s Q \propto \sum_{s \in S}{ \text{number of edges in } s - \text{expected number of edges in } s}

But how is the number of expected edges found?

Null Model

Given a real GG on nn nodes and mm edges, we construct a rewired network GG', a multigraph.

If knk_n is the degree of node nn, the expected number of edges between nodes ii and jj is:

kikj2m \frac{k_i k_j}{2m}

That is, the sum of degrees of all nodes is twice the number of edges. Hence, in the entire graph, $$ \sum_{u \in N}{k_u} = 2m $$ The expected number of edges in the multigraph GG' is

=12iNjNkikj2m=14miNkijNkj=2m2m4m=m \begin{aligned} &= \frac{1}{2} \sum_{i \in N}{\sum_{j \in N}{\frac{k_i k_j}{2m}}} \\ &= \frac{1}{4m} \sum_{i \in N}{k_i \sum_{j \in N}{k_j}} \\ &= \frac{2m \cdot 2m}{4m} \\ &= m \end{aligned}

And the expected degree - number of edges per node, for the entire graph is 2m/n2m/n.

Given a partitioning SS of graph GG:

Q(G,S)=12msSisjsAijkikj2m Q(G, S) = \frac{1}{2m} \sum_{s\in S}{\sum_{i \in s}{\sum_{j\in s}{A_{ij} - \frac{k_i k_j}{2m}}}}

Where Aij=1A_{ij} = 1 if there is a edge iji \rightarrow j, and 00 otherwise.

QQ is normalized such that 1<Q<1-1 \lt Q \lt 1.

A positive value means the number of edges within the groups exceeds the expected number.

If 0.3Q0.70.3 \ge Q \ge 0.7 means significant community structure in most graphs.

Modularity is initially 0; with only one community, no edges have been removed and the expected number of edges is derived from this case. In nice cases, there will be an inflection point at which maximum modularity is reached.