Wednesday, 13 February 2013

Analysing an Online Community

I had a look at carrying out network analysis of the relations in an online community for the Coursera Social Network Analysis class.

Original Bulletin Board Graph
Original Bulletin Board Graph
The community was a UK based bulletin board of general chat, completely anonymised. I wanted to look at the relationship between topics and see what could be inferred from that, so in the graphs the nodes are threads and the edges are people who posted between threads. This post is mainly about the network analysis, another will look at nitty gritty of how I carried it out.


In short, Python and Gephi -more details later.

The Network

I ended up with a network of 79 nodes and 345 edges. The node labels were extracted keywords from the threads -edges were people who posted between threads.


First question is there anything to look at? Any Structure? One way to tell is by constructing a random graph and comparing it's properties with those of our graph -this can be done, a little tediously, with Gephi.

Our graph has an average degree of 8.73, a shortest path length of 2.22 and an average clustering coefficient of 0.634. In comparison a random graph with same node and edge counts has a degree of ~4.4 a path length of 2.2 and the clustering coefficeient of 0.144 degree is considerably lower than our graph, so nodes are more connected and, although the path length is similar to a random graph the clustering is higher, indicating small world properties -which gives us something to look at.

There are a 3 super nodes in the graph :

clarkson, people, one, bbc1015
people, like, money, would650
clarkson, one, public, sector343.

People in the UK can probably guess the reason for the first and third nodes. The comment was so outrageous that perhaps most people felt that they had to say something about it, and abnormal relationships in the graph could be created. After removal of the clarkson nodes the graph is as below:
Clarkson free graph
Clarkson free graph
Running Gephi Modularity with a resolution of 1.0 gives us 6 groups. Groups 4, 5 and 0 seem to be too diverse to say anything much about. Group 1 is mainly about sport, although its' highest degree node is about Apple. Group 3 is about UK domestic topics, Group 2 is more global centring around politics and economics.


We can see that there is some clustering of subjects when we relate them by people although we have not discovered any surprises on this small data set, as similar subjects cluster together. Next steps for this study would be too use a more sophisticated way of extracting the labels from the thread text, run against a larger data set and perhaps introduce the notion of time as an attribute, one could then perhaps look at 'contagion' of the network by a topic, the Clarkson posts might be interesting here. We could also invert the network and relate people by posts -but that seemed less interesting.


  1. Had totally missed this, but then I never listen to Clarkson:

    1. I like Clarkson -you just have to remember he's a professional fool.