A professional development goal of mine is to learn a lot more about social network analysis and visualization of social media data. This area has grown increasingly valuable and important in our field. And I believe we all need to have at least a base knowledge of social data and how to play with it.
With my wife traveling for work and rainy weather here in West Virginia, this weekend presented a great opportunity to finally get my feat wet (no pun intended).
As you may know, my beloved Virginia Tech Hokies haven’t been playing so well this college football season. So I decided to use Saturday’s game as an opportunity to play with Twitter data and Gephi, an open source data visualization program.
I’ll explain what I did below to make the above visualization in case you’d like to try this for yourself. This is a simple approach and I think you’ll find you can do it if I can learn it in a weekend! I started Saturday morning with zero knowledge of graph theory, social network analysis, how to use Gephi, and how to pull down Tweets.
I’m writing this up because I found several tutorials online. But, none of them quite came together to show me how to do all the parts in one tutorial. A major reason is that the Twitter API has changed since many tutorials available online were built. So, the ways offered for getting the Twitter data on those tutorials no longer works. As such, getting Twitter data is a challenge if you don’t know a little programming with Python, etc (Needless to say, I don’t).
Fortunately, each of the tools together below made this first experiment in Twitter data visualization possible.
Here’s how I did it:
1) I used the TAGS v.6 Twitter Archiving Tool to gather Tweets with the hashtag #hokies. This is an amazing, free tool – thank you so much to Martin Hawksey for this! You can learn to use the TAGS archiver fairly easily via Google Docs. The only real slow down is that you have to get a Twitter API key via your Twitter account.
I ended up gathering 1583 Tweets between 3:19am – after midnight before the game – and the majority of the way through the game at 2:43. So, whatever Tweets going back I could pull when I extracted the data at 2:43; not a great picture of the #Hokies conversation, but it worked for this exercise.
2) I used @DFeelon’s spreadsheet converter to convert the TAGS spreadsheet to a file I could put into GEPHI to do the visualization. Thanks Deen!
His converter pulls only the first Twitter account that is mentioned in the Tweet or in a RT – so any additional persons mentioned in a Tweet were not counted. You can learn more about it here on Deen’s blog. It is easy to use. In short, I copied my Tweeter and Tweet text into his spreadsheet, and voila! This created my edge file in CSV for GEPHI with 2 columns (vertices, or nodes) – the first column being the person who sent the Tweet and the second column being the person to whom the Tweet was directed.
3) I noticed that some mentions of Twitter account handles were all lowercase whereas others were not. This had created duplicate nodes. That is, in some instances, one Twitter account had been split into two: an all lowercase version and the original. So, I simply made all text lowercase to address this problem. I used Google Refine to clean my CSV file because I want to learn to use this program. But, you could change the case in Excel or any spreadsheet software.
4) I then loaded the cleaned CSV file into Gephi (download it here) so I could do the visualization.
5) I spent a lot of time on Saturday reading about visualization and getting a basic knowledge of graph theory and how to use Gephi. While I’ve still got a lot to learn, I decided to follow a tutorial for my first “go round.” It seemed like a great opportunity to put together concepts and tools in Gephi that I’d learned in a guided environment. So, I followed the instructions on the latter half of this YouTube video for how to visualize the data and export it into the file you see with this post. The tutorial is by Michael Bauer via the International Journalism Festival. Of note, the first half shows you how to extract data using Twitter’s old API and that process no longer works. So you can take your CSV file gained through the process above, import it into Gephi, and pick up with the tutorial at 1:05:46.
So, that’s it!
A few quick things about this visualization:
As indicated by the size of the Twitter account name, we can see that Virginia Tech sports beat writer Andy Bitter for the Roanoke Times had the largest number of Tweets directed at him regarding the game (that is, his node – his Twitter account – had the most degrees. The degrees are the number of edges, or connections one node has to another). This makes sense. I’ve followed the #Hokies conversation on Twitter for years and Andy has been a constant presence and leader in providing news and analysis of Tech.
The communities are indicated in colors. I used the modularity script in Gephi to identify these, as is shown in the above-noted YouTube video. In short, you can use the color coding to make a basic clustering of who is talking to who.
While I’ve got a ton to learn, I’m thrilled with the progress I’ve made in just over a weekend from not knowing the first thing about graph theory, basic spreadsheet formatting for nodes and edges, or how to visualize a social network, to building my first visualization. And, while my goal is not to become a data scientist, I am excited to continue to learn and grow a base knowledge in this area. I know I am just scraping the tip of the iceberg.
I’d love to hear your thoughts and tips on how I can improve my knowledge and skills! Also, please feel free to share your tips, tutorials, and experiences with social data.
Note: Thanks to Nathan Carpenter at the ISU SMACC for helping me get started with data gathering and visualization by generously sharing his experiences and tools!