Given a dataset of lyrics across bands we can represent each band as a point in an n-dimensional vector space over the words in the lyrics. By weighting each word by their relative frequency using "term frequency inverse document frequency" we can take an initial stab at distilling the salient words in a band's lyrical corpus and place less emphasis on words that occur frequently across many bands. This can be considered a proxy for lyrical "topic".

Using the vector space representation of bands we therefore have a notion of distance between bands. In this example, we are considering "lyric similarity" as the cosine distance between these TF-IDF vectors.

Using hierarchical agglomerative clustering we can iteratively group bands (and clusters of bands) together that are increasingly similar. This is a bottom up approach: bands start out as individual nodes and merge with their closest neighbors. Clusters then merge with the nearest clusters, and we repeat until we have clustered everything.

What we see just by looking at what essentially amounts to weighted word counts is that this system is able to find sensible communities of bands. A cluster with Damnation AD, Unbroken, and 108 seems reasonable when you consider that these bands tend to have introspective lyrics. It also correctly pulls out the Gorilla Biscuits + YOT + In My Eyes relationship (youth crew).

Click here for the code. Provided without warranty. Should be the last working version.

Selected linkage type: