Clustering in LSA space

Actually, this is something I tried out over the christmas holidays and never quite found the time to get back to, but after I had the chance to meet AurĂ©lie Herbelot at the Herrenhausen conference on big data this week and discuss her work on the distributional semantics of poetry, it all came back to me. In short - and please excuse any misrepresentation - Herbelot derives a measure of text coherence from the semantic spaces and uses it to display a variety of texts as a continuum ranging from ‘random’ to ‘poetic’ to ‘factual’. Now, what I just wanted to show here was something very similar that I ran into when using my master’s thesis to build an LSA model (using 500 word chunks as documents) and plot it. The clustering algorithm that ran on top is almost unnecessary as the data points themselves display something very clearly: The chunks of highly self-referential and redundant academic text form dense assemblages, whereas the transcriptions of children’s discussions distribute themselves broadly. What also becomes visible are two threads, which correspond to the two different focus groups I conducted.

Leaving out the appendix that holds the complete transcriptions reveals shorter passages cited in the text:

Made using python, respectively gensim and sklearn. Code will follow at some point, all of this needs some more work.

comments powered by Disqus