Clustering in LSA space

Actually, this is something I tried out over the christmas holidays and never quite found the time to get back to, but after I had the chance to meet AurĂ©lie Herbelot at the Herrenhausen conference on big data this week and discuss her work on the distributional semantics of poetry, it all came back to me. In short - and please excuse any misrepresentation - Herbelot derives a measure of text coherence from the semantic spaces and uses it to display a variety of texts as a continuum ranging from ‘random’ to ‘poetic’ to ‘factual’. Now, what I just wanted to show here was something very similar that I ran into when using my master’s thesis to build an LSA model (using 500 word chunks as documents) and plot it. The clustering algorithm that ran on top is almost unnecessary as the data points themselves display something very clearly: The chunks of highly self-referential and redundant academic text form dense assemblages, whereas the transcriptions of children’s discussions distribute themselves broadly. What also becomes visible are two threads, which correspond to the two different focus groups I conducted.

Leaving out the appendix that holds the complete transcriptions reveals shorter passages cited in the text:

Made using python, respectively gensim and sklearn. Code will follow at some point, all of this needs some more work.

Musil's man without qualities

At our DARIAH-DE cluster unconference last week we had time to develop a simple new workflow using DKPro, Jython and Gephi. We use the Apache UIMA based framework to do the heavy lifting, in this case everything up to and including named entity recognition (NER). We invoke the pipeline and process its output via a Python script.

Here the basic idea is to find a window within which to count occuring named entities as connected. Any two entities within that window (and if there are more, all combinations count) are taken as nodes, put together, and written as edges into a .csv file. That .csv file can be imported into Gephi in order to produce network graphs like the ones shown below [from the 1st book of Musil’s man without qualities]:

The paragraph segmentation above (note to self: use the DKPro built-in paragraph splitter) yields quite a complete picture of mentions, whereas the co-occurence within sentences should be indicative of a more direct connection between mentioned entities:

As the StanfordNamedEntityRecognizer must be instructed to look for specific kinds of entities, I tried them all and apart from the ‘person’ attribute, ‘location’ led to the extraction of a meaningful network from that novel:

As can be observed looking at close-ups of the network graph, NER needs to be improved - in at least two ways: to tune it for German as well as for literary texts. Another thing that can be experimented with is what segmentation(s) to use. A more theoretical approach, for example based on concepts of scene structure, might prove fruitful here - but also a little harder to implement.

[This is still work in progress, but some of our other recipes are online already. You can find them here.]