Musil's man without qualities

At our DARIAH-DE cluster unconference last week we had time to develop a simple new workflow using DKPro, Jython and Gephi. We use the Apache UIMA based framework to do the heavy lifting, in this case everything up to and including named entity recognition (NER). We invoke the pipeline and process its output via a Python script.

Here the basic idea is to find a window within which to count occuring named entities as connected. Any two entities within that window (and if there are more, all combinations count) are taken as nodes, put together, and written as edges into a .csv file. That .csv file can be imported into Gephi in order to produce network graphs like the ones shown below [from the 1st book of Musil’s man without qualities]:

The paragraph segmentation above (note to self: use the DKPro built-in paragraph splitter) yields quite a complete picture of mentions, whereas the co-occurence within sentences should be indicative of a more direct connection between mentioned entities:

As the StanfordNamedEntityRecognizer must be instructed to look for specific kinds of entities, I tried them all and apart from the ‘person’ attribute, ‘location’ led to the extraction of a meaningful network from that novel:

As can be observed looking at close-ups of the network graph, NER needs to be improved - in at least two ways: to tune it for German as well as for literary texts. Another thing that can be experimented with is what segmentation(s) to use. A more theoretical approach, for example based on concepts of scene structure, might prove fruitful here - but also a little harder to implement.

[This is still work in progress, but some of our other recipes are online already. You can find them here.]

comments powered by Disqus