Medieval Swords and Graph Theory

A few days ago, I came across a challenge put up by the British Library (here, German and here, English). They have a sword (not too far from UCL where I work) with an inscription that seems to be strange at the first glance.

+NDXOXCHWDRGHDXORVI+

British Museum 1858,1116.5 (Picture thanks to the British Museum)

In the comment section of their blog, many people speculated about the contents of the inscription. Unfortunately, it is very hard to find a compelling explanation without any bias towards a specific interpretation. I present an approach which analyses the occurrences of pairs of words in a language to build a probability graph which finds the explanation with the maximum likeihood for a given language input:

Nomine Domini Sancta Trinitas CHW De Rex Gratus HD X. Ob Res Vos Impugnant
in the name of Our Lord and the Holy Trinity to NAME from the grateful sovereign NAME the tenth against things that attack you

The Assumptions

According to Marc van Hasselt (here, far down) there are a few assumptions which may be valid in this context:

XOX, or OXO may refer to the Holy Trinity
ND may refer to Nostrum Domine (Our Lord) or Nomine Domini (in the name of Our Lord)

As I am much more familiar with physics as I am with medieval swords, I kept these two assumptions for my analysis. Moreoever, it is safe to assume that the overall language of the inscription is latin, as it was the most prevalent language around 1300 (the time the sword has been dated to from the British Museum). There is one letter though, that is not compatible with this idea: the W which is not part of the alphabet used in Latin. Nevertheless, it was common to use single words (mostly names) embedded in Latin phrases. Therefore, I assume that this letter belongs to a name.

Last but not least: there are very few vowels in this inscription (you can barely pronounce it, but please go ahead and try it – it’s fun as long as you are alone in the office). So I think that the letters may be the initial letters of words in a sentence. SITTTLMBTILOWIAS would be the last sentence, and GEODIPTQUIBAATQILCNGA would be the famous beginning of Caesar’s De Bello Gallico.

The Idea

Latin consists of seperate words (citation needed) which (in the proper order, one of the major challenges for a student) form sentences. While checking for correct grammar is a highly non-trivial task, the structure of sentences can be approximated by pairs of consecutive words, the so called bigrams. As an example, let’s consider the famous last words of Hamlet (the figure in the play, not the play itself, as it continues after Hamlet’s death):

 The rest is silence.
|--------|
    |-------|
          |---------|

This sentence has three bigrams. Now each of these unique pairs has a relative frequency over the text written in a specific language, as long the words in question have not been added at some time in history. Of course for new words (like “Quantum Mechanics”) there are no occurrences before they became meaningful. When we analyse a large batch of texts in a given language, then we can build a list of bigrams with their relative frequency. This requires tremendous efforts and results in databases like this one from Google Books.

Now let’s assume we have such a list of bigrams for a language. The problem we have here, is that we know the initial letters but we do not know the remaining ones. So we are not looking for a specific bigram, we are interested in all bigrams for that language that share their initial letters. Hamlet would say we are looking for T-r, r-i, and i-s.

 The rest is silence.
|   T-r  |
    |  r-i  |
          |   i-s   |

Now for each tuple of characters, we can find all bigrams in the given language and weight them by their probability. This probability is defined by the relative frequency of occurrence in the underlying text batch. Once we have all these bigrams, we have separate lists of pairs and their relative weight. As a fictious example:

Word 1	Word 2	Occurrences
The	results	30
The	rest	19
Tim	rides	1

In this case, we have three bigrams with 50 occurrences in total. We can build such a list for all three bigrams. Now the only constraint to a valid solution for the abbreviation Tris is that the words are identical for the same letter. In other words: the last word of the “solution” to any bigram has to be the first word of the “solution” of the next bigram. Otherwise, we would not have a well-defined sentence.

This is where math comes into play. You can see this as a path problem. Each word in all bigrams is a city and each bigram (which links two words or cities) is a road in-between. Now each road has a probability that you take it. In our example for the first bigram, we have this map
path4505
where each arrow has a probability. I have used arrows here, because we do not want to go backwards but go straight through all the letters in the word. This structure is called a directed graph or digraphs. There are many algorithms available to find the shortest path (they mean the path with the highest probability in this case) on graphs, which is why it is convenient to express the problem in terms of digraphs. The arrows are called edges and the boxes are called nodes for graphs.

Do we actually need this? Can’t we just take the pair with the highest probability for each bigram list? No, as this would be another simplification which may omit important sentences and may even lead to incomplete interpretations. Just imagine the edge case (bad joke, I know) where you have the Hamlet statement above and a very high probability explanation for T-r but there is no bigram available that starts with the second word of this explanation. Then we would be stuck: no explanation whatsoever.

Now there is one problem: these algorithms for finding a suitable path through a digraph take the weights on the edges as they are defined. We have to make sure that it is impossible for one good explanation somewhere in the course of the abbreviation to empower bad explanations around it by giving a very low weight. My solution here is arbitrary: the occurrences are expressed in percent and then subtracted from 100 – so higher probability means smaller weight.

In the end, we will have a list of preferred paths in this graph. But this requires a pile of Latin texts.

The Data

I made use of the collection of public domain Latin texts of The Latin Library. This is a quite substantial source, but may not be large enough. Unfortunately, there is no corpus for Latin available online that I am aware of. For those who may have access to such a corpus, I publish the code on this page – just rerun with your data. I downloaded the website with wget -m and used html2text for conversion (bash):

for i in *html; 
do 
  echo $i; 
  html2text --ignore-emphasis --ignore-links --ignore-images \
     -b 0 $i >> raw.txt; 
done

The result is about 16 MB of raw text which has been tokenised (split into words) using the Natural Language Toolkit and python:

import nltk

raw = open('raw.txt').read().decode("ascii", "ignore").encode("ascii")
tok = nltk.word_tokenize(raw)
bgs = nltk.bigrams(tok)
fdist = nltk.FreqDist(bgs)
for k,v in fdist.items():
    print "%s %s\t%d" % (k[0], k[1], v)

This results in 1 529 223 distinct bigrams. Not too much but a good start.

Results

Using the code given below and this dataset, I got the following 15 suggestions for the part after the name letter W, namely DRGHDXORVI:

de re gratiam hic dies xvi o rex vero in
de re gratiam hic dies x. omnes res vel in
dedecore reliquum genus hominum diebus xv o rex vero in
domi relictus genus hominum diebus xv o rex vero in    
de re gratiam hic dies x. omnes rustici vel in         
de re gratiam hic dies x. omnes romanos vel in         
de re gratiam hic dies x. omnes risum vel in           
de re gratiam hic dies x. omnes romanorum vel in       
de re gratiam hic dies x. omnes roma vel in            
de re gratiam hic dies x. omnes rebus vel in           
debitum reliquum genus hominum diebus xv o rex vero in 
dati remedii genus hominum diebus xv o rex vero in       
de re gratiam hic dies x. omnes res vero in            
de re gratiam hoc die xvi o rex vero in        
de re gratiam hic dies x. omnes reges vero in

As mentioned above, this approach knows nothing about grammar. But to me, most of these suggestions sound like dedications – this is where the interpretation starts. If this is a dedication, it could start with “de rex gratus” (from the grateful king/sovereign). With the name W before that and the NDXOX interpretation above, we now have “in the name of Our Lord and the Holy Trinity to NAME from the grateful sovereign”.

For the D in the remaining part HDXORVI, the word dies (day) is very stable. Even if I just check the HDX part alone, all the most popular suggestions contain dies. So let’s assign the meaning of “day” to this letter. The next strange one is the X. While the x is not particularly rare in Latin, my dictionary lists no Latin words that begin with one. So I think that this is either a delimiter or a digit. If it were a digit, we have to check neighbouring letters for digits, as well. While the O is no digit, the D may be one. But together, DX would be 510 – which does not make sense to me here. From the structure of the sentence, one could also think that this is the name of the sovereign HD the tenth. I am not very familiar with the origin of the sword and the region where it has been manufactured – maybe somebody with more detailed knowledge on the history of this particular artefact can evaluate this idea. For now, I will stick to this interpretation, because it is the most conclusive one I see in the results from the likelihood analysis. Now we have “in the name of Our Lord and the Holy Trinity to from the grateful sovereign the tenth”.

There are four letters remaining: ORVI. Let’s see the ten most likely bigram combinations for that:

ob rem vos in                            
ob rem victus in                         
ob rem videlicet in                      
ob rem volo interim                      
ob rem video in                          
ob rem vos iam                           
ob rem vos ipsi                          
ob rem vos igitur                        
ob rem volo ire                          
ob rem veram intelligenciam

That’s interesting: they all start with “ob rem” which would mean “against things” (after you correct the grammar). Now there is a problem you may see in every dictionary: check for the meanings of the Latin word “res” – in my dictionary, this fills pages. So let’s think about the remaining letters. The most probable word “vos” would be “you” in the accusative 2nd person plural. The candidates for the last letter I are not too convincing. Given the rest of the sentence I would expect something a sword is helpful for. So far we have “in the name of Our Lord and the Holy Trinity to from the grateful sovereign the tenth against things that … you”. As there are three words (impugnare, invadere, invehere) for attacking that start with an I, the best interpretation I see is

in the name of Our Lord and the Holy Trinity to NAME from the grateful sovereign NAME the tenth against things that attack you

Maybe researchers with a strong background in history, Latin and linguistics can make more sense of the data here, but I think this interpretation is rather compelling – and based on statistical data. Perhaps this approach is helpful for researchers who have access to a bigger data source of texts in Latin – I would be interested to learn whether this has been tried already and what the results would be if the script would crunch on a larger amount of input text. If you have some ideas, please let me know.

The Code

Of course, you do not want to do things by yourself. Make the computer do it. Finally, here is the code I used. This code uses NetworkX for graph operations. Everything related to finding the n shortest paths is from Guilherme Maia – I borrowed his code from here.

import networkx as nx
import os
from heapq import heappush, heappop
from itertools import count
import sys

# https://gist.github.com/guilhermemm/d4623c574d4bccb6bf0c
def k_shortest_paths(G, source, target, k=1, weight='weight'):
    """Returns the k-shortest paths from source to target in a weighted graph G.
    Parameters
    ----------
    G : NetworkX graph
    source : node
       Starting node
    target : node
       Ending node
       
    k : integer, optional (default=1)
        The number of shortest paths to find
    weight: string, optional (default='weight')
       Edge data key corresponding to the edge weight
    Returns
    -------
    lengths, paths : lists
       Returns a tuple with two lists.
       The first list stores the length of each k-shortest path.
       The second list stores each k-shortest path.  
    Raises
    ------
    NetworkXNoPath
       If no path exists between source and target.
    Examples
    --------
    &gt;&gt;&gt; G=nx.complete_graph(5)    
    &gt;&gt;&gt; print(k_shortest_paths(G, 0, 4, 4))
    ([1, 2, 2, 2], [[0, 4], [0, 1, 4], [0, 2, 4], [0, 3, 4]])
    Notes
    ------
    Edge weight attributes must be numerical and non-negative.
    Distances are calculated as sums of weighted edges traversed.
    """
    if source == target:
        return ([0], [[source]]) 
       
    length, path = nx.single_source_dijkstra(G, source, target, weight=weight)
    if target not in length:
        raise nx.NetworkXNoPath("node %s not reachable from %s" % (source, target))
        
    lengths = [length[target]]
    paths = [path[target]]
    c = count()        
    B = []                        
    G_original = G.copy()    
    
    for i in range(1, k):
        for j in range(len(paths[-1]) - 1):            
            spur_node = paths[-1][j]
            root_path = paths[-1][:j + 1]
            
            edges_removed = []
            for c_path in paths:
                if len(c_path) &gt; j and root_path == c_path[:j + 1]:
                    u = c_path[j]
                    v = c_path[j + 1]
                    if G.has_edge(u, v):
                        edge_attr = G.edge[u][v]
                        G.remove_edge(u, v)
                        edges_removed.append((u, v, edge_attr))
            
            for n in range(len(root_path) - 1):
                node = root_path[n]
                # out-edges
                for u, v, edge_attr in G.edges_iter(node, data=True):
                    G.remove_edge(u, v)
                    edges_removed.append((u, v, edge_attr))
                
                if G.is_directed():
                    # in-edges
                    for u, v, edge_attr in G.in_edges_iter(node, data=True):
                        G.remove_edge(u, v)
                        edges_removed.append((u, v, edge_attr))
            
            spur_path_length, spur_path = nx.single_source_dijkstra(G, spur_node, target, weight=weight)            
            if target in spur_path and spur_path[target]:
                total_path = root_path[:-1] + spur_path[target]
                total_path_length = get_path_length(G_original, root_path, weight) + spur_path_length[target]                
                heappush(B, (total_path_length, next(c), total_path))
                
            for e in edges_removed:
                u, v, edge_attr = e
                G.add_edge(u, v, edge_attr)
                       
        if B:
            (l, _, p) = heappop(B)        
            lengths.append(l)
            paths.append(p)
        else:
            break
    
    return (lengths, paths)

def get_path_length(G, path, weight='weight'):
    length = 0
    if len(path) &gt; 1:
        for i in range(len(path) - 1):
            u = path[i]
            v = path[i + 1]
            
            length += G.edge[u][v].get(weight, 1)
    
    return length    

def fetch_line(fh):
	try:
		line = next(fh)
	except:
		return None
	parts = line.split('\t')
	words = parts[0].split()
	w1 = words[0].split('_')[0]
	w2 = words[1].split('_')[0]
	count = int(parts[1])
	return w1.lower(), w2.lower(), count, len(line)

def find_2_gram(letters, fh, filesize):
	l1, l2 = letters
	print 'Finding 2-grams for %s and %s' % (l1, l2)
	res = fetch_line(fh)

	out = dict()
	tcount = 0
	linec = 0
	chartotal = 0

	print 'Starting ...'
	while res != None:
		w1, w2, count, charcount = res
		if w1.startswith(l1):
			if w2.startswith(l2):
				if w1 in out:
					if w2 in out[w1]:
						out[w1][w2] += count
					else:
						out[w1][w2] = count
				else:
					out[w1] = dict()
					out[w1][w2] = count
				tcount += count
		res = fetch_line(fh)
		linec += 1
		chartotal += charcount
		if linec % 100000 == 0:
			print '\r%05.3f percent done' % (float(chartotal) / filesize * 100), 

	print '\rDone.'
	return out, tcount

def update_graph(G, twogramdict, tcount, idx, init=False, end=False):
	for w1 in out:
		c = out[w1]
		if init:
			G.add_edge('BEGIN', str(idx) + '.' + w1, weight=1)
		for w2 in c:
			if end:
				G.add_edge(str(idx+1) + '.' + w2, 'END', weight=1)
			G.add_edge(str(idx) + '.' + w1, str(idx+1) + '.' + w2, weight=1-float(c[w2])/tcount)

def find_filename(letters):
	return 'base-lt/complete'
	l1, l2 = letters
	if l1 == 'a':
		return 'base-en/googlebooks-eng-all-2gram-20120701-aa'
	if l1 == 't':
		return 'base-en/googlebooks-eng-all-2gram-20120701-te'
	return None

if len(sys.argv) != 2:
	print 'Usage: %s abbreviation' % sys.argv[0]
	exit(1)

G = nx.DiGraph()

word = sys.argv[1]
pairs = zip(word[:-1], word[1:])
#pairs = (('a', 't'),('t', 'b'))

for idx, pair in enumerate(pairs):
	filename = find_filename(pair)
	filesize = os.path.getsize(filename)
	out, tcount = find_2_gram(pair, open(filename), filesize)
	update_graph(G, out, tcount, idx, init=(idx==0), end=(idx==len(pairs)-1))

ws, ps = k_shortest_paths(G, 'BEGIN', 'END', k=100)

print 'Score (low=best) Explanation'
for w, p in zip(ws, ps):
	print w, ' '.join(p[1:-1])

To run it, save it as search.py and give the search string as argument after you have set the filename of the corpus in find_filename():

python -u search.py abc