skip to primary navigationskip to content

Tagging the Lexicon

Every writing technology has advantages and drawbacks. The papyrus codex, for example, was (for its time) user-friendly and easy to search, because its architecture was rather similar to that of the world wide web — composed of a large number of short pages, rather than a continuous scroll.

It did, on the other hand, have drawbacks for copying, which was very labour-intensive, and also for permanent archiving, due to its fragility, which may be seen from this illustration of Charles Hedrick working on the Nag Hammadi codices.


Similar factors still apply in the modern world. Information flow is optimised when the message doesn't depend entirely on the medium, but can be translated across a variety of vehicles. For this reason, we are composing the Cambridge Greek Lexicon using XML technology (the letters stand for 'extensible markup language'). This means that our pages are not just formatted for appearance, as with word-processing software, but are also typeset for publication, and configured for online display and searching in electronic editions.

Here is a comparison of the two systems. First, a page of the lexicon composed using a word-processing program looks something like this:


This electronic typing standardises and preserves formatting quite well. However, extensive proof-reading has to be undertaken, and, in order to be translated into other media, the structure underlying the formatting also needs to be recorded. For example, the plain-text passages sometimes express the definition of the headword, but when bracketted, they may express an introductory or following explanatory remark, or encyclopaedic information, and these need to be identified if the lexicon is to be searched.

Such information can be preserved using XML 'tagging' (a development of the HTML which is used to format WWW pages). The basis of the system is an extended use of tags (labels inside pointed brackets) which in HTML are used to mark format: instead of typing a bold section of text by changing the font style, the passage is simply enclosed within "<bold>" tags.

In XML, the tags can define structure as well as format, and we can configure our own tags, so we can mark the headword or lemma by enclosing it in a specific tag. We can stipulate that this tag always marks the headword (a structural function), and that the text inside it is always in a bold Greek font (a formatting requirement). Similarly, we can tag the inflection, dialect forms, principal parts, definitions, and contextual information.

We have found that 100 different tags suffice to cover every type of entry in the lexicon. Here is the start of the page shown above, now marked up in XML: 


At first glance, this may look rather forbidding, but it soon becomes as natural as setting styles in a 'Word' document. We select the tags as we write. For example, the first entry, for libazomai, is enclosed in tags marked 'VE', because it is a 'verb entry'. Within that 'wrapper', there is a 'verbal head group' (vHG), which contains the lemma (HL), the part of speech label (PS), and the etymology (Ety). These elements may contain others within them, in a hierachical structure. And some of the elements are primarily there to facilitate searching: for example, inside the etymology tag, the related word libas is enclosed in 'Ref' tags, which indicate that it refers to another headword in the lexicon.

The definitions and translations appear inside 'S1' tags, and there are also 'S2' tags (not shown here) for subsections illustrating nuances of meaning. Within these 'S' elements are many others marking authors and contextual information, such as the subjects and objects which a verb takes, examples of nouns qualified by an adjective, or verbs modified by an adverb.

This level of precision means that we can immediately translate the page into print-quality format, producing a PDF page.

This gives us an accurate picture of the finished product, enabling us to identify typing errors and unwanted variations of style and content while we are writing. There are other advantages too: because we have organised the tags within a specific structure, we are encouraged to be consistent in the way we write each entry, and so we can maintain a 'house style'.

The final step is to combine these  individual pages into a single paginated PDF document to produce the final typeset copy.

We may sum up the advantages of XML authoring under five headings:

1: An integrated, flexible writing and publishing environment

We can cope with any technical problems which might arise as we proceed, and produce precise formatting for the typesetters, so the task of proof-reading will be greatly helped.

2: A consistent writing style

Inconsistencies are almost unavoidable in typed copy, and especially when articles are written by more than one person. For example, when citing Euripides Antiope, LSJ refers to "Antiop.iv B, [line number] A" and also "Antiop.iv B line ... Arn." and sometimes "Antiop.iv B line ... Arnim", or else "Antiop.p.21 A" or "Antiop.B 58 p.21 A". All these citations refer to the same fragment (fr.10 in Page's Select Papyri). Consistency could have been maintained if the authors of LSJ had been able to compare all their citations easily.

XML allows us to apply maximal constraints to entries, and so enables all the members of the editorial team to maintain consistent style and format.

3: A structure which reflects our methodology

Our aim has been to create structures which impose constraints on the writing, yet remain flexible enough to contain the range of information which we may wish to enter. We achieve this most importantly through the innovation of using dedicated structures for each part of speech. This enables us to maintain a balance between extended definitions, translation glosses, and contextual and encyclopaedic information, so that we are helped to write the last entries in the same style as we wrote the first.

4: A product which is translatable across publishing media

There will be an electronic edition on the Perseus site. The system means that it can be easily and accurately searched. Dictionaries which are tagged after they were written necessarily contain fewer tagged elements than ours (as they were not composed with such a precise structure), so fewer types of search are possible. A reader of our lexicon can, for example, see how vocabulary changes across the range of Ancient and Koiné Greek, because we mark usage in a corpus of 70 authors, from Homer to Plutarch, and so we can compare word frequency in different writers. And our system will also be linked to other Perseus databases, to images as well as to texts.

5: Better-organised material

XML releases us from the constraints on space of the printed book. Most usefully, our 'annotation' element allows us to incorporate editorial notes in each entry, for reference during the writing, and as a permanent archive of our research.

And cross-reference elements enable us to perform electronic searches during the writing and proof-reading stages. That has a number of advantages:

(a) We shall be able to group related words together, so we can easily compare all words sharing the same stem, and write the entry for a simple form before dealing with its derivatives. It is useful to compare the entries for all the compounds from the verb bainw (go), which can take the preverbs ana-, anti-, apo-, dia-, eis-, ek-, epi-, kata-, exana-, meta-, para-, peri-, poti-, pro-, pros-, sum-, huper-, and hupo-, rather than only treating them in alphabetical order.

(b) We can investigate the range of meanings of the prefixes themselves, across the different primary forms (as in the derivatives of bainw listed above), and compare this with their uses as independent prepositions and adverbs.

(c) We can incorporate cultural information. For example, colour terms constitute a group which is currently the subject of considerable semantic interest, and XML tagging enables us to study them not only in their primary forms, but also in compounds, where they may appear as stems, as in akro-kelainiown, with black surface; dia-melainw, become quite dark; hupo-glaukos, somewhat grey, huperuthros, rather red. They also appear as prefixes combined with a noun like aspis ('shield'): we find leuk-aspis ('white-shielded'), phoinik-aspis ('red-shielded'), chalk-aspis ('bronze-shielded'), and chrus-aspis ('gold-shielded').

Attention to these details of word formation enables the writers to compose more precise definitions, which in turn can help the student gain a deeper understanding of Greek word meaning. Electronic searching during the writing process will help us produce a more consistent, coherent, and consequently more useful lexicon.

The XML environment is a little more challenging for the writers, because we have to become accustomed to manipulating the tags. However, it does save effort, too, as the text formatting is largely automated, we don't need to select bold or italic fonts, or to insert brackets, or even section numbers: all that is done automatically.

And the advantages are that mistakes and inconsistencies can be avoided, the writing and publication processes are integrated, and the usefulness of the lexicon can be maximised and extended in the future, as new ways of integrating verbal and visual information are discovered. We believe that this will help students to explore the richness of the Ancient Greek vocabulary in the most effective way possible.


Next Page: Research Partnerships

RSS Feed Latest news

Unveiling the Invisible: Analysing Roman pottery

Feb 25, 2021

Archaeologists Alessandro Launaro, Senior Lecturer, and Ninetta Leone, Research Associate, have been working as members of the Cambridge MACH group to develop mathematical approaches to the classification of Roman pottery, part of the “Unveiling the Invisible” project funded by the Leverhulme Trust.

The Faculty reports with great sadness the death of John Easterling

Feb 23, 2021

A Fellow of Trinity from 1958, and Secretary of Trinity College Council for very many years, John was a University Assistant Lecturer in Classics (Ancient Philosophy) before he was appointed to the Office of University Draftsman at the Old Schools. John died on 23 February after a long illness.

Facilitating school visits and learning Latin with MoCA

Feb 23, 2021

Justyna Ladosz, Education and Outreach Coordinator in the Museum of Classical Archaeology, explains how she continues to facilitate lessons for school groups whilst the Museum remains closed, and how the Faculty’s students continue to deliver the Learn Latin with MoCA project.

Rebecca Flemming has been appointed a Joukowsky Lecturer

Feb 23, 2021

Dr Rebecca Flemming, Senior Lecturer in Ancient History and Fellow of Jesus College, has been appointed as a Joukowsky Lecturer for the Archaeological Institute of America (AIA) for 2020-21. Rebecca also recently featured on BBC Radio 4’s In Our Time discussing the Justinianic Plague.

View all news