"Digital Surrogates for the Printed Book: Problems and Possibilities"

John Unsworth

Opening Plenary, "Emblems in the Twenty-First Century: Materials and Media," The Seventh International Conference of the Society for Emblem Studies, held at the University of Illinois, Urbana-Champaign, July 25, 2005.

from the Blake Archive


Reading "About Emblems" on the UIUC German Emblem Books site, I learn that

Emblem books can . . . be looked upon as the multi-media publications of the 17th and 18th centuries. They are books that link together three constitutive elements—a motto, a woodcut or engraving and an explanatory poem. . . . An emblem is more than the sum of its parts, because the interplay between text and image produces a greater meaning than any of the individual components can provide.

This sounds quite a bit like something on which I spent about ten years at the Institute for Advanced Technology in the Humanities (IATH), namely the The Dante Gabriel Rossetti Archive, and specifically Rossetti's "Double Works," which Jerome McGann, the prime mover of the Rossetti Archive, describes this way:

In the fall of 1848, while Rossetti was working on his first major painting, The Girlhood of Mary Virgin, he wrote a sonnet to accompany the picture. He finished the painting in time to exhibit it at the Hyde Park Corner Free Exhibition in March 1849 and at that time wrote a second sonnet for the painting. When the painting was exhibited, the pair of sonnets was attached to the picture frame on a piece of gold-leaf paper as an accompanying textual component. . . . This composite set of textual and pictorial materials on the subject of "Mary's Girlhood" (which was the title he gave to the first sonnet) defines what has come to be known as Rossetti's "double work of art". The typical Rossettian double work develops in the [same] manner. . . . That is to say, Rossetti executes a picture and then writes a poem—typically a sonnet or a pair of sonnets—that comments and elaborates upon the pictorial work.

The problems that the double-work raised in the Rossetti Archive may be of interest, and may actually sound familiar. Back in 1993, we held that the complex "structures of meaning" in such work could not be encoded in something as pedestrian and widely used as TEI, and so we set about rolling our own document type definitions, or DTDs. In the earliest "introduction" to the Rossetti Archive, McGann wrote:

The Archive has three DTDs . . . . These are the RAPs, the RADs, and the RAWS; or—respectively—the "Rossetti Archive Picture", "Rossetti Archive Document", the "Rossetti Archive Work". A RAP is prepared for every painting, drawing, or design. A RAD is prepared for every textual document; one type of RAD covers manuscript texts, another covers printed texts (whether the latter have manuscript material as well or not). A RAW is prepared in order to study Rossetti's works as they may be generically conceived. "The Blessed Damozel", for example, is a poem in many documentary states, but it is also a work with structures of meaning and expressiveness in its own right. The "work" so titled is also an instance of a typical Rossettian object: what scholars call his "double works of art", i.e., works that are doubly instantiated in textual and in visual forms. For such works, the RAW serves as a convenient conceptual locus through which all of "The Blessed Damozel" materials can be organized.

[From The Rossetti Hypermidea Archive: An Introduction (1993)]

By 1998, the Rossetti Archive had actually evolved a fourth DTD for commentary (called the RAC), and this one was nearly pure TEI, but by that time we had also come to the conclusion that we were wrong to have rejected TEI in the first place. In a paper delivered that year at the ALLC/ACH conference in Debrecen, Hungary, Daniel Pitti and I admitted this, but still held that there was something about these "double works" that could not be expressed within the constraints of hierarchical markup languages. We said:

Upon reflection, Jerome McGann, the editor of the Rossetti Archive, has come to the conclusion that his major objections to TEI were really misplaced. The problems were not essentially TEI problems, but SGML problems. Much of what he wants to represent simply defies representation under any implementation of SGML. For example, take the Rossetti Archive Work: a "work" in the context of the Rossetti Archive is the author's pure ideation, not instatiated in any material form-the "idea" of the Blessed Damozel, for instance, and not any text of that poem, nor any instance of that pictorial work. Pure ideation does not have divs, has no body, has no text to encode—pure header, perhaps? But it has no source to describe, no bibliographic features, no provenance. Nonetheless, it does have qualities, characteristics, attributes, and forms of reference that one might want to encode, and it may (and most often does) stand in some structured relationship to textual and pictorial instances—in fact, it generally structures the relation of those instances to one another—though more likely than not, those relationships are multiple, overlapping, and concurrent, rather than straightforwardly hierarchical.

One can, of course, address those relationships in other ways, beyond marking them up: write an essay, mount an argument, explore the territory in a less structured way. But this reflection on the Rossetti Archive and emblem books does highlight one of the "possibilities" that I have always thought most interesting and most important, when creating digital surrogates for the printed book, and that is the possibility of externalizing and examining the largely tacit knowledge of the expert. By instantiating an understanding of Rossetti's work in a structured, parsable, internally consistent naming of parts and their relationships, a scholar in the humanities (in this case, an experienced and an eminent one) was able to think more deeply about the nature of that work, and was able to spot the internal contradictions in his own understanding, and even sort out, to some extent, the accidental contradictions, which could be resolved within the parsable structure, from the meaningful ones, which ultimately escape that structure.

Since we've gone into this much detail about the Rossetti Archive, let me go just a little further, in order to offer a kind of object lesson for those of you in the business of building complex collections of digital representations of works of literature and art. The relationship of the different expressions of the abstract "work," in the Rossetti Archive, was for a long time carried in a single attribute value: as I wrote in 2001, "Each object in the archive, whether pictorial or textual, was given a unique ID. This ID appeared as an attribute on the Rossetti Archive Master (RAM) element, which is the root element of our SGML, and on DIV elements, which mark divisions in the document hierarchy (such as single poems within a collection)" (SDS Annual Report, 2001: The Rossetti Project: Editorial Architecture). It was long assumed that we would be able to do various things on the basis of this unique ID, or more accurately, on the basis of its attribute value—for example, collect the various expressions of a work from across the Archive, group and sort them in the interface, and so on. In fact, to some extent we were able to do that (to some extent at least) by hard-coding the logic of these relationships in Dynaweb, the SGML publishing engine that we used before XML came along. The Dynaweb "stylesheets," having been invented before the existence of XSL, used proprietary methods to accomplish this, and the problem with our underlying strategy emerged quite clearly when we tried to re-express all this in XML and XSL, abstracted from any delivery mechanism, in order to import the Rossetti Archive and its relationships into FEDORA (the Flexible Extensible Digital Object Repository Architecture), during a Mellon-funded project to investigate library collection of born-digital scholarly publications:

[T]he ID, it turned out, could not be designed to imply an object's relation to other objects. The Rossetti group had hoped that a search engine would be able to deduce these relations from the semantics of a given ID. For example, while a research assistant understood that the ID "s205a.mansell" meant "the Mansell reproduction of the 'a' version of item 205 in the Surtees catalog," the DynaWeb search engine could not infer all related documents. Ultimately, the semantics of the ID attribute were too complex and too compact to be unambiguous. The problem was handled in various ways in the original DynaWeb publication (pre-collocating related materials, for example) but it became an unavoidable impediment to moving the Archive into FEDORA, since critically important contexts and relationships would need to be unambiguously expressed in order to be preserved. The solution was to disassemble the content of the original ID attribute and assign different parts of that content to several separate and unambiguous attributes.

The RAM and DIV elements now include information about the object's type and its relations to other objects. For example, the RAM for object s205a.del, a photographic reproduction by Charles Mansell of the "a" version of the pictorial work s205, part of the double work 2-1867.s205 (Lady Lilith"), might look like this:

<RAM ARCHIVETYPE="rap" TYPE="painting"
METATYPE="web.visual" ID="a.s205a.mansell"
WIDTH="747" HEIGHT="500"
WORKCODE="2-1867.s205" VERSION="a"
DBLWORK="2-1867.s205" PHOTDUP="mansell">

These attributes explicitly set out the object's type, how it should be rendered, and how it is related to other objects.

The lesson to be drawn from this is that it is easy to be seduced into believing that names can have meaning. In fact, it is best if the authoritative name of a digital object has as little meaning as possible, and instead conveys the information we are tempted to load into the semantics of the name by some other means--for example, by breaking it out explicitly in different attribute values, or different database fields, or in some other way making it explicit rather than implied. The only thing one really wants a name to do, in short, is to distinguish this thing from other things, and so the only really required quality of a name—in the world of digital objects, at least—is uniqueness. If a name actually means something, then something is probably in the process of going terribly wrong. Many projects have learned this lesson, and usually it is learned after the number of objects grows to the point where the constraints of naming cannot sustain the semantic weight that's accumulated in the structure of relationships: unfortunately, at that point, it is usually a lot of work to go back, as we had to do in the Rossetti Archive, and rename everything, disambiguate relational information, and so on.

Returning to the "About Emblems" section of the UIUC German Emblem Books site, I note this passage as well, since it resonates with other experience at IATH:

Emblems were often thought to be hieroglyphs, riddles or even mysterious messages containing secrets. They drew on such diverse sources as the Bible, Classical antiquity, fables, mythology, science and medicine, and they reflected movements and events such as the Reformation and the Thirty Years' War. Their interpretation and understanding relied on the wit, knowledge and ability of the reader to combine clues in the text and image to produce meaning. During the time of their original use, they were read and viewed widely by both the educated and uneducated classes of European society. Today, research in emblems is highly interdisciplinary, attracting scholars of Latin, history, art history, and the European vernacular languages. This unusually rich form of combined artistic and literary expression also appeals to religious scholars, philosophers, and historians of science and education. . . . Many emblem books are small in physical size (averaging 5" wide x 9" height), and it is very difficult without the aid of a magnifying glass to decipher the intricate detail of an emblem engraving, and the corresponding mottos, many of which are written in Frakturschrift.

Much of this resonates with my experience in working on the William Blake Archive: Blake's illuminated books are also thought to contain riddles and secrets, also draw on diverse mythological sources, discuss science and medicine, and reflect historical movements and events. It could also be said of Blake, as it is said here of emblem books, that research in this area is "highly interdisciplinary," and that his "rich form of combined artistic and literary expression...appeals to religious scholars, philosophers, and historians" as well as, of course, to literary scholars and scholars of art history. Finally, Blake's works are also "small in physical size" (almost the same size as the average emblem book, sometimes smaller) and in Blake's case, as well, it is "difficult without the aid of a magnifying glass to decipher the intricate detail."

Discussing the logical consequences of some of these things in that 1998 paper, Daniel Pitti and I noted that in the Blake Archive,

While [the editors] are interested in the text of these works, they are interested in it first and foremost as it is presented on each plate. If you look at [the plate or object view], you will see that the centrality of the physical object is not only an editorial principle, but a design principle as well: illustration information (which is header information) and transcription (which is what would be central if the textual content were privileged) are both present here, but subordinated as links to dependent pages. And if you look at [the enlargement view], you will see the practical consequence of choosing to subordinate the transcription to the artifact—it appears in an ancillary window, and has the same presentational status as the enlargement of the plate image.

In order to represent Blake's work according to the editorial and design principles of Eaves, Essick, and Viscomi, we have developed the Blake Archival Description (BAD) DTD. . . . In it, you will find many specific elements devoted to describing the physical object (<physdesc>), illustrations found on plates (<illusdesc>), and the text on them (<phystext>). If you look for a moment at this example, you will see what we mean when we say that some of our fellows are intensely interested in describing the physical artifact: the Blake Archive markup treats each plate as a collection of one or more illustrations, and each illustration as a collection of one or more components, and each component as a collection of one or more characteristics. . . .

. . . [D]esigning a DTD for this purpose has allowed us to do other, complementary things that depend indirectly on the SGML structure, such as using IATH's Inote software to present the the person who searches for nude climbers and arching, arboreal vegetation with an image of the particular sector of that plate in which these elements occur. . . . [I]t needs to be said that making the choice to privilege the artifact has made other things more difficult—as, for example, when we come to a long book with subdivisions in the intellectual order—chapters or constituent poems—and we find that we must nonetheless render this work as a series of 100 plates, rather than as a series of a dozen poems.

Several things leap out at me when I consider the Blake Archive alongside emblem books: first, the strategy used in the Blake Archive to mark up image contents (using a grid system, and a controlled vocabulary), and the process of extracting those descriptions and annotations from an XML record and turning them into an overlay for Inote, and then searching against the XML and returning the annotated image, are all eminently transferable to the case of emblem books. Second, the problems that attend the Blake Archive, when it comes to the choice between privileging the physical over the logical divisions of the text are probably also problems that you face in emblem books. Finally, you are probably among the very small group of people in the world who would actually take advantage of a controlled vocabulary in a search interface: the Blake Archive has an elaborate typology represented in its image search, and it does serve a purpose. Still, it should not be oversold: even Blake experts prefer the simple keyword search over everything else. And it is worth noting, on this point, that the Blake editors considered and rejected Iconclass, thinking it too much oriented to renaissance painting for their purposes.


In the second part of this talk, I would like to take a rather different perspective, and instead of talking about the challenges that attend the creation of digital scholarly resources, I'd like to talk about the new opportunities for traditional scholarship that have been created by the conversion of primary resources to digital form, by the creation of new, born-digital resources, and by the availability of tools designed to be used with these digital materials. It is worth asking whether digital resources make possible new answers to old research questions, or whether they enable us to ask entirely new kinds of research questions. Do they open the way for new paradigms of humanities research?  And can they do all these things in print?

I will depart from emblem books here, and give a few examples of a range of digital primary resources relevant to humanities research—from rare materials, to material and popular culture, to electronic journals and scholarly editions, to tools for comparison and analysis—in order to consider what difference, if any, digital resources make for the practice of humanities scholarship, as it is still generally practiced—in print.

First, though, I want to unpack the emphasis on digital primary resources, distinguishing these from digital secondary resources, and noting that most digital primary resources are digitized from physical objects, but lumping them together, nonetheless, with the smaller number of born-digital resources. At present, any scholarship involves the use of some digital resources—for example, a library catalogue. Furthermore, certain resources you might find through Virgo, in the course of your research, might themselves be available (under that "internet" button) as full-text electronic resources—for example, New Literary History, through Project Muse, at Johns Hopkins University Press. But unless the subject of your research is scholarly publishing, New Literary History is probably a secondary resource. There are, of course, primary resources in digital form on the web, too—for example, the Renaissance Texts from Perseus, where you could find a text-and-image digital version of Antony and Cleopatra, from the Brandeis First Folio, along with secondary resources like C.T. Onions, A Shakespeare Glossary. You could go and see the Brandeis First Folio, of course, and you could find "A Shakespeare Glossary" on the shelf in the reference room in Alderman (or in the 30-day stacks in Clemons). That's not the case with born-digital resources—things that don't now exist in another form (for example, reconstructions like the the computer model of  The Crystal Palace or the recreated Prokudin-Gorskii color photos from pre-revolutionary Russia), or things that never existed in another form, like digital art, simulations, etc..

In fact, although individual objects of our attention might be categorized as digital or analog, scholarship itself is now a continuum, in which all activity falls somewhere between those two points, and almost nothing is completely non-digital, or non-analog.

So, back to the original question: what new opportunities for scholarship are presented by the existence of digital primary resources? Our habits of research in the humanities, and particularly in literary study, can be affected—sometimes renovated, sometimes mooted—by several kinds of novelty:

Digital primary resources are already quite interesting in the first way—the digitization of cultural heritage materials in the US and elsewhere has made much more available many rare materials, and many underutilized materials as well—for example, rare historical maps, or diaries of daily life in earlier times (e.g., "California as I saw it"). There are many in humanities departments who have embarked on digital research projects in the past ten years and have faced the problem of having to create their own digital primary resources first, in order to do enable scholarship, but that situation is really changing now—not everything (by a very long shot) is available in digital form, but there are now some substantial collections of primary materials that were, in their predigital form, difficult to find, difficult to get to, or difficult to use. Your materials fit this description quite well, actually. These collections offer valuable new materials for research—usually not because those materials were never available before, but rather because the expense and impracticality of consulting them made it extremely unlikely that research would be done on them. In this category—new materials for research—I mention, as something of broad interest and therefore capable of changing the research paradigm in the discipline, The Making of America ("a digital library of primary sources in American social history from the antebellum period through reconstruction. The collection currently contains approximately 8,500 books and 50,000 journal articles with 19th century imprints") and the Library of Congress's American Memory project ("a gateway to rich primary source materials relating to the history and culture of the United States. The site offers more than 7 million digital items from more than 100 historical collections"). All of this opens new possibilities for archival research projects, especially for graduate students, who may lack travel budgets.

New perspectives on familiar materials are also available, as a result of the creation of digital primary resources. As an example here, I return to The William Blake Archive, which presents full-color images, newly transcribed texts, and editorial description and commentary, on all of Blake's illuminated books, with non-illuminated materials (manuscript materials, individual plates and paintings, commercial engravings, etc.) now coming on line. The Blake Archive makes it practical to teach Blake as a visual artist, by the simple fact of the economics of image reproduction on the web, and this is a fundamental change from the way I was taught Blake, through Erdman's text-only synthetic edition (which is also, by the way, available on the site). The Blake Archive also offers some good examples of new tools that could provoke new scholarship in print—for example, the image search and plate comparison features.

Finally, some new possiblities for print scholarship are presented by born-digital information and the tools one uses with that information—for example, geographic information systems. See Past Time, Past Place: GIS for History, which includes contributions from history and religious studies, and which concerns the new modes of analysis, new arguments, and new conclusions available as a result of computer techniques that map all kinds of social information onto geographic space. Perhaps some of you are already experimenting with GIS and emblem books: if not, I hope you will. Locating composition and range of reference in time and space is an essential part of scholarship, and GIS can provide a very useful interface to this, even for projects that are centrally textual in their focus.

And text-analysis tools, though still clumsy and offputting for the layman, are turning a corner with the advent of XML, and in the nora project (nora stands for No One Remembers Acronymns) we are exploring the possibilities for the discovery of patterns of almost any sort in literary texts. The goal of the nora project is to produce software for discovering, visualizing, and exploring significant patterns across large collections of full-text humanities resources in existing digital libraries. In search-and-retrieval, we bring specific queries to collections of text and get back (more or less useful) answers to those queries; by contrast, the goal of data-mining (including text-mining) is to produce new knowledge by exposing unanticipated similarities or differences, clustering or dispersal, co-occurrence and trends. Over the last decade, many millions of dollars have been invested in creating digital library collections: at this point, terabytes of full-text humanities resources are publicly available on the web. Those collections, dispersed across many different institutions, are large enough and rich enough to provide an excellent opportunity for text-mining, and we believe that web-based text-mining tools will make those collections significantly more useful, more informative, and more rewarding for research and teaching.

All of these resources and methods can and should be part of the future of literary study—even if, in that future, we choose to publish the results of our studies in print.