Scholarly Digital Editions: July 2013

Monday 29 July 2013

Why digital humanists should get out of textual scholarship. And if they don't, why we textual scholars should throw them out.

Three weeks ago, when I was writing my paper for the conference on Social, Digital, Scholarly Editing I organized (with lots of help -- thanks guys!) at Saskatoon, I found myself writing this sentence:

Digital humanists should get out of textual scholarship: and if they will not, textual scholars
should throw them out.

As I wrote it, I thought: is that saying more than I mean? Perhaps-- but it seemed to me something worth saying, so I left it in, and said it, and there was lots of discussion, and some misunderstanding too. Rather encouraged by this, I said it again the next week, this time at the ADHO conference in Nebraska. (The whole SDSE paper, as delivered, is now on academia.edu; the slides from Nebraska are at slideshare -- as the second half of a paper, following my timeless thoughts on what a scholarly digital edition should be).

So, here are a few more thoughts on the relationship between the digital humanities and textual scholarship, following the discussion which the two papers provoked, and numerous conversations with various folk along the way.

First, what I most definitely did not mean. I do not propose that textual scholars should reject the digital world, go back to print, it's all been a horrible mistake, etc. Quite the reverse, in fact. Textual scholarship is about communication, and even more than other disciplines, must change as how we communicate changes. The core of my argument is that the digital turn is so crucial to textual scholars that we have to absorb it totally -- we have to be completely aware of all its implications for what we do, how we do it, and who we are. We have to do this, ourselves. We cannot delegate this responsibility to anyone else. To me, too much of what has gone on at the interface between textual scholarship and the digital humanities the last few years has been exactly this delegation. There were good reasons for this delegation over the first two decades (roughly) of the making of digital editions. The technology was raw; digital humanists and scholarly editors had to discover what could be done and how to do it. The prevailing model for this engagement was: one scholar, one project, one digital humanist, hereafter 1S/1P/1DH. Of course there are many variants on this basic pattern. Often, the one digital humanist was a team of digital humanists, typically working out of a single centre, or was a fraction of a person, or the one scholar might be a group of scholars. But, the basic pattern of close, long and intensive collaboration between the 'one scholar' and the 'one digital humanist' persists. This is how I worked with Prue Shaw on the Commedia and Monarchia; with Chris Given-Wilson and his team on the Parliament Rolls of Medieval England; and how many other digital editions were made at Kings College London, MITH, IATH, and elsewhere.

This leads me to my second point, which has (perhaps) been even more misunderstood than the first point. I am not saying, at all, that because of the work done up to now, all the problems have been solved, we have all the tools we need, we can now cut the ties between textual scholarship and digital humanities and sail the textual scholarly ship off into the sunset, unburdened by all those pesky computer folk. I am saying that this mode of collaboration between textual scholars and digital humanists, as described in the last paragraph, has served its purpose. It did produce wonderful things, it did lead to a thorough understanding of the medium and what could be done with it. However, there are such problems with this model that it is not just that it is not needed: we should abandon it for all but a very few cases. The first danger, as I have suggested, is that it leads to textual scholars relying over-much on their digital humanist partners. I enjoyed, immensely, the privilege of two decades of work with Prue Shaw on her editions of Dante. Yet I feel, in looking back (and I know Prue will agree) that too many times, I said to her -- we should do this, this way; or we cannot do that. I think these would have been better editions if Prue herself had been making more decisions, and me fewer (or even none). As an instance of this: Martin Foys' edition of the Bayeux Tapestry seems to me actually far better as an edition, in terms of its presentation of Martin's arguments about the Tapestry, and his mediation of the Tapestry, than anything else I have published or worked on. And this was because this really is Martin's edition: he conceived it, he was involved in every detail, he thought long and hard about exactly how and what it would communicate. (Of course, Martin did not do this himself, and of course he relied heavily on computer engineers and designers -- but he worked directly with them, and not through the filter of a 'digital humanist', ie, me). And the readers agree: ten years on, this is still one of the best-selling of all digital editions.

The second danger of this model, and one which has already done considerable damage, is that the digital humanist partners in this model come to think that they understand more about textual editing than they actually do -- and, what is worse, the textual editors come to think that the digital humanists know more than they do, too. A rather too-perfect example of this is what is now chapter 11 of the TEI guidelines (in the P5 version). The chapter heading is "Representation of Primary Sources", and the opening paragraphs suggest that the encoding in this chapter is to be used for all primary source materials: manuscripts, printed materials, monumental inscriptions, anything at all. Now, it happens that the encoding described in this chapter was originally devised to serve a very small editorial community, those engaged in the making of "genetic editions", typically of the draft manuscripts of modern authors. In these editions, the single document is all-important, and the editor's role is to present what he or she thinks is happening in the document, in terms of its writing process. In this context, it is quite reasonable to present an encoding optimized for that purpose. But what is not at all reasonable is to presume that this encoding should apply to every kind of primary source. When we transcribe a manuscript of the Commedia, we are not just interested in exactly how the text is disposed on the page and how the scribe changed it: we are interested in the text as an instance of the work we know as the Commedia. Accordingly, for our editions, we must encode not just the "genetic" text of each page: we need to encode the text as being of the Commedia, according to canticle, canto and line. And this is true for the great majority of transcriptions of primary sources: we are encoding not just the document, but the instance of the work also. Indeed, it is perfectly possible to encode both document and work instance in the one transcription, and many TEI transcriptions do this. For the TEI to suggest that one should use a model of transcription developed for a small (though important) fraction of editorial contexts for all primary sources, the great majority of which require a different model, is a mistake.

Another instance of this hubris is the preoccupation with TEI encoding as the ground for scholarly editing. Scholarly editors in the digital age must know many, many things. They must know how texts are constructed, both as document and work instance; they must know how they were transmitted, altered, transformed; they must know who the readers are, and how to communicate the texts and what they know of them using all the possibilities of the digital medium. What an editor does not need to know is exactly what TEI encoding should be used at any point, any more than editors in the print age needed to know what variety of linotype printer was in use. While the TEI hegemony has created a pleasant industry in teaching TEI workshops, the effect has been to mystify the editorial process, convincing rather too many prospective editors that this just too difficult for them to do without -- guess what -- a digital humanist specialist. This, in turn, has fed what I see as the single most damaging product of the continuation of the1S/1P/1DH model: that it disenfranchises all those scholars who would make a digital edition, but do not have access to a digital humanist. As this is almost every textual scholar there is, we are left with very few digital editions. This has to change. Indeed, multiple efforts are being made towards this change, as many groups are attempting to make tools which (at least in theory) might empower individual editors. We are not there yet, but we can expect in the next few tools a healthy competition as new tools appear.

A final reason why the 1S/1P/1DH model must die is the most brutal of all: it is just too expensive. A rather small part of the Codex Sinaiticus project, the transcription and alignment of the manuscripts, consumed grant funding of around £275,000; the whole project cost many times more. Few editions can warrant this expenditure -- and as digital editions and editing lose their primary buzz, funding will decrease, not increase. Throw in another factor: almost all editions made this way are data siloes, with the information in them locked up inside their own usually-unique interface, and entirely dependent on the digital humanities partner for continued existence.

In his post in response to the slides of my Nebraska talk, Joris van Zundert speaks of "comfort zones". The dominance of the 1S/1P/1DH model, and the fortunate streams of funding sustaining that model, has made a large comfort zone. The large digital humanities centres have grown in part because of this model and the money it has brought them -- and have turned the creation of expensively-made data, dependent on them for support, as a rationale for their own continued existence. What is bad for everyone else -- a culture where individual scholars can make digital editions only with extraordinary support -- is good for them, as the only people able to provide that support. I've written elsewhere about the need to move away from the domination of digital humanities by a few large centres (in my contribution to the proceedings of last year's inaugural Australian Association of Digital Humanities conference).

This comfort zone is already crumbling, as comfort zones tend to do. But beside the defects of the 1S/1P/1DH, a better reason for its demise is that a better model exists, and we are moving towards it. Under this model, editions in digital form will be made by many people, using a range of online and other tools which will permit them to make high-quality scholarly editions without having to email a digital humanist every two minutes (or ever, even). There will be many such editions. But we will have gained nothing if we lock up these many editions in their own interfaces, as so many of us are now doing, and if we wall up the data by non-commercial or other restrictive licenses.

This is why I am at such pains to emphasize the need for this new generation of editions to adopt the creative commons attribution share-alike licence, and to make all created materials available independent of any one interface, as the third and fourth desiderata I list for scholarly editions in the digital age. The availability of all this data, richly marked up according to TEI rules and supporting many more uses than the 'plain text' (or 'wiki markup') transcripts characteristic of the first phase of online editing tools, will fuel a burgeoning community of developers, hacker/scholars, interface creators, digital explorers of every kind. I expressed this in my Nebraska talk this way:

Under this model we can look to many more digital humanists working with textual scholarly materials, and many more textual scholars using digital tools. There will still be cases where the textual scholar and the digital humanist works closely together, as they have done under the 1S/1P/1DH model, in the few scholarly edition projects which are of such size and importance to warrant and fund their own digital support. (I hope that Troy Griffitts has a long and happy time ahead of him, supporting the great editions of the Greek New Testament coming from Münster and Birmingham). But these instances will be not the dominant mode of how digital humanists and textual scholars will work together. At heart, the 1S/1P/1DH model is inherently restrictive. Only a few licenced people can work with the data made from any one edition. Instead, as Joris says, we should seek to unlock the "highly intellectually creative and prolific" potential of the digital environment, by allowing everyone to work with what we make. In turn, this will fuel the making of more and better tools, which textual scholars can then use to make more and better editions, in a truly virtuous circle.

Perhaps I overdramatized matters, by using a formula suggesting that digital humanists should no longer have anything to do with textual scholarship, when I meant something different: that the model of how digital humanists work with textual scholars should change -- and is changing. I think it is changing for the better. But to ensure that it does, we should recognize that the change is necessary, work with it rather than against it, and determine just what we would like to see happen. It would help enormously if the large digital humanities centres, and the agencies which fund them, subscribed whole-heartedly to my third and fourth principles: of open data, available through open APIs. The first is a matter of will; the second requires resources, but those resources are not unreasonable. I think that it will be very much in the interests of the centres to adopt these rules. Rather quickly, this will exponentially increase the amount of good data available to others to use, and hence incite others to create more, and in turn increase the real need for centres specializing in making the tools and other resources textual scholars need. So, more textual scholars, more digital humanists, everyone wins.

Saturday 27 July 2013

The Woodstock of the Web: Geneva 25-27 May, 1994

Background: about this article (27 July 2013)

In 1994 I was based in Oxford University Computing Services, clinging post-doctorally to academic life through a variety of research posts at the intersection of computing and the humanities (creating the Collate program with Leverhulme support, starting off the Canterbury Tales Project with minimal formal support but lots of enthusiasm, advising Cambridge University Press about the new world of digital publication, etc.) Because I had no regular post, people threw me the crumbs they could not eat themselves ("people" being mostly Susan Hockey, Lou Burnard and Marilyn Deegan). At that time, Lou and Harold Short were writing a report outlining the need for a computing data service for the humanities: this report served as the template for what became the Arts and Humanities Data Service (now alas deceased, killed by British Rover -- that's another story). You might still be able to get a copy of this report online. They had got some money for travel to research the report, more travel than Lou, for his part, could manage: so several times that year Lou (whose office was beside mine in 6 Banbury Road) said to me, "Peter, how would you like to go to..". So I got to go to Dublin to see John Kidd commit public academic hari-kari at the James Joyce conference. And, I got to go to Geneva for the first 'WWW' conference, ever. Some time in that period -- 1993, it must have been -- Charles Curran called me into this office at OUCS to show me the latest whizzy thing: a system for exchanging documents on the internet, with pictures! and links! This was the infant web, and OUCS had (so legend goes) server no. 13 (Peter's note 2024: this may have been as early as September 1991, just a month after TBL made the very first webpage).

The total triumph of the web as invented at CERN in those years (that is, http + html, etc) makes it easy to forget that at that time, it was one among many competing systems, all trying to figure out how to use the internet for information exchange. Famously, a paper proposed by Tim Berners-Lee and others for a conference on hypertext in 1992 was rejected; Ted Ritchie, who ran OWL which looked just like the web before the Web, has a nice TED video telling how he was unimpressed by Tim Berners-Lee's idea. So in early 1994, it was not a slam-dunk that the Web would Win. I had already met TBL: he came to Oxford, I think in late 1993 or early 1994, with the proposal that Oxford should be a host of what became the WWW3 consortium. I was present (with Lou, and others) at some of a series of meetings between TBL and various Oxford people. The Oxford reception was not rapturous, and he ended by going to MIT (as this report notes).

The report is completely un-edited: I've highlighted headings, otherwise the text is exactly as I sent it to -- who? presumably Lou and Harold. And, I still have, and occasionally wear, the conference t-shirt.

****My trip report****

Summary:

The first World Wide Web conference showed the Web at a moment of transition.
It has grown enormously, on a minimum of formal organization, through the
efforts of thousands of enthusiasts working with two cleverly devised interlocking
systems: the HTML markup language that allows resource to call resource across
networks, and the HTTP protocol that allows computer servers to pass the
resources about the world. However, the continued growth and health of the Web
as a giant free information bazaar will depend on sounder formal organization, and
on its ability to provide efficient gateways to commercial services. The challenge
for the Web is to harmonize this organization and increasing commercialism with
the idealistic energies which have driven it so far, so fast. For the Arts and
Humanities Data Archive, the promise of the Web is that it could provide an ideal
means of network dissemination of its electronic holdings, blending free access with
paid access at many different levels.

From Wednesday 25th to Friday 27th I braved the clean streets and expensive
chocolates of Geneva, in search of an answer to a simple question: what is World
Wide Web? and what will it be, in a month, a year, a decade?

Once more, I donned my increasingly-unconvincing impersonation of Lou
Burnard. This time, I was the Lou Burnard of the Arts and Humanities Data
Archive feasibility study, in quest of masses of Arts and Humanities data which
might need a secure, air-conditioned, (preferably clean and chocolate-coated)
environment for the next few millennia. In the past months, WWW has become
possibly the world's greatest data-generator, with seemingly the whole of academe
and parts far beyond scrambling to put up home pages with links to everything. So
my mission: what effect is the enormous success of WWW going to have on the
domains of data archives in particular, and on library practice and electronic
publishing in general?

The conference was held at CERN, the European Centre for Nuclear Research.
Underneath us, a 27km high-vacuum tunnel in which a positron can travel for three
months at the speed of light without ever meeting another atomic particle.
Ineluctably, the comparison presents itself: one can travel for weeks on WWW
without encountering anything worth knowing. The quality of the Web was a
major subtext of the conference. One thing the Web is not (yet) is any sort of
organization: it is a mark-up language (HTML), a network protocol by which
computers can talk to each other (HTTP) and it is a lot of people rushing in
different directions. Anyone with a computer and an internet account can put
anything on the Web, and they do. In no time at all, the Web has grown, and
grown...

Here follow questions I took with me, and some answers...

How big is the Web?

From the conference emerged a few remarkable figures on the growth of the Web:
in June 1993 there were 130 Web servers. This had doubled by November 1993 to
272, then doubled again within a month to 623, then doubled again by March this
year to 1265 (source: David Eichmann conference paper 113, 116). Other speakers
at the conference indicated that between March and June this year the number of
servers has doubled and redoubled again, and is probably now somewhere over
5000. Another speaker estimated that there were around 100,000 'artifacts'
(articles, pictures etc etc...) on the Web (McBryan; Touvet put the figure at
'billions'). The volume of information is matched by the level of use. The NCSA
home page currently receives around one million requests a week. Around 500 gb
a week are shifted around the Web, and WWW now ranks sixth in total internet
traffic and will accelerate to first if it continues its current growth.

Who is the Web?

All this, from an idea in the head of Tim Berners-Lee, with no full-time staff or
organization to speak of. I met (at lunch on Friday) Ari Luotonen, one of the two
staff at CERN who have direct responsibility for running the CERN server, the
first home of the Web, and still its European hub. Simultaneously, they are
running the server, writing software etc. for it, and refining the file transfer
protocol HTTP which underpins the whole enterprise. Before I went, I met Dave
Raggett of Hewlett-Packard, Bristol, who is busy simultaneously inventing the next
incarnation of HTML+ and writing a browser for it, and I spoke several more
times to him during the conference. HTML, the Web's 'native' markup language, is
to the software of the Web what HTTP is to the hardware: without these two,
nothing.

To one used to the rigours of the Text Encoding Initiative, which has
ascetically divided design from practice in search of truly robust and elegant
solutions, the spectacle of the base standards of the Web being vigorously
redesigned by the same people who are writing its basic software is exhilerating
(there is a lot of: that's a nice idea! I'll have that) and also rather scary. There is a
glorious mix of idealism and hard-headed practicality about the people at the heart
of the Web. This is nowhere better epitomized than in Tim Berners-Lee, whose
address on the first morning had machines manipulating reality, with the Web being
'a representation of human knowledge'. Along the way, Tim tossed out such
treasures as the Web allowing us to play dungeons and dragons in the Library of
Congress, or the Web finding the right second-hand car, or putting together a
skiing party including Amanda. And, he said, you could buy a shirt on the Web.
If the Web is to continue to grow and prosper, and become more than a
computer hobbyist's super toy, it will need to be useful. Or will it? Can we only
have all human knowledge on the Web if we can buy shirts on it, too? Useful, to
who?

Who uses the Web?

One paper (Pitkow and Recker) reported the results of a survey on who uses the
Web, based on 4500 responses to a posting on the Web in January 1994: over 90
per cent of its users are male, and 60 % are under 30, and 69% are in North
American universities. One could say that this survey was rather skewed in that the
survey relied on the 'forms' feature in Mosaic, and as few people had access to this
on anything other than Unix 93% of respondents (the same 94 % who were male?)
were Unix/Mosaic users. In any case, the conference was overwhelmingly male,
and strongly American in accent. And the papers at the conference were
overwhelmingly technical: concerned with bandwidth, Z39.50 protocols, proxy
servers, caching, algorithms for 'walking the Web', and so on. In part this
technical bias is necessary. The very success of the Web threatens Internet
meltdown. But extreme technical skill in one area can mean frightening naievety in
another, and vast quantities of naievety were on view over the three days. For
many people, the Web is the world's greatest computer playground: hacker heaven
in 150 countries at once. And so, many of the papers were exercises in ingenuity:
look what I have done! without reflection as to whether it was worth doing in the
first place. But, ingenuity certainly there was. Here are some examples...

How do we find out what is on the Web?

The sudden and huge size of the Web has created a monster: how do you find just
what you want amid so much, where there are no rules as to what goes where, no
catalogue or naming conventions. The whole of Thursday morning, and many
other papers outside that time, were devoted to 'taming the Web': resource
discovery methods for locating what you want on the Web. It is typical of the
extraordinary fertility of invention which the Web provokes (and this just might be
the Web's greatest gift to the world) that no two of these papers were even
remotely alike in the methods they proposed. One system (SOLO, presented by
Touvet of INRIA) focussed on directory services: indexing all the file names
accessed through the web and constructed a 'white pages' for the Web which maps
all the file names and their servers to resource names, so that one need only point at
the SOLO directory to find an object, whereever it is. Again, something like this is
badly needed as lack of 'name persistence' -- the distressing tendency for files to
disappear from where they are supposed to be -- is a major Web shortcoming.
Fielding (UC at Irvine) outlined a different solution to this particular problem: a
'MOMspider' which would walk the Web and find places where files have been
mislaid, and hence compute the Web's health and alert document owners of
problems. Fielding argued that living infostructures must be actively maintained to
prevent structural collapse: so, who is to do the maintenance?

A quite different approach, though still centred on whole documents, is
ALIWEB, presented by Martijn Koster of NEXOR Ltd. This uses index files
prepared according to standard formats, containing information on who made the
file, what it is, and some key-words, etc. These files must be 'hand-prepared' by
someone at each server site; ALIWEB then collects all these files together and loads
them into a database sent back to all the servers. This is modelled on the working
of Archie, where servers similarly compile and exchange information about ftp
resources, etc. There is no doubting the value of something like ALIWEB. But
how many of the people running those 5000-plus servers (especially, the 4000 or so
which have joined in the last three months) even know of ALIWEB?
Another system using 'hand-prepared' data based on whole-document
description was outlined by McBryan (University of Colorado). This is GENVL,
or 'the mother of all bulletin boards'. This uses a very clever system of nested
'virtual libraries' deposited in GENVL by outside users. I can make up my own
'virtual library' of Web resources in my own area of interest using the format
supplied by GENVL. This library can then be nested within the existing GENVL,
and point at yet other virtual libraries supplied by other people. People can find my
library and the documents and libraries it points at by travelling down the
hierarchy, or by WAIS searching on the whole collection. But, who guarantees the
accuracy and timeliness of the data in GENVL? No-one, of course: but it seems on
its way towards being the biggest yellow pages in the world. McBryan's
demonstration of this was far the best demonstration seen at the conference: you can
get into it for yourself by
http://www.cs.colorado.edu/homes/mcbryan/public_html/bb/summary_html. I must
say the competition for the best demonstration was less intense than might have
been supposed. Indeed, there were more presentational disasters than I have ever
seen at any conference. Many presenters seemed so struck by the wonder of what
they had done that they thought it was quite enough to present their Web session
live, projected onto a screen, in happy ignorance of the fact that the ergonomics of
computer graphics meant that hardly anyone in the auditorium could read what was
on the screen. This enthusiastic amateurism is rather typical of a lot which goes on
in the Web.

Other papers in this strand looked into the documents themselves. The
Colorado system described by McBryan supplements the hand-prepared 'virtual
libraries' by an automated 'Worm' (it appears the Web is breeding its own parasitic
menagerie; in Web-speak, these are all varieties of 'robots') which burrows its way
through every document on the Web, and builds a database containing index entries
for every resource on the Web (as at March 1994, 110,000 resources, 110,000
entries). It does not try to index every word; rather it indexes the titles to every
resource, the caption to every hypertext link associated with that resource, and the
titles of every referenced hypertext node. This is clever, and again was most
impressive in demonstration: you simply dial up the worm with a forms aware
client (Mosaic 2.0, or similar) and type in what you are looking for. Each of
McBryan's systems (GENVL and the Worm) currently receives over 2000 requests
a day. They were far and away the most impressive, and most complete, of the
search systems shown at the conference.

We met many more of these 'robots' during the conference: beasts bred by
computer science departments to graze their way through the web. They are
creatures of controversy: they can add considerably to the network load and there
are some who would ban all robots from the web. Eichmann of the University of
Houston (Clear Lake) described his 'spider' which again walked the whole Web,
like McBryan's worm. However, his system performs a full-text index of each
Web document, not just an index of titles and hypertext captions. This is a
modified WAIS index which works by throwing away the most frequently
referenced items. However, the Web is so large that the index of even a part of it
soon grew to over 100 megs. The sheer size of the Web, and the many near-
identical documents in some parts of it, can have entertaining consequences.
Eichmann related the sad tale of the time his spider fell down a gravitational well.
It spent a day and a half browsing a huge collection of NASA documents (actually a
thesaurus of NASA terms) and got virtually nothing back you could use in a search:
all it found was NASA, ad infinitum, and so it threw the term away from the index
and .... eventually Eichmann had to rescue his spider. As NASA were sponsoring
this research they were not amused to find that if you asked the spider to find all
documents containing 'NASA' you got precisely nothing.

Every conference on computing must contain a paper with the title 'lost in
hyperspace': this one was from Neuss (Fraunhofer Institute, Germany). I was by
now rather fatigued by techies who thought that running GREP across the Web
actually was a decent search strategy. Neuss did a little more than this (using 'fuzzy
matching') but what he had was essentially our old friend the inverted file index.
Which of course when faced with gigabytes finds it has rather too many files to
invert and so tends to invert itself, turtle-like, which is why both Neuss and
Eichmann found themselves concluding that their methods would really only be
useful on local domains of related resources, etc. At last, after all this we had a
speaker who had noticed that there is a discipline called 'information retrieval',
which has spent many decades developing methods of relevance ranking, document
clustering and vectoring, relevance feedback, 'documents like...', etc. He pointed
out the most glaring weaknesses in archie, veronica, WAIS and the rest (which had
all been rather well demonstrated by previous speakers). However, it appeared he
had good ideas but no working software (the reverse of some of the earlier
speakers).

Finally, in this strand, the true joker of the pack: De Bra of Eindhoven
University with -- hold your hat -- real-time full-text searching of the web. That
is: you type in a search request and his system acually fires it off simultaneously to
servers all over the Web, getting them to zoom through their documents looking
for matches. It is not quite as crude as this, and uses some very clever 'fishing'
algorithms to control just how far the searches go, and also relies heavily on
'caching' to reduce network load. But it is exactly this type of program which
rouses the ire of the 'no-robots' Web faction (notably, Martijn Koster).

So: masses of ways of searching the Web. But is there anything there worth
finding?

How serious is the Web? -- Electronic publishing and the Web

There were times in these three days when I felt I had strayed into some vast
electronic sandpit. I had come, on the part of the Arts and Humanities archives, in
search of masses of data which might need archiving. What I found instead was a
preoccupation with how the Web worked, and very little interest in the quality of
material on it. On the Wednesday morning, Steffen Meschkat (ART + COM,
Berlin) described his vision of active articles in interactive journals. These would
present a continuous, active model of publication. The boundaries between
reviewed and published articles would disappear, and the articles themselves would
merge with the sea of information. To Meschkat, the idea of publication as a
defining act has no meaning; there is no place in his world for refereed publication,
so important to University tenue committees. There is no doubt that the Web is a
superb place for putting up pre-publication drafts of material for general comment,
and many scholars are making excellent use of it for just that. But to argue that this
is a completely satisfactory model of publication is absurd. A journal whose only
defining strategy is the whims of its contributers (how do you 'edit' such a
journal?) is not a journal at all; it is a ragbag. I am pleased to report that the
centrepiece of Meschkat's talk, a demonstration of the active journal, failed to
work.

In general, I was surprised by the level of ignorance shown about what the
Web would need if it were to become a tool for serious electronic publishing of
serious materials. It is the first conference I have been to where the dread word
copyright hardly appeared. Further, there was hardly any discussion of what the
Web would need to become a satisfactory electronic publishing medium: that it
would have to tackle problems of access control, to provide multilayered and fine-
grained means of payment, to give more control over document presentation. A
SGML group met on Wednesday afternoon to discuss some of these issues, but did
not get very far. There were representatives of major publishers at this group
(Springer, Elsevier) but none yet are doing more than dip a toe in the Web.
From the talk at the conference, from the demonstrations shown, and what I
have seen of the Web, it is not a publication system. It is a pre-publication system,
or an information distribution system. But it is as far from a publishing system as
the Thursday afternoon free paper is from OUP.

How serious is the Web? free information, local information

The free and uncontrolled nature of the Web sufficiently defines much of what is
on it. Most of what is on the Web is there because people want to give their
information away. Because it is free, it does not matter too much that the current
overloading of the internet means that people may not actually be able to get the
information from the remote server where it is posted. Nor does it matter that the
rght files have disappeared, or that the information is sloppily presented, eccentric
or inaccurate. These things would be fatal to a real electronic publishing system,
but do not matter in the give-away information world which is much of the Web.

There is an important category of Web use which does not fit this
description. That is its use for distribution of local information systems: for
carrying the local 'help' or 'info' system for a university, or a company. The most
impressive instances of Web use offered at the conference fitted this model. Here,
we are dealing with information which is valuable but free: valuable, because it is
useful to a definable group of people. Typically, the information is owned by the
local organization and so there are no copyright problems to prevent its free
availability. Further, because the information is most valuable to the local user,
and rather less valuable to anyone beyond, it is quite valid to optimize the local
server and network for local access; those beyond will just have to take their
chances. The Web server currently being set up in Oxford is an excellent example
of the Web's power to draw together services previously offered separately
(gopher, info, etc.) into a single attractive parcel. Several conference papers told
of similar successful applications of the Web to local domains: a 'virtual classroom'
to teach a group of 14 students dispersed across the US, but linked to a single Web
server (Dimitroyannis, Nikhef/Fom, Amsterdam); an on-line company-wide
information system by Digital Equipment (Jones, DEC). Most impressive was the
PHOENIX project at the University of Chicago (Lavenant and Kruper). This aims
to develop, on the back of the Web, a full teaching environment, with both students
and teachers using the Web to communicate courses, set exercises, write papers etc.
Three pilot courses have already been run, with such success that a further 100
courses are set to start later this year. Because it is being used for formal teaching,
PHOENIX must include sophisticated user-authentication techniques. This is done
through the local server; a similar method (though rather less powerful) is in use at
the City University, London (Whitcroft and Wilkinson).

Such local applications resolve the problem of access to remote servers
which threatens to strangle the web (and with it, the whole Internet). They also
solve the problems of accountability and access: someone is responsible for the
information being accurate and available. As for access: if restriction is required,
this can be done through the local server, so avoiding all the problems of trying to
enforce control through servers on the other side of the world. The Web is so well
suited to these local information services that these alone will see the Web continue
to burgeon. Much of the Web will be a network of islands of such local systems,
upon which the rest of the world may eavesdrop as it will. Among these islands,
we can expect to see many more distance learning providers, on the model of
PHOENIX but distributed across world-wide 'virtual campuses'. Butts et al.
described the 'Globewide Network Academy' (GNA), which claims to be the
world's first 'virtual organization', and appears to intend to become a global
university on the network. This seems rather inchoate as yet. But already 2500
people a day access GNA's 'Web Tree'; where they are going, others will certainly
follow. We could also expect to see many more museums, archives, libraries, etc.
using the Web as an electronic foyer to their collections. There was only one such
group presenting at the conference, from the Washington Holocaust museum
(Levine): that there should be only this one tells us a great deal about the lack of
computer expertise in museums. The Web is an ideal showplace for these. Expect
to see many more museums,etc., on the Web as the technology descends to their
range.

How can I make sure I can get through to the Web?

For days on end, I have tried to get through to McBryan's server at Colorado, or
the NCSA home page, or indeed anywhere outside the UK. Unless you are
prepared to time it for a moment when all the rest of the world is in bed (and you
should be too), you cannot get through. This alone appears as if it could be doom
for the Web's aspirations. Certainly, it could not be a commercial operation at this
basis: noone is going to pay to be told that the server they want is unavailable. Of
course, this does not matter if all you want is local information, as described in the
last section. But this is far from the vision of the world talking to itself promoted
so enthusiastically by Berners-Lee and others.

Help is at hand. A surprising number of papers were devoted to problems
of making the Web work. The answer is simple: if lack of network bandwidth
means we cannot all dial into America at once, set up clones of America all over the
world. There are many ways of doing this: 'mirror' servers, etc, as long known in
the FTP world, for example. There are obvious problems with mirror servers in
the fast-moving world of the Web, with some documents both changing and being
accessed very frequently. The method that seems most enthusiastically pursued is
'caching', otherwise 'proxy' servers. Instead of dialling America direct, you dial
the proxy server. The server checks if it has a copy of the document you want; if it
has, then it gives you the document itself. If it has not, it gets the document from
America, passes it to you and keeps a copy itself so the next person who asks for it
can get it from the cache. Papers by Luotonen, Glassman, Katz and Smith
described various proxy schemes, several of them based on Rainer Post's Lagoon
algorithms. It is the sort of topic which fascinates computer scientists, with lots of
juicy problems to be solved. For example: once someone has dialled into a proxy
and then starts firing off http calls from the documents passed over by the proxy,
how do we make sure all those http calls are fed to the proxy and not to the 'home'
servers whose address is hard-wired into the documents? or, how do you operate a
proxy within a high-security 'firewall' system, without compromising the security?
how do you make sure the copy the proxy gives you is up to date? and so on.

These systems work, and work well: since I have caught on to dialling the
Web via the HENSA proxy at Kent, life has suddenly become possible. An
interesting statistic, drawn from Smith of Kent's talk: over an average week, it
seems that around two thirds of the calls to the Web by all those hundreds of
thousands of people are to some 24,000 documents (about a quarter of the total on
the Web, as of May), amounting to just 450 Meg of data. The Web is both
enormous, and surprisingly small. One point is clear: proxies and similar devices
can only work given sophisticated co-operation between servers.

Has the Web a future?

If the Web stays as it is, it will slowly strangle itself with too much data, too little
of which is of any value, and with even less of what you want available when you
want it. I had more than a few moments of doubt at the conference: there were too
many of the hairy-chested 'I wrote a Web server for my Sinclair z99 in turbo-
Gaelic in three days flat' to make me overwhelmingly confident. The Web will not
stay as it is. The Web has grown on the enthusiastic energies of thousands of
individuals, working in isolation. But the next stages of its growth, as the Web
seeks to become more reliable and to grow commercial arms and legs, will require
more than a protocol, a mark-up language and enthusiasm. Servers must talk to
one another, so that sophisticated proxy systems can work properly. Servers will
have to make legal contracts with one another, so that money can pass around the
Web. Richer mark-up languages, as HTML evolves, will require more powerful
browsers, and create problems of standardization, documentation and maintenance.
All this will require co-operation, and organization. The organization is on the
way. After much hesitation, CERN has decided to continue to support the Web,
and the EC is going to fund a stable European structure for the Web, probably
based at CERN and possibly with units in other European centres (including,
perhaps, Oxford). Berners-Lee himself is going to MIT, where he is going to work
with the team which put together XWindows.

What will the Web look like, then, in a few years? The momentum for
much of the Web, at least, to remain free and uncontrolled (even, anarchic) will
continue: it will become the world's free Thursday afternoon paper, a rag-bag of
advertizing, vanity publishing, plain nonsense and serious information. There will
be many local islands of information -- help systems, archive galleries, educational
material. Eccentrics, like the Norwegian family who have put their whole house on
the Web so you can dial up a video camera in their living room and watch what
they are doing, will continue to find a home on the Web (Ludvigsen). But
increasingly the Web will provide a gateway to other, specialist and typically
commercial services: electronic publishing, databases, home shopping, etc. The
presence of these satellite services will relieve the Web itself from the need to be all
things to all people, and allow it to continue to be the lingua franca to the
electronic world -- demotic, even debased, but easy, available and above all, free.
The Web will not be the only gateway to services which can do what the Web
cannot, but it may well be the biggest and the most widely used. For organizations
like Universities, libraries, archives of all sorts (and, especially, the AHDA), which
give some information away and sell other information, the Web and its satellites
will be an ideal environment. This aspect of the Web has been barely touched, to
now. But it will happen, and it is our business to make sure it happens the right
way.

What have I not described?

Even this rather long account omits much. Briefly, a few things I have not
covered:
1. HTML tools: there was discussion about HTML and its future (Raggett; various
workshops); several HTML editors were described (Williams and Wilkinson;
Kruper and Lavenant; Rubinsky of SoftQuad announced HotMeTaL); various
schemes for converting documents to HTML were outlined (from FrameMaker:
Stephenson; Rousseau; from Latex: Drakos).
2. Connectivity tools for linking the Web to other resources: to CD-ROMs
(Mascha); to full-featured hypertext link servers (Hall and the Microcosm team); to
computer program teaching systems (Ibrahim); to databases (Eichmann)
and much, much more.

Thirty years after McLuhan, the global village is here.