Citescapes: Supporting Knowledge Construction on the Web

Stuart Moulthrop and Nancy Kaplan
School of Communications Design
The University of Baltimore

1: Knowledge construction and the Web

It is well understood that knowledge in professional communities is not simply discovered, but rather evolves through systematic association of claims. In what Bolter calls "the late age of print" [BOLTER], this process necessarily involves documents. What biologists, sociologists, lawyers, engineers, physicians, and other knowledge workers know is largely embodied in their professional literatures. The advancement of disciplinary knowledge depends on social interaction among writers of mutually-referring documents [LATOUR]. The value of new claims is established by patterns of support and dissent manifested in papers, reports, reviews, opinions, and other public communications. A claim survives and acquires influence or "reach" only if it becomes attached to a network of references [KAUFER].

Proponents of hypertext have portrayed it as a valuable tool for this social evolution of knowledge. For Bush, automated retrieval and linking seemed a promising solution for the great diversification of scientific information [BUSH]. For Nelson, hypertext promised wider and more general access to written knowledge [NELSON]. Bolter associated hypertext with a paradigm shift in literate culture, a transition from fixed hierarchies to dynamic networks [BOLTER]. Forecasts concerning the World Wide Web have been similarly enthusiastic, and with some reason [DECRAN]. If success is measured by numbers of users, documents, and links, then the Web is an overwhelmingly successful implementation of hypertext. Some major bodies of knowledge are significantly better connected than they were at the start of this decade.

Success can also be measured qualitatively, however, and by this index the Web may prove less impressive. We might ask, what good is all this connection? As Bolter notes, hypertext requires a multi-dimensional approach to information, introducing a basic metaphor of "writing space." In recent years hypertext designers and knowledge theorists have argued for greater richness and sophistication in the conceptual, functional, and design space of hypertext [KAPL94, KOLB, MARSHA]. Yet the writing space of today's Web is notably limited, especially in encompassing those documentary relationships on which communal knowledge-building depends.

In pre-Web days, hypertext systems were often criticized for failing to afford the same functionality as print [CARLSO]. Now we might reverse that complaint: hypertexts on the Web generally resemble print documents far too closely. As Furuta and Marshall observe , Web hypertexts are "passive" objects, representing information in relatively fixed form, much as do periodicals and books [FURUTA]. Links among these documents are one-way structures pointing away from the current location. If the Web may be said to function as a language, that language has only one verb (to go) and one predicate (go there). Because of this limitation, links in Web hypertexts suffer the same temporal constraint that affects references in print documents: they can only refer backward in time to prior sources. At present, we have no way to provide for a "there" which isn't there yet.

These limits can be challenged, of course. The simplest expedient is to update Web documents frequently -- the author changes her "Under Construction" sign to read "Perpetually Under Construction" and resolves to write links to later work as it comes along. Aside from being tedious, this strategy reinforces another serious flaw that the Web has inherited from print: changes in the document's links require the author's active engagement. If an author chooses not to update, perhaps because she is distracted by other work, new connections go unmapped.

A more interesting solution has been proposed by the Digital Libraries group at Stanford University [ROSCHE]. They have modified client and server software to allow writers other than the original author to annotate Web documents. Annotations are treated as "meta-information" linked to but not necessarily stored with the original document. In effect the annotations create a new textual dimension tangential to the World Wide Web. This system enables several extensions to the Web's writing space, including "tours" and "trails" where annotations weave together various Web documents. These trails could be used to associate knowledge claims in the ways we describe here.

While the Stanford proposal augments Web hypertext in important ways, it does not fully meet the requirements of associative knowledge building. In the Stanford system, only authenticated users may see annotations. This approach has its advantages, which we discuss in the final part of this paper, but it also introduces crucial limitations. To begin with, restricting access to specific groups raises questions about group constitution and regulation. Unless membership can be quickly and simply adjusted, groups will tend to be small, homogeneous, and static. This is consistent with the earliest conception of hypertext (Bush's Memex, which was meant for individuals and small teams). It does not conform to more recent thinking, which stresses possibilities for "generalism" and cross fertilization among disciplines [NELSON, BOLTER]. Indeed, the notion of private annotations and pathways seems badly out of step with the open and heterogeneous character of the Web.

As an alternative to restricted annotations, we propose a different device for associating Web documents. We call this mechanism a citescape, which may be defined generally as a dynamic, linked representation within a document which contains all the pages with hypertextual references (HREFs) to that document. The term is derived in analogy with "landscape" and connotes a visual survey or mapping of the document space that surrounds a given piece of writing. Figure 1 shows a prototype citescape.

Figure 1: Page-Specific Citescape

The citescape has no exact counterpart in print tradition, though it bears a family resemblance to a citation index. As the illustration shows, however, there are crucial differences between citescapes and citation indices.

Citation indices are large, complex works compiled laboriously and expensively. They often lag behind current research by several years. By contrast, the citescape is generated on demand (note the REFRESH button and dateline in Figure 1). Information in the citescape is as current as its source database. This database is continually and automatically updated (see the technical discussion in section 3). Entries in the citescape are live hypertext links, not passive records as in a printed index. Finally and perhaps most important, the citescape is integrated into its subject document: anyone who can see the document can see the existing citescape and generate a fresh one if desired. No special privileges are required.

Figure 2: Temporal Relations Among Documents

By virtue of its availability, integration and content, the citescape would add significant value to a Web document. Since it contains live links, the citescape provides quick access to other texts consituting the writing space surrounding the subject document. These links may be valuable even if they are never followed. Readers of a technical or professional communication could compare the number of links with the date of the document's first appearance on the Web. This could give a rough sense of the document's importance or suggest that it has been neglected. Likewise, the names and origins of texts with links to the subject document might also give crucial insights. In professional literatures, an idea is known not so much by the company it keeps as by the company that keeps it in circulation.

These benefits accrue mainly to readers, but authors could derive value from the citescape as well. Research workers would obviously benefit from an automated literature survey and clipping service. These functions could also benefit creators of commercial Web documents, who could use lists of links pointing to their documents to quantify market impact. Using page-specific citescapes (described in section 3), authors could tell what parts of their documents were drawing the greatest interest or generating the most controversy. This information could help considerably with revision.

2. Proof of concept

It is prudent to ask whether the functions we propose can be carried out with resources currently available on the Web. The Lycos spider retrieves and stores most of the information citescapes would require, including the titles of documents and the Uniform Resource Locators (URLs) they include. Its powerful search engine enables users to query the database with considerable flexibility. Yet the results of various combinations of search techniques, displayed in the Table, show that for all its utility as a research tool, Lycos cannot currently perform key citescape functions.

2.1 Case studies: finding citations of three documents

Although the Web is relatively young, it is already a rich resource for research in a number of knowledge domains, especially those concerned with electronic technologies and communication. The documents we used for this test have been available for only 6-24 months, but each had already received a number of citations. These case studies examine the reach of Schank's book-length hypertext, Engines for Education [SCHANK], and two of Moulthrop's articles, "You Say You Want a Revolution? Hypertext and the Laws of Media," published in Postmodern Culture, a peer-reviewed electronic journal [MOUL91], and "It's Not What You Think," a hypertextual letter to the editors of Newsweek [MOUL95]. Although we cannot claim that this survey yields generalizable data, the three publications represent a range of traditional intellectual genres, the kinds of written records scholars and researchers use as the basis for furthering intellectual work.

For each work, the Lycos database was queried four times, each time with a different set of search terms. Searches were conducted using the author's name, the title of the target work, elements of the work's URL, and a combination either of author and title or of author and URL element. Each search employed as many constraints as possible to limit the number of irrelevant hits. Thus author's names and keywords from titles were constrained so that only exact matches would be retrieved. Whenever a search employed more than one search term, only those hits containing all terms with a high adjacency factor were returned.

Table: Lycos-based Searches for Citations

All searches except one yielded some results. The search using unique elements of the URL for Schank's Engines for Education proved fruitless because the URL depends on punctuation marks which Lycos strips out in its search algorithms.

The variations in hits by search method suggest that authors of Web documents use heterogenous styles for citing other works on the Web. Such heterogeneity also characterizes works in print and reflects rhetorical and stylistic differences between knowledge domains. To be fully functional, however, a mechanism for aggregating all links into a specific document should be as general as possible so that variations in style do not thwart its purposes.

2.2 Limitations and Issues

2.3 Implications

Although it is possible to use Lycos and similar search mechanisms to track the reach of a document, the process is difficult and the results uncertain. Search strategies need to be carefully crafted to exploit the power of the database and search engine while avoiding their limitations and prohibitions. The strategies must also take into account some semantic features of the information that can be used to identify the target document. The results obtained will certainly include a large number of irrelevant hits, many of which will need to be investigated before they can be eliminated from the list.

The most serious limitation of Lycos, however, remains its distance from the target document. As we see it, a key feature of citescapes is their incorporation within the target document. Such inclusion maps the intellectual terrain of which the document is a part and permits the document to grow in complexity as the various enterprises to which it is important also grow and change.

3: Technical description

The mechanism we propose for implementing citescapes consists of three components: a citescape database server, a Common Gateway Interface program, and a proposed extension to Hypertext Markup Language (HTML). Each of these parts is explained in detail below:

3.1 Citescape Server

The citescape mechanism requires a comprehensive, regularly updated database which contains the destination URL from all hypertextual references (HREFs) in all pages on all publicly accessible servers on the Web. The database also contains aggregation information (arguments to PARTOF, see 3.3 below). The information is gathered automatically by a survey daemon of the type used by current search databases like Lycos and WebCrawler. The search mechanism seeks an exact match to the URL in the query request.

This is obviously the most elaborate part of the proposal. As the previous section shows, however, it is clearly feasible.

3.2 Citescape CGI

This program identifies the type of citescape being requested, issues a query to the Citescape Server, creates a citescape page in the current document if necessary, and places the results of the query in that page.

Citescape queries may be of two types. A page-specific query (the default) returns all URLs containing links to the present Web page. Figure 1 shows results of a page-specific citescape query.

It may often be desirable to generate a citescape for an aggregation of Web pages. This is done with a document-specific query, which returns all URLs having links to any page within the current hypertextual document, a document defined here as a coordinated set of pages (see 3.3 below). Figure 3 shows a document-specific citescape.

Figure 3: Document-Specific Citescape

3.3 HTML Attribute PARTOF

Many documents on the Web consist of numerous pages connected by networks of hypertext links. Kaplan's hypertext "E-Literacies," for example, comprises approximately 35 pages and 180 links, most of which refer to other pages within the document [KAPL95]. Current implementations of HTML offer no way to identify a page as part of an aggregate structure.

The attribute PARTOF, appearing within the tag of an HTML page, fills this gap. The argument of PARTOF is the name of the document to which the present page belongs. The usage PARTOF="W3-95" indicates that the current page is a component of a hypertext called W3-95. Since hypertextual linking allows authors to use a single page within several documents, PARTOF accepts multiple arguments separated by commas. PARTOF="W3-95, spaceProgram, Perisites2" indicates that the present page is a component of two other documents, spaceProgram and Perisites2, as well as W3-95.

We prefer to let authors decide whether a page is a legitimate part of a hypertextual document, as opposed to simply being reachable by a link from that document. A good rule of thumb might be whether or not the page in question contains links to other pages in the subject document. Authors may disregard this rule if they are interested in broader aggregations.

Pages with no PARTOF attribute are treated as separate entities. A document-specific citescape request on such a document defaults to a page-specific request.

4: Citescapes and the noise problem

Critics of electronic writing often complain that it confers too much anonymity [STOLL, TUMAN]. Suppose we find on the Web a lengthy technical paper about magnetohydrodynamics (MHD), full of data and elaborate equations. As people unacquainted with the field, we might assume this is the work of an engineer or physicist -- only to be told that it was written by a very bright 12-year-old as appendix to an amateurish science fiction novel. The data are invented, the equations flawed and fundamentally meaningless. The joke is on us. In print, the critics argue, this mistake would never happen. Readers are protected by editors, reviewers, publishers, and other gatekeepers absent from the Web.

A citescape might provide partial protection from this trouble. If we find no subsequent links on the citescape, we might view the paper in question more skeptically. If the links we find all seem to be from science fiction writers, or from 12-year-olds, the game would likely be up.

However, we can construct equally plausible scenarios in which citescapes make it harder to separate intellectual signal from noise. Suppose the paper on MHD is indeed the work of a rigorously trained researcher in engineering physics. However, the researcher has ventured outside his main specialty and is writing in a highly speculative vein (thus the absence of co-authors). Let us suppose the citescape for this paper contains three links. The first two are from hypertexts written by mainstream academic researchers, containing comments sharply critical of the author's ideas on MHD. When we follow the third link, we find a rambling, semi-coherent tract about cattle mutilations. This author seems to think flying saucers are powered by MHD. Given these results, how should we characterize the original paper: as an interesting theoretical venture or as the sort of pseudo-science that appeals to cranks?

This scenario suggests that citescapes will not necessarily improve professional discourse on the Web. They could even do the opposite, exposing serious work to intrusions from the lunatic fringe. In the Web equivalent of "spamming," popular or important work might be peppered with links from authors interested mainly in self promotion. In one sense a citescape functions as a window on a surrounding discursive space; but in another sense it is an open door, possibly inviting unwanted guests. If our goal is to maintain strict control over knowledge claims, then the private, restricted meta-links envisioned by the Stanford Digital Libraries Group could be more appropriate.

But if strict control of information is paramount, why trade print for hypertext in the first place? From its inception, hypertext has been described as a powerful tool for associating ideas. To illustrate uses of his Memex system, Bush speculated that it would allow researchers to connect chains of diverse ideas, moving from Turkish crossbows to the properties of various woods to the vagaries of strategic doctrine [BUSH]. Bolter notes that in hypertext there is "no reason not to include disparate materials in one electronic network" [BOLTER, p.7]. "An electronic book," he writes, "is a structure that reaches out to other structures, not only metaphorically, as does a printed book, but operationally" [p.87].

Much of the potential value of hypertext stems from this facility for connection. Interdisciplinary thinking represents a primary source of intellectual breakthrough and critique. Kaufer and Carley note that "authors associated with the most authority and change are not rooted within a single intellectual community. Instead, they are authors on the move, the maverick, the eccentric, the outsider, the intellectual migrant, trained in one community and rising to fame after finding their way to another" [KAUFER, p.394]. Kekulé von Stradonitz discovered the benzene ring largely because he trained in architecture and switched to chemistry [ULMER]. Mandelbrot's mathematical insights on fractal geometry yielded important crossovers in economics, population genetics, and biology [GLEICK]. If Penrose's recent speculations are correct, then artificial intelligence and neuropsychology have much to learn from quantum physics [PENROS]. As Kaufer and Carley show, print has facilitated these cross-fertilizations. Hypertext could conceivably do much more; but only if we understand its difference from print.

The noise problem can be dealt with simply enough. Once they have access to citescapes for their texts, Web authors can create edited versions, screening out links they consider inappropriate, malicious, or even embarrassing. These authorized citescapes would coexist hypertextually with the unedited versions. Readers might be encouraged to use the canonical citescape instead of the raw cut, but they would be free to compare the two and draw their own conclusions. This scheme lets authors filter out anything they find too noisy, but also preserves the de-selected information in case it comes from an emerging genius and not a hopeless crank.

5. Concluding thought experiment

We believe the citescape mechanism is a viable technical proposal. At the same time, questions about its implications also suggest an important thought experiment. Suppose that the citescape function were already available. Would it be perceived as an overall benefit or harm to the World Wide Web? What sorts of Web users would be likely to adopt this function, and which would reject it? What would rejection say about the Web and the uses for which we intend it?

Citescapes pose a clear alternative to more localized structures of association. As we have indicated, these restrictive structures promise more homogeneity, better noise suppression, and tighter authorial control over electronic writing. Such qualities may prove more desirable to future Web users than the flexibility, heterogeneity, and noisiness that citescapes support.

Noise suppression always has a cost, however. Restrictive mechanisms would likely inhibit intellectual "migration," reinscribe strict disciplinary boundaries, and thus deter innovation. The present Web offers a viable if unruly alternative to the regimes of print. Citescapes augment this alternative, aiming to support the wide circulation of knowledge implicit in both the Web and the hypertext concept itself.

Is hypertext really what we want? Or would we prefer "electronic books" and "digital libraries" -- mechanisms that restore a pre-Internet social order? These are not simply technical questions. Technologies have social implications, just as social agendas inevitably shape technologies. The conceptual problems posed by citescapes may well model issues about communication and control that are salient for cyberspace in general.

References

[BOLTER]
Bolter, J.D. (1991) Writing space: the Computer, hypertext, and the history of writing. Erlbaum.

[BUSH]
Bush, V. (1945) As we may think. The Atlantic Monthly, July. URL: http://www.csi.uottawa.ca/~dduchier/misc/vbush/as-we-may-think.html.

[CARLSO]
Carlson, P. (1990) The rhetoric of hypertext, Hypermedia 2:109-31.

[DECRAN]
December, J. and N. Randall (1994) The World Wide Web unleashed, SAMS.

[FURUTA]
Furuta, R. and Marshall, C. (1995) Genre as reflection of technology in the World Wide Web. IWHD '95 proceedings. International Workshop on Hypermedia Design, Montpellier.

[GLEICK]
Gleick, J. (1987) Chaos: making a new science. Viking.

[KAPL94]
Kaplan, N. and S. Moulthrop. (1994) Where no mind has gone before: ontological design for virtual spaces. ECHT '94 proceedings. European Conference on Hypermedia Technology.

[KAPL95]
Kaplan, N. (1995) E-literacies: politexts, hypertexts, and other cultural formations in the late age of print. Computer-mediated communication magazine 2(3). URL: http://sunsite.unc.edu/cmc/mag/1995/mar/kaplan.html.

[KAUFER]
Kaufer, D. and K. Carley. (1993) Communication at a distance: the influence of print on sociocultural organization and change. Erlbaum.

[KOLB]
Kolb, D. (1995) Socrates in the labyrinth: hypertext, argument, philosophy. Eastgate systems.

[LATOUR]
Latour, B. (1987) Science in action: how to follow scientists and engineers through society. Harvard UP.

[MARSHA]
Marshall, C., F. Shipman, and J. Coombs. (1994) VIKI: spatial hypertext supporting emergent structure. ECHT '94 proceedings. European Conference on Hypermedia Technology.

[MOUL95]
Moulthrop, S. (1995) It's not what you think: Newsweek's tech-no mania. URL: http://www.charm.net/~sam/inwyt/inwyt.html.

[MOUL91]
Moulthrop, S. (1991) You say you want a revolution? hypertext and the laws of media. Postmodern culture 1:3. URL: http://jefferson.village.virginia.edu/pmc/issue.591/moulthro.591.

[NELSON]
Nelson, T. (1987) Literary machines. Mindful press.

[PENROS]
Penrose, R. (1994) Shadows of the mind: the search for the missing science of consciousness. Oxford UP.

[ROSCHE]
Röscheisen, M., C. Mogensen, and T. Winograd. (1995) Beyond browsing: shared comments, SOAPs, trails, and on-line communities. 1995 World Wide Web Conference. URL: http://www-diglib.stanford.edu/diglib/pub/reports/brio_www95.html.

[SCHANK]
Schank, R. (1994) Engines for education. Erlbaum. URL: http://www.ils.nwu.edu/~e_for_e/.

[STOLL]
Stoll, C. (1995) Silicon snake oil. Doubleday.

[TUMAN]
Tuman, M. (1992). Word perfect: literacy in the computer age. U. Pittsburgh Press.

[ULMER]
Ulmer, G. (1990). Teletheory: grammatology in the age of video. Routledge.

Acknowledgement

Thanks to Christine Boese of Rensselaer Polytechnic Institute for pointing out the temporal aspects of the citescape mechanism.