By Angela Martello
Managing Editor, Web Content
Thomson Scientific
Nearly 50 years after Dr. Eugene Garfield began indexing scholarly research
bibliographic data, Thomson Scientific continues his emphasis on quality and
relevance - carefully selecting and indexing the core literature published in
peer-reviewed scholarly journals, books, and proceedings. In 1998, the company
decided to complement the extensive bibliographic information it already provides
its customers by developing a collection of scholarly Web sites1
– Current Web Contents™. The Web Content Editors in the
Editorial Development Department identify Web sites and evaluate them by determining
how the Web site adheres to a number of selection criteria (e.g., authority,
accuracy, currency, overall design, and quality of writing).
A hallmark of Thomson Scientific has, of course, been its citation indexing.
By capturing references and linking them back to source items in our database,
Thomson Scientific Web of Science® lets users organize literature
around cited references, trace the historic route of scientific discovery, map
out relationships among multiple research papers, and even keep track of how
well received their publications are by their peers. Thomson Scientific is now
applying its strengths in citation tracking and linking to documents resident
on the Web in its new offering – the Web Citation Index.
Web Citation Index – What is it?
The Web Citation Index is a multidisciplinary citation index of Web-accessible,
scholarly research papers (including articles, preprints, theses, dissertations,
proceedings, technical reports, and other gray literature). The Web Citation
Index uses some of the Autonomous Citation Indexing2 software
developed by NEC Laboratories America in Princeton, NJ. This technology extracts
source information and references from documents and builds an index with cited
and citing relationships. This technology currently supports the CiteSeer database
of computer and information science papers (http://citeseer.ist.psu.edu/).
Thomson Scientific has enhanced this technology in many ways. For example, additional
software harvests Open Access Initiative (OAI)-compliant metadata (if available)
and combines it with full-text indexes of each document.3, 4
Through the Web Citation Index interface, customers can perform cited
reference searches and navigate through the Web-based literature by using the
cited and citing relationships that exist among the indexed documents. Customers
also have the ability to link back to the Web of Science if the Web-based
document cited or was cited by a Web of Science article, or if the
document also was indexed in the Web of Science.
Institutional and Subject-Specific Repositories
The Web Citation Index differs from the current CiteSeer database
in several ways, but the most fundamental difference is the selection of content.
The CiteSeer system crawls the Web and harvests content that follows certain
rules. The Web Citation Index, on the other hand, contains carefully
selected content. The Web Content Editors of Editorial Development serve as
the content curators for the Web Citation Index, choosing content that
meets defined selection criteria.
For the initial release of the Web Citation Index, Thomson Scientific
has concentrated on the scholarly material archived in institutional and subject-specific
repositories available via the Internet. Many factors have contributed to the
rise of such repositories, most notably the Open Archives Initiative and the
development of several open source archiving software packages. A position paper
by the Scholarly Publishing & Academic Resources Coalition supported institutional
repositories, referring to them as a “compelling response to two strategic
issues facing academic institutions,” namely how to reform and complement
the current scholarly journal publishing system and how to present to the public
an indicator of the quality and relevance of an institution’s research
efforts.5
Indeed, seven major institutions participated in the development of the pilot
version of the Web Citation Index. These institutions – Australian
National University, California Institute of Technology, Cornell University,
the Max Planck Society, Monash University, the University of Rochester, and
NASA Langley – allowed us to test our software on their repositories.
They also offered valuable feedback on the overall design and functionality
of the product itself.
Just how many institutional repositories and subject-specific archives there
are is hard to pinpoint. A recent article based on a survey conducted jointly
by the Coalition for Networked Information, the UK Joint Information Systems
Committee, and the SURF Foundation in the Netherlands attempted to summarize
the status of the deployment of institutional repositories in 13 countries (Australia,
Canada, the United States, Belgium, France, the United Kingdom, Denmark, Norway,
Sweden, Finland, Germany, Italy, and the Netherlands).6 The responses
to the survey were incomplete (especially with respect to the United States),
but to summarize:
- The respondents reported a total of 305 institutional repositories.
- Percentages of universities with an institutional repository ranged from
5% (Finland) to 100% (Germany, Norway, and The Netherlands).
A rather extensive list of repositories is the Institutional Archives Registry
(http://archives.eprints.org/index.php)
maintained by Tim Brody of the University of Southampton on the EPrints.org
Web site. The Registry as of February 2006 contains 610 archives.
A number of software packages7 are also represented in the Registry.
The most widely used package is GNU EPrints, developed at the University of
Southampton. The second most common self-archiving system is DSpace, developed
jointly by MIT and Hewlett-Packard.
Selection Criteria for the Web Citation Index
While selecting institutional and subject-specific repositories for the Web
Citation Index, the Web Content Editors keep in mind a number of criteria.
These criteria include the following:
- Authority
- Overall design, maintenance, and ease of use of the archive
- Frequency of updates
- Review policy/procedure (if any)
Authority
With respect to authority, the Editors are primarily concerned with identifying
the body behind the archive. The Editors look for repositories and subject-specific
archives sponsored by universities or colleges with faculty and staff active
in research (e.g., MIT, Dartmouth, Australian National University); government
research laboratories or agencies that produce a great volume of documents (e.g.,
NASA, EPA); major non-governmental or intergovernmental bodies (e.g., the United
Nations); or prominent not-for-profit or non-profit research institutions, societies,
or organizations (e.g., Max Planck Society). The Editors also consider the document
collections of large, research-oriented commercial or corporate entities (e.g.,
Sun Microsystems, Hewlett-Packard), provided they contain free access to full-text
documents.
Overall Design and Maintenance
Overall design and maintenance of the repository or archive Web site are important
considerations as well. Sites with many broken links or excessive server downtime
are problematic and generally rejected, as are sites that have not been updated
in quite some time. Other design/maintenance features of a repository the Editors
look for are document type metadata tags (e.g., article, thesis, or memo), some
type of subject classification, and the ability to access the full-text document.
Frequency of Updates
The Editors also consider the frequency with which new materials are posted
to the repository or archive. A regular stream of new articles signals that
the collection is current and well-maintained. For institutional repositories,
a frequently updated collection also signals that the faculty and staff of the
institution have embraced the concept of archiving their works.
Frequency of updates, however, does not necessarily mandate that a repository
should have a certain number of total documents. In fact, the total numbers
of documents in the repositories reviewed and accepted by the Editors range
from fewer than 100 to 350,000+. Fairly new or single-topic repositories, such
as those run by the University of Washington’s Structural Informatics
Group (http://sigpubs.biostr.washington.edu/)
and the Advanced Knowledge Technologies collaboration in the UK (http://eprints.aktors.org/),
have considerably fewer papers (e.g., 100-250) than do the more established
or multidisciplinary repositories managed by large universities, such as DSpace
at MIT (http://dspace.mit.edu/)
and the Virginia Tech Electronic Theses and Dissertations Collection (http://scholar.lib.vt.edu/theses/),
which each have several thousand documents. The arXiv at Cornell University
(http://www.arxiv.org/)
and the various technical reports servers maintained by NASA (e.g., http://library-dspace.larc.nasa.gov/)
are examples of well-established yet more narrowly focused collections that
have very large numbers of documents (350,000+ for the arXiv and 5000+ for NASA
Langley Technical Library Digital Repository). The Editors, therefore, must
consider the frequency of update, the age of the repository, and the range of
subject matter – not just the total number of documents – when evaluating
a repository for overall robustness.
Review Policy/Procedure
If the repository follows a self-archiving model, the Editors check to see
to what extent – if any – the submissions are moderated. The Editors
are not necessarily looking for a peer-review process; rather, they are checking
to see if submissions are verified (i.e., reviewed with respect to appropriateness
of content and adherence to archive requirements) before being added to the
repository.
There are also some technical issues that could affect the ease with which
the archive can be crawled and indexed. These include access to the full-text
document (via a browse or search), document format (PDF, PostScript, zipped
PostScript), and compliance to OAI metadata standards (particularly with respect
to the pertinent source item elements (e.g., title, author, abstract).
Examples of Repositories
Some examples of well-designed, well-maintained institutional and subject-specific
archives are arXiv (http://www.arxiv.org/),
the physics, mathematics, computer science, and quantitative biology open access
preprint archive maintained by Cornell University; Caltech Collection of Open
Digital Archives (CODA) (http://library.caltech.edu/digital/default.htm/);
the Australian National University Eprints Repository (http://eprints.anu.edu.au/),
which includes all the scholarly output of the ANU community; and the NASA Langley
Technical Library Digital Repository (http://library-dspace.larc.nasa.gov/).
These examples fit the editorial guidelines with respect to authority, overall
design, maintenance, frequency of updates, and ease of use.
Conclusion
With the development of the Web Citation Index, Thomson Scientific
brings its strengths in citation indexing to documents resident on the Web in
institutional and subject-specific repositories. This powerful new product initiative
allows ISI Web of KnowledgeSM users to search Thomson Scientific’s
core database of journal literature as well as the contents of selected repositories,
and track the cited and citing relationships among these documents.
References
1. Current Web Contents: Web Site Selection Criteria (essay), (http://scientific.thomson.com/free/essays/selectionofmaterial/cwc-criteria/)
2. Lawrence S, Giles CL, Bollacker K: Digital libraries and autonomous citation
indexing. IEEE Computer 32(6):67-71, 1999
3. Open Archives Initiative Protocol for Metadata Harvesting, v. 2.0 (http://www.openarchives.org/OAI/openarchivesprotocol.html)
4. OAI for Beginners: The Open Archives Forum Online Tutorial (http://www.oaforum.org/tutorial/)
5. Crow R: The case for institutional repositories: A SPARC position paper.
Scholarly Publishing & Academic Resources Coalition, 2002 (http://www.arl.org/sparc/IR/ir.html)
6. van Westrienen G, Lynch CA: Academic institutional repositories: Deployment
status in 13 nations as of mid 2005. D-Lib Magazine 11(9), September 2005 (http://www.dlib.org/dlib/september05/westrienen/09westrienen.html)
7. Open Society Institute: A guide to institutional repository software, 3rd
edition, 2004