Enhancing Digital Libraries Using Missing Content Analysis
David
Carmel, Elad Yom-Tov and Haggai Roitman |
Abstract: This work shows how the content
of a digital library can be enhanced to better satisfy its users'
needs. Missing content is identified by finding missing content
topics in the system's query log or in a pre-defined taxonomy
of required knowledge. The collection is then enhanced with new
relevant knowledge, which is extracted from external sources
that satisfy those missing content topics. Experiments we conducted
measure the precision of the system before and after content
enhancement. The results demonstrate a significant improvement
in the system effectiveness as a result of content enhancement
and the superiority of the missing content enhancement policy
over several other possible policies. |
Building a Dynamic Lexicon from a Digital Library
David
Bamman and Gregory Crane |
Abstract: We describe here in detail our work
toward creating a dynamic lexicon from the texts in a large digital
library. By leveraging a small structured knowledge source (a
30,537 word treebank), we are able to extract selectional preferences
for words from a 3.5 million word Latin corpus. This is promising
news for low-resource languages and digital collections seeking
to leverage a small human investment into much larger gain. The
library architecture in which this work is developed allows us
to query customized subcorpora to report on lexical usage by
author, genre or era and allows us to continually update the
lexicon as new texts are added to the collection. |
On Content-Driven Search—Keyword
Suggesters for Literature Digital Libraries
Sulieman Bani-Ahmad and Gultekin
Ozsoyoglu |
Abstract: We propose and evaluate a “content-driven
search keyword suggester” for keyword-based search in literature
digital libraries. Suggesting search keywords at an early stage,
i.e., while the user is entering search terms, is helpful for
constructing more accurate, less ambiguous, and focused search
keywords for queries. Our search keyword suggestion approach
is based on an a priori analysis of the publication collection
in the digital library at hand, and consists of the following
steps. We (i) parse the document collection using the Link Grammar
parser, a syntactic parser of English, (ii) group publications
based on their “most-specific” research topics,
(iii) use the parser output to build a hierarchical structure
of simple and compound tokens to be used to suggest search terms,
(iv) use TextRank, a text summarization tool, to assign topic-sensitive
scores to keywords, and (v) use the identified research-topics
to help user aggregate search keywords prior to the actual search
query execution. We experimentally show that the proposed framework,
which is optimized to work on literature digital libraries, promises
a more scalable, high quality, and user-friendly search-keyword
suggester when compared to its competitors. We validate our proposal
experimentally using a subset of the ACM SIGMOD Anthology digital
library as a testbed, and by employing the research-pyramid model
to identify the “most-specific” research topics. |
Unsupervised Semantic Markup of Literature for Biodiversity
Digital Libraries
Hong Cui |
Abstract: This paper reports the further development
of machine learning techniques for semantic markup of biodiversity
literature, especially morphological descriptions of living organisms
such as those hosted at efloras.org and algaebase.org. Syntactic
parsing and supervised machine learning techniques have been
explored by earlier research. Limitations of these techniques
promoted our investigation of an unsupervised learning approach
that combines the strength of earlier techniques and avoids the
limitations. Semantic markup at the organ and character levels
is discussed. Research on semantic markup of natural heritage
literature has direct impact on the development of semantic-based
access in biodiversity digital libraries. |
Seeking information in realistic books: A user study
Veronica
Liesaputra and Ian Witten |
Abstract: There are opposing views on whether
readers gain any advantage from using a computer model of a 3D
physical book. There is enough evidence, both anecdotal and from
formal user studies, to suggest that the usual HTML or PDF presentation
of documents is not always the most convenient, or the most comfortable,
for the reader. On the other hand it is quite clear that while
3D book models have been prototyped and demonstrated, none are
in routine use in today’s digital libraries. And how do
3D book models compare with actual books? This paper reports
on a user study designed to compare the performance of a practical
Realistic Book implementation with conventional formats (HTML
and PDF) and with physical books. It also evaluates the annotation
features that the implementation provides. |
Understanding Cultural
Heritage Experts Information
Seeking Needs
Alia Amin, Jacco van Ossenbruggen, Lynda Hardman
and Annelies van Nispen |
Abstract: We report on our user study on the
information seeking behavior of cultural heritage experts and
the sources they use to carry out search tasks. Seventeen experts
from nine cultural heritage institutes in the Netherlands were
interviewed and asked to answer questionnaires about their daily
search activities. The interviews helped us to better understand
their search motivations, types, sources and tools. A key finding
of our study is that the majority of search tasks involve relatively
complex information gathering. This is in contrast to the relatively
simple fact-finding oriented support provided by current tools.
We describe a number of strategies that experts have developed
to overcome the inadequacies of their tools. Finally, based on
the analysis, we derive general trends of cultural heritage experts’ information
seeking needs and discuss our preliminary experiences with potential
solutions. |
The Myth of Find: User Behaviour and Attitudes Towards the
Basic Search Feature
Fernando Loizides and George Buchanan |
Abstract: The ubiquitous within-document text
search feature (Ctrl-F) is considered by users to be a key advantage
in electronic information seeking [1]. However what people say
they do and what they actually do are not always consistent.
It is necessary to understand, acknowledge and identify the cause
of this inconsistency. We must identify the physical and cognitive
factors to develop better methods and tools, assisting with the
search process. This paper discusses the limitations and myths
of Ctrl-f in information seeking. A prototype system for within-document
search is introduced. Three user studies portray shared behaviour
and attitudes, common between participants regarding within-document
searching. |
A Longitudinal Study of Exploratory and Keyword Search
Max
L. Wilson and m.c. schraefel |
Abstract: Digital libraries are concerned
with improving the access to collections to make their service
more effective and valuable to users. In this paper, we present
the results of a four-week longitudinal study investigating the
use of both exploratory and keyword forms of search within an
online video archive, where both forms of search were available
concurrently in a single user interface. While we expected early
use to be more exploratory and subsequent use to be directed,
over the whole period there was a balance of exploratory and
keyword searches and they were often used together. Further,
to support the notion that facets support exploration, there
were more than five times as many facet clicks than more complex
forms of keyword search (boolean and advanced). From these results,
we can conclude that there is real value in investing in exploratory
search support, which was shown to be both popular and useful
for extended use of the system. |
Exploring Educational Standard Alignment: In Search of 'Relevance'
Rene
Reitsma, Byron Marshall and Michael Dalton |
Abstract:The growing availability of online
K-12 curriculum is increasing the need for meaningful alignment
of this curriculum with state-specific standards. Promising automated
and semi-automated alignment tools have recently become available.
Unfortunately, recent alignment evaluation studies report low
inter-rater reliability, e.g., 32% with two raters and 35 documents.
While these results are in line with studies in other domains,
low reliability makes it difficult to accurately train automatic
systems and complicates comparison of different services. We
propose that inter-rater reliability of broadly defined, abstract
concepts such as ‘alignment’ or ‘relevance’ must
be expected to be low due to the real-world complexity of teaching
and the multidimensional nature of the curricular documents.
Hence, we suggest decomposing these concepts into less abstract,
more precise measures anchored in the daily practice of teaching.
This article reports on the integration of automatic alignment
results into the interface of the TeachEngineering collection
and on an evaluation methodology intended to produce more consistent
document relevance ratings. Our results (based on 14 raters
x 6 documents) show high inter-rater reliability (61 - 95%)
on less abstract relevance dimensions while scores on the overall ‘relevance’ concept
are (as expected) lower (64%). Despite a relatively small sample
size, regression analysis of our data resulted in an explanatory
(R2 = .75) and statistically stable (p-values < .05) model
for overall relevance as indicated by matching concepts, related
background material, adaptability to grade level, and anticipated
usefulness of exercises. Our results suggest that more detailed
relevance evaluation which includes several dimensions of relevance
would produce better data for comparing and training alignment
tools. |
From NSDL 1.0 to NSDL 2.0: Towards a Comprehensive Cyberinfrastructure
for Teaching and Learning
David McArthur and Lee Zia |
Abstract: NSDL is a premier provider of digital
educational collections and services, which has been supported
by NSF for eight years. As a mature program, NSDL has reached
a point where it could either change direction or wind down.
In this paper we argue there are reasons to continue the program
and we outline several possible new program directions. These
build on NSDL’s learning platform, and they also look towards
NSF’s emerging interest in supporting work at the intersection
of cyberinfrastructure and education. We consider NSDL’s
potential roles in several grand challenges that confront education,
including: tailoring educational resources to students’ needs,
providing educators with a cyber-teaching environment, developing
a cyber-workbench for researchers, and integrating education
research and practice. |
Cross-Disciplinary Molecular Science Education in Introductory
Science Courses: An NSDL MatDL Collection
David Yaron, Jodi
Davenport, Michael Karabinos, Gaea Leinhardt, Laura Bartolo,
John Portman, Cathy Lowe, Donald Sadoway, W. Craig Carter and
Colin Ashe |
Abstract: This paper discusses a digital library
designed to help undergraduate students draw connections across
disciplines, beginning with introductory discipline-specific
science courses (including chemistry, materials science, and
biophysics). The collection serves as the basis for a design
experiment for interdisciplinary educational libraries and is
discussed in terms of the three models proposed by Sumner and
Marlino. As a cognitive tool, the library is organized around
recurring patterns in molecular science, with one such pattern
being developed for this initial design experiment. As a component
repository, the library resources support learning of these patterns
and how they appear in different disciplines. As a knowledge
network, the library integrates design with use and assessment. |
Curriculum Overlay Model for Embedding Digital Resources
Huda
Khan, Keith Maull and Tamara Sumner |
Abstract: This paper describes the design
and implementation of a curriculum overlay model for the representation
of adaptable curriculum using educational digital library resources.
We focus on representing curriculum to enable the incorporation
of digital resources into curriculum and curriculum sharing and
customization by educators. We defined this model as a result
of longitudinal studies on educators' development and customization
of curriculum and user interface design studies of prototypes
representing curriculum. Like overlay journals or the information
network overlay model, our curriculum overlay model defines curriculum
as a compound object with internal semantic relationships and
relationships to digital library metadata describing resources.
We validated this model by instantiating the model using science
curriculum which uses digital library resources and using this
instantiation within an application which, built on FEDORA, supports
curriculum customization. Findings from this work can support
the design of digital library services for customizing curriculum
which embeds digital resources. |
Gazetiki: Automatic Creation of a Geographical Gazetteer
Adrian
Popescu, Gregory Grefenstette and Pierre Alain Moëllic |
Abstract: Geolocalized databases are becoming
necessary in a wide variety of application domains. Thus far,
the creation of such databases has been a costly, manual process.
This drawback has stimulated interest in automating their construction,
for example, by mining geographical information from the Web.
Here we present and evaluate a new automated technique for creating
and enriching a geographical gazetteer, called Gazetiki. Our
technique merges disparate information from Wikipedia, Panoramio,
and web search engines in order to identify geographical names,
categorize these names, find their geographical coordinates and
rank them. We show that our method provides a richer structure
and an improved coverage compared to the other known attempt
at automatically building a geographic database, TagMaps. Our
technique correctly identifies 93% of geographical location candidates,
with a much greater coverage than TagMaps, finding 2 to 30 times
more items than TagMaps per location. The information produced
in Gazetiki enhances and complements the Geonames database, using
a similar domain model. |
Discovering GIS Sources on the Web using Summaries
Ramaswamy
Hariharan, Bijit Hore and Sharad Mehrotra |
Abstract: In this paper, we consider the
problem of discovering GIS data sources on the web. Source discovery
queries for GIS data are specified using keywords and a region
of interest. A source is considered relevant if it contains data
that matches the keywords in the specified region. Existing techniques
simply rely on textual metadata accompanying such datasets to
compute relevance to user-queries. Such approaches result in
poor search results, often missing the most relevant sources
on the web. We address this problem by developing more meaningful
summaries of GIS datasets that preserve the spatial distribution
of keywords. We conduct experiments showing the effectiveness
of proposed summarization techniques by significantly improving
the quality of query results over previous approaches, while
guaranteeing scalability and high performance. |
SocialTrust: Tamper-Resilient Trust Establishment in Online
Communities
James Caverlee, Ling Liu and Steve Webb |
Abstract: Web 2.0 promises rich opportunities
for information sharing, electronic commerce, and new modes of
social interaction, all centered around the ``social Web'' of
user-contributed content, social annotations, and person-to-person
social connections. But the increasing reliance on this ``social
Web'' also places individuals and their computer systems at risk.
In this paper, we identify a number of vulnerabilities inherent
in online communities and study opportunities for malicious participants
to exploit the tight social fabric of these networks. With these
problems in mind, we propose the SocialTrust framework for tamper-resilient
trust establishment in online communities. Two of the salient
features of SocialTrust are its dynamic revision of trust by
(i) distinguishing relationship quality from trust; and (ii)
incorporating a personalized feedback mechanism for adapting
as the community evolves. We experimentally evaluate the SocialTrust
framework using real online social networking data consisting
of millions of MySpace profiles and relationships. We find that
SocialTrust supports robust trust establishment even in the presence
of large-scale collusion by malicious participants. |
Personal & SME Archiving
Stephan Strodl,
Florian Motlik, Kevin Stadler and Andreas Rauber |
Abstract: Digital objects require appropriate
measures for digital preservation to ensure that they can be
accessed and used in the near and far future. While heritage
institutions have been addressing the challenges posed by digital
preservation needs for some time, private users and SMEs are
way less prepared to handle these challenges. Yet, both have
increasing amounts of data that represent considerable value,
be it office documents or family photographs. Backup, common
practice of home users, avoids the physical loss of data, but
it does not prevent the loss of the ability to render and use
the data in the long term. Research and development in the area
of digital preservation is driven by memory institutions and
large businesses. The available tools, services and models are
developed to meet the demands of these professional settings.
This paper analyses the requirements and challenges of preservation
solutions for private users and SMEs. Based on the requirements
and supported by available tools and services, we are designing
and implementing a home archiving system to provide digital
preservation solutions specifically for digital holdings in
the small office and home environment. It hides the technical
complexity of digital preservation challenges and provides
simple and automated services based on established best practice
examples. The system combines bit preservation and logical
preservation strategies to avoid loss of data and the ability
to access and use the data similar in style. A first software
prototype, called Hoppla, is presented in this paper. |
Recovering a Website's Server Components from the Web Infrastructure
Frank
McCown and Michael Nelson |
Abstract: Our previous research has shown
that the collective behavior of search engine caches (e.g., Google,
Yahoo, Live Search) and web archives (e.g., Internet Archive)
results in the uncoordinated but large-scale refreshing and migrating
of web resources. Interacting with these caches and archives,
which we call the Web Infrastructure (WI), allows entire websites
to be reconstructed in an approach we call lazy preservation.
Unfortunately, the WI only captures the client-side view of a
web resource. While this may be useful for recovering much of
the content of a website, it is not helpful for restoring the
scripts, web server configuration, databases, and other server-side
components responsible for the construction of the web resource.
This paper proposes a novel technique for storing and recovering
the server-side components of a website from the WI. Using
erasure codes to embed the server-side components as HTML comments
throughout the website, we can effectively reconstruct all
the server components of a website when only a portion of the
client-side resources have been extracted from the WI. We present
the results of a preliminary study that baselines the lazy
preservation of ten EPrints repositories and then examines
the preservation of an EPrints repository that uses the erasure
code technique to store the server-side EPrints software throughout
the website. We found nearly 100% of the EPrints components
were recoverable from the WI just two weeks after the repository
came online, and it remained recoverable three months after
it was "lost". |
A Data Model and Architecture for Long-Term Preservation
Greg
Janee, Justin Mathena and James Frew |
Abstract: The National Geospatial Digital
Archive, one of eight initial projects funded under the Library
of Congress’s NDIIPP program, has been researching how
geospatial data can be preserved on a national scale and be made
available to future generations. In this paper we describe an
archive architecture that provides a minimal approach to the
long-term preservation of digital objects based on co-archiving
of object semantics, uniform representation of objects and semantics,
explicit storage of all objects and semantics as files, and abstraction
of the underlying storage system. This architecture ensures that
digital objects can be easily migrated from archive to archive
over time and that the objects can, in principle, be made usable
again at any point in the future; its primary benefit is that
it serves as a fallback strategy against, and as a foundation
for, more sophisticated (and costly) preservation strategies.
We describe an implementation of this architecture in a protoype
archive running at UCSB that also incorporates a suite of ingest
and access components. |
HarvANA - Harvesting Community Tags to Enrich Collection
Metadata
Jane Hunter, Imran Khan and Anna Gerber |
Abstract: Collaborative, social tagging and
annotation systems have exploded on the Internet as part of the
Web 2.0 phenomenon. Systems such as Flickr, Del.icio.us, Technorati,
Connotea and LibraryThing, provide a community-driven approach
to classifying information and resources on the Web, so that
they can be browsed, discovered and re-used. Although social
tagging sites provide simple, user-relevant tags, there are issues
associated with the quality of the metadata and the scalability
compared with conventional indexing systems. In this paper we
propose a hybrid approach that enables authoritative metadata
generated by traditional cataloguing methods to be merged with
community annotations and tags. The HarvANA (Harvesting and Aggregating
Networked Annotations) system uses a standardized but extensible
RDF model for representing the annotations/tags and OAI-PMH to
harvest the annotations/tags from distributed community servers.
The harvested annotations are aggregated with the authoritative
metadata in a centralized metadata store. This streamlined, interoperable,
scalable approach enables libraries, archives and repositories
to leverage community enthusiasm for tagging and annotation,
augment their metadata and enhance their discovery services.
This paper describes the HarvANA system and its evaluation through
a collaborative testbed with the National Library of Australia
using architectural images from PictureAustralia. |
Semi Automated Metadata Extraction for Preprints Archives
Emma
Tonkin and Henk Muller |
Abstract: In this paper we present a system
called paperBase that aids users in entering metadata for preprints.
PaperBase extracts metadata from the preprint. Using a Dublin-Core
based REST API, third-party repository software populates a web
form that the user can then proofread and complete. PaperBase
also predicts likely keywords for the preprints, based on a controlled
vocabulary of keywords that the archive uses and a Bayesian classifier.
We have tested the system on 12 individuals, and measured
the time that it took them to enter data, and the accuracy
of the entered metadata. We find that our system is not significantly
faster than manual entry, even though all but two participants
perceived it to be faster. However, some metadata, in particular
the title of preprints, contains significantly fewer mistakes
when entered automatically; even though the automatic system
is not perfect, people tend to correct mistakes that paperBase
makes, but would leave their own mistakes in place. |
A Metadata Generation System for Scanned Scientific Volumes
Xiaonan
Lu, James Z. Wang and C. Lee Giles |
Abstract: Large scale digitization projects
have been conducted at digital libraries with advancement in
automatic document processing and popularity of digital libraries.
Scientific literature originally printed on paper have been converted
into collections of digital resources for preservation and open
access purposes. In this work, we tackle the problem of extracting
structural and descriptive metadata for scanned volumes of journal.
These metadata information illustrate the internal structure
of a scanned volume, link objects in different sources, and describe
published articles within a scanned volume. These structural
and descriptive information is critical for digital libraries
to provide effective content access functionalities to users.
We proposed methods for generating volume level, issue level,
and article level metadata using format and text features extracted
from OCRed text. We have developed the system and integrated
it into an operational digital library for real world usage. |
Exploring a Digital Library through Key Ideas
Billl
N. Schilit and Okan Kolak |
Abstract: Key Ideas is a technique for exploring
digital libraries by navigating passages that repeat across multiple
books. From these popular passages emerge quotations that authors
have copied from book to book because they capture an idea particularly
well: Jefferson on liberty; Stanton on women's rights; and Gibson
on cyberpunk. We augment Popular Passages by extracting key terms
from the surrounding context and computing sets of related key
terms. We then create an interaction model where readers fluidly
explore the library by viewing popular quotations on a particular
key term, and follow links to quotations on related key terms.
In this paper we describe our vision and motivation for Key Ideas,
present an implementation running over a massive, real-world
digital library consisting of over a million scanned books, and
describe some of the technical and design challenges. The principal
contribution of this paper is the interaction model and prototype
system for browsing digital libraries of books using key terms
extracted from the aggregate context of popularly quoted passages. |
Math Information Retrieval: User Requirements and Prototype
Implementation
Jin Zhao, Min-Yen Kan and Yin Leng Theng |
Abstract: We report on the user requirements
study and preliminary implementation phases in creating a digital
library that indexes and retrieves educational materials on math.
We first review the current approaches and resources for math
retrieval, then report on the interviews of small group of potential
users properly ascertain their needs. While preliminary, the
results suggest that Meta-Search and Resource Categorization
are two basic requirements for a math search engine. In addition,
we implement a prototype categorization system and show that
the generic features work well in identifying the math contents
from the webpage but are weak in categorizing them. We believe
this is mainly due to the training data and the segmentation.
In near future, we plan to improve it further while integrating
it and Meta-Search into a search engine. As a long-term goal,
we will also look into how math expressions and text may be best
handled. |
A Competitive Environment for Exploratory Query Expansion
David
Milne, David Nichols and Ian Witten |
Abstract: Most information workers query digital
libraries many times a day. Yet people have little opportunity
to hone their skills in a controlled environment, or compare
their performance with others in an objective way. Conversely,
although search engine logs record how users evolve queries,
they lack crucial information about the user’s intent.
This paper describes an environment for exploratory query expansion
that pits users against each other and lets them compete, and
practice, in their own time and on their own workstation. The
system captures query evolution behavior on predetermined information-seeking
tasks. It is publicly available, and the code is open source
so that others can set up their own competitive environments. |
How people find videos
Sally Jo Cunningham
and David M. Nichols |
Abstract: At present very little is known
about how people locate and view videos. This study draws a rich
picture of everyday video seeking strategies and video information
needs, based on an ethnographic study of New Zealand university
students. These insights into the participants’ activities
and motivations suggest potentially useful facilities for a video
digital library. |
Selection and Context Scoping for Digital Video Collections:
An Investigation of YouTube and Blogs
Robert Capra, Christopher
Lee, Gary Marchionini, Terrell Russell, Chirag Shah and Fred
Stutzman |
Abstract: Digital curators are faced with
decisions about what part of the ever-growing, ever-evolving
space of digital information to collect and preserve. The recent
explosion of web video on sites such as YouTube presents curators
with an even greater challenge – how to sort through and
filter a large amount of information to find, assess and ultimately
preserve important, relevant, and interesting video. In this
paper, we describe research conducted to help inform digital
curation of on-line video. Since May 2007, we have been monitoring
the results of 57 queries on YouTube related to the 2008 U.S.
presidential election and report results comparing these data
to blogs that point to candidate videos on YouTube and discuss
the effects of query-based harvesting as a collection development
strategy. |
A Study of Awareness in Multimedia Search
Robert Villa, Nick Gildea and Joemon Jose |
Abstract: Awareness of another's activity
is an important aspect of facilitating collaboration between
users, enabling an "understanding of the activities of others".
Techniques such as collaborative filtering enable a form of asynchronous
awareness, providing recommendations generated from the past
activity of a community of users. In this paper we investigate
the role of awareness and its effect on search behavior in collaborative
multimedia retrieval. We focus on the scenario where two users
are searching at the same time on the same task, and via the
interface, can see the activity of the other user. The main research
question asks: does awareness of another searcher aid a user
when carrying out a multimedia search session?
To encourage awareness, an experimental study was designed
where two users were asked to find as many relevant video shots
as possible under different awareness conditions. These were
individual search (no awareness of each other), mutual awareness
(where both user's could see each other's search screen), and
unbalanced awareness (where one user is able to see the other's
screen, but not vice-versa). Twelve pairs of users were recruited,
and the four worst performing TRECVID 2006 search topics were
used as search tasks, under four different awareness conditions.
We present the results of this study, followed by a discussion
of the implications for multimedia digital library systems. |
Towards usage-based impact metrics: first results from the
MESUR project
Johan Bollen, Herbert Van de Sompel and Marko
A. Rodriguez |
Abstract: Scholarly usage data holds the potential
to be used as a tool to study the dynamics of scholarship in
real time, and to form the basis for the definition of novel
metrics of scholarly impact. However, the formal groundwork to
reliably and validly exploit usage data is lacking, and the exact
nature, meaning and applicability of usage-based metrics is poorly
understood. The MESUR project funded by the Andrew W. Mellon
Foundation constitutes a systematic effort to define, validate
and cross-validate a range of usage-based metrics of scholarly
impact. MESUR has collected nearly 1 billion usage events as
well as all associated bibliographic and citation data from significant
publishers, aggregators and institutional consortia to construct
a large-scale usage data reference set. This paper describes
some major challenges related to aggregating and processing usage
data, and discusses preliminary results obtained from analyzing
the MESUR reference data set. The results confirm the intrinsic
value of scholarly usage data, and support the feasibility of
reliable and valid usage-based metrics of scholarly impact. |
Evaluating the Contributions of Video Representation for
a Life Oral History Collection
Michael Christel and Michael
Frisch |
Abstract: A digital video library of over 900 hours of video
and 18000 stories from The HistoryMakers is used to investigate
the role of motion video for users of recorded life oral histories.
Stories in the library are presented in one of two ways in two
within-subjects experiments: either as audio accompanied by a
single still photographic image per story, or as the same audio
within a motion video of the interviewee speaking. 24 participants
given a treasure-hunt fact-finding task, i.e., very directed
search, showed no significant preference for either the still
or video treatment, and no difference in task performance. 14
participants in a second study worked on an exploratory task
in the same within-subjects experimental framework, and showed
a significant preference for video. For exploratory work, video
has a positive effect on user satisfaction. Implications for
use of video in collecting and accessing recorded life oral histories,
in student assignments and more generally, are discussed, along
with reflections on long term use studies to complement the ones
presented here. |
From Writing
and Analysis to the Repository: Taking the Scholars Perspective
on Scholarly Archiving
Catherine C. Marshall |
Abstract: This paper reports the results of
a qualitative field study of the writing, collaboration, and
archiving practices of researchers in a single organization;
the researchers span five subdisciplines and bring different
expertise to the papers they write together. The study focuses
on the kinds of artifacts the researchers create in the process
of writing a paper, how they exchange and store these artifacts
over the short term, how they handle references and bibliographic
materials, and the strategies they use to guarantee the long
term safety of their scholarly materials. By attending to and
supporting the upstream processes of writing and collaboration,
we hope to facilitate personal digital archiving and deposit
into institutional and disciplinary repositories as a side-effect
to everyday aspects of research. The findings reveal a great
range of scholarly materials, consequential differences in how
researchers handle them now and what they expect to keep, and
patterns of bibliographic practices and resource use. The findings
also identify long term vulnerabilities for personal archives. |
User-Assisted Ink-Bleed Correction for Handwritten Documents
Yi
Huang and Michael S. Brown |
Abstract: We describe a user-assisted framework
for correcting ink-bleed in old handwritten documents housed
at the National Archives of Singapore (NAS). Our approach departs
from traditional correction techniques that strive for full automation.
Fully-automated approaches make assumptions about ink-bleed characteristics
that are not valid for all inputs. Furthermore, fully-automated
approaches often have to set algorithmic parameters that have
no meaning for the end-user. In our system, the user needs only
to provide simple examples of ink-bleed, foreground ink, and
background. These training examples are used to classify the
remaining pixels in the document to produce a computer-generated
result that is equal to or better than existing fully-automated
approaches.
To offer a complete system we also provide tools that allow
any errors in the computer-generated results to be quickly
``cleaned up'' by the user. The initial training markup, computer-generated
results, and manual edits are all recorded with the final output,
allowing subsequent viewers to see how a corrected document
was created and to make changes or updates. While an on-going
project, our feedback from the NAS staff has been overwhelmingly
positive that this user-assisted framework is a practical way
to address the ink-bleed problem. |
CRF-Based Authors' Name Tagging for Scanned Documents
Manabu
Ohta and Atsuhiro Takasu |
Abstract: Authors' names are a critical bibliographic
element when searching or browsing academic articles stored in
digital libraries. Therefore, those creating metadata for digital
libraries would appreciate an automatic method to extract such
bibliographic data from printed documents. In this paper, we
describe an automatic author name tagger for academic articles
scanned with optical character recognition (OCR) mark-up. The
method uses conditional random fields (CRF) for labeling the
unsegmented character strings in authors’ blocks as those
of either an author or a delimiter. We applied the tagger to
Japanese academic articles. The results of the experiments showed
that it correctly labeled more than 99% of the author name strings,
which compares favorably with the under 96% correct rate of our
previous tagger based on a hidden Markov model (HMM). |
Automatic Information Extraction from 2-Dimensional Plots
in Digital Documents
William Browuer, Saurabh Kataria, Sujatha
Das, Prasenjit Mitra and C. Lee Giles |
Abstract: Most search engines index the textual
content of documents in digital libraries. However, scholarly
articles often report important findings in figures. The contents
of the figures are not indexed. Often scholars need to search
for data reported in figures and process them. Therefore, searching
for data reported in figures and extracting them is an important
problem. To the best of our knowledge, there exists no tool to
automatically extract data from figures in digital documents.
If we can perform extraction tasks from these images automatically,
there is the potential for an end-user to query the data from
multiple digital documents simultaneously and efficiently. We
propose a framework of algorithms based on image analysis and
machine learning that can extract all information from 2-D plot
images and store them in a database. We show how to identify
2-D plot figures, how to segment the plots to extract the axes,
the legend and the data sections, how to extract the labels of
the axes, separate the data symbols from the text in the legend,
identify data points and segregate overlapping data points. We
also show that our algorithms can extract information from 2-D
plots accurately and scalably using a testbed of images available
from multiple real-life sources. |
A simple method for citation metadata extraction using hidden
Markov models
Erik Hetzner |
Abstract: This paper describes a simple method
for extracting metadata fields from citations using hidden Markov
models. The method is easy to implement and can achieve levels
of precision and recall for heterogeneous citations comparable
to other HMM-based methods. The method consists largely of string
manipulation and otherwise depends only on an implementation
of the Viterbi algorithm, which is widely available, and so can
be implemented by diverse digital library systems. |
Identification of Time-Varying Objects on the Web
Satoshi
Oyama, Kenichi Shirasuna and Katsumi Tanaka |
Abstract: We have developed a method for determining
whether data found on the Web are for the same or different objects
that takes into account the possibility of changes in their attribute
values over time. Specifically, we estimate the probability that
observed data were generated for the same object that has undergone
changes in its attribute values over time and the probability
that the data are for different objects, and we define similarities
between observed data using these probabilities. By giving a
specific form to the distributions of time-varying attributes,
we can calculate the similarity between given data and identify
objects by using agglomerative clustering on the basis of the
similarity. Experiments in which we compared identification accuracies
between our proposed method and a method that regards all attribute
values as constant showed that the proposed method improves the
precision and recall of object identification. |
Using the Web
for Creating Publication Venue Authority Files
Denilson Alves Pereira, Berthier Ribeiro-Neto, Nivio
Ziviani and Alberto H. F. Laender |
Abstract: Citations to publication venues
in the form of journal, conference, and workshop contain spelling
variants, acronyms, abbreviated forms and misspellings, all of
which make more difficult to retrieve the item of interest. The
task of discovering and reconciling these variant forms of bibliographic
references is known as authority work. The key goal is to create
the so called authority files, which maintain, for any given
bibliographic item, a list of variant labels (i.e., variant strings)
used as a reference to it. In this paper we propose to use the
Web to create high quality publication venue authority files.
Our idea is to recognize (and extract) references to publication
venues in the text snippets of the answers returned by a search
engine. References to a same publication venue are then reconciled
in an authority file. Each entry in this file is composed of
a canonical name for the venue, an acronym, the venue type (i.e.,
journal, conference, workshop) and a mapping to various forms
of writing its name. Experimental results show that our Web-based
approach for creating authority files is superior to previous
work based on straight string matching techniques. Considering
the average precision in finding correct venue canonical names,
we observe gains up to 41.7%. |
Application of Kalman Filters to Identify Unexpected Change
in Blogs
Paul Bogen, Joshua Johnston, Unmil Karadkar, Richard
Furuta and Frank Shipman |
Abstract: Information on the Internet, especially
blog content, changes rapidly. Users of information collections,
such as the blogs hosted by technorati.com, have little, if any,
control over the content or frequency of these changes. However,
it is important for users to be able to monitor content for deviations
in the expected pattern of change. If a user is interested in
political blogs and a blog switches subjects to a literary review
blog the user would want to know of this change in behavior.
Since pages may change too frequently for manual inspection for “unwanted” changes,
an automated approach is wanted. In this paper, we explore methods
for indentifying unexpected change by using Kalman filters to
model blog behavior over time. Using this model, we examine the
history of 77 blogs and determine methods for flagging the significance
of a blog's change from one time step to the next. We are able
to predict large deviations in blog content, and allow user defined
sensitivity parameters to tune a statistical threshold of significance
for deviation from expectation. |
NCore: Architecture and Implementation of a Flexible, Collaborative
Digital Library
Dean Krafft, Aaron Birkland and Ellen Cramer |
Abstract: NCore is an open source architecture
and software platform for creating flexible, collaborative digital
libraries. NCore was developed by the National Science Digital
Library (NSDL) project, and it serves as the central technical
infrastructure for NSDL. NCore consists of a central Fedora-based
digital repository, a specific data model, an API, and a set
of backend services and frontend tools that create a new model
for collaborative, contributory digital libraries. This paper
describes NCore, presents and analyzes its architecture, tools
and services; and reports on the experience of NSDL in building
and operating a major digital library on it over the past year
and the experience of the Digital Library for Earth Systems Education
in porting their existing digital library and tools to the NCore
platform. |
Acceptance and Use of Electronic Library Services in Ugandan
University
Prisca Tibenderana |
Abstract: Library as old as civilization were
created to acquire, store, organise and provide access to information
services to those in need albeit using manual operations. However,
with information explosion and the coming of new technologies,
libraries have opted to automate their operations and provice
services using digital technology. For electronic library services
to be utilized effectively, with special reference to Developing
Countries, end-users need to accept them. This study is an effort
to modify "The Unified Theory of Acceptance and Use of Technology" Model
to cater for electronic library services in addressing a recommendation
by Venkatesh, et. al (2003) study (conducted in USA) that the
model be tested in different setting (such as Uganda) and in
a different context. The study developed, tested and validated
a "Service Oriented Unified Theory of Acceptance and Use of Technology" (SOUTAUT)
Model. |
Portable Digital Libraries on an iPod: Beyond the client-server
model
David Bainbridge, Steve Jones, Sam McIntosh, Matt
Jones and Ian Witten |
Abstract: We have created an experimental
prototype that enhances an ordinary iPod personal music player
by adding digital library capabilities. It does not enable access
to a remote DL from a user’s PDA; rather, it runs a complete,
standard digital library server environment (Greenstone) on the
iPod. Being optimized for multimedia information, this platform
has truly vast storage capacity. It raises the possibility of
not just personal collections but entire institutional-scale
digital libraries that are fully portable. Our implementation
even allows the iPod to be configured as a web server to provide
digital library content over a network, inverting the standard
mobile client-server configuration—and incidentally providing
full-screen access.
Our system is not (yet) a practical implementation. Rather,
it is a proof of concept intended to stimulate thinking on
potential applications of a radically new DL configuration.
This paper describes the facilities we built, focusing on interface
issues and touching on the technical problems that were encountered
and solved. It attempts to convey a feeling for the kind of
issues that must be faced when adapting standard DL software
for non-standard, leading-edge devices. |
Annotated Program Examples as First Class Objects in an Educational
Digital Library
Peter Brusilovsky, Michael Yudelson and
I-Han Hsiao |
Abstract: The paper analyzes three major problems
encountered by our team as we endeavored to turn program examples
into highly reusable educational activities, which could be included
as first class objects in various educational digital libraries.
It also suggests three specific approaches to resolving these
problems, and reports on the evaluation of the suggested approaches.
Our successful experience presented in the paper demonstrates
how to make program examples self-sufficient, to provide students
with personalized guidance to the most appropriate examples,
and to increase the volume of annotated examples. |
Annotating Historical Archives of Images
Xiaoyue
Wang, Lexiang Ye, Eamonn Keogh and Christian Shelton |
Abstract: Recent initiatives like the Million
Book Project and Google Print Library Project have already archived
several million books in digital format, and within a few years
a significant fraction of world’s books will be online.
While the majority of the data will naturally be text, there
will also be tens of millions of pages of images. Many of these
images will defy automation annotation for the foreseeable future,
but a considerable fraction of the images may be amiable to automatic
annotation by algorithms that can link the historical image with
a modern contemporary, with its attendant metatags. In order
to perform this linking we must have a suitable distance measure
which appropriately combines the relevant features of shape,
color, texture and text. However the best combination of these
features will vary from application to application and even from
one manuscript to another. In this work we propose a simple technique
to learn the distance measure by perturbing the training set
in a principled way. We show the utility of our ideas on archives
of manuscripts containing images from natural history and cultural
artifacts. |
sLab: Smart Labeling of Family
Photos Through an Interactive Interface
Ehsan Fazl-Ersi, I. Scott MacKenzie and John K.
Tsotsos |
Abstract: A novel technique for semi-automatic
photo annotation is proposed and evaluated. The technique, iLab,
uses face processing algorithms and a simplified user interface
for labeling family photos. A user study compared our system
with two others. One was Adobe Photoshop Element. The other was
an inhouse implementation of a face clustering interface recently
proposed in the research community. Nine participants performed
an annotation task with each system on faces extracted from a
set of 150 images from their own family photo albums. As the
faces were all well known to participants, accuracy was near
perfect with all three systems. On annotation time, iLab was
25% faster than Photoshop Element and 16% faster than the face
clustering interface. |
Autotagging to Improve Text Search for 3D Models
Corey
Goldfeder and Peter Allen |
Abstract: The most natural user interface
for searching libraries of 3D models is to use standard text
queries. However, text search on 3D models has traditionally
worked poorly, as text anno- tations on 3D models are often unreliable
or incomplete. In this paper we attempt to improve the recall
of text search by automatically assigning appropriate tags to
models. Our algorithm finds relevant tags by appealing to a large
corpus of partially labeled example models, which does not have
to be preclassified or otherwise prepared. For this purpose we
use a copy of Google 3DWarehouse, a library of user con- tributed
models which is publicly available on the Internet. Given a model
to tag, we find geometrically similar mod- els in the corpus,
based on distances in a reduced dimen- sional space derived from
Zernike descriptors. The labels of these neighbors are used as
tag candidates for the model with probabilities proportional
to the degree of geometric similarity. We show experimentally
that text based search for 3D models using our computed tags
can reproduce the power of geometry based search. Finally, we
demonstrate our 3D model search engine that uses this algorithm
and discuss some implementation issues. |
Slide Image Retrieval: A Preliminary Study
Guo
Min Liew and Min-Yen Kan |
Abstract: We consider the task of automatic
slide image retrieval, in which slide images are ranked for relevance
against a textual query. Our implemented system, SLIDIR caters
specifically for this task using features specifically designed
for synthetic images embedded within slide presentation. We show
promising results in both the ranking and binary relevance task
and analyze the contribution of different features in the task
performance. |
Perception-based Online News Extraction
Jinlin
Chen and Keli Xiao |
Abstract: A novel online news extraction approach
based on human perception is presented in this paper. The approach
simulates how a human perceives and identifies online news content.
It first detects news areas based on content function, space
continuity, and formatting continuity of news information. It
further identifies detailed news content based on the position,
format, and semantic of detected news areas. Experiment results
show that our approach achieves much better performance (in average
more than 99% in terms of F1 Value) compared to previous approaches
such as Tree Edit Distance and Visual Wrapper based approaches.
Furthermore, our approach does not assume the existence of Web
templates in the tested Web pages as required by Tree Edit Distance
based approach, nor does it need training sets as required in
Visual Wrapper based approach. The success of our approach demonstrates
the strength of the perception based Web information extraction
methodology and represents a promising approach for automatic
information extraction from sources with presentation design
for humans. |
Plato: a service-oriented decision support system for preservation
planning
Christoph Becker, Hannes Kulovits, Andreas Rauber
and Hans Hofman |
Abstract: The fast changes of technologies
in today's information landscape have considerably shortened
the lifespan of digital objects. Digital preservation has become
a pressing challenge. Different strategies such as migration
and emulation have been proposed; however, the decision for a
specific tool e.g. for format migration or an emulator is very
complex. The process of evaluating potential solutions against
specific requirements and building a plan for preserving a given
set of objects is called preservation planning. So far, it is
a mainly manual, sometimes ad-hoc process with little or no tool
support. This paper presents a service-oriented architecture
and decision support tool that implements a solid preservation
planning process and integrates services for content characterisation,
preservation action and automatic object comparison to provide
maximum support for preservation planning endeavours. |
Usage Analysis of a Public Website Reconstruction Tool
Frank
McCown and Michael Nelson |
Abstract: The Web is increasingly the medium
by which information is published today, but due to its ephemeral
nature, web pages and sometimes entire websites are often "lost" due
to server crashes, viruses, hackers, run-ins with the law, bankruptcy
and loss of interest. When a website is lost and backups are
not available, an individual or third party can use Warrick to
recover the website from several search engine caches and web
archives (the Web Infrastructure). In this short paper, we present
Warrick usage data obtained from Brass, a queueing system for
Warrick hosted at Old Dominion University and made available
to the public for free. Over the last six months, 520 individuals
have reconstructed more than 700 websites with 800K resources
from the Web Infrastructure. Sixty-two percent of the static
web pages were recovered, and 42% of all the website resources
were recovered. The Internet Archive was the largest contributor
of recovered resources (78%). |
Using Web Metrics to Analyze Digital Libraries
Michael
Khoo, Joe Pagano, Anne Washington, Mimi Recker, Bart Palmer
and Robert Donahue |
Abstract: Web metrics tools and digital libraries
vary widely in form and function. Bringing the two together is
often not a straightforward exercise. This paper discusses the
use of web metrics in the Instructional Architect, the Library
of Congress, the National Science Digital Library, and WGBH Teachers'
Domain. We explore similarities and differences in the use of
web metrics across these libraries, and introduce a discussion
of an emerging focus of web metrics research, the analysis of
session time and page popularity. We conclude by discussing some
of the current limitations and future possibilities of using
web metrics to analyze and evaluate digital library use and impact. |
A Lightweight Metadata Quality Tool
David
Nichols, Chu-Hsiang Chan, David Bainbridge, Dana McKay and
Michael Twidale |
Abstract: We describe a Web-based metadata
quality tool that provides statistical descriptions and visualisations
of Dublin Core metadata harvested via the OAI protocol. The lightweight
nature of development allows it to be used to gather contextualized
requirements and some initial user feedback is discussed. |
Improving Navigation Interaction in Digital Documents
George
Buchanan and Tom Owen |
Abstract: This paper investigates novel interactions
for supporting within--document navigation. We first study navigation
more broadly through interviews with intensive users of document
reader applications. We then focus on a specific interaction:
the following of figure references. This interaction is used
to illuminate factors also found in other forms of navigation.
Several alternative interactions for supporting figure navigation
are described and evaluated through a user study. Experimentation
proves the advantages of our interaction design that can be applied
to other navigation needs. |
Keeping Narratives of a Desktop to Enhance Continuity of
On-going Tasks
Youngjoo Park and Richard Furuta |
Abstract: We describe a novel interface by
which a user can browse, bookmark and retrieve previously used
working environments, i.e., desktop status, enabling the retention
of the history of use of various sets of information. Significant
tasks often require reuse of (sets of) information that was used
earlier. Particularly, if a task involves extended interaction,
then the task’s environment has been through a lot of changes
and can get complex. Under the current prevailing desktop-based
computing environment, after an interruption to the task users
can gain little assistance to get back to the context that they
previously worked on. A user thus encounters increased discontinuity
in continuing extended tasks. |
Note-Taking, Selecting, and Choice: Designing Interfaces
that Encourage Smaller Selections
Aaron Bauer and Kenneth
Koedinger |
Abstract: Our research evaluates the use of
copy-paste functionality in note-taking applications. While pasting
can be more efficient than typing, our studies indicate that
it reduces attention. An initial interface we designed to encourage
attention by reducing selection-size, which is negatively associated
with learning, was resisted by students and produced poor learning.
In this paper we present a design study intended to learn more
about how students interact with note-taking interfaces and develop
more user-friendly restrictions. We also report an experimental
evaluation of interfaces derived from this design study. While
we were able to produce interfaces that reduced selection size
and improved satisfaction, the new interfaces did not improve
learning. We suggest design recommendations derived from these
studies, and describe a “selecting-to-read” behavior
we encountered, which has implications for the design of reading
and note-taking applications. |
A Fedora Librarian Interface
David
Bainbridge and Ian Witten |
Abstract: The Fedora content management system
embodies a powerful and flexible digital object model. This paper
describes a new open-source software front-end that enables end-user
librarians to transfer documents and metadata in a variety of
formats into a Fedora repository. The main graphical facility
that Fedora itself provides for this task operates on one document
at a time and is not librarian-friendly. A batch driven alternative
is possible, but requires documents to be converted beforehand
into the XML format used by the repository, necessitating a need
for programming skills. In contrast, our new scheme allows arbitrary
collections of documents residing on the user’s computer
(or the web at large) to be ingested into a Fedora repository
in one operation, without a need for programming expertise. Provision
is also made for editing existing documents and metadata, and
adding new ones. The documents can be in a wide variety of different
formats, and the user interface is suitable for practicing librarians.
The design capitalizes on our experience in building the Greenstone
librarian interface and participating in dozens of workshops
with librarians worldwide. |
|