Diverse digital resources are of common usage by different types of users that access and use different types of applications to access, exchange and use them. It is common practice to develop those applications having in mind a set of requirements of a specific target category of users. We envisaged and designed the IPSA archive and system using a similar approach: the identification of a set of requirements of researchers in illuminated manuscripts, as a target group of domain professional users. The IPSA system has been in use to domain professionals that have used it for some years as one of their tools to carry out scientific research. The consideration that the content of the archive managed by the IPSA system could be of interest for many different types of users suggested to reconsider its approach to envisage a new system that is designed around the same archive of illuminated manuscripts for their access by diverse categories of users. The paper reports on the work that has been conducted to re-design and re-engineer the system to match requirements and expectations of non-domain users.
Archive-It, a subscription service from the Internet Archive, allows users to create, maintain and view digital collections of web resources. The current interface of Archive-It is largely text-based, supporting drill-down navigation using lists of URIs. To provide an overview of each collection and highlight the collection's underlying characteristics, we present four alternate visualizations (image plot with histogram, wordle, bubble chart and timeline). The sites in an Archive-It collection may be organized by the collection curator into groups for easier navigation. However, many collections do not have such groupings, making them difficult to explore. We introduce a heuristics-based categorization for such collections.
Data are proliferating far faster than they can be captured, managed, or stored. What types of data are most likely to be used and reused, by whom, and for what purposes? Answers to these questions will inform information policy and the design of digital libraries. We report findings from semi-structured interviews and field observations to investigate characteristics of data use and reuse and how those characteristics vary within and between scientific communities. The two communities studied are the researchers at the Center for Embedded Network Sensing (CENS) and users of the Sloan Digital Sky Survey (SDSS) data. We found that the interactions between inquiry, data, and use fall into three categories: foreground vs. background, use of the same data for different actions, and sources of data for reuse. The data practices of CENS and SDSS researchers have implications for data curation, system evaluation, and policy. Some data that are important to the conduct of research are not viewed as sufficiently valuable to keep. Other data of great value may not be mentioned or cited, because those data serve only as background to a given investigation. Metrics to assess the value of documents do not map well to data.
Digital Preservation and Knowledge Discovery Based on Documents from an International Health Science Program
Dharitri Misra, Robert Hall, Susan Payne and George Thoma
Important biomedical information is often recorded, published or archived in unstructured or semi-structured textual form. Artificial intelligence and knowledge discovery techniques may be applied to large volumes of such data to identify and extract useful metadata, not only for providing access to these documents, but also for conducting analyses and uncovering patterns and trends in a field. The System for Preservation of Electronic Resources (SPER), an information management tool developed at the U.S. National Library of Medicine, provides these capabilities by integrating machine learning, data mining and digital preservation techniques. In this paper, we present an overview of SPER and its ability to retrieve information from one such dataset. We show how SPER was applied to the semi-structured records of an international health science program, the 46-year continuous archive of conference publications and related documents from the Joint Cholera Panels of the U.S.-Japan Cooperative Medical Science Program (CMSP). Metadata, extracted automatically from the document contents and stored in a relational database, were used to preserve, access, and analyze the documents to quantitatively describe the activity of a research community toward specific health science program goals. We describe the technical approach in detail, show how pertinent information was discovered from these datasets, and provide examples of its use for a preliminary study of a subset of CMSP activities to meet some original program goals.