The number of channels of digital television is increasing, particularly the number that are free-to-air. However due to the nature of broadcasting, this morass of information is not, for the main part, organized—it is principally a succession of images and sound transmitted as multiplexed streams of data. Compare this deluge that terrestrially bombards our homes with the information available in the digital libraries we access over the Internet—stored using software purpose built to help organize carefully curated sets of documents. This project brings together these two seemingly incompatible concepts to develop a software environment that concurrently captures all the available live television channels—so a user does not need to proactively choose what to record— and segments them into files which are then imported into a digital video library with a user interface designed to work from a multimedia remote control. A shifting time-based “window” of all recordings is maintained—we settled on from the last two weeks so as to be practicably operable on a regular desktop PC. The system leverages off the information contained in the electronic program guide and the video recordings to generate metadata suitable for the digital library. Manually entered metadata can be optionally assigned to enrich the content. By combining these two concepts, the developed home-based media center seeks to go beyond what is currently provided by digital TV set-top boxes. A user evaluation of the developed prototype showed a high level of participant satisfaction across a range of attributes, notably date-based searching.
Digitized physical books offer access to tremendous amounts of knowledge, even for people with print-related disabilities. Various projects and standard activities are underway to make all of our past and present books accessible. However digitizing books requires extensive human efforts such as correcting the results of OCR (optical character recognition) and adding structural information such as headings. Some Asian languages need extra efforts for the OCR errors because of their many and varied character sets. Japanese has used more than 10,000 characters compared with a few hundred in English. This heavy workload is inhibiting the creation of accessible digital books. To facilitate digitization, we are developing a new system for processing physical books. We reduce and disperse the human efforts and accelerate conversions by combining automatic inference and human capabilities. Our system preserves the original page images for the entire digitization process to support gradual refinement and distributes the work as micro-tasks. We conducted trials with the Japanese National Diet Library (NDL) to estimate the required effort for digitizing all of the archived documents in the NDL. The results showed old Japanese books lead to specific problems when correcting OCR errors and adding structures. Drawing on our results, we discuss further workload reductions and future directions for international digitization systems.
In this paper, we present IPKB, an effort to digitalize and share the Treatise on Invertebrate Paleontology, which is the most authoritative compilation of data on invertebrate fossils. Unfortunately, the PDF version of the Treatise is simply a clone of paper publications and the content is in no way organized to facilitate search and knowledge discovery. We extracted texts and images from the treatise, stored them in a database, and built a system for effcient browsing and searching. For image processing, fossil photos are segmented from figures, embedded labels are identified and recognized, and linked with the corresponding genus records. The detailed information of each genus, including fossil images, is delivered to users through the web access module. External information (e.g. Google Earth) are acquired through web services APIs to help with presentation and improve user experiences.
In the paper, we report our initial accomplishments in utilizing the information from the Treatise. Given the rich information in the Treatise, analyzing, modeling and understanding paleontological data are significant in many areas, such as: understanding evolution; understanding climate change; finding fossil fuels. etc. IPKB builds a general framework that aims to facilitate knowledge discovery activities in invertebrate paleontology, and provides a solid foundation for future explorations.