The PHD UNS digital library which is developed using OpenDLT is integrated with CRIS UNS system and search of digital library is available at CRIS UNS.
Public access to theses and dissertations via the Internet is important for the development of a knowledge-based society. A knowledge-based society relies on the knowledge of its citizens to drive entrepreneurship, innovation, and vitality of that society’s economy. A knowledge-based society possesses a community of scholars, researchers, research networks, engineers, technicians, and businesses engaged in research and the production of high-technology goods and provision of services. It forms a national innovation and production system, which is integrated into international networks of knowledge production. Its communication and information technological tools make vast amounts of human knowledge easily accessible.
One approach to achieving a knowledge-based society can be through depositing electronic dissertations and theses (ETDs) in a freely accessible digital repository. Assigning appropriate metadata to ETDs can improve discoverability by increasing their visibility. The importance of scientific-research results visibility for further development of science is discussed in many scientific-research manuscripts. Furthermore, visibility of ETDs can be increased by putting the digital object or its descriptive metadata (or both) into systems containing theses and dissertations, such as digital libraries, research management systems, institutional repositories (IRs), the Networked Digital Library of Thesis and Dissertations (NDLTD), DART-Europe E-thesis portal, Digital Repository Infrastructure for European Research (DRIVER), and others. On one hand, metadata about scientific-research results can be separately entered in all those Internet based systems by researchers or by librarians. This is hard and error-prone job. On the other hand, metadata about scientific-research results can be entered in one system and exported to other systems. This approach contributing to:
The goal of the OpenDLT digital library developed at University of Novi Sad in accordance with CERIF, DC, ETD-MS, and OAIPMH, is to avoid or reduce duplicated inputs on the two platforms and increase metadata quality, reliability, and reusability. The OpenDLT architecture enables easy integration with library information systems, which are based on MARC 21 format and also can hold metadata about ETDs.
The first step in this project was analysis of various systems that contain metadata about theses and dissertations.
The following are international initiatives:
• NDLTD is an international organization that aims to create a worldwide network of ETDs. Each digital repository that is a network member has to enable metadata exchange in the ETD-MS format (developed by DNLTD) in accordance with OAI-PMH.21
• DART-Europe E-Thesis Portal aims to collect details of the open access research theses stored in Europe’s digital repositories (doctoral and master theses). It collects metadata in DC using OAI-PMH.
• DRIVER is an international organization co-funded by the European Commission with the goal of creating a network of freely accessible digital repositories with content across all academic disciplines. Each digital repository that is a network member has to enable metadata exchange in DC in accordance with the OAI-PMH protocol.
In addition, many academic and research institutions and research communities may implement and manage the following approaches to collecting, preserving, accessing, and disseminating research:
• IRs are online systems that collect, preserve, and disseminate the intellectual output in digital form of an institution. IRs may use open-source software, such as EPrints, DSpace and Fedora, or hosted, proprietary software, such as Digital Commons and SimpleDL. Many IRs support the exchange of data in DC via OAI-PMH.
• A CRIS is a database of other information system for storing data on current research (e.g., data about institutions, researchers, research projects, equipment, published results, etc.). The European Union encourages the development of national research management systems in accordance with the CERIF standard. CERIF compatible research management systems are called CRIS. Due to specific local or national requirements, CRIS systems are built on different modifications (or extensions) of CERIF data model.
• A Library Information System (LIS) is a software system for acquiring, cataloging, and circulating library holdings. LIS are built on various bibliographic standards; most are based on MARC 21 formats.
Across these systems, different standards and protocols—CERIF, OAI-PMH, DC, ETD-MS, and MARC—enable interoperability. After analysis was completed, a comprehensive metadata set was defined to develop a repository that is compatible with all previously mentioned systems. An object-oriented method was used for the module modeling. Object-oriented modeling creates models using object-oriented diagrams (class diagram, sequence diagram, etc.), which is the starting point for implementing a system using object-oriented programming language. The modeling was carried out using the Sybase PowerDesigner tool that supports OMG’s Unified Modeling Language (UML) 2.0. The module model can be obtained by contacting the OpenDLT team members via email firstname.lastname@example.org. The implementation was realized using “bestof- breed” open-source components written in Java.
After analysis of various systems that contain metadata about theses and dissertations (NDLTD, DART-Europe E-thesis portal, DRIVER, IRs, CRISs, LIS), a comprehensive metadata set was defined to create a repository that is compatible with various ETDs systems. Table 1 presents the list of metadata elements selected for OpenDLT and indicates their presence or absence in CERIF, DC, and ETF-MS. The set of metadata about EDTs adopted for the OpenDLT software system unites the metadata sets prescribed by CERIF, DC, and ETD-MS format, extended by metadata that are used in MARC 21 format and metadata for ETDs prescribed by University of Novi Sad.
Table 1. Metadata about theses and dissertations adopted for the OpenDLT software system
As already stated, the OpenDLT data model holds data about scientific research in MARC 21 format. MARC 21 records are stored using an attribute of the MARC 21 record entity that holds a string representing a MARC 21 record serialized according to the International Standards Organization (ISO) 2709 standard, which sets out the format for information exchange. Upon serializing the MARC 21 record in an ISO 2709 string, the record is stored in the database and its contents are indexed using the Apache Lucene information retrieval library. MARC 21 records can be classified using the entity MARC21 Record_Class: master thesis, PhD dissertation, and so on. Also, that entity can be used for the definition of the scientific field and scientific discipline of the research, such as mathematics, computer sciences, biology, information systems, and artificial intelligence. Using that entity, records can be divided in sets and the OAI-PMH “ListRecords” requirement, which mandates the ability to download only records that belong to a defined set, can be met. The MARC21Record entity also contains attributes creator, dateOfCreation, modifier, and dateOfLastModification. Date of creation and date of the last modification are necessary to meet all requirements prescribed by the OAI-PMH protocol; the OAI-PMH ListRecords request must be able to download only records that are processed in a certain period. Furthermore, the data model contains the File_Storage entity that is intended to hold data related to the digital form of theses or dissertations. Each instance of the File_Storage entity is connected to an instance of the MARC21Record entity that holds bibliographic metadata about the thesis or dissertations. Also, the File_Storage entity contains the following attributes: uploader, fileName, mime, and length. The uploader attribute holds the e-mail address of the user who uploaded the digital content. The attributes fileName, mime, and length store metadata describing the digital content that is stored in a folder of the file system of the OpenDLT server. The folder is not directly accessible through the Internet, but digital contents can be downloaded using a Java Servlet. In this way, access to digital content is controlled, i.e., the Java Servlet controls who can download digital content. Table 2 shows mappings of adopted metadata about theses and dissertations shown in table 1 to the extended OpenDLT data model. The first column holds names of metadata and the second column holds location in MARC 21 bibliographic record. The first three characters of a MARC 21 record present a field code; the next two characters present the first and the second indicator, respectively; and the last character presents a subfield code. The character “#” indicates that indicator is not defined. The last column shows some notes about metadata and methods of their storing.
Table 2. Mappings of metadata to the data model
The OpenDLT members identified the basic information requirements of this digital library as the following:
• Uploading ETDs. The system supports pdf, doc, docx, and odt file formats. Furthermore, the system has to backup files and provides long-time preservation of those files.
• Migrating existing data from various sources, i.e., implementation of a scalable and open architecture importer module. The module software architecture should be extensible with plugins for import of theses and dissertations from various sources in various formats. The module should import data through a user interactive process by which consolidation of data can be achieved. Moreover, import of data through interactive user-interface could enable creation of database of unique authority records about authors, mentors, committees’ members and institutions where theses and dissertations have been defended.
• Entering all metadata about EDTs that that CERIF standard prescribes and all metadata that are necessary for exchange in accordance with the OAI-PMH protocol within NDLTD. User interface has to be as simple as possible so that it can be used by users without the knowledge of standards and protocols.
• Exchanging metadata about EDTs with other CRIS systems. In this way, researchers from European countries using national CRIS systems can find EDTs from the system implemented using OpenDLT.
• Exchanging metadata about EDTs in accordance with the OAI-PMH protocol. In this way, theses and dissertations from the system implemented using OpenDLT can be visible through a various IRs as well as through web applications for searching the NDLTD Union Catalogue or DART-Europe Theses portal.
• Searching of EDTs using web forms of OpenDLT digital library as well as remote searching from other systems by SRU protocol.
• Multilingual user interface which can be easily translated to some new language.
OpenDLT has open-architecture which enables easy extension with new features. It can be easily integrated to a complete scientific-research information system or integrated with existing MARC 21 based library information system. The application for cataloguing published results in the MARC21 format was implemented in the multi-tiered client-server architecture on the Java platform. An UML deployment diagram for this application is shown in Figure 1.
Figure 1. The software architecture of OpenDLT
OAI-PMH service provider: Also, the client side of OpenDLT can be a system which implements client side of OAI-PMH protocol.
Apache tomcat: The server side of the digital library can be executed within the Apache Tomcat application server or some other server supporting Java Servlet technology.
Interface module: User interface implementation is based on the JSF development environment. Unlike other development environments based on the model-view-controller model, JSF is used for component-based, event-driven web application development. JSF is increasingly used in combination with AJAX technology. By adding AJAX, the user interface can be richer, and JSF takes care that the problems with AJAX within the web browser are minimized. For the implementation of the application that is described in this paper we used RichFaces library of JSF components based on AJAX.
Format converter: This component transforms records between various formats: DTO, MARC 21, Dublin Core, ETD-MS, CERIF. Data transfer object (DTO) is used for data transport between application components. A DTO has a set of attributes and accessor/mutator methods for these attributes. Transformation to MARC 21, Dublin Core, ETD-MS and CERIF format are implemented in accordance with Table 1 and Table 2. Those formats are used for import and export records.
Import data: The component for import data about theses and dissertations from various data sources. The component import data through a user interactive process by which consolidation of data is achieved.
OAI-PMH data provider: This component implements server side of OAI-PMH protocol. It enables export via OAI-PMH protocol in Dubline Core, MARC 21 and ETD-MS format.
IR server: For indexing and searching text contents the Apache Lucene information retrieval (IR) library is used. Apache Lucene is an open source text searching engine written in Java.
DB access: JDBC is used for database access.
File server: This component implements storing and downloading ETDs. ETDs are stored in server file system. It enables storing any file format, but can be configured to accept only files formats belonging to some set (for instance, to accept only pdf, doc and docx files).
MySQL DBMS: MySQL can be used as a database management system or some other relational DBMS which has implemented JDBC connector.
The digital library of theses and dissertations is a web application implemented using Java platform and set of open-source libraries written in Java. The form for input of metadata is shown in Figure 2. Translations of multilingual metadata can be entered using this form and invoking (clicking on) the boxes to the right (e.g., Title translations, Subtitle translations, and so on). Because some metadata are multilingual, information retrieval measures (precision, recall, and F-measure) are improved, i.e., visibility of ETDs are increased. Furthermore, visibility of ETDs can be improved by using fuzzy search that is enabled through Apache Lucene library. Fuzzy search retrieves all theses and dissertations that meet a set of criteria that define similarity. For example, similarity criteria for two strings (string from a query and string from a thesis or dissertation title stored in the OpenDLT database) can be defined as follows:
• Each word in one string does not differ by more than two letters from a word in another string.
• If one string contains more than five words, the previous criterion is satisfied for at least 80 percent of the words.
Figure 2. Input of metadata
Data migration from some data source is controlled by a user. Whenever there is dilemma whether the imported object already exists in the OpenDLT’s database, the module for import provides a list of similar objects using the dialog shown in Figure 3.
Figure 3. Similar records
The user who started the import has to decide what should be done with the imported object’s metadata. The module provides the followings options:
If the user selects to merge the data, the form shown in Figure 4 is opened. The user on this form can see imported object’s metadata (within the input fields) as well as metadata of existing object in the OpenDLT system database (messages next to the images ) that should be merged with the imported object.
Figure 4. Records merging
Searching has three distinct modes which can be opened by selecting one of the following options Dissertations, Authors and board members and Search based on query language (Figure 5). By selecting the option Dissertations a form for making complex queries using the elements of the application interface is opened.
By selection the option Authors and board members a form for searching database of researchers (dissertations’ authors/advisors/boards’ members) by first and last name is opened. For each retrieved researcher beside column containing basic personal data about researcher (first and last name, affiliation, position, title) there are also link to metadata about her/his dissertation, links to metadata about dissertations where she/he was advisor, board president or board member.
By selecting the option Search based on query language (Figure 1) a form for making Lucene query is opened. Syntax for Lucene query language is available on address http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html, and list of available fields for searching are available on the form for making Lucene query.
Figure 5. Search