PhD UNS- Digital library of PHD dissertations

About

The PHD UNS digital library is integrated with CRIS UNS system and search of digital library is available at CRIS UNS

PHD UNS is developed using Open Source Digital Library of Theses and Dissertations - openDLT

 

Public access to theses and PhD dissertations via the Internet is important for the development of a knowledge-based society. A knowledge-based society relies on the knowledge of its citizens to drive entrepreneurship, innovation, and vitality of that society’s economy. A knowledge-based society possesses a community of scholars, researchers, research networks, engineers, technicians, and businesses engaged in research and the production of high-technology goods and provision of services. It forms a national innovation and production system, which is integrated into international networks of knowledge production. Its communication and information technological tools make vast amounts of human knowledge easily accessible.

One approach to achieving a knowledge-based society can be through depositing electronic PhD dissertations and theses (ETDs) in a freely accessible digital repository. Assigning appropriate metadata to ETDs can improve discoverability by increasing their visibility. The importance of scientific research results visibility for further development of science is discussed in many scientific research manuscripts. Furthermore, visibility of ETDs can be increased by putting the digital object or its descriptive metadata (or both) into systems containing theses and PhD dissertations, such as digital libraries, research management systems, institutional repositories (IRs), the Networked Digital Library of Thesis and Dissertations (NDLTD), DART-Europe E-thesis portal, Digital Repository Infrastructure for European Research (DRIVER), and others. On one hand, metadata about scientific research results can be separately entered in all those Internet based systems by researchers or by librarians. This is hard and error-prone job. On the other hand, metadata about scientific research results can be entered in one system and exported to other systems. This approach contributing to:

  1. Avoiding duplicated inputs on the two platforms,
  2. Increasing metadata quality, reliability and reusability.

 

The goal of the PHD UNS digital library developed at University of Novi Sad in accordance with CERIF, DC, ETD-MS, and OAIPMH, is to avoid or reduce duplicated inputs on the two platforms and increase metadata quality, reliability, and reusability. The PHD UNS architecture enables easy integration with library information systems, which are based on MARC 21 format and also can hold metadata about ETDs.

Research and development method

The first step in this project was analysis of various systems that contain metadata about theses and dissertations.

The following are international initiatives:

• NDLTD is an international organization that aims to create a worldwide network of ETDs. Each digital repository that is a network member has to enable metadata exchange in the ETD-MS format (developed by DNLTD) in accordance with OAI-PMH.21

• DART-Europe E-Thesis Portal aims to collect details of the open access research theses stored in Europe’s digital repositories (doctoral and master theses). It collects metadata in DC using OAI-PMH.

• DRIVER is an international organization co-funded by the European Commission with the goal of creating a network of freely accessible digital repositories with content across all academic disciplines. Each digital repository that is a network member has to enable metadata exchange in DC in accordance with the OAI-PMH protocol.

In addition, many academic and research institutions and research communities may implement and manage the following approaches to collecting, preserving, accessing, and disseminating research:

• IRs are online systems that collect, preserve, and disseminate the intellectual output in digital form of an institution. IRs may use open-source software, such as EPrints,DSpace and Fedora,  or hosted, proprietary software, such as Digital Commons and SimpleDL. Many IRs support the exchange of data in DC via OAI-PMH.

• A CRIS is a database of other information system for storing data on current research (e.g., data about institutions, researchers, research projects, equipment, published results, etc.). The European Union encourages the development of national research management systems in accordance with the CERIF standard. CERIF compatible research management systems are called CRIS. Due to specific local or national requirements, CRIS systems are built on different modifications (or extensions) of CERIF data model.

• A Library Information System (LIS) is a software system for acquiring, cataloging, and circulating library holdings. LIS are built on various bibliographic standards; most are based on MARC 21 formats.

Across these systems, different standards and protocols—CERIF, OAI-PMH, DC, ETD-MS, and MARC—enable interoperability. After analysis was completed, a comprehensive metadata set was defined to develop a repository that is compatible with all previously mentioned systems. An object-oriented method was used for the module modeling. Object-oriented modeling creates models using object-oriented diagrams (class diagram, sequence diagram, etc.), which is the starting point for implementing a system using object-oriented programming language. The modeling was carried out using the Sybase PowerDesigner tool that supports OMG’s  Unified Modeling Language (UML) 2.0. The module model can be obtained by contacting the DOSIRD UNS team members via email dosird@uns.ac.rs. The implementation was realized using “bestof- breed” open-source components written in Java.

Data model

After analysis of various systems that contain metadata about theses and PhD dissertations (NDLTD, DART-Europe E-thesis portal, DRIVER, IRs, CRISs, LIS), a comprehensive metadata set was defined to create a repository that is compatible with various ETDs systems. Table 1 presents the list of metadata elements selected for PHD UNS and indicates their presence or absence in CERIF, DC, and ETF-MS. The set of metadata about EDTs adopted for the PHD UNS digital library unites the metadata sets prescribed by CERIF, DC, and ETD-MS format, extended by metadata that are used in MARC 21 format and metadata for ETDs prescribed by University of Novi Sad.

Table 1. Metadata about theses and PhD dissertations adopted for the PHD UNS digital library

PHD UNS

CERIF

Dublin Core

ETD-MS

author

+

+

+

advisor

-

-

+

chair

-

-

+

committee member

-

-

+

title

+

+

+

alternative title

-

-

+

subtitle

+

-

-

keywords

+

+

+

abstract

+

+

+

extended abstract

-

-

-

note

+

-

+

language

-

+

+

ISBN

+

-

-

physical description

+

-

-

UDC

-

-

-

publisher

+

+

+

publication date

+

+

+

record type

-

+

+

content format

-

+

+

URI

+

+

+

access rights

-

+

+

thesis type

+

-

-

name of author degree after defence

-

-

+

level of education

-

-

+

scientific field

-

-

+

scientific discipline

-

-

-

accepted by competent scientific institution on

-

-

-

institution

+

+

+

defended on

-

-

-

holding data

-

-

-

 

As already stated, the PHD UNS data model holds data about scientific research in MARC 21 format. MARC 21 records are stored using an attribute of the MARC 21 record entity that holds a string representing a MARC 21 record serialized according to the International Standards Organization (ISO) 2709 standard, which sets out the format for information exchange. Upon serializing the MARC 21 record in an ISO 2709 string, the record is stored in the database and its contents are indexed using the Apache Lucene information retrieval library. MARC 21 records can be classified using the entity MARC21 Record_Class: master thesis, PhD dissertation, and so on. Also, that entity can be used for the definition of the scientific field and scientific discipline of the research, such as mathematics, computer sciences, biology, information systems, and artificial intelligence. Using that entity, records can be divided in sets and the OAI-PMH “ListRecords” requirement, which mandates the ability to download only records that belong to a defined set, can be met. The MARC21Record entity also contains attributes creator, dateOfCreation, modifier, and dateOfLastModification. Date of creation and date of the last modification are necessary to meet all requirements prescribed by the OAI-PMH protocol; the OAI-PMH ListRecords request must be able to download only records that are processed in a certain period. Furthermore, the data model contains the File_Storage entity that is intended to hold data related to the digital form of theses or PhD dissertations. Each instance of the File_Storage entity is connected to an instance of the MARC21Record entity that holds bibliographic metadata about the thesis or PhD  dissertations. Also, the File_Storage entity contains the following attributes: uploader, fileName, mime, and length. The uploader attribute holds the e-mail address of the user who uploaded the digital content. The attributes fileName, mime, and length store metadata describing the digital content that is stored in a folder of the file system of the PHD UNS server. The folder is not directly accessible through the Internet, but digital contents can be downloaded using a Java Servlet. In this way, access to digital content is controlled, i.e., the Java Servlet controls who can download digital content. Table 2 shows mappings of adopted metadata about theses and PhD dissertations shown in table 1 to the extended PHD UNS data model. The first column holds names of metadata and the second column holds location in MARC 21 bibliographic record. The first three characters of a MARC 21 record present a field code; the next two characters present the first and the second indicator, respectively; and the last character presents a subfield code. The character “#” indicates that indicator is not defined. The last column shows some notes about metadata and methods of their storing.

Table 2. Mappings of metadata to the data model

Metadata

MARC 21

Note

author

100 1# a

All data about authors/advisors/chair/committee members are stored in a MARC 21 authority record, relation of thesis or dissertation with the authority record is established using the subfield 0 of data field 100/700 of MARC 21 bibliographic record. The subfield e of data field 100/700 holds relationship type: author, mentor, thesis/dissertation defend board chair, thesis/dissertation defend board member.

advisor

700 1# a

chair

700 1# a

committee member

700 1# a

title

245 00 a

Translations of those metadata are stored in the field 880 as it is described in the paper CERIF compatible data model based on MARC 21 format26

alternative title

246 0# a

subtitle

245 00 b

keywords

653 ## a

abstract

520 3# a

extended abstract

520 ## a

note

500 ## a

language

  008

Language is stored using three letters from 35th to 37th character positions of the control field 008. Character positions starts from 0.

ISBN

020 ## a

 

physical description

  300 ##

Physical description is stored using subfields of the data field 300.

UDC

080 ## a

 

publisher

260 ## b

The metadata holds a value author’s reprint or name of the appropriate institution.

publication date

260 ## c

Year of publication are additionally stored in character positions 7-10 of the control field 008.

record type

  LDR

Record type is stored in 6th character position of the leader of MARC 21 record. Character positions starts from 0.

content format

856 ## q

The metadata holds one of the following values: pdf, doc, docx, odt.

URI

856 ## u

The subfield holds URL to a thesis or dissertation in digital form or DOI of ETD.

access rights

540 ## a

 

thesis type

655 #4 a

Also stored using the MARC 21Record_Class entity of the PHD UNS data model.

name of author degree after defense

502 ## a

Name of degree is prescribed at the institution where author defends thesis or dissertation. For example: master of electrical engineering, doctor of technical sciences ...

level of education

502 ## b

The element holds level of education: bachelor, master, doctoral, post-doctoral ...

scientific field

650 24 a

Also stored using the MARC 21Record_Class entity of the PHD UNS data model.

scientific discipline

650 14 a

accepted by competent scientific institution on

502 ## g

The metadata are stored in the subfield g in the following format:

502 ## $gTheme of thesis or dissertation accepted on date.

institution

502 ## c

That subfield holds the name and address of the institution. All data about institutions are stored in a MARC 21 authority record, the relation of thesis or dissertation with the authority record is realized using entity MARC 21Record_MARC 21Record.

defended on

502 ## g

The metadata are stored in the subfield g in the following format:

502 ## $gThesis or dissertation defended on date.

holding data

852 ## a

 

 

 

 

Information requirements

The DOSIRD UNS members identified the basic information requirements of this digital library as the following:

 

  • Uploading ETDs. The system supports pdf, doc, docx, and odt file formats. Furthermore, the system has to backup files and provides long-time preservation of those files.
  • Migrating existing data from various sources, i.e., implementation of a scalable and open architecture importer module. The module software architecture should be extensible with plugins for import of theses and PhD  dissertations from various sources in various formats. The module should import data through a user interactive process by which consolidation of data can be achieved. Moreover, import of data through interactive user-interface could enable creation of database of unique authority records about authors, mentors, committees’ members and institutions where theses and PhD dissertations have been defended.
  •  Entering all metadata about EDTs that that CERIF standard prescribes and all metadata that are necessary for exchange in accordance with the OAI-PMH protocol within NDLTD. User interface has to be as simple as possible so that it can be used by users without the knowledge of standards and protocols.
  •  Exchanging metadata about EDTs with other CRIS systems. In this way, researchers from European countries using national CRIS systems can find EDTs from the system PHD UNS.
  •  Exchanging metadata about EDTs in accordance with the OAI-PMH protocol. In this way, dissertations from PHD UNS can be visible through a various IRs as well as through web applications for searching the NDLTD Union Catalogue or DART-Europe Theses portal.
  •  Searching of EDTs using web forms of PHD UNS digital library as well as remote searching from other systems by SRU protocol.
  •  Multilingual user interface which can be easily translated to some new language.

Architecture

PHD UNS has open-architecture which enables easy extension with new features. It can be easily integrated to a complete scientific research information system or integrated with existing MARC 21 based library information system. The application for cataloguing published results in the MARC21 format was implemented in the multi-tiered client-server architecture on the Java platform. An UML deployment diagram for this application is shown in Figure 1.

 

The software architecture of PHD UNS

                                                                               Figure 1. The software architecture of PHD UNS

Client

Web browser: The client side of PHD UNS is the web browser. The application can be accessed by all modern browsers supporting HTML 4 and JavaScript.

OAI-PMH service provider: Also, the client side of PHD UNS can be a system which implements client side of OAI-PMH protocol.

Application server

Apache tomcat: The server side of the digital library can be executed within the Apache Tomcat application server or some other server supporting Java Servlet technology.

Interface module: User interface implementation is based on the JSF development environment. Unlike other development environments based on the model-view-controller model, JSF is used for component-based, event-driven web application development. JSF is increasingly used in combination with AJAX technology. By adding AJAX, the user interface can be richer, and JSF takes care that the problems with AJAX within the web browser are minimized. For the implementation of the application that is described in this paper we used RichFaces library of JSF components based on AJAX.

Format converter: This component transforms records between various formats: DTO, MARC 21, Dublin Core, ETD-MS, CERIF. Data transfer object (DTO) is used for data transport between application components. A DTO has a set of attributes and accessor/mutator methods for these attributes. Transformation to MARC 21, Dublin Core, ETD-MS and CERIF format are implemented in accordance with Table 1 and Table 2. Those formats are used for import and export records.

Import data: The component for import data about theses and PhD dissertations from various data sources. The component import data through a user interactive process by which consolidation of data is achieved.

OAI-PMH data provider: This component implements server side of OAI-PMH protocol. It enables export via OAI-PMH protocol in Dubline Core, MARC 21 and ETD-MS format.

IR server: For indexing and searching text contents the Apache Lucene information retrieval (IR) library is used. Apache Lucene is an open source text searching engine written in Java.

DB access: JDBC is used for database access.

File server: This component implements storing and downloading ETDs. ETDs are stored in server file system. It enables storing any file format, but can be configured to accept only files formats belonging to some set (for instance, to accept only pdf, doc and docx files).

DBMS

MySQL DBMS: MySQL can be used as a database management system or some other relational DBMS which has implemented JDBC connector.

Implementation

The PHD UNS digital library of PhD dissertations is a web application implemented using Java platform and set of open-source libraries written in Java. Although it can be a separated application, within DOSIRD UNS project  PHD UNS is integrated with CRIS UNS application. The form for input of metadata is shown in Figure 2. Translations of multilingual metadata can be entered using this form and invoking (clicking on) the boxes to the right (e.g., Title translations, Subtitle translations, and so on). Because some metadata are multilingual, information retrieval measures (precision, recall, and F-measure) are improved, i.e., visibility of ETDs are increased. Furthermore, visibility of ETDs can be improved by using fuzzy search that is enabled through Apache Lucene library. Fuzzy search retrieves all theses and PhD dissertations that meet a set of criteria that define similarity. For example, similarity criteria for two strings (string from a query and string from a thesis or PhD dissertation title stored in the PHD UNS database) can be defined as follows:

  •  Each word in one string does not differ by more than two letters from a word in another string.
  • If one string contains more than five words, the previous criterion is satisfied for at least 80 percent of the words.

 

PHD UNS Input of metadata

                                                           Figure 2. Input of metadata

 

Data migration from some data source is controlled by a user. Whenever there is dilemma whether the imported object already exists in the PHD UNS database, the module for import provides a list of similar objects using the dialog shown in Figure 3.

 

PHD UNS similar records

                                                            Figure 3. Similar records

 

The user who started the import has to decide what should be done with the imported object’s metadata. The module provides the followings options:

  • Leave metadata that already exist in the PHD UNS’s database: the button in the column „Select institution“,
  • Overwrite existing metadata with the imported object metadata: the button in the column „Change“,
  • Perform merging of metadata controlled by the user: the button in the column „Merge“,
  • Store the object’s metadata as a new record: the button „Save“ stores imported object’s metadata without changes; the button „Change“ opens form for changing metadata and after that stores metadata.

If the user selects to merge the data, the form shown in Figure 4 is opened. The user on this form can see imported object’s metadata (within the input fields) as well as metadata of existing object in the PHD UNS system database (messages next to the images ) that should be merged with the imported object.

 

PHD UNS records merging

                                                                    Figure 4. Records merging

 

Searching has three distinct modes which can be opened by selecting one of the following options Dissertations,  Authors and board members and  Search based on query language (Figure 5). By selecting the option Dissertations a form for making complex queries using the elements of the application interface is opened.

By selection the option Authors and board members a form for searching database of researchers (dissertations’ authors/advisors/boards’ members) by first and last name is opened. For each retrieved researcher beside column containing basic personal data about researcher (first and last name, affiliation, position, title) there are also link to metadata about her/his PhD dissertation, links to metadata about dissertations where she/he was advisor, board president or  board member.

By selecting the option Search based on query language (Figure 1) a form for making Lucene query is opened. Syntax for Lucene query language is available on address http://lucene.apache.org/core/old_versioned_docs/versions/2_9_1/queryparsersyntax.html, and list of available fields for searching are available on the form for making Lucene query.

 

PHD UNS search

                                                                  Figure 5. Search