Results of the CADIAL project were summed up in the book Technologies for the Processing and Retrieval of Semi-Structured Documents. All the work related to the CADIAL project is described in the book.
The book is composed of four parts:
- Language Technologies for Information Retrieval
- Knowledge Technologies for Information Retrieval
Preface and the Introduction sections are available here online:
- Preface by Ralf Steinberger
The book is freely available, if you would like a copy of this book please contact us at firstname.lastname@example.org
CADIAL search engine
CADIAL search engine is an intelliget web-based search engine that enables users to search the collection of Croatian legislative documents in national and multilingual European context. Searching using the English language is supported throught the use Eurovoc, an EU standard thesaurus for document indexing and retrieval. Users can also search and filter the documents by the Eurovoc descriptors assigned to those documents.
Search engine features:
- Morphological normalization for Croatian language
- Support for structured document retrieval
- Phrase searching
- Generation of snippets of document text containing the query keywords
- Combining the search of document text, title and descriptors
- Searching the descriptors in Croatian and English language (can be expanded to any language in the Eurovoc thesaurus)
eCADIS (enriched computer aided document indexing system) is a workstation that speeds up the human document indexing.
- a workstation that speeds up the human document indexing
- application of machine learning techniques
- automatic suggestion of relevant descriptors i.e. automatic indexing
User interface in eCADIS is multilingual and it is possible to have e.g. english interface and index document written in Croatian language (or vice versa). Document window provides usefull information derived from opened document such as word forms, lemmas and n-grams. eCADIS highlights statistically relevant n-grams (collocations).
Eurovoc browser window enables human indexers to quickly look up relevant information about descriptors (fig. 2). Because Eurovoc thesaurus is multilingual thesaurus, it is possible to switch descriptors language using a single keypress.
When (possibly previously unseen) document is opened, eCADIS uses machine learning techniques to automatically suggest descriptors for document indexing. Figures 6. and 7. show a comparison between manually attached descriptors and automatically generated suggestions.
|Figure 6. Manually attached descriptors.||Figure 7. Automatically suggested descriptors.|
TermeX is a tool for automatic collocation extraction and terminology lexica construction used by Hidra to enrich Eurovoc in Croatian Language.
Figure 8. shows the interface for the TermeX tool and the use of different measures for automatic collocation extraction.