Computer Aided Document Indexing for Accessing Legislation


Results of the CADIAL project were summed up in the book Technologies for the Processing and Retrieval of Semi-Structured Documents. All the work related to the CADIAL project is described in the book.

The book is composed of four parts:

Preface and the Introduction sections are available here online:

The book is freely available, if you would like a copy of this book please contact us at

Figure 1. CADIAL book frontpage.

CADIAL search engine

CADIAL search engine is an intelliget web-based search engine that enables users to search the collection of Croatian legislative documents in national and multilingual European context. Searching using the English language is supported throught the use Eurovoc, an EU standard thesaurus for document indexing and retrieval. Users can also search and filter the documents by the Eurovoc descriptors assigned to those documents.

Search engine features:

Figure 2. CADIAL search engine snapshot.

Figure 3. Descriptors from the search results (used for filtering the results).

eCADIS workstation

eCADIS (enriched computer aided document indexing system) is a workstation that speeds up the human document indexing.

User interface in eCADIS is multilingual and it is possible to have e.g. english interface and index document written in Croatian language (or vice versa). Document window provides usefull information derived from opened document such as word forms, lemmas and n-grams. eCADIS highlights statistically relevant n-grams (collocations).

Figure 4. Document window.

Eurovoc browser window enables human indexers to quickly look up relevant information about descriptors (fig. 2). Because Eurovoc thesaurus is multilingual thesaurus, it is possible to switch descriptors language using a single keypress.

Figure 5. Eurovoc browser window.

When (possibly previously unseen) document is opened, eCADIS uses machine learning techniques to automatically suggest descriptors for document indexing. Figures 6. and 7. show a comparison between manually attached descriptors and automatically generated suggestions.

Figure 6. Manually attached descriptors. Figure 7. Automatically suggested descriptors.


TermeX is a tool for automatic collocation extraction and terminology lexica construction used by Hidra to enrich Eurovoc in Croatian Language.

Figure 8. shows the interface for the TermeX tool and the use of different measures for automatic collocation extraction.

Figure 8. TermeX tool screenshot.