Looking for preselected multiword units in an untagged corpus of written Italian: maximizing the potential of the search program DBT

Cignoni, L; Coffey, Stephen James

Language research carried out with the aid of computer corpora avails itself, above all, of the fact that modern alphabet-based written language is usually represented in the form of a linear sequence of ‘words’ (or analogous text elements) interspersed with spaces and punctutation marks. These words, however, both in the language and in the corpus, do no necessarily coincide with units of meaning. One of the consequent problems for the researcher is how to look for phraseological units in a corpus. In the present paper, the authors decribe the problems encountered while making searches for specific Italian multiword units in an Italian corpus, and how a flexible search program such as DBT could be used to greatest advantage in order to overcome such problems.