Considerations emerging from a frequency study of multiword units in a corpus of contemporary written Italian

Cignoni, L; Coffey, Stephen James

In this article we discuss the notion of corpus frequency as applied to multiword units. Most commonly, corpus frequency data regards single word forms, mainly because such data is very easy to obtain. It would be much more useful, however, to have access to information regarding the relative frequency of all units of meaning or function in a given language. That is, frequency data should be available for both single-word and multi-word units, and be sense-differentiated where homonymy exists. Such information would play an important role in contributing to the overall description of a language, in making cross-corpus and cross-language comparisons, and in providing the basis for other computational tasks. In the present paper we discuss some of the major problems involved in drawing up frequency figures for multiword units, and then proceed to present a case study of how partial frequency data was arrived at for a corpus of written Italian. We also make cross-corpus comparisons, notably with a typologically similar corpus of English.