The Bulgarin Sence-annotated Corpus (BulSemCor) contains sense-disambiguated lexical items defined in the context of occurrence.
The Bulgarian Sense-annotated Corpus follows the methodology of the Princeton University SemCor. As BulSemCor it consists of excerpts from the Brown Corpus of Bulgarian. Each lexical item (simple word, compound word or multiword expression) is assigned manually the unique semantic or grammatical meaning from the Bulgarian wordnet (BulNet) in the particular context.
Contrary to other sense annotated corpora, the BulSemCor covers both open and close class words and all occurences of multiword expressions and named entities.
The annotated lexical units inherit all the information from the synonym sets in the BulNet, incl. explanatory definition, PoS, usage examples, notes on grammatical, stylistic, and pragmatic properties, and all relations (semantic morpho-syntactic and extra-linguistic) pertaining to the synset, as well as the semantic and derivational relations pertaining to the literal. The BulSemCor contains 101 062 tokens, 99 480 annotated lexical units - 86 842 single words, а 5797 multiword expressions.
The BulSemCor is used as training and testing set in the elaboration of a probability based automatic word-sense disambiguation that is applicable in variety of natural language processing tasks such as machine translation, text categorisation, information extraction, among others.