The Bulgarian Part-of–Speech Corpus (BulPosCor) is derived from the Brown Corpus of Bulgarian, automatically annotated respectively with PoS tags and manually disambiguated. The corpus for annotation was built by selecting portions of 150+ words from each sample from the Brown Corpus of Bulgarian. The automatic grammatical annotation of the corpus employed the Bulgarian Grammar Dictionary containing about 85 000 words and over 1.5 million word forms specified with grammatical characteristics.
Disambiguation was performed by human experts that assigned the correct PoS tags out of two or more possible for an ambiguous token. A number of annotation principles had been outlined in order to provide a uniform approach to the annotation. As a result a PoS disambiguated corpus was obtained consisting of 217 210 tokens, including 172 482 single words, 42 058 punctuation marks and 2 670 numbers.
The chief intended application of the Bulgarian Tagged Corpora is to serve as a test and/or training dataset for PoS disambiguation.
The Tagged Corpus enables efficient online search of language patterns and forms as well.