Bulgarian-X language Parallel Corpus

489 Last view: 2024-04-18

10 Last update: 2020-02-03

Bulgarian-X language Parallel Corpus

Bul-X-Cor

http://dcl.bas.bg/en/parallelcorpora/,

http://search.dcl.bas.bg/en/

ID:

805 The Bulgarian-X language Parallel Corpus (Bul-X-Cor) is a part of the Bulgarian National Corpus (BulNC). The Bulgarian National Corpus is designed as a uniform framework for texts of different modality (written - spoken), period (synchronic - diachronic), and number of languages (monolingual - parallel where one of the counterparts is Bulgarian). Any X-language in the corpus is equally treated with respect to the text type diversity and balance, metadata description scheme, preprocessing and annotation, search engine queries and data storage format.
Bulgarian-X Language Parallel Corpus includes parallel corpora of 48 languages – English, German, French, Slavic and Balkan languages, as well as other European and non-European languages.
The parallel corpora represent only texts which have a Bulgarian correspondence – either the original is in Bulgarian, there is a Bulgarian translation, or both texts are translations from a third language.
As of January 2013, the Bulgarian-X Language Parallel Corpus contains 4.2 billion tokens, comprising the biggest parallel corpus of Bulgarian. Languages are not equally represented: the largest parallel corpus is the Bulgarian-English parallel corpus (280.8 and 283.1 million words for Bulgarian and English respectively); there are 18 other corpora of over 200 million tokens per language, 2 parallel corpora between 100 and 200 million tokens per language, 11 parallel corpora of size in the range 5-15 million tokens per language, and the rest 15 are below 1 million, with the smallest corpus being Japanese with 50,000 tokens. Each parallel subcorpus within Bul-X-Cor mirrors the structure of BulNC.
The structure, data formatting and text description follow the model of BulNC. All Bulgarian texts in BulNC and English texts in Bul-X-Cor are supplied with extensive metadata description compliant with the well established standards. The Bulgarian-English parallel corpus is supplied as well with annotation on various levels while the annotation of other languages has just started.
Main applications of parallel corpora are in the field of computational linguistics: machine translation, developing bilingual lexical resources (dictionaries), etc. The benefits of the parallel corpora increase if they are annotated.
The Bulgarian-X Language Parallel Corpus Collocations service is a web service for collocations search and different types of statistics over the Bulgarian-X Language Parallel Corpus.
The service employs the free of charge NoSketchEngine, a system for corpora processing that combines Manatee and Bonito.
The Collocation service is a RESTful webservice, supporting complicated queries through http. Example: http://dcl.bas.bg/collocations/?cmd=collocations&word=нет
user: bulnc
pass: bulnc
The query returns the collocations of a given word in the NoSketchEngine format.
The system also supports additional arguments, namely all that are accepted by NoSketchEngine, provided with default values and an optional language identificator. The following example restricts the statistics to Bulgarian: http://dcl.bas.bg/collocations/?cmd=collocations&word=нет&lang=bg

You don’t have the permission to edit this resource.

DistributionAvailability

Available - Restricted Use

Start date: 09/01/2011

Licence

Other

Restrictions: Academic - Non Commercial Use

Execution location: hidden

Distribution Access/Medium: Web Executable

Other

Restrictions: Academic - Non Commercial Use

Execution location: hidden

Distribution Access/Medium: Accessible Through Interface

IPR Holder

Institute for Bulgarian Language

Contact Person

Ivelina Stoyanova

text

Multilingual text corpusLanguages

Bulgarian

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel

Size

1,202,209,147 Tokens Tokens

Multilingual text corpusLanguages

Tajik (160,123 Tokens) Turkmen (127,430 Tokens) Serbian (1,832,323 Tokens) Catalan; Valencian (640,522 Tokens) Basque (461,080 Tokens) Swedish (180,752,058 Tokens) Turkish (13,297,328 Tokens) Arabic (2,446,857 Tokens) Azerbaijani (137,238 Tokens) Maltese (163,515,445 Tokens) Dutch (204,309,755 Tokens) Portuguese (211,824,204 Tokens) Albanian (9,781,443 Tokens) Ukrainian (744,815 Tokens) Mongolian (135,076 Tokens) English (260,681,821 Tokens) Russian (3,293,243 Tokens) Kirghiz; Kyrgyz (135,031 Tokens) Kazakh (486,766 Tokens) Georgian (128,502 Tokens) Norwegian (1,588,561 Tokens) Slovak (189,752,630 Tokens) Polish (197,762,449 Tokens) Danish (190,843,358 Tokens) Spanish (191,092,782 Tokens) Romanian (235,859,637 Tokens) Czech (196,769,297 Tokens) Greek (229,749,068 Tokens) Chinese (229,293 Tokens) Hungarian (183,530,929 Tokens) Finnish (156,288,741 Tokens) Slovene (188,776,967 Tokens) Estonian (160,175,247 Tokens) Lithuanian (170,381,570 Tokens) German (194,497,872 Tokens) Bosnian (6,195,646 Tokens) Italian (209,083,677 Tokens) Croatian (11,950,183 Tokens) Galician (629,272 Tokens) Macedonian (9,542,940 Tokens) Latvian (167,600,804 Tokens) Japanese (50,194 Tokens) Icelandic (762,894 Tokens) Armenian (139,802 Tokens) Hebrew (2,872,765 Tokens) Irish (13,287,693 Tokens) French (231,486,663 Tokens)

Linguality

Linguality type: Multilingual

Multi-linguality type: Parallel

Size

4,195,791,994 Tokens

Character encoding

UTF - 8

AnnotationSegmentation

Segmentation level: Word

Lemmatization

Segmentation level: Word

Segmentation

Segmentation level: Sentence

Morphosyntactic Annotation - B Pos Tagging

Segmentation level: Word

Alignment

Segmentation level: Sentence

Resource Creation

Resource Creator

Institute for Bulgarian Language

Funding Project

Bulgarian National Corpus project (BulNC)

URL: http://ibl.bas.bg/en...

Funding Type: National Funds

Project duration: 12/17/2009 - 06/17/2013

Central and South-East European Resources (CESAR)

URL: http://cesar.nytud.hu/

Funding Type: Eu Funds

Project duration: 02/01/2011 - 01/30/2013

Metadata

Created: 11/20/2011

Last Updated: 02/03/2020

Version

Version: 2.0

Last Updated: 07/20/2012

ValidationValidated

Usage

Foreseen UseNlp ApplicationsHuman UseActual Use - Human Use

Documentation

Tool Documentation: Help Functions, Manual, None

Koeva, Svetla, Ivelina Stoyanova, Svetlozara Leseva, Tsvetana Dimitrova, Rositsa Dekova, Ekaterina Tarpomanova. The Bulgarian National Corpus: Theory and Practice in Corpus Design. – Journal of Language Modelling, 2012, 1 (1), pp. 65-110. ISSN: 2299-8470 http://nlp.ipipan.wa...

Koeva, Svetla, Ivelina Stoyanova, Rositsa Dekova, Borislav Rizov, Angel Genov. Bulgarian X-language Parallel Corpus. – In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul: European Language Resources Association (ELRA), 2012, pp. 51-62. ISBN: 978-2-9517408-7-7.

People who looked at this resource also viewed the following: