Web based infrastructure for Bulgarian data processing
The Bulgarian Language Processing Chain includes the following types of text processing and linguistic annotation: Sentence segmentation; Tokenisation; POS tagging and grammatical annotation; Lemmatisation. The Bulgarian POS tagger marks up each word with the most probable Part of Speech and unambiguous morphosyntactic information among the set of tags associated with a given word. The tagger is based on SVM (Support Vector Machines) learning. The tagger predicts the POS tag of a word based on a set of features describing the word and its context. These features are words, word bigrams and trigrams within a window of words around the currently tagged word; POS tags, POS tags bigrams and trigrams in the current window, and information about suffixes, prefixes, capitalization, hyphenation etc. for the unknown words. The tagger is trained and tested on manually POS disambiguated corpus. The strategy chosen for training Bulgarian tagger is two passes in both directions; a window of five tokens, the currently tagged word being on the second position; two and three-grams of words or tags or ambiguity classes, lexical parameters as prefixes, suffixes, sentence borders, and capital letters. The trained model is applied to disambiguate texts. The precision of the tagger up to the moment is 96,58%.
The Bulgarian lemmatizer determines for a given word form its lemma and detailed morphosyntactic annotation. The lemmatization is based on an unambiguous association between the tagger output and information encoded in a large grammatical dictionary of Bulgarian language. At the tagging a reduced tagset is used (75 word classes compering to 1029 unique grammatical tags in the dictionary) compiled in a way that the minimum necessary information for unambiguous association with the respective lemma to be ensured. A small number of rules and preferences are also implemented to limit the ambiguity in lemmatization. Some additional tools for advanced processing and annotation are available, as well as for annotation and alignment of parallel texts at sentential and subsentential level.
A highly scalable web service based infrastructure was developed to provide easy access to the tools for text processing and annotation of Bulgarian. Three different types of access is provided to facilitate the user access to the system: online access; access via RESTful API; asynchronous access.
Online access is suitable for users who need processing of relatively small amount of data occasionally. RESTful API access is suitable for software developers who can integrate the processing tools in high level applications. Asynchronous access is aimed for processing large corpora – the user uploads the archived corpus, it is processed on the server, a notification email is sent upon completion of the task and the annotated corpus can be downloaded.
The system is highly scalable and can be distributed on different machines. The service infrastructure consist of three main components: Frontend, Backend and TaskDispatcher, each of these can be deployed on different machines.
The Frontend component is responsible for implementation of the access policies of the service apis, error handling, logging, support of different return formats (xml,json,plain text), communication with the Backend. Also the Fronted provides the Web UI to user to control the asynchronous tasks: start, stop or monitor a task and upload/download data. The Backend performs the actual processing and it combines the Bulgarian tokenizer, sentence splitter, tagger and lemmatiser in the form of
a server application which handles the requests of the Frontend over tcp/ip. Even though the Frontend is implemented efficiently and can handle many request simultaneously, whenever necessary several instances of the Frontend can be distributed on different machines.
The TaksDispatcher is responsible for managing the processes of the asynchronous tasks. It receives the start/stop commands by the Frontend and notifies the user by e-mail when the result is ready.