Table of Contents
Semadict design
Research
Topics:
- Ontology
- Ontology transformation
- Model driven & Ontology
- database design: http://www.databaseanswers.org/data_models/index.htm
-
- rule based is good for dictionary creation. It includes: transfer-based machine translation, interlingual machine translation and dictionary-based machine translation paradigms
- The basic approach involves linking the structure of the input sentence with the structure of the output sentence using a parser and an analyzer for the source language, a generator for the target language, and a transfer lexicon for the actual translation.
- Lexical database design
- open
-
-
- Methodology: More than just a taxonomy of linguistic terms, GOLD is founded on principles of ontological engineering: rich axiomatization of classes and relations is provided, for example. In the beginning GOLD was constructed from the top-down using SIL International's on-line glossary of linguistic terms and standard linguistics sources, for example: David Crystal's Cambridge Encyclopedia of Language. To supplement the original development, a new methodology for concept acquisition is being developed by Will Lewis and Scott Farrar, whereby GOLD can be constructed on an empirical basis (see Data-Driven Linguistic Ontology project). GOLD has been mapped to the Suggested Upper Merged Ontology (SUMO). We are now also implementing the GOLD Community of Practice as a building block for a cyberinfrastructure for linguistics.
- Wordnet ontology, SUMO:
Tools
- GOLD linguistic ontology: http://linguistics-ontology.org/version
- Manipulation tools: http://uakari.ling.washington.edu/e-linguistics/eltk.html
Lucene - stemmers for indexing
1 down vote You could use Lucene (or any text-search engine) to store your documents, combined with a custom stemmer to index your document text based on meaning (rather than word variations). Normally, stemmers are used to convert all variations of a word to the base word stem. For example, although the document is stored and retrieved with text as-is, any of the words "sing, singing, sang, sung" would be indexed as "sing", so when a search is made using the search term "sing", you get a hit on all documents containing sing, singing, sang or sung. Similarly, the search terms may also be stemmed, so searching for any of "sing, singing, sang or sung" will search as if "sing" is the search term. Standard stemmers deal with the usual English variations of words, but you could create one that stems based on meaning. For example, you might create a stemmer that stems any of "problem, issue or complaint" to "problem", etc for all words you want to "link". The advantage of using a stemmer is all the search-related heavy lifting is done for you by the text search engine (and besides, text search engines are incredibly fast!). Wen it come to implementation, you could make the linkages data-driven, either generating the code for the stemmer based on data in a database, or make it dynamic and look up a database whenever a search/index operation is done, or somewhere in between - caching the values and refreshing them periodically.
Headline
Apertium
rule-based translation. User edit source/destination language rules and dictionary. Generate bilingual dictionary.
Has Android app, all opensource, support and possibly used by google translation.
http://meta.wikimedia.org/wiki/Machine_translation
The corpora files can be found in the 'corpora' folder, the tools used to create them (along with some very basic instructions) in the 'tools' folder. Different databases are offered for each language, among them: * 'pars', which contains paragraphs as extracted from Wikipedia * 'sent', which contains sentences extracted from the corpus above, using the TrainPunkt and Punkt scripts (which wrap NTLK Punkt module) - this is probably what you want, if you are not sure * 'punkt', which contains the trained model from which the corpus above was generated (and can be used with the Punkt script in the tools folder, but remember that this algorithm was designed for unsupervised learning from the text it is expected to be applied to, not for a generalized sentence segmentation - i.e., you might want to use the 'sent' corpus to train your own, supervised, tokenizer) * 'tokens', which contains the paragraph corpus tokenized, lowercased, with sentence markers (“<s>” and “</s>”) and eventually filtered of exceptionally long words and sentences. * 'lm3', which contains an ARPA-format language model based on 3-grams * 'lm5', which contains an ARPA-format language model based on 5-grams * 'dict', which cointains a sorted, lower-cased vocabulary in the format “token count” (this type of file usually contains noise at the bottom, with low-count tokens) The two first letters indicate the language, as used by Wikipedia (ISO 639-1). The date indicates the date of the Wikipedia dump (as available from http://dumps.wikimedia.org/backup-index.html), not the date the corpus was generated. Language specific information: * For Georgian ('ka'), the corpus contains noise (such as HTML formatting and English text). The sentences were split with a standard Punkt training and probably contain a number of erros. * For Galician ('ga'), the tokens were obtained with FreeLing tokenizing module. * For Italian ('it'), the tokens and the sentences were obtained with FreeLing tokenizing modules.