A new online simplifying and summarizing service using Basic English


Probably one of the greatest challenges of our time is that of communications. At a technical level of telecommunications technology, it is one of the fastest growing industries. However, once physical communication is achieved between individuals, the next step is that of understanding. This is normally achieved using language. The English language is the predominant language for human interaction across the whole world today. Two thirds of the Internet traffic is in English. Most of the top universities, business transactions, entertainment, as well as scientific and technological publications use English.

Many translating tools are available to translate between languages but they are in general of very low quality due to the complexities of correctly deriving the equivalence between words and phrases in two distinct natural languages. Translation is difficult for numerous reasons, including the lack of one-to-one word correspondences among languages, the existence in every language of homonyms, and the fact that natural grammars are idiosyncratic; they do not conform to an exact set of rules that would facilitate direct, word-to-word substitution. It is toward a computational "understanding" of these idiosyncrasies that many artificial-intelligence research efforts have been directed, and their limited success testifies to the complexity of the problem. Google Translate is certainly very effective for everyday material being translated between major languages but if more complex  material is use, or less common languages are used, the result can be disappointing.

An alternative approach is to interact in a language which is widely understood and which many people wish to learn, even if at a basic conversational level in order to interact and be entertained, as is the case with the English language. The difficulty then arises of how to assimilate complex material even if only a colloquial level of knowledge with a limited vocabulary is available.

Chinese writing for instance possesses more than 40,000 mainly ideographic signs, but knowledge of 2-4 thousand is enough for most purposes. Chinese writing, insofar as it is phonetic, is also monosyllabic, for the very good reason that the words of the language consist of only one syllable, with a large number of homophones, which made it important to have signs that distinguished between these homophones, and so the script avoided being purely phonetic. Even in this case, early simplification such as the one performed by James Yen in 1923, resulted in a selection of 1,200 of the traditional characters, in order to form what can be called Basic Chinese, enabling illiterate people to read in this system after four months work. A later refinement by Yuan Chao produced a system of about 2,500 of the traditional characters, which it was claimed can cover basically all of the language. The Japanese resolved the basic linguistic problem by adding Hira Gana, children are taught 1,200 from 40,000 symbols, which often contain a Chinese root and suffixes.

Another attempt at devising a simplified version of a language is that of Basic English, as proposed by Charles K. Ogden in the 1920s. The fact that it is possible to say almost everything we normally wish to say with 850 words, makes Basic English something extremely attractive. By the addition of 100 words required for any general field such as science, and 50 internationally recognized words, a total of 1,000 words enable successful communication. Clearly, where complex or ambiguous material is being turned from English into a reduced-vocabulary representation, there will be some loss of semantic content. However, material of a legal, business, scientific and technological nature is normally specifically produced in a way that seeks to be both precise and clear, and is therefore amenable to a reduced-vocabulary representation. On Internet, on the other hand, if we consider scientific and technological words, the required vocabulary comes closer to 100,000 words and is therefore well beyond the ability to understand for the majority of non-native English speakers.

A new site is now commercially available -
www.simplish.org - that has implemented an automatic translation tool, based on converting Standard English into Basic English, so that a user with even a basic conversational level of English can understand English content however complex. For the case of more complex scientific words, these are explained wherever they occur in footnotes using these 1,000 basic words. This service can be used for free 5 times daily to process documents of less than 5,000 words, whereas registered users can process files up to 25,000 words, have some space for personal files on the server, as well as add words to a personal dictionary so the system can adapt to each user’s level of knowledge. This is a novel and very timely development that will no doubt help the millions of internet users who need to read English texts but have an insufficient level of knowledge of the English language to do so with ease.


Currently, the site has three specialized dictionaries and extended vocabularies: science, legal and business; supported by a  50-word common international words vocabulary. So, the basic vocabulary of 850 words, based on those originally selected by Ogden, can be supplemented by a extended 100-word vocabulary and the international vocabulary to add to a 1,000 words vocabulary.

This site also offers a multi-lingual and multi-document summarizing service, based on Google Translate and the ideas of C.S. Pierce about abduction; rather than conventional summaries based on word frequency. The first document serves as a guide for the system about what to extract from the rest of the documents. Typically, this is either an abstract or a short Wikipedia description. The system then uses a general cognition engine and simplish, to generate a multi-document summary, based on a representation of knowledge in the form of a sequence of multi-dimensional ideograms, very similar to the ideas behind Chinese symbols, making the system capable of “understanding” language and producing more coherent better quality summaries than those produced using conventional methods.  All words in Basic English are related to each other in a multidimensional kernel which enables the cognition engine to understand the meaning of each phrase. Thus, the criterion for including a given phrase in the summary is its relevance to the first reference document. This is important because highly relevant information might well not be mentioned many times and will therefore be missed by conventional methods. Indeed, it is often the case that crucial information only appears in one document or two and all the rest mention basically the same points.

Finally, it is worth mentioning that this approach of simplifying by reducing the number of words used is an excellent tool for data mining, since it is often the vocabulary used by different authors that makes extracting, clustering and analyzing information hard.


Comments

Popular posts from this blog

Basic English for summarizing online - a comparison of 5 commonly used tools