A new online simplifying and summarizing service using Basic English
Probably one of the greatest challenges of our time is that of
communications. At a technical level of telecommunications
technology, it is one of the fastest growing industries. However,
once physical communication is achieved between individuals, the next
step is that of understanding. This is normally achieved using
language. The English language is the predominant language for human
interaction across the whole world today. Two thirds of the Internet
traffic is in English. Most of the top universities, business
transactions, entertainment, as well as scientific and technological
publications use English.
Many translating tools are available to translate
between languages but they are in general of very low quality due to
the complexities of correctly deriving the equivalence between words
and phrases in two distinct natural languages. Translation
is difficult for numerous reasons, including the lack of one-to-one
word correspondences among languages, the existence in every language
of homonyms, and the fact that natural grammars are idiosyncratic;
they do not conform to an exact set of rules that would facilitate
direct, word-to-word substitution. It is toward a computational
"understanding" of these idiosyncrasies that many
artificial-intelligence research efforts have been directed, and
their limited success testifies to the complexity of the problem. Google Translate is certainly very effective for everyday material being translated between major languages but if more complex material is use, or less common languages are used, the result can be disappointing.
An alternative approach is to interact in a language which is widely understood and which many people wish to learn, even if at a basic conversational level in order to interact and be entertained, as is the case with the English language. The difficulty then arises of how to assimilate complex material even if only a colloquial level of knowledge with a limited vocabulary is available.
An alternative approach is to interact in a language which is widely understood and which many people wish to learn, even if at a basic conversational level in order to interact and be entertained, as is the case with the English language. The difficulty then arises of how to assimilate complex material even if only a colloquial level of knowledge with a limited vocabulary is available.
Chinese writing for instance possesses more than
40,000 mainly ideographic signs, but knowledge of 2-4 thousand is
enough for most purposes. Chinese writing, insofar as it is
phonetic, is also monosyllabic, for the very good reason that the
words of the language consist of only one syllable, with a large
number of homophones, which made it important to have signs that
distinguished between these homophones, and so the script avoided
being purely phonetic. Even in this case, early simplification such
as the one performed by James Yen in 1923, resulted in a selection of
1,200 of the traditional characters, in order to form what can be
called Basic Chinese, enabling illiterate people to read in this
system after four months work. A later refinement by Yuan Chao
produced a system of about 2,500 of the traditional characters, which
it was claimed can cover basically all of the language. The Japanese
resolved the basic linguistic problem by adding Hira
Gana, children are taught 1,200 from
40,000 symbols, which often contain a Chinese root and suffixes.
Another attempt at devising a simplified version
of a language is that of Basic English, as proposed by Charles K.
Ogden in the 1920s. The fact that it is possible to say almost
everything we normally wish to say with 850 words, makes Basic
English something extremely attractive. By the addition of 100 words
required for any general field such as science, and 50
internationally recognized words, a total of 1,000 words enable
successful communication. Clearly, where complex or ambiguous
material is being turned from English into a reduced-vocabulary
representation, there will be some loss of semantic content.
However, material of a legal, business, scientific and technological
nature is normally specifically produced in a way that seeks to be
both precise and clear, and is therefore amenable to a
reduced-vocabulary representation.
On Internet, on the other hand, if we consider scientific and technological words,
the required vocabulary comes closer to 100,000 words and is therefore well beyond the ability to understand for the majority of non-native English speakers.
A new site is now commercially available - www.simplish.org - that has implemented an automatic translation tool, based on converting Standard English into Basic English, so that a user with even a basic conversational level of English can understand English content however complex. For the case of more complex scientific words, these are explained wherever they occur in footnotes using these 1,000 basic words. This service can be used for free 5 times daily to process documents of less than 5,000 words, whereas registered users can process files up to 25,000 words, have some space for personal files on the server, as well as add words to a personal dictionary so the system can adapt to each user’s level of knowledge. This is a novel and very timely development that will no doubt help the millions of internet users who need to read English texts but have an insufficient level of knowledge of the English language to do so with ease.
Finally, it is worth mentioning that this approach of simplifying by reducing the number of words used is an excellent tool for data mining, since it is often the vocabulary used by different authors that makes extracting, clustering and analyzing information hard.
A new site is now commercially available - www.simplish.org - that has implemented an automatic translation tool, based on converting Standard English into Basic English, so that a user with even a basic conversational level of English can understand English content however complex. For the case of more complex scientific words, these are explained wherever they occur in footnotes using these 1,000 basic words. This service can be used for free 5 times daily to process documents of less than 5,000 words, whereas registered users can process files up to 25,000 words, have some space for personal files on the server, as well as add words to a personal dictionary so the system can adapt to each user’s level of knowledge. This is a novel and very timely development that will no doubt help the millions of internet users who need to read English texts but have an insufficient level of knowledge of the English language to do so with ease.
Currently, the site has three specialized dictionaries and extended vocabularies: science, legal and business; supported by a 50-word common international words vocabulary. So, the basic vocabulary of 850 words, based on those originally selected by Ogden, can be supplemented by a extended 100-word vocabulary and the international vocabulary to add to a 1,000 words vocabulary.
This site also offers a multi-lingual and multi-document summarizing service, based on Google Translate and the ideas of C.S. Pierce about abduction; rather than conventional summaries based on word frequency. The first document serves as a guide for the system about what to extract from the rest of the documents. Typically, this is either an abstract or a short Wikipedia description. The system then uses a general cognition engine and simplish, to generate a multi-document summary, based on a representation of
knowledge in the form of a sequence of multi-dimensional ideograms, very
similar to the ideas behind Chinese symbols, making the system capable
of “understanding” language and producing more coherent better quality
summaries than those produced using conventional methods. All words in Basic English are related to each other in a multidimensional kernel which enables the cognition engine to understand the meaning of each phrase. Thus, the criterion for including a given phrase in the summary is its relevance to the first reference document. This is important because highly relevant information might well not be mentioned many times and will therefore be missed by conventional methods. Indeed, it is often the case that crucial information only appears in one document or two and all the rest mention basically the same points.Finally, it is worth mentioning that this approach of simplifying by reducing the number of words used is an excellent tool for data mining, since it is often the vocabulary used by different authors that makes extracting, clustering and analyzing information hard.
Comments
Post a Comment