Chinese text segmentation, aka Chinese word segmentation, is the foundational technology of natural language processing, natural language understanding, especially search engine, speech synthesis (Text-to-speech system), automatic speech recognition, and machine translation. This process is called Tokenisation (Text-to-Token Conversion), because raw text usually contains number, name, abbreviation, symbol, punctuation etc which must be translated into meanful words. Yet there are so many different segment strategy and approach to optimise the tokenised result, but unfortunately the final result are still very poor until today. We find out almost all important factors that directly effect Chinese word segmentation that will be applied for machine translation. We invented a brand-new approach that especially resolved the ambiguity in Chinese sentence and that can separate mixed Chinese text into words with very high precision, and established a computable model of natural language processing that have been integrated as the infrastructure of our in-developing embedded machine translation system. The sentential semantic model and engine can also be easily applied to other natural languages.
The structure of Chinese sentences is very flexible. The accuracy of Chinese word segmentation heavily depends on the text context, and one sentence can always have several ambiguity results. Therefore, any good word segmentation algorithm must have the ability to handle such complexity. e.g:
我|方可|答應|你的|要求。 (then I can promise your request.)
我方|可|答應|你的|要求。 (We can promise you.)
We developed an prototype engine that be designed to have an extremely small footprint, requires very small memory and storage so that it can be easily used in embedded system. The precompiled binary is about 190Kb and plus data is up to about 370KB, an extremely small space occupied. On the other side, the engine can also be easily used in server environment, including concurrent execution task, machine cluster, cloud service etc.
In it, a very powerful morphological and semantic analyser used to improve the artificial intelligence, that can accurate separate unknown words, unknown names of things, place, person and organisation etc. This feature enable the engine independent with external dictionary or corpus and still keep the kernel of the engine very small. At the same time, it also attached an interface that allow the end user can append external words manually.
The engine is portable, cross platform, written with ANSI C from scratch without external dependencies, so it can be easily ported to all platform, bind with different programming languages, and run anywhere. The machine translation system based on it is under continuously updating, it was designed for deeply embedded environment and completely different with the translation service from Google that based on server side architecture. The project needs your help, donation or investment to make continuous improvement. The brand-new programming language and computer system based on the engine are also in the initial development stage, also needs your enthusiasm and help, please connect us and help us to make them into final products.
$./bamboo -s "我們可以建議貴方在報價上定下限額嗎？" 我們|可以|建議|貴方|在|報價|上|定|下限額|嗎|？ $./bamboo /_)_ _ _ /_ _ _ /_)/_|/ / //_//_//_/ Bamboo v0.2 Copyright (c) 2017 sevenuc.com. Usage: bamboo [options] [sentence] Options: -k convert mandarin to kanji. -s separate sentence into words. sentence: should be quoted.
This demo program is a command line tool that can segment Chinese text into words, and translate Mandarin to Kanji or vice versa. The demo app is just a simple prototype that has very limited features.
1: Raven's Standard IQ Test , you might want to do some accurate and standard "IQ Test" for fun or serious things, this test suite is suitable from 5-year-old child to 95 year elders.
2: About artificial intelligence, you might be interested in the ancient strategy game which dates back more than 2000 years. The Chinese Chess, it contains the ancient oriental profound philosophy and wisdom, that results it very IQ challenge and very attractive unique styles than any game you've been played, you can challenge it in your whole life, and it's suitable for both children and parents. Chinese Chess for Beginners explains the basic rules of the game clearly and in detail so that you can start playing right away. You can download the funny app from App store and enjoy it.
- Tiny Cantonese TTS engine.
- Translation between Cantonese and Mandarin.
- Text segmentation on Wikipedia.
- Natural language processing on Wikipedia.
- Natural language understanding on Wikipedia.