Chinese text segmentation, aka Chinese word segmentation, is the foundational technology of natural language processing, natural language understanding, especially search engine, speech synthesis (Text-to-speech system), automatic speech recognition, and machine translation. This process is called Tokenisation (Text-to-Token Conversion), because raw text usually contains also numbers, names, abbreviations, symbols, punctuation etc which must be translated into spoken text. Yet there are so many different segment strategy and approach to optimise the result of tokenisation, but unfortunately all of they are still generate poor final result until today. We find out almost all important factors that directly effect Chinese words in text segmentation that will be applied for machine translation. We invented a brand-new approach that especially resolved the ambiguity in Chinese sentence and separate mixed Chinese text into words with very high precision, and established a computable model of natural language processing that have been integrated as the infrastructure of our machine translation system. The sentential semantic model and engine can also be easily applied to other natural languages.
The structure of Chinese sentences is very flexible. The accuracy of Chinese word segmentation heavily depends on the text context, and there are always several ambiguity results. Therefore, any good word segmentation algorithm must have the ability to handle such complexity. e.g:
我|方可|答應|你的|要求。 (then I can promise your request.)
我方|可|答應|你的|要求。 (We can promise you.)
An very powerful morphological and semantic analyser have been built in the engine to improve the artificial intelligence, that can accurate separate unknown words, unknown names of things, place, person and organisation etc. This feature enable the engine independent with external dictionary or corpus and still keep the engine kernel very small. At the same time, it also attached an interface that allow the end user can manually append external words.
The segmentation engine designed to an extremely small size, requires very small memory and storage so that it can be easily used in embedded system. The precompiled binary is about 190Kb and plus data is up to about 370KB, an extremely small space occupied. On the other side, the engine can also be easily used in server environment, including concurrent execution task, machine cluster, cloud service etc.
The engine is portable, cross platform, written with ANSI C from scratch without external dependencies, so it can be easily ported to all platform, bind with different programming languages, and run anywhere. The machine translation system based on it is under continuously updating, it was designed for deeply embedded environment and completely different with the translation service from Google that based on server side architecture. The project needs your help, donation or investment to make continuous improvement. The brand-new programming language and computer system based on the engine are also in the initial development stage, also needs your enthusiasm and help.
Don't have any doubts for the ability of the engine because of its extremely small size. This page provides a demo app that help you to input some Chinese sentence and then get separated words generated by the engine, try it and then make a comment.
$./bamboo -s "我們可以建議貴方在報價上定下限額嗎？" 我們|可以|建議|貴方|在|報價|上|定|下限額|嗎|？ $./bamboo /_)_ _ _ /_ _ _ /_)/_|/ / //_//_//_/ Bamboo v0.2 Copyright (c) 2017 sevenuc.com. Usage: bamboo [options] [sentence] Options: -k convert mandarin to kanji. -m convert kanji to mandarin. -s separate sentence into words. sentence: should be quoted.
Standalone executable binary, the program is a command line tool that can segment Chinese text into words, and translate Mandarin to Kanji or vice versa. The demo app is just a simple prototype with limited features, the release version in productive environment has more function, high accurate rate, and high performance.
1: Raven's Standard IQ Test , you might want to do some accurate and standard "IQ Test" for fun or serious things, this test suite is suitable from 5-year-old child to 95 year elders.
2: About artificial intelligence, you might be interested in the ancient strategy game which dates back more than 2000 years. The Chinese Chess, it contains the ancient oriental profound philosophy and wisdom, that results it very IQ challenge and very attractive unique styles than any game you've been played, you can challenge it in your whole life, and it's suitable for both children and parents. Chinese Chess for Beginners explains the basic rules of the game clearly and in detail so that you can start playing right away. You can download the funny app from App store and enjoy it.
- Tiny Cantonese TTS engine.
- Translation between Cantonese and Mandarin.
- Text segmentation on Wikipedia.
- Natural language processing on Wikipedia.
- Natural language understanding on Wikipedia.