Mercurial > libguess
view README @ 2:754a4550c64e
- added arabic, greek, hebrew and turkish DFAs
- new UCS-2LE/BE DFAs
- now arabic_impl.c uses arabic DFAs
- dfa common macros have been moved to dfa.h
- minor cleanups
author | Yoshiki Yazawa <yaz@cc.rim.or.jp> |
---|---|
date | Wed, 11 Jun 2008 00:11:30 +0900 |
parents | d9b6ff839eab |
children |
line wrap: on
line source
libguess is derived from Gauche-0.8.3, a scheme interpretor by Shiro Kawai. int dfa_validate_utf8(const char *buf, int buflen) This function validates given string is utf8 or not. buf: string buflen: length of a string to be validated. return: 1 if buf is utf8, 0 if not utf8. const char *guess_jp(const char *buf, int buflen) detect character encoding for a given string in Japanese. buf: string to be checked. buflen: length of a string to be checked. return: encoding name which can be feeded to g_convert() or iconv(). Encoding name is one of folloings: UTF-16, ISO-2022-JP, EUC-JP, SJIS, UTF-8. returned string is constant, so you MUST NOT free. If the given string is not ehough long to destinguish, guess_jp takes order list into account to determine encoding. For instance, the order for Japanese is defined as #define ORDER_JP &utf8, &sjis, &eucj leftmost encoding has highest priority. it will be applied even if only two encodings are alive. if utf8 and sjis remain, guess_jp will returns utf8. if sjis and eucj remain, sjis will be returned. this means if score of each encoding is same, const char *guess_tw(const char *buf, int buflen) const char *guess_cn(const char *buf, int buflen) const char *guess_kr(const char *buf, int buflen) Although gues_xx() can distinguish UCS-2BE and UCS-2LE, g_convert() cannot