版权声明:可以任意转载,转载时请务必以超链接形式标明文章原始出处和作者信息及本声明 http://www.chedong.com/tech/hello_unicode.html 关键词:linux java mutlibyte encoding locale i18n i10n chinese ISO-8859-1 GB2312 BIG5 GBK UNICODE 内容摘要: 不知道你有没有这样的感受:为什么PHP很少有乱码问题而用Java做WEB应用却这么麻烦呢?为什么在Google上能用简体中文查到繁体中文,甚至日文的结果?而且用Google的时候发现它居然能自动根据我使用浏览器的语言选择自动调出中文界面?
很多国际化应用的让我理解了这么一个道理:Unicode是为更方便的做国际化应用设计的,而Java核心的字符是基于UNICODE的,这一机制为应用提供了对中文“字”的控制(而不是字节)。但如果不仔细理解其中的规范,这种自由反而会成为累赘,从而导致更多的乱码问题:
为了了解Java应用的编码处理的机制,首先要了解操作系统对JVM缺省编码方式的影响,因此我做了一个Env.java,用于打印显示不同系统下JVM的属性和系统支持的LOCALE。程序很简单: /* * Copyright (c) 2002 Email: chedongATbigfoot.com/chedongATchedong.com * $Id: hello_unicode.html,v 1.6 2003/11/09 07:57:11 chedong Exp $ */
import java.util.*; import java.text.*;
/** * 目的: * 显示环境变量和JVM的缺省属性 * 输入:无 * 输出: * 1 支持的LOCALE * 2 JVM的缺省属性 */
public class Env { /** * main entrance */ public static void main(String[] args) { System.out.println("Hello, it's: " + new Date());
//print available locales Locale list[] = DateFormat.getAvailableLocales(); System.out.println("======System available locales:======== "); for (int i = 0; i < list.length; i++) { System.out.println(list[i].toString() + "\t" + list[i].getDisplayName()); }
//print JVM default properties System.out.println("======System property======== "); System.getProperties().list(System.out); } }
最需要注意的是JVM的file.encoding属性,这个属性确定了JVM的缺省的编码/解码方式:从而影响应用中所有字节流==>字符流的解码方式 ,字符流==>字节流的编码方式。 LINUX下的LOCALE可以通过 LANG=zh_CN; LC_ALL=zh_CN.GBK; export LANG LC_ALL 设置。locale 命令可以显示系统当前的环境设置 Windows的LOCALE可以通过 控制面板==>区域设置 设置实现 GNU/Linux 2.4.x (J2SE1.3.1) LANG=en_US LC_ALL=en_US | GNU/Linux 2.4.x (J2SE1.3.1) LANG=zh_CN LC_ALL=zh_CN.GBK | Windows 2000(J2SE1.3.0) 区域设置:中国 中文 | Windows 2000(J2SE1.3.0) 区域设置:英国 英文 | Hello, it's: Tue Jul 30 11:05:44 CST 2002 ======System available locales:======== en English en_US English (United States) ar Arabic ar_AE Arabic (United Arab Emirates) ar_BH Arabic (Bahrain) ar_DZ Arabic (Algeria) ar_EG Arabic (Egypt) ar_IQ Arabic (Iraq) ar_JO Arabic (Jordan) ar_KW Arabic (Kuwait) ar_LB Arabic (Lebanon) ar_LY Arabic (Libya) ar_MA Arabic (Morocco) ar_OM Arabic (Oman) ar_QA Arabic (Qatar) ar_SA Arabic (Saudi Arabia) ar_SD Arabic (Sudan) ar_SY Arabic (Syria) ar_TN Arabic (Tunisia) ar_YE Arabic (Yemen) be Byelorussian be_BY Byelorussian (Belarus) bg Bulgarian bg_BG Bulgarian (Bulgaria) ca Catalan ca_ES Catalan (Spain) ca_ES_EURO Catalan (Spain,Euro) cs Czech cs_CZ Czech (Czech Republic) da Danish da_DK Danish (Denmark) de German de_AT German (Austria) de_AT_EURO German (Austria,Euro) de_CH German (Switzerland) de_DE German (Germany) de_DE_EURO German (Germany,Euro) de_LU German (Luxembourg) de_LU_EURO German (Luxembourg,Euro) el Greek el_GR Greek (Greece) en_AU English (Australia) en_CA English (Canada) en_GB English (United Kingdom) en_IE English (Ireland) en_IE_EURO English (Ireland,Euro) en_NZ English (New Zealand) en_ZA English (South Africa) es Spanish es_BO Spanish (Bolivia) es_AR Spanish (Argentina) es_CL Spanish (Chile) es_CO Spanish (Colombia) es_CR Spanish (Costa Rica) es_DO Spanish (Dominican Republic) es_EC Spanish (Ecuador) es_ES Spanish (Spain) es_ES_EURO Spanish (Spain,Euro) es_GT Spanish (Guatemala) es_HN Spanish (Honduras) es_MX Spanish (Mexico) es_NI Spanish (Nicaragua) et Estonian es_PA Spanish (Panama) es_PE Spanish (Peru) es_PR Spanish (Puerto Rico) es_PY Spanish (Paraguay) es_SV Spanish (El Salvador) es_UY Spanish (Uruguay) es_VE Spanish (Venezuela) et_EE Estonian (Estonia) fi Finnish fi_FI Finnish (Finland) fi_FI_EURO Finnish (Finland,Euro) fr French fr_BE French (Belgium) fr_BE_EURO French (Belgium,Euro) fr_CA French (Canada) fr_CH French (Switzerland) fr_FR French (France) fr_FR_EURO French (France,Euro) fr_LU French (Luxembourg) fr_LU_EURO French (Luxembourg,Euro) hr Croatian hr_HR Croatian (Croatia) hu Hungarian hu_HU Hungarian (Hungary) is Icelandic is_IS Icelandic (Iceland) it Italian it_CH Italian (Switzerland) it_IT Italian (Italy) it_IT_EURO Italian (Italy,Euro) iw Hebrew iw_IL Hebrew (Israel) ja Japanese ja_JP Japanese (Japan) ko Korean ko_KR Korean (South Korea) lt Lithuanian lt_LT Lithuanian (Lithuania) lv Latvian (Lettish) lv_LV Latvian (Lettish) (Latvia) mk Macedonian mk_MK Macedonian (Macedonia) nl Dutch nl_BE Dutch (Belgium) nl_BE_EURO Dutch (Belgium,Euro) nl_NL Dutch (Netherlands) nl_NL_EURO Dutch (Netherlands,Euro) no Norwegian no_NO Norwegian (Norway) no_NO_NY Norwegian (Norway,Nynorsk) pl Polish pl_PL Polish (Poland) pt Portuguese pt_BR Portuguese (Brazil) pt_PT Portuguese (Portugal) pt_PT_EURO Portuguese (Portugal,Euro) ro Romanian ro_RO Romanian (Romania) ru Russian ru_RU Russian (Russia) sh Serbo-Croatian sh_YU Serbo-Croatian (Yugoslavia) sk Slovak sk_SK Slovak (Slovakia) sl Slovenian sl_SI Slovenian (Slovenia) sq Albanian sq_AL Albanian (Albania) sr Serbian sr_YU Serbian (Yugoslavia) sv Swedish sv_SE Swedish (Sweden) th Thai th_TH Thai (Thailand) tr Turkish tr_TR Turkish (Turkey) uk Ukrainian uk_UA Ukrainian (Ukraine) zh Chinese zh_CN Chinese (China) zh_HK Chinese (Hong Kong) zh_TW Chinese (Taiwan) ======System property======== -- listing properties -- java.runtime.name=Java(TM) 2 Runtime Environment, Stand... sun.boot.library.path=/usr/java/jdk1.3.1_04/jre/lib/i386 java.vm.version=1.3.1_04-b02 java.vm.vendor=Sun Microsystems Inc. java.vendor.url=http://java.sun.com/ path.separator=: java.vm.name=Java HotSpot(TM) Client VM file.encoding.pkg=sun.io java.vm.specification.name=Java Virtual Machine Specification user.dir=/home/chedong/src/char_test java.runtime.version=1.3.1_04-b02 java.awt.graphicsenv=sun.awt.X11GraphicsEnvironment os.arch=i386 java.io.tmpdir=/tmp line.separator=
java.vm.specification.vendor=Sun Microsystems Inc. java.awt.fonts= os.name=Linux java.library.path=/usr/java/jdk1.3.1_04/jre/lib/i386:/u... java.specification.name=Java Platform API Specification java.class.version=47.0 os.version=2.4.7-10 user.home=/home/chedong user.timezone=Asia/Shanghai java.awt.printerjob=sun.awt.motif.PSPrinterJob file.encoding=ISO-8859-1
java.specification.version=1.3
user.name=chedong
java.class.path=/home/chedong/classes
java.vm.specification.version=1.0
java.home=/usr/java/jdk1.3.1_04/jre
user.language=en
java.specification.vendor=Sun Microsystems Inc.
java.vm.info=mixed mode
java.version=1.3.1_04
java.ext.dirs=/usr/java/jdk1.3.1_04/jre/lib/ext
sun.boot.class.path=/usr/java/jdk1.3.1_04/jre/lib/rt.jar:...
java.vendor=Sun Microsystems Inc.
file.separator=/
java.vendor.url.bug=http://java.sun.com/cgi-bin/bugreport...
sun.cpu.endian=little
sun.io.unicode.encoding=UnicodeLittle
user.region=US
sun.cpu.isalist=
| Hello, it's: Tue Jul 30 11:07:34 CST 2002 ======System available locales:======== en 英文 en_US 英文 (美国) ar 阿拉伯文 ar_AE 阿拉伯文 (阿拉伯联合酋长国) ar_BH 阿拉伯文 (巴林) ar_DZ 阿拉伯文 (阿尔及利亚) ar_EG 阿拉伯文 (埃及) ar_IQ 阿拉伯文 (伊拉克) ar_JO 阿拉伯文 (约旦) ar_KW 阿拉伯文 (科威特) ar_LB 阿拉伯文 (黎巴嫩) ar_LY 阿拉伯文 (利比亚) ar_MA 阿拉伯文 (摩洛哥) ar_OM 阿拉伯文 (阿曼) ar_QA 阿拉伯文 (卡塔尔) ar_SA 阿拉伯文 (沙特阿拉伯) ar_SD 阿拉伯文 (苏丹) ar_SY 阿拉伯文 (叙利亚) ar_TN 阿拉伯文 (突尼斯) ar_YE 阿拉伯文 (也门) be 白俄罗斯文 be_BY 白俄罗斯文 (白俄罗斯) bg 保加利亚文 bg_BG 保加利亚文 (保加利亚) ca 加泰罗尼亚文 ca_ES 加泰罗尼亚文 (西班牙) ca_ES_EURO 加泰罗尼亚文 (西班牙,Euro) cs 捷克文 cs_CZ 捷克文 (捷克共和国) da 丹麦文 da_DK 丹麦文 (丹麦) de 德文 de_AT 德文 (奥地利) de_AT_EURO 德文 (奥地利,Euro) de_CH 德文 (瑞士) de_DE 德文 (德国) de_DE_EURO 德文 (德国,Euro) de_LU 德文 (卢森堡) de_LU_EURO 德文 (卢森堡,Euro) el 希腊文 el_GR 希腊文 (希腊) en_AU 英文 (澳大利亚) en_CA 英文 (加拿大) en_GB 英文 (英国) en_IE 英文 (爱尔兰) en_IE_EURO 英文 (爱尔兰,Euro) en_NZ 英文 (新西兰) en_ZA 英文 (南非) es 西班牙文 es_BO 西班牙文 (玻利维亚) es_AR 西班牙文 (阿根廷) es_CL 西班牙文 (智利) es_CO 西班牙文 (哥伦比亚) es_CR 西班牙文 (哥斯达黎加) es_DO 西班牙文 (多米尼加共和国) es_EC 西班牙文 (厄瓜多尔) es_ES 西班牙文 (西班牙) es_ES_EURO 西班牙文 (西班牙,Euro) es_GT 西班牙文 (危地马拉) es_HN 西班牙文 (洪都拉斯) es_MX 西班牙文 (墨西哥) es_NI 西班牙文 (尼加拉瓜) et 爱沙尼亚文 es_PA 西班牙文 (巴拿马) es_PE 西班牙文 (秘鲁) es_PR 西班牙文 (波多黎哥) es_PY 西班牙文 (巴拉圭) es_SV 西班牙文 (萨尔瓦多) es_UY 西班牙文 (乌拉圭) es_VE 西班牙文 (委内瑞拉) et_EE 爱沙尼亚文 (爱沙尼亚) fi 芬兰文 fi_FI 芬兰文 (芬兰) fi_FI_EURO 芬兰文 (芬兰,Euro) fr 法文 fr_BE 法文 (比利时) fr_BE_EURO 法文 (比利时,Euro) fr_CA 法文 (加拿大) fr_CH 法文 (瑞士) fr_FR 法文 (法国) fr_FR_EURO 法文 (法国,Euro) fr_LU 法文 (卢森堡) fr_LU_EURO 法文 (卢森堡,Euro) hr 克罗地亚文 hr_HR 克罗地亚文 (克罗地亚) hu 匈牙利文 hu_HU 匈牙利文 (匈牙利) is 冰岛文 is_IS 冰岛文 (冰岛) it 意大利文 it_CH 意大利文 (瑞士) it_IT 意大利文 (意大利) it_IT_EURO 意大利文 (意大利,Euro) iw 希伯来文 iw_IL 希伯来文 (以色列) ja 日文 ja_JP 日文 (日本) ko 朝鲜文 ko_KR 朝鲜文 (南朝鲜) lt 立陶宛文 lt_LT 立陶宛文 (立陶宛) lv 拉托维亚文(列托) lv_LV 拉托维亚文(列托) (拉脱维亚) mk 马其顿文 mk_MK 马其顿文 (马其顿王国) nl 荷兰文 nl_BE 荷兰文 (比利时) nl_BE_EURO 荷兰文 (比利时,Euro) nl_NL 荷兰文 (荷兰) nl_NL_EURO 荷兰文 (荷兰,Euro) no 挪威文 no_NO 挪威文 (挪威) no_NO_NY 挪威文 (挪威,Nynorsk) pl 波兰文 pl_PL 波兰文 (波兰) pt 葡萄牙文 pt_BR 葡萄牙文 (巴西) pt_PT 葡萄牙文 (葡萄牙) pt_PT_EURO 葡萄牙文 (葡萄牙,Euro) ro 罗马尼亚文 ro_RO 罗马尼亚文 (罗马尼亚) ru 俄文 ru_RU 俄文 (俄罗斯) sh 塞波尼斯-克罗地亚文 sh_YU 塞波尼斯-克罗地亚文 (南斯拉夫) sk 斯洛伐克文 sk_SK 斯洛伐克文 (斯洛伐克) sl 斯洛文尼亚文 sl_SI 斯洛文尼亚文 (斯洛文尼亚) sq 阿尔巴尼亚文 sq_AL 阿尔巴尼亚文 (阿尔巴尼亚) sr 塞尔维亚文 sr_YU 塞尔维亚文 (南斯拉夫) sv 瑞典文 sv_SE 瑞典文 (瑞典) th 泰文 th_TH 泰文 (泰国) tr 土耳其文 tr_TR 土耳其文 (土耳其) uk 乌克兰文 uk_UA 乌克兰文 (乌克兰) zh 中文 zh_CN 中文 (中国) zh_HK 中文 (香港) zh_TW 中文 (台湾) ======System property======== -- listing properties -- java.runtime.name=Java(TM) 2 Runtime Environment, Stand... sun.boot.library.path=/usr/java/jdk1.3.1_04/jre/lib/i386 java.vm.version=1.3.1_04-b02 java.vm.vendor=Sun Microsystems Inc. java.vendor.url=http://java.sun.com/ path.separator=: java.vm.name=Java HotSpot(TM) Client VM file.encoding.pkg=sun.io java.vm.specification.name=Java Virtual Machine Specification user.dir=/home/chedong/src/char_test java.runtime.version=1.3.1_04-b02 java.awt.graphicsenv=sun.awt.X11GraphicsEnvironment os.arch=i386 java.io.tmpdir=/tmp line.separator=
java.vm.specification.vendor=Sun Microsystems Inc. java.awt.fonts= os.name=Linux java.library.path=/usr/java/jdk1.3.1_04/jre/lib/i386:/u... java.specification.name=Java Platform API Specification java.class.version=47.0 os.version=2.4.7-10 user.home=/home/chedong user.timezone=Asia/Shanghai java.awt.printerjob=sun.awt.motif.PSPrinterJob file.encoding=GBK
java.specification.version=1.3
user.name=chedong
java.class.path=/home/chedong/classes
java.vm.specification.version=1.0
java.home=/usr/java/jdk1.3.1_04/jre
user.language=zh
java.specification.vendor=Sun Microsystems Inc.
java.vm.info=mixed mode
java.version=1.3.1_04
java.ext.dirs=/usr/java/jdk1.3.1_04/jre/lib/ext
sun.boot.class.path=/usr/java/jdk1.3.1_04/jre/lib/rt.jar:...
java.vendor=Sun Microsystems Inc.
file.separator=/
java.vendor.url.bug=http://java.sun.com/cgi-bin/bugreport...
sun.cpu.endian=little
sun.io.unicode.encoding=UnicodeLittle
user.region=CN
sun.cpu.isalist=
| Hello, it's: Tue Jul 30 11:49:36 CST 2002 ======System available locales:======== en English en_US English (United States) ar Arabic ar_AE Arabic (United Arab Emirates) ar_BH Arabic (Bahrain) ar_DZ Arabic (Algeria) ar_EG Arabic (Egypt) ar_IQ Arabic (Iraq) ar_JO Arabic (Jordan) ar_KW Arabic (Kuwait) ar_LB Arabic (Lebanon) ar_LY Arabic (Libya) ar_MA Arabic (Morocco) ar_OM Arabic (Oman) ar_QA Arabic (Qatar) ar_SA Arabic (Saudi Arabia) ar_SD Arabic (Sudan) ar_SY Arabic (Syria) ar_TN Arabic (Tunisia) ar_YE Arabic (Yemen) be Byelorussian be_BY Byelorussian (Belarus) bg Bulgarian bg_BG Bulgarian (Bulgaria) ca Catalan ca_ES Catalan (Spain) ca_ES_EURO Catalan (Spain,Euro) cs Czech cs_CZ Czech (Czech Republic) da Danish da_DK Danish (Denmark) de German de_AT German (Austria) de_AT_EURO German (Austria,Euro) de_CH German (Switzerland) de_DE German (Germany) de_DE_EURO German (Germany,Euro) de_LU German (Luxembourg) de_LU_EURO German (Luxembourg,Euro) el Greek el_GR Greek (Greece) en_AU English (Australia) en_CA English (Canada) en_GB English (United Kingdom) en_IE English (Ireland) en_IE_EURO English (Ireland,Euro) en_NZ English (New Zealand) en_ZA English (South Africa) es Spanish es_AR Spanish (Argentina) es_BO Spanish (Bolivia) es_CL Spanish (Chile) es_CO Spanish (Colombia) es_CR Spanish (Costa Rica) es_DO Spanish (Dominican Republic) es_EC Spanish (Ecuador) es_ES Spanish (Spain) es_ES_EURO Spanish (Spain,Euro) es_GT Spanish (Guatemala) es_HN Spanish (Honduras) es_MX Spanish (Mexico) es_NI Spanish (Nicaragua) es_PA Spanish (Panama) es_PE Spanish (Peru) es_PR Spanish (Puerto Rico) es_PY Spanish (Paraguay) es_SV Spanish (El Salvador) es_UY Spanish (Uruguay) es_VE Spanish (Venezuela) et Estonian et_EE Estonian (Estonia) fi Finnish fi_FI Finnish (Finland) fi_FI_EURO Finnish (Finland,Euro) fr French fr_BE French (Belgium) fr_BE_EURO French (Belgium,Euro) fr_CA French (Canada) fr_CH French (Switzerland) fr_FR French (France) fr_FR_EURO French (France,Euro) fr_LU French (Luxembourg) fr_LU_EURO French (Luxembourg,Euro) hr Croatian hr_HR Croatian (Croatia) hu Hungarian hu_HU Hungarian (Hungary) is Icelandic is_IS Icelandic (Iceland) it Italian it_CH Italian (Switzerland) it_IT Italian (Italy) it_IT_EURO Italian (Italy,Euro) iw Hebrew iw_IL Hebrew (Israel) ja Japanese ja_JP Japanese (Japan) ko 韩文 ko_KR 韩文 (大韩民国) lt Lithuanian lt_LT Lithuanian (Lithuania) lv Latvian (Lettish) lv_LV Latvian (Lettish) (Latvia) mk Macedonian mk_MK Macedonian (Macedonia) nl Dutch nl_BE Dutch (Belgium) nl_BE_EURO Dutch (Belgium,Euro) nl_NL Dutch (Netherlands) nl_NL_EURO Dutch (Netherlands,Euro) no Norwegian no_NO Norwegian (Norway) no_NO_NY Norwegian (Norway,Nynorsk) pl Polish pl_PL Polish (Poland) pt Portuguese pt_BR Portuguese (Brazil) pt_PT Portuguese (Portugal) pt_PT_EURO Portuguese (Portugal,Euro) ro Romanian ro_RO Romanian (Romania) ru Russian ru_RU Russian (Russia) sh Serbo-Croatian sh_YU Serbo-Croatian (Yugoslavia) sk Slovak sk_SK Slovak (Slovakia) sl Slovenian sl_SI Slovenian (Slovenia) sq Albanian sq_AL Albanian (Albania) sr Serbian sr_YU Serbian (Yugoslavia) sv Swedish sv_SE Swedish (Sweden) th Thai th_TH Thai (Thailand) tr Turkish tr_TR Turkish (Turkey) uk Ukrainian uk_UA Ukrainian (Ukraine) zh 中文 zh_CN 中文 (中华人民共和国) zh_HK 中文 (香港) zh_TW 中文 (台湾) ======System property======== -- listing properties -- java.runtime.name=Java(TM) 2 Runtime Environment, Stand... sun.boot.library.path=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0... java.vm.version=1.3.0_02 java.vm.vendor=Sun Microsystems Inc. java.vendor.url=http://java.sun.com/ path.separator=; java.vm.name=Java HotSpot(TM) Client VM file.encoding.pkg=sun.io java.vm.specification.name=Java Virtual Machine Specification user.dir=D:\java\src\char_test java.runtime.version=1.3.0_02 java.awt.graphicsenv=sun.awt.Win32GraphicsEnvironment os.arch=x86 java.io.tmpdir=D:\TEMP\ line.separator=
java.vm.specification.vendor=Sun Microsystems Inc. java.awt.fonts= os.name=Windows 98 java.library.path=C:\WINDOWS;.;C:\WINDOWS\SYSTEM;C:\WIN... java.specification.name=Java Platform API Specification java.class.version=47.0 os.version=4.90 user.home=C:\WINDOWS user.timezone=Asia/Shanghai java.awt.printerjob=sun.awt.windows.WPrinterJob file.encoding=GBK
java.specification.version=1.3
user.name=Sicci
java.class.path=d:\java\classes
java.vm.specification.version=1.0
java.home=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_02
user.language=zh
java.specification.vendor=Sun Microsystems Inc.
awt.toolkit=sun.awt.windows.WToolkit
java.vm.info=mixed mode
java.version=1.3.0_02
java.ext.dirs=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...
sun.boot.class.path=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...
java.vendor=Sun Microsystems Inc.
file.separator=java.vendor.url.bug=http://java.sun.com/cgi-bin/bugreport...
sun.cpu.endian=little
sun.io.unicode.encoding=UnicodeLittle
user.region=CN
sun.cpu.isalist=pentium i486 i386 | Hello, it's: Tue Jul 30 11:53:27 CST 2002 ======System available locales:======== en English en_US English (United States) ar Arabic ar_AE Arabic (United Arab Emirates) ar_BH Arabic (Bahrain) ar_DZ Arabic (Algeria) ar_EG Arabic (Egypt) ar_IQ Arabic (Iraq) ar_JO Arabic (Jordan) ar_KW Arabic (Kuwait) ar_LB Arabic (Lebanon) ar_LY Arabic (Libya) ar_MA Arabic (Morocco) ar_OM Arabic (Oman) ar_QA Arabic (Qatar) ar_SA Arabic (Saudi Arabia) ar_SD Arabic (Sudan) ar_SY Arabic (Syria) ar_TN Arabic (Tunisia) ar_YE Arabic (Yemen) be Byelorussian be_BY Byelorussian (Belarus) bg Bulgarian bg_BG Bulgarian (Bulgaria) ca Catalan ca_ES Catalan (Spain) ca_ES_EURO Catalan (Spain,Euro) cs Czech cs_CZ Czech (Czech Republic) da Danish da_DK Danish (Denmark) de German de_AT German (Austria) de_AT_EURO German (Austria,Euro) de_CH German (Switzerland) de_DE German (Germany) de_DE_EURO German (Germany,Euro) de_LU German (Luxembourg) de_LU_EURO German (Luxembourg,Euro) el Greek el_GR Greek (Greece) en_AU English (Australia) en_CA English (Canada) en_GB English (United Kingdom) en_IE English (Ireland) en_IE_EURO English (Ireland,Euro) en_NZ English (New Zealand) en_ZA English (South Africa) es Spanish es_AR Spanish (Argentina) es_BO Spanish (Bolivia) es_CL Spanish (Chile) es_CO Spanish (Colombia) es_CR Spanish (Costa Rica) es_DO Spanish (Dominican Republic) es_EC Spanish (Ecuador) es_ES Spanish (Spain) es_ES_EURO Spanish (Spain,Euro) es_GT Spanish (Guatemala) es_HN Spanish (Honduras) es_MX Spanish (Mexico) es_NI Spanish (Nicaragua) es_PA Spanish (Panama) es_PE Spanish (Peru) es_PR Spanish (Puerto Rico) es_PY Spanish (Paraguay) es_SV Spanish (El Salvador) es_UY Spanish (Uruguay) es_VE Spanish (Venezuela) et Estonian et_EE Estonian (Estonia) fi Finnish fi_FI Finnish (Finland) fi_FI_EURO Finnish (Finland,Euro) fr French fr_BE French (Belgium) fr_BE_EURO French (Belgium,Euro) fr_CA French (Canada) fr_CH French (Switzerland) fr_FR French (France) fr_FR_EURO French (France,Euro) fr_LU French (Luxembourg) fr_LU_EURO French (Luxembourg,Euro) hr Croatian hr_HR Croatian (Croatia) hu Hungarian hu_HU Hungarian (Hungary) is Icelandic is_IS Icelandic (Iceland) it Italian it_CH Italian (Switzerland) it_IT Italian (Italy) it_IT_EURO Italian (Italy,Euro) iw Hebrew iw_IL Hebrew (Israel) ja Japanese ja_JP Japanese (Japan) ko Korean ko_KR Korean (South Korea) lt Lithuanian lt_LT Lithuanian (Lithuania) lv Latvian (Lettish) lv_LV Latvian (Lettish) (Latvia) mk Macedonian mk_MK Macedonian (Macedonia) nl Dutch nl_BE Dutch (Belgium) nl_BE_EURO Dutch (Belgium,Euro) nl_NL Dutch (Netherlands) nl_NL_EURO Dutch (Netherlands,Euro) no Norwegian no_NO Norwegian (Norway) no_NO_NY Norwegian (Norway,Nynorsk) pl Polish pl_PL Polish (Poland) pt Portuguese pt_BR Portuguese (Brazil) pt_PT Portuguese (Portugal) pt_PT_EURO Portuguese (Portugal,Euro) ro Romanian ro_RO Romanian (Romania) ru Russian ru_RU Russian (Russia) sh Serbo-Croatian sh_YU Serbo-Croatian (Yugoslavia) sk Slovak sk_SK Slovak (Slovakia) sl Slovenian sl_SI Slovenian (Slovenia) sq Albanian sq_AL Albanian (Albania) sr Serbian sr_YU Serbian (Yugoslavia) sv Swedish sv_SE Swedish (Sweden) th Thai th_TH Thai (Thailand) tr Turkish tr_TR Turkish (Turkey) uk Ukrainian uk_UA Ukrainian (Ukraine) zh Chinese zh_CN Chinese (China) zh_HK Chinese (Hong Kong) zh_TW Chinese (Taiwan) ======System property======== -- listing properties -- java.runtime.name=Java(TM) 2 Runtime Environment, Stand... sun.boot.library.path=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0... java.vm.version=1.3.0_02 java.vm.vendor=Sun Microsystems Inc. java.vendor.url=http://java.sun.com/ path.separator=; java.vm.name=Java HotSpot(TM) Client VM file.encoding.pkg=sun.io java.vm.specification.name=Java Virtual Machine Specification user.dir=D:\java\src\char_test java.runtime.version=1.3.0_02 java.awt.graphicsenv=sun.awt.Win32GraphicsEnvironment os.arch=x86 java.io.tmpdir=D:\TEMP\ line.separator=
java.vm.specification.vendor=Sun Microsystems Inc. java.awt.fonts= os.name=Windows 98 java.library.path=C:\WINDOWS;.;C:\WINDOWS\SYSTEM;C:\WIN... java.specification.name=Java Platform API Specification java.class.version=47.0 os.version=4.90 user.home=C:\WINDOWS user.timezone=Asia/Shanghai java.awt.printerjob=sun.awt.windows.WPrinterJob file.encoding=Cp1252
java.specification.version=1.3
user.name=Sicci
java.class.path=d:\java\classes
java.vm.specification.version=1.0
java.home=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_02
user.language=en
java.specification.vendor=Sun Microsystems Inc.
awt.toolkit=sun.awt.windows.WToolkit
java.vm.info=mixed mode
java.version=1.3.0_02
java.ext.dirs=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...
sun.boot.class.path=C:\PROGRAM FILES\JavaSOFT\JRE.3.0_0...
java.vendor=Sun Microsystems Inc.
file.separator=java.vendor.url.bug=http://java.sun.com/cgi-bin/bugreport...
sun.cpu.endian=little
sun.io.unicode.encoding=UnicodeLittle
user.region=GB
sun.cpu.isalist=pentium i486 i386
|
结论1: JVM的缺省编码方式由系统的“本地语言环境”设置确定,和操作系统的类型无关。所以当设置成相同的LOCALE时,Linux和Windows下的缺省编码方式是没有区别的(可以认为cp1252=ISO-8859-1都是一样的西文编码方式,只包含255以下的拉丁字符),因此后面的测试2我只列出了GNU/Linux下LOCALE分别设置成zh_CN 和en_US的测试结果输出。以下测试如果在Windows下分别按照不同的区域和字符集设置后试验的输出是一样的。 通过这个HelloUnicode.java程序,演示说明"Hello world 世界你好"这个字符串(16个字符)在不同缺省系统编码方式下的处理效果。在编码/解码的每个步骤之后,都打印出了相应字符串每个字符(Charactor)的byte值,short值和所在的UNICODE区间。 | LANG=en_US LC_ALL=en_US | LANG=zh_CN LC_ALL=zh_CN.GBK | ========testing1: write hello world to files======== [test 1-1]: with system default encoding=ISO-8859-1 string=Hello world 世界你好 length=20 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='? byte=-54 \uFFFFFFCA short=202 \uCA LATIN_1_SUPPLEMENT char[13]='? byte=-64 \uFFFFFFC0 short=192 \uC0 LATIN_1_SUPPLEMENT char[14]='? byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT char[15]='? byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT char[16]='? byte=-60 \uFFFFFFC4 short=196 \uC4 LATIN_1_SUPPLEMENT char[17]='? byte=-29 \uFFFFFFE3 short=227 \uE3 LATIN_1_SUPPLEMENT char[18]='? byte=-70 \uFFFFFFBA short=186 \uBA LATIN_1_SUPPLEMENT char[19]='? byte=-61 \uFFFFFFC3 short=195 \uC3 LATIN_1_SUPPLEMENT
第1步:在英文编码环境下,虽然屏幕上正确的显示了中文, 但实际上它打印的是“半个”汉字,将结果写入第1个文件 hello.orig.html
[test 1-2]: getBytes with platform default encoding and decoding as gb2312:
string=Hello world ???? length=16
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='?' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS
char[13]='?' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS
char[14]='?' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS
char[15]='?' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS
按系统缺省编码重新变成字节流,然后按照GB2312方式解码,这里虽然打印出的是问号 (因为当前的英文环境下系统对于255以上的字符是不知道用什么字符表示的,因此全部用?显示) 但从相应的UNICODE MAPPING和SHORT值我们可以知道字符是正确的中文
但下一步的写入第2个文件html.gb2312.html, 没有指定编码方式(按系统缺省的ISO-8859-1编码方式), 因此从后面的测试2-2读取的结果是真的'?'了
[test 1-3]: convert string to UTF8
string=Hello world 涓栫晫浣犲ソ length=24
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='? byte=-28 \uFFFFFFE4 short=228 \uE4 LATIN_1_SUPPLEMENT
char[13]='? byte=-72 \uFFFFFFB8 short=184 \uB8 LATIN_1_SUPPLEMENT
char[14]='? byte=-106 \uFFFFFF96 short=150 \u96 LATIN_1_SUPPLEMENT
char[15]='? byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT
char[16]='? byte=-107 \uFFFFFF95 short=149 \u95 LATIN_1_SUPPLEMENT
char[17]='? byte=-116 \uFFFFFF8C short=140 \u8C LATIN_1_SUPPLEMENT
char[18]='? byte=-28 \uFFFFFFE4 short=228 \uE4 LATIN_1_SUPPLEMENT
char[19]='? byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT
char[20]='? byte=-96 \uFFFFFFA0 short=160 \uA0 LATIN_1_SUPPLEMENT
char[21]='? byte=-27 \uFFFFFFE5 short=229 \uE5 LATIN_1_SUPPLEMENT
char[22]='? byte=-91 \uFFFFFFA5 short=165 \uA5 LATIN_1_SUPPLEMENT
char[23]='? byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT
第3个试验,将字符流按照UTF8方式编码后,写入第3个测试文件hello.utf8.html, 我们可以看到UTF8对英文没有影响,但对于其他文字使用了3字节编码方式, 因此比GB2312编码方式的存储要大50%,
========Testing2: reading and decoding from files========
[test 2-1]: read hello.orig.html: decoding with system default encoding
string=Hello world 世界你好 length=20
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='? byte=-54 \uFFFFFFCA short=202 \uCA LATIN_1_SUPPLEMENT
char[13]='? byte=-64 \uFFFFFFC0 short=192 \uC0 LATIN_1_SUPPLEMENT
char[14]='? byte=-67 \uFFFFFFBD short=189 \uBD LATIN_1_SUPPLEMENT
char[15]='? byte=-25 \uFFFFFFE7 short=231 \uE7 LATIN_1_SUPPLEMENT
char[16]='? byte=-60 \uFFFFFFC4 short=196 \uC4 LATIN_1_SUPPLEMENT
char[17]='? byte=-29 \uFFFFFFE3 short=227 \uE3 LATIN_1_SUPPLEMENT
char[18]='? byte=-70 \uFFFFFFBA short=186 \uBA LATIN_1_SUPPLEMENT
char[19]='? byte=-61 \uFFFFFFC3 short=195 \uC3 LATIN_1_SUPPLEMENT
按系统从中间存储hello.orig.html文件中读取相应文件, 虽然是按字节方式(半个“字”)读取的,但由于能完整的还原,因此输出显示没有错误。 其实PHP等应用很少出现字符集问题其实就是这个原因,全程都是按字节流方式处理, 很好的还原了输入,但这样处理的同时也失去了对字符的控制
[test 2-2]: read hello.gb2312.html: decoding as GB2312
string=Hello world ???? length=16
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
char[13]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
char[14]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
char[15]='?' byte=63 \u3F short=63 \u3F BASIC_LATIN
最惨的就是输出的时候这些'?'真的是问号char(63)了, 数据如果是这样就真的没救了
[test 2-3]: read hello.utf8.html: decoding as UTF8
string=Hello world ???? length=16
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='?' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS
char[13]='?' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS
char[14]='?' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS
char[15]='?' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS
great! 字符虽然显示为'?',但实际上字符的解码是正确的, 从相应的UNICODE MAPPING就可以看的出来。 | ========Testing1: write hello world to files======== [test 1-1]: with system default encoding=GBK string=Hello world 世界你好 length=16 char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS
注意:在新的语言环境中做以上测试需要将源程序重新编译, 最早的字节流到字符流的解码过程从JavaC编译源文件就开始了, 这个测试和刚才最大的不同在于源文件中的“世界你好”这4个字是否按中文编码方式 编译导程序里的,而不是按字节方式编译成8个字符(实际上对应的是8个字节)在程序里。
[test 1-2]: getBytes with platform default encoding and decoding as gb2312:
string=Hello world 世界你好 length=16
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS
char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS
char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS
char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS
在中文环境下,解码和上面缺省的编码是一致的,因此输出一致
[test 1-3]: convert string to UTF8
string=Hello world 涓栫晫浣犲ソ length=18
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='涓' byte=-109 \uFFFFFF93 short=28051 \u6D93 CJK_UNIFIED_IDEOGRAPHS
char[13]='栫' byte=43 \u2B short=26667 \u682B CJK_UNIFIED_IDEOGRAPHS
char[14]='晫' byte=107 \u6B short=26219 \u666B CJK_UNIFIED_IDEOGRAPHS
char[15]='浣' byte=99 \u63 short=28003 \u6D63 CJK_UNIFIED_IDEOGRAPHS
char[16]='犲' byte=-78 \uFFFFFFB2 short=29362 \u72B2 CJK_UNIFIED_IDEOGRAPHS
char[17]='ソ' byte=-67 \uFFFFFFBD short=12477 \u30BD KATAKANA
其实我们用于测试的终端窗口就是一个GBK字符集的应用, 这个输出其实都是把UNICODE按GBK字符集解码的效果。
========Testing2: reading and decoding from files========
[test 2-1]: read hello.orig.html: decoding with system default encoding
string=Hello world 世界你好 length=16
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS
char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS
char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS
char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS
[test 2-2]: read hello.gb2312.html: decoding as GB2312
string=Hello world 世界你好 length=16
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS
char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS
char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS
char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS
[test 2-3]: read hello.utf8.html: decoding as UTF8
string=Hello world 世界你好 length=16
char[0]='H' byte=72 \u48 short=72 \u48 BASIC_LATIN
char[1]='e' byte=101 \u65 short=101 \u65 BASIC_LATIN
char[2]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[3]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[4]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[5]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[6]='w' byte=119 \u77 short=119 \u77 BASIC_LATIN
char[7]='o' byte=111 \u6F short=111 \u6F BASIC_LATIN
char[8]='r' byte=114 \u72 short=114 \u72 BASIC_LATIN
char[9]='l' byte=108 \u6C short=108 \u6C BASIC_LATIN
char[10]='d' byte=100 \u64 short=100 \u64 BASIC_LATIN
char[11]=' ' byte=32 \u20 short=32 \u20 BASIC_LATIN
char[12]='世' byte=22 \u16 short=19990 \u4E16 CJK_UNIFIED_IDEOGRAPHS
char[13]='界' byte=76 \u4C short=30028 \u754C CJK_UNIFIED_IDEOGRAPHS
char[14]='你' byte=96 \u60 short=20320 \u4F60 CJK_UNIFIED_IDEOGRAPHS
char[15]='好' byte=125 \u7D short=22909 \u597D CJK_UNIFIED_IDEOGRAPHS
结论:如果后台数据采用UNICODE方式的存储 然后根据需要指定字符集编码、解码方式,则应用几乎可以不受前端应用所处 环境字符集设置的影响 | 试验2的一些结论: - 所有的应用都是按照字节流=>字符流=>字节流方式进行的处理的:
byte_stream ==[input decoding]==> unicode_char_stream ==[output encoding]==> byte_stream; - 在Java字节流到字符流(或者反之)都是含有隐含的解码处理的(缺省是按照系统缺省编码方式);
- 最早的字节流解码过程从javac的代码编译就开始了;
- Java中的字符character存储单位是双字节的UNICODE;
HelloUnicode.java 原码 /* * Copyright (c) 2002-2003 Che, Dong Email: chedongATbigfoot.com/chedongATchedong.com * $Id: HelloUnicode.java,v 1.3 2003/03/09 08:41:46 chedong Exp $ */ import java.io.BufferedReader; import java.io.File; import java.io.FileReader; import java.io.FileWriter; /** * 目的: * 测试不同字符编码解码方式对多字节编码(中文)处理的影响 * 输入: * 可以从命令行输入测试字符串 * 输出: * 测试1 按照不同解码方式处理字符串,并按不同编码方式写入文件 * 测试2 按照不同解码方式从文件中将字符串读出 * @author Che, Dong */ class HelloUnicode { /** * main entrance * @param args command line arguments */ public static void main(String[] args) { String hello = "Hello world 世界你好"; //read from command line input if (args.length > 0) { hello = args[0]; } try { /* * 试验1: 从测试字符串按系统缺省编码方式解码,并写入文件 */ System.out.println(">>>>testing1: write hello world to files<<<<"); System.out.println("[test 1-1]: with system default encoding=" + System.getProperty("file.encoding") + "\nstring=" + hello + "\tlength=" + hello.length()); printCharArray(hello); writeFile("hello.orig.html", hello); //把字符串按GB2312解码 hello = new String(hello.getBytes(), "GB2312"); System.out.println( "[test 1-2]: getBytes with platform default encoding and decoding as gb2312:\nstring=" + hello + "\tlength=" + hello.length()); writeFile("hello.gb2312.html", hello); printCharArray(hello); //把字符串按UTF8解码成字节流,并打印相应的字节 hello = new String(hello.getBytes("UTF8")); System.out.println("[test 1-3]: convert string to UTF8\nstring=" + hello + "\tlength=" + hello.length()); writeFile("hello.utf8.html", hello); printCharArray(hello); /* * 试验2: 从试验1的输出文件中读取,并按照不同方式解码 */ System.out.println( ">>>>testing2: reading and decoding from files<<<<"); //first file: encoding with system default hello = readFile("hello.orig.html"); System.out.println( "[test 2-1]: read hello.orig.html: decoding with system default encoding\nstring=" + hello + "\tlength=" + hello.length()); printCharArray(hello); //second file: decoding from GBK hello = readFile("hello.gb2312.html"); hello = new String(hello.getBytes(), "GB2312"); System.out.println( "[test 2-2]: read hello.gb2312.html: decoding as GB2312\nstring=" + hello + "\tlength=" + hello.length()); printCharArray(hello); //third file: decoding from UTF8 hello = readFile("hello.utf8.html"); hello = new String(hello.getBytes(), "UTF8"); System.out.println( "[test 2-3]: read hello.utf8.html: decoding as UTF8\nstring=" + hello + "\tlength=" + hello.length()); printCharArray(hello); } catch (Exception e) { System.out.println(e.toString()); } } /** * print char array * @param inStr input string */ public static void printCharArray(String inStr) { char[] myBuffer = inStr.toCharArray(); //list each Charactor in byte value, short value, and UnicodeBlock Mapping for (int i = 0; i < inStr.length(); i++) { byte b = (byte) myBuffer[i]; short s = (short) myBuffer[i]; String hexB = Integer.toHexString(b).toUpperCase(); String hexS = Integer.toHexString(s).toUpperCase(); StringBuffer sb = new StringBuffer(); //print char sb.append("char["); sb.append(i); sb.append("]='"); sb.append(myBuffer[i]); sb.append("'\t"); //byte value sb.append("byte="); sb.append(b); sb.append(" \u"); sb.append(hexB); sb.append('\t'); //short value sb.append("short="); sb.append(s); sb.append(" \u"); sb.append(hexS); sb.append('\t'); //Unicode Block sb.append(Character.UnicodeBlock.of(myBuffer[i])); System.out.println(sb.toString()); } System.out.println(); } /** * write content to output file * @param fileName output file name * @param content file content to write */ private static void writeFile(String fileName, String content) { try { File tmpFile = new File(fileName); if (tmpFile.exists()) { tmpFile.delete(); } FileWriter fw = new FileWriter(fileName, true); fw.write(content); fw.close(); } catch (Exception e) { System.out.println(e.toString()); } } /** * read content from input file * @param fileName input file name * @return String file content */ private static String readFile(String fileName) { try { BufferedReader fr = new BufferedReader(new FileReader(fileName)); StringBuffer out = new StringBuffer(); String thisLine = new String(); while (thisLine != null) { thisLine = fr.readLine(); if (thisLine != null) { out.append(thisLine); } } fr.close(); return out.toString(); } catch (Exception e) { System.out.print(e.toString()); return null; } } }
|