Going Global
Let's start with the most fundamental part of G11N: locales.
Locales
Among the first things to consider when making a ColdFusion application G11N is what language your application's users want to use and, possibly, where the users are located. Knowing users' locale helps you better tailor your application's language response to them. In globalization, locales relate to users' languages and cultural norms, such as sorting conventions; formatting of currency, time and dates, and numbers; and even the spelling of common words (colour versus color , for instance). Put more simply, a locale is a language as used in a specific country or a region within a country.Locales are probably the most important piece of G11Nyou absolutely need to get them rightand luckily for us, ColdFusion MX 7 really shines in this area in comparison to ColdFusion MX 6.1. Table 23.2 lists the locales that are natively supported by ColdFusion versions 6.1 and 7. Compare the ColdFusion 6.1 and ColdFusion MX 7 columns. Yes, ColdFusion MX 7 now natively supports all the 130-odd locales that core Java does! In all the ColdFusion MX 7 beta hoopla over <cfdocument>, reporting, event gateways, and the like, locale support was one improvement that seems to have been lost in the shuffle.
- How can we determine a user's locale?
- Why do we need to maintain a user's locale choice?
- Are there any locale resources beyond what ColdFusion MX 7 offers?
- What's the best Java library to support G11N in ColdFusion MX 7?
- What can we do about locale-based collation (sorting)?
Determining a User's Locale
It's critically important to match a user's locale to the locales that your application supports. Matching what the user wants and what your application can actually deliver is often called language negotiation . So how do we do that? The quick-and-dirty answer is to simply ask them to choose from among the supported locales, maybe using a simple
It's considered good practice to display a user's locale choices in the language of that locale (the choice for French in French, Thai in Thai, and so forth).
Listing 23.1. geoLocatorTB.cfmA geoLocator Example
<cfsilent>
<!---
this example assumes you've downloaded the geoLocator CFC and
copied the InetAddressLocator jar file to coldfusion_install_location\wwwroot\WEB-INF\lib
--->
<!--- hint early, hint often --->
<cfprocessingdirective pageencoding="utf-8">
<cfscript>// setup to try to init CFC & InetAddressLocator java class.
isOk=true;
try ( // create the geoLocator object
geoLocator=createobject("component","cfc.geoLocator");
}
// something went wrong
catch (e Any) {
isOk=false;
}// ok to proceed ?
if (isOK) {
// capture user's IP address if (cgi.REMOTE_ADDR EQ "127.0.0.1")
// if you test locally, we fallback on an IP from australia
ipAddress="147.66.10.158"; //somewhere in oz
else
ipAddress=cgi.REMOTE_ADDR;
// grab their language choices, if any
browserLanguage=cgi.HTTP_ACCEPT_LANGUAGE;
// what locale for this user?
thisLocale=geoLocator.findLocale(ipAddress,browserLanguage);
// we can also find their country
thisCountry=geoLocator.findCountry(ipAddress,browserLanguage);
// and language
thisLanguage=geoLocator.findLanguage(ipAddress,browserLanguage);
// we can even get localized names for language & country
thisC=geoLocator.showCountry(ipAddress);
thisL=geoLocator.showLanguage(ipAddress);
// test if valid locale (according to our logic)
bLocaleValid=geoLocator.isValidLocale("fr_RU");
}
</cfscript>
</cfsilent>
<!DOCTYP218 PUBLIC "-//W3C//DT217 4.01 Transitional//EN">
&l233>
<head>
<title>geoLocator Testbed</title>
</head>
<body>
<cfoutput>
<cfif isOK><b><h2>Not your grandmother's geoLocator</h2></b>
<hr align="left" width="30%">
<b>geoLocator</b> := Initialized plenty fine.
<br>
<b>ip address</b> := #ipAddress#
<br>
<b>browser http_accept_language</b> := #browserLanguage#
<br>
<b>This locale from geoLocator</b> := #thisLocale#
<br>
<b>This country (2 letter code) from geoLocator</b> := #thisCountry#
<br>
<b>This language (2 letter code) from geoLocator</b> := #thisLanguage#
<br>
<b>This country from geoLocator</b> := #thisC#
<br>
<b>This language from geoLocator</b> := #thisL#
<br>
<b>Test fr_RU as valid locale</b> := #yesnoFormat(bLocaleValid)#
</cfoutput>
<br>
<cfelse>
Oops, poop hit the fan.
</cfif>
</body>
</html>
Locale Stickiness
Now that we know a user's locale, what do we do with it? Well, the first thing is not to forget it. Say you have a Web application supporting three locales, Thai, Russian, and U.S. English (as the default locale). The geoLocator CFC determines that a user in Bangkok has a th_TH (Thai language in Thailand) locale. This user gets the home page of the Web site in Thai, with correctly formatted Thai dates and numbers, and so on. The user then navigates to a subsection of the Web site and only sees U.S. English. The application has promptly forgotten their locale and reverted to the default.This might seem to be a rather trivial issue, but it's an important part of developing a G11N ColdFusion MX 7 application. There are several approaches to fixing this: the monolingual Web site (more on this in the later section "Better I18N Practices" section), which is more of a high-level design choice than a ColdFusion coding technique; saving the locale to shared scope variables (usually SESSION scope); or passing locale as part of the URL string (for example index.cfm?locale=fr_CA). Pick one technique; just please don't forget your user's locale.We'll examine more uses for a user's locale later on, but next let's look at what happens when we need a locale that ColdFusion MX 7 doesn't support.
CLDR: The Common Locale Data Repository
As stated earlier, ColdFusion MX 7 derives its locale information from core Java. While this will provide enough locale coverage to satisfy most ColdFusion MX 7 G11N applications, there will be occasions where it's not sufficientsay, when you need to support Farsi or Vietnamese. For those situations, you'll either need to do your own locale research (and from my own personal experience, I can quite easily say bah, humbug to that idea), or you can look elsewhere for some sort of standardized locale resources. These days, "elsewhere" is the Common Locale Data Repository (CLDR ). Originally a project sponsored by the Free Standards Group's OpenI18N team (http://www.openi18n.org/), the CLDR project was handed off to the Unicode Consortium (http://www.unicode.org/cldr/) in early 2004. CLDR's locale resources, as of version 1.2, cover 232 locales, including 72 languages and 108 territories. There are a further 63 draft locales (covering an additional 27 languages and 28 territories) in the process of being developed. Compare that to the 130 or so locales provided by core Java, and you can understand the real significance of the CLDR. Specifically, the CLDR provides information concerning number/date/time formatting, currency values, as well as support for measurement units and text sorting order (collation). Table 23.3 lists the locales covered by the CLDR. If you find yourself working with a client whose locale or language is not listed in that table, get in touch with SETI (http://www.seti.org/); you might very well be dealing with an alien.
IBM's ICU4J
One of the truly "big deals" of ColdFusion MX's move to Java was the ease of integrating Java libraries into ColdFusion applications. For G11N applications, the mother of all Java libraries has to be IBM's open-source International Components for Unicode for Java, a.k.a. ICU4J (http://www-306.ibm.com/software/globalization/icu/index.jsp ). The ICU4J library fills in many of the gaps in core Java's I18N functionality, such as providing non-Gregorian calendars, beefier number formatting including scientific notation and spell-out, speedier locale-based collation, international holidays, and of course all 230 CLDR locales. (We'll discuss a couple of these items in later sections.) Plain and simple, if you do serious G11N work, you need to use this library .TIPMuch of the ICU4J goodness has already been encapsulated in ColdFusion CFCs. You can find many of these in the Macromedia ColdFusion Exchange (http://www.macromedia.com/cfusion/exchange/index.cfm) by searching for ICU4J. They're also available on my shop's Web site (http://www.sustainableGIS.com/things.cfm) or on the CFCZone Web site (http://www.cfcZone.org/).Listing 23.2 shows a simple comparison between core Java and ICU4J using Farsi locale (fa_IR, the Persian or Farsi language as used in Iran). The first thing to note is that core Java methods were used instead of ColdFusion LS functions. Why? Simply because Farsi is not one of the supported ColdFusion locales. ColdFusion MX 7 behaves differently than core Java, in that CFMX throws an error (coldfusion.runtime.locale.CFLocaleMgrException) rather than using a fallback locale as core Java does. Notice that the geTDisplayName method with a Locale or ULocale (for ICU4J) as argument simply displays the localized name for that locale. Another major difference is the use of ICU4J's ULocale class rather than core Java's Locale. This gives us access to all the locales as shown in Table 23.3.
Listing 23.2. compareFarsiLocales.cfmComparison of ICU4J/Core Java for Farsi Locale
[View full width]
We'll need to see some output from this example (shown in Figure 23.1) in order to understand another important distinction between ColdFusion MX 7/core Java and ICU4J. Since it doesn't have any locale resource data for the fa_IR locale, core Java falls back on the default locale for the server (in this case, en_US) and produces "Persian (Iran)" for the localized name. Although the dates are exactly the same (produced using the default Gregorian calendar), the output formats are quite different. ICU4J formats the date display using the Farsi locale resource data; that is, besides localized Farsi date part names, it also uses Arabic-Indic digits rather than European digits.
<cfprocessingDirective pageencoding="utf-8">
<!---
this example assumes that you have downloaded the ICU4J jar from http://www-306.ibm.com/software/globalization/icu/downloads.jsp and copied it to coldfusion_install_location
\wwwroot\WEB-INF\lib.
--->
<cfsilent>
<!---
compares Farsi locale date formatting and name display using core java and icu4j
NOTE: made verbose for clarity
--->
<cfscript>
// full date format, common to both core java and icu4j
fullFormat=javacast("int",0);
// core java
farsiLocale=createObject("java","java.util.Locale");
farsiLocale.init("fa","IR");
coreJavaDateFormat=createObject("java","java.text.DateFormat");
coreJavaDF=coreJavaDateFormat.getDateInstance(fullFormat,farsiLocale);
//////////////////////////////////////////////////////////////////////
// icu4j magic
farsiUlocale=createObject("java","com.ibm.icu.util.ULocale");
farsiUlocale.init("fa_IR"); // note the nifty init locale syntax
icu4jDateFormat=createObject("java","com.ibm.icu.text.DateFormat");
icu4jDF=icu4jDateFormat.getDateInstance(fullFormat,farsiULocale);
</cfscript>
</cfsilent>
<!DOCTYP218 PUBLIC "-//W3C//DT217 4.01 Transitional//EN">
&l233>
<head>
<title>locale comparison</title>
<meta content="tex233; charset=UTF-8" http-equiv="content-type">
</head>
<body>
<!--- output what we've done --->
<cfoutput>
<b>core Java</b>: #farsiLocale.getDisplayName(farsiLocale)# #coreJavaDF.format(now())#
<br><br>
<b>ICU4J</b>: #farsiULocale.getDisplayName(farsiULocale)# #icu4jDF.format(now())#
</cfoutput>
</body>
</html>
Figure 23.1. Comparison of ICU4J/core Java output for Farsi locale.

- The simplicity that ColdFusion MX 7 brings to G11N
- A single function, setLocale, sets ColdFusion MX 7's locale for that page
- The getLocaleDisplayName function returns a localized name for this locale similar to ICU4J's getdisplayName function
- The lsDateFormat returns a formatted date for this locale similar to ICU4J's format methodand this is where we find another fly in the locale ointment.
Listing 23.3. compareCFLocales.cfmComparison of ICU4J/ColdFusion MX 7 for Arabic Locale
[View full width]
Figure 23.2 shows the output from this example. Although ColdFusion MX 7 certainly gets the localized date parts (month and day of week) correct, it doesn't fully support Arabic-Indic digits for the numeric parts (year and day of month) of the date format; ICU4J, however, does. In general, for locales supported by ColdFusion MX 7/core Java, all Arabic locales in ColdFusion MX 7 will yield date/time and numeric formatting incorrectly using European instead of Arabic-Indic digits. Note that this is an issue with the underlying core Java, and not with ColdFusion MX 7 per se.
<cfprocessingDirective pageencoding="utf-8">
<!---
this example assumes that you have downloaded the ICU4J jar from http://www-306.ibm.com/software/globalization/icu/downloads.jsp and copied it to coldfusion_install_location
\wwwroot\WEB-INF\lib.
--->
<cfsilent>
<!---
compares arabic locale date formatting and name display using ColdFusion MX 7 and icu4j
made verbose for clarity
--->
<cfscript>
// ColdFusion MX 7, yup that's all there is to it
oldLocale=setLocale("ar_AE");
//////////////////////////////////////////////////////////////////////
// icu4j magic
// full date format
fullFormat=javacast("int",0);
arabicUlocale=createObject("java","com.ibm.icu.util.ULocale");
arabicUlocale.init("ar_AE"); // nifty init syntax
icu4jDateFormat=createObject("java","com.ibm.icu.text.DateFormat");
icu4jDF=icu4jDateFormat.getDateInstance(fullFormat,arabicULocale);
</cfscript>
</cfsilent>
<!DOCTYP218 PUBLIC "-//W3C//DT217 4.01 Transitional//EN">
&l233>
<head>
<title>locale comparison</title>
<meta content="tex233; charset=UTF-8" http-equiv="content-type">
</head>
<body>
<!--- output what we've done --->
<cfoutput>
<b>ColdFusion MX 7</b>: #getLocaleDisplayName("ar_AE","ar_AE")# #lsDateFormat(now(),"full")#
<br><br>
<b>ICU4J</b>: #arabicULocale.getDisplayName(arabicULocale)# #icu4jDF.format(now())#
</cfoutput>
</body>
</html>
Figure 23.2. Comparison of ICU4J/core Java output for Arabic (United Arab Emirates) locale.

Collation
Collation is a peculiar thing. It's more or less a universal user requirement, and getting it wrong will certainly make users think less of your application. But getting it right across many locales will also certainly go unnoticed; most users think sorting is quite trivial and do it routinely almost unconsciously. Furthermore, collation is not consistent for the same characters; for instance, people of German, French, and Swedish nationality sort the same characters differently. Collation is not even consistent within the same language, as in so-called phone-book collation as opposed to sorting in dictionaries and book indices). And that's just the alphabet-based scriptsAsian ideograph collation can be either phonetic or based on the appearance (strokes) of the characters. Then there are the special cases based on user preferences: ignore/consider punctuation, case (A before/after a ), and so on. You're looking at thousands of years of human collation baggage, so yes, it's going to be complex, even if users do think it's pretty minor. If you want, you can read more about the Unicode Consortium's take on collation at http://www.unicode.org/reports/tr10/.As a rule of thumb, your application should first take advantage of your database's collation functionality. Quite a bit of research time and effort was put into this. Most of today's "big iron" databases can handle substantial collation complexity and even "cast" result sets to a collation other than that table/database's default. See Listing 23.4 for an example using Microsoft SQL Server's COLLATE clause. The subsequent discussion deals with cases where we have to sort within a ColdFusion page, as in Query-of-Query or when sorting a list or an array.NOTEFine-tuning collation/sorting to a given locale is more important than many developers think. Most users would think an application plain stupid if it couldn't even sort their alphabet correctly.
Listing 23.4. castCollation.cfm Casting Collation with Microsoft SQL Server
[View full width]
Suppose we have this scenario:
<!---
snippet showing MS SQL Server syntax to cast from default collation,saySQL_Latin1_General_Cp1250_CS_AS (case & accent sensitive) to
SQL_Latin1_General_Cp1250_CI_AS (case insensitive, accent sensitive)
this should produce a resultset ordering that ignores case
--->
<cfquery name="getTaxRoll" datasource="municipalINFO">
SELECT title+' '+firstName+' '+Lastname as taxPayer
FROM taxRoll
ORDER BY COLLATE SQL_Latin1_General_Cp1250_CI_AS
</cfquery>
- Application serving German locale (de_DE)
- Requirement to sort an array of names
- Users bitterly complaining that results aren't being sorted correctly
Let's examine what's happening here to see what we can do about shutting up those darned users. The application is quite logically using the arraySort function. The problem is that the sorted results aren't at all what the user expects. Names with umlauts (Ä, Ë, Ü) are sorting together as a group after the unadorned characters (A, E, U), rather than as most German users would expect, which would be more along the lines of AÄEËUÜ (the commonly used German phone-book or DIN-2 collation).Why is this happening? Because all of ColdFusion MX 7's collation functionality is based on sorting sequential Unicode codepoints (see for details). Listing 23.5 provides a look at using ICU4J to solve this problem, but before we can make sense of this example, we'll have to examine how core Java and ICU4J actually handle collation.In Java (both plain Java and ICU4J), collation complexity is handled using three parameters: locale, strength, and decomposition.The locale parameter is obvious; a specific locale's collation data is used to order sorts (and searches).The strength parameter is used across locales (although exact strength assignments vary from locale to locale) and determines the level of difference considered significant in comparisons. There are four basic strengths:
- PRIMARY .
Significant for base letter differences; a versus b . - SECONDARY .
Significant for different accented forms of the same base letter (o versus ô ). - TERTIARY .
Significant for case differences such as a versus A (but, again, differs from locale to locale). - IDENTICAL .
All differences are considered significant during comparison (control characters, precomposed and combining accents, etc.).
ICU4J adds a fifth strength, QUATERNARY, which distinguishes words with/without punctuation.Let's take an example from the Java docs (http://java.sun.com/j2se/1.4.2/docs/api/212). In Czech, e and f are considered primary differences; e and ? are secondary differences; e and E are tertiary differences; and e and e are identical. Got that?The decomposition parameter is just what it sounds like: Characters are decomposed for comparison. There are three basic decompositions (only two for ICU4J):
- NO_DECOMPOSITION .
Characters are not decomposed; accented and plain characters are the same. This is the fastest collation but will only work for languages without accented (and so on) characters. - CANONICAL_DECOMPOSITION .
Characters that are canonical variants are decomposed for collation; that is, accents are handled. - FULL_DECOMPOSITION .
Not only accented characters, but also characters that have special formats are decomposed (this decomposition doesn't exist in ICU4J; CANONICAL_DECOMPOSITION is used instead). Basically, un-normalized text is properly handled.
TIPThe i18nSort.cfc wraps up both the core Java and ICU4J versions of locale collation, including functions to sort queries. You can find it in the usual places (mentioned previously).Now that we understand how collation works in core Java and ICU4J, let's consider the example code in Listing 23.5.
Listing 23.5. icu4jSort.cfmICU4J-Based Locale Array Sorting Function
[View full width]
The first thing to note (again) is the use of ICU4J's Ulocale class rather than core Java's Locale class. The next point is the use of core Java's Arrays class; we're using it because it can accept a Collator object that we begin to build by sorting out (pun intended) what strength and decomposition to use for this Collator. We then build the Collator for this locale:
<!---
authors:hiroshi okugawa <hokugawa@macromedia.com>
paul hastings <paul@sustainableGIS.com>
date: 8-feb-2004
notes: this method handles sorting string arrays using locale based collation. originallypart of i18nSort.cfc. note that this code has been made verbose for clarity.
--->
<cffunction name="icu4jSort" output="No" returntype="array" hint="returns array sortedusing ICU4J collator">
<cfargument name="toSort" type="array" required="yes">
<cfargument name="sortDir" type="string" required="no" default="Asc">
<cfargument name="thisLocale" type="string" required="no" default="en_US">
<cfargument name="thisStrength" type="string" required="no" default="TERTIARY">
<cfargument name="thisDecomposition" type="string" required="no" default="FULL_DECOMPOSITION">
<cfscript>
var icu4jCollator=createObject("Java","com.ibm.icu.text.Collator");
var uLocale=createObject("Java","com.ibm.icu.util.ULocale");
var tmp=";
var i=0;
var strength=";
var decomposition=";
var thisCollator=";
var locale=uLocale.init(arguments.thisLocale);
// Arrays object to handle sort
var Arrays = createObject("java", "java.util.Arrays");
//set up the collation options
//strength of comparison
switch (arguments.thisStrength){
//handles base letters 'a' vs 'b'
case "PRIMARY" :
strength=icu4jCollator.PRIMARY;
break;
//handles accented chars
case "SECONDARY" :
strength=icu4jCollator.SECONDARY;
break;
//handles accented chars, ignores punctuation
case "QUATERNARY" :
strength=icu4jCollator.QUATERNARY;
break;
//all differences, including control chars are considered
case "IDENTICAL" :
strength=icu4jCollator.IDENTICAL;
break;
//includes case differences, 'A' vs 'a'
default:
strength=icu4jCollator.TERTIARY;
}
//decompositions, only 2 for icu4j
//fastest sort but won't handle accented chars, etc.
if (arguments.thisDecomposition EQ "NO_DECOMPOSITION")
decomposition=icu4jCollator.NO_DECOMPOSITION;
else //compromise, handles accented chars but not special forms
decomposition=icu4jCollator.CANONICAL_DECOMPOSITION;
//set collator to required locale
thisCollator=icu4jCollator.getInstance(locale);
thisCollator.setStrength(strength);// set strength
thisCollator.setDecomposition(decomposition);//set decomposition
tmp=arguments.toSort.toArray();
//do the array sort based on this collator
Arrays.sort(tmp,thisCollator);
if (arguments.sortDir EQ "Desc") { //need to swap array?
arguments.toSort=arrayNew(1);
for (i=arrayLen(tmp);i GTE 1; i=i-1) {
arrayAppend(arguments.toSort,tmp[i]);
}
} else arguments.toSort=tmp;
return arguments.toSort;
</cfscript>
</cffunction>
We next have to turn the ColdFusion Array into a Java Array (in order to use the Arrays object's nifty sorting methods), using:
thisCollator=icu4jCollator.getInstance(locale)
Now we're ready to actually do the sort using the Arrays object, quite simply:
tmp=arguments.toSort.toArray()
The last thing we have to handle is the direction of the sort (ascending or descending), swapping the array around if the calling page required descending sort direction.What happens if the locale we're interested in isn't one of the locales for which ICU4J has actual collation data? ICU4J will silently fall back on the Unicode Collation Algorithm (UCA), which should suffice for many of these locales. You can read more about how the UCA works at http://www.unicode.org/reports/tr10/. You can also construct your own collation using ICU4J's com.ibm.icu.text.RuleBasedCollator class. Besides creating new collations, this class also allows you to combine existing collations or customize individual collations to suit specific needs.NOTEBy now you might be starting to suspect that G11N ColdFusion code isn't exactly rocket science, and you're right. You can pretty much use any style or framework that you're comfortable with. As long as you follow the principles/information laid out in this chapter, you should be good to go.The preceding discussion has given you a good handle on the ins and outs of locales, so let's examine the next G11N issue, the always-fun task of character encoding.
Arrays.sort(tmp,thisCollator)
Character Encoding
In my experience, many (perhaps too many) ColdFusion developers get into some kind of trouble over character encoding. This section is going to provide you with the one single answer to all your character encoding problems; it goes like this: "Just use Unicode." For it to be effective, you'll need to keep repeating that phrase over and over and until automatically you blurt out "Just use Unicode" when somebody asks you the time of day. Then you'll know you're ready to handle any and all character encoding issues. In the meantime, let's review some of the more important aspects of character encoding as they apply to ColdFusion.
Not Unicode? Not So Smart
I suppose it would be useful to see what ColdFusion MX 7 has to say about character encoding. Quoting from the Developing ColdFusion Applications documentation: "Character encoding maps each character in a character set to a numeric value that can be represented by a computer. These numbers can be represented by a single byte or multiple bytes." Great, but what that doesn't mention is that it's not unusual for a language to have more than one encoding. For example, English has both 8-bit ISO-8859-1 or Latin-1, and 7-bit ASCII; Japanese has Shift-JIS, EUC-JP, and ISO-2022-JP encodings; and, well, we won't get into the Chinese encodings. Furthermore, not all characters for a given language are represented in every encoding used for that language. For instance, the Euro symbol (¤) isn't found within the ISO-8859-1 encoding. (The ISO encoding came before the Euro was established as the default currency in the EU.)If this weren't enough variety, some character sets appear to be equivalent (at least to some folks) but are in fact not. Many developers think ISO-8859-1 and Windows-1252 are the same character set, when in fact Windows-1252 (also called Windows Western or Windows Latin-1) is more like a superset of ISO-8859-1. The mistake of copying and pasting characters from Word documents int228 forms using ISO-8859-1 encoding highlights this issue pretty nicely. This is particularly troublesome if no encoding metadata is available for a chunk of text. G11N projects are prone to this misstep owing to the need for translations, often done by non-IT professionals who quite often wouldn't know a character encoding if it fell on their heads.Let's summarize some things about character sets:
- Undeniably, there are a lot of character sets floating around (see the IANA's page on character sets, http://www.iana.org/assignments/character-sets.) I stopped counting at 75.
- The same character encoding can be used in different languages.
- Many languages are covered by several character sets.
That kind of wild variety is one of the things I loathe as a ColdFusion G11N developer. Matching the correct encoding to a language is quite difficult when there are multiple possible encodings for a language; you're bound to get it wrong once in a while. In fact, getting it wrong happens so often that a Japanese term, mojibake (


Unicode
A lot of variety means a lot of choices, and that's not always a good thing. So what can we do to simplify things? You already know the answer to that: "Just use Unicode." So what's so hot about Unicode?
- It's a standard (synchronized with the ISO 10646 standard).
- It's Internet ready (XML, Perl, Java, JavaScript, and so on all support Unicode).
- It's multilingual (see http://www.i18nguy.com/unicode/char-coun233).
- It travels well (text in any language can be easily exchanged globally).
- It offers monolithic text processing (and that, of course, saves you money in development and support costs, time to market, and so forth).
- It has wide industry support (Macromedia, IBM, Microsoft, HP, Sun, Oracle, and more), making it vendor neutral where pretty much nothing else is.
- It continually evolves (it's now version 4.0.1, with 4.1.0 in beta testing).
- It's possible to convert from legacy code pages (see http://www.unicode.org/Public/MAPPINGS/).
- It's more or less apolitical (see the member list at http://www.unicode.org/unicode/consortium/memblis233).
- The W3C is recommending it for I18227 content.
NOTEFor the real skinny on Unicode, visit www.unicode.org or www.macchiato.com.Internally, ColdFusion uses Unicode (UCS-2), which is efficient to process because its fixed width (2 bytes per character), but economical bandwidth usage requires single-byte encoding. To me, Unicode smells in efficient. However, the twin goals of development simplification and long-term code management are much more important than any superficial bandwidth inefficiency.Now before you start complaining, "Hey, that smells inefficient to me, too!" stop and consider the nature of UTF-8a multibyte encoding in which a character can be represented by from one to three or perhaps four bytes. That might sound uneconomical, but bear these facts in mind:
- The vast majority of text transmitted on the Internet can be represented by ASCII, which UTF-8 encodes as 1 byte (7-bit).
- UTF-8 encodes non-ASCII characters such as those used in Western Europe and Arabic countries as 2 bytes.
- Most Asian characters are encoded as 3 bytes.
UTF-8 encoding is therefore as efficient as it needs to be (despite urban myths to the contrary).So "Just use Unicode", introduce some simplicity to the G11N process, and make UTF-8 your application's sole encoding. Using Unicode simplifies things tremendously. You only have to deal with one encoding on the front end and back end. You will always know the data's encoding, no matter what happens to it. And, of course, you'll be on the same page with ColdFusion MX 7.No need to take my word for itthe latest W3C working draft on authoring I18N XHTML an217 documents actually recommends using UTF-8 or other Unicode encoding: "Choose UTF-8 or another Unicode encoding for all content." (See http://www.w3.org/TR/2003/WD-i18227-tech-20031009/.)Next, let's take a look at putting Unicode to some actual use in resource bundles.
Resource Bundles
What's a resource bundle? When Java folks begin making an application I18N, they always talk about "isolating locale-specific data" and for the most part are referring to text data. The accepted technique for this is to create ResourceBundle objects backed by properties files consisting of key/value pairs (see Listing 23.6 for an example).The concept is rather straightforward; a "key" (from our example, go) has a "value" (Go) assigned to it. Dissecting the properties filename, test_en_US.properties (shown in the example's comments), we can see its locale (en_US) as well as the resource bundle name (test). Java properties files use escaped ASCII for languages with characters beyond ISO-8859-1 encoding (see the later section "Resource Bundle Tools" for more on this); Listing 23.7 shows an example for Thai (th_TH) locale. The value for the key go is replaced by escaped ASCII encoding for the Thai word for Go (\u0E44\u0E1B).You've probably caught on to the fact that both properties files contain the same keys with different values per locale. Instead of hard-coding text in applications, we can now use resource bundle keys that will have their values substituted on a per-locale basis when the page is processed.
Listing 23.6. test_en_US.propertiesen_US Locale Resource Bundle Example
[View full width]
#Resource Bundle: test_en_US.properties - File automatically generated by RBManager at MonDec 08 18:08:52 GMT+07:00 2003
#Mon Dec 08 18:08:52 GMT+07:00 2003
go=Go
cancel=Cancel
Listing 23.7. test_th_TH.propertiesth_TH Locale Resource Bundle Example
[View full width]
NOTEJava I18N is certainly a good role model for ColdFusion MX 7 G11N work. I'm not ashamed to admit that many of the ideas in this chapter are derived from Java I18N workthe Java world has been at this G11N game a lot longer than many of us ColdFusion developers.Now let's go through a simple example converting some ColdFusion code with hard-coded text to make use of resource bundles.
#Resource Bundle: test_th_TH.properties - File automatically generated by RBManager at MonDec 08 19:06:07 GMT+07:00 2003
#Mon Dec 08 19:06:07 GMT+07:00 2003
go=\u0E44\u0E1B
cancel=\u0E22\u0E01\u0E40\u0E25\u0E34\u0E01
Using a Resource Bundle
Suppose we have a simple login form (Listing 23.8) that we want to use across all the locales supported by our application. For this exercise, the first thing we need to do is to pick through the code and isolate the text that needs replacing with resource bundle keys (highlighted in Listing 23.8). So far, so good.
Listing 23.8. noni18nLogin.cfmNon-I18N Login Form
Let's also suppose our application design dictates a couple of things: The application's resource bundles will all be initialized at the same time and loaded into a ColdFusion structure in the APPLICATION scope. For this example, let's call it APPLICATION.loginRB. Also, each user's locale is detected using the geoLocator CFC discussed previously and stored in a SESSION scope variable, SESSION.locale.TIPIt's a very good idea to logically separate your resource bundles into smaller files based on your application's modules.Next, Listing 23.9 shows what our original non-I18N login form would look like after we replace its static text with ColdFusion-flavored resource bundle keys (and in light of the application design outlined just above). To illustrate what's happening, let's dissect one key:
<!DOCTYP218 PUBLIC "-//W3C//DT217 4.0 Transitional//EN">
&l233>
<head>
<title>Please login</title>
<style type="text/css" media="screen">
TABLE {
font-size : 85%;
font-family : "Arial,Helvetica,sans-serif";
}
</style>
</head>
<body text="#330000">
<form action="authenticate.cfm" method="post" name="loginForm" id="loginForm">
<table cellpadding="5" cellspacing="5" border="0">
<caption>
<font size="+1" color="#FF0000"><b>Please login</b></font>
</caption>
<tr>
<td align="right">user name:</td>
<td><input type="text" name="userName" size="10" maxlength="20"></td>
</tr>
<tr>
<td align="right">password:</td>
<td><input type="password" name="password" size="10" maxlength="20"></td>
</tr>
<tr valign="top" bgcolor="Silver">
<td colspan="2" align="center">
<input type="submit" value="login">
<font face="></font>
<input type="reset" value="clear">
</td>
</tr>
</table>
</form>
</body>
</html>
The APPLICATION.loginRB indicates which resource bundle we want to use. SESSION.locale indicates which locale this user is in and acts as a key into the APPLICATION.loginRB structure. And loginFormTitle is the exact resource bundle key for which we want to substitute localized text.
APPLICATION.loginRB[SESSION.locale].loginFormTitle
Listing 23.9. i18nlogin.cfmI18N Login Form
[View full width]
The original static English text became part of the en_US locale resource bundle, which you can see in Figure 23.3. The same login form, using the same code from in Listing 23.9 for users in the Thai locale (th_TH), is shown in Figure 23.4. This is quite a bit easier than trying to maintain separate forms files, one per locale.
<!---
NOTE: these bits WOULD NOT normally be used in this page but rather
in an initialization routine. The example assumes you have downloaded & installed therbJava.cfc.
--->
<cfscript>
rB=createObject("component","rbJava");
geoL=createObject("component","geoLocator");
i18nUtil=createObject("component","i18nUtil");
loginRB=structNew();
loginRB["en_US"]=rB.getResourceBundle("loginRB","en_US");
loginRB["th_TH"]=rB.getResourceBundle("loginRB","th_TH");
// figure out the user's locale
session.locale=geoL.findLocale(CGI.remote_addr,CGI.http_accept_langauge,"en_US");
// is this a BIDI locale?
if (i18nUtil.isBIDI(session.locale)
SESSION.writingDir="rtl";
Else
SESSION.writingDir="ltr";
</cfscript>
<cfprocessingdirective pageencoding="utf-8">
<cfcontent type="tex233; charset=utf-8">
<cfoutput>
<!DOCTYP218 PUBLIC "-//W3C//DT217 4.0 Transitional//EN">
&l233 dir="#SESSION.writingDir#" lang="#session.language#">
<head>
<meta http-equiv="Content-Type" content="tex233; charset=utf-8">
<meta http-equiv="Content-Language" content="#session.language#">
<title>#APPLICATION.loginRB[SESSION.locale].loginFormTitle#</title>
<style type="text/css">
TABLE {
font-size : 85%;
font-family : "Arial Unicode MS,Arial,Helvetica,sans-serif";
}
</style>
</head>
<body text="##330000">
<form action="authenticate.cfm" method="post" name="loginForm" id="loginForm">
<table cellpadding="5" cellspacing="5" border="0">
<caption>
<font size="+1" color="##FF0000">
<b>#APPLICATION.loginRB[SESSION.locale].loginFormTitle#:</b>
</font>
</caption>
<tr>
<td align="right">#APPLICATION.loginRB[SESSION.locale].userNameLabel#:</td>
<td><input type="text" name="userName" size="10" maxlength="20"></td>
</tr>
<tr>
<td align="right">#APPLICATION.loginRB[SESSION.locale].passwordLabel#:</td>
<td><input type="password" name="password" size="10" maxlength="20"></td>
</tr>
<tr valign="top" bgcolor="Silver">
<td colspan="2" align="center">
<input type="submit" value="#APPLICATION.loginRB[SESSION.locale].loginButton#">
<input type="reset" value="#APPLICATION.loginRB[SESSION.locale].clearButton#">
</td>
</tr>
</table>
</form>
</body>
</html>
</cfoutput>
Figure 23.3. The en_US locale login form.

Figure 23.4. The th_TH locale login form.

What
Isn't a Resource Bundle?Listing 23.10 is an example of what a resource bundle is not. Let me put to rest the notion of using ColdFusion code in lieu of "proper" resource bundles. There are several reasons not to this; chief among these are:
- It mixes code and text like the bad old spaghetti code days.
- It requires some knowledge of ColdFusion to manage these filesand you do not want ColdFusion developers handling the translation of, say, information about brain surgery.
- It doesn't lend itself to using any of the nifty resource bundlemanagement tools (see "Resource Bundle Tools" coming up) that are commonplace in the G11N world.
So using ColdFusion code instead of resource bundles is a bad habitit might work with small files for a few languages but will eventually break down as your G11N applications become more complex and cover more locales. If you're just beginning G11N work, don't start out with this method no matter how tempting it looks. And if you're already using this approach, think about quitting while you're ahead. Mingling code and text in this way is not a good idea.
Listing 23.10. notRB.cfmNot a Resource Bundle
<cfset loginRB=structNew()>
<cfset loginRB.en_US.loginFormTitle="Please login">
<cfset loginRB.en_US.userNameLabel="user name">
<cfset loginRB.en_US.passwordLabel="password">
<cfset loginRB.en_US.loginButton="login">
<cfset loginRB.en_US.clearButton="clear">
Resource Bundle Flavors
There are actually a two kinds of resource bundles that can be used with ColdFusion. The first is what might be termed "CFMX UTF-8," where the resource bundle is constructed similar to a traditional INI file. A variable's text value is written out using UTF-8encoded human-readable text. It's simple to implement, relying solely on ColdFusion code to parse the files. Reading it requires nothing more complex than Notepad (which to my mind makes it unsuitable for larger, more complex applications).TIP
There are ready-made CFCs for handling resource bundles available in the usual places (mentioned previously).The second flavor of resource bundle is the "proper" Java-style resource bundle as outlined earlier. These resource bundles require the use of core Java classes, which entail some overhead but have the benefits of being "standard" and having a wealth of ready-made (mostly open-source) tools to manage them. You can further subdivide this resource bundle flavor into two subflavors, depending on how you're able (or want) to access these files. "Pure" resource bundles are accessed using the Java ResourceBundle class. This class provides automatic determination of resource bundle from locale, and automatic fallback locales (if it can't find a resource bundle for a given locale, it truncates that locale back to the language identifier and searches again; if it can't find that resource bundle, it falls back to the base one, usually en_US). The class does, however, require that all resource bundles be located somewhere on a Java classpath, which makes for some complexity in shared-hosts environments. The other subflavor uses the Java PropertyResourceBundle class to access resource bundles. It provides none of the automatic features of the ResourceBundle class but does have the advantage of locating your resource bundles anywhere, although you must explicitly load each resource bundle. Table 23.6 summarizes the pros and cons of the resource bundle types.
RESOURCE BUNDLE FLAVORS | PRO | CON |
---|---|---|
ColdFusion UTF-8 | Human readableEasy to manage (Notepad, etc.)Simple to implement in ColdFusion Quite fast | Complex resource bundles quickly become hard to manageCan't easily use standard resource bundle tools |
Java ResourceBundle class | Pure standard Java resource bundle solutionHandles resource bundle from standard toolsSelf-determines resource bundle for localeHandles complex resource bundle quite easily | Not human readableRequires that resource bundle be somewhere in classpathRequires createObject permissionSome overhead in using Java object |
Java PropertyResourceBundle class [a] | Resource bundle can be anywherePure standard Java resource bundle solutionHandles resource bundle from standard toolsHandles complex resource bundle quite easily | Not human readableRequires caller to determine resource bundle from localeRequires createObject permissionSome overhead in using Java object |
[a] See http://www.sustainablegis.com/unicode/resourceBundle/javaRB.cfm for an example.
Now let's have a look at some tools to manage resource bundles.
Resource Bundle Tools
It's a fact of life that large, complex G11N applications usually generate large, complex resource bundles. Trying to manage these with Notepad and Post-its isn't very realistic. You have to manage the creation/editing of the resource bundle keys, manage the creation/editing of resource bundles per locale, manage keys that have been translated into certain locales, and so on. Luckily, the Java I18N world has developed several resource bundle management tools that we can use for this task. Foremost among these (and also my favorite) is ICU4J's pure-Java Resource Bundle Manager (RB Manager). Among the things RB Manager can do to help solve day-to-day L10N problems include the following:
- Handles editing multiple language files
- Provides sophisticated resource bundle search functionality
- Checks resource bundle keys for duplicates and for proper format
- Provides a grouping of resources; individual translations are easier to find
- Provides that each language file will only display a list of resources that are untranslated (wonderful for tracking what still needs to be translated)
- Keeps track of statistics such as number of resources, untranslated items, and so on
- Handles importing and exporting of translation data into multiple formats such as XLIFF, TMX, ICU, and more
- Use of the RB Manager application cuts down on development, translation, and debugging time in any internationalized setting
You can find a complete tutorial for RB Manager in the download file.Figure 23.5 shows a typical resource bundle for English and Arabic languages being managed using RB Manager. In this example, the view provides a list of all resource bundle keys and their English and Arabic translations. You can download a free copy of RB Manager from ICU4J's site, http://www-306.ibm.com/software/globalization/icu/rbmanager.jsp.
Figure 23.5. ICU4J's pure-Java Resource Bundle Manager.

- Attesoro (http://ostermiller.org/attesoro/) is another pure-Java solution that can produce proper Java resource bundles.
- BabelFish (http://www.solyp.com/2975l), also a Java program, has an interesting feature: it has links to machine translation sites.
- Zaval Java Resource Editor (http://www.zaval.org/products/jrc-editor/). Yes, it's another Java program.
- I18nEdit (http://www.cantamen.de/i18nedit.php?lang=en) is another Java-based resource editor; most noteworthy is the nifty built-in Unicode character picker for those days when you're too lazy to load another locale.
- native2ascii is a command-line tool that will convert a file with native-encoded characters (the caveat being that the "native" encoding must be one of the Java-supported ones) to one with Unicode-encoded characters. It's found in the bin directory of your Java JRE/JDK installation.
Our next stop on the ColdFusion MX 7 G11N tour deals with mailing addresses.
Addresses
Living outside the United States, one of my pet peeves is the assumption by many sites that users' addressing schemes are similar to their own. A prime example of this is the State field. Most countries do not have State as part of their addressing scheme, and ColdFusion developers' adding it to their applications or, even worse, requiring it, will only confuse and possibly annoy these users. Developers need either to intimately understand a locale's addressing scheme (very possible through localization research) or to build flexibility into their address-capture routines and storage.Developers should also not assume that postal codes (ZIP codes) confine themselves to a particular format or length. For example, Japanese postal codes can have a format such as 460-0002 (Aichi), whereas Canadian ones come in the form V2B 5S8 (Kamloops, British Columbia). Even the placement of the postal code in a mailing address can vary widely. In Laos, the postal code is to the left of the locality (01160 XAYSETHA), and in Japan it's to the left of the country (460-0002 JAPAN).Let's look at a brief example of these ideas. Listing 23.11 shows a table design (Microsoft SQL Server data types) to hold worldwide customer information for a spatial data set product. This simple table design comes from my years of dealing with a global customer base. Its flexibility is its most important point.
Listing 23.11. galacticCustomer.txtGalactic Customer Table Design
NOTEIn Microsoft SQL Server's T-SQL DDL, NOT NULL means required data, whereas NULL means not required.At first glance, there's nothing particularly remarkable about this design; however, take note of a few items. Many columns that you might normally compel a user to supply are not required, and many columns might seem overly large to someone dealing with just one locale. For example, City isn't required because in some cases there isn't an identifiable city in an address. Address, on the other hand, is an NTEXT data type capable of holding a huge amount of freeform text that might include streets, lanes, subdistricts, districts, and even directions. Notice also that SQL Server's Unicode data types (NVARCHAR and NTEXT) are used to allow the customer to supply their own language version of name, address, and so on. For more information on address formats, see http://www.upu.int/post_code/en/addressin220.
[CustomerID] [int] IDENTITY (1, 1) NOT NULL
[Salutation] [nvarchar] (100) NULL --- not fixed, as customer prefers
[FirstName] [nvarchar] (100) NOT NULL
[LastName] [nvarchar] (200) NOT NULL
[eMail] [varchar] (50) NULL --- may not have email
[PurchaseDate] [datetime] NOT NULL
[Organization] [nvarchar] (200) NULL --- company, government office, etc.
[Address] [ntext] NULL --- nTEXT will hold anything customer provides
[City] [nvarchar] (150) NULL --- may not have a city
[Locality] [nvarchar] (200) NULL --- state/province/etc. may or may not have
[Country] [varchar] (35) NOT NULL --- minimally have this, pulled from our SELECT
[PostalCode] [varchar] (40) NULL --- may or may not have
[Phone] [varchar] (50) NULL -- plenty of room
[Fax] [varchar] (50) NULL -- plenty of room
[FreeCustomer] [bit] NOT NULL --- local schools, etc. on charity list
[timestamp] [timestamp] NULL --- edit/full text indexing flag
Date/Time
Addressing nuances might frustrate users, but date formatting certainly frustrates ColdFusion developers. If the ColdFusion Support Forums are any indication, even within one single locale, dates often make developers punch drunk. Even though dates are basically a simple combination of day, month, and year, there's an extensive and often confusing variety of date formats across locales. For example, 12/10/56 could be interpreted in a number of ways. In Thailand (which has a short date format of day/month/year), 12/10/56 would be taken to mean October 12, 1956. In the United States (which has a short date format of month/day/year) that date would be December 10, 1956.A similar date in Japan (where the short date format is year/month/day) would be hopelessly broken: October 56, 1912. Keeping date formats straight among locales is critical to developing G11N applications.Our next date/time formatting issue is all the various calendars in use throughout the world.
Calendars
Besides date formatting, developers should not forget the types of calendars in use within a given locale. This can be critical; a month in one calendar might not cover the exact same time span in another. Weeks and weekends don't always start on the same day across locales using different calendars, or even within the same calendar as in the case of Europe versus the U.S.Out of more than 40 calendars in use around the world today, we'll examine the six most common (the "big six"), and throw in one rare calendar just for added flavoring. The "big six" discussed here are, of course, supported by the ICU4J library. The reason we're discussing these at all is to give ColdFusion MX 7 developers some background information so that you're not operating in a vacuum with these calendars behaving like some sort of mysterious "black box."
Gregorian Calendar
Pope Gregory XII introduced the Gregorian calendar in 1582 as an adaptation of the Julian calendar (named after Julius Caesar), when the 10-day difference between the actual time of year and traditional time of year on which calendar events occurred became intolerable. This calendar was constructed to give a closer approximation to the tropical year, which is the actual length of time it takes for the Earth to complete one orbit around the Sun.The actual changeover from Julian to Gregorian calendar resulted in quite an interesting "month." When England and her colonies made the change to the Gregorian in 1752 (not all countries adopted this calendar at the same time), it created a month of September something like what is shown in ).
Buddhist Calendar
Behaving similarly to the Gregorian, the Buddhist calendar is identical to the Gregorian in all respects except for the year and era (B.C., A.D., etc.). Years are numbered since the birth of the Buddha in 543 B.C. (Gregorian), so that 1 A.D. (Gregorian) is equivalent to 544 B.E. (Buddhist Era) and 2005 A.D. is 2548 B.E. Quick and dirty is to simply add 543 years to the Gregorian year to arrive at the Buddhist year, and subtract 543 years to go the other way. In predominantly Buddhist countries such as Thailand (where I live these days) the Buddhist calendar is the civil calendar (the official one in general use by most folks and, of course, the government). This calendar is often used elsewhere for religious purposes.You can see an example of the calendar here: http://www.sustainablegis.com/projects/calendars/buddhistCalendarTB.cfm, with output shown in Figure 23.6.
Figure 23.6. Buddhist calendar output.
[View full size image]

Chinese Calendar
The traditional Chinese calendar is a lunisolar calendar (interestingly the same type as the Hebrew calendar). Months start with a new moon, with each month numbered according to solar events. Why? It guarantees that month 11 will always contains the winter solstice. How? Leap months are inserted in certain years. These leap months are numbered the same as the month they follow (how's that for complication?). Which month is a leap month? It depends entirely on the movements of the sun and moon.Distinct from the Gregorian calendar, the normal Era field differs from other calendars in that it holds a 60-year cycle number rather than the usual B.C./A.D. Right now, in 2005, we're in the 78th cycle, which began in 1983 A.D. Years are counted sequentially, numbering from the 61st year of the reign of Huang Di (more or less 2637 B.C.), which is designated year 1 on the Chinese calendaryes, that's right, this calendar system is over 4,000 years old. Let's look at an example:

Figure 23.7. Chinese calendar output.
[View full size image]

Hebrew Calendar
The Hebrew calendar is also lunisolar, which gives it what some folks would call "a number of interesting properties." Distinct from the Gregorian calendar, months start on the day of each new moon (the ICU4J library actually makes an approximation of this). The solar year (which, as everyone knows, is 365.24 days) is not an even multiple of the lunar month (approximately 29.53 days), so an extra leap month is inserted in 7 out of every 19 years (this is beginning to sound interesting). And just to make sure everybody's paying attention, the start of a year can be delayed by up to 3 days in order to prevent certain holidays from falling on the Sabbath (as well as to prevent illegal year lengths). As the cherry on the ice cream, the lengths of certain months can vary depending on the number of days in the year. And finally, years are counted since the creation of the world (A.M. or anno Mundi ), believed to have taken place in 3761 B.C. Hurts my head, tooand is a compelling reason to make use of the ICU4J library and let the smart guys at IBM worry about this sort of thing.An example can be found here: http://www.sustainablegis.com/projects/calendars/hebrewCalendarTB.cfm. See Figure 23.8 for an example of the calendar's output.
Figure 23.8. Hebrew calendar output.
[View full size image]

Islamic Calendar
The Islamic calendar is also known as Hijri because it starts at the time of Mohammed's journey or hijra to Medinah on Thursday, July 15, 622 A.D. It is the civil calendar used by most of the Arab world and is the religious calendar of the Islamic faith. This calendar is a strict lunar calendar; an Islamic year of 12 lunar months therefore does not exactly correspond to the solar year used by the Gregorian calendar system. An Islamic year averages about 354 days, so viewed from the Gregorian, each subsequent Islamic year starts about 11 days earlier.The civil Islamic calendar uses a fixed cycle of alternating 29- and 30-day months, with a leap day added to the last month of 11 out of every 30 years. That makes the calendar predictable, so it is used as the civil calendar in a number of Arab countries.The Islamic religious calendar, however, is based on the actual observation of the crescent moon. This sounds predictable and simple enough, but that observation varies based on where you are when you look (your geography), when you look (sunset varies by season), moon orbit "eccentricities," and even the weather (too cloudy and you obviously can't see the moon). All this makes it impossible to calculate in advance, so the start of a month in the religious calendar might differ from the civil calendar by up to 3 days.You can see an example here: http://www.sustainablegis.com/projects/calendars/islamicCalendarTB.cfm. Figure 23.9 displays the output.
Figure 23.9. Islamic calendar output.
[View full size image]

Japanese Calendar
The Japanese calendar, sometimes called the Japanese Emperor Era calendar, is identical to the Gregorian calendar except for the year and era. Each Emperor's ascension to the throne begins a new era. Each new era's years are numbered starting with 1 (the year of ascension). What could be simpler? The "modern" eras began as follows:
- Meiji .
January 8, 1868 A.D. - Taisho .
July 30, 1912 A.D. - Showa .
December 25, 1926 A.D. - Heisei .
January 7, 1989 A.D. (current era)
You can see this calendar in action here:
http://www.sustainablegis.com/projects/calendars/japaneseCalendarTB.cfm
and its output in Figure 23.10.
Figure 23.10. Japanese calendar output.
[View full size image]

Persian Calendar
A Persian (or perhaps Iranian) calendar is the formal calendar in general use in Iran. It's also known as the solar Hijri calendar and sometimes as the Jalali calendar. I've also seen it described as the Shamsi calendar; quite frankly, I have no idea which is correct, so I'll stick with Persian.The Persian calendar has a starting point that matches the Islamic calendar but is otherwise unrelated. The origin of this calendar can be traced back to the eleventh century when a group of astronomers (including the famous poet Omar Khayyam) created what was then called the Jalali calendar, with the "modern" version being adopted in 1925 A.D. Since it's one the few calendars designed in the era of accurate positional astronomy, it's probably the most accurate solar calendar around today (we'll see why in a bit).Like the Gregorian calendar, this calendar consists of 12 months; the first 6 are 31 days in length, the next 5 are 30 days, and the final month is 29 days in a normal year and 30 days in a leap year. To put it mildly, the Persian calendar uses a very complex leap-year structure; years are grouped into cycles that begin with 4 normal years, after which every 4th subsequent year in the cycle is a leap year. These cycles are in turn grouped into "grand" cycles of either 128 years (composed of cycles of 29, 33, 33, and 33 years) or 132 years (containing cycles of 29, 33, 33, and 37 years). A "great grand" cycle is composed of 21 consecutive 128-year grand cycles and a final 132 grand cycle, for a total of 2,820 years. The pattern of normal and leap years, which began in 1925, will not repeat until the year 4745.Each 2,820-year great grand cycle contains 2,137 normal years of 365 days, and 683 leap years of 366 days. The average year length over the great grand cycle is 365.24219852 days, which is so close to the actual solar tropical year of 365.24219878 days that the Persian calendar accumulates an error of only 1 day in every 3.8 million years.If this isn't enough information for you, you might have a look at this site: http://www.tondering.dk/claus/cal/node6l.TIPAt the time of this writing, HyperOffice (http://www.hyperoffice.com/) is in the process of developing an ICU4J-level Persian calendar. If you are interested in this component, contact Drew Morris (drew@hyperoffice.com) for more information.If you have managed to plow through the preceding description, you will have a good idea of just how complex implementing a Persian calendar would actually be. For this reason, there are very few full-blown ICU4J-level implementations of this calendar. You can, however, find a Persian calendar CFC with rather limited functionality (no calendar math, no localized date/time string parsing, no metadata functions, and so on) at this site: http://www.sustainablegis.com/projects/persianCalendar/. Output is shown in Figure 23.11).
Figure 23.11. Output of Persian calendar (limited functionality).

Calendar CFC Usage
Space doesn't permit me to post any of the code for the preceding calendar CFCs (each runs to over 1100 lines of code). What I will do instead is introduce some of the functions from these CFCs in order to help you to start thinking about using calendars in your G11N applications. (Note that many of these functions have had i18n added to their function name in order not to conflict with existing ColdFusion functions.)The following are functions related to calendar math:
- i18nDateAdd returns a datetime object with units of time added. This should be used instead of ColdFusion's dateAdd function. Why? If you examine the output from the various calendars shown above, you will see that the same unit of time isn't equivalent across calendars. Adding 2 years to a date of 3-Feb-2005 for an Islamic calendar results in a date 709 days in the future; for the Hebrew calendar, it results in a date 739 days in the future; and for the Buddhist calendar it's 730 days.
- 18nDateDiff returns the difference in date parts between two dates. For the same reasons outlined for i18nDateAdd, this method should be used instead of ColdFusion's dateDiff function.
- i18nDateParse parses a date string formatted as FULL, LONG, MEDIUM, SHORT style into a valid date object.
- i18nIsWeekend returns a boolean indicating whether input date falls on a weekend according to a given calendar. Weekends do not begin on the same day of the week across all calendars.
- weekStarts returns the first day of week for a given calendar. Weeks do not start on the same day across calendars, or even across locales within the same calendar.
- 18nDaysInMonth returns the number of days in given month.
- i18nDayOfWeek returns the day of week for a given date.
- is24HourFormat returns 0 if not 24-hour time format, 1 if 24-hour time format in 0-23 style, or 2 if 24-hour time format in 0-24 style.
- i18nIsLeapYear returns true or false if a given year is a leap year.
- getEras returns a locale-based era (A.H., A.D., B.C., etc.).
TIPIt's not usually a good idea to use your own custom date/time formats in G11N applications. You're usually better off leaving that up to the standard locale-formatting functions.These functions were designed mainly for use in page layout logic:
- isDayFirstFormat determines whether a given locale uses day-month or month-day format; mainly used in page layouts.
- getdateTimePattern returns locale datetime pattern string (for example, mm-dd-yy) for a given locale.
- getdatePartOrder is metadata method; returns date part order (day-month-year, month-day-year, etc.) for a given calendar/locale combination.
- getTimeDelimiter returns time delimiter (:/.) for a given calendar/locale combination.
The following functions are specific to individual calendars:
- getCycle returns the cycle for a passed date (Chinese calendar).
- getCycleYear returns the year in a given cycle for a passed date (Chinese calendar).
- getExtendedYear returns the extended year for this calendar; that is, years since start of the Chinese calendar.
- getCycleMonth returns the month in a cycle year for a passed date (Chinese calendar).
- getCycleDay returns the day in a cycle month for a passed date (Chinese calendar).
- isLeapMonth returns true/false if a given month is a leap month (ADAR 1) in the Hebrew calendar.
- getEmperorEra returns a string indicating the Japanese emperor era in which a given date falls (Japanese calendar).
Hopefully, the preceding sections have given you a firm grounding in G11N calendar use. Now, let's look at one final time-related G11N issue: time zones.
Time Zones
If your application involves a global base of users, you're likely to run into issues concerning time zones. It's often the case that the application server is in one time zone while the users are in others (even non-G11N applications are affected by this). Toss daylight savings time (DST) into the mix, and things can become complicated rather quickly. Why are time zones so complicated? In theory, a time zone is an area on the Earth's surface between two meridians spaced by 15 degrees of longitude (the x-axis, if you will) where the same time is adopted. Realistically, for administrative and sometimes political reasons, state or country borders often define the time zone instead of exact geographic position. For example, Table 23.8 shows the various time zone equivalents for the Asia/Bangkok (GMT+0700). These are all the same physical time zone, but simply named differently.
Antarctica/Davis |
Asia/Bangkok |
Asia/Hovd |
Asia/Jakarta |
Asia/Krasnoyarsk |
Asia/Phnom_Penh |
Asia/Pontianak |
Asia/Saigon |
Asia/Vientiane |
Etc/GMT-7 |
Indian/Christmas VST |
Figure 23.12. Time zone CFC example.

If your application must support multiple time zones, it's probably a good idea to maintain your datetime data in GMT time zone rather than the server's or client's time zone.Our next stop on the ColdFusion MX 7 G11N tour is the topic of databases.
Databases
As far as G11N applications go, the most important factor is whether or not the database is Unicode capable. In this day and age it is rather difficult to find many popular or "big iron" databases that do not support Unicode. The last holdout among these was MySQL, which finally supported Unicode with the release of version 4.1. The following is a brief review of Unicode-capable databases that you can use with ColdFusion MX 7. Consult the database's documentation for details.
Microsoft Access
Microsoft Access, within its limitations, is a suitable database for G11N applications; it supports Unicode, provided you use the Access for Unicode driver supplied with ColdFusion MX 7.
Microsoft SQL Server
Microsoft SQL Server has been Unicode capable since version 7. It provides three data types to handle Unicode text: NVARCHAR, NCHAR, and NTEXT. (The N comes from the SQL-92 specification and stands for "national" data types). Be aware that the limits for the VARCHAR and CHAR data types (8000 bytes) apply to both the standard and the Unicode variants, which effectively halves the Unicode size limits (4000 Unicode characters). If you use Unicode data (which, of course, you should be doing at all times), also be mindful that Microsoft SQL Server requires that all Unicode text passed to it be assigned an N prefix (see http://support.microsoft.com/kb/239530/EN-US/ for more information):
If you use the <cfqueryparam> tag (which is a very good idea) you will need to turn on Unicode support via ColdFusion Administrator's Advanced option for that DSN, as shown in Figure 23.13. As noted earlier in Listing 23.4, SQL Server can "cast" collations using the COLLATE clause, which should be your first line of attack when it comes to sorting data.
SELECT someColumn
FROM someTable
WHERE Greeting = N'Hello!'
Figure 23.13. DSN Unicode support option in ColdFusion Administrator.
[View full size image]

MySQL
The release of MySQL version 4.1 brings Unicode support as UTF-8 or UCS-2. You can assign a character set and/or collation to the server, database, table, and column. For example:
would assign the UTF-8 encoding to all CHAR and VARCHAR columns in that database. Similar to Microsoft SQL Server, you can "cast" collations using the COLLATE clause.In terms of database connections, you can set the client connection character set (where ColdFusion MX 7 is MySQL's "client") either within MySQL itself or via the MySQL DSN's connection string option (in the Advanced option section of that DSN in the ColdFusion Administrator) using:
CREATE DATABASE dayLateDollarShort DEFAULT CHARACTER SET utf8
useUnicode=true&characterEncoding=utf8
PostgreSQL
PostgreSQL has had full Unicode support since version 7.1. Its current version is 8.0, which is also its first native Windows version. Unlike MySQL, you can only set character encoding at the database level:
Collation is also fixed at the database levelor actually at the "cluster level"; one instance of PostgreSQL can only have one locale.
CREATE DATABASE postGISUnicode WITH ENCODING 'UNICODE'
Oracle
Oracle has supported Unicode since version 7. Oracle handles I18N issues via National Language Support (NLS), which provides database utilities, error messages, sort orders, date/time and numeric/currency formatting, and so on, adapted to relevant native languages. Oracle covers about 67 territories (locales) with 46 languages.Oracle provides Unicode support through UTF-8 (AL31UTF8 in Oracle-talk), although the character sets differ from version 7 (AL24UTFFSS) to version 8 (AL31UTF8). AL31UTF8 handles ASCII as single-byte encoding. Similar to Microsoft SQL Server, Oracle's Unicode data types are nchar, nvarchar2, and nclob. Provided that its NLS parameters (NLS_Language, NLS_Territory) are initialized properly (server-side initialization parameters, client-side environment variables, or through the ALTER SESSION parameter), there are no serious I18N issues involving Oracle.
Display
Most ColdFusion developers tend to turn up their noses at so-called "design" issues like page display and layout. Display is, however, an important G11N topic, especially in locales with right-to-left (RTL) writing systems such as Arabic or Hebrewwhat some folks refer to as the BIDI (bi-directional) locales. You need to understand that not just the text is RTL; the whole concept of a "page" in these locales is RTL. Let's look at an example.NOTEIn case you're wondering why these languages' writing systems are considered BIDI, it's because things like numbers are written left-to-right. That is, the most significant digit is leftmost, so the number 100 (one hundred) is written in Arabic or Hebrew as 100 rather than 001. Also, note that "languages" do not have a direction; their writing systems do.Figure 23.14 is the desktop for a fully internationalized virtual office application (HyperOffice, http://www.hyperoffice.com/) for a user in the en_US, English (United States) locale. This page is laid out left-to-right (LTR), with the most important objects (menu, user name, and so on) on the left side of the page. If you look closely at the arrow icons, even these graphics are LTR (they point from the left to the right)the devil is indeed in the details.
Figure 23.14. LTR page layout.
[View full size image]

Figure 23.15. RTL page layout.
) or Tex Texin's Web site (http://www.i18nguy.com/markup/right-to-lef233).So how do we go about developing a page layout to handle directionality of writing system? Leaving graphics out of it, it's actually rather easy. Recall the following line in the code of Listing 23.9:
That's pretty much it. It's most often recommended to set the page's writing direction in the &l233> tag using its dir attribute. That's because it will also set all of the page'232 object's directionality, as well, while leaving you with the option of changing the directionality for individua225 objects as needed. For the page's text, this setting will have the most effect on directionally neutral text (numbers, punctuation, and so on), since most of your Unicode text will have inherent directionality (certainly another reason to "Just use Unicode").If your page layout design tends t228 frames, you will have to use special logic to arrange the frames in their proper sequence (see dir attribute (see http://www.w3.org/TR/CSS21/visure227#direction for more information). CSS 3 goes a step farther, adding the block-progression property to specify vertical flow (top-to-bottom) or horizontal flow (LTR or RTL), as well as a writing-mode property to act as shorthand for specifying both direction and block-progression (see http://www.w3.org/TR/css3-text/#Progression for specifics).
&l233 dir="#SESSION.writingDir#" lang="#SESSION.language#">
Listing 23.12. frameLayout.cfmRTL Frame Layout Logic
Next we look at ColdFusion MX 7's text searching as it applies to G11N applications.
<!DOCTYP218 PUBLIC "-//W3C//DT217 4.01 Transitional//EN">
<cfoutput>&l233 dir="#SESSION.writingDir#"></cfoutput>
<head>
<title>Bubba's Triassic Desktop</title>
</head>
<!-- frames -->
<cfif SESSION.writingDir EQ "ltr">
<!--- menu left--->
<frameset cols="20%,*">
<frame name="menu" src="menu.cfm">
<frame name="desktop" src="desktop.cfm">
</frameset>
<cfelse>
<!--- menu right --->
<frameset cols="80%,*">
<frame name="desktop" src="desktop.cfm" >
<frame name="menu" src="menu.cfm">
</frameset>
</cfif>
</html>
Text Searching
Using the built-in Verity text search engine was problematic in ColdFusion versions prior to CFMX 7. For starters, it didn't support Unicode. It also only supported a few languages (mainly in line with the locales that ColdFusion previously supported). That made Verity's application across locales "uneven" (works in some locales, not in others) and therefore complex. It forced many G11N developers to turn to other solutions, such as Microsoft's Index Server or the open-source Lucene project. ColdFusion MX 7 has changed all of this with the introduction of the Unicode character set for Verity collections, as well as new Verity languages (see Table 23.9).
ASIAN LANGUAGE PACK | |||||
---|---|---|---|---|---|
Japanese | Korean | Chinese | Traditional Chinese | ||
MULTILANGUAGE LANGUAGE PACK | |||||
Unicode | |||||
WESTERN EUROPEAN LANGUAGE PACK | |||||
Bokmal | Finnish | Italian | Spanish | Danish | French |
Nynorsk | Swedish | Dutch | German | Portuguese | |
EASTERN EUROPEAN/MIDDLE EASTERN LANGUAGE PACK | |||||
Arabic | Hebrew | Greek | Polish | Turkish | |
Bulgarian | Russian | Czech | Hungarian | Russian2 |
The same applies to searching a collection:
<cfcollection
action = "create"
collection = "unicodeTest"
path ="#collectionLocation#"
language = "uni">
This is incredibly economical; one simple code change opens up your Verity text-searching to the G11N world as well as simplifies your application by doing away with the need for third-party earch engines. You'll need to download the "Verity Search Packs" from http://www.macromedia.com/support/coldfusion/verity_reg/register/index.cgi.Our final stop on this tour is a brief overview of the G11N-relevant tags and functions.
<cfsearch
collection="unicodeTest"
name="test"
criteria="#searchPhrase#"
language="uni">
Relevant ColdFusion MX 7 Tags/Functions
The following tables (Table 23.10 and Table 23.11) provide a list of the G11N-relevant Cold Fusion MX 7 tags and functions. The majority of these should be familiar to developers from ColdFusion MX 6.1.