Non dev Exchange related question

Oct 8, 2014 at 2:52 PM
Hello,

I'm hoping to get something answered here that I haven't really been able to find anywhere else.

I do a lot of work with restoring Exchange databases. I have noticed that when I move an exchange database to a new server that is a different OS that it usually has to be defragged before it can be accessed and used.

I think I understand that this has something to do with the Unicode sort order of the indexes but I don't really understand how this works.

Can anyone give me a sort of high level explanation of why the database has to be defragged when moved to a new OS? How does it determine the need?

Thanks.

Hope I don't sound like to much of an idiot.
Developer
Oct 14, 2014 at 1:09 AM
It's a simple question with a not-so-simple-answer!

Start with this to get an idea of how indices work: http://blogs.msdn.com/b/martinc/archive/2012/09/10/introduction-to-indices.aspx

Unicode columns aren't normalized to upper-case, as ANSI columns are. We rely on an API called LCMapString(), from the NLS team (another team in Microsoft). It turns a Unicode string in to a binary blob (and can do other things as well). The binary blob can then be memcmp'ed, and sorted appropriately.

This is actually very hard. You must be able to sort the German double-S character correctly, and insert u-umlaut in the right place.
The sort order changes on the locale, too! Some locales will sort accented-characters in different orders.
And it can do things like sort "8" after "07" and before "81".
And it can get even more complicated, because it's legal to have both a 'U-umlaut' code point, AND a 'U' codepoint followed by 'Add-Umlaut-to-previous-character' codepoint. These two codepoints will look the same, but they are definitely different binary representations.
And of course, they want to waste as little space as possible!
Then when the Unicode consortium adds new characters, the NLS team needs to figure out where to insert these. In an already-dense number space, this can be impossible.

So what this means is that sometimes they MUST change the normalization scheme sometimes. These times usually coincide with new versions of Windows, but sometimes they happen in Service Packs.

Defragging the database throws away all of the indices stored in the file, and builds a brand new database, using the new indexing scheme.

Usually we try to minimize the index rebuilding when upgrading the OS, but when downgrading, it's much easier to call everything out of date, and require a defrag. We mistakenly called this 'corrupt' before, when it's more accurate to call it 'out of date':
#define JET_wrnPrimaryIndexOutOfDate         1417 /* The Primary index is created with an incompatible OS sort version. The table can not be safely modified. */
#define JET_wrnSecondaryIndexOutOfDate       1418 /* One or more Secondary index is created with an incompatible OS sort version. Any index over Unicode text should be deleted. */
Here are some examples of string normalization, using the default normalization options that ESE uses. (lcmap.exe is a simple program I wrote that just calls LCMapString).
Note how a short string can be MUCH longer in bytes. I think the worst case is about 5:1 bloat.
c:\ [17:56:56.23] ++$ lcmap "hello"
LCID                    = 0
dwMapFlags              = 0x30401 [NORM_IGNORECASE NORM_IGNOREKANATYPE NORM_IGNOREWIDTH LCMAP_SORTKEY ]

Input string =
[hello]

Mapped to 0xf bytes:
00000000    0e2c0e21 0e480e48 0e7c0101 010100   .,.!.H.H.|.....

c:\ [17:57:01.01] ++$ lcmap "hello world"
LCID                    = 0
dwMapFlags              = 0x30401 [NORM_IGNORECASE NORM_IGNOREKANATYPE NORM_IGNOREWIDTH LCMAP_SORTKEY ]

Input string =
[hello world]

Mapped to 0x1b bytes:
00000000    0e2c0e21 0e480e48 0e7c0702 0ea40e7c .,.!.H.H.|.....|
00000010    0e8a0e48 0e1a0101 010100            ...H.......

c:\ [17:57:05.85] ++$ lcmap "hello WORLD. Sorting is hard."
LCID                    = 0
dwMapFlags              = 0x30401 [NORM_IGNORECASE NORM_IGNOREKANATYPE NORM_IGNOREWIDTH LCMAP_SORTKEY ]

Input string =
[hello WORLD. Sorting is hard.]

Mapped to 0x3f bytes:
00000000    0e2c0e21 0e480e48 0e7c0702 0ea40e7c .,.!.H.H.|.....|
00000010    0e8a0e48 0e1a0733 07020e91 0e7c0e8a ...H...3.....|..
00000020    0e990e32 0e700e25 07020e32 0e910702 ...2.p.%...2....
00000030    0e2c0e02 0e8a0e1a 07330101 010100   .,.......3.....
You can use the JET_UNICODEINDEX2 structure to customize the flags. JET_UNICODEINDEX is documented, but the v2 structure is public, and uses a locale name instead of an LCID (e.g. "en-us" instead of 1033).

Hope that helps,


-martin