This project has moved. For the latest updates, please go here.

Slow seek / iteration performance on non-unique secondary index.

Oct 4, 2013 at 10:02 PM
Further qualifications: the table had probably about 40 million records and then went down to 28. But that seemed too long for the subject.

The data as is right now looks quite sequential, with no major gaps, but a TryMoveFirst(), TryMoveNext() loop over the secondary index takes 250 seconds. 41713 pages were read. Iterating by the primary index finishes in less time for GetTickCount() to detect. In real code, iteration happens very infrequently: it is mainly Seek's, with forward iterations of up to probably dozens of records after a successful seek. So iteration isn't my main concern, but the problem was initially found with Seeking.

Obviously, for something so drastic a reduction in size requirements, an offline compact (unless an online compact also made it in 8.1 :)) is the best solution.

At this point in time, calling JetDefragment produces no benefit (I foolishly did the defragment before doing the timing test, so I don't know if that improved the performance of the main index). The deletion of all 40 million records was done as a background maintenance task, but the first time I ever wrote JetDefragment in a program was yesterday to evaluate what would happen in this terrible state. Would periodically issuing it have helped it from getting into this situation? Also, I remember martinc saying that there was a new option in Windows 7 / Server 2008R2 related to defrag, and that it was possibly better than running JetDefragment, but I can't find that post now.

Thanks.
Developer
Oct 8, 2013 at 3:35 AM
If you know that the database is always going to be large, you can use SystemParameters.CacheSizeMin to force a minimum cache size.
ESE isn't overly aggressive about growing the cache, even it eventually ends up being useful. This was done to not be a resource hog outright, but allow other existing processes to continue to operate. It's called Dynamic Buffer Allocation if you're curious:
http://www.bing.com/search?q=dynamic+buffer+allocation+ese
http://technet.microsoft.com/en-us/library/aa998781(v=EXCHG.65).aspx

But sometimes DBA doesn't grow as quickly as you'd like, which is why you'd want to force the minimum.

I wonder if one of the following will help for OpenTableGrbit:
        /// <summary>
        /// Provides a hint that the table is probably not in the buffer cache, and
        /// that pre-reading may be beneficial to performance.
        /// </summary>
        Preread = 0x40,

        /// <summary>
        /// Assume a sequential access pattern and prefetch database pages.
        /// </summary>
        Sequential = 0x8000,
There's also the following, which is public in esent.h, but doesn't seem to be in ManagedEsent:
#define JET_bitTableOpportuneRead       0x00000080  /*  attempt to opportunely read physically adjacent leaf pages using larger physical IOs */
You can cast it explicitly to give it a try.
JetDefragment() is horribly misunderstood. It was made in the times when consolidating free space was much more important than data contiguity. By running it, you may have ended up with bad data contiguity, thereby reducing your move-next performance!

I did actually write a way to sort of shrink the database file for 8.1, but it uses sparse files (NTFS-only), and ends up fragmenting the file. (I keep on meaning to write an announcement...) But I don't think it will help your performance.

-martin
Developer
Oct 8, 2013 at 3:51 AM
For Online-Defrag v2 (OLDv), you can use JET_bitDefragmentBTree() to JetDefragment2:
http://msdn.microsoft.com/en-us/library/gg294095(v=exchg.10).aspx
Again, it's not ported to ManagedEsent, so you'll need to do a cast.
#define JET_bitDefragmentBTree                  0x00000100  /* defrag one B-Tree */
It may help with sequential access.

-martin
Oct 8, 2013 at 3:48 PM
Edited Oct 8, 2013 at 3:48 PM
martinc wrote:
For Online-Defrag v2 (OLDv), you can use JET_bitDefragmentBTree() to JetDefragment2:
http://msdn.microsoft.com/en-us/library/gg294095(v=exchg.10).aspx
Again, it's not ported to ManagedEsent, so you'll need to do a cast.
#define JET_bitDefragmentBTree                  0x00000100  /* defrag one B-Tree */
It may help with sequential access.

-martin
Hi Martin,

I have some comments regarding some of the other suggestions you've made, but I want to hold off until I can answer everything. So, first, how do you correctly use this? Initially I was stuck because passing in DefragmentBatchStart as well as DefragmentBTree was returning InvalidGrBit. Then, just trying DefragmentBTree, it started crashing in the CTableDefragment constructor, which prompted me to try passing in the table name (which makes sense, given the name, but since the docs say it's unused, that wasn't my initial thought. but the docs just seem to be outdated for JetDefragment2, still referring to concepts from JetDefragment).

So now I have two issues:
  1. With just passing in DefragmentBTree, my callback isn't called. This may or may not be a hindrance in the future, so that's just a sidenote.
  2. Based on logging, this potentially seems promising (minding your comments earlier about JetDefragment maybe causing worse performance):
    Logging from only passing in DefragmentBTree + name of the table:
    "EseDefrag (6084) esedefrag: Online defragmentation has defragmented index 'main' of table 'main' in database '[filepath]'. This pass started on 10-8-2013.
    1 pages visited
    0 pages freed
    0 partial merges
    1 pages moved" (For this table, named "main", the primary index is called "data", and the secondary index is called "key". I'm assuming that "index 'main'" is a side effect that the primary index can be requested using NULL, but that assumption could be wrong)
    Logging only passing in BatchStart:
    "EseDefrag (6040) esedefrag: Online defragmentation is beginning a full pass on database '[filepath]'."
    "EseDefrag (6040) esedefrag: Online defragmentation has completed a full pass on database '[filepath]', freeing 0 pages. This pass started on 10-8-2013 and ran for a total of 1 seconds, requiring 1 invocations over 1 days. Since the database was created it has been fully defragmented 4 times over 4 days."
Since the index that I'm having seek performance issues with is a secondary index, is there a way to smuggle in the index name inside the table name parameter? I've tried some of the other mechanisms used ('nul delimited, double nul terminated', and '+ delimited' [which doesn't make sense that it would work given the context that '+ delimited' is used, but it was early]).

Thanks,
Michael
Developer
Oct 8, 2013 at 9:41 PM
Looks like it's fine to use with JetDefragment (not JetDefragment2). Here's an example from one of our tests:
        JetDefragment(self.sesid, self.dbid, tablename, 0, 0, JET_GRBIT.DefragmentBTree)
I admit to not knowing the online defrag code very well. I just found this little tidbit in the code:
// secondary indexes not currently supported
So it sounds like you're out of luck for the defragmentation. Sorry.

Maybe you should try using JET_SPACEHINTS with JET_INDEXCREATE2 so that the index never gets too fragmented? Note that at this point we're still speculating that it's a fragmentation issue. It could potentially be solved by cache and/or prereading.

-martin
Oct 11, 2013 at 12:47 AM
Secondary indexes not being supported seems unfortunate, but hopefully I may just be misinterpreting my results:
Here's the test app I used to time iteration (and also to do the defragmentation, it started out as just defrag...). It sets the Cache Size Min parameter to one gigabyte. (it requires at least VC11/VS2012, because I was lazy)
https://gist.github.com/anonymous/d35ebc2e80146b08404a
It iterates all tables over each explicitly defined index twice, to test the effects of the caching. It iterates over the system tables, but that's just because of laziness on my part, not because I'm actually doing that in real code :)
"with prefetch" means "Open the table with JET_bitTablePreread | JET_bitTableOpportuneRead | JET_bitTableReadOnly | JET_bitTableSequential"
"no prefetch" means "Open the table with JET_bitTableReadOnly | JET_bitTableSequential"

Here's my interpretation:
  • After the defrag, iterating over the "key" secondary index (I'll post the initialization of the JET_TABLECREATE struct at the end), iteration did indeed get slower by a few seconds, as you suggested it might. Most notably, during the second iteration, it still reads over 300 pages, while pre-defrag, there were zero pages read.
  • For the primary index, "data", the defrag operation brought the "referenced page" count down drastically from 23,380 pages to 31. Read count goes from 23,306 to 5.
  • Explicitly specifying the additional prefetch flags has negligible effects (in the grand scheme of things) on timing for this test, with this database.
  • The defrag hurt the system tables, so if I were to include defrag functionality, DefragmentBTree with some extra logic deciding which table would be a good idea.
For me, as someone with the only basic trivial "read a Wikipedia article on B+Tree's" knowledge, what's the most odd is this:
In table 'main' by index 'key', there are 25 records. Elapsed: 252.182 seconds
Read 41712 pages. Referenced 41738 pages. Preread 1 pages.

It is very confusing to me that iterating (not even reading) 25 records, which are less than 200 bytes long (estimated, I haven't tried JetGetRecordSize yet), would require touching over 40,000 pages. Again, this table probably had about 40 million records initially, before a large deletion pass deleted all but the first record. And then a restart occurred and a few extra records got added before I got brought in to try to explain what was going on since it was taking a long time to shut down and start up. I understand that JetDelete wouldn't want to rebalance too aggressively as that would incur a much larger cost.
Also, looking at the data again, the fact that it looks sequential, without gaps, may actually be an illusion, because I wasn't taking into account how the data that looks sequential was generated.

For JET_SPACEHINTS: Can this be retroactively modified for a table? (I understand that this won't affect existing data.)


JET_TABLECREATE initialization:
#define INDEX_KEY_LENGTH 100
#define DB_KEY_LENGTH 35

static JET_COLUMNCREATE_A inxColumnDefs [] =
{
    { sizeof(JET_COLUMNCREATE_A), "data", JET_coltypBinary, INDEX_KEY_LENGTH, JET_bitColumnFixed | JET_bitColumnNotNULL, NULL, 0, 0, 0, 0 },
    { sizeof(JET_COLUMNCREATE_A), "key", JET_coltypBinary, DB_KEY_LENGTH, JET_bitColumnFixed | JET_bitColumnNotNULL, NULL, 0, 0, 0, 0 },
    { sizeof(JET_COLUMNCREATE_A), "record_type", JET_coltypShort, sizeof(unsigned short), JET_bitColumnFixed | JET_bitColumnNotNULL, NULL, 0, 0, 0, 0 },
};

static JET_INDEXCREATE_A inxIndexDefs [] =
{
    { sizeof(JET_INDEXCREATE_A), "data", "+data\0", 7, JET_bitIndexPrimary | JET_bitIndexUnique, 0, 0, 0, 0, 0, 0 },
    { sizeof(JET_INDEXCREATE_A), "key", "+key\0", 6, 0, 0, 0, 0, 0, 0, 0 },
};


static JET_TABLECREATE_A s_inxTableCreate =
{
    sizeof(JET_TABLECREATE_A),
    "main",
    NULL,
    20,
    0,
    inxColumnDefs,
    3,
    inxIndexDefs,
    2,
    0,
    0,
    0,
};
The database uses an 8KB page size, mostly since there are occasional cases where databases from multiple OS's hosted in the same process. Ironically(?), this is the only table with a secondary index. Or at least, the only table that can grow to any notable size.
Developer
Oct 12, 2013 at 12:21 AM
It is very confusing to me that iterating (not even reading) 25 records, which are less than 200 bytes long (estimated, I haven't tried JetGetRecordSize yet), would require touching over 40,000 pages. Again, this table probably had about 40 million records initially, before a large deletion pass deleted all but the first record.
It's likely that after you deleted the records, the 'Cleanup' task didn't run. Are you seeing the following in your event log?
MessageId=628
SymbolicName=BT_MOVE_SKIPPED_MANY_NODES_ID
Language=English
%1 (%2) %3Database '%4': While attempting to move to the next or
previous node in a B-Tree, the database engine skipped over %7
non-visible nodes in %8 pages. It is likely that these non-visible
nodes are nodes which have been marked for deletion but which are
yet to be purged. The database may benefit from widening the online
maintenance window during off-peak hours in order to purge such nodes
and reclaim their space. If this message persists, offline
defragmentation may be run to remove all nodes which have been marked
for deletion but are yet to be purged from the database.
Also look for an event that version store may be falling behind because it's pretty full (80% full?).

One option is to call JetIdle() when you're doing the massing number of Deletes. If it's too late, than an offline defrag (esentutl -d, or JetCompact() ) will help.

Sorry for not answering all of the questions you had; it's Friday afternoon and I've got to run. :)

-martin
Oct 15, 2013 at 5:42 PM
Edited Oct 15, 2013 at 5:46 PM
martinc wrote:
It's likely that after you deleted the records, the 'Cleanup' task didn't run. Are you seeing the following in your event log?
MessageId=628
SymbolicName=BT_MOVE_SKIPPED_MANY_NODES_ID
Language=English
%1 (%2) %3Database '%4': While attempting to move to the next or
previous node in a B-Tree, the database engine skipped over %7
No :( I also tried it on a Win8 machine to see if it would show up there (just in case the message was added later on, or more cases were encountered).
Also look for an event that version store may be falling behind because it's pretty full (80% full?).
Is this logged at Information? There are many Information logs at around the time the deletions occurred, so I filtered out Information events and didn't see anything, but I can go looking through the Information events if it would be there. Actually, extracting the message table resource from esent.dll, I didn't see anything quite like that. The closest were version store full / version store couldn't grow.
One option is to call JetIdle() when you're doing the massing number of Deletes. If it's too late, than an offline defrag (esentutl -d, or JetCompact() ) will help.
The JetIdle function is defunct, and should only be used for testing purposes.
That doesn't give me the most confidence for calling JetIdle :-/, but I guess you'd know better. I'll try to set up a test that reproduces the original scenario, but that may take a while due to other commitments. I know that an offline defrag will help now, but I'm hoping for something that can be done without reducing access to the database.
Sorry for not answering all of the questions you had; it's Friday afternoon and I've got to run. :)
No problem. Thanks for helping out with this.

Another question: Since this is only affecting a secondary index (well, JetDefragment can fix it for the primary index), what's the version store impact of creating and deleting indexes on the fly? Or, well, re-creating the index under a new name and deleting the old one? Assuming that there are significantly more than 25 records :) And, related -- is it safe to call JetTerm while there are still active sessions?