This project has moved. For the latest updates, please go here.

Write/Flush freezes/hangs

Aug 10, 2015 at 8:51 AM
Hi Martin,
We are using the esent / managed esent implementation as local cache for about two years now, and until now we've never had any problems with it, neither in development nor test nor production.

Although I'm not aware of any unusual usage, here's a very short summary of how we're using it:
-We're synchronizing write operations for a single Key.
-Write operations to multiple keys are allowed in parallel.
-We're flushing dictionaries to disk every two seconds (which is a relatively new feature we've added, only about half a year ago).
-There are multiple dictionaries in use concurrently.

In internal test environments we have long running tests (weeks and months) that test the following scenarios with Server 2008R2 and 2012R2:
-About 1000 insert/update operations per second on 1000 keys (maximum of 1000 concurrent)
-About 1000 insert/update operations per second on 30 keys (maximum of 30 concurrent)
From a background Task records are read in batches of 5000 and are finally deleted one by one.

All of this works fine in our testenvironments.

However - we have one client environment with about 50 insert/update operations per second on a single key (we have multiple other dictionaries in use as well). Within hours after start our application freezes while trying to access the dictionary, calls to dictionary.Flush won't return either.
It's not just that the process needs to be restarted, usually it's required to restart the virtual machine.

Do you have any idea what could go wrong, what could lead to this problem and/or how we can prevent it?

BR, Klaus
Developer
Aug 11, 2015 at 5:05 PM
Hi Linky/Klaus,

Is this with the PersistentDictionary that we provide? Or one that you wrote?

Initially I was thinking that you might be causing some built-up contention in that particular scenario, but then I saw your phrase "usually it's required to restart the virtual machine". That got me to thinking that it's more dependent on that particular machine, rather than the load.

It could be something in the storage stack (hardware/firmware/drivers). Can you try some sort of I/O load program? Normally I'd recommend JetStress (http://www.microsoft.com/en-us/download/details.aspx?id=36849) but it needs binaries from the Exchange installation. (Which is rather annoying).

-martin
Aug 11, 2015 at 6:37 PM
Hello Martin,
Thanks for your reply;
Is this with the PersistentDictionary that we provide? Or one that you wrote?
It's pretty much the one you provide; there are small changes that allow us to configure strict/lazy transaction handling but that's it. In this particular environment we're running with lazy transaction so it's the original implementation behind an interface.
It could be something in the storage stack (hardware/firmware/drivers). Can you try some sort of I/O load program? Normally I'd recommend JetStress (http://www.microsoft.com/en-us/download/details.aspx?id=36849) but it needs binaries from the Exchange installation. (Which is rather annoying).
From what I've seen the machine is very fast. No formal test with a benchmark just me taking a look with our operations guy; I'll try to get the results but I doubt that it's possible.
Initially I was thinking that you might be causing some built-up contention in that particular scenario, but then I saw your phrase "usually it's required to restart the virtual machine".
Just to clarify: When we're running into the hanging/locking/freezing situation our services can't handle any more messages from devices; usually when a service dies (for whatever reason) it's enough to kill the process. But as our on-site guys reported just killing the process and restarting locks up in the exact same place which led me to believe that some locks are not only per process/application but held by the operating system (as far as I remember from the tests that are part of the code, some esent components are shared between processes).
That got me to thinking that it's more dependent on that particular machine, rather than the load.
That might be true; One main difference between the test machine and the production is that the production environment is part of the clients active directory whereas the test environment is in our default environment (workgroup, no ad).
Do you think that it could be security related?

Thanks a lot for your help!

Klaus