This project has moved and is read-only. For the latest updates, please go here.

Bulk inserts

Jun 25, 2012 at 8:41 AM

Is there a way to optimize things if I want to insert large number of rows?

Say, 100,000 rows in, as fast as possible?

Jun 25, 2012 at 10:51 PM

Here is similar post. Try JetTemp tables if you dont care about persisting data on Disk.. I never tried but thats wat mentioned in below Post

http://managedesent.codeplex.com/discussions/356581

Many settings you can tweak which can affect read/write speed

Jun 25, 2012 at 11:19 PM

In addition to those suggestions from SSISGuy, it does depend on whether you insert sequentially or not.

It is fastest to insert in sequential order, according to the primary index. If you don't define a primary index, then one is automatically defined for you with an autoincrement column.

If you can't guarantee sequential inserts, then you should look at your index definitions. Change the ulDensity part of the structure (http://msdn.microsoft.com/en-us/library/windows/desktop/gg294082(v=exchg.10).aspx) to a number less than 100. Having a 100% density makes sense for sequential inserts, but if you then later have to insert something in the middle, it means that you are guaranteed to have to split a page on an insert (because it's already full).

If you work on Win7, then also look at the JET_SPACEHINTS structure. http://msdn.microsoft.com/en-us/library/windows/desktop/gg269205(v=exchg.10).aspx It can help keep contiguity by reserving space in advance, which is pretty important when you grow the number of tables in a database to be large (Exchange can easily has thousands of tables in a database).

That's all for now off the top of my head...

 

-martin

Jun 26, 2012 at 12:54 PM

SSISGuy,

I actually care quite a bit about persist the data to disk, so temp tables aren't relevant.

 

Martinc,

I have an auto increment field that is set as the primary key index. I have additional indexes on that table that aren't sequential.

All my indexes are using 90 for density.

I can't assume Win7 or up (I still run on XP) and I have just 10 or so tables.

 

I am thinking about something like SQL Server Bulk Insert, which allows you to insert tens and hundreds of thousands of documents per second.
In my tests, I maxed out at 1 - 2 thousands per second (SSD drive, safe transactions, recovery enabled).

 

Jun 30, 2012 at 4:33 AM

I was just thinking about this, and you should be able to still populate the JET_TABLECREATE object with space hints, and then have only call JetCreateTableColumnIndex3() on Win7. On XP you should still call the older version of the API (JetCreateTableColumnIndex2). The C# JET_TABLECREATE is converted just before pinvoking to esent.dll, so the extra Space Hints structure will just be dropped.

It means your program will work a bit slower on XP (because of the lack of space hints), but it should run better on Win7.

Obviously you should test it; this is a Friday night afterthought. :)

Do you need to have the the primary index as an autoinc? Can you use any of you other indices as the primary instead? Reducing the number of indices is always good for update performance.

I noticed you said 'safe transactions'. Does that mean you are doing many non-lazy transactions? For optimal speed, do all of those transactions as lazy, and then have a final JetCommitTransaction() with WaitAllLevel0Commit which will flush them:

 

/// <summary> 
/// All transactions previously committed by any session that have not
/// yet been flushed to the transaction log file will be flushed immediately.
/// This API will wait until the transactions have been flushed before
/// returning to the caller. This option may be used even if the session
/// is not currently in a transaction. This option cannot be used in
/// combination with any other option.
/// </summary>
public const CommitTransactionGrbit WaitAllLevel0Commit = (CommitTransactionGrbit)0x8;

Oh right, that's Server 2003 again... Drat. :)

Also check your database cache size. I wonder if you're hitting the max. You can look under the 'Database' perf counters in perfmon. You can consider increasing the Max cache size.

Jun 30, 2012 at 10:14 AM

The code below produces the following outputs:

When always using CommitTransactionGrbit.None 

Total: 583,961 ms, Per batch: 583.96ms Per doc: .57 ms

When using CommitTransactionGrbit.LazyFlush and one CommitTransactionGrbit.WaitLastLevel0Commit

Total: 509,134 ms, Per batch: 509.13ms Per doc: .5 ms

This is running on an SSD drive, by the way.

 

Can you think of other ways to improve this?  

 

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.Globalization;
using System.IO;
using System.Text;
using Microsoft.Isam.Esent.Interop;
using Raven.Client.Document;

namespace Raven.Tryouts
{
	class Program
	{
		static void Main(string[] args)
		{
			JET_INSTANCE instance;
			Api.JetCreateInstance(out instance, "test");
			const string logsPath = ".";
			var x = new InstanceParameters(instance);
			x.CircularLog = true;
			x.Recovery = true;
			x.NoInformationEvent = false;
			x.CreatePathIfNotExist = true;
			x.EnableIndexChecking = true;
			x.TempDirectory = Path.Combine(logsPath, "temp");
			x.SystemDirectory = Path.Combine(logsPath, "system");
			x.LogFileDirectory = Path.Combine(logsPath, "logs");
			x.MaxVerPages = TranslateToSizeInDatabasePages(128, 1024 * 1024);
			x.BaseName = "RVN";
			x.EventSource = "Raven";
			x.LogBuffers = TranslateToSizeInDatabasePages(8192, 1024 * 1024);
			x.LogFileSize = ((64 / 4) * 1024);
			x.MaxSessions = 2048;
			x.MaxCursors = 2048;
			x.DbExtensionSize = TranslateToSizeInDatabasePages(8, 1024 * 1024);
			x.AlternateDatabaseRecoveryDirectory = logsPath;

			Api.JetInit(ref instance);

			using (var session = new Session(instance))
			{
				JET_DBID dbid;
				Api.JetCreateDatabase(session, "database", null, out dbid, CreateDatabaseGrbit.OverwriteExisting);
				Api.JetAttachDatabase(session, "database", AttachDatabaseGrbit.None);

				using (var tx = new Transaction(session))
				{
					CreateDocumentsTable(dbid, session);

					tx.Commit(CommitTransactionGrbit.None);
				}
				Api.JetCloseDatabase(session, dbid, CloseDatabaseGrbit.None);
				Api.JetDetachDatabase(session, "database");
			}

			Console.WriteLine("Starting to write");
			using (var session = new Session(instance))
			{
				var metadata = new byte[1024];
				var data = new byte[1024 * 14];

				var random = new Random();
				random.NextBytes(metadata);
				random.NextBytes(data);


				JET_DBID dbid;
				Api.JetAttachDatabase(session, "database", AttachDatabaseGrbit.None);
				Api.JetOpenDatabase(session, "database", null, out dbid, OpenDatabaseGrbit.None);

				var sp = Stopwatch.StartNew();
				IDictionary<string, JET_COLUMNID> docsColumns = null;
				for (int i = 0; i < 1 * 1000; i++)
				{
					using (var tx = new Transaction(session))
					{
						using (var table = new Table(session, dbid, "documents", OpenTableGrbit.None))
						{
							if (docsColumns == null)
								docsColumns = Api.GetColumnDictionary(session, table);
							for (int j = 0; j < 1024; j++)
							{
								using (var update = new Update(session, table, JET_prep.Insert))
								{
									Api.SetColumn(session, table, docsColumns["key"], "docs/" + i + "/" + j, Encoding.Unicode);
									Api.SetColumn(session, table, docsColumns["etag"], Guid.NewGuid().ToByteArray());
									Api.SetColumn(session, table, docsColumns["last_modified"], DateTime.Now);
									Api.SetColumn(session, table, docsColumns["locked_by_transaction"], false);
									Api.SetColumn(session, table, docsColumns["metadata"], metadata);
									using (var stream = new ColumnStream(session, table, docsColumns["data"]))
									using (var buffered = new BufferedStream(stream))
									{
										buffered.Write(data, 0, data.Length);
										buffered.Flush();
									}
									update.Save();
								}
							}
						}

						tx.Commit(CommitTransactionGrbit.LazyFlush);
					}
					Console.WriteLine(i);
				}

				Api.JetCloseDatabase(session, dbid, CloseDatabaseGrbit.None);
				Api.JetDetachDatabase(session, "database");

				using (var tx = new Transaction(session))
				{
					tx.Commit(CommitTransactionGrbit.WaitLastLevel0Commit);
				}
				sp.Stop();

				Console.WriteLine("Total: {0:#,#} ms, Per batch: {1:#,#.##}ms Per doc: {2:#,#.##} ms",
					sp.ElapsedMilliseconds,
					(double)sp.ElapsedMilliseconds / 1000,
					((double)sp.ElapsedMilliseconds / 1000) / 1024);

			}
		}


		private static int TranslateToSizeInDatabasePages(int sizeInMegabytes, int multiply)
		{
			//This doesn't suffer from overflow, do the division first (to make the number smaller) then multiply afterwards
			double tempAmt = (double)sizeInMegabytes / SystemParameters.DatabasePageSize;
			int finalSize = (int)(tempAmt * multiply);
			return finalSize;
		}
		private static void CreateDocumentsTable(JET_DBID dbid, Session session)
		{
			JET_TABLEID tableid;
			Api.JetCreateTable(session, dbid, "documents", 1, 80, out tableid);
			JET_COLUMNID columnid;

			Api.JetAddColumn(session, tableid, "id", new JET_COLUMNDEF
			{
				coltyp = JET_coltyp.Long,
				grbit = ColumndefGrbit.ColumnFixed | ColumndefGrbit.ColumnAutoincrement | ColumndefGrbit.ColumnNotNULL
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "key", new JET_COLUMNDEF
			{
				cbMax = 2048,
				coltyp = JET_coltyp.LongText,
				cp = JET_CP.Unicode,
				grbit = ColumndefGrbit.ColumnNotNULL
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "etag", new JET_COLUMNDEF
			{
				cbMax = 16,
				coltyp = JET_coltyp.Binary,
				grbit = ColumndefGrbit.ColumnFixed | ColumndefGrbit.ColumnNotNULL,
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "last_modified", new JET_COLUMNDEF
			{
				coltyp = JET_coltyp.DateTime,
				grbit = ColumndefGrbit.ColumnFixed | ColumndefGrbit.ColumnNotNULL,
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "locked_by_transaction", new JET_COLUMNDEF
			{
				cbMax = 16,
				coltyp = JET_coltyp.Binary,
				grbit = ColumndefGrbit.ColumnTagged,
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "data", new JET_COLUMNDEF
			{
				coltyp = JET_coltyp.LongBinary,
				grbit = ColumndefGrbit.ColumnTagged
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "metadata", new JET_COLUMNDEF
			{
				coltyp = JET_coltyp.LongBinary,
				grbit = ColumndefGrbit.ColumnTagged
			}, null, 0, out columnid);

			CreateIndexes(tableid, session,
						  new JET_INDEXCREATE
						  {
							  szIndexName = "by_id",
							  szKey = "+id\0\0",
							  grbit = CreateIndexGrbit.IndexPrimary
						  },
						  new JET_INDEXCREATE
						  {
							  szIndexName = "by_etag",
							  szKey = "+etag\0\0",
							  grbit = CreateIndexGrbit.IndexDisallowNull
						  },
						  new JET_INDEXCREATE
						  {
							  szIndexName = "by_key",
							  szKey = "+key\0\0",
							  grbit = CreateIndexGrbit.IndexDisallowNull | CreateIndexGrbit.IndexUnique,
						  });
		}


		private static void CreateIndexes(JET_TABLEID tableid, Session session, params JET_INDEXCREATE[] indexes)
		{
			foreach (var index in indexes)
			{
				try
				{
					Api.JetCreateIndex(session, tableid, index.szIndexName, index.grbit, index.szKey, index.szKey.Length, 90);
				}
				catch (Exception e)
				{
					throw new InvalidOperationException("Could not create index: " + index.szIndexName, e);
				}
			}
		}
	}
}

Jul 5, 2012 at 7:48 PM

Ayende, thanks for the code. Unfortunately I've been pretty busy this week. Hopefully I'll get a chance to try it out next week.

It's interesting that LazyFlush doesn't make much difference for you. I wonder if it's because you're using an SSD.

-martin

Jul 17, 2012 at 9:15 AM

I took a quick look. These are the two things I found that made a difference for me (I don't have an SSD, but it seemed to go from 2.5 ms/rec to 1.9 ms/rec):

     static void Main(string[] args)
     {
->             SystemParameters.DatabasePageSize = 8192;
-> 
      JET_INSTANCE instance;

And:

            Api.SetColumn(session, table, docsColumns["locked_by_transaction"], false);
<-          Api.SetColumn(session, table, docsColumns["metadata"], metadata);
->          Api.SetColumn(session, table, docsColumns["metadata"], metadata, SetColumnGrbit.IntrinsicLV);
            using (var stream = new ColumnStream(session, table, docsColumns["data"]))
            using (var buffered = new BufferedStream(stream))

On Windows 7 you can go to a 32k page size, but XP only goes to 8k. The larger page size helps with continuity, and I think it helps with the LV separation (I'm too sleepy to confirm it though :)

If you add the advanced counters (aka 'Squeaky Lobster') you can see the 'Separated Long-Values' counters shrink.

You could probably get another perf boost by using the intrinsic flag for the 'data' column, but that's harder to do with the ColumnStream/BufferedStream approach. And if you could use the cbSeparateLV parameter of JET_TABLECREATE, but that's win7+ only.

-martin

Jan 31, 2013 at 12:33 PM
Edited Jan 31, 2013 at 12:36 PM
Hi. My question is related to the post where I already asked about seeking for random text (http://managedesent.codeplex.com/discussions/430567). This time we also got into a trouble when doing bulk inserts where key is random text (we don't seek). The problem is that insert performance is declining if we insert more and more items. For example inserting 10,000 records when we already have in database 3,000,000 records sometimes takes even 30sec (while usually it takes is 1s). What I've noticed by using 'Database' performance counters looks like the entire Esent slows down suddenly but I cannot figure out what is causing that - especially 'B+ Tree Inserts/sec' counter decreases drastically. Sometimes after a while it returns to the previous performance but then some peaks are getting to be present:
Image

It doesn't seems to be a buffering problem ('Log Record Stalls / sec' is near 0). I narrowed the problem to single table that consists only from two columns and one index to investigate this special case when doing inserts of hashes. The best results that I'm able to get are when I set:
SystemParameters.DatabasePageSize = 32768;
Table density to 70 %
(I already tried different many different settings of LogBuffers, LogFileSize etc)

The plot that shows insert time per batch looks as following:
Image

I would like to optimize in order to avoid peaks if possible. In this presented example the min insert time was 107ms while the max time 3.8s. What performance counters could be really useful to determine what is going on? Also is there any reference that describes in more details the meaning of the particular perf counters? I would be very grateful for any help or ideas. The important thing: problem is not applicable on SSD (presented results come from processing on HDD)

The sample code:
namespace Raven.Tryouts
{
    using System;
    using System.Collections.Generic;
    using System.Diagnostics;
    using System.IO;
    using System.Security.Cryptography;
    using System.Text;
    using Microsoft.Isam.Esent.Interop;

    class Program
    {
        static void Main(string[] args)
        {
            JET_INSTANCE instance;

            SystemParameters.DatabasePageSize = 32768;

            Api.JetCreateInstance(out instance, "test");
            const string logsPath = ".";
            var x = new InstanceParameters(instance);
            x.CircularLog = true;
            x.Recovery = true;
            x.NoInformationEvent = false;
            x.CreatePathIfNotExist = true;
            x.EnableIndexChecking = true;
            x.TempDirectory = Path.Combine(logsPath, "temp");
            x.SystemDirectory = Path.Combine(logsPath, "system");
            x.LogFileDirectory = Path.Combine(logsPath, "logs");
            x.MaxVerPages = TranslateToSizeInDatabasePages(512, 1024 * 1024);
            x.BaseName = "RVN";
            x.EventSource = "Raven";
            x.LogBuffers = (16 * 1024 * 1025) / 512; // 16Mb - 512b is a unit here - half of LogFileSize as recommended
            x.LogFileSize = (32 * 1024); // 32 MB - maximum, units 1024b
            x.MaxSessions = 256;
            x.MaxCursors = 2048;
            x.DbExtensionSize = TranslateToSizeInDatabasePages(8, 1024 * 1024);
            x.AlternateDatabaseRecoveryDirectory = logsPath;

            SystemParameters.CacheSizeMax = TranslateToSizeInDatabasePages(2048, 1024 * 1024); // 2GB 

            Api.JetInit(ref instance);

            using (var session = new Session(instance))
            {
                JET_DBID dbid;
                Api.JetCreateDatabase(session, "database", null, out dbid, CreateDatabaseGrbit.OverwriteExisting);
                Api.JetAttachDatabase(session, "database", AttachDatabaseGrbit.None);

                using (var tx = new Transaction(session))
                {
                    CreateDocumentsTable(dbid, session);

                    tx.Commit(CommitTransactionGrbit.None);
                }
                Api.JetCloseDatabase(session, dbid, CloseDatabaseGrbit.None);
                Api.JetDetachDatabase(session, "database");
            }

            
            using (var session = new Session(instance))
            {
                JET_DBID dbid;
                Api.JetAttachDatabase(session, "database", AttachDatabaseGrbit.None);
                Api.JetOpenDatabase(session, "database", null, out dbid, OpenDatabaseGrbit.None);

                string[][] ids = new string[500][];

                Console.WriteLine("Starting to create ids");

                for (int i = 0; i < 500; i++)
                {
                    ids[i] = new string[10000];
                    for (int j = 0; j < 10000; j++)
                    {
                        ids[i][j] = MD5Hash(RandomSentence(5, "-") + rnd.Next(100000000).ToString()).Substring(0, 12);
                    }
                }

                Console.WriteLine("Starting to write");

                IDictionary<string, JET_COLUMNID> docsColumns = null;
                for (int i = 0; i < 500; i++)
                {
                    using (var tx = new Transaction(session))
                    {
                        using (var table = new Table(session, dbid, "documents", OpenTableGrbit.None))
                        {
                            if (docsColumns == null)
                                docsColumns = Api.GetColumnDictionary(session, table);

                            var sp = Stopwatch.StartNew();
                            for (int j = 0; j < 10000; j++)
                            {
                                string key = ids[i][j];

                                using (var update = new Update(session, table, JET_prep.Insert))
                                {
                                    Api.SetColumn(session, table, docsColumns["key"], key, Encoding.Unicode);
                                    update.Save();
                                }
                            }

                            sp.Stop();

                            Console.WriteLine("Batch #{0}: {1:#,#.##}ms Per doc: {2:#,#.##} ms",
                                i,
                                sp.ElapsedMilliseconds,
                                (double)sp.ElapsedMilliseconds / 10000);

                            using (var streamWriter = File.AppendText("output.txt"))
                            {
                                streamWriter.Write(sp.ElapsedMilliseconds + "\n");
                            }   
                        }

                        tx.Commit(CommitTransactionGrbit.None);
                    }
                }

                Api.JetCloseDatabase(session, dbid, CloseDatabaseGrbit.None);
                Api.JetDetachDatabase(session, "database");
            }
        }

        private static int TranslateToSizeInDatabasePages(int sizeInMegabytes, int multiply)
        {
            //This doesn't suffer from overflow, do the division first (to make the number smaller) then multiply afterwards
            double tempAmt = (double)sizeInMegabytes / SystemParameters.DatabasePageSize;
            int finalSize = (int)(tempAmt * multiply);
            return finalSize;
        }

        private static void CreateDocumentsTable(JET_DBID dbid, Session session)
        {
            JET_TABLEID tableid;
            Api.JetCreateTable(session, dbid, "documents", 1, 70, out tableid);
            JET_COLUMNID columnid;

            Api.JetAddColumn(session, tableid, "id", new JET_COLUMNDEF
            {
                coltyp = JET_coltyp.Long,
                grbit = ColumndefGrbit.ColumnFixed | ColumndefGrbit.ColumnAutoincrement | ColumndefGrbit.ColumnNotNULL
            }, null, 0, out columnid);

            Api.JetAddColumn(session, tableid, "key", new JET_COLUMNDEF
            {
                cbMax = 2048,
                coltyp = JET_coltyp.LongText,
                cp = JET_CP.Unicode,
                grbit = ColumnNotNullIfOnHigherThanWindowsXp()
            }, null, 0, out columnid);

            var szIndexName = "by_key";
            var szKey = "+key\0\0";
            var grbit = CreateIndexGrbit.IndexDisallowNull | CreateIndexGrbit.IndexUnique;

            Api.JetCreateIndex(session, tableid, szIndexName, grbit, szKey, szKey.Length, 90);
        }

        public static ColumndefGrbit ColumnNotNullIfOnHigherThanWindowsXp()
        {
            var isWindowsXpOrServer2003 = (Environment.OSVersion.Version.Major == 5);
            return isWindowsXpOrServer2003 ? ColumndefGrbit.None : ColumndefGrbit.ColumnNotNULL;
        }
        public static string MD5Hash(string text)
        {
            MD5 md5 = new MD5CryptoServiceProvider();

            //compute hash from the bytes of text
            md5.ComputeHash(Encoding.ASCII.GetBytes(text));

            //get hash result after compute it
            byte[] result = md5.Hash;

            var strBuilder = new StringBuilder();
            for (int i = 0; i < result.Length; i++)
            {
                //change it into 2 hexadecimal digits
                //for each byte
                strBuilder.Append(result[i].ToString("x2"));
            }

            return strBuilder.ToString();
        }

        private static readonly string[] words = "her long make thing see him two has look more day could go come did number sound no most people my over know water than call first who may down side been now find any new work part take get place made live where after back little only round man year came show every good me give our under name very through just form".Split(' ');
        private static readonly Random rnd = new Random(DateTime.Now.Millisecond);
        private static string RandomSentence(int length, string spacer = " ")
        {
            var retStrings = new List<string>();
            for (int i = 0; i < length; i++)
                retStrings.Add(words[rnd.Next(words.Length)]);

            return string.Join(spacer, retStrings);
        }
    }
}