Optimizing large number of writes

May 14, 2012 at 1:48 PM

I have the need to write potentially very large number of rows at about the same time.

(In a single transaction, but it is well within the version store limit)

What are the ways to optimize this? I already have an auto inc id on the table, and I am interested in ways to throw as much data at the table as I can in the shortest amount of time.

Right now the code that writes to it looks like:

public void PutMappedResult(string view, string docId, string reduceKey, RavenJObject data, byte[] viewAndReduceKeyHashed)
{
    Guid etag = uuidGenerator.CreateSequentialUuid();

	using (var update = new Update(session, MappedResults, JET_prep.Insert))
	{
		Api.SetColumn(session, MappedResults, tableColumnsCache.MappedResultsColumns["view"], view, Encoding.Unicode);
		Api.SetColumn(session, MappedResults, tableColumnsCache.MappedResultsColumns["document_key"], docId, Encoding.Unicode);
		Api.SetColumn(session, MappedResults, tableColumnsCache.MappedResultsColumns["reduce_key"], reduceKey, Encoding.Unicode);
		Api.SetColumn(session, MappedResults, tableColumnsCache.MappedResultsColumns["reduce_key_and_view_hashed"], viewAndReduceKeyHashed);

		using (var stream = new BufferedStream(new ColumnStream(session, MappedResults, tableColumnsCache.MappedResultsColumns["data"])))
		{
			data.WriteTo(stream);
			stream.Flush();
		}

		Api.SetColumn(session, MappedResults, tableColumnsCache.MappedResultsColumns["etag"], etag.TransformToValueForEsentSorting());
		Api.SetColumn(session, MappedResults, tableColumnsCache.MappedResultsColumns["timestamp"], SystemTime.Now);

		update.Save();
	}
}

This method is called several tens of thousands of times in a loop.

Is there any better way to do so?

Developer
May 14, 2012 at 5:14 PM

One obvious thing to do is to convert from multiple calls of Api.SetColumn() to a single call of Api.SetColumns(). There is some amount of overhead simply calling in to ESENT, and it's exacerbated by the managed/native transition.

You said it's done several tens of thousands of times? Then it might also be worthwhile to avoid all of the dictionary lookups of MappedResultsColumns.

The overhead of the Streams is also a potential slowdown point. It's possible that converting that to a byte array has some perf potential, but without knowing the implementation of RavenJObject, it could also be a lot of work.

 

-martin

May 14, 2012 at 6:31 PM

To elaborate on Martin's points:

  • You will get a huge win from Api.SetColumns.
  • As you are inserting a lot of records you should create the ColumnValue objects once, set the ColumnId members and reuse them for every call to PutMappedResult, just changing Value. This will avoid the dictionary lookups that Martin mentioned, along with cost of allocating and initializing the ColumnValue objects.
  • If you can allocate a byte[] array and serialize the RavenJObject into it, that will be faster than using streams.

From the testing I did comparing Api.JetSetColumn and Api.SetColumn I expect that making those changes should result in a large (2X) speedup in the insert rate.

I would be interested in hearing how this goes.

--Laurion

May 14, 2012 at 9:37 PM

I'll try your suggestions, but profiling suggests that most of the time is actually spent in 

		update.Save();

In particular, in calling this:

Microsoft.Isam.Esent.Interop.Implementation.NativeMethods.JetUpdate(IntPtr, IntPtr, Byte[], UInt32, UInt32&)

 

Not in all the rest. Would calling SetColumn vs SetColumns make any difference in the JetUpdate call?

Developer
May 15, 2012 at 9:35 PM

How many indices do you have on the table?

-martin

Developer
May 16, 2012 at 5:36 PM

To actually answer your question: No, SetColumn() vs. SetColumns() won't make a difference in JetUpdate.

JetUpdate modifies all of the indices for the insertion. It's at this point that you'd get a JET_errKeyDuplicate, not at SetColumn time. If you have a lot of indices, JetUpdate has a lot more work to do.

If you are willing to make Unicode data be binary-sorted (no case insensitivity, and no accent insensitivity, and no awareness of that 'u-umlaut' should be next to 'u'), you could convert to binary indices over the text data. It has significant usability impact, but has a potential savings. It's probably not worthwhile most of the time.

Assuming 60% of the time is in JetUpdate, and 40% is in SetColumn() [These are wild-guess numbers], then doing the conversion could still potentially save 20%.

-martin

May 17, 2012 at 12:11 AM

Martin,

Here is the index defintions:

private void CreateMapResultsTable(JET_DBID dbid)
		{
			JET_TABLEID tableid;
			Api.JetCreateTable(session, dbid, "mapped_results", 1, 80, out tableid);
			JET_COLUMNID columnid;

			Api.JetAddColumn(session, tableid, "id", new JET_COLUMNDEF
			{
				coltyp = JET_coltyp.Long,
				grbit = ColumndefGrbit.ColumnFixed | ColumndefGrbit.ColumnAutoincrement | ColumndefGrbit.ColumnNotNULL
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "view", new JET_COLUMNDEF
			{
				cbMax = 2048,
				coltyp = JET_coltyp.LongText,
				cp = JET_CP.Unicode,
				grbit = ColumndefGrbit.ColumnNotNULL
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "document_key", new JET_COLUMNDEF
			{
				cbMax = 2048,
				coltyp = JET_coltyp.LongText,
				cp = JET_CP.Unicode,
				grbit = ColumndefGrbit.ColumnNotNULL
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "reduce_key", new JET_COLUMNDEF
			{
				coltyp = JET_coltyp.LongText,
				cp = JET_CP.Unicode,
				grbit = ColumndefGrbit.None
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "reduce_key_and_view_hashed", new JET_COLUMNDEF
			{
				cbMax = 32,
				coltyp = JET_coltyp.Binary,
				grbit = ColumndefGrbit.ColumnNotNULL
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "data", new JET_COLUMNDEF
			{
				coltyp = JET_coltyp.LongBinary,
				grbit = ColumndefGrbit.ColumnTagged
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "etag", new JET_COLUMNDEF
			{
				cbMax = 16,
				coltyp = JET_coltyp.Binary,
				grbit = ColumndefGrbit.ColumnNotNULL | ColumndefGrbit.ColumnFixed
			}, null, 0, out columnid);

			Api.JetAddColumn(session, tableid, "timestamp", new JET_COLUMNDEF
			{
				coltyp = JET_coltyp.DateTime,
				grbit = ColumndefGrbit.ColumnNotNULL | ColumndefGrbit.ColumnFixed
			}, null, 0, out columnid);

			CreateIndexes(tableid,
				new JET_INDEXCREATE
				{
					szIndexName = "by_id",
					szKey = "+id\0\0",
					grbit = CreateIndexGrbit.IndexPrimary
				},
				new JET_INDEXCREATE
				{
					szIndexName = "by_view_and_doc_key",
					szKey = "+view\0+document_key\0\0",
					grbit = CreateIndexGrbit.IndexDisallowNull
				},
				new JET_INDEXCREATE
				{
					szIndexName = "by_view",
					szKey = "+view\0\0",
					grbit = CreateIndexGrbit.IndexDisallowNull
				},
				new JET_INDEXCREATE
				{
					szIndexName = "by_view_and_etag",
					szKey = "+view\0-etag\0\0",
				},
				new JET_INDEXCREATE
				{
					szIndexName = "by_reduce_key_and_view_hashed",
					szKey = "+reduce_key_and_view_hashed\0\0",
				});
		}
I am guessing that I can drop the by_view index and rely on a partial search on by_view_and_etag, but I don't know how to reduce it further.

May 17, 2012 at 12:12 AM

In this case, making the data case sensitive is actually possible, I'll test that.