This project has moved. For the latest updates, please go here.

Complex Primary Key and Performance Questions

May 27, 2011 at 1:47 AM

Thanks again for this project.  The performance is better than anything else I have found and the interface you have created is great for those of us who just don't have the time to get our heads around the C++ interfaces.  :-)

Now my questions. 

I have a very large data set that I need to store and pull some very specific queries from.  It is US Options end of day data that has about 400k entries per day and I have 5+ years of the data at ~252 days per year for a grand total of ~504 million rows of data and growing daily.  Obviously getting the performance right is going to be critical.  The Column Defs are as follows;  

 
  1. Api.JetAddColumn(sesid, tableid, "DataDate", columndefDate, null, 0, out columnid);  
  2. Api.JetAddColumn(sesid, tableid, "UnderlyingSymbol", columndefString, null, 0, out columnid);
  3. Api.JetAddColumn(sesid, tableid, "Expiration", columndefDate, null, 0, out columnid);
  4. Api.JetAddColumn(sesid, tableid, "ContractType", columndefBit, null, 0, out columnid);
  5. Api.JetAddColumn(sesid, tableid, "Strike", columndefLong, null, 0, out columnid);
  6. Api.JetAddColumn(sesid, tableid, "Warning", columndefString, null, 0, out columnid);
  7. Api.JetAddColumn(sesid, tableid, "UnderlyingPrice", columndefLong, null, 0, out columnid);
  8. Api.JetAddColumn(sesid, tableid, "Last", columndefLong, null, 0, out columnid);
  9. Api.JetAddColumn(sesid, tableid, "Bid", columndefLong, null, 0, out columnid);
  • Api.JetAddColumn(sesid, tableid, "Ask", columndefLong, null, 0, out columnid);
  • Api.JetAddColumn(sesid, tableid, "Volume", columndefLong, null, 0, out columnid);
  • Api.JetAddColumn(sesid, tableid, "OpenInterest", columndefLong, null, 0, out columnid);
  • Api.JetAddColumn(sesid, tableid, "IV", columndefLong, null, 0, out columnid);
  • Api.JetAddColumn(sesid, tableid, "Delta", columndefLong, null, 0, out columnid);
  • Api.JetAddColumn(sesid, tableid, "Gamma", columndefLong, null, 0, out columnid);
  • Api.JetAddColumn(sesid, tableid, "Theta", columndefLong, null, 0, out columnid);
  • Api.JetAddColumn(sesid, tableid, "Vega", columndefLong, null, 0, out columnid);

     I currently have the following Indexs;

     indexDef = "+UnderlyingSymbol\0\0"; Api.JetCreateIndex(sesid, tableid, "UnderlyingSymbol", CreateIndexGrbit.None, indexDef, indexDef.Length, 100);
  • indexDef = "-DataDate\0\0" ; Api.JetCreateIndex(sesid, tableid, "DataDate", CreateIndexGrbit.None, indexDef, indexDef.Length, 100);
  • indexDef = "-Expiration\0\0"; Api.JetCreateIndex(sesid, tableid, "Expiration", CreateIndexGrbit.None, indexDef, indexDef.Length, 100);
  • indexDef = "-Strike\0\0"; Api.JetCreateIndex(sesid, tableid, "Strike", CreateIndexGrbit.None, indexDef, indexDef.Length, 100);

     The only Primary Key/Unique index I can directly find in the data would be a combo of UnderlyingSymbol\Expiration\ContractType\Strike\ContractType\DataDate and that is extremely unwieldy.

    I need to get back a handful of specific queries;

    1. A list of all expiration dates for a specific UnderlyingSymbol on a specific DataDate
    2. A list of all Strike Prices for a specific UnderlyingSymbol on a specific DataDate
    3. A list of Data for an UnderlyingSymbol with a certian ExpirationDate and Strike between Start DataDate and End DataDate

    Currently I am seperating the data into 26 different data stores based on the 1st letter of the UnderlyingSymbol.  My hope is that this will improve performance.

    My main questions are will ManagedEsent be able to hadle this large amount of data ok?  What should I do to improve performance?  What recomendations do you have for the primary key (or lack of one)? And any thoughts you might have to help out.  I am diving in on a very challenging project and any guidance would be welcome. 

    BTW, most of my code was directly following your StockDB examples.  Thank you for that.

    Doug

     

     

     

  • May 27, 2011 at 3:23 AM

    With the indexes you have I assume that you are using JetIntersectIndexes to
    calculate the results? With your data I think that might be inefficient -- you
    are intersecting two large index ranges to find a small number of records.

    I suggest using multi-column indexes targetted at the queries you want to run
    quickly. You can answer your queries with two indexes. The first index is used to answer
    the first two queries:

    "+UnderlyingSymbol\0+DataDate\0+Expiration\0+Strike\0\0"

    By adding the Expiration and Strike columns to the index it becomes possible to
    retrieve the data from the index without having to retrieve the full record. The
    columns are at the end so they don't affect the sort order. There are two things
    you have to do to make sure this works:
     1. Declare the UnderlyingSymbol column as ASCII. Unicode columns break the
     retrieve from index optimization.
     2. Use RetrieveColumnGrbit.RetrieveFromIndex when getting the columns.

     For example, you an retrieve all the expiration dates for a given symbol/date
     column like this:

    Api.MakeKey(sesid, tableid, symbol, Encoding.ASCII, MakeKeyGrbit.NewKey);
    Api.MakeKey(sesid, tableid, date, MakeKeyGrbit.None);
    if (Api.TrySeek(sesid, tableid, SeekGrbit.SeekGT))
    {
        Api.MakeKey(sesid, tableid, symbol, Encoding.ASCII, MakeKeyGrbit.NewKey);
        Api.MakeKey(sesid, tableid, date, MakeKeyGrbit.None);
        if (Api.TrySetIndexRange(sesid, tableid, SetIndexRangeGrbit.RangeUpperLimit | SetIndexRangeGrbit.RangeInclusive))
        {
            do
            {
                DateTime expiration = Api.RetrieveColumnAsDateTime(
                        sesid, tableid, expirationColumn, RetrieveColumnGrbit.RetrieveFromIndex);
            }
            while (Api.TryMoveNext(sesid, tableid));       
        }
    }  
    

     

    The second query can be done with an index like this:

    "+UnderlyingSymbol\0+ExpirationDate\0+Strike\0+DataDate\0\0"

    Retrieving all the columns will require a read of the full record so this query
    will be slower than the first two.

    You shouldn't need to separate the data store into separate files. Although you
    have a lot of records they are small so your database should be less than 100GB,
    which isn't considered very large.

    The primary key is tricky -- all secondary index entries have to contain the
    primary key so there is a lot of benefit in having a small primary key. I suggest
    creating an explicit autoincrement column to be the primary key. With the number
    of records you have running out of space in a 32-bit number is a concern so you
    should create the autoincrement column as JET_coltyp.Currency.

    May 27, 2011 at 3:34 AM

    Thanks for the advice.  I will start working on this and see what I can get for a performance increase.

    I am also seeing a radical difference in insert speed based on the drive I am using.  My main C: drive will take an insert of 400k records in roughly 2 mins while my significantly larger F: drive takes nearly 9 minutes for the same.  The F: is a 1 TB Caviar Black and theoretically should be higher performance.  All the drives are also regularly defragmented.

    I am considering buying a 480GB SSD drive specificly for this.  If you have any thoughts on how that would affect performance I would apreciate hearing them.

    Thank you for helping.  I really apreciate the work you are doing on this.

    Doug

     

    May 27, 2011 at 3:52 AM

    A few points:

    • If you can get to 3K inserts/second then I assume you are already using lazy commits and batching a lot of inserts into one transaction?
    • Inserts are painful because the records will be inserted into the secondary indexes in a random fashion, which causes random I/O. To reduce the amount of I/O try increasing the checkpoint depth with JET_param.CheckpointDepthMax (try a value of 200MB). The downside of a large checkpoint value is that database recovery after a crash takes longer.
    • With two drives you should try putting the database on one drive and the logfiles on the other (use JET_param.LogPath to specify the log directory). This is more complex because the files are in different directories, but lets one drive do nothing but append to the log while the other deals with database I/O.
    • By default ESENT manages its cache size automatically. The cache size doesn't grow very quickly so initial insert performance won't be great. You can use JET_param.CacheSizeMin to set a minimum cache size, but make sure that much memory is actually available on the machine.

    An SSD will definitely speed things up, especially when starting the application because there are no cached pages. On the other hand they are expensive :-)

    May 27, 2011 at 4:07 AM

    Yep, doing lazy commits. I will try breaking the data and logs onto separate drives.

    Right now I am wrapping 1000 inserts into a transaction over the whole 400k lines file. It seems to do ok but it would be great to do better. Aside from the initial load of 5 years of data I am not too worried about either the insert speed of the overall size of the database. After the initial load it will only be one file per night. It is really the query speed I need to focus on. A question for you, after the inserts and indexing are done will the size of the database significantly affect the performance of the queries?

    The computer is a Quad core Windows 7 x64 machine with 8GB RAM and 3 separate hard drives. It should be able handle just about any reasonable sized cache. The recovery shouldn’t be a problem I need to deal with often. This is just historical reference data, so if it completely crashes I can just reload it with no significant loss of info.

    Thanks again,

    Doug

    From: laurionb [email removed]
    Sent: Thursday, May 26, 2011 11:53 PM
    To: drobertson123@gmail.com
    Subject: Re: Complex Primary Key and Performance Questions [ManagedEsent:259175]

    From: laurionb

    A few points:

    • If you can get to 3K inserts/second then I assume you are already using lazy commits and batching a lot of inserts into one transaction?
    • Inserts are painful because the records will be inserted into the secondary indexes in a random fashion, which causes random I/O. To reduce the amount of I/O try increasing the checkpoint depth with JET_param.CheckpointDepthMax (try a value of 200MB). The downside of a large checkpoint value is that database recovery after a crash takes longer.
    • With two drives you should try putting the database on one drive and the logfiles on the other (use JET_param.LogPath to specify the log directory). This is more complex because the files are in different directories, but lets one drive do nothing but append to the log while the other deals with database I/O.
    • By default ESENT manages its cache size automatically. The cache size doesn't grow very quickly so initial insert performance won't be great. You can use JET_param.CacheSizeMin to set a minimum cache size, but make sure that much memory is actually available on the machine.

    An SSD will definitely speed things up, especially when starting the application because there are no cached pages. On the other hand they are expensive :-)