Better software through software architecture and devops

@jamessnape

Posts

  • This is the second part in a series of data warehouse presentations I’ve been giving. This one concentrates on the physical design of the underlying SQL Server database, some information on the SQL Server Fast Track Data Warehouse and finally a one slide guide to tuning the storage engine performance. The tuning slide is only really a first step and I plan a more in depth session some time in the future.

    Also I apologize, this deck is a little word heavy. I prefer more graphics in a presentation but there is a lot of specific guidance to list here.

    This entry was posted in data-warehousing  and tagged #physical-design #presentation  on .
    Discuss this on Twitter or LinkedIn
  • There are no hard and fast rules but the goal is to cut the time taken to extract data from a source system and cut the amount of work you have to do with the extracted data. The numbers quoted here are the ones I use as a starting point but you need to measure to find the best values.

    Don’t do an incremental extract if:

    • There isn’t much data in the source table (less than 100k rows)
    • There is enough change in the source table to require that you read most of it each time (such as more than half the rows changing between extracts)
    • The data in the source table is used for periodic snapshots (such as a balance sheet) and you need to track how a table changes at particular points in time

    Do an incremental extract if:

    • There is a lot of data in the source table
    • Rows are only ever added to the source table (i.e. rows are not updated)
    • You need to track each and every change to a source row
    • The source data is updated several times before being closed and once closed is never updated again (also known as an accumulating snapshot)

    In general dimension tables match the first set of rules and are not extracted incrementally where as fact tables normally match the second set of rules.

    This entry was posted in data-warehousing  and tagged #incremental-extracts #source-system  on .
    Discuss this on Twitter or LinkedIn
  • My programmer personality type is: DHTC

    You’re a D****oer. You are very quick at getting tasks done. You believe the outcome is the most important part of a task and the faster you can reach that outcome the better. After all, time is money.

    You like coding at a H****igh level. The world is made up of objects and components, you should create your programs in the same way.

    You work best in a T****eam. A good group is better than the sum of its parts. The only thing better than a genius programmer is a cohesive group of genius programmers.

    You are a C****onservative programmer. The less code you write, the less chance of it containing a bug. You write short and to the point code that gets the job done efficiently.

    This entry was posted in agile  and tagged #personal-development  on .
    Discuss this on Twitter or LinkedIn
  • I’ve been doing some Analysis Services 2005 performance work for a customer. Eventually I want to talk about some sort of a performance process for AS2005 but right now Chris Webb has a pretty good reference on designing effective aggregations. This is just a quick description of what the 11100101010,1010001000000 means in SQL Server Profiler.

    Since SQL Profiler now supports Analysis Services you can monitor queries as they execute to find the poorly performing ones. Chances are you will see a Progress Report event similar to Started reading data from the ‘XXX’ partition closely followed by a Query Subcube event with TextData looking like a string of ones and zeros such as: ‘00001000000000000,10111100000011111100,10’ The progress report tells you that data was read directly from the partition and no aggregation was used.

    If you turn on the Query Subcube Verbose event then a more complete description is printed but quite often you will receive traces with just the subcube strings as the verbose event is hidden from the default set of options.

    The subcube string tells you which dimensions and attributes are involved. Each group of digits, separated by a comma, denotes a dimension and each digit denotes an attribute within that dimension. The digit is ‘1’ if that attribute is part of the subcube and ‘0’ otherwise. Some things to note:

    • These are cube dimensions so the same dimension, e.g. Date, can be represented many times as a role-playing dimension, e.g. Ship Date
    • The order is defined by the actual order of dimensions and attributes in the measure group, not alphabetical or any other sort order.

    Since the order of dimensions and attributes is not immediately obvious, it’s better to write some code to print them out in the correct order. The following prints all the dimensions and attributes of the supplied measure group object in order:

    private static void PrintMeasureGroupDimensions(MeasureGroup mg) {
     for (int j = 0; j < mg.Dimensions.Count; ++j) { 
       CubeDimension dim = mg.Dimensions[j].CubeDimension; 
       Console.WriteLine("DIM:\\t{0} ({1})", dim.Name, dim.Attributes.Count); 
    
       for (int k = 0; k < dim.Attributes.Count; ++k) { 
         CubeAttribute attr = dim.Attributes[k]; 
         Console.WriteLine("ATT:\t\t{0}", attr.Attribute.Name); 
       }
     }
    }

    I’ve attached a piece of sample code that compiles into a console application to either print the entire set of dimensions for a measure group or, if you pass a subcube string, will just print those involved in the query. Execute with no args or ‘/h’ to get some help.

    SubcubeQueries.cs

    This entry was posted in data-warehousing  and tagged #analysis-services #performance  on .
    Discuss this on Twitter or LinkedIn