Better software through software architecture and devops

@jamessnape

Posts

  • I used to regularly write a blog over at http://jamessnape.me.uk based on my experiences with a number of topics at Microsoft. Since moving to BNP Paribas I haven’t had much to say publically beyond posting a few photos and I’d like to change that.

    Over the past couple of years I’ve been practising agile business intelligence in my role as a solution architect. My team work on user stories from a prioritised backlog in two week iterations, practise test driven development, continuous integration and are beginning to look at acceptance test driven development.

    Even with all these techniques it is still very difficult to be truly agile with business intelligence solutions –interrelated requirements, complex database schemas, poor agile tool support and the skill set of most business intelligence developers are all forces trying to turn your project back towards the waterfall dark side.

    All the time there is an ever louder call from the business for instant answers to their business questions.

    So I’ve begun  to wonder, how agile can you get? Can you create traditional and modern business intelligence solutions using the same agile processes that normal software teams use? Can solution architecture be lean and risk driven?

    I’m influenced by quite a varied set of development jobs – broadcast and media at Sony, telecoms and call centres at Exony, all sorts at Microsoft and now banking. I’ve learned that clean code, good domain modelling and testing is the only way to be successful.

    Obviously I can’t talk about my day job as I’m bound by employee confidentiality but I do want  to try out some ideas in a public arena where I can get feedback. I’m going to make up a set of requirements for a completely different industry to banking – healthcare; which I’ll setup in a future article.

    So comments always welcome as are suggestions for future posts and I look forward to publishing some useful content.

    This entry was posted in and tagged on .
    Discuss this on Twitter or LinkedIn
  • Running Sum Over Degenerate Dimension

    Asking for a running sum in a report is a common thing but this week I was asked to create a running sum for a particular customer against number of facts. What I mean here is to create a graph of count vs amount (sort of like a Pareto except in transaction order). So something that looks like graph above.

    This is a well rehearsed subject in MDX. You can either use recursion:

    With Member [Measures].[Running Sum]
    As
        [Internet Sales Order Details].[Sales Order Number].PrevMember
        + 
        [Measures].[Internet Gross Profit]
    Member [Measures].[Running Count]
    As
        [Internet Sales Order Details].[Sales Order Number].PrevMember
        + 
        1
    
    Select {
        [Measures].[Running Count],
        [Measures].[Internet Gross Profit],
        [Measures].[Running Sum]
    } On 0, Non Empty {
        [Internet Sales Order Details].[Sales Order Number].Children
    } On 1
    From [Adventure Works]
    Where (
        [Customer].[Customer].[Brian Watson]
    )

    Or, iteration (thanks to Chris Webb for some help on relative performance) which should perform better, especially on AS2008.

    With Member [Measures].[Running Sum]
    As 
        Sum(
            {Null : [Internet Sales Order Details].[Sales Order Number].CurrentMember},
            [Measures].[Internet Gross Profit]
        )
    Member [Measures].[Running Count]
    As
        Count(
            {Null : [Internet Sales Order Details].[Sales Order Number].CurrentMember}
        )
    Select {
     [Measures].[Running Count],
     [Measures].[Internet Gross Profit],
     [Measures].[Running Sum]
    } On 0, Non Empty {
     [Internet Sales Order Details].[Sales Order Number].Children
    } On 1
    From [Adventure Works]
    Where (
     [Customer].[Customer].[Brian Watson]
    )

    [However, on my x64 laptop the second version takes longer to execute YMMV.]

    This is OK for AdventureWorks but my real degenerate dimension has many millions of members and this just doesn’t scale. I contemplated using Reporting Services RunningValue() function but as far as I can tell you can’t use it to generate a category axis.

    I needed a way of generating the running count for the x-axis in a way that uses Analysis Services’ excellent aggregation ability.

    Bucket HierarchyThe solution I ended up with is to create an artificial hierarchy and bucket transactions. That way I can create an attribute relation for aggregation and, importantly, control the number of cells in the iteration.

    The next problem was how to assign values to this bucket – some customers had only a few transactions yet others had millions. They all needed to be spread over a fixed set of buckets.

    The answer lies in a SQL Server RANK() function:

    update dw.Sales
    set TradeBucket = x.TradeBucket
    from (
        select TradeKey,
        rank() over(partition by CustomerKey order by t.TradeKey asc) / 
        case 
            when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 1000 then 1
            when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 10000 then 10
            when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 100000 then 100
            when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 1000000 then 1000
            else 10000
        end as TradeBucket
        from dw.Sales
    ) x
    where dw.Sales.TradeKey = x.TradeKey

    Effectively, we are generating an incrementing number on a per customer basis and then dividing that number to compress the range. This is surprisingly fast to execute.

    Once everything is processed, my new MDX looks like:

    With Member [Measures].[Running Sum]
    As     
        Sum(
            {Null : [Internet Sales Order Details].[Trade Bucket].CurrentMember},
            [Measures].[Internet Gross Profit]    
        )
    Member [Measures].[Running Count] As
     Sum(
     {Null : [Internet Sales Order Details].[Trade Bucket].CurrentMember},
     [Measures].[Sales Count]
     )
    Select {
     [Measures].[Running Count],
     [Measures].[Internet Gross Profit],
     [Measures].[Running Sum]
    } On 0, Non Empty {
     [Internet Sales Order Details].[Trade Bucket].Children
    } On 1
    From [Adventure Works]
    Where (
     [Customer].[Customer].[Brian Watson]
    )

    It works on aggregated data; there are still around 1000 points which is just fine on the graph and it executes in around 3 seconds. So all good?

    Well, for now yes but I can see a problem looming – every time I do an import I update every fact row and fully reprocess the cube. That isn’t going to scale long-term. I will probably have to implement some sort of bucket partition strategy.

    This entry was posted in business-intelligence  and tagged #mdx  on .
    Discuss this on Twitter or LinkedIn
  • Whilst doing some design work today for a customer project I realised there are a set of principals I try and adhere to when creating SQL Server Integration Services packages. The list is no doubt incomplete but this is what I have so far.

    Minimise IO

    This is a general data processing principal. Usually disk and, to a lesser extent, network performance determine the overall processing speed. Reducing the amount of IO in a solution will therefore increase performance.

    Solutions that consist of multiple read-process-write steps should be redesigned into a single read-process-process-process-write step.

    Prefer Sequential IO to Random IO

    Disks perform at their best when sequentially reading or writing large chunks of data. Random IO (and poor performance) manifests when procedural style programming occurs - signs to look out for are SQL statements modifying/returning only few rows but being executed repeatedly.

    Watch out for hidden random IO - for example, if you are reading from one table and writing to another in a sequential manor then disk access will still be random if both tables are stored on the same spindles.

    Avoid data flow components that pool data

    Data flow components work on batches of data called buffers. In most instances buffers are modified in place and passed down stream. Some components, such as “Sort” cannot process data like this and effectively hang on to buffers until the entire data stream is in memory (or spooled to disk in low memory situations). This increased memory pressure will affect performance.

    Sometimes SQL is the better solution

    Whilst the SSIS data flow has lots of useful and flexible components, it is sometimes more efficient to perform the equivalent processing in a SQL batch. SQL Server is extremely good at sorting, grouping and data manipulation (insert, update, delete) so it is unlikely you will match it for raw performance on a single read-process-write step.

    SSIS does not handle hierarchical data well

    Integration Services is a tabular data processing system. Buffers are tabular and the components and associated APIs are tabular. Consequently it is difficult to process hierarchical data such as the contents of an XML document. There is an XML source component but it’s output is a collection of tabular data streams that need to joined to make sense.

    Execute SSIS close to where you wish to write your data

    Reading data is relatively easy and possible from a wide variety of locations. Writing data, on the other hand, can involve complex locking and other issues which are difficult to optimise on a network protocol. In particular when writing data to a local SQL Server instance, SSIS automatically used the Shared Memory transport for direct inter-process transfer.

    Don’t mess with the data flow metadata at runtime

    It’s very difficult to do this anyway but worth mentioning that SSIS gets it’s stellar performance from being able to setup a data flow at runtime safe in the knowledge that buffers are of a fixed format and component dependencies will not change.

    The only time this is acceptable is when you need to build a custom data flow programmatically. You should use the SSIS API’s and not attempt to write the package XML directly.

    This entry was posted in data-warehousing  and tagged #integration-services #sql-server #sql-server-integration-services #ssis  on .
    Discuss this on Twitter or LinkedIn
  • RJR Consulting Thumbnail Over the past few weeks I’ve been working on a web site for my cousin. This was a ground up redesign of his company web to replace the old SEO unfriendly Flash site.

    Normally as the developer I only get someone else’s completed design; I just do the implementation but I wanted to see if any of my recent photography and image editing skills were transferable.

    The site was first designed in Illustrator/Photoshop with a fair number of round trips to agree the final design. Implementation was done in Expression Web using HTML, CSS and some JQuery for interactivity. There are no anywhere on the site; it is a pure CSS layout. I also used Expression Web “Dynamic Web Templates” for the master page layout as the server where it is now hosted does not support .NET.

    Some things I’ve learned from doing this:

    • jQuery makes life easy and Glimmer makes it even easier
    • IE8 standards support is fantastic but way to many still use IE6/7 to use it
    • You need to have an idea of what is possible in HTML before letting loose in Photoshop
    • background-image is your friend

    As it is all pretty static at the moment there are plans to move to a site that supports .NET so I can add some more features, in particular a blog and news feed. I would like to try this in ASP.MVC to get some experience there.

    So go to RJR Consulting and have a look around (especially if you need telemarketing, telesales or similar services).

    This entry was posted in and tagged #web-design  on .
    Discuss this on Twitter or LinkedIn
  • Every data warehouse needs a date dimension and at some point it needs to be populated. Most use some sort of a SQL script that loops though the dates and add rows to the destination table but this is pretty slow to execute. You might even try cross joining a year, month and day temporary tables to produce a set based solution but don’t forget to filter out the illegal days.

    I prefer to fill my date tables by generating the correct stream of values from a SQL Server Integration Services script source component. This has a number of benefits:

    • It executes very quickly
    • The data can be bulk loaded
    • CultureInfo supplies the correct translations of day and month names
    • It is easy to add custom columns such as fiscal years and quarters

    I haven’t wrapped this in a pre-compiled component as it is so easy to do in script from. Also, I haven’t got around to generalizing the fiscal date offsets for different companies so they usually have to be custom coded.

    Script Component TypeFirst drop a “Script Component” onto your Data Flow.

    Select “Source” as the Script Component Type and click OK.

    Then double-click the newly added component to edit the properties.

    Note that you need to add the correct output columns before adding the script or else it won’t compile.

    Output Columns

    I’ve renamed the output here to “Dates” to help further down the Data Flow.

    Click the “Add Column” button to add new columns as show here. Note that I’ve also changed the data type of each column to match my source table. It required casts in script but it’s easier than conversions in the data pipeline.

    Finally go back to the script part of the dialog and click the “Edit Script” button to launch Visual Studio for Applications.

    In the resulting window, add your code to generate the date stream to the CreateNewOutputRows() function.
    The general form is of:

    var output = this.DatesBuffer;  // Get the output buffer
    
    while (/*loop though your dates*/)
    {
    
        output.AddRow();
    
        // Set the various column values e.g.
        output.CalendarYear = date.Year
    
        // Increment the date
        date = date.AddDays(1);
    }

    The full script is in the attached sample package where I’ve also added a script destination that does nothing with the data. Attach a data viewer to see what output is generated.

    Generated outputFrom here you can manipulate the data, and pipe it to your dimension table from within the pipeline.

    DateSourceSample.dtsx

    This entry was posted in data-warehousing  and tagged #ssis  on .
    Discuss this on Twitter or LinkedIn