Better software through software architecture and devops

@jamessnape

Category Archives: business intelligence

  • * I mean this in a the most British sense of the phrase

    I remember how painful the original Data Transformation Services tool was to use (loops anyone?) and when Integration Services was shipped with SQL Server 2005 is was a breath of fresh air for anyone trying to build data warehouses. For SQL Server developers the choice was simple use the free tool that was easy to use and fast to execute or try to code your own. In 2005 the code your own option was hard. You had to write your own threading model, workflow, streaming primitives, event logging etc. since the managed APIs of the time were not that mature.

    In contrast Integration Services was easy. Easy to create, even if you don’t know anything able code; easy to run, even if you are not a SQL Server administrator and blisteringly fast, provided you follow some best practice guidelines. This may be one of the reasons I dislike it so much – it sets unrealistic expectations to project managers on how long it should take to develop real software solutions. Bear with me on this one as the point of the post is not about what’s wrong with SSIS but how it limits you in the wider solution.

    I do remember an e-mail conversation with Kirk Haselden before I joined Microsoft about how COM, which SSIS is built on, leaks abstractions all over the .NET API. He maintained it was the right thing to do; I wasn’t so sure but it was his product so I didn’t put up much of a fight.

    I believe that SSIS is designed as a tool for DBAs or those database developers that only know SQL, MDX and Reporting Services. It is the best is can possibly be without introducing code concepts to these people.

    A few months back I read a post by Teo Lachev called When Developers and BI Collide which I agree with large parts of – primarily that you must have BI specialists if you want to develop BI solutions and some that I disagree with – maintaining SSIS is easier than maintaining custom code, coders are not BI pros. I consider myself a coder AND a BI pro and there are a number of similar people working in the team I architect for at the moment. Actually when hiring I have found is it often easier and more productive to find and teach a coder about BI than the reverse.

    So anyway I digress, I joined Microsoft in 2006 and did a lot of consulting around integration services. It was a popular tool for customers to have problems with. We used it as part of a solution for a large UK bank’s faster payments implementation. It was a hard project for me – the rest of the team were designing C# code and publishing object based APIs. I had to use a shared encryption routine so when BizTalk unwrapped a transaction at the other end it would be able to decrypt it. This API meant I has to spend a good proportion of my time writing boring and error prone code to convert data sets (SSIS data flow) to objects and back to data sets again. This data mapping code was the interesting part though – I hate ‘programming’ by mouse; click, click, type, drag, click…and this is what the SSIS experience is. That was the first time I regretted using SSIS on a project.

    There are plenty of posts about what is wrong with SSIS and some equally passionate responses. My main issues with it are all related to real world usage. I have never been involved in a project where SSIS was the whole solution. It is always just a component of something bigger, an architecture, a project, a development team and a process. I work exclusively in agile teams now and every step of the way SSIS slows things down:

    • Unit testing is not possible (SSISUnit is really a component or integration test)
    • Agile team development requires code branching and merging which is not possible
    • Distributed source control (such as Git) can’t be used at all since there is no way to lock a file whilst you are working on it
    • Code reviews are difficult – you have to open up every package and click every box to check
    • It is hard to enforce project standards – StyleCop and FxCop do not work on VSA code
    • There is no way to share code – copy/paste coding is prolific
    • Everyone uses a template package to try and ensure some standards – unfortunately you can’t make changes to that template though since it was copied
    • COM leaks abstractions everywhere from the C# APIs to the type system
    • The type system to too tightly bound to metadata – need to change a column length, shame now you have to open all the relevant packages and fix the issues; ANSI <-> Unicode conversions must be explicit
    • There is no way to stub out data sources or destinations i.e. temporarily replace a SQL table with a local file for testing
    • Mouse based programming

    The net result of all this is yes, it is very quick to get something running but you must forever pay interest on the technical debt you just created. SSIS is not friction free in a team development environment.

    There are two situations where I believe you should use SSIS:

    1. If you are a DBA or database developer, you just need to get something done quickly and don’t care about the maintenance issues I’ve described above
    2. You need to take advantage of some of the more complex to code components such as CDC or fuzzy matching (remember that it is very easy to call packages from code anyway so no need to base the entire solution on SSIS)

    What are the alternatives? The simplest one is to make use of SSIS in an abstract way – code up something that can load your configuration, pass it to packages, execute them and capture the events for logging. We use something like this on my current project and to an extent it has helped a little. We still have to manage lots of packages though.

    Next up the ladder is to create abstractions of the tasks that packages are performing and generate them. Biml is a perfect example of this.

    Finally, I mentioned that APIs back in 2005 were not that mature. Here in 2013 though we have some fantastic APIs to work with – Task Parallel Library, asynchronous workflows, Linq, functional programming concepts and the rich ecosystem of Nuget packages. Coding isn’t as hard as it used to be.

    I started out this summer to produce an end to end BI solution in an agile way but quickly found out I needed to update some of my skills with respect to the C# tools and techniques available. So whilst I haven’t been blogging I have coded, learned and tried out ideas. Some of these are ready to show and you can try them out look for Deeply on Github or on Nuget. It is early days but try it out and let me know what you think.

    This entry was posted in business-intelligence  and tagged #c #data-transformation-services #sql-server-developers #ssis  on .
    Discuss this on Twitter or LinkedIn
  • tattoo work by Keith Killingsworth source http://commons.wikipedia.org/wiki/File:Tattoos.jpg

    So in the comments on a recent post on Risk Driven Architecture, Jamie Thomson asked whether the problems associated with change can be mitigated by using views. I firmly believe that views can help but unfortunately not enough to save you from clients that connect directly with Analysis Services cubes.

    So it got me thinking about a similar mitigation for cubes. Unfortunately nothing came to mind apart from an analogy:

    Dimensional models are like tattoos – you have to live with them for a long time

    Why you might ask? Well you can add to them, maybe fill in some extra colour but basically once you’ve committed to you are stuck with them because every spread sheet and report using your model will need fixing if you try to remove something. Like tattoos, you can remove them but its going to be painful and cost a lot of money.

    I don’t have any tattoos (not because I don’t like them, I just can’t decide on one that I’d have to live with for so long). However I’ve heard plenty of guidance about taking your time before committing – one of the best techniques is to simply draw your new tattoo with a Sharpie and try it on for size for a while.

    How does this help with dimensional models? Well the same techniques apply. Try a new model on for size, especially if you can arrange it for the new model to fade like the Sharpie as time passes which automatically limits client usage. Maybe process the cube manually for a while – your users will soon tell you if the data is useful. This fits with an agile approach too - only put measures and attributes in the cube if you need them and don’t add stuff in the hope that it will be used productively.

    This entry was posted in business-intelligence  and tagged #dimensional-model  on .
    Discuss this on Twitter or LinkedIn
  • Whilst researching the previous article I came across this link on Acronyms and Ubiquitous Language. It is well worth reading as everything discussed also applies to dimensional models. There is one quote that I want to reprint from the .NET Framework General Naming Conventions:

    Do not use any acronyms that are not widely accepted, and then only when necessary.

    Your business users should be able to point Excel (or whatever tool you are using) at a dimensional model and intuitively know what the measures, dimensions and attributes are because they describe the business your users work in. Since acronyms obfuscate meaning they don’t belong in dimensional models.

    The only time I generally relax this rule is when both the following are true:

    • All business users know the meaning of the acronym
    • The expanded version is so long that it becomes unwieldy
    This entry was posted in business-intelligence  and tagged #dimensional-model #domain-driven-design #naming-conventions #ubiquitous-language  on .
    Discuss this on Twitter or LinkedIn
  • Managing complexity graphic showing DDD patterns to apply from Domain Driven Design source https://commons.wikimedia.org/wiki/File:Maintaining_Model_Integrity.png

    My commute is around two and a half hours each way so I read a lot on the train. One of the subjects I’ve recently become interested in Domain Driven Design or DDD. I’ve found it isn’t really a new topic for me but more like someone has documented many of the techniques I’ve always used.

    DDD discusses data warehouses primarily as an output or reporting function within a larger application. The book Implementing Domain-Driven Design by Vaughn Vernon mentions reporting repeatedly as a by-product of DDD (particularly when used with Event Sourcing) but not directly as a possible use case.

    I would agree that not all concepts can be reused in data warehouse solutions since the only interface available is often one that transfers a set of mutations (property value changes) without the accompanying reasons (it is a key idea in DDD that you need to design your model not by changes in attributes but by operations performed on entities). For example an order count has decreased and the reason is missing – was the order returned, cancelled, an error etc. So where can it be applied? Are any of the concepts useful when designing data warehouses?

    Ubiquitous Language

    Lets start with one of the core concepts – Ubiquitous Language is a rigorous shared language used between developers and users. It is used to make sure that conversations are accurate and productive. It should evolve as the team’s understanding of a domain changes. The ubiquitous language is what forms the domain model at the heart of a software solution.

    I find this description very similar to an equivalent concept in data warehousing – the Dimensional Model. This model, and its associated dimension bus matrix, is based on real business processes and terminology. The dimensional model is the public face of the data warehouse. It needs to be precise, reflect the terms used by business users and form a common vocabulary between users and the development team. For example when browsing an Analysis Services cube in Excel, the dimensions and facts defined in the dimensional model are directly visible to end users – if they don’t automatically understand what is on-screen then the model doesn’t describe the business.

    Entities and Value Objects

    There are two types of object in domain driven design – entities and value objects. Value objects are immutable and identified via their attributes. For example $100 USD in one object is the interchangeable with $100 USD in another. Entities on the other had cannot be identified purely by their attributes – there must be some sort of unique identifier (in data warehouse terms this is a business key) to differentiate similar entities. For example, one John Smith may not be the same as another and need a Customer-Id to differentiate the two.

    With respect to dimensional models, value objects should not be implemented as top-level dimensions but instead be added at sets of attributes to the entities which own them. For example ‘Product Colour’ is a value type (colour) and should belong in the ‘Product’ dimension. This seems obvious when written this way but happens a lot.

    Entities and Aggregate Roots

    DDD groups sets of closely related entities under the control of a single ‘Aggregate root’. Some entities make no sense unless the parent entity is also within context; order lines and parent orders are the typical example.

    So value objects shouldn’t be dimensions and I don’t think ordinary entities should be either. True dimensions are the aggregate roots and the one thing that seals it for me is that an aggregate root (according to DDD) defines a transactional boundary – you should not update multiple aggregate roots within a single transaction; instead sagas keep your data warehouse in sync (eventually).

    Finally

    Domain driven design and business intelligence share a number of common concepts yet the two philosophies are rarely seen as related. I think there is a lot to be gained by applying software concepts from different viewpoints which may not ordinarily be considered.

    This entry was posted in business-intelligence  and tagged #data-warehouse #domain-driven-design  on .
    Discuss this on Twitter or LinkedIn
  • Running Sum Over Degenerate Dimension

    Asking for a running sum in a report is a common thing but this week I was asked to create a running sum for a particular customer against number of facts. What I mean here is to create a graph of count vs amount (sort of like a Pareto except in transaction order). So something that looks like graph above.

    This is a well rehearsed subject in MDX. You can either use recursion:

    With Member [Measures].[Running Sum]
    As
        [Internet Sales Order Details].[Sales Order Number].PrevMember
        + 
        [Measures].[Internet Gross Profit]
    Member [Measures].[Running Count]
    As
        [Internet Sales Order Details].[Sales Order Number].PrevMember
        + 
        1
    
    Select {
        [Measures].[Running Count],
        [Measures].[Internet Gross Profit],
        [Measures].[Running Sum]
    } On 0, Non Empty {
        [Internet Sales Order Details].[Sales Order Number].Children
    } On 1
    From [Adventure Works]
    Where (
        [Customer].[Customer].[Brian Watson]
    )

    Or, iteration (thanks to Chris Webb for some help on relative performance) which should perform better, especially on AS2008.

    With Member [Measures].[Running Sum]
    As 
        Sum(
            {Null : [Internet Sales Order Details].[Sales Order Number].CurrentMember},
            [Measures].[Internet Gross Profit]
        )
    Member [Measures].[Running Count]
    As
        Count(
            {Null : [Internet Sales Order Details].[Sales Order Number].CurrentMember}
        )
    Select {
     [Measures].[Running Count],
     [Measures].[Internet Gross Profit],
     [Measures].[Running Sum]
    } On 0, Non Empty {
     [Internet Sales Order Details].[Sales Order Number].Children
    } On 1
    From [Adventure Works]
    Where (
     [Customer].[Customer].[Brian Watson]
    )

    [However, on my x64 laptop the second version takes longer to execute YMMV.]

    This is OK for AdventureWorks but my real degenerate dimension has many millions of members and this just doesn’t scale. I contemplated using Reporting Services RunningValue() function but as far as I can tell you can’t use it to generate a category axis.

    I needed a way of generating the running count for the x-axis in a way that uses Analysis Services’ excellent aggregation ability.

    Bucket HierarchyThe solution I ended up with is to create an artificial hierarchy and bucket transactions. That way I can create an attribute relation for aggregation and, importantly, control the number of cells in the iteration.

    The next problem was how to assign values to this bucket – some customers had only a few transactions yet others had millions. They all needed to be spread over a fixed set of buckets.

    The answer lies in a SQL Server RANK() function:

    update dw.Sales
    set TradeBucket = x.TradeBucket
    from (
        select TradeKey,
        rank() over(partition by CustomerKey order by t.TradeKey asc) / 
        case 
            when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 1000 then 1
            when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 10000 then 10
            when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 100000 then 100
            when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 1000000 then 1000
            else 10000
        end as TradeBucket
        from dw.Sales
    ) x
    where dw.Sales.TradeKey = x.TradeKey

    Effectively, we are generating an incrementing number on a per customer basis and then dividing that number to compress the range. This is surprisingly fast to execute.

    Once everything is processed, my new MDX looks like:

    With Member [Measures].[Running Sum]
    As     
        Sum(
            {Null : [Internet Sales Order Details].[Trade Bucket].CurrentMember},
            [Measures].[Internet Gross Profit]    
        )
    Member [Measures].[Running Count] As
     Sum(
     {Null : [Internet Sales Order Details].[Trade Bucket].CurrentMember},
     [Measures].[Sales Count]
     )
    Select {
     [Measures].[Running Count],
     [Measures].[Internet Gross Profit],
     [Measures].[Running Sum]
    } On 0, Non Empty {
     [Internet Sales Order Details].[Trade Bucket].Children
    } On 1
    From [Adventure Works]
    Where (
     [Customer].[Customer].[Brian Watson]
    )

    It works on aggregated data; there are still around 1000 points which is just fine on the graph and it executes in around 3 seconds. So all good?

    Well, for now yes but I can see a problem looming – every time I do an import I update every fact row and fully reprocess the cube. That isn’t going to scale long-term. I will probably have to implement some sort of bucket partition strategy.

    This entry was posted in business-intelligence  and tagged #mdx  on .
    Discuss this on Twitter or LinkedIn