Better software through software architecture and devops

@jamessnape

Annual Archives: 2013

  • I’ve just pushed a new version of Deeply to nuget.org. This version provides just enough functionality to write some basic ETL jobs:

    • Parallel and Sequential Tasks
    • Execute SQL Task
    • Execute Process Task
    • Simple Dataflow Task

    The tasks are pretty self-explanatory. The key part it nearly all the setup is done in the constructor; once the structure is created then it is executed asynchronously.

    Data flows are a little harder to configure. You need a source, a target and a mapping function. A source is anything conforming to IEnumerable<T>, a target is class that accepts and IEnumerable<T> implemented in IBulkRepository<T> and finally a mapping function that maps the source to the target.

    The code for using a simple data flow looks a little like the pseudo-csharp below:

    var source = new CsvReader("C:\\sourcefile.csv");
    
    var connectionFactory = new SqlConnectionFactory("Data Source=(localdb)\\v11.0;");
    
    var columnMappings = new Dictionary<string, string>()
    {
        { "Id", "Id" },
        { "Name", "Name" },
        { "Created", "Created" }
    };
    
    var target = new SqlBulkRepository(
        "dbo.FactTable", connectionFactory, columnMappings);
    
    var dataflow = new SimpleDataflowTask<TSource, TTarget>(
        this.source, MappingFunctions.Identity, target);
    
    var context = new TaskContext();
    await dataflow.ExecuteAsync(context);

    If anyone would like to help write some examples and documentation I’d be immensely grateful but otherwise please let me know of your experiences using this package.

    This entry was posted in data-warehousing  and tagged #dataflow #deeply #etl #nuget  on .
    Discuss this on Twitter or LinkedIn
  • * I mean this in a the most British sense of the phrase

    I remember how painful the original Data Transformation Services tool was to use (loops anyone?) and when Integration Services was shipped with SQL Server 2005 is was a breath of fresh air for anyone trying to build data warehouses. For SQL Server developers the choice was simple use the free tool that was easy to use and fast to execute or try to code your own. In 2005 the code your own option was hard. You had to write your own threading model, workflow, streaming primitives, event logging etc. since the managed APIs of the time were not that mature.

    In contrast Integration Services was easy. Easy to create, even if you don’t know anything able code; easy to run, even if you are not a SQL Server administrator and blisteringly fast, provided you follow some best practice guidelines. This may be one of the reasons I dislike it so much – it sets unrealistic expectations to project managers on how long it should take to develop real software solutions. Bear with me on this one as the point of the post is not about what’s wrong with SSIS but how it limits you in the wider solution.

    I do remember an e-mail conversation with Kirk Haselden before I joined Microsoft about how COM, which SSIS is built on, leaks abstractions all over the .NET API. He maintained it was the right thing to do; I wasn’t so sure but it was his product so I didn’t put up much of a fight.

    I believe that SSIS is designed as a tool for DBAs or those database developers that only know SQL, MDX and Reporting Services. It is the best is can possibly be without introducing code concepts to these people.

    A few months back I read a post by Teo Lachev called When Developers and BI Collide which I agree with large parts of – primarily that you must have BI specialists if you want to develop BI solutions and some that I disagree with – maintaining SSIS is easier than maintaining custom code, coders are not BI pros. I consider myself a coder AND a BI pro and there are a number of similar people working in the team I architect for at the moment. Actually when hiring I have found is it often easier and more productive to find and teach a coder about BI than the reverse.

    So anyway I digress, I joined Microsoft in 2006 and did a lot of consulting around integration services. It was a popular tool for customers to have problems with. We used it as part of a solution for a large UK bank’s faster payments implementation. It was a hard project for me – the rest of the team were designing C# code and publishing object based APIs. I had to use a shared encryption routine so when BizTalk unwrapped a transaction at the other end it would be able to decrypt it. This API meant I has to spend a good proportion of my time writing boring and error prone code to convert data sets (SSIS data flow) to objects and back to data sets again. This data mapping code was the interesting part though – I hate ‘programming’ by mouse; click, click, type, drag, click…and this is what the SSIS experience is. That was the first time I regretted using SSIS on a project.

    There are plenty of posts about what is wrong with SSIS and some equally passionate responses. My main issues with it are all related to real world usage. I have never been involved in a project where SSIS was the whole solution. It is always just a component of something bigger, an architecture, a project, a development team and a process. I work exclusively in agile teams now and every step of the way SSIS slows things down:

    • Unit testing is not possible (SSISUnit is really a component or integration test)
    • Agile team development requires code branching and merging which is not possible
    • Distributed source control (such as Git) can’t be used at all since there is no way to lock a file whilst you are working on it
    • Code reviews are difficult – you have to open up every package and click every box to check
    • It is hard to enforce project standards – StyleCop and FxCop do not work on VSA code
    • There is no way to share code – copy/paste coding is prolific
    • Everyone uses a template package to try and ensure some standards – unfortunately you can’t make changes to that template though since it was copied
    • COM leaks abstractions everywhere from the C# APIs to the type system
    • The type system to too tightly bound to metadata – need to change a column length, shame now you have to open all the relevant packages and fix the issues; ANSI <-> Unicode conversions must be explicit
    • There is no way to stub out data sources or destinations i.e. temporarily replace a SQL table with a local file for testing
    • Mouse based programming

    The net result of all this is yes, it is very quick to get something running but you must forever pay interest on the technical debt you just created. SSIS is not friction free in a team development environment.

    There are two situations where I believe you should use SSIS:

    1. If you are a DBA or database developer, you just need to get something done quickly and don’t care about the maintenance issues I’ve described above
    2. You need to take advantage of some of the more complex to code components such as CDC or fuzzy matching (remember that it is very easy to call packages from code anyway so no need to base the entire solution on SSIS)

    What are the alternatives? The simplest one is to make use of SSIS in an abstract way – code up something that can load your configuration, pass it to packages, execute them and capture the events for logging. We use something like this on my current project and to an extent it has helped a little. We still have to manage lots of packages though.

    Next up the ladder is to create abstractions of the tasks that packages are performing and generate them. Biml is a perfect example of this.

    Finally, I mentioned that APIs back in 2005 were not that mature. Here in 2013 though we have some fantastic APIs to work with – Task Parallel Library, asynchronous workflows, Linq, functional programming concepts and the rich ecosystem of Nuget packages. Coding isn’t as hard as it used to be.

    I started out this summer to produce an end to end BI solution in an agile way but quickly found out I needed to update some of my skills with respect to the C# tools and techniques available. So whilst I haven’t been blogging I have coded, learned and tried out ideas. Some of these are ready to show and you can try them out look for Deeply on Github or on Nuget. It is early days but try it out and let me know what you think.

    This entry was posted in business-intelligence  and tagged #c #data-transformation-services #sql-server-developers #ssis  on .
    Discuss this on Twitter or LinkedIn
  • I’m currently trying to learn F# because I’m keen to learn new programming styles as well as languages. It turns out that many of the concepts we C# programmers know and love such as Linq (monads), generics and async workflows originated in either F# or other functional languages. Thinking ‘functionally’ is a great skill to have too. How does this apply to surrogate key mapping? Well to borrow a notation from F# we are looking for a function like this:

    string –> int

    That is, a function that takes a string (the business key) and returns an integer (the surrogate key). Surrogate key lookup is a perfect fit for the functional view where “functions have no side effects”. Pass the same string to our lookup function any number of times and it should return the same integer value. The poorly performing version of this function might run off to the database every call and retrieve the value but there is a familiar functional technique called Memoization that can help. C# programmers might call this technique “store the values in a hashtable and only call the database if the value is missing”. A few other optimisations are necessary. Firstly, memoization will only cache the result of a single call so if we have a few hundred thousand dimension members in the database it will still take a lot of calls to populate the cache. Secondly, my lookup function doesn’t really care about the mechanics for the real database call so it would be nice if we could abstract that away. Finally, because I intend this class to be used a part of a multithreaded pipeline it needs to make sure that the internal data structures are protected. Piecing these requirements together we can start to flesh out the code. The main map function as we mentioned takes a string and returns an int:

    public int Map(string businessKey) { }

    Since we want to prime the cache with a set of values and abstract the real lookup functionality the best place to configure this is in the constructor:

    public DimensionMapper(IDictionary<string, int>initialCache, Func<string, int> lookup) { }

    Assuming the constructor just saves these parameters for later we can create a first cut version of the Map function:

    public int Map(string businessKey)
    {
        int surrogateKey;
    
        if (this.map.TryGetValue(businessKey, out surrogateKey))
        {
            return surrogateKey;
        }
    
        surrogateKey = this.lookup(businessKey);
        this.map.Add(businessKey, surrogateKey);
    
        return surrogateKey;
    }

    This works but it isn’t thread safe. For that we need a ReaderWriterLockSlim since only writes need to be synchronised. If you look at the code above there are two parts to it – the first few lines check the cache and return a value if it exists (the majority path); the last three lines are concerned with calling the real lookup function and populating the cache with the result when it doesn’t exist. Splitting on this boundary allows us to wrap the first part in a read lock and the second in a write lock - turning the write part into a separate function is a little cleaner:

    public int Map(string businessKey)
    {
        this.updateLock.EnterUpgradeableReadLock();
    
        try
        {
            int surrogateKey;
    
            if (this.map.TryGetValue(businessKey, out surrogateKey))
            {
                return surrogateKey;
            }
    
            return this.Lookup(businessKey);
        }
        finally
        {
            this.updateLock.ExitUpgradeableReadLock();
        }
    }
    
    private int Lookup(string businessKey)
    {
        this.updateLock.EnterWriteLock();
    
        try
        {
            int surrogateKey = this.lookup(businessKey);
            this.map.Add(businessKey, surrogateKey);
            return surrogateKey;
        }
        finally
        {
            this.updateLock.ExitWriteLock();
        }
    }

    So we have most of the class written now and I haven’t discussed anything to do with databases or how we get a real surrogate key because…well its not relevant here since a function is passed to the constructor. I like this ability to concentrate on just a single algorithm and not worry about the wider solution. From what I’ve learned so far F# is better as this than C#.

    For the full class definition see the full source file in context and associated unit tests.

    This entry was posted in sample-solution  and tagged #f #functional #surrogate-key  on .
    Discuss this on Twitter or LinkedIn
  • I’ve been working on this for a few weeks now, half an hour at a time in the evenings and I can safely say it’s pretty hard to maintain a train of thought in thirty minute intervals. However a bare minimum implementation is complete and ready to discuss.

    We start with an acceptance test:

    https://gist.github.com/jsnape/5887988

    The first part of the feature describes the user story and the second part tells us that when we load three patient referrals then the total count should be 3 with 1 on the 1st January.

    I’m using SpecFlow for acceptance tests since it is very easy to define tables and there are some useful binding utilities as we will see. After entering the test we can immediately compile the application and run the tests without writing anything else. The test will obviously fail since we haven’t written any code. In fact the acceptance test will stay broken for some time as we write code and unit tests. When it passes we know the feature is done.

    So thinking about this functionally we effectively want to write a function that transforms an enumerable of source referral records into an enumerable of referral facts; then pipe this iterator into a SqlBulkCopy instance. Effectively this code needs to work:

    referralrepository.BulkCopy(referrals.Select(x => mapper.Map(x)));

    This is a Linq transform with a mapping function applied to each item in the source list. In the next few posts I’m going to break it into bite size chunks to implement.

    This entry was posted in sample-solution  and tagged #linq #mental-health-project #specflow  on .
    Discuss this on Twitter or LinkedIn
  • Tableau Visualization

    This week has been dominated by the Tableau Customer Conference. I was fortunate to get a ticket since it was sold out but one of our architects couldn’t go so I filled in. I’m glad I did.

    It’s been a while since I got to learn about a completely new technology so it is a refreshing change to be a bit of a novice. After a number of Microsoft conferences this one felt quite different too – less geeky with a more mixed crowd. It was interesting to be able to talk with non-technical types such as data analysts, business managers and statisticians.

    I mainly went to the technical sessions but a couple of the keynote sessions were really interesting. Firstly ‘Creating a culture of data at Facebook’ gave some useful ideas about creating communities and getting more staff comfortable with visualizations. It was also nice to listen to a blogger I’ve read for a while (but only just discovered worked for Facebook). The second was Prof. Hans Rosling. I’ve seen his TED talk but in person was completely different – probably because he was talking to a room full of data visualisation professionals. He had plenty of anecdotes about how his famous visualizations came about. Ellie Fields gives a good description of his talk.

    So back to the day job now but with some new ideas about business intelligence and data visualization.

    This entry was posted in data-visualization  and tagged #conference #data-visualization #hans-rosling #tableau #tableau-customer-conference  on .
    Discuss this on Twitter or LinkedIn