Better software through software architecture and devops

@jamessnape

Tag Archives: #ssis

  • Photo of the old IMAX cinema on Bournemouth seafront.

    Latest links for easy consumption over the May long weekends – I missed out on March so have dropped some of the less interesting ones to keep the list short.

    Organisational Behaviour

    Programmers, Teach Non-Geeks The True Cost of Interruptions – a simple way to show to your boss how drive-by-management kills programmer productivity. Also work reading Maker’s Schedule, Manager’s Schedule which highlights the differences. If this is still a problem then this notice might be your only solution

    The Death Of Expertise – The Dunning-Kruger effect is often strong in semi-technical managers especially in industries where confidence plays a large part in success such as finance. This article discusses some of the problems related to treating all opinions as equal and ignoring experts.

    Save Your Software from the Start: Overcoming Skewed Thinking in the Project Planning Stage – Very simply, why we always underestimate the true complexity and cost of a project plus some tools to help overcome these psychological effects.

    Why Good Managers Are So Rare - Gallup finds that companies fail to choose the candidate with the right talent for the job 82% of the time. Managers account for at least 70% of variance in employee engagement scores across business units.

    I Give Up: Extroverted Barbarians at the Gates – Anyone remember the “perpendicular transparent red lines” video doing the rounds? This is an on-the-nail deconstruction of what is happening and why it happens. If you are an introvert then this other post might feel very familiar to you.

    Agile

    Coconut Headphones: Why Agile Has Failed – A rant about how modern agile methodologies seem to only consist of management practises. Take note of the end points to being successful.

    The death of agile? – Additional comment on the above. 

    Writing User Stories for Back-end Systems – The real functionality a user sees in a business intelligence project is quite small and can easily be described in a few words. This makes breaking up user stories into sprint sized chunks hard. This article gives some great advice that can be translated to BI projects. 

    Design Your Agile Project, Part 1 – So how do you pick the right kind of agile project? When should you use Kanban and when should you use Scrum? How is the business side of equation handled? Also Part 2, Part 3, and Part 4.

    Large Agile Framework Appropriate for Big, Lumbering Enterprises – A perfect solution to doing agile in finance organisations (wink). Love the concept of ‘Pair Managing’.

    Metrics that matter with evidence-based management – Its long but Martin does a great job looking at lots of the metrics in use today, why their use is limited and a far better approach to designing metrics that really help.

    Databases

    Is ETL Development doomed? – “Long term, the demand for ETL skills will decline”. The demand will mutate into one for more abstract ETL capabilities.

    Testing

    Intro to Unit Testing 9: Tips and Tricks – A handy list of tips that can make maintaining unit test code a little easier.

    FsCheck + XUnit = The Bomb – Even if you write code in C# it may be wise to think about writing unit tests in F# since the code is more concise, easier to read and with FxCheck can find things you might not.

    Data Visualization

    5 Tips to Good Vizzin’ – Should be required reading for anyone who is thinking about creating dashboards in Tableau.

    A Natural Approach to Analytics – This explains why using tools such as Tableau for largely static dashboards is a waste of time. Users need to interact with the data in a way they cannot do when relegated to dashboard consumers.

    Big Data/Hadoop

    Modern Financial Services Architectures Built with Hadoop – Hortonworks looks at big data in financial services.

    Beyond hadoop: fast queries from big data – I think Hadoop might be catching up here but it is still a bit of an elephant compared to SQL Server/Oracle etc when it comes to raw query performance.

    Don’t understand Big Data? Blame your genes! – 5 common errors for dealing with big data.

    The Parable of Google Flu: Traps in Big Data Analysis – Big data answers are not always correct. This paper looks at some of the pitfalls.

    No, Hadoop Isn’t Going To Replace Your Data Warehouse – More thoughts on modern data architectures and hybrid transactional/analytical processing.

    This entry was posted in reference  and tagged #apache-hadoop #business-intelligence #etl #sql-server-integration-services #ssis #tableau  on .
    Discuss this on Twitter or LinkedIn
  • * I mean this in a the most British sense of the phrase

    I remember how painful the original Data Transformation Services tool was to use (loops anyone?) and when Integration Services was shipped with SQL Server 2005 is was a breath of fresh air for anyone trying to build data warehouses. For SQL Server developers the choice was simple use the free tool that was easy to use and fast to execute or try to code your own. In 2005 the code your own option was hard. You had to write your own threading model, workflow, streaming primitives, event logging etc. since the managed APIs of the time were not that mature.

    In contrast Integration Services was easy. Easy to create, even if you don’t know anything able code; easy to run, even if you are not a SQL Server administrator and blisteringly fast, provided you follow some best practice guidelines. This may be one of the reasons I dislike it so much – it sets unrealistic expectations to project managers on how long it should take to develop real software solutions. Bear with me on this one as the point of the post is not about what’s wrong with SSIS but how it limits you in the wider solution.

    I do remember an e-mail conversation with Kirk Haselden before I joined Microsoft about how COM, which SSIS is built on, leaks abstractions all over the .NET API. He maintained it was the right thing to do; I wasn’t so sure but it was his product so I didn’t put up much of a fight.

    I believe that SSIS is designed as a tool for DBAs or those database developers that only know SQL, MDX and Reporting Services. It is the best is can possibly be without introducing code concepts to these people.

    A few months back I read a post by Teo Lachev called When Developers and BI Collide which I agree with large parts of – primarily that you must have BI specialists if you want to develop BI solutions and some that I disagree with – maintaining SSIS is easier than maintaining custom code, coders are not BI pros. I consider myself a coder AND a BI pro and there are a number of similar people working in the team I architect for at the moment. Actually when hiring I have found is it often easier and more productive to find and teach a coder about BI than the reverse.

    So anyway I digress, I joined Microsoft in 2006 and did a lot of consulting around integration services. It was a popular tool for customers to have problems with. We used it as part of a solution for a large UK bank’s faster payments implementation. It was a hard project for me – the rest of the team were designing C# code and publishing object based APIs. I had to use a shared encryption routine so when BizTalk unwrapped a transaction at the other end it would be able to decrypt it. This API meant I has to spend a good proportion of my time writing boring and error prone code to convert data sets (SSIS data flow) to objects and back to data sets again. This data mapping code was the interesting part though – I hate ‘programming’ by mouse; click, click, type, drag, click…and this is what the SSIS experience is. That was the first time I regretted using SSIS on a project.

    There are plenty of posts about what is wrong with SSIS and some equally passionate responses. My main issues with it are all related to real world usage. I have never been involved in a project where SSIS was the whole solution. It is always just a component of something bigger, an architecture, a project, a development team and a process. I work exclusively in agile teams now and every step of the way SSIS slows things down:

    • Unit testing is not possible (SSISUnit is really a component or integration test)
    • Agile team development requires code branching and merging which is not possible
    • Distributed source control (such as Git) can’t be used at all since there is no way to lock a file whilst you are working on it
    • Code reviews are difficult – you have to open up every package and click every box to check
    • It is hard to enforce project standards – StyleCop and FxCop do not work on VSA code
    • There is no way to share code – copy/paste coding is prolific
    • Everyone uses a template package to try and ensure some standards – unfortunately you can’t make changes to that template though since it was copied
    • COM leaks abstractions everywhere from the C# APIs to the type system
    • The type system to too tightly bound to metadata – need to change a column length, shame now you have to open all the relevant packages and fix the issues; ANSI <-> Unicode conversions must be explicit
    • There is no way to stub out data sources or destinations i.e. temporarily replace a SQL table with a local file for testing
    • Mouse based programming

    The net result of all this is yes, it is very quick to get something running but you must forever pay interest on the technical debt you just created. SSIS is not friction free in a team development environment.

    There are two situations where I believe you should use SSIS:

    1. If you are a DBA or database developer, you just need to get something done quickly and don’t care about the maintenance issues I’ve described above
    2. You need to take advantage of some of the more complex to code components such as CDC or fuzzy matching (remember that it is very easy to call packages from code anyway so no need to base the entire solution on SSIS)

    What are the alternatives? The simplest one is to make use of SSIS in an abstract way – code up something that can load your configuration, pass it to packages, execute them and capture the events for logging. We use something like this on my current project and to an extent it has helped a little. We still have to manage lots of packages though.

    Next up the ladder is to create abstractions of the tasks that packages are performing and generate them. Biml is a perfect example of this.

    Finally, I mentioned that APIs back in 2005 were not that mature. Here in 2013 though we have some fantastic APIs to work with – Task Parallel Library, asynchronous workflows, Linq, functional programming concepts and the rich ecosystem of Nuget packages. Coding isn’t as hard as it used to be.

    I started out this summer to produce an end to end BI solution in an agile way but quickly found out I needed to update some of my skills with respect to the C# tools and techniques available. So whilst I haven’t been blogging I have coded, learned and tried out ideas. Some of these are ready to show and you can try them out look for Deeply on Github or on Nuget. It is early days but try it out and let me know what you think.

    This entry was posted in business-intelligence  and tagged #c #data-transformation-services #sql-server-developers #ssis  on .
    Discuss this on Twitter or LinkedIn
  • Whilst doing some design work today for a customer project I realised there are a set of principals I try and adhere to when creating SQL Server Integration Services packages. The list is no doubt incomplete but this is what I have so far.

    Minimise IO

    This is a general data processing principal. Usually disk and, to a lesser extent, network performance determine the overall processing speed. Reducing the amount of IO in a solution will therefore increase performance.

    Solutions that consist of multiple read-process-write steps should be redesigned into a single read-process-process-process-write step.

    Prefer Sequential IO to Random IO

    Disks perform at their best when sequentially reading or writing large chunks of data. Random IO (and poor performance) manifests when procedural style programming occurs - signs to look out for are SQL statements modifying/returning only few rows but being executed repeatedly.

    Watch out for hidden random IO - for example, if you are reading from one table and writing to another in a sequential manor then disk access will still be random if both tables are stored on the same spindles.

    Avoid data flow components that pool data

    Data flow components work on batches of data called buffers. In most instances buffers are modified in place and passed down stream. Some components, such as “Sort” cannot process data like this and effectively hang on to buffers until the entire data stream is in memory (or spooled to disk in low memory situations). This increased memory pressure will affect performance.

    Sometimes SQL is the better solution

    Whilst the SSIS data flow has lots of useful and flexible components, it is sometimes more efficient to perform the equivalent processing in a SQL batch. SQL Server is extremely good at sorting, grouping and data manipulation (insert, update, delete) so it is unlikely you will match it for raw performance on a single read-process-write step.

    SSIS does not handle hierarchical data well

    Integration Services is a tabular data processing system. Buffers are tabular and the components and associated APIs are tabular. Consequently it is difficult to process hierarchical data such as the contents of an XML document. There is an XML source component but it’s output is a collection of tabular data streams that need to joined to make sense.

    Execute SSIS close to where you wish to write your data

    Reading data is relatively easy and possible from a wide variety of locations. Writing data, on the other hand, can involve complex locking and other issues which are difficult to optimise on a network protocol. In particular when writing data to a local SQL Server instance, SSIS automatically used the Shared Memory transport for direct inter-process transfer.

    Don’t mess with the data flow metadata at runtime

    It’s very difficult to do this anyway but worth mentioning that SSIS gets it’s stellar performance from being able to setup a data flow at runtime safe in the knowledge that buffers are of a fixed format and component dependencies will not change.

    The only time this is acceptable is when you need to build a custom data flow programmatically. You should use the SSIS API’s and not attempt to write the package XML directly.

    This entry was posted in data-warehousing  and tagged #integration-services #sql-server #sql-server-integration-services #ssis  on .
    Discuss this on Twitter or LinkedIn
  • Every data warehouse needs a date dimension and at some point it needs to be populated. Most use some sort of a SQL script that loops though the dates and add rows to the destination table but this is pretty slow to execute. You might even try cross joining a year, month and day temporary tables to produce a set based solution but don’t forget to filter out the illegal days.

    I prefer to fill my date tables by generating the correct stream of values from a SQL Server Integration Services script source component. This has a number of benefits:

    • It executes very quickly
    • The data can be bulk loaded
    • CultureInfo supplies the correct translations of day and month names
    • It is easy to add custom columns such as fiscal years and quarters

    I haven’t wrapped this in a pre-compiled component as it is so easy to do in script from. Also, I haven’t got around to generalizing the fiscal date offsets for different companies so they usually have to be custom coded.

    Script Component TypeFirst drop a “Script Component” onto your Data Flow.

    Select “Source” as the Script Component Type and click OK.

    Then double-click the newly added component to edit the properties.

    Note that you need to add the correct output columns before adding the script or else it won’t compile.

    Output Columns

    I’ve renamed the output here to “Dates” to help further down the Data Flow.

    Click the “Add Column” button to add new columns as show here. Note that I’ve also changed the data type of each column to match my source table. It required casts in script but it’s easier than conversions in the data pipeline.

    Finally go back to the script part of the dialog and click the “Edit Script” button to launch Visual Studio for Applications.

    In the resulting window, add your code to generate the date stream to the CreateNewOutputRows() function.
    The general form is of:

    var output = this.DatesBuffer;  // Get the output buffer
    
    while (/*loop though your dates*/)
    {
    
        output.AddRow();
    
        // Set the various column values e.g.
        output.CalendarYear = date.Year
    
        // Increment the date
        date = date.AddDays(1);
    }

    The full script is in the attached sample package where I’ve also added a script destination that does nothing with the data. Attach a data viewer to see what output is generated.

    Generated outputFrom here you can manipulate the data, and pipe it to your dimension table from within the pipeline.

    DateSourceSample.dtsx

    This entry was posted in data-warehousing  and tagged #ssis  on .
    Discuss this on Twitter or LinkedIn