* I mean this in a the most British sense of the phrase
I remember how painful the original Data Transformation Services tool was to use (loops anyone?) and when Integration Services was shipped with SQL Server 2005 is was a breath of fresh air for anyone trying to build data warehouses. For SQL Server developers the choice was simple use the free tool that was easy to use and fast to execute or try to code your own. In 2005 the code your own option was hard. You had to write your own threading model, workflow, streaming primitives, event logging etc. since the managed APIs of the time were not that mature.
In contrast Integration Services was easy. Easy to create, even if you don’t know anything able code; easy to run, even if you are not a SQL Server administrator and blisteringly fast, provided you follow some best practice guidelines. This may be one of the reasons I dislike it so much – it sets unrealistic expectations to project managers on how long it should take to develop real software solutions. Bear with me on this one as the point of the post is not about what’s wrong with SSIS but how it limits you in the wider solution.
I do remember an e-mail conversation with Kirk Haselden before I joined Microsoft about how COM, which SSIS is built on, leaks abstractions all over the .NET API. He maintained it was the right thing to do; I wasn’t so sure but it was his product so I didn’t put up much of a fight.
I believe that SSIS is designed as a tool for DBAs or those database developers that only know SQL, MDX and Reporting Services. It is the best is can possibly be without introducing code concepts to these people.
A few months back I read a post by Teo Lachev called When Developers and BI Collide which I agree with large parts of – primarily that you must have BI specialists if you want to develop BI solutions and some that I disagree with – maintaining SSIS is easier than maintaining custom code, coders are not BI pros. I consider myself a coder AND a BI pro and there are a number of similar people working in the team I architect for at the moment. Actually when hiring I have found is it often easier and more productive to find and teach a coder about BI than the reverse.
So anyway I digress, I joined Microsoft in 2006 and did a lot of consulting around integration services. It was a popular tool for customers to have problems with. We used it as part of a solution for a large UK bank’s faster payments implementation. It was a hard project for me – the rest of the team were designing C# code and publishing object based APIs. I had to use a shared encryption routine so when BizTalk unwrapped a transaction at the other end it would be able to decrypt it. This API meant I has to spend a good proportion of my time writing boring and error prone code to convert data sets (SSIS data flow) to objects and back to data sets again. This data mapping code was the interesting part though – I hate ‘programming’ by mouse; click, click, type, drag, click…and this is what the SSIS experience is. That was the first time I regretted using SSIS on a project.
There are plenty of posts about what is wrong with SSIS and some equally passionate responses. My main issues with it are all related to real world usage. I have never been involved in a project where SSIS was the whole solution. It is always just a component of something bigger, an architecture, a project, a development team and a process. I work exclusively in agile teams now and every step of the way SSIS slows things down:
- Unit testing is not possible (SSISUnit is really a component or integration test)
- Agile team development requires code branching and merging which is not possible
- Distributed source control (such as Git) can’t be used at all since there is no way to lock a file whilst you are working on it
- Code reviews are difficult – you have to open up every package and click every box to check
- It is hard to enforce project standards – StyleCop and FxCop do not work on VSA code
- There is no way to share code – copy/paste coding is prolific
- Everyone uses a template package to try and ensure some standards – unfortunately you can’t make changes to that template though since it was copied
- COM leaks abstractions everywhere from the C# APIs to the type system
- The type system to too tightly bound to metadata – need to change a column length, shame now you have to open all the relevant packages and fix the issues; ANSI <-> Unicode conversions must be explicit
- There is no way to stub out data sources or destinations i.e. temporarily replace a SQL table with a local file for testing
- Mouse based programming
The net result of all this is yes, it is very quick to get something running but you must forever pay interest on the technical debt you just created. SSIS is not friction free in a team development environment.
There are two situations where I believe you should use SSIS:
- If you are a DBA or database developer, you just need to get something done quickly and don’t care about the maintenance issues I’ve described above
- You need to take advantage of some of the more complex to code components such as CDC or fuzzy matching (remember that it is very easy to call packages from code anyway so no need to base the entire solution on SSIS)
What are the alternatives? The simplest one is to make use of SSIS in an abstract way – code up something that can load your configuration, pass it to packages, execute them and capture the events for logging. We use something like this on my current project and to an extent it has helped a little. We still have to manage lots of packages though.
Next up the ladder is to create abstractions of the tasks that packages are performing and generate them. Biml is a perfect example of this.
Finally, I mentioned that APIs back in 2005 were not that mature. Here in 2013 though we have some fantastic APIs to work with – Task Parallel Library, asynchronous workflows, Linq, functional programming concepts and the rich ecosystem of Nuget packages. Coding isn’t as hard as it used to be.
I started out this summer to produce an end to end BI solution in an agile way but quickly found out I needed to update some of my skills with respect to the C# tools and techniques available. So whilst I haven’t been blogging I have coded, learned and tried out ideas. Some of these are ready to show and you can try them out look for Deeply on Github or on Nuget. It is early days but try it out and let me know what you think.
What I’m fed up with is the CDC Components inside SSIS. I disagree using them. I’d rather use the T-SQL functions that give you more granular control. For example, the CDC Control Task executes a “SELECT * FROM CDC_STATES” table query when updating a single record in the CDC_STATES table. If you have processes running in parallel, good luck dealing with locks. First we experienced deadlocks, then we moved to an approach where we started using a Sequence Container for Distributed Transactions and calling sp_getapplock to work around the issue. Now we’re getting SPIDS blocked by -2, or “something like orphaned” distributed transactions because the Control Task also calls sp_reset_connection. And you mention BIML, we’re using Mist/Biml to generate all our CDC/ETL processes. The CDC Source component has been a nightmare to work with. It converts (and doesn’t support) all kinds of data types. We’ve burned all kinds of hours working with these components. As far as I’m concerned, I’d stray away from them in all but the most basic of use cases.
LikeLiked by 1 person
Thanks for the write up. SSIS is one of the best tools I have worked with – Script Component is my best friend.. Not only can SSIS connect and integrate data from a myriad of data sources, it is fast and entirely customized with C# Script Components. This is one of the best features IMO – the ability to interact with the OS, SFTP, and any web service in any way you want. In cases when a large volume of data is transferred, it is much faster than T-SQL. I’ve worked with large DW/Integration projects that are entirely SSIS. I agree with some of the challenges you have listed, but I think it is much more capable than just those two situations of use. I agree it is a PITA to manage large projects- hundreds of packages can be come overwhelming but the project deployment model has helped considerably with organization – that was a game changer. Code reviews, standards, agile dev, all issues as well.
IMO to truly use SSIS properly, you need to have a coding background in TSQL and C# or VB. It’s simply a must. Developing in house components is also a huge plus – e.g. write a custom SFTP script component which can be further deployed as a custom component so other team members can use it. Regarding CDC, I have not used that. The warehouse itself has always tracked active state at the row level using start/end dates. That has worked fine with up to a terabyte of data, with moderate processing times for a one time daily load. I imagine CDC is faster, but will bring complexity as Ryan mentioned.