Better software through software architecture and devops

@jamessnape

Tag Archives: #sql-server-integration-services

  • Latest links for easy consumption over the May long weekends – I missed out on March so have dropped some of the less interesting ones to keep the list short.

    Organisational Behaviour

    Programmers, Teach Non-Geeks The True Cost of Interruptions – a simple way to show to your boss how drive-by-management kills programmer productivity. Also work reading Maker’s Schedule, Manager’s Schedule which highlights the differences. If this is still a problem then this notice might be your only solution

    The Death Of Expertise – The Dunning-Kruger effect is often strong in semi-technical managers especially in industries where confidence plays a large part in success such as finance. This article discusses some of the problems related to treating all opinions as equal and ignoring experts.

    Save Your Software from the Start: Overcoming Skewed Thinking in the Project Planning Stage – Very simply, why we always underestimate the true complexity and cost of a project plus some tools to help overcome these psychological effects.

    Why Good Managers Are So Rare - Gallup finds that companies fail to choose the candidate with the right talent for the job 82% of the time. Managers account for at least 70% of variance in employee engagement scores across business units.

    I Give Up: Extroverted Barbarians at the Gates – Anyone remember the “perpendicular transparent red lines” video doing the rounds? This is an on-the-nail deconstruction of what is happening and why it happens. If you are an introvert then this other post might feel very familiar to you.

    Agile

    Coconut Headphones: Why Agile Has Failed – A rant about how modern agile methodologies seem to only consist of management practises. Take note of the end points to being successful.

    The death of agile? – Additional comment on the above. 

    Writing User Stories for Back-end Systems – The real functionality a user sees in a business intelligence project is quite small and can easily be described in a few words. This makes breaking up user stories into sprint sized chunks hard. This article gives some great advice that can be translated to BI projects. 

    Design Your Agile Project, Part 1 – So how do you pick the right kind of agile project? When should you use Kanban and when should you use Scrum? How is the business side of equation handled? Also Part 2, Part 3, and Part 4.

    Large Agile Framework Appropriate for Big, Lumbering Enterprises – A perfect solution to doing agile in finance organisations (wink). Love the concept of ‘Pair Managing’.

    Metrics that matter with evidence-based management – Its long but Martin does a great job looking at lots of the metrics in use today, why their use is limited and a far better approach to designing metrics that really help.

    Databases

    Is ETL Development doomed? – “Long term, the demand for ETL skills will decline”. The demand will mutate into one for more abstract ETL capabilities.

    Testing

    Intro to Unit Testing 9: Tips and Tricks – A handy list of tips that can make maintaining unit test code a little easier.

    FsCheck + XUnit = The Bomb – Even if you write code in C# it may be wise to think about writing unit tests in F# since the code is more concise, easier to read and with FxCheck can find things you might not.

    Data Visualization

    5 Tips to Good Vizzin’ – Should be required reading for anyone who is thinking about creating dashboards in Tableau.

    A Natural Approach to Analytics – This explains why using tools such as Tableau for largely static dashboards is a waste of time. Users need to interact with the data in a way they cannot do when relegated to dashboard consumers.

    Big Data/Hadoop

    Modern Financial Services Architectures Built with Hadoop – Hortonworks looks at big data in financial services.

    Beyond hadoop: fast queries from big data – I think Hadoop might be catching up here but it is still a bit of an elephant compared to SQL Server/Oracle etc when it comes to raw query performance.

    Don’t understand Big Data? Blame your genes! – 5 common errors for dealing with big data.

    The Parable of Google Flu: Traps in Big Data Analysis – Big data answers are not always correct. This paper looks at some of the pitfalls.

    No, Hadoop Isn’t Going To Replace Your Data Warehouse – More thoughts on modern data architectures and hybrid transactional/analytical processing.

    This entry was posted in reference  and tagged #apache-hadoop #business-intelligence #etl #sql-server-integration-services #ssis #tableau  on .
    Discuss this on Twitter or LinkedIn
  • Context is everything with architecture. I’ve often had conversations which started with the phrase “how come you didn’t…” – once the context is explained the decision is usually obvious.

    The context for this architecture is purely my own since there are no real customers. I  want to satisfy the concerns and requirements as simply as possible but leave room to swap out parts of the architecture to investigate new approaches and technologies.

    The diagram below is a pretty standard set of data warehouse components. If you are a traditional Microsoft guy then, from left to right, the components would be Integration Services, SQL Server, Integration Services again, SQL Server, Analysis Services and Excel or Reporting Services respectively. Alternatively you might be using Hadoop as the source mirror and Tableau for the data mart and consume components or some other combination.

    architecture

    I always try to set firewalls within an architecture so that any problems can be isolated and replaced without too much disruption. In this instance I’m going to use those firewalls so that I can try out new ideas and technologies.

    The main synchronisation points are the three data stores – Source Mirror, Data Warehouse and Data Mart. (In this instance I am using the term data mart to mean a prepared, subject area specific store optimised for analytical/aggregate queries.)

    The responsibilities for each stage are as follows:

    • Acquire > Source Mirror: receive data from source, ensure minimal load on source via high watermarks or another strategy, archive data for historical accuracy.  A key point here is that the source mirror has the same metadata as the source itself. No joins or transforms on the way. Sometimes simple data conversions are useful but less is more.
    • Load > Data Warehouse: apply business logic and transform the source data into dimensional model, update data warehouse.
    • Summarise > Data Mart: aggregate generation or cube processing.
    • Consume & Act: covers any output tool such as Reporting Services, Excel, Tableau Dashboard, R, F# etc.

    I consider the SQL/relational store to be “the data warehouse” and not Analysis Services which is better suited to a data mart role.

    It’s quite hard to be succinct when talking about architecture and this post is quite lengthy so I’ll split it and talk about risk driven architecture in the next post.

  • Whilst doing some design work today for a customer project I realised there are a set of principals I try and adhere to when creating SQL Server Integration Services packages. The list is no doubt incomplete but this is what I have so far.

    Minimise IO

    This is a general data processing principal. Usually disk and, to a lesser extent, network performance determine the overall processing speed. Reducing the amount of IO in a solution will therefore increase performance.

    Solutions that consist of multiple read-process-write steps should be redesigned into a single read-process-process-process-write step.

    Prefer Sequential IO to Random IO

    Disks perform at their best when sequentially reading or writing large chunks of data. Random IO (and poor performance) manifests when procedural style programming occurs - signs to look out for are SQL statements modifying/returning only few rows but being executed repeatedly.

    Watch out for hidden random IO - for example, if you are reading from one table and writing to another in a sequential manor then disk access will still be random if both tables are stored on the same spindles.

    Avoid data flow components that pool data

    Data flow components work on batches of data called buffers. In most instances buffers are modified in place and passed down stream. Some components, such as “Sort” cannot process data like this and effectively hang on to buffers until the entire data stream is in memory (or spooled to disk in low memory situations). This increased memory pressure will affect performance.

    Sometimes SQL is the better solution

    Whilst the SSIS data flow has lots of useful and flexible components, it is sometimes more efficient to perform the equivalent processing in a SQL batch. SQL Server is extremely good at sorting, grouping and data manipulation (insert, update, delete) so it is unlikely you will match it for raw performance on a single read-process-write step.

    SSIS does not handle hierarchical data well

    Integration Services is a tabular data processing system. Buffers are tabular and the components and associated APIs are tabular. Consequently it is difficult to process hierarchical data such as the contents of an XML document. There is an XML source component but it’s output is a collection of tabular data streams that need to joined to make sense.

    Execute SSIS close to where you wish to write your data

    Reading data is relatively easy and possible from a wide variety of locations. Writing data, on the other hand, can involve complex locking and other issues which are difficult to optimise on a network protocol. In particular when writing data to a local SQL Server instance, SSIS automatically used the Shared Memory transport for direct inter-process transfer.

    Don’t mess with the data flow metadata at runtime

    It’s very difficult to do this anyway but worth mentioning that SSIS gets it’s stellar performance from being able to setup a data flow at runtime safe in the knowledge that buffers are of a fixed format and component dependencies will not change.

    The only time this is acceptable is when you need to build a custom data flow programmatically. You should use the SSIS API’s and not attempt to write the package XML directly.

    This entry was posted in data-warehousing  and tagged #integration-services #sql-server #sql-server-integration-services #ssis  on .
    Discuss this on Twitter or LinkedIn