Better software through software architecture and devops

@jamessnape

Category Archives: data warehousing

  • Photo of a scuba diver swimming away from the camera at 30m depth. Blue hues throughout the photo due to light filtering at depth.

    Generally I try hard to avoid adding dependencies to a library project designed for reuse. Since Deeply is a Nuget package I have no idea how it might end up being used and for that reason I’m unwilling to add dependencies that might not fit with a user’s design. As a user of Deeply however, I’m finding that I have to add the same patterns repeatedly and would rather just use a pre-existing package.

    How to reconcile these opposing arguments? I’ve decided to add a new package to Nuget - Deeply.Extras. This assembly is free to on take whatever dependencies make sense. Initially this is going to be Autofac for its CommonServiceLocator implementation and CsvHelper to provide a CsvBulkRepository.

    This entry was posted in data-warehousing  and tagged #autofac #csvhelper #deeply #nuget  on .
    Discuss this on Twitter or LinkedIn
  • I’ve just pushed a new version of Deeply to nuget.org. This version provides just enough functionality to write some basic ETL jobs:

    • Parallel and Sequential Tasks
    • Execute SQL Task
    • Execute Process Task
    • Simple Dataflow Task

    The tasks are pretty self-explanatory. The key part it nearly all the setup is done in the constructor; once the structure is created then it is executed asynchronously.

    Data flows are a little harder to configure. You need a source, a target and a mapping function. A source is anything conforming to IEnumerable<T>, a target is class that accepts and IEnumerable<T> implemented in IBulkRepository<T> and finally a mapping function that maps the source to the target.

    The code for using a simple data flow looks a little like the pseudo-csharp below:

    var source = new CsvReader("C:\\sourcefile.csv");
    
    var connectionFactory = new SqlConnectionFactory("Data Source=(localdb)\\v11.0;");
    
    var columnMappings = new Dictionary<string, string>()
    {
        { "Id", "Id" },
        { "Name", "Name" },
        { "Created", "Created" }
    };
    
    var target = new SqlBulkRepository(
        "dbo.FactTable", connectionFactory, columnMappings);
    
    var dataflow = new SimpleDataflowTask<TSource, TTarget>(
        this.source, MappingFunctions.Identity, target);
    
    var context = new TaskContext();
    await dataflow.ExecuteAsync(context);
    

    If anyone would like to help write some examples and documentation I’d be immensely grateful but otherwise please let me know of your experiences using this package.

    This entry was posted in data-warehousing  and tagged #dataflow #deeply #etl #nuget  on .
    Discuss this on Twitter or LinkedIn
  • Whilst doing some design work today for a customer project I realised there are a set of principals I try and adhere to when creating SQL Server Integration Services packages. The list is no doubt incomplete but this is what I have so far.

    Minimise IO

    This is a general data processing principal. Usually disk and, to a lesser extent, network performance determine the overall processing speed. Reducing the amount of IO in a solution will therefore increase performance.

    Solutions that consist of multiple read-process-write steps should be redesigned into a single read-process-process-process-write step.

    Prefer Sequential IO to Random IO

    Disks perform at their best when sequentially reading or writing large chunks of data. Random IO (and poor performance) manifests when procedural style programming occurs - signs to look out for are SQL statements modifying/returning only few rows but being executed repeatedly.

    Watch out for hidden random IO - for example, if you are reading from one table and writing to another in a sequential manor then disk access will still be random if both tables are stored on the same spindles.

    Avoid data flow components that pool data

    Data flow components work on batches of data called buffers. In most instances buffers are modified in place and passed down stream. Some components, such as “Sort” cannot process data like this and effectively hang on to buffers until the entire data stream is in memory (or spooled to disk in low memory situations). This increased memory pressure will affect performance.

    Sometimes SQL is the better solution

    Whilst the SSIS data flow has lots of useful and flexible components, it is sometimes more efficient to perform the equivalent processing in a SQL batch. SQL Server is extremely good at sorting, grouping and data manipulation (insert, update, delete) so it is unlikely you will match it for raw performance on a single read-process-write step.

    SSIS does not handle hierarchical data well

    Integration Services is a tabular data processing system. Buffers are tabular and the components and associated APIs are tabular. Consequently it is difficult to process hierarchical data such as the contents of an XML document. There is an XML source component but it’s output is a collection of tabular data streams that need to joined to make sense.

    Execute SSIS close to where you wish to write your data

    Reading data is relatively easy and possible from a wide variety of locations. Writing data, on the other hand, can involve complex locking and other issues which are difficult to optimise on a network protocol. In particular when writing data to a local SQL Server instance, SSIS automatically used the Shared Memory transport for direct inter-process transfer.

    Don’t mess with the data flow metadata at runtime

    It’s very difficult to do this anyway but worth mentioning that SSIS gets it’s stellar performance from being able to setup a data flow at runtime safe in the knowledge that buffers are of a fixed format and component dependencies will not change.

    The only time this is acceptable is when you need to build a custom data flow programmatically. You should use the SSIS API’s and not attempt to write the package XML directly.

    This entry was posted in data-warehousing  and tagged #integration-services #sql-server #sql-server-integration-services #ssis  on .
    Discuss this on Twitter or LinkedIn
  • Every data warehouse needs a date dimension and at some point it needs to be populated. Most use some sort of a SQL script that loops though the dates and add rows to the destination table but this is pretty slow to execute. You might even try cross joining a year, month and day temporary tables to produce a set based solution but don’t forget to filter out the illegal days.

    I prefer to fill my date tables by generating the correct stream of values from a SQL Server Integration Services script source component. This has a number of benefits:

    • It executes very quickly
    • The data can be bulk loaded
    • CultureInfo supplies the correct translations of day and month names
    • It is easy to add custom columns such as fiscal years and quarters

    I haven’t wrapped this in a pre-compiled component as it is so easy to do in script from. Also, I haven’t got around to generalizing the fiscal date offsets for different companies so they usually have to be custom coded.

    Script Component TypeFirst drop a “Script Component” onto your Data Flow.

    Select “Source” as the Script Component Type and click OK.

    Then double-click the newly added component to edit the properties.

    Note that you need to add the correct output columns before adding the script or else it won’t compile.

    Output Columns

    I’ve renamed the output here to “Dates” to help further down the Data Flow.

    Click the “Add Column” button to add new columns as show here. Note that I’ve also changed the data type of each column to match my source table. It required casts in script but it’s easier than conversions in the data pipeline.

    Finally go back to the script part of the dialog and click the “Edit Script” button to launch Visual Studio for Applications.

    In the resulting window, add your code to generate the date stream to the CreateNewOutputRows() function.
    The general form is of:

    var output = this.DatesBuffer;  // Get the output buffer
    
    while (/*loop though your dates*/)
    {
    
        output.AddRow();
    
        // Set the various column values e.g.
        output.CalendarYear = date.Year
    
        // Increment the date
        date = date.AddDays(1);
    }
    

    The full script is in the attached sample package where I’ve also added a script destination that does nothing with the data. Attach a data viewer to see what output is generated.

    Generated outputFrom here you can manipulate the data, and pipe it to your dimension table from within the pipeline.

    DateSourceSample.dtsx

    This entry was posted in data-warehousing  and tagged #ssis  on .
    Discuss this on Twitter or LinkedIn
  • This is the second part in a series of data warehouse presentations I’ve been giving. This one concentrates on the physical design of the underlying SQL Server database, some information on the SQL Server Fast Track Data Warehouse and finally a one slide guide to tuning the storage engine performance. The tuning slide is only really a first step and I plan a more in depth session some time in the future.

    Also I apologize, this deck is a little word heavy. I prefer more graphics in a presentation but there is a lot of specific guidance to list here.

    This entry was posted in data-warehousing  and tagged #physical-design #presentation  on .
    Discuss this on Twitter or LinkedIn