Category Archives: Data Warehousing

Taking on dependencies with Deeply

Generally I try hard to avoid adding dependencies to a library project designed for reuse. Since Deeply is a Nuget package I have no idea how it might end up being used and for that reason I’m unwilling to add dependencies that might not fit with a user’s design. As a user of Deeply however, I’m finding that I have to add the same patterns repeatedly and would rather just use a pre-existing package.

How to reconcile these opposing arguments? I’ve decided to add a new package to Nuget – Deeply.Extras. This assembly is free to on take whatever dependencies make sense. Initially this is going to be Autofac for its CommonServiceLocator implementation and CsvHelper to provide a CsvBulkRepository.

Deeply 0.2.0 Alpha

I’ve just pushed a new version of Deeply to nuget.org. This version provides just enough functionality to write some basic ETL jobs:

  • Parallel and Sequential Tasks
  • Execute SQL Task
  • Execute Process Task
  • Simple Dataflow Task

The tasks are pretty self-explanatory. The key part it nearly all the setup is done in the constructor; once the structure is created then it is executed asynchronously.

Data flows are a little harder to configure. You need a source, a target and a mapping function. A source is anything conforming to IEnumerable<T>, a target is class that accepts and IEnumerable<T> implemented in IBulkRepository<T> and finally a mapping function that maps the source<T> to the target<T>.

The code for using a simple data flow looks a little like the pseudo-csharp below:

var source = new CsvReader("C:\sourcefile.csv");

var connectionFactory = new SqlConnectionFactory("Data Source=(localdb)\v11.0;");

var columnMappings = new Dictionary<string, string="">()
            {
                { "Id", "Id" },
                { "Name", "Name" },
                { "Created", "Created" }
            };

var target = new SqlBulkRepository("dbo.FactTable", connectionFactory, columnMappings);

var dataflow = new SimpleDataflowTask<sourcetype, targettype="">(
 this.source, MappingFunctions.Identity, target);

var context = new TaskContext();
await dataflow.ExecuteAsync(context);

If anyone would like to help write some examples and documentation I’d be immensely grateful but otherwise please let me know of your experiences using this package.

Integration Services Design Principals

Whilst doing some design work today for a customer project I realised there are a set of principals I try and adhere to when creating SQL Server Integration Services packages. The list is no doubt incomplete but this is what I have so far.

Minimise IO

This is a general data processing principal. Usually disk and, to a lesser extent, network performance determine the overall processing speed. Reducing the amount of IO in a solution will therefore increase performance.

Solutions that consist of multiple read-process-write steps should be redesigned into a single read-process-process-process-write step.

Prefer Sequential IO to Random IO

Disks perform at their best when sequentially reading or writing large chunks of data. Random IO (and poor performance) manifests when procedural style programming occurs – signs to look out for are SQL statements modifying/returning only few rows but being executed repeatedly.

Watch out for hidden random IO – for example, if you are reading from one table and writing to another in a sequential manor then disk access will still be random if both tables are stored on the same spindles.

Avoid data flow components that pool data

Data flow components work on batches of data called buffers. In most instances buffers are modified in place and passed down stream. Some components, such as “Sort” cannot process data like this and effectively hang on to buffers until the entire data stream is in memory (or spooled to disk in low memory situations). This increased memory pressure will affect performance.

Sometimes SQL is the better solution

Whilst the SSIS data flow has lots of useful and flexible components, it is sometimes more efficient to perform the equivalent processing in a SQL batch. SQL Server is extremely good at sorting, grouping and data manipulation (insert, update, delete) so it is unlikely you will match it for raw performance on a single read-process-write step.

SSIS does not handle hierarchical data well

Integration Services is a tabular data processing system. Buffers are tabular and the components and associated APIs are tabular. Consequently it is difficult to process hierarchical data such as the contents of an XML document. There is an XML source component but it’s output is a collection of tabular data streams that need to joined to make sense.

Execute SSIS close to where you wish to write your data

Reading data is relatively easy and possible from a wide variety of locations. Writing data, on the other hand, can involve complex locking and other issues which are difficult to optimise on a network protocol. In particular when writing data to a local SQL Server instance, SSIS automatically used the Shared Memory transport for direct inter-process transfer.

Don’t mess with the data flow metadata at runtime

It’s very difficult to do this anyway but worth mentioning that SSIS gets it’s stellar performance from being able to setup a data flow at runtime safe in the knowledge that buffers are of a fixed format and component dependencies will not change.

The only time this is acceptable is when you need to build a custom data flow programmatically. You should use the SSIS API’s and not attempt to write the package XML directly.

Using Integration Services to populate a Date Dimension

Every data warehouse needs a date dimension and at some point it needs to be populated. Most use some sort of a SQL script that loops though the dates and add rows to the destination table but this is pretty slow to execute. You might even try cross joining a year, month and day temporary tables to produce a set based solution but don’t forget to filter out the illegal days.

I prefer to fill my date tables by generating the correct stream of values from a SQL Server Integration Services script source component. This has a number of benefits:

  • It executes very quickly
  • The data can be bulk loaded
  • CultureInfo supplies the correct translations of day and month names
  • It is easy to add custom columns such as fiscal years and quarters

I haven’t wrapped this in a pre-compiled component as it is so easy to do in script from. Also, I haven’t got around to generalizing the fiscal date offsets for different companies so they usually have to be custom coded.

Script Component TypeFirst drop a “Script Component” onto your Data Flow.

Select “Source” as the Script Component Type and click OK.

Then double-click the newly added component to edit the properties.

Note that you need to add the correct output columns before adding the script or else it won’t compile.

Output Columns

I’ve renamed the output here to “Dates” to help further down the Data Flow.

Click the “Add Column” button to add new columns as show here. Note that I’ve also changed the data type of each column to match my source table. It required casts in script but it’s easier than conversions in the data pipeline.

Finally go back to the script part of the dialog and click the “Edit Script” button to launch Visual Studio for Applications.

In the resulting window, add your code to generate the date stream to the CreateNewOutputRows() function.
The general form is of:

var output = this.DatesBuffer;  // Get the output buffer

while (/*loop though your dates*?)
{

    output.AddRow();

    // Set the various column values e.g.
    output.CalendarYear = date.Year

    // Increment the date
    date = date.AddDays(1);
}

The full script is in the attached sample package where I’ve also added a script destination that does nothing with the data. Attach a data viewer to see what output is generated.

Generated outputFrom here you can manipulate the data, and pipe it to your dimension table from within the pipeline.

DateSourceSample.zip (27.08 KB)

Physical Data Warehouse Design

This is the second part in a series of data warehouse presentations I’ve been giving. This one concentrates on the physical design of the underlying SQL Server database, some information on the SQL Server Fast Track Data Warehouse and finally a one slide guide to tuning the storage engine performance. The tuning slide is only really a first step and I plan a more in depth session some time in the future.

Also I apologize, this deck is a little word heavy. I prefer more graphics in a presentation but there is a lot of specific guidance to list here.

When should you do an incremental extract?

There are no hard and fast rules but the goal is to cut the time taken to extract data from a source system and cut the amount of work you have to do with the extracted data. The numbers quoted here are the ones I use as a starting point but you need to measure to find the best values.

Don’t do an incremental extract if:

  • There isn’t much data in the source table (less than 100k rows)
  • There is enough change in the source table to require that you read most of it each time (such as more than half the rows changing between extracts)
  • The data in the source table is used for periodic snapshots (such as a balance sheet) and you need to track how a table changes at particular points in time

Do an incremental extract if:

  • There is a lot of data in the source table
  • Rows are only ever added to the source table (i.e. rows are not updated)
  • You need to track each and every change to a source row
  • The source data is updated several times before being closed and once closed is never updated again (also known as an accumulating snapshot)

In general dimension tables match the first set of rules and are not extracted incrementally where as fact tables normally match the second set of rules.

Subcube Queries

I’ve been doing some Analysis Services 2005 performance work for a customer. Eventually I want to talk about some sort of a performance process for AS2005 but right now Chris Webb has a pretty good reference on designing effective aggregations. This is just a quick description of what the 11100101010,1010001000000 means in SQL Server Profiler.

Since SQL Profiler now supports Analysis Services you can monitor queries as they execute to find the poorly performing ones. Chances are you will see a Progress Report event similar to Started reading data from the ‘XXX’ partition closely followed by a Query Subcube event with TextData looking like a string of ones and zeros such as:
‘00001000000000000,10111100000011111100,10’
The progress report tells you that data was read directly from the partition and no aggregation was used.

If you turn on the Query Subcube Verbose event then a more complete description is printed but quite often you will receive traces with just the subcube strings as the verbose event is hidden from the default set of options.

The subcube string tells you which dimensions and attributes are involved. Each group of digits, separated by a comma, denotes a dimension and each digit denotes an attribute within that dimension. The digit is ‘1’ if that attribute is part of the subcube and ‘0’ otherwise. Some things to note:

  • These are cube dimensions so the same dimension, e.g. Date, can be represented many times as a role-playing dimension, e.g. Ship Date
  • The order is defined by the actual order of dimensions and attributes in the measure group, not alphabetical or any other sort order.

Since the order of dimensions and attributes is not immediately obvious, it’s better to write some code to print them out in the correct order. The following prints all the dimensions and attributes of the supplied measure group object in order:

private static void PrintMeasureGroupDimensions(MeasureGroup mg) {
 for (int j = 0; j < mg.Dimensions.Count; ++j) { 
   CubeDimension dim = mg.Dimensions[j].CubeDimension; 
   Console.WriteLine("DIM:\t{0} ({1})", dim.Name, dim.Attributes.Count); 

   for (int k = 0; k < dim.Attributes.Count; ++k) { 
     CubeAttribute attr = dim.Attributes[k]; 
     Console.WriteLine("ATT:\t\t{0}", attr.Attribute.Name); 
   }
 }
}

I’ve attached a piece of sample code that compiles into a console application to either print the entire set of dimensions for a measure group or, if you pass a subcube string, will just print those involved in the query. Execute with no args or ‘/h’ to get some help.

SubcubeQueries.cs

 

The futility of checking database constraints before modifying data

Just a quick reminder that there is no point manually checking database constraints before modifying your data. Take this code:

create table Users (
UserID uniqueidentifier
primary key not null default newid(),
LoginName nvarchar(50) not null unique
)

if not exists (
select * from Users where LoginName = ‘JohnSmith’
)
begin
insert into Users (LoginName)
values (‘JohnSmith’)
end

There is a race condition between the if not exists check and the insertion of data. Another connection could insert the same data before this insert gets a chance to complete. We could wrap the check and insert in a transaction to make sure consistency as follows (note the isolation level because we need to place a range lock on data that could be there):

begin tran
    set transaction isolation level repeatable read
    if not exists (
select * from Users where LoginName = ‘JohnSmith’
)
    begin
insert into Users (LoginName)
values (‘JohnSmith’)
    end
commit

This works but we’ve got the added cost of a transaction. There is no point in doing this though as there is already a unique constraint on the column and databases are very good at checking constraints. With this in mind, the new and improved version looks like:

insert into Users (LoginName)
values (‘JohnSmith’)

Much simpler. There is only one minor problem, for which I don’t have an answer, and that is if you have multiple constraints on a table then it’s very difficult to find which one failed. The @@error variable will be set to 547 (constraint violation) and a suitable raiserror is generated which could be parsed by the caller but not easily.