Author Archives: jsnape

About jsnape

Solution architect at Hitachi Consulting. I live in Bournemouth and have a wife, two lovely daughters and a chocolate lab called Thula.

RJR Consulting Web Site Design

RJR Consulting ThumbnailOver the past few weeks I’ve been working on a web site for my cousin. This was a ground up redesign of his company web to replace the old SEO unfriendly Flash site.

Normally as the developer I only get someone else’s completed design; I just do the implementation but I wanted to see if any of my recent photography and image editing skills were transferable.

The site was first designed in Illustrator/Photoshop with a fair number of round trips to agree the final design. Implementation was done in Expression Web using HTML, CSS and some JQuery for interactivity. There are no anywhere on the site; it is a pure CSS layout. I also used Expression Web “Dynamic Web Templates” for the master page layout as the server where it is now hosted does not support .NET.

Some things I’ve learned from doing this:

  • jQuery makes life easy and Glimmer makes it even easier
  • IE8 standards support is fantastic but way to many still use IE6/7 to use it
  • You need to have an idea of what is possible in HTML before letting loose in Photoshop
  • background-image is your friend

As it is all pretty static at the moment there are plans to move to a site that supports .NET so I can add some more features, in particular a blog and news feed. I would like to try this in ASP.MVC to get some experience there.

So go to RJR Consulting and have a look around (especially if you need telemarketing, telesales or similar services).

Using Integration Services to populate a Date Dimension

Every data warehouse needs a date dimension and at some point it needs to be populated. Most use some sort of a SQL script that loops though the dates and add rows to the destination table but this is pretty slow to execute. You might even try cross joining a year, month and day temporary tables to produce a set based solution but don’t forget to filter out the illegal days.

I prefer to fill my date tables by generating the correct stream of values from a SQL Server Integration Services script source component. This has a number of benefits:

  • It executes very quickly
  • The data can be bulk loaded
  • CultureInfo supplies the correct translations of day and month names
  • It is easy to add custom columns such as fiscal years and quarters

I haven’t wrapped this in a pre-compiled component as it is so easy to do in script from. Also, I haven’t got around to generalizing the fiscal date offsets for different companies so they usually have to be custom coded.

Script Component TypeFirst drop a “Script Component” onto your Data Flow.

Select “Source” as the Script Component Type and click OK.

Then double-click the newly added component to edit the properties.

Note that you need to add the correct output columns before adding the script or else it won’t compile.

Output Columns

I’ve renamed the output here to “Dates” to help further down the Data Flow.

Click the “Add Column” button to add new columns as show here. Note that I’ve also changed the data type of each column to match my source table. It required casts in script but it’s easier than conversions in the data pipeline.

Finally go back to the script part of the dialog and click the “Edit Script” button to launch Visual Studio for Applications.

In the resulting window, add your code to generate the date stream to the CreateNewOutputRows() function.
The general form is of:

var output = this.DatesBuffer;  // Get the output buffer

while (/*loop though your dates*?)
{

    output.AddRow();

    // Set the various column values e.g.
    output.CalendarYear = date.Year

    // Increment the date
    date = date.AddDays(1);
}

The full script is in the attached sample package where I’ve also added a script destination that does nothing with the data. Attach a data viewer to see what output is generated.

Generated outputFrom here you can manipulate the data, and pipe it to your dimension table from within the pipeline.

DateSourceSample.zip (27.08 KB)

Physical Data Warehouse Design

This is the second part in a series of data warehouse presentations I’ve been giving. This one concentrates on the physical design of the underlying SQL Server database, some information on the SQL Server Fast Track Data Warehouse and finally a one slide guide to tuning the storage engine performance. The tuning slide is only really a first step and I plan a more in depth session some time in the future.

Also I apologize, this deck is a little word heavy. I prefer more graphics in a presentation but there is a lot of specific guidance to list here.

When should you do an incremental extract?

There are no hard and fast rules but the goal is to cut the time taken to extract data from a source system and cut the amount of work you have to do with the extracted data. The numbers quoted here are the ones I use as a starting point but you need to measure to find the best values.

Don’t do an incremental extract if:

  • There isn’t much data in the source table (less than 100k rows)
  • There is enough change in the source table to require that you read most of it each time (such as more than half the rows changing between extracts)
  • The data in the source table is used for periodic snapshots (such as a balance sheet) and you need to track how a table changes at particular points in time

Do an incremental extract if:

  • There is a lot of data in the source table
  • Rows are only ever added to the source table (i.e. rows are not updated)
  • You need to track each and every change to a source row
  • The source data is updated several times before being closed and once closed is never updated again (also known as an accumulating snapshot)

In general dimension tables match the first set of rules and are not extracted incrementally where as fact tables normally match the second set of rules.

Programmer Personality Type

My programmer personality type is: DHTC

You’re a Doer.
You are very quick at getting tasks done. You believe the outcome is the most important part of a task and the faster you can reach that outcome the better. After all, time is money.

You like coding at a High level.
The world is made up of objects and components, you should create your programs in the same way.

You work best in a Team.
A good group is better than the sum of its parts. The only thing better than a genius programmer is a cohesive group of genius programmers.

You are a Conservative programmer.
The less code you write, the less chance of it containing a bug. You write short and to the point code that gets the job done efficiently.

 

Subcube Queries

I’ve been doing some Analysis Services 2005 performance work for a customer. Eventually I want to talk about some sort of a performance process for AS2005 but right now Chris Webb has a pretty good reference on designing effective aggregations. This is just a quick description of what the 11100101010,1010001000000 means in SQL Server Profiler.

Since SQL Profiler now supports Analysis Services you can monitor queries as they execute to find the poorly performing ones. Chances are you will see a Progress Report event similar to Started reading data from the ‘XXX’ partition closely followed by a Query Subcube event with TextData looking like a string of ones and zeros such as:
‘00001000000000000,10111100000011111100,10’
The progress report tells you that data was read directly from the partition and no aggregation was used.

If you turn on the Query Subcube Verbose event then a more complete description is printed but quite often you will receive traces with just the subcube strings as the verbose event is hidden from the default set of options.

The subcube string tells you which dimensions and attributes are involved. Each group of digits, separated by a comma, denotes a dimension and each digit denotes an attribute within that dimension. The digit is ‘1’ if that attribute is part of the subcube and ‘0’ otherwise. Some things to note:

  • These are cube dimensions so the same dimension, e.g. Date, can be represented many times as a role-playing dimension, e.g. Ship Date
  • The order is defined by the actual order of dimensions and attributes in the measure group, not alphabetical or any other sort order.

Since the order of dimensions and attributes is not immediately obvious, it’s better to write some code to print them out in the correct order. The following prints all the dimensions and attributes of the supplied measure group object in order:

private static void PrintMeasureGroupDimensions(MeasureGroup mg) {
 for (int j = 0; j < mg.Dimensions.Count; ++j) { 
   CubeDimension dim = mg.Dimensions[j].CubeDimension; 
   Console.WriteLine("DIM:\t{0} ({1})", dim.Name, dim.Attributes.Count); 

   for (int k = 0; k < dim.Attributes.Count; ++k) { 
     CubeAttribute attr = dim.Attributes[k]; 
     Console.WriteLine("ATT:\t\t{0}", attr.Attribute.Name); 
   }
 }
}

I’ve attached a piece of sample code that compiles into a console application to either print the entire set of dimensions for a measure group or, if you pass a subcube string, will just print those involved in the query. Execute with no args or ‘/h’ to get some help.

SubcubeQueries.cs

 

Third time perfect

For some reason when I’m developing something new I always end up changing my design twice. Note, I’m not talking about architecture here. Architecture is a higher level concept and as such isn’t killed by implementation detail the way a design is. I think there is a reason for this multi-version development though.

The first design is basically a prototype exploring possibilities. At this point I’m not sure what the best solution is or even what techniques might be best to use. It only just does the job and is probably not very elegant or efficient. Not something to be proud of.

The second is usually a great design but, more often than not, flawed with a super hero style weakness. It will be elegant, efficient and cover every requirement imagined. The only problem – it will never be finished. This second iteration is usually complex, over designed and would take an eternity to implement.

After a good dose of reality, the third version takes shape. It’s much more practical, easier to maintain and solves only the requirements that are actually needed. These three designs are usually quite different. Not what the XP crowd would call iterative implementations but the end result is similar.

I think it’s important not to get too attached to any one way of doing things. I see some who start coding and stick with their first design to the bitter end no matter what. People often complain that a v1 product is OK for early adopters but not general consumption. Iterative design like this enables that sought after v3 product in your first release. Just don’t take three times as long to implement it…

Coder Block

OK, I’m a little embarrassed to admit but I’ve suffered from coder’s block this week. It’s a state similar to writer’s block which is “a usually temporary psychological inability to begin or continue work on a piece of writing”.

I think probably everyone suffers it from time to time and, although I haven’t had a case for a several years, this has come at a most inopportune moment because we are on a very tight schedule with our current release.

I guess it’s a little like depression in that you can’t just mentally “fix it”, you need to work through the cause. In this case I don’t think it’s an inability to write code, it’s more a case of too many possible solutions to the problem in question. They are all good but none are perfect. In particular one of the requirements is to allow additional code generation (note – not just an add-in, this is adding to the system) post release by our professional services engineers. This is hard for me because my brain prefers to work spatially – I need to see the design animated in my head before I code it up; not something that’s possible if an unknown chunk of code if going to be added by someone at a later date. Talking with other developers didn’t help because all that did is give me more good solutions to consider.

The other problem is that a number of very intricate details have to be thought about or the implementation won’t meet the requirements. Completely overwhelming if you look at them together so on Friday I decided to take a leaf out of the extreme programming book – baby steps. I started by ignoring the mass of requirements and implementing a really small section. By lunchtime I was back in the flow, I managed to check-in something yesterday evening and things are looking to be back on track.

So if you are suffering similar then I can offer the following advice:

  • Take a break and do something completely different like go for a walk, however doing this too often is just procrastination
  • Try and work out what is causing the block and fix those issues one at a time
  • Take small steps to get something (anything) working
  • If all else fails, see if you can swap your feature with another developer

The futility of checking database constraints before modifying data

Just a quick reminder that there is no point manually checking database constraints before modifying your data. Take this code:

create table Users (
UserID uniqueidentifier
primary key not null default newid(),
LoginName nvarchar(50) not null unique
)

if not exists (
select * from Users where LoginName = ‘JohnSmith’
)
begin
insert into Users (LoginName)
values (‘JohnSmith’)
end

There is a race condition between the if not exists check and the insertion of data. Another connection could insert the same data before this insert gets a chance to complete. We could wrap the check and insert in a transaction to make sure consistency as follows (note the isolation level because we need to place a range lock on data that could be there):

begin tran
    set transaction isolation level repeatable read
    if not exists (
select * from Users where LoginName = ‘JohnSmith’
)
    begin
insert into Users (LoginName)
values (‘JohnSmith’)
    end
commit

This works but we’ve got the added cost of a transaction. There is no point in doing this though as there is already a unique constraint on the column and databases are very good at checking constraints. With this in mind, the new and improved version looks like:

insert into Users (LoginName)
values (‘JohnSmith’)

Much simpler. There is only one minor problem, for which I don’t have an answer, and that is if you have multiple constraints on a table then it’s very difficult to find which one failed. The @@error variable will be set to 547 (constraint violation) and a suitable raiserror is generated which could be parsed by the caller but not easily.