Author Archives: jsnape

About jsnape

Solution architect at Hitachi Vantara. I live in Bournemouth and have a wife, two lovely daughters and labradors called Thula and Nala.

Mental Health Stakeholders

mental-health-orgchartI want to briefly wrap up the section on stakeholders specific to mental health services because this is where we will get our requirements from.

At the top of the organisation is the chief executive and the trust board. Below her are the IT director, medical director, nursing director and three operational directors who are responsible for Child and Adolescent Mental Health Services (CAMHS), Adult Mental Health Services and Geriatric Mental Health Services respectively.

The IT director is responsible for IT staff such as support staff and system administrators; IT systems and hardware; and IT projects.

The medical director manages all the doctors in the organisation and the nursing director similarly the nursing staff.

The operational directors focus on their individual services with a mix of staff including psychologists, counsellors, therapists, social workers and administration staff.

Collecting all the information we have so far with an estimate on the individual’s needs:

Who Interest/Power Class Concerns
Chief Exec Low/High Acquirer Low costs
IT Director High/High Acquirer/Assessor Low costs
Ease of deployment
Medical/Nursing Directors Low/High User Minimal training, or time taken away from duties
Services Directors Low/High User/Communicator Functionality
IT administrators High/Low Administrator Automated maintenance
Simple troubleshooting
Secure implementation
Zero friction installs and upgrades
User support staff Low/Low Support staff Ease of use
Training material
Team members Depends/Low Users Functionality
Ease of use

The last three are generalisations – if this were reality I would be looking for specific people since the Interest/Power level is unique to a person and not the role.

A behaviour driven testing framework for batch processing systems

Recently I’ve been working on a testing framework to support testing of batch systems such as data warehouses.

The framework is called ‘posh-gwen‘ due to the three behaviour driven methods Given, When, and thEN. The first version is on github at: Comments, suggestions and pull requests are welcome.

So why should you care about using this framework?

It is difficult to test batch systems using modern test frameworks such as Specflow or FitNesse because of the simple rule that good tests should be isolated from one another. All these frameworks run tests in sequence:

  • do something
  • check something
  • clean up
  • move on to the next test

For this to be successful each test has to run very fast. Most batch processing systems are optimised for bulk processing of data. They may take tens of seconds to run end to end even with a single row of data so running hundreds of tests independently can take hours.

This framework is designed to break the rule of sequential test execution. All tests are run in parallel by phase.

The best way to test batch processing is for a known input data to contain many test cases. The batch is run loading all data at once. Finally a number of queries are executed against the resulting system. So for example a data warehouse might load a number of source files using an ETL framework such as SQL Server Integration Services. Once loaded the data warehouse can be queried to check that expected values exist in the final system.

It is still important to make sure that each test is isolated from others or else changes in one might cause a number of others to fail or become invalid.

We can do this for batch processing by data isolation – that is to carve up data domains in a way that only a single test uses data from that domain. Then verification of the results force the query to execute against that test specific sub-domain.

There are a number of suitable domains to use but any with high cardinality are best:

  • Dates – each day is a single test (or blocks of days, weeks, years etc. for those tests that need to span days).
  • Transaction identifiers – use a map of IDs to test cases or in the case of strings prefix the transaction id with the test case number.
  • Business keys – for entities such as customer or product there is usually an ID field used as the business key; use the same methods as transaction identifiers.
  • Custom attributes – if none of the above will work then you might consider adding an extra attribute to the source data which is passed through the batch system. Obviously this is not a preferred solution single you will have to change your system.
  • Combinations of the above – sometimes depending on where you need to validate you might need multiple solutions.

Go try it out and let me know how it goes. I plan on adding more features over the coming months.

Classes of stakeholder

bournemouthlibrarySo this is the second attempt at writing this post. The first was very complete but read like a textbook and it makes more sense to read a real book – I recommend the stakeholder section in this book: Software Systems Architecture: Working With Stakeholders Using Viewpoints and Perspectives.

All stakeholders have requirements, needs and interests which collectively I call “concerns”. I even track them in the same work item tracking system used for requirements since it means they never get lost and you can ensure traceability from concerns to requirements and architectural decisions.

The classes can be broadly split into three groups. The first being “receivers”:

  • Acquirers
  • Assessors
  • Users
  • Administrators
  • Support staff

Acquirers may be wanting your software for a number of reasons. They might not actually use it so their concerns are likely to be related to cost, need and satisfying other concerns indirectly, for example purchasing reporting software because a regulator has demanded accurate reporting of patient outcomes with possible penalties for failure.

In the mental health arena there are many additional “assessor” agencies and regulators such as the Care Quality Commission, Monitor, professional bodies such as the General Medical Council and watchdogs like Healthwatch. Fortunately they are mainly concerned with the quality of healthcare, patient outcomes and overall costs; less so about software.

Administrators and support staff want to spend as little time, money and effort as possible with your solution so anything that eases deployment, automatically heals or diagnoses issues and keeps the software running will help. Its worth looking further though – what if an administrator has a review objective of reducing disk costs over the year and you turn up asking for terabytes of clustered storage?

The second group of stakeholders are the “producers”:

  • Developers
  • Testers
  • Maintainers
  • Suppliers

Their concerns should be easy to list and meet since they want the same things as you. In this sample project the only producer is me and anyone who supplies me with coffee.

The remaining category are “communicators”. The book defines them as those who “…explain the system to other stakeholders…” but I think is should also include anyone who will discuss, promote, detract, educate, deny, network, rally and gossip about your project and are not in either of the other two groups. They may only by interested in your project for their own political reasons.

Do you have stakeholders from all these categories? If so look at things from each stakeholder’s angle and try to imagine what their needs are. Then go talk to them and confirm it.

Mental health stakeholder analysis – power and interest

So its great that we know a little about how mental health services work but as an architect we need to cover some more ground before the project can be considered up and running. This post is about the slightly dirty subject of stakeholders and politics.

Why do this? Well getting the politics right can mean the difference between a success and failure. All projects have stakeholders with specific concerns. Some can be ignored but in general it is the architect’s job satisfy those concerns. Some stakeholders  have more power than others; they will also have varying levels of interest. The diagram below shows the best approach for each of the categories.

  • StakeholderEngagement Low Interest – Low Power: the easiest category to deal with since they are not too concerned with the project and don’t have the ability to derail you. Just monitor them for feedback  and other  information.
  • High Interest – Low Power: this class of stakeholders want to know what is happening and still don’t have the ability to disrupt the project. The best tactic is to ensure they are well informed.
  • Low Interest – High Power: these stakeholders can be the most difficult to deal with since they have the power to influence not only your project but many of the other stakeholders. The main issue being that they are not that interested so it is difficult to have meaningful discussions and  negotiations. The primary tactic is to ensure their concerns are met with as little fuss as possible.
  • High Interest – High Power: not only do they have the power to effect change but are interested in the outcomes. Personally I find this group a lot of fun since they can contribute to some very interactive sessions. The best tactic is to engage them which shouldn’t be too difficult given their level of interest.

In the next post I’ll cover classes of stakeholder and why it some of the non-obvious ones might be the most important.

Mental Healthcare Workflow

As I mentioned in my last post I wanted to setup the scenario I’m going to use for my examples. My lovely wife is a child psychiatrist so I have learned a fair amount about the way psychiatric services work in the UK. Lets dive right in.

There are four main workflow steps to all patient care:

  1. Referral – prospective patients are referred to the service with behaviours that indicate a mental health problem.
  2. Assessment – a file is opened for the patient and they are assessed. If required, tests are performed, then where appropriate a diagnosis is given and the patient is referred for treatment or discharged.
  3. Treatment – depending on the diagnosis, one or more treatment approaches may be required. Each approach has a outcome: either further assessment and treatment or discharge.
  4. Discharge – the patient is released from mental health services.

There are two queues in the workflow:

  1. Assessment queue – those that have been referred but are waiting assessment. There is a government target of 4 weeks for this queue.
  2. Treatment queue – those that have a diagnosis requiring treatment. There is a desirable target of 6 weeks for this queue for the patient to begin their first treatment.

Some additional notes:

  • Whilst a patient may be seen by many professionals during treatment there is usually a single mental health professional that oversees the overall care.
  • There are multiple levels of discharge – each treatment approach and then final discharge from the service by the lead professional.
  • Diagnoses in the UK are made according to the International Classification of Diseases (WHO) which covers all diseases and includes mental health disorders. Other countries such as the USA diagnose using the Diagnostic and Statistical Manual of Mental Disorders.
  • Services are divided into tiers from Tier 1 (GP treatment) to Tier 4 (intensive, possibly in-patient, treatment).
  • There are other business processes that don’t contribute directly to the value stream but are essential to the overall efficiency of the service and quality of patient care such as treatment reviews.
  • Treatments are conducted over a number of sessions. Even if the treatment is medication there are still sessions required to monitor and evaluate the drug effects and side-effects.

I’m sure this isn’t a complete picture but its enough to make a start. First we need to look at some requirements; then review the stakeholders and begin to outline the architecture.

Agile Business Intelligence Architecture

I used to regularly write a blog over at based on my experiences with a number of topics at Microsoft. Since moving to BNP Paribas I haven’t had much to say publically beyond posting a few photos and I’d like to change that.

Over the past couple of years I’ve been practising agile business intelligence in my role as a solution architect. My team work on user stories from a prioritised backlog in two week iterations, practise test driven development, continuous integration and are beginning to look at acceptance test driven development.

Even with all these techniques it is still very difficult to be truly agile with business intelligence solutions –interrelated requirements, complex database schemas, poor agile tool support and the skill set of most business intelligence developers are all forces trying to turn your project back towards the waterfall dark side.

All the time there is an ever louder call from the business for instant answers to their business questions.

So I’ve begun  to wonder, how agile can you get? Can you create traditional and modern business intelligence solutions using the same agile processes that normal software teams use? Can solution architecture be lean and risk driven?

I’m influenced by quite a varied set of development jobs – broadcast and media at Sony, telecoms and call centres at Exony, all sorts at Microsoft and now banking. I’ve learned that clean code, good domain modelling and testing is the only way to be successful.

Obviously I can’t talk about my day job as I’m bound by employee confidentiality but I do want  to try out some ideas in a public arena where I can get feedback. I’m going to make up a set of requirements for a completely different industry to banking – healthcare; which I’ll setup in a future article.

So comments always welcome as are suggestions for future posts and I look forward to publishing some useful content.

Running Sums over Degenerate Dimensions

Running Sum Over Degenerate Dimension

Asking for a running sum in a report is a common thing but this week I was asked to create a running sum for a particular customer against number of facts. What I mean here is to create a graph of count vs amount (sort of like a Pareto except in transaction order). So something that looks like graph above.

This is a well rehearsed subject in MDX. You can either use recursion:

With Member [Measures].[Running Sum]
    [Internet Sales Order Details].[Sales Order Number].PrevMember
    [Measures].[Internet Gross Profit]
Member [Measures].[Running Count]
    [Internet Sales Order Details].[Sales Order Number].PrevMember

Select {
    [Measures].[Running Count],
    [Measures].[Internet Gross Profit],
    [Measures].[Running Sum]
} On 0, Non Empty {
    [Internet Sales Order Details].[Sales Order Number].Children
} On 1
From [Adventure Works]
Where (
    [Customer].[Customer].[Brian Watson]

Or, iteration (thanks to Chris Webb for some help on relative performance) which should perform better, especially on AS2008.

With Member [Measures].[Running Sum]
        {Null : [Internet Sales Order Details].[Sales Order Number].CurrentMember},
        [Measures].[Internet Gross Profit]
Member [Measures].[Running Count]
        {Null : [Internet Sales Order Details].[Sales Order Number].CurrentMember}
Select {
 [Measures].[Running Count],
 [Measures].[Internet Gross Profit],
 [Measures].[Running Sum]
} On 0, Non Empty {
 [Internet Sales Order Details].[Sales Order Number].Children
} On 1
From [Adventure Works]
Where (
 [Customer].[Customer].[Brian Watson]

[However, on my x64 laptop the second version takes longer to execute YMMV.]

This is OK for AdventureWorks but my real degenerate dimension has many millions of members and this just doesn’t scale. I contemplated using Reporting Services RunningValue() function but as far as I can tell you can’t use it to generate a category axis.

I needed a way of generating the running count for the x-axis in a way that uses Analysis Services’ excellent aggregation ability.

Bucket HierarchyThe solution I ended up with is to create an artificial hierarchy and bucket transactions. That way I can create an attribute relation for aggregation and, importantly, control the number of cells in the iteration.

The next problem was how to assign values to this bucket – some customers had only a few transactions yet others had millions. They all needed to be spread over a fixed set of buckets.

The answer lies in a SQL Server RANK() function:

update dw.Sales
set TradeBucket = x.TradeBucket
from (
    select TradeKey,
    rank() over(partition by CustomerKey order by t.TradeKey asc) / 
        when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 1000 then 1
        when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 10000 then 10
        when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 100000 then 100
        when (select COUNT(*) from dw.Sales where CustomerKey = t.CustomerKey) < 1000000 then 1000
        else 10000
    end as TradeBucket
    from dw.Sales
) x
where dw.Sales.TradeKey = x.TradeKey

Effectively, we are generating an incrementing number on a per customer basis and then dividing that number to compress the range. This is surprisingly fast to execute.

Once everything is processed, my new MDX looks like:

With Member [Measures].[Running Sum]
        {Null : [Internet Sales Order Details].[Trade Bucket].CurrentMember},
        [Measures].[Internet Gross Profit]    
Member [Measures].[Running Count] As
 {Null : [Internet Sales Order Details].[Trade Bucket].CurrentMember},
 [Measures].[Sales Count]
Select {
 [Measures].[Running Count],
 [Measures].[Internet Gross Profit],
 [Measures].[Running Sum]
} On 0, Non Empty {
 [Internet Sales Order Details].[Trade Bucket].Children
} On 1
From [Adventure Works]
Where (
 [Customer].[Customer].[Brian Watson]

It works on aggregated data; there are still around 1000 points which is just fine on the graph and it executes in around 3 seconds. So all good?

Well, for now yes but I can see a problem looming – every time I do an import I update every fact row and fully reprocess the cube. That isn’t going to scale long-term. I will probably have to implement some sort of bucket partition strategy.

Integration Services Design Principals

Whilst doing some design work today for a customer project I realised there are a set of principals I try and adhere to when creating SQL Server Integration Services packages. The list is no doubt incomplete but this is what I have so far.

Minimise IO

This is a general data processing principal. Usually disk and, to a lesser extent, network performance determine the overall processing speed. Reducing the amount of IO in a solution will therefore increase performance.

Solutions that consist of multiple read-process-write steps should be redesigned into a single read-process-process-process-write step.

Prefer Sequential IO to Random IO

Disks perform at their best when sequentially reading or writing large chunks of data. Random IO (and poor performance) manifests when procedural style programming occurs – signs to look out for are SQL statements modifying/returning only few rows but being executed repeatedly.

Watch out for hidden random IO – for example, if you are reading from one table and writing to another in a sequential manor then disk access will still be random if both tables are stored on the same spindles.

Avoid data flow components that pool data

Data flow components work on batches of data called buffers. In most instances buffers are modified in place and passed down stream. Some components, such as “Sort” cannot process data like this and effectively hang on to buffers until the entire data stream is in memory (or spooled to disk in low memory situations). This increased memory pressure will affect performance.

Sometimes SQL is the better solution

Whilst the SSIS data flow has lots of useful and flexible components, it is sometimes more efficient to perform the equivalent processing in a SQL batch. SQL Server is extremely good at sorting, grouping and data manipulation (insert, update, delete) so it is unlikely you will match it for raw performance on a single read-process-write step.

SSIS does not handle hierarchical data well

Integration Services is a tabular data processing system. Buffers are tabular and the components and associated APIs are tabular. Consequently it is difficult to process hierarchical data such as the contents of an XML document. There is an XML source component but it’s output is a collection of tabular data streams that need to joined to make sense.

Execute SSIS close to where you wish to write your data

Reading data is relatively easy and possible from a wide variety of locations. Writing data, on the other hand, can involve complex locking and other issues which are difficult to optimise on a network protocol. In particular when writing data to a local SQL Server instance, SSIS automatically used the Shared Memory transport for direct inter-process transfer.

Don’t mess with the data flow metadata at runtime

It’s very difficult to do this anyway but worth mentioning that SSIS gets it’s stellar performance from being able to setup a data flow at runtime safe in the knowledge that buffers are of a fixed format and component dependencies will not change.

The only time this is acceptable is when you need to build a custom data flow programmatically. You should use the SSIS API’s and not attempt to write the package XML directly.