Architecture shares something with testing in that resources are limited so effort is best directed toward maximising risk reduction.
The amount of ‘architecture’ in a solution should also reflect the risk associated with a project. For example the sample solution I’m creating carries almost no risk apart from my pride so a light touch is warranted.
However in a real solution what are the major risks? Where should we concentrate our efforts? Below are some of the common risks associated with business intelligence projects:
- Unclean data e.g. Key pathologies – see later post.
- Unreliable sources – how are failed connections, retries and duplicates handled?
- Data volumes – what are the expected peak volumes? What will happen if these peaks are exceeded?
- Latency requirements – can data be supplied to users fast enough? What is the business cost of delays?
- Testability – how testable is the solution? How long can you keep going before technical debt catches up with you?
- History and archive – in my experience most source systems don’t keep a full fidelity history so it ends up being the data warehouse’s responsibility.
- Staying agile – unfortunately many problems with business intelligence solutions are due to an inability to change things; once users, reports, spread-sheets, and ETL code depend on data warehouse schemas or cube designs the whole thing becomes very difficult to change.
- Disaster recovery – what happens when your server dies? Network fails? Data centre fails?
- Scalability – what are your expected user loads? what happens if they are exceeded? are there any events that could cause your user load to be drastically exceeded?
- Usability – how will your users interact with the system? how much training will they need? what if they need help? how can you make the solution easier to use?
“Agile architecture is the art of constraining a solution in order to optimise competing stakeholder concerns whilst maximising the number of options for future design decisions.” – James Snape (just now)
So to be agile lets just concentrate on the risks and try to not be too prescriptive over the final solution. Everything else can generally be quickly changed if it doesn’t work out.
Context is everything with architecture. I’ve often had conversations which started with the phrase “how come you didn’t…” – once the context is explained the decision is usually obvious.
The context for this architecture is purely my own since there are no real customers. I want to satisfy the concerns and requirements as simply as possible but leave room to swap out parts of the architecture to investigate new approaches and technologies.
The diagram below is a pretty standard set of data warehouse components. If you are a traditional Microsoft guy then, from left to right, the components would be Integration Services, SQL Server, Integration Services again, SQL Server, Analysis Services and Excel or Reporting Services respectively. Alternatively you might be using Hadoop as the source mirror and Tableau for the data mart and consume components or some other combination.
I always try to set firewalls within an architecture so that any problems can be isolated and replaced without too much disruption. In this instance I’m going to use those firewalls so that I can try out new ideas and technologies.
The main synchronisation points are the three data stores – Source Mirror, Data Warehouse and Data Mart. (In this instance I am using the term data mart to mean a prepared, subject area specific store optimised for analytical/aggregate queries.)
The responsibilities for each stage are as follows:
- Acquire > Source Mirror: receive data from source, ensure minimal load on source via high watermarks or another strategy, archive data for historical accuracy. A key point here is that the source mirror has the same metadata as the source itself. No joins or transforms on the way. Sometimes simple data conversions are useful but less is more.
- Load > Data Warehouse: apply business logic and transform the source data into dimensional model, update data warehouse.
- Summarise > Data Mart: aggregate generation or cube processing.
- Consume & Act: covers any output tool such as Reporting Services, Excel, Tableau Dashboard, R, F# etc.
I consider the SQL/relational store to be “the data warehouse” and not Analysis Services which is better suited to a data mart role.
It’s quite hard to be succinct when talking about architecture and this post is quite lengthy so I’ll split it and talk about risk driven architecture in the next post.