Better software through software architecture and devops

@jamessnape

Tag Archives: #data-warehouse

  • These seem to get longer and longer. A whole pile of links for you.

    Management and Organisational Behaviour

    How Serving Is Your Leadership? - Who is working for who here?

    Be a Manager - “The only reason there’s so many awful managers is that good people like you refuse to do the job.”

    I’m the Boss! Why Should I Care If You Like Me? - Because your team will be more productive… Here are some pointers.

    Software Development

    Technical debt 101 - Do you think you know what technical debt is and how to tackle it? Even so I’m sure this article has more you can discover and learn. A must read.

    Heisenberg Developers - So true. In fact this hits a little close to home since we use JIRA, the bug tracking tool mentioned in the article.

    What is Defensive Coding? - Many think that defensive coding is just making sure you handle errors correctly but that is a small part of the process.

    Need to Learn More about the Work You’re Doing? Spike It! - So you are an agile shop, your boss is demanding some story estimates and you have no idea how complex the piece of work is because it’s completely new. What do you do?

    Software Development with Feature Toggles - Don’t branch, toggle instead.

    Agile practices roundup - here are a number of articles I’ve found useful recently:

    How to review a merge commit- Phil dives into the misunderstood world of merge commits and reviews. Also see this list of things to look out for during code reviews.

    Functional Programming

    Don’t Be Scared Of Functional Programming - A good introduction to functional programming concepts using JavaScript as the demonstration language.

    Seamlessly integrating T-SQL and F# in the same code - The latest version of FSharp.Data allows you to write syntax checked SQL directly in your F# source and it executes as fast as Dapper.

    Railway Oriented Programming - This is a functional technique but I’ve recently been using it in C# when I needed to process many items in a sequence, any of which could fail and I want to collect all the errors up for reporting back to ops. It is harder to do in C# since there are no discriminated unions but a custom wrapper class is enough.

    Erlang and code style - A different language this time, Erlang. How easy is programming when you don’t have to code defensively and crashing is the preferred way of handling errors.

    Twenty six low-risk ways to use F# at work - Some great ways to get into F# programming without risking your current project.

    A proposal for a new C# syntax - A lovely way to look at writing C# using a familiar but lighter weight syntax. C#6 have some of these features planned but this goes further. Do check out the link at the end of the final proposal.

    Excel-DNA: Three Stories - Integrating F# into Excel - a data analysts dream…

    Data Warehousing

    Signs your Data Warehouse is Heading for the Boneyard - Some interesting things to look out for if you hold the purse strings to a data warehouse project. How many have you seen before?

    The 3 Big Lies of Data - I’ve heard these three lies over and over from business users and technology vendors alike. Who is kidding who?

    Six things I wish we had known about scaling - Not specifically about data warehouses but these are all issues we see on a regular basis.

    Why Hadoop Only Solves a Third of the Growing Pains for Big Data - You can’t just go and install a Hadoop cluster. There is more to it than that.

    Microsoft Azure Machine Learning - Finally it looks like we can have a simple way of doing cloud scale data mining.

    Data Visualization

    5 Tips to Good Vizzin’ - So many visualizations break these rules.

    Five indicators you aren’t using Tableau to its full potential - I’ve seen a few of these recently - tables anyone?

    Create a default Tableau Template - Should save some time when you have a pile of dashboards to create.

    Building a Tableau Center of Excellence - It is so easy to misunderstand Tableau which is not helped by a very effective sales team. This article has some great advice for introducing Tableau into your organisation.

    Beginner’s guide to R: Painless data visualization - Some simple R data visualization tips.

    Visualizing Data with D3 - If you need complete control over your visualization then D3 is just what you need. It can be pretty low-level but its easy to produce some amazing stuff with a bit of JavaScript programming.

    Testing

    I Don’t Have Time for Unit Testing - I’ve recently been guilt of this myself so I like to keep a reminder around - you will go faster if you write tests.

    Property Based Testing with FsCheck - FsCheck is a fantastic tool primarily used in testing F# code but there is no reason it can’t be used with C# too. It generates automated test cases to explore test boundaries. I love the concise nature of F# test code too especially with proper sentences for test names.

    Analysis Services

    I’ve collected a lot of useful links for Analysis Services, both tabular and multidimensional:

    DAX Patterns website - This website is my go-to resource for writing DAX calculations. These two are particularly useful:

    Using Tabular Models in a Large-scale Commercial Solution - Experiences of SSAS tabular in a large solution. Some tips, tricks and things to avoid.

    Also:

    This entry was posted in reference  and tagged #business-intelligence #data-visualization #data-warehouse #excel #functional-programming #tableau #testing  on .
    Discuss this on Twitter or LinkedIn
  • Managing complexity graphic showing DDD patterns to apply from Domain Driven Design source https://commons.wikimedia.org/wiki/File:Maintaining_Model_Integrity.png

    My commute is around two and a half hours each way so I read a lot on the train. One of the subjects I’ve recently become interested in Domain Driven Design or DDD. I’ve found it isn’t really a new topic for me but more like someone has documented many of the techniques I’ve always used.

    DDD discusses data warehouses primarily as an output or reporting function within a larger application. The book Implementing Domain-Driven Design by Vaughn Vernon mentions reporting repeatedly as a by-product of DDD (particularly when used with Event Sourcing) but not directly as a possible use case.

    I would agree that not all concepts can be reused in data warehouse solutions since the only interface available is often one that transfers a set of mutations (property value changes) without the accompanying reasons (it is a key idea in DDD that you need to design your model not by changes in attributes but by operations performed on entities). For example an order count has decreased and the reason is missing – was the order returned, cancelled, an error etc. So where can it be applied? Are any of the concepts useful when designing data warehouses?

    Ubiquitous Language

    Lets start with one of the core concepts – Ubiquitous Language is a rigorous shared language used between developers and users. It is used to make sure that conversations are accurate and productive. It should evolve as the team’s understanding of a domain changes. The ubiquitous language is what forms the domain model at the heart of a software solution.

    I find this description very similar to an equivalent concept in data warehousing – the Dimensional Model. This model, and its associated dimension bus matrix, is based on real business processes and terminology. The dimensional model is the public face of the data warehouse. It needs to be precise, reflect the terms used by business users and form a common vocabulary between users and the development team. For example when browsing an Analysis Services cube in Excel, the dimensions and facts defined in the dimensional model are directly visible to end users – if they don’t automatically understand what is on-screen then the model doesn’t describe the business.

    Entities and Value Objects

    There are two types of object in domain driven design – entities and value objects. Value objects are immutable and identified via their attributes. For example $100 USD in one object is the interchangeable with $100 USD in another. Entities on the other had cannot be identified purely by their attributes – there must be some sort of unique identifier (in data warehouse terms this is a business key) to differentiate similar entities. For example, one John Smith may not be the same as another and need a Customer-Id to differentiate the two.

    With respect to dimensional models, value objects should not be implemented as top-level dimensions but instead be added at sets of attributes to the entities which own them. For example ‘Product Colour’ is a value type (colour) and should belong in the ‘Product’ dimension. This seems obvious when written this way but happens a lot.

    Entities and Aggregate Roots

    DDD groups sets of closely related entities under the control of a single ‘Aggregate root’. Some entities make no sense unless the parent entity is also within context; order lines and parent orders are the typical example.

    So value objects shouldn’t be dimensions and I don’t think ordinary entities should be either. True dimensions are the aggregate roots and the one thing that seals it for me is that an aggregate root (according to DDD) defines a transactional boundary – you should not update multiple aggregate roots within a single transaction; instead sagas keep your data warehouse in sync (eventually).

    Finally

    Domain driven design and business intelligence share a number of common concepts yet the two philosophies are rarely seen as related. I think there is a lot to be gained by applying software concepts from different viewpoints which may not ordinarily be considered.

    This entry was posted in business-intelligence  and tagged #data-warehouse #domain-driven-design  on .
    Discuss this on Twitter or LinkedIn
  • Architecture shares something with testing in that resources are limited so effort is best directed toward maximising risk reduction.

    The amount of ‘architecture’ in a solution should also reflect the risk associated with a project. For example the sample solution I’m creating carries almost no risk apart from my pride so a light touch is warranted.

    However in a real solution what are the major risks? Where should we concentrate our efforts? Below are some of the common risks associated with business intelligence projects:

    • Unclean data e.g. Key pathologies – see later post.
    • Unreliable sources – how are failed connections, retries and duplicates handled?
    • Data volumes – what are the expected peak volumes? What will happen if these peaks are exceeded?
    • Latency requirements – can data be supplied to users fast enough? What is the business cost of delays?
    • Testability – how testable is the solution? How long can you keep going before technical debt catches up with you?
    • History and archive – in my experience most source systems don’t keep a full fidelity history so it ends up being the data warehouse’s responsibility.
    • Staying agile – unfortunately many problems with business intelligence solutions are due to an inability to change things; once users, reports, spread-sheets, and ETL code depend on data warehouse schemas or cube designs the whole thing becomes very difficult to change.
    • Disaster recovery – what happens when your server dies? Network fails? Data centre fails?
    • Scalability – what are your expected user loads? what happens if they are exceeded? are there any events that could cause your user load to be drastically exceeded?
    • Usability – how will your users interact with the system? how much training will they need? what if they need help? how can you make the solution easier to use?

    “Agile architecture is the art of constraining a solution in order to optimise competing stakeholder concerns whilst maximising the number of options for future design decisions.” – James Snape (just now)

    So to be agile lets just concentrate on the risks and try to not be too prescriptive over the final solution. Everything else can generally be quickly changed if it doesn’t work out.

    This entry was posted in sample-solution  and tagged #architecture #business-intelligence #data-warehouse #risk #stakeholder-concerns  on .
    Discuss this on Twitter or LinkedIn
  • Context is everything with architecture. I’ve often had conversations which started with the phrase “how come you didn’t…” – once the context is explained the decision is usually obvious.

    The context for this architecture is purely my own since there are no real customers. I  want to satisfy the concerns and requirements as simply as possible but leave room to swap out parts of the architecture to investigate new approaches and technologies.

    The diagram below is a pretty standard set of data warehouse components. If you are a traditional Microsoft guy then, from left to right, the components would be Integration Services, SQL Server, Integration Services again, SQL Server, Analysis Services and Excel or Reporting Services respectively. Alternatively you might be using Hadoop as the source mirror and Tableau for the data mart and consume components or some other combination.

    architecture

    I always try to set firewalls within an architecture so that any problems can be isolated and replaced without too much disruption. In this instance I’m going to use those firewalls so that I can try out new ideas and technologies.

    The main synchronisation points are the three data stores – Source Mirror, Data Warehouse and Data Mart. (In this instance I am using the term data mart to mean a prepared, subject area specific store optimised for analytical/aggregate queries.)

    The responsibilities for each stage are as follows:

    • Acquire > Source Mirror: receive data from source, ensure minimal load on source via high watermarks or another strategy, archive data for historical accuracy.  A key point here is that the source mirror has the same metadata as the source itself. No joins or transforms on the way. Sometimes simple data conversions are useful but less is more.
    • Load > Data Warehouse: apply business logic and transform the source data into dimensional model, update data warehouse.
    • Summarise > Data Mart: aggregate generation or cube processing.
    • Consume & Act: covers any output tool such as Reporting Services, Excel, Tableau Dashboard, R, F# etc.

    I consider the SQL/relational store to be “the data warehouse” and not Analysis Services which is better suited to a data mart role.

    It’s quite hard to be succinct when talking about architecture and this post is quite lengthy so I’ll split it and talk about risk driven architecture in the next post.