The count distinct adventure - Part I

I've been working with several reporting and data analysis tools for a while now. Started with SSAS cubes, then Business Objects came along, later, MicroStrategy (MSTR) joined the team and then Tableau. They all have their own ways of managing count distinct (CD), but they yet have more in common. Manage simply means yielding the good results every time a user interacts with that CD measure. Us, the BI tech guys, we have a goal to facilitate easy data analysis to the users, especially that most of them work in non-IT domains and have different skill levels.

Throughout time, CDs have proved to be a difficult thing to manage and if it wasn't the ease of implementation, it was the performance impact we soon discovered. It's been known that the difficulty lies in the nature of this non-additive measure. Basically, we cannot sum it as a standard revenue measure, but you always have to re-calculate it when the level of the aggregation changes. Have to admit that I like a good challenge and MicroStrategy gave me one  when I had to implement such CD measures. I decided to write this article, because, compared to other tools, MicroStrategy gave me more ways to reach some solutions and even though are not perfect solutions, you can still make things work.

Now, before diving, let me show you a basic example of how things should work with count distinct, at least for our cases.


Let's assume that I'm an analyst at MicroStrategy and I want to see some metrics around training. I have customers participating to trainings, most customers had multiple during the year.. Now,  I would probably create 2 metrics, one following the number of unique trainings that people participated on last year (as a CD) and also a metric showing me the number of unique customers that consumed trainings last year. Let's take a closer look at the customer John and see what has been up to last year.



 He participated to 5 different trainings. So far so good.  Now let's create the 2 metrics and place them into the grid.

/* Unique Customers */
Count<Distinct=True, UseLookupForAttributes=False>(Customer){~+}
/* Unique Trainings */
Count<Distinct=True, UseLookupForAttributes=False>(Trainings){~+}



This looks good: 5 trainings, a single customer. What if I remove the attributes, so that I see the numbers as global KPIs?


Everything looks good again and this is the expected behavior. Basically, the CD will have to re-compute the new levels of aggregation from the lowest available information.


And now the journey begins….

I've used Business Objects (BO) for 7 years and that's enough time to develop a way of thinking on providing solutions. Because it has the same idea behind: generate SQL, bring data to the platform and play with it in memory; I brought this mentality to MicroStrategy. The reason I talk about BO here is that, at the core, the project metrics have the same philosophy and yet there's a slight difference in approach that made me resist to change at the beginning. In BO, same as MSTR, you define the metric formula that will be executed against the warehouse and then you have to set the projection function (the dynamic aggregation) to "Database Delegated" and here is the difference. This option basically tells BO that this metric is to be re-calculated against the warehouse every time the aggregation level changes between the definition of the report/dataset and the contents of the grid. Every time this happens in BO, the user will see in the metric cells of the grid the message #TOREFRESH and by clicking refresh, BO will compute behind the scenes a new SQL to reflect the new aggregation numbers.

I had the same approach in MSTR, but I discovered that such a dynamic aggregation function (DAF) similar to BO didn't exist, so I had a goal to discover all solutions possible around CDs, to help our projects.


Solution 1 - Re-execute against the warehouse

How can we make MSTR refresh the report against the warehouse and yield the right numbers every time we change the aggregation level in the grid compared to the report level?

MSTR would re-execute the report against the warehouse if the report structure changes. So, I created the CD metric and then I picked "None" as DAF. This way, aggregation is locked  and values are displayed only when report level and grid level are the same. This prevents wrong results (due to inappropriate DAFs) and makes the user remember to change the structure of the report according to the aggregation needs

Needless to say that this works best in reports and even though reports can be changed through the Vis is not really a sustainable solution, so I would sell this only for reports.

Pros:
  • Values are correct
  • Good for people crunching numbers in reports
  • Gives access to the underlying pool of data without having it available in MSTR
  • Latest version of data

Cons:
  • Relies only on warehouse computations
  • Can be slower than MSTR in-memory processing depending on the database system
  • Limited scope to reports only
  • Limited functionality
  • Becomes annoying to the users


Solution 2 - Create level metrics

Second solution was to use MSTR paradigm and create level metrics to all expected user combinations. This solution actually never saw daylight, as we can have thousands of combinations.

Pros:
  • Good when you have a small number of use-cases and a reduced number of objects
  • Respects MSTR paradigm

Cons:
  • Maintenance and development time skyrockets
  • Not user friendly



Stay tuned! There's more to come in Part II.


Comments

Post a Comment

Popular posts from this blog

The surrogate key strikes back

The count distinct adventure - Part II