The count distinct adventure - Part I
I've been
working with several reporting and data analysis tools for a while now. Started
with SSAS cubes, then Business Objects came along, later, MicroStrategy (MSTR)
joined the team and then Tableau. They all have their own ways of managing
count distinct (CD), but they yet have more in common. Manage simply means
yielding the good results every time a user interacts with that CD measure. Us,
the BI tech guys, we have a goal to facilitate easy data analysis to the users,
especially that most of them work in non-IT domains and have different skill
levels.
Throughout
time, CDs have proved to be a difficult thing to manage and if it wasn't the
ease of implementation, it was the performance impact we soon discovered. It's
been known that the difficulty lies in the nature of this non-additive measure.
Basically, we cannot sum it as a standard revenue measure, but you always have
to re-calculate it when the level of the aggregation changes. Have to admit
that I like a good challenge and MicroStrategy gave me one when I had to implement such CD measures. I
decided to write this article, because, compared to other tools, MicroStrategy
gave me more ways to reach some solutions and even though are not perfect
solutions, you can still make things work.
Now,
before diving, let me show you a basic example of how things should work with
count distinct, at least for our cases.
Let's
assume that I'm an analyst at MicroStrategy and I want to see some metrics
around training. I have customers participating to trainings, most customers
had multiple during the year.. Now, I
would probably create 2 metrics, one following the number of unique trainings
that people participated on last year (as a CD) and also a metric showing me
the number of unique customers that consumed trainings last year. Let's take a
closer look at the customer John and see what has been up to last year.
He participated to 5 different trainings. So
far so good. Now let's create the 2
metrics and place them into the grid.
/* Unique Customers */
Count<Distinct=True, UseLookupForAttributes=False>(Customer){~+}
/* Unique Trainings */
Count<Distinct=True, UseLookupForAttributes=False>(Trainings){~+}
This
looks good: 5 trainings, a single customer. What if I remove the attributes, so
that I see the numbers as global KPIs?
Everything
looks good again and this is the expected behavior. Basically, the CD will have
to re-compute the new levels of aggregation from the lowest available
information.
And now
the journey begins….
I've used
Business Objects (BO) for 7 years and that's enough time to develop a way of
thinking on providing solutions. Because it has the same idea behind: generate
SQL, bring data to the platform and play with it in memory; I brought this
mentality to MicroStrategy. The reason I talk about BO here is that, at the
core, the project metrics have the same philosophy and yet there's a slight
difference in approach that made me resist to change at the beginning. In BO,
same as MSTR, you define the metric formula that will be executed against the
warehouse and then you have to set the projection function (the dynamic
aggregation) to "Database Delegated" and here is the difference. This
option basically tells BO that this metric is to be re-calculated against the
warehouse every time the aggregation level changes between the definition of
the report/dataset and the contents of the grid. Every time this happens in BO,
the user will see in the metric cells of the grid the message #TOREFRESH and by
clicking refresh, BO will compute behind the scenes a new SQL to reflect the
new aggregation numbers.
I had the
same approach in MSTR, but I discovered that such a dynamic aggregation
function (DAF) similar to BO didn't exist, so I had a goal to discover all
solutions possible around CDs, to help our projects.
Solution 1 - Re-execute against the warehouse
How can
we make MSTR refresh the report against the warehouse and yield the right
numbers every time we change the aggregation level in the grid compared to the
report level?
MSTR
would re-execute the report against the warehouse if the report structure
changes. So, I created the CD metric and then I picked "None" as DAF.
This way, aggregation is locked and
values are displayed only when report level and grid level are the same. This
prevents wrong results (due to inappropriate DAFs) and makes the user remember
to change the structure of the report according to the aggregation needs
Needless
to say that this works best in reports and even though reports can be changed
through the Vis is not really a sustainable solution, so I would sell this only
for reports.
Pros:
- Values are correct
- Good for people crunching numbers in reports
- Gives access to the underlying pool of data without having it available in MSTR
- Latest version of data
Cons:
- Relies only on warehouse computations
- Can be slower than MSTR in-memory processing depending on the database system
- Limited scope to reports only
- Limited functionality
- Becomes annoying to the users
Solution 2 - Create level metrics
Second
solution was to use MSTR paradigm and create level metrics to all expected user
combinations. This solution actually never saw daylight, as we can have
thousands of combinations.
Pros:
- Good when you have a small number of use-cases and a reduced number of objects
- Respects MSTR paradigm
Cons:
- Maintenance and development time skyrockets
- Not user friendly
Stay
tuned! There's more to come in Part II.
your valuable information and time. Please keep updating.
ReplyDeleteMicrostrategy Online Training
Microstrategy Online Course