The count distinct adventure - Part II
During
the first part of the topic we introduced the challenge of count distinct and 2
solutions to address them, along with their advantages and disadvantages. The
list of solutions continues in this final section and these, presented below,
are also the solutions that we use across our projects where CDs are required.
Solution 3 - Count Distinct derived metrics
This
functionality is probably the most known by the users and mostly used, as it
offers flexibility and the necessary control without the intervention of the IT
teams. Creating =derived metrics in reports, Vis and documents is a walk In the
park, using the context menus. With the deployment of V10, things got even more
flexible adding support for conditional derived metrics (IF and CASE) by using
attributes directly. As you probably remember, or still use today, up until
V10, building conditional derived metrics was a hustle, as only metrics could
be used with IF and CASE statements. You would first create a
MAX(<Attribute>) so that you would have the attribute members as output,
wrapped as a metric, and then use them inside a second metric's condition.
However, things have improved a lot in V10 and this hustle is no longer there.
Therefore,
this method is easier than never before and that's why a lot of users are more
openly using it. That was good news for us too, because now we have a viable
solution that we can propose to all our power users. Again, we are user
oriented and it's important to always offer the best possible solutions with
the functionalities
at hand
and we know before if this would be adopted or not by our users. I have to
admit that before V10, this solution was mainly used by more experienced users,
but not without complaints.
I have to
be fair and also underline a bit the pitfalls of this solution. One immediate
problem is the reusability. These derived metrics have to be re-created all
over the place when needed. Second, you need to have the right attributes in
the report so that you can create these derived metrics. For CD metrics you
most of the times need a more detailed query, which in the end translates to
higher data loading time and MSTR storage. Due to projects specificities, you
might end up with the necessity of creating technical attributes that would be
then used in derived metrics, things like IDs, keys or even field
concatenations which have no business meaning. This can start to be confusing
for the users. One last thing here might be a drop in performance which we
noticed with MSTR, even for cubes and that is the data type behind the
attribute used in the derived metric. Count distinct is a resource demanding
task and the data type could add overhead. We had once a varchar concatenate
field inside a 3.5M (not that much, right?) rows cube and the data display of
the CD derived metrics took 6 minutes. Luckily, we could change that
concatenation to a big integer data type and the time went down to an
impressive 2 seconds (mind blowing, right? :) ). As we also have proficient
database experience, this led us to the conclusion that MSTR's engine is pretty
similar and suffers from the same performance challenges as a SQL Server
engine.
Pros:
- Easy to implement and use
- Yields good results.
- High performance, as the data is already in MSTR and the calculations are done by the analytical engine (stay away as much as possible from high selective varchar formulas)
Cons:
- Increased granularity of the data
- Metrics are not reusable (unless people reuse the reports that contains them)
- Longer warehouse processing time and data fetching to MSTR
With
these drawbacks in mind we kept investigating some other solutions and we
mainly tried to improve reusability and reduce the number of technical
attributes needed for calculations.
Solution 4 - Reusable metrics with custom sub-totals
With the
first goal in mind, we went towards project level re-suable metrics and custom
sub-totals based on the attributes (like in solution 3):
Count<Distinct=True>(<Attribute>)
{@ }
We
defined the metric as a normal CD and used the above sub-total for dynamic
aggregation function and total subtotal function. Needless to say that the
attribute involved in the subtotal needs to be present in the report (not
necessarily in the grid). Therefore, at this stage, we have fixed the
re-usability problem, but not yet the need to create the technical attributes,
those based on which the dynamic aggregation would be used for further
aggregations.
Pros
- No extra work for users (metrics are implemented by IT)
- Yields good results
- Metrics are reusable
- High performance, as the data is already in MSTR and the calculations are done by the analytical engine (stay away as much as possible from high selective varchar formulas)
Cons
- Increased granularity of the data
- Longer warehouse processing time and data fetching to MSTR
Solution 5 - Reusable metrics with custom sub-totals on metric
values
This
might sound weird at first, but it's pretty simple. Our last goal was to get
rid of the necessity of using technical attributes, or attribute all together
to compute aggregations. So, we took solution 4 and we enhanced it.
- Pushed the attribute definition into a fact, which would still retrieve the attribute members as needed
- Changed the formula of the metric to MAX(<Fact>). This ensures that we get the attribute values based on which we will compute the aggregations. We kind of integrated the attribute into the metric
- Changed the custom subtotal to: Count<Distinct=True>(x) {@ } . "x" is basically the current object's values, meaning our metric's values.
Indeed,
to get the right results we still need to keep the granularity of the report at
the necessary level, to compute the aggregations, but we no longer need to have
the technical attributes as the metrics contain those values. Then, the
subtotal used as dynamic aggregation will take care of the further
calculations. You will also notice that if the report and grid have the same
aggregation level, metric will display the underlying values (defined in the
fact) because the dynamic aggregation is not yet triggered. If this is an
issue, then make sure that the attributes that provide the necessary detailed
level (at which the metric MAX is calculated) are not in the grid, so the
report will trigger dynamic aggregation on report open (report is opened in
OLAP mode). If the reports are used as support for Vis and documents only (not
exposed to users), then I recommend keeping the report and grid at the same
granularity level (Standard mode), as MSTR would need the extra task for
dynamic aggregation calculation and might impact performance when opening Vis
and documents.
The only
remaining point here would be to say that users need to be educated to use
specific granularity attributes in the report, necessary for the metric values
retrieved from the warehouse. Otherwise, you won't have the right values
retrieved for further aggregations.
For more
details on troubleshooting performance take a look at these MSTR community
article:
9.x - https://goo.gl/6sKbC5
10.x - https://goo.gl/UYw4FH
Pros:
- No extra work for users (metrics are implemented by IT)
- Yields good results
- Metrics are reusable
- High performance, as the data is already in MSTR and the calculations are done by the analytical engine (stay away as much as possible from high selective varchar formulas)
- Decreased warehouse processing time and data fetching to MSTR
- Removed additional technical attributes
Cons
- Increased granularity of the data
Thanks for your information. very good article.
ReplyDeleteMicrostrategy Online Training
Microstrategy Online Course