The count distinct adventure - Part II

During the first part of the topic we introduced the challenge of count distinct and 2 solutions to address them, along with their advantages and disadvantages. The list of solutions continues in this final section and these, presented below, are also the solutions that we use across our projects where CDs are required.

Solution 3 - Count Distinct derived metrics

This functionality is probably the most known by the users and mostly used, as it offers flexibility and the necessary control without the intervention of the IT teams. Creating =derived metrics in reports, Vis and documents is a walk In the park, using the context menus. With the deployment of V10, things got even more flexible adding support for conditional derived metrics (IF and CASE) by using attributes directly. As you probably remember, or still use today, up until V10, building conditional derived metrics was a hustle, as only metrics could be used with IF and CASE statements. You would first create a MAX(<Attribute>) so that you would have the attribute members as output, wrapped as a metric, and then use them inside a second metric's condition. However, things have improved a lot in V10 and this hustle is no longer there.

Therefore, this method is easier than never before and that's why a lot of users are more openly using it. That was good news for us too, because now we have a viable solution that we can propose to all our power users. Again, we are user oriented and it's important to always offer the best possible solutions with the functionalities
at hand and we know before if this would be adopted or not by our users. I have to admit that before V10, this solution was mainly used by more experienced users, but not without complaints.

I have to be fair and also underline a bit the pitfalls of this solution. One immediate problem is the reusability. These derived metrics have to be re-created all over the place when needed. Second, you need to have the right attributes in the report so that you can create these derived metrics. For CD metrics you most of the times need a more detailed query, which in the end translates to higher data loading time and MSTR storage. Due to projects specificities, you might end up with the necessity of creating technical attributes that would be then used in derived metrics, things like IDs, keys or even field concatenations which have no business meaning. This can start to be confusing for the users. One last thing here might be a drop in performance which we noticed with MSTR, even for cubes and that is the data type behind the attribute used in the derived metric. Count distinct is a resource demanding task and the data type could add overhead. We had once a varchar concatenate field inside a 3.5M (not that much, right?) rows cube and the data display of the CD derived metrics took 6 minutes. Luckily, we could change that concatenation to a big integer data type and the time went down to an impressive 2 seconds (mind blowing, right? :) ). As we also have proficient database experience, this led us to the conclusion that MSTR's engine is pretty similar and suffers from the same performance challenges as a SQL Server engine.

Pros:
  • Easy to implement and use
  • Yields good results.
  • High performance, as the data is already in MSTR and the calculations are done by the analytical engine (stay away as much as possible from high selective varchar formulas)

Cons:
  • Increased granularity of the data
  • Metrics are not reusable (unless people reuse the reports that contains them)
  • Longer warehouse processing time and data fetching to MSTR

With these drawbacks in mind we kept investigating some other solutions and we mainly tried to improve reusability and reduce the number of technical attributes needed for calculations.


Solution 4 - Reusable metrics with custom sub-totals

With the first goal in mind, we went towards project level re-suable metrics and custom sub-totals based on the attributes (like in solution 3):
Count<Distinct=True>(<Attribute>) {@ }

We defined the metric as a normal CD and used the above sub-total for dynamic aggregation function and total subtotal function. Needless to say that the attribute involved in the subtotal needs to be present in the report (not necessarily in the grid). Therefore, at this stage, we have fixed the re-usability problem, but not yet the need to create the technical attributes, those based on which the dynamic aggregation would be used for further aggregations.

Pros
  • No extra work for users (metrics are implemented by IT)
  • Yields good results
  • Metrics are reusable
  • High performance, as the data is already in MSTR and the calculations are done by the analytical engine (stay away as much as possible from high selective varchar formulas)

Cons
  • Increased granularity of the data
  • Longer warehouse processing time and data fetching to MSTR


Solution 5 - Reusable metrics with custom sub-totals on metric values

This might sound weird at first, but it's pretty simple. Our last goal was to get rid of the necessity of using technical attributes, or attribute all together to compute aggregations. So, we took solution 4 and we enhanced it.
  1. Pushed the attribute definition into a fact, which would still retrieve the attribute members as needed
  2. Changed the formula of the metric to MAX(<Fact>). This ensures that we get the attribute values based on which we will compute the aggregations. We kind of integrated the attribute into the metric
  3. Changed the custom subtotal to: Count<Distinct=True>(x) {@ } . "x" is basically the current object's values, meaning our metric's values.

Indeed, to get the right results we still need to keep the granularity of the report at the necessary level, to compute the aggregations, but we no longer need to have the technical attributes as the metrics contain those values. Then, the subtotal used as dynamic aggregation will take care of the further calculations. You will also notice that if the report and grid have the same aggregation level, metric will display the underlying values (defined in the fact) because the dynamic aggregation is not yet triggered. If this is an issue, then make sure that the attributes that provide the necessary detailed level (at which the metric MAX is calculated) are not in the grid, so the report will trigger dynamic aggregation on report open (report is opened in OLAP mode). If the reports are used as support for Vis and documents only (not exposed to users), then I recommend keeping the report and grid at the same granularity level (Standard mode), as MSTR would need the extra task for dynamic aggregation calculation and might impact performance when opening Vis and documents.

The only remaining point here would be to say that users need to be educated to use specific granularity attributes in the report, necessary for the metric values retrieved from the warehouse. Otherwise, you won't have the right values retrieved for further aggregations.

For more details on troubleshooting performance take a look at these MSTR community article:


Pros:
  • No extra work for users (metrics are implemented by IT)
  • Yields good results
  • Metrics are reusable
  • High performance, as the data is already in MSTR and the calculations are done by the analytical engine (stay away as much as possible from high selective varchar formulas)
  • Decreased warehouse processing time and data fetching to MSTR
  • Removed additional technical attributes


Cons

  • Increased granularity of the data

Comments

Post a Comment

Popular posts from this blog

The count distinct adventure - Part I

The surrogate key strikes back