Sauce Dallas: Build an use aggregate tables with Mondrian and Saiku Analytics

Using multi-dimensional cubes to build crosstabs with tools such Saiku Analytics, with an important amount of data leads to a recurrent problem : performance. In cases, datawarehouse contains millions on lines on fact tables, so response times of Saiku could become inacceptable ; database spend substential time to aggregate many lines.

Mondrian provides several ways to improve responses times of crosstab requests. First mean presented here is creation and utilization of aggregate tables. Idea behind concept of aggregate tables is to have kinds of other fact tables, with less granularity than initial fact table. For instance my fact table stores sales per clients. If users display frequently crosstab with sales per countries (countries are in clients dimension), and requests for theses crosstabs are too long, we can create a fact table the stores just sales per country. This new fact table does not contains sales details per user, but sales measures already calculated/aggregated per country.

There are bunch of articles all over the net that explain this concept, first is Mondrian Pentaho documentation: http://mondrian.pentaho.com/documentation/aggregate_tables.php

But its set-up is far less documented. Let's go to carry out together aggregate tables that will be used by Saiku.

Bad performance observation

As seen on this previous article, we keep using the same cube 'sales'. Now download this same cube's data schema but with a fact table that contains 27 millions lines (dump available here MySQL_sales.sql.zip).

Open Saiku, build few crosstabs and look at execution time. on my Intel i5 2.5Ghz quad core I have :

Sales amount and quantity per continent : 71 seconds
Sales amount and quantity per continent and product type : 283 seconds

We really face to performance problems: these response times are too long and I know that my big-boss-wearing-a-3000-bucks-suit monitors tirelessly sales per continent. It's now time to build an aggregate table in order to display measures 'per continent' faster.

Define aggregate tables

Pentaho provide a great tool for that, but IMO badly referenced : Pentaho Aggregation Designer. This tool analyzes your datamart ; check your DB's schema, suggests some aggregate tables, and, icing on the cake, writes for your SQL create/insert queries to build theses aggregate tables.

Download

First download latest stable version of this tool (here 5.0.1) at: http://sourceforge.net/projects/mondrian/files/aggregation%20designer/

Install

Unzip it and first be sure that JAVA_HOME and PATH environment variable are correctly set (as seen in this previous article).

Copy MySQL JDBC driver (or the corresponding driver of your database), as downloaded previously into
pentaho-aggdesigner-ui/drivers directory

Launch and configure

Launch the app with startaggregationdesigner.[sh|bat] executable file, and click on 'configure' button in welcome popin : you have to specify database connection of your datamart.
-choose connection type 'MySQL'
-fill hostname (localhost)
-fill database name
-fill user name and password
-choose access type as JDBC
check with 'test' button and close with 'ok'.

Still in popin, in OLAP Model part, for 'Mondrian schema file', open Mondrian schema previously used
Then select 'sales' cube in cubes list, and finally click on 'connect'.

Aggregation design perfoms so analysis of database and check that tables indexes are correctly set.
Then you access to main view.

Define aggregate table

Now click to 'add' button on the right, dimensions lists appears in 'aggregation level panel'.
We need to have correct response time for 'sales per continent and per product', so choose
'continent' level for customer dimension
'product type' for product dimension
'all' for time dimension

And click on apply. Then Aggregation designer informs you that this potential aggregate table would contains 28 lines.
Of course it will be faster to aggregate 28 lines only but It's a bit overkill. You know that time dimension is often used too, in order compare sales variations per month. change level for time dimension and choose 'month'. Click on 'apply' again. Target aggregate table would have 14784 lines, that's a good tradeoff.

Create aggregate table in database and update Mondrian schema

They are two actions to settle aggregate tables
-update mondrian schema xml file to specify usage of these new fact tables.
-create new fact table in database

Database actions

Click on 'export' button in top toolbar.
On 'Create aggregates' section click on 'execute' to create new table into database.
On 'Populate aggregates section' click on execute to populate theses new tables.
And finally export new Mondrian schema into proper Saiku directory : override previous schema or create a new. If create a new one, don't forget to specify in Saiku configuration file (sales.txt), name of this new file.

Activate usage of aggregate tables usage on Saiku

Open apache-tomcat-8.0.5/webapps/saiku/WEB-INF/classes/mondrian.properties and uncomment and modify line 609 :

mondrian.rolap.aggregates.Read=true

modify line 851 to have

mondrian.rolap.aggregates.Use=true

Use and monitor queries on aggregate tables

Restart Saiku. To be sure that previous queries are not returned from database cache, clear DB's cache (command 'reset query cache' on MySQL) and launch same query. Now you have substantial gains, 'sales per continent and product type' are displayed less than 1 seconds, vs 280 seconds before.

Check usage of aggregate tables on logs

To be sure that nothing is cheated, you can monitor SQL on logs, and be sure that new fact tables are used.
For that we have to activate SQL logs.

Open apache-tomcat-8.0.5/webapps/saiku/WEB-INF/classes/log4j.xml and uncomment line 129 to 142 in order to have

   <appender name="SQLLOG" class="org.apache.log4j.RollingFileAppender">
     <param name="File" value="${catalina.base}/logs/mondrian_sql.log"/>
     <param name="Append" value="false"/>
     <param name="MaxFileSize" value="500KB"/>
     <param name="MaxBackupIndex" value="1"/>
     <layout class="org.apache.log4j.PatternLayout">
       <param name="ConversionPattern" value="%d %-5p [%c] %m%n"/>
     </layout>
   </appender>


   <category name="mondrian.sql">
      <priority value="DEBUG"/>
      <appender-ref ref="SQLLOG"/>
   </category>

Restart Saiku, query again and you see in apache-tomcat-8.0.5/logs/mondrian_sql.log SQL queries executed. You can see that requests 'from' clauses points to aggregate table instead of fact_sales table.

Drawbacks

Aggregate tables usage is strongly facilitated with the designer, but their maintenance is more complicated.

First you have to integrate update of these new tables in your data store update process. Your ETL for instance have to be updated too, in order to maintain consistent data between intial fact table and aggregated tables.

You will face quickly to frequent users requests to accelerate some queries, so create new aggregates. And you to have maintain all these stuff. Because Mondrian need to match precisely levels displayed vs aggregated levels presents in aggregate table, it is possible to build many aggregations : combination of all levels of all dimensions, 60 possiblities in our shabby cube example.

All that work can become cumbersome, so we see in next article how to use Mondrian cache only, and populate this cache with users queries results at Saiku startup.

Sauce Dallas

Friday, May 23, 2014

Build an use aggregate tables with Mondrian and Saiku Analytics