In this post, I would like to share my impressions and experience prototyping SparklineData/spark-druid-olap open source framework.
The main idea of the framework to enable SQL access to Druid index using Tableau Desktop, on the way provide single access point API to query indexed and raw data.
Since Druid 0.9 does not have SQL support out of the box, the big advantage of SparklineData framework is providing the ability to run SQL queries over Data in Druid, which is very useful for the end to end Tableau integration.
Another aspect of using the same API to query raw data is not useful in practice, at least from Tableau perspective.
Running environment information
The main idea of the framework to enable SQL access to Druid index using Tableau Desktop, on the way provide single access point API to query indexed and raw data.
Since Druid 0.9 does not have SQL support out of the box, the big advantage of SparklineData framework is providing the ability to run SQL queries over Data in Druid, which is very useful for the end to end Tableau integration.
Another aspect of using the same API to query raw data is not useful in practice, at least from Tableau perspective.
Running environment information
- Hadoop cluster of Cloudera distribution 5.10.0
- Spark 1.6.1
- SparklineData release 0.4.1
- Druid 0.10.0-rc
After upgrading Spark, downloading SparklineData artifacts, making some data formatting and uploading to HDFS I was ready to start SparklineData Thrift server using this command.
Then I connected to the Thrift server with Beeline, defined source Spark table, defined and built Druid index and defined Sparkline table according to the description on Wiki page.
The last step is connecting Tableau Desktop to the Spark Thrift server using ODBC connection as described in this post
After all preparation steps, I had a system ready for building reports with Tableau.
Now starting the most interesting part ...
Pitfall #1
Querying columns which is not mapped to Druid index having about 100 times slower response time (8 seconds vs 12 minutes). The reason of such a difference is very clear having said that from Tableau user experience perspective it is totally disaster.
Pitfall #2
Trying to fix Pitfall #1 I start thinking about
The first approach was pretty straightforward, here is the command. I can only query in memory table using another Beeline session due to spark.sql.hive.thriftServer.singleSession=true configuration but failed to query from Tableau because this table defined as a temporal table in Hive meta store.
To implement the second approach I add custom code to SparklineData Thrift server to cache raw data on startup sqlContext.cacheTable("existing_spark_table") command.
Unfortunately this kind of caching, cache data in memory in lazy approach, that means the first query slow but next same queries will be fast.
Pitfall #3
Fine tuning of SparklineData queries is pretty complex and time-consuming activity. I used explain druid rewrite command to see query explain plan as well try to disable query cost model. Having said that I improved some queries runtime from 2 minutes to 15 seconds.
I paid attention that Tableau heavy manipulates with timestamp field of Druid index which translated by SparklineData to JavaScript which probably explains tens of seconds query response time.
One of the Tableau queries I posted here.
Pitfall #4
Tableau query to get data in specified time range failed. For example querying data in last 7 days, one month etc.
Conclusions
For me, it was a nice journey to deploy, configure and integrate all the parts of the system.
Using the same API to query raw and indexed data without caching raw data in memory sounds more like non practical approach due to a huge difference in query response time.
High ramp up to learn how to improve queries performance and understand SparklineData query optimizer.
Due to the results described bellow we found SparklineData/spark-druid-olap does not suit our system requirements.
Then I connected to the Thrift server with Beeline, defined source Spark table, defined and built Druid index and defined Sparkline table according to the description on Wiki page.
The last step is connecting Tableau Desktop to the Spark Thrift server using ODBC connection as described in this post
After all preparation steps, I had a system ready for building reports with Tableau.
Now starting the most interesting part ...
Pitfall #1
Querying columns which is not mapped to Druid index having about 100 times slower response time (8 seconds vs 12 minutes). The reason of such a difference is very clear having said that from Tableau user experience perspective it is totally disaster.
Pitfall #2
Trying to fix Pitfall #1 I start thinking about
- Caching raw data in memory using SparkSQL
- Caching raw data using Spark RDD as part of the Thrift server
The first approach was pretty straightforward, here is the command. I can only query in memory table using another Beeline session due to spark.sql.hive.thriftServer.singleSession=true configuration but failed to query from Tableau because this table defined as a temporal table in Hive meta store.
To implement the second approach I add custom code to SparklineData Thrift server to cache raw data on startup sqlContext.cacheTable("existing_spark_table") command.
Unfortunately this kind of caching, cache data in memory in lazy approach, that means the first query slow but next same queries will be fast.
Pitfall #3
Fine tuning of SparklineData queries is pretty complex and time-consuming activity. I used explain druid rewrite command to see query explain plan as well try to disable query cost model. Having said that I improved some queries runtime from 2 minutes to 15 seconds.
I paid attention that Tableau heavy manipulates with timestamp field of Druid index which translated by SparklineData to JavaScript which probably explains tens of seconds query response time.
One of the Tableau queries I posted here.
Pitfall #4
Tableau query to get data in specified time range failed. For example querying data in last 7 days, one month etc.
Conclusions
For me, it was a nice journey to deploy, configure and integrate all the parts of the system.
Using the same API to query raw and indexed data without caching raw data in memory sounds more like non practical approach due to a huge difference in query response time.
High ramp up to learn how to improve queries performance and understand SparklineData query optimizer.
Due to the results described bellow we found SparklineData/spark-druid-olap does not suit our system requirements.
Comments
Post a Comment