What's the optimal business intelligence stack for a mobile gaming company? Part 2 of 2: The stack

July 24, 2012

The first post in this series attempted to contextualize a business intelligence system by identifying the data that it will store and process. This post will focus on the practical, technological aspects of building a business intelligence system using those data requirements; I’ll outline what I feel is the best combination of technologies to store, process, visualize, and extract value from data.

I like to conceptualize the business intelligence stack as being literally a vertical “stack” comprised of three things: 1) unaggregated data at the base, 2) a data warehouse infrastructure on top of the raw data, 3) a platform for visualizing and exploring the warehoused data. Since I’m a fan of the Hadoop stack, the first two components of the business intelligence system that I advocate blend into each other. “Hadoop” used to mean the combination of HDFS and MapReduce but has grown into an entire ecosystem of products; what I’ll describe in this post are the products that I personally recommend — sometimes because I feel they perform the best, and sometimes because competing products are so new that I simply don’t have any experience with them. The industrious business intelligence manager should always keep a vigilant eye open for superior market entrants.

The base layer: HDFS + MapReduce

Given the large number of user events you’ll want to track, a RDBMS implementation isn’t realistic for analytics. Although I’ve previously advocated against over-engineering business intelligence systems, I do believe that any system built should be capable of handling at least 250k DAU. With an average session time of two minutes, and at 5 events per minute, with 1 session per day (conservative estimates), 250k DAU produces 2.5 million rows per day; meaningful ad-hoc analysis on a dataset that grows that fast would be almost impossible to do after a month or so.

If you store your events data in a RDBMS, you can use a tool like Sqoop to localize your relational data into HDFS (or directly into Hive) in bulk. If you store your events data in log files, you can run your MapReduce jobs on them directly. I think HDFS + MapReduce is the best fit for a mid-size mobile gaming company for two reasons: 1) using a tool like Whirr, a MapReduce cluster can be instantiated on a hosted network (like Amazon EC2) very quickly, and 2) MapReduce ameliorates the need for deep technical knowledge of distributed systems, since it handles load balancing and fault-tolerance.

The warehouse layer: Hive

Hive is a date warehouse framework built for the Hadoop stack that allows SQL-like commands (written in a proprietary querying language, Hive QL) to be run on the HDFS layer. HQL commands are converted into MapReduce jobs; Hive can therefore be used as an intermediary between HDFS and other agents (for example, Tableau, Microstrategy, and R) for small queries and data browsing.

An alternative to Hive is Pig. Pig serves the same purpose as Hive, with three fundamental differences: 1) Pig is utilized with a textual language called PigLatin, 2) Pig creates “data pipelines” which allow for procedural queries that can be more intuitive than their corresponding convoluted SQL statements, 3) Hive must adhere to a pre-defined data schema, whereas Pig has no such restriction (the structure of the data being retrieved is defined in the query). Twitter published a very interesting blog post about their implementation of Pig; likewise, this blog post from a developer at Yahoo! presents a set of use cases and the tool (Pig or Hive) that best fits each.

That said, I prefer Hive to Pig because SQL is ubiquitous; it’s far easier to explain a SQL command to someone than a PigLatin command if that person doesn’t know PigLatin. Also, because it is architected around a SQL-like model, it can interface directly with visualization platforms.

The visualization layer: Tableau

Tableau is by far my favorite visualization and data browsing tool. I believe it is the best fit for a mobile gaming company’s analytics needs for a few reasons:

Geography-based visualizations. Geography plays a prominent role in mobile marketing and being able to quickly visualize metrics like LCV, conversion rate, retention, etc. across geographies is an essential function of a reporting tool in that environment
Hosted, centralized reporting. Many mobile gaming companies are distributed across the globe; Tableau Server allows reporting to be centralized in one place, providing for real-time access
Calculated fields (in-data calculations). Tableau’s calculated fields are easy to create and provide a high degree of flexibility in manipulating data
Custom SQL. The ability to execute custom SQL statements against a dataset reduces the analytics team’s dependence on architectural changes to the data warehouse
Table calculations. Dynamic table calculations (dynamic meaning the scope of the calculation changes with the range of graph) give the end user an incredible amount of flexibility when exploring data. They also allow for really cool, insightful reports to be automated
Scalability. Tableau can be implemented as a desktop application on a small number of licenses and scale up to a reporting server accessible to thousands of people through a web interface
Speaking qualitatively, employees at gaming companies tend to respond best to visual information, and Tableau’s focal theme is the visual communication of data

These features come at a high cost: not only are Tableau licenses expensive, but mastering Tableau takes quite a bit of time, and free tutorials on Tableau’s most advanced functionality are scarce. I believe Tableau is worth the price, though, and serves as an excellent information dissemination valve at the end of the business intelligence pipeline.

Resources for learning about HadoopThe Hadoop stack is notoriously cryptic; tutorials are generally very narrowly focused on one product, without much insight into how the Hadoop stack operates as a system. I have compiled some excellent resources that I think, taken together, should provide a detailed yet broad introduction to Hadoop and the tools that comprise it.

Hadoop

http://university.cloudera.com/ — Cloudera University, the starting point for learning about Hadoop

http://www.slideshare.net/davidengfer/intro-to-the-hadoop-stack-javamug — an introduction to the Hadoop stack

http://www.youtube.com/watch?v=nm9R_rIEfKg — an interview with the CTO of Cloudera, Amr Awadallah, in which he gives a brief description of Hive, Pig, HBase, Avro, Mahout, Sqoop, Flume, Oozie, and Whirr

http://shout.setfive.com/2011/09/14/getting-started-with-hadoop-hive-and-sqoop/ — a step-by-step tutorial for setting up a Hadoop stack consisting of HDFS, Hive, and Sqoop

http://sujitpal.blogspot.fi/ — a blog featuring a varied selection of MapReduce code examples

http://gbif.blogspot.fi/2011/01/setting-up-hadoop-cluster-part-1-manual.html — a step-by-step tutorial for setting up a Hadoop cluster

Hive

https://cwiki.apache.org/Hive/tutorial.html#Tutorial-WhatisHive — a conceptual introduction to Hive

http://facility9.com/2010/12/getting-started-with-hive/ — a Hive tutorial

http://www.slideshare.net/cwsteinbach/hive-quick-start-tutorial — another Hive tutorial

Sqoop

http://www.slideshare.net/cloudera/apache-sqoop-a-data-transfer-tool-for-hadoop — an introduction to Sqoop

Other

http://pivotallabs.com/talks/150-hadoop-for-rubyists — an excellent presentation from Pivotal Labs about the transition of a client’s software system to the Hadoop stack. It provides some practical insight into the decision making process surrounding such a migration.

Comments: