The first post in this series attempted to contextualize a business intelligence system by identifying the data that it will store and process. This post will focus on the practical, technological aspects of building a business intelligence system using those data requirements; I’ll outline what I feel is the best combination of technologies to store, process, visualize, and extract value from data.
I like to conceptualize the business intelligence stack as being literally a vertical “stack” comprised of three things: 1) unaggregated data at the base, 2) a data warehouse infrastructure on top of the raw data, 3) a platform for visualizing and exploring the warehoused data. Since I’m a fan of the Hadoop stack, the first two components of the business intelligence system that I advocate blend into each other. “Hadoop” used to mean the combination of HDFS and MapReduce but has grown into an entire ecosystem of products; what I’ll describe in this post are the products that I personally recommend — sometimes because I feel they perform the best, and sometimes because competing products are so new that I simply don’t have any experience with them. The industrious business intelligence manager should always keep a vigilant eye open for superior market entrants.
The base layer: HDFS + MapReduce
Given the large number of user events you’ll want to track, a RDBMS implementation isn’t realistic for analytics. Although I’ve previously advocated against over-engineering business intelligence systems, I do believe that any system built should be capable of handling at least 250k DAU. With an average session time of two minutes, and at 5 events per minute, with 1 session per day (conservative estimates), 250k DAU produces 2.5 million rows per day; meaningful ad-hoc analysis on a dataset that grows that fast would be almost impossible to do after a month or so.
If you store your events data in a RDBMS, you can use a tool like Sqoop to localize your relational data into HDFS (or directly into Hive) in bulk. If you store your events data in log files, you can run your MapReduce jobs on them directly. I think HDFS + MapReduce is the best fit for a mid-size mobile gaming company for two reasons: 1) using a tool like Whirr, a MapReduce cluster can be instantiated on a hosted network (like Amazon EC2) very quickly, and 2) MapReduce ameliorates the need for deep technical knowledge of distributed systems, since it handles load balancing and fault-tolerance.
The warehouse layer: Hive
Hive is a date warehouse framework built for the Hadoop stack that allows SQL-like commands (written in a proprietary querying language, Hive QL) to be run on the HDFS layer. HQL commands are converted into MapReduce jobs; Hive can therefore be used as an intermediary between HDFS and other agents (for example, Tableau, Microstrategy, and R) for small queries and data browsing.
An alternative to Hive is Pig. Pig serves the same purpose as Hive, with three fundamental differences: 1) Pig is utilized with a textual language called PigLatin, 2) Pig creates “data pipelines” which allow for procedural queries that can be more intuitive than their corresponding convoluted SQL statements, 3) Hive must adhere to a pre-defined data schema, whereas Pig has no such restriction (the structure of the data being retrieved is defined in the query). Twitter published a very interesting blog post about their implementation of Pig; likewise, this blog post from a developer at Yahoo! presents a set of use cases and the tool (Pig or Hive) that best fits each.
That said, I prefer Hive to Pig because SQL is ubiquitous; it’s far easier to explain a SQL command to someone than a PigLatin command if that person doesn’t know PigLatin. Also, because it is architected around a SQL-like model, it can interface directly with visualization platforms.
The visualization layer: Tableau
Tableau is by far my favorite visualization and data browsing tool. I believe it is the best fit for a mobile gaming company’s analytics needs for a few reasons:
- Geography-based visualizations. Geography plays a prominent role in mobile marketing and being able to quickly visualize metrics like LCV, conversion rate, retention, etc. across geographies is an essential function of a reporting tool in that environment
- Hosted, centralized reporting. Many mobile gaming companies are distributed across the globe; Tableau Server allows reporting to be centralized in one place, providing for real-time access
- Calculated fields (in-data calculations). Tableau’s calculated fields are easy to create and provide a high degree of flexibility in manipulating data
- Custom SQL. The ability to execute custom SQL statements against a dataset reduces the analytics team’s dependence on architectural changes to the data warehouse
- Table calculations. Dynamic table calculations (dynamic meaning the scope of the calculation changes with the range of graph) give the end user an incredible amount of flexibility when exploring data. They also allow for really cool, insightful reports to be automated
- Scalability. Tableau can be implemented as a desktop application on a small number of licenses and scale up to a reporting server accessible to thousands of people through a web interface
- Speaking qualitatively, employees at gaming companies tend to respond best to visual information, and Tableau’s focal theme is the visual communication of data
Hadoop
http://university.cloudera.com/ — Cloudera University, the starting point for learning about Hadoop
http://www.slideshare.net/davidengfer/intro-to-the-hadoop-stack-javamug — an introduction to the Hadoop stack
http://www.youtube.com/watch?v=nm9R_rIEfKg — an interview with the CTO of Cloudera, Amr Awadallah, in which he gives a brief description of Hive, Pig, HBase, Avro, Mahout, Sqoop, Flume, Oozie, and Whirr
http://shout.setfive.com/2011/09/14/getting-started-with-hadoop-hive-and-sqoop/ — a step-by-step tutorial for setting up a Hadoop stack consisting of HDFS, Hive, and Sqoop
http://sujitpal.blogspot.fi/ — a blog featuring a varied selection of MapReduce code examples
http://gbif.blogspot.fi/2011/01/setting-up-hadoop-cluster-part-1-manual.html — a step-by-step tutorial for setting up a Hadoop cluster
Hive
https://cwiki.apache.org/Hive/tutorial.html#Tutorial-WhatisHive — a conceptual introduction to Hive
http://facility9.com/2010/12/getting-started-with-hive/ — a Hive tutorial
http://www.slideshare.net/cwsteinbach/hive-quick-start-tutorial — another Hive tutorial
Sqoop
http://www.slideshare.net/cloudera/apache-sqoop-a-data-transfer-tool-for-hadoop — an introduction to Sqoop
Other
http://pivotallabs.com/talks/150-hadoop-for-rubyists — an excellent presentation from Pivotal Labs about the transition of a client’s software system to the Hadoop stack. It provides some practical insight into the decision making process surrounding such a migration.