Game Analytics 101 — Setting up the data analytics infrastructure.

Prod Man
7 min readJan 12, 2022

Overview

A data analytics module would enable us to collect events from user devices to a centralized data repository, which is necessary to analyze, visualize the data to report on different metrics and charts to achieve the following objectives :

  • Improve the gameplay experience
  • Optimize monetization
  • Understand player behavior

Objective :

The following services are necessary for setting up a data analytics module that would enable us to fulfill the objectives listed above:

  • Client Telemetry
  • Data Pipeline
  • Data Warehouse
  • BI tools

These services are both time and resource-heavy to be developed from scratch, so acquiring third-party services to fulfill these services should be the best option especially if working on a time crunch. A careful evaluation of all the vendors based on the functionality of each service is of paramount importance.

The above diagram depicts the various layers inside the analytics module along with the required services that need to be established.

Client Telemetry

Client telemetry refers to the extraction of user data and events sent to the data warehouse that we would like to track.

The above diagram depicts the overall functionality of client telemetry along with the vendor evaluation parameters.

Vendor Landscape

Most of the vendors offer Client Telemetry service bundled with other backend services. Vendors providing client telemetry in the gaming domain are rare and limited.

Major Vendors

Google Firebase, Azure Playfab, Twilio Segment.

Evaluation Criteria

Standard features offered by all the vendors are real-time or near real-time client telemetry. While choosing a vendor for client telemetry, other significant characteristics that need to consider are

  • Ingestion setup of data
  • Flexible Integration
  • Ease of use

Ingestion setup of data

The Back-End set up to send the client payload to the vendor should be seamlessly done with minimal or no effort.

Flexible Integration

The data from the vendor should support different types of exports, i.e., structured, unstructured, semi-structured, and it should also be compatible with the rest of your standardized data since most vendors support direct connection to your CDW/DWH. Example: Playfab does not support Google BQ-CDW. It is limited chiefly to Azure/Snowflake.

Ease of use

The vendor should have an intuitive and easy-to-use dashboard to support real-time monitoring and define custom rules for the events (custom-rule engine). Example: Twilio’s Dashboard seems to have a steep learning curve.

Data Pipeline:

The data pipeline is a data integration layer that blends data from many sources; It enables us to efficiently upload data into the DWH.

The above diagram depicts the functionality of a data pipeline along with the criteria to evaluate different vendors providing the service.

Vendor Landscape

The vendor landscape for setting up the data pipeline is limited to a very few vendors, and most of the vendors specialize in either ETL or ELT, but not both.

Major Vendors

Matillion, Alteryx, Stitch, Fivetran, and y42.

Evaluation Criteria

Standard features provided by all the vendors include Automatic Schema creation, Incremental updates. The primary criteria to evaluate a vendor for the data pipeline are:

  • Ingestion Compatibility
  • Pipeline Support
  • Transformation Efficacy
  • Operational Control

Ingestion Compatibility

The vendor should ingest data from data warehouses, data lakes, and other SAAS sources like marketing applications, etc. In multiple forms of structured, semi-structured streaming sources should be considered.

Pipeline Support

The vendor should support any chosen methods of the data pipelining, i.e., ETL or ELT, or both. Example: While Alteryx supports only ETL pipeline, Stitch specializes in providing services for ELT pipeline.

Transformation Efficacy

Data transformation where the cleaning, deduplication, and standardization of data values should be done accurately and mapped into the choice schema. Example: Stitch and Fivetran are limited in transformation capabilities.

Operational Control

The user should have the flexibility to set up concurrent connections with the DWH while loading and be able to add new data sources and batch scheduling. For example, Matillion cannot add a new data source after establishing the data layer. Still, it is the only vendor to support concurrent connections to the target repository while loading.

Data Warehouse

A data warehouse is a data management system that enables the company to organize, analyze and report information.

The above diagram depicts the data visualisation flow along with the criteria to be observed while evaluating different vendors providing Data Warehousing as a service.

Vendor Landscape

Very few vendors dominate the entire ecosystems built around the data warehouse service.

Major Vendors

AWS — Redshift, Google Cloud Platform — BigQuery, Snowflake

Evaluation Criteria

Standard features across all the vendors include materialized view, UDFs, caching, hot-cold storages, scheduling, integrated BI in their ecosystem. Criteria to evaluate the vendor for setting a DWH service would be:

  • Ingestion of streaming data
  • Flexibility in the choice of data
  • Maintenance
  • Ease of use
  • Scalability

Ingestion of streaming data

The DWH should ingest the incoming streaming data (continuous flow, unlike bulk upload). For example, AWS-redshift does not have any built-in capability to ingest data streams, even snowflakes, which only supports micro-batches via snow pipe.

Flexibility in the choice of data

The vendor’s DWH should be flexible enough to ingest different data types and at other structural states like structured, unstructured, semi-structured seamlessly. Example: Snowflake can support most of the data types, unlike AWS-RS.

Maintenance

The data should be auto arranged for the efficient use of computing resources and storage without the need for external stimulus for the best use of storage space and computing. Example: In Redshift, vacuuming needs to be done where the table is cleaned off of empty, deleted rows to end up with a more compact and sorted table, that would increase the performance of redshift cluster, it also requires to maintain sort order. (Inherited from PostgreSQL)

Ease of use

The DWH should be easy to query. It requires no special knowledge for leveraging the analytical abilities provided by the vendor. Example: Google BQ queries needs to be optimized to avoid the rack up of high costs while pulling data that requires knowledge of SQL. AWS Redshift does not support parallel upload or parallel querying (concurrent user querying) and does need a systems manager.

Scalability

The DWH should be able to up-scale or downscale both the computing and storage resources without downtime. Example: In AWS: Redshift cluster resizing requires reconfiguring the instance, which disrupts the service because their storage and computing resources are from a single source.

BI tools

BI tools allow users to organize, transform, analyze and visualize data that provide actionable insights.

The above diagram depicts the data visualisation flow along with the criteria to be observed while evaluating different vendors providing BI tools.

Vendor Landscape

The vendor landscape has numerous vendors. Most of them are running on the cloud. There is a huge security risk to evaluate and choose only those vendors who benchmark the segment.

Major Vendors

Tableau, Looker, Mode, Periscope, Domo.

Evaluation Criteria

The standard features across all the major vendors center around core analytical and visualization capabilities and integrations with the significant DWH service providers. Primary Criteria for evaluating different vendors who provide BI tools should include:

  • Ease of Use
  • Flexibility
  • Price

Ease of Use

Requirement of any external dependency to leverage the total capacity of analytics. Example: Mode and Periscope require SQL to perform data analytics, while chartio can be used even by non-technical users.

Flexibility

The BI tool should be flexible with the most commonly occurring data types and the data repository of choice. Example: Looker it is doesn’t support the use .csv format or excel spreadsheet. It should be done externally through third-party .csv to SQL tools.

Price

We should consider the general pricing and the use and capabilities of multiple users. For example, $70 provides full analytics capability, while $45 limits the analytics capability, while $15 provides only the “view-only” option for Tableau.

My Recommendations

Client telemetry

Playfab — Considering the robustness of supporting the telemetry both through API calls or cloud script. (creating optimize to send data back to our target storage-is it worth the extra step)?

Since most of the services provide a free trial, it would be a better option to try out each one of them.

Data pipeline service

Matillion — This vendor is well known for its batch extraction and loading; If we choose the ELT data pipeline, the primary vendor to consider would be Fivetran.

Data warehouse service

Google BQ would be the best choice of action, working at a large scale with a game/data analytics team, who can optimize queries, and effort should be put in clustering and portioning the data. We could consider providing almost the same services with higher costs than Google-BQ.

BI tools

Tableau would be the best choice for BI, primarily because of the price and ease of use. Domo can also be considered due to the ease of use, regardless of the numerous negative after-sales reviews.

Best Practices:

  • Always use ETL instead of ELT while setting up the DWH since loading and transforming are resource-heavy and lead to higher costs and security concerns. (Depends on the use case mostly)
  • Consider the agile approach instead of the Big Bang approach while transforming your data. Always have the pipeline of transforming your data into actionable insights, going on in the form of a regular task than paying for long-term storage, transforming everything at once at the hour of need.
  • Define a CDC policy beforehand.
  • Materialize query results in partitions. (Since we don’t want to search the entire database every time) also, use the incremental updates.

--

--

Prod Man

I don’t have a lot of things to care about, but “Product” isn’t one of them. I aspire to become the very best product leader one day.