Game Analytics 101 — Setting up the data analytics infrastructure.
Overview
A data analytics module would enable us to collect events from user devices to a centralized data repository, which is necessary to analyze, visualize the data to report on different metrics and charts to achieve the following objectives :
- Improve the gameplay experience
- Optimize monetization
- Understand player behavior
Objective :
The following services are necessary for setting up a data analytics module that would enable us to fulfill the objectives listed above:
- Client Telemetry
- Data Pipeline
- Data Warehouse
- BI tools
These services are both time and resource-heavy to be developed from scratch, so acquiring third-party services to fulfill these services should be the best option especially if working on a time crunch. A careful evaluation of all the vendors based on the functionality of each service is of paramount importance.
Client Telemetry
Client telemetry refers to the extraction of user data and events sent to the data warehouse that we would like to track.
Vendor Landscape
Most of the vendors offer Client Telemetry service bundled with other backend services. Vendors providing client telemetry in the gaming domain are rare and limited.
Major Vendors
Google Firebase, Azure Playfab, Twilio Segment.
Evaluation Criteria
Standard features offered by all the vendors are real-time or near real-time client telemetry. While choosing a vendor for client telemetry, other significant characteristics that need to consider are
- Ingestion setup of data
- Flexible Integration
- Ease of use
Ingestion setup of data
The Back-End set up to send the client payload to the vendor should be seamlessly done with minimal or no effort.
Flexible Integration
The data from the vendor should support different types of exports, i.e., structured, unstructured, semi-structured, and it should also be compatible with the rest of your standardized data since most vendors support direct connection to your CDW/DWH. Example: Playfab does not support Google BQ-CDW. It is limited chiefly to Azure/Snowflake.
Ease of use
The vendor should have an intuitive and easy-to-use dashboard to support real-time monitoring and define custom rules for the events (custom-rule engine). Example: Twilio’s Dashboard seems to have a steep learning curve.
Data Pipeline:
The data pipeline is a data integration layer that blends data from many sources; It enables us to efficiently upload data into the DWH.
Vendor Landscape
The vendor landscape for setting up the data pipeline is limited to a very few vendors, and most of the vendors specialize in either ETL or ELT, but not both.
Major Vendors
Matillion, Alteryx, Stitch, Fivetran, and y42.
Evaluation Criteria
Standard features provided by all the vendors include Automatic Schema creation, Incremental updates. The primary criteria to evaluate a vendor for the data pipeline are:
- Ingestion Compatibility
- Pipeline Support
- Transformation Efficacy
- Operational Control
Ingestion Compatibility
The vendor should ingest data from data warehouses, data lakes, and other SAAS sources like marketing applications, etc. In multiple forms of structured, semi-structured streaming sources should be considered.
Pipeline Support
The vendor should support any chosen methods of the data pipelining, i.e., ETL or ELT, or both. Example: While Alteryx supports only ETL pipeline, Stitch specializes in providing services for ELT pipeline.
Transformation Efficacy
Data transformation where the cleaning, deduplication, and standardization of data values should be done accurately and mapped into the choice schema. Example: Stitch and Fivetran are limited in transformation capabilities.
Operational Control
The user should have the flexibility to set up concurrent connections with the DWH while loading and be able to add new data sources and batch scheduling. For example, Matillion cannot add a new data source after establishing the data layer. Still, it is the only vendor to support concurrent connections to the target repository while loading.
Data Warehouse
A data warehouse is a data management system that enables the company to organize, analyze and report information.
Vendor Landscape
Very few vendors dominate the entire ecosystems built around the data warehouse service.
Major Vendors
AWS — Redshift, Google Cloud Platform — BigQuery, Snowflake
Evaluation Criteria
Standard features across all the vendors include materialized view, UDFs, caching, hot-cold storages, scheduling, integrated BI in their ecosystem. Criteria to evaluate the vendor for setting a DWH service would be:
- Ingestion of streaming data
- Flexibility in the choice of data
- Maintenance
- Ease of use
- Scalability
Ingestion of streaming data
The DWH should ingest the incoming streaming data (continuous flow, unlike bulk upload). For example, AWS-redshift does not have any built-in capability to ingest data streams, even snowflakes, which only supports micro-batches via snow pipe.
Flexibility in the choice of data
The vendor’s DWH should be flexible enough to ingest different data types and at other structural states like structured, unstructured, semi-structured seamlessly. Example: Snowflake can support most of the data types, unlike AWS-RS.
Maintenance
The data should be auto arranged for the efficient use of computing resources and storage without the need for external stimulus for the best use of storage space and computing. Example: In Redshift, vacuuming needs to be done where the table is cleaned off of empty, deleted rows to end up with a more compact and sorted table, that would increase the performance of redshift cluster, it also requires to maintain sort order. (Inherited from PostgreSQL)
Ease of use
The DWH should be easy to query. It requires no special knowledge for leveraging the analytical abilities provided by the vendor. Example: Google BQ queries needs to be optimized to avoid the rack up of high costs while pulling data that requires knowledge of SQL. AWS Redshift does not support parallel upload or parallel querying (concurrent user querying) and does need a systems manager.
Scalability
The DWH should be able to up-scale or downscale both the computing and storage resources without downtime. Example: In AWS: Redshift cluster resizing requires reconfiguring the instance, which disrupts the service because their storage and computing resources are from a single source.
BI tools
BI tools allow users to organize, transform, analyze and visualize data that provide actionable insights.
Vendor Landscape
The vendor landscape has numerous vendors. Most of them are running on the cloud. There is a huge security risk to evaluate and choose only those vendors who benchmark the segment.
Major Vendors
Tableau, Looker, Mode, Periscope, Domo.
Evaluation Criteria
The standard features across all the major vendors center around core analytical and visualization capabilities and integrations with the significant DWH service providers. Primary Criteria for evaluating different vendors who provide BI tools should include:
- Ease of Use
- Flexibility
- Price
Ease of Use
Requirement of any external dependency to leverage the total capacity of analytics. Example: Mode and Periscope require SQL to perform data analytics, while chartio can be used even by non-technical users.
Flexibility
The BI tool should be flexible with the most commonly occurring data types and the data repository of choice. Example: Looker it is doesn’t support the use .csv format or excel spreadsheet. It should be done externally through third-party .csv to SQL tools.
Price
We should consider the general pricing and the use and capabilities of multiple users. For example, $70 provides full analytics capability, while $45 limits the analytics capability, while $15 provides only the “view-only” option for Tableau.
My Recommendations
Client telemetry
Playfab — Considering the robustness of supporting the telemetry both through API calls or cloud script. (creating optimize to send data back to our target storage-is it worth the extra step)?
Since most of the services provide a free trial, it would be a better option to try out each one of them.
Data pipeline service
Matillion — This vendor is well known for its batch extraction and loading; If we choose the ELT data pipeline, the primary vendor to consider would be Fivetran.
Data warehouse service
Google BQ would be the best choice of action, working at a large scale with a game/data analytics team, who can optimize queries, and effort should be put in clustering and portioning the data. We could consider providing almost the same services with higher costs than Google-BQ.
BI tools
Tableau would be the best choice for BI, primarily because of the price and ease of use. Domo can also be considered due to the ease of use, regardless of the numerous negative after-sales reviews.
Best Practices:
- Always use ETL instead of ELT while setting up the DWH since loading and transforming are resource-heavy and lead to higher costs and security concerns. (Depends on the use case mostly)
- Consider the agile approach instead of the Big Bang approach while transforming your data. Always have the pipeline of transforming your data into actionable insights, going on in the form of a regular task than paying for long-term storage, transforming everything at once at the hour of need.
- Define a CDC policy beforehand.
- Materialize query results in partitions. (Since we don’t want to search the entire database every time) also, use the incremental updates.