The engineering team first left pipelines such as the data aggregation legacy scalding pipelines, unchanged. These pipelines were designed to run on their own data centres. However, the batch layer’s output was switched to separate storage on Google Cloud. The company used Google’s serverless, highly scalable data warehouse, BigQuery store the transcoded output aggregations from the Scalding pipelines.
Dataflow pipeline read the data from BigQuery and then applied simple light transformations. This Cloud Bigtable for low-latency, fully managed NoSQL database served as a backend for online dashboards and consumer APIs.
The team started to redesign the rest of the data analytics pipeline using Google Cloud after the successful installation of the first iteration. Twitter chose Apache Beam due to its deep integration with other Google Cloud products.
A BigQuery slot is defined as a unit of computational capacity used to execute SQL queries. The platform re-implemented the batch layers. It first staged the data from on-prem HDFS to Cloud. A Dataflow job batch then regularly loaded the data from Cloud Storage and processes the aggregation. The results are then written to BigQuery.
The primary motivation behind migration to GCP was the democratisation of data analysis. Visualisation and secure machine leaning were the top priorities for Twitter back then. Google’s BigQuery and Data Studio came in handy to facilitate the same.