Data Science

Catching up Google BigQuery

With the ink just drying on Google's recently-closed acquisition of Looker, all eyes are turning to BigQuery as to plans for expanding the platform's footprint. Feeding the anticipation is the fact that GCP's cloud data warehousing rivals, including MicrosoftOracle, and SAP, have recently expanded the scope of their offerings either to include back-end data integration or front end self-service BI visualization. While Google affirms that Looker will retain its multi-cloud platform support, within GCP, BigQuery appears to be the logical target for enhanced integration.We recently sat in on an update call that reviewed recently introduced features ranging from the general availability of Redshift and S3 migration tooling and the in-memory BI engine to beta releases for Flex Slots and column-level security. BigQuery has been one of GCP's fastest growing services, with the customer base having grown significantly over the past 18 months, and more importantly, with large flat-rate (as opposed to a la carte per query) customers doubling in numbers over the past year. In a just-published blog, Google pointed to large wins with customers such as KeyBank, Wayfair, Lowe's, Sabre, and Lufthansa.Big Query is unique in that, unlike most cloud data warehousing services, it is serverless. Traditionally, you used it on an ad hoc basis and didn't worry about provisioning nodes, although later on, slot pricing was introduced to make BigQuery costs more predictable for large-scale users. Serverless is also useful for handling high-concurrency scenarios, with Google claiming that some BigQuery users have run up to 10,000 queries at once.

A typical scenario for BigQuery adoption is leveraging the platform's scale, both in terms of data volumes (with petabyte size queries not unusual) and high concurrency. While there's no equivalent of the CAP Theorem when it comes to scale vs. concurrency in analytic databases, for most data warehousing platforms, it's usually a choice between one or the other.Originally the outgrowth of Google's log processing system, BigQuery is the platform on which the Dremel query engine was developed; that's the engine on which Apache Drill was developed. BigQuery can store a variety of data going beyond typical relational structured data to formats such as Parquet, JSON, or CSV and can use cloud object storage as a source; while such extensibility is not unusual today among other cloud data warehousing platforms, BigQuery was one of the first to offer such extensibility.

BigQuery originally did not resemble a typical data warehouse, as it worked best when data is organized in nested structures that, at first blush, look more like JSON documents than typical SQL relational or star schemas. Since then, Google claims that BigQuery has evolved so it can now work efficiently with more traditional data warehouse schemas.

So, customers are likely to need some help when moving data to BigQuery given its unique layout. Partners such as Datometry and CompilerWorks have developed migration tools for moving workloads without having to rewrite queries. Informatica has developed a no code/low-code BigQuery integration tool that includes a six-step wizard aimed at less technical business users to guide them through the process. In turn, global SIs such as Accenture, Infosys, and Wipro have developed migration tooling as part of their own BigQuery practices. Google recently expanded its partnership with SADA Systems, a global consulting and managed services provider specializing in cloud that was also one of GCP's original partners. They have re-upped with a $500 million agreement that will include support for migrations from Netezza, Teradata, and Hadoop to BigQuery.

When it comes to tooling, Google subscribes to a coopetition model; over the past year, it has made several acquisitions. At last year's NEXT, Google announced Cloud Data Fusion, the result of its acquisition of the open source company behind the development of the open source technology CDAP, that runs data transformation pipelines inside Google Cloud Dataproc, GCP's Hadoop service. Subsequently, Google acquired Alooma, which instead uses a staging server approach that is akin to AWS and Azure Database Migration services. While these offerings help round out GCP's portfolio, as the upstart in the cloud platform ecosystem, we don't expect Google to aggressively sell these services in competition to its partners.

One of the key selling points for cloud data platform providers is tapping the synergies across their portfolios. BigQuery's federated query story is expected to go GA soon. Today it can reach into Cloud SQL (GCP's MySQL and PostgreSQL services) and Bigtable (the NoSQL database that was the inspiration for Hadoop's HBase). We believe that down the road, GCP will add Spanner to that list.

BigQuery has also gotten its feet wet with machine learning (ML) by making it more accessible to SQL developers. Typically, these capabilities enable developers to run ML models without having to write Python or R code, and for BigQuery, they now support various training models for linear regression (for predicting numerical values); K-means clustering (for customer segmentation); matrix factorization (in Alpha, for recommender systems); XGBoost; (for regression, classification, and ranking); Deep Neural Networks (using TensorFlow) and others.

Google is hardly alone here – having the ability to trigger ML models from SQL code so they can run inside the database without having to move data is starting to become a checkbox item. But most of the others (e.g., Amazon Redshift, Oracle, and SQL Server on-premises) typically treat R or Python programs used for ML as user-defined functions, rather than BigQuery's storage of the models within the data sets themselves. Also, BigQuery's serverless architecture has made the platform better suited for training models compared to most cloud data warehousing services.

So, what's next? We expect that the obvious answer is how Google will blend Looker's data integration and visualization capabilities into its broader data platform offering. With Microsoft recently unveiling Synapse, which places Azure Data Factory under a common service, Oracle extends the autonomous data warehouse to incorporate self-service data integration tools, while SAP has expanded its HANA Data Warehouse to use the analytics of SAP Analytics cloud, Looker and BigQuery look increasingly like they're made for each other. But, we're also interested in seeing whether Google will designate BigQuery as one of the services that could get supported under its Anthos hybrid platform. We expect we'll get some of those answers in April at Google NEXT.