Apache Spark on Google Cloud

Unlock Spark's full potential on Google Cloud. Choose serverless ease or cluster control, boosted by high-speed processing, AI assistance, and seamless open lakehouse connectivity.

Benefits

Increase developer productivity and get faster data insights

Seamless Spark for all data users

Run Spark easily with BigQuery, Vertex AI and IDEs using serverless or managed clusters. Eliminate custom integrations, streamline ETL to ML workflows, and boost productivity with Gemini for code and operations.


Operational simplicity with serverless Spark

Google Cloud Serverless for Apache Spark offers instant autoscaling and near-zero configuration. Get a 3.6x query performance boost* with Lightning Engine (Preview). Dataplex Universal Catalog unifies metadata, simplifying operations.

Run Spark your preferred way

One size does not fit all. Google Cloud gives you the flexibility to choose between serverless, managed clusters, and compute clusters for your Spark workloads.

Key features

Powerful ways to run Spark on Google Cloud

Google Cloud Serverless for Apache Spark

Using Google Cloud Serverless for Apache Spark to boost productivity and performance with Lightning Engine* and Gemini. This experience is a deeply integrated environment to run Apache Spark and SQL workloads directly from BigQuery. It provides unified security, runtime metadata using BigLake metastore, and governance through Dataplex Universal Catalog. Maximize productivity with integrated CI/CD, Gemini in notebooks, and eliminate Apache Spark cluster management.

* The queries are derived from the TPC-DS standard and TPC-H standard and as such are not comparable to published TPC-DS standard and TPC-H standard results, as these runs do not comply with all requirements of the TPC-DS standard and TPC-H standard specification.

Managed Spark, Hadoop, and OSS clusters with Dataproc

Dataproc is your fully managed and highly scalable service for deploying and operating dedicated Spark, Hadoop, and a vast ecosystem of 30+ open source tools. Its integration with the broader Google Cloud products and services, including Lightning Engine for Dataproc on Google Compute Engine (premium tier), makes it ideal for data lake modernization, efficient ETL pipelines, and secure, large-scale data science initiatives where cluster control is paramount.

Data Science with Apache Spark on Google Cloud

Whether you prefer the zero-ops simplicity of Google Cloud Serverless for Apache Spark or the control of managed Dataproc clusters, you can accelerate your entire machine learning life cycle. Benefit from:

  • Seamless Integration: Connect effortlessly with BigQuery for data access and Vertex AI for MLOps, building end-to-end data science pipelines.
  • Developer Productivity: Leverage Gemini for coding insights and assistance in notebook environments like BigQuery Studio and Vertex AI Workbench.
  • AI/ML Readiness: Utilize pre-packaged ML libraries and GPU acceleration available with both serverless Spark and Dataproc clusters for demanding training and inference tasks.
  • Faster Iteration: Focus on development and experimentation no matter what you choose.

Spark through Vertex AI

Develop and operationalize Spark for data science seamlessly with Vertex AI. Use Spark from Vertex AI Workbench for interactive development with built-in security and Gemini assistance. Integrate Spark processing into Vertex AI Pipelines for robust MLOps.

Open source table format support for your lakehouse

Google Cloud's Spark offerings provide robust compatibility with open source formats like Apache Iceberg, Delta Lake, and Hudi. Leverage BigLake metastore or Dataproc Metastore for unified metadata management across formats, enabling an open lakehouse architecture where you can process data with your choice of Spark engine.


Apache Spark is a trademark of The Apache Software Foundation.

Take the next step

Tell us what you’re solving for. A Google Cloud expert will help you find the best solution.

Google Cloud