Build modern data leakhouses on Google Cloud with Apache Iceberg and Apache Spark

by SkillAiNest

Sponsored material

Build modern data leakhouses on Google Cloud with Apache Iceberg and Apache Spark

Large data analytics scenario is being created permanently, in which organizations look for more flexible, expanding and cost -effective methods to manage and analyze the large amount of data. This acquisition has resulted in the increase in data leakhouse pyramids, which combines low cost storage and flexibility of data leaks with data management capabilities and transactional consistency of data warehouse. At the center of this revolution, there are powerful processing engines like Apache Iceberg like Open Table Forms and Apache Spark, all are empowered by Google Cloud’s strong infrastructure.

Apache Iceburg’s rise: a game changer for data leaks

For years, Google Cloud Storage (GCS) offers data leaks, unprecedented scale and cost performance on cloud storage (GCS) such as Cloud Object Storage. However, they often lack the key features found in traditional data warehouses, such as transactional consistency, scheme evolution, and analytical questions improve performance performance. This is the place where Apache Iceberg shines.

Apache Iceberg is an open table format designed to remove these limits. It sits on the upper part of your data files (such as Parcate, Arc, or AVRO) in Cloud Storage, providing a layer of metadata that converts a combination of files into high performance, SQL -like tables. Makes iceberg so powerful:

  • Acidity of acid: Iceberg features nuclear, consistency, isolation and stability (acid) in your data leak. This means that the data writes are transactions, which, despite the harmony process, ensure the integrity of the data. There is nothing to write more partial or contradictory.
  • Skima Evolution: The biggest pain point of traditional data leaks is to manage scheme changes. The iceberg schema handles evolution without interruption, allowing you to add, quit, rename or reset columns without re -writing the basic data. This is important for the development of the data.
  • Invisible distribution: Iceberg manages intelligently distribution, and summarizes the physical setting of your data. Users no longer need to know the distribution scheme to write effective questions, and you can develop your distribution strategy over time without data transfer.
  • Time travel and rollback: The iceberg table maintains the full date of snapshots. This enables “Time Travel” questions, which you can ask for data because it was present at any time in the past. It also provides rollback capabilities, which make you return a table to the previous good state, which is invaluable to debugging and data recovery.
  • Performance Correction: Iceberg’s rich metad data allows engines to effectively cut irrelevant data files and parts, which significantly accelerates the processing of the inquiry. It avoids expensive file listing operations, jumping directly to the relevant data based on its metadata.

By providing these data warehouse features above the data leak, Apache Iceberg enables the formation of an actual “data leakhouse”, which offers the best of both worlds: cloud storage flexibility and cost effectiveness with the reliability and performance of structured tables.

Google Cloud Big Lake Tables for Apache Iceberg in Big Cory The standard Big Corey offers a fully organized table experience like tables, but all data is stored in customer -owned storage buckets. Supported features include:

  • Table Variations Through Gogs SQL Data Hera Perry Language (DML)
  • Using Unified Beach and High Throw Pit Streaming Write storage API Through a bug leak connector like Spark
  • Automatic refreshment on Iceburg V2 Snap Shot Export and Each Table Mutter
  • Skima Evolution to update column metad data
  • Automatic storage correction
  • Time travel to access historical data
  • Column level security and data masking

An example of ways to make empty leak iceberg table using Gogs SQL is:


SQL

CREATE TABLE PROJECT_ID.DATASET_ID.my_iceberg_table (
  name STRING,
  id INT64
)
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
file_format="PARQUET"
table_format="ICEBERG"
storage_uri = 'gs://BUCKET/PATH');

Then you can Import the data In data using LOAD INTO To import data from a file or INSERT INTO From another table


SQL

# Load from file
LOAD DATA INTO PROJECT_ID.DATASET_ID.my_iceberg_table
FROM FILES (
uris=('gs://bucket/path/to/data'),
format="PARQUET");

# Load from table
INSERT INTO PROJECT_ID.DATASET_ID.my_iceberg_table
SELECT name, id
FROM PROJECT_ID.DATASET_ID.source_table

In addition to a fully organized offer, Apache Iceberg is also supported as reading.External Table in the Big Cory. Use it to point out the current route with data files.


SQL

CREATE OR REPLACE EXTERNAL TABLE PROJECT_ID.DATASET_ID.my_external_iceberg_table
WITH CONNECTION PROJECT_ID.REGION.CONNECTION_ID
OPTIONS (
  format="ICEBERG",
  uris =
    ('gs://BUCKET/PATH/TO/DATA'),
  require_partition_filter = FALSE);

Apache Spark: Engine for Data Lake House Analytics

While Apache Iceberg provides structure and management for your data leakhouse, Apache Spark is a processing engine that brings it to life. Spark is a powerful open source, a distributed processing system that is known for its ability to handle its speed, versatility, and diverse data workload. Spark memory, the strong ecosystem of tools, including processing, ML and SQL -based processing, and deep iceberg support make it an excellent choice.

Apache Spark has been deeply connected to the Google Cloud Environmental System. The benefits of using Apache spark on Google Cloud include:

  • Get access to a real server lace spark experience without cluster management Google Cloud Server Lace for Apache Spark.
  • Fully organized spark experience with flexible cluster layout and administration Data Prock.
  • Faste the spark jobs using the new Electricity engine for Apache spark Price feature.
  • Sort your run time with GPUs and drivers.
  • Run AI/ML jobs using a strong set of libraries available through default in the Spark Run Times, including XGBOost, Pytorch and Transformer.
  • Write the Praspark Code through the Big Big Query Studio Kolab Enterprise Notebook With Gemini -powered Pacer Code Generation.
  • Easily contact your data in Big Coyarial Tables, Big Lake Iceburg Tables, Outdoor Tables and GCS
  • Integration with vertex AI for the mlops from the end to the end

Iceburg + Spark: Better together

Together, Iceberg and Spark Performance and Creating a powerful combination for the construction of reliable data leakhouses. Spark iceberg can take advantage of the metaphata of spark iceberg to improve the inquiry plans, perform effective data harvesting, and to ensure the consistency of transactions in your data leak.

Your iceberg tables and large indigenous tables are accessible through it Big Lake Metastor. This promotes your tables to source engines, which sync large quantities, including spark.


Python

from pyspark.sql import SparkSession

# Create a spark session
spark = SparkSession.builder \
.appName("BigLake Metastore Iceberg") \
.config("spark.sql.catalog.CATALOG_NAME", "org.apache.iceberg.spark.SparkCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.catalog-impl", "org.apache.iceberg.gcp.bigquery.BigQueryMetastoreCatalog") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_project", "PROJECT_ID") \
.config("spark.sql.catalog.CATALOG_NAME.gcp_location", "LOCATION") \
.config("spark.sql.catalog.CATALOG_NAME.warehouse", "WAREHOUSE_DIRECTORY") \
.getOrCreate()
spark.conf.set("viewsEnabled","true")

# Use the blms_catalog
spark.sql("USE `CATALOG_NAME`;")
spark.sql("USE NAMESPACE DATASET_NAME;")

# Configure spark for temp results
spark.sql("CREATE namespace if not exists MATERIALIZATION_NAMESPACE");
spark.conf.set("materializationDataset","MATERIALIZATION_NAMESPACE")

# List the tables in the dataset
df = spark.sql("SHOW TABLES;")
df.show();

# Query the tables
sql = """SELECT * FROM DATASET_NAME.TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()
sql = """SELECT * FROM DATASET_NAME.ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()

sql = """SELECT * FROM DATASET_NAME.READONLY_ICEBERG_TABLE_NAME"""
df = spark.read.format("bigquery").load(sql)
df.show()

Big Lake Metastor has increased functionality Iceburg Rest Catalog Access to iceberg data with any data processing engine (in preview). How to connect it using sparks is:


Python

import google.auth
from google.auth.transport.requests import Request
from google.oauth2 import service_account
import pyspark
from pyspark.context import SparkContext
from pyspark.sql import SparkSession

catalog = ""
spark = SparkSession.builder.appName("") \
    .config("spark.sql.defaultCatalog", catalog) \
    .config(f"spark.sql.catalog.{catalog}", "org.apache.iceberg.spark.SparkCatalog") \
    .config(f"spark.sql.catalog.{catalog}.type", "rest") \
    .config(f"spark.sql.catalog.{catalog}.uri",
" \
    .config(f"spark.sql.catalog.{catalog}.warehouse", "gs://") \
    .config(f"spark.sql.catalog.{catalog}.token", "") \
    .config(f"spark.sql.catalog.{catalog}.oauth2-server-uri", " \                   .config(f"spark.sql.catalog.{catalog}.header.x-goog-user-project", "") \     .config("spark.sql.extensions","org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") \
.config(f"spark.sql.catalog.{catalog}.io-impl","org.apache.iceberg.hadoop.HadoopFileIO") \    .config(f"spark.sql.catalog.{catalog}.rest-metrics-reporting-enabled", "false") \
.getOrCreate()

Complete the leakhouse

Google provides a comprehensive suite of cloud services that complements Apache Iceberg and Apache Spark, which enables you to easily make, manage and scale your data leakhouse, while taking advantage of open source technologies that you already use:

  • Data Plex Universal Catalog: The data Plex Universal Catalog provides a united data fabric to handle, monitor and govern your data in data leaks, data warehouses and data matches. It is connected to the Big Lake Metastor, ensuring that governance policies are permanently enforced in your iceberg tables, and enable the capabilities such as cementing search, data lineage, and data quality checks.
  • Google Cloud arranged for Apache Kafka: Run fully -administered cuff cluster on Google Cloud, including Kafka Connect. Data streams can be read directly into the Big Cyari, including the iceberg tables organized with low lettuce reds.
  • Cloud Composer: Consisting of a fully organized workflower orpostrine service Apache Air Flow.
  • Vertex AI: Use the vertex AI to manage the MLPs experience at the end of the complete end. You can also use Vertex AI Work Bench For an organized jupyterlab experience to connect your server lace spark and data APROC examples.

Conclusion

The combination of Apache Iceberg and Apache Spark on Google Cloud offers a great solution for the construction of modern, high -performance data leakhouses. The iceberg provides transactional consistency, scheme evolution, and performance improvements that historically disappeared from data leaks, while the Spark offers a versatile and expansive engine to take action on these major datases.

Our of more information, please check our Free Webnar On July 8 at 11am on the PST, we will be deeply divers on the use of Apache Sparks and auxiliary tools on the Google Cloud.

Author: Brad Marrow, Senior Developer Advocate – Google

You may also like

Leave a Comment

At Skillainest, we believe the future belongs to those who embrace AI, upgrade their skills, and stay ahead of the curve.

Get latest news

Subscribe my Newsletter for new blog posts, tips & new photos. Let's stay updated!

@2025 Skillainest.Designed and Developed by Pro