Hydra External Tables - Part II

Owen Ou

2022-10-19

Welcome to the second installment in our series on Hydra External Tables! Follow us on TwitterLinkedIn, or Discord to be the first to hear about the next installment in our series and other Hydra announcements. Part 1 of the series is available here.

Hydra is an open source Snowflake alternative that transforms Postgres into an elastic data warehouse with columnar storage and external tables. In part 1, we discussed using external tables to access and join data that live outside Hydra. In this post, we are going to discuss the reverse direction - populate data to the Hydra data warehouse from other databases. In this tutorial, we use RDS as the external database to Hydra.

Why populate data to Hydra from foreign databases?

In part 1 of our External Tables series, we showed how External Tables can be used to quickly run ad-hoc analytics queries on data outside of a Hydra data warehouse. When an External Table’s data size is large, the query becomes more expensive. No need to fear - you can also use External Tables to continuously duplicate data to your Hydra data warehouse automatically! For large data volumes, create a synchronous duplicate of the data inside the warehouse for faster query execution. Check out this tutorial to see how!

Data Modeling 101: Star Schema

In the tutorial below, we will adopt a data modeling approach known as the star schema. Let’s run through some basic concepts.

Star schema is a logical arrangement of tables in a multidimensional database such that the entity relationship diagram resembles a star shape. It is a fundamental approach that is widely used to develop or build a data warehouse. It requires classifying model tables as either dimension or fact.

Fact tables record measurements or metrics for a specific event. A fact table contains dimension key columns that relate to dimension tables and numeric measure columns. Dimension tables describe business entities - the things you model. Entities can include products, people, places, and concepts including time itself. A dimension table contains a key column (or columns) that acts as a unique identifier and other descriptive columns. Fact tables are best as columnar tables, while dimension tables may be best as row tables due to their size and rate of updates.

The following diagram shows the star schema that we are going to model for the tutorial:

Star Schema

Tutorial

In this tutorial, we are going to have a sales fact columnar table, a customers dimension row table, and a products dimension row table in Hydra. The customers row table is auto-synced from RDS Postgres using a trigger where RDS is the canonical source of this table.

RDS to Hydra

The following is a companion video for this tutorial:

Setup

In Hydra, we create the sales, products, and customers table. We then add some sample data to the sales and products table. The customers table is empty and will be auto-populated from RDS at a later step.

CREATE TABLE sales (order_number int, product_key TEXT, email TEXT, order_date TIMESTAMPTZ) USING columnar;
CREATE TABLE products (product_key TEXT, name TEXT, category TEXT, color TEXT);
CREATE TABLE customers ( email TEXT, country TEXT, state TEXT );

INSERT INTO products (product_key, name, category, color)
VALUES
('pixy_7_pro_white', 'Pixy 7 Pro', 'Phone', 'Black'),
('pixy_7_pro_black', 'Pixy 7 Pro', 'Phone', 'White'),
('airphonepods_pro_white', 'AirPhonePods Pro', 'Headphones', 'White'),
('airphone_14_red', 'airPhone 14', 'Phone', 'Red');

INSERT INTO sales (order_number, product_key, email, order_date)
VALUES
(100, 'pixy_7_pro_white', 'owen@example.com', CURRENT_TIMESTAMP),
(101, 'pixy_7_pro_black', 'jd@example.com', CURRENT_TIMESTAMP),
(102, 'airphone_14_red', 'joe@example.com', CURRENT_TIMESTAMP);

Link to the Hydra columnar table

In RDS Postgres, we create an external table linking to the Hydra customers table:

/* replace ... with your Hydra DB info */

CREATE EXTENSION postgres_fdw;

CREATE SERVER hydra_server FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (host '...', port '...', dbname '...');

CREATE USER MAPPING FOR CURRENT_USER SERVER hydra_server
    OPTIONS (user '...', password '...');

CREATE SCHEMA hydra;

CREATE FOREIGN TABLE hydra.customers ( email TEXT, country TEXT, state TEXT )
    SERVER hydra_server
    OPTIONS (schema_name 'public', table_name 'customers');

Trigger to sync with the Hydra columnar table

In RDS, we create customers as a row table and add a trigger to sync with the Hydra customers table:

CREATE TABLE customers ( id INT, email TEXT, country TEXT, state TEXT, created_at TIMESTAMPTZ);

CREATE OR REPLACE FUNCTION insert_hydra_customers_table ()
RETURNS TRIGGER AS $BODY$
BEGIN
    INSERT INTO hydra.customers ("email", "country", "state")
        VALUES (NEW.email, NEW.country, NEW.state);
    RETURN NEW;
END;
$BODY$ LANGUAGE plpgsql;

CREATE TRIGGER trigger_insert_hydra_columnar_table
AFTER INSERT ON customers
FOR EACH ROW
EXECUTE PROCEDURE insert_hydra_customers_table();

CREATE OR REPLACE FUNCTION delete_hydra_customers_table ()
RETURNS TRIGGER AS $BODY$
BEGIN
    DELETE FROM hydra.customers WHERE email = OLD.email;
    RETURN OLD;
END;
$BODY$ LANGUAGE plpgsql;

CREATE TRIGGER trigger_delete_hydra_columnar_table
AFTER DELETE ON customers
FOR EACH ROW
EXECUTE PROCEDURE delete_hydra_customers_table();

INSERT INTO customers (id, email, country, state, created_at)
VALUES
(1, 'owen@example.com', 'Canada', 'BC', CURRENT_TIMESTAMP),
(2, 'joe@example.com', 'US', 'CA', CURRENT_TIMESTAMP),
(3, 'jd@example.com', 'US', 'TX', CURRENT_TIMESTAMP);

Query from Hydra columnar table

In Hydra, check that the data is populated from RDS to the Hydra customers table:

SELECT * FROM customers;

Perform analysis in Hydra

In Hydra, we can perform interesting data analysis like product sales by country:

SELECT COUNT(products.product_key), customers.country 
FROM sales
INNER JOIN products ON products.product_key = sales.product_key
INNER JOIN customers on customers.email = sales.email
GROUP BY customers.country;

Wrap Up

This post explored populating dimension data from RDS Postgres to Hydra using a star schema model. We have shown the ability to auto-sync data outside your data warehouse using a trigger.

We will explore more Hydra integration patterns in the next few posts. Make sure to follow us on TwitterLinkedIn, or Discord to get notified. In the meantime, you can learn more about External Tables in our documentation. The Hydra database is open sourced on GitHub; please star it if you like what you see in this post 🙂.

Get a fully managed data warehouse at cloud scale with no vendor lock-in.