Hydra External Tables - Part I

Owen Ou

2022-09-21

Welcome to the first installment in our series on Hydra External Tables! Follow us on Twitter, LinkedIn, or Discord to be the first to hear about the next installment in our series and other Hydra announcements.

What is Hydra?

Hydra is an open source Snowflake alternative. Hydra transforms Postgres into an elastic data warehouse with columnar storage and external tables. It's our mission to help start-ups make data-driven decisions.

What is an External Table?

External tables are data that live outside Hydra. You can use External Tables to query and combine data from multiple data sources. In the image below, Hydra tables are represented as "local" and RDS tables as "external". Data in RDS is being fetched and returned through Hydra.

fdw_diagram.jpeg

This post will explore how to use Postgres External Tables to select, join, and insert data to and from your Hydra data warehouse.

Why use External Tables with Hydra?

Data warehouses are referred to as an organization’s source of truth because they aggregate data from different sources into a single, central data store to support data analysis. Traditionally, organizations configure data pipelines to copy data from multiple sources into the warehouse. While this approach works well for immutable data like analytics, rapidly changing data quickly becomes out of date. Constantly refreshing this data is expensive, slow, and restrictive.

External tables address the following problems:

  • Data copying grows storage and pipeline costs.
  • The warehouse’s data recency is limited to the data copy rate and interval.
  • Impossible to run ad-hoc queries without copying relevant tables to the warehouse.

With External Tables, queries execute on source databases directly, which greatly reduces the friction and the cost outlined above. However, External Tables fall short when the volume of external data to fetch is large and queries need to be run repeatedly. The tradeoff is whether you want to save the data copying time for ad-hoc queries.

Tutorial

This tutorial performs queries in Hydra and joins an RDS external table. The following is a companion video:

Setup

In RDS Postgres, we create demo_signups as a standard row table:

CREATE TABLE demo_signups ( email TEXT );
INSERT INTO demo_signups (email)
VALUES
('owen@example.com'),
('joe@example.com');

In Hydra, we create events as a columnar table:

CREATE TABLE events (email TEXT, name TEXT, timestamp TIMESTAMPTZ) USING columnar;
INSERT INTO events (email, name, timestamp)
VALUES
('owen@example.com', 'webinar_signup', CURRENT_TIMESTAMP),
('joe@example.com', 'email_signup', CURRENT_TIMESTAMP),
('jd@example.com', 'webinar_signup', CURRENT_TIMESTAMP);

Link to the RDS external table

In Hydra, we create an external table linking to the RDS demo_signups table:

/* replace ... with your RDS DB info */

CREATE EXTENSION postgres_fdw;

CREATE SERVER rds_server FOREIGN DATA WRAPPER postgres_fdw
    OPTIONS (host '...', port '...', dbname '...');

CREATE USER MAPPING FOR CURRENT_USER SERVER rds_server
    OPTIONS (user '...', password '...');

CREATE FOREIGN TABLE rds_demo_signups ( email TEXT )
    SERVER rds_server
    OPTIONS (schema_name 'public', table_name 'demo_signups');

Query from RDS external table

In Hydra, check that you can query the external RDS demo_signups table:

SELECT * FROM rds_demo_signups;

Perform join from RDS external table

In Hydra, we perform a join with the external RDS demo_signups table with the columnar events table. As an example, the following calculates the rate of user conversion from demo signups to webinar signups:

SELECT COUNT(events.email) / COUNT(rds_demo_signups.email) * 100.0
FROM events
LEFT JOIN rds_demo_signups ON events.email = rds_demo_signups.email
WHERE events.name = 'webinar_signup';

Wrap Up

This post explored querying and combining data from an external RDS row table in Hydra. We have shown the ability to query data outside your data warehouse without ETL pipelines.

We will explore Hydra integration patterns in the next few posts. Make sure to follow us on Twitter, LinkedIn, or Discord to get notified. In the meantime, you can learn more about External Tables in our documentation. The Hydra database is open sourced on GitHub; please star it if you like what you see in this post 🙂.

Get a fully managed data warehouse at cloud scale with no vendor lock-in.