spark-postgres

PostgreSQL and GreenPlum Data Source for Apache Spark

GreenPlum Data Source for Apache Spark License GitHub release codecov Build StatusHitCount

A library for reading data from and transferring data to Greenplum databases with Apache Spark, for Spark SQL and DataFrames.

This library is 100x faster than Apache Spark’s JDBC DataSource while transferring data from Spark to Greenpum databases.

Also, this library is fully transactional .

Try it now !

CTAS

CREATE TABLE tbl
USING greenplum
options ( 
  url "jdbc:postgresql://greenplum:5432/",
  delimiter "\t",
  dbschema "gptest",
  dbtable "store_sales",
  user 'gptest',
  password 'test')
AS
 SELECT * FROM tpcds_100g.store_sales WHERE ss_sold_date_sk<=2451537 AND ss_sold_date_sk> 2451520;

View & Insert

CREATE TEMPORARY TABLE tbl
USING greenplum
options ( 
  url "jdbc:postgresql://greenplum:5432/",
  delimiter "\t",
  dbschema "gptest",
  dbtable "store_sales",
  user 'gptest',
  password 'test')
  
INSERT INTO TABLE tbl SELECT * FROM tpcds_100g.store_sales WHERE ss_sold_date_sk<=2451537 AND ss_sold_date_sk> 2451520;

Please refer to Spark SQL Guide - JDBC To Other Databases to learn more about the similar usage.