Skip to content

Parquet in Data Flow Research Spike #152

@TylerMatteo

Description

@TylerMatteo

Background

Currently, data-flow handles shapefiles and CSVs that are uploaded to Digital Ocean by Data Engineering. DE is planning to move to parquet files in general so data-flow should be equipped to handle this file format to load into a database. The first practical application for this issue would be downloading the source data for police precincts and school districts for FacDB.

Existing work

The two options we have currently to get parquet files into data flow are 1) pg_parquet and 2) duckdb. Initial findings regarding the two have been documented here in the original ticket. Also below:

Okay, I've looked at two ways we could handle parquet files (both suggestions courtesy of DE):

  1. pg_parquet, a postgres extension
    The basic installation for this involved building the extension from source, which was 1) taking a long time and 2) confusing as to where and why it's hanging/taking a long time. I have not even been able to build the extension locally with my postgres version.
  2. duckdb, a nodejs api to use duckdb
    For this path, I installed the package, created a duckdb instance to connect to my local postgres database, and used read_parquet to create source tables and insert values from the parquet file. A small limitation is that the geometry type in duckdb is slightly limited (e.g. can't store srid) so it's something to keep in mind when creating the source tables for parquet files.

I am now working on a branch to test out automatically organizing the files depending on their type and which method to generate the source tables.

  1. pg_parquet. Tim has successfully installed this extension on his branch, but I'm attempting to actually use it with the workflow we have on my own branch.
  2. duckdb. There has been more successful using duckdb and converting the parquet files into data flow source tables in the original ticket branch.

Goals

  • Use pg_parquet to convert parquet files into tables
  • Summarize findings
  • Choose one approach to use moving forward

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions