-
Notifications
You must be signed in to change notification settings - Fork 0
Parquet in Data Flow Research Spike #152
Description
Background
Currently, data-flow handles shapefiles and CSVs that are uploaded to Digital Ocean by Data Engineering. DE is planning to move to parquet files in general so data-flow should be equipped to handle this file format to load into a database. The first practical application for this issue would be downloading the source data for police precincts and school districts for FacDB.
Existing work
The two options we have currently to get parquet files into data flow are 1) pg_parquet and 2) duckdb. Initial findings regarding the two have been documented here in the original ticket. Also below:
Okay, I've looked at two ways we could handle parquet files (both suggestions courtesy of DE):
- pg_parquet, a postgres extension
The basic installation for this involved building the extension from source, which was 1) taking a long time and 2) confusing as to where and why it's hanging/taking a long time. I have not even been able to build the extension locally with my postgres version.- duckdb, a nodejs api to use duckdb
For this path, I installed the package, created a duckdb instance to connect to my local postgres database, and usedread_parquetto create source tables and insert values from the parquet file. A small limitation is that the geometry type in duckdb is slightly limited (e.g. can't store srid) so it's something to keep in mind when creating the source tables for parquet files.I am now working on a branch to test out automatically organizing the files depending on their type and which method to generate the source tables.
- pg_parquet. Tim has successfully installed this extension on his branch, but I'm attempting to actually use it with the workflow we have on my own branch.
- duckdb. There has been more successful using duckdb and converting the parquet files into data flow source tables in the original ticket branch.
Goals
- Use pg_parquet to convert parquet files into tables
- Summarize findings
- Choose one approach to use moving forward