File formats
File Type Compatibility
SDF is compatible with a variety of different file types.
Filetype | Compatibility |
---|---|
csv | Comma separated values, with or without header |
parquet | A self describing, extremely efficient columnar data format |
json | Newline delimited json |
gzip | GZIP compressed json or csv file |
bzip2 | BZIP2 compressed json or csv file |
Compression File Modifiers
SDF works with two common compression formats, GZIP, and BZIP2.
The de-compressed files must be in a supported format (csv, json).
GZIP & BZIP2
GZIP can only compress a single file, while zip can compress multiple files into a single archive. Additionally GZIP achieves a much higher compression ratio than ZIP archives. GZIP is also very memory efficient, making it useful for systems working with constrained memory or very large amounts of data.
BZIP2 is an alternative to GZIP. At the expense of a higher memory footprint, and slightly longer compression time, the resulting archive is typically >10% smaller than with GZIP.
To import a compressed file with SDF, try the below code snippet.
Parquet
Parquet is the SDF recommended file format for working with data. Internally, SDF, and the SDF cache, use parquet with data blocks compressed with GZIP.
Parquet files are a self describing column-oriented data file format designed for efficient data storage and retrieval. They contain within them column header information as well as user-defined metadata. Parquet also supports native compression of the data blocks within a file. These offer different compression ratio / processing cost tradeoffs.
To import a parquet file with SDF, try the below code snippet.
Learn more about Parquet
Raw File Types
JSON
SDF natively supports newline delimited json (ndjson). Working with nested structures is supported with simple dot notation.
To import a json file with SDF, try the below code snippet.
Working with Nested json Structures
In the twitter_json
object below, we wnat to extract the user_id, country_code, and state from the user_location object.
SDF supports dot-notation to access nested fields to make it easy to work with even complex strutures.
Learn more about ndjson
CSV
SDF works CSVs! They may have a header row, or you can specify the schema information yourself.
To import a csv with header row:
To specify the schema information yourself:
Local & Remote
Paths
Paths should be relative to workspace.
Remote Paths
SDF supports some cloud storage URIs natively.
Support for other cloud providers is on the way! If your preferred cloud provider is not listed, drop us a line and let us know, or contribute to our github.
- AWS S3
- Google Cloud Storage
- Azure Blob Storage
Registering Tables in YML
Registering tables in YML is also supported. There is support for identical modifiers.
note: There is a bug with registerign locations as S3 paths as of Aug 2023