I want to discuss the work that goes into engineering a data pipeline. At work, I frequently do ELT - extract data from a source, load it into BigQuery, and transform it with SQL queries that I compose with dbt.
Over the last three years, I have worked with tens of different data providers. I have noticed that the work required to extract and load data crucially depends on choices the data provider makes. The source file's format alone makes the difference between an afternoon task and a project spanning several months. I believe the following points are important to bear in mind when providing data for third parties.
Keep The Schema Close To The Data
Keep the schema close to the data. A schema contains rich, condensed information on the data. Anyone working with unschemed data must at some point invest compute and time to recover that information. The more unschemed data there is, the more time and compute needs to be invested.
If a schema is available, several tools exist to port it across RDBMSs. AWS built their conversion toolkit, so did Microsoft Azure and many, many open-source projects exist.
Assume that when loading data into a data warehouse, most engineers do not care about the contents of the data. Any company larger than a startup employs both analysts and engineers. Analysts will work with the data once it is loaded. And engineers do the plumbing; they load the data. Engineers do not want to figure out column names, data types, column descriptions, or table descriptions. They want this information provided in a machine-readable format.
Don't make the engineer call support and ask for schema information. Don't make them manually assemble a schema. Provide it. The schema's format does not matter in that regard. Let it be JSON, YAML, a txt-file. It does not matter, as long as the format is consistent and machine-readable. Make the schema impossible to miss.
Use A Reasonable File Format
When I made measurements in the lab during my studies, we used computers from the early 2000s to record measurements. Those machines were programmed by students working at the lab, and controlled Arduinos that performed the measurements. Results were stored as plain .txt files, containing column names, and the data. The files contained some KBs worth of data.
Today the year is 2023. My code runs in the cloud, where I transfer TBs of data. A .txt file is not a reasonable way to transfer large amounts of data. The Avro file format is reasonable, or the Parquet file format. I find those great, because they keep the schema close to the data and compress the data.
Compress The Data
Bandwidth and storage cost money. It is reasonable to compress data for that reason. The easy way to compress data leverages a reasonable file format. Sometimes one needs to transfer plain txt files. Most RDBMSs support gzip as the standard compression method for plain txt files. While not exactly cutting-edge technology, gzip offers a reasonable trade-off between file size and speed of compression/decompression.
Transfer Large Amounts Of Data In Batches
I understand that it is intriguing for data suppliers to leverage REST APIs to provide data to data customers. The API is already there for some other application. And all one has to do is strap authentication and monetization on top.
This approach is fine for transferring partial data and small datasets. But it is deficient when transferring large datasets. The overhead required to transfer an entire dataset via small HTTP requests can be avoided by transferring data in batches.