Data Pipeline#

A data pipeline is a critical component in preparing data for machine learning workflows. It encompasses the sequence of processes and tools required to ingest, clean, transform, and prepare raw data for analysis or training. This involves collecting data from diverse sources, such as databases or files, and applying preprocessing steps like normalization, feature extraction, and data augmentation. By streamlining and automating these tasks, a well-structured data pipeline ensures that data is consistent, reliable, and formatted optimally for machine learning, thereby enhancing efficiency in model development and enabling seamless integration of new data.

Adapter#

Data access is facilitated through an interface named adapter. It's the adapter's responsibility to provide a hyper-optimized local representation (see Cache Files) of a given data type.

Speed#

The adapter interface plays a key role in enabling hyper-optimized local representations of specific data types, ensuring that data access is both fast and efficient. Adapters transform data into forms optimized for fast random access, providing instant availability of common computations such as roll-ups, standard deviation, min/max/mean, and more.

For instance, when processing compressed CSV files from Refinitiv, an adapter can load 13 compressed CSV files—each containing a year of trade data—in under 20 milliseconds:

The adapter facilitates fast data access both by timestamp:

... and by absolute index.

Optimized Data Transformation#

When files are uploaded, the adapter processes the data sequentially in a single pass. This ensures the data is transformed into a format optimized for fast random access, checked for integrity, and precomputed for key metrics. This transformation step can be triggered automatically during uploads or manually through a web app, CLI, or Python library.

Broad Data Compatibility#

Adapters are designed to work with virtually any data source, from structured databases and warehouses to unstructured formats like text, images, video, audio, and geospatial data. They provide seamless access to both on-premises and cloud-based storage, ensuring versatility in supporting diverse data needs.

By integrating adapters into your workflow, you can achieve unparalleled speed and flexibility in accessing and transforming data, optimizing every stage of your machine learning processes.