Skip to content

Datasets#

lit.sdk.data.datasets #

This module provides methods for creating datasets that will be utilized during the build process.

Dataset #

__getitem__(key) #

Enable subscription (slicing/indexing) of datasets directly. Delegates to the adapter's getitem method.

Examples:

>>> ds = Dataset.from_team_and_name("contoso", "nvda")
>>> data = ds[-10:]  # Last 10 records
>>> data = ds[100:200]  # Records 100-199
>>> data = ds[50]  # Single record at index 50

DatasetEvent #

Bases: TypedDict

This class defines the structure for an event within a dataset.

detail instance-attribute #

Additional details or context about the event.

timestamp instance-attribute #

The Unix timestamp (in floating-point format) indicating when the event occurred.

type instance-attribute #

A string representing the type of the event.

username instance-attribute #

The Unix username of the individual responsible for triggering the event.

add_path_to_dataset(team, name, path) #

Add path to existing dataset

Parameters:

Name Type Description Default
team str

The team the dataset belongs to.

required
name str

Name of the dataset the path is to be added.

required
path str

Path of file to be added to dataset.

required

Returns:

Type Description
dict

The dataset.

Examples:

>>> add_path_to_dataset("contoso", "my_ds", "/data/contoso/raw/sample.csv.gz")
{
    "name": "my_ds",
    "raw": ["/data/contoso/raw/sample.csv.gz"],
    "events": [{
        "type": "Added raw",
        "detail": "/data/contoso/raw/sample.csv.gz",
        "timestamp": 11729182007,
        "username": "lit_user"
    },
    {
        "type": "init",
        "detail": "began work on my_ds",
        "timestamp": 1729181556,
        "username": "lit_user"
    }],
}

demo(team_name, name, feature_path, index, params) #

Runs a feature demonstration on the specified dataset and returns the result.

Parameters:

Name Type Description Default
team_name str

The name of the team.

required
name str

The name of the dataset.

required
feature_path str

The path to the feature to test.

required
index int

The data index within the dataset to use for the demo.

required
params dict

A set of parameters to pass to the feature script.

required

Returns:

Type Description
dict

The result of the feature demonstration; the timestamp, return data from the feature, and any UI hints.

Examples:

>>> demo(
...     "contoso",
...     "my_ds",
...     "/data/contoso/features/ohlcv.py",
...     19562810,
...     {"count": 5, "size": 1, "unit": "hour"},
... )
{'timestamp': 1493994825045691315,
    'data': array([[2.38490005e+02, 2.38559998e+02, 2.33226593e+02, 2.38500000e+02,
        3.71410000e+04, 8.73224400e+06, 2.35110626e+02],
        [2.38500000e+02, 2.38660004e+02, 2.38300003e+02, 2.38520004e+02,
        2.33910000e+04, 4.86324900e+06, 2.07911118e+02],
        [2.38528900e+02, 2.38770004e+02, 2.38210007e+02, 2.38500000e+02,
        2.87640000e+04, 6.10002200e+06, 2.12071411e+02],
        [2.38500000e+02, 2.38798996e+02, 2.38399994e+02, 2.38740005e+02,
        3.50160000e+04, 7.89611100e+06, 2.25500092e+02],
        [2.39190002e+02, 2.39309998e+02, 2.38839996e+02, 2.38860001e+02,
        2.14480000e+04, 4.43271800e+06, 2.06672791e+02]]),
'hints': {}}

estimate(team_name, name, feature_path, count, params) #

Estimates feature data for a specified dataset.

Parameters:

Name Type Description Default
team_name str

The name of the team.

required
name str

The name of the dataset.

required
feature_path str

The path to the feature script.

required
count int

The number of samples to estimate.

required
params dict

A set of parameters to pass to the feature script.

required

Returns:

Type Description
NDArray

The estimated feature data as a NumPy array.

Examples:

>>> estimate(
...     "contoso",
...     "spy",
...     "/data/contoso/features/ohlcv.py",
...     5,
...     {"count": 5, "size": 1, "unit": "hour"},
... )
array([2.43907004e+02, 2.44300000e+02, 2.43257996e+02, 2.43632999e+02,
    7.06504000e+04, 1.72431610e+07, 2.39403308e+02])

get_data(team_name, name, start, stop) #

Retrieves data for a specified dataset within a team over a given range.

This function fetches data between the start and stop indices for the given dataset. The returned data is either a JSON string or a dictionary. If the data is a JSON string, it is parsed into a dictionary before being returned.

Parameters:

Name Type Description Default
team_name str

The name of the team.

required
name str

The name of the dataset.

required
start int

The starting index for the data retrieval.

required
stop int

The stopping index for the data retrieval.

required

Returns:

Type Description
dict

The data for the specified dataset and range, parsed as a dictionary.

Raises:

Type Description
TypeError

If the returned data is not of type 'str' or 'dict'.

Examples:

>>> get_data("contoso", "my_ds", 0, 100)
{...}

get_data_by_date(team_name, name, timestamp, aperture) #

summary

Parameters:

Name Type Description Default
team_name str

The name of the team.

required
name str

The name of the dataset.

required
timestamp float

The timestamp around which data is to be retrieved.

required
aperture int

The number of samples to retrieve on either side of the timestamp.

required

Returns:

Name Type Description
dict dict

The data around the specified timestamp with the given aperture.

Examples:

>>> get_data_by_date("contoso", "my_ds", 1494858825, 10000)
{...}

get_dataset(team, name) #

Returns a dataset by name

Parameters:

Name Type Description Default
team str

The team the datasets belongs to.

required
name str

Name of the dataset to be returned

required

Returns:

Type Description
dict

The dataset.

Examples:

>>> get_dataset("contoso", "my_ds")
{
    "name": "my_ds",
    "raw": ["/data/contoso/raw/sample.csv.gz"],
    "events": [{
        "type": "Added raw",
        "detail": "/data/contoso/raw/sample.csv.gz",
        "timestamp": 11729182007,
        "username": "lit_user"
    },
    {
        "type": "init",
        "detail": "began work on my_ds",
        "timestamp": 1729181556,
        "username": "lit_user"
    }],
}

get_sample_count(team_name, name) #

Retrieves the sample count for a specified dataset within a team.

Parameters:

Name Type Description Default
team_name str

The name of the team.

required
name str

The name of the dataset.

required

Returns:

Type Description
int

The sample count for the specified dataset.

Examples:

>>> get_sample_count("contoso", "my_ds")
52042581

init_dataset(team, name) #

Create a new dataset

Parameters:

Name Type Description Default
team str

The team the dataset belongs to.

required
name str

Name of the new dataset.

required

Returns:

Type Description
dict

The dataset.

Examples:

>>> init_dataset("contoso", "my_ds")
{
    "name": "my_ds",
    "raw": [],
    "events": [{
        "type": "init",
        "detail": "began work on my_ds",
        "timestamp": 1729181556,
        "username": "lit_user"
    }],
}

list_datasets(team) #

Returns a list of dataset names

Parameters:

Name Type Description Default
team str

The team the datasets belongs to.

required

Returns:

Type Description
list[str]

The collection of dataset names.

Examples:

>>> list_datasets("contoso")
["MSFT", "AAPL", "SPY"]

remove_path_to_dataset(team, name, path) #

Remove path from existing dataset

Parameters:

Name Type Description Default
team str

The team the dataset belongs to.

required
name str

Name of the dataset the path is to be removed.

required
path str

Path of file to be removed to dataset.

required

Returns:

Type Description
dict

The dataset.

Examples:

>>> remove_path_to_dataset("contoso", "my_ds", "/data/contoso/raw/sample.csv.gz")
{
    "name": "my_ds",
    "raw": ["/data/contoso/raw/sample.csv.gz"],
    "events": [{
        "type": "Removed raw",
        "detail": "/data/contoso/raw/sample.csv.gz",
        "timestamp": 1729184479,
        "username": "lit_user"
        },
        {
            "type": "Added raw",
            "detail": "/data/contoso/raw/sample.csv.gz",
            "timestamp": 11729182007,
            "username": "lit_user"
        },
        {
            "type": "init",
            "detail": "began work on my_ds",
            "timestamp": 1729181556,
            "username": "lit_user"
    }],
}