Creating a model from scratch#
Volatility Forecast Walkthrough#
Developing robust and accurate machine learning models requires high-quality data and an efficient platform for data processing and model training. In this walkthrough, we'll explore the full journey of using the LIT platform to create a machine learning model using historical stock transaction data from LSEG (London Stock Exchange Group), a provider of comprehensive financial data. This article will guide you through the entire process: from importing and preprocessing raw data, exploring and developing features, to designing and training a neural network, and finally, making predictions—all within the versatile LIT platform. Whether you're an experienced data scientist or just starting, this walkthrough will demonstrate how LIT simplifies the complexities of machine learning with powerful tools and intuitive interfaces. Let's dive in and see how you can harness the LIT capabilities to turn raw financial data into actionable insights.
Create a new project#
I'll start by creating a new project. I'd like to try predicting whether volatility will go up or down in the next hour so let's name our new project volatility
.
Upload raw data.#
In this walk-thru I've manually copied data files retreived from LSEG DataScope to the LIT server using rsync
. It's simpler to upload using the web upload interface but I prefer the additional control and efficiency of working directly with the operating system.
The copied files show up in the Uploads app. Open it by choosing Upload
from the Data
menu.
Adapter#
Because we have the Traders
plugin installed in this LIT instance, we know that the platform already knows how to read LSEG data files natively. If you're interested in fine-tuning any of the filtering or data cleansing choices we made in our Refintiv adapter, you can find the adapter here: /data/<team>/adapters/refinitiv.py
.
Caching#
Raw data typically comes to us in a form that's optimized for file transfer, not for random access. For training models, we'll need the data reformatted so that it's optimized for fast random data access. To facilitate this, the platform will cache our raw data files.
If we'd uploaded the files view the uploader, that caching would be kicked off automatically. Since we copied the files to the disk manually, we'll kick off the caching process by hand. We do this by selecting a batch of files and choosing Cache Files
from the menu.
We could have chosen to initiate caching via the SDK:
from lit.sdk.data.adapter import load_adapter
paths = sorted(glob.glob(f"/data/contoso/raw/aapl/*.csv.gz"))
for path in paths:
adapter = load_adapter(team, "refinitiv", {"paths": [path]})
adapter.create_cache_file(path)
We can monitor the caching either by clicking on the shell icon in the app.
You can also monitor the caching by attaching to one of the running screen sessions.
When caching is complete the Uploads app will show checkmarks in the cached column. Add the files to our volatility project by checking the boxes in the left-hand column.
Discovery#
Let's open up features and start exploring our data.
Via App#
To start, let's maximize the discovery chart.
And drill-down into this packet by holding down SHIFT and clicking on it to see if we need to update what packets we're filtering out.
{
"index": 24277872,
"Price": 93.69,
"Volume": 44,
"Seq. No.": null,
"session": 0,
"Qualifiers": "ODD[IRGCOND];Q[GV3_FLAG];X[GV3_TEXT]; [PRC_QL2];@ I[GV4_TEXT]",
"timestamp": 1463597875626078700,
"symbol": "AAPL.O",
"Date-Time": "2016-05-18T18:57:55.626Z"
}
Odd lot, irregular condition. Copy and paste ODD[IRGCOND]
into the search bar. Let's see how prevalent that qualifier is.
Woah. That's a lot of yellow. We'll need an SME to weigh-in on whether this is signal or noise.
While we're here, let's look for any other anomalies. Hit the shuffle button to get another random index into the dataset.
Note how fast this is. In this example we're grabbing tens of thousands of datapoints from random locations within the AAPL
tick-level transactions and displaying them on the screen in 1 to 2 seconds. The files we uploaded were 12 compressed csv files that are now being treated as a whole single dataset.
Zoom out with your mouse scroll wheel. The platform is capable of displaying hundreds of thousands of data points at a time.
There are 179,931
datapoints in this screenshot. Seeing the data like this can provide valuable insights that will impact your AI model performance.
Via Code#
You can do all of this in code as well.
The adapter gives you random access to the data with the same ruthless performance implemented in the UI.
Features#
Since this is a new installation of LIT we haven't yet created any features. We do, however, have the Traders plugin installed so let's go see what scripts come with that plugin.
Select the Scripts
tab.
Let's select sma
(Simple Moving Average).
Let's fill out the form with any arbitrary values you want and hit the Test
button to see how this works.
Just under our discovery chart you'll see a table of values; sma
, sma_normalized
.
Let's peel back a layer and see how this works. Click the yellow Edit Source
button on the feature.
Note that all of the inputs that we filled out in the form as an application user are written in the docstring. The author of this feature script decides what input the feature will need from the user to compute.
Note that the function that computes the feature takes as input just 2 required parameters: adapter
, index
.
TradersBaseAdapter
. The TradersBaseAdapter
base class is a contract between the author of
the adapter and the author of the feature. The feature can rely on certain functions being available for data access.
In this case the feature author calls adapter.get_bars
which retrieves an arbitrary number of OHLCV (Open, High, Low, Close,
Volume) bars for any index into the dataset.
The feature then uses numpy's sliding_window_view
and mean
functions to compute the sma
.
The feature then creates a sma_normalized
by subtracting the last Close price from the SMA values.
The sma
and sma_normalized
are then returned.
Back to the task at hand, building a model to predict volatility. For the sake of simplicity let's use the Average True Range (ATR) indicator as an indicator of volatility. We're not overly concerned with how it's calculated other than it has the property that it can be calculated over any timeframe; e.g. second-by-second, minute-by-minute, etc.
Inputs#
Let's hypothesize, you and I, that a good trader can look at a stock chart and make a prediction "The volatility is likely to go up". We must ask ourselves, what data, exactly, does this trader consider when making this prediction? What we're doing now is something we fondly call "small data analysis".
Let's hypothesize that the trader is looking at:
- a minute-bar OHLCV chart for the last 100 minutes
- a minute-by-minute ATR line chart for the last 100 minutes
The OHLCV chart is created by the feature script unit_bar
, so named because the common name for this feature is a
concatenation of the unit of time and the word bars
; e.g. minute bars
, hour bars
. Click unit_bar
, fill out
the form with 100
count 1
size minute
unit to create 100 1-minute bars. Click Test
to see the newly created
unit bars under the discovery chart.
Let's save this feature. Give it a name then click Save
.
We're taken back the Features
tab and now we have a saved feature. This feature is now a part of your library and usable
for all all future AI model development projects.
Let's go back to Scripts
and create our ATR feature.
Labels#
Okay. That's great. Now that we have the input features all we need is a label. If you think of AI model training as a
child learning from a corpus of learning material (in data science parlance, training data
), then the input features
we've just created are the questions. The labels are the answers. If what we want the model to predict is whether
volatility increases in the future then what we want to predict is ATR in the next minute. The Trader plugin comes with
several built-in feature scripts but it does not come with a future ATR
feature. We're going to have to extend the features
to create our own future ATR
feature to use as a label.
Go back to scripts and take a peek at the feature script named price_next_bars
. As you might guess, this is a feature
script for creating a label to predict the price of the next minute-bar. Notice that the call the adapter.get_bars
in this
feature is slightly different than what we saw in the unit_bar
and atr
feature scripts:
future=True
. Aha!
Let's drop into an interactive python shell and play around with this to see if we can come up with a label feature.
We'll start by getting an adapter to play with and grabbing some random index within that dataset.
>>> import random
>>> from lit.sdk.data.adapter import load_adapter_by_project
>>> team, project = "contoso", "volatility"
>>> adapter = load_adapter_by_project(team, project)
>>> index = random.randint(0, len(adapter)-1)
Now let's play with the get_bars
function:
>>> adapter.get_bars(index, 3, "minute", 1)
array([[ 141.18, 141.21, 140.87, 140.99, 3770. , 541076. ,
143.52, 8700. ],
[ 141. , 141.03, 140.91, 140.96, 1952. , 231526. ,
118.61, 5000. ],
[ 140.96, 141.03, 140.86, 140.99, 1847. , 237115. ,
128.38, 5000. ]])
>>> adapter.get_bars(index, 3, "minute", 1, future=True)
array([[ 140.92, 141.21, 140.9 , 141.11, 2041. , 322491. ,
158.01, 8000. ],
[ 141.12, 141.33, 141.12, 141.26, 1609. , 199271. ,
123.85, 4500. ],
[ 141.27, 141.4 , 141.24, 141.39, 1670. , 204523. ,
122.47, 4755. ]])
TradersBaseAdapter
. One of the causes of untold number of failure in AI projects are features that,
through oversight, contain information about the future state. This failure is known as lookahead bias
. The
get_bars
function will, by default, help you avoid lookahead bias
by never giving you the computed unit
bar for the transaction at this index, because that computed unit bar, by definition, has data from future
transactions. We can overrride this feature by specifying prevent_lookahead_bias=False
to our call to get_bars
.
>>> adapter.get_bars(index, 3, "minute", 1, prevent_lookahead_bias=False)
array([[ 141. , 141.03, 140.91, 140.96, 1952. , 231526. ,
118.61, 5000. ],
[ 140.96, 141.03, 140.86, 140.99, 1847. , 237115. ,
128.38, 5000. ],
[ 140.99, 141. , 140.88, 140.91, 1736. , 181989. ,
104.83, 4000. ]])
>>> adapter.get_bars(index, 3, "minute", 1, future=True, prevent_lookahead_bias=False)
array([[ 140.99, 141. , 140.88, 140.91, 1736. , 181989. ,
104.83, 4000. ],
[ 140.92, 141.21, 140.9 , 141.11, 2041. , 322491. ,
158.01, 8000. ],
[ 141.12, 141.33, 141.12, 141.26, 1609. , 199271. ,
123.85, 4500. ]])
Now we see a minute-bar in common between the 3 past minute-bars and 3 future minute-bars.
Armed with this knowledge, and some judicious copy/pasting from the atr.py
feature script, let's write a feature
that computes either a 1 or a 0 depending on whether the next unit-bar has an ATR value that is greater than this
unit-bar.
"""
Parameters
----------
rate : number
When calculating across a window, the number of bars to include in
that window; e.g. [10]-day moving average for 30 days
size : number
The number of units in each bar; e.g. 10 [1]-second bars
unit : [hour,minute,second]
Either hour, minute, or second; e.g. 10 1-[second] bars
Returns
-------
[ label ]
"""
import logging
from typing import Callable
import numpy as np
import pandas as pd
from numpy.lib.stride_tricks import sliding_window_view
from lit.plugins.traders.adapter import TradersBaseAdapter
# https://stackoverflow.com/a/74282809/61396
def rma(s: pd.Series, period: int) -> pd.Series:
return s.ewm(alpha=1 / period).mean()
def atr(df: pd.DataFrame, length: int = 14) -> pd.Series:
# Ref: https://stackoverflow.com/a/74282809/
high, low, prev_close = df['h'], df['l'], df['c'].shift()
tr_all = [high - low, high - prev_close, low - prev_close]
tr_all = [tr.abs() for tr in tr_all]
tr = pd.concat(tr_all, axis=1).max(axis=1)
atr_ = rma(tr, length)
return atr_
def feature(adapter: TradersBaseAdapter, index: int, params:dict={}, features:dict={}):
size = params.get('size') or 1
unit = params.get('unit') or 'sec'
rate = params.get('rate') or 14
data = adapter.get_bars(index-rate+1, rate, unit, size, future=True)
if len(data) < rate:
return []
df = pd.DataFrame(data=data[:,:4], columns=['o', 'h', 'l', 'c'])
values = atr(df, rate).values
values = np.expand_dims(values, axis=1)[-2:]
result = [1.0] if values[1] > values[0] else [0.0]
return result
Now let's fill out the form and save this feature.
We have just one more feature to add.
As it's configured now, when the training data is built, the LIT platform build tools will iterate each and every one of the 907,849,322
transactions in the data. It would be ideal if we only created 1 training data sample for each minute-bar. Ideally, we'd
like that to be made using the very last transaction in each minute bar. Lucky for us, the Traders plugin contains
just such a feature. Take a peak at the feature named per_unit
.
This feature returns a [1.0] if the next index has a different unit (second, minute, hour, etc) than this. Otherwise, the feature returns an empty array.
Note, also, the sort_key
function in this feature script:
Let's save the feature.
Schema#
The LIT platform enforces the creation of a schema
before building a training dataset. The purpose of the schema is both
to provide a detailed historical record how the data build was initiated and ensure that subsequent builds, both clean and
incremental, are consistent and reproducible.
Start by opening the Schemas
via the menu.
Then click the 'plus' button in the top-right to add a new schema.
If you followed along with this tutorial and had the AAPL
data files selected in the Uploads
screen and had the 4
features and inputs selected in the Features
screen, then you'll find that these values are already filled in for
you in the Paths
and Features
fields. If you did not follow along, no worries, simply select them in these two
input fields now.
Since we're already limiting our samples with the per_minute
feature, there's no need to have a Resolution
so we'll
set that to 1
. Let's also change our Split
to Date
so we can do our Test/Train
split by date. We could also
have chosen random to randomly select a given percent of the data samples or we could have chosen End
which would have
split the whole dataset by a given amount and used the second part of the split for the test data. For this example,
I'm already anticipating that we may want to add more data in the future and to do that we may build multiple datasets
from multiple tickers and train on all of them at the same time. If we do that, the only way to ensure that the neural
networks can't cheat is by telling all of the datasets to split by a date-certain. I'll choose 1/1/2024
as that
split date.
Looks good. Save
it.
What are we waiting for? Let's launch a build!
Build#
Let's open the builds to monitor our new build.
Status#
If we open our running build we're presented first with the Status
tab.
Processes#
If we click on the Processes
tab we can see both running and completed processes.
Click on any running process to get connected directly to the screen session in read-only mode.
Log#
Click on the Log
tab at the top to see a detailed shared log where all running process report their messages.
First, let me draw your attention to the order of the Computed fields
in that very last visible log message. Note
that the first field to be computed is per_minute
and the other computed fields appear in the order in which the
features appear in our schema. That's because of the sort_key
function we noted earlier in the process inside the
Source Code for the per_unit.py
feature script. Feature authors can override the sort order in which features are
computed. Why is this important? In this case, the per_minute
feature will REJECT any data sample where the very
next index is not in the next minute. As soon as a sample is rejected by a feature, the data worker discards that sample
without bothering to compute any more features. By moving the per_minute
feature to the front, we skip computing
features for all of the many data samples that will get rejected by this feature. For this project, this one optimization
will save us days of compute time.
The logs are stored under /data/{team_name}/logs/{build_number}/
as well as a few other artifacts including a snapshot
of the schema used to execute the build.
litadmin@lit:/data/contoso/logs/26$ ls -lht
total 20K
-rw-rw-r-- 1 ben ben 5.0K Aug 12 12:31 workflow.log
-rw-rw-r-- 1 ben ben 2.1K Aug 12 12:31 pids.json
-rw-rw-r-- 1 ben ben 13 Aug 12 12:27 status.txt
-rw-rw-r-- 1 ben ben 1.2K Aug 9 13:32 schema_1723228265324.json
litadmin@lit:/data/contoso/logs/26$ tail workflow.log
2024-08-12 12:29:31.437 3953838 INFO data_worker - process_chunk: Computed fields: ['per_minute', '100_1_minute_atr', '100_1_minute_bars', 'atr_up_next_minute'] Dependencies: []
2024-08-12 12:30:02.174 3954516 INFO data_worker - process_chunk: 10172363 samples at resolution of 1
2024-08-12 12:30:02.174 3954516 INFO data_worker - process_chunk: Writing to /data/contoso/build/schema_1723228265324/tmp/data/schema_1723228265324_50861815_61034177.h5 mode w ...
2024-08-12 12:30:02.317 3954516 INFO data_worker - process_chunk: Computed fields: ['per_minute', '100_1_minute_atr', '100_1_minute_bars', 'atr_up_next_minute'] Dependencies: []
2024-08-12 12:30:32.316 3955693 INFO data_worker - process_chunk: 10172363 samples at resolution of 1
2024-08-12 12:30:32.316 3955693 INFO data_worker - process_chunk: Writing to /data/contoso/build/schema_1723228265324/tmp/data/schema_1723228265324_61034178_71206540.h5 mode w ...
2024-08-12 12:30:32.418 3955693 INFO data_worker - process_chunk: Computed fields: ['per_minute', '100_1_minute_atr', '100_1_minute_bars', 'atr_up_next_minute'] Dependencies: []
2024-08-12 12:31:03.181 3957082 INFO data_worker - process_chunk: 10172363 samples at resolution of 1
2024-08-12 12:31:03.182 3957082 INFO data_worker - process_chunk: Writing to /data/contoso/build/schema_1723228265324/tmp/data/schema_1723228265324_71206541_81378903.h5 mode w ...
2024-08-12 12:31:03.341 3957082 INFO data_worker - process_chunk: Computed fields: ['per_minute', '100_1_minute_atr', '100_1_minute_bars', 'atr_up_next_minute'] Dependencies: []
Feature authors that use Python's standard logging
library will find their log messages saved here. Warning, the feature code is executed
once per sample so if you're doing a build with a billion data samples and you have several calls to logging.info
you may find yourself with
a large and unwieldy workflow.log
.
Just under 30 hours later, build complete.
Assets#
The test and training data we've just built can be found under Assets
in the app.
Audit#
Select the newly created asset.
By default, the build writes the training data to /data/{team}/assets/{schema_name}/queue/
. The idea behind building into a
queue folder is two-fold:
- While it is not absolutely necessary to audit your training data before you train AI models on it, we STRONGLY encourage you to do so. By building into a
queue
folder and forcing the user to explicitlyPromote
those assets intoProduction
before training, and recording both when that step happened as well as who pressed the button, we are encouraging you to make Auditing a part of your model training lifecycle. - If we were already training models on one version of an asset and then we decided to add some feature and rebuild the same asset, the build would not interfere with the already running training sessions.
Let's crack open the train
file and see what's inside.
Because this is a walk-thru we won't do an exhaustive audit here but I'll draw your attention to a couple of things.
- Again, as with the feature explorer, we have a sample index input field in the top-right with a
Shuffle
button to make it easy to audit random samples quickly. - Reading
min
,mean
,max
for each column of data is often a quick shortcut to data quality. - If you're using a categorical label (as opposed to numerical), as we are in this example, take note of the
mean
value. The further away that number is from0.5
, the more difficult it will be for your model to learn. - Take note of the total number of samples in each of your train and test files, especially if you chose to split your test/train by date. Too few samples in either the train or test set will cause difficulties when training.
Promote#
Click Promote to Production
.
Alternatively, you can move the files on disk from the queue
folder to its parent folder. This will have the same effect except the application won't have an opportunity to log the 'who' and 'when' of the promotion.
Model Design#
It's time to do a little neural network architecture. Let's start by opening the model designer and clicking the 'Add' button in the top-left.
We're taken to a blank canvas. Start by opening Settings
.
Give the canvas a name, select our project, and select the asset we just built.
Let's add our first component, Input
.
Select one of the features we built then click Add
.
Components#
After clicking Add
it's added to the canvas. Click on it to bring up the component properties. As with the features there's an Edit Source
button. Click it.
At compile-time, the platform will call make_component
on each component on the canvas, turning your canvas into a Tensorflow Keras functional model. The source scripts for all of the components on the canvas can be found in /data/{team}/components
. You are encouraged to extend the ones we ship out-of-the-box with your own custom components.
The same way that you added the Input
component, add a Dense
node to the canvas and connect the two by clicking on the dot to the right of the input node and dragging it to the left dot on the dense layer.
Select the Dense
layer and, following the input hints, fill out the form.
Note the layers
properties. Each component on the canvas can represent any number of layers and internal structure.
Now add an Output
node.
Note the feature
drop-down selector. We can select any of the features we built to use as a label. We have the flexibility to build many different labels and easily switch between them.
We have just one component left to add, Globals
. Again, follow the guidance in help text.
Launch Experiment#
We're ready to start training our model. Go back to settings, choose a device to train on, then click 'Launch Experiment'!
Compatibility#
Just a brief note about compatibility. You are not restricted to doing model design on our canvas. Nor are you restricted from training or predicting models outside of the platform.
Recall that the components on the canvas are python/tensorflow. The canvas itself, as a whole, compiles into a Tensorflow/Keras model.
And code-based training is always an option. Here's the familiar mnist
running inside of LIT:
import lit
import tensorflow as tf
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
y_train, y_test = tf.keras.utils.to_categorical(y_train, 10), tf.keras.utils.to_categorical(y_test, 10) # one-hot encoding
model = tf.keras.Sequential([
tf.keras.Input(shape=(28, 28)),
tf.keras.layers.Lambda(lambda x: x/255.0), # normalize 0-255 values to 0-1
tf.keras.layers.Reshape(target_shape=(28, 28, 1)), # Conv2D expects color-channel to have a dimension
tf.keras.layers.Conv2D(32, kernel_size=(3, 3), activation="relu"),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Conv2D(64, kernel_size=(3, 3), activation="relu"),
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Flatten(),
tf.keras.layers.Dropout(0.5),
tf.keras.layers.Dense(10, activation="softmax"),
])
model.compile(loss="categorical_crossentropy", optimizer="adam", metrics=["accuracy"])
with lit.run(team="contoso", description="initial test") as run:
model.fit(
x_train, y_train,
batch_size=512, epochs=500, validation_split=0.1,
callbacks=[lit.callbacks.csv(), tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)])
score = model.evaluate(x_test, y_test, verbose=0)
print('Test loss:', score[0])
print('Test accuracy:', score[1])
In our experience, though, collaborating with SMEs via the canvas results in less time spent iterating and more time spent innovating.
Training#
Let's monitor our running experiment. Start by opening the Experiments
app.
Select the experiment. First, note that we can connect directly to the running training session in the CLI by clicking on the shell
button.
You can also connect to these sessions directly via the shell using screen
.
Rather than monitoring STDOUT
, let's monitor the live performance statistics of one or more running training sessions in tabular format via the Grid
button.
We can also monitor one or more running training sessions in a visual form via the Chart
button.
Use the column selector to choose which metrics to monitor.
After 100 epochs the precision & recall numbers start to look interesting enough to check out in a live environment.
Let's take epoch 104 and add it to the vault. To do so, just click the epoch you want saved then choose 'Add To Vault'. This can be done from either the Grid or Chart view.
In a few seconds you'll see the grid refresh with a link in the vault
column for the epoch you saved.
After 147 epochs, I stopped this run by clicking on Stop Selected
from the menu.
All of the data related to training can be found on disk here: /data/{team}/train/{project}/{runid}
lit-admin@lit:/data/contoso/train/volatility/6$ ls -R
.:
definition.json description.txt features.json log.txt perf.csv saved_models session.txt
./saved_models:
model.100.h5 model.110.h5 model.120.h5 model.130.h5 model.140.h5 model.17.h5 model.27.h5 model.37.h5 model.47.h5 model.57.h5 model.67.h5 model.77.h5 model.87.h5 model.97.h5
model.101.h5 model.111.h5 model.121.h5 model.131.h5 model.141.h5 model.18.h5 model.28.h5 model.38.h5 model.48.h5 model.58.h5 model.68.h5 model.78.h5 model.88.h5 model.98.h5
model.102.h5 model.112.h5 model.122.h5 model.132.h5 model.142.h5 model.19.h5 model.29.h5 model.39.h5 model.49.h5 model.59.h5 model.69.h5 model.79.h5 model.89.h5 model.99.h5
model.103.h5 model.113.h5 model.123.h5 model.133.h5 model.143.h5 model.1.h5 model.2.h5 model.3.h5 model.4.h5 model.5.h5 model.6.h5 model.7.h5 model.8.h5 model.9.h5
model.104.h5 model.114.h5 model.124.h5 model.134.h5 model.144.h5 model.20.h5 model.30.h5 model.40.h5 model.50.h5 model.60.h5 model.70.h5 model.80.h5 model.90.h5
model.105.h5 model.115.h5 model.125.h5 model.135.h5 model.145.h5 model.21.h5 model.31.h5 model.41.h5 model.51.h5 model.61.h5 model.71.h5 model.81.h5 model.91.h5
model.106.h5 model.116.h5 model.126.h5 model.136.h5 model.146.h5 model.22.h5 model.32.h5 model.42.h5 model.52.h5 model.62.h5 model.72.h5 model.82.h5 model.92.h5
model.107.h5 model.117.h5 model.127.h5 model.137.h5 model.147.h5 model.23.h5 model.33.h5 model.43.h5 model.53.h5 model.63.h5 model.73.h5 model.83.h5 model.93.h5
model.108.h5 model.118.h5 model.128.h5 model.138.h5 model.14.h5 model.24.h5 model.34.h5 model.44.h5 model.54.h5 model.64.h5 model.74.h5 model.84.h5 model.94.h5
model.109.h5 model.119.h5 model.129.h5 model.139.h5 model.15.h5 model.25.h5 model.35.h5 model.45.h5 model.55.h5 model.65.h5 model.75.h5 model.85.h5 model.95.h5
model.10.h5 model.11.h5 model.12.h5 model.13.h5 model.16.h5 model.26.h5 model.36.h5 model.46.h5 model.56.h5 model.66.h5 model.76.h5 model.86.h5 model.96.h5
You can click the link to go directly to the vaulted item from this screen.
Vault#
You can also get to the vaulted item from the Vault
app. Open the Vault
app from the dash menu.
There are a few ways we can deploy this model from the vault into production. One simple way is to open the vault detail, choose a deployment, then save the changes.
But before we deploy, we do need somewhere to deploy to.
Streams#
To do this, we start by opening Streams
and clicking the Add
button.
Choose an adapter from the drop-down list.
Then fill the form as prescribed by the data provider
The stream configuration data is stored on disk here: /data/{team}/streams/
lit-admin@lit:/data/contoso/streams$ cat testapi.json
{
"adapter": {
"name": "restapi"
},
"parameters": {
"port": 12345,
"api_keys": []
}
}
The base LIT platform comes with a REST API output stream. The Traders plugin we have installed has output streams for NASDAQ and CME.
User interface#
Streaming adapters authors may provide their own user interface which will be shown when a user clicks on a Stream.
Streams using the REST API streaming data adapter show a SWAGGER interface when clicked.
Streams using the NASDAQ streaming data adapter, provided by the Traders plugin, show a candlestick chart.
In the same manner as the Discovery tool, the Traders plugin allows you to drill down on the minute bars all the way down to the packet level.
Regardless of which stream you choose, the next step is to create a deployment service for your models.
Deployments#
Start by opening Deployments.
The lit-udf
service shown here comes with the Traders plugin. It's a standalone service responsible for providing data to the candlestick stream user interface at scale.
Let's create two deployments here. One for the RESTAPI stream and one for the NASDAQ stream.
Click Add new Deploy
from the menu.
Give it a name, select a streaming data adapter, hit Save
.
Once you've done that, go back into the deployment detail to see if there's additional configuration options that need to be set.
Deploy thru Vault#
Now let's deploy one model via deployments and the other through the vault.
We'll start with vault. Go back to the vault detail, then select the restapi_1
deployment that we just set up from the Deployment
drop-down. Then hit Save
.
When we go back to the SWAGGER user interface surfaced by the REST API stream and execute the models
endpoint, we see our model there, ready to make predictions.
Deploy thru Deployments#
We open the NASDAQ deployment, click the models drop-down, and selected one or more models to be served up by this deployment. Then we hit Save
and Restart
.
You should see the deployment service output in the shell at the bottom of the deployment detail. This is a live view of the STDOUT
of the deployment service.
We can see from the output that out model is making predictions each minute.
Let's view those predictions using the Traders plugin model viewer. From the Stream
user interface, select our deployed model from the models
drop-down menu.
The deployment configuration data is stored here: /data/{team}/preview
lit-admin@lit:/data/contoso/preview$ cat NASDAQ_AAPL.json
{
"name": "NASDAQ_AAPL",
"output": {
"device": "cpu",
"models": [ "Unnamed_0" ]
},
"stream": "nasdaq",
"parameters": {
"ticker": "AAPL",
"interval": "1minute"
}
}
Insights#
Now let's open up Insights on one of these predictions to learn more about why it made the prediction it did. SHIFT-CLICK on any individual prediction to open Insights.
Let's start on the Impact
tab.
You can see here that Volume
had the highest saliency and independence score.
Switching back to the Inputs
tab and viewing the data in chart form it looks pretty obvious in hindsight that the recent volume numbers are maxxed out.
One way to interpret this insight is that the model has learned that when there are elevated volumes that volatility is likely to increase.
We can test that theory a little further. Switch back to the tabular data view, change some of those trailing 9
values into 0
and 1
values. As you make those changes the platform triggers the model to make a new prediction and show you what its output would have been under those alternate circumstances. In this case, after we lowered the volumes, the prediction value went down from 0.51 to 0.47.