Building an ML Trade Bot from Scratch

In a 100 lines of code

The world of financial trading is complex; it requires thorough knowledge of a number of practical aspects of the trade cycle to successfully run a strategy.

In this project, I tried to make trading as simple as running a cell in a Jupyter notebook.

Of course there was systematic thought involved in building the bot, but the idea wasn’t to develop a sophisticated strategy, rather, to develop A strategy FAST.

1. The Data

But first…

1.1 Alpaca

Alpaca is an API first broker that allows you to trade and maintain your portfolio using their python REST API. I used Alpaca as my broker and obtained API-keys for my paper trading account on Alpaca, as with any other SaaS.

1.2 Universe

I chose my universe of assets to be approximately 200 US ETFs chosen from a larger universe of 4000+ listed US ETFs. The criterion of selection was a threshold on the volume weighted average price having a mean of larger than a 100 in the past month. The volume weighted average price is a proxy measure for the liquidity of the asset. I preferred to deal with highly liquid ETFs.

The list of 4000+ tickers was pulled from the data vendor 12Data. They maintain an API endpoint for all 13000+ ETFs listed on all major exchanges in the world!

This is however, a matter of discretion, and not science, and one is free to choose the universe in any other way they desire, without changing the approach. It could be stocks, bonds, options, currency, crypto, or other alternatives.

1.3 OLHCV Data

With our universe determined, I now pull data for the last 8 years in the following manner:

  1. Install and import libraries
import numpy as np
import pandas as pd

from alpaca_trade_api.rest import REST, TimeFrame
from ta import add_all_ta_features # to add technical features
from datetime import datetime, timedelta
from tqdm.auto import tqdm # to create a progress bar for loops

from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings("ignore")
  1. Declare your Alpaca API Keys
api = REST(key_id="YOUR-API-KEY", secret_key="YOUR-SECRTE-KEY", base_url="https://paper-api.alpaca.markets", api_version="v2")
  1. Import file of tickers in universe (here, ETF universe)
us_etfs = pd.read_csv("YOUR-UNIVERSE.csv")
  1. Set up data strcutures to store:
    • symbols to buy
    • symbols to sell
    • last close prices of symbols to buy
    • returns of symbols to buy
symbols_to_buy = []
symbols_to_sell = []
price = pd.DataFrame()
rets = pd.DataFrame()
  1. Determine training period using the datetime module
current_time = datetime.now()
tomorrow = current_time + timedelta(days=1)
yesterday = current_time - timedelta(days=1)
end = yesterday.strftime('%Y-%m-%d')
years = 8
start = current_time.replace(year=current_time.year - years).strftime('%Y-%m-%d')
  1. Pull OLHCV bar data from Alpaca API We do this in a loop but the general format is:
bars = api.get_bars("SYMBOL", TimeFrame.Day, start, end)
if len(bars) == 0:
    continue
else:
    df = bars.df

The last bit of code (if-else statement) just makes sure our dataframe isn’t empty due to errors like the security not actively trading for any period in our 8 year window.

2. Features and Split

2.1 Adding Features

With OLHCV data now pulled, we add all possible technical indicators from the library TA-lib using their python wrapper. Technical Indicators are derived from market data and serve as features to predict price movements.

We do the following next:

  1. Store price data of all tickers in the universe in our price dataframe. This will be of use to us later.
price[symbol] = df['close']
  1. Add all technical indicators to the dataframe with the TA-Lib module. Here, the functions bfill() and ffill() impute NaNs and missing values in the indicators. Of course we could impute these values with the mean or median, or create a probability distribution and sample from it, or even create an ML model to impute missing values. However, for simplicity, we stick to backward and forward filling.
df = add_all_ta_features(df, open="open", high="high", low="low", close="close", volume="volume").bfill().ffill()
  1. The log returns for each security is calculated and stored for later use. We also define our signal. This is the value that we will try to predict by building an ML model. The hi_lo_signal simply flags the days when the return was positive with a 1, and days where returns were negative remain 0.
df['ret'] = np.log(df.close) - np.log(df.close.shift(1))
rets[symbol] = df['ret']
df['hi_lo_signal'] = np.where(df.ret>0,1,0)

2.2 Splitting Dataset in Train and Test Sets

We can now split the data into train and test sets, as is the convention in machine learning, to build a model. We use the train set to build a model, and the test set to evaluate it. We set our train period to all years prior to (and including) 2022, while our test period as all years after 2022 (starting 2023).

We want to predict the hi_lo_signal defined earlier, hence, we define this as our target variable y. We drop this column from the rest of the data and define these features for the prediciton of the hi_lo_signal to be X. Note that we shift the data to avoid a forward looking bias.

train = df[df.index.year <= 2022]
test = df[df.index.year > 2022]
X_train = train.drop(columns=['hi_lo_signal']).shift(1).dropna().reset_index(drop=True)
y_train = train['hi_lo_signal'][2:]

X_test = test.drop(columns=['hi_lo_signal']).shift(1).dropna().reset_index(drop=True)
y_test = test['hi_lo_signal'][1:]

assert X_train.shape[0] == y_train.shape[0]

3. The Alpha Model

The Alpha model is the most important aspect of a trading strategy. It gives us incentive to actively manage a portfolio instead of just holding a portfolio of index funds for life i.e., it gives us the excess return beyond what the market returns.

Here, our alpha model simply tries to predict the hi_lo_signal, which in other words, is trying to predict which securities will rise in value in the next day. I choose a decision tree classifier as my model after backtesting a whole host of linear, regularized, non-parametric, and ensemble methods. One is free to choose any model they prefer. I also avoid parameter tuning as this makes the program extensively long with marginal gain in accuracy.

try:
    clf = DecisionTreeClassifier(max_depth=10, random_state=42)
    clf = clf.fit(X_train, y_train)
    y_preds = clf.predict(X_test)

    if y_preds[-1] == 1:
        symbols_to_buy.append(symbol)
    else:
        symbols_to_sell.append(symbol)
except:
    continue

4. Conclusion

Our alpha model has now created signals for which securities it expects to rise in price, thus providing a positive return, and in turn, a long signal. The securities that the model expects to drop in price over the next day form the short signal. We can now easily create a portfolio that is long the securities to buy, and short the securities to sell. However, how do we decide the quantities to buy for each security? We discuss this and other portfolio and risk management techniques in the next post. Stay tuned!

The full code for the data and alpha modules is as follows:

import numpy as np
import pandas as pd

from alpaca_trade_api.rest import REST, TimeFrame
from ta import add_all_ta_features
from datetime import datetime, timedelta
from tqdm.auto import tqdm

from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings("ignore")

api = REST(key_id="YOUR-API-KEY", secret_key="YOUR-SECRET-KEY", 
                    base_url="https://paper-api.alpaca.markets", api_version='v2')

us_etfs = pd.read_csv("YOUR-UNIVERSE")

symbols_to_buy = []
symbols_to_sell = []
price = pd.DataFrame()
rets = pd.DataFrame()

current_time = datetime.now()
tomorrow = current_time + timedelta(days=1)
yesterday = current_time - timedelta(days=1)

# end = current_time.strftime('%Y-%m-%d')
end = yesterday.strftime('%Y-%m-%d')
years = 3
start = current_time.replace(year=current_time.year - years).strftime('%Y-%m-%d')

for symbol in tqdm(us_etfs['symbol']):
    bars = api.get_bars(symbol, TimeFrame.Day, start, end)
    if len(bars) == 0:
        continue
    else:
        df = bars.df

    price[symbol] = df['close']

    df = add_all_ta_features(df, open="open", high="high", low="low", close="close", volume="volume").bfill().ffill()
    df['ret'] = np.log(df.close) - np.log(df.close.shift(1))
    rets[symbol] = df['ret']
    df['hi_lo_signal'] = np.where(df.ret>0,1,0)

    train = df[df.index.year <= 2022]
    test = df[df.index.year > 2022]

    X_train = train.drop(columns=['hi_lo_signal']).shift(1).dropna().reset_index(drop=True)
    y_train = train['hi_lo_signal'][2:]

    X_test = test.drop(columns=['hi_lo_signal']).shift(1).dropna().reset_index(drop=True)
    y_test = test['hi_lo_signal'][1:]

    assert X_train.shape[0] == y_train.shape[0]

    try:
        clf = DecisionTreeClassifier(max_depth=10, random_state=42)
        clf = clf.fit(X_train, y_train)
        y_preds = clf.predict(X_test)

        if y_preds[-1] == 1:
            symbols_to_buy.append(symbol)
        else:
            symbols_to_sell.append(symbol)
    except:
        continue