Trading Metrics that Actually Matter

Jupyter Notebook for this post can be found here

Traders love their performance metrics. Anyone who’s used their platform’s backtesting features has probably come across a few dozen of them, and everyone’s got their favorite. Anybody who’s anybody in the finance world has one named after them: Sharpe, Sortino, Calmar, Treynor, Gartman, etc. (OK, maybe not the last one). But which ones are the most important? There should be some kind of objective answer to this, right? If you look for the answer to this question on Google, you’ll be quickly overwhelmed. Most people want to give you 5+ different metrics you should be focusing on, all at once. But they can’t all be equally important; what we really want is a fitness function, or one statistic that can be optimized to compare all strategies against one another.

My last two articles covered position sizing from a practitioner’s perspective. In these articles, the idea of “Ideal f” (a truly uncatchy name) was introduced; this is the fraction of capital that should be allocated to an asset/trading strategy to maximize risk-adjusted compounded returns. If you haven’t read them, you should really take the time to do so now. You can read Part 1 here  and Part 2 here.

Didn’t read them, did ya? That’s OK; the summary is this: Ideal f determines the fraction of staked capital that will return the highest compounded growth rate whilst constraining drawdown within the trader’s comfort level. Calculating this fraction also returns the projected geometric holding period return (GHPR) of the system at this level of leverage.

This median GHPR, in my opinion, is the only metric that traders should really care about. The end goal is to achieve the highest possible compounded return, while ensuring that the path is smooth enough that you continue to follow the system. If you can’t stomach the system’s drawdowns, you’ll stop trading, and wont achieve the returns. But at the end of the day, you just want to grow capital as much as possible.

However, there is one problem with using this metric: the time it takes to calculate it. One has to generate thousands of equity curves and calculate metrics based on each of them. While this is feasible to do once or a handful of times, when you want to use the metric as a criteria for an optimization that could include thousands of iterations, it quickly becomes unwieldy. What we seek instead are time-independent metrics that predict what the drawdown-constrained GHPR would be.

Introducing: The Metrics

So let’s look at some popular metrics that can be used to evaluate performance. Each of these can be easily applied to a distribution of trades (I prefer daily marked-to-market returns) agnostic to the order in which they occurred. 

Sharpe Ratio: Mean of Returns / St. Dev of Returns
Sortino Ratio: Mean of Returns / St. Dev of Negative Returns
Win %: # Positive Returns / # Total Returns
Win Loss Ratio: Average Positive Return / Average Negative Return
Profit Factor: Sum Positive Returns / Sum Negative Returns
CPC Index: Profit Factor * Win % * Win Loss Ratio
Tail Ratio: 95th Percentile of Returns / 5th Percentile of Returns
Common Sense Ratio: Profit Factor * Tail Ratio
Outlier Win Ratio: 99th Percentile of Returns / Mean Positive Return
Outlier Loss Ratio: 1st Percentile of Returns / Mean Negative Return

If you have a favorite metric you like to look at in your testing, please leave a comment below and we’ll see if it works any better than the ones I came up with!

def sharpe_ratio(returns):
    return np.mean(returns) / np.std(returns)

def sortino_ratio(returns):
    losses = returns[returns <= 0]
    downside_deviation = np.std(losses)
    return np.mean(returns) / downside_deviation

def mean_return(returns):
    return np.mean(returns)

def win_percent(returns):
    wins = returns[returns > 0]
    return len(wins) / len(returns)

def win_loss_ratio(returns):
    wins = returns[returns > 0]
    losses = returns[returns <= 0]
    avg_win = np.mean(wins)
    avg_loss = np.mean(np.abs(losses))
    return avg_win / avg_loss

def profit_factor(returns):
    wins = returns[returns > 0]
    losses = returns[returns <= 0]
    sum_wins = np.sum(wins)
    sum_losses = np.sum(np.abs(losses))
    return sum_wins / sum_losses

def cpc_index(returns):
    _profit_factor = profit_factor(returns)
    _win_percent = win_percent(returns)
    _win_loss_ratio = win_loss_ratio(returns)
    return _profit_factor * _win_percent * _win_loss_ratio

def tail_ratio(returns):
    percentile_5 = np.abs(np.percentile(returns, 5))
    percentile_95 = np.abs(np.percentile(returns, 95))
    return percentile_95 / percentile_5

def common_sense_ratio(returns):
    _common_sense_ratio = profit_factor(returns) * tail_ratio(returns)
    return _common_sense_ratio

def outlier_win_ratio(returns):
    wins = returns[returns > 0]
    mean_win = np.mean(wins)
    outlier_win = np.percentile(returns, 99)
    return outlier_win / mean_win

def outlier_loss_ratio(returns):
    losses = returns[returns <= 0]
    mean_loss = np.mean(np.abs(losses))
    outlier_loss = np.abs(np.percentile(returns, 1))
    return outlier_loss / mean_loss

The “<“s are supposed to be < symbols. Take it up with WordPress.

Metric Correlation to GHPR

Now that we have our metrics defined, we can see how they correlate with our median GHPR at Ideal f. To do so, we’ll simulate a bunch of trading system returns, record all the metrics for each, and see what patterns develop between them. For the tradeable asset universe, I’ve gathered data for five large-cap cryptocurrencies (BCH, BTC, ETH, LTC, and XRP) as well as 67 different ETFs since their inception. The returns used are the mark-to-market daily closing returns of the assets. Each “system” is a random sampling of returns assuming that the system was in the market 25% of the time. We can repeat this process about a thousand times and see what happens. If you’re following along in the notebook, now would be a great time to grab a coffee, walk the dog, call your mom (she’ll appreciate it), etc.

# Create a blank list of returns
returns_list = []

# Load some bitcoin data
xbt_1d = pd.read_csv('data/XBTUSD_1d.csv', parse_dates=True, index_col=0)

# Add the returns to the list
xbt_1d['returns'] = xbt_1d['Close'].pct_change().dropna()
returns_list.append(xbt_1d['returns'])

# Load in some altcoin data
for altcoin in ['BCH', 'ETH', 'LTC', 'XRP']:
    filepath = 'data/' + altcoin + 'XBT_1d.csv'
    df = pd.read_csv(filepath, parse_dates=True, index_col=0)
    returns = df['Close'].pct_change().dropna()
    returns_list.append(returns)
    
# Load in some ETF data 
path = # put your path to some ETF ohlc data here
data_list = os.listdir(path)
for file in data_list:
    file_path = path + '/' + file
    df = pd.read_csv(file_path, parse_dates=True, index_col=0)
    returns = df['Adj Close'].pct_change().dropna()
    returns_list.append(returns)

# Function to record metrics for each sample
def get_performance_metrics(returns):
    metric_funcs = [
        sharpe_ratio, sortino_ratio, win_percent, 
        win_loss_ratio, profit_factor, cpc_index,
        tail_ratio, common_sense_ratio, outlier_win_ratio,
        outlier_loss_ratio
    ]
    metric_dict = {
        metric_func.__name__:metric_func(returns) 
        for metric_func in metric_funcs
    }
    ideal_f_results = ideal_f(returns)
    metric_dict['optimal_f'] = ideal_f_results['optimal_f']
    metric_dict['ghpr'] = ideal_f_results['max_ghpr']
    return pd.Series(metric_dict)

# Gather all the metrics we want to track on each run
metric_funcs = [
    sharpe_ratio, sortino_ratio, win_percent, 
    win_loss_ratio, profit_factor, cpc_index,
    tail_ratio, common_sense_ratio, outlier_win_ratio,
    outlier_loss_ratio
]
metric_columns = [
    metric_func.__name__ for metric_func in metric_funcs
]
metric_columns += ['optimal_f', 'ghpr']
metric_df = pd.DataFrame(columns=metric_columns)

# We'll say that our strategies are in the market 25% of the time
exposure = 0.25

# Run it!
for i in tqdm(range(1000)):
    returns = np.random.choice(returns_list)
    sample_size = max(250, int(len(returns) * exposure))
    sample = np.random.choice(returns, sample_size)
    metric_row = get_performance_metrics(sample)
    metric_df = metric_df.append(metric_row, ignore_index=True)
    
metric_df = metric_df[metric_df['ghpr'] > 0]

# Plot the results
fig, axes = plt.subplots(4, 3, figsize=(9,12))
metric_df_col_num = 0
for row in range(4):
    axes[row, 0].set_ylabel('ghpr')
    for col in range(3):
        metric_column = metric_df.columns[metric_df_col_num]
        axes[row, col].scatter(metric_df[metric_column], metric_df['ghpr'])
        axes[row, col].set_xlabel(metric_column)
        metric_df_col_num += 1

fig.suptitle('Performance Metric Correlation', size=16)
plt.tight_layout()
fig.subplots_adjust(top=0.95)
plt.savefig('Performance Metric Correlation')

Results

Well, the results speak for themselves. Sharpe Ratio, the finance industry standard is almost perfectly correlated to median GHPR. I have to admit, I was pretty surprised that such a simple metric correlated so strongly with our target. Going forward in my testing, I will probably be using Sharpe Ratio of returns as my fitness function of choice for model evaluation. The formula is easy to compute, easy to understand, and well-known by most everyone in the finance/trading industry.

Extension: Linear Regression

Most of you can probably stop reading now. The following is just an experiment to see if applying regression to all of the metrics returns a meaningful improvement over just using Sharpe Ratio by itself.

Predictions Using Sharpe Ratio

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.preprocessing import PolynomialFeatures

# Single feature is sharpe ratio of returns
X = metric_df['sharpe_ratio'].values.reshape(-1,1)

# Target is median GHPR
y = metric_df['ghpr']

# Split out our training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

# Use polynomial features to account for non-linear curve
poly = PolynomialFeatures(2)
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)

# Train our model
model = LinearRegression()
model.fit(X_train, y_train)

# Make some predictions
y_pred = model.predict(X_test)

# How'd we do?
r2 = r2_score(y_test, y_pred)
print('RSQ Achieved: {}%'.format(np.round(r2 * 100, 2)))
RSQ Achieved: 97.16%

I don’t know about you, but that’s good enough for me! That’s gonna be tough to beat.

Predictions Using All Metrics (1st order)

X = metric_df.drop(columns=['optimal_f', 'ghpr'])
y = metric_df['ghpr']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('RSQ Achieved: {}%'.format(np.round(r2 * 100, 2)))
RSQ Achieved: 96.56%

So using a linear combination of all the metrics actually degraded model accuracy. Certainly not a great argument for adding complexity. Next, we’ll see if there using polynomial features improves the model at all.

Predictions Using All Metrics (2nd Order)

X = metric_df.drop(columns=['optimal_f', 'ghpr'])
y = metric_df['ghpr']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
poly = PolynomialFeatures(2)
X_train = poly.fit_transform(X_train)
X_test = poly.fit_transform(X_test)
model = LinearRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
r2 = r2_score(y_test, y_pred)
print('RSQ Achieved: {}%'.format(np.round(r2 * 100, 2)))
RSQ Achieved: 95.15%

Even more complexity, even worse results. I think I’ll just stop now. If anyone has a way to squeeze that last 3% out with more sophisticated models, I’d love to hear about it in the comments section!

Conclusion

It appears that in order to optimize for drawdown-constrained GHPR without generating thousands of curves, all we need to do is optimize for Sharpe Ratio. I’ve heard various arguments for why one shouldn’t use this metric, and you can find plenty of them out there on the web. However, I think this simple test has shown that there is merit to using it. Personally, I like it for its analogy to a z-score in statistics, where the higher the Sharpe, the lower the probability the true mean of returns is 0. 

I hope you found this informative and can use it to guide your testing/modeling decisions.

Feel free to comment below, and connect on Twitter or LinkedIn. You can also join the Quant Talk Telegram channel here: https://t.me/joinchat/GrWzrxH7Z0X_65JD3NLGMw