Mining For Profitable LowRisk DeltaNeutral Long Straddle Option Strategies

Mining for Profitable Low-Risk Delta-Neutral Long Straddle Option Strategies
Senthil K. Murugan
A Thesis
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Science in Data Mining
Department of Mathematical Sciences
Central Connecticut State University

New Britain, Connecticut
December 2012
Thesis Advisor:
Dr. Daniel Larose
Mining for Profitable Low-Risk Delta-Neutral Long Straddle Option Strategies
Senthil K. Murugan
An Abstract of a Thesis
Submitted in Partial Fulfillment of the
Requirements for the Degree of
Master of Science
in
Data Mining
Central Connecticut State University
New Britain, Connecticut
December 2012
Thesis Advisor
Dr. Daniel Larose
Key Words: Delta-Neutral Long Straddle Option Strategies, Predictive Models, Grid Search
ABSTRACT
This study provides a framework for identifying potential low-risk high-profit
option strategies. Delta-neutral long straddle strategies are explored. These strategies are
profitable when the underlying stock price either increases or decreases considerably.
Option transactions between years 2002 and 2006 for seven underlying stocks with high
transaction volume were used to form the strategies. Some of the drivers of the values of
an option such as implied volatilities, Greeks, and their elasticities were computed using
the Black-Scholes option pricing model. The strategy risk (in %) is defined here as the
fraction of tradable days with a loss. Strategies were classified as low-risk if their
corresponding strategy risk was less than or equal to 50%. Only about 6.9% of the
identified delta-neutral long straddle strategies were low-risk strategies. The entry
analysis identifies the low-risk delta-neutral long straddle strategies that could be entered.
Predictive models were built and validated using SAS JMP software. A set of neural
network models, one for each underlying stock, that predicts the low-risk category using
all standardized variables as independent variables had the highest test accuracy of about
58% true positives. This model was selected as the best predictive model. Top 2% of the
predicted low-risk strategies from the best predictive model, with an overall lift of around
8, were identified as the best low-risk strategies to enter. The exit analysis identifies
optimal conditions to exit the entered strategies so as to maximize the expected profit
under various constraints. An entered strategy is exited when one of the following
optimal conditions is met: the loss limit, the profit limit, limit on remaining days to expire
or the last tradable day. A grid search on the training data is performed to find the optimal
exit rules, which are further validated using test dataset. The grid with various exit
parameter limits and strategy returns are computed by a heuristic algorithm that
minimizes data processing time. For a trader with limited funds, the optimal exit rules
that corresponds to the constraint that 75% of entered strategies are expected to exit
profitably was chosen as the best scenario to maximize strategy returns. Under such
trading scenario, mean profits from 5.1% (QCOM) to 44.4% (MO) were observed.
TABLE OF CONTENTS
Introduction .............................................................................................................8
Call and Put Options ...................................................................................8
Delta-Neutral Long Straddle Option Strategy .............................................9
Option Pricing Model ...............................................................................11
Studies on the Black-Scholes Model ........................................................13
Research on Option Strategies ..................................................................14
Goals of the Present Study ........................................................................15
Data Preparation for Entry Analysis .....................................................................16
Option Transaction Data ...........................................................................16
Transaction Filters and Data Cleaning ......................................................17
Eligible Transactions for Creating a Long Straddle Strategy ...................18
Interest Rates .............................................................................................18
Implied Volatility ......................................................................................19
Call and Put Option Greeks and Elasticities .............................................19
Creating Delta-Neutral Long Straddle Strategies .....................................21
Computing Strategy Risk ..........................................................................22
Variable Transformation and Standardization ..........................................23
Distribution of Variables ...........................................................................24
Correlation Analysis .................................................................................28
Factor Analysis .........................................................................................30
Training and Test Datasets ........................................................................32
Data Balancing in the Training Dataset ....................................................34
Entry Analysis: Models to Identify Low-Risk Strategies .....................................35

Modeling Methods ....................................................................................35
Model Types based on Analysis Variables ...............................................44
Models Studied .........................................................................................47
Results from Select Models ......................................................................48
Selection of the Best Predictive Model .....................................................62
Diagnostics of the Best Predictive Model .................................................65
Characteristics of the Entered low-risk Strategies ....................................70
Summary ...................................................................................................73
Exit Analysis: Rules to Exit the Strategies with Optimal Profit ...........................75
Data Preparation ........................................................................................75
Exit Rule Logic .........................................................................................78
Methodology .............................................................................................79
Results .......................................................................................................87
Summary ...................................................................................................95
Conclusion ............................................................................................................97
Limitations ..............................................................................................102
Future Work ............................................................................................102
References ...........................................................................................................104
Appendices ..........................................................................................................107
Appendix A: Variables in Entry Analysis Dataset .................................107
Appendix B: Analysis Variable Statistics ...............................................111
Appendix C: Correlation Analysis ..........................................................112
Appendix D: Factor Loadings .................................................................114

Appendix E: Characteristics of Entered Low-Risk Strategies ................115
Appendix F: Optimal Exit Rules from Grid Search ...............................117
Appendix G: Exit Rules from Response Surface Optimization ..............120
8
INTRODUCTION
Call and Put Options

This thesis investigates an option strategy in financial markets. Options are
contracts to purchase or sell the underlying goods at a specific price within a specific
time. The underlying good considered in this study are stocks. Call options refers to the
contracts to purchase the stocks. Put options refers to the contracts to sell the stocks.
Each contract specifies a specific purchase or selling price for the stocks that are referred
to as the exercise price. Contracts are valid only up to a specified time called expiration
time. Option contracts are created by a seller and are bought by a buyer. By buying a
contract for a premium, a trader takes long positions (long call or long put). By selling a
contract, a trader takes the short position (short call or short put) and gets the premium.
A long call is generally profitable if the underlying stock price is higher than the
exercise price before expiry. Otherwise, it expires worthless at expiration and the
premium paid is lost. A long put is profitable if the underlying stock price is lower than
the exercise price before expiry. The premium for long put is lost if stock price is higher
than the exercise price at expiry. The premium is also referred as the option price. Short
positions have opposite profitability characteristics of the corresponding long positions.
Figure 1a shows the profit and loss diagram for a long call option (solid blue line)
and a short call option (dotted brown line) at expiration. Here, the call option has an
exercise price of $100 with a call price of $5. Above a stock price of $105, the long call
is profitable. For a stock price below $100, the long call expires worth less with the loss
of $5. Figure 1b shows profit and loss for a long put option (solid blue line) and a short
9
put option (dotted brown line) at expiration. The put option has an exercise price of $100
with a put price of $5. Long put is profitable when stock price is below $95. It expires
worth less when the stock price is at or above $100 with a loss of $5. Reverse trends are
true for corresponding short positions.
Profit / Loss Diagram: CALL Option
$6.00
Long CALL
Profit / Loss Diagram: PUT Option

$6.00
Short CALL
Short PUT
$105.00
$110.00
$4.00
$2.00
$0.00
$90.00
($2.00)
$95.00
$100.00
$105.00
$110.00
Profit / Loss (in $)
$4.00
Long PUT
($4.00)
$2.00
$0.00
$90.00
($2.00)
$95.00
$100.00
($4.00)
($6.00)
($6.00)
Stock Price
Stock Price
Figure 1. Profit and loss diagram for: (a) call option and (b) put option. Here, the exercise
price of each option is $100 and the option prices or the premiums are $5 in each case.
Delta-Neutral Long Straddle Option Strategy

There are numerous option strategies that involve various combinations of calls
and puts. One such strategy is the long straddle (Kolb, 1991) where a long call and a long
put with same exercise price and expiration are entered simultaneously.
Figure 2 shows the profit and loss diagram for a long straddle strategy at
expiration where a long call with a price of $2 and a long put with a price of $1.50 are
entered simultaneously and both have $100 exercise price and the same expiration date.
Thus, a total of $3.50 is required to enter into this position. The strategy is profitable
when the stock price is below $96.50 (=$100 exercise price - $3.50 investment) or above
$103.50 (=$100 exercise price + $3.50 investment). The maximum loss is realized when
the stock price is same as the exercise price.
10
Profit / Loss Diagram: LONG STRADDLE Strategy
$6
$4
$2
$0
$92
$94
$96
$98
$100
$102
$104
$106
$108
($2)
($4)
Stock Price
Figure 2. Profit and loss diagram for a long straddle option strategy at expiration.
Exercise prices of both long call and long put options are $100 and they expire at the
same time. Total premium is $3.50.
The term delta refers to the change in the option value for a unit change in stock
price. Option pricing models are used to derive the delta. A long straddle with same
exercise price as the current stock price is delta-neutral. That is, a small movement in
stock price in either direction does not change the total value of the long straddle strategy
at that instant. However, a large movement of stock price in either direction results in a
profit (Kolb, 1991). In the above example illustrated in Figure 2, if this long straddle
strategy was entered or executed when the underlying stock price was same as the
exercise price of $100, then this would be a special case of a delta-neutral long straddle
option strategy.
In practice, it is rare to have an opportunity where both exercise prices of the
options and the underlying stock price match exactly as the stock price changes
continuously. In addition, the remaining time to expire and general market expectations
about the future movement in stock price makes it difficult to have a neutral delta if only
one call option and one put option is combined to form a delta-neutral long straddle
option strategy. A more practical approach would be to choose the exercise price that is
currently close to the stock price and select different number of call options and put
11
options with same exercise price and expiration time in such a way that their total delta is
as close to zero as possible. Since the call option delta and a put option delta could be
computed using appropriate option pricing models, straightforward algebra could be used
to determine the number of calls and number of puts needed to achieve a zero total delta.
In fact, this is the approach used throughout the study to form a delta-neutral long
straddle strategy. In summary, for this study, a delta-neutral long straddle strategy is
executed by buying a certain number of call options and a certain number of put options
with same expiration and same exercise price and with a zero total delta. The exercise
prices of these options are as close as possible to the stock price at the time of executing
the strategy. A trader entering the delta-neutral long straddle option strategy can benefit
if the stock price either increases or decreases a lot.
Option Pricing Model

The Black-Scholes option pricing model (Black & Scholes, 1973; Greeks
(Finance), 2003) is a widely used model to determine call and put option prices. This
model predicts that the parameters that affect a call or put price (V) are stock price (S),
exercise price (E), time to expire (t), risk free interest rate (r), dividend yield (q) and
volatility in the price of the underlying stock (). In general, the dividend yield has a
relatively smaller impact and is assumed as zero for simplicity. The call option (C) and
put option (P) pricing formulae given by the Black-Scholes model are listed below. In
these formulae, N (.) is the cumulative normal distribution function.
12
C = Se
qt
P = C Se
N (d1 ) Ee rt N (d 2 )
qt
+ Ee rt = Ee rt N (d 2 ) Se
qt
S
ln( ) + r q + 0.5 2 t
E
d1 =
t
N ( d1 )
(Eqn. 1)
d 2 = d1 t
Equation 1 suggests that the call price increases when stock price goes up or the
exercise price goes down. Reverse is true for the put price. As time to expire decreases,
both the call and the put price go down. For a unit increase in stock price, the call price
increases by N (d1 ) and the put price decreases by N ( d1 ) . As the volatility increases,
the call and put prices increases. Various first and second order derivatives of the above
parameters, called Greeks, are helpful for understanding the sensitivities of the call and
put prices (Black & Scholes, 1973; Greeks (Finance), 2003). Some of the important
Greeks considered in this study are listed below. Here, for call option related Greeks, V
refers to the call price (C) and for put option related Greeks, V refers to put price (P).
o Delta [ = V S ] : change in option value per unit change in stock price,
o Vega [ = V ] : change in option value per unit change in volatility,
o Theta [ = V t ] : change in option value per unit change in time period,
o Gamma [ = S ] : change in delta per unit change in stock price.
In addition to Greeks, their corresponding elasticities that are expressed in

percentage changes are also helpful in understanding the sensitivities of the call and put
prices. Important elasticities are listed below.
o Elasticity of Delta, also called as Lambda [ = (S V ) ] : percentage change in
option value per percentage change in stock price,
13
o Elasticity of Vega [ = ( V ) ] : percentage change in option value per percentage
change in volatility of the stock,

o Elasticity of Theta [ = (t V ) ] : percentage change in option value per percentage
change in time to expire,

o Elasticity of Gamma [ = (S ) ] : percentage change in delta per percentage
change in stock price.
Studies on the Black-Scholes Model

Various tests of the Black-Scholes models have been carried out to analyze the
adequacy of the option pricing model for describing the market value of the option. The
Black-Scholes study (Black & Scholes, 1972) analyzed the over-the-counter option prices
and showed that the actual market prices differed significantly from the prices predicted
by the model. However, when transaction costs are considered, the potential differences
eroded. The Galai studies (Galai, 1977; Galai, 1978) studied daily price quotations from
Chicago Board Options Exchange and showed that when transaction costs are considered
the Black-Scholes Model prices closely matched the market prices for options.
Bhattacharya (Bhattacharya, 1983) studied the adherence of market prices to theoretical
boundaries implied by no-arbitrage conditions (i.e., trading to make a risk-free profit
with no investment) and showed that the traders could not exploit the deviation from the
boundary conditions to be profitable.
James MacBeth and Larry Merville (MacBeth & Merville, 1979) used the BlackScholes Model to compute implied standard deviations from the underlying stocks and
found the greatest discrepancy between actual and theoretical prices for options with a
14
long time till expiration and options that are way in- or way out-of-the money (i.e., a large
profit or a large loss). Mark Rubinstein (Rubinstein, 1985) compared several option
pricing models against the Black-Scholes model and in general, he could not conclude
that any one model is superior to the other as none outperformed the Black-Scholes
model consistently.
Various studies mentioned above studied a single call or put option pricing model.
An option strategy may involve more than one option and becomes a bit more
complicated to analyze.
Research on Option Strategies

There are several papers, books and online articles describing various option
strategies, their performances and when to deploy them in general. Coval and Shumway
(2001) assesses the returns of delta-neutral long straddle strategies and compares them
with the results from stochastic models. However, the author has not come across a study
that does the following:
Examines a delta-neutral long straddle strategy, and
Numerically quantifies the best entry and exit points to be profitable with lowrisk.
Such studies are likely to have been carried out by financial institutions due to its
practical value; however they are unlikely to publish them to trump the competition.
For an individual trader and a researcher, it would be immensely helpful to develop an
analysis framework and a pathway for identifying best entry and exit points for any
strategy and in particular would be very helpful if some profitable patterns could be
15
uncovered and quantified for a particular strategy like delta-neutral long straddle option
strategy. This thesis attempts to research on both of these areas.
Goals of the Present Study

The objective of the thesis is to identify profitable low-risk delta-neutral long
straddle option strategies. This is achieved in two steps. First, optimal entry points to
execute such strategies are identified by analyzing the variables affecting the option
prices. These entry points are selected based on models that predict that a given strategy
has a low-risk for not making a profit. For example, for a particular strategy, let us say
that there are 100 tradable days until expiration and 50 of them either break even or make
a profit and the other 50 days result in a loss. In this case, there is a 50% risk that the
strategy will be losing money. This study attempts to predict an entry point that would
minimize this risk. Second, after identifying various low-risk entry points, this study
identifies conditions for the most profitable exit points. For example, considering the
same scenario illustrated above, of the 50 profitable days, this study attempts to find the
right conditions for selling all the options in the strategy (i.e., exiting the strategy) that
would maximize the profit. The combination of these two steps results in predicting
profitable low-risk delta-neutral long straddle option strategies.
Strategies for seven stocks with high volume of option transactions are studied
using five years worth of data (2002 to 2006). The number of stocks chosen for the
analysis was limited in order to reduce the size of the analysis dataset. Results from the
analysis suggest that generalizations to other stocks not included in the analysis may not
be appropriate.
16
DATA PREPARATION FOR ENTRY ANALYSIS
Option Transaction Data

Most call and put option transactions in US from years 2002 to 2006 were
obtained from an option data vendor (Historical Options Data, 2002) for a fee. The data
represents the end of the day consolidated option prices of all exchanges. The data was
made available as comma separated flat files for each tradable day and contains option
transactions for most of the stocks traded in US. The total size of the transactions data for
the five years exceeded 5.1 GB. The number of stocks chosen for the analysis was limited
in order to reduce the size of the analysis dataset.
For each year, Chicago Board of Options Exchange (CBOE) reports top equity
options ranked by their transaction volume (Annual Market Statistics, 2010). This data
from CBOE was used to choose ten top ranked equity options that were on the top 20 list
for at least four out of the five years between 2002 and 2006. From this consolidated list,
seven stocks with high option transaction volumes were selected for the analysis. These
underlying stocks are: Citigroup (C), Cisco Systems (CSCO), General Electric (GE),
Altria Group (MO), Microsoft (MSFT), Qualcomm (QCOM), and Walmart (WMT).
Though these stocks represent different industries, there is some bias towards technology
industry in this list because the option transactions were in general higher for the
technology industries between the years from 2002 to 2006. This may reduce the
generalizability of the results to other stocks. For the seven stocks selected above, there
were about 1.44 million option transactions in the dataset. This data has daily end of the
day closing prices for all active call and put options along with details about options
17
(exercise price, option type, time to expire, transaction volume etc.,) and the closing stock
price.
Transaction Filters and Data Cleaning

The overall objective is to have sufficient transaction volume that allows us to
create an option strategy and to have sufficient time to follow through until its expiration
date so that the risk and return of the strategy could be computed appropriately.
Transactions with a low volume (< 20 units) or a low open interest (< 20 units) in a given
day were eliminated. Open interest here refers to the number of option contracts that are
active on a given day. Any option transaction that expires after year 2006 was also
eliminated so that any strategy we form could be followed up to its expiry to compute the
risk and return. To compute risk, a sufficiently large number of days between the entry
point and the expiration of a strategy were preferred. Any transaction with less than 30
days between entry and expiration were removed.
The quality check process indicated that the option transactions quoted on or
before 07/05/2002 had their option prices exaggerated by ten times. These prices were
corrected accordingly. Option prices were then checked for price discrepancies. If the
option price indicated that it provided a high arbitrage opportunity (i.e., able to make a
profit with no investment) then the option price may be suspect. Such suspected
transactions were also eliminated.
18
Eligible Transactions for Creating a Long Straddle Strategy.
In general, option volumes are higher when the exercise price is closer to the
current underlying stock price. In theory, for such options, the absolute values of a call
delta and a corresponding put delta have higher chances of being closer. These conditions
offer a best recipe to form a balanced delta-neutral long straddle strategy with a higher
transaction volume. Hence, in the present study, only the options with the exercise price
either just above or just below the underlying stock price in a given day were retained.
For example, let us say that the underlying stock price is $101. Let us assume that the
available option exercise prices that are closer to $101 in either direction are $100 and
$105 respectively. Hence, for this particular day, two possible combinations of call and
put transactions are possible. One strategy would combine a call and a put with an
exercise price of $100 and same expiry. The second would combine a call and a put with
an exercise price of $105 and same expiry. Any call and put transactions that may have
an exercise price other than $100 or $105 are eliminated.
Interest Rates
The Black-Scholes model suggests that one of the parameters that affect the price
of an option is the prevailing interest rate. In this study, the prevailing target fund rate of
the Federal Reserve was used as the interest rate. These historic fund rates between 2002
and 2006 were obtained from the New York Federal Reserve web site (Target Federal
Funds and Discount Rates, 2010).
19
Implied Volatility
From the Black-Scholes model, we see that the volatility or the standard deviation
in the underlying stock prices is another variable that affects the option prices. One
common way to compute this volatility is to impute it from the Black-Scholes option
pricing formula (Eqn. 1). Such imputed volatility is referred to as the Implied Volatility
( ). For a given transaction we know the current option price (C or P), stock price (S),
exercise price (E), time to expire (t) and the interest rate (r). Given these variables, the
implied volatility is obtained by solving the Black-Scholes model (Eqn. 1) for using
numerical procedures. Solving Equation 1 turns out to be a constrained non-linear
optimization problem for each option transaction. The constraint here is that should be
a positive value. SAS procedure for non-linear regression (PROC NLIN) was used along
with Newton Raphson method to solve for the implied volatilities. This is a large scale
non-linear optimization problem (around 120,000 iterations), since we need to obtain
for each eligible transaction. PROC NLIN was very helpful in solving such a problem
in a reasonably fast time. There were a few transactions with convergence issues, and
these were deleted from the dataset. Quality checks were done to ensure that the
computed implied volatilities were in the expected range.
Call and Put Option Greeks and Elasticities

In the previous chapter, we discussed that the option prices are also sensitive to
several rate of change metrics refered to as Greeks and Elasticities that were derived
based on the Black-Scholes option pricing formulae (Eqn. 1). At this point of the data
preparation stage, we have all the inputs needed for computing these metrics. Based on
20
the Black-Scholes model, Table 1 gives the closed form solutions to obtain the specified
Greeks and Elasticities (Greeks (Finance), 2003). In this table, N(.) and (.) represents
normal cumulative distribution function and normal probability density function
respectively. Other variables are defined in the previous chapter.
Table 1. Closed form solutions for call and put option Greeks and Elasticities. This
solution is obtained based on the Black-Scholes option pricing models.
Variable
Definition
Formula for Call
Formula for Put Option
Option
Greeks
Delta ( )
Call: C = C S e qt N (d1 )
e qt N ( d1 )
Put: P = P S
Vega ( )
Call: C = C
Se qt (d1 ) t or
Se qt (d1 ) t or
Put: P = P
Ee rt (d 2 ) t
Ee rt (d 2 ) t
Theta ( )
Gamma ( )
Elasticities
Lambda ( )
Elasticity of
Vega
Elasticity of
Theta
Elasticity of
Gamma
Call: C = C t
Put: P= P t
Call: C = C S
Put: P = P S
S (d1 )
2 t
rt
rEe N (d 2 )
S (d1 )
2 t
rt
+ rEe N (d 2 )
+ qSe qt N (d1 )
(d1 )
e qt
S t
qSe qt N ( d1 )
(d1 )
e qt
S t
C = C (S C )
C ( C )
P = P (S P )
P ( P )
C (t C )
P (t P )
C (S C )
P (S P )
e qt
e qt
At the end of this stage of the data processing, we have the eligible call and put
option daily transactions and the important variables that influence the option price.
21
Creating Delta-Neutral Long Straddle Strategies
A long straddle strategy is formed by combining or merging a call option and a
put option of a specific stock in such a way that they both have same exercise price, same
expiration time and the same quote date. The exercise price in this study is designed to be
closer to the current stock price on the quote day. In order for this strategy to be deltaneutral, the number of call options and the number of put options are chosen in such a
way that their total delta is zero. Note that the call delta and the put delta for a given
strategy have opposite signs. For simplicity, the count of option type (i.e., a call or a put)
with the lowest value of absolute delta was considered to be ten units. The count of the
remaining option type was determined using the logic specified below.
If C < P then the number of calls ( N C ) and number of puts ( N P ) in the
C
strategy are determined as N C = 10 and N P =
P
N C . Otherwise, N P = 10
P
N P . The computed counts are rounded to the nearest integer. The total
and N C =
delta of the strategy computed as N C C + N P P is close to zero; hence this strategy is a

delta-neutral long straddle strategy.
There are some additional transaction costs (brokerage fees) involved to enter and
exit the strategy. From Ameritrade, each option transaction has a fixed cost of $10 and an
additional cost of $0.75 per contract. Total transaction costs are doubled to account for
both entry and exit parts of the transactions. Each option contract is made for 100 stocks.
The total price ( Pdn _ ls ) of the delta-neutral long straddle strategy is thus given by,
Pdn _ ls = 100 C N C + 100 P N P + 2((10 + 0.75 N C ) + (10 + 0.75 N P )) (Eqn. 2)
22
At the end of this stage, we created about 44,725 eligible delta-neutral long
straddle strategies for the analysis.
Computing Strategy Risk

For each of the strategies created, we go through each tradable day from when the
strategy was entered to its expiry. For each such tradable day, the current total price of
the strategy is computed using the current call and put option prices. Transaction costs
are not included as they were accounted for already at the entry stage. The difference
between the current strategy price and the initial price when it was entered gives the
return from the strategy. Days with zero or a positive return are flagged. Call and put
transactions with low volumes in a given day (<10 units) were not considered as tradable
days and were excluded in this computation. The risk (in %) and the risk category for
each strategy are computed using the following formula.
Risk (in %) = 1
Total number of tradable days with a zero or positive return

Total number of tradable days until expiry
1 if risk 50%
Risk Flag =
0 otherwise
. (Eqn. 3)
The predictive models in the entry analysis use one of these risk fields as the
dependent variable. Table 2 shows the proportion of low-risk categories (i.e., risk flag =
1) for each underlying stock as well as the overall proportion of low-risk strategies. We
see that overall, only about 6.9% of the identified strategies has a potential to break-even
or make profit during 50% of the tradable days until expiry. This observation by itself is
surprising as we observe that only a small portion of the strategies have low-risk. The
23
objective of the entry analysis is to predict these low-risk strategies using some of the
variables influencing the option prices.
Table 2. Proportion of delta-neutral long straddle strategies that either breaks-even or
makes a profit during 50% of the eligible tradable days until expiry.
C
CSCO GE
MO MSFT QCOM WMT Overall
5.9% 8.9% 5.0% 6.5% 5.5% 9.1%
7.9% 6.9%
Variable Transformation and Standardization

Most of the influential variables that were derived had skewed distributions. A
scaled log transformation was done for each of these variables to reduce the skew and
obtain a distribution more closely to the normal distribution. Log transformed variables
were created as a separate set of variables. The distributions of the original variables
differed for each underlying stock. To bring the original variables to a common scale
across underlying stocks, Z transformation of each variable within each underlying stock
was carried out. The resulting standardized variables were designed to have a mean of
100 and a standard deviation of 20. As a separate set, log transformed variables were also
standardized similarly (i.e., with mean 100 and standard deviation of 20). Table A1 in
Appendix A has a list of variables created for the analysis of the delta-neutral long
straddle strategies. Table B1 in Appendix B has the means and standard deviations by
underlying stock for the original variables. An outlier analysis was done by computing
the mahalanobis distance from a given observation to the center of all the observations. A
very high distance indicates that the given observation may be an outlier. Further data
analysis showed that a decent portion of the outliers were low-risk strategies. None of the
outliers were removed because of the presence of low-risk strategies among them.
24
Distribution of Variables
Distributions of various variables influencing the strategy risk are shown in
Figure 3 through Figure 10. Figure 3 indicates that except for 3rd quarter of 2002 and the
4th quarter of 2006, all other quarters have a reasonably high number of strategies. Figure
4 shows that the strategy counts are relatively evenly distributed across underlying
stocks. Figure 5 shows that most strategies have a high risk and only 6.86% of them have
a relatively low risk (i.e., risk 50%). The mean risk is itself high at 84%. Figure 6
shows that low-risk strategies have in general less time to expire. This is good since more
such strategies could be executed within a given year. Low-risk strategies tend to have a
lower call and put implied volatilities, however this varies by the underlying stock. For
example, the trend is reversed for GE. Figure 7 indicates that standardized call Theta and
call Gamma differentiates the low-risk strategies (when compared to high-risk strategies)
better than the standardized call Delta and call Vega. Figure 8 shows that standardized
call elasticity of Delta (Lambda) and the call elasticity of Gamma are better
differentiators of the low-risk strategies from the high-risk strategies. Figures 9 and 10
shows that standardized put Gamma, put Theta, put Vega, put elasticities of Delta
(Lambda) and Gamma are better differentiators of the low-risk strategies from the highrisk strategies. However, they are not always true for all the underlying stocks. For
example, standardized put Theta does not seem to be a good differentiator of low-risk
strategies for WMT and QCOM. More such examples could be observed from these
charts.
25
Figure 3. Overall distribution of strategies across quarters of different years. Q3 of 2002

and Q4 of 2006 have relatively lower number of strategies.
Figure 4. Distribution of strategies by the underlying stock. The variation in the counts is
relatively low.
Figure 5. (a) Overall distribution of strategy risk. The mean risk is high at about 84%.
(b) Distribution of risk category. Risk flag is set as one when the risk is less than 50%.
Thus, risk flag value of one represents a relatively lower risk. There are about 6.86% lowrisk strategies.
26
C
CSCO
GE
MO
MSFT
C_underlying
QCOM
WMT
Figure 6. Box plots of standardized time to expire (z_C_time_to_expire), call implied

volatility (z_C_imp_volatility) and put implied volatilities (z_P_imp_volatility) by
underlying stock. Blue box plots are for high-risk strategies (ls_dn_rflag_0 = 0). Red box
plots are for low-risk strategies (ls_dn_rflag_0 = 1).
Figure 7. Box plots of standardized call option Greeks (Delta, Vega, Theta and Gamma)
by underlying stock. Blue box plots are for high-risk strategies (ls_dn_rflag_0 = 0). Red
box plots are for low-risk strategies (ls_dn_rflag_0 = 1).
27
Figure 8. Box plots of standardized call option Elasticities (Lambda, Vega, Theta and
Gamma) by underlying stock. Blue box plots are for high-risk strategies (ls_dn_rflag_0 =
0). Red box plots are for low-risk strategies (ls_dn_rflag_0 = 1).
Figure 9. Box plots of standardized put option Greeks (Delta, Vega, Theta and Gamma)
by underlying stock. Blue box plots are for high-risk strategies (ls_dn_rflag_0 = 0). Red
box plots are for low-risk strategies (ls_dn_rflag_0 = 1).
28
Figure 10. Box plots of standardized put option Elasticities (Lambda, Vega, Theta and
Gamma) by underlying stock. Blue box plots are for high-risk strategies (ls_dn_rflag_0 =
0). Red box plots are for low-risk strategies (ls_dn_rflag_0 = 1).
Correlation Analysis
Correlation analysis helps in understanding the multi-colinearity between
variables and helps in identifying highly correlated variables to eliminate or informs
taking other actions like principal component analysis and factor rotations to reduce the
multi-colinearity and the variable dimensions. From Figure 11 we observe that the risk
flag (ls_dn_rflag_0), the dependent variable, does not have a high correlation with other
variables. We also observe that many variables are highly correlated with each other.
29
z_P_EL_Vega
C_in_money_flg
z_SE_Diff_percent
z_P_Delta
z_C_EL_Theta
z_C_Delta
ls_dn_puts
z_P_imp_volatility
z_C_imp_volatility
z_C_Pr_by_Stk_Pr_percent
z_P_EL_Gamma
z_P_EL_Lambda
z_P_Vega
z_P_Pr_by_Stk_Pr_percent
z_C_Vega
z_C_time_to_expire
ls_dn_tot_price
z_ls_dn_pr_by_stk_pr
z_P_Theta
z_C_Theta
ls_dn_risk_0
z_C_EL_Vega
z_P_EL_Theta
ls_dn_calls
P_in_money_flg
z_C_EL_Gamma
z_P_Gamma
z_C_EL_Lambda
z_C_int_rate
z_C_Gamma
ls_dn_rflag_0
Correlation Clusters
p
-1
Figure 11. Visualization of correlations between different variables. In this picture

created using SAS JMP software, variables are clustered together according to their
correlations and then they are presented as clusters of correlations. We observe that the
risk flag (the first variable) is not much correlated with other variables. However, there
are multiple groups of highly correlated variables as indicated by dark red areas (high
positive correlation) along the diagonals and some dark blue areas (high negative
correlations) away from the diagonals.
Table C1 in Appendix C lists the pair wise correlations for the highly correlated
variables. Variable pairs with very high correlations or importance are highlighted. For a
given variable (variable 1) with very high correlation with other variables, this table also
identifies a possible replacement variable. In other words, some of the variables with high
correlation could be potentially eliminated from the analysis and thereby reduces the
variable dimensions and the multi-colinearity. From Table C1 in Appendix C, we
observe that the following variables alone could be included in the analysis as these
30
included variables could act as proxies for other eliminated variables. Identified variables
for inclusion are:
o C_in_money_flag, z_SE_Diff_percent, c_int_rate, z_C_time_to_expire (or
C_time_to_expire), z_C_imp_volatility, z_C_Delta, z_C_Vaga, z_C_Theta,

z_C_Gamma, z_C_EL_Lambda, z_P_EL_Lambda,
z_ls_dn_price_per_stk_pr, z_C_Pr_by_Stk_Pr_percent,
z_P_Pr_by_Stk_Pr_percent, z_P_imp_volatility, z_P_EL_Theta, ls_dn_calls,
ls_dn_puts.
Some of the other observations are:
o When the expiration time of a strategy decreases, the strategy price decreases
since the corresponding call and put prices are relatively low.
o As volatility increases, strategy price increases.
o As expiration time decreases, the call Vega decreases.
Factor Analysis
The correlation analysis suggests that there is a multi-colinearity issue in the data.
Factor analysis helps in reducing the multi-colinearity issue by creating factors, which are
linear combinations of the variables, in such a way that the correlations between the
factors are either very low or zero. In addition, few factors may be enough to capture the
majority of the variations present in the data. Factor analysis was carried out using
Multivariate Analysis module of SAS JMP software. Varimax factor rotations were
carried out using the following configurations:
o Principal Components as the factoring method.
31
o Principal Components (diagonals=1) as prior communality
Top seven factors accounted for about 95% of the variation in the data and were
retained as rotated components for further analysis. Table 3 shows the variance explained
by each of the seven factors.
Table 3. Variance explained by each factor. Factors were obtained by using principal
components and Varimax rotations.
Factor
Variance
Factor 1
Factor 2
Factor 3
Factor 4
Factor 5
Factor 6
Factor 7
7.93
4.42
3.39
3.12
2.35
1.13
0.61
Percent Cumulative
(%)
Percent (%)
33.1
33.1
18.4
51.5
14.1
65.6
13.0
78.6
9.8
88.4
4.7
93.1
2.6
95.7
Table D1 in Appendix D has the factor loadings for each of the top seven factors.
Following are the characteristics of these factors:
o Factor 1 Represents general Greeks and their elasticity sensitive strategies
with in-the-money calls. They are represented by high Delta. They are also
sensitive to differences in stock and exercise price at the entry.
o Factor 2 Strategies that represent time to expire and change in volatility.
o Factor 3 Represents price momentum (or acceleration) of the strategies
(Gamma).
o Factor 4 Represents implied volatility of the strategies.
o Factor 5 Represents change in time sensitivity of the strategies (Theta)
o Factor 6 Represents the Interest rates of the strategies
o Factor 7 Another factor, in addition to factor 1, which represents stock price
to expiration price differences.
32
From Figure 12 that shows the distribution of various factors with respect to the
risk flag, we see that only Factor 3 has some predictive power for risk. This factor
explains about 14% of the variance in the data.
Figure 12. Box plots of top seven factors. Blue box plots are for high-risk strategies
(ls_dn_rflag_0 = 0). Red box plots are for low-risk strategies (ls_dn_rflag_0 = 1).
Training and Test Datasets

The analysis dataset was randomly split into training and test datasets. Training
dataset had 70% of the observations and were later used to build predictive models that
predict either the risk percentage (continuous variable) or the risk flag (discrete variable).
The test dataset had 30% of the total observations and were primarily used to test the
accuracy of the models built using the training dataset. To split the analysis dataset, first a
random value with uniform distribution and a range between 0 and 1 was generated for
each record. Observations with a random value 0.7 were flagged for training
(Train_Flag = "Y") and the remaining 30% of the observations were set aside for testing
(Train_Flag = "N").
Figure 13 shows the distribution of low-risk and high-risk strategies within
training and test datasets. Overall, both datasets have about 6.8% of low-risk strategies.
33
However, across various stocks, the share of low-risk strategies has small but noticeable
differences. For example, QCOM has 9.8% and 7.6% of low-risk strategies in the training
and the test datasets respectively. Similar minor differences in the proportion of low-risk
strategies between the training and test datasets were observed across various quarters.
Figure 13. (a) Distribution of risk flag in training and test datasets. (b) Distribution of
risk flag across underlying stocks. Both frequency and shares are reported. (c) Pictorial
view of low-risk (blue) and high risk (red) strategies by underlying stock between
training and test datasets.
34
Data Balancing in the Training Dataset
In the training dataset, we have about 6.86% low-risk strategies and 93.14% highrisk strategies. Since predicting the low-risk strategies is the primary objective of this
study, an approximately 50:50 split ratio between the number of low-risk and high-risk
strategies would be helpful to boost the predictive accuracy of low-risk strategies. Data
balancing for the training dataset was achieved by assigning a higher weight to the lowrisk strategies. This weight is captured by the variable obs_freq. For high-risk strategies
obs_freq is set as one. For low-risk strategies, obs_freq was computed using the formula
highrisk % 0.9314
= lowrisk % = 0.0686 = 13.5 14 and was set as 14. In other words, this is equivalent
to repeating the low-risk strategy observations 14 times to achieve a balanced training
dataset in terms of risk categorization.
After the execution of all of the above processes we have the complete analysis
datasets for training and testing. Table A1 in Appendix A summarizes all the variables in
the analysis dataset along with their descriptions.
35
ENTRY ANALYSIS: MODELS TO IDENTIFY LOW-RISK STRATEGIES
The goal of the entry analysis is to identify the low-risk delta-neutral long straddle
strategies that could be executed or entered. Predictive models were built and validated
using the training and test datasets created during the data preparation step. SAS JMP
software was primarily used for building predictive models. Modeling methods used were
linear regression, logistic regression, neural networks and decision trees. IBM Modeler
was used occasionally to compare some of the equivalent models from SAS JMP (like,
neural networks and decision trees). The analysis dataset has several variables. Models
were built using different sets of variables (like, all standardized variables, factor
variables only etc.). In some cases, separate models for each underlying stock were built
to study the effects on the predictive accuracy. Classification errors based on the test
dataset were used to select the best predictive model. The top 2% of the strategies that
were predicted to have lowest risks were identified as best strategies to enter.
Modeling Methods
The risk flag (ls_dn_rflag_0) is the categorical dependent variable with two
categories; representing high-risk and low-risk respectively. Supervised modeling
methods that support a categorical outcome variable, such as logistic regression, neural
network for classification and partition methods such as decision trees were used to
predict the risk category. For the strategies we have generated, we also have percent risk
(ls_dn_risk_0) as continuous dependent variable. Predictive models such as ordinary least
square regression, neural networks for continuous outcome variable and decision trees for
36
continuous variables were used to model the risk expressed in percentage. Various texts
(Berry & Linoff, 1997; Hand, Mannila & Smyth, 2001; Larose, 2004; Larose, 2006;
Witten & Frank, 2000) give more details about these methods. Some of the more
complex and flexible methods are not studied here primarily because of SAS JMP
limitations. However, the models considered here provide a good coverage of objectives
of many of these other methods like generalized additive models and projection pursuit
methods. SAS JMP hand books (SAS Institute Inc, 2010a, 2010b, 2010c) give JMP
implementation and some statistical details underlying these methods. Lora D. Delwiche
and Susan J. Slaughter (Delwiche & Slaughter, 2003) explains simple SAS codes.
Modeling Methods: Logistic Regression

Logistic regression is used to predict the categorical variable ls_dn_rflag_0 with
binary response indicating if a strategy is a high-risk (= 0) or a low-risk (= 1) strategy.
Logistic regression fits probabilities for the categorical response levels (say, y = 0 or y =
1) using a logistic function of independent variables (say, X). For binary response, the
P( y = 0)
= X . Here, P(.) represents the
function is P ( y = 0) = (1 + e X ) 1 or log
P( y = 1)
probability of an event. Maximum likelihood fitting principles are used to choose s that
maximizes the joint probability attributed by the model to the responses that did occur.
The 'whole model test' outputs helps to answer if the specified model is better than
just the intercept model (i.e., a model with intercept only). The 'Prob > Chi.Sq.' output
measure represents the probability of obtaining greater model Chi-square value by chance
alone. A lower value indicates that the specified model is better than the intercept model.
A high R 2 (close to 1), which in this case is the ratio of the differences between the
37
negative log-likelihood values of the specified model and the intercept model to the
negative log-likelihood of the intercept model, represents a better fit for the model.
Generalized R 2 is the generalization of the R 2 measure that simplifies to the regular R 2
for continuous normal responses. This could be used to compare the fits of different
models. Misclassification rate is the rate at which the response category with the highest
predicted probability is not the observed category. A lower misclassification rate (closer
to 0) indicates a better fit for the model. The lack-of-fit or the Goodness-of-fit test
compares the saturated model with the specified model and informs if there is enough
information in the variables included in the current model or if additional terms are
needed. A high 'Prob > Chi.Sq.' measure represents that the fitted model is sufficient.
Effect tests are used to check if the specified model is better than a model without
a given effect. Parameter estimates output gives the effect estimates ( s), their standard
errors, Chi-square test and the significance probability. A lower significance probability
(say, less than 0.05 for 95% confidence level) represents that the given effect is
statistically significant. The Chi-square values from likelihood ratio tests could be used to
sort the effects according to their importance. The Unit Odds ratio measures the change
in probability of the first response (say, y = 0) when the given effect changes by one unit.
The Range Odds ratio gives similar probability when a given effect changes over its
whole range seen in the data. Odds ratios far from one represents that the outcome is
highly sensitive to the given effect. For binary responses, the log odds ratio for flipped
response levels (say, y = 1) involves taking the reciprocal of the odds ratio for the
response y = 0. In our case, such reciprocal odds ratios are helpful in determining the
importance of low-risk strategies as ls_dn_rflag_0 equals one for the low-risk strategies.
38
The confusion matrix (Figure 14a) gives a table of predicted vs. actual response.
In our case, false positives result in a much higher risk for making a profit than a true
negative. Receiver Operatic Characteristics (ROC) curves (Figure 14c) measures the
sorting efficiency of the model's fitted probabilities to sort the specified response level. It
graphs the probability for false positives (x-axis) vs. the probability of true positives (yaxis) after sorting the observations based on the predicted probability. The higher the
curve from the diagonal, the better the fit. The Lift curve (Figure 14d) gives similar
information like ROC curve, but dramatizes the richness of the ordering at the beginning.
The y-axis shows the ratio of richness of a given response level for a given portion of the
population when compared to the rate of that response level as a whole.
Predicted
Negative
Positive
=0
(high-risk)
=1
(low-risk)
Confusion Matrix
False
=0
(high-risk)
False
Negative
False
Positive
True
=1
(low-risk)
True
Negative
True
Positive
Actual
Figure 14. (a) Confusion matrix. Cells represent the count of observations in each
category. (b) An example of a Contingency table that represents top 2% of the low-risk
strategies. Rows are actual values and the columns are predicted values. (c) Receiver
Operating Characteristics (ROC) curve. (d) Lift curve.
39
One way to minimize the false positives and maximize the true positives would
be to choose a certain portion of the population with highest predicted probabilities for
low-risk as the low-risk strategies to enter. The remaining portion of the population is
classified as high-risk strategies. In this study, overall, we have about 6.8% low-risk
strategies. For entry purposes, only top 2% of the strategies with highest predicted
probability for low-risk (i.e., ls_dn_rflag_0 = 1) are selected. This limit was chosen so
that the identification of entry strategies will not be rare (i.e., in an average, occurs two to
three times a week). The contingency table in Figure 14b represents an equivalent of
confusion matrix for the top 2% of the low-risk strategies. This contingency table is then
calculated for the test dataset as well. The true positive % and the false positive % from
the test dataset are used to compare the predictive accuracy of the models.
Modeling Methods: Linear Regression (Ordinary Least Squares)

The ordinary least squares (OLS) linear regression model is used to predict the
continuous outcome variable ls_dn_risk_0 which is the strategy risk expressed in
percentage. A model with continuous dependent variable y and independent variables X
and a normally distributed error is specified as y = X + . The parameter estimates
( 's) are computed by minimizing the error sum of squares. Variables are assumed to be
normally distributed with no correlation between them. In practice, data deviates from
these assumptions and the OLS models are slightly robust to minor variations in the
assumptions.
The actual vs. predicted plot shows the scatter plot of the predicted vs. actual
values of the dependent variable. The analysis of variance table and the R 2 measure are
40
used to infer the level of fit of the given model. A lower value of 'Prob > F' (say, below
0.05) indicates that the current model is significantly better than the intercept model (at
95% confidence level). The lack-of-fit test, parameter estimates and effect tests are
conceptually similar to the ones described in the logistic regression section. However, F
ratio is used for lack-of-fit test and t-tests are used for effect tests. Effects sorted
according to their descending 't' values indicate the importance of the model variables to
predict the strategy risk. A low value for 'Prob > |t|' measure (say, less than 0.05)
indicates that the given effect is statistically significant at a specified confidence level
(here, the confidence level is 95%).
Top 2% of strategies with lowest predicted risk are classified as low-risk
strategies. A contingency table similar to the one shown in Figure 14b is created to
evaluate the predictive accuracy of the model. Both true positive and false positive
percentages are used for the evaluation.
Modeling Methods: Neural Networks

Neural Networks are explained in detail in the text by Bishop (Bishop, 2000).
Neural networks implementation in SAS JMP can take either a categorical variable like
risk flag or a continuous variable like strategy risk in percentage. It could be considered
as a function of a set of derived inputs. It has three types of layers: an input layer, hidden
layers, and an output layer. The input layer has one node for each input (for continuous
variables) or each levels of the input variable (for categorical variables). Similarly, an
output layer has one node for each output variable in case of continuous outputs or each
level of the output for the categorical output variables. There can be multiple hidden
41
layers with multiple hidden nodes in each layer. Each of the nodes in a given hidden layer
is connected to all nodes in the adjacent layers. SAS JMP base version supports only one
hidden layer. Figure 15 shows a neural network with two inputs (x1 and x2), one hidden
layer with three hidden nodes and one output (y).
Figure 15. A neural network with two input nodes (x1, x2) in the input layer (blue), three
hidden nodes in the hidden layer (green) and one output node corresponding to the
continuous output y in the output layer (sky blue). Hidden nodes are connected to each
node in the adjacent layers.
Each hidden node is a function of all input nodes. The function applied to the
nodes of the hidden layer is called activation function. SAS JMP base version uses
hyperbolic tangent function as the activation function. The function applied to the output
response is either a linear combination of activation functions of the hidden nodes (for a
continuous output variable) or a logistic transformation of activation functions of the
hidden nodes (for categorical responses). The main advantage of a neural network model
is its ability to model complex response surfaces. However they are not easily
interpretable and can over fit the data unless an efficient validation method is
incorporated into the model building process. SAS JMP base version uses a specified
proportion of the original data as a validation dataset (holdback validation). In this study,
for neural network models, the training dataset is further divided randomly into two:
42
67% of observations for training purposes and 33% of observations for validation
purposes.
Outputs from the neural network model includes model R 2 , generalized R 2 ,
misclassification rates, confusion matrices, ROC and lift curves (for categorical
outcomes) and actual vs. predicted value charts (for continuous outcomes). All of these
outputs are produced for both the training set and the validation set. SAS JMP also
generates scoring code for the models and is helpful in predicting the probability for lowrisk categorization (when ls_dn_rflag_0 is used as an outcome variable) or the predicted
strategy risk (in the case of ls_dn_risk_0). Contingency tables for top 2% of the
predictions are then constructed using the predicted values.
Modeling Methods: Recursive Partitions

The partition module of SAS JMP is used to build a decision tree that splits the
values of the factors to maximize the difference in responses between the two branches of
the split. The logic used here is different than that used in C5.0 decision trees. The factors
(X's) can be continuous or categorical. Continuous factors are split into two at each tree
node using a cutting value. Categorical factors are split into two groups of levels at each
tree node. The response (y) can also be either continuous or categorical. If y is
categorical, then the method fits the probabilities estimated for the response levels by
minimizing the residual log-likelihood Chi-square ( G 2 = 2 Entropy ). If y is continuous,
the method fits the means by minimizing the sum of squared errors. The node splitting is
based on the LogWorth statistic. LogWorth is calculated as log10 ( p value ) . The SAS
JMP manual (SAS Institute Inc., 2010c) has more details on computing this statistic.
43
Advanced techniques like bootstrap forest and boosted trees are supported only by SAS
JMP Professional version. In this study, the observations in the training dataset are split
randomly into 70% for training and 30% for validation (holdback validation). These
datasets have several thousands of observations; hence K-Fold validation methods that
are typically used for small datasets were not used.
The output contains overall Akike Information Criteria (AIC), G 2 , LogWorth,
number of splits, Entropy R 2 , misclassification rates, confusion matrices and means and
standard deviations (for continuous response variable). The leaf report has the set of rules
that produces the corresponding leaf node and the probability of response levels (for
categorical response) or the means of responses (for continuous response). For
categorical responses, ROC and lift curves are also generated. SAS JMP scores each
observation based on the rules applied to the particular leaf node. This helps to get the
predicted probability for low-risk (for ls_dn-rflag_0 = 1) or the predicted strategy risk
(for ls_dn_risk_0). These predicted values are used to create contingency tables for top
2% of the predicted low-risk strategies.
Modeling Methods: Hybrid Models

Combining different types of models (i.e., model averaging) is a technique used
sometimes to improve model accuracy and to reduce the model over-fitting. In this study,
a neural network model and a decision tree model with higher predictive accuracy for
low-risk category (ls_dn_rflag_0 = 1) were combined to generate a hybrid score. A
hybrid score for the low-risk category is the average of the predicted low-risk
44
probabilities from both the models. Hybrid score is used to select top 2% of the low-risk
strategies and the contingency table is developed for comparing the model accuracies.
Modeling Methods: Other Models

IBM Modeler software was used to generate neural network and C5.0 decision
tree models for some select cases. Details about the model implementation using this
software could be found in the Clementine manual (SPSS Inc., 2001). The results were
compared to corresponding SAS JMP model types.
Model Types based on Analysis Variables

Depending on the set of variables used for predicting the risk category and the
strategy risk, models were further classified. In all of the models, field obs_freq that
represents a higher weight for low-risk strategies were used to balance the outcome
categories or the strategy risk.
Model Type: Full Model (M1)

This model has all the standardized variables included as independent variables.
Some of these variables were highly correlated with each other. Many of them also have
skewed distribution. Variables included in this model type are:
o C_in_money_flg, z_SE_Diff_percent, z_C_Pr_by_Stk_Pr_percent,
z_C_int_rate, z_C_time_to_expire, z_C_imp_volatility, z_C_Delta,

z_C_Vega, z_C_Theta, z_C_Gamma, z_C_EL_Lambda, z_C_EL_Vega,
z_C_EL_Theta, z_C_EL_Gamma, z_P_Pr_by_Stk_Pr_percent,
45
z_P_imp_volatility, z_P_Delta, z_P_Vega, z_P_Theta, z_P_Gamma,
z_P_EL_Lambda, z_P_EL_Vega, z_P_EL_Theta, z_P_EL_Gamma, and
z_ls_dn_pr_by_stk_pr.
Model Type: Full Model with Transformed Variables (M1T)

This model has all the standardized and transformed variables included as
independent variables. The log transformations reduce the skewness in the distribution of
these variables. Some of these variables are still highly correlated with each other. The
objective of this model type is to check if the predictive accuracy improves by reducing
the skewness in the variable distributions. Variables included in this model type are:
o z_SE_Diff_percent, z_log10_C_Price_Percent, z_log10_days_to_expire,
z_log10_C_Imp_Volatility, z_log10_C_Vega, z_log10_C_Theta,

z_log10_C_Gamma, z_log10_C_EL_Lambda, z_log10_C_EL_Vega,
z_log10_C_EL_Theta, z_log10_C_EL_Gamma, z_log10_P_Price_Percent,
z_log10_P_Imp_Volatility, z_log10_P_Vega, z_log10_P_Theta,
z_log10_P_Gamma, z_log10_P_EL_Lambda, z_log10_P_EL_Vega,
z_log10_P_EL_Theta, z_log10_P_EL_Gamma, and
z_log10_ls_dn_Price_Percent.
Model Type: Select Variables Model (M2)

This model has only the selected standardized variables that reduces the
correlations among themselves and also acts as proxies for other removed variables.
These set of variables were identified after the correlation analysis described in the data
46
preparation chapter. The goal of this model type is to check if reducing multi-colinearity
improves the predictive accuracy. Variables included in this model type are:
o z_SE_Diff_percent, z_C_Pr_by_Stk_Pr_percent, z_C_int_rate,
z_C_time_to_expire, z_C_imp_volatility, z_C_Delta, z_C_Vega, z_C_Theta,

z_C_Gamma, z_C_EL_Lambda, z_P_Pr_by_Stk_Pr_percent,
z_P_imp_volatility, z_P_EL_Lambda, z_P_EL_Theta, and
z_ls_dn_pr_by_stk_pr.
Model Type: Factor Variables Model (M3)

This model has only the seven factors obtained from the factor analysis as
independent variables. These factors represent about 95% of the variations in the data and
are uncorrelated, thus meeting some of the assumptions of linear regression models. The
goal of this model type is to see if uncorrelated factors are enough to improve the
predictive accuracy. Variables included in this model type are: Factor 1 through Factor 7.
Model Type: Full Model with Underlying Stock as a variable (M4)

This model has the underlying stock as categorical variable along with all the
standardized variables of full model M1 included as independent variables. Single
variable analysis shows that the distributions of standardized variables are different
across underlying stocks. The strategy risk also varies across stocks. Goal of this model
type is to check if including underlying as a variable to model would improve model
accuracy. The disadvantage here is that this model type cannot be generalized outside the
set of underlying stocks studied here.
47
Model Type: Full Model for each Underlying Stock (M5)

This model is similar to the full model M1, but is built separately for each
underlying stock. Thus there will be seven sub models. Similar to the model type M4,
goal of this model type is to check if a separate model for each underlying stock would
improve model accuracy. As in M4, the disadvantage here is that this model type cannot
be generalized outside the set of underlying stocks studied here. In addition, it becomes
cumbersome to build models and validate their accuracies.
Models Studied
We have several modeling methods and several model types based on analysis
variables as described in the previous section. Figure 16 gives a list of models studied for
the entry analysis. There were about 24 models built using various modeling methods and
variable sets. This count does not include the sub models for each underlying stock built
as part of modeling type M5 or the models studied using IBM Modeler. Models were not
built for some of the modeling method and model type combinations. This is because of
the basic conclusions from the overall observations that these models are not likely to
produce better results than the ones that were tried before. Figure 17 summarizes the
random split % and count of observations used for training the model and validating the
model. This split was made within the training dataset created before. All observations
are used for logistic and linear regressions models. Neural networks use 67% of
observations for training purposes and 33% for validation purposes. Decision trees use
70% of observations for training purposes and the remaining 30% for validation.
48
Select
Variables
(M2)
Factor
Variables
(M3)
Categorical Response
(ls_dn_rflag_0)
Transformed
Variables
(M1T)
Logistic Regression
Hybrid Models
Linear Regression
Neural Networks Categorical

Decision Tree Categorical
Neural Networks Continuous

Decision Tree Continuous
Full Model with Full Model for

Underlying
Each
Stock Variable Underlying
(M4)
Stock (M5)
Full Model
(M1)
Continuous
Response
(ls_dn_risk_0)
Modeling Methods
Model Types based on Analysis Variables
x
x
Figure 16. Models studied for identifying the entry strategies. Models are classified by
modeling methods and model types based on analysis variables.
Split of Training Dataset observations for Model Training
and Model Validation
Categorical Response
(ls_dn_rflag_0)
Continuous
Response
(ls_dn_risk_0)
Modeling Methods
Split %
Training
Validation
Logistic Regression
Neural Networks Categorical
Decision Tree Categorical
Hybrid Models
Linear Regression
Neural Networks Continuous
Decision Tree Continuous
Count of Observations
Training
Validation
100%
0%
31,231
67%
33%
20,821
10,410
70%
30%
21,861
9,370
depends on base models
depends on base models
100%
0%
31,231
67%
33%
20,821
10,410
70%
30%
21,861
9,370
Figure 17. Split of training dataset observations into training set and validation dataset
for each modeling method. Both split % and the approximate count of observations used
for training and validation are shown.
Results from Select Models

A large amount of diagnostic outputs were produced by SAS JMP for all the
studied models. Some of the important outputs for select models are presented in this
49
section. The presented models are chosen to represent the varieties in modeling methods
and model types for various variable sets. Results from the training dataset are presented.
Sections of the thesis following this section discuss about the selection criteria for the
best model and the diagnostic outputs from the best model. In this section, the results are
presented for the following models: logistic regression using transformed variables
(M1T), decision tree with categorical outcome using full model with underlying stock as
an additional variable (M4), linear regression using factor variables (M3), neural network
predicting risk % using all standardized variables (M1), decision tree with continuous
outcome using select model variables (M2), hybrid model and comparable models using
IBM Modeler software.
Results: Logistic Regression using Transformed Variables (M1T)

This logistic regression model has the risk category (ls_dn_rflag_0) as the
response variable. It uses standardized and transformed variables as independent
variables (model type: M1T). Figure 18 to Figure 21 shows the SAS JMP outputs for the
model. From Figure 18, we see that the model R 2 is 0.18 and the generalized R 2 is also
relatively low at 0.29. About 30% of the observations are misclassified. The goodness of
fit test shows a significant 'Prob > ChiSq' measure, suggesting that more complex forms
of the variables (like interactions, non-linear terms etc.,) may be needed for a better fit.
From Figure 19, we see that call implied volatility (z_log10_C_imp_volatility) and call
elasticity of Lambda (z_log10_C_EL_Lamda) have high Chi-squares and are some of the
important predictors.
50
Figure 18. Whole model test and lack of fit test results for logistic regression model with
transformed variables as independent variables (M1T).
Parameter Estimates
Term
Intercept
z_SE_Diff_percent
z_log10_C_Price_Percent
z_log10_days_to_expire
z_log10_C_Imp_Volatility
z_log10_C_Vega
z_log10_C_Theta
z_log10_C_Gamma
z_log10_C_EL_Lambda
z_log10_C_EL_Vega
z_log10_C_EL_Theta
z_log10_C_EL_Gamma
z_log10_P_Price_Percent
z_log10_P_Imp_Volatility
z_log10_P_Vega
z_log10_P_Theta
z_log10_P_Gamma
z_log10_P_EL_Lambda
z_log10_P_EL_Vega
z_log10_P_EL_Theta
z_log10_P_EL_Gamma
z_log10_ls_dn_Price_Percent
Estimate
-45.055409
0.01822376
0.06514264
0.06305586
0.09529714
0.00249674
-0.0009891
0.00076192
0.24142807
0.01612163
-0.020294
-0.0458241
0.09562522
0.03117831
-0.0124637
-0.0897304
-0.118099
-0.0990679
-0.0931831
0.05754443
0.24868466
0.00085028
Std Error
3.9226102
0.0015649
0.0094048
0.0094828
0.002934
0.0017377
0.003394
0.0048456
0.0094443
0.002694
0.0041342
0.0068585
0.0136008
0.0044885
0.0044774
0.0037868
0.0066174
0.0139328
0.0039365
0.004415
0.0113244
0.0035939
ChiSquare
131.93
135.61
47.98
44.22
1055.0
2.06
0.08
0.02
653.48
35.81
24.10
44.64
49.43
48.25
7.75
561.48
318.50
50.56
560.33
169.88
482.25
0.06
Prob>ChiSq
<.0001*
<.0001*
<.0001*
<.0001*
<.0001*
0.1508
0.7707
0.8751
<.0001*
<.0001*
<.0001*
<.0001*
<.0001*
<.0001*
0.0054*
<.0001*
<.0001*
<.0001*
<.0001*
<.0001*
<.0001*
0.8130
For log odds of 0/1
Figure 19. Parameter estimates from the logistic regression model with transformed
variables as independent variables (M1T).
51
Figure 20. Unit Odds ratios and Range Odds ratios from logistic regression model with
transformed variables as independent variables (M1T). The reciprocal column has the
odds ratio's corresponding to low-risk strategies (i.e., ls_dn_rflag_0 = 1).
52
Figure 21. ROC and lift curves for the logistic regression model with transformed
variables as independent variables (M1T). The contingency table is for the top 2% of the
predicted low-risk strategies. Rows (ls_dn_rflag_0) in the contingency table refers to the
actual values and the columns (M1T_LR_rflag_fnl) refers to the predicted values.
Reciprocal range odds ratios in Figure 20, that gives the odds-ratio for low-risk,
shows that call elasticity Lambda (z_log10_C_EL_Lambda), put elasticity of Gamma
(z_log10_P_EL_Gamma), call implied volatility (z_log10_C_imp_volatility), call
elasticity Gamma (z_log10_C_EL_Gamma) and put Theta (z_log10_P_Theta) have
major influences in predicting the low-risk strategies. This is because their reciprocal
odds ratios are far from one. The lift curve in Figure 21 shows a maximum lift of about
1.6 for the low-risk strategies (for the balanced data).
53
Top 2% of the strategies with highest predicted probabilities for low-risk
(ls_dn_rflag_0 = 1) are classified as low-risk strategies to possibly enter. The
contingency table in Figure 21 shows that there are 628 such strategies in the training
dataset. 22.1% of these are true positives (i.e., actual ls_dn_rflag_0 equals 1) and the
other 77.8% are false positives. Relatively speaking, this indicates that the model predicts
poorly.
Results: Decision Trees using Underlying Stock and Full Model Variables (M4)
This decision tree model uses all standardized variables as well as the underlying
stock as a categorical variable (model type: M4) for predicting the risk category
(ls_dn_rflag_0). Figure 22 through Figure 24 shows the SAS JMP outputs for this model.
From Figure 22, we see that, the generalized R 2 for the training and validation data are
0.47 and 0.38 respectively. We see an improvement in the predictive accuracy. Based on
G 2 measure, the time to expire (z_C_time_to_expire) and the underlying stock
(C_underlying) are some of the important predictors. Figure 23 shows a small part of the
tree output along with G 2 and LogWorth measures for each node. Figure 24 shows the
contingency table for the top 2% of the predicted low-risk strategies. About 31.9% of
these are true positives and 68.1% are false positives. There is a small improvement in
predictive accuracy in this model; however, the false positive % is still high.
54
Column Contributions
Term
C_underlying
z_SE_Diff_percent
z_C_int_rate
z_C_time_to_expire
z_C_imp_volatility
z_C_Delta
z_C_Vega
z_C_Theta
z_C_Gamma
z_C_EL_Lambda
z_C_EL_Vega
z_C_EL_Theta
z_C_EL_Gamma
z_P_imp_volatility
z_P_Delta
z_P_Vega
z_P_Theta
z_P_Gamma
z_P_EL_Lambda
z_P_EL_Vega
z_P_EL_Theta
z_P_EL_Gamma
Number
of Splits
6
2
2
7
6
4
1
6
5
4
1
1
1
4
0
4
1
2
3
1
2
3
1
1
1
G^2
2299.8773
359.1480
262.0303
1469.6280
8118.6824
795.1788
146.9140
1266.0620
669.1480
664.2533
1025.2642
255.6065
165.3602
1266.1476
0.0000
811.7609
117.4206
383.1742
485.1792
151.9697
330.4666
493.0176
168.3326
138.8791
102.4036
Figure 22. Model fit statistics and the independent variable contributions in terms of G 2
measure from a decision tree model. This model has risk flag as response variable and
uses underlying stocks (categorical variable) and all standardized variables as
independent variables (M4).
Figure 23. A small part of the tree is shown to illustrate the output from the decision tree
model. This model has risk flag as response variable and uses underlying stocks
(categorical variable) and all standardized variables as independent variables (M4). The
blue bar corresponds to low-risk strategies (ls_dn_rflag_0 = 1).
55
Figure 24. The contingency table is for the top 2% of the predicted low-risk strategies.
Rows (ls_dn_rflag_0) in the contingency table refers to the actual values and the columns
(M4_TRD_rflag_fnl) refers to the predicted values. The decision tree model here uses
underlying stocks (categorical variable) and all standardized variables as independent
variables (M4).
Results: Linear Regression using Factor Variables (M3)

Outputs from the linear regression model predicting the continuous risk outcome
(ls_dn_risk_0) using factors (M3) are shown in Figure 25 through Figure 27. From Figure
25, we see that R 2 is 0.22 (a low predictive accuracy) and a highly significant 'Prob > F'
measure in the analysis of variance table. This indicates that the model is significantly
better than intercept model. The lack-of-fit test suggests that we need complex
combinations of these variables to get a better predictive model. Factors 3, 2, and 5 are
the important predictors. This is determined based on their absolute t-values. From Figure
26, predicted vs. actual chart, we see that the error does not have a constant variation, a
sign of deviation from the linear model assumptions and a less optimal model. Top 2% of
the strategies with lowest predicted risks are then classified as low-risk strategies. The
corresponding contingency table in Figure 27 shows a low predictive accuracy with only
13.9% of them being true positives.
56
Figure 25. Model fit statistics, analysis of variance, lack of fit statistics and parameter
estimates sorted by their t-values are shown. This linear regression model has strategy
risk as response variable and uses the factors as independent variables (M3).
Figure 26. Actual by predicted plot from linear regression model with strategy risk as
response variable and the factors as independent variables (M3).
57
Figure 27. Actual by predicted plot and contingency table for top 2% of the strategies
with lowest predicted risk classified as low-risk strategies are shown. This linear
regression model has strategy risk as response variable and uses the factors as
Results: Neural Network for Continuous Outcome using All Variables (M1)
This model predicts strategy risk ls_dn_risk_0 (a continuous outcome) using all
standardized variables as independent variables (M1). Figure 28 through Figure 30 shows
the SAS JMP outputs. From Figure 28, we see that R 2 for training and validation data are
0.5 and 0.39 respectively. This fit is slightly better than the other models described in the
previous sections. Figure 29 shows actual vs. predicted value charts for the training and
the validation data. The contingency table for the top 2% of strategies with lowest
predicted risk (Figure 30) shows a 37.7% true positives and 62.3% false positives. This is
a slight improvement from models described in earlier sections.
SAS JMP offers an important interactive graphical tool called profiler to
understand the influence of each variable in predicting the outcome. This tool also helps
to understand the interactions between independent variables and the presence of nonlinear relationships between the independent variable and the outcome variable. Since the
58
profiler uses several interactive charts and are huge in this case, the outputs are not
presented here.
Figure 28. Neural network model fit statistics for the observations used for training and
validation are shown. This neural network model has strategy risk as response variable
and uses all the standardized variables as independent variables (M1).
Figure 29. Actual vs. predicted risks for the training and the validation portions of the
training dataset. This neural network model has strategy risk as response variable and
uses all the standardized variables as independent variables (M1).
Figure 30. The contingency table from neural network model (type M1) for top 2% of the
strategies with lowest predicted risk classified as low-risk strategies are shown. The data
here includes both training and validation observations.
59
Results: Decision Trees using Select Variables (M2)
This decision tree (or recursive partitioning) model has strategy risk ls_dn_risk_0
(a continuous variable) as the response variable. It uses select standardized variables
(M2) as independent variables. Figure 31 through Figure 33 shows the results from SAS
JMP software. Figure 31 shows a small part of the tree. The nodes show the mean and
standard deviation of the strategy risk within the observations of each node. From Figure
32, the time to expire (z_C_time_to_expire) and the interest rates (z_C_int_rate) are
observed to be some of the important predictors. This is determined based on sum of
squares contributions of each variable in predicting the strategy risk. Contingency table
(Figure 33) for top 2% strategies with lowest predicted risk shows a 51.8% true positives
and 48.1% false positives. This is relatively a good predictive accuracy.
Figure 31. A small part of the tree is shown to illustrate the output from the decision tree
model with continuous response (ls_dn_risk_0). This model has strategy risk as response
variable and uses select standardized variables as independent variables (M2). Each node
has mean and standard deviation for the strategy risk.
60
Figure 32. The sum of squares explained by each independent variable in the decision
tree model with continuous response is used to determine its overall contribution in
predicting the strategy risk. This decision tree model has strategy risk as response
variable and uses select standardized variables as independent variables (M2).
ls_dn_rflag_0
Contingency Table
M2_TRC_rflag
Count 0
1
Total %
Col %
Row %
0
28778
310
92.15
0.99
94.09 48.14
98.93
1.07
1
1809
334
5.79
1.07
5.91 51.86
84.41 15.59
30587
644
97.94
2.06
29088
93.14
2143
6.86
31231
Figure 33. The contingency table for top 2% of the strategies with lowest predicted risk
classified as low-risk strategies are shown. The data here includes both training and
validation observations. This decision tree model has strategy risk as response variable
and uses select standardized variables as independent variables (M2).
Results: Hybrid Models

Two models with relatively high predictive accuracies for low-risk strategies
(ls_dn_rflag_0 = 1) were combined to form this hybrid model. These component models
are a neural network model for risk-category and a decision tree model for risk-category;
61
both using all standardized variables (M1) as independent variables. The average of the
predicted probabilities for low-risk (ls_dn_rflag_0 = 1) from the component models were
used for scoring. Figure 34 shows the contingency table for top 2% of the predicted lowrisk strategies. We see 48.6% true positives and 51.4% false positives.
Figure 34. The contingency table for top 2% of the strategies with highest low-risk
predicted probability. The hybrid model predicted probability is the average of the
predicted probabilities from the component base models. The base models here are a
neural network model and a decision tree model, both using all standardized variables as
Results: Comparing Models from IBM Modeler

Using IBM Modeler software, various neural network and C5.0 decision tree
models were built using all standardized variables (M1) as independent variables.
Objective here was to compare the equivalent models between SAS JMP and IBM
Modeler software. The general observations are listed below.
o IBM Modeler, with its intuitive interface, requires less time to build a model.
o SAS JMP is better in terms of modeling larger datasets with faster execution
times.
62
o Neural network implementation in IBM Modeler yields a very high training
accuracy but far lower test accuracies. Difference in accuracies between

training and test datasets are much lower in SAS JMP (i.e., more stable neural
network models).
o The C5.0 decision tree implementation in IBM Modeler results in a better test
data predictive accuracy than the decision trees built using SAS JMP. IBM
Modeler C5.0 also provides opportunities to specify a cost matrix. However,
identifying a cut-off probability to designate top 2% of low-risk strategies was
more of a trial and error process.
o IBM Modeler has more data mining methods available for data analysis.
Selection of the Best Predictive Model

The model predictive accuracies (i.e., true positives) using the training dataset for
the top 2% of the predicted low-risk strategies are summarized in Figure 35. In order to
select the best model, the models are validated against the test dataset that was held out
before the modeling process began. For a given model, the predicted probability cut-off
limit (or the risk cut-off limit in case of continuous outcome) for identifying the top 2%
of the low-risk strategies in the training dataset is first computed. This cut-off limit is
then applied to the test dataset to identify the low-risk strategies. In most cases, this
corresponded to approximately 2% of the observations in the test dataset. Then a
contingency table was developed to obtain false positives and the true positives whose
total equals 100%. The best predictive model has the highest true positive percentage
obtained from the test dataset.
63
Figure 36 has the true positive percentages obtained from test dataset for various
models. For each modeling method, the best model is highlighted in blue. The overall
best predictive model with the highest predictive accuracy of 58.4% is the neural network
model that predicts the risk category for each underlying stock using all standardized
variables as predictors (M5). This cell is squared and highlighted.
Figure 35. Training data true positive percentages for various models. Top 2% of the
predicted low-risk strategies were first selected as best low-risk strategies and then were
compared against the actual risk category. Neural networks M5 model has the highest
accuracy, where 54.4% of the selected strategies were actually low-risk strategies
(ls_dn_rflag_0 = 1). This cell is highlighted and squared.
64
Figure 36. Test dataset true positive percentages for various models. Top 2% probability
limit obtained for each model from the training dataset was used for the selection of lowrisk strategies in the test dataset. Neural networks M5 model has the highest accuracy.
This cell is highlighted and squared. Here, 58.4% of the selected strategies were actually
low-risk strategies (ls_dn_rflag_0 = 1).
Some of the observations from test dataset model accuracies (Figure 36) are listed
below:
o In general, models predicting the categorical response variable (i.e., risk
category) perform better than the corresponding modeling methods using

continuous response outcome (i.e., strategy risk in percentage).
o Models built separately for each underlying using all standardized variables
(M5) outperforms all other model types. However, these model results could
not be generalized outside the underlying stocks studied.
o In general, neural network models perform better than other modeling
methods considered in this study.
65
Diagnostics of the Best Predictive Model
The best predictive model is the set of neural network models to predict risk
category for each underlying stock. These models use all standardized variables as
independent variables for each of the underlying stock (model type M5). These models
are specific to the underlying stock and can not be generalized to other stocks outside the
study. Each of these models used ten hidden nodes in the single hidden layer.
Model fit statistics are reported in Table 4 for the training and validation data. The
validation generalized R 2 varies from 0.367 for QCOM to 0.598 for MO. Identified
strategies for GE, CSCO and WMT have high predictive accuracies. Their true positive
percentages for the test dataset are 78.8%, 61.7% and 66.7% respectively. Strategies for
MO and C have lower predictive accuracies with 42.9% and 46.7% true positives in the
test dataset. Overall, 1.8% of the strategies from the test dataset are identified as low-risk
strategies to enter.
Table 4. Best predictive model fit statistics and the overall training, validation and test
accuracies for the top 2% of the strategies identified as low-risk strategies. Results are
listed for each underlying stock. Overall predictive accuracies are also reported. The true
positive percentages for the test dataset are bolded.
Model Fit
Underlying
Stock
C
CSCO
GE
MO
MSFT
QCOM
WMT
Overall
Overall accuracy for Top 2% of the Identified low-risk

strategies
Training
Validation
Training and Validation
Test Dataset
Root
Root
Misclassi
Misclassi
% of
Generalized Mean
Generalized Mean
True
False
True
False
fication
fication
selected
R-Square Square
R-Square
Square
Positives Negatives Positives Negatives
Rate
Rate
strategies
Error
Error
0.636
0.343
0.147
0.537
0.383
0.213
31.0%
69.0% 46.7%
53.3%
1.6%
0.755
0.279
0.093
0.489
0.368
0.172
71.7%
28.3% 61.7%
38.3%
2.2%
0.759
0.287
0.111
0.578
0.354
0.171
65.4%
34.6% 78.8%
21.2%
1.5%
0.689
0.318
0.135
0.598
0.350
0.172
49.4%
50.7% 42.9%
57.1%
1.7%
0.686
0.326
0.173
0.531
0.376
0.213
59.6%
40.4% 55.8%
44.2%
2.4%
0.567
0.357
0.174
0.367
0.398
0.207
41.9%
58.1% 52.2%
47.8%
1.5%
0.722
0.303
0.122
0.510
0.369
0.179
53.3%
46.8% 66.7%
33.3%
1.9%
54.4%
45.6% 58.4%
41.6%
1.8%
More model fit statistics along with training and validation lift curves for each of
the underlying stocks are shown in Figure 37 through Figure 43. The SAS JMP scoring
66
formula to identify the low-risk strategy is very big since it involves a ten hidden node
neural network for each underlying stock. Due to its large size it is not presented here.
From Table 4, we see that the overall true positives from the best model are about
58.4% in the test dataset and 54.4% in the training and validation datasets. However, only
a very limited (i.e., close to 2%) number of the strategies are identified as low-risk
strategies to enter. If we were to choose these strategies randomly, only 6.86% of them
would be actual low-risk strategies. After this modeling process and using the best
predictive model we have improved the low-risk strategy identification with an overall
lift of around eight (=58.4% / 6.86%). This is a noticeable improvement in the predictive
power. The identified entry strategies are characterized further in the next section.
Neural - C
Validation: Random Holdback
Freq: obs_freq
Model NTanH(10)
Training
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
Misclassification Rate
-LogLikelihood
Sum Freq
Validation
Measures
0.6364709
0.4694399
0.3434016
0.2437446
0.1467761
1820.9394
4994
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Measures
0.5373653
0.3733189
0.3830599
0.2707885
0.2131737
1079.3269
2505
Training Lift Curve - Underying "C"

2.0
1.9
1.8
Validation Lift Curve - Underlying "C"

ls_dn_rflag_0
0
1
1.5
1.4
1.3
1.2
1.1
1.0
0.000.100.200.300.400.500.600.700.800.90
Portion
Lift
Lift
1.7
1.6
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.000.100.200.300.400.500.600.700.800.90
ls_dn_rflag_0
0
1
Portion
Figure 37. Model fit statistics and lift curves for training and validation portions for the
underlying stock "C".
67
Neural - CSCO
Freq: obs_freq
Model NTanH(10)
Training
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Validation
Measures
0.7550715
0.6060422
0.2793462
0.172358
0.0932981
1923.7672
7192
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Measures
0.4892724
0.332614
0.3685402
0.2350233
0.1724138
1629.4834
3596
Training Lift Curve - Underlying "CSCO"

2.6
Validation Lift Curve - Underlying "CSCO"

2.6
ls_dn_rflag_0
0
1
2.4
2.2
2.2
2.0
Lift
2.0
Lift
ls_dn_rflag_0
0
1
2.4
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
1.0
0.000.100.200.300.400.500.600.700.800.90
1.0
0.000.100.200.300.400.500.600.700.800.90
Portion
Portion
underlying stock "CSCO".
Neural - GE
Freq: obs_freq
Model NTanH(10)
Training
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Validation
Measures
0.7591601
0.610316
0.2868547
0.1768135
0.111
1595.6414
6000
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Measures
0.5781438
0.4123716
0.3536841
0.2179738
0.1706019
1206.5767
3007
Training Lift Curve - Underlying "GE"

2.4
2.2
2.4
2.0
2.0
1.8
1.8
1.6
1.6
1.4
1.4
1.2
1.2
1.0
0.000.100.200.300.400.500.600.700.800.90
Portion
ls_dn_rflag_0
0
1
2.2
Lift
Lift
Validation Lift Curve - Underlying "GE"

ls_dn_rflag_0
0
1
1.0
0.000.100.200.300.400.500.600.700.800.90
Portion
underlying stock "GE".
68
Neural - MO
Freq: obs_freq
Model NTanH(10)
Training
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Validation
Measures
0.6887001
0.5242435
0.3181898
0.2260389
0.1354757
1570.0314
4761
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Measures
0.5980078
0.4292946
0.3504891
0.2429464
0.1724426
947.4007
2395
2.0
1.9
1.8
1.7
1.6
1.5
Validation Lift Curve - Underlying "MO"

ls_dn_rflag_0
0
1
Lift
Lift
Training Lift Curve - Underlying "MO"
1.4
1.3
1.2
1.1
1.0
0.000.100.200.300.400.500.600.700.800.90
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.000.100.200.300.400.500.600.700.800.90
Portion
ls_dn_rflag_0
0
1
Portion
underlying stock "MO".
Neural - MSFT
Freq: obs_freq
Model NTanH(10)
Validation
Measures
0.6858309
0.5227111
0.3258063
0.2114832
0.1732202
1890.8997
5773
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Measures
0.5308728
0.3677232
0.376409
0.2555221
0.2126775
1252.6458
2887
Lift
Training Lift Curve - Underlying "MSFT"

2.2
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.000.100.200.300.400.500.600.700.800.90
Portion
Validation Lift Curve - Underlying "MSFT"

2.4
ls_dn_rflag_0
0
1
ls_dn_rflag_0
0
1
2.2
2.0
Lift
Training
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
1.8
1.6
1.4
1.2
1.0
0.000.100.200.300.400.500.600.700.800.90
Portion
underlying stock "MSFT".
69
Neural - QCOM
Freq: obs_freq
Model NTanH(10)
Training
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Validation
Measures
0.5673571
0.4034432
0.3574809
0.2616588
0.1740558
2213.0977
5481
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Measures
0.3668874
0.2346909
0.3982184
0.2961472
0.2074236
1422.7299
2748
Training Lift Curve - Underlying "QCOM"

2.6
Validation Lift Curve - Underlying "QCOM"

ls_dn_rflag_0
0
1
2.4
2.2
Lift
Lift
2.0
1.8
1.6
1.4
1.2
1.0
0.000.100.200.300.400.500.600.700.800.90
Portion
2.2
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.000.100.200.300.400.500.600.700.800.90
ls_dn_rflag_0
0
1
Portion
underlying stock "QCOM".
Neural - WMT
Freq: obs_freq
Model NTanH(10)
Validation
Measures
0.72224
0.5635234
0.3029666
0.1977189
0.1216009
1572.9812
5222
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
Measures
0.5106462
0.3489685
0.3698298
0.2450545
0.1791444
1175.9112
2618
Lift
Training Lift Curve - Underlying "WMT"

2.2
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.000.100.200.300.400.500.600.700.800.90
Portion
Validation Lift Curve - Underlying "WMT"

ls_dn_rflag_0
0
1
Lift
Training
ls_dn_rflag_0
Generalized RSquare
Entropy RSquare
RMSE
Mean Abs Dev
-LogLikelihood
Sum Freq
2.2
2.1
2.0
1.9
1.8
1.7
1.6
1.5
1.4
1.3
1.2
1.1
1.0
0.000.100.200.300.400.500.600.700.800.90
ls_dn_rflag_0
0
1
Portion
underlying stock "WMT".
70
Characteristics of the Entered low-risk Strategies
Based on the best predictive model, close to top 2% of the predicted low-risk
strategies were chosen as low-risk strategies to enter. This accounts for about 868 entry
strategies in the complete dataset. Within the study period of five years, in an average,
this translates to about three enterable low-risk strategies per week. Thus, even though we
chose only top 2%, we still have ample opportunities for identifying the enterable
strategies. Figure 44 shows the actual count of low-risk strategies as well as the risk
profiles for these strategies. Among the entered strategies, about 56% of them are actual
low-risk strategies (true positives) and the other 44% are false positives. The median risk
(in %) is about 47.5%. The risk ranges from 22% to 93% within 10th percentile to 90th
percentile range. Thus many of these entered strategies offers at least few profitable
trading days. During the exit analysis, the focus would be to find the best possible day
when these entered strategies could be exited to maximize the profit potential.
Figure 45 and Figure 46 shows the distribution of risk category and risk in
percentage for the training and test datasets. There are 607 entry strategies in the training
dataset and 261 entry strategies in the test dataset. Training and test datasets have about
55% and 58% of actual low-risk strategies respectively. During the exit analysis, the 607
entry strategies in the training dataset were used to find optimal exit rules. These exit
rules were then tested for their accuracy using the 261 entry strategies in the test dataset.
71
Figure 44. Overall distribution of the actual risk category (ls_dn_rflag_0) and the actual
strategy risks in % (ls_dn_risk_0) for the entered low-risk strategies. The actual low-risk
strategies have ls_dn_rflag_0 = 1.
Figure 45. Distribution of the actual risk category (ls_dn_rflag_0) and the actual strategy
risks in % (ls_dn_risk_0) for the entered low-risk strategies in the Training dataset. The
actual low-risk strategies have ls_dn_rflag_0 = 1.
72
Figure 46. Distribution of the actual risk category (ls_dn_rflag_0) and the actual strategy
risks in % (ls_dn_risk_0) for the entered low-risk strategies in the Test dataset. The actual
low-risk strategies have ls_dn_rflag_0 = 1.
For the entered low-risk strategies, Figure E1 in Appendix E shows the
distribution of the risk categories and the strategy risk in percentage for each underlying
stock. Table 5 below summarizes the proportion of actual low-risk strategies within each
underlying stock.
Table 5. Proportion of actual low-risk strategies (ls_dn_rflag_0 = 1) among the entered
strategies for each underlying stock.
C CSCO GE MO MSFT QCOM WMT Overall
35% 68% 69% 47% 58%
44%
57%
56%
General Electric (GE: 69%) and Cisco Systems (CSCO: 68%) have relatively
better predictive models and hence higher proportion of actual low-risk strategies. The
mean strategy risks in percentage are 43% for GE and 44% for CSCO. Citigroup (C:
35%) and Qualcomm (QCOM: 44%) have less than average proportion of actual low-risk
strategies. Table 6 shows the distribution of remaining days to expire after entry for the
73
identified strategies. In an average, the entered strategies have about 68 days to expire.
The remaining time to expire ranges from 30 days to 253 days.
Table 6. Distribution of days to expire within the 868 entered strategies.
Mean
Std Deviation
100% Max
99%
95%
90%
75% Q3
Quantile
50% Median
(percentile)
25% Q1
10%
5%
1%
0% Min
Statistic
Days to
Expire
68
38
253
191
143
120
91
52
38
32
31
30
30
Summary
During the period between 2002 and 2006, about 44K delta-neutral long straddle
strategies were identified for select underlying stocks. These were split into 70% for
training and 30% for testing. The goal of the entry analysis is to identify the low-risk
delta-neutral long straddle strategies that could be executed or entered. Predictive models
were built and validated using SAS JMP software. Both the risk category (a categorical
outcome) and the strategy risk in percentage (a continuous outcome) were used as
dependent variables. Modeling methods studied were linear regression, logistic
regression, neural networks and decision trees. The analysis dataset has several variables.
Models were built using different sets of variables (like, all standardized variables,
standardized and transformed variables, select standardized variables, factor variables
only and all variables plus underlying stocks etc.). In some cases, separate models for
each underlying stock were built to study the effects on the predictive accuracy. Top 2%
74
of the strategies that were predicted to have lowest risks were identified as low-risk
strategies. Classification errors (true positive percentages) based on the test dataset was
used to select the best predictive model. A set of neural network models, one for each
underlying stock, that predicts the low-risk category using all standardized variable as
independent variables had the highest test accuracy of about 58% true positives. This
model was selected as the best predictive model.
The top 2% of the predicted low-risk strategies from the best predictive model
were identified as the best low-risk strategies to enter. This accounts for about 868 lowrisk entry strategies in the entire dataset. Within the study period of five years, in an
average, this translates to about three enterable low-risk strategies per week, thus giving
ample opportunities for identifying the enterable low-risk strategies. If we were to choose
these strategies randomly, only 6.86% of them would be actual low-risk strategies. Thus,
we have improved the low-risk entry strategy identification with an overall lift of around
eight (=58.4% / 6.86%). This is a noticeable improvement in the predictive power. In the
next chapter, the entry strategies are explored further to identify rules for exiting the
entered strategies with optimal profit.
75
EXIT ANALYSIS: RULES TO EXIT THE STRATEGIES WITH OPTIMAL PROFIT
The goal of the exit analysis is to identify rules that describe the optimal
conditions to exit the entered strategies so as to maximize the expected profit. We exit
from the strategy when one of the following conditions is met: the loss limit, the profit
limit, limit on remaining days to expire or the last tradable day. For example, after we
enter a particular strategy, we may exit the strategy if at any time we reach a loss 50%
or a profit 10% or if remaining time to expire reaches a 15% limit; whichever occurs
first. This study strives to find these limits for each underlying stock so as to maximize
the expected profit under some constraints on the distributions of the expected return. For
example, we may want to find an optimal exit rule so that we may incur loss from less
than 25% of the entered strategies. In other words, the remaining 75% of the entered
strategies makes a profit. First, the data is collected and prepared in a format that would
help the exit analysis. Then, a grid of strategy returns for various combinations of the
rule parameters is generated. A grid search is done on training data to find the rules that
maximize the expected profit. These rules are then tested and validated for their
accuracies using the entered strategies in the test dataset.
Data Preparation
Each of the entered strategies is followed from the date of entry to the expiration
day and the corresponding daily call and put transactions are merged together. That is,
each observation represents corresponding call and put end-of-day transaction of a given
entered strategy from the entry date to the expiration day. The strategy return (i.e. profit
76
or a loss) for each tradable day is computed. An eligible tradable day is defined as the
transaction days where both call and put options had a minimum volume of ten units. The
call and put option prices were checked for no arbitrage conditions (i.e., prices are such
that it does not give an opportunity to make a profit without any investment).
Transactions with high arbitrage conditions were removed, since the price may be a
suspect. Price corrections were made for the option transactions that occurred on or
before July 5th 2002 since the prices in the raw data were magnified by ten times during
those days. Following formula was used to compute the strategy return on a given day:
Exit Return (in $) = ( N C Current Call Price 100 + N P Current Put Price 100)
- Strategy Entry Price
(Eqn. 4)
Exit Return (in $)

100
Percentage Return (in %) =
Strategy Entry Price
Here, N C and N P are number of calls and the number of puts respectively in the
entered strategy. Transaction costs are not accounted directly in the above formula, since
the strategy entry price includes both entry and exit transaction costs. Other parameters
like implied volatilities, Greeks and elasticities were also computed for the call and put
transactions using the elaborate procedure described in the chapter on the data
preparation for entry analysis. However, for the final rule predictions, these were not used
and hence may not be necessary.
After including the eligible end of day transactions for each of the 868 entered
strategies, we have 32,642 records for the seven underlying stocks. The entered strategies
in the training and test datasets have 22,364 (69%) and 10,278 (31%) records
respectively. Table 7 shows the distribution of the percentage returns in the data. In the
complete dataset (i.e., all training and test strategies), we see that about 50% of the
77
observations (the median), have a profit of 6.5% or higher. The mean return is profitable
with about 18% profit. However, the returns can range from -87% (high loss) to 585%
(high profit). In the test dataset, the observed median profit is 4.4%. In short, we see that
the strategy returns from the entered strategies can vary widely over the tradable days,
but if several entry strategies are entered simultaneously and exited randomly, a mean
profit of about 16% to 18% could be observed.
Table 7. Distribution of strategy exit returns for the complete dataset and the test dataset.
Exit Return (in %)
Complete
Test Dataset
Dataset
(N=10,278)
(N=32,642)
Mean
Std Deviation
100% Max
99%
95%
90%
75% Q3
Quantile
50% Median
(percentile)
25% Q1
10%
5%
1%
0% Min
Statistic
18.2
50.0
585.2
201.6
112.4
74.8
36.2
6.5
-10.0
-28.4
-40.7
-63.9
-86.9
16.0
52.2
585.2
219.8
102.3
69.5
34.7
4.4
-12.6
-31.1
-43.5
-67.0
-86.9
Table 8 shows the distribution of exit return (in %) by each underlying stock.
CSCO and GE have best median exit returns at 11.6% and 10.4% respectively. QCOM
and C have lowest median exit returns of -3.1% and -1.6% respectively.
78
Table 8. Distribution of strategy exit returns for each underlying stock within the
complete dataset
Exit Return (in %)
C (N=4,459)
Mean
Std Deviation
100% Max
99%
95%
90%
75% Q3
Quantile
50% Median
(percentile)
25% Q1
10%
5%
1%
0% Min
Statistic
6.1
30.9
135.8
83.5
61.4
49.6
28.9
-1.6
-14.8
-27.9
-35.3
-59.4
-86.4
CSCO
(N=6,652)
16.4
31.2
224.2
105.2
72.3
57.4
33.8
11.6
-2.8
-16.8
-29.0
-52.9
-78.0
GE
MO
MSFT
QCOM
WMT
(N=5,398) (N=2,607) (N=6,582) (N=2,667) (N=4,237)
17.4
42.4
194.2
132.3
98.1
77.0
41.0
10.4
-8.5
-33.4
-46.2
-68.2
-84.2
25.7
59.2
402.3
278.0
139.6
87.7
42.2
8.5
-6.8
-19.7
-33.5
-66.6
-86.9
36.6
76.0
585.2
273.8
187.6
155.5
68.2
8.2
-12.1
-29.1
-43.8
-58.7
-79.3
5.6
44.0
288.1
157.5
95.8
64.7
17.1
-3.1
-20.4
-35.4
-48.7
-73.8
-83.6
9.6
34.5
149.7
99.8
70.4
55.6
31.3
4.8
-9.9
-32.8
-44.0
-71.1
-82.9
Exit Rule Logic

The exit rules are primarily determined based on limits applied to a strategy loss,
strategy profit or the remaining time to expire. After a strategy is entered, we loop
through each tradable day and if any of the defined conditions occur, then we close the
open transactions by selling the corresponding calls and puts in the strategy (i.e., exit
from the strategy). The defined conditions for the exit are:
a) If the strategy's return (in %) reaches a specified loss limit OR
b) If the strategy's return (in %) reaches a specified profit limit OR
c) If the remaining time to expire is at or below the specified time limit.
In some cases, where the trading volume and the time limit are low, we may reach
the last tradable day without hitting any of these limits. For such cases, we consider the
last tradable day as the day of exit. The logic behind the exit rule is shown in Figure 47.
79
The objective of the analysis is to find the optimum loss limit, profit limit and the time
limit for a given underlying stock, so as to maximize the expected profit.
Day 1 of the
strategy
If return <= loss limit

(i.e., negative return)
yes
exit
no
Compute Strategy
Return
If return >= profit limit
yes
exit
no
If remaining time to
expire <= time limit
yes
exit
no
Go to next day
no
is last tradable day?
yes
exit
Figure 47. Exit rule logic is shown. Goal is to find the optimal limits to maximize profit.
Methodology
The method used here to find the optimal rule limits that maximizes the profit is
just an exhaustive grid search. The prepared data for exit analysis is still at the transaction
level for each entered strategy. A grid of strategy returns for various combinations of loss
limits, profit limits and times to expire are first formed for each of the entered strategy.
Then, based on the rule logic defined before, the date of exit and the actual strategy return
on the exit day are computed for each combination of the rule parameters (i.e., limits).
After computing exit return at the strategy level, the distribution of the returns is
summarized at the underlying stock level for each of the exit parameter value
combinations in the grid. Once the return grid is computed, using the training data, we go
through the return values in the grid to find optimal exit limit combination that yields the
maximum profit under certain constraints. These optimal limits are validated using test
80
dataset. Another approach based on the optimization of predicted response surfaces was
also considered. More details about these steps are described in the following sections.
Methodology: Exit Grid Formation
First a range of values for each of the exit parameters are chosen. These values are
chosen in such a way that the intervals between the values approximately represent the
distribution of the exit returns in the data. Table 9 shows the exit parameters, the variable
name and its description, the range of grid values considered for each of the exit
parameter and the total number of values in the range. For loss limits, profit limits and
time limits, we have 22, 30, and eight grid values respectively. Each entered strategy
would have 5,280( = 22 30 8 ) combinations. Considering all the 828 strategies, we
would have about 4.6M grid value combinations in the initial grid. Thus, we are trying to
form a huge grid and compute exit returns for each of the exit parameter combinations
using the transaction level strategy data.
Table 9. Grid parameter value ranges.
Grid Values
Total
number of
values
Exit return in %. The grid values

Loss Limit ls_dn_exit_pl_per are loss %s. That is the actual
returns will be negative.
-2%, -4%, -6%, -8%, -10%, 12%, -14%, -16%, -18%, -20%, 25%, -30%, -35%, -40%, -45% 50%, -55%, -60%, -65%, -70%, 80%,-90% and above.
22
Exit return in %. The grid values

are expressed as profit %. That is,
Profit Limit ls_dn_exit_pl_per
the actual returns are positive
values.
1%, 2%, 3%, 4%, 5%, 6%, 7%,

8%, 9%, 10%, 12%, 14%, 16%,
18%, 20%, 22%, 24%, 26%,
28%, 30%, 35%, 40%, 45%
50%, 55% 60%, 70%, 80%,
90%, and above 100%
30
Exit
Parameter
Time Limit
Variable Name
time_ratio
Variable Description
% of time remaining berore expiry.

50%, 40%, 30%, 20%, 15%,
Computed based on total life of the
10%, 5%, and 0%
strategy.
Such a grid computation may take a long time if we compute returns by passing
through the transactions data for each row of the grid. Heuristic algorithms were
81
developed to compute the whole grid by passing through the transaction data only few
times, there by saving a considerable computation time. The basic idea behind the
heuristic processing is listed below:
Step 1: Create new date and return columns for each of the grid value
specified above. Then, for each transaction record, store the transaction date if
a particular exit limit value is triggered. Thus, in one pass we note all the
trigger dates.
Step 2: Minimum dates for each of the exit limit value columns are computed
by summarizing the data at the strategy level. Simply put, this date
corresponds to the earliest trigger date for a given exit limit value. The exit
returns on each of these trigger dates are then computed. This step requires
multiple iterations of SQL joins (60 iterations, one for each limit column).
Step 3: Then, all the grid value combinations are recreated for each strategy.
The rule that triggers the exit along with the corresponding return is captured.
We will be having about 4.6M total grid value combinations (5,280
combinations for each of the 868 strategies).
Step 4: Finally, the grid value combinations are summarized at the underlying
stock and data type (i.e., training set and test set) level. The distribution
characteristics for the exit returns, like, mean, standard deviations and various
quantiles are captured for each underlying. The final exit grid data has about
37K observations (5,280 combinations for each of the seven underlying
stocks) in the training dataset and a similar count in the test dataset.
82
This logic is illustrated by a hypothetical example in Figure 48. Two strategies (A
and B) for the underlying stock GE are considered. Each strategy has five days from start
to expiration. The returns on each of these days are shown in Figure 48a. This data is at
the strategy transaction level (i.e., one record per day for the strategy). For simplicity, we
consider two loss limits (-5%, -20%), two profit limits (5%, 10%) and just one time limit
(80%). As part of step 1 (Figure 48a), new columns that store trigger dates are created for
each of these limits. For each observation, if a particular rule triggers, then the transaction
date is stored as the trigger date for the corresponding limiting value. For example, in the
first record (Figure 48a), the return is -5% and hence the -5% loss limit will be triggered.
The Mar-01 transaction date is stored as trigger date for the loss limit of -5%. More than
one rule can trigger for a given row. In step 2 (Figure 48b), the table in Figure 48a is
summarized by strategy and the minimum of trigger dates within each strategy is
captured. This trigger date basically captures the earliest trigger date for a given limiting
value.
In step 3 (Figure 48c), all possible combinations of the limiting values are
generated for each strategy. Using data in Figure 48b, the date of trigger is computed as
the minimum of trigger dates from the corresponding loss limit, profit limit and time limit
values. For example, row 1 of Figure 48c corresponds to strategy A with loss, profit and
time limits of -5%, 5%, and 80% respectively. From Figure 48b, the corresponding
trigger dates are Mar-01, Mar-02 and Mar-04 respectively. The exit is triggered when one
of the limits is triggered first. Here, the earliest trigger (Mar-01) is for the loss limit of 5%. The return on Mar-01 is -5%. These values are captured in row 1 of Figure 48c.
83
Step 1: Mark the Trigger dates for each limiting value
Trigger Dates
Profit Limits
Loss Limits
Record
Number
1
2
3
4
5
6
7
8
9
10
Stock
GE
GE
GE
GE
GE
GE
GE
GE
GE
GE
Strategy
ID
A
A
A
A
A
B
B
B
B
B
Transaction
Return -20%
Date
Mar-01
-5%
Mar-02
6%
Mar-03
-25% Mar-03
Mar-04
12%
Mar-05
6%
Jun-01
2%
Jun-02
5%
Jun-03
-10%
Jun-04
8%
Jun-05
15%
-5%
5%
Time Limit
80% (i.e.,
day 4 of 5)
10%
Mar-01
Mar-02
Mar-03
Mar-04 Mar-04
Mar-05
Mar-04
Mar-05
Jun-02
Jun-03
Jun-04
Jun-05
Jun-04
Jun-05
Jun-05
Step 2: Summarize by strategy and get minimum trigger dates and corresponding returns.
Minimum Trigger Dates for each Strategy
Return on the Minimum Trigger Date
Time
Loss Limits
Profit Limits
Time Limit Loss Limits Profit Limits
Limit
80%
80% (i.e.,
Record
Strategy
-20%
-5%
5%
10%
-20% -5% 5% 10% (i.e., day
Stock
day 4 of 5)
Number
ID
4 of 5)
1 GE
A
Mar-03 Mar-01 Mar-02 Mar-04
Mar-04 -25% -5%
6% 12%
12%
2 GE
B
Jun-03 Jun-02 Jun-05
Jun-04
-10%
5% 15%
8%
Step 3: Create all limiting value combinations for each strategy. Determine if which rule triggers and get the
return on that trigger date.
Parameter Limits
Record
Strategy
Stock
Number
ID
1
2
3
4
5
6
7
8
GE
GE
GE
GE
GE
GE
GE
GE
A
A
A
A
B
B
B
B
Loss
Limit
-5%
-5%
-20%
-20%
-5%
-5%
-20%
-20%
Profit
Limit
5%
10%
5%
10%
5%
10%
5%
10%
Time
Limit
80%
80%
80%
80%
80%
80%
80%
80%
Which rule triggers?. Based on minimum of

three possible trigger dates.
Date of
trigger (Min
of three
trigger dates)
Mar-01
Mar-01
Mar-02
Mar-03
Jun-02
Jun-03
Jun-02
Jun-04
Loss rule
Profit
rule
Return
Time on the
rule Trigger
date
x
x
x
x
x
x
x
x
-5%
-5%
6%
-25%
5%
-10%
5%
8%
Figure 48 (a), (b), (c). A hypothetical example that illustrates the heuristics used to create
the exit grid data is shown in multiple steps. (a) First step of the process, (b) the second
step and (c) the third step. Rows are unique by the columns with bold headers.
84
Step 4: Summarizes all limiting value combinations by underlying stock. Mean returns and other statistics
are computed for each grid value combination.
Count of triggered rules.
Parameter Limits
Record
Stock
Number
1
2
3
4
GE
GE
GE
GE
Loss
Limit
-5%
-5%
-20%
-20%
Profit
Limit
5%
10%
5%
10%
Time Count of
Limit Strategies
80%
80%
80%
80%
2
2
2
2
Loss rule
Profit rule
1
2
0
1
Time
rule
1
0
2
0
Mean
Return
0
0
0
1
0.0%
-7.5%
5.5%
-8.5%
Figure 48 (d). A hypothetical example that illustrates the heuristics used to create the exit
grid data is shown in multiple steps. This figure shows the final step of the process. Rows
are unique by the columns with bold headers.
Finally, in step 4 (Figure 48d), the data in Figure 48c is summarized by
underlying stock and the unique combinations of the limiting values. The mean return
along with counts of triggered limits is computed. For example, row 1 of Figure 48d
represents the limiting value combination of (-5% loss limit, 5% profit limit, and 80% of
time limit) for GE. We have two corresponding records in Figure 48c (i.e., strategies A
and B), one of them triggered due to loss rule and the other triggered due to profit rule
(i.e., count of loss and profit rule triggers are 1). The mean return is 0% because strategy
A returned -5% and strategy B returned 5% (Figure 48c) for this limiting value
combination. Other statistics like standard deviations and quantiles were also captured
but not shown in this example.
Table 10 shows sample final stage outputs from the actual data. To make the table
readable, data columns are represented in the table rows and observations are in the table
columns. Four observations from test dataset (train_flag='N') for the underlying stock 'C' ,
a loss limit of -90%, profit limit of 1% to 4% and no time limits are shown.
85
Table 10. A sample of exit grid with four observations is shown. Here observations are
shown as columns. Rows represent the names of the variables in the exit grid.
Variable Name
C_underlying
Train_Flag
x_time_limit
x_loss_limit
x_profit_limit
_FREQ_
profit_trig
loss_trig
time_trig
last_day_trig
mean_exit_return
sd_exit_return
min_exit_return
p1_exit_return
p5_exit_return
p10_exit_return
p25_exit_return
p50_exit_return
p75_exit_return
p90_exit_return
p95_exit_return
p99_exit_return
max_exit_return
Variable Description
Underlying Stock Symbol
Train or Test data indicator
Represent grid parameter limits
in %
Count of entered strategies.
Number of times a particular
rule was triggered for exit
Mean exit return (in %)
Std. Dev. in return (%)
Minimum return (%)
1st percentile return (%)
5th percentile return (%)
Maximum return (in %)
Obs. 1
C
N
Obs. 2
C
N
0
-90
1
45
41
0
0
5
4.8
18.4
-81.5
-81.5
-8.7
1.1
2.5
5.5
12.6
19.1
26.8
30.2
30.2
Obs. 3
C
N
0
-90
2
45
39
0
0
6
4.8
18.5
-81.5
-81.5
-8.7
-5.1
3.0
5.8
12.6
19.1
26.8
30.2
30.2
Obs. 4
C
N
0
-90
3
45
39
0
0
6
5.2
18.5
-81.5
-81.5
-8.7
-5.1
3.8
6.5
12.6
19.1
26.8
30.2
30.2
0
-90
4
45
39
0
0
6
6.2
18.7
-81.5
-81.5
-8.7
-5.1
4.6
7.4
13.9
19.5
26.8
30.2
30.2
Methodology: Optimal Exit Rules through Grid Search

The grid search process searches for the maximum return value for each
underlying stock in the exit grid formed as above. This is first applied to the training data
by simply sorting the exit returns in the descending order for each underlying stock and
picking the top return (i.e., top row) for each underlying. The corresponding loss limit,
profit limit and the time limit are the optimal exit rule parameters that maximize the
return for the given underlying stock. These optimal limits are then applied to the test
data to test the performance of the optimal exit rule for each underlying stock.
The search process described above is an unconstrained optimization as we are
searching the whole grid. A trader may wish to apply some constraints to the search
process. For example, one of the constraints may be that the given exit rule must either
86
break-even or return a profit at least 75% of the time (i.e., 25th percentile of the exit
return is 0). By applying such constraint, the trader is taking a lower risk and is willing
to accept a potentially lower return. Optimal rules under such constraints are achieved by
first retaining only the observations in the exit grid that satisfies the given constraint.
In the above example, the filtering rule would be p25_exit_return 0. Then, this
reduced set is searched for optimal exit rule parameter limits using the same logic we
used for the unconstrained grid search. Grid search scenarios studied in this research are:
o Unconstrained
o 50th percentile of the exit return >= 0. That is, at least 50% of the entered
strategies make a profit when the optimal exit rules are applied.
Methodology: Optimal Exit Rules from Response Surfaces

One of the disadvantages of the grid search process described above is the
influences of the outliers. If there is an outlier with a high return, then the direct grid
search process described above would get the corresponding exit parameter limits as the
optimal exit rules, which when applied to the test data or in the real trading environment
87
may not perform as well as expected. To mitigate such effects, a response surface
approach was carried out.
Initially, for each underlying stock, using the exit grid, we fit a regression model
to predict the return as a function of loss limit, profit limit and time limit. The strategy
returns are not linear, hence higher order polynomials and interaction terms were used to
fit a response surface for the exit return. The outliers are more likely to be away from the
response surfaces than on the surface. Then a non-linear constrained optimization is done
using the model formula to get the optimal exit rule limits for each underlying stock. In
this study, this constrained non-linear optimization was done using a method similar to
exhaustive grid search described above, but by using the predicted returns to form the
grid. Two scenarios of optimization of response surfaces were considered. They are:
o Unconstrained
Results from this technique are presented in the results section. The model
structures and parameter estimates are not reported as we observed that the optimal rules
through this approach did not perform better than the simple grid search described in the
previous section.
Results
The results from grid search are shown in Table 12. Abbreviations for various
scenarios studied here are in Table 11. The first row in Table 12 represents an
unconstrained optimization through grid search for the underlying stock 'C'. The optimal
88
exit rule has a profit limit of 55%, a loss limit of -65% and the remaining time limit of
20%. Under this exit rule, the entered strategies for 'C' returns a 23.7% and 23.6% mean
returns for training and test data respectively.
Figure 49 shows the distribution of exit returns for various grid search scenarios
for each underlying. For example, in the case of the underlying stock 'C', based on the test
data, the distribution of strategy returns from unconstrained optimization (TA[uc]) shows
that about 30% of entered strategies results in a loss and of the remaining 65% entered
strategies, a considerable portion of them makes a higher profit and hence the mean
return is about 23.7%. In this chart for 'C', curves TA[uc] and TB[uc] overlaps as they
both have same rule limits.
To mitigate the risk, let us say a trader wishes to lose in only about 25% of the
trades (i.e., profits 75% of the time), in this case, as seen in curve TC[p25], the 75% of
the profitable strategies have only modest profit as the optimal exit rule profit limit is
only 5% (Table 12, row 3) and the remaining exit rule time limit has increased to 30%.
The average return under this constraint of lower risk is 6.3% (from test data). If a trader
wishes to fail only in 10% of the time for 'C', then no such opportunity exists since the
grid search under this constraint has 0 rows to search (hence no curves are shown).
Table 11. Exit rule scenarios (or constraints) and their abbreviated labels are shown.
Constraint Label
Exit Rule Scenarios: Constraints for
Optimization
Train
Test
Unconstrained
50th Percentile Exit Return >= 0
A[uc]
B[p50]
C[p25]
D[p10]
E[p5]
TA[uc]
TB[p50]
TC[p25]
TD[p10]
TE[p5]
89
Table 12. Results from optimization through grid search are shown. The optimal exit rule
parameter limits and the strategy returns for training and test dataset are shown for
various optimization scenarios of each underlying stock. Table 11 has the abbreviations
for constraints for various scenarios.
Underlying
Stock
CSCO
GE
MO
MSFT
QCOM
WMT
Constraint Label
Train
Test
A[uc]
B[p50]
C[p25]
A[uc]
B[p50]
C[p25]
D[p10]
E[p5]
A[uc]
B[p50]
C[p25]
D[p10]
A[uc]
B[p50]
C[p25]
D[p10]
E[p5]
A[uc]
B[p50]
C[p25]
A[uc]
B[p50]
C[p25]
D[p10]
A[uc]
B[p50]
C[p25]
D[p10]
TA[uc]
TB[p50]
TC[p25]
TA[uc]
TB[p50]
TC[p25]
TD[p10]
TE[p5]
TA[uc]
TB[p50]
TC[p25]
TD[p10]
TA[uc]
TB[p50]
TC[p25]
TD[p10]
TE[p5]
TA[uc]
TB[p50]
TC[p25]
TA[uc]
TB[p50]
TC[p25]
TD[p10]
TA[uc]
TB[p50]
TC[p25]
TD[p10]
Optimal Exit Rule Limits

Profit Loss
Time
Limit Limit Limit
55% -65%
20%
55% -65%
20%
5% -90%
30%
45% -90%
15%
45% -90%
15%
45% -90%
15%
40% -90%
15%
35% -90%
10%
100% -55%
15%
100% -55%
15%
40% -55%
15%
6% -60%
15%
80% -40%
20%
80% -40%
20%
80% -40%
20%
28% -40%
30%
10% -65%
0%
100% -60%
15%
100% -60%
15%
16% -60%
15%
100% -90%
40%
100% -90%
30%
9% -90%
30%
4% -40%
20%
55% -90%
0%
55% -90%
0%
55% -90%
0%
20% -90%
0%
Strategy Return (in %)

Train
Test
Standard
Standard
Mean
Mean
Deviation
Deviation
23.7
47.8
23.6
40.0
23.7
47.8
23.6
40.0
7.5
27.1
6.3
16.1
41.6
30.0
38.8
31.7
41.6
30.0
38.8
31.7
41.6
30.0
38.8
31.7
40.6
23.2
35.4
28.9
39.4
17.8
31.5
25.8
42.1
65.5
42.1
62.8
42.1
65.5
42.1
62.8
32.6
45.5
30.2
42.9
13.8
21.6
11.0
19.6
64.0
58.4
68.6
66.0
64.0
58.4
68.6
66.0
64.0
58.4
68.6
66.0
48.4
43.1
44.4
45.1
29.5
34.1
23.8
23.7
37.2
62.4
39.1
72.7
37.2
62.4
39.1
72.7
19.7
42.4
15.6
47.5
23.6
57.9
18.6
47.9
22.2
58.9
15.4
54.3
8.3
19.4
5.1
23.7
6.3
17.8
-0.2
21.1
34.8
47.3
41.5
41.7
34.8
47.3
41.5
41.7
34.8
47.3
41.5
41.7
21.6
25.5
24.0
23.7
90
TrainingandTestReturnsforCSCO
125
100
100
75
75
50
25
0
25 0
10
20
30
40
50
60
70
80
90
100
ExitReturn(%)
ExitReturn(%)
TrainingandTestReturnsforC
125
50
75
100
A[uc]
B[p50]
TB[p50]
TC[p25]
C[p25]
TA[uc]
50
25
0
25 0
10
20
30
40
70
80
90
100
B[p50]
C[p25]
D[p10]
E[p5]
TB[p50]
TC[p25]
TD[p10]
TE[p5]
Percentile
TrainingandTestReturnsforGE
ExitReturn(%)
ExitReturn(%)
TrainingandTestReturnsforMO
175
150
125
100
75
50
25
0
25 0
10
50
A[uc]
75
100TA[uc]
10
20
30
40
50
60
70
80
90
100
A[uc]
B[p50]
C[p25]
D[p10]
TA[uc]
TB[p50]
TC[p25]
TD[p10]
20
30
ExitReturn(%)
30
40
50
A[uc]
B[p50]
TB[p50]
TC[p25]
60
70
80
90
100
D[p10]
E[p5]
TB[p50]
TC[p25]
TD[p10]
TE[p5]
TrainingandTestReturnsforQCOM
ExitReturn(%)
20
50
C[p25]
Percentile
TrainingandTestReturnsforMSFT
10
40
B[p50]
Percentile
200
175
150
125
100
75
50
25
0
25 0
50
75
100
60
100TA[uc]
Percentile
175
150
125
100
75
50
25
0
25 0
50
75
100
50
50
75A[uc]
60
70
C[p25]
80
90
100
TA[uc]
Percentile
175
150
125
100
75
50
25
0
25 0
50
75
100
10
20
30
40
50
60
70
80
90
100
A[uc]
B[p50]
C[p25]
D[p10]
TA[uc]
TB[p50]
TC[p25]
TD[p10]
Percentile
TrainingandTestReturnsforWMT
125
100
ExitReturn(%)
75
50
25
0
25 0
10
20
30
40
50
60
70
80
90
100
50
75
A[uc]
B[p50]
C[p25]
D[p10]
100
TA[uc]
TB[p50]
TC[p25]
TD[p10]
Percentile
Figure 49. Grid search exit return distributions for various exit rule scenarios are shown
for each underlying stock. Solid lines represent training data and dotted lines represent
test data. Table 11 has the abbreviations for various optimization scenarios.
91
If a trader has a lot of money to invest, he or she can enter into large number of
strategies at the same time and can afford to take risk by using unconstrained
optimization scenario since the mean return is fairly high. If a trader has only limited
money to invest, only few strategies could be entered at a given time. An early loss in this
case would be devastating as funds will not be available to make more trades and reach
the expected mean return. A better approach may be to avoid loss as much as possible by
applying, say, a 25th percentile constraint. By minimizing the risk of loss, sufficient funds
are likely to be available to repeat the strategy multiple times sequentially to achieve a
mean profit that is likely to be less than the profit from the unconstrained case. Trying to
reduce the risk even more, say, a 10th percentile constraint may not provide any trading
opportunity at all. Hence, by analyzing the graphs in Figure 49, an appropriate level of
constraint should be chosen. Kelly's criterion (Kelly's Criterion, 2007; Kelly Criterion in
detail) may help in choosing optimal fraction of the amount at hand to invest in
sequential trades (or bets) so that a trader could maximize the returns from the sequential
investments. This area is considered for future research.
Looking at various underlying stocks in Figure 49, we see that the curves are very
different and the optimal mean returns for various constraints are highly specific to a
given underlying stock. For example, assuming a 25th percentile constraint, entered
strategies for stocks 'MO', 'WMT' and 'CSCO' make a considerably high mean profits at
44.4%, 41.5% and 38.8% respectively (from test dataset mean returns). Also, for many
stocks, the 25th percentile constraints mean returns do not differ too much from the
returns from unconstrained case. However, being even more conservative either provides
no trading opportunity or the mean returns is much lower than unconstrained case. If the
92
author were a trader, the author has only limited fund and would like to enter into
sequential trades (i.e., few trades at a time) than large amount of parallel trades. Hence
the author chooses to use the optimal exit rule limits corresponding the 25th percentile
constraint. Table 13 summarizes the mean test data returns for various stocks arranged
from left to right based on the magnitude of the profit for the optimal exit rules based on
the 25th percentile constraint. A profit range from 5.1% (QCOM) to 44.4% (MO) could
be realized under such constrained trading scenario.
Table 13. Mean returns from test dataset for the '25th percentile exit return >= 0' scenario
are shown. The underlying stocks are arranged from left to right based on the magnitude
of the mean expected profit.
Expected Mean Returns from Test dataset. Sorted left to right based on Mean
Returns
Scenario: 25th Percentile

Exit Return >= 0
QCOM
MSFT
GE
CSCO
WMT
MO
5.1%
6.3%
15.6%
30.2%
38.8%
41.5%
44.4%
Table F1 in Appendix F shows more details on the count of rule triggers (i.e.,
which component rule led to the exit from the strategy) and other statistics for each of the
scenarios for various underlying stocks. From this table it could be seen that for the '25th
percentile constraint' scenario, we exit the strategy most of the times through the profit
limit trigger.
Table 14 and Figure 50 show returns and distributions from the scenarios of
response surface optimizations. Considering 25th percentile constraint cases, we see that
many of the stocks have slightly reduced mean returns than those obtained from grid
search. However, there are exceptions like MO (68% mean test return) and C (10.1%
mean test return). However, Figure 50 shows that the test dataset has larger deviations
from the '25th percentile constraints'. For example close to 30% of CSCO strategies loses
93
(Figure 50, chart for CSCO, constraint TC[p25]). Figure 50 show that QCOM has a large
convergence issues. It could also be observed from Table 14 that the predicted returns
from regressions had a considerable difference from actual returns for some cases. Due to
such instabilities in predicted values and the results from response surface optimizations,
a simple grid search optimization was preferred. Table G1 in Appendix G shows more
detailed results from response surface optimizations for various scenarios of each
underlying stock.
Table 14. Results from response surface optimizations are shown. The optimal exit rule
parameter limits and the strategy returns for training and test dataset are shown for
various optimization scenarios of each underlying stock.
Underlying
Stock
C
CSCO
GE
MO
MSFT
QCOM
WMT
Constraint Label
Optimal Exit Rule Limits
Train
Profit Loss
Limit Limit
A[uc]
C[p25]
A[uc]
C[p25]
A[uc]
C[p25]
A[uc]
C[p25]
A[uc]
C[p25]
A[uc]
C[p25]
A[uc]
C[p25]
Test
TA[uc]
55%
TC[p25] 16%
TA[uc]
70%
TC[p25] 70%
TA[uc]
90%
TC[p25] 60%
TA[uc]
80%
TC[p25] 80%
TA[uc] 100%
TC[p25] 12%
TA[uc] 100%
TC[p25] 100%
TA[uc]
60%
TC[p25] 60%
-70%
-90%
-70%
-70%
-60%
-65%
-60%
-60%
-65%
-70%
-60%
-4%
-80%
-80%
Strategy Return (in %)

Train
Time Predicted
Limit
Mean
20%
10%
0%
0%
10%
50%
20%
20%
20%
20%
40%
0%
0%
0%
20.9
9.9
45.9
45.9
41.8
36.0
66.4
66.4
38.4
21.4
25.6
8.2
35.5
35.5
Predicted
25th
Percentile
-29.9
0.1
18.0
18.0
-37.7
0.5
6.5
6.5
-20.2
0.5
-16.3
0.4
20.4
20.4
Actual
Mean
23.7
9.8
38.7
38.7
38.3
29.7
63.6
63.6
33.9
17.0
23.6
2.8
32.0
32.0
Test
Standard
Deviation
Actual
25th
Percentile
47.8
38.1
49.8
49.8
65.1
39.1
59.1
59.1
64.1
42.6
57.9
34.0
50.0
50.0
-29.0
-8.5
4.9
4.9
-27.5
-4.8
2.1
2.1
-22.3
3.4
-24.0
-7.9
9.3
9.3
Mean
23.5
10.1
33.4
33.4
39.4
32.5
68.0
68.0
37.2
15.2
18.6
-5.1
39.3
39.3
Standard 25th
Deviation Percentile
40.2
27.8
51.9
51.9
62.7
43.2
67.1
67.1
72.5
46.0
47.9
11.1
47.7
47.7
-5.1
1.1
-7.5
-7.5
0.7
-0.7
-5.1
-5.1
-29.2
-12.0
-18.6
-9.4
13.2
13.2
94
OLSModels:TrainingandTestReturnsforC
OLSModels:TrainingandTestReturnsforCSCO
100
ExitReturn(%)
75
50
25
0
25 0
10
20
30
40
50
60
70
80
90
100
50
75
A[uc]
100
C[p25]
TA[uc]
TC[p25]
200
175
150
125
100
75
50
25
0
25 0
50
75
100
ExitReturn(%)
125
10
20
30
A[uc]
ExitReturn(%)
30
A[uc]
40
50
C[p25]
60
70
TA[uc]
80
90
100
TC[p25]
175
150
125
100
75
50
25
0
25 0
50
75
100
10
20
30
A[uc]
40
TA[uc]
50
C[p25]
Percentile
80
90
100
TC[p25]
60
70
TA[uc]
80
90
100
TC[p25]
Percentile
OLSModels:TrainingandTestReturnsforMSFT
OLSModels:TrainingandTestReturnsforQCOM
175
150
125
100
75
50
25
0
25 0
50
75
100
ExitReturn(%)
ExitReturn(%)
200
175
150
125
100
75
50
25
0
25 0
50
75
100
70
OLSModels:TrainingandTestReturnsforMO
ExitReturn(%)
20
60
Percentile
OLSModels:TrainingandTestReturnsforGE
10
50
C[p25]
Percentile
175
150
125
100
75
50
25
0
25 0
50
75
100
40
10
20
30
A[uc]
40
50
C[p25]
60
70
TA[uc]
80
90
100
TC[p25]
Percentile
10
20
A[uc]
30
40
50
C[p25]
60
TA[uc]
70
80
90
100
TC[p25]
Percentile
OLSModels:TrainingandTestReturnsforWMT
125
100
ExitReturn(%)
75
50
25
0
25 0
10
20
30
40
50
60
70
80
90
100
50
75
100
A[uc]
C[p25]
TA[uc]
TC[p25]
Percentile
Figure 50. Exit return distributions from response surface optimization for various exit
rule scenarios are shown for each underlying stock. Solid and dotted lines represent
training and test data respectively. Table 11 has the abbreviations for various scenarios.
95
Summary
conditions to exit the entered strategies so as to maximize the expected profit. We exit
from the entered strategy when one of the following conditions is met: the loss limit, the
profit limit, limit on remaining days to expire or the last tradable day. The objective here
is to find the optimum loss limit, profit limit and the time limit for a given underlying
stock, so as to maximize the expected profit under some constraints on the distributions
of the expected return.
First, the option transactions data is prepared in a format that would help the exit
analysis. This initial data essentially represents daily transactions at the strategy level for
the entered strategies. A grid of strategy returns for various combinations of loss limits,
profit limits and times to expire are then formed for each of the entered strategy. These
values are chosen in such a way that the intervals between the values approximately
represent the distribution of the exit returns in the data. Heuristic algorithms were
developed to compute the whole grid by passing through the transaction data only few
times, there by saving a considerable computation time.
Once the return grid is computed, using the training data, we go through the return
values in the grid to find optimal exit limit combination that yields the maximum profit
under certain constraints. This is done by sorting the exit returns in the descending order
for each underlying stock and picking the top return (i.e., top row) for each underlying.
These optimal limits are then applied to the test data to test the performance of the
optimal exit rule for each underlying stock.
96
Grid search scenarios (or constraints) studied in this research are:
o Unconstrained
o At least 50% of the strategies are profitable (i.e., 50th percentile return >= 0.)
These optimal limits are validated using test dataset. Another approach based on
the optimization of predicted response surfaces was also carried out. In this approach, a
regression model is first fit to predict the exit return as a function of loss limit, profit limit
and time limit. Then the optimal exit rules are obtained by using an exhaustive grid
search on the predicted returns computed from the model. Two scenarios for the
optimization of response surfaces were considered. They are:
o Unconstrained
The results from the response surface optimizations were not always superior to
the direct grid search approach and also had some convergence issues. Hence direct grid
search is considered as an acceptable approach in this study. For a trader with limited
funds, the optimal exit rules that corresponds to the 25th percentile constraint (i.e., 75% of
entered strategies are expected to exit profitably) was chosen as the best scenario to
maximize strategy returns. Profit ranges from 5.1% (QCOM) to 44.4% (MO) were
observed under such constrained trading scenario.
97
CONCLUSION
This study provides a framework for identifying option strategies with low risk
and high profit. A delta-neutral long straddle options strategy, which could be profitable
if stock prices either increases or decreases by a considerable margin, is explored in this
study. Delta-neutral long straddle options strategies are formed by combining a certain
number of long call options and a certain number of long put options for a given
underlying stock. All call and put options have the same exercise price and the same
expiration date. The counts of call and put options are chosen in such a way that their
delta is zero. The term delta here refers to the change in the option price for a unit change
in the stock price. For this study, only the call and put options with exercise prices that
are closer to the underlying stock prices on a given day in either direction (i.e., exercise
price above the stock price and exercise price below the stock price) are considered to
form the delta-neutral long straddle strategies. This selection criterion was chosen
because the option trading volumes are usually higher under such conditions, hence
providing a reasonable trading opportunity to form a strategy.
The daily end of the day options transactions in US from years 2002 to 2006 were
initially obtained from the market place. To minimize the size of the analysis data, only
seven underlying stocks with high transaction volumes during these five years were
selected for the analysis. These underlying stocks are: Citigroup (C), Cisco Systems
(CSCO), General Electric (GE), Altria Group (MO), Microsoft (MSFT), Qualcomm
(QCOM), and Walmart (WMT). Though these stocks represent different industries, there
98
is some bias towards technology industry in this list because the option transactions were
in general higher for the technology industries between the years from 2002 to 2006.
The Black-Scholes option pricing formula suggests that the primary drivers of the
value of a given option are stock price, exercise price, time to expire, stock volatility and
the prevailing risk free interest rates (say, Federal Reserve funds rate). Various first and
second derivatives (called Greeks) and the corresponding elasticity measures are also
some of the important factors in understanding the sensitivity of the option prices. These
measures for each option were computed using the closed form solutions offered by the
Black-Scholes formula. The interest rates were considered to be same as the Federal
Reserve's target fund rate and are readily available in the web. The implied volatility of
the stocks on a given day for a call and a put option were determined by back filling the
Black-Scholes formula as all other variables in the formula are known on a given day.
This process of determining implied volatilities is essentially a large scale constrained
non-linear optimization problem, which was solved using SAS procedure PROC NLIN
along with Newton Raphson method. This SAS procedure was used because of its speed
and its ability to handle large scale data processing.
After collecting and deriving the necessary variables that influences the option
prices, call and put options were combined to form delta-neutral long straddle strategies.
These strategies were then followed from the start of the strategy to its expiration date to
compute their risks. Transaction costs were included in the strategy prices. The strategy
risk (in %) here is defined as the fraction of tradable days with a negative return (i.e., a
loss). Strategies were classified as low-risk if their corresponding strategy risk is less than
or equal to 50%. It was observed that only about 6.9% of the identified delta-neutral long
99
straddle strategies were low-risk strategies. This observation by itself is surprising as we
observe that only a small portion of the strategies have low-risk. Initial analysis of likely
drivers of strategy risks showed that each underlying stock had unique characteristics,
hence variable standardization and transformations were carried out at the underlying
stock level and then combined together to form the analytical dataset.
During the period between 2002 and 2006, about 44K delta-neutral long straddle
strategies were identified for the selected underlying stocks. These were split into 70%
for training and 30% for testing. The goal of the entry analysis is to identify the low-risk
delta-neutral long straddle strategies that could be executed or entered. Predictive models
were built and validated using SAS JMP software. Both the risk category (a categorical
outcome) and the strategy risk in percentage (a continuous outcome) were used as
dependent variables. Modeling methods studied were linear regression, logistic
regression, neural networks and decision trees. The studied predictive models provide a
reasonable coverage of simple methods to highly flexible fitting models. Models were
built using different sets of variables (like, all standardized variables, standardized and
transformed variables, select standardized variables, factor variables only and all
variables plus underlying stocks etc.). In some cases, separate models for each underlying
stock were built to study the effects on the predictive accuracy. Top 2% of the strategies
that were predicted to have lowest risks from each of the predictive model were identified
as low-risk strategies. Classification errors (true positive percentages) based on the test
dataset was used to select the best predictive model. A set of neural network models, one
for each underlying stock, that predicts the low-risk category using all standardized
100
variables as independent variables had the highest test accuracy of about 58% true
positives. This model was selected as the best predictive model.
Top 2% of the predicted low-risk strategies from the best predictive model were
identified as the best low-risk strategies to enter. This accounts for about 868 low-risk
entry strategies in the entire dataset. Within the study period of five years, in an average,
this translates to about three enterable low-risk strategies per week, thus giving ample
opportunities for identifying the enterable low-risk strategies. If we were to choose these
strategies randomly, only 6.9% of them would be actual low-risk strategies. Thus, we
have improved the identification of the low-risk entry strategies with an overall lift of
around eight (=58% / 6.9%). This is a noticeable improvement in the predictive power.
conditions to exit the entered strategies so as to maximize the expected profit under
various constraints on the distributions of the expected return.
We exit from the entered strategy when one of the following conditions is met:
the loss limit, the profit limit, limit on remaining days to expire or the last tradable day. A
grid of strategy returns for various combinations of loss limits, profit limits and times to
expire are formed for each of the entered strategy. These values are chosen in such a way
that the intervals between the values approximately represent the distribution of the exit
returns in the data. Heuristic algorithms were developed to compute the whole grid by
passing through the transaction data only few times, there by saving a considerable
computation time.
After computing the return grid, using the training data, a grid search was carried
out to find optimal exit rule limits that yields the maximum profit under certain
101
constraints. These optimal limits are then applied to the test data to test the performance
of the optimal exit rules for each underlying stock. Grid search scenarios (or constraints)
studied in this research are:
o Unconstrained grid search
Another approach, based on the optimization of predicted response surfaces, was

also carried out. In this approach, a regression model is first fit to predict the exit return
as a function of loss limit, profit limit and time limit. Then the optimal exit rules are
obtained by using an exhaustive grid search on the predicted returns computed from the
model. Two scenarios for the optimization of response surfaces were considered. They
are:
o Unconstrained grid search and
The results from the response surface optimizations were not always superior to
the direct grid search approach and also had some convergence issues. Hence direct grid
search is considered as an acceptable approach in this study. For a trader with limited
funds, the optimal exit rules that corresponds to the constraint that 75% of entered
strategies are expected to exit profitably was chosen as the best scenario to maximize
strategy returns. Under such trading scenario, mean profits from 5.1% (QCOM) to 44.4%
(MO) were observed.
102
Limitations
The testing of the model accuracies to predict the entry strategies and the exit
rules were based on the data within the study period. A more elaborate testing using
future data (say, years 2007 and beyond) is required to further validate the model lift in
identifying the entry strategies as well as the resulting profit from the identified optimal
exit rules. Only a limited set of underlying stocks were considered in the analysis and
hence the results may reduce the generalizations to other stocks. This is particularly true,
since even within the seven stocks considered in the study, we found that the
predictability of risk, optimal exit rules and the data patters are different.
Future Work
This work can be extended in several different ways. Some of the important next
steps are listed below.
o Testing of model performances for the unseen or future data and frequent
model calibrations may be needed.

o Combinatorial optimization techniques other than the grid search for finding
the optimal exit rules may be considered.

o Extension of the analysis for more stocks, particularly to the highly traded
index funds may be helpful.

o Only a particular form of delta-neutral long straddle strategy was considered
in this study. There are numerous other option strategies, starting from a solo
long call or a long put option transactions. More types of strategies should
probably be studied further to validate the methodologies used here.
103
o Studies that begin to assemble an optimal option trading strategy along with
the identification of optimal entry and exit points under the given market
conditions would be extremely valuable. This may be a hard problem to solve.
o Study the application of Kelly's criterion (Kelly's Criterion, 2007; Kelly
Criterion in detail) that may help in choosing optimal fraction of the amount at
hand to invest in sequential trades (or bets) so that a trader could maximize the
returns from the sequential investments.
104
REFERENCES
Annual Market Statistics. (2010). Retrieved October 1, 2011, from

http://www.cboe.com/Data/AnnualMarketStatistics.aspx
Berry, Michael J.A., & Linoff, Gordon (1997). Data Mining Techniques for Marketing,
Sales and Customer Support. John Wiley & Sons, Inc.
Bhattacharya, Mihir (1983). Transaction Data Tests on the Efficiency of the Chicago
Board Options Exchange. Journal of Financial Economics. 12:2, 161-185.
Bishop, Christopher M. (2000), Neural Networks for Pattern Recognition. New York,
NY: Oxford University Press.
Black, Fischer & Scholes, Myron (1972). The Valuation of Option Contracts and a Test
of Market Efficiency. Journal of Finance, 27:2, 399-417.
Black, Fischer & Scholes, Myron (1973). The Pricing of Options and Corporate
Liabilities. Journal of Political Economy, May 1973, 637-659.
Coval, Joshua D. and Shumway, Tylor (2001). Expected Option Returns. The Journal of
Finance. LVI:3, pp 983-1009.
Delwiche, Lora D. & Slaughter, Susan J. (2003). The Little SAS Book. Cary, NC: SAS
Institute Inc.
Galai, Dan (1977). Tests of Market Efficiency of the Chicago Board Options Exchange.
Journal of Business, 50:2, 167-197.
Galai, Dan (1978). Empirical Tests of Boundary Conditions for CBOE Options. Journal
of Financial Economics, 6:2/3, 182-211.
105
Greeks (Finance). (2003). Retrieved September 15, 2012, from
http://en.wikipedia.org/wiki/Greeks_(finance)
Hand, D., Mannila, H., & Smyth P. (2001). Principles of Data Mining. Cambridge, MA:
The MIT Press.
Historical Options Data. (2002). Retrieved September 15, 2012, from
http://www.historicaloptiondata.com
Kolb, Robert W. (1991). Options. The Investor's Complete Toolkit. Simon and Schuster,
Inc.
Kelly's Criterion. (2007). Retrieved December 4, 2012, from
http://en.wikipedia.org/wiki/Kelly_criterion
Kelly Criterion in detail. Retrieved December 4, 2012, from
http://www.elem.com/~btilly/kelly-criterion/
Larose, D. (2004). Discovering Knowledge in Data- An Introduction to Data Mining.
Hoboken, NJ: John Wiley&Sons, Inc.
Larose, D. (2006). Data Mining Methods and Models. Hoboken, NJ: John Wiley&Sons,
Inc.
MacBeth, James D. & Merville, Larry J. (1979). An Empirical Examination of the BlackScholes Call Option Pricing Model. Journal of Finance, 34:5, 1173-1186.
Rubinstein, Mark (1985). Nonparametric Tests of Alternative Option Pricing Models
Using All Reported Trades and Quotes on the 30 Most Active COE Option
Classes from August 23, 1976 Through August 31, 1978. Journal of Finance,
40:2, 455-480.
106
SAS Institute Inc. (2010a). JMP 9 Basic Analysis and Graphing. Cary, NC: SAS
Institute Inc.
SAS Institute Inc. (2010b). JMP 9 Discovering JMP. Cary, NC: SAS Institute Inc.
SAS Institute Inc. (2010c). JMP 9 Modeling and Multivariate Methods. Cary, NC: SAS
Institute Inc.
SPSS Inc. (2001). Clementine 6.0 User's Guide. Chicago, IL: SPSS Inc
Target Federal Funds and Discount Rates. (2010). Retrieved September 1, 2012, from
http://www.newyorkfed.org/markets/statistics/dlyrates/fedrate.html
Witten, Ian H., & Frank, Eibe (2000). Data Mining Practical Machine Learning Tools
and Techniques with JAVA Implementations. San Francisco, CA: Morgan
Kaufmann Publishers.
107
APPENDICES
APPENDIX A: VARIABLES IN ENTRY ANALYSIS DATASET
Table A1 contains the variables in the entry analysis dataset corresponding to

delta-neutral long straddle strategies. The key Std: in variable names column refers to the
standardized form of the variable. Similarly, ST: refers to transformed and standardized
form of the variable. All transformations are done using logarithmic function with a base
of 10 and an appropriate scale. These scales are listed in the Description column along
with the transformation formula. The standardization is done using the
x mean
formula z =
20 + 100 within each of the underlying stock. Here, z is the
std .dev
standardized form of the variable x. Mean and std.dev refers to the mean and standard
deviation of the variable x for a given underlying stock. All standardized variables have a
mean of 100 and a standard deviation of 20.
Table A1. Variables in the entry analysis dataset are listed.
Variable Name
Description
Unique ID for the delta-neutral long
straddle strategy. Formed by concatenating
underlying stock symbol, expiration date,
ls_strategy_id
exercise price and entry date. Example:
C_15APR2005_45.00_16MAR2005
C_underlying
Underlying stock symbol
Year and quarter of the entry date.
year_qtr
Example: 2005_Q1
Year and month of entry date. Example:
year_mon
2005_03
ls_dn_calls
Number of call options in the strategy
ls_dn_puts
Number of put options in the strategy
Total price of the strategy. This includes
ls_dn_tot_price
transaction costs for both entry and exit of
the strategy.
Strategy risk in %. Represents % of
ls_dn_risk_0
tradable days those breaks even or are
108
Variable Name
ls_dn_rflag_0
SE_diff_percent
Std: z_SE_diff_percent
C_Pr_by_Stk_Pr_Percent
Std: z_C_Pr_by_Stk_Pr_Percent
ST: z_log10_C_Pr_by_Stk_Pr_Percent
C_in_money_flg
C_int_rate
Std: z_C_int_rate
ST: z_log10_C_int_rate
C_time_to_expire
Std: z_C_time_to_expire
ST: z_log10_C_days_to_expire
C_imp_volatility
Std: z_C_imp_volatility
ST: z_log10_C_imp_volatility
C_Delta
Std: z_C_Delta
C_Vega
Std: z_C_Vega
ST: z_log10_C_Vega
C_Theta
Std: z_C_Theta
ST: z_log10_C_Theta
C_Gamma
Std: z_C_Gamma
ST: z_log10_C_Gamma
C_EL_Lambda
Std: z_C_EL_Lambda
ST: z_log10_C_EL_Lambda
C_EL_Vega
Std: z_C_EL_Vega
ST: z_log10_C_EL_Vega
C_EL_Theta
Std: z_C_EL_Theta
ST: z_log10_C_EL_Theta
C_EL_Gamma
Std: z_C_EL_Gamma
ST: z_log10_C_EL_Gamma
Description
profitable.
Strategy Risk Flag. Equals 1 when the risk
is less than or equal to 50%. Otherwise 0.
Low-risk strategies have a value of 1.
Difference between stock price and
exercise price on the entry day. Expressed
as % of stock price.
Call price as percentage of stock price on
the entry day.
Equals 1 if the stock price is higher than or
equal to the exercise price on the entry day.
Otherwise 0.
Interest rate on the entry day. Equals to the
Fed target fund rate.
Time to expire represented in years. 365
days were used as number of days in the
year.
Call Implied Volatility. Transformation
formula: log10(C_imp_volatility*100)
Call Delta.
Call Vega. Transformation formula:
log10(C_Vega*-100)
Call Theta. Transformation formula:
log10(C_Theta*100)
Call Gamma. Transformation formula:
log10(C_Gamma*100)
Call Elasticity of Delta (Call Lambda).
Transformation formula:
log10(C_EL_Lambda*100)
Call Elasticity of Vega. Transformation
formula: log10(C_EL_Vega*100)
Call Elasticity of Theta. Transformation
formula: log10(C_Theta*-100)
Call Elasticity of Gamma. Transformation
formula: log10(C_EL_Gamma*100)
109
Variable Name
P_Pr_by_Stk_Pr_Percent
Std: z_P_Pr_by_Stk_Pr_Percent
ST: z_log10_P_Pr_by_Stk_Pr_Percent
P_in_money_flg
P_imp_volatility
Std: z_P_imp_volatility
ST: z_log10_P_imp_volatility
P_Delta
Std: z_P_Delta
P_Vega
Std: z_P_Vega
ST: z_log10_P_Vega
P_Theta
Std: z_P_Theta
ST: z_log10_P_Theta
P_Gamma
Std: z_P_Gamma
ST: z_log10_P_Gamma
P_EL_Lambda
Std: z_P_EL_Lambda
ST: z_log10_P_EL_Lambda
P_EL_Vega
Std: z_P_EL_Vega
ST: z_log10_P_EL_Vega
P_EL_Theta
Std: z_P_EL_Theta
ST: z_log10_P_EL_Theta
P_EL_Gamma
Std: z_P_EL_Gamma
ST: z_log10_P_EL_Gamma
ls_dn_pr_by_stk_pr_percent
Std: z_ls_dn_pr_by_stk_pr_percent
ST: z_log10_ls_dn_pr_by_stk_pr_percent
obs_freq
Factor1
Factor2
Factor3
Description
Put price as percentage of stock price on
the entry day.
Equals 1 if the stock price is lower than or
equal to the exercise price on the entry day.
Otherwise 0.
Put Implied Volatility. Transformation
formula: log10(P_imp_volatility*100)
Put Delta.
Put Vega. Transformation formula:
log10(P_Vega*-100)
Put Theta. Transformation formula:
log10(P_Theta*-100)
Put Gamma. Transformation formula:
log10(P_Gamma*100)
Put Elasticity of Delta (Put Lambda).
Transformation formula:
log10(P_EL_Lambda*-100)
Put Elasticity of Vega. Transformation
formula: log10(P_EL_Vega*100)
Put Elasticity of Theta. Transformation
formula: log10(P_Theta*-100)
Put Elasticity of Gamma. Transformation
formula: log10(P_EL_Gamma*-100)
Strategy price as percentage of stock price
on the entry day.
Weight used to boost the low-risk strategy.
Weight is assigned as 14 when
ls_dn_rflag_0 is 1 (i.e., low-risk).
Otherwise the weights are assigned as 1.
Factor 1 from principal component analysis
with varimax rotation.
110
Variable Name
Factor4
Factor5
Factor6
Factor7
tr_te_rand
Train_Flag
Description
A random number between 0 and 1 used to
determine if the strategy belongs to the
training or the test dataset.
Equals "Y" for the Training dataset and
"N" for the Test dataset. 70% of the data is
set aside as training dataset and the
remaining 30% of the strategies are set as
Test dataset. The field tr_te_rand is used to
determine the dataset type.
111
APPENDIX B: ANALYSIS VARIABLE STATISTICS
Table B1. Means and standard deviations of the initial analysis variables are shown.
Variable
Count
ls_dn_risk_0 (%)
ls_dn_rflag_0
SE_Diff_percent
C_Pr_by_Stk_Pr_percent (%)
C_int_rate
C_time_to_expire
C_imp_volatility
C_Delta
C_Vega
C_Theta
C_Gamma
C_EL_Lambda
C_EL_Vega
Means
C_EL_Theta
C_EL_Gamma
P_Pr_by_Stk_Pr_percent
P_imp_volatility
P_Delta
P_Vega
P_Theta
P_Gamma
P_EL_Lambda
P_EL_Vega
P_EL_Theta
P_EL_Gamma
ls_dn_pr_by_stk_pr (%)
ls_dn_risk_0 (%)
ls_dn_rflag_0
SE_Diff_percent
C_Pr_by_Stk_Pr_percent (%)
C_int_rate
C_time_to_expire
C_imp_volatility
C_Delta
C_Vega
C_Theta
C_Gamma
C_EL_Lambda
Standard C_EL_Vega
Deviations C_EL_Theta
C_EL_Gamma
P_Pr_by_Stk_Pr_percent
P_imp_volatility
P_Delta
P_Vega
P_Theta
P_Gamma
P_EL_Lambda
P_EL_Vega
P_EL_Theta
P_EL_Gamma
ls_dn_pr_by_stk_pr (%)
Overall
44,627
84.1
6.86%
6.10%
7.44
2.21%
0.45
0.27
0.55
8.42
-3.99
0.09
11.27
1.03
-0.58
6.53
7.16
0.29
-0.45
8.55
-3.52
0.08
-9.49
1.15
-0.51
-6.86
111
18.1
25.28%
571%
5.61
1.36%
0.46
0.12
0.16
5.15
2.65
0.05
7.19
0.57
0.30
5.49
5.34
0.11
0.15
5.15
2.58
0.05
6.13
0.55
0.26
4.66
80
C
CSCO
6,259
7,073
84.7
82.6
5.93%
8.89%
21.92% -15.53%
5.72
10.62
2.25%
2.09%
0.42
0.51
0.21
0.37
0.56
0.56
9.70
4.30
-3.87
-2.34
0.09
0.11
14.26
7.88
0.98
1.08
-0.58
-0.59
8.29
4.47
5.69
9.75
0.25
0.37
-0.44
-0.44
9.96
4.30
-3.54
-1.96
0.07
0.11
-11.30
-6.93
1.15
1.15
-0.49
-0.52
-8.07
-5.15
90
148
17.7
19.6
23.62% 28.47%
420%
824%
4.17
7.63
1.38%
1.28%
0.40
0.51
0.10
0.13
0.15
0.17
4.73
2.44
1.81
1.03
0.04
0.05
8.45
4.87
0.51
0.64
0.28
0.34
6.38
3.84
3.99
6.91
0.09
0.13
0.12
0.17
4.68
2.44
1.93
1.09
0.03
0.05
6.72
4.51
0.44
0.65
0.21
0.31
4.96
3.56
61
105
Underlying Stock Symbol

GE
MO
MSFT
QCOM
WMT
7,705
5,512
7,311
5,248
5,519
86.4
83.8
86.3
80.7
82.6
5.05%
6.51%
5.46%
9.15%
7.88%
7.83% 43.22%
1.94%
3.07% -15.21%
6.68
6.48
7.98
8.76
5.35
2.13%
2.37%
2.08%
2.27%
2.40%
0.51
0.43
0.53
0.35
0.36
0.22
0.24
0.26
0.37
0.21
0.55
0.56
0.55
0.56
0.54
7.34
11.93
7.68
9.17
10.53
-2.57
-5.79
-3.25
-6.82
-4.69
0.10
0.06
0.10
0.06
0.07
12.87
11.12
10.89
8.20
13.61
1.04
0.96
1.05
0.98
1.05
-0.60
-0.55
-0.59
-0.53
-0.60
7.56
6.32
6.32
4.55
8.08
6.67
7.21
7.30
8.16
5.00
0.26
0.31
0.27
0.38
0.23
-0.45
-0.44
-0.45
-0.44
-0.46
7.52
12.24
7.75
9.17
10.63
-2.25
-5.63
-2.75
-5.99
-3.82
0.09
0.05
0.10
0.05
0.07
-10.52
-8.52
-9.67
-7.30
-12.08
1.17
1.14
1.18
1.09
1.17
-0.50
-0.51
-0.51
-0.50
-0.50
-7.55
-6.23
-7.02
-5.37
-8.52
99
108
114
137
75
16.7
17.2
17.4
19.1
18.6
21.90% 24.68% 22.72% 28.83% 26.95%
554%
508%
555%
515%
461%
5.00
3.82
5.96
5.32
3.45
1.33%
1.44%
1.30%
1.38%
1.43%
0.50
0.44
0.54
0.35
0.33
0.09
0.07
0.10
0.12
0.06
0.17
0.14
0.17
0.11
0.17
3.68
5.57
5.31
4.98
5.20
1.25
3.13
2.76
2.81
1.93
0.05
0.02
0.06
0.03
0.03
8.61
5.54
7.17
3.80
6.94
0.63
0.49
0.63
0.34
0.60
0.34
0.25
0.34
0.18
0.31
6.75
4.27
5.63
2.70
5.57
4.92
5.04
5.26
5.37
3.31
0.09
0.08
0.09
0.12
0.06
0.15
0.12
0.16
0.11
0.16
3.63
5.55
5.27
4.98
5.16
1.34
2.61
2.69
2.93
2.14
0.05
0.02
0.06
0.03
0.03
7.20
4.83
6.63
3.68
5.85
0.56
0.43
0.66
0.35
0.58
0.27
0.21
0.31
0.17
0.29
5.52
3.55
5.24
2.63
4.66
72
63
87
81
46
112
APPENDIX C: CORRELATION ANALYSIS
Table C1 shows highly correlated pairs of variables. Very high correlations are in
bold font. The maroon cell indicates that those correlations may be of interest. The last
column indicates whether what could be the possible proxy variable for replacing the
variable in the first column. The last column is populated only for variable pairs with
very high correlation.
Table C1. Correlation between pairs of highly correlated variables (correlation 0.7 ) in
the entry analysis dataset is shown.
Variable 1
ls_dn_risk_0
P_in_money_flg
z_SE_Diff_percent
z_SE_Diff_percent
z_C_time_to_expire
z_C_Delta
z_C_Delta
z_C_Delta
z_C_Vega
z_C_Vega
z_C_EL_Lambda
z_C_EL_Lambda
z_C_EL_Vega
z_C_EL_Vega
z_C_EL_Vega
z_C_EL_Vega
z_C_EL_Theta
z_C_EL_Theta
z_C_EL_Theta
z_C_EL_Theta
z_C_EL_Theta
z_C_EL_Gamma
z_C_EL_Gamma
z_C_EL_Gamma
z_C_EL_Gamma
z_P_imp_volatility
Variable 2
ls_dn_rflag_0
C_in_money_flg
C_in_money_flg
P_in_money_flg
C_in_money_flg
P_in_money_flg
z_SE_Diff_percent
ls_dn_tot_price
z_C_time_to_expire
z_C_Gamma
C_in_money_flg
P_in_money_flg
z_SE_Diff_percent
z_C_Delta
C_in_money_flg
P_in_money_flg
z_SE_Diff_percent
z_C_Delta
z_C_EL_Vega
z_C_EL_Lambda
z_C_EL_Vega
z_C_EL_Theta
ls_dn_tot_price
z_C_imp_volatility
Correlation
-0.70
-0.99
0.79
-0.79
0.70
0.79
-0.79
0.84
0.78
0.84
-0.75
0.79
-0.74
0.74
-0.80
-0.96
0.74
-0.74
0.80
0.95
-0.98
-0.72
0.97
0.73
-0.74
0.70
0.91
Possible Replacement
Variable
C_in_money_flg
z_C_Delta
z_C_Delta
z_C_EL_Lambda
z_C_imp_volatility
113
Variable 1
z_P_Delta
z_P_Delta
z_P_Delta
z_P_Delta
z_P_Delta
z_P_Delta
z_P_Delta
z_P_Vega
z_P_Vega
z_P_Vega
z_P_Theta
z_P_Gamma
z_P_Gamma
z_P_EL_Lambda
z_P_EL_Lambda
z_P_EL_Vega
z_P_EL_Vega
z_P_EL_Vega
z_P_EL_Vega
z_P_EL_Vega
z_P_EL_Vega
z_P_EL_Vega
z_P_EL_Theta
z_P_EL_Theta
z_P_EL_Theta
z_P_EL_Theta
z_P_EL_Theta
z_P_EL_Theta
z_P_EL_Theta
z_P_EL_Theta
z_P_EL_Gamma
z_P_EL_Gamma
Variable 2
C_in_money_flg
P_in_money_flg
z_SE_Diff_percent
z_C_Delta
z_C_EL_Vega
z_C_EL_Theta
z_C_EL_Gamma
ls_dn_tot_price
z_C_time_to_expire
z_C_Vega
z_C_Theta
z_C_Gamma
z_C_EL_Lambda
z_P_Gamma
C_in_money_flg
P_in_money_flg
z_SE_Diff_percent
z_C_Delta
z_C_EL_Vega
z_C_EL_Theta
z_P_Delta
C_in_money_flg
P_in_money_flg
z_SE_Diff_percent
z_C_Delta
z_C_EL_Vega
z_C_EL_Theta
z_P_Delta
z_P_EL_Vega
z_P_EL_Lambda
ls_dn_tot_price
z_C_time_to_expire
z_C_Vega
z_P_Vega
z_P_EL_Lambda
Correlation
0.79
-0.79
0.84
0.99
-0.95
0.94
-0.70
0.78
0.85
1.00
0.91
0.95
0.77
0.74
-0.76
0.75
-0.75
0.80
0.93
-0.86
0.84
0.93
-0.75
0.75
-0.81
-0.90
0.83
-0.85
-0.91
-0.98
0.73
0.97
0.81
0.80
0.71
0.87
0.70
0.74
Possible Replacement
Variable
z_C_Delta
z_C_Vega
z_C_Theta
z_C_Gamma
z_C_Delta
z_C_Delta
z_P_EL_Lambda
114
APPENDIX D: FACTOR LOADINGS
Table D1 shows the rotated factor loadings for the top seven factors that represent
about 95% of the variations in the data. Varimax factor rotations with Principal
Components as the factoring method and Principal Components (diagonals=1) as prior
communality was specified as the configuration parameters in SAS JMP Multivariate
analysis module to obtain these rotated factor loadings.
Table D1. Factor loadings for the top seven factors in the analysis dataset are listed.
z_SE_Diff_percent
z_C_int_rate
z_C_time_to_expire
z_C_imp_volatility
z_C_Delta
z_C_Vega
z_C_Theta
z_C_Gamma
z_C_EL_Lambda
z_C_EL_Vega
z_C_EL_Theta
z_C_EL_Gamma
z_P_imp_volatility
z_P_Delta
z_P_Vega
z_P_Theta
z_P_Gamma
z_P_EL_Lambda
z_P_EL_Vega
z_P_EL_Theta
z_P_EL_Gamma
Factor 1
0.859
-0.007
0.066
0.066
0.986
-0.019
-0.023
-0.203
-0.502
-0.962
0.958
-0.664
0.060
0.984
-0.006
-0.007
-0.154
-0.328
0.952
-0.941
-0.524
-0.050
0.532
-0.425
Factor 2
-0.078
-0.134
0.822
-0.005
0.057
0.919
0.252
-0.407
-0.371
-0.100
0.068
-0.312
0.019
0.076
0.914
0.274
-0.442
0.403
-0.143
0.188
0.354
0.663
0.457
0.534
Factor 3
-0.062
0.218
-0.189
-0.276
-0.054
-0.334
-0.113
0.780
0.590
0.030
-0.030
0.483
-0.323
-0.053
-0.344
-0.064
0.799
-0.681
0.028
-0.033
-0.596
-0.271
-0.304
-0.237
Factor 4 Factor 5 Factor 6 Factor 7

-0.008
-0.015
-0.027
0.470
-0.232
0.093
0.931
0.007
0.260
0.383
-0.085
0.070
0.807
-0.428
-0.140
-0.007
0.033
0.037
0.044
0.009
-0.036
0.142
-0.041
-0.035
-0.123
0.935
-0.064
0.022
-0.253
-0.042
0.097
0.009
-0.351
-0.049
0.151
0.277
0.015
0.043
-0.022
0.103
0.055
-0.106
-0.120
-0.099
-0.278
-0.036
0.115
0.330
0.842
-0.336
-0.149
0.036
0.055
0.036
0.048
0.013
-0.045
0.161
-0.032
-0.018
-0.187
0.903
0.165
-0.033
-0.246
-0.077
0.090
-0.024
0.394
0.122
-0.179
0.046
-0.087
0.035
0.093
0.029
0.014
0.037
0.070
-0.042
0.350
0.110
-0.174
0.107
0.613
0.167
-0.115
-0.187
0.524
0.189
-0.065
0.279
0.566
0.207
-0.113
-0.221
115
APPENDIX E: CHARACTERISTICS OF ENETERED LOW-RISK STRATEGIES
Distributions of the entered low-risk strategies for each underlying stock are
shown below. The charts in the left side show the actual risk categories (ls_dn_rflag_0),
with a value one indicating low-risk. The charts in the right side show the distribution of
the actual strategy risk in % (ls_dn_risk_0), a continuous response variable. The
quantiles, moments and in particular the inter quartile ranges for the strategy risk variable
helps to directionally understand the magnitude of the expected risk. The underlying
stocks represented here are C, CSCO, GE, MO, MSFT, QCOM, and WMT.
116
Figure E1. Distribution of the actual risk category (ls_dn_rflag_0) and the actual strategy
risks in % (ls_dn_risk_0) for each underlying stock within the entered low-risk strategies.
The risk category value of 1 indicates actual low-risk strategies.
117
APPENDIX F: OPTIMAL EXIT RULES FROM GRID SEARCH
Table F1. Results from grid search are shown. The optimal exit rule parameter limits,
total entered strategies, the count of rule triggers (i.e., which component rule led to the
exit from the strategy) and the strategy returns for training and test dataset are shown for
various optimization scenarios of each underlying stock.
UnderlyingStock
Dataset
Constraint
Citigroup(C)
TRAIN
A
B50th
Unconstrai Percentile
ned
ExitReturn
>=0
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
Std.Dev.ofReturns(%)
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
A[uc]
20%
65
55
72
35
1
36
2
23.7
47.8
C25th
A
Percentile Unconstrai
ExitReturn ned
>=0
B[p50]
20%
65
55
72
35
1
36
2
23.7
47.8
C[p25]
30%
90
5
72
55
0
17
0
7.5
27.1
TA[uc]
20%
65
55
45
20
2
23
5
23.6
40.0
TEST
B50th
Percentile
ExitReturn
>=0
C25th
Percentile
ExitReturn
>=0
TB[p50]
20%
65
55
45
20
2
23
5
23.6
40.0
TC[p25]
30%
90
5
45
36
0
9
0
6.3
16.1
CiscoSystems(CSCO)
TRAIN
C25th
B50th
A
Unconstrai Percentile Percentile
ExitReturn ExitReturn
ned
>=0
>=0
A[uc]
15%
90
45
99
71
0
10
23
41.6
30.0
B[p50]
15%
90
45
99
71
0
10
23
41.6
30.0
C[p25]
15%
90
45
99
71
0
10
23
41.6
30.0
D10th
Percentile
ExitReturn
>=0
D[p10]
15%
90
40
99
78
0
5
21
40.6
23.2
E[p5]
10%
90
35
99
84
0
1
17
39.4
17.8
TEST
C25th
Percentile
ExitReturn
>=0
D10th
Percentile
ExitReturn
>=0
E5th
Percentile
ExitReturn
>=0
TA[uc]
TB[p50]
TC[p25]
TD[p10]
15%
15%
15%
15%
90
90
90
90
45
45
45
40
47
47
47
47
34
34
34
36
0
0
0
0
8
8
8
7
9
9
9
8
38.8
38.8
38.8
35.4
31.7
31.7
31.7
28.9
TE[p5]
10%
90
35
47
38
0
6
7
31.5
25.8
A
E5th
ExitReturn ned
>=0
B50th
Percentile
ExitReturn
>=0
118
UnderlyingStock
Dataset
Constraint
GeneralElectric(GE)
TRAIN
A
B50th
C25th
ned
>=0
>=0
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
UnderlyingStock
Dataset
Constraint
A[uc]
15%
55
100
100
32
10
27
42
42.1
65.5
B[p50]
15%
55
100
100
32
10
27
42
42.1
65.5
C[p25]
15%
55
40
100
73
10
9
10
32.6
45.5
D10th
A
ExitReturn ned
>=0
D[p10]
15%
60
6
100
91
1
6
2
13.8
21.6
TA[uc]
15%
55
100
40
11
6
8
18
42.1
62.8
TEST
B50th
C25th
Percentile Percentile
>=0
>=0
TB[p50]
15%
55
100
40
11
6
8
18
42.1
62.8
D10th
Percentile
ExitReturn
>=0
TC[p25]
15%
55
40
40
27
6
2
5
30.2
42.9
TD[p10]
15%
60
6
40
37
2
1
0
11.0
19.6
AltriaGroup(MO)
TRAIN
C25th
B50th
A
ned
>=0
>=0
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
A[uc]
20%
40
80
74
42
5
21
14
64.0
58.4
B[p50]
20%
40
80
74
42
5
21
14
64.0
58.4
D10th
Percentile
ExitReturn
>=0
C[p25]
20%
40
80
74
42
5
21
14
64.0
58.4
D[p10]
30%
40
28
74
57
4
13
7
48.4
43.1
E[p5]
0%
65
10
74
70
1
0
5
29.5
34.1
TEST
C25th
Percentile
ExitReturn
>=0
D10th
Percentile
ExitReturn
>=0
E5th
Percentile
ExitReturn
>=0
TA[uc]
TB[p50]
TC[p25]
TD[p10]
20%
20%
20%
30%
40
40
40
40
80
80
80
28
31
31
31
31
19
19
19
26
2
2
2
2
8
8
8
3
6
6
6
5
68.6
68.6
68.6
44.4
66.0
66.0
66.0
45.1
TE[p5]
0%
65
10
31
28
0
0
4
23.8
23.7
A
E5th
ExitReturn ned
>=0
B50th
Percentile
ExitReturn
>=0
Microsoft(MSFT)
TRAIN
A
B50th
Unconstrai Percentile
ned
ExitReturn
>=0
A[uc]
15%
60
100
115
32
4
56
40
37.2
62.4
B[p50]
15%
60
100
115
32
4
56
40
37.2
62.4
A
C25th
ExitReturn ned
>=0
C[p25]
15%
60
16
115
86
4
21
5
19.7
42.4
TA[uc]
15%
60
100
41
13
5
17
11
39.1
72.7
TEST
B50th
Percentile
ExitReturn
>=0
C25th
Percentile
ExitReturn
>=0
TB[p50]
15%
60
100
41
13
5
17
11
39.1
72.7
TC[p25]
15%
60
16
41
29
4
6
3
15.6
47.5
119
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
Qualcomm(QCOM)
TRAIN
C25th
A
B50th
ned
>=0
>=0
A[uc]
40%
90
100
65
12
0
48
5
23.6
57.9
B[p50]
30%
90
100
65
14
0
45
10
22.2
58.9
C[p25]
30%
90
9
65
49
0
15
1
8.3
19.4
A
D10th
ExitReturn ned
>=0
D[p10]
20%
40
4
65
58
5
2
2
6.3
17.8
TA[uc]
40%
90
100
32
4
0
28
0
18.6
47.9
TEST
C25th
B50th
>=0
>=0
TB[p50]
30%
90
100
32
6
0
24
2
15.4
54.3
D10th
Percentile
ExitReturn
>=0
TC[p25]
30%
90
9
32
25
0
7
0
5.1
23.7
TD[p10]
20%
40
4
32
26
4
1
1
0.2
21.1
TEST
B50th
C25th
>=0
>=0
D10th
Percentile
ExitReturn
>=0
Walmart(WMT)
TRAIN
A
B50th
C25th
ned
>=0
>=0
A[uc]
0%
90
55
73
40
0
0
34
34.8
47.3
B[p50]
0%
90
55
73
40
0
0
34
34.8
47.3
C[p25]
0%
90
55
73
40
0
0
34
34.8
47.3
D10th
A
ExitReturn ned
>=0
D[p10]
0%
90
20
73
64
0
0
11
21.6
25.5
TA[uc]
0%
90
55
34
21
0
0
14
41.5
41.7
TB[p50]
0%
90
55
34
21
0
0
14
41.5
41.7
TC[p25]
0%
90
55
34
21
0
0
14
41.5
41.7
TD[p10]
0%
90
20
34
32
0
0
4
24.0
23.7
120
APPENDIX G: EXIT RULES FROM RESPONSE SURFACE OPTIMIZATION

Table G1. Results from response surface optimization are shown. The optimal exit rule
parameter limits, total entered strategies, the count of rule triggers (i.e., which component
rule led to the exit from the strategy) and the strategy returns for training and test dataset
are shown for various optimization scenarios of each underlying stock. Mean predicted
returns and the predicted 25th percentile returns from the regression models are also
shown.
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
p25ExitReturn(%)
PredictedMeanReturn(%)
Predictedp25Return(%)
Citigroup(C)
TRAIN
TEST
AOLS
BOLS
AOLS
BOLS
Unconstrai 25th
Unconstrai 25th
ned
Percentile ned
Percentile
ExitReturn
ExitReturn
>=0
>=0
A[uc]
20%
70
55
72
35
0
37
2
23.7
47.8
29
B[p25]
10%
90
16
72
52
0
18
5
9.8
38.1
8
TA[uc]
20%
70
55
45
20
0
25
5
23.5
40.2
5
TB[p25]
10%
90
16
45
33
0
8
4
10.1
27.8
1
20.9
29.9
9.9
0.1
20.9
29.9
9.9
0.1
121
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
p25ExitReturn(%)
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
p25ExitReturn(%)
CiscoSystems(CSCO)
TRAIN
TEST
AOLS
BOLS
AOLS
BOLS
Unconstrai 25th
Unconstrai 25th
ned
Percentile ned
Percentile
ExitReturn
ExitReturn
>=0
>=0
A[uc]
0%
70
70
99
44
2
0
62
38.7
49.8
5
B[p25]
0%
70
70
99
44
2
0
62
38.7
49.8
5
TA[uc]
0%
70
70
47
19
3
0
27
33.4
51.9
8
TB[p25]
0%
70
70
47
19
3
0
27
33.4
51.9
8
45.9
18.0
45.9
18.0
45.9
18.0
45.9
18.0
GeneralElectric(GE)
TRAIN
TEST
BOLS
AOLS
BOLS
AOLS
Unconstrai 25th
Unconstrai 25th
Percentile
Percentile ned
ned
ExitReturn
ExitReturn
>=0
>=0
A[uc]
10%
60
90
100
38
10
21
36
38.3
65.1
28
B[p25]
50%
65
60
100
42
0
58
0
29.7
39.1
5
TA[uc]
10%
60
90
40
12
5
5
18
39.4
62.7
1
TB[p25]
50%
65
60
40
19
0
21
0
32.5
43.2
1
41.8
37.7
36.0
0.5
41.8
37.7
36.0
0.5
122
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
p25ExitReturn(%)
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
p25ExitReturn(%)
AltriaGroup(MO)
TRAIN
TEST
AOLS
BOLS
AOLS
BOLS
Unconstrai 25th
Unconstrai 25th
Percentile
ned
Percentile ned
ExitReturn
ExitReturn
>=0
>=0
A[uc]
20%
60
80
74
42
1
25
15
63.6
59.1
2
B[p25]
20%
60
80
74
42
1
25
15
63.6
59.1
2
TA[uc]
20%
60
80
31
19
0
10
6
68.0
67.1
5
TB[p25]
20%
60
80
31
19
0
10
6
68.0
67.1
5
66.4
6.5
66.4
6.5
66.4
6.5
66.4
6.5
Microsoft(MSFT)
TRAIN
TEST
AOLS
BOLS
AOLS
BOLS
Unconstrai 25th
Unconstrai 25th
ned
Percentile ned
Percentile
ExitReturn
ExitReturn
>=0
>=0
A[uc]
20%
65
100
115
32
3
65
21
33.9
64.1
22
B[p25]
20%
70
12
115
86
1
25
5
17.0
42.6
3
TA[uc]
20%
65
100
41
13
1
22
5
37.2
72.5
29
TB[p25]
20%
70
12
41
30
0
10
2
15.2
46.0
12
38.4
20.2
21.4
0.5
38.4
20.2
21.4
0.5
123
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
p25ExitReturn(%)
UnderlyingStock
Dataset
Constraint
ConstraintLabel
TimeLimit
LossLimit
ProfitLimit
EnteredStrategies
ProfitTriggers
LossTriggers
TimeTriggers
LastdayTriggers
MeanExitReturn(%)
p25ExitReturn(%)
Qualcomm(QCOM)
TRAIN
TEST
BOLS
AOLS
BOLS
AOLS
Unconstrai 25th
Unconstrai 25th
Percentile
Percentile ned
ned
ExitReturn
ExitReturn
>=0
>=0
A[uc]
40%
60
100
65
12
0
48
5
23.6
57.9
24
B[p25]
0%
4
100
65
5
59
0
1
2.8
34.0
8
TA[uc]
40%
60
100
32
4
0
28
0
18.6
47.9
19
TB[p25]
0%
4
100
32
0
30
0
2
5.1
11.1
9
25.6
16.3
8.2
0.4
25.6
16.3
8.2
0.4
Walmart(WMT)
TRAIN
TEST
BOLS
BOLS
AOLS
AOLS
Unconstrai 25th
Unconstrai 25th
Percentile
Percentile ned
ned
ExitReturn
ExitReturn
>=0
>=0
A[uc]
0%
80
60
73
33
2
0
43
32.0
50.0
9
B[p25]
0%
80
60
73
33
2
0
43
32.0
50.0
9
TA[uc]
0%
80
60
34
20
2
0
14
39.3
47.7
13
TB[p25]
0%
80
60
34
20
2
0
14
39.3
47.7
13
35.5
20.4
35.5
20.4
35.5
20.4
35.5
20.4
124
Biographical Statement
Senthil Murugan is currently a Senior Specialist / Client Manager with Commercial
Analytics and Decision Sciences group at Merck & Co. He has over 15 years of
experience in pharmaceutical analytics. He has presented his articles in various
pharmaceutical industry conferences. Prior to completing his M.S. in Data Mining at
Central Connecticut State University, he earned a B.S. in Mechanical Engineering from
Madurai Kamaraj University, India and a M.S. in Applied Mechanics from Indian
Institute of Technology Madras, India.

Mining For Profitable LowRisk DeltaNeutral Long Straddle Option Strategies

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Mining For Profitable LowRisk DeltaNeutral Long Straddle Option Strategies

Uploaded by

Copyright:

Available Formats

Mining for Profitable Low-Risk Delta-Neutral Long Straddle Option Strategies

Central Connecticut State University

Mining for Profitable Low-Risk Delta-Neutral Long Straddle Option Strategies

Entry Analysis: Models to Identify Low-Risk Strategies .....................................35

Appendix D: Factor Loadings .................................................................114

Call and Put Options

Profit / Loss Diagram: PUT Option

Profit / Loss (in $)

Profit / Loss (in $)

Delta-Neutral Long Straddle Option Strategy

Profit / Loss Diagram: LONG STRADDLE Strategy

Profit / Loss (in $)

Option Pricing Model

In addition to Greeks, their corresponding elasticities that are expressed in

option value per percentage change in stock price,

change in volatility of the stock,

change in time to expire,

change in stock price.

Studies on the Black-Scholes Model

Research on Option Strategies

Examines a delta-neutral long straddle strategy, and

Goals of the Present Study

Option Transaction Data

Transaction Filters and Data Cleaning

Call and Put Option Greeks and Elasticities

delta of the strategy computed as N C C + N P P is close to zero; hence this strategy is a

Computing Strategy Risk

Total number of tradable days with a zero or positive return

Variable Transformation and Standardization

Figure 3. Overall distribution of strategies across quarters of different years. Q3 of 2002

Figure 6. Box plots of standardized time to expire (z_C_time_to_expire), call implied

Figure 11. Visualization of correlations between different variables. In this picture

C_time_to_expire), z_C_imp_volatility, z_C_Delta, z_C_Vaga, z_C_Theta,

to expiration price differences.

Training and Test Datasets

Modeling Methods: Logistic Regression

Modeling Methods: Linear Regression (Ordinary Least Squares)

Modeling Methods: Neural Networks

Modeling Methods: Recursive Partitions

Modeling Methods: Hybrid Models

Modeling Methods: Other Models

Model Types based on Analysis Variables

Model Type: Full Model (M1)

z_C_int_rate, z_C_time_to_expire, z_C_imp_volatility, z_C_Delta,

Model Type: Full Model with Transformed Variables (M1T)

z_log10_C_Imp_Volatility, z_log10_C_Vega, z_log10_C_Theta,

Model Type: Select Variables Model (M2)

z_C_time_to_expire, z_C_imp_volatility, z_C_Delta, z_C_Vega, z_C_Theta,

Model Type: Factor Variables Model (M3)

Model Type: Full Model with Underlying Stock as a variable (M4)

Model Type: Full Model for each Underlying Stock (M5)

Neural Networks Categorical

Neural Networks Continuous

Full Model with Full Model for

Model Types based on Analysis Variables

depends on base models

depends on base models

Results from Select Models

Results: Logistic Regression using Transformed Variables (M1T)

For log odds of 0/1

Results: Linear Regression using Factor Variables (M3)

Results: Hybrid Models

Results: Comparing Models from IBM Modeler