Customer Analytics

This module will discuss *customer analytics*. Customer
analytics analyzes a company's customer data and behaviours to try
to identify, attract, and retain the most profitable types of
customers. Since customers have access to significantly more
information about products and companies in the digital age,
organizations must use updated strategies to attract and retain
these customers. This is the goal of customer analytics: to create a
unified, accurate view of the company's customer base, and decide
which strategies can best retain and grow this base. More detailed
questions can fall underneath this basic approach, for example, who
are a company's highest value customers, and should those customers
be prioritized over other needs?

Customer analytics is normally an interdisciplinary problem that involves marketing, sales, IT, customer service, and business analytics. The skills and knowledge of each group is shared to identify the business metrics to capture, the analysis to perform, and the overall goals to address. Customer analytics begins with the capture of raw data and ends with business decisions. One definition of the stages of customer analytics includes the following three steps.

**Collection.**Obtain raw data from marketing tools, CRM systems, or external sources: This could include demographics, purchase history, social media presence, engagement with advertisements, and so on.**Organization.**Convert the data into a format that will facilitate the types of analysis needed to provide insights to support the desired business decisions.**Analysis.**Perform appropriate analysis on the organized data to produce the insights needed to direct the business in its decisions. Examples include modelling different types of customers, sales prediction, effects of price changes, and so on.

The types of business decisions or goals associated with customer analytics are numerous and varied, but they normally have a relationship to the overall purpose of identifying, attracting, and retaining profitable customers. Below are some examples of goals for customer analytics.

- Analyzing how to distribute a product across customer channels,
- determining customer satisfaction,
- identifying when and where to engage with a customer,
- predicting churn and actions to reduce it,
- highlight trends in the raw data that could be leveraged to increase sales, and
- optimize a customer's "journey" through a web site to better encourage sales.

Not surprisingly, numerous commercial tools exist to help with customer analytics. Many focus on customer relationship management (CRM), which encompasses many of the ideas discussed above. CRM tools can collect data, aggregate data from different sources, support raw data organization and analysis, and visualize results to better highlight any relevant insights that are found. The tools can also integrate with sales and marketing applications, web content management systems, email, social media sites, customer loyalty programs, and other tools designed to help attract and retain customers. Some common CRM tools from well-known vendors include Salesforce, Oracle Netsuite, Zoho, HubSpot, Pipedrive, Insightly, and Google Analytics 360, among others.

CRM tools are sometimes divided
into *collaborative*, *operational*,
and *analytical*. A collaborative CRM is designed to
remove *silos* between different teams to ensure they are all
sharing a common set of data. An operational CRM streamlines
understanding the customer's journey through a company's web site,
even in situations where the journey is represented with many highly
detailed touchpoints. This is usually done by automating repetitive
tasks to free employees to focus on more subjective or creative
issues. An analytical CRM is optimized to analyze massive amounts of
customer data and return actionable insights.

For our instruction, we will complete a basic introduction of Google Analytics. Google Analytics is free, is easy to setup within a web site, provides various ways to filter data, offers high-quality dashboards, and has some basic analytics built into its system. For more sophisticated organization and analytics, data from Google Analytics can be exported as CSV for use by external programs.

Google Analytics is a digital analytics platform provided by Google. One of its main advantages is that Google provides the service for free: because Google Analytics tends to drive business to Google Ads, Google benefits indirectly from users of the Analytics platform. A basic description of Google Analytics is that it is a tool within Google's Marketing Platform that collects data on users visiting your web site, then allows you to compile that data into reports to develop business strategies and improve your web site's performance. By installing web site-specific tracking code, you can see who visits you site and what they do there, as well as collect a wide variety of demographic data on your visitors. A small sample of the data you can collect using Google Analytics includes the following.

- How visitors enter your web site.
- The path or "journey" visitors take through your web site.
- When visitors purchase a product on your web site.
- Which pages a visitor loads from your web site.

Google Analytics discusses the *digital analytics funnel*, the
idea that individuals explore a web site or purchase items in stages.
Marketing uses the concept of a funnel to enumerate these stages.

**Acquisition.**Building awareness and acquiring user interest.**Behaviour.**User engagement with your web site or business.**Conversion.**A user becoming a customer through a transaction with your business.

These stages may be different for web site that are not designed to sell products, however, the basic ideas are often analogous. For example, conversion on a non-product web site may mean that the visitor returns to the site repeatedly because they find it useful.

Although we will discuss Google Analytics in the context of web sites and web traffic, it can also be used to collect data from mobile applications, point-of-sales systems, video game consoles, CRMs, and other Internet-connected platforms.

Google Analytics was originally built using UA (Universal Analytics). This allowed you to create an account, then obtain a snippet of Javascript code that you added to each web site you wanted to track. The Javascript contained a unique ID that sent information back to your Analytics account. The Google Analytics web site allowed that information to be viewed and filtered in a variety of ways.

More recently, Google decided to move to a new platform called GA4
(Google Analytics 4). This is designed to do what UA did, but in a
very different way. Originally, UA used a hierarchy of
account → property → view, where an account
represented one or more web sites, a property was a section of a web
site (*e.g.*, a part of the site of a subset of its
visitors), and a view was a combination of a filter and a
visualization of some or all of the data associated with a
property. Properties were themselves given unique sub-IDs to allow
you to treat them as independent data sources.

In GA4 the hierarchy is now account → property →
data stream. The basic ideas of account and property are similar to
UA, but view has been removed. The new *data stream* has been
added, which represents a source of raw data to be fed into a
property. As before, that data can be filtered and visualized in a
number of different ways. However, it is non-trivial
to *save* a particular "view" into a property's data. This
has caused a number of issues for individuals and businesses who
were using the old UA system, since views are clearly valuable and
difficult to replace. On the other hand, GA4 makes it much easier to
separate events on a web site. Previously, UA bundled everything
under a single tag ID. GA4 is also capable of capturing a much wider
set of events that UA was.

All of our discussion will revolve around GA4, since UA is being depricated in July 2024. If you search online for information on using Google Analytics, be careful to make sure you're looking at instructions for GA4 and not UA. Google has not removed any information about UA, and since this is the system with the longest history, searches that do not explicitly specify GA4 (and even some that do) will point you to information relevant to UA and not GA4.

Google Tag Manager is now the recommended method for collecting data
*to be analyzed* in Google Analytics. Although this is often
not clear, although Google Analytics and Google Tag Manager are
closely integrated, they are *two entirely separate* online
systems with two different purposes.

Google Analytics is an online software suite used for analytics, conversion tracking, and reporting. It provides a wide range of visualization, filtering, and reporting tools to build dashboards that provide insights into visitors to a web site, what they see on the site, and how they interact on the site.

Google Tag Manager, on the other hand, is a tag management system that
can detect and store *events* within a web site, for example,
page views, clicks, scrolling, entering or exiting the site, and so on.
These events can be sent to a separate analytics package for further
analysis and presentation. An obvious candidate for this task is
Google Analytics.

To use Google Analytics, you first create a Google Analytics account
tied to your web site. This includes creating an initial property
attached to your account and a data stream to provide data to that
property. At that point, you have the option of adding Javascript
code to each page you want to track, or using the *Google Tag
Manager* to automatically track different types of events
(*e.g.*, page views, clicks, etc.) Currently, Google is
recommending using the Google Tag Manager, since any changes you
make in the Tag Manager web site should automatically apply to all
pages your are managing, without the need for additional edits to
the Google Javascript that is involved in manual tagging.

Complete the following steps to create your initial Google Analytics account.

- Navigate to
`https://analytics.google.com`

. - Choose the Google account you want to use to sign up with.
- Click "Start measuring."
- Choose an "Account name" for your Analytics account. You can accept the default Account Data Sharing Settings.
- Choose a name for your initial, default property in "Property name (Required)." Select the proper timezone for your web site.
- Choose the "Industry Category (Required)" from the drop-down menu and "Business size (Required)" of the organization your web site belongs to.
- Choose the purpose of your Analytics site. The default is "Get baseline reports" if none of the other options apply.
- Once you click "Create" and accept the Terms of Service you will be asked to "Choose a Platform." For our purposes, we are only looking at web-based analytics, so choose "Web."
- Enter the domain of the website you want to analyze, for example,
`healey.wordpress.ncsu.edu`

. Choose a Stream name for the datastream attached to this property, and ensure "Enhanced measurement" is turned on. - At this point a set of Installation instructions will appear. If
you are using a web hosting service that Google Analytics
recognizes (you can see the list of services by clicking "Select
your platform" under "Install with a website builder or CMS")
choose the proper service for instructions on how to register your
Google Analytics site ID. Otherwise, choose "Install manually" and
the Javascript needed for every web page you want to track will be
shown. You can copy and past this code immediately after
the
`<head>`

tag on the web pages you plan to analyze. - Close the Instruction instructions panel and you will see a Stream details panel summarizing your Analytics options. You can return to this panel later if you need to update any information or (specifically for Google Tag Manager) if you need your Measurement ID.
- Press Esc to exit the Stream details panel, then click "Next." You will be told "Data collection is pending." Click "Continue to Home" to jump to the Analytics homepage. Choose any email communications you want, click "Save", and your analytics homepage will be shown. At this point the page is empty since no data has been collected yet.

This web
page gives additional instructions on how to setup an account
and create an initial *property* and *stream*. After
some time has passed, you can login to Google Analytics and use the
site's interface to explore data about users who have visited your
web site.

Google Analytics groups activity into a *session* that
begins when a user loads a page with tracking code, and ends after
30 minutes of inactivity. Data is uploaded to Google and made
available through your analytics account. By default, Google
Analytics will aggregate and present information based on a
predefined set of criteria like geographic location, origin type,
and page, but these can be modified as desired using filters and
other interactive controls.

A Google Analytics account is made up of one or more
*properties*. Within each property one or more *data
streams* can be created. Properties are mean to collect data
independent of one another by using a unique tracking ID in the
tracking code. For example, a business may want to use different
properties to collect data from different web sites, from a web site
versus a mobile application, or from different geographic
regions. Data streams are sources of data from a web site being sent
to a property. Data streams come in three types: web, iOS app, or
Android app. The intent is to allow us to aggregate users across
different data sources using a single data stream.

When a property is created, Google Analytics automatically creates a default view called "All Web Site Data," containing all raw data collected for the property.

- Navigate to
`https://tagmanager.google.com`

. - Choose the Google account you want to use to sign up with.
- Click the "Create Account" button.
- Enter an Account Name and choose a Country. Enter a Container name. The container controls the Javascript code you will use to invoke Google Tag Manager. Typically there is one container per web site. Choose Web as the Target platform and click the "Create" button.
- Click the "I also accept the Data Processing Terms as required by GDPR" checkbox, then click the "Yes" button to agree to the Terms of Service.
- Two Javascript code snippets are shown. Note the instructions:
the first snippet should be placed on your web page right after
the
`<HEAD>`

tag. The second should be placed right after the`<BODY>`

tag. - You can test your website by typing in its URL and clicking "Test." This should place a checkmark beside the URL if Google Tag Manager sees valid Tag Manager code on the given web page.

Next, you will want to add one or more tags and associated triggers to collect data from your web site and send it to Google Analytics.

- Click "Add a new tag" to create a new tag within the Google Tag Manager you just created (this should be shown in the dropdown at the top of the page.)
- Replace
`Untitled Tag`

at the top of the panel with a more descriptive name of the type of data you plan to capture, for example,`Page Views`

. - Click on the "Tag Configuration" region, then on "Google Analytics", then on "Google Tag." You will now need to enter your Tag ID from the Google Analytics account you previously created. To find this, sign in to your Google Analytics account, click the Admin gear at the bottom-left of the page, click "Data collection and modification" under the "Property settings", click "Data streams", then choose the data stream you previously created. Your Google Tag should be the "MEASUREMENT ID" at the top of the page. Click the icon to the right to copy it, then paste it into the "Google Tag" field in Google Tag Manager.
- Click on the "Triggering" region and choose "All Pages" to trigger this tag on every page view. Click the "Save" button to save your new tag.
- Click the "Preview" button to ensure your tag is working. Enter your web site URL in the "Your website's URL" field and click the "Connect" button. Your website should appear in a new tab or window with a Tag Assistant pop-up. If you go back to the Google Tag Manager window, it should say "Connected!"
- Click "Continue" and you should see a "Page Views" tag under the "Tags Fired" field. If you refresh the window or tab containing your web page, the "Tags Fired" field should increase by 1.
- Close the debug version of your website and the Tag Assistant tab. Click "Submit" and choose a Version Name and Description (if desired) and click the "Publish" button to finalize registration of your new tag.
- If you go back to Google Analytics and choose "Realtime" in the Reports menu, you should be able to track visits to your site as they happen.

**Note.** I could only get the Preview to work when I used the
Chrome browser, installed the Tag Assistant Companion extension, and
disabled all other extensions since there seemed to be a conflict in
my extension set. I suspect this only affects previewing whether the
tag is working or not. Rather than running the Preview step, you
could jump directly to Submit then Publish, then check with Realtime
as you load your page to see if page views are appearing or not.

In its most basic form, A–B testing is a way to determine whether changing a property (often called a key performance indicator or KPI) of an environment makes it better or worse, based on a specific evaluation metric. In other words, given two versions of an environment, which performs better? Although the term "A–B testing" was coined in the 1990s, it is a form of a basic randomized controlled experimentation, first documented by Ronald Fisher of the famous Fisher's Iris dataset in the 1920s. A–B testing in its current form is often characterized by being run online, in real time, and on a much larger scale in terms of participants and experiments versus traditional randomized controlled trials (RCTs). A high-level overview of designing, conducting, and evaluating A–B testing might go something like this.

- Decide what you want to test, and construct two versions of the test environment: Version A and Version B.
- Determine how you will evaluate performance.
*Randomly*assign two sets of users to Version A and Version B of the environment- Run the experiment, asking the users to operate in their version of the environment.
- Statistically evaluate the performance of the two sets of users to determine if there was a significant difference between the two environments.

For example, you might wonder whether one version of a button on a web site will encourage users to click more often than another version. Two versions of the web site with the two button candidates is constructed, and users are randomly assigned to view one of the two versions. The performance metric is number of button clicks. Once all users have explored the web site, the number of web clicks is statistically compared to see whether one is significantly higher than the other. If it is, the button with more clicks is chosen for the final web design.

As with all experiments, randomization is critical. This ensures that users are not grouped based on some criteria that might influence their preferences, for example, for one colour of button over another.

During design of the experiment, deciding how many users are needed
to ensure statistical significant is important. Since A–B
testing is a form of a randomized controlled experiment, we can use
literature from either area to study this problem. For example, the
medical community often conducts A–B-type tests and has many
good sources and examples of how to calculate sample sizes for a
desired level of improvement (*i.e.*, how much "better" does
the outcome need to be to be considered relevant?) Two types of
experiments are considered: *dichotomous*, where the outcome
of interest is one of two possibilities: yes/no, success/failure,
and so on, or *continuous*, where the outcome of interest is
the mean difference of an outcome variable between the two groups:
for example, the difference in the average number of clicks between
group A and group B.

Given the overall goal of determining whether changing the test environment leads to a significant change in KPI, experiments are often described in terms of the null hypothesis (\(H_0\)), that no significant change was found, or the alternative hypothesis (\(H_a\)), a significant change did occur. This is often modelled using false positive (Type I error), false negative (Type II error), true positive, and true negative proportions, which are based on the null hypothesis \(H_0\) (no difference) and the alternative hypothesis \(H_a\) (significant difference).

\(H_0\) | \(H_a\) | |

Predict \(H_0\) |
Probability True Negative \(1 - \alpha\) |
Probability False Negative \(\beta\) |

Predict \(H_a\) |
Probability False Positive \(\alpha\) |
Probability True Positive \(1-\beta\) |

**Dichotomous.** For a proportional metric, we need to define a
significance level \(\alpha\), a power level \(P\), and the two
proportions \(\mu_1\) and \(\mu_2\) from groups A and B that
constitute the desired level of improvement \(\mu_2 -
\mu_1\). Recall that

- \(\alpha\) is the probability of rejecting the null hypothesis \(H_0\) when it is actually true (Type I error),
- \(\beta\) is the probability of rejecting the alternative hypothesis \(H_a\) when it is actually true (Type II error), and
- power \(P = 1 - \beta\) is the probability of accepting the alternative hypothesis when it is actually true; a minimum \(P\) is normally \(0.8\) or higher.

Notice that reducing the probability of committing a Type II error increases the probability of committing a Type I error and vice versa. Because of this, careful balance must be maintained between \(\alpha\) and \(\beta\).

Given this, the size of each group \(n_A = n_B\) is \[ n_A = n_B = c \cdot \frac{\mu_1 (1-\mu_1) + \mu_2 (1-\mu_2)}{(\mu_1 - \mu_2)^{2}} \] where \(c=7.9\) or \(c=10.5\) for the standard power levels of \(P=80\)% or \(P=90\)% and \(\alpha = 0.05\). \(c\) is based on the cumulative distribution function (CDF), \(c = f(\frac{\alpha}{2}, \beta) = (\Phi(\frac{\alpha}{2}) + \Phi \beta)^{2}\) where \(\Phi\) is the CDF of a standard normal distribution. \(\Phi\) is based on the Z-score. \[ \begin{gather} \Phi(x) = p(Z \leq x) = \frac{1}{\sqrt{2 \pi}} \int_{-\inf}^{x} \exp \left( -\frac{u^{2}}{2} \right) du \\ Z \sim N(\mu = 0, \sigma^{2} = 1) \end{gather} \] For example, if we want to go from 40% of participants answering Yes in Group A (control) to 70% answering Yes in Group B (test), \(n_A = n_B = 7.9 \cdot \frac{(0.4 \cdot 0.6) + (0.7 \cdot 0.3)}{0.3^{2}} \approx 40\) for an 80% power level or \(n_A = n_B \approx 53\) for a 90% power level at \(\alpha = 0.05\).

**Continuous.** We need to define a significance level
\(\alpha\), a power level \(P\), a desired response difference
\(\mu_2 - \mu_1\), and a common (combined group) standard deviation
\(\sigma\). Given this, the size of each group \(n_A = n_B\) is
\[
n_A = n_B = \frac{2c}{\delta^{2}} + 1
\]
where
\[
\delta = \frac{|\mu_2 - \mu_1|}{\sigma}
\]
where as before \(c=7.9\) for \(p=80\)% and \(c=10.5\) for
\(p=90\)%.
For example, if we wanted to go from 20% clicks in group A to
30% clicks in group B with a standard deviation \(\sigma=0.5\) then
\(\delta = \frac{0.1}{0.5} = 0.2\) and \(n_A = n_B =
\frac{15.8}{0.04} + 1 = 396\) for \(p=80\)% or \(n_A = n_B =
\frac{21}{0.04} + 1 = 526\) for \(p=90\)%.

Alternatively, you can use either Python or R to calculate minimum
sample sizes using the `statsmodels`

or `power`

libraries. Both model the problem using false
positive (Type I error), false negative (Type II error), true
positive, and true negative proportions, which are based on the null
hypothesis \(H_0\) (no difference) and the alternative hypothesis
\(H_a\) (significant difference). Both Python and R's tests use the
\(\alpha\) significance level (normally 1%, 5%, or 10%), the false
negative rate \(\beta\) (the probability of incorrectly rejecting
\(H_0\)), the power level (\(1 - \beta\) or the true positive rate,
the probability of correctly rejecting \(H_0\)), and the effect size
divided by the minimum detectable lift (MDL, the minimum change
needed to reject \(H_0\)).

For a dichotomous A–B
test, `power`

's `prop.test()`

is used to
determine the minimum \(n\) needed for significance.

# Historical data
p0 <- 0.12 # Group A probability
# Model parameters
alpha <- 0.05 # False positive probability
beta <- 0.20 # False negative probability
power <- 1 - beta # True positive probability
mdl <- 0.02 # Minimum detectable lift
dir <- 'two.sided' # Type of t-test
min_n <- power.prop.test(
n=NULL,
p1=p0,
p2=(p0*(1+mdl)),
sig.level=alpha,
power=power,
alternative=c(dir)
)
min_n$n

For a continuous A–B
test, `power`

's `t.test()`

is used to
determine the minimum \(n\) needed for significance.

# Historical data
mu <- 30 # Average lift
theta <- mu / 5 # Standard deviation of lift
# Model parameters
alpha <- 0.05 # False positive probability
beta <- 0.20 # False negative probability
power <- 1 - beta # True positive probability
mdl <- 0.02 # Minimum detectable lift
dir <- 'two.sided' # Type of t-test
min_n <- power.t.test(
n=NULL,
delta=(mu*mdl),
sd=theta,
sig.level=alpha,
power=power,
type=c('two.sample'),
alternative=c(dir)
)
min_n$n

Once results are obtained from an A–B test, they are analyzed to
search for significant differences. The null hypothesis \(H_0\) that
there is no significant difference in the performance metric between
the two groups A and B is $p_B - p_A = 0$ for dichotomous
(proportional) metrics and \(\bar{p_A} = \bar{p_B}\) for continuous
metrics. Proportional significance can be measured in Python
with `statsmodels.stats.proportion.proportions_ztest()`

and in R with `prop.test()`

. For continuous metrics
use `scipy.stats.ttest_ind()`

or `t.test()`

in
Python or R, respectively.

One final value you may want to calculate is effects size (ES).
Intuitively, effects size states how strongly the independent
variables affect the dependent variable. For t-test studies,
Cohen's *d* is often used to measure effect
size. Cohen's *d* calculates the ratio of mean difference
between groups to pooled standard deviation, where \(d=0.2\) is
consider small, \(d=0.5\) is considered medium, and \(d=0.8\) is
considered large.

It is often useful to complete the analysis by including both significance and effect size. For example, changing this property results in a significant change between groups, with a small/medium/large effect on the measured result or KPI.

Multivariate testing (MVT) is performed using many variations of a design, usually called factors, tested simultaneously. For example, you might design two possible headlines and two possible images for a website, then test them simultaneous as $2 \times 2 = 4$ possibilities using a headline factor of size 2 and an image factor of size 2. MVT is more complicated than A–B testing, but it can be more efficient, since it allows multiple factors to be tested in parallel rather than sequentially. It also provides information about how combinations of factors perform: it may be that Image 1 works well with Headline 1 but not with Headline 2. Testing each factor independently would not reveal this insight.

If we define a factor being present as \(+1\) and a factor being
absent as \(-1\), we can present MVT designs as a table of
experiments or *treatments* and the associated factors and
factor interactions being tested.

A (Factor 1) | B (Factor 2) | AB (Interaction) | |

Treatment 1 | +1 | +1 | +1 |

Treatment 2 | +1 | -1 | -1 |

Treatment 3 | -1 | +1 | -1 |

Treatment 4 | -1 | -1 | +1 |

Recall that when two vectors' dot product is 0 they are orthogonal. If we consider the factor columns, a balanced factor of designs occurs when their dot products are 0 or orthogonal. In this case, Factor A \(\cdot\) Factor B = \((1,1,-1,-1) \cdot (1,-1,1,-1) = 1 - 1 - 1 + 1 = 0\), producing a full factorial design or balanced design (all possible combinations of factors are tested).

The effect of any factor, for example A, is calculated as the
difference between the mean in response between rows of A at +1 and
-1, \(\bar{x_A} = \bar{x_{A+1}} - \bar{x_{A-1}}\). The effect of the
interaction between factors is calculated similarly, \(\bar{x_{AB}}
= \bar{x_{AB+1}} - \bar{x_{AB-1}}\). The key advantage of a balanced
design is that you can add more (two-level) factors *without*
increasing the required sample size. An \(n\)-factor design has
\(2^{n}\) rows, \(1\) mean, \(n\) main effect(s), \(2^{n-1}\)
interactions, and \(2^{n}\) treatments. For example, an
\(n=3\)-factor design has \(2^3=8\) rows, \(1\) overall mean, \(n=3\)
main effects, \(3\) two-way interactions, \(1\) three-way interaction,
and \(2^{3}=8\) total treatments.

As you can see, as we increase the number of factors, the number of treatments increases rapidly and the number of interactions increases exponentially. If we choose not to test all interaction terms, we can instead focus on designs that include only a subset of the treatments. The question then becomes: Which subset to include? Consider a 3-factor design where we want to run four treatments.

A (Factor 1) | B (Factor 2) | C (Factor 2) | AB (Interaction) | AC (Interaction) | BC (Interaction) | ABC (Interaction) | |

Treatment 1 | +1 | +1 | -1 | +1 | -1 | -1 | -1 |

Treatment 2 | +1 | -1 | +1 | -1 | +1 | -1 | -1 |

Treatment 3 | +1 | +1 | +1 | +1 | +1 | +1 | +1 |

Treatment 4 | +1 | -1 | -1 | -1 | -1 | +1 | -1 |

Treatment 5 | -1 | +1 | -1 | -1 | +1 | -1 | -1 |

Treatment 6 | -1 | -1 | +1 | +1 | -1 | -1 | -1 |

Treatment 7 | -1 | +1 | +1 | -1 | -1 | +1 | -1 |

Treatment 8 | -1 | -1 | -1 | +1 | +1 | +1 | +1 |

If we chose Treatments 1-4 then we could not investigate the main effect of A, since A is \(+1\) in all cases so there is no variance available. So this would be a poor subset to choose. This is where the idea of fractional factorial design comes into play. Fractional factorial design focuses on a reduced set of treatments that optimize independently estimating main effects and the lower level interactions. Some effects are dependent or confounded on one another, so they cannot be estimated.

A (Factor 1) | B (Factor 2) | C (Factor 2) | AB (Interaction) | AC (Interaction) | BC (Interaction) | ABC (Interaction) | |

Treatment 3 | +1 | +1 | +1 | +1 | +1 | +1 | +1 |

Treatment 4 | +1 | -1 | -1 | -1 | -1 | +1 | -1 |

Treatment 5 | -1 | +1 | -1 | -1 | +1 | -1 | -1 |

Treatment 6 | -1 | -1 | +1 | +1 | -1 | -1 | -1 |

For example, in Treatments 3-6 C=AB so C cannot be estimated independent of A and B. Additionally, A=BC, B=AC, and C=AB. This is known as a Resolution III design: the main effects are confounded with the 2-factor interactions but not with each other.

- Resolution II: The main effects can be confounded with each other.
- Resolution III: The main effects can be confounded with 2-factor interactions.
- Resolution IV: The main effects can be estimated independent of each other and the 2-way interactions, but the main effects can be confounded with 3-way interactions and 2-way interactions can be confounded with each other. Three-way interactions are assumed to be negligible.
- Resolution V. The main effects can be estimated independent of each other and the 2-way and 3-way interactions, but the main effects can be confounded with higher order interactions. Two-way interactions are not confounded with each other. Higher order interactions are assumed to be negligible.

Obviously the higher the design resolution the better, but this also causes more treatments. Typically Resolution IV or Resolution V designs are most practical. There are some general "rules of thumb" when trying to choose a good factorial design.

**Balance.**Each level occurs an equal number of times.**Orthogonality.**There is no correlation between pairs of factors.

Note that depending on the actual factors, certain conditions may not be possible. For example, if you are testing the presence or absence of a banner on a web page, and whether the banner should be red or blue, the colour is not testable in the condition where the banner is not present. In this case, balance is not possible. These types of situations should be taken into account when designing an MVT experiment. Since finding optional designs for complex treatment arrangements is difficult, most statistical software provides functionality to do this for you.

A final question is: should you use A–B testing or MVT testing? A–B testing focuses on the effect of independent components in a new environment. MVT experiments focus on the holistic effect of the overall experience in a new environment. The experience that is most relevant or important to you or your users should dictate which type of experiments you choose to run.

Although RCTs are the "gold standard" for assessing changes in a
test environment, in certain situations they are not
possible. Ethical, cost, lack of known participant properties, the
number of specialized participants needed for the experiment, or
other factors may preclude conducting controlled experiments.
In these situations, we can use *propensity score matching*
(PSM) to "match" pairs of participants that are similar, placing one
in group A and one in group B.

The most common use of propensity scoring is when participants are defined by multiple attributres. During the definition of groups A and B, rather than choosing by random selection, we would like to choose two participants that are "similar" to one another and place one in group A and one in group B. This addresses the issue of balancing potentitally confounding effects across the two groups. We fall back to random selection in situations where (the correct) attributes are not available. For example, if we are testing button clicks for two types of buttons on a website, we are unlikely to know anything (useful) to define our split between users. Here we use randomization to assign users to group A or group B to best address bias.

Propensity scoring is calculated from
*observational data* about participants. It attempts to split
participants based on two participants having common observable
characteristics. Rather than using the participant attributes
directly, which can be difficult or expensive, we compute
a *propensity score* for each participant. A propensity score
is a value where the "score" of a participant is determined as a
function of the covariates (observable attributes) of a participant.

Propensity scoring simplifies the task of identifying (or matching) similar participants in group A and group B. Matching by covariate values, especially when there are numerous covariates, is complicated. Reducing the covariates to a single score makes it much easier to identify similar participants. The standard method to do this is to fit a logistic regression model to the covariates of interest, then use the model to convert a participant's covariates to a single logit value on the range \(0 \ldots 1\). Recall that a (binary) logistic regression model defines log-odds of an event as a linear combination of one or more independent variables. Here, those variables are the covariates selected to compute the propensity score.

PSM does not eliminate A–B testing. Instead, it adjusts an initial A–B randomized split to improve it by removing possible bias in the two groups. More specifically, the following steps are used to determine whether there is a significant difference between group A (control) and group B (test).

- Randomly divide participants into group A and group B, exactly like A–B testing.
- Choose which covariate attributes you will use to calculate a participant's propensity score.
- Fit a logistic regression model \(l\) using the selected covariates.
- Use \(l(p_i)\) to compute a propensity score for each participant \(p_i\).
- Order participants in both groups by their propensity scores.
- For each participant \(p_i\) find their nearest neighbour \(n_i\) in the opposite group. If \(n_i\) is farther than a threshold value \(\tau\), do not include \(p_i\) in the follow-on analysis.
- Store the pair \((p_i, n_i)\) as a
*matched pair*. - Once all participants are paired or removed, search for significance over the pairs' proportional differences \(|\mu_{p_i} - \mu_{n_i}|\).

import random
import matplotlib.pyplot as plt
from sklearn.metrics import roc_curve, roc_auc_score
from sklearn.datasets import make_classification
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import auc
from sklearn.neighbors import NearestNeighbors
from IPython.display import display
import numpy as np
import pandas as pd
# Generate noisy data
X, y = make_classification(
n_samples=1000,
n_features=4,
n_redundant=0,
n_classes=2,
n_clusters_per_class=1,
class_sep=2,
flip_y=0.2,
weights=[0.5, 0.5],
)
# Create minmaxscaler, normalize data to usage ratios
scaler = MinMaxScaler()
normalized_data = scaler.fit_transform(X)
data = pd.DataFrame(normalized_data)
data.columns = ["A", "B", "C", "D"]
data["RenewalStatus"] = y
data["Treatment"] = (data["A"] >= 0.40) * 1
# Select covariates for PSM
covariates = ["B", "C", "D"]
X = data[covariates]
y = data["Treatment"]
# Fit logit to get coefficients for covariates
logit = LogisticRegression()
logit.fit(X, y)
PS = logit.predict_proba(X)[:, 1]
false_positive_rate, true_positive_rate, th = roc_curve(y, PS)
# Match treated and control individuals based on propensity score
treated_indices = data[data["Treatment"] == 1].index
control_indices = data[data["Treatment"] == 0].index
nbrs = NearestNeighbors(n_neighbors=1, algorithm="ball_tree").fit(
np.reshape(PS[control_indices], (-1, 1))
)
distances, indices = nbrs.kneighbors(
np.reshape(PS[treated_indices], (-1, 1))
)
matched_control_indices = control_indices[indices.flatten()]
# Duplicate entries can lead to biased estimates
new_control_indices = list(set(matched_control_indices))
control_data = data.iloc[new_control_indices].RenewalStatus
control_mean = data.iloc[new_control_indices].RenewalStatus.mean()
treatment_data = data.iloc[treated_indices].RenewalStatus
treatment_mean = data.iloc[treated_indices].RenewalStatus.mean()
effect = treatment_mean - control_mean
print("Direct PSM:")
print("Effect on renewal rates w/Product A > 40%: ", round(effect * 100), "%")
print()

This code snippet implements propensity score matching to test
whether a four-product dataset will generate higher renewal rates
when the usage of the first product `A`

is greater than
40%.

- Create random data with four
products
`A`

,`B`

,`C`

, and`D`

; a binary column indicating whether a customer renewed their subscription`RenewalStatus`

, and a binary column identifying customers with product`A`

usage over 40%. - Select covariates
`B`

,`C`

, and`D`

and fit a logit to produce an outcome treatment variable to predict whether a customer will have a usage for product`A`

above or below 40%. - Given the logit probabilities, we pair customers in the control
and treatment groups with similar probabilities
using
*k*-nearest neighbours. - We compare the mean renewal rate for control and treatment groups
to determine the predicted renewal rate when product
`A`

usage is above 40%.

Python provides the package `psmpy`

to perform propensity
score matching directly. The following code shows the follow-on codee
that uses `psmpy`

on the same random data as the original
example.

# psmpy
from psmpy import PsmPy
from psmpy.functions import cohenD
from psmpy.plotting import *
psm_data = data.copy()
psm_data["idx"] = psm_data.index
# Create propensity score matching (psm) data structure
psm = PsmPy(
psm_data, treatment="Treatment", indx="idx", exclude=["A", "RenewalStatus"]
)
# Apply logit for propensity probabilities
psm.logistic_ps(balance=True)
# Match control and treatment pairs based on propensity scores, 1-many
psm.knn_matched(
matcher="propensity_logit",
replacement=False,
caliper=None,
drop_unmatched=False,
)
effect_tbl = (
psm_data[["RenewalStatus", "Treatment"]]
.groupby(by="Treatment")
.aggregate(["mean", "var", "std"])
)
effect_tbl.columns = ["Mean", "Var", "Std"]
effect = effect_tbl.iloc[1]["Mean"] - effect_tbl.iloc[0]["Mean"]
print("psmpy:")
print("Effect on renewal rates w/Product A > 40%: ", round(effect * 100), "%")

Although the effect scores are not identical, they are close, suggesting
`psmpy`

is performing certain steps slightly different than
the direct code.

Direct PSM:
Effect on renewal rates w/Product A > 40%: -66 %
psmpy:
Effect on renewal rates w/Product A > 40%: -67 %