Machine Learning Intro: A Simple Model For A Growing Ecommerce Startup


This is an introductory article for people who are curious about machine learning.

Artificial Intelligence may shape our future more than anything else this century.
When we think of the Hollywood version of artificial intelligence — a super intelligent machine (that is much smarter than the best human minds in every subject) which can make jokes, trade stocks, manipulate people, and reprogram itself — we are not there…yet.
This version of artificial intelligence is called Artificial Superintelligence or ASI. When we’re at this point, we may or may not survive.
Machine Learning is a subfield of artificial intelligence. Machine Learning is a method of data analysis where the goal is to enable computers to learn on their own.
How is machine learning being used today?

Here are a few examples:
Let’s say you’re an innovative eCommerce company that sells AI inspired sunglasses.
Let’s say you’re an innovative eCommerce company that sells AI inspired sunglasses.
Your sunglasses are unique because the technology in the sunglasses is able to recognize popular brands and products people are wearing. Customer love wearing them out at concerts because the sunglasses allow a person to pick up on trends and what’s in style. For instance, a customer might notice someone wearing a pair of fresh sneakers. With your AI inspired sunglasses, a customer could find out they’re limited edition Nike Air Max.
In the past year, your sunglasses have become the go-to sunglasses brand for outdoor events and concerts.
You also have a store in the Short North (Columbus, Ohio) where you provide demos and styling consultations. Customers are able to go back home and order their choice of sunglasses from your website or mobile app.
As an owner of a growing business, you’re trying to decide whether to focus more on the mobile app experience or website. Most of your sales come from your website but mobile sales are growing faster.

Develop Business Case

Your proposition (aka hypothesis) is that your mobile app will overtake your website sales and therefore that’s where you think you should invest. However, you’re not certain and you wish there was a way to make a more informed data-driven decision.
You realize you have some useful data on your customers that could help with this. You’ve also read of how companies are using machine learning for problems related to pricing models, sentiment analysis, and customer segmentation.
You decide to reach out to your friends at Northpeak, a digital innovation studio that helps founders with solutions in data analytics, design, and marketing.
You come to learn that machine learning is a method of data analysis. This method uses algorithms that iteratively learn from data.

You also learn that most of a data scientists time is actually spent on acquiring, cleaning, and exploring data. Pradeep Menon of Alibaba Cloud estimates 80% of your time is spent there, while only 20% is on modeling, deploying, and evaluating.

At the highest level, there are 3 main types of Machine Learning algorithms:
1. Supervised Learning.
2. Unsupervised Learning.
3. Reinforcement Learning.
Supervised learning algorithms are trained using labeled examples, where the desired output is known. If you are producing some sort of product or equipment as an example, you might label data points either “F” (failed) or “R” (runs).
Unsupervised learning is used against data that has no historical labels. In this case, you don’t have the known answer, rather you’re looking for structure within the data. Think about a mobile company that needs to segment its customers for marketing and sales campaigns.
With reinforcement learning, the algorithm learns through trial and error which action produces the greatest rewards. It’s often used in robotics, gaming, and navigation.
In our case, we can use a supervised learning algorithm, specifically a linear regression model in order to predict whether investing in your mobile app or website will lead to additional revenue in the future.

Acquire Data

The second step for you is to gather all your data into a place where you can work with it. As a growing but relatively early stage startup, your best bet is to use third-party tools like Redshift to set up your data infrastructure.
With the improvements in technology and a growing number of tools, this is what most early-stage startups do. This idea is to create pipelines from disparate data sources and funnel the data into a single data warehouse.
For this analysis, we’ll be using Python in Jupyter, along with Pandas, Numpy, and other data science libraries. Again this is meant to be a simple example for those looking to better understand how you can use machine learning for practical business applications. So if you don’t understand the code or some of the tools mentioned, that is totally fine — just skip it!
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
The data we’ll use includes customer info, such as Email, Address, and the Avatar color. It also has numerical value columns:
Note: This data is artificial.

Clean and Explore Data

Because data is usually messy, most of the time you will need to clean your data. We’re going to skip this step for the purposes of this hypothetical situation (and because it’s really not a fun process:)).
Explore your data
Once your data has been cleaned and standardized, you can quickly start to get a bird’s eye view from summary statistics. We can see from the output below that customers spend more time on your website and that the mean amount spent is roughly $494/year.
As mentioned earlier, before we train and test the model, we should do some data exploration to get a better feel of what’s going on.

You might start by comparing the correlation of time on website and yearly amount spent.

sns.jointplot(x=’Time on App’,y=’Yearly Amount Spent’,data=customers)
sns.jointplot(x=’Time on Website’,y=’Yearly Amount Spent’,data=customers)
From above, it seems like there’s a more linear relationship with time on app and yearly amount spent versus time on website and yearly amount spent. Interesting.
Another question you might ask is, what’s the relationship between time on app and length of membership?

Linear plot using seaborn:

sns.jointplot (x = ’Time on App’, y = ’Length of Membership’, kind = ’hex’, data = customers)
There doesn’t seem to be a strong linear relationship here.
You can continue exploring your data with tools such as pairplots to get a better understanding of the correlation between different variables.

Train Model

The next step is for to split your data into a training and testing set. The idea here is to take a portion of our data and use it to train our model. We will use the test dataset to see how accurate our model predicts by comparing it to the actual output.
Again that’s why it’s called supervised learning, we already know the answer or desired output. We continue this process of iteration (see first chart) until we feel comfortable with the accuracy and evaluation of our model.
Below we split into our test dataset by .3 and use the rest for training our model.
Note: The first step would be to start by setting your X and Y variables to the appropriate columns. Next use model_selection.train_test_split from sklearn to split the data into training and testing sets.
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
lm = LinearRegression(),y_train)
Now we print out the coefficients of the model. We interpret these coefficients further down below.

print(‘Coefficients: \n’, lm.coef_)

[0.17825135 43.86015875 0.34813890 63.04039211]


Predict Test Data

Now that we have fit our model, let’s evaluate its performance by predicting off the test values.
We can use lm.predict() to predict off the X_test set of the data.
Looking at the scatterplot of the real test values versus the predicted values we can see that our model is actually doing pretty good!

Evaluate Model

MAE: 6.918249653330853
MSE: 78.97115165397457
RMSE: 8.71812066078243
Our R-squared value actually came out to 99%. For those not familiar with R-squared, it’s a measurement of how much variance your model explains. So we have a very good fit model for our test data.
Looking at the histogram below, we can confirm that the discrepancy between our data and estimated model was fairly minimal and normally distributed.
So, do we focus our efforts on mobile app or website development?
How can you interpret these coefficients?
Well, it seems like we should explore the relationship between length of membership and the app and the length of membership and website before coming to a conclusion.
It actually seems like maybe length of membership might be more important than whether to invest in the mobile experience or website!
In fact, the answer is a bit more complex.
You could invest more on the website so it catches up to the performance of the app or develop the app more since it’s already working better. Here is where someone that can understand both the data science side and business side is extremely useful.
You really would want to understand all the factors of the company and costs before making a decision. However, what’s important here is the approach and using all available information to make a strategic business decision.
Hopefully, as a business owner, you now have a better idea of how you can utilize machine learning to make more data-driven decisions.
Thank you!