On this episode, CodeEmporium talks about how to build time series forecast and how to do it with machine learning.
hello everyone today i wanted to talk
about
a concept that data scientists tend to
tackle in their day jobs
time series analysis now i don’t talk
too much about my personal
or even my professional life outside of
youtube
but now that i’ve worked as a full-time
associate data scientist for about a
year
i think i can spice up my content with
some experience
this could be useful to you as a lot of
you as i’ve seen that are not just
ml curious you also want to get a
full-time job in the field and some of
you
from at least the comments that i read
are already established professionals
but regardless always excited to talk to
new people
and everyone is welcome equally looking
at
resources across the internet that talk
about time series problems
they almost always take a traditional
approach
eyes closed through all the data into
some arima model and some magic happens
unfortunately these models can be very
difficult to tune if you aren’t an
expert
but luckily you really don’t need to be
an expert with dissecting time series
concepts specifically
to get usable results from a time series
model
in fact you can take some machine
learning approach instead
and also get results that are just as
good
in this video we are going to walk
through the typical flow to solve time
series problems
we’re going to see what different
approaches we can take to solve such
time series problems
and also highlight the differences
between traditional approaches and
machine learning approaches
so you know what techniques to use and
also
when to use them but before we continue
this video is sponsored partially by
kite they provide a code completion
service for
machine learning code it integrates
super well with your editors and even
jupiter notebooks
so click the link in the description to
try kite for free
now back to the video let’s first define
a concrete problem where time series
is useful so that’s step one
think about your grandma she started
this laptop repair line
two years ago and it’s a hit the way it
works is a customer places an
order request online the customer then
ships the broken laptop to grandma
then she and her workers fix them and
the laptops are sent back
the problem here is her workers are paid
by the hour
if there are more laptops to repair
grandma calls in more workers
but as you can imagine it’s hard to know
how many workers we need without knowing
the number of laptops that we
get per day now grandma hired you
as a data scientist and you think what
could be useful to know
for grandma is how many laptops are we
going to receive tomorrow
this way we can call the required number
of workers to come in tomorrow
so now that we have this defined problem
i think it’s easier to move forward
so step two what data do we have
for now let’s say that we store every
work order
in an orders table when a customer makes
an order request online
a row in the orders table is added this
table
has information like order id the price
and the timestamp when the order was
made
for simplicity let’s say that there are
no log errors no missing values and no
sparse data
so step three what is the data telling
us
well from here we do some exploratory
data analysis or eda
and come up with a few approaches the
natural structure of this problem is a
time series
problem from the orders table we can
aggregate the data at the daily level to
get a time
series of number of order requests per
day
when doing our exploratory analysis i’d
like to understand the bread and butter
of the time series that is trend and
seasonality so first of all does this
data exhibit seasonality
for this case let’s say that we see some
weekly seasonality in the data
basically we are hit relatively hard on
mondays low on thursdays and so on
so weekly seasonality exists
now question two do we see trend in this
data
over time we’d probably want to know if
the orders have been increasing or
decreasing in volume
we also want to explain the trend
changes that we see like in the middle
of july we see a huge
increase in the trend of sales and this
probably happens because we launched a
promotional event that month
and we see another trend change in
october and mostly because we changed
our payment strategy
now that we have like a basic
understanding of our data
how do we make predictions
if we think of this as a traditional
time series forecasting problem
there are several approaches like arima
profit
neural profit and vector auto regression
let’s briefly talk about these
so to make a prediction in arima we need
to identify the trend and seasonality
components
and transform our data accordingly this
kyle notebook here does a good job in
following the step-by-step procedure to
use
arima but all of this processing and
testing is only good for a basic toy
data set here
it might be a little tough to get these
predictions right and tuned for more
complex problems
especially if you aren’t an expert with
time series in general
as an alternative we could use
facebook’s profit model
profit handles missing data better and
it can take
data with seasonality and trends and it
produces stellar results that even
rival a tuned arima and its other
flavors like seasonal arima
we can get even better results using
neural profit
this increases the forecasting accuracy
of the profit model by using a neural
network
scrolling down here i can see some main
features that are added to this model
but a big disadvantage of profit models
at least from what i’ve seen
is that we cannot add regressors for
which we don’t have future values
okay so um let me explain this with
grandma’s laptop company
we want to know how many laptops will we
get tomorrow
to make this prediction we use the
number of laptops we got yesterday
two days ago three days ago and so on
aside from that we can pass in a day of
week predictor
for you know some seasonality and we can
pass this in because
for tomorrow we already know what the
day will be
even though the day hasn’t happened yet
but you know what else could be useful
in predicting the number of laptops
tomorrow
it would be something like the number of
orders that were placed
online over the past few days laptops
come in about three to four days after
the orders are placed online
so using order information could come in
quite handy to determine the number of
laptops you’ll receive
however we cannot add this
order information directly to the profit
model
since we cannot feed it a predictor for
which we don’t have the future value
i can’t tell you the number of orders
that happen tomorrow since it hasn’t
happened yet
in fact the number of orders per day
forms another time series on its own too
profit and arima models they fall into
the category of
univariate models we forecast and we
only deal with
a single time series but with
multivariate time series models
we can deal with and even forecast the
output of multiple time
series an example of this is vector
auto-regression models
the input could be the past inbound
volume time series
the past order time series and some day
of week predictors
and the output could be the forecasted
inbound volume and the forecasted order
volume
the main downside here for vector auto
regression
is that we might need a lot more quality
data to come up with reasonable
predictions
as opposed to the univariate approaches
since we’re forecasting multiple series
here
but depending on your situation and your
data one or the other
may be useful now all these methods that
we’ve discussed so far
are approaches where we treat the
problem as a time series problem
but we could very well convert this to a
traditional machine
learning regression problem so in
grandma’s laptop example
if we’re making prediction today the
features could be something like
how many laptops did we get this day
last week
what was the standard deviation of the
inbound volume
over the last week what was the number
of orders that
have been placed but haven’t been
fulfilled yet can we also add in day of
weak predictors to account for the
weekly seasonality
and almost any other predictor and
anything else that you can think of
the label is what we want to predict
and in this case it is the inbound
volume for
tomorrow and we’d want to frame our
training data
in this way a set of x’s and y’s
so what kind of models are we talking
about here
let’s paint this picture in a hierarchy
in the form of a chart
all the models for which we can do time
series modeling can be classified as
traditional time series models and
machine learning
models time series models can be further
classified as univariate and
multivariate depending on the number of
values we
are predicting univariate time series
models which
predict the output of one time series
they can be
arima models they could be the surima or
that’s
seasonal arima models which is basically
an arima model that takes into account
seasonality
but they can also be like profit models
which is facebook solution to time
series forecasting that you can use
without being an expert in time series
analysis then we also have neural profit
which is the neural network version of
profit
and then we have also multivariate time
series models
so we can forecast multiple time series
an example of this would be the vector
auto regression
models that i talked about and this
along with the other
models actually have good
implementations in
the python’s stats models library
so if you’re looking to code this out
check out stats models
and now on the machine learning front we
can basically
use any type of regression model to do
the time series job
so it could be a neural net regressor
where these are neural networks
with one output neuron that determines
the label of your regression
we can even use something like cat boost
regressors this is pretty cool because
it allows for better feature engineering
like random force and scikit-learn it
also gives really cool diagrammatic
representations of feature importance
it’s really easy to check if there’s
overfitting and
but unlike you know random forest
decision trees it’s actually in better
model altogether
based off of the gradient boosted
decision trees actually so it’s totally
worth checking out cat boost
and honestly any other regression model
could work here too
now that we’ve outlined some examples
let’s look at the core differences
between traditional time series models
and machine learning models for time
series forecasting
traditional time series forecasting is
recursive
in grandma’s company we thought making
predictions for tomorrow would be enough
but it looks like the workers need more
of a heads up than just a single day
so she now wants to know the number of
laptops that we will receive
three days from now now to determine the
inbound volume three days from now
the traditional time series way would be
that we determine the inbound volume one
day from now
use that to determine the two day out
prediction
and then use this to determine the three
day out prediction
the machine learning approach though we
can forecast this directly
so we would directly know the three day
out forecast
if our model is trained to do so
so here’s the second difference time
series models are easily
extendable so now in grandma’s warehouse
we also need to determine
the long-term space arrangements in the
warehouse
but to do this we need to know the
inbound volume 10 days
in advance well that’s okay with our
traditional time series model because we
just need to keep recursively making
predictions until we get the 10 day out
volume
no change in training data no change in
the model now if we want to do that for
a machine learning model though we need
to modify our training data
we need to train the model to predict 10
days out
too in addition to the three days out
model
and this could scale the training data
linearly as
we have more horizons to predict
and the third difference well with
traditional time series approaches
they can be pretty tough to get right
unless you’re an expert with time series
models
while the machine learning approach is a
lot more tractable for people who don’t
know much about time series forecasting
although i will say the profit models
are an exception here
now a fourth difference so time series
models at least the unit variant ones
we can’t add regressors for which we
don’t know the future values
while the machine learning models we can
add these regressors
that allows us to better fine-tune the
models
so clearly each has their advantages and
disadvantages
and depending on the problem you’re
solving the data that you have
and the hardware capacity one of these
solutions may be more preferable than
the other
hope this gives you a better intuition
on different approaches to time series
forecasting
you don’t need to be an expert i have a
year in the field and
i’m just getting my feet wet everything
in this video is my experience dealing
with time series-esque problems
so let me know in the comments below
what your experiences are with time
series forecasting
would love to listen to them i’ll see
you in the next one take care till then
bye bye