On this episode, CodeEmporium talks about how to build time series forecast and how to do it with machine learning.

hello everyone today i wanted to talk

about

a concept that data scientists tend to

tackle in their day jobs

time series analysis now i don’t talk

too much about my personal

or even my professional life outside of

youtube

but now that i’ve worked as a full-time

associate data scientist for about a

year

i think i can spice up my content with

some experience

this could be useful to you as a lot of

you as i’ve seen that are not just

ml curious you also want to get a

full-time job in the field and some of

you

from at least the comments that i read

are already established professionals

but regardless always excited to talk to

new people

and everyone is welcome equally looking

at

resources across the internet that talk

about time series problems

they almost always take a traditional

approach

eyes closed through all the data into

some arima model and some magic happens

unfortunately these models can be very

difficult to tune if you aren’t an

expert

but luckily you really don’t need to be

an expert with dissecting time series

concepts specifically

to get usable results from a time series

model

in fact you can take some machine

learning approach instead

and also get results that are just as

good

in this video we are going to walk

through the typical flow to solve time

series problems

we’re going to see what different

approaches we can take to solve such

time series problems

and also highlight the differences

between traditional approaches and

machine learning approaches

so you know what techniques to use and

also

when to use them but before we continue

this video is sponsored partially by

kite they provide a code completion

service for

machine learning code it integrates

super well with your editors and even

jupiter notebooks

so click the link in the description to

try kite for free

now back to the video let’s first define

a concrete problem where time series

is useful so that’s step one

think about your grandma she started

this laptop repair line

two years ago and it’s a hit the way it

works is a customer places an

order request online the customer then

ships the broken laptop to grandma

then she and her workers fix them and

the laptops are sent back

the problem here is her workers are paid

by the hour

if there are more laptops to repair

grandma calls in more workers

but as you can imagine it’s hard to know

how many workers we need without knowing

the number of laptops that we

get per day now grandma hired you

as a data scientist and you think what

could be useful to know

for grandma is how many laptops are we

going to receive tomorrow

this way we can call the required number

of workers to come in tomorrow

so now that we have this defined problem

i think it’s easier to move forward

so step two what data do we have

for now let’s say that we store every

work order

in an orders table when a customer makes

an order request online

a row in the orders table is added this

table

has information like order id the price

and the timestamp when the order was

made

for simplicity let’s say that there are

no log errors no missing values and no

sparse data

so step three what is the data telling

us

well from here we do some exploratory

data analysis or eda

and come up with a few approaches the

natural structure of this problem is a

time series

problem from the orders table we can

aggregate the data at the daily level to

get a time

series of number of order requests per

day

when doing our exploratory analysis i’d

like to understand the bread and butter

of the time series that is trend and

seasonality so first of all does this

data exhibit seasonality

for this case let’s say that we see some

weekly seasonality in the data

basically we are hit relatively hard on

mondays low on thursdays and so on

so weekly seasonality exists

now question two do we see trend in this

data

over time we’d probably want to know if

the orders have been increasing or

decreasing in volume

we also want to explain the trend

changes that we see like in the middle

of july we see a huge

increase in the trend of sales and this

probably happens because we launched a

promotional event that month

and we see another trend change in

october and mostly because we changed

our payment strategy

now that we have like a basic

understanding of our data

how do we make predictions

if we think of this as a traditional

time series forecasting problem

there are several approaches like arima

profit

neural profit and vector auto regression

let’s briefly talk about these

so to make a prediction in arima we need

to identify the trend and seasonality

components

and transform our data accordingly this

kyle notebook here does a good job in

following the step-by-step procedure to

use

arima but all of this processing and

testing is only good for a basic toy

data set here

it might be a little tough to get these

predictions right and tuned for more

complex problems

especially if you aren’t an expert with

time series in general

as an alternative we could use

facebook’s profit model

profit handles missing data better and

it can take

data with seasonality and trends and it

produces stellar results that even

rival a tuned arima and its other

flavors like seasonal arima

we can get even better results using

neural profit

this increases the forecasting accuracy

of the profit model by using a neural

network

scrolling down here i can see some main

features that are added to this model

but a big disadvantage of profit models

at least from what i’ve seen

is that we cannot add regressors for

which we don’t have future values

okay so um let me explain this with

grandma’s laptop company

we want to know how many laptops will we

get tomorrow

to make this prediction we use the

number of laptops we got yesterday

two days ago three days ago and so on

aside from that we can pass in a day of

week predictor

for you know some seasonality and we can

pass this in because

for tomorrow we already know what the

day will be

even though the day hasn’t happened yet

but you know what else could be useful

in predicting the number of laptops

tomorrow

it would be something like the number of

orders that were placed

online over the past few days laptops

come in about three to four days after

the orders are placed online

so using order information could come in

quite handy to determine the number of

laptops you’ll receive

however we cannot add this

order information directly to the profit

model

since we cannot feed it a predictor for

which we don’t have the future value

i can’t tell you the number of orders

that happen tomorrow since it hasn’t

happened yet

in fact the number of orders per day

forms another time series on its own too

profit and arima models they fall into

the category of

univariate models we forecast and we

only deal with

a single time series but with

multivariate time series models

we can deal with and even forecast the

output of multiple time

series an example of this is vector

auto-regression models

the input could be the past inbound

volume time series

the past order time series and some day

of week predictors

and the output could be the forecasted

inbound volume and the forecasted order

volume

the main downside here for vector auto

regression

is that we might need a lot more quality

data to come up with reasonable

predictions

as opposed to the univariate approaches

since we’re forecasting multiple series

here

but depending on your situation and your

data one or the other

may be useful now all these methods that

we’ve discussed so far

are approaches where we treat the

problem as a time series problem

but we could very well convert this to a

traditional machine

learning regression problem so in

grandma’s laptop example

if we’re making prediction today the

features could be something like

how many laptops did we get this day

last week

what was the standard deviation of the

inbound volume

over the last week what was the number

of orders that

have been placed but haven’t been

fulfilled yet can we also add in day of

weak predictors to account for the

weekly seasonality

and almost any other predictor and

anything else that you can think of

the label is what we want to predict

and in this case it is the inbound

volume for

tomorrow and we’d want to frame our

training data

in this way a set of x’s and y’s

so what kind of models are we talking

about here

let’s paint this picture in a hierarchy

in the form of a chart

all the models for which we can do time

series modeling can be classified as

traditional time series models and

machine learning

models time series models can be further

classified as univariate and

multivariate depending on the number of

values we

are predicting univariate time series

models which

predict the output of one time series

they can be

arima models they could be the surima or

that’s

seasonal arima models which is basically

an arima model that takes into account

seasonality

but they can also be like profit models

which is facebook solution to time

series forecasting that you can use

without being an expert in time series

analysis then we also have neural profit

which is the neural network version of

profit

and then we have also multivariate time

series models

so we can forecast multiple time series

an example of this would be the vector

auto regression

models that i talked about and this

along with the other

models actually have good

implementations in

the python’s stats models library

so if you’re looking to code this out

check out stats models

and now on the machine learning front we

can basically

use any type of regression model to do

the time series job

so it could be a neural net regressor

where these are neural networks

with one output neuron that determines

the label of your regression

we can even use something like cat boost

regressors this is pretty cool because

it allows for better feature engineering

like random force and scikit-learn it

also gives really cool diagrammatic

representations of feature importance

it’s really easy to check if there’s

overfitting and

but unlike you know random forest

decision trees it’s actually in better

model altogether

based off of the gradient boosted

decision trees actually so it’s totally

worth checking out cat boost

and honestly any other regression model

could work here too

now that we’ve outlined some examples

let’s look at the core differences

between traditional time series models

and machine learning models for time

series forecasting

traditional time series forecasting is

recursive

in grandma’s company we thought making

predictions for tomorrow would be enough

but it looks like the workers need more

of a heads up than just a single day

so she now wants to know the number of

laptops that we will receive

three days from now now to determine the

inbound volume three days from now

the traditional time series way would be

that we determine the inbound volume one

day from now

use that to determine the two day out

prediction

and then use this to determine the three

day out prediction

the machine learning approach though we

can forecast this directly

so we would directly know the three day

out forecast

if our model is trained to do so

so here’s the second difference time

series models are easily

extendable so now in grandma’s warehouse

we also need to determine

the long-term space arrangements in the

warehouse

but to do this we need to know the

inbound volume 10 days

in advance well that’s okay with our

traditional time series model because we

just need to keep recursively making

predictions until we get the 10 day out

volume

no change in training data no change in

the model now if we want to do that for

a machine learning model though we need

to modify our training data

we need to train the model to predict 10

days out

too in addition to the three days out

model

and this could scale the training data

linearly as

we have more horizons to predict

and the third difference well with

traditional time series approaches

they can be pretty tough to get right

unless you’re an expert with time series

models

while the machine learning approach is a

lot more tractable for people who don’t

know much about time series forecasting

although i will say the profit models

are an exception here

now a fourth difference so time series

models at least the unit variant ones

we can’t add regressors for which we

don’t know the future values

while the machine learning models we can

add these regressors

that allows us to better fine-tune the

models

so clearly each has their advantages and

disadvantages

and depending on the problem you’re

solving the data that you have

and the hardware capacity one of these

solutions may be more preferable than

the other

hope this gives you a better intuition

on different approaches to time series

forecasting

you don’t need to be an expert i have a

year in the field and

i’m just getting my feet wet everything

in this video is my experience dealing

with time series-esque problems

so let me know in the comments below

what your experiences are with time

series forecasting

would love to listen to them i’ll see

you in the next one take care till then

bye bye