Photo by Aaron Burden on Unsplash

While data comes in many forms, mathematical models are limited to real numbers. As a result, we often have to *engineer* our inputs prior to model development and inference. Data representation is best illustrated with an example.

This is what the top 3 rows of our dataset looks like -- we can assume that we have at least a few thousand observations. The goal is to train a probabilistic model to determine each person's likelihood of buying a Magazine subscription after being given a free trial.

customer_id | purchased_subscription | income_annualized | age | review | linked_payment_method |
---|---|---|---|---|---|

12 | TRUE | 78,000 | 49 | 4 | CREDIT_CARD |

328 | FALSE | 45,000 | 21 | 2 | NONE |

89 | TRUE | 120,000 | 50 | 5 | PAYPAL |

We want to predict *purchased_subscription* using the other columns as predictors.

*Income* is the only continuous variable out of our predictors. We can keep income as it is or transform it to another set of real numbers (for instance, we can divide income by 1000 or take its logarithm value). In either case, income remains a real number and needs no further preprocessing.

We can also group income into buckets if we think people in the same income bucket share similar purchasing behavior. For instance, any income under 30,000 would be Group 1, any income between 30,000 and 50,000 would be Group 2, etc. This effectively turns income into a **categorical variable**.

*linked_payment_method* is a categorical variable represented as a string. If we think that all linked payment types are equal, we can represent payment as a binary indicator (payment linked versus no payment linked). If we think that linking Paypal affects Magazine purchase differently than linking Credit Card (maybe there's a hefty fee for Paypal?), then we should divide this into three classes: no payment linked, paypal and credit card.

Let's go with the second assumption. The next step is to expand our single *linked_payment_method* predictor into 3 different predictors: *no_payment_linked, credit_card_linked, paypal_linked*. Each of these predictors will evaluate to 1 if the customer is linked to that method; otherwise, it will evaluate to 0. The sum of these three predictors should equal one for every observation.

This method of enumerating categorical variables is known as **one-hot encoding**. If we one-hot encoded our categorical variable *linked_payment_method*, our data would look like:

customer_id | purchased_subscription | income_annualized | age | review | no_payment_linked | credit_card_linked | paypal_linked |
---|---|---|---|---|---|---|---|

12 | TRUE | 78,000 | 49 | 4 | 0 | 1 | 0 |

328 | FALSE | 45,000 | 21 | 2 | 1 | 0 | 0 |

89 | TRUE | 120,000 | 50 | 5 | 0 | 0 | 1 |

*review* is an ordinal variable. Ordinal variables are categorical variables where the possible values are ordered. We have the flexibility here to leave them as is or to one-hot encode them. If we one-hot encode them, we will lose information about the order.

The effect of certain variables might differ based on the values of another variable. For instance, the purchase behavior of a 50+ year old making 150,000 a year might be different from the purchase behavior of 20 year old making the same amount of money.

While most models and algorithms will understand these relationships implicitly, it is *sometimes* better to create new model features to explicitly capture this behavior.

Interaction between Continuous Variables

We can treat both *age, income* as continuous variables and designate new variable *age_income_interaction* as their product. We can use *age_income_interaction* as a predictor in our model.

customer_id | income_annualized | age | age_income_interaction |
---|---|---|---|

12 | 78,000 | 49 | 49 x 78,000 = 3,822,000 |

328 | 45,000 | 21 | 21 x 45,000 = 945,000 |

89 | 120,000 | 50 | 50 x 120,000 = 6,000,000 |

Interaction between Categorical Variables

Alteratively, we can use *age_income_interaction* to bucket and capture the interaction between age and income. We will one-hot encode this predictor similarly to what we did with the *linked_payment_method* predictor.

Age < 25 | Age 25-50 | Age > 50 | |
---|---|---|---|

Income < 50,000 | Age-Income Bucket 1 | Age-Income Bucket 2 | Age-Income Bucket 3 |

Income > 50,000 | Age-Income Bucket 4 | Age-Income Bucket 5 | Age-Income Bucket 6 |

Interaction between Categorical and Continuous Variable

Assume we want to stick with our buckets for *income*, but keep *age* as a continuous variable. This creates new predictors *income_bucket_1_age* and *income_bucket_2_age*. *income_bucket_1_age* will reflect the customer's age if they fall under bucket 1 and 0 otherwise. Similarly, *income_bucket_2_age* will reflect the customer's age if they fall under bucket 2 and 0 otherwise.

**While these examples deal with interactions between two variables, we can capture interactions for any number of variables. For instance, we can use the interaction between age, income, review as a predictor.**