Photo by Aaron Burden on Unsplash
While data comes in many forms, mathematical models are limited to real numbers. As a result, we often have to engineer our inputs prior to model development and inference. Data representation is best illustrated with an example.
This is what the top 3 rows of our dataset looks like -- we can assume that we have at least a few thousand observations. The goal is to train a probabilistic model to determine each person's likelihood of buying a Magazine subscription after being given a free trial.
customer_id | purchased_subscription | income_annualized | age | review | linked_payment_method |
---|---|---|---|---|---|
12 | TRUE | 78,000 | 49 | 4 | CREDIT_CARD |
328 | FALSE | 45,000 | 21 | 2 | NONE |
89 | TRUE | 120,000 | 50 | 5 | PAYPAL |
We want to predict purchased_subscription using the other columns as predictors.
Income is the only continuous variable out of our predictors. We can keep income as it is or transform it to another set of real numbers (for instance, we can divide income by 1000 or take its logarithm value). In either case, income remains a real number and needs no further preprocessing.
We can also group income into buckets if we think people in the same income bucket share similar purchasing behavior. For instance, any income under 30,000 would be Group 1, any income between 30,000 and 50,000 would be Group 2, etc. This effectively turns income into a categorical variable.
linked_payment_method is a categorical variable represented as a string. If we think that all linked payment types are equal, we can represent payment as a binary indicator (payment linked versus no payment linked). If we think that linking Paypal affects Magazine purchase differently than linking Credit Card (maybe there's a hefty fee for Paypal?), then we should divide this into three classes: no payment linked, paypal and credit card.
Let's go with the second assumption. The next step is to expand our single linked_payment_method predictor into 3 different predictors: no_payment_linked, credit_card_linked, paypal_linked. Each of these predictors will evaluate to 1 if the customer is linked to that method; otherwise, it will evaluate to 0. The sum of these three predictors should equal one for every observation.
This method of enumerating categorical variables is known as one-hot encoding. If we one-hot encoded our categorical variable linked_payment_method, our data would look like:
customer_id | purchased_subscription | income_annualized | age | review | no_payment_linked | credit_card_linked | paypal_linked |
---|---|---|---|---|---|---|---|
12 | TRUE | 78,000 | 49 | 4 | 0 | 1 | 0 |
328 | FALSE | 45,000 | 21 | 2 | 1 | 0 | 0 |
89 | TRUE | 120,000 | 50 | 5 | 0 | 0 | 1 |
review is an ordinal variable. Ordinal variables are categorical variables where the possible values are ordered. We have the flexibility here to leave them as is or to one-hot encode them. If we one-hot encode them, we will lose information about the order.
The effect of certain variables might differ based on the values of another variable. For instance, the purchase behavior of a 50+ year old making 150,000 a year might be different from the purchase behavior of 20 year old making the same amount of money.
While most models and algorithms will understand these relationships implicitly, it is sometimes better to create new model features to explicitly capture this behavior.
Interaction between Continuous Variables
We can treat both age, income as continuous variables and designate new variable age_income_interaction as their product. We can use age_income_interaction as a predictor in our model.
customer_id | income_annualized | age | age_income_interaction |
---|---|---|---|
12 | 78,000 | 49 | 49 x 78,000 = 3,822,000 |
328 | 45,000 | 21 | 21 x 45,000 = 945,000 |
89 | 120,000 | 50 | 50 x 120,000 = 6,000,000 |
Interaction between Categorical Variables
Alteratively, we can use age_income_interaction to bucket and capture the interaction between age and income. We will one-hot encode this predictor similarly to what we did with the linked_payment_method predictor.
Age < 25 | Age 25-50 | Age > 50 | |
---|---|---|---|
Income < 50,000 | Age-Income Bucket 1 | Age-Income Bucket 2 | Age-Income Bucket 3 |
Income > 50,000 | Age-Income Bucket 4 | Age-Income Bucket 5 | Age-Income Bucket 6 |
Interaction between Categorical and Continuous Variable
Assume we want to stick with our buckets for income, but keep age as a continuous variable. This creates new predictors income_bucket_1_age and income_bucket_2_age. income_bucket_1_age will reflect the customer's age if they fall under bucket 1 and 0 otherwise. Similarly, income_bucket_2_age will reflect the customer's age if they fall under bucket 2 and 0 otherwise.
While these examples deal with interactions between two variables, we can capture interactions for any number of variables. For instance, we can use the interaction between age, income, review as a predictor.