Data Representation

Photo by Aaron Burden on Unsplash

While data comes in many forms, mathematical models are limited to real numbers. As a result, we often have to engineer our inputs prior to model development and inference. Data representation is best illustrated with an example.

This is what the top 3 rows of our dataset looks like -- we can assume that we have at least a few thousand observations. The goal is to train a probabilistic model to determine each person's likelihood of buying a Magazine subscription after being given a free trial.

12TRUE78,000494CREDIT_CARD
328FALSE45,000212NONE
89TRUE120,000505PAYPAL

We want to predict purchased_subscription using the other columns as predictors.

• income_annualized is a continuous variable between 0 and 10 million
• age is a positive integer under 120
• review is on the scale of 1 to 5 (1 being the worst, 5 being the best)
• linked_payment_method is one of PAYPAL, CREDIT_CARD or NONE if no payment method is linked
• customer_id is only for identification purposes (excluded from model)

• Continuous Variables

Income is the only continuous variable out of our predictors. We can keep income as it is or transform it to another set of real numbers (for instance, we can divide income by 1000 or take its logarithm value). In either case, income remains a real number and needs no further preprocessing.

We can also group income into buckets if we think people in the same income bucket share similar purchasing behavior. For instance, any income under 30,000 would be Group 1, any income between 30,000 and 50,000 would be Group 2, etc. This effectively turns income into a categorical variable.

Categorical Variables

linked_payment_method is a categorical variable represented as a string. If we think that all linked payment types are equal, we can represent payment as a binary indicator (payment linked versus no payment linked). If we think that linking Paypal affects Magazine purchase differently than linking Credit Card (maybe there's a hefty fee for Paypal?), then we should divide this into three classes: no payment linked, paypal and credit card.

Let's go with the second assumption. The next step is to expand our single linked_payment_method predictor into 3 different predictors: no_payment_linked, credit_card_linked, paypal_linked. Each of these predictors will evaluate to 1 if the customer is linked to that method; otherwise, it will evaluate to 0. The sum of these three predictors should equal one for every observation.

This method of enumerating categorical variables is known as one-hot encoding. If we one-hot encoded our categorical variable linked_payment_method, our data would look like:

12TRUE78,000494010
328FALSE45,000212100
89TRUE120,000505001

Ordinal Variables

review is an ordinal variable. Ordinal variables are categorical variables where the possible values are ordered. We have the flexibility here to leave them as is or to one-hot encode them. If we one-hot encode them, we will lose information about the order.

Interaction Effects between Variables

The effect of certain variables might differ based on the values of another variable. For instance, the purchase behavior of a 50+ year old making 150,000 a year might be different from the purchase behavior of 20 year old making the same amount of money.

While most models and algorithms will understand these relationships implicitly, it is sometimes better to create new model features to explicitly capture this behavior.

Interaction between Continuous Variables

We can treat both age, income as continuous variables and designate new variable age_income_interaction as their product. We can use age_income_interaction as a predictor in our model.

customer_idincome_annualized ageage_income_interaction
1278,0004949 x 78,000 = 3,822,000
32845,0002121 x 45,000 = 945,000
89120,0005050 x 120,000 = 6,000,000

Interaction between Categorical Variables

Alteratively, we can use age_income_interaction to bucket and capture the interaction between age and income. We will one-hot encode this predictor similarly to what we did with the linked_payment_method predictor.

Age < 25Age 25-50Age > 50
Income < 50,000Age-Income Bucket 1Age-Income Bucket 2Age-Income Bucket 3
Income > 50,000Age-Income Bucket 4Age-Income Bucket 5Age-Income Bucket 6

Interaction between Categorical and Continuous Variable

Assume we want to stick with our buckets for income, but keep age as a continuous variable. This creates new predictors income_bucket_1_age and income_bucket_2_age. income_bucket_1_age will reflect the customer's age if they fall under bucket 1 and 0 otherwise. Similarly, income_bucket_2_age will reflect the customer's age if they fall under bucket 2 and 0 otherwise.

While these examples deal with interactions between two variables, we can capture interactions for any number of variables. For instance, we can use the interaction between age, income, review as a predictor.