Data Representation

While data comes in many forms, mathematical models are limited to real numbers. As a result, we often have to engineer our inputs prior to model development and inference. Data representation is best illustrated with an example.

This is what the top 3 rows of our dataset looks like -- we can assume that we have at least a few thousand observations. The goal is to train a probabilistic model to determine each person's likelihood of buying a Magazine subscription after being given a free trial.

customer_id	purchased_subscription	income_annualized	age	review	linked_payment_method
12	TRUE	78,000	49	4	CREDIT_CARD
328	FALSE	45,000	21	2	NONE
89	TRUE	120,000	50	5	PAYPAL

We want to predict purchased_subscription using the other columns as predictors.

income_annualized is a continuous variable between 0 and 10 million

age is a positive integer under 120

review is on the scale of 1 to 5 (1 being the worst, 5 being the best)

linked_payment_method is one of PAYPAL, CREDIT_CARD or NONE if no payment method is linked

customer_id is only for identification purposes (excluded from model)

Continuous Variables

Income is the only continuous variable out of our predictors. We can keep income as it is or transform it to another set of real numbers (for instance, we can divide income by 1000 or take its logarithm value). In either case, income remains a real number and needs no further preprocessing.

We can also group income into buckets if we think people in the same income bucket share similar purchasing behavior. For instance, any income under 30,000 would be Group 1, any income between 30,000 and 50,000 would be Group 2, etc. This effectively turns income into a categorical variable.

Categorical Variables

linked_payment_method is a categorical variable represented as a string. If we think that all linked payment types are equal, we can represent payment as a binary indicator (payment linked versus no payment linked). If we think that linking Paypal affects Magazine purchase differently than linking Credit Card (maybe there's a hefty fee for Paypal?), then we should divide this into three classes: no payment linked, paypal and credit card.

Let's go with the second assumption. The next step is to expand our single linked_payment_method predictor into 3 different predictors: no_payment_linked, credit_card_linked, paypal_linked. Each of these predictors will evaluate to 1 if the customer is linked to that method; otherwise, it will evaluate to 0. The sum of these three predictors should equal one for every observation.

This method of enumerating categorical variables is known as one-hot encoding. If we one-hot encoded our categorical variable linked_payment_method, our data would look like:

customer_id	purchased_subscription	income_annualized	age	review	no_payment_linked	credit_card_linked	paypal_linked
12	TRUE	78,000	49	4	0	1	0
328	FALSE	45,000	21	2	1	0	0
89	TRUE	120,000	50	5	0	0	1

Ordinal Variables

review is an ordinal variable. Ordinal variables are categorical variables where the possible values are ordered. We have the flexibility here to leave them as is or to one-hot encode them. If we one-hot encode them, we will lose information about the order.

Interaction Effects between Variables

The effect of certain variables might differ based on the values of another variable. For instance, the purchase behavior of a 50+ year old making 150,000 a year might be different from the purchase behavior of 20 year old making the same amount of money.

While most models and algorithms will understand these relationships implicitly, it is sometimes better to create new model features to explicitly capture this behavior.

Interaction between Continuous Variables

We can treat both age, income as continuous variables and designate new variable age_income_interaction as their product. We can use age_income_interaction as a predictor in our model.

customer_id	income_annualized	age	age_income_interaction
12	78,000	49	49 x 78,000 = 3,822,000
328	45,000	21	21 x 45,000 = 945,000
89	120,000	50	50 x 120,000 = 6,000,000

Interaction between Categorical Variables

Alteratively, we can use age_income_interaction to bucket and capture the interaction between age and income. We will one-hot encode this predictor similarly to what we did with the linked_payment_method predictor.

	Age < 25	Age 25-50	Age > 50
Income < 50,000	Age-Income Bucket 1	Age-Income Bucket 2	Age-Income Bucket 3
Income > 50,000	Age-Income Bucket 4	Age-Income Bucket 5	Age-Income Bucket 6

Interaction between Categorical and Continuous Variable

Assume we want to stick with our buckets for income, but keep age as a continuous variable. This creates new predictors income_bucket_1_age and income_bucket_2_age. income_bucket_1_age will reflect the customer's age if they fall under bucket 1 and 0 otherwise. Similarly, income_bucket_2_age will reflect the customer's age if they fall under bucket 2 and 0 otherwise.

While these examples deal with interactions between two variables, we can capture interactions for any number of variables. For instance, we can use the interaction between age, income, review as a predictor.

« Previous: Learning to love your Code Editor Tutorials Next: Data Cleaning »