How to organize data for Mutllevel modeling - Decision Tree, Classification, or Regression
I have three tables - Sales Manager, Customer, and Order. Each sales manager has multiple customers, and each customer can have multiple orders.
I am interested in determining if certain attributes of sales manager and attributes of customer will lead to sales of a particular product (Let's say Product A Yes/no).
Suppose I have 3 sales managers, 10 customers, and 20 orders.
Should I structure the data set to have 3 rows, 10 rows or 20 rows. Please advise.
Also, will the decision tree, and classification algorithm automatically understand the hierarchical relationships among manager, customer and开发者_Go百科 order?
Thanks.
I think you should make one big feature matrix out of it. Suppose you have tables
Sales Manager (id attr_1 ... attr_m)
Customer (id attr_1 ... attr_n sales_manager_id)
Order (id product_id_1 ... product_id_l customer_id)
Then it is most probably reasonable to create the matrix in the following form
Matrix:
product_id order_attr_1 ... order_attr_l customer_attr_1 ... customer_attr_n ... manager_attr_1 ... manager_attr_m
Now you have 20*l row matrix with all the attributes that are given for certain order.
In the simplest form you can use the following matrix for classification. In case of too many attributes maybe it is reasonable to use PCA first. Maybe you should try to use Weka and see, what turns out.
Considering your question about the hierarchical relations, then the classification algorithms will not understand them explicitly.
I would recommend this book here: Introduction to Data Mining, as it answers most of your questions.
精彩评论