The programming of fast and reliable tools is a constant challenge for a computer scientist. In the data mining context, this leads to a better capacity to handle large datasets. When we build the final model that we want to deploy, the quickness is not really important. But in the exploratory phase where we search the best model, it is decisive. It improves our chance to obtain the best model simply because we can try more configurations.
I have tried many solutions to improve the calculation times of the logistic regression. In fact, I think the performance rests heavily on the optimization algorithm used. The source code of Tanagra shows that I have greatly hesitated. Some studies have helped me about the right choice.
Several tools propose the logistic regression. It is interesting to compare their calculation times and memory occupation. I have already studied this kind of comparison in the past . The novelty here is that I use a new operating system (64 bit version of Windows 7), and some tools are especially intended for this system. The calculating capabilities are greatly improved for these tools. For this reason, I have increased the dataset size. Moreover, to make more difficult the variable selection process, I added predictive attributes that are correlated to the original descriptors, but not to the class attribute. They have not to be selected in the final model.
In this paper, in addition to Tanagra 1.4.14 (32 bit), we use R 2.13.2 (64 bit), Knime 2.4.2 (64 bit), Orange 2.0b (build 15 oct2011, 32 bit) and Weka 3.7.5 (64 bit).
Keywords: logistic regression, software comparison, glm, stepAIC, R software, knime, orange, weka
Components: BINARY LOGISTIC REGRESSION, FORWARD LOGIT
Tutorial: en_Tanagra_Perfs_Bis_Logistic_Reg.pdf
Dataset: perfs_bis_logistic_reg.zip
References:
Tanagra, "Logistic regression - Software comparison", december 2008.
T.P. Minka, « A comparison of numerical optimizers for logistic regression », 2007.
Home >
Supervised Learning
> Logistic regression on large dataset
Friday, February 10, 2012
Logistic regression on large dataset
About The Author
stella
Nulla sagittis convallis arcu. Sed sed nunc. Curabitur consequat. Quisque metus enim, venenatis fermentum, mollis in, porta et, nibh. Duis vulputate elit in elit. Mauris dictum libero id justo.
Subscribe to:
Post Comments (Atom)
Find us on Facebook
Find us on Google Plus
Labels
- Association rules (8)
- Clustering (14)
- Data file handling (17)
- Decision tree (21)
- Exploratory Data Analysis (17)
- Feature Construction (6)
- Feature Selection (8)
- PLS Regression (5)
- Python (11)
- Regression analysis (13)
- Sipina (23)
- Software Comparison (49)
- Statistical methods (3)
- Supervised Learning (67)
- Tanagra (13)
- Text Mining (2)



No comments:
Post a Comment