Because I have recently updated my operating system (OS), I am wondering how the 64-bit versions of Knime 2.4.2 and RapidMiner 5.1.011 could handle a very large dataset, which cannot be loaded into main memory on a 32-bit OS. This article completes a previous study where we deal with a moderate sized dataset with 500,000 instances and 22 variables. Here, we handle a dataset with 9,634,198 instances and 41 variables. We have already used this dataset in another tutorial. We showed that we cannot perform a decision tree induction on this kind of database without a swapping system, which is implemented into the SIPINA, on a 32-bit OS. We note that Tanagra can handle the dataset, but this is because it encodes the values of the categorical attributes with a byte. The memory occupation remains moderate.
In this tutorial, I analyze the behavior of the 64-bit Knime and RapidMiner on this database. I use 64-bit OS and tools, but I have "only" 4 GB of available memory on my personal computer.
Keywords: very large dataset, decision tree, sampling, sipina, knime, rapidminer
Components: ID3
Tutorial: en_Tanagra_Tree_Very_Large_Dataset.pdf
Dataset: twice-kdd-cup-discretized-descriptors.zip
References:
Tanagra, "Dealing with very large dataset in Sipina".
Tanagra, "Decision tree and large dataset (continuation)".
Tanagra, "Decision tree and large dataset".
Tanagra, "Local sampling for decision tree learning".
Home >
Software Comparison
> Dealing with very large dataset (continuation)
Sunday, December 11, 2011
Dealing with very large dataset (continuation)
About The Author
stella
Nulla sagittis convallis arcu. Sed sed nunc. Curabitur consequat. Quisque metus enim, venenatis fermentum, mollis in, porta et, nibh. Duis vulputate elit in elit. Mauris dictum libero id justo.
Labels:
Decision tree,
Sipina,
Software Comparison
Subscribe to:
Post Comments (Atom)
Find us on Facebook
Find us on Google Plus
Labels
- Association rules (8)
- Clustering (14)
- Data file handling (17)
- Decision tree (21)
- Exploratory Data Analysis (17)
- Feature Construction (6)
- Feature Selection (8)
- PLS Regression (5)
- Python (11)
- Regression analysis (13)
- Sipina (23)
- Software Comparison (49)
- Statistical methods (3)
- Supervised Learning (67)
- Tanagra (13)
- Text Mining (2)



No comments:
Post a Comment