The aim of this tutorial is to show the programming of the famous "word count" algorithm from a set of files stored in HDFS file system.
The "word count" is a state-of-the-art example for the programming under Hadoop. It is described everywhere on the web. But, unfortunately, the tutorials which describe the task are often not reproducible. The dataset are not available. The whole process, including the installation of the Hadoop framework, are not described. We do not know how to access to the files stored in the HDFS file system. In short, we cannot run programs and understand in details how they work.
In this tutorial, we describe the whole process. We detail first the installation of a virtual machine which contains a single-node Hadoop cluster. Then we show how to install R and RStudio Server which allows us to write and run a program. Last, we write some programs based on the mapreduce scheme.
The steps, and therefore the source of errors, are numerous. We will use many screenshots to actually understand each operation. This is the reason of this unusual presentation format for a tutorial.
Keywords: big data, big data analytics, mapreduce, package rmr2, package rhdfs, hadoop, rhadoop, logiciel R, rstudio, rstudio server, cloudera, R language
Tutorial: en_Tanagra_Hadoop_with_R.pdf
Files: hadoop_with_r.zip
References :
Tanagra Tutorial, "MapReduce with R", Feb. 2015.
Hugh Devlin, "Mapreduce in R", Jan. 2014.
Home >
Software Comparison
> R programming under Hadoop
Friday, April 10, 2015
R programming under Hadoop
About The Author
stella
Nulla sagittis convallis arcu. Sed sed nunc. Curabitur consequat. Quisque metus enim, venenatis fermentum, mollis in, porta et, nibh. Duis vulputate elit in elit. Mauris dictum libero id justo.
Labels:
Software Comparison
Subscribe to:
Post Comments (Atom)
Find us on Facebook
Find us on Google Plus
Labels
- Association rules (8)
- Clustering (14)
- Data file handling (17)
- Decision tree (21)
- Exploratory Data Analysis (17)
- Feature Construction (6)
- Feature Selection (8)
- PLS Regression (5)
- Python (11)
- Regression analysis (13)
- Sipina (23)
- Software Comparison (49)
- Statistical methods (3)
- Supervised Learning (67)
- Tanagra (13)
- Text Mining (2)



No comments:
Post a Comment