The aim of text categorization is to assign documents to predefined categories as accurately as possible. We are within the supervised learning framework, with a categorical target attribute, often binary. The originality lies in the nature of the input attribute, which is a textual document. It is not possible to implement predictive methods directly, it is necessary to go through a data preparation phase.
In this tutorial, we will describe a text categorization process in Python using mainly the text mining capabilities of the scikit-learn package, which will also provide data mining methods (logistics regression). We want to classify SMS as "spam" (spam, malicious) or "ham" (legitimate). We use the “SMS Spam Collection v.1” dataset.
Keywords: text mining, document categorization, corpus, bag of words, f1-score, recall, precision, dimensionality reduction, variable selection, logistic regression, scikit learn, python
Tutorial: Spam identification
Dataset: Corpus and Python program
References:
Almeida, T.A., Gómez Hidalgo, J.M., Yamakami, "A. Contributions to the Study of SMS Spam Filtering: New Collection and Results", in Proceedings of the 2011 ACM Symposium on Document Engineering (DOCENG'11), Mountain View, CA, USA, 2011.
Home >
Text Mining
> Document classification in Python
Thursday, October 5, 2017
Document classification in Python
About The Author
stella
Nulla sagittis convallis arcu. Sed sed nunc. Curabitur consequat. Quisque metus enim, venenatis fermentum, mollis in, porta et, nibh. Duis vulputate elit in elit. Mauris dictum libero id justo.
Labels:
Python,
Supervised Learning,
Text Mining
Subscribe to:
Post Comments (Atom)
Find us on Facebook
Find us on Google Plus
Labels
- Association rules (8)
- Clustering (14)
- Data file handling (17)
- Decision tree (21)
- Exploratory Data Analysis (17)
- Feature Construction (6)
- Feature Selection (8)
- PLS Regression (5)
- Python (11)
- Regression analysis (13)
- Sipina (23)
- Software Comparison (49)
- Statistical methods (3)
- Supervised Learning (67)
- Tanagra (13)
- Text Mining (2)



No comments:
Post a Comment