
Modern text processing pipeline for machine learning applications
Report Bug
·
Request Feature
Table of Contents
textpipe is an end-to-end text processing pipeline designed for modern NLP workflows. It provides:
- Configurable Processing: YAML-based configuration for all processing steps
- Modular Architecture: Clean separation of data loading, cleaning, vectorization, and modeling
- Production Ready: Built-in logging, error handling, and type validation
- ML Integration: Seamless integration with scikit-learn models
- Customizable Components:
- Multiple text cleaning strategies
- Configurable tokenization (stemming, stopwords)
- TF-IDF vectorization with automatic feature management
- Extensible model architecture
Install the package with pip:
pip install textpipe
Update existing installation:
pip install textpipe --upgrade
Basic text processing pipeline example:
from textpipe import Config, load_csv, SentimentPipeline
# Initialize configuration
config = Config.get()
# Load training data
texts, labels = load_csv("data/train.csv")
# Initialize and train pipeline
pipeline = SentimentPipeline(config)
pipeline.train(texts, labels)
# Make predictions
new_texts = ["I love this product!", "Terrible service..."]
predictions = pipeline.predict(new_texts)
print(predictions)
Advanced configuration example (config.yml
):
processing:
language: english
remove_stopwords: true
use_stemming: false
max_features: 5000
min_text_length: 3
logging:
level: INFO
Contributions are what make the open source community an amazing place to learn, inspire, and create. Any contributions are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
Distributed under the MIT License. See LICENSE
for more information.
Textpipe Team - [email protected]
Project Link: https://github.com/CodexEsto/textpipe
- Scikit-learn community for foundational ML components
- NLTK team for language processing resources
- Pandas for data handling capabilities
- All contributors and open-source maintainers who inspired this work