Hybrid Intelligent Dataset Cleaner.
A local-first, ultra-professional data sanitization pipeline combining deterministic rules, statistical models, and deep autoencoders to guarantee 100% data integrity.
Data Integrity First
CleanIq operates under a strict philosophy: Format > Fix > Impute > Flag > Report.
PyOD Neural Nets
Deep Autoencoders evaluate the Reconstruction Error to flag multivariate anomalies.
Local-First Privacy
Structural fingerprints are learned without sending raw sensitive data to the cloud.
The Data Sanitization Pipeline
CleanIq operates in sequence to minimize computational overhead while maximizing anomaly detection precision. It transitions from deterministic rules to deep-learning inference.
Rule Engine
Deterministic cleaning like whitespace stripping and booleans.
Classical ML
SmartImputer strategy selection (Mean/Median/Mode) based on skewness.
Statistical Outliers
Isolation Forest and IQR checks for mathematical anomalies.
Neural Net
PyOD Autoencoder for high-dimensional structural flags.
The Core CleanIQ Workflows
Each guide maps to a real application surface so users can move from documentation into the product without hunting.
Bring CSV or TSV files into CleanIQ with plan-aware validation before a dataset record is created.
- 1Select a file
- 2Validate size and format
- 3Create the dataset record
Inspect row counts, column counts, file size, status, and quality score before any transformation is applied.
- 1Open dataset
- 2Review metadata
- 3Find quality signals
Build auditable cleaning flows with operations that are previewed before they become exports.
- 1Pick operations
- 2Preview results
- 3Apply with audit context
Download cleaned data and keep teams aligned with consistent dataset history and export surfaces.
- 1Choose dataset
- 2Select format
- 3Download output
SaaS-Grade Training
The training pipeline handles structural fingerprinting and synthetic corruption at scale.
Recursive Target Scanning
Recursively builds MD5 hashed schemas out of tabular data up to infinity rows, parsing thousands of sub-directories silently.
Synthetic Dirty Generator
Deliberately injects advanced Mojibake failures, stray delimiters, and swaps to push the autoencoders to the limits.
$ python train_models.py --no-synthetic
Found 650 training files (35.63 GB total)
Max rows per file: All (memory-efficient)
Schema diversity: 23 unique layouts
Training DeepAutoEncoder... Done.
Training IsolationForest... Done.Governance & Scale
Upload, run, storage, and retention limits are calculated based on your active subscription tier.
| Plan | Price | Max file | Files / month | Runs / month | Storage | Downloads | Seats | Share links | Retention |
|---|---|---|---|---|---|---|---|---|---|
| Starter | $0forever | 5 MB | 10 | 10 | 100 MB | 500 MB | 1 | 0 | 7 days |
| Plus | $29/month | 25 MB | 100 | 200 | 2 GB | 10 GB | 3 | 25 | 30 days |
| Pro | $79/month | 100 MB | 500 | 1,000 | 20 GB | 100 GB | 10 | 250 | 180 days |
| Enterprise | Custom | 250 MB | Unlimited | Unlimited | Custom | Custom | Unlimited | Unlimited | 3,650 days |