v1.2 SaaS-Grade Engine

Hybrid Intelligent Dataset Cleaner.

A local-first, ultra-professional data sanitization pipeline combining deterministic rules, statistical models, and deep autoencoders to guarantee 100% data integrity.

System HealthActive
Rule EngineDeterministic
Neural NetPyOD / Autoencoder
SmartImputerVariance Aware
PrivacyLocal-First (MD5)

Data Integrity First

CleanIq operates under a strict philosophy: Format > Fix > Impute > Flag > Report.

PyOD Neural Nets

Deep Autoencoders evaluate the Reconstruction Error to flag multivariate anomalies.

Local-First Privacy

Structural fingerprints are learned without sending raw sensitive data to the cloud.

Architecture

The Data Sanitization Pipeline

CleanIq operates in sequence to minimize computational overhead while maximizing anomaly detection precision. It transitions from deterministic rules to deep-learning inference.

01

Rule Engine

Deterministic cleaning like whitespace stripping and booleans.

02

Classical ML

SmartImputer strategy selection (Mean/Median/Mode) based on skewness.

03

Statistical Outliers

Isolation Forest and IQR checks for mathematical anomalies.

04

Neural Net

PyOD Autoencoder for high-dimensional structural flags.

User guides

The Core CleanIQ Workflows

Each guide maps to a real application surface so users can move from documentation into the product without hunting.

Upload

Bring CSV or TSV files into CleanIQ with plan-aware validation before a dataset record is created.

  1. 1Select a file
  2. 2Validate size and format
  3. 3Create the dataset record
Open workflow
Profile

Inspect row counts, column counts, file size, status, and quality score before any transformation is applied.

  1. 1Open dataset
  2. 2Review metadata
  3. 3Find quality signals
Open workflow
Transform

Build auditable cleaning flows with operations that are previewed before they become exports.

  1. 1Pick operations
  2. 2Preview results
  3. 3Apply with audit context
Open workflow
Export

Download cleaned data and keep teams aligned with consistent dataset history and export surfaces.

  1. 1Choose dataset
  2. 2Select format
  3. 3Download output
Open workflow
Engine Insight

SaaS-Grade Training

The training pipeline handles structural fingerprinting and synthetic corruption at scale.

Recursive Target Scanning

Recursively builds MD5 hashed schemas out of tabular data up to infinity rows, parsing thousands of sub-directories silently.

Synthetic Dirty Generator

Deliberately injects advanced Mojibake failures, stray delimiters, and swaps to push the autoencoders to the limits.

train_models.py
$ python train_models.py --no-synthetic

Found 650 training files (35.63 GB total)
Max rows per file: All (memory-efficient)
Schema diversity: 23 unique layouts
Training DeepAutoEncoder... Done.
Training IsolationForest... Done.
Plan limits

Governance & Scale

Upload, run, storage, and retention limits are calculated based on your active subscription tier.

PlanPriceMax fileFiles / monthRuns / monthStorageDownloadsSeatsShare linksRetention
Starter$0forever5 MB1010100 MB500 MB107 days
Plus$29/month25 MB1002002 GB10 GB32530 days
Pro$79/month100 MB5001,00020 GB100 GB10250180 days
EnterpriseCustom250 MBUnlimitedUnlimitedCustomCustomUnlimitedUnlimited3,650 days