Site icon Filestack Blog

How to Validate and Clean Extracted Receipt Data Before Database Insertion

Validate and Clean Extracted Receipt Data

When you add receipt scanning to your app, reading text from the receipt image is only the first step. The harder part comes next.

The text you get is usually not perfect:

So before you save this data in your database, you must clean it and check if it’s correct.

Filestack’s Intelligence API helps you easily get text from receipts using advanced OCR technology. But to make your app work properly in real situations, you also need a robust system to clean and verify that data.

💡Tip: If you want to understand how receipt extraction works in detail, you can also explore the OCR and document data extraction overview.

In this guide, you’ll learn how to build a production-ready validation pipeline that ensures your app saves correct and reliable information.

Key Takeaways

Before we talk about how to validate receipt data, let’s first understand why validation is so important.

Why Checking Receipt Data Is Important

Even though Filestack’s OCR is powerful, it is not 100% perfect. Real receipts can be messy, and the raw text you get is not always ready to save directly in your database.

Faded print and poor quality can cause mistakes.

For example, a date like 01/15/2024 might be read as 01/35/2024. If the ink is faded or damaged, the system may confuse numbers like 1, 3, or 7.

Bad lighting or tilted photos can also cause errors.

A shadow over 8.99 might make it look like 8.9 or even 8 99. This can break your number calculations.

Different receipt formats add more problems.

Every store prints receipts in its own way. Some put the total at the bottom. Others put it in the middle after ads or extra text. This makes it harder to reliably find important details.

Multiple languages can also confuse the system.

Some receipts mix English with another language or special symbols. This can make text parsing more difficult.

If you don’t check and clean this data, wrong information will go into your database. That can lead to:

That’s why validating and cleaning receipt data is not optional. It’s necessary if you want your app to be accurate and reliable.

So how do we solve all these problems in a structured way?

The answer is to validate receipt data step by step instead of all at once.

The Complete Validation Pipeline Architecture

For a real production app, receipt validation should not happen in one big function. It should happen step by step. Each step checks for different types of problems.

Here’s a simple example:

async function processReceiptData(extractedData, metadata) {
  // Stage 1: Schema validation: Check required fields
  const schemaValidated = validateSchema(extractedData);

  // Stage 2: Fix data types and formats
  const normalized = normalizeDataTypes(schemaValidated);

  // Stage 3: Business rule validation
  const businessValidated = validateBusinessRules(normalized);

  // Stage 4: Cross-field checks
  const consistencyChecked = validateCrossFieldConsistency(businessValidated);

  // Stage 5: Confidence scoring
  const scored = scoreConfidence(consistencyChecked, metadata);

  // Stage 6: Database preparation
  const dbReady = prepareDatabaseInsertion(scored);

  return dbReady;
}

This step-by-step approach is much better than putting everything in one big validation function. It makes debugging easier, testing simpler, and your system more reliable.

The diagram below shows how receipt data moves through each validation stage before reaching the database.

Let’s break down each stage and see what it actually does.

Stage 1: Schema Validation (Check Required Fields First)

The first step is simple: make sure the receipt data contains the fields your app really needs.

Even though Filestack returns structured data, some fields may be missing depending on the receipt quality. So you should always check before moving forward.

The complete stage-by-stage reference implementation is available in this GitHub repository.

function validateSchema(data) {
  ...
}

What this stage does:

1. Checks important fields

These are required:

If any of these are missing, we mark them as an error.

2. Checks optional fields

These are helpful but not required:

If they are missing, we add a warning instead of stopping everything.

3. Look for unexpected fields

Sometimes, OCR extracts extra random fields.

We don’t fail the process, but we log a warning so we can inspect it later.

Why this is important:

If you skip this step:

By separating errors and warnings, you stay flexible.

For example:

This stage makes sure your data structure is safe before you move to deeper validation.

Once we confirm the required fields exist, the next step is to fix their format and data types.

Stage 2: Type Normalisation and Format Standardisation

At this stage, we fix data types.

Receipts usually give everything as text (strings).

But your database needs proper types like:

So we clean and convert the data carefully.

Date and Time Normalisation

Receipts can show dates in many formats:

Your function tries different patterns until one works.

function normalizeDateField(dateString, context = {}) {
  ...
}

What this function does:

1. Checks if date exists:

If empty: return error.

2. Fixes common OCR mistakes:

3. Tries multiple formats:

Since different stores use different date styles.

4. Validates the date:

5. Attempts correction:

If OCR reads 35 instead of 15, it tries to fix it instead of rejecting immediately.

6. Returns a confidence score:

So instead of just accepting or rejecting, the system:

That’s called a smart validation.

Currency and Numeric Field Normalisation

Money values are tricky because:

Your function handles all this.

function normalizeMoneyField(amountString, currencyContext = 'USD') {
  ...
}

What it does:

1. Removes currency symbols:

2. Checks currency mismatch:

If the receipt says € but your system expects USD, reduce confidence.

3. Handles thousand separators:

It counts commas and periods to decide which is a decimal.

4. Fixes OCR mistakes:

5. Converts to number: Uses parseFloat.

6. Sanity checks:

7. Rounds to 2 decimal places: Standard currency format.

8. Returns confidence score + issues.

Why this stage is important:

Without normalisation:

This stage ensures:

It transforms messy OCR text into clean, structured, database-ready data, safely and intelligently.

Now that everything is in the correct format, we need to check if the numbers actually make sense.

Stage 3: Business Logic Validation

Now that the data types are clean and standardised, the next step is checking if the numbers and values actually make sense.

This stage is about logic, not format.

A value can look perfectly valid (correct type, correct format) but still be wrong.

function validateBusinessRules(receipt) {
  ...
}

What it does:

1. Line items should match the subtotal:

If a receipt has multiple items:

If they don’t match (allowing small rounding differences), something is wrong.

Example:

That’s a problem.

Even if both are valid numbers, the math doesn’t work.

2. Final total should be correct:

The formula should usually be:

Subtotal + Tax + Tip - Discount = Total

If this calculation does not match the reported total (within a small tolerance like 2 cents), it’s flagged as an error.

This catches common OCR mistakes like:

3. Tax rate should be reasonable:

The system calculates:

tax_rate = tax_amount / subtotal

Most sales tax rates fall between 0% and 15%.

If the tax rate is 40% or negative, that’s suspicious.

It may not always be wrong, but it should be flagged.

4. Duplicate line items:

If the same item appears multiple times, it may be:

So duplicates are flagged as warnings.

5. Transaction date should not be in the future:

If the receipt date is tomorrow or next year, that’s almost always an OCR error. This is marked as an error.

6. Merchant name should look real:

Checks include:

That might mean OCR misread the header.

Why this stage is important:

Type validation checks the structure.

Business validation checks logic.

A receipt might pass all format checks, but still:

This stage catches those deeper problems before saving to the database.

It’s the difference between “valid format” and “actually correct data.”

But checking each field individually isn’t enough. Some errors only appear when we compare fields with each other.

Stage 4: Cross-Field Consistency Validation

This stage checks if different fields match each other logically.

Sometimes, each field looks fine on its own. But when you compare them together, the mistake becomes clear.

function validateCrossFieldConsistency(receipt) {
  ...
}

What this does:

1. Cash payment vs tip amount:

If the payment method is CASH, and the tip is very high (like more than 25% of the total), that’s unusual.

Example:

That’s 40%, possible, but unlikely.

This might mean:

So it gets flagged as a warning.

2. Quantity × unit price should equal line total:

If a receipt shows:

That’s clearly wrong because: 2 × 50 = 100 (not 300)

Each line item is checked like this.

If the math doesn’t match (within a small tolerance like 2 cents), it gets flagged.

This catches:

3. Date + Time should not be in the future:

If the receipt has both:

The system combines them into one full datetime.

If that combined value is in the future, it’s almost certainly an OCR mistake.

Example:

That’s an error.

4. Discount should not be bigger than the subtotal:

If:

That’s impossible.

The discount cannot exceed the subtotal.

This is marked as an error.

Why this stage is important:

Single-field validation checks each field separately.

Cross-field validation checks relationships.

Many OCR errors look valid alone, but fail when compared:

This stage catches those deeper logical problems before saving to the database.

It makes your validation smarter and much more reliable.

Even after all these checks, we still need to decide how much we trust this receipt.

Stage 5: Confidence Scoring and Flagging

Even after all validation steps, we still need to answer one important question: How confident are we that this receipt is correct?

Not every receipt can be 100% trusted. So instead of just saying “valid” or “invalid,” we calculate a confidence score between 0 and 1.

The closer to 1, the more reliable the receipt.

function scoreConfidence(receipt, extractionMetadata) {
  ...
}

What affects the confidence score?

1. OCR confidence:

If Filestack provides an OCR confidence score in metadata, we use it.

If OCR says 85% confidence, we multiply that into our score.

2. Image quality:

If the image is blurry, dark, or tilted, the quality score reduces confidence.

Good input = better output.

3. Field-level confidence:

During normalisation (like date and money cleaning), we already calculated confidence for individual fields.

For example:

We average these and apply them to the final score.

If important fields look suspicious, overall trust drops.

4. Validation errors and warnings:

Errors reduce confidence more than warnings.

More problems = lower confidence.

5. Missing optional fields:

Missing optional fields (like merchant address or receipt number) slightly reduce confidence.

This is not critical, but less complete data = slightly less reliable.

What should happen next?

After calculating the final score, we decide:

High confidence (≥ 0.8):

Medium confidence:

Low confidence (< 0.6):

Also:

This is called smart automation.

Why this stage is important:

Instead of manually checking every receipt:

This saves time, reduces manual work, and still keeps data quality high.

It’s not just validation anymore; it’s intelligent decision-making.

Finally, before saving anything, we must ensure the data follows our database rules.

Stage 6: Database Check Before Saving

Before saving the receipt to the database, we must make sure it follows all database rules.

Even if the data passed earlier validation steps, it can still fail at the database level.

This step checks everything in advance so the database doesn’t throw errors.

function prepareDatabaseInsertion(receipt) {
...
}

What this does:

1. Required fields:

If the database says a field is required (like merchant_name or total_amount), it must not be:

If it is missing, we flag it before insertion.

2. Maximum length for text fields:

Databases limit how long text can be.

Example:

If the text is too long, we catch it here instead of letting the database fail.

3. Decimal precision rules:

For money fields like:

The database might allow:

So:

We check both:

4. Allowed values (enums):

Some fields only allow specific values.

Example:

payment_method = CASH, CREDIT, DEBIT, OTHER

If OCR extracts something like:

payment_method = CARD

It will be rejected because it’s not in the allowed list.

5. Duplicate receipt check:

To prevent saving the same receipt twice, you can check:

If a similar record already exists in the database, you can flag it as a duplicate. (This usually requires a database query.)

Why this stage is important:

If you skip this step:

By validating against database rules before insertion:

This is the final safety layer before the receipt is officially stored.

Now that we’ve built the full validation pipeline, let’s see how it connects with Filestack in a real implementation.

How Everything Works Together with Filestack

Here’s how these stages integrate with Filestack’s intelligence API using the Filestack JavaScript SDK:

const filestack = require('filestack-js');

async function processReceiptWithFilestack(imageUrl) {
  const client = filestack.init('YOUR_FILESTACK_API_KEY');

  // Step 1: Extract receipt data using Filestack intelligence
  const intelligenceResponse = await client.intelligence({
    url: imageUrl,
    type: 'receipt'
  });

  const rawData = intelligenceResponse.receipt_data;
  const extractionMetadata = {
    ocr_confidence: intelligenceResponse.confidence,
    image_quality: intelligenceResponse.image_quality,
    extraction_timestamp: new Date().toISOString()
  };

  // Step 2: Run through validation pipeline
  try {
    const result = await processReceiptData(rawData, extractionMetadata);

    if (result.database_ready && result.auto_approved) {
      // High confidence, no errors - insert directly
      await insertReceiptToDatabase(result);

      return {
        status: 'SUCCESS',
        receipt_id: result.id,
        confidence: result.confidence_score
      };

    } else if (result.requires_manual_review) {
      // Flag for manual review
      await queueForManualReview(result);

      return {
        status: 'PENDING_REVIEW',
        receipt_id: result.id,
        confidence: result.confidence_score,
        review_reason: result.review_reason
      };

    } else {
      // Has errors - reject or queue for correction
      await logReceiptError(result);

      return {
        status: 'FAILED',
        errors: result.database_errors || result.validation_errors,
        warnings: result.validation_warnings
      };
    }

  } catch (error) {
    console.error('Receipt processing failed:', error);

    return {
      status: 'ERROR',
      message: error.message
    };
  }
}

What this code does:

Best Practices

To make this system reliable in production, keep these best practices in mind:

With these principles in place, your receipt processing system becomes both automated and reliable.

Conclusion

Receipt data validation helps turn Filestack’s OCR results into clean, reliable database records.

Filestack is very good at extracting text from receipts. But the raw data still needs to be checked before saving. That’s why we use a six-step pipeline:

These steps catch mistakes at different levels while still keeping the system fast.

It’s a good idea to start with strict rules. As you learn more about the types of receipts your users upload, you can adjust and relax some rules if needed. Keep tracking validation results and update your logic when you discover new edge cases.

When validation is done properly:

Let Filestack handle the hard OCR work. Then use strong validation to make sure only clean, correct data is saved in your database.

Exit mobile version