Cleaning Up Double Spaces in SQL: Why and How
SeeDream

Cleaning Up Double Spaces in SQL: Why and How

Why Do We Need to Clean Double Spaces?

If you work with data, you’ve probably seen it: “Messy text with too many spaces.”

Double (or more) spaces sneak into data for all sorts of reasons—manual entry, copy-paste errors, or system quirks. These extra spaces can cause problems:

  • They break searches and filters (e.g., 'John Smith' ≠ 'John Smith')
  • They make reports look unprofessional
  • They can even cause issues with data exports or integrations

SQL doesn’t have a built-in, one-step way to collapse all double (or more) spaces into single spaces. That’s where a reusable function comes in handy!


The Clever Approach

Replace all spaces with two special characters! Simple really 🤭

Here’s how it works, step by step:

  1. Replace every space with a unique marker (by default, two rare characters).
  2. Remove overlaps that happen when there are multiple spaces in a row.
  3. Turn the markers back into single spaces.

The function has defaults of CHAR(17) and CHAR(18) - WHAT ARE THEY?

CHAR(17) and CHAR(18) are simply ASCII characters that are rarely used in typical text. Think of them like secret codes or placeholders.

  • CHAR(17) represents the Device Control 1 character.
  • CHAR(18) represents the Device Control 2 character.

These are non-printable control characters from the ASCII (American Standard Code for Information Interchange) table. In the early days of computing, these characters were used to control hardware devices (like printers or terminals), but they don't have a visual representation when you type them or display them in most modern contexts.

Why they are used in this context:

The cleverness of using CHAR(17) and CHAR(18) in the double-space cleaning function comes from their rarity in actual text data.

The goal of the "marker" method is to temporarily replace spaces with something unique that won't accidentally be part of your original string. If you used something common like ## as a marker, and your original string was "Price ## 100", the function would mess up your data.

By choosing CHAR(17) and CHAR(18), the chances of them naturally appearing in a user's name, address, or product description are extremely low. This makes them ideal "safe" temporary placeholders for the string manipulation trick.


Why Is This Function Good Practice?

  • Efficient: It handles any number of spaces in a row, all at once.
  • Safe: By using rare characters (CHAR(17) and CHAR(18)), it avoids messing up your real data.
  • Simple: No need for complicated loops or repeated replacements.


The Solution: A Reusable, Safe SQL Function

CREATE OR ALTER FUNCTION Toolbox.CleanDoubleSpaces
(
    @InputString NVARCHAR(MAX),
    @TempChar_1 NCHAR(1),
    @TempChar_2 NCHAR(1)
)
RETURNS NVARCHAR(MAX)
AS
BEGIN
    DECLARE @Result NVARCHAR(MAX);
    DECLARE @TempMarker NVARCHAR(2);
    DECLARE @OverlapMarker NVARCHAR(2);
    DECLARE @TempChar1 NCHAR(1);
    DECLARE @TempChar2 NCHAR(1);

    SELECT @TempChar1 = ISNULL(@TempChar_1,CHAR(17))
    SELECT @TempChar2 = ISNULL(@TempChar_2,CHAR(18))

    -- Handle NULL input
    IF @InputString IS NULL
        RETURN NULL;

    -- Handle empty string
    IF LEN(@InputString) = 0
        RETURN @InputString;

    -- Handle no spaces
    IF CHARINDEX(' ', @InputString) is NULL 
        RETURN @InputString;

    -- Create our marker strings
    SET @TempMarker = @TempChar1 + @TempChar2;
    SET @OverlapMarker = @TempChar2 + @TempChar1;

    -- Apply the three-step process
    SET @Result = REPLACE(
        REPLACE(
            REPLACE(@InputString, ' ', @TempMarker),
            @OverlapMarker, ''
        ),
        @TempMarker, ' '
    );

    RETURN @Result;
END;        

Ensuring the Toolbox Schema Exists

Here’s how you can ensure the toolbox schema exists before creating your function. You only need to run this code once per database, but it’s a good habit to include it in your deployment scripts.

IF NOT EXISTS (SELECT 1 FROM sys.schemas WHERE name = 'Toolbox')
    EXEC('CREATE SCHEMA Toolbox');
GO        

Why Is This Better Than Looping?

A common way to remove double spaces is to use a loop: You keep replacing ' ' (two spaces) with ' ' (one space) over and over until there are no more double spaces left.

But looping has drawbacks:

  • Performance: Each loop scans the entire string again, which can be slow for long text or large tables.
  • Resource Usage: Loops use more CPU and memory, especially inside SQL functions or queries.
  • Complexity: Looping logic is harder to read, maintain, and debug.

The marker method is better because:

  • It does all the work in a single pass using built-in string functions.
  • It’s much faster, especially for big data.
  • It’s easier to understand and maintain.
  • It avoids the risk of infinite loops if something goes wrong.

In short: The marker method is more efficient, more reliable, and more readable than looping!


Usage Examples

-- Basic usage (safe for all data)
SELECT Toolbox.CleanDoubleSpaces('Hello    world!  This   is   SQL.', null, null);
-- Returns: 'Hello world! This is SQL.'

-- With custom temporary characters (if you need to avoid CHAR(17) and CHAR(18))
SELECT Toolbox.CleanDoubleSpaces('Messy   data', CHAR(1), CHAR(2));        

Extra Tips and Best Practices

1. Be Mindful of Data Types

  • The function uses NVARCHAR(MAX) for input and output for maximum compatibility.
  • If your data is VARCHAR, cast it to NVARCHAR before using the function, or create a VARCHAR version.

2. Handling NULLs and Empty Strings

  • The function returns NULL if the input is NULL.
  • It returns an empty string if the input is empty.

3. Performance

  • For very large tables, consider running this function in batches or during off-peak hours, as string manipulation can be resource-intensive.


Conclusion

Cleaning up double spaces in SQL is a common need, and this function makes it easy, efficient, and safe for all your data. Add it to your toolkit to keep your data—and your reports—looking sharp!


Happy cleaning!

Ben Doremus

Healthcare Data Leadership and Execution

1mo

This seems like a clean cut scenario to use regex. I'm curious what would lead one to create a function like this?

Like
Reply

Many SQL DBMSes allow you to use SELECT REGEXP_REPLACE('Hello world! This is SQL.',' +',' ','g')

Dennes Torres

Data Platform MVP (2020-2023) | Certified Expert in Fabric Analytics Engineering, Azure Data Engineering, Solutions Architecture, DevOps, and Development

1mo

Data cleaning techniques are always important!

Robert Harmon

Taking a machete to analytics inefficiencies

1mo

That's actually pretty slick. Of course, if you're a bit of a chaos monkey, once you get everything down to a single space, replace all the single spaces with chr(255) :D

Toby Beevers

Head of Data | Digital Transformation Leader | Author of “Beyond the Numbers” Newsletter | Open to New Opportunities

1mo

Nice, like that James Reeves thank you 🙏

To view or add a comment, sign in

Others also viewed

Explore topics