Redundancy vs Normalization

Published on: Sun Oct 05 2025

Databases are like cities. The more data (people) you add, the messier it gets. Normalization is your city planner: it builds the structure, assigns the addresses, and enforces order. It ensures data integrity.

But when that city grows into a mega-metropolis pushing millions of queries per second, structure becomes friction. You have to break a few rules.

That’s when you meet denormalization: the calculated decision to adapt the design. This specialist understands that at massive scale, strict rules must yield to the pragmatic demands of speed. You don't get purity; you get performance.

We’re not talking about the mess here. Normalization is integrity. Denormalization is controlled redundancy … a strategic attack on latency.

Let's walk through the trade-offs and find the line between good engineering and bad habits.

Normalization

Normalization is a process during which you break a table into logically divided sub-sections. This is where you decide which column belongs to which table and if you need some extra tables or not.

When you are starting from scratch, and you are just brain storming about what the database should look like, a good practice is to dump all the columns required from the application into a single monolithic table and then start butchering it down to separate sections/tables, all the while asking yourself the following questions:

Should this column go into this table or does it need a table of its own?
Is this data repeated across rows?
Can this group of fields be reused elsewhere?
Will I need to filter, sort, or join on this?

Is this field dependent on the primary key or partially/compositionally related?

A redundant table butchered into smaller tables

Normalizing our database comes with some benefits i.e. reduced redundancy and improved data consistency. But what do they actually mean?

Redundancy is when, you have duplicate columns across different tables. Reducing redundancy means, you remove the duplicates from other tables and keep that data in just one place.

Consistency is a by-product of reduced redundancy. It is when there is a single column for a single piece of information e.g. age should be found in the user_profile table and nowhere else. A single source of truth.

Let's understand this with an example:

A redundant schema where "price" column is in all three tables, in one form or another

Would you say that this schema contains redundancy? Obviously you would point out the price column which is present in both the meals table and the meal_orders table. Moreover there is a derived column in the orders table named as total_price , which you can probably guess is the total sum of all the meals ordered in a single order.

So in conclusion the above schema contains redundancy. And as it happens, normalizing it is very simple. You just eliminate the price column from meal_orders table and total_price from orders table.

So the next time you want a total sum of all the meals ordered in a single order, you would do a query like this:

query.sql

SELECT sum(price) FROM meal_orders AS mo INNER JOIN meals AS m 
ON mo.meal_id = m.id 
WHERE order_id = 123;

Denormalization

Just to keep us both on the same page, denormalization in databases is not always seen as a defect or something wrong. Yes, it has its downsides, but it is rarely talked about in schools when we should denormalize our databases on purpose.

To demonstrate, we will use the example I shared above which includes meals, meal_orders and orders tables. The redundancy introduced in it can be necessary for a number of reasons including:

Admin may change the price of meals on the fly. So in that case, what will we do, if we want to know how much money a user paid for a certain meal at a certain time in the past?
Users may use a discount on the total price of the order. In this case, having the total_price column inside orders table can be helpful to know exactly how much money did a customer pay after a discount was resolved .
Having these redundant columns in our database actually helps us keep a report or how much money came into the system and how much was discounted. The app owner might need this report for his accountant.

Turns out the redundancy which we were considering bad, was actually helpful for a lot of use-cases. But in this scenario it is more like “adhering to business requirements” than “actual performance benefits” right? This is not actually a break from normalization but a necessary design pattern, often called an Event Store or a Fact Table.

📌Event Store or a Fact Table is where you capture the state of data at the time of a transaction. Like a snapshot or a docker image.

So instead let’s also look at an example where redundancy is not actually related to business requirements, but purely improves optimization.

Since the accounting guy would likely ask you for a monthly or yearly report, you can store that pre-calculated information in the total number of meals sold and the average price of a meal per month in a separate reporting table, which is a calculated metric that could be derived from raw data but is stored for speed. It is like already doing the assignment and waiting on the deadline when the teacher asks for it. Quite a nice feeling isn’t it?

Here is what the reporting table would look like:

Reporting table which contains pre-calculated information for faster query

📌I used a separate table in this example, but you are free to use Views or Materialized Views as well.

So you see, introducing redundancy can be just as necessary as normalization in certain scenarios. And before you ask, no there is no hard-coded way to know when to use what. Just be mindful of your use-case and the clients requirements and learn to have a little foresight.

Let's look at some more scenarios where denormalization can be necessary.

Movie Genre

A classic relational challenge is the many-to-many relationship. A movie can have multiple genre, and a single genre can belong to multiple movies. So it only makes sense to have the schema look like this:

A schema of movies and their genres

But think about Netflix. No! It is not the time to be distracted. Think about what would they do if they had to fetch millions of movies along with their genres to perform their machine learning and sciency stuff?

A common theory-friendly answer would be to use joins. Joins are good, but they can prove to be very computationally expensive in such scenarios. Now I know exactly what you are thinking. Hate to break it to you, but we can not use our previous "meals and orders" example here, since there are no repeated columns in this schema. So what do we do?

📌 Some of you might object that we could use pagination to fetch batches of movies along with their genres. That might work or maybe not depending on how your machine learning model works. But for the sake of argument, we are trying to see if thinking out of the theory book actually helps us or not.

Throughout my career, I have seen people come up with a “solution” that is instead of creating two separate tables just for the sake of storing genre information, you could just make do with only one table or none at all. Let's elaborate.

📌At this stage I am not saying that the following aproach is a good approach or not … we will decide that at the end of the discussion. It is just that I have seen people do this more than necessary and I would like to address that.

There are two ways to approach this. First one is given below:

Create a column named genre in the movies table.
Store JSON array inside it like this ["horror", "sci-fi"].

This way you would not need to create two separate tables. But using this way, you would not have a consistent naming scheme for the genres. "Sci-fi", "sci-fi" and "scifi" would be counted as three different genres.

As for the other way to do this:

To work around the problem of inconsistent naming scheme, we can still keep the genre table, but remove the movie_genre table.
Instead we again introduce a JSON column (if supported) in the movies table and fill it with genre IDs this time like this [1,4,7].
You can use your ORM to handle fetching relevant genres using this array in an optimized manner. Laravel has eager loading, Django has __in variable naming convention etc.

This way we are still using JOINS to fetch movie genres, but the number of JOIN-ing rows are significantly less. Thus reducing the computational cost by a little.

📌 If your database does not support JSON based columns, just use datatype text and store JSON as a stringified version. But when you load it through an Object Relation Mapper (ORM) you convert it into a hashmap, also known as dictionary (in python) or object (in javascript).

📌 Mind you, ORM is not magic. You can't go hilly-billy on database design and expect ORM to handle the mess. ORMs work best on optimized databases. If your schema is bloated or inconsistent, your ORM is just building pretty wrappers around a slow mess.It’s like putting a turbo on a car with square wheels.

Really now?

This redundancy which we introduced using arrays in a JSON column, has its own drawbacks as well:

Enforced FK constraints - PostgreSQL does not enforce FK constraints on array elements
Harder to join using raw SQL - You need to UNNEST() or use ANY() or JOIN LATERAL to match genres. But who cares when you are using an ORM which can do it in an optimized way? right?
Hard to index efficiently - GIN index helps but filtering is more complex than flat joins. Again using an ORM helps a lot.

📌 I will stop right here regarding ORMs, since we will be discussing those in more detail in the upcoming posts.

Now that the deed has been done, let’s analyze what just happened and if it was worth it or not.

Aspect	Normalized (Linking Table)	Denormalized (JSON Column)
Data Integrity	Excellent. Enforces Foreign Keys (FKs), ensuring you cannot link to a non-existent genre ID.	Poor. No database-level FK enforcement. Logic must be managed in the application.
Query Flexibility	Excellent. Easy to run complex queries like `GROUP BY genre_id` to count movies per genre.	Difficult. Requires complex, non-standard functions (`JSON_UNNEST, JSONB_ARRAY_ELEMENTS`) and special indexing (GIN).
Schema Evolution	Rigid. Adding a new attribute to a genre requires an `ALTER TABLE` on the `genres` table.	Flexible. You can add new attributes to the JSON structure on a row-by-row basis without changing the schema.

Judging by this analysis, we can safely conclude that the drawbacks outweigh the benefits of this approach.

So do we always refrain from employing this approach? The answer is “No, Not always”, as this approach definitely comes with its own benefits and you might find yourself using it while in the trenches. But as I said, I have seen people using it more than necessary, so you need to use your brain and understand the requirements first.

But wait, is there a better approach to this? The answer is “Absolutely yes”, but before I tell you about that secret sauce, I want to take my time explaining when and how the JSON columns can be useful.

Scenario 1: Fast, Simple Lookups (Read Optimization)

This is the core denormalization benefit. A JSON column is used when you need to fetch the primary entity (e.g., a movie) and all its simple, related attributes (e.g., genres) in a single, fast database query, eliminating the JOIN.

Loading an item details page on a high-traffic website. The primary query is always to fetch the item by ID.

You are optimizing for the 99% read path at the expense of the 1% maintenance/write path. Since most web applications read data far more often than they update it, the speed gain on the most frequent operation can be worth the integrity sacrifice.

Much quicker than using JOINS

Scenario 2: Low-Cardinality Tagging and Attributes

This applies when the set of related items is small, limited, and non-critical to other business logic. In this case, a movie can have at most 2-5 genres as far as I have seen. So the genre set each movie has is very limited but on the other hand we need to query movies by genres that is where it fails.

Therefore I would like to use another example just this once to explain how storing raw JSON helps.

The example is about storing a user’s settings or preferences. Storing a user's settings or a list of application preferences ({"theme": "dark", "notifications": true}) or a simple set of permissions (["read", "write"]). These flags are tied only to the user and don't require complex relationships with other tables.

And later on, you can even easily expand any new preference related to a new feature simply by adding a new key-value pair to the JSON object.

User preferences table leveraging the power of NoSQL

Scenario 3: Microservices and Flexible Schema (NoSQL Hybrid)

JSON columns are critical for providing schema flexibility in a rigid relational environment.

The front-end or a downstream microservice might need data from several normalized tables, but only wants a single, self-contained object. Storing a JSON blob allows the database to act as an effective API payload cache, assembling the document structure right inside the RDBMS.

📌 We are going to assume GraphQL does not exist yet. That is another topic for another day.

This is used for storing metadata or optional fields that are not present on every row (e.g., movie metadata such as resolution, length, filetype etc). This avoids constant schema migrations (ALTER TABLE) and prevents the main table from having hundreds of sparse, empty columns.

See those random NULL cells?

Anyways, now you know when and how these JSON columns in your tables can be useful.

Coming back to our movie database example, this approach is not suitable for it, because the recommendation engine would like to throw you movies based on the genres you like. Or you might want to filter movies by genres.

So what was the secret sauce I was talking about? When platforms like Netflix need to fetch millions of records for high-speed delivery or backend "sciency stuff" (like machine learning models), the goal is to eliminate runtime computation. The solution isn't to destroy the normalized source tables, but to pre-calculate and cache the joined data. Same thing we did with the “meals and orders” example.

Instead of the risk-prone approach of using internal JSON arrays, which essentially sacrifices Foreign Key constraints and makes raw SQL querying complex. A cleaner denormalization technique is Column Flattening for a dedicated read path.

This approach creates a separate read-optimized table, say column_flattened_movie_table, that stores the most commonly accessed genre data as discrete, redundant columns.

A column flattened table dedicated for the machine learning model

This denormalized table serves a specific “read” purpose: showing a quick overview of a movie's genres without needing a JOIN. Yes, the genre names are redundant (they are also in the genres table), and you must update them if a core genre name changes. But this is a calculated trade-off where we sacrifice some write speed and storage to dramatically improve read performance for the most common use case.

📌 Oh wait! Doesn’t that table look identical to the table we butchered at the beginning of this blog?

Now you know, that normalization and denormalization is just a game of trade-offs where you can only make the right call if you are well aware of the requirements and future changes to some extent. This is why we must question everything until the requirement becomes clear enough to start working.

The Modern Reality: RDBMS is Evolving

In case you didn’t know this already, modern relational databases like PostgreSQL (JSONB) and MySQL now treat JSON columns as a first-class data type.

This means the "JSON is just a slow string" argument is outdated. JSONB, for instance, can be indexed (using a GIN index) and queried, allowing it to perform fast lookups that compete with traditional JOINs in specific scenarios, especially when the join table is extremely large and read-heavy.

Conclusion

Normalization and denormalization are simply two specialized tools.

Normalization is your foundation; it's the bedrock that ensures data integrity and honesty. You don't skip it.

Denormalization is the necessary evil. It's the calculated choice to trade clean data for raw speed when JOINs become too expensive at scale. It's a strategic attack on latency.

The difference between a novice and a veteran isn't knowing the rules; it's knowing exactly which rule to break, why you're breaking it, and what the maintenance cost will be.

So, stop asking if your schema is clean. Start asking:

"Will this structure survive the load, and is the speed worth the manual risk?"

That deliberate mindset, the willingness to write your own rules and own the resulting chaos, that is the mark of a true database engineer.

If this post clicked with you, you'll love the newsletter. It's where I share raw, no-fluff lessons from real projects. The kind of stuff you don't find in documentation.

If you feel like it, my community is always open for like minded engineers who seek to level up their game. You are more than welcome to hang out with us. We talk code, chaos, and the craft of building things that don't break. Join the Discord channel

Lastly, everything I make is powered by you.

If you'd like to keep this campfire burning, you can drop a tip in the Tip Jar … or simply spread the word. Every bit helps.

Categories:

Data Strategies

Tags:

normalizationrelational databasessqloptimization