Data Modeling Guide
Data modeling is the process of designing how data is structured, stored, and related within a database. A good data model makes queries fast, data reliable, and systems maintainable. A poor data model leads to slow queries, data anomalies, and endless workarounds.
Why Data Modeling Matters
Your data model determines what questions your database can answer efficiently. A model designed for transactional operations (processing orders) differs fundamentally from one designed for analytics (reporting on revenue trends). Changing a data model after the system is in production is expensive — it requires data migration, application changes, and re-testing. Invest in modelling upfront. Use the JSON Formatter to validate JSON schema definitions that document your data models.
Normalization
Normalization organises data to reduce redundancy and prevent anomalies. The key normal forms are:
1NF: Every column contains atomic (indivisible) values. No repeating groups.
2NF: 1NF plus every non-key column depends on the entire primary key (no partial dependencies).
3NF: 2NF plus no non-key column depends on another non-key column (no transitive dependencies).
In practice, most OLTP databases are normalised to 3NF. This minimises data duplication, prevents update anomalies, and ensures data integrity.
Denormalization
Denormalization intentionally introduces redundancy to improve query performance. Instead of joining five tables to answer a query, you store pre-joined data in a single table. This makes reads faster but writes more complex (you must update redundant copies). Data warehouses and analytics databases are heavily denormalized for this reason.
Dimensional Modeling
Dimensional modeling, developed by Ralph Kimball, is the standard approach for data warehouses. It organises data into fact tables and dimension tables:
Facts contain measurable events: sales amount, click count, order quantity. Fact tables are typically very large (millions to billions of rows) and grow continuously.
Dimensions contain descriptive attributes: product name, customer city, date components. Dimension tables are smaller and change slowly.
The combination allows analysts to "slice and dice" facts by dimensions: revenue by product category, by region, by quarter.
Star Schema vs Snowflake Schema
In a star schema, dimension tables are denormalized — all attributes in a single table. In a snowflake schema, dimensions are normalized into sub-tables. The star schema is simpler to query and performs better (fewer joins). The snowflake schema saves storage and avoids update anomalies. For most data warehouses, the star schema is preferred. Use the Code Diff tool to compare data model versions during schema evolution.
Slowly Changing Dimensions (SCD)
Dimensions change over time — a customer moves to a new city, a product is recategorised. How you handle these changes affects historical accuracy:
Type 1: Overwrite the old value. Simple but loses history.
Type 2: Add a new row with versioning (start_date, end_date, is_current). Preserves full history but increases table size.
Type 3: Add a column for the previous value. Preserves limited history.
Type 2 is the most common in data warehouses because historical accuracy matters for analytics.
Entity-Relationship Modeling
ER modeling is used for OLTP database design. Entities (Customer, Order, Product) become tables, attributes become columns, and relationships become foreign keys. Cardinality (one-to-one, one-to-many, many-to-many) determines the relationship structure. Many-to-many relationships require junction tables. ER diagrams remain the best tool for communicating database designs across teams. Convert model documentation between formats using the JSON to YAML Converter.
Best Practices
Name tables and columns clearly and consistently (use snake_case or a chosen convention). Document your data model — future developers and analysts will thank you. Use surrogate keys (auto-incrementing integers) for dimension tables. Add created_at and updated_at timestamps to all tables. Version your schema changes with migration tools (Flyway, Alembic, ActiveRecord Migrations). Test your model with realistic data volumes before going to production.