What is Indexing in Data Engineering?

Power of Indexing in Data Engineering

Published in

Code Like A Girl

7 min readMay 31, 2023

Indexing is a database optimization technique that involves creating a data structure to improve the speed of data retrieval operations on a database table.

The index is a separate data structure that provides a faster way to look up data rows based on the values of one or more columns. Instead of scanning the entire table to find the desired records, the database engine can use the index to quickly locate the relevant data.

Simply put, an index is a pointer to data in a table. An index in a database is very similar to an index in the back of a book.

By selectively choosing indexes based on the application’s common query patterns, application developers or database administrators play a crucial role in determining which data is indexed.

Primary index- A primary key serves as a unique identifier for a row in a relational table, a document in a document database, or a vertex in a graph database. Other records within the database can reference this specific row/document/vertex by its primary key or ID, and an index is utilized to resolve these references.

Examples of such indexes include Hash indexes, B-tree indexes, and LSM (Log-Structured Merge) indexes. Primary Index requires the rows in data blocks to be ordered on the index key.

Primary index diagram borrowed from “**Database System Implementation by Garcia Molina et al**”

Secondary index- The primary distinction lies in the fact that the keys in a secondary index are not unique. This means that there can be multiple rows (documents, vertices) associated with the same key.

Secondary key indexes can be created on the same table in relational databases using the CREATE INDEX command, and they play a vital role in optimizing join operations. Both B-trees and log-structured indexes can be used as secondary indexes. Secondary Index does _not_ have any impact on how the rows are actually organized in data blocks.

Secondary Index diagram borrowed from “**Database System Implementation by Garcia Molina et al**”

Saving values within the index

Queries in an index target the key, while the value can be a reference to the row stored elsewhere. The place where the row is stored is known as a heap file, which stores data without any specific order.

The heap file approach is commonly used to prevent data duplication when multiple secondary indexes are present. In this approach, each index refers to a specific location in the heap file, keeping the actual data centralized in one place.

Updating a value without modifying the key can be efficiently handled using the heap file approach. If the new value is not larger than the old value, the record can be overwritten in place, resulting in efficient updates.

Heap file diagram borrowed from BEGINNERSBOOK.COM

There are cases where the additional step of accessing the heap file from the index negatively impacts read performance. In such situations, it is preferable to store the indexed row directly within the index itself. This storage approach is referred to as a clustered index.

An intermediate solution between a clustered index (which stores all row data in the index) and a nonclustered index (which stores only references to the data) is called a covering index or index with included columns. This type of index stores a subset of a table’s columns within the index itself.

Clustered Index diagram borrowed from geeksforgeeks

Non clustered index diagram borrowed from TutorialsTeacher.com

Multi-column indexes- The prevalent form of a multi-column index is referred to as a concatenated index. It involves merging multiple fields into a single key by appending one column to another. The index definition specifies the order in which the fields are concatenated.

Multi-column index diagram borrowed from dataschool.com

Full-text search and fuzzy indexes — By utilizing a full-text index, it becomes possible to construct queries that match patterns within the indexed string properties.

On the other hand, a fuzzy search is designed to locate text that closely matches a given term, rather than requiring an exact match. Fuzzy searches are particularly useful in situations where search terms might contain spelling errors, as they aid in finding relevant results.

Choosing Index Columns

Selecting the right columns for indexing is a critical decision that directly impacts performance. It involves understanding the query patterns and workload of the system.

Key considerations include:

Frequency of Use: Identify columns frequently used in WHERE clauses, JOIN operations, and sorting/grouping.

Cardinality: High-cardinality columns, with many distinct values, are often good candidates for indexing.

Primary Key and Unique Constraints: Automatically indexed to ensure uniqueness and improve query performance.

Best Practices and Considerations

Successful indexing involves a set of best practices and considerations to ensure optimal performance:

Avoid Over-Indexing: Creating too many indexes, especially on columns with low selectivity, can impact write performance during data modifications.

Regular Maintenance: Periodic maintenance, such as rebuilding or reorganizing indexes, is crucial to keep indexes in optimal condition.

Monitoring and Tuning: Utilize monitoring tools and query performance analysis to identify areas for improvement. Regularly review and adjust indexing strategies based on changing data patterns and query workloads.

Adaptability: Regularly revisit and adapt indexing strategies to maintain optimal performance as data volumes and query patterns evolve.

In conclusion, indexing is vital in data engineering by enabling efficient data retrieval and query performance. Effective indexing strategies are essential in data engineering to optimize database performance and improve data retrieval efficiency.

FAQ

Q: How does indexing improve query performance?

Answer: Indexing improves query performance by reducing the time it takes to locate and retrieve specific rows of data. Instead of scanning the entire table, the database engine uses the index to quickly find the relevant rows, resulting in faster query execution times.

Q: What is the difference between a clustered and a non-clustered index?

Answer: In a clustered index, the actual data rows are stored in the order of the index. In a non-clustered index, a separate structure is created, and the order of the index does not affect the physical order of the data.

Q: How should I choose which columns to index?

Answer: The choice of columns to index depends on the query patterns and workload of the database. Focus on indexing columns frequently used in WHERE clauses, JOIN operations, and sorting/grouping. Avoid over-indexing and consider the size and distribution of data.

Q: What is the importance of index maintenance?

Answer: Index maintenance involves periodic tasks such as rebuilding or reorganizing indexes to ensure optimal performance. It helps keep indexes in good shape and prevents performance degradation over time.

Q: Can indexes be used for all types of databases?

Answer: While most relational databases support indexing, the indexes' types and implementations can vary. Understanding the indexing capabilities and options provided by the specific database management system being used is essential.

Q: How do indexes contribute to concurrency control in databases?

Answer: Indexes play a role in concurrency control by facilitating efficient locking mechanisms. They help manage concurrent transactions by reducing resource contention and ensuring data integrity in multi-user environments.

Case scenarios for indexing in data engineering

E-commerce Product Catalog:

Scenario: An e-commerce platform has an extensive product catalog with millions of products. Users frequently search for products based on categories, brands, and price ranges.

Indexing Solution: Indexing can be applied to the columns representing product categories, brands, and prices. This would significantly speed up search queries and improve the overall user experience.

Customer Database for a CRM System

Scenario: A Customer Relationship Management (CRM) system stores customer information, and users often query the database to find customers based on their names, locations, or transaction histories.

Indexing Solution: Creating indexes on the columns containing customer names, locations, and transaction details can enhance the speed of customer search and reporting queries.

Financial Transactions in Banking:

Scenario: A banking system records millions of financial transactions daily. Analysts must perform queries to identify transactions within a specific date range or involving particular account numbers.

Indexing Solution: Indexing on the timestamp and account number columns can optimize queries related to transaction history, enabling faster retrieval of relevant financial data.

You might be interested in this series, in which I introduce several essential concepts that new data engineers should know. The other topics:

Slowly Changing Dimensions

Distinctions Between CTEs, Subqueries, and Temporary Tables.

Replication Lag

Replication

Sharding and Partitioning

Partitioning Data

Optimizing data

Enhanced Query Performance

Scalability

Thanks for the read. Do clap👏 and follow me if you find it useful😊.