Files and Indices
Last updated
Last updated
[TOC]
File is a collection of pages, each containing a collection of records.
Many alternative file organizations exists, each is good for some situations, and not so good in others.
Heap Files have no particular order among the records.
They are suitable when the typical access is a file scan retrieving all records.
They are also feasible for inserting records.
Sorted Files: pages and records within pages are ordered by some condition
Best for retrieval (of a range of records) in some order
They are also feasible for finding and manipulating some records.
Index File Organizations
Special data structure that has the fastest retrieval in some order
Simplest file structure, contains records in no particular order. As file grows and shrinks, disk pages are allocated and de-allocated. Fastest for inserts compared to other alternatives
Similar structure like heap files (pages and records), but pages and records are ordered.
Fast for range queries, but hard for maintenance( each insert potentially reshuffles records )
DBMS model the cost of all operations
The cost is typically expressed in the number of page accesses (or disk I/O operations - to bring data from disk to memory)
1 page access (on disk) == 1 I/O (used interchangeably)
Sometimes, we want to retrieve records by specifying the values in one or more fields.
An index is a data structure built on top of data pages used for efficient search. The index is built over specific fields called search key fields.
Any subset of the fields of a relation can be the search key for an index on the relation.
An index contains a collection of data entries, and supports efficient retrieval pf data records matching a given search condition.
Classification based on various factors:
Clustered vs. Unclustered
Primary vs. Secondary
Single Key vs. Composite
Indexing technique: Tree-based, hash-based, other
If order of data records is the same as the order of index data entries, then the index is called clustered index. Otherwise is unclustered.
A data file can have a clustered index on at most one search key combination. (i.e. we cannot have multiple clustered indexes over a single table)
Cost of retrieving data records through index varies greatly based on whether index is clustered. (cheaper for clustered)
Clustering indexes are more expensive to maintain (require file reorganization), but are really efficient for range search.
Approximated cost of retrieving records found in range scan:
Clustered: cost $#$ of pages in data file with matching records
Unclustered: cost $#$ of matching index data entries (data records)
Primary index includes the table’s primary key
Secondary is any other index
Properties:
Primary index never contains duplicates
Secondary index may contain duplicates
An index can be built over a combination of search keys.
Data entries in index sorted by search keys.
Examples:
Index on
Index on
Hash-based index
Represents index as a collection of buckets. Hash function maps the search key to the corresponding bucket.
bucket in which record belongs
Good for equality-based selections
Order: The order of a B+ tree is represented by d
. Each node, excluding the root, is required to have between d
and 2d
entries under normal conditions without deletions. However, leaf nodes may have fewer than d
entries following data deletions.
Fanout: The fanout of a B+ tree refers to the maximum number of children a node can have, which is 2d+1
.
Leaf Nodes: All leaf nodes in a B+ tree are positioned at the same depth and maintain between d
and 2d
entries, ensuring they are at least half full.
Height: The height of a B+ tree is defined by the number of edges needed to traverse from the root node to any leaf node.
Find the leaf node in which you will insert your value. Add the key and the record to the leaf node in order.
If overflows
Split into and . Keep entries in (this means entries will go in ).
If was a leaf node, COPY ’s first entry into the parent. If was not a leaf node, MOVE ’s first entry into the parent.
Adjust pointers.
If the parent overflows, then recurse on it by doing step 2 on the parent.
Just find the appropriate leaf and delete the unwanted value from that leaf. In practice we get way more insertions than deletions so something will quickly replace whatever we delete.
Alternative 1: By Value
The leaf pages contain the records themselves.
This alternative is clustered by default.
Limitation: Cannot support multiple indexes built on the same file (Since the records are ordered by points in the figure below).
Alternative 2: By Reference
The leafs contain (Key, RecordID) pairs where the RecordID is [PageNum, RecordNum].
Alternative 3: By List of References
The leaf pages hold lists of pointers to the corresponding records.
Read the appropriate root-to-leaf path.
Read the appropriate data page(s). If we need to read multiple pages, we will allot a read IO for each page. In addition, we account for clustering for Alt. 2 or 3 (see below.)
Write data page, if you want to modify it. Again, if we want to do a write that spans multiple data pages, we will need to allot a write IO for each page.
Update index page(s).
Construct a B+ tree from scratch without having to traverse the tree each time we want to insert something new.
In the following figure, the order is 2 and the fill factor is
After some adjustments: