Distributed Databases
Definitions
Distributed Database: a single logical database physically spread across multiple computers in multiple locations that are connected by a data communications link
appears to users as though it is one database
Decentralized Database: a collection of independent databases which are not networked together as one logical database
appears to users as though many databases
Advantages of ditributed DBMS
Good fit for geographically distributed organizations/ users
Data located near site with greatest demand
Faster data access (to local data)
Faster data processing
Workload is splited amongst physical servers
Allows modular growth
Thanks to horizontal scalability
Increased reliability and availability
Less danger of a single-point of failure(SPOF), if data is replicated
Supports database recovery
When data is replicated across mutiple sites
Objectives of distributed DBMS
Location transparency: a user does not need to know where particular data are stored
Local autonomy: a node can continue to function for local users if connectivity to the network is lost
Funtions of a ditrbuted DBMS
Locate data with a distributed catalog (meta data)
Determine location from which to retrieve data and process query components
DBMS translation between nodes with different local DBMSs (using middleware)
Data consistency (via multiphase commit protocols)
Global primary key control
Scalability
Security, concurrency, query optimization, failure recovery
Distribution options
When distributing data around the world, the data can be partitioned ot replicated.
Data replication: the process of duplicating data to different nodes.
Data partitioning: the process of partitioning data into subsets that are shipped to different nodes.
Many real-life systems use a combination of two (e.g. distribute data and keep some replicas around usually 3)
Advantages of republication
High reliability due to redundant copies of data
Fast access to data at the location where it is most accessed
May avoid complicated distributed integrity routines
Replicated data is refreshed at scheduled intervals
Decoupled nodes don't affect data availability
Transactions proceed even if some nodes are down
Reduced network traffic at prime time
If updates can be delayed
This is currently popular as a way of achieving high availability for global systems
Disadvantages
Need more storage space
Data Integrity
Takes time for update operations
Network communication
Horizontal partitioning: Table rows distributed across nodes (sides)
Vertical partitioning: Table columns distributed across nodes (sides)
Advantages of partitioning
Data stored close to where it is used (NBA in US, ARL in AUS)
Local access optimization
Only relevant data is stored locally
Unions across partitions
Disadvantages of partitioning
Accessing data across partitions
No data replication backup vulnerability (SPOF)
Trade-offs
Availability vs Consistency
The CAP theorem says we need to decide whether to make data always available OR always consistent
Synchronous vs Asynchronous updates
Are changes immediately visible everywhere (great BUT expensive) or later propagated (less expensive faster, but seeing stale data)?
Sync and Async Updates
Synchronous updates: Data is continuously kept up to date
Ensures data integrity and minimizes the complexity of knowing where the most recent copy of data is located.
Slow response time and high network usage
Asynchronous updates: Some delay in propagating data updates to remote databases
Acceptable response time
More complex to plan and design
The CAP Theorem
You can only have two of three of Consistency, Partition Tolerance and Availability.
Consistency: everyone always sees the same data
Availability: System stays up when nodes fail
Partition Tolerance: System stays up when network between nodes fail
Last updated