Spanner : Google’s globally distributed database

Spanner is Google’s scalable, multi-version, globallydistributed, and synchronously-replicated database. It is the first system to distribute data at global scale and support externally-consistent distributed transactions.

Key features
* Partitions data across many instances of Paxos state machines
* Automatically repartitions data across machines as the data volume increases or new servers are added. This feature is just awesome! Say good bye to manual sharding!!
* Scales up to trillions of database rows
* Supports general purpose transactions
* Provides a SQL based query language
* Configurable replication
* Externally consistent reads and writes
* Globally consistent reads across the database at a timestamp

Architecture
A single deployment of Spanner is referred to as a universe. In practice there is usually one universe per environment. Like for instance a development universe or production universe etc. It is further broken up into zones. Each zone is a unit of administration and represents a location which can house data replicas. Zones can be added to and removed from the universe while the system is running. The zone has a single zonemaster and many (100s-1000s) spanservers. Zonemasters assign data to spanservers. Spanservers serve data to the clients. Each zone also has a location proxy which helps clients locate spanservers that house the data. The universe also has an administrative console called the universe master that displays information about the status of all zones. The placement driver, as the name suggests is responsible for the transfer of data across zones. It also remains in touch with the spanservers periodically to fulfill their data movement needs.

A single spanserver controls about 100-1000 instances of a data structure called tablet. Each tablet is a bag/collection of mappings of the format

(key:string, timestamp -> string)

Because Spanner assigns timestamps to each piece of data the underlying data model resembles a multi-version database over a simple k-v store. The tablet’s state is persisted in a set of B tree like files and a write ahead log on a distributed file system called Colossus (the next gen of GFS).

Replication is provided by the spanservers by implementing Paxos on atop every tablet. The Paxos state machine is implemented so that the bag of mappings are consistently replicated. All writes to the tablet initiate the Paxos protocol while reads go to the nearest tablet directly. The collective of replicas constitute a Paxos group.
As is typical with Paxos one replica is elected the leader. On the leader replica the spanserver implements a lock table to implement concurrency control. Any operation that requires synchronization acquires a lock from the lock table. Leaders also run a transaction manager to support distributed transactions. The lock table and the transaction manager together provide transactionality. When a transaction involves persisting across two Paxos groups the group leaders coordinate to carry out a two phase commit.

Spanner layers a bucketing abstraction called the directory on top of this bag of mappings. A directory is a collection of keys that have a common prefix. Its also a unit of data placement. All the data within a directory have the same replication configuration. Spanner moves data between Paxos groups in a directory-wise manner. Directories can be moved for reasons such as improving data locality, balance load and resource usage etc. This movement happens dynamically while the system is still online. Normally a directory which is about 50 MB can be moved in about a few seconds. This move is not transactional as it can block other ongoing transactions.
A directory is also the smallest unit for which an application can specify geo-replication properties. You can control the number and types of replicas and their geographic placement.

For an Application
To applications Spanner exposes the abstraction of a semi-relational, schematized tables (with synchronous replication), SQL based query language and general purpose transactions.

Trade offs
The world of distributed systems has hitherto shunned the use of the two-phase commit protocol due to availability and performance issues. The designers of system make an interesting departure from this long held idea, and for some good reasons. Here is what they have to say

We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.

When understood naively it may appear that the world of distributed systems has come a full circle in struggle for better scalability and availability!

Data Model
An application creates a database within a universe. Each database can hold a number of tables. Tables have rows and columns. In addition they also store versioned values for the data in these cells. Every table must have an ordered set one or more primary key columns. The primary key uniquely identifies each row. The whole table is a mapping between the primary key columns to the other non-primary key columns.

In a distributed database the partitioning scheme is the key to improved performance. While partitioning you want to keep data from related tables within the same unit of placement to the extent possible. In Spanner’s case this unit of placement is the directory. So what you want the client is to specify the group of tables that should be held within a single directory. You can do this using the INTERLEAVE declaration in the table creation step. See the paper for further details.

Transactions & Timestamp management
Spanner is the first system out there that assigns globally meaningful commit timestamps to distributed transactions. Spanner provides the guarantee that if transaction T-1 commits before transaction T-2 starts then T-1′s commit timestamp will be smaller that T-2′s. Spanner offers this guarantee at a global scale and is the first system to do so. This feature is enabled by the TrueTime API.

The paper goes on to describe the details of timestamp assignment in different transactional scenarios such as Read-Write transactions, Read Only transactions and Snapshot reads.

Real world experience
The buzz about Spanner has been around for a little while now. Experimental production trials with Spanner began since early 2011 as part of the rewrite of Google’s ad backend called F1. The F1 team chose Spanner for several reasons –
1) Removes the need to manually partition data
2) Synchronous replication and automatic failover
3) Strong transactional semantics

Link to the original paper