Presentation about Google’s internal systems by independent researcher Toby DiPasquale given at Philadelphia LUG on August 2nd, 2006 (slides)

No Comments »
Given by Jeff Dean (Google) at the given University of Washington on Oct 18, 2005 (video, slides)
BigTable is a distributed storage system for managing structured data that is designed to scale to a very large size.
Interesting quotes from presentation:
- Scale is too big for commercial databases, they can’t also run on a cheap clustered servers.
- Features:
- Distributed multy-level map
- Fault tolerant, persistant
- Scalabale (thousands of servers, megabytes of in-memory data, petabyte of disk data, millions/sec of r/w, efficient scans)
- Self-managing (servers can be added/removed dynamically, servers adjust to load imbalance)
- Largest bigtable cells (data collections) ~200TB on over thousands of servers
- Built upon:
- miltidimentional – row (e.g. url), col (attribute) = cell, inside cell time-based values for the cell.
- related rows (tablets) are located on the same machines for better performance
- load balancing moves tablets around
- tablets are replicated across multiple machines
- requests like “get recent X values” are possible
- columns can be configured to retain only X most recent entries
- locality groups to partition tablets
- has huge logging problems
- a lot of opportunities for compression – time-shifted data is similar, many values are the same. Using BMDiff (dictionary-based compression) – encode ~100MB/s, decode ~1000MB/s; Zippy (LZW-like) – 179MB/s, 409MB/s
- Compression experiment results: web pages compress at 9.2%, links at 13.2%, anchors at 12.7%
Update: Luke Baker made screen shots from video with all slides (not really in the right order).
No Comments »