Archive for November 20th, 2006

Given by Jeff Dean (Google) at the given University of Washington on Oct 18, 2005 (video, slides)

BigTable is a distributed storage system for managing structured data that is designed to scale to a very large size.

Interesting quotes from presentation:

  • Scale is too big for commercial databases, they can’t also run on a cheap clustered servers.
  • Features:
    • Distributed multy-level map
    • Fault tolerant, persistant
    • Scalabale (thousands of servers, megabytes of in-memory data, petabyte of disk data, millions/sec of r/w, efficient scans)
    • Self-managing (servers can be added/removed dynamically, servers adjust to load imbalance)
  • Largest bigtable cells (data collections) ~200TB on over thousands of servers
  • Built upon:
  • miltidimentional - row (e.g. url), col (attribute) = cell, inside cell time-based values for the cell.
  • related rows (tablets) are located on the same machines for better performance
  • load balancing moves tablets around
  • tablets are replicated across multiple machines
  • requests like “get recent X values” are possible
  • columns can be configured to retain only X most recent entries
  • locality groups to partition tablets
  • has huge logging problems
  • a lot of opportunities for compression - time-shifted data is similar, many values are the same. Using BMDiff (dictionary-based compression) - encode ~100MB/s, decode ~1000MB/s; Zippy (LZW-like) - 179MB/s, 409MB/s
  • Compression experiment results: web pages compress at 9.2%, links at 13.2%, anchors at 12.7%

Update: Luke Baker made screen shots from video with all slides (not really in the right order).

Comments No Comments »