Archive for the “Distributed systems” Category


Presentation about Google’s internal systems by independent researcher Toby DiPasquale given at Philadelphia LUG on August 2nd, 2006 (slides)

Google Internals

Comments No Comments »

Presentation about Nutch, Open Source implementation of MapReduce algorithm originally created and implemented by Google. Given by Doug Cutting at OSCON 2005 on August 3rd, 2005 (Slides).

Nutch: Scalable Computing with MapReduce.

Comments No Comments »

Presentation by Randy Shoup and Dan Pritchett at SD Forum 2006 on November 29th, 2006. (PDF).

eBay architecture

Some comments from Greg Linden (ex Amazon), who also has interesting “Early Amazon” series of posting in his blog.

Comments No Comments »

Google TechTalk by Ari Zilka of Terracotta on November 21, 2006

Video, slides (available for related presentation on Terracotta at JavaOne).

Quite fascinating clustering technology that allows for Java applications to share part of their data over the cluster and synchronize changes. All of that without modifications to actual applications but on JVM level, clustering behavior of which is configured through configuration files.

I’m not proficient in this field and can’t really extract most important parts of the presentation so you’ll have to watch it by yourself but demos that show two Swing applications running synchronously on two different machines are pretty cool.

Unfortunately I was unable to find slides for this presentation and video demos quality is not very good, but was enough for me to follow the logic.

Enjoy.

Comments No Comments »

Given by Jeff Dean (Google) at the given University of Washington on Oct 18, 2005 (video, slides)

BigTable is a distributed storage system for managing structured data that is designed to scale to a very large size.

Interesting quotes from presentation:

  • Scale is too big for commercial databases, they can’t also run on a cheap clustered servers.
  • Features:
    • Distributed multy-level map
    • Fault tolerant, persistant
    • Scalabale (thousands of servers, megabytes of in-memory data, petabyte of disk data, millions/sec of r/w, efficient scans)
    • Self-managing (servers can be added/removed dynamically, servers adjust to load imbalance)
  • Largest bigtable cells (data collections) ~200TB on over thousands of servers
  • Built upon:
  • miltidimentional - row (e.g. url), col (attribute) = cell, inside cell time-based values for the cell.
  • related rows (tablets) are located on the same machines for better performance
  • load balancing moves tablets around
  • tablets are replicated across multiple machines
  • requests like “get recent X values” are possible
  • columns can be configured to retain only X most recent entries
  • locality groups to partition tablets
  • has huge logging problems
  • a lot of opportunities for compression - time-shifted data is similar, many values are the same. Using BMDiff (dictionary-based compression) - encode ~100MB/s, decode ~1000MB/s; Zippy (LZW-like) - 179MB/s, 409MB/s
  • Compression experiment results: web pages compress at 9.2%, links at 13.2%, anchors at 12.7%

Update: Luke Baker made screen shots from video with all slides (not really in the right order).

Comments No Comments »