<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Systems We Make &#124; Systems We Make</title>
	<atom:link href="http://www.systemswemake.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.systemswemake.com</link>
	<description>Curating Complex Distributed Systems</description>
	<lastBuildDate>Fri, 03 May 2013 19:15:59 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Ceph: A Scalable, High-Performance Distributed File System</title>
		<link>http://www.systemswemake.com/papers/ceph</link>
		<comments>http://www.systemswemake.com/papers/ceph#comments</comments>
		<pubDate>Fri, 03 May 2013 19:15:59 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed File Systems]]></category>
		<category><![CDATA[distributed file systems]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=803</guid>
		<description><![CDATA[Abstract : We have developed Ceph, a distributed ﬁle system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We &#8230;]]></description>
				<content:encoded><![CDATA[<p><strong>Abstract :</strong><br />
We have developed Ceph, a distributed ﬁle system that provides excellent performance, reliability, and scalability. Ceph maximizes the separation between data and metadata management by replacing allocation tables with a pseudo-random data distribution function (CRUSH) designed for heterogeneous and dynamic clusters of unreliable object storage devices (OSDs). We leverage device intelligence by distributing data replication, failure detection and recovery to semi-autonomous OSDs running a specialized local object ﬁle system. A dynamic distributed metadata cluster provides extremely efﬁcient metadata management and seamlessly adapts to a wide range of general purpose and scientiﬁc computing ﬁle system workloads. Performance measurements under a variety of workloads show that Ceph has excellent I/O performance and scalable metadata management, supporting more than 250,000 metadata operations per second.</p>
<p><strong><a href="http://www.ssrc.ucsc.edu/Papers/weil-osdi06.pdf" target="_blank">Link to the original paper </a></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/ceph/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>SCOPE: Easy and Efficient Parallel Processing  of Massive Data Sets</title>
		<link>http://www.systemswemake.com/papers/scope-parallel-processing</link>
		<comments>http://www.systemswemake.com/papers/scope-parallel-processing#comments</comments>
		<pubDate>Tue, 23 Apr 2013 14:28:16 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed Programming]]></category>
		<category><![CDATA[data-parallel programming]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=737</guid>
		<description><![CDATA[Companies providing cloud-scale services have an increasing need to store and analyze massive data sets such as search logs and click streams. For cost and performance reasons, processing is typically done on large clusters of shared-nothing commodity machines. It is imperative to develop a programming model that hides the complexity &#8230;]]></description>
				<content:encoded><![CDATA[<p>Companies providing cloud-scale services have an increasing need  to  store and analyze massive data sets such as search logs and click streams. For cost and performance reasons, processing is typically  done on large clusters of shared-nothing commodity machines. It is imperative to develop a programming model that hides the complexity of the underlying system but provides flexibility by allowing users to extend functionality to meet a variety of requirements.<br />
<span id="more-737"></span><br />
In this paper, we present a new declarative and extensible scripting language, SCOPE (Structured  Computations Optimized for Parallel Execution), targeted for this type of massive data analysis. The language is designed for ease of use with no explicit parallelism, while being amenable to efficient parallel execution on large clusters. SCOPE borrows several features from SQL. Data is modeled as sets of rows composed of typed columns. The select statement is retained with inner joins, outer joins, and aggregation allowed. Users can easily define their own functions and implement their own versions of operators: extractors (parsing and constructing rows from a file), processors (row-wise processing), reducers (group-wise processing), and combiners (combining rows from two inputs). SCOPE supports nesting of expressions but also allows a computation to be specified as a series of steps, in a manner often preferred by programmers. We also describe how scripts are compiled into efficient, parallel execution plans and executed on large clusters.</p>
<p><strong>Previewing from <a href="http://www.goland.org/Scope-VLDB-final.pdf" target="_blank">http://www.goland.org/Scope-VLDB-final.pdf</a></strong></p>
<p><iframe src="http://docs.google.com/viewer?url=http%3A%2F%2Fwww.goland.org%2FScope-VLDB-final.pdf&#038;embedded=true" width="600" height="780" style="border: none;"></iframe></p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/scope-parallel-processing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>An Efﬁcient Multi-Tier Tablet Server Storage Architecture</title>
		<link>http://www.systemswemake.com/papers/tablet-server-storage-layer</link>
		<comments>http://www.systemswemake.com/papers/tablet-server-storage-layer#comments</comments>
		<pubDate>Sat, 20 Apr 2013 12:57:45 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed Storage]]></category>
		<category><![CDATA[distributed database]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=947</guid>
		<description><![CDATA[This work presents a new, highly scalable, and efficient TSSL architecture called the General Tablet Server Storage Layer or GTSSL. Specific contributions include &#8211; 1. Improved data compaction algorithms significantly, and adapted them to multi-tier storage architectures. 2. Aggressive use of advanced algorithms, data structures, and Bloom filters to achieve &#8230;]]></description>
				<content:encoded><![CDATA[<p>This work presents a new, highly scalable, and efficient TSSL architecture called the General Tablet Server Storage Layer or GTSSL.<br />
Specific contributions include &#8211;<br />
1. Improved data compaction algorithms significantly, and adapted them to multi-tier storage architectures.<br />
2. Aggressive use of advanced algorithms, data structures, and Bloom filters to achieve 3–10× faster lookups (reads), and 5× faster insertions (writes) over Cassandra and HBase.<br />
3. Integrated versatile and efficient transactions without compromising performance.<br />
4. Empirical and theoretical evaluation of GTSSL, the Cassandra TSSL, and the HBase TSSL for a wide range of configurations from read-optimized to write-optimized.<br />
5. Write-optimized TSSL architecture can remain efficient for transactional workloads in comparison to Berkeley DB and MySQL’s InnoDB.</p>
<p><strong><a href="http://www.systemswemake.com/wp-content/uploads/2013/04/An-Efficient-Multi-Tier-Tablet-Server-Storage-Architecture.pdf" target="_blank">Link to paper</a></strong></p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/tablet-server-storage-layer/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Iterative Map Reduce &#8211; Prior Art</title>
		<link>http://www.systemswemake.com/papers/iterative-mr</link>
		<comments>http://www.systemswemake.com/papers/iterative-mr#comments</comments>
		<pubDate>Sat, 13 Apr 2013 04:23:05 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed Programming]]></category>
		<category><![CDATA[data-parallel programming]]></category>
		<category><![CDATA[distributed programming]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=1165</guid>
		<description><![CDATA[There have been several attempts in the recent past at extending Hadoop to support efficient iterative data processing on clusters. To facilitate understanding this problem better here is a collection of some prior art relating to this problem space. HaLoop: Efﬁcient Iterative Data Processing on Large Clusters &#8211; It extends &#8230;]]></description>
				<content:encoded><![CDATA[<p>There have been several attempts in the recent past at extending Hadoop to support efficient iterative data processing on clusters. To facilitate understanding this problem better here is a collection of some prior art relating to this problem space.</p>
<p><strong><a href="http://www.ics.uci.edu/~yingyib/papers/HaLoop_camera_ready.pdf" target="_blank">HaLoop: Efﬁcient Iterative Data Processing on Large Clusters</a></strong> &#8211; It extends MapReduce by adding programming support for iterative applications, and also improves their efﬁciency by making the task scheduler loop-aware and by adding various caching mechanisms</p>
<p><strong><a href="http://rio.ecs.umass.edu/mnilpub/papers/DataCloud2011_iMapReduce.pdf" target="_blank">iMapReduce</a></strong> &#8211; It allows users to specify the iterative operations with map and reduce functions, while supporting the iterative processing automatically without the need of users’ involvement. More importantly, iMapReduce signiﬁcantly improves the performance of iterative algorithms by (1) reducing the overhead of creating a new task in every iteration, (2) eliminating the shufﬂing of the static data in the shufﬂe stage of MapReduce, and (3) allowing asynchronous execution of each iteration, i.e., an iteration can start before all tasks of a previous iteration have ﬁnished.</p>
<p><strong><a href="http://www.iterativemapreduce.org/hpdc-camera-ready-submission.pdf" target="_blank">Twister: A Runtime for Iterative MapReduce</a></strong> &#8211; It uses a publish/subscribe messaging  infrastructure for communication and data transfers, and supports long running map/reduce tasks, which can be used in “configure once and use many times” approach. In addition it provides programming extensions to MapReduce with “broadcast” and “scatter” type data transfers. These improvements allow Twister to support iterative MapReduce computations highly efficiently compared to other MapReduce runtimes.</p>
<p><strong><a href="http://burtonator.wordpress.com/2011/12/26/a-new-map-reduce-framework-for-iterative-and-pipelined-jobs/" target="_blank">Peregrine</a></strong> &#8211; From the Sipnn3r folks. </p>
<p><strong><a href="http://www.mpi-sws.org/~elnikety/Eslam_Elnikety_Web_Page_files/cloudcom.pdf" target="_blank">iHadoop: Asynchronous Iterations for MapReduce</a></strong> &#8211; The iHadoop model schedules iterations asynchronously. It connects the output of one iteration to the next, allowing both to process their data concurrently. iHadoop’s task scheduler exploits inter-iteration data locality by scheduling tasks that exhibit a producer/consumer relation on the same physical machine allowing a fast local data transfer. For those iterative applications that require satisfying certain criteria before termination, iHadoop runs the check concurrently during the execution of the subsequent iteration to further reduce the application’s latency</p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/iterative-mr/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>In Search of an Understandable Consensus Algorithm</title>
		<link>http://www.systemswemake.com/papers/raft</link>
		<comments>http://www.systemswemake.com/papers/raft#comments</comments>
		<pubDate>Wed, 10 Apr 2013 08:32:53 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed Programming]]></category>
		<category><![CDATA[consensus]]></category>
		<category><![CDATA[paxos]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=1240</guid>
		<description><![CDATA[Abstract &#8211; Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to Paxos, and it is as efﬁcient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. &#8230;]]></description>
				<content:encoded><![CDATA[<p>Abstract &#8211; Raft is a consensus algorithm for managing a replicated log. It produces a result equivalent to Paxos, and it is as efﬁcient as Paxos, but its structure is different from Paxos; this makes Raft more understandable than Paxos and also provides a better foundation for building practical systems. In order to enhance understandability, Raft separates the key elements of consensus, such as leader election and log replication, and it enforces a stronger degree of coherency to reduce the number of states that must be considered. Raft also includes a new mechanism for changing the cluster membership, which uses overlapping majorities to guarantee safety. Results from a user study demonstrate that Raft is easier for students to learn than Paxos.</p>
<p><a href="https://ramcloud.stanford.edu/wiki/download/attachments/11370504/raft.pdf" target="_blank">Link to paper</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/raft/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>To BLOB or Not To BLOB</title>
		<link>http://www.systemswemake.com/papers/to-blob-or-not</link>
		<comments>http://www.systemswemake.com/papers/to-blob-or-not#comments</comments>
		<pubDate>Sat, 22 Dec 2012 19:57:28 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed Storage]]></category>
		<category><![CDATA[blob storage]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=1217</guid>
		<description><![CDATA[To decide on a mechanism for storing a large number of files and querying them based on metadata we have two options - a) storing the file + metadata combination as BLOB in the database along with the metadata fields or b) just storing the metadata in the database and &#8230;]]></description>
				<content:encoded><![CDATA[<p>To decide on a mechanism for storing a large number of files and querying them based on metadata we have two options -<br />
a) storing the file + metadata combination as BLOB in the database along with the metadata fields or<br />
b) just storing the metadata in the database and the actual file itself on the filesystem.</p>
<p>An excellent source of information that explains the trade-offs between these two options is the paper &#8211; <a href="http://research.microsoft.com/pubs/64525/tr-2006-45.pdf" target="_blank">To BLOB or Not To BLOB: Large Object Storage in a Database or a Filesystem?</a>.<br />
Going by the discussions in the paper a trade off can be made based on the size of the objects stored. For small file sizes (in the order of few 100s of KBs) a database offers higher read throughputs. But as the file size approaches MBs the throughput of the file system increases faster than that of the database.</p>
<p>A key aspect that the paper brings out is the impact that <strong>disk fragmentation</strong> has on the performance of such a storage solution. One of the main reasons why access from a filesystem is faster is because of its ability to deal with disk fragmentation due to repeated updates. As the &#8220;storage age&#8221; (avg. number of times a file has been replaced) increases both read and write performance of a database degrades. Mainly because databases simply don&#8217;t have an automated way of dealing with the impact of fragmentation due to repeated updates. Defragmentation of a database requires explicit application logic to copy BLOBS to a new table (on SQL server).</p>
<p>So some of the requirements that a good BLOB database solution needs to address includes providing ability to defragment automatically. At the minimum it should tell you how fragmented a BLOB object is. On top of this it can offer other optimizations like in place defragmentation etc.</p>
<p>It feels like the idea of scalable &#8220;BLOB databases&#8221; in general (for the lack of a better term) is perhaps still nascent. Most BLOB management solutions (for audio, video or text) rely on distributed object stores like S3 or Ceph. Most of these don&#8217;t even offer metadata storage along with the data objects, leave alone offering specialized indexing and search capabilities. Its left to the applications that push data objects into these object stores to deal with the question of how to keep the database object metadata and the filesystem object data synchronized.<br />
Attempts such as <a href="https://www.usenix.org/conference/lisa12/simple-file-storage-system-web-applications">HSS from AOL</a> and <a href="http://www.systemswemake.com/papers/haystack">Haystack</a> are in some ways a start. Although they are very far from becoming a specialized distributed database. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/to-blob-or-not/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HAcid: A lightweight transaction system for HBase</title>
		<link>http://www.systemswemake.com/papers/hacid</link>
		<comments>http://www.systemswemake.com/papers/hacid#comments</comments>
		<pubDate>Tue, 02 Oct 2012 17:39:01 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed Storage]]></category>
		<category><![CDATA[distributed database]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=1212</guid>
		<description><![CDATA[HAcid is a client library that applications can use for operating multi-row transactions in HBase. Seems to be motivated by Google&#8217;s Percolator. Link to the original paper]]></description>
				<content:encoded><![CDATA[<p>HAcid is a client library that applications can use for operating multi-row transactions in HBase. Seems to be motivated by <a href="http://www.systemswemake.com/large-scale-incremental-processing-using-distributed-transactions-and-notifications/">Google&#8217;s Percolator</a>.</p>
<p><a href="http://users.ics.aalto.fi/desoua1/HAcid_Thesis.pdf" target="_blank">Link to the original paper</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/hacid/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Spanner : Google&#8217;s globally distributed database</title>
		<link>http://www.systemswemake.com/papers/spanner</link>
		<comments>http://www.systemswemake.com/papers/spanner#comments</comments>
		<pubDate>Sun, 16 Sep 2012 08:16:39 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed Storage]]></category>
		<category><![CDATA[distributed database]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=1194</guid>
		<description><![CDATA[Spanner is Google’s scalable, multi-version, globallydistributed, and synchronously-replicated database. It is the ﬁrst system to distribute data at global scale and support externally-consistent distributed transactions. Key features * Partitions data across many instances of Paxos state machines * Automatically repartitions data across machines as the data volume increases or new &#8230;]]></description>
				<content:encoded><![CDATA[<p>Spanner is Google’s scalable, multi-version, globallydistributed, and synchronously-replicated database. It is the ﬁrst system to distribute data at global scale and support externally-consistent distributed transactions.</p>
<p><strong>Key features</strong><br />
   * Partitions data across many instances of Paxos state machines<br />
   * Automatically repartitions data across machines as the data volume increases or new servers are added. This feature is just awesome! Say good bye to manual sharding!!<br />
   * Scales up to trillions of database rows<br />
   * Supports general purpose transactions<br />
   * Provides a SQL based query language<br />
   * Configurable replication<br />
   * Externally consistent reads and writes<br />
   * Globally consistent reads across the database at a timestamp</p>
<p><strong>Architecture</strong><br />
A single deployment of Spanner is referred to as a universe. In practice there is usually one universe per environment. Like for instance a development universe or production universe etc.  It is further broken up into zones. Each zone is a unit of administration and represents a location which can house data replicas. Zones can be added to and removed from the universe while the system is running. The zone has a single zonemaster and many (100s-1000s) spanservers. Zonemasters assign data to spanservers. Spanservers serve data to the clients. Each zone also has a location proxy which helps clients locate spanservers that house the data. The universe also has an administrative console called the universe master that displays information about the status of all zones. The placement driver, as the name suggests is responsible for the transfer of data across zones. It also remains in touch with the spanservers periodically to fulfill their data movement needs.</p>
<p>A single spanserver controls about 100-1000 instances of a data structure called tablet. Each tablet is a bag/collection of mappings of the format</p>
<p>                <em>                                  (key:string, timestamp -> string)</em></p>
<p>Because Spanner assigns timestamps to each piece of data the underlying data model resembles a multi-version database over a simple k-v store. The tablet&#8217;s state is persisted in a set of B tree like files and a write ahead log on a distributed file system called Colossus (the next gen of GFS).</p>
<p>Replication is provided by the spanservers by implementing Paxos on atop every tablet. The Paxos state machine is implemented so that the bag of mappings are consistently replicated. All writes to the tablet initiate the Paxos protocol while reads go to the nearest tablet directly. The collective of replicas constitute a Paxos group.<br />
As is typical with Paxos one replica is elected the leader.  On the leader replica the spanserver implements a lock table to implement concurrency control. Any operation that requires synchronization acquires a lock from the lock table. Leaders also run a transaction manager to support distributed transactions. The lock table and the transaction manager together provide transactionality. When a transaction involves persisting across two Paxos groups the group leaders coordinate to carry out a two phase commit.</p>
<p>Spanner layers a bucketing abstraction called the <strong>directory</strong> on top of this bag of mappings. A directory is a collection of keys that have a common prefix. Its also a unit of data placement. All the data within a directory have the same replication configuration. Spanner moves data between Paxos groups in a directory-wise manner. Directories can be moved for reasons such as improving data locality, balance load and resource usage etc. This movement happens dynamically while the system is still online. Normally a directory which is about 50 MB can be moved in about a few seconds. This move is not transactional as it can block other ongoing transactions.<br />
A directory is also the smallest unit for which an application can specify geo-replication properties. You can control the number and types of replicas and their geographic placement. </p>
<p><strong>For an Application</strong><br />
To applications Spanner exposes the abstraction of a semi-relational, schematized tables (with synchronous replication), SQL based query language and general purpose transactions. </p>
<p><strong>Trade offs</strong><br />
The world of distributed systems has hitherto shunned the use of the two-phase commit protocol due to availability and performance issues. The designers of system make an interesting departure from this long held idea, and for some good reasons. Here is what they have to say</p>
<blockquote><p><em>We believe it is better to have application programmers deal with performance problems due to overuse of transactions as    bottlenecks arise, rather than always coding around the lack of transactions. Running two-phase commit over Paxos mitigates the availability problems.</em></p></blockquote>
<p>When understood naively it may appear that the world of distributed systems has come a full circle in struggle for better scalability and availability!</p>
<p><strong>Data Model</strong><br />
An application creates a database within a universe. Each database can hold a number of tables. Tables have rows and columns. In addition they also store versioned values for the data in these cells. Every table must have an ordered set one or more primary key columns. The primary key uniquely identifies each row. The whole table is a mapping between the primary key columns to the other non-primary key columns. </p>
<p>In a distributed database the partitioning scheme is the key to improved performance. While partitioning you want to keep data from related tables within the same unit of placement to the extent possible. In Spanner&#8217;s case this unit of placement is the directory. So what you want the client is to specify the group of tables that should be held within a single directory. You can do this using the INTERLEAVE declaration in the table creation step. See the paper for further details. </p>
<p><strong>Transactions &#038; Timestamp management</strong><br />
Spanner is the first system out there that assigns globally meaningful commit timestamps to distributed transactions. Spanner provides the guarantee that if transaction T-1 commits before transaction T-2 starts then T-1&#8242;s commit timestamp will be smaller that T-2&#8242;s. Spanner offers this guarantee at a global scale and is the first system to do so. This feature is enabled by the TrueTime API. </p>
<p>The paper goes on to describe the details of timestamp assignment in different transactional scenarios such as Read-Write transactions, Read Only transactions and Snapshot reads.</p>
<p><strong>Real world experience</strong><br />
The buzz about Spanner has been around for a little while now. Experimental production trials with Spanner began since early 2011 as part of the rewrite of <a href="http://research.google.com/pubs/pub38125.html" title="F1" target="_blank">Google&#8217;s ad backend called F1</a>. The F1 team chose Spanner for several reasons &#8211;<br />
1) Removes the need to manually partition data<br />
2) Synchronous replication and automatic failover<br />
3) Strong transactional semantics</p>
<p><a href="http://research.google.com/archive/spanner.html" title="Spanner" target="_blank">Link to the original paper</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/spanner/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>HAIL &#8211; Only Aggressive Elephants are Fast Elephants</title>
		<link>http://www.systemswemake.com/papers/hail</link>
		<comments>http://www.systemswemake.com/papers/hail#comments</comments>
		<pubDate>Sun, 26 Aug 2012 19:30:15 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed Programming]]></category>
		<category><![CDATA[distributed programming]]></category>
		<category><![CDATA[hadoop]]></category>
		<category><![CDATA[map reduce]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=1188</guid>
		<description><![CDATA[Typically we store data based on any one of the different physical layouts (such as row, column, vertical, PAX etc). And this choice determines its suitability for a certain kind of workload while making it less optimal for other kinds of workloads. Can we store data under different layouts at &#8230;]]></description>
				<content:encoded><![CDATA[<p>Typically we store data based on any one of the different physical layouts (such as row, column, vertical, PAX etc). And this choice determines its suitability for a certain kind of workload while making it less optimal for other kinds of workloads. Can we store data under different layouts at the same time? Especially within a HDFS environment where each block is replicated a few times. This is the big idea that HAIL (Hadoop Aggressive Indexing Library) pursues. </p>
<p>At a very high level it looks like to understand the working of HAIL we will have to look at the three distinct workflows the system is organized around namely &#8211;<br />
1) The data/file upload pipeline<br />
2) The indexing pipeline<br />
3) The query pipeline</p>
<p>Every unit of information makes its journey through these three pipelines.</p>
<p><strong>The Upload Pipeline</strong></p>
<p>Here is where we begin. Typically one uploads the file to be analyzed into HDFS first and then executes a set of MR jobs on the Hadoop cluster. With HAIL you use a Hail client to upload the file. The client parses the contents of the file based on newlines and splits it up into blocks such that no row spans across blocks. Additionally the user can also specify a schema while uploading the file (much like in PIG). HAIL then converts all data blocks to a binary PAX representation. The good thing about this pipeline is that the whole transformation happens as the file is being written on HDFS. The blocks are not re-read from HDFS which will cause a lot of extra I/O. This significantly improves the write performance.</p>
<p>The client then contacts the Name node to get a list of data nodes. It then sends chunked PAX blocks to the first data node. When the data node receives the packet it immediately forwards the same packet to the next data node. It does this without flushing the contents of the packets and the checksum to the disk. This is the same with every data node that receives the packet. Contents are not flushed to the disk immediately.</p>
<p>On receiving a whole block worth of contents the each data node sorts the contents and creates indexes based on the specification of sort order and indexes by the user. Each data node sorts the data in a different order. All the index metadata, the sorted data and checksums etc form what is known as a HAIL block. </p>
<p>Within HAIL its vital that the MR jobs run such that they are able to leverage the indexing thats happened on the data nodes. So the tasks have to be scheduled on the data node that has the most suitable index.  In order to enable this the sort/index metadata has to be stored at the name node level. An instance of HAILBlockReplicaInfo contains detailed information about the types of available indexes for a replica, i.e. indexing key, index type, size, start offsets etc.</p>
<p><strong>The Indexing Pipeline</strong></p>
<p>The basic purpose of the indices is to get to the relevant blocks by scanning the index first. After experimenting with a few different types of indexes they seem to have concluded on using a sparse clustered B+ tree based index. The column that needs to be indexed is first sorted in memory and then the index tree is written to the disk on to a single directory. Note that the index is not a multi-level index. The paper gives some back of the envelope calculations for this choice.</p>
<p><strong>The Query Pipeline </strong></p>
<p>Much of the MR job continues to be written just as before but with some interfaces changed. Firstly the InputFormat implementation used is HailInputFormat. The other nicety of this framework is the typical task of filtering the records which is carried out within the map function can be delegated to HAIL. You can annotate the map function with @HailQuery annotation where you may declaratively specify the projected attributes and the filtering condition.</p>
<p><code><br />
@HailQuery(filter="@3 between(1999-01-01,2000-01-01)", projection={@1})<br />
void map(Text key, HailRecord v) { ... }</p>
<p></code></p>
<p>The HailRecordReader which collaborates with HailInputFormat is the component that applies the predicate to filter out the qualifying records. Lastly the value passed to the map function is a HailRecord object. </p>
<p><strong>Summing it up</strong><br />
HAIL tries to support per-replica indexes in an efﬁcient way and without signiﬁcant changes in the standard execution pipeline. It tries to achieve much of this by providing alternate implementations of the InputFormat and RecordReader interfaces along with a custom splitting policy. HAIL improves upload and query times without impacting the failover properties of Hadoop and minimal change to the map reduce programming interface. </p>
<p><a href="http://infosys.uni-saarland.de/publications/HAIL.pdf" title="http://infosys.uni-saarland.de/publications/HAIL.pdf" target="_blank">Link to the original paper</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/hail/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MinuteSort with Flat Datacenter Storage</title>
		<link>http://www.systemswemake.com/papers/flat-datacenter-storage</link>
		<comments>http://www.systemswemake.com/papers/flat-datacenter-storage#comments</comments>
		<pubDate>Thu, 24 May 2012 19:26:40 +0000</pubDate>
		<dc:creator>Hari</dc:creator>
				<category><![CDATA[Distributed Storage]]></category>
		<category><![CDATA[distributed database]]></category>

		<guid isPermaLink="false">http://www.systemswemake.com/?p=1176</guid>
		<description><![CDATA[Its been a couple of days since Microsoft has been in the news as the one to beat the previous data sorting record held by Hadoop by sorting 1,401Gb of data in a minute. All news articles came along with a mention of two new terms namely, Flat Datacenter Storage &#8230;]]></description>
				<content:encoded><![CDATA[<p>Its been a couple of days since Microsoft has been in the news as the one to beat the previous data sorting record held by Hadoop by sorting 1,401Gb of data in a minute. All news articles came along with a mention of two new terms namely, Flat Datacenter Storage and Full Bisection Bandwidth Networks. The enquiry to understand what these terms meant led me to this paper which makes an attempt at describing it. Flat Datacenter Storage (FDS) happens to be a high-performance distributed blob storage system. Just the kind of system I would love to cover for this blog!</p>
<p>Data is stored on dedicated storage nodes, called tractservers (comparable to HFDS name nodes). The tractserver is a network front-end to a single disk; machines with multiple disks have one tractserver running per disk.<br />
User code does not run on tractservers; applications can only retrieve data from or write data to tractservers over the network. In FDS, there is no such thing as a local file.</p>
<p>Contrary to the idea of moving compute to the storage nodes this system works by always sending data over the network. It manages the cost of data transport by<br />
a) Giving each storage node network bandwidth that matches its storage bandwidth<br />
b) Interconnecting storage nodes and compute nodes using a full bisection bandwidth network </p>
<p>This combination produces an uncongested path from remote disks to CPUs, giving the system an aggregate I/O bandwidth essentially equivalent to a system such as MapReduce that uses local storage. FDS also supports data replication for failure recovery.</p>
<p>In FDS, data is logically stored in blobs. A blob is a byte sequence named with a 128-bit GUID. Blobs can be any length, limited in size only by the system’s storage capacity. Reads from and writes to a blob are done in units called tracts. Each tract within a blob is numbered sequentially starting from 0. Tracts in FDS are about 8MB. The FDS API defines simple CRUD operations to interact with a Blob. All calls in the API are non-blocking. Consequently the API also takes in callback function that is invoked after the operation completes.</p>
<p>By spreading a blob’s tracts over many tractservers and issuing many requests in parallel, many tractservers can begin reading data off disk and transferring it back to a processing node in parallel. Deep read-aheads enable a tract to be read off disk into the tractserver’s cache while the previous one is being transferred over the network. </p>
<p><strong>Does it have a SPOF?</strong></p>
<p>A single central metadata server that should be consulted to learn about where the data is placed is a common design pattern in distributed storage systems. Writers contact the metadata server to ﬁnd out where to write a new block; the metadata server picks a data server, durably stores that decision and returns it to the writer. Readers contact the metadata server to ﬁnd out which servers store the blocks to be read. This approach turns the metadata server into a SPOF as it is always on the critical path for all reads and writes.</p>
<p>This system too has a metadata server, except that its role during normal operations is simple and limited: collect a list of the system’s active tractservers and distribute information about them to clients. This list known as the tract locator table (TLT), is ﬁrst retrieved from the metadata server when a client starts. The metadata server stores only metadata about the hardware conﬁguration, not about ﬁles.</p>
<p>When it wants to read or write a tract, it ﬁrst computes a tract locator. The simplest tract locator is the sum of the 128-bit blob GUID to be read and the 64-bit tract number to be read, modulo the number of entries in the TLT. Indexing the tract locator into the TLT yields the tractserver to which that tract read or write should be issued. The TLT changes only in response to cluster reconfiguration and not individual CRUD operations. It can thus be cached by clients for a long time.</p>
<p>Since the tractservers remember their position in the table, the metadata server stores no durable state; in case of a metadata server failure, the TLT is reconstructed by contacting each tractserver. The TLT is never modiﬁed due to reads and writes.<br />
Also the TLT contains random permutations of the list of tractservers. This increases the chances of sequential reads and writes by independent clients utilizing all tractservers uniformly. The TLTs independent permutations prevent clients from organizing into synchronized convoys.</p>
<p>Section 2.2.1 very splendidly describes the optimizations and trade-offs in the design of the metadata server. A must read!</p>
<p>Per-blob metadata, such as blob length and permissions, are stored in a special tract (“tract -1”) of each blob. Clients ﬁnd a blob’s metadata using the same method for ﬁnding data, using the TLT. Thus per blob metadata management is as distributed as blob<br />
data storage.</p>
<p>The <a href="http://sortbenchmark.org/FlatDatacenterStorage2012.pdf" target="_blank">rest of the paper</a> describes the execution of the sort algorithm in great detail.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.systemswemake.com/papers/flat-datacenter-storage/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
