TwitterRSS
The Limitation of MapReduce: A Probing Case and a Lightweight Solution

The Limitation of Ma...

While we usually see enough papers that deal with the applications of the Map Reduce programming model this one for a change tries to address the limitations of the MR model. It argues that MR only allows a program to ...

continue reading
Keyword Searching and Browsing in Databases using BANKS

Keyword Searching an...

BANKS is a system that enables keyword based searches on a relational database. As a paper that was published 10 years ago in ICDE 2002, it has won the most influential paper award for past decade this year at ICDE. ...

continue reading
HadoopDB: Efficient Processing of Data Warehousing Queries in a Split Execution Environment

HadoopDB: Efficient...

The buzz about Hadapt and HadoopDB has been around for a while now as it is one of the first systems to combine ideas from two different approaches, namely parallel databases based on a shared-nothing architecture and map-reduce, to address ...

continue reading
Spark: Cluster Computing with Working Sets

Spark: Cluster Compu...

One of the aspects you can’t miss even as you just begin reading this paper is the strong scent of functional programming that the design of Spark bears. The use of FP idioms is quite widespread across the architecture of ...

continue reading
Kafka: a Distributed Messaging System for Log Processing

Kafka: a Distributed...

Kafka, a system developed at LinkedIn, is essentially a messaging system that is designed to support aggregation of high throughput log messages arriving from different applications. Why would a traditional messaging system not be a good fit for log processing? Typical enterprise ...

continue reading
Windows Azure Storage : A Highly Available   Cloud Storage Service with Strong Consistency

Windows Azure Storag...

Windows Azure Storage is a key component of the Windows Azure Cloud platform that offers an infinite disk in the cloud. It’s been in production since November 2008 and is used heavily within Microsoft in addition to being available as ...

continue reading
Thialfi: A Client Notification Service for Internet-Scale Applications

Thialfi: A Client No...

The Scandinavian mythology regards Thialfi, a swift runner, as the attendent of Thor, the god of war. Motivated by the swiftness that qualifies Thialfi, was perhaps why the folks at Google named their message delivery system (it delivers notifications at ...

continue reading
Spotify: Large Scale, Low Latency, P2P Music-on-Demand Streaming

Spotify: Large Scale...

This paper from folks at Spotify primarily focuses on how they use P2P techniques in their platform. The service is not web-based, but instead uses a proprietary client and protocol. At the heart of the system is this custom music streaming ...

continue reading
Tenzing : A SQL Implementation On The MapReduce Framework

Tenzing : A SQL Impl...

This paper which appeared in this year’s VLDB talks about the internals of the SQL query engine atop Google’s Map Reduce framework. Its currently used by over 1000 people in Google serving over 10,000 queries each day that span across ...

continue reading
HipG: Parallel Processing of Large-Scale Graphs

HipG: Parallel Proce...

Abstract Distributed processing of real-world graphs is challenging due to their size and the inherent irregular structure of graph computations. We present HipG, a distributed framework that facilitates programming parallel graph algorithms by composing the parallel application automatically from the user-defined ...

continue reading

Data Management for Internet-Scale Single-Sign-On

Google offers a variety of Internet services that require user authentication. These services rely on a single-sign-on service, called Google Accounts, that has been in active deployment since 2002. As of 2006, Google has tens of applications with millions of user accounts worldwide. We describeContinue reading

The Google File System

We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients. While sharing many ofContinue reading

The Chubby lock service for loosely-coupled distributed systems

We describe our experiences with the Chubby lock service, which is intended to provide coarse-grained locking as well as reliable (though low-volume) storage for a loosely-coupled distributed system. Chubby provides an interface much like a distributed file system with ad- visory locks, but the designContinue reading

Dremel: Interactive Analysis of Web-Scale Datasets

Dremel is a scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes ofContinue reading

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure

Modern Internet services are often implemented as complex, large-scale distributed systems. These applications are constructed from collections of software modules that may be developed by different teams, perhaps in different programming languages, and could span many thousands of machines across multiple physical facili- ties. ToolsContinue reading

Mercator: A Scalable, Extensible Web Crawler

This paper describes Mercator, a scalable, extensible web crawler written entirely in Java. Scalable web crawlers are an important component of many web services, but their design is not well-documented in the literature. We enumerate the major components of any scalable web crawler, comment onContinue reading

Pages:«123456789