Winning the Big Data SPAM Challenge__HadoopSummit2010

View more presentations from Yahoo Developer Network.

App Engine SDK 1.3.5 Released With New Task Queue, Python Precompilation, and Blob Features

Today we are happy to announce the 1.3.5 release of the App Engine SDK for both Python and Java developers.

Due to popular demand, we have increased the throughput of the Task Queue API, from 50 reqs/sec per app to 50 reqs/sec per queue. You can also now specify the amount of storage available to the taskqueue in your app, for those with very large queues with many millions of tasks. Stay tuned for even more Task Queue scalability improvements in the future.

Additionally, in this release we’ve also added support for precompilation of Python source files to match the same feature we launched for Java last year. For Python, you can now use precompilation to speed up application loading time and to reduce CPU usage for new app instances. You can enable precompilation by including the following lines in your app.yaml file:

- python_precompiled

This will start offline precompilation of Python modules used by your app when you deploy your application. Currently precompliation is off by default for Python applications, but it will be enabled by default in some future release. (Java precompilation has been enabled by default since the release of 1.3.1.)

To give you a taste of what this feature is like, we tested this on a modified version of Rietveld (which included a copy of Django 1.0.4 in the app directory, and which did not use the datastore in its base url). The latency and CPU usage results for the initial load of the application, after uploading a new version of the app and requesting the homepage, were:

Before precompilation enabled:
Test 1: 1450ms 1757cpu_ms
Test 2: 1298ms 1523cpu_ms
Test 3: 1539ms 1841cpu_ms
After precompilation enabled:
Test 1: 805ms 669cpu_ms
Test 2: 861ms 702cpu_ms
Test 3: 921ms 803cpu_ms

Of course, any individual app’s performance will vary, so we recommend that you experiment with the setting for your application. Please submit your feedback and results to the support group!

In addition to Task Queues and Python precompilation, we have made a few changes to the Blobstore in 1.3.5 First, we have added file-like interfaces for reading Blobs. In Python, this is supported through the BlobReader class. In Java, we have implemented the BlobstoreInputStream class, which gives an InputStream view of the blobs stored in Blobstore.



Weak Consistency and CAP Implications

Weak Consistency and CAP Implications: "

Migrating your web application from a single node to a distributed setup is always a deceivingly large architectural change. You may need to do it due to a resource constraint of a single machine, for better availability, to decouple components, or for a variety of other reasons. Under this new architecture, each node is on its own, and a network link is present to piece it all back together. So far so good, in fact, ideally we would also like for our new architecture to provide a few key properties: Consistency (no data conflicts), Availability (no single point of failure), and Partition tolerance (maintain availability and consistency in light of network problems).

Problem is, the CAP theorem proposed by Eric Brewer and later proved by Seth Gilbert and Nancy Lynch, shows that together, these three requirements are impossible to achieve at the same time. In other words, in a distributed system with an unreliable communications channel, it is impossible to achieve consistency and availability at the same time in the case of a network partition. Alas, such is the tradeoff.

'Pick Two' is too simple

The original CAP conjecture presented by Eric Brewer states that as architects, we can only pick two properties (CA, CP, or PA) at the same time, and many attempts have since been made to classify different distributed architectures into these three categories. Problem is, as Daniel Abadi recently pointed out (and Eric Brewer agrees), the relationships between CA, CP and AP are not nearly as clear-cut as they appear on paper. In fact, any attempt to create a hard partitioning into these buckets seems to only increase the confusion since many of the systems can arbitrarily shift their properties with just a few operational tweaks - in the real world, it is rarely an all or nothing deal.

Focus on Consistency

Following some great conversations about CAP at a recent NoSQL Summer meetup and hours of trying to reconcile all the edge cases, it is clear that the CA vs. CP vs. PA model is, in fact, a poor representation of the implications of the CAP theorem - the simplicity of the model is nice, but in reality the actual design space requires more nuance. Specifically, instead of focusing on all three properties at once, it is more productive to first focus along the continuum of “data consistency” options: none, weak, and full.

On one extreme, a system can demand no consistency. For example, a clickstream application which is used for best effort personalization can easily tolerate a few missed clicks. In fact, the data may even be partitioned by data centre, geography, or server, such that depending on where you are, a different “context” is applied - from home, your search returns one set of results, from work, another! The advantage of such a system is that it is inherently highly available (HA) as it is a share nothing, best effort architecture.

On the other extreme, a system can demand full consistency across all participating nodes, which implies some communications protocol to reach a consensus. A canonical example is a “debit / credit” scenario where full agreement across all nodes is required prior to any data write or read. In this scenario, all nodes maintain the exact same version of the data, but compromise HA in the process - if one node is down, or is in disagreement, the system is down.

CAP Implies Weak Consistency

Strong consistency and high availability are both desirable properties, however the CAP theorem shows that we can’t achieve both of these over an unreliable channel at once. Hence, CAP pushes us into a “weak consistency” model where dealing with failures is a fact of life. However, the good news is that we do have a gamut of possible strategies at our disposal.

In case of a failure, your first choice could be to choose consistency over availability. In this scenario, if a quorum can be reached, then one of the network partitions can remain available, while the second goes offline. Once the link between the two networks is restored, a simple data repair can take place - the minority partition is strictly behind, hence there are no possible data conflicts. Hence we sacrifice HA, but do continue to serve some of the clients.

On the other hand, we could lean towards availability over consistency. In this case, both sides can continue to accept reads and/or writes. Both sides of the partition remain available, and mechanisms such as vector clocks can be used to assist with conflict resolution (although, some conflicts will always require application level resolution). Repeatable reads, read-your-own-writes, and quorum updates are just a few of the examples of possible consistency vs. availability strategies in this scenario.

Hence, a simple corollary to the CAP theorem: when choosing availability under the weak consistency model, multiple versions of a data object will be present, will require conflict resolution, and it is up to your application to determine what is an acceptable consistency tradeoff and a resolution strategy for each type of object.

Speed of Light: Too Slow for PNUTS!

Interestingly enough, dealing with network partitions is not the only case for adopting “weak consistency”. The PNUTS system deployed at Yahoo must deal with WAN replication of data between different continents, and unfortunately, the speed of light imposes some strict latency limits on the performance of such a system. In Yahoo’s case, the communications latency is enough of a performance barrier such that their system is configured, by default, to operate under the “choose availability, under weak consistency” model - think of latency as a pseudo-permanent network partition.

Architecting for Weak Consistency

Instead of arguing over CA vs. CP vs. PA, first determine the consistency model for your application: strong, weak, or shared nothing / best effort. Notice that this choice has nothing to do with the underlying technology, and everything with the demands and the types of data processed by your application. From there, if you land in the weak-consistency model (and you most likely will, if you have a distributed architecture), start thinking how you can deal with the inevitable data conflicts: will you lean towards consistency and some partial downtime, or will you optimize for availability and conflict resolution?

Finally, if you are working under weak consistency, it is also worth noting that it is not a matter of picking just a single strategy. Depending on the context, the application layer can choose a different set of requirements for each data object! Systems such as Voldemort, Cassandra, and Dynamo all provide mechanisms to specify a desired level of consistency for each individual read and write. So, an order processing function can fail if it fails to establish a quorum (consistency over availability), while at the same time, a new user comment can be accepted by the same data store (availability over consistency).



Rails Performance Needs an Overhaul

Rails Performance Needs an Overhaul: "

Browsers are getting faster; JavaScript frameworks are getting faster; MVC frameworks are getting faster; databases are getting faster. And yet, even with all of this innovation around us, it feels like there is massive gap when it comes to the end product of delivering an effective and scalable service as a developer: the performance of most of our web stacks, when measured end to end is poor at best of times, and plain terrible in most.

The fact that a vanilla Rails application requires a dedicated worker with a 50MB stack to render a login page is nothing short of absurd. There is nothing new about this, nor is this exclusive to Rails or a function of Ruby as a language - whatever language or web framework you are using, chances are, you are stuck with a similar problem. But GIL or no GIL, we ought to do better than that. Node.js is a recent innovator in the space, and as a community, we can either learn from it, or ignore it at our own peril.

Measuring End-to-End Performance

A modern web-service is composed of many moving components, all of which come together to create the final experience. First, you have to model your data layer, pick the database and then ensure that it can get your data in and out in the required amount of time - lots of innovation in this space thanks to the NoSQL movement. Then, we layer our MVC frameworks on top, and fight religious wars as developers on whose DSL is more beautiful - to me, Rails 3 deserves all the hype. On the user side, we are building faster browsers with blazing-fast JavaScript interpreters and CSS engines. However, the driveshaft (the app server) which connects the two pieces (the engine: data & MVC), and the front-end (the browser + DOM & JavaScript), is often just a checkbox in the deployment diagram. The problem is, this checkbox is also the reason why the ‘scalability’ story of our web frameworks is nothing short of terrible.

It doesn't take much to construct a pathological example where a popular framework (Rails), combined with a popular database (MySQL), and a popular app server (Mongrel) produce less than stellar results. Now the finger pointing begins. MySQL is more than capable of serving thousands of concurrent requests, the app server also claims to be threaded, and the framework even allows us to configure a database pool!

Except that, the database driver locks our VM, and both the framework and the app server still have a few mutexes deep in their guts, which impose hard limits on the concurrency (read, serial processing). The problem is, this is the default behaviour! No wonder people complain about 'scalability'. The other popular choices (Passenger / Unicorn) “work around” this problem by requiring dedicated VMs per request - that's not a feature, that's a bug!

The Rails Ecosystem

To be fair, we have come a long way since the days of WEBrick. In many ways, Mongrel made Rails viable, Rack gave us the much needed interface to become app-server independent, and the guys at Phusion gave us Passenger which both simplified the deployment, and made the resource allocation story moderately better. To complete the picture, Unicorn recently rediscovered the *nix IPC worker model, and is currently in use at Twitter. Problem is, none of this is new (at best, we are iterating on the Apache 1.x to 2.x model), nor does it solve our underlying problem.

Turns out, while all the components are separate, and its great to treat them as such, we do need to look at the entire stack as one picture when it comes to performance: the database driver needs to be smarter, the framework should take advantage of the app servers capabilities, and the app server itself can't pretend to work in isolation.

If you are looking for a great working example of this concept in action, look no further than node.js. There is nothing about node that can't be reproduced in Ruby or Python (EventMachine and Twisted), but the fact that the framework forces you to think and use the right components in place (fully async & non-blocking) is exactly why it is currently grabbing the mindshare of the early adopters. Rubyists, Pythonistas, and others can ignore this trend at their own peril. Moving forward, end-to-end performance and scalability of any framework will only become more important.

Fixing the 'Scalability' story in Ruby

The good news is, for every outlined problem, there is already a working solution. With a little extra work, the driver story is easily addressed (MySQL driver is just an example, the same story applies to virtually every other SQL/NoSQL driver), and the frameworks are steadily removing the bottlenecks one at a time.

After a few iterations at PostRank, we rewrote some key drivers, grabbed Thin (evented app server), and made heavy use of continuations in Ruby 1.9 to create our own API framework (Goliath) which is perfectly capable of serving hundreds of concurrent requests at a time from within a single Ruby VM. In fact, we even managed to avoid all the callback spaghetti that plagues node.js applications, which also means that the same continuation approach works just as well with a vanilla Rails application. It just baffles me that this is not a solved problem already.

The state of art in the end-to-end Rails stack performance is not good enough. We need to fix that.