Caching vs. Database Replication Strategies in Chained MR Jobs

Team Members

picture

Andrew Wong

BS, Computer Science '19

picture

Ryan Demo

MSE, Computer Science '19
BS, Electrical Eng. & CS '18

picture

Benjamin Hoertnagl-Pereira

MSE, Computer Science '19
BS, Computer Eng. '18

picture

Daniel Sohn

BS, Computer Science & AMS '20

Problem Statement

We are looking to investigate strategies for database caching and replication depending on the data workload. Specifically, for a tenant in a datacenter that has to compute on large amounts of uniformly sampled data queried from a database, where first read time can add latency, which combination of local caching and database replication minimizes job compute times?

Experimental Background

In chained MapReduce jobs in Hadoop that read randomly, uniformly sampled documents from a database after each reduce (such as hash ids), looking up hash pointers in the database is dependent on the ids and an appropriate database sharding + replication scheme is necessary to reduce job times with this workload.

Some real world tasks for this are traversing a graph with hash references to nodes, looking up digital signature metadata for multiple rounds of signing, and calculating the lowest degree vertex in a graph.

Experiment

Our baseline will measure job time using one database node and 32 compute nodes all sending it queries with no compute node local caches. The database node will have an in-memory integrated cache.

The independent variables are:

The measured dependent variables are:

From all permutations, we will look at results to determine best practices for caching and/or replication depending on mainly the data compute workload, and analyzing the performance results of each strategy.

If the previous results prove conclusive, we will develop a simple query load balancer to that dynamically adjusts the amount of database nodes containing shards/replicas based on heuristics from static results, and evaluate its performance compared to the static amount of nodes.

Timeline

Results (Work in Progress)

We are currently running and refining experiments, and expect to have all the data by 4/24. Delays in collection have been due to modeling the data for our workload, testing the code and getting the database cluster to behave the way we want it to.

Initial results with 100 database keys, 1-2 shards, and a Redis cache size of 256MB show that sharding without Redis has increasingly worse performance as the MR chain grows in length compared to a single shard with Redis. We also observe that as a MR chain grows in length with constant sharding and caching strategies the job completion time grows linearly, and approximately doubles with each order of magnitude increase in DB size. We will let these initial results inform the experiment parameter permutations that we focus on in the short term.

Existing Work

ShardFS vs. IndexFS: Replication vs. Caching Strategies for Distributed Metadata Management in Cloud Storage Systems

The rapid growth of cloud storage systems calls for fast and scalable namespace processing. While few commercial file systems offer anything better than federating individually non-scalable namespace servers, a recent academic file system, IndexFS, demonstrates scalable namespace processing based on client caching of directory entries and permissions (directory lookup state) with no per-client state in servers. In this paper we explore explicit replication of directory lookup state in all servers as an alternative to caching this information in all clients. Both eliminate most repeated RPCs to different servers in order to resolve hierarchical permission tests. Our realization for server replicated directory lookup state, ShardFS, employs a novel file system specific hybrid optimistic and pessimistic concurrency control favoring single object transactions over distributed transactions. Our experimentation suggests that if directory lookup state mutation is a fixed fraction of operations (strong scaling for metadata), server replication does not scale as well as client caching, but if directory lookup state mutation is proportional to the number of jobs, not the number of processes per job, (weak scaling for metadata), then server replication can scale more linearly than client caching and provide lower 70 percentile response times as well.

Analysis of Caching and Replication Strategies for Web Applications

Replication and caching mechanisms are often employed to enhance the performance of Web applications. In this article, we present a qualitative and quantitative analysis of state-of-the-art replication and caching techniques used to host Web applications. Our analysis shows that the selection of best mechanism is heavily dependant on the data workload and requires careful analysis of the application characteristics. To this end, we propose a technique that may enable future Web practitioners to compare the performance of different caching/replication mechanisms.

Replication Strategy for Spatiotemporal Data Based on Distributed Caching System

The replica strategy in distributed cache can effectively reduce user access delay and improve system performance. However, developing a replica strategy suitable for varied application scenarios is still quite challenging, owing to differences in user access behavior and preferences. In this paper, a replication strategy for spatiotemporal data (RSSD) based on a distributed caching system is proposed. By taking advantage of the spatiotemporal locality and correlation of user access, RSSD mines high popularity and associated files from historical user access information, and then generates replicas and selects appropriate cache node for placement. Experimental results show that the RSSD algorithm is simple and efficient, and succeeds in significantly reducing user access delay.

Extra Notes

Database sharding and replication best practices for when data is poorly cacheable or requires fast first reads, and example applications that would use this paradigm.

Cloud providers that offer multi-tenant services must handle compute hotspots reliably in order to optimize resource usage. Usually the problem is that the server lacks the CPU/memory to execute all the queries, or there is a bandwidth issue. Even if these are hardware limitations and not necessarily database dependent, it can be solved by sharding/replication either by dividing up the queries to different shards or moving a copy of the database closer. One approach to handle dynamic compute-heavy localities relying on the same data is via distributed database sharding. For example, compute-heavy jobs requiring access to the same data shards may be bottlenecked by database reads, where network congestion (bandwidth) or CPU can be limiting factors of how fast requests are processed.

To have enough copies of relevant data to balance the compute load across multiple nodes, the database can be replicated dynamically. It is the conditions under which replication should occur and the extent to which it should occur that we are interested in.

In context of the CAP theorem, there may be multiple optimal replication strategies based on the priorities of the distributed database. For example, given different loads such as read-heavy, mixed r/w, write-heavy operations, caching might be better in read-heavy situations, while sharding might be better when the data is updated frequently.

Checkpoint 1 Updates

Checkpoint 2 Updates

For this iteration we refined our Google Cloud MongoDB cluster with the following architecture: 1 MapReduce node, 1 mongos app router, 1 config server, 2 shard servers. Each regular node uses the default Debian 9 node in Google Cloud Compute Engine with 3.75 GB of memory and 1 vCPU. Shard nodes: 1.7 GB of memory, 1 vCPU. The mongos app router is an interface to two eponymous databases: single and cloud, where single holds a single collection of data on the app router node, and cloud holds the sharded collection of data on the shard servers. The config server exists to hold metadata and shard lookup information for the app router during queries. In addition to the architecture, we set up the /etc/hosts file on each node so each node within a cluster has easy aliased access to each other. Finally, we wrote a data ingress pipeline so we can easily add and remove data when tweaking cluster configuration parameters in future iterations.

We ran experiments on a 4GB sized database with a normally distributed interarrival profile, 1 replica per shard, 1-2 shards, and varying cache sizes from 0-2GB. The workload was a MR chain of 5 jobs. These can be found in the results spreadsheet.

We also set up the results spreadsheet to know exactly the data we're capturing, and have it generating charts as we input data.

Next Steps