Influence of the Document Validation/Replication Methods on Cooperative Web Proxy Caching Architectures
نویسندگان
چکیده
Nowadays cooperative web caching has shown to improve the performance in Web document access. That is why the interest in works related to web caching architectures designs has been increasing. This paper discusses and compares performances of some cooperative web caching designs (hierarchy, mesh, hybrid) using different document validation/replication methods (TTL, invalidation, pushing, etc). It is shown how the performance in a cooperative web proxy caching architecture is affected by the document validation method that is implemented. Comparing typical caching scenarios we found that speedups for some combinations of distributed caching with validation methods can be between 1.3 and 2.5. If the only criteria of decision to construct a cooperative caching system is response time then it is easily to decide that the combination with a speedup of 2.5 is the best one. However, we found that the bandwidth consumption and the number of stale documents in that combination could be prohibitive. That is why we cannot decide to construct a specific cooperative caching system based on a limited number of decision criterions. This paper shows some trade-off and possible alternatives for constructing a cooperative caching system using different combinations of document validation methods with distributed caching architectures. INTRODUCTION The idea behind Web caching consists in getting documents close to clients at a low cost. Caching has been successfully applied in computer memory hierarchies. The advantage in caching is based on the reference locality principle, which specifies that the data that have been recently used are likely to be used in the near future. Cooperating proxy caches are a group of caches that share cached objects and collaborate with each other to do the same work as a single Web cache, ________________ * The author is also supported by COSNET-CENIDET México with the project 2440.01-PR and CIC-IPN México. but trying to attend more users in a scalable and efficient way. A topic of particular debate in cooperative web caching is the design of cooperative caching architectures (cooperative mechanisms) aimed at obtaining more efficiency (hit rates and response time). At present there are many protocols of cache communication [29] which show different levels of efficiency in a cooperative web caching system. Web caching reduces network load, server load, and latency of responses. However, web caching has the disadvantage that the pages returned to clients by caches may be stale. This means the pages may not be consistent with the version currently on the server. That is why the second important topic when discussing caches is that of document validation/replication mechanisms. Presently we can find many studies debating the design of cooperative web caching from these two perspectives: on the one hand, works comparing and designing cooperative web caching architectures that can be efficient (better hit rates and response time) and scalable [7][6][15][9][3][2][1] , and on the other hand, works comparing and designing document validation/replication mechanisms to keep some grade of consistency in the cached documents [12][4][14]. There are no works talking about the influence or impact on the caching efficiency of combining some cooperative web caching architecture with different document validation/replication mechanisms. We can not say whether a cooperative web caching architecture is better than another if we do not specify which document validation mechanism is to be used. An analysis from this perspective gives us useful information for taking decisions when constructing a cooperative web caching system. Cooperative web caching Three common approaches to implementing a large scale cooperation scheme are hierarchical, distributed (mesh), and hybrid schemes. In hierarchical caching architectures, caches are located at different network levels. In most cases it is assumed that inferior levels in the hierarchy have better quality of service. Thus, at the most inferior level in the hierarchy we can find the client caches (L1) which are directly connected to the clients (i.e. caches included in Netscape or MS Explorer). At the next level are found the institutional caches (L2), which could be located in some nodes of the campus network. In the next level upwards in the hierarchy we can find the regional or national caches (L3) which could be connected to a national backbone linking several universities or institutions inside or outside the country. At the present time the number of levels commonly used in a hierarchy is on the order of 3 [8][1][11][13], including client, institutional, and regional caches. Some popular hierarchies which follow this structure are the National Laboratory for Applied Network Research (NLANR) in USA [8], Korean National Cache [11], The Spanish Academic Network, Red Iris [10]. In hierarchical schemes, when a request can not be satisfied by client caches (L1), the request is redirected to the institutional caches (L2), which in turn forwards unsatisfied request to the regional or national caches (L3), at this level caches contact directly the origin server. When the document is found, it travels down in the hierarchy, leaving a copy at each intermediate cache. Further requests for the same document travel up the caching hierarchy until the request finds the document. The software that is most used in hierarchies is Squid [6], which is a descendant of the Harvest project [15]. In distributed caching architectures (specially Mesh) there are no intermediate caches defined by levels, rather, there is a single level of caches where they can cooperate to serve the requests generated by clients. Because there are no intermediate caches which store and centralize all the documents requested by low level caches, the institutional caches need another mechanism to share documents with each other. A popular mechanism to share documents is based on broadcast probe. In this case, caches query their siblings whenever they do not find the requested document in their repositories. This is done commonly by using ICP (Inter Cache Protocol) [23]. This mechanism can significantly increase the bandwidth consumption and client-perceived latencies due to the possible generation of a lot of queries. However this fact would not be a limitation if the cache system is connected by a high speed and reliable network. Other mechanisms that are used to discover shared documents in a distributed caching architecture are those that use directories or summaries [2][5]. This mechanism has the advantage of not requiring that caches send a great number of unnecessary queries to their siblings, which is reflected in bandwidth savings. A drawback of this mechanism is that the directories/summaries only give a probability of an object being found in a sibling, but do not guarantee that the object is really stored in that cache [5]. As further alternatives to distribute and discover documents in a distributed caching system we can find some proposals in [16][17]. They propose to use hash functions to associate a request with a specific cache. With this approach there are no duplicated documents in more than one cache in the distributed caching system. Because of these hash functions, caches have not to interchange their content. A limitation of this approach is documents are less available, which means we need a very reliable network. We define hybrid caching architectures as a combination of the two previous architectures, which consist of a hierarchy whose ramifications are formed by meshes. Some performance analysis of several caching architectures can be found in [21][1][9][7]. Mechanisms of document validation/replication The second subject of debate at the time of implementing an architecture of document distribution using cooperative caching architectures is the issue of mechanisms that allow to distribute and to maintain document consistency. We can find two broad strategies to replicate and validate documents in a caching system. These strategies either use pulling or pushing mechanisms. In pulling mechanisms, caches periodically poll the origin server (or an upper level cache, according to the defined architecture). In pushing mechanisms caches do not need to ask (poll) the servers periodically for the freshness of a document, the servers distribute the documents that are updated to interested or subscribed caches. Within these broad strategies we can find several mechanisms to replicate/validate documents. We have grouped them in 3 blocks: Time to Live (TTL), validate verification, and callback or invalidation. In the time to live approach (TTL), servers add a time stamp to each document requested by caches. That time stamp defines the period of time that the document could be valid. A typical example of the TTL approach is the Expire tag in HTTP/1.0 (TTLE). The validate verification mechanisms commonly are related to time to live mechanisms. A cache receives a document with a time stamp, this time stamp indicates the last time that the document was modified. When a client asks for that document the cache sends a validate verification message to the server including the time stamp recorded in the document. Server validates the document time stamp with the document last modified time, and it sends a no modification message (the document has been not updated) or the new document with a new time stamp. This approach is implemented in HTTP/1.0 using the If-Modified-Since (IMS) tag. This approach can be combined with expiration or leasing approaches. The IMS messages can occur in several instances of time, depending on the consistency degree that we want to have. For example, if we want strong consistency, caches have to generate an IMS message any time a page is requested by the clients. This would define a threshold time to validate a document equal to zero (TTL0). If we bet for a validation threshold greater than zero (TTLA) then any time a cache receives a client request, the threshold has to be verified. If the actual time minus document time stamp is greater than the threshold, an IMS message has to be sent to the server otherwise the cache speculates about document freshness and sends the document to the client. In callback or invalidation mechanisms , each server keeps track of all caches which have requested a particular page, and whenever that page changes, notifies those caches. This approach does not scale well in the limit of many readers per page; both the state required to store the list of readers, and the OS and network burden of having to contact every reader of a page when it changes, grow linearly in the number of readers. This scaling problem can be overcome by using multicast to transmit the invalidations (IMC). By assigning a multicast group to each page, and having clients joining the groups associated with the pages they have accessed. Somewhat related idea (IMCP) of pushing content (rather than sending invalidations) via multicast is described in [19][20] (pushing approaches). Multicast solves the scaling problems at the server, however it creates another one at the routers. Routers are required to keep the state of hundreds or thousands of addresses. Moreover, the rate at which clients would be joining and leaving multicast groups, as they read and discard documents will likely create an unscalable overhead on the routing infrastructure. This problem can be solved when the volume of the contract is all the documents in a cache instead of a single document group as it is described in [4]. SITUATION As we have seen in the previous sections there are many researches related to cooperative caching systems. A missed point that we have found in those works is that most of them concentrate on a particular caching design topic. The attention is focused on one hand on the construction of efficient cooperative caching architectures (specially intercache communication protocols), and on the other hand, on the creation of efficient document validation/replication mechanisms. Table 1 shows several caching researches, some of them present advantages and drawbacks in creating cooperative caching architectures and some others show advantages and drawbacks in using certain document validation/replication mechanisms. This work is focused on verifying whether there is a considerable influence (in response time, bandwidth consumption, document consistency) when we combined some cooperative caching architectures with some document validation/replication mechanisms. DESCRIPTION OF THE COMPARISON PROCESS Creating a system which compares different cooperative caching architectures combined with several document validation/replication methods is not a trivial work. Generating a mathematic model which take into account all possible variables in a process like this becomes intractable. For those reasons we have created a set of simulations which include different typical cooperative caching architectures combined with some well known document validation/replication methods, creating some common scenarios. Table 1. Several researches in cooperative caching Archit. Validation/ replication mechanisms Main research reference Some results obtained Hier. PullingTTLE [15] Distribution of workload is better using hierarchies Hierarchies with 3 levels have better performances than hierarchies with more or less levels. Hier. PullingTTLA [14] with Adaptive TTL we can get optimal bandwidth consumption and response time. They suggest TTLA protocol (a protocol derived from TTL-Alex) It is better to use TTLE when we know the document expiration time. Hier. IMC [12][18] a good mechanism to guarantee consistency in document frequently modified is using invalidations Hier IMC, IMCP, pullingTTLN, pullingTTL0 [4] IMC and IMCP algorithms show the best performance in documents with high frequency of reads and writes. Hier. & Dist. It is not specified [1][7][9] A distributed architecture has better performance than a hierarchy. Hier., Dist. & Hybrid It is not specified [26] The best performance is obtained using a hybrid architecture. Distributed caches have less connection time than hierarchies. Distributed architectures use more bandwidth than hierarchical architectures. Dist. It is not specified [25][17] better performance when clients access directly to caches without using hierarchies. Dist. Asynchronous and synchronous Pushing [20][3] As additional proposal in clientinitiated caching. Better distribution of popular documents. If we create interest channels we will get better document distribution. Hier., Dist. & Hybrid PullingTTL0, PullingTTLN,IMC, IMCP This paper -see results and conclusions sections. The objective of the simulations will be to evaluate the performance of each cooperative caching configuration when it is working with a particular document validation method. We observe details as client-perceived response time, bandwidth consumption, document staleness (period of time that a document has been inconsistent, and how many documents), and a speedup which shows the performance improvement compared to a configuration with
منابع مشابه
Building a Flexible Web Caching System
Web caching is a technology that has demonstrated to improve traffic on the Internet. To find out how to implement a Web caching architecture that assures improvements is not an easy task. The problem is more difficult when we are interested in deploying a distributed and cooperative Web caching system. We have found that some cooperative Web caching architectures could be unviable when changes...
متن کاملA Study on Web Caching Architectures and Performance
As World Wide Web usage has grown dramatically in recent years so has grown the recognition that Web caches especially proxy caches will have an important role in reducing server loads client re quest latencies and network tra c In this survey we present the most common architectures for web caching and their most important characteristics are outlined These architectures include proxy caching ...
متن کاملCooperative proxy caching for wireless base stations
This paper proposes a mobile cache model to facilitate the cooperative proxy caching in wireless base stations. This mobile cache model uses a network cache line to record the caching state information about a web document for effective data search and cache space management. Based on the proposed mobile cache model, a P2P cooperative proxy caching scheme is proposed to use a self-configured an...
متن کاملUsage Patterns in Cooperative Caching
The amount of information requested over the World Wide Web has increased enormously during the past decade. Web caching helps to reduce service times, balance the load to origin servers, and brings content closer to the user. Since duplicating and distributing files amongst proxy caches has proved to be insufficient, cooperative caching aims to ameliorate the shortcomings of basic replication ...
متن کاملQuantifying the Overall Impact of Caching and Replication in the Web
This paper discusses the benefits and drawbacks of caching and replication strategies in the WWW with respect to the Internet infrastructure. Bandwidth consumption, latency, and overall error rates are considered to be most important from a network point of view. The dependencies of these values with input parameters like degree of replication, document popularity, actual cache hit rates, and e...
متن کامل