Elsevier

Applied Soft Computing

Volume 7, Issue 3, June 2007, Pages 890-898
Applied Soft Computing

Evolutionary design of en-route caching strategies

https://doi.org/10.1016/j.asoc.2006.04.003Get rights and content

Abstract

Nowadays, large distributed databases are commonplace. Client applications increasingly rely on accessing objects from multiple remote hosts. The Internet itself is a huge network of computers, sending documents point-to-point by routing packetized data over multiple intermediate relays. As hubs in the network become overutilized, slowdowns and timeouts can disrupt the process. It is thus worth to think about ways to minimize these effects. Caching, i.e. storing replicas of previously-seen objects for later reuse, has the potential for generating large bandwidth savings and in turn a significant decrease in response time. En-route caching is the concept that all nodes in a network are equipped with a cache, and may opt to keep copies of some documents for future reuse [X. Tang, S.T. Chanson, Coordinated en-route web caching, IEEE Transact. Comput. 51 6 (2002) 595–607]. The rules used for such decisions are called “caching strategies”. Designing such strategies is a challenging task, because the different nodes interact, resulting in a complex, dynamic system. In this paper, we use genetic programming to evolve good caching strategies, both for specific networks and network classes. An important result is a new innovative caching strategy that outperforms current state-of-the-art methods.

Introduction

The Internet is a distributed, heterogeneous network of computers. From a user's point of view, it can be regarded as a large database of data objects—“documents” that are available for retrieval via their uniform resource locator (URL). Access to files on the net is based on a client-server architecture: a client computer generates a request, opens up a connection to a server host, and retrieves the document from the server.

Applications based on Internet protocols, such as web servers and browsers, are usually unaware of the underlying transport layer. They see the net as an all-to-all, fully connected network where each host can talk directly to any other.

Underneath, there is a real network of computers connected to each other via physical links of various kinds (coaxial and fiber optics cables, satellite links and so on; see Fig. 1) that relay each message, passing it along until it reaches its destination. The average number of intermediate steps (hops) can be quite substantial depending on the topology of the network, the origin and destination, and the routing algorithm.

The coexistence of multiple superimposed paths creates the potential for bottlenecks and congestion. Internet users experience latency when there is an extended wait between the moment a document is requested and that of its reception, and low perceived bandwidth when the transmission of the document is slow. In order to avoid infinite queues from forming, relaying hosts usually implement a timeout feature, killing documents when the transmission is delayed more than a certain threshold, so some documents never arrive at their destinations.

These problems can be prevented either by expanding the capacity and bandwidth of overloaded links, in order to match peak time demand, or by making a more efficient use of the existing capacity. One such proposal is en-route caching, an approach to minimize network traffic by exploiting regularities in document request patterns [1], [2].

Popular documents on the net (portals, for example) are requested all the time, while others are almost never requested. Therefore, it makes sense to store copies (replicas) of popular documents at several places in the network. This phenomenon has prompted hosting companies (e.g. Akamai.com) to position hosts all over the world, creating forms of mirroring to save bandwidth by servicing requests from hosts that are closer, in Internet topology terms, to the clients making the requests. However, this solution works only for long-term data access patterns in which a commercial interest can be matched with monetary investments in distributed regions of the globe.

For the en-route caching perspective, observe that when two neighbors in the same street request the same web page, each one of them creates a channel to the remote server hosting the document, even though both requesting computers are connected to the same trunk line. The same data is sent over from the host twice, and relayed by the same intermediate routers. It would make sense, for any of the intermediate hosts, to keep a copy of the document, allowing it to service the second request directly, without having to contact the remote host at all.

Proxy servers sometimes have caching capability, being able to optimize Internet access for a group of users in a closed environment, such as a corporate office or a campus network. However, much better savings and scalability are possible by using this strategy at all levels: If the campus proxy fails to retrieve the page from the cache, or even, if the two requests come from neighboring university campuses in the same city, then a node further down the chain would have the opportunity of utilizing its cache memory, for the same opportunity exists in every single router on the Internet. The proposal of en-route web caching is to provide every intermediate host with the option to respond to a request by sending back a cached copy of a document that was previously requested.

The difficult question is how to decide which documents to store, and where to store them. With finite memory, it is impossible for individual hosts to cache all the documents they see.

A global policy control in which a centralized decision-making entity distributes replicas among servers optimally is impractical, for several reasons: the tremendous complexity, because the Internet is dynamically changing all the time, and because no global authority exists. Thus, each server has to decide independently which documents it wants to keep a replica of. The rules used for such decisions are also known as caching strategies.

Today, many routers with caching—such as a campus network proxy—use the well-known least recently utilized (LRU) strategy: objects are prioritized by the last time they were requested. The document that has not been used for the longest time is the first to be deleted. Although this makes sense for an isolated router, it is easy to see why LRU is not an optimal policy for a network of caching hosts. In our example above, all the intermediate hosts between the two neighbors that requested the same document, and the server at the end of the chain, will store a copy of the document because a new document has the highest priority in LRU. However, it would be more efficient if only one, or a few, but not all intermediate nodes kept a copy—leaving room for caching other highly requested documents. In isolation, a caching host tries to store all the documents with highest priority. In a network, a caching host should try to cache only those documents that are not cached by its neighbors.

The possible economic benefits of en-route web-caching are obvious: it has not only the potential to remove congestions and thus reduce latency and save time for the end user, but it could also reduce bandwidth requirements and even the load of highly popular servers by shielding off some traffic and serving the requests locally. Finally, the distributed structure of en-route web caching could reduce the impact of link or server failures. As the Internet grows in size and importance and file sizes grow due to an increasing amount of multimedia content, these issues will gain even more importance.

However, designing good en-route caching strategies for a network is a non-trivial task, because it involves trying to create global efficiency by means of local rules. Furthermore, caching decisions at one node influence the optimal caching decisions of the other nodes in the network. The problem of cache similarity of the above example is one of symmetry breaking: when neighbors apply identical, local-information based strategies, they are likely to store the same documents in their caches. In this scenario, the network becomes saturated with replicas of the same few documents, with the consequent degradation of performance.

In this paper, we attempt to design good caching strategies by means of genetic programming (GP). As we will show, GP is able to evolve new innovative caching strategies, outperforming other state-of-the-art caching strategies on a variety of networks.

The paper is structured as follows: first, we will cover related work in Section 2. Then, we describe our GP framework and the integrated network simulator in Section 3. Section 4 goes over the results based on a number of different scenarios. The paper concludes with a summary and some ideas for future work.

Section snippets

Related work

There is a huge amount of literature on caching at the CPU level (see e.g. [3]). Caching Internet documents is a special field of caching. For a survey, see Wang [4]; Davison [5], and, with a particular focus on cache replacement strategies, Podlipnig and Boszormenyi [6]. But even those are primarily concerned with caching on the receiving end, like browser caches or proxy caching. As we have argued in the introduction, much higher benefits can be achieved by allowing documents to be cached

A genetic programming approach to the evolution of caching strategies

In this section, we will describe the different parts of our approach to evolve caching strategies suitable for networks. We will start with a more precise description of the assumed environment and the network simulator used, followed by the GP implementation.

Linear networks

First, we tested our approach by evolving strategies for linear networks, for which we were able to determine the optimal placement of replicas by means of complete enumeration.

The first test involved a purely linear network with 10 nodes connected in a line (i.e. the first and last node have exactly one neighbor, all other nodes have exactly two neighbors). Every node has one original document, each document has the same size, and each node has excess memory to store exactly one additional

Conclusions

A key inefficiency of the Internet is its tendency to retransmit a single blob of data millions of times over identical trunk routes. Web caches are an attempt to reduce this waste by storing replicas of recently accessed documents at suitable locations. Caching reduces network traffic as well as experienced latency.

The challenge is to design caching strategies which, when applied locally in every network router, exhibit a good performance from a global point of view. One of the appeals of GP

References (28)

  • S. Jin et al.

    Greedydual* web caching algorithm

    Computer Commun.

    (2001)
  • D.E. Goldberg et al.

    A comparative analysis of selection schemes used in genetic algorithms

  • T. Kamada et al.

    An algorithm for drawing general undirected graphs

    Information Processing Lett.

    (1989)
  • X. Tang et al.

    Coordinated en-route web caching

    IEEE Transact. Computers

    (2002)
  • K. Li, H. Shen, Optimal methods for object placement in en-route web caching for tree networks and autonomous systems,...
  • A.J. Smith

    Cache memories

    ACM Computing Surveys

    (1982)
  • J. Wang

    A survey of web caching schemes for the internet

    ACM SIG-COMM Computer Comm. Rev.

    (2001)
  • B.D. Davison

    A web caching primer

    IEEE Internet Computing

    (2001)
  • S. Podlipnig et al.

    A survey of web caching replacement strategies

    ACM Computing Surveys

    (2003)
  • M. Karlsson, C. Karamanolis, M. Mahalingam, A framework for evaluating replica placement algorithms. Tech. Rep....
  • T. Loukopoulos, I. Ahmad, Static and adaptive data replication algorithms for fast information access in large...
  • S. Sen, File placement over a network using simulated annealing, in: Symposium on Applied Computing. ACM, 1994, pp....
  • G. Pierre et al.

    Dynamically selecting optimal distribution strategies for web documents

    IEEE Transact. Comput.

    (2002)
  • S. Williams, M. Abrams, C.R. Standridge, G. Abdulla, E.A. Fox, Removal policies in network caches for World-Wide Web...
  • Cited by (8)

    • Simulation optimisation

      2019, GECCO 2019 Companion - Proceedings of the 2019 Genetic and Evolutionary Computation Conference Companion
    • Simulation optimisation: Tutorail

      2018, GECCO 2018 Companion - Proceedings of the 2018 Genetic and Evolutionary Computation Conference Companion
    • Simulation optimisation: Tutorial

      2016, GECCO 2016 Companion - Proceedings of the 2016 Genetic and Evolutionary Computation Conference
    • Genetically improved software

      2015, Handbook of Genetic Programming Applications
    • Genetic improvement of software for multiple objectives

      2015, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus
    View full text