Demystifying Docker overlay networking

By | October 12, 2016

Docker overlay networking is insanely simple to configure. I mean insanely simple! But lurking beneath the simplicity of the setup are a bunch of moving parts that you really wanna understand if you’re gonna deploy this stuff in your prime-time production estate.

Anyway… last week I attended the Docker Open Systems Summit in Berlin and got the chance to hang out with some of the networking gods at Docker. My eyes were opened while at the same time my mind was blown! It was intense, but I learned a shed load!

So… I thought I’d write up what I learned and add it as a networking chapter in my book Docker for Sysadmins. What follows here is a major excerpt from that chapter.

 

Enjoy!


. . . . .

The rest of this chapter will be broken into two parts:

– Part 1: we’ll build and test a Docker overlay network in swarm mode
– Part 2: We’ll explain the theory behind how it works.

Part 1: Build and test a Docker overlay network in swarm mode

For the following examples we’ll use two Docker hosts on two separate Layer 2 networks connected by a router as shown below

figure8-1

Each host is running Docker 1.12 or higher and a 4.4 Linux kernel (newer is always better).

Build a swarm

The first thing we’ll do is configure the two hosts into a two-node Swarm. We’ll run the `docker swarm init` command on node1 to make it a manager, and then we’ll run the `docker swarm join` command on node2 to make it a worker.

> Warning: If you are following along in your own lab you’ll need to swap the IP addresses, container IDs, tokens etc. with the correct values for your environment.

Run the following command on node1.

$ docker swarm init
Swarm initialized: current node (1ex3...o3px) is now a manager.

To add a worker to this swarm, run the following command:

docker swarm join \
 --token SWMTKN-1-0hz2ec...2vye \
 172.31.1.5:2377

Run the next command on node2.

$ docker swarm join \
> --token SWMTKN-1-0hz2ec...2vye \
> 172.31.1.5:2377
This node joined a swarm as a worker.

We now have a two-node Swarm where node1 is a manager and node2 is a worker.

Create a new overlay network

Now let’s create a new overlay network called uber-net.

Run the following command from node1.

$ docker network create -d overlay uber-net
c740ydi1lm89khn5kd52skrd9

That’s it! You’ve just created a brand new overlay network that is available to all hosts in the swarm and has its control plane encrypted with TLS!

You can list all networks on each node with the `docker network ls` command.

$ docker network ls
NETWORK ID    NAME              DRIVER   SCOPE
ddac4ff813b7  bridge            bridge   local
389a7e7e8607  docker_gwbridge   bridge   local
a09f7e6b2ac6  host              host     local
ehw16ycy980s  ingress           overlay  swarm
2b26c11d3469  none              null     local
c740ydi1lm89  uber-net          overlay  swarm
```

The network we created is at the bottom of the list called uber-net.
The other networks were automatically created when Docker was installed and when we created the swarm. We’re only interested in the uber-net overlay network.

If you run the `docker network ls` command on node2 you’ll notice that it can’t see the uber-net network. This is because new overlay networks are only made available to worker nodes that have containers using the overlay. This reduces the scope of the network gossip protocol and helps with scalability.

Attach a service to the overlay network

Let’s create a new Docker service and attach it to the uber-net overlay network. We’ll create the service with two replicas (containers) so that one runs on node1 and the other runs on node2. This will automatically extend the uber-net overlay to node2.

Run the following commands from node1.

$ docker service create --name test \
--network uber-net \
--replicas 2 \
ubuntu sleep infinity

The command creates a new service called test, attaches it to the uber-net overlay network, and creates two containers (replicas) running the `sleep infinity` command. This command makes sure the containers don’t immediately exit.

Because we’re running two containers (replicas) and the Swarm has two nodes, one container will run on each node.

Verify the operation with a `docker service ps` command.

 $ docker service ps
 ID          NAME    IMAGE   NODE   DESIRED STATE   CURRENT STATE
 77q...rkx   test.1  ubuntu  node1  Running         Running
 97v...pa5   test.2  ubuntu  node2  Running         Running

When Swarm starts a container on an overlay network it automatically extends that network to the node the container is running on. This means that the uber-net network is now visible on node2.

Congratulations! You’ve created a new overlay network spanning two nodes on separate physical underlay networks, and you’ve scheduled two containers to use the network. How simple was that!

figure8-2

Test the overlay network

Now let’s test the overlay network with the ping command.

In order to do this, we need to do a bit of digging around to get each container’s IP address.

From node1 run a `docker network inspect` to see the Subnet assigned to the overlay.

 $ docker network inspect uber-net
 [
   {
     "Name": "uber-net",
     "Id": "c740ydi1lm89khn5kd52skrd9",
     "Scope": "swarm",
     "Driver": "overlay",
     "EnableIPv6": false,
     "IPAM": {
       "Driver": "default",
       "Options": null,
       "Config": [
         {
           "Subnet": "10.0.0.0/24",
           "Gateway": "10.0.0.1"
         }
 <Snip>

The output above shows that uber-net‘s subnet is `10.0.0.0/24`. Note that this does not match either of the physical underlay networks (`172.31.1.0/24` and `192.168.1.0/24`).

Run the following two commands on node1 and node2 to get the container ID’s and their IP addresses.

 $ docker ps
 CONTAINER ID  IMAGE           COMMAND            CREATED       STATUS
 396c8b142a85  ubuntu:latest   "sleep infinity"   2 hours ago   Up 2 hrs
 $
 $ docker inspect --format='{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}' 396c8b142a85
 10.0.0.3

Make sure you run these commands on both nodes to get the IP addresses of both containers.

The diagram below shows the configuration so far.

figure8-3

As we can see, there is a Layer 2 overlay network spanning both hosts, and each container has an IP address on this overlay network. This means that the container on node1 will be able to ping the container on node2 using it’s `10.0.0.4` address from the overlay network. This works despite the fact that both nodes are on separate Layer 2 underlay networks. Let’s prove it.

Log on to the container on node1 and install the `ping` utility and then ping the container on node2 using its `10.0.0.4` IP address.

If you’re following along, the container ID used below will be different in your environment.

 $ docker exec -it 396c8b142a85 bash
 root@396c8b142a85:/#
 root@396c8b142a85:/#
 root@396c8b142a85:/# apt-get update
 <Snip>
 root@396c8b142a85:/#
 root@396c8b142a85:/#
 root@396c8b142a85:/# apt-get install iputils-ping
 Reading package lists... Done
 Building dependency tree
 Reading state information... Done
 <Snip>
 Setting up iputils-ping (3:20121221-5ubuntu2) ...
 Processing triggers for libc-bin (2.23-0ubuntu3) ...
 root@396c8b142a85:/#
 root@396c8b142a85:/#
 root@396c8b142a85:/# ping 10.0.0.4
 PING 10.0.0.4 (10.0.0.4) 56(84) bytes of data.
 64 bytes from 10.0.0.4: icmp_seq=1 ttl=64 time=1.06 ms
 64 bytes from 10.0.0.4: icmp_seq=2 ttl=64 time=1.07 ms
 64 bytes from 10.0.0.4: icmp_seq=3 ttl=64 time=1.03 ms
 64 bytes from 10.0.0.4: icmp_seq=4 ttl=64 time=1.26 ms
 ^C
 root@396c8b142a85:/#

As shown above, the container on node1 can ping the container on node2 using the overlay network.

If you install `traceroute` on the container, and trace the route to the remote container, you’ll see only a single hop (see below). This proves that the containers are talking directly over the overlay network and are blissfully unaware of any underlay networks being traversed.

$ root@396c8b142a85:/# traceroute 10.0.0.4
 traceroute to 10.0.0.4 (10.0.0.4), 30 hops max, 60 byte packets
 1 test-svc.2.97v...a5.uber-net (10.0.0.4) 1.110ms 1.034ms 1.073ms

So far we’ve created an overlay network with a single command. We then added containers to the overlay network on two hosts on two different Layer 2 networks. Once we worked out the container’s IP addresses, we proved that they could talk directly over the overlay network.

Part 2: The theory of how it all works

Now that we’ve seen how to build and use a container overlay network, let’s find out how it’s all put together behind the scenes.

VXLAN primer

First and foremost, Docker overlay networking uses VXLAN tunnels as the underlying for creating virtual Layer 2 overlay networks. So before we go any further, let’s do a quick primer on VXLAN technology.

At the highest level, VXLANs let you create a virtual Layer 2 network on top of an existing Layer 3 infrastructure. The example we used earlier created a new 10.0.0.0/24 network on top of a Layer 3 IP network comprising two Layer 2 networks – 172.31.1.0/24 and 192.168.1.0/24. This is shown below.

figure8-4

The beauty of VXLAN is that existing routers and network infrastructure just see the VXLAN traffic as regular IP/UDP packets and handle them without issue.

To create the virtual Layer 2 overlay network a VXLAN tunnel is created through the underlying Layer 3 IP infrastructure. You might hear the term underlay network used to refer to the underlying Layer 3 infrastructure.

Each end of the VXLAN tunnel is terminated by a VXLAN Tunnel Endpoint (VTEP). It’s this VTEP that performs the encapsulation/de-encapsulation and other magic required to make all of this work. See below.

figure8-5

Walk through our two-container example

In the example we built earlier, we had two hosts connected via an IP network. Each host ran a single container, and we created a single VXLAN overlay network for the containers to use.

To accomplish this the a new network namespace was created on each host. A network namespace is like a container, but instead of running an application it runs an isolated network stack – one that’s sandboxed from the network stack on the host itself.

A virtual switch (a.k.a virtual bridge) called Br0 is created inside the network namespace. A VTEP is also created with one end plumbed into the Br0 virtual switch, and the other end plumbed into the host network stack. The end in the host network stack gets an IP address on the underlay network the host is connected to and is bound to a UDP socket on port 4789. The two VTEPs on each host create the overlay via a VXLAN tunnel as shown below.

figure8-6

This is essentially the VXLAN overlay network created and ready for use.

Each container then gets its own virtual Ethernet (veth) adapter that is also plumbed into the local Br0 virtual switch. The topology now looks like the image below, and it should be getting easier to see how the two containers can communicate over the VXLAN overlay network despite their hosts being on two separate networks.

figure8-7

Communication example

Now that we’ve seen the main plumbing elements let’s see how the two containers communicate.

For this example, we’ll call the container on node1 “C1” and the container on node2 “C2“. And let’s assume C1 wants to ping C2 like we did in the practical example earlier in the chapter.

figure8-8

Container C1 creates the ping requests and sets the destination IP address to be the `10.0.0.4` address of C2. It sends the traffic over its veth interface which is connected to the Br0 virtual switch. The virtual switch doesn’t know where to send the packet as it doesn’t have an entry in its MAC address table (ARP table) that corresponds to the destination IP address. As a result, it floods the packet to all ports. The VTEP interface connected to Br0 knows how to forward the frame so responds with its own MAC address. This is a proxy ARP reply and results in the Br0 switch learning how to forward the packet and it updates its ARP table mapping 10.0.0.4 to the MAC address of the VTEP.

Now that the Br0 switch has learned how to forward traffic to C2 all future packets for C2 will be transmitted directly to the VTEP interface. The VTEP interface knows about C2 because all newly started containers have their network details propagated to other nodes in the swarm using the network’s built-in gossip protocol.

The switch then sends the packet to the VTEP interface which encapsulates the frames so they can be sent over the underlay transport infrastructure. At a fairly high level this encapsulation includes adding a VXLAN header to the Ethernet frame. The VXLAN header contains the VXLAN network ID (VNID) which is used to map frames from VLANs to VXLANs and vice versa. Each VLAN gets mapped to VNID so that on the receiving end the packet can be de-encapsulated and forwarded on to the correct VLAN. This obviously maintains network isolation. The encapsulation also wraps the frame in a IP/UDP packet with the IP address of the VTEP on node2 in the destination IP field and the UDP port 4789 socket information. This encapsulation allows the data to be sent across the underlying networks without the underlying networks having to know anything about VXLAN.

When the packet arrives at node2, the kernel sees that it’s addressed to UDP port 4789. The kernel also knows that it has a VTEP interface bound to that socket. As a result, it sends the packet to the VTEP which reads the VNID, de-encapsulates the packet and sends it on to its own local Br0 switch on the VLAN that corresponds the VNID. From there it is delivered to container C2.

That’s the basics of how VXLAN technology is leveraged by native Docker overlay networks.

We’re only scratching the surface here, but it should be enough for you to be able to start the ball rolling with any potential production Docker deployments. It should also give the knowledge required to talk to your networking team about the networking aspects of your Docker infrastructure.

One final thing to mention about Docker overlay networks is that Docker also supports Layer 3 routing within the same overlay network. For example, you can create an overlay network with two subnets, and Docker will take care of routing between them. The command to create a network like this could be `docker network create –subnet=10.1.1.0/24 –subnet=11.1.1.0/24 -d overlay prod-net`. This would result in two virtual switches Br0 and Br1 being created inside the network namespace and routing happens by default.


This has been an excerpt from chapter 9 of my book “Docker for Sysadmins” available from the Kindle store or LeanpubNOTE: Existing Kindle owners may have to wait a day or two to get the updates. New purchasers will get the new version immediately. Thanks for reading my book! Feel free to check out my extensive range of Docker training videos at Pluralsight.

Docker for Sysadmins: Linux Windows VMware

Docker for Sysadmins: Linux Windows VMware

28 thoughts on “Demystifying Docker overlay networking

  1. suneel mallela

    Nice article and good explanation, thank you very much!!

  2. Gabriel Czernikier

    I understood quite well nevertheless this chapter deserves a second read to make concepts stick. Worth to mention, I’m a blissful unknower on Docker’s land, quite as I’m on networking concepts. Hence, double merit goes to you for teaching to a casual reader, old-school programmer.

    BTW, noticed a typo of placing 2 articles up one next the other, where only 1 can fit. These are ‘the’ and ‘a’ articles, found in this sentence: ‘To accomplish this the a new network namespace was created on each host.’.

  3. M Roberts

    Hi liked the article. You mention bland being used without explaining them first

  4. Peter

    I guess you are pointing out new features of Docker 1.12.2? Previous versions of Docker with Swarm usage indicate setting up a variety of services like Discovery, Manager(s) and Workers all running the appropriate containers providing those service features. This looks pretty slick but can you explain what is going on beneath the covers? Is there a Discovery Service running in the background like consul? If so, what is it? And without running containers to provide these services (at least there are no showing), how do you see logs, and how do you stop the swarm? What is the opposite of “docker swarm init”? How do I stop a node from attempting to join a swarm? What is the opposing command to “docker swarm join” to stop attempting to join a swarm?

  5. Nigel Poulton Post author

    @Peter.

    As of 1.12 in swarm mode there has been no requirement for an external K/V store for either networking or swarm config. This is now handled by a distributed K/V store that is automatically created when you run `docker swarm init`. If I remember correctly it’s an etcd store behind the scenes. So yes, behind the scenes there is an etcd store automatically distributed across the swarm.

    The opposite of `docker swarm init` is `docker swarm leave`. But a node will not attempt to automatically join a swarm, you have to explicitly join it with the `docker swarm join` command and include the join token for a manager or a worker.

    HTH

  6. Javier Ramírez Urea

    Overlay networks don’t work on Windows yet.
    Thanks for sharing,
    Javier R.

  7. Øyvind Bakksjø

    Nice article. One note though. As layer 2 is not IP, is it meaningful to say “a new 10.0.0.0/24 Layer 2 network”?

  8. Vijay

    I am glad to have read a detailed post on docker overlays, however I wanted to raise these questions:
    1) Can we create a docker network with three underlay networks(subnets) ?
    2) How can we decide to dynamically add containers to existing overlay networks?
    3) Do we have the choice to alter the range of subnets once it is created already ?

  9. Nigel Poulton Post author

    Hi Vijay.

    The number of underlying subnets shouldn’t matter so long as there are routes between them. As far as VXLAN is concerned the underlying network(s) is just a transport infrastructure.

    To add a container to an existing Docker overlay is as simple as using the `–network` flag with `docker run`. The same goes for adding a service, just throw the `–network` flag to the `docker service` command. Obviously all containers in a service need to be on the same overlay.

    I’m not sure I understand your last question.

  10. Velimir

    Hi Mr. Poulton,

    I must say that you’re a walking encyclopedia of Docker knowledge!
    While trying to learn Docker, I always end up on your pages …

    Now, a quick intro and 2 question!

    Swarm configured, service running. I have 2 SWARM-scoped networks connected to service.

    Why some swarm-scoped networks don’t appear on workers?
    Is there a way to remove a network from a service ?

    Thank you!

  11. Nigel Poulton Post author

    Hi Velimir.

    My guess about the swarm-scoped networks not appearing on all workers would be that those workers don’t have tasks running that use that network. Overlay networks only become visible on workers if the worker is running a task (container) that uses that network. This helps keep network related gossiping to a minimum and helps with scalability.

    I’m not 100% certain I understand your second question. Can you provide more detail?

    Cheers.

  12. Velimir

    Hi Mr. Poulton,

    Thank you for quick response.

    An update to first question:
    Yes, that makes a lot of sense, I noticed that because I’ve had swarm containing 3 manager and 3 worker nodes. Created a service with 2 replicas, published ports, but not all nodes are responding to request I make (simple web page). Sometimes, even if worker node is responding, I don’t see network that I created. Later I figured out that on of manager nodes had firewall up and running.
    After configuring firewall, all worked, so, thank you for answering the first question and helping me understand what’s the story behind.

    Question 2:
    I ended writing so much stuff, that I gave up on it, don’t want to waste your time and blog space … I will try to better understand networking, containers and services, maybe I’ll get a clearer picture then .

    Thank you a lot on your time, you were more than helpful, as always!

    Cheers 🙂 !

  13. Gene

    Hi Nigel,

    Nice article on overlay network for docker swarm mode.
    I’ve tried using overlay network on our docker cluster. We’re using 40Gb network connecting each host.

    Tried using networkstatic/iperf3 to test network performance between containers in different host, we only get at most 3Gb/s speed between containers.

    Do you have any idea on this problem?

  14. 6hopsaway

    Gene,
    I work on a lot of HPC environments and there are specific issues regarding bandwidth and association with consuming 10gb/s interfaces and up. As the Linux kernel is a schedule based OS the only way to get around this performance problem is by buying network interfaces for your physical hosts that bypass the kernel for processing. I believe intel is the only company on the market that currently has one to address this issue. follow this link for details.

    http://www.intel.com/content/dam/www/public/us/en/documents/case-studies/ethernet-converged-mcorelab-high-throughput-study.pdf

  15. Nigel Poulton Post author

    Hi Gene.

    I’ve asked around and got the following which may be of help…

    This kind of limit can come from the overhead of the Linux bridge (a bridge is used by overlay inside of the network namespace). It involves MAC table lookups etc and can limit you to the performance of a single core somewhere around 3-4Gb/s.

    I *think* you might get better performance between containers on the same host but am not sure.

    If you’re able to test this, I’d love to know the results.

    Thanks again for getting involved.

  16. Phyllis

    Great explanation of how overlay networks work, thank you. I’m a real noob and have a 30,000 ft question-when are overlay networks needed? is it for communication between DTR, UCP controllers and service containers? Or something else?

    Thanks,
    Phyllis

  17. Harold Naparst

    This was incredibly useful. Docker suffers from a lack of documentation, which is made worse by the rapid changes. Rapid changes and very sparse docs do not make Docker fun to use.

    Networking is the hardest part to figure out, and the only part of your book that would be interesting to me. I would suggest that you write an entire book on Docker networking, including how to use them with compose files. This was a good start, but there is so much more to say. You could also include a lengthy discussion of compose 1, 2, and 2.1.

  18. Nigel Poulton Post author

    Sorry for the slow response Phyllis. Overlays in the Docker world are all about containers on different Docker hosts being able to communicate easily. They’re integral to Docker services which are naturally spread over many Docker hosts. HTH.

  19. Pingback: Setup a Docker Overlay Network on Multiple Hosts | Data Probe

  20. Sachin

    Hi Nigel,

    Excellent article and awesome Pluralsight classes.

    If I have two different services running on the two different overlay netwokrs, would the containers on those two services able to talk to each other even though they would be on different subnets without having to change any routes inside the containers ?

  21. Nigel Poulton Post author

    Hi Sachin. Sorry for the sloooooooow response. Apparently I’ve stopped getting emails about new comments.

    For two containers/services on separate overlay networks to talk to each other there will need to be routes configured between them. This is the same as with traditional networks where two nodes on separate networks will not be able to communicate without routes being configured between the two networks.

    HTH

  22. Michael

    Hi Nigel

    I’ve a few questions on overlay networks in regards to security.

    You say that the overlay network has its control plane protected by TLS. What exactly does this mean? I’ve enabled an overlay with the secure option enabled which as far as I can tell encrypts all traffic flowing on the overlay though any detail on this is glaringly sparse on the docker site so I’m not totally confident in what it’s doing.

    Would I still need to use TLS separately for all services in my swarm?

    Do you know if it’s possible to use client certificates as well as TLS for service to service interaction. I am using Traefik as a reverse proxy within my swarm.

    Many thanks.

  23. Nigel Poulton Post author

    Hi Michael.

    Encryption of the control plane using TLS means that all control plane related networking traffic (so the gossiping etc about network state and config etc) is all encrypted by default.

    If you then choose to encrypt the network then all data plane traffic (service/app related traffic) will also be encrypted.

    Re TLS and certs for swarm services etc. When you create a enw swarm with `docker swarm init` you get a CA created for the Swarm by default. If you wanna use an existing one then you use the `–external-ca` flag.

    Each node in the Swarm (manager and worker) get a client certificate. You can see it with `openssl x509 -in /var/lib/docker/swarm/certificates/swarm-node.crt -text`. O=Swarm ID, OU=role, CN=node ID. These certs are used to secure Swar related communications – e.g. authenticate nodes etc.

    HTH.

  24. Pingback: Demystifying Docker overlay networking – Full-Stack Feed

  25. b

    Please remove the space on the second line of the following code (between “–” and “network”)

    docker service create –name test \
    — network uber-net \
    –replicas 2 \
    ubuntu sleep infinity

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You can add images to your comment by clicking here.