Pages

Friday, October 12, 2018

Anatomy Of Containers


We all know how containers are adding value to developers by improving the ability to isolate from other applications, run application virtually anywhere from public cloud to on premise data centers. Most of us start creating docker containers right away and play with them but It is very important to understand what makes a container.

With many organizations migrating their applications to containers, it is becoming more important for programmers and administrators to know the concepts behind containers. These core concepts are common to any container technology available now.

This article “Anatomy of Containers” will make sure everyone understand container internals, helps them to understand what makes a container and gives a good sense of debugging when issues arises

First thing first , Containers are not new anyone with an understanding of Linux capabilities can play and docker, rkt tools are just wrappers around these tools. The Core concepts will give a clear picture of how containers are created and how they work. Lets start digging into the concepts

File System -
We all know that an image is the starting point for creating a container. We need an image and we start a container from the image. Basically, an image is nothing but a plain file system with different types of files and executables.

Let’s download an image (using docker or we can create our own) from a repository
[root@rkt-machine ~]# docker images
REPOSITORY           TAG                 IMAGE ID                 CREATED      SIZE
docker.io/python      latest              7a35f2e8feff           34 hours ago  922.4 MB

Save the image to a tar file using
[root@rkt-machine ~]# docker save docker.io/python >> centos.tar

Check the tar file -
[root@rkt-machine ~]# ls
anaconda-ks.cfg  centos.tar  original-ks.cfg

Extract the tar file -
[root@rkt-machine ~]# tar -xvf centos.tar
5182e96772bf11f4b912658e265dfe0db8bd314475443b6434ea708784192892.json
d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/
d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/VERSION
d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/json
d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/layer.tar
manifest.json
repositories

Now there is a layer.tar file in the d1ed0** file. Extract that to a directory using, 
[root@rkt-machine rootfs]# tar -xf d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/layer.tar -C ./rootfs

If you check the roofs directory it looks awful lot like a Linux system. There is a bin, etc. and many other similar locations that we normally see in a Linux machine.

[root@rkt-machine rootfs]# ll
-rw-r--r--.    1 root root 12005 Aug  4 22:05 anaconda-post.log
lrwxrwxrwx.  1 root root     7 Aug  4 22:04 bin -> usr/bin
drwxr-xr-x.  2 root root     6 Aug  4 22:04 dev
drwxr-xr-x. 47 root root  4096 Aug  4 22:05 etc
drwxr-xr-x.  2 root     6 Apr 11 04:59 home
lrwxrwxrwx.  1 root root     7 Aug  4 22:04 lib -> usr/lib
lrwxrwxrwx.  1 root root     9 Aug  4 22:04 lib64 -> usr/lib64
drwxr-xr-x.  2 root root     6 Apr 11 04:59 media
drwxr-xr-x.  2 root root     6 Apr 11 04:59 mnt
drwxr-xr-x.  2 root root     6 Apr 11 04:59 opt
drwxr-xr-x.  2 root root     6 Aug  4 22:04 proc
dr-xr-x---.  2 root root   114 Aug  4 22:05 root
drwxr-xr-x. 10 root root   130 Aug  4 22:05 run
lrwxrwxrwx.  1 root root     8 Aug  4 22:04 sbin -> usr/sbin
drwxr-xr-x.  2 root root     6 Apr 11 04:59 srv
drwxr-xr-x.  2 root root     6 Aug  4 22:04 sys
drwxrwxrwt.  7 root root   132 Aug  4 22:05 tmp
drwxr-xr-x. 13 root root   155 Aug  4 22:04 usr
drwxr-xr-x. 18 root root   238 Aug  4 22:04 var

We can also see the sh and bash shell being available in the above roofs locations as,
[root@rkt-machine ~]# ls rootfs/bin/sh
rootfs/bin/sh

[root@rkt-machine ~]# ls rootfs/bin/bash
rootfs/bin/bash

The File system of a container image will be same as the one that we see in a normal linux Os. This file system includes all the necessary libraries to run minus kernel. The kernel is the lowest level of easily replaceable software that interfaces with the hardware in the computer.

One important thing to remember is that every container that we start will not have any kernel and kernel is necessary to communicate with the hardware. For this container rely on the Host kernel. So a container image will have all the files necessary to run and if there are any system call needed, the container will use the Host Kernel as shown
                         
We also need to remember that due to the sharing of host kernel with the container kernel, the size of the container images has gone down. This is same reason that  makes different flavors’ of Linux containers run on different Os. In the above, iam running a Ubuntu , busybox and debian containers on a centos Machine. Since the kernel is always same for all these distributions, they work seamlessly. This is the same reason why we cannot run containers on windows ( though we have some tools that does this ) because windows never shares kernel.

Now that we have seen that container image file system is similar to the linux file system, how can I restrict my processes running inside this container to the host machine. Since my container uses host kernel, will a process running inside the container will have access to the other files systems of host. This is where chroot comes into picture.

Chroot
In a *nix-based OS, root directory (/) is the top directory. Root file system sits on the disk partition where root directory is located and all other file system mount on this root file system.

We all know that the three is root process with Pid 1 is the first thing that gets started when we start the Linux machine. All other process or jobs that start will be children of this process. Every Processes in Linux can access any one file or directory in the filesystem

What if we want to restrict the file system view for the process running. What if i want to block the file system access to a process that I’m running or debugging. Linux provides an option for this which is called “chroot”. Every process running by default know that root directory for this is the working directory where the process has been started and process though it starts from a directory will be sub process for the pid 1 that started long back.

More over the pid 1 know that the root directory is available on the root filesystem which make root file system as current working directory. So naturally all sub process that start from pid 1 will have access to the root file system. Those processes can see all files that are available in every directory on the root file system.

Chroot is an operation that allows a system to change the root directory for current process and all its children. It means whatever the process that we start with chroot will have the current working directory as root directory. This makes the process that we started with the chroot will become pid 1. This chroot essentially restricts the view of the file system for process. This makes the process capable of accessing files only available in that chroot location.

In order to use chroot, all we will use the same rootfs extracted from the container image above. Go to the rootfs location and run the chroot command as below,
[root@manja17-I18060 testing]# chroot rootfs /bin/bash
root@testing-machine-name:/# which python
/usr/local/bin/python
root@testing-machine-name:/# /usr/local/bin/python -V
Python 3.7.0 (default, Sep  5 2018, 03:25:31)
[GCC 6.3.0 20170516]
root@testing-machine-name:/# exit
exit

[root@manja17-I18060 testing]# which python
/usr/bin/python
[root@manja17-I18060 testing]# /usr/bin/python -V
Python 2.7.5

If you see in the above output , the chroot command is invoked on the rootfs file system that we got from a docker image. We see that the python version is 3.7.0 which is available in the rootfs filesystem, and it is dependent on the rootfs file system libraries. If we use exit to come out of the chroot and run a python version available on host, we see a different version.

Similarly from inside the chroot, if we try to access the files inside /root , we will be seeing nothing whereas when we try to access them from the host machine we see a couple of files,
root@testing-machine-name:~# cd /root/

root@testing-machine-name:~# /bin/ls /root/
root@testing-machine-name:~# exit
exit


[root@manja17-I18060 testing]# cd /root/

[root@manja17-I18060 ~]# ls
anaconda-ks.cfg     apache-maven-3.5.4-bin.tar.gz  
one-context_4.14.4.rpm                    terraform


Using chroot on a directory, we are restricting processes that start inside that directory to itself. The process that start inside this will have access to the files available inside that chroot and they cannot access any thing outside of the chroot.

How does chroot help containers?
Chroot helps in starting containers with their own file system provided from the image and this is what makes process running inside a container to view only file system available in the container image. Chroot make the process running inside a container not able to access any file outside of the container file system (host file system )

Now that we have seen how chroot restricts file system access, it will not be able to restrict certain things from host into chroot system. Now for instance, let's start a process on the host machine as

[root@manja17-I18060 testing] # sleep 100 &
[1] 8634

Now let’s do chroot and see if the process is visible,
[root@manja17-I18060 testing] # chroot rootfs /bin/bash
root@testing-machine-name:/# whereis grep
grep: /bin/grep /usr/share/man/man1/grep.1.gz /usr/share/info/grep.info.gz

root@testing-machine-name:/# /bin/ps aux | /bin/grep sleep
root      8627  0.0  0.0 107904   608 ?        S    08:02   0:00 sleep 60
root      8634  0.0  0.0 107904   608 ?        S    08:02   0:00 sleep 100
root      8638  0.0  0.0  11104   708 ?        S+   08:02   0:00 /bin/grep sleep

I can see the sleep 100 command is available with a PID number 8634. If I want to kill that pid , I can
[root@manja17-I18060 testing] # /bin/kill -9 8634
[1]+  Killed          sleep 100

So chroot only restricts file system view but there are still other areas which can be accessible inside chroot. Here is where namespace come into picture.

Name Spaces
Namespace allow us to create restricted view of system like the process tree, network interface, mount etc. So chroot restricts file system, namespace restricts other important system resources like network, process tree etc.  So, a kernel namespace call wraps a global system resource in abstraction and isolation so that processes within that namespace think they have their own isolated instance of the global resource. Modifications done to that resource inside the namespace are not visible to the original resource being used by host machine or other namespaces.

Linux Provides us with a nice command line tool called unshare, by which we can create namespaces. Lets see how it can be done. Create a sleep process like below,

[root@manja17-I18060 testing]# sleep 100 &
[1] 8783

Run the Chroot by also calling unshare along as below,
[root@manja17-I18060 testing]# unshare -p -f --mount-proc=$PWD/rootfs/proc \
>     chroot rootfs /bin/bash

In this command i am actually creating a new process tree namespace and attaching to our chroot. Now if we run the “ps ux” command we see

root@testing-machine-name:/# /bin/ps ux
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.0  0.0  18220  2148 ?        S    08:13   0:00 /bin/bash
root         3  0.0  0.0  36628  1540 ?        R+   08:13   0:00 /bin/ps ux

We will not be seeing any of the process of the host machine. If we observe closely our bash itself is given with the pid 1 making it as the root process. If we see the above command, we only added.

One advantage of using namespaces is their composability. In the above case we are isolating a process tree of our shell process. The other namespaces like network, mount etc are still shared. We can choose which namespace to be unshare and which one to share. For example, in a kubernetes pod with multiple containers can have a separate process namespace individually but both can share network namespace and mount namespace.

There are 6 namespaces available as below,
The pid namespace: Process isolation (PID: Process ID).
The net namespace: Managing network interfaces (NET: Networking).
The ipc namespace: Managing access to IPC resources (IPC: InterProcess Communication).
The mnt namespace: Managing filesystem mount points (MNT: Mount).
The uts namespace: Isolating kernel and version identifiers. (UTS: Unix Timesharing System).
The user namespace: isolating user id between namespaces

How does container use namespace?
Containers on the other hand will be using namespace to isolate the container by giving them the global resources in a isolated way. Every container that we start will be attached with a namespace that we provide, and container will use that resource in an isolation. Any changes that are done inside the container ( namespace ) will not effect the system resource globally. The changes will be restricted to that  namespace

Now that we have seen how to isolate a filesystem as well as global system resource but we did not isolate the memory and CPU. This means though we have a chroot location with namespace attached and a memory eating program will still eat the whole system. This is where Cgroups comes into picture.

CGroups
Cgroups in short for control groups which allow restrictions on memory and Cpu. The restrictions are imposed by kernel itself. The kernel exposes the cgroups in this below location as,

[root@manja17-I18060 cgroup]# ll /sys/fs/cgroup/
total 0
drwxr-xr-x. 3 root root  0 Oct  9 04:00 blkio
lrwxrwxrwx. 1 root root 11 Sep 19 01:52 cpu -> cpu,cpuacct
lrwxrwxrwx. 1 root root 11 Sep 19 01:52 cpuacct -> cpu,cpuacct
drwxr-xr-x. 3 root root  0 Oct  9 04:00 cpu,cpuacct
drwxr-xr-x. 3 root root  0 Sep 19 01:52 cpuset
drwxr-xr-x. 3 root root  0 Oct  9 03:10 devices
drwxr-xr-x. 3 root root  0 Sep 19 01:52 freezer
drwxr-xr-x. 3 root root  0 Sep 19 01:52 hugetlb
drwxr-xr-x. 3 root root  0 Oct  9 04:00 memory
lrwxrwxrwx. 1 root root 16 Sep 19 01:52 net_cls -> net_cls,net_prio
drwxr-xr-x. 3 root root  0 Sep 19 01:52 net_cls,net_prio
lrwxrwxrwx. 1 root root 16 Sep 19 01:52 net_prio -> net_cls,net_prio
drwxr-xr-x. 3 root root  0 Sep 19 01:52 perf_event
drwxr-xr-x. 3 root root  0 Sep 19 01:52 pids
drwxr-xr-x. 5 root root  0 Sep 19 01:52 systemd

Lets see Cgroups in action,
Create a cgroup dir. This is very easy as creating a dir but in specific location,
mkdir /sys/fs/cgroup/memory/test

Now if we see the directory we will see that kernel fills this directory with files as below,
[root@manja17-I18060 testing]# ll /sys/fs/cgroup/memory/test/  | awk '{print $9}'
cgroup.clone_children
cgroup.event_control
cgroup.procs
memory.failcnt
memory.force_empty
memory.kmem.failcnt
memory.kmem.limit_in_bytes
memory.kmem.max_usage_in_bytes
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes
memory.kmem.usage_in_bytes
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.memsw.failcnt
memory.memsw.limit_in_bytes
memory.memsw.max_usage_in_bytes
memory.memsw.usage_in_bytes
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.soft_limit_in_bytes
memory.stat
memory.swappiness
memory.usage_in_bytes
memory.use_hierarchy
notify_on_release
tasks

All we have to do is to edit some files from the above to set our restrictions.
[root@manja17-I18060 testing]# echo "100000000" > /sys/fs/cgroup/memory/test/memory.limit_in_bytes

[root@manja17-I18060 testing]# echo "0" > /sys/fs/cgroup/memory/test/memory.swappiness

All i am doing from above commands is to disable swap and set 100mb memory limitation. The last thing that we need to do is to add the process id (pid ) to this cgroup. This is done by getting the pid and adding that to the tasks file in the same directory.

[root@manja17-I18060 testing]#  echo $$ > /sys/fs/cgroup/memory/test/tasks

We can add any number of pids to this tasks file to make sure the 100mb memory restriction is set. So after this i just started a memory eating application and ran. The output is something like below,

[root@manja17-I18060 testing]# python hai.py
10mb
20mb
30mb
40mb
50mb
60mb
70mb
80mb
Killed

We can see that though I have around 6GB of ram, the process is killed when it is about reach 100MB. This is because the memory eating process is set to the test cgroup for which the restriction is set 100mb. Hence if any process in the test group is about to reach 100mb , it kills and make sure the memory eating program don’t eat all system memory.

This is the same thing that happens in a container. Using Cgroups on containers, we can restrict memory and cpu limitations such that they don’t break out of container and eat up the host memory and cpu.

Containers are allowed to run any type of code downloaded from internet. One good thing is that everything is name spaced in containers. Using namespaces we achieve some level of security. As we already learned that namespaces give global resource with an isolated view thus restricting the changes done in namespace to itself.

There are other systems that are not name spaced mostly from kernel subsystems like,
Selinux
Cgroups
File system under /sys
/proc/sys, /proc/irq , /proc/bus etc

Devices are also not namespaced like /dev/mem , kernel modules , /dev/sd* file system devices. Using these kernel vulnerabilities,  an exploit can lead to container breakout to host. This is where linux capabilities comes into picture

Linux Capabilities
Containers can run any type of code downloaded from internet. One good thing is that everything is name spaced in containers. Using namespaces we achieve some level of security. As we already learned that namespaces give global resource with an isolated view thus restricting the changes done in namespace to itself.

There are other systems that are not namespaced mostly from kernel subsystems like,
Selinux
Cgroups
File system under /sys
/proc/sys, /proc/irq , /proc/bus etc

Devices are also not namespaced like /dev/mem , kernel modules , /dev/sd* file system devices. Using these kernel vulnerabilities,  an exploit can lead to container breakout to host.

A common topic of discussion is security nowadays. Security is becoming more important as the world is becoming more interconnected. Linux in security is evolving to solve increasing security issues.

One aspect of security is the user privileges. *nix based system comes with 2 types of user privileges. User and root. Regular users are powerless. They cannot modify any processes or files except their own. On the other hand, we have root user who can do anything modifying all processes and files to having unrestricted network to hardware access.

There will be cases where we will need a middle ground where a non-root user will have the capability to perform actions that are normally done by user.  This middle ground is called as capabilities. Capabilities provide fine grained control over superuser permissions, allowing use of the root user to be avoided.

Capabilities divide the system access into logical groups that may be individually granted to, removed from different processes. Capabilities allow system admin to fine-tune what a processes is allowed to do thus reducing the security risks on the system. These capabilities are already supported by nix systems.

In simple terms, capabilities are designed to split up the root privileges into a set of distinct privileges which can be independently enabled or disabled. These are used to restrict what a process running as root can do on the system which means it would be possible to deny filesystem mount operations, deny kernel module loading etc as separate privileges.

Let's say we want to start a Simple Http Server module of python on port 80 with a non-privileged user.

[vagrant@testing-machine ~]$ python -m SimpleHTTPServer 80
Traceback (most recent call last):
  File "/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib64/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 220, in
    test()
  File "/usr/lib64/python2.7/SimpleHTTPServer.py", line 216, in test
    BaseHTTPServer.test(HandlerClass, ServerClass)
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 595, in test
    httpd = ServerClass(server_address, HandlerClass)
  File "/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
    self.server_bind()
  File "/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
    SocketServer.TCPServer.server_bind(self)
  File "/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
    self.socket.bind(self.server_address)
  File "/usr/lib64/python2.7/socket.py", line 224, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 13] Permission denied

Now lets add the Capabiltities as,
[vagrant@testing-machine ~]$ sudo setcap 'CAP_NET_BIND_SERVICE+ep' /usr/bin/python2.7

The above command states that we are adding the CAP_NET_BIND_SERVICE capability to out /usr/bin/python2.7 file. “+ep” indicates that the file is effective and permitted. Now lets start the Python server again,

[vagrant@testing-machine ~]$ python -m SimpleHTTPServer 80
Serving HTTP on 0.0.0.0 port 80 …

We are now able to serve traffic over privileged port 80 for a non-privileged user. Similarly if we use chroot and try to run the python server inside,

[root@testing-machine vagrant]#  chroot rootfs /bin/bash
root@testing-machine:/# /usr/bin/python -m SimpleHTTPServer 80
Serving HTTP on 0.0.0.0 port 80 ...

We can see the server is running fine. This means the user is root. Now lets drop the capability and see how it goes. We will be using the capsh command for dropping or adding capabilities as below,

[root@testing-machine vagrant]# sudo capsh --drop=cap_chown,cap_setpcap,cap_setfcap,cap_sys_admin,cap_net_bind_service --chroot=$PWD/rootfs --

root@testing-machine:/# which python
/usr/bin/python
root@testing-machine:/# /usr/bin/python -m SimpleHTTPServer 80
Traceback (most recent call last):
  File "/usr/lib/python2.7/runpy.py", line 174, in _run_module_as_main
    "__main__", fname, loader, pkg_name)
  File "/usr/lib/python2.7/runpy.py", line 72, in _run_code
    exec code in run_globals
  File "/usr/lib/python2.7/SimpleHTTPServer.py", line 235, in
    test()
  File "/usr/lib/python2.7/SimpleHTTPServer.py", line 231, in test
    BaseHTTPServer.test(HandlerClass, ServerClass)
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 606, in test
    httpd = ServerClass(server_address, HandlerClass)
  File "/usr/lib/python2.7/SocketServer.py", line 417, in __init__
    self.server_bind()
  File "/usr/lib/python2.7/BaseHTTPServer.py", line 108, in server_bind
    SocketServer.TCPServer.server_bind(self)
  File "/usr/lib/python2.7/SocketServer.py", line 431, in server_bind
    self.socket.bind(self.server_address)
  File "/usr/lib/python2.7/socket.py", line 228, in meth
    return getattr(self._sock,name)(*args)
socket.error: [Errno 13] Permission denied

The Server cannot be started. This is because we have dropped the capability “cap_net_bind_service”.

Conclusion
Containers are not magic. Anyone with an understanding of linux machine can play with the tools that provide this isolation. In real world , containers use chroot , namespace , CGroups and linux capabilities to create a isolated folder with process running inside them restricted to themselves (Chroot ), and not able to contact the external world ( chroot ) and sharing global resources like process, mount , network etc to the isolated folder ( namespaces ) with memory and cpu restricted using Cgroups for all the processes running inside the isolated folder. It also uses the Linux capabilities to define what users in the containers can perform and cannot

2 comments :

  1. Excellent blog here! Also your web site loads up very
    fast! What host are you using? Can I get your affiliate link to your host?
    I wish my site loaded up as fast as yours lol

    ReplyDelete
  2. I got this web site from my buddy who told me about this
    site and now this time I am browsing this website and reading very informative content at this place.

    ReplyDelete