We all know how containers are adding value to developers by improving the ability to isolate from other applications, run application virtually anywhere from public cloud to on premise data centers. Most of us start creating docker containers right away and play with them but It is very important to understand what makes a container.
With many organizations migrating their applications to
containers, it is becoming more important for programmers and administrators to
know the concepts behind containers. These core concepts are common to any
container technology available now.
This article “Anatomy of Containers” will make sure
everyone understand container internals, helps them to understand what makes a
container and gives a good sense of debugging when issues arises
First thing first , Containers
are not new anyone with an understanding of Linux capabilities can play and
docker, rkt tools are just wrappers around these tools. The Core concepts will
give a clear picture of how containers are created and how they work. Lets
start digging into the concepts
File
System -
We all know that an
image is the starting point for creating a container. We need an image and we
start a container from the image. Basically, an image is nothing but a plain
file system with different types of files and executables.
Let’s download an
image (using docker or we can create our own) from a repository
[root@rkt-machine ~]#
docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
docker.io/python latest 7a35f2e8feff 34 hours ago 922.4 MB
Save the image to a
tar file using
[root@rkt-machine ~]#
docker save docker.io/python >> centos.tar
Check the tar file -
[root@rkt-machine ~]#
ls
anaconda-ks.cfg centos.tar
original-ks.cfg
Extract the tar file
-
[root@rkt-machine ~]#
tar -xvf centos.tar
5182e96772bf11f4b912658e265dfe0db8bd314475443b6434ea708784192892.json
d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/
d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/VERSION
d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/json
d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/layer.tar
manifest.json
repositories
Now there is a
layer.tar file in the d1ed0** file. Extract that to a directory using,
[root@rkt-machine
rootfs]# tar -xf
d1ed0d8ec4ec460641430566e9a8cece698e60d4ad4afcf48759ad157d340064/layer.tar -C
./rootfs
If you check the
roofs directory it looks awful lot like a Linux system. There is a bin, etc.
and many other similar locations that we normally see in a Linux machine.
[root@rkt-machine
rootfs]# ll
-rw-r--r--. 1
root root 12005 Aug 4 22:05
anaconda-post.log
lrwxrwxrwx. 1 root root 7 Aug
4 22:04 bin -> usr/bin
drwxr-xr-x. 2 root root 6 Aug
4 22:04 dev
drwxr-xr-x. 47 root
root 4096 Aug 4 22:05 etc
drwxr-xr-x. 2 root
6 Apr 11 04:59 home
lrwxrwxrwx. 1 root root 7 Aug
4 22:04 lib -> usr/lib
lrwxrwxrwx. 1 root root 9 Aug
4 22:04 lib64 -> usr/lib64
drwxr-xr-x. 2 root root 6 Apr 11 04:59 media
drwxr-xr-x. 2 root root 6 Apr 11 04:59 mnt
drwxr-xr-x. 2 root root 6 Apr 11 04:59 opt
drwxr-xr-x. 2 root root 6 Aug
4 22:04 proc
dr-xr-x---. 2 root root
114 Aug 4 22:05 root
drwxr-xr-x. 10 root
root 130 Aug 4 22:05 run
lrwxrwxrwx. 1 root root 8 Aug
4 22:04 sbin -> usr/sbin
drwxr-xr-x. 2 root root 6 Apr 11 04:59 srv
drwxr-xr-x. 2 root root 6 Aug
4 22:04 sys
drwxrwxrwt. 7 root root
132 Aug 4 22:05 tmp
drwxr-xr-x. 13 root
root 155 Aug 4 22:04 usr
drwxr-xr-x. 18 root
root 238 Aug 4 22:04 var
We can also see the
sh and bash shell being available in the above roofs locations as,
[root@rkt-machine ~]#
ls rootfs/bin/sh
rootfs/bin/sh
[root@rkt-machine ~]#
ls rootfs/bin/bash
rootfs/bin/bash
The File system of a
container image will be same as the one that we see in a normal linux Os. This
file system includes all the necessary libraries to run minus kernel. The
kernel is the lowest level of easily replaceable software that interfaces with
the hardware in the computer.
One important thing
to remember is that every container that we start will not have any kernel and
kernel is necessary to communicate with the hardware. For this container rely
on the Host kernel. So a container image will have all the files necessary to
run and if there are any system call needed, the container will use the Host
Kernel as shown
We also need to remember that due to the sharing of host
kernel with the container kernel, the size of the container images has gone
down. This is same reason that makes
different flavors’ of Linux containers run on different Os. In the above, iam
running a Ubuntu , busybox and debian containers on a centos Machine. Since the
kernel is always same for all these distributions, they work seamlessly. This
is the same reason why we cannot run containers on windows ( though we have
some tools that does this ) because windows never shares kernel.
Now that we have seen that container image file system is
similar to the linux file system, how can I restrict my processes running
inside this container to the host machine. Since my container uses host kernel,
will a process running inside the container will have access to the other files
systems of host. This is where chroot comes into picture.
Chroot
In a *nix-based OS, root directory (/) is the top
directory. Root file system sits on the disk partition where root directory is
located and all other file system mount on this root file system.
We all know that the three is root process with Pid 1 is
the first thing that gets started when we start the Linux machine. All other
process or jobs that start will be children of this process. Every Processes in
Linux can access any one file or directory in the filesystem
What if we want to restrict the file system view for the
process running. What if i want to block the file system access to a process
that I’m running or debugging. Linux provides an option for this which is called
“chroot”. Every process running by default know that root directory for this is
the working directory where the process has been started and process though it
starts from a directory will be sub process for the pid 1 that started long
back.
More over the pid 1 know that the root directory is
available on the root filesystem which make root file system as current working
directory. So naturally all sub process that start from pid 1 will have access
to the root file system. Those processes can see all files that are available
in every directory on the root file system.
Chroot is an operation that allows a system to change the
root directory for current process and all its children. It means whatever the
process that we start with chroot will have the current working directory as
root directory. This makes the process that we started with the chroot will
become pid 1. This chroot essentially restricts the view of the file system for
process. This makes the process capable of accessing files only available in
that chroot location.
In order to use chroot, all we will use the same rootfs
extracted from the container image above. Go to the rootfs location and run the
chroot command as below,
[root@manja17-I18060 testing]# chroot rootfs /bin/bash
root@testing-machine-name:/# which python
/usr/local/bin/python
root@testing-machine-name:/# /usr/local/bin/python -V
Python 3.7.0 (default, Sep 5 2018, 03:25:31)
[GCC 6.3.0 20170516]
root@testing-machine-name:/# exit
exit
[root@manja17-I18060 testing]# which python
/usr/bin/python
[root@manja17-I18060 testing]# /usr/bin/python -V
Python 2.7.5
If you see in the above output , the chroot command is
invoked on the rootfs file system that we got from a docker image. We see that
the python version is 3.7.0 which is available in the rootfs filesystem, and it
is dependent on the rootfs file system libraries. If we use exit to come out of
the chroot and run a python version available on host, we see a different
version.
Similarly from inside the chroot, if we try to access the
files inside /root , we will be seeing nothing whereas when we try to access
them from the host machine we see a couple of files,
root@testing-machine-name:~#
cd /root/
root@testing-machine-name:~# /bin/ls /root/
root@testing-machine-name:~# exit
exit
[root@manja17-I18060
testing]# cd /root/
[root@manja17-I18060 ~]# ls
anaconda-ks.cfg apache-maven-3.5.4-bin.tar.gz
one-context_4.14.4.rpm terraform
Using chroot on a directory, we are restricting processes
that start inside that directory to itself. The process that start inside this
will have access to the files available inside that chroot and they cannot
access any thing outside of the chroot.
How does chroot help
containers?
Chroot helps in starting containers with their own file
system provided from the image and this is what makes process running inside a
container to view only file system available in the container image. Chroot
make the process running inside a container not able to access any file outside
of the container file system (host file system )
Now that we have seen how chroot restricts file system
access, it will not be able to restrict certain things from host into chroot
system. Now for instance, let's start a process on the host machine as
[root@manja17-I18060 testing] # sleep 100 &
[1] 8634
Now let’s do chroot and see if the process is visible,
[root@manja17-I18060 testing] # chroot rootfs /bin/bash
root@testing-machine-name:/# whereis grep
grep: /bin/grep /usr/share/man/man1/grep.1.gz
/usr/share/info/grep.info.gz
root@testing-machine-name:/# /bin/ps aux | /bin/grep
sleep
root 8627 0.0
0.0 107904 608 ? S
08:02 0:00 sleep 60
root 8634 0.0
0.0 107904 608 ? S
08:02 0:00 sleep 100
root 8638 0.0
0.0 11104 708 ?
S+ 08:02 0:00 /bin/grep sleep
I can see the sleep 100 command is available with a PID
number 8634. If I want to kill that pid , I can
[root@manja17-I18060 testing] # /bin/kill -9 8634
[1]+ Killed sleep 100
So chroot only restricts file system view but there are
still other areas which can be accessible inside chroot. Here is where
namespace come into picture.
Name Spaces
Namespace allow us to create restricted view of system
like the process tree, network interface, mount etc. So chroot restricts file
system, namespace restricts other important system resources like network,
process tree etc. So, a kernel namespace
call wraps a global system resource in abstraction and isolation so that
processes within that namespace think they have their own isolated instance of
the global resource. Modifications done to that resource inside the namespace
are not visible to the original resource being used by host machine or other
namespaces.
Linux Provides us with a nice command line tool called
unshare, by which we can create namespaces. Lets see how it can be done. Create
a sleep process like below,
[root@manja17-I18060 testing]# sleep 100 &
[1] 8783
Run the Chroot by also calling unshare along as below,
[root@manja17-I18060 testing]# unshare -p -f
--mount-proc=$PWD/rootfs/proc \
> chroot
rootfs /bin/bash
In this command i am actually creating a new process tree
namespace and attaching to our chroot. Now if we run the “ps ux” command we see
root@testing-machine-name:/# /bin/ps ux
USER PID %CPU
%MEM VSZ RSS TTY
STAT START TIME COMMAND
root 1 0.0
0.0 18220 2148 ?
S 08:13 0:00 /bin/bash
root 3 0.0
0.0 36628 1540 ?
R+ 08:13 0:00 /bin/ps ux
We will not be seeing any of the process of the host
machine. If we observe closely our bash itself is given with the pid 1 making
it as the root process. If we see the above command, we only added.
One advantage of using namespaces is their composability.
In the above case we are isolating a process tree of our shell process. The
other namespaces like network, mount etc are still shared. We can choose which
namespace to be unshare and which one to share. For example, in a kubernetes
pod with multiple containers can have a separate process namespace individually
but both can share network namespace and mount namespace.
There are 6 namespaces available as below,
The pid namespace: Process
isolation (PID: Process ID).
The net namespace: Managing
network interfaces (NET: Networking).
The ipc namespace: Managing
access to IPC resources (IPC: InterProcess Communication).
The mnt namespace: Managing
filesystem mount points (MNT: Mount).
The uts namespace: Isolating
kernel and version identifiers. (UTS: Unix Timesharing System).
The user namespace: isolating user id between namespaces
How does container
use namespace?
Containers on the other hand will be using namespace to
isolate the container by giving them the global resources in a isolated way.
Every container that we start will be attached with a namespace that we provide,
and container will use that resource in an isolation. Any changes that are done
inside the container ( namespace ) will not effect the system resource
globally. The changes will be restricted to that namespace
Now that we have seen how to isolate a filesystem as well
as global system resource but we did not isolate the memory and CPU. This means
though we have a chroot location with namespace attached and a memory eating
program will still eat the whole system. This is where Cgroups comes into
picture.
CGroups
Cgroups in short for control groups which allow
restrictions on memory and Cpu. The restrictions are imposed by kernel itself.
The kernel exposes the cgroups in this below location as,
[root@manja17-I18060 cgroup]# ll /sys/fs/cgroup/
total 0
drwxr-xr-x. 3 root root
0 Oct 9 04:00 blkio
lrwxrwxrwx. 1 root root 11 Sep 19 01:52 cpu ->
cpu,cpuacct
lrwxrwxrwx. 1 root root 11 Sep 19 01:52 cpuacct ->
cpu,cpuacct
drwxr-xr-x. 3 root root
0 Oct 9 04:00 cpu,cpuacct
drwxr-xr-x. 3 root root
0 Sep 19 01:52 cpuset
drwxr-xr-x. 3 root root
0 Oct 9 03:10 devices
drwxr-xr-x. 3 root root
0 Sep 19 01:52 freezer
drwxr-xr-x. 3 root root
0 Sep 19 01:52 hugetlb
drwxr-xr-x. 3 root root
0 Oct 9 04:00 memory
lrwxrwxrwx. 1 root root 16 Sep 19 01:52 net_cls ->
net_cls,net_prio
drwxr-xr-x. 3 root root
0 Sep 19 01:52 net_cls,net_prio
lrwxrwxrwx. 1 root root 16 Sep 19 01:52 net_prio ->
net_cls,net_prio
drwxr-xr-x. 3 root root
0 Sep 19 01:52 perf_event
drwxr-xr-x. 3 root root
0 Sep 19 01:52 pids
drwxr-xr-x. 5 root root
0 Sep 19 01:52 systemd
Lets see Cgroups in action,
Create a cgroup dir. This is very easy as creating a dir
but in specific location,
mkdir /sys/fs/cgroup/memory/test
Now if we see the directory we will see that kernel fills
this directory with files as below,
[root@manja17-I18060 testing]# ll
/sys/fs/cgroup/memory/test/ | awk
'{print $9}'
cgroup.clone_children
cgroup.event_control
cgroup.procs
memory.failcnt
memory.force_empty
memory.kmem.failcnt
memory.kmem.limit_in_bytes
memory.kmem.max_usage_in_bytes
memory.kmem.slabinfo
memory.kmem.tcp.failcnt
memory.kmem.tcp.limit_in_bytes
memory.kmem.tcp.max_usage_in_bytes
memory.kmem.tcp.usage_in_bytes
memory.kmem.usage_in_bytes
memory.limit_in_bytes
memory.max_usage_in_bytes
memory.memsw.failcnt
memory.memsw.limit_in_bytes
memory.memsw.max_usage_in_bytes
memory.memsw.usage_in_bytes
memory.move_charge_at_immigrate
memory.numa_stat
memory.oom_control
memory.pressure_level
memory.soft_limit_in_bytes
memory.stat
memory.swappiness
memory.usage_in_bytes
memory.use_hierarchy
notify_on_release
tasks
All we have to do is to edit some files from the above to
set our restrictions.
[root@manja17-I18060
testing]# echo "100000000" >
/sys/fs/cgroup/memory/test/memory.limit_in_bytes
[root@manja17-I18060
testing]# echo "0" > /sys/fs/cgroup/memory/test/memory.swappiness
All i am doing from above commands is to disable swap and
set 100mb memory limitation. The last thing that we need to do is to add the
process id (pid ) to this cgroup. This is done by getting the pid and adding
that to the tasks file in the same directory.
[root@manja17-I18060 testing]# echo $$ > /sys/fs/cgroup/memory/test/tasks
We can add any number of pids to this tasks file to make
sure the 100mb memory restriction is set. So after this i just started a memory
eating application and ran. The output is something like below,
[root@manja17-I18060 testing]# python hai.py
10mb
20mb
30mb
40mb
50mb
60mb
70mb
80mb
Killed
We can see that though I have around 6GB of ram, the
process is killed when it is about reach 100MB. This is because the memory
eating process is set to the test cgroup for which the restriction is set
100mb. Hence if any process in the test group is about to reach 100mb , it
kills and make sure the memory eating program don’t eat all system memory.
This is the same thing that happens in a container. Using
Cgroups on containers, we can restrict memory and cpu limitations such that
they don’t break out of container and eat up the host memory and cpu.
Containers are allowed to run any type of code downloaded
from internet. One good thing is that everything is name spaced in containers.
Using namespaces we achieve some level of security. As we already learned that
namespaces give global resource with an isolated view thus restricting the
changes done in namespace to itself.
There are other systems that are not name spaced mostly
from kernel subsystems like,
Selinux
Cgroups
File system under /sys
/proc/sys, /proc/irq , /proc/bus etc
Devices are also not namespaced like /dev/mem , kernel
modules , /dev/sd* file system devices. Using these kernel
vulnerabilities, an exploit can lead to
container breakout to host. This is where linux capabilities comes into picture
Linux Capabilities
Containers can run any type of code downloaded from
internet. One good thing is that everything is name spaced in containers. Using
namespaces we achieve some level of security. As we already learned that
namespaces give global resource with an isolated view thus restricting the
changes done in namespace to itself.
There are other systems that are not namespaced mostly
from kernel subsystems like,
Selinux
Cgroups
File system under /sys
/proc/sys, /proc/irq , /proc/bus etc
Devices are also not namespaced like /dev/mem , kernel
modules , /dev/sd* file system devices. Using these kernel
vulnerabilities, an exploit can lead to
container breakout to host.
A common topic of discussion is security nowadays.
Security is becoming more important as the world is becoming more interconnected.
Linux in security is evolving to solve increasing security issues.
One aspect of security is the user privileges. *nix based
system comes with 2 types of user privileges. User and root. Regular users are
powerless. They cannot modify any processes or files except their own. On the
other hand, we have root user who can do anything modifying all processes and
files to having unrestricted network to hardware access.
There will be cases where we will need a middle ground
where a non-root user will have the capability to perform actions that are
normally done by user. This middle
ground is called as capabilities. Capabilities provide fine grained control
over superuser permissions, allowing use of the root user to be avoided.
Capabilities divide the system access into logical groups
that may be individually granted to, removed from different processes.
Capabilities allow system admin to fine-tune what a processes is allowed to do
thus reducing the security risks on the system. These capabilities are already
supported by nix systems.
In simple terms, capabilities are designed to split up
the root privileges into a set of distinct privileges which can be
independently enabled or disabled. These are used to restrict what a process
running as root can do on the system which means it would be possible to deny
filesystem mount operations, deny kernel module loading etc as separate
privileges.
Let's say we want to start a Simple Http Server module of
python on port 80 with a non-privileged user.
[vagrant@testing-machine ~]$ python -m SimpleHTTPServer
80
Traceback (most recent call last):
File
"/usr/lib64/python2.7/runpy.py", line 162, in _run_module_as_main
"__main__", fname, loader, pkg_name)
File
"/usr/lib64/python2.7/runpy.py", line 72, in _run_code
exec code in
run_globals
File
"/usr/lib64/python2.7/SimpleHTTPServer.py", line 220, in
test()
File
"/usr/lib64/python2.7/SimpleHTTPServer.py", line 216, in test
BaseHTTPServer.test(HandlerClass, ServerClass)
File
"/usr/lib64/python2.7/BaseHTTPServer.py", line 595, in test
httpd =
ServerClass(server_address, HandlerClass)
File
"/usr/lib64/python2.7/SocketServer.py", line 419, in __init__
self.server_bind()
File
"/usr/lib64/python2.7/BaseHTTPServer.py", line 108, in server_bind
SocketServer.TCPServer.server_bind(self)
File
"/usr/lib64/python2.7/SocketServer.py", line 430, in server_bind
self.socket.bind(self.server_address)
File
"/usr/lib64/python2.7/socket.py", line 224, in meth
return
getattr(self._sock,name)(*args)
socket.error: [Errno 13] Permission denied
Now lets add the Capabiltities as,
[vagrant@testing-machine
~]$ sudo setcap 'CAP_NET_BIND_SERVICE+ep' /usr/bin/python2.7
The above command states that we are adding the
CAP_NET_BIND_SERVICE capability to out /usr/bin/python2.7 file. “+ep” indicates
that the file is effective and permitted. Now lets start the Python server
again,
[vagrant@testing-machine ~]$ python -m SimpleHTTPServer
80
Serving HTTP on 0.0.0.0 port 80 …
We are now able to serve traffic over privileged port 80
for a non-privileged user. Similarly if we use chroot and try to run the python
server inside,
[root@testing-machine vagrant]# chroot rootfs /bin/bash
root@testing-machine:/#
/usr/bin/python -m SimpleHTTPServer 80
Serving HTTP on 0.0.0.0 port 80 ...
We can see the server is running
fine. This means the user is root. Now lets drop the capability and see how it
goes. We will be using the capsh command for dropping or adding capabilities as
below,
[root@testing-machine
vagrant]# sudo capsh --drop=cap_chown,cap_setpcap,cap_setfcap,cap_sys_admin,cap_net_bind_service
--chroot=$PWD/rootfs --
root@testing-machine:/#
which python
/usr/bin/python
root@testing-machine:/#
/usr/bin/python -m SimpleHTTPServer 80
Traceback (most
recent call last):
File "/usr/lib/python2.7/runpy.py",
line 174, in _run_module_as_main
"__main__", fname, loader,
pkg_name)
File "/usr/lib/python2.7/runpy.py",
line 72, in _run_code
exec code in run_globals
File
"/usr/lib/python2.7/SimpleHTTPServer.py", line 235, in
test()
File "/usr/lib/python2.7/SimpleHTTPServer.py",
line 231, in test
BaseHTTPServer.test(HandlerClass,
ServerClass)
File
"/usr/lib/python2.7/BaseHTTPServer.py", line 606, in test
httpd = ServerClass(server_address,
HandlerClass)
File
"/usr/lib/python2.7/SocketServer.py", line 417, in __init__
self.server_bind()
File
"/usr/lib/python2.7/BaseHTTPServer.py", line 108, in server_bind
SocketServer.TCPServer.server_bind(self)
File
"/usr/lib/python2.7/SocketServer.py", line 431, in server_bind
self.socket.bind(self.server_address)
File
"/usr/lib/python2.7/socket.py", line 228, in meth
return getattr(self._sock,name)(*args)
socket.error: [Errno
13] Permission denied
The Server cannot be
started. This is because we have dropped the capability “cap_net_bind_service”.
Conclusion
Containers are not magic. Anyone with an understanding
of linux machine can play with the tools that provide this isolation. In real
world , containers use chroot , namespace , CGroups and linux capabilities to
create a isolated folder with process running inside them restricted to
themselves (Chroot ), and not able to contact the external world ( chroot ) and
sharing global resources like process, mount , network etc to the isolated
folder ( namespaces ) with memory and cpu restricted using Cgroups for all the
processes running inside the isolated folder. It also uses the Linux
capabilities to define what users in the containers can perform and cannot