The Petrov Experience or why not automate everything

I’m a big fan of automation. Probably all good devops are, and maybe a bunch of old-and-good sysadmins too. But everything in the world has limits. Couple of days ago I had discussed with a colleage the convenient or not of total automation of systems actions.

While for some repetitive and non-critical work I’m agree that automation is a good thing, for some relevant actions or critical ones I’m more conservative. Our discussions was about the automatic failover in database (which is the core of the company and eventually critical) or keeping the manual failover, even if this failover implies to wake up early during the night.

Today, I was to share not very famous story, which happened in 1983,  but which explain quite well why some actions must be manual. Let’s start with the Petrov story.

This is September 26, in 1983. The world is quite nervous these days. The Soviet Union recently take down a korean civil plane for flying over russian sky and 260 people were killed. The NATO started military maneuvers near the Soviet Union border.

The NORAD as see in War Games film
The NORAD as see in War Games film

In this climate,  the Lieutenant Colonel Stanislav Yevgrafovich Petrov (Станисла́в Евгра́фович Петро́в) started his turn of duty that night. He was the responsible to control the soviet detection system OKB, which is the automatic system responsible to detect nuclear releases from american bases, and, in case of that launch is detected,  run the counter-attack (which means start a nuclear war in that age).

In one moment during this night, the main panels in the wall started to blinking, the alarm start to sound and the system confirms a launch from one american base. Now Petrov needs to make a decision. Or wait to be sure that the launch is real or run the counter attack. But he has no much time. If an attack is on the way, every second count to organize the defense and the response. If not, he probably would be starting the last war for the humanity.

A couple of minutes ago, the system started to buzz again. Other launch was detected, and in few minutes other three launches were detected again. In total five nuclear missiles appears to be launched from the US against the Soviet Union. The OKB system was very reliable, it uses spy cameras in satellite constellation, orbiting without problems for years. The data is analyzed in three separate data centers, without any common element.

From a technical point of view, the threat was real. The satellites (very sophisiticated piece of technology for that age) detect five missile launch (not only one), and three isolated data centers  come to the same conclusion. Nothing appears to be wrong in the system… so… what is the logical reaction?

In that point, the human common sense enters in the ecuation. Petrov thought (and keep in your mind that probably this thinking had avoided a nuclear war): “no one start a war with only five missiles”. So he decided not to run the counter attack, not to inform the high military generals and just wait until a radar in the Soviet Union border can confirm o disprove what the systems said.

Fortunately for all of us, Petrov was right. No one start a war with five missiles. The border radars confirmed that there was no missile at all, and everything was a terrible mistake of the system. Specifically, a weird align of the Sun, the satellite and the Earth, provokes that a reflection in some lower clouds appears for the system exactly like a launch does, and this reflection just appear over a military base which, in fact, was nuclear missile silos.

Stanislav Yevgráfovich Petrov in uniform. Credits to unknown author.
Stanislav Yevgráfovich Petrov in uniform. Credits to unknown author.

The point here is: what happened if this decision was made by an automatic process? Probably neither you and me would not be here now.

By the way, Petrov received an award for his help to maintain the peace in 2004, and a prize of (sic) 1000 US dollars. His family died without knowing what really happened this day because the incident was categorized as top secret for years.

NTP stratum 1 with raspberry pi

One of the projects that I would love to implement this year is a NTP stratum 1 using a raspberry pi and a GPS antenna. Well, the main goal is a little bit ambitious. I want to enter in the NTP poll as stratum 1 (we have in connectical a time server stratum 2 server right now), and also do the same with GLONASS based chip to compare the accuracy of both models.

But, right now I need to start building the first one, a GPS based NTP statrum 1. For that I use a Ublox MAX-7Q chip from HAB suppliers, and as antenna one single SMA model from the same manufacturer.

 

RaspberryPi with GPS module
RaspberryPi with GPS module connected in GPIO. You can see the GPS antenna cable too.

The initial installation was easy, just plugin the GPS board in GPIO connector and let’s move forward. For the OS I use the image create for the NTPi project of openchaos, which works fine for this chip.

Once connected to the rasp and wait for a couple of seconds for the GPS synchronization (I must say that this model is incredibly fast), I use cgps -s command to inspect what satellites are visible by my antenna:

Screenshot from 2014-09-18 20:00:17
A screenshot of cgps showing satellites in my area.

So, next step is configure PPS source and NTP to use the it as main source for time synchronization.

Once connected to the rasp, I tested the PPS source using ppstest.

# ppstest /dev/pps0
trying PPS source "/dev/pps0"
found PPS source "/dev/pps0"
ok, found 1 source(s), now start fetching data...
source 0 - assert 1411064435.000594220, sequence: 3906 - clear  0.000000000, sequence: 0
source 0 - assert 1411064436.000598888, sequence: 3907 - clear  0.000000000, sequence: 0
source 0 - assert 1411064437.000602658, sequence: 3908 - clear  0.000000000, sequence: 0
^C

Everything appears to work fine. Time to configure NTP daemon. I use the following NTP config:

# /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help

# Drift file to remember clock rate across restarts
driftfile /var/lib/ntp/ntp.drift

# coarse time ref-clock, not really needed here as we have LAN & WAN servers
server 127.127.28.0  minpoll 4 maxpoll 4
fudge 127.127.28.0 time1 +0.350 refid GPS  stratum 15

# Kernel-mode PPS ref-clock for the precise seconds
server 127.127.22.0 minpoll 4 maxpoll 4
fudge 127.127.22.0  flag3 1  refid PPS

# WAN servers, "pool" will expand the number of servers to suit
pool eu.pool.ntp.org  minpoll 10  iburst

Note the lines GPS and PPS. The first one use the GPS reference as clock. There are
no much to explain (the ntp.conf (5) man page is really aclaratory), but essentially
I configure two local server using the gpsd interface (the gpsd daemon starts automatically
if you use openchaos image) one use the standard GPS interface and the other the PPS link.

After restart ntpd, and wait a couple of minutes for reaching, I can see both new servers
in ntpq:

# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
oPPS(0)          .PPS.            0 l    4   16    7    0.000   -2.015   8.087
*SHM(0)          .GPS.            0 l    3   16   17    0.000  -11.105  11.233
 ptbtime1.ptb.de .PTB.            1 u   49   64    1   88.195  -15.562   0.002
+vps01.roethof.n 83.98.201.134    3 u   41   64    1   80.312  -11.912   2.794
+i2t15.i2t.ehu.e .GPS.            1 u   38   64    1   70.612  -12.175   2.321
+ntp01.srv.cat   158.227.98.15    2 u   42   64    1   59.587   -7.685   2.895
+213.194.159.3 ( 46.16.60.129     3 u   39   64    1   70.919  -12.449   2.945

And that’s all. The following steps are to optimize the configuration to get a good quality time source, measure delays and repeat the experiment with GLONASS one. Stay tuned!

Poor man’s containerization

8a10808f854b42568797_010502_Provincial_10-10-11_D1JGMRD_1

Since a few months ago, the containerization of processes becomes in the new virtualization for modern devops.

Of course we are old devops, you know, and nothing special are in containerization that we didn’t use years ago. There are some poor man’s alternative to new tools, like docker or vagrant, but in the old-school way.

The forgotten chroot

Years ago chroot was forgotten for unspecific reasons. The truth is that we can use chroot to create a good way of containerization if we don’t need copy-on-write or network capabilities. This is a very portable way which requires only root privileges, but none special capability enabled in kernel config (very useful for restricted VPS).

You have also a number of non-root alternatives based on ptree, like proot. The use of ptrace is deeper enough to write another article per se. Stay tuned!

LD_PRELOAD

You can do very interesting things with LD_PRELOAD variable. If set the GNU dynamic liker load the library defined in variable in the process context, linking the symbols. So you can override methods like open (2) or write (2). Using this way you can implement a easy-to-use copy on write system which do not require anything special. No root privileges, no special configs in kernel.

Of course there are a number of implementations of this idea. My favorite one is fl-cow, which comes in debian package (officially maintained in Debian and Ubuntu).

unshare

The “new” member of system functions since linux 2.6.16 is the unshare (1) system call, which comes with user space tool unshare (1). The unshare function allow to disassociate parts of the process execution context. That means that you can run a process with different filesystem space for example. It’s very useful when you need to handle mount points for your “containers”.

My favorite tool to handle unshare, clone and others is dive. A tool created by Vitaly Shukela which allows you to run process with different mountpoints, and other capabilities, like cgroups or network namespaces, which will see in next paragraph.

Network namespaces

Since kernel 2.6.24, linux kernel has the ability to create network namespaces. Namespaces is a way to create different network adapters and route tables based in the process context. So you process can handle a “virtual” interface in a simple way.

Scott Lowe wrote some years ago (nothing new here) a really good introduction to namespaces in GNU/Linux using iproute2.

With NS you can easily define a number of hosts with connectivity between them (using loopback) so, your pseudo-containers can use network. It’s very useful when you need to test master-slave configurations.

Conclusions

Of course the containerization is one of most active area in devops today. A lot of good developments like docker are emerging in the horizon, but if you don’t need a more complex systems, this solutions can help you. Furthermore, most of these principles are in the base of how modern containerization systems actually works.

A secure way to sign HTML blocks

html_code

A couple of years ago I was talking with my colleages in those years about security in some websites. We were not talking about SSL (which is, by the way more popuplar now), because SSL only works at connection level. With SSL you can guarantee that the communication is reliable (in terms of authenticity) and that the endpoint server is actually who pretend to be.

But SSL hides a shameful secret, a flaw in the design which can provoke, eventually, a big security problem. This neglected detail is too evident that no one think very mucho about it: “SSL doesn’t guarantee you anything about the content that you are viewing”.

We can build an imaginary experiment. Let’s suppose that a big e-commerce web site which has payments enabled for their customers wants to fire an employment. That employment is a good qualified programmer with access to the site source code. Before they fired the worker, he modify the source code to add a very small piece of code (buried in a millions of lines of e-commerce code) which just change a little bit thing. The action of the payment HTML form now send credit card data to an anonymous web service running in some weird country.

Now, let’s do another exercise in imagination. Suppose you are an unsuspecting user who loves products of our company. You buy a couple of goods, and probably you pay with your credit card… Ops! Back a moment… Now your credit card data is stored in a probably not very safe database in one server located in our Weird Country, ready to be sold to anyone who can pay for that kind of information (and I can assure you that they aren’t good people).

In this case SSL is green. Is the real server with a trust communication. But in this case SSL doesn’t help us to avoid the crime. That’s the reason why we need content signing eventually.

Thinking about this problem I create a way to facilitate this implementation. The core of the idea is the attribute data-signature. This attribute can be used in any HTML5 block, and it’s a signature of the HTML representation of all childs of the block which has the attribute. So, for example in the following code:

<div id="content" class="myclass_for_stylish" data-signature="eWVzIG1hcnRoYSwgdGhpcyBpcyBub3QgYSByZWFsIHNpZ25hdHVyZQo=">
  <!-- This is a normal comment -->
  <p>Some paragraph here</p>
</div>

The signature is valid for the HTML <p>Some paragraph</p>. We don’t need to sign the comment (nothing important could be saved there). The signature algorithm is, right now irrelevant. We work on that point some paragraphs below.

Of course, nested blocks can be signed also.

With this approximation, we are sure that the content of the div block is genuine, because we assume that the developer has no access to master keys to sign critical data. In out store example, the critical data is just the form block, and needs to be hard coded, but, anyway, this is usually a fixed string in a template.

Finally we need to talk a little bit about the algorithm to sign. We can use any public key based algorithm, and the only problem is how can we check that the signature is right. Well, there are a lot of solutions for that problem.

One solution could be that the browser (or browser extension ;)) validate the signature looking for the public key associated with the domain in a public CA (or web of trust model).

So, this is a simple way to validate HTML blocks and put more security in web sites. Do you think that this kind of systems are necessary? or convenient? Do you know any other way to sign content in web sites?

Let’s think about this when click the “Payment” button 😉

Simple way to manage lots of system users in distributed environments.

Few years ago I was working in design of a large cluster of systems to perform some actions (solving some mathematical models, sharding database…). From the point of view of the systems, I had to deal with a number of pesky troubles. One of them was the user management.

Since I had more than one hundred of hosts, and this number could be grow up in short, and I’ve a number of users which need to access to all hosts, I need to think in a way to easy user management. Actually user management is, in my opinion a pain in the butt. If you ve a central user directory, you need to deal with a big and fat single point of failure, so you need to create some kind of HA service for this directory. And if you ve systems around the world, then you need to replicate the user data in different directories and keep them synchronized. If it is not the hell, it must be very similar.

Dealing with user managemet is a royal hassle for system administrators in every place, but in the cloud (i.e. a number of hosts distributed around the world), it’s also a punishment. So, I need to solve (almost in part) this problem before moving forward in my deployment. I do not need a full user management really, just a basic UID mapping and a way to authenticate users (for which I could use the old-and-friendly authorized_keys).

So, How can I manage a big number of users in a single way, and to be effective in a distributed environment? That’s not a simple questions, and of course each implementation has its own solution, from authentication services to suites of scripts. Anyway, I was looking for a simple to manage ones, cause of I was the responsible to manage the entire environment and I’m too lazy too 😉

Thinking about the problem, I imagine a system without any user, let’s imaging that there are just one user, and any other user is just an alias for the first one. It could be easy to manage, because we only need one UID, but we need to solve the alias mapping.
Here is when libnss_map join into the game. The libnss_map is a library designed to be used with GNU NSS service. The NSS allows the system to get user credentials from many sources, which can be configured easy from the /etc/nsswitch.conf file.

For example, we can configure our system in the following way:

passwd:      files map
shadow:      files map
group:       files map

So, for each user to get credentials NSS will lookup in standard files first, and then using the map module (libnss_map).The map module works as the flow diagram shows.

Flow diagram of how get credentials works with libnss_map

As you can see in the diagram there are two major steps in lookup. The first one is the responsible to map an user to a virtual one. The virtual user is static, and it’s defined in /etc/nssmap.conf. This file has the same syntax like passwd does. For example:

virtual:x:10000:10000:,,,:/home/:/secure_shell

Which means that any user who does not exists in /etc/passwd will be mapped into this one, with UID 10000.

Okay, sounds good, but there are a lot questions yet. What about the password? What about the home dir?

Well, I do not find a good solution for password, so nssmap will return a masked password (account is enabled, but password will be unpredictable), and I authenticate the user using other methods via PAM, or public keys via SSH.

Home directory is easier. The home directory field in the user definition (inside /etc/nssmap.conf file) is used as prefix, and it will be completed with the user name (the name of the user which is intended to login, not the virtual one). So, for example, for the hypothetical user “sample”, the effective home directory will be “/home/sample”, because “/home/” is the prefix. Please note that the end slash is mandatory in current implementation.

Finally I need to solve another big problem: if two users has the same UID then both can change the same files, or delete the files of other “virtual” user. How can we solve it? There are not single answer, not easy afterall. In my case I use special shell, which
ensure that the user cannot remove, touch or even read files in any path into /home except his own home directory, but it’s not a full solution yet.

Here is an example using nss map:

host ~ # sudo su - test
No directory, logging in with HOME=/
test@host / $ id
uid=10000(virtual) gid=10000(virtual) groups=10000(virtual)

In the meanwhile, a basic code is available in my github, and I still researching in this kind of authorization. Keep in touch and enjoy! An of course, feedback is welcome 😀

Hands on dreamplug

A DreamPlug device

Update: Óscar García gently published a similar spanish version of this article.

Couple of weeks ago I received my dreamplug, from NewIT. Though I knew some software troubles in device, I remains hopeful. However when I unpacked and powered on the plug I got a number of problems. This is the history about these problems and their solutions. I hope that my experience can be useful for anyone who take the same model and have the same problems. Afterall, the NewIT forum has a number of posts about these and other problems, so reading it is recommended.

The first problem that I found was a bad partition table, which is also a well-known bug in NewIT forums; and the second one is the Ubuntu installation which is not very tiny by default 🙂 Fortunately, both problems are easy to solve.

These are the steps that I followed to “upgrade” my dreamplug software to a cooler one. Please note that the Globalscale guys do not support other system images and installations AFAIK, so perform this changes under your own responsibility. You are warned!.
Continue reading

New dtools and a bit more

Last week was crazy. I published a new release of dtools, the 4.2, a new web site for the dtools project and a couple of patches for version 5.0 of collectd.

In the last months, dtools becomes in an useful tool for me. I use dtools everyday for system administration in large distributed networks. So I decided to improve some functionalities and also test and retest current features, so in a couple of months I expect to launch a new release of dtools.

In the meantime, I still working in whistler, a XMPP bot for MUC rooms, I hope that SleekXMPP library, which is the XMPP engine used by whistler become to release early. At that point we must remove any dependency with the old xmpppy in the code and we will ready to release the version 2.0 of whistler.

Integer conversions in bash

Since version 2, bash support a single aritmethic operations. Altough bash is not a mathematical shell (use bc instead), you can perform certain conversions using the bash arithmetic logic.

For example you can remove the left zeroes in a decimal number without require any external utility or print formats, let’s suppose that you want to strip zeroes from the number 007, which is stored in bond variable.

$ echo $bond
007
$ let nozeros=10#$bond
$ echo $nozeros
7

In many forums and mailing list, people need to use ugly sed expressions, or awk invokation, but (with bash) it’s just simply 🙂

Using the same trick, you can perform a base conversions, for example:

$ let i=0x10
$ echo $i
16
$ let i=2#10000
$ echo $i
16

Or create an easy number checking:

$ is_decimal () { let i=10#$1 2>/dev/null; }
$ is_decimal 'a' || echo Nop
Nop
$ is_decimal 56 && echo 'Yep'
Yep

Enjoy!

Whistler: a new Jabber MUC bot.

Few days ago, I start a new project called whistler. Whistler is a bot written in python using the greatest xmppy library, designed to work in XMPP networks (like jabber or GTalk. In first time I tried to use the quinoa framework, and it is very usefull, but have some issues for me, for example you cannot set another server configuration, which is a problem for GTalk accounts. So, after tried a number of frameworks, I decided to create my own one. Probably not the best, but mine 🙂

Whistler is intended to manage the connectical MUC room, and only basic functionalities are provided. Obviously it is under heavy development yet.

The code is publicy available on github whistler repository, and you can clone as usually:

$ git clone git://github.com/ajdiaz/whistler

You require xmppy to work with whistler and python >= 2.5. In few days I will publish the projecti into pypi too.

Enjoy and remember, any feedback is welcome 😉