The Petrov Experience or why not automate everything

I’m a big fan of automation. Probably all good devops are, and maybe a bunch of old-and-good sysadmins too. But everything in the world has limits. Couple of days ago I had discussed with a colleage the convenient or not of total automation of systems actions.

While for some repetitive and non-critical work I’m agree that automation is a good thing, for some relevant actions or critical ones I’m more conservative. Our discussions was about the automatic failover in database (which is the core of the company and eventually critical) or keeping the manual failover, even if this failover implies to wake up early during the night.

Today, I was to share not very famous story, which happened in 1983,  but which explain quite well why some actions must be manual. Let’s start with the Petrov story.

This is September 26, in 1983. The world is quite nervous these days. The Soviet Union recently take down a korean civil plane for flying over russian sky and 260 people were killed. The NATO started military maneuvers near the Soviet Union border.

The NORAD as see in War Games film
The NORAD as see in War Games film

In this climate,  the Lieutenant Colonel Stanislav Yevgrafovich Petrov (Станисла́в Евгра́фович Петро́в) started his turn of duty that night. He was the responsible to control the soviet detection system OKB, which is the automatic system responsible to detect nuclear releases from american bases, and, in case of that launch is detected,  run the counter-attack (which means start a nuclear war in that age).

In one moment during this night, the main panels in the wall started to blinking, the alarm start to sound and the system confirms a launch from one american base. Now Petrov needs to make a decision. Or wait to be sure that the launch is real or run the counter attack. But he has no much time. If an attack is on the way, every second count to organize the defense and the response. If not, he probably would be starting the last war for the humanity.

A couple of minutes ago, the system started to buzz again. Other launch was detected, and in few minutes other three launches were detected again. In total five nuclear missiles appears to be launched from the US against the Soviet Union. The OKB system was very reliable, it uses spy cameras in satellite constellation, orbiting without problems for years. The data is analyzed in three separate data centers, without any common element.

From a technical point of view, the threat was real. The satellites (very sophisiticated piece of technology for that age) detect five missile launch (not only one), and three isolated data centers  come to the same conclusion. Nothing appears to be wrong in the system… so… what is the logical reaction?

In that point, the human common sense enters in the ecuation. Petrov thought (and keep in your mind that probably this thinking had avoided a nuclear war): “no one start a war with only five missiles”. So he decided not to run the counter attack, not to inform the high military generals and just wait until a radar in the Soviet Union border can confirm o disprove what the systems said.

Fortunately for all of us, Petrov was right. No one start a war with five missiles. The border radars confirmed that there was no missile at all, and everything was a terrible mistake of the system. Specifically, a weird align of the Sun, the satellite and the Earth, provokes that a reflection in some lower clouds appears for the system exactly like a launch does, and this reflection just appear over a military base which, in fact, was nuclear missile silos.

Stanislav Yevgráfovich Petrov in uniform. Credits to unknown author.
Stanislav Yevgráfovich Petrov in uniform. Credits to unknown author.

The point here is: what happened if this decision was made by an automatic process? Probably neither you and me would not be here now.

By the way, Petrov received an award for his help to maintain the peace in 2004, and a prize of (sic) 1000 US dollars. His family died without knowing what really happened this day because the incident was categorized as top secret for years.

NTP stratum 1 with raspberry pi

One of the projects that I would love to implement this year is a NTP stratum 1 using a raspberry pi and a GPS antenna. Well, the main goal is a little bit ambitious. I want to enter in the NTP poll as stratum 1 (we have in connectical a time server stratum 2 server right now), and also do the same with GLONASS based chip to compare the accuracy of both models.

But, right now I need to start building the first one, a GPS based NTP statrum 1. For that I use a Ublox MAX-7Q chip from HAB suppliers, and as antenna one single SMA model from the same manufacturer.

 

RaspberryPi with GPS module
RaspberryPi with GPS module connected in GPIO. You can see the GPS antenna cable too.

The initial installation was easy, just plugin the GPS board in GPIO connector and let’s move forward. For the OS I use the image create for the NTPi project of openchaos, which works fine for this chip.

Once connected to the rasp and wait for a couple of seconds for the GPS synchronization (I must say that this model is incredibly fast), I use cgps -s command to inspect what satellites are visible by my antenna:

Screenshot from 2014-09-18 20:00:17
A screenshot of cgps showing satellites in my area.

So, next step is configure PPS source and NTP to use the it as main source for time synchronization.

Once connected to the rasp, I tested the PPS source using ppstest.

# ppstest /dev/pps0
trying PPS source "/dev/pps0"
found PPS source "/dev/pps0"
ok, found 1 source(s), now start fetching data...
source 0 - assert 1411064435.000594220, sequence: 3906 - clear  0.000000000, sequence: 0
source 0 - assert 1411064436.000598888, sequence: 3907 - clear  0.000000000, sequence: 0
source 0 - assert 1411064437.000602658, sequence: 3908 - clear  0.000000000, sequence: 0
^C

Everything appears to work fine. Time to configure NTP daemon. I use the following NTP config:

# /etc/ntp.conf, configuration for ntpd; see ntp.conf(5) for help

# Drift file to remember clock rate across restarts
driftfile /var/lib/ntp/ntp.drift

# coarse time ref-clock, not really needed here as we have LAN & WAN servers
server 127.127.28.0  minpoll 4 maxpoll 4
fudge 127.127.28.0 time1 +0.350 refid GPS  stratum 15

# Kernel-mode PPS ref-clock for the precise seconds
server 127.127.22.0 minpoll 4 maxpoll 4
fudge 127.127.22.0  flag3 1  refid PPS

# WAN servers, "pool" will expand the number of servers to suit
pool eu.pool.ntp.org  minpoll 10  iburst

Note the lines GPS and PPS. The first one use the GPS reference as clock. There are
no much to explain (the ntp.conf (5) man page is really aclaratory), but essentially
I configure two local server using the gpsd interface (the gpsd daemon starts automatically
if you use openchaos image) one use the standard GPS interface and the other the PPS link.

After restart ntpd, and wait a couple of minutes for reaching, I can see both new servers
in ntpq:

# ntpq -p
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
oPPS(0)          .PPS.            0 l    4   16    7    0.000   -2.015   8.087
*SHM(0)          .GPS.            0 l    3   16   17    0.000  -11.105  11.233
 ptbtime1.ptb.de .PTB.            1 u   49   64    1   88.195  -15.562   0.002
+vps01.roethof.n 83.98.201.134    3 u   41   64    1   80.312  -11.912   2.794
+i2t15.i2t.ehu.e .GPS.            1 u   38   64    1   70.612  -12.175   2.321
+ntp01.srv.cat   158.227.98.15    2 u   42   64    1   59.587   -7.685   2.895
+213.194.159.3 ( 46.16.60.129     3 u   39   64    1   70.919  -12.449   2.945

And that’s all. The following steps are to optimize the configuration to get a good quality time source, measure delays and repeat the experiment with GLONASS one. Stay tuned!

A secure way to sign HTML blocks

html_code

A couple of years ago I was talking with my colleages in those years about security in some websites. We were not talking about SSL (which is, by the way more popuplar now), because SSL only works at connection level. With SSL you can guarantee that the communication is reliable (in terms of authenticity) and that the endpoint server is actually who pretend to be.

But SSL hides a shameful secret, a flaw in the design which can provoke, eventually, a big security problem. This neglected detail is too evident that no one think very mucho about it: “SSL doesn’t guarantee you anything about the content that you are viewing”.

We can build an imaginary experiment. Let’s suppose that a big e-commerce web site which has payments enabled for their customers wants to fire an employment. That employment is a good qualified programmer with access to the site source code. Before they fired the worker, he modify the source code to add a very small piece of code (buried in a millions of lines of e-commerce code) which just change a little bit thing. The action of the payment HTML form now send credit card data to an anonymous web service running in some weird country.

Now, let’s do another exercise in imagination. Suppose you are an unsuspecting user who loves products of our company. You buy a couple of goods, and probably you pay with your credit card… Ops! Back a moment… Now your credit card data is stored in a probably not very safe database in one server located in our Weird Country, ready to be sold to anyone who can pay for that kind of information (and I can assure you that they aren’t good people).

In this case SSL is green. Is the real server with a trust communication. But in this case SSL doesn’t help us to avoid the crime. That’s the reason why we need content signing eventually.

Thinking about this problem I create a way to facilitate this implementation. The core of the idea is the attribute data-signature. This attribute can be used in any HTML5 block, and it’s a signature of the HTML representation of all childs of the block which has the attribute. So, for example in the following code:

<div id="content" class="myclass_for_stylish" data-signature="eWVzIG1hcnRoYSwgdGhpcyBpcyBub3QgYSByZWFsIHNpZ25hdHVyZQo=">
  <!-- This is a normal comment -->
  <p>Some paragraph here</p>
</div>

The signature is valid for the HTML <p>Some paragraph</p>. We don’t need to sign the comment (nothing important could be saved there). The signature algorithm is, right now irrelevant. We work on that point some paragraphs below.

Of course, nested blocks can be signed also.

With this approximation, we are sure that the content of the div block is genuine, because we assume that the developer has no access to master keys to sign critical data. In out store example, the critical data is just the form block, and needs to be hard coded, but, anyway, this is usually a fixed string in a template.

Finally we need to talk a little bit about the algorithm to sign. We can use any public key based algorithm, and the only problem is how can we check that the signature is right. Well, there are a lot of solutions for that problem.

One solution could be that the browser (or browser extension ;)) validate the signature looking for the public key associated with the domain in a public CA (or web of trust model).

So, this is a simple way to validate HTML blocks and put more security in web sites. Do you think that this kind of systems are necessary? or convenient? Do you know any other way to sign content in web sites?

Let’s think about this when click the “Payment” button 😉

Simple way to manage lots of system users in distributed environments.

Few years ago I was working in design of a large cluster of systems to perform some actions (solving some mathematical models, sharding database…). From the point of view of the systems, I had to deal with a number of pesky troubles. One of them was the user management.

Since I had more than one hundred of hosts, and this number could be grow up in short, and I’ve a number of users which need to access to all hosts, I need to think in a way to easy user management. Actually user management is, in my opinion a pain in the butt. If you ve a central user directory, you need to deal with a big and fat single point of failure, so you need to create some kind of HA service for this directory. And if you ve systems around the world, then you need to replicate the user data in different directories and keep them synchronized. If it is not the hell, it must be very similar.

Dealing with user managemet is a royal hassle for system administrators in every place, but in the cloud (i.e. a number of hosts distributed around the world), it’s also a punishment. So, I need to solve (almost in part) this problem before moving forward in my deployment. I do not need a full user management really, just a basic UID mapping and a way to authenticate users (for which I could use the old-and-friendly authorized_keys).

So, How can I manage a big number of users in a single way, and to be effective in a distributed environment? That’s not a simple questions, and of course each implementation has its own solution, from authentication services to suites of scripts. Anyway, I was looking for a simple to manage ones, cause of I was the responsible to manage the entire environment and I’m too lazy too 😉

Thinking about the problem, I imagine a system without any user, let’s imaging that there are just one user, and any other user is just an alias for the first one. It could be easy to manage, because we only need one UID, but we need to solve the alias mapping.
Here is when libnss_map join into the game. The libnss_map is a library designed to be used with GNU NSS service. The NSS allows the system to get user credentials from many sources, which can be configured easy from the /etc/nsswitch.conf file.

For example, we can configure our system in the following way:

passwd:      files map
shadow:      files map
group:       files map

So, for each user to get credentials NSS will lookup in standard files first, and then using the map module (libnss_map).The map module works as the flow diagram shows.

Flow diagram of how get credentials works with libnss_map

As you can see in the diagram there are two major steps in lookup. The first one is the responsible to map an user to a virtual one. The virtual user is static, and it’s defined in /etc/nssmap.conf. This file has the same syntax like passwd does. For example:

virtual:x:10000:10000:,,,:/home/:/secure_shell

Which means that any user who does not exists in /etc/passwd will be mapped into this one, with UID 10000.

Okay, sounds good, but there are a lot questions yet. What about the password? What about the home dir?

Well, I do not find a good solution for password, so nssmap will return a masked password (account is enabled, but password will be unpredictable), and I authenticate the user using other methods via PAM, or public keys via SSH.

Home directory is easier. The home directory field in the user definition (inside /etc/nssmap.conf file) is used as prefix, and it will be completed with the user name (the name of the user which is intended to login, not the virtual one). So, for example, for the hypothetical user “sample”, the effective home directory will be “/home/sample”, because “/home/” is the prefix. Please note that the end slash is mandatory in current implementation.

Finally I need to solve another big problem: if two users has the same UID then both can change the same files, or delete the files of other “virtual” user. How can we solve it? There are not single answer, not easy afterall. In my case I use special shell, which
ensure that the user cannot remove, touch or even read files in any path into /home except his own home directory, but it’s not a full solution yet.

Here is an example using nss map:

host ~ # sudo su - test
No directory, logging in with HOME=/
test@host / $ id
uid=10000(virtual) gid=10000(virtual) groups=10000(virtual)

In the meanwhile, a basic code is available in my github, and I still researching in this kind of authorization. Keep in touch and enjoy! An of course, feedback is welcome 😀