Debian Jessie gateway fail-over switch daemon.

Debian Jessie gateway fail-over switch daemon.

Hi again!
Recently I resurrected an old Alix1d board that was thrown away in the office old-crap storage since I needed a cheap way to have an Internet gateway appliance that I could program to switch between uplinks.

Although there are plenty of appliances to do so at first glance, the problem is that the location where I had to use this does not have DSL, cable, fiber or any other Internet "wired" connection... Instead LTE/4G card is the main mean to reach Internet, and wireless tethering to my Android phone (3G/HSDPA) the back-up.

LTE/4G routers are as cheap as crappy, unless you got something professional that is obviously expensive for home office usage. So here's where the old Alix runing Debian Jessie comes to the rescue... so this is the plan:

  • Connect the USB LTE/4G dongle to Alix USB port and ensure it automatically connects to Internet as connectivity is stablished.
  • Install and configure some of the old, available mini-PCI WiFi Cards, that I have around to automatically detect, and connect to my Android phone sharing its Internet link (tethering) as soon as it is available.
  • Configure a process daemon that monitors gateway availability, Internet reachability, (also it does VPN link status check due to my needs) and performs default gateway switching automatically, while continuosly logging events to a log file.

Jessie, LTE/4G dongles, and usb-modeswitch mess...

One of the reasons to struggle the migration from my classic Debian Squeeze images for Alix boards to Jessie, was the contrast on how easy is to plug and play an USB LTE dongle in my Debian Stretch daily usage computers.
I have to admit that it was very deceiving when I plug my LTE USB dongle (Huawei 3131) just to see it did not work.
Although no problems where reported at Wheezy, and at Stretch is working perfect to me, Huawei LTE USB Dongles do not work.
The package usb-modeswitch-data contains new set of rules, there are some changes due to extensive chenges from Wheezy to Jessie, and patches arrived too late to enter Jessie... see Bug #805512.

Long history short: The dongle is recognized and handled correctly by Kernel, but the system fails to automatically switch the device from its initial storage mode (This is useful in Windows systems to get necessary drivers stored in the same device, but useless in Linux, since the kernel does need it) to usb modem mode.

The 'brute-force' solution:

We can but, manually, use the 'usb_modeswitch' command to convince a Huawei E3131 LTE dongle to go into usb modem mode.
If we take a look at the usb list with 'lsusb' command we see this:

root@alix:~# lsusb
Bus 001 Device 003: ID 12d1:1f01 Huawei Technologies Co., Ltd. 
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub

 
Note the device ID (Vendor:Mode... so 12d1 means Huawei, 1f01 means storage) for first device Huawei Technologies Co., Ltd... now if we execute this:

usb_modeswitch -v 12d1 -p 1f01 -V 12d1 -P 14db -M "55534243123456780000000000000a11062000000000000100000000000000"

 
The dongle switces to modem, so now, 'lsusb' shows the following:

root@alix:~# lsusb
Bus 001 Device 003: ID 12d1:14db Huawei Technologies Co., Ltd. 
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0001 Linux Foundation 1.1 root hub

 
note that ID has switched its mode portion to 14db!
Also, if we examine the networking panorama issuing an 'ifconfig' command:

   ....
   eth2      Link encap:Ethernet  HWaddr 58:2c:80:13:92:63  
      inet addr:192.168.1.100  Bcast:192.168.1.255  Mask:255.255.255.0
      inet6 addr: fe80::5a2c:80ff:fe13:9263/64 Scope:Link
      UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
      RX packets:217735 errors:0 dropped:0 overruns:0 frame:0
      TX packets:169164 errors:0 dropped:0 overruns:0 carrier:0
      collisions:0 txqueuelen:1000 
      RX bytes:127797981 (121.8 MiB)  TX bytes:27456393 (26.1 MiB)
   ....

 
We got a new ethernet device eth2 (eth0 is the ethernet port at my Alix board, and eth1 is my WiFi card).

Adding the command into a script, makes it very easy to call it at any time...

pico ~/enablehilink.sh

 
Paste the following:

#!/bin/bash
usb_modeswitch -v 12d1 -p 1f01 -V 12d1 -P 14db -M "55534243123456780000000000000a11062000000000000100000000000000"

 
And after saving, make it executable:

chmod +x ~/enablehilink.sh

 
Now we could create from something very rude, such as an endless looping script executing the script periodically, to something less aggressive and more elegant, to detect the dongle presence, it's IP address, ping to internet...It is up to you...
Then, add your script to your session init script, to your login script, rc levels, there are plenty of possibilities!

What I did is to declare the eth2 (in my case) interface in /etc/network/interfaces file, that way I ensure that I got the interface declared and enabled on system (although in down status when dongle is not plug or in storage mode):

....
# The Huawei HiLink WAN interface
auto eth2
allow-hotplug eth2
iface eth2 inet dhcp
....

 
now you can monitor periodically the interface state (up or down), or ping the modem default gateway hardceded address (192.168.1.1).
If, and only if, you get no ping or down status, you run /root/enablehilink.sh.
If you continue reading, you'll se my combined whatchdog script later on...

Jessie, WiFi cards, and wpa_supplicant daemon...

Setting up wifi cards has become easier for us Debian users. As Debian versions rool out, with newer kernels, and more and more drivers are available through packaging, there is plenty of possibilities that wifi experience is more 'plug and play' in Debian than in Windows, specially when using old, well known manufacturer's cards.

There is but, chances that all modern debian literature about setting up a wifi card may fail miserably... and sure I'm not being lucky with Jessie.

Although my cards are well known by kernel and their modules are correctly loaded, firmware packages are present and installed, and related packages sucha as 'wireless-tools' and 'wpa_supplicant' are installed, setting up my wpa configuration at /etc/network/interfaces(as pointed by oficial documentation) file was useless... yet another nasty surprise...

... Again, the 'brute-force' solution into rescue:

From the old days of wifi cards setup, I remembered that what we did was to use wpa_supplicant command 'as is', manually, to make a wireless device to bind to a WPA/WPA2 AP, and we did it by using the 'daemon' mode of wpa_supplicant binary, to keep it up, trying constantly to find, negotiate and stablish connection to AP as soon as it was available.
Ultimately, a typical procedure was to put all that on a script that was launched automatically at boot...
So, in the systemd era, an old init.d script is going to solve all this mess.

A look at PCI devices with 'lspci':

alixadmin@alix:~$ lspci
....
00:0e.0 Network controller: Intel Corporation PRO/Wireless 2200BG [Calexico2] Network Connection (rev 05)
....

 
Jessie handles cards quite goot at kernel level (but not too good this time upwards...), modules are nicely loaded.

alixadmin@alix:~$ lsmod | grep 2200
ipw2200               134358  0 
libipw                 29891  1 ipw2200
cfg80211              346002  2 libipw,ipw2200
lib80211               12829  2 libipw,ipw2200

 
Wireless tools are also working: we can guess our wifi network interface easily with 'iwconfig':

root@alix:/home/alixadmin# iwconfig
....

 eth1      IEEE 802.11bg  ESSID:off/any  
      Mode:Managed  Channel:0  Access Point: Not-Associated   
      Bit Rate:0 kb/s   Tx-Power=20 dBm   Sensitivity=8/0  
      Retry limit:7   RTS thr:off   Fragment thr:off
      Encryption key:off
      Power Management:off
      Link Quality:0  Signal level:0  Noise level:0
      Rx invalid nwid:0  Rx invalid crypt:0  Rx invalid frag:0
      Tx excessive retries:0  Invalid misc:0   Missed beacon:0

....

 
In theory, adding 'wpa-ssid' and 'wpa-psk' values at /etc/network/interfacesfor eth1 interface should suffice... but it doesn't for me... So... manually set 'wpa_supplicant'!!!

first create a file to hold your AP configuration, I created a directory in /etc to hold the config, so I got something nice sucha as /etc/wpa_supplicant/wpa_supplicant.conf

mkdir /etc/wpa_supplicant
pico /etc/wpa_supplicant/wpa_supplicant.conf 

 
And paste the following with your values:

network={
    ssid="MyAndroidTheteringDevice"
    proto=WPA2
    key_mgmt=WPA-PSK
    psk="My password"
}

 
Now we could test this config directly in cli like this:

wpa_supplicant -D wext -i eth1 -c /etc/wpa_supplicant/wpa_supplicant.conf -d

 
You'll see the device is connecting, but it is not getting ip address... it is normal... we have to add eth1 to /etc/network/interfaces and set it like this:

# The wireless WAN interface
auto eth1
allow-hotplug eth1
iface eth1 inet dhcp

 
Now, if you run the wpa_command again, it should work.
The true fact is that, when using some APs it runs good, but with my Android device as AP the dhcp sometimes get stuck...
So, if you're dhcp troubles, the easiest solution is to manually set an IP address to eth1: connectivity is instantaneous and rock stable.
Android tethering AP network is 192.168.43.0/24, leasing IPs about the middle of the range... so, a configuration like this:

# The wireless WAN interface
auto eth1
allow-hotplug eth1
iface eth1 inet static
    address 192.168.43.254
    netmask 255.255.255.0
    network 192.168.43.0
    broadcast 192.168.43.255

 
Work like a charm for me!

Now, we could create an script to easily call for the wpa_daemon with all its parameters, so I 'picoed' /root/eth1_supplicant.sh script with the following contents:

#!/bin/bash

while :
do
    wpa_supplicant -D wext -i eth1 -c /etc/wpa_supplicant/wpa_supplicant.conf -d
    sleep 5
done

 
As you see, the script runs wpa_supplicant as daemon and, in the event of process kill for whatever reason, it is run again after a 5 seconds pause in an endless loop.
Ensure the script is executable:

pico ~/eth1_supplicant.sh
chmod +x ~/eth1_supplicant.sh

 
Now we are goig to create an init.d script to control to handle the script as a system daemon and enable it at boot, so create a /etc/init.d/eth1_supplicantscript and make it executable:

pico /etc/init.d/eth1_supplicant
chmod +x /etc/init.d/eth1_supplicant

 
The content of the init.d script may look like this:

#!/bin/sh

### BEGIN INIT INFO
# Provides:          eth1_supplicant
# Required-Start:    $remote_fs $syslog
# Required-Stop:     $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: hilink net test
# Description:       wpa_supplicant daemon watchdog for eth1
### END INIT INFO

# Using the lsb functions to perform the operations.
. /lib/lsb/init-functions
# Process name ( For display )
NAME="eth1_supplicant"
# Daemon name, where is the actual executable
DAEMON="/root/eth1_supplicant.sh"
# pid file for the daemon
PIDFILE="/var/run/eth1_supplicant.pid"

# If the daemon is not there, then exit.
test -x ${DAEMON} || exit 5


case "$1" in
	start)
  	# Checked the PID file exists and check the actual status of process
  	if [ -e ${PIDFILE} ]; then
   		status_of_proc -p ${PIDFILE} ${DAEMON} "$NAME process" && status="0" || status="$?"
   		# If the status is SUCCESS then don't need to start again.
   			if [ ${status} = "0" ]; then
    			exit # Exit
   			fi
  	fi
	# Start the daemon.
  	log_daemon_msg "Starting the process" "$NAME"
  	# Start the daemon with the help of start-stop-daemon
  	# Log the message appropriately
  	if start-stop-daemon --start --quiet --oknodo --pidfile ${PIDFILE} --make-pidfile --background --exec ${DAEMON} -- \
		; then
   		log_end_msg 0
  	else
   		log_end_msg 1
  	fi
    	;;
	stop)
	if [ -e ${PIDFILE} ]; then
		status_of_proc -p ${PIDFILE} ${DAEMON} "Stoppping the $NAME process" && status="0" || status="$?"
   		if [ "$status" = 0 ]; then
    		start-stop-daemon --stop --quiet --oknodo --pidfile ${PIDFILE}
    		/bin/rm -rf ${PIDFILE}
	        /usr/src/CFLIB/cfont --dev /dev/ttyACM0 --led 4 0 0  > /dev/null 2>&1
   		fi
  	else
   	log_daemon_msg "$NAME process is not running"
   	log_end_msg 0
  	fi
    	;;
	restart)
  	# Restart the daemon.
  	$0 stop && sleep 2 && $0 start
  	;;
	status)
  	# Check the status of the process.
  	if [ -e ${PIDFILE} ]; then
   		status_of_proc -p ${PIDFILE} ${DAEMON} "$NAME process" && exit 0 || exit $?
  	else
   		log_daemon_msg "$NAME Process is not running"
   		log_end_msg 0
  	fi
  	;;
	reload)
  	# Reload the process. Basically sending some signal to a daemon to reload
  	# it configurations.
  	if [ -e ${PIDFILE} ]; then
   		start-stop-daemon --stop --signal USR1 --quiet --pidfile ${PIDFILE} --name ${NAME}
   		log_success_msg "$NAME process reloaded successfully"
  	else
   		log_failure_msg "$PIDFILE does not exists"
  	fi
  	;;
	*)
	echo "Usage: $0 {start|stop|restart|reload|status}"
  	exit 2
    	;;

esac

 
We are almost done...
Thanks to Jessie's migration to Systemctl stuff making our init.d script work is different now, no insserv, but instead we do the following:

systemctl daemon-reload
systemctl enable eth1_supplicant
service eth1_supplicant start

 
Done! now, as soon as the AP is in range, our system will bind to it get connectivty.

The only thing missing is to handle default gateway as needed!!!!

Gluing all together: a kind of 'nettest watchdog' daemon...

Once I'm sure that my WAN links are setup, and that as soon as a connection is available they will connect (or reconnect) and gain Internet access, the only missing part is taking control of hte default gateway.

In linux, it is very easy to define the default-gateway for our system... just a single command:

route add default gw 192.168.43.1

 
And to change it again, it is as simple as this:

route del default gw 192.168.43.1
route add default gw 192.168.1.1

 
Ok, so now, we could write a looping script, that will gather connectivity information and gateway availability by sending pings to Internet servers and gateway interface addresses.

In my case, the script is designed to let my Alix board operate normally using its USB dongle, LTE/4G connection, but switch inmediately to wifi connection, to my Android device tethering, if I switch it on.

Basically the script reacts to gathered data altering default gateway, launching usb_modeswitch, or restarting the VPN process... but you could easily adapt and expand to do whatever.
In order to log events, I wanted the script to log changes and actions to a log file (NOTE: I use the 'ts' command to generate timestamp, so, if you like it, you'll have to install the 'moreutils' package)... here is the idea:

#!/bin/bash

# Handling killing signals before exit
trap 'before_exit; exit' SIGINT SIGQUIT

before_exit()
{
echo "Received the kill signal...exiting script, bye!" | ts  | tee /dev/fd/3
}

# This shell script requires 'ts' command from
# 'moreutils' package to print timestamps to
# log file.
LOGFILE="/var/log/nettest.log"

# retry initial counters
NETRTRY=0
VPNRTRY=0

# Retry max values before action
MAXNETRTRY=10
MAXVPNRTRY=10

# loop control sleep value
SLEEP=0.4

#Gateways to monitor
LTEGW="192.168.1.1"
WIFIGW="192.168.43.1"

#IPs to check connectivity
WANIP="8.8.8.8"
VPNIP="10.0.112.1"

#Status variables
WANOK=0
VPNOK=0
LTEGWOK=0
WIFIGWOK=0

#Initial Gateway IP
GWIP="127.0.0.1"

# Setting logfile.
# If you want to debug and send output ALSO to console
# replace  | ts
# with  | ts | tee /dev/fd/3
# at the endo of echo lines
if [ ! -f ${LOGFILE} ]; then
    echo "Log File not found... trying to create"
    touch ${LOGFILE}
    if [ ! -f ${LOGFILE} ]; then
        echo "Unable to create log file...exiting"
        exit 1
    fi
fi

exec 3>&1 1>>${LOGFILE} 2>&1

sleep $SLEEP

echo ""
echo ""
echo "Starting nettest Script..." | ts | tee /dev/fd/3

while :
do
    GWIP=`route | grep default | awk '{print $2}'`

    ping -W 1 -c 3 ${LTEGW} &> /dev/null
    if [ $? -ne 0 ]; then
        if [ ${LTEGWOK} -ne 0 ]; then
            echo "Connection to LTE/4G GW is lost!" | ts
            LTEGWOK=0
        fi
    else
        if [ ${LTEGWOK} -ne 1 ]; then
            echo "Connection to LTE/4G GW is up!" | ts
            LTEGWOK=1
        fi
    fi

    ping -W 1 -c 3 ${WIFIGW} &> /dev/null
    if [ $? -ne 0 ]; then
        if [ ${WIFIGWOK} -ne 0 ]; then
            echo "Connection to WIFI GW is lost!" | ts
            WIFIGWOK=0
        fi
    else
        if [ ${WIFIGWOK} -ne 1 ]; then
            echo "Connection to WIFI GW is up!" | ts
            WIFIGWOK=1
        fi
    fi

    ping -W 1 -c 3 ${WANIP} &> /dev/null
    if [ $? -ne 0 ]; then
        if [ ${WANOK} -ne 0 ]; then
            echo "Connectivity to Internet is lost!" | ts
            WANOK=0
        fi
    else
        if [ ${WANOK} -ne 1 ]; then
            echo "Connectivity to Internet is up!" | ts
            WANOK=1
        fi
    fi

    ping -W 1 -c 3 ${VPNIP} &> /dev/null
    if [ $? -ne 0 ]; then
        if [ ${VPNOK} -ne 0 ]; then
            echo "Connectivity to VPN is lost!" | ts
            VPNOK=0
        fi
    else
        if [ ${VPNOK} -ne 1 ]; then
            echo "Connectivity to VPN is up!" | ts
            VPNOK=1
        fi
    fi  


    if [ ${WIFIGWOK} -eq 1 ]; then
        if [ ! ${GWIP} == ${WIFIGW} ]; then
            echo "Switching GW to wifi" | ts
            route del default gw $LTEGW
            route add default gw $WIFIGW
        fi
    elif [ $LTEGWOK -eq 1 ]; then
        if [ ! ${GWIP} == ${LTEGW} ]; then
            echo "Switching GW to LTE/4G" | ts
            route del default gw $WIFIGW
            route add default gw $LTEGW
        fi
    else
        echo "WARNING!!!! All Gateways are down!!!!" | ts
    fi


    if [ $VPNOK -eq 0 ]; then
        if [ $WANOK -eq 0 ]; then
            NETRTRY=$((NETRTRY + 1))
            sleep 0.5
            if [ $NETRTRY -eq $MAXNETRTRY ]; then
                echo "WARNING!!! WAN test retrying Give up! Force usb_modeswitch" | ts
                /root/enablehilink.sh > /dev/null 2>&1
                sleep $SLEEP
                NETRTRY=0
            fi
        else
            VPNRTRY=$((VPNRTRY + 1))
            sleep 2
            if [ $VPNRTRY -eq $MAXVPNRTRY ]; then
                echo "WARNING!!! VPN test retrying Give up! restart VPN" | ts
                /etc/init.d/openvpn restart > /dev/null 2>&1
                sleep $SLEEP
                VPNRTRY=0
            fi
        fi
    else
        VPNRTRY=0
        NETRTRY=0
        sleep $SLEEP
    fi
done

 
I'm aware it could be of course enhaced and polited... but for me it performs its task perfectly.
Again, what I do, is to paste this script into /root/nettest.sh,
Ensure but that the script is executable, so:

pico ~/nettest.sh
chmod +x ~/nettest.sh

 
Then, lets use this script again in a classic init.d script, so I can manage this as service, start it and ask it to be started on boot.
so create a /etc/init.d/nettest script (or whatever you like) and make it executable:

pico /etc/init.d/eth1_supplicant
chmod +x /etc/init.d/eth1_supplicant

 
The content of the init.d script may look like this:

#!/bin/sh

### BEGIN INIT INFO
# Provides:          nettest
# Required-Start:    $remote_fs $syslog
# Required-Stop:     $remote_fs $syslog
# Default-Start:     2 3 4 5
# Default-Stop:      0 1 6
# Short-Description: hilink net test
# Description:       network connection watchdog for hilink
### END INIT INFO

# Using the lsb functions to perform the operations.
. /lib/lsb/init-functions
# Process name ( For display )
NAME="nettest"
# Daemon name, where is the actual executable
DAEMON="/root/nettest.sh"
# pid file for the daemon
PIDFILE="/var/run/nettest.pid"

# If the daemon is not there, then exit.
test -x ${DAEMON} || exit 5


case "$1" in
    start)
    # Checked the PID file exists and check the actual status of process
    if [ -e ${PIDFILE} ]; then
        status_of_proc -p ${PIDFILE} ${DAEMON} "$NAME process" && status="0" || status="$?"
        # If the status is SUCCESS then don't need to start again.
            if [ ${status} = "0" ]; then
                exit # Exit
            fi
    fi
    # Start the daemon.
    log_daemon_msg "Starting the process" "$NAME"
    # Start the daemon with the help of start-stop-daemon
    # Log the message appropriately
    if start-stop-daemon --start --quiet --oknodo --pidfile ${PIDFILE} --make-pidfile --background --exec ${DAEMON} -- \
        ; then
        log_end_msg 0
    else
        log_end_msg 1
    fi
        ;;
    stop)
    if [ -e ${PIDFILE} ]; then
        status_of_proc -p ${PIDFILE} ${DAEMON} "Stoppping the $NAME process" && status="0" || status="$?"
        if [ "$status" = 0 ]; then
            start-stop-daemon --stop --quiet --oknodo --pidfile ${PIDFILE}
            /bin/rm -rf ${PIDFILE}
            /usr/src/CFLIB/cfont --dev /dev/ttyACM0 --led 4 0 0  > /dev/null 2>&1
        fi
    else
    log_daemon_msg "$NAME process is not running"
    log_end_msg 0
    fi
        ;;
    restart)
    # Restart the daemon.
    $0 stop && sleep 2 && $0 start
    ;;
    status)
    # Check the status of the process.
    if [ -e ${PIDFILE} ]; then
        status_of_proc -p ${PIDFILE} ${DAEMON} "$NAME process" && exit 0 || exit $?
    else
        log_daemon_msg "$NAME Process is not running"
        log_end_msg 0
    fi
    ;;
    reload)
    # Reload the process. Basically sending some signal to a daemon to reload
    # it configurations.
    if [ -e ${PIDFILE} ]; then
        start-stop-daemon --stop --signal USR1 --quiet --pidfile ${PIDFILE} --name ${NAME}
        log_success_msg "$NAME process reloaded successfully"
    else
        log_failure_msg "$PIDFILE does not exists"
    fi
    ;;
    *)
    echo "Usage: $0 {start|stop|restart|reload|status}"
    exit 2
        ;;

esac

 
So, the only that is left now is again to make usage of the new script, addidng it to system available services, enabling it as boot and starting it:

systemctl daemon-reload
systemctl enable nettest
service nettest start

 
And here we go...
try to unplug dongle, turn off AP, etc... the script reacts accordingly logging to file:

root@alix:~# tail /var/log/nettest.log
Jun 29 19:50:11 Connectivity to VPN is lost!
Jun 29 19:50:31 Connectivity to VPN is up!
Jun 29 20:20:14 Connectivity to VPN is lost!
Jun 29 20:20:33 Connectivity to Internet is lost!
Jun 29 20:20:36 Connectivity to VPN is up!
Jun 29 20:20:45 Connectivity to Internet is up!
Jun 29 20:30:17 Connectivity to Internet is lost!
Jun 29 20:30:33 Connectivity to Internet is up!
Jun 30 14:48:28 Connectivity to Internet is lost!
Jun 30 14:48:39 Connectivity to Internet is up!

 
The only detail that is left is to handle the logfile!

Rotating our logs.

This one is very straight forward... just add a logrotate.d script that handles generated logfile as /var/log/nettest.log. So:

pico /etc/logrotate.d/nettest

 
And paste the following:

/var/log/nettest.log
{
rotate 7
daily
size 1M
dateext
missingok
create 640 root adm
notifempty
compress
delaycompress
copytruncate
}

 
Now, log files will be rotated, so an 'ls -l' against /var/log will include among the rest of log files the following:

....
-rw-r--r-- 1 root        root   21746 Jun 30 14:48 nettest.log
-rw-r--r-- 1 root        root    1195 Jun 26 15:33 nettest.log-20160626
-rw-r--r-- 1 root        root   35489 Jun 27 16:24 nettest.log-20160627

 
That's it! Really fun!