Client roaming and scripts

Client roaming is a relatively basic feature of wireless networks. When multiple APs are available, it is expected that a client station (STA) will always be connected to the best possible AP, but this is not always the case. Poor client roaming decisions are a common occurrence and often lead to poor performance and user frustration.

Today we discuss client roaming, and also provide an automated way to verify roaming on a client device.

Looking at the image above, what should be happening is as follows:

  • While in the coverage area of AP1, the STA is associated to AP1
  • The STA moves away from AP1, and towards AP2
  • The STA monitors various connection metrics (e.g. signal strength, noise) and if sufficiently degraded, starts scanning for a stronger signal
  • The STA hears a response from AP2 and associates to AP2, ideally, this happens right as the STA passes into the coverage area of AP2

However, what sometimes happens instead, is this:

  • While in the coverage area of AP1, the STA is associated to AP1
  • The STA moves away from AP1, and towards AP2
  • The STA monitors various connection metrics, but this does not result in the STA attempting to find a better AP
  • The STA is now in the coverage area of AP2, but still associated to AP1, with a sub-optimal connection (image below)

This is referred to as the sticky-client problem, and happens more often that we would like.

A number of things can cause this type of behaviour, starting with poor desing. At a design level, it could be a case of the AP being configured with too much transmit power, the result will be that even though the client is within range of a better AP, the signal to the original AP is still sufficiently strong that the client does not see any need to roam. In such a case, the design of the RF environment itself is the culprit, wireless designers should aim to create RF environments which ensure the smallest possible chance of undesireable client behaviour.

Another possible cause is the roaming algorithm of the device itself. Some devices are just better at roaming than others. Here the network administrator has the option of deploying roaming enhancements on the infrastructure side, some standards-based options are 802.11k and 802.11v. Both of these enhancements provide the client device with a list of possible candidate APs to roam to, however the device needs to support the protocols, and also act on them.

The key part here is that it is the responsibility of the client device to make the roaming decision. Short of actively disconnecting the client from the network, all other attempts to influence client roaming behavior are exactly that, attempts to influence. In the end, if the device refuses to trigger an association to a better AP there is not a great deal that can be done about it. Of course, this is sometimes a difficult conversation to have with a wireless network user who is experiencing a “Wi-Fi problem”.

So how do we tell which it is? Is the client device simply not roaming, or is it not roaming because of something that was caused by poor design? To answer this we need to look at what is happening on the client.

Most operating systems will have a command to display metrics for the current wireless connection. In the case of Windows we have netsh wlan show interfaces, which gives us the output below:

C:\>netsh wlan show interfaces

There is 1 interface on the system:

    Name : Onboard 802.11ac
    Description : Broadcom 802.11ac Network Adapter
    GUID : 41fe5a4e-57a4-4745-8c2c-ee2b6c8ff13b
    Physical address : 28:c2:dd:e2:3f:a3
    State : connected
    SSID : TestNet
    BSSID : 70:d3:79:e0:8e:ee
    Network type : Infrastructure
    Radio type : 802.11ac
    Authentication : WPA2-Personal
    Cipher : CCMP
    Connection mode : Profile
    Channel : 64
    Receive rate (Mbps) : 1300
    Transmit rate (Mbps) : 600
    Signal : 85%
    Profile : TestNet

In the case of Linux we have the iwconfig command:

$ iwconfig wlan1
wlan1 IEEE 802.11AC ESSID:"TestNet" Nickname:""
Mode:Managed Frequency:5.32 GHz Access Point: 70:D3:79:E0:8E:EE
Bit Rate:400 Mb/s Sensitivity:0/0
Retry:off RTS thr:off Fragment thr:off
Power Management:off
Link Quality=97/100 Signal level=63/100 Noise level=0/100
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:0 Missed beacon:0

You will notice that while both of the commands provide basic wireless information, the signal level, and in the case of Linux also the signal quality, are provided as a % rather than in dBm.

For Linux this depends on the wireless adapter and driver used. The example below is from a Raspberry Pi using the onboard Wi-Fi interface, using the same iwconfig command the signal is shown in dBm.

pi@raspberrypi:~ $ iwconfig wlan0
wlan0 IEEE 802.11 ESSID:"TestNet"
Mode:Managed Frequency:5.32 GHz Access Point: 70:D3:79:E0:8E:EE
Bit Rate=200 Mb/s Tx-Power=31 dBm
Retry short limit:7 RTS thr:off Fragment thr:off
Power Management:on
Link Quality=60/70 Signal level=-50 dBm
Rx invalid nwid:0 Rx invalid crypt:0 Rx invalid frag:0
Tx excessive retries:0 Invalid misc:0 Missed beacon:0

Attempts have been made to convert the Signal level % back into a dBm value, the results from these publicly available forumlas can return inconsistent results so we will avoid using them here.

While having this as a percentage is certainly not ideal (wireless engineers speak fluent dBm…), having any understanding of how the client is perceiving the signal level still gives us some insight into the client behaviour.

The next step is to display the output of these commands over time. This can be done by repeating the commands and optionally logging the output for later analysis. Numerous scripts exist within the broader wireless community that take this approach to analysing client roaming, below is a sample created for Windows using Python 3.

import subprocess, re, sys
#import necessary python modules
try:
#try/except block for clean exit on CTRL+C
    while True:
    #loop indefinitely
        output = subprocess.check_output('netsh wlan show interfaces').decode('ascii')
        #assign result of netsh wlan show interfaces command to variable
        ssid = 'SSID:' + re.search(r'(SSID.+?:\s)(.+)', output).group(2).rstrip()
        bssid = 'BSSID:' + re.search(r'(BSSID.+?:\s)(.+)', output).group(2).rstrip()
        channel = 'Channel:' + re.search(r'(Channel.+?:\s)(.+)', output).group(2).rstrip()
        txrate = 'TX:' + re.search(r'(Transmit rate.+?:\s)(.+)', output).group(2).rstrip()
        signal = 'Signal:' + re.search(r'(Signal.+?:\s)(.+)', output).group(2).rstrip()
        #search output for ssid, bssid, channel, txrate & signal and assign to variables
        print(ssid, bssid, channel, signal, txrate)
        #print results to screen
except KeyboardInterrupt:
    sys.exit()

The same approach can be taken for the Linux output.

import subprocess, re, sys
#import necessary python modules
try:
#try/except block for clean exit on CTRL+C
    while True:
    #loop indefinitely
        output = subprocess.check_output('iwconfig wlan0', shell=True).decode('ascii')
        #assign result of iwconfig wlan0 command to variable
        ssid = 'SSID:' + re.search(r'(ESSID:")(.+)(")', output).group(2)
        bssid = 'BSSID:' + re.search(r'(Access Point: )(.+)', output).group(2).rstrip()
        channel = 'Channel:' + re.search(r'(Frequency:)(.+GHz)', output).group(2).rstrip()
        signal = 'Signal:' + re.search(r'(Signal level=)(.+)', output).group(2).rstrip()
        txrate = 'TX:' + re.search(r'(Bit Rate=)(.+)(Tx)', output).group(2).rstrip()
        #search output for ssid, bssid, channel, txrate & signal and assign to variables
        print(ssid, bssid, channel, signal, txrate)
        #print results to screen
except KeyboardInterrupt:
    sys.exit()

In both cases we return a repeating output similar to this:

C:\>roam.windows.simple.py
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260
SSID:LabNet BSSID:34:ed:1b:a5:6f:4f Channel:116 Signal:84% TX:260

Running the script while moving around with the device will show the signal changing along with the data rate. When the client has moved sufficiently far way from the AP (and hopefully within range a better AP) the client should roam and the BSSID will change.

These simple versions of the scripts are available on GitHub.

roam.windows.simple.py

roam.linux.simple.py

Additionally, more robust versions of the scripts are also posted.

roam.windows.py

roam.linux.py

The Windows script includes simple data verification to prevent crashing when the wireless adapter is disassociated, and also saves the data to a CSV. For the Linux script there is improved regex matching, as well as a mapping function which displays the channel number as opposed to the current frequency. Feel free to use/edit/merge these as required.

Now that we have some basic tools, what do we do with them?

Some wireless drivers allow for the configuration of roaming sensitivity/tendency/aggressiveness (the exact name of the feature and its options are driver dependent). The options can be found in Advanced driver settings. See images below from two different Windows clients.

If available, this can be used to fine tune the roaming behaviour of the client. This is an often-overlooked setting, actually, the management of wireless client configuration, including updating of wireless drivers, tend to all be overlooked, unfortunately. The client is after-all responsible for the other side of the wireless transmission.

We can compare the results from the highest and lowest roaming aggressiveness settings to demonstrate the difference in behaviour.

Highest roaming aggressiveness (Intel 8265):

SSID:TestNet BSSID:70:d3:79:d6:8c:ef Channel:36 Signal:92% TX:400
SSID:TestNet BSSID:70:d3:79:d6:8c:ef Channel:36 Signal:92% TX:400
SSID:TestNet BSSID:70:d3:79:d6:8c:ef Channel:36 Signal:92% TX:400
SSID:TestNet BSSID:70:d3:79:d6:8c:ef Channel:36 Signal:74% TX:400
SSID:TestNet BSSID:70:d3:79:e0:8e:ef Channel:64 Signal:92% TX:400

Last metrics before roam in red – the client did not rate shift down, the signal dropped from 92% to 74%, this triggered a roam. First metrics from new AP in green, we can see the channel & BSSID change. This is very aggressive roaming behaviour.

Lowest roaming aggressiveness (Intel 8265):

SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:99% TX:400
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:99% TX:400
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:99% TX:400
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:83% TX:270
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:83% TX:270
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:83% TX:270
-lines omitted for brevity-
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:80% TX:300
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:80% TX:300
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:80% TX:300
-lines omitted for brevity-
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:81% TX:240
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:81% TX:240
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:81% TX:240
-lines omitted for brevity-
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:75% TX:240
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:75% TX:240
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:75% TX:240
-lines omitted for brevity-
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:67% TX:180
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:67% TX:180
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:67% TX:180
-lines omitted for brevity-
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:67% TX:120
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:67% TX:120
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:67% TX:120
-lines omitted for brevity-
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:67% TX:135
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:67% TX:135
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:67% TX:135
-lines omitted for brevity-
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:57% TX:162
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:57% TX:162
SSID:TestNet BSSID:00:38:df:1c:6c:ee Channel:100 Signal:57% TX:162
-lines omitted for brevity-
SSID:TestNet BSSID:70:d3:79:e0:8e:ee Channel:64 Signal:91% TX:400
SSID:TestNet BSSID:70:d3:79:e0:8e:ee Channel:64 Signal:91% TX:400
SSID:TestNet BSSID:70:d3:79:e0:8e:ee Channel:64 Signal:91% TX:400

In the lowest setting the roam takes significantly longer. We see the client rate shifting a number of times as the signal drops. The last measurements before and after the roam are in red and green respectively. In this case, the client was well within the coverage area of a better AP (for quite some time) before the roam was triggered. This type of roaming behaviour should ideally be avoided because at a lower data rate the client will be using more air-time than is necessary, which ultimately has an impact on all other clients in the radio cell.

Note that no network infrastructure settings were changed between these tests.

Taking this one step further, we can compare the results of the roaming script against the client’s scan results. Clients perform periodic scans to collect information about available wireless networks. For Windows, the command to list the results is netsh wlan show networks mode=Bssid.

SSID 3 : TestNet
    Network type : Infrastructure
    Authentication : WPA2-Personal
    Encryption : CCMP
    BSSID 1 : 70:d3:79:d6:8c:ee
        Signal : 30%
        Radio type : 802.11ac
        Channel : 36
        Basic rates (Mbps) : 36
        Other rates (Mbps) : 48 54
    BSSID 2 : 70:d3:79:e0:8e:ee
        Signal : 85%
        Radio type : 802.11ac
        Channel : 64
        Basic rates (Mbps) : 36
        Other rates (Mbps) : 48 54
    BSSID 3 : 00:38:df:1c:6c:ee
        Signal : 14%
        Radio type : 802.11ac
        Channel : 100
        Basic rates (Mbps) : 36
        Other rates (Mbps) : 48 54

We can see that for the TestNet SSID the client can actually “hear” three AP, along with basic information for each. Taking a similar approach (repeating the command) will give us a view of how the available BSSIDs change over time. The script below does this and returns a summary of the available BSSIDs for the currently connected SSID, also available on GitHub.

scan.windows.simple.py

Running the two scripts side by side will provide a view of the currently connected BSSIDs, as well as all possible BSSIDs for the given SSID. The results can be compared to verify if the client is connected to the strongest possible AP at any point in time.

In the image above, we can see that the client is in fact not connected to the strongest AP. The current association is on channel 36, while a better option exists on channel 64.

Further work is possible to combine all the scripts presented here into one, however the main purpose was to provide an overview of how information can be collected on client devices. For reference, the Linux command to display scan results is iw <interface> scan dump. The Linux iw commands provide a much more comprehensive view compared to what is available on Windows.