9 Mysterious Wi-Fi and Network problems solved – Part 2

|
9 Mysterious Wi-Fi and Network problems solved – Part 2

A common misconception about 7SIGNAL’s Wi-Fi monitoring platform is that it only provides insights on the what is going on in the air and how this affects the Wi-Fi experience for users. While this IS its primary goal, by testing the network from the user perspective, it is surprising just how many network infrastructure problems it is also able to uncover that would otherwise go undetected by other products and access point vendors.

In this continuation of our blog from last month we look at five cases where customers were able to expose lurking infrastructure and device issues by using 7SIGNAL.

NBA arena detects misconfigured Captive Portal causing loss of client access

At sports stadiums and arenas, it is typical for fans to get Internet access by connecting to the Guest Wi-Fi via a captive portal splash page. At multifunction venues where one day it’s a sports event, another day a concert, the captive portal and even the guest SSID is frequently changed. Indeed, there may even be multiple guest networks, each sponsored by a different stakeholder.

Although the physical configuration of APs rarely changes, the logical configuration involving guest VLANs and captive portals changes as frequently as events come and go. So for example, guests at a Katie Perry concert will join the Wi-Fi through an appropriately branded splash page and there will be no mention of the sports team whose home is the venue. One of the problems is that because there are lots of different groups responsible for different aspects of the network, inevitably mistakes are made.

In this instance, a misconfigured captive portal, pointing to the wrong VLAN prevented fans from being able to reach the Internet. Client devices could connect to the APs, and join the guest network via the captive portal splash screen, but after that, they got nothing but dead web pages! This is because the Guest SSID was assigned to the wrong VLAN.

The configuration error was made the day before Game Day, and detected almost immediately by 7SIGNAL, because test end points were not able to pass data to / or ping the 7SIGNAL sensors on the guest network, and this triggered alerts. This was a telltale sign of a network mis-configuration.

Hospital finds massively degraded Wi-Fi service after WLAN controller failover

In mission critical environments like healthcare, it is standard practice to implement automated N:1 WLAN controller failover, preferably in a hot-standby configuration. So long as AP images and configurations are kept up-to-date and in sync, when the backup controller takes over control of APs from the one that has failed, you expect the performance for users to be about the same as it was before. Right?

After all it’s not like the APs have been moved, or the channel plan has been changed. So unless the controller is congested, except for a momentary glitch no-user should be any the wiser. That’s the whole idea of automated disaster recovery!

So when network engineers at a large children’s hospital started hearing alarm bells from the 7SIGNAL system, promptly followed by complaints by medical staff that the Wi-Fi was incredibly slow, they were in for quite a shock. They were aware there had been a primary controller failover, just a few minutes earlier. But, they had no idea that the failed-over environment wasn’t performing as expected – average latency was up by 50% and download throughput across the board dropped by almost 90% during periods of high and peak load, while idle behavior looked the same as normal.

In the chart below you can see intermittent but dramatic drops in performance over the following days, while the backup controller was in service. These drops correlate to peak day-time hours.

This didn’t make sense. The configurations of the two controllers were supposed to be identical. But 7SIGNAL  proved the fault lay with the backup controller, because the moment they switched back to the original controller when it was returned to service, normal service levels were immediately restored. This would have been extremely difficult to investigate without 24/7 monitoring and historical data.

The primary controller failure was diagnosed as a hardware issue with a faulty GBIC card causing “interface flapping”. Once this was replaced the unit was returned to service. However, the reason for the backup performing poorly remains a mystery.

University uncovers DHCP issue after port channel maintenance mistake

Port channels are used to bundle physical links into a channel “group” to create a single logical link. The purpose of this is to aggregate bandwidth and provide load balancing across ports as well as providing redundancy against physical link failure. Configuration can be tricky because there are a lot of options, pre-requisites and compatibility requirements to satisfy, so it is surprisingly easy to make mistakes.

This was the case at a university during a port channel maintenance exercise, which resulted in clients being unable to reach a DHCP server to obtain an IP address. Of course, that means zero Wi-Fi service for the affected students, soon to be followed by a barrage of support calls! Fortunately, thanks to the 7SIGNAL the problem was identified within a few minute of being caused, because it triggered a series of alarms due to test end points not being able to obtain an IP address from a DHCP server as you can see from the 7SIGNAL dashboard screenshot below.

Airline finds lingering firewall congestion causing performance bottleneck

One of the services 7SIGNAL provides is WLAN optimization, where our engineers come on site and deploy some Wi-Fi sensors in your network. Then we monitor the results for a few days, and come back to you with observations and recommendations to dramatically improve WLAN capacity and performance. After that, we monitor the impact of those changes, and do the same again, up to three times.

As you would expect, the majority of our recommendations are Wi-Fi specific, but from time to time we also uncover previously undetected infrastructure related problems that are impacting the Wi-Fi experience. As was the case at the US headquarters for one of the largest international airlines in the world.

During the first optimization phase, we noticed that our recommended changes improved uplink performance to the extent we expected, however to our surprise there was little or no effect on downlink performance.

In order to diagnose the problem, we decided to install three additional test end points in the DMZ, data center and internal LAN. This quickly revealed the performance bottleneck was at the Checkpoint firewall. This was most likely caused by non-optimal Checkpoint IPS settings and/or the firewall being underpowered to handle current traffic loads.It’s impossible to tell how long this had been going on, undetected by all other management tools.

Hospital discovers Wi-Fi driver version issue, degrading WoW performance

One of the most common device related problems 7SIGNAL helps to uncover is different performance levels between different versions of Wi-Fi device drivers on the same client device, or differences between different models of the same device.

However, before we introduced the Mobile Eye module for Windows, macOS, iOS and Android devices, these types of issues were rather hard to detect and isolate. Being able to drill down and compare device type and OS-specific trends was one of the big drivers for creating Mobile Eye in the first place.

Scenarios like the one described below are all too common, but when you are running blind, without the means to take regular snapshots of Wi-Fi performance at the device level, such problems go undetected.

Case in point, a hospital was in the midst of large-scale roll-out of workstations on wheels (WoWs) to support Electronic Medical Records at the bedside when they ran into what seemed like a show stopper. They found about 30% of the stations performed very poorly while others performed well, even in the same location.  In addition to slow retrieval of patient records especially images, they also found the connection to the EPIC system kept timing out, requiring the nurse or doctor to log in again and again as they did their rounds.

They decided to install the Mobile Eye module on all these stations and soon learned what was going on. The slow stations were failing to roam to the best available access point, until they completely lost their connection to the first access point they were connected to, at which point they had to connect afresh to the nearest AP, which explains why EPIC was timing out.

On the management side, this was characterized by a significant increase in retransmissions and increased airtime utilization, as carts were moved from end of the ward to the other, which could be seen in the 7SIGNAL dashboard. In addition, the performance charts for each WoW, clearly showed download throughput falling. In contrast, when these stations were not moved, and the other higher performing WoWs were moved instead, no such problems were seen.

So what was different about these two sets of WOWs?  It turned out that the Wi-Fi driver on the slower stations was a more recent version than the one used on the faster stations. As soon as the drivers were replaced with an older version the problem went away.

Summary

Many enterprises are content with their Wi-Fi being “good enough”, and make do with the standard network management and troubleshooting tools that only give them a narrow field of view. And that’s fine, until things go wrong. As these nine examples demonstrate, there are many types of Wi-Fi and infrastructure related problems that manifest in a poor Wi-Fi user experience, which simply cannot be detected using standard network management and troubleshooting tools.

Just like card magic, if you don’t have the right view point, how the trick is performed is a mystery. The danger is that when the S$%#$% hits the fan, what you don’t pay for in tools, you pay for in reduced end-user productivity, and lost opportunity cost by having network engineers wasting hours and hours troubleshooting in vain, because they are looking in the wrong place to begin with.

Worse, a lot of enterprises end up retiring APs prematurely at great expense, thinking they have run out of steam, when in truth they were simply under-performing due to a vast array of reasons all of which could have been corrected. If only they had the tools to provide a 360 view and interpret what’s really going on.

Whether you’re in an enterprise, a school or a mission critical environment fast, reliable Wi-Fi is now essential to operations, since it is the default way everyone connects to everything they do online. In this day and age, it is therefore wise to arm yourself with the very best Wi-Fi monitoring tools from the start. So when something goes wrong, you can see all the angles, record and slow down the action, quickly deconstruct Wi-Fi tricks as they unfold, and fix issues before they impact your users.

Learn more about wireless network monitoring from 7SIGNAL.

 

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.