MQTT BluemixDemo, intermittent "Operation not Supported" results from modem, when running non MBED

Solved
leroy's picture
leroy
Junior(1)

 

Hello,

I am working with the BluemixDemo sample project taken from the Avnet MBED repo @ https://developer.mbed.org/teams/Avnet/.  You will find the development platform, firmware revisions etc below.

 

I am seeing what looks to be intermittent issues with the modem returning an "Operation not Supported" error when running outside of the online mbed environment, that is when running under Kinetis Design Studio as well as when using the Mbed CLI tools.

 

I have modified the sample by commenting out the publishing to the MQTT broker. Thus the only activity would be the initial connect, subscribe and then the keepalive ping. The connect to the IP layer and the MQTT broker seem to be working ok.

 

During the failure the application is in a read loop sending  AT@SOCKREAD=1,1 to the modem. At first I get back good status from the modem. However after a bit, perhaps 1-5 minutes, the following error is returned:  "+CME ERROR: operation not supported"

 

Some background:

 

I've imported the BluemixDemo project into my Embed workspace, added some debug code, disabled any MQTT publishing from the demo so as to test the keepalive pings but otherwise the code is unchanged from the original sample.

 

Then I have taken the code and compiled it 3 different ways and see 2 different results.

 

  ******   Seeing Good Results  *****************
1. First Case: Online via the mbed compiler.

 

I've imported the project into my MBED environment, I did not update any dependency libraries. In this case everything looks fine. I have let the demo app run for up to 1-1/2 hours without any hiccups what so ever. The IP and MQTT connections remained connected, the keepalive pings were responded to as expected.

 

  ******  Seeing errors  *****************
2. Second Case: Off line via exporting my version of the demo from Mbed, then importing it into Kinetis Design Studio.
   - in this case to fix some compile time issues I need to add "mbed_config.h" to the compile options such that it is included in every compile statement. "mbed_config.h" appears to be am artifact built as a result of the export from mbed. It contains only the following:
        #define MBED_CONF_RTOS_PRESENT 1 // set by library:rtos

 

  - in this case when running the demo, the connection is made to IP and MQTT. The subscribe is sent and acknowledgement seen at the demo app side. Sometimes we are able to see 1 or more keepalive ping requests/responses before the modem returns the "operation not supported" status. This then results in the socket being closed and the connection torn down.

 

 - I have debug output enabled for the MQTT broker and do not see any broker initiated socket disconnects.

 

 - with KDS I tried the GCC Tool chain versions 4.8.2, 5.4.1, and 6.3.1. No appreciable difference was noted when trying different compiler versions.

 

3. Third Case: Off line via the MBED-cli tools, using the CLI tools to import the project and its dependency libraries onto my laptop.

 - in this case to get past a compile time error due to an undefined __wrap_exit() function, I had to update (via CLI tools) the mbed library code. None of the other libraries needed to be updated.

 

- the mbed library update also required an update to WNCInterface/WncControllerK64F/MODSERIAL/Device/MODSERIAL_KSDK.cpp due to a slight change in the definition of serial_t in the latest MBED library.

 

 - NOTE:  these same changes made to the MBED online code version did not change it's behavior, it still works fine

 

- the mbed_config.h change seen with KDS is not required here as there is no mbed_config.h in the resulting code tree

 

 - I see similar results to those seen with Kinetis Design Studio. The connections are made to the MQTT broker, we may get 1-4 successful ping requests/responses before we get the "Operation not supported".

 - It did not seem to matter the GCC toolchain version for this method either.

 

 Questions:

 - is there some command I can send to the modem when it returns "Operation not supported" in order to get meaningful status that might lead to a solution?

 - is there some command I might send prior to each read query that may show an impending problem (buffer full, signal degradation, etc...).

 

 

***********************  putty terminal LOG FILE: ****************************
Note:   >>> Leroy, my comments, not part of the log file
        >>> I have disabled the TX:/RX: dbgPuts() in WncController::at_send_wnc_cmd() to cut down on the amount of output
....
>>> Leroy:  this particular run was after 3 successful ping requests had been sent/received to the broker

MQTTClient sending PING Request to Broker. Count #: 4
[*] 285253814 <-------- Begin Cell Status ------------
[*] 285259410 mdmGetLine returns OK:   [*] 285259526 AT+CSQ+CSQ: 19,99OK
[*] 285265643 mdmGetLine returns OK:   [*] 285265759 AT+CPIN?+CPIN: READYOK
[*] 285273293 mdmGetLine returns OK:   [*] 285273410 AT+CREG?+CREG: 2,1,6A43,07F23311,7OK
[*] 285301648 ------------ WNC Ready ---------------->
[*] 285370498 mdmGetLine returns OK:   [*] 285400608 AT@SOCKWRITE=1,2,"c000"@SOCKWRITE:2OK
[*] 286276805 mdmGetLine returns OK:   [*] 286276924 AT@SOCKREAD=1,1@SOCKREAD:1,"D0"OK
[*] 286989828 mdmGetLine returns OK:   [*] 286989947 AT@SOCKREAD=1,1@SOCKREAD:1,"00"OK
MQTTClient received PING Response

>>> Leroy:  the command sent, some of this is debug statements that I added.
[*] 303809906 >>> mdmGetLine +CME ERROR, cmd:   [*] 303810036 AT@SOCKREAD=1,1

>>> Leroy:  the response from the command
[*] 303810152 >>>cmd returned:   [*] 303810260 ATSOCKREAD=1,1+CME ERROR: operation not supported

[*] 303810420 "at_send_wnc_cmd: returns:"    [*] 303810544 WNC_AT_CMD_ERRCME
[*] 303875609 readBuf: [*] 303899564
[*] 303916231 Closing open socket
[*] 303954773 Try to close and re-open socket

[*] 304003731 at_sockclose_wnc

>>> Leroy: reinitializing things.. , I'm not sure this is successful but have not been looking at this aspect
[*] 304045770 mdmGetLine returns OK:   [*] 304079773 AT@SOCKCLOSE=1OK
[*] 304113106 <-------- Begin Cell Status ------------
[*] 304172817 mdmGetLine returns OK:   [*] 304212065 AT+CSQ+CSQ: 20,0OK
[*] 304249394 mdmGetLine returns OK:   [*] 304288106 AT+CPIN?+CPIN: READYOK
[*] 304330792 mdmGetLine returns OK:   [*] 304368314 AT+CREG?+CREG: 2,1,6A43,07F23311,7OK
[*] 304422481 ------------ WNC Ready ---------------->

 

******************  Hardware config info   *****************************

development machine Linux 14.04 LTS 64 bit
Lenovo T61 laptop
3.8 GB memory
Intel core 2 Duo CPU T8100 @ 2,10GHZ

ATT IOT Starter Kit
FRDM-K64F Rev E
Firmware: DAPLink rev 0243:   0243_k20dx_frdmk64f_0x5000.bin

Modem:
  WNC M14A2A
  P/N 60M14A21.DG1
  S/N G1XDG3300171J01
  modem firmware: AT+GMRMPSS report : M14A2A_v11.50.164451 APSS: M14A2A_v11.53.164451OK

 The Avnet board is powered via the external 5 v adapter provided in the starter kit.

 The target mosquitto MQTT broker is running on my development laptop machine.

LightCollector's picture
LightCollector
Moderator(18)

Hi Leroy,

 

There are no additional AT commands to enable more details about the CME error.  This can periodically happen, the manufacturer stated it is a networking error and that's all they would tell us, the remedy is to close and re-open the socket.  So if that error occurs, the higher level protocol is also going to have to declare that transaction as incomplete.  I'm not familiar with how the MQTT handles this error.

 

Why does it happen when you re-compile with other tools (and have modified the code)?  The only thing I see in the evidence is that it occurs when there is a long delay between pings.  The log tells me the shorter interval pings are working but when the 15 second interval happens, that is when the error occurred.  If you are running your own MQTT server (which is another difference between your setup and the demo code) you should consider taking a look at what it is seeing, whether it got the ping at all and responded or it never got the ping at all etc.

 

Best regards

leroy's picture
leroy
Junior(1)

LightCollector,

It's too bad we cannot get more failure details.

 

- I do have my "real" app coded to handle the error and reconnect, though I will need to check I think there are a couple of changes needed to either the MQTT or the WNController libs in order to tryly tear down the connection. Something about closing the underlying socket, I will need to re-check it maybe ok in the latest lib versions.

 

- It could very well be that 15 sec pings would work, I have not tried that. Though that does seem like keepalive overkill, to have to ping so often. With the regular demo temp/humidity publish disabled such that the app does not publish at all we fall back on the MQTT default keepalive timeout of 60 seconds. (the MQTT layer handles this timeout on its own). This 60 second keepalive seems to work ok when compiled with the online compiler toolchain, but not with compiled with the offline tools.

 

- The MQTT broker is my local broker in all cases detailed. The pings do get to the broker in the good scenario. The broker sees a client disconnect prior to the pings being sent in the failure scenario. So, I think the broker side is fine.

 

- Something is different when compiling online vs offline mode. I've tried to keep everthing else the same. I am running under Linux, perhaps I will see if I can duplicate this under Windows 7, Windows 10 would be a touch challenge on my end.

 

Thanks,

          Leroy

jflynn129's picture
jflynn129
Moderator(2)

There is a difference between the online and CLI tools.  The CLI tools automatically pull in the latest mbed libraries while the online tools will only pull newer libraries when you specifically update them.

LightCollector's picture
LightCollector
Moderator(18)

Hi Leroy,  The point about the ping interval is the opposite of what you say in your response.  The shorter ping interval worked but when the interval was longer (according to the log about 15 seconds) that is when the CME error occured.

 

Best regards

leroy's picture
leroy
Junior(1)

I guess I'm confused about the "15 second" ping.

 

The ping I'm looking at is the app MQTT keepalive ping from client to the MQTT broker. I have this set to 60 seconds (which I believe is also the default). This is managed by the MQTTClient code compiled within the app. The MQTT data struct MQTTPacket_connectData defines the keepAliveInterval.

 

We may be talking different pings???

LightCollector's picture
LightCollector
Moderator(18)

Could be talking about different pings, I'm only looking at the truncated log you posted and the interval of the data being sent from the 14A2A.  Follow the time stamps in the log, was good for a few transactions that were closer together and then there was a longer pause and then CME error occurred,

 

Best regards

leroy's picture
leroy
Junior(1)

Those 6 mdmGetLine() debug outputs seen in the truncated log prior to the CME Error are normal modem responses to the MQTT keepalive ping request sent from the app to the MQTT broker. I have not decoded what they are telling us.

 

I see these same 6 messages (content may vary just a bit), 3 after "Begin call status" and 3 after "WNC ready", for each MQTT level keepalive ping that is sent to the broker. The keepalive timeout is set to 60 seconds. The last debug message I see for each successful MQTT ping is "MQTTClient received Ping Response". Then there is the appox 60 seconds of nothing being dumped to the log until the next ping is sent.

 

Thus I believe the longer pause you mention is just the normal quiet period prior to the sending of the next MQTT ping.  What is not normal is getting the CME Error during this "quiet" period.

 

I have a couple of things I can look at at this end, but it will be a few days before I can get back on this project.

- see if there is another code path for the WncController to be sending commands to the modem, other than what I have currently traced.

- re-enable the very verbose debug messages, perhaps there is something there that I missed.

- carefully decode the message being sent to/from the modem. Perhaps the last before failure is truely a bad operation.

-the linker file is a bit different when comparing online version with off line, but if the above do not show anything I'm not sure the app is the problem.

 

Is there perhaps any interrupt that the modem board may be capable of sending the K64F that is not currently being used that I might enable that may give a clue if the modem is having trouble?

 

thanks,

        Leroy

LightCollector's picture
LightCollector
Moderator(18)

Hi Leroy,

 

Some of what you mentioned looking into I wouldn't do at first.  I would take a look at the differences between the debugging output (for all levels not just the AT commands).  I would compare the logs between the mbed online compiled version and the offline versions that don't work.

 

Since it working is consistent for the online tools and the CME error is consistent for the other environments, a comparison of the log may help get to the bottom of what is going on.

 

The log of the WncControllerLibrary to the debug serial port, the 1st field is ticks in microseconds, which will give you a rough idea of when things are happening.  Extra debug mode for WncController just prints out the entire AT command string and response.  Not-extra prints out the command/response but truncates them up to so many characters, the truncation is done by omitting the middle of the string if it is over a certain length.

 

Best regards

leroy's picture
leroy
Junior(1)

Hi,

I have re-enabled the dumping of the TX and RX messages as sent/received from the modem. I've added some code to eliminate some of the RX/TX messages that are identical inorder to reduce the amount of logging output. I really do not see anything different between the online vs offline as far as the modem messages go, though I have not compared the timings.

 

I say "timings" because I did see the alternate sendCmd() api in WNCInterface/WncControllerK64F/WncController/WncController.cpp.

Currently used API:

sendCmd(const char * cmd, bool crLf)

Alternate API:

sendCmd(const char * cmd, unsigned n, unsigned wait_uS, bool crLf)

 

The alternate API method has this note in the version of library that I am using:  "// WNC used to have troubles handling full speed, seems to not need this now."

The newer API version sends commands to the modem via puts(), the alternate version uses putc(), with a delay between the putc() calls.

 

If I use the alternate (presumable older? API) then my connection does remain stable with the offliine compilers, I "do not" see the "Operation not supported" message, the socket remains connected and I get successful MQTT keepalive pings, for about an hour at least, with 60 second intervals.

 

This alternate API call that is commented out in the calling method was coded for a delay of 1000 msec, a bit long. I was able to cut that back to 75 msec and still get consistent results (no disconnect). If I lowered that to 50msec however, then the "Operation not supported" error returns.

 

I will see if I can get some timing results when runnig with the puts() version of the sendCmd API to see if perhaps  the online environment is calling puts() at a slower rate than is the offline environment.

 

Some progress..

  Leroy

jflynn129's picture
jflynn129
Moderator(2)

I believe the problems you are seeing using the off-line tools vs. the on-line tools have to do with which mbed libraries are being used.  The older WNC libraries are designed to work with the mbed v2 libraries.  When you compile on-line, you are compiling using the mbed v2 libraries unless you specifically update (which we say not to do in the release notes).  With the off-line tools, the CLI automatically updates the libraries to the latest versions; so you end up compiling WNC (mbed v2) with MBED O/S (mbed v5.x).  I have seen a lot of strangeness when doing this myself.  The only sure-fired way I know of to move to the off-line tools (using CLI, I have not used the Kiel tools) is to start with the latest release we have (https://developer.mbed.org/teams/Avnet/code/http-example-wnc/rev/0859fa4...) and add the MQTT functionallity to it.

LightCollector's picture
LightCollector
Moderator(18)

Hi Leroy,

 

I think you are possibly on the right track with the speed of how often the AT command are being sent to the modem, that is why I've pointed out the time logging information, so you can rule that out.  I have seen other scenarios where the 14A2A just isn't keeping up if some types of AT commands are sent back to back.

 

Regarding older code/way, it was having issues because of a bug in the 14A2A firmware that was causing many problems.  Once the 14A2A firmware was updated, the delays no longer seemed to be needed and as you have experienced, it does work if you use the online tools with a tagged version (and the proper mbed libraries).

 

So as a means to your end of being able to use this with some of the offline tools, you can certainly add the delay in with the sendCmd and see if that works for you.  It is possible the other tools and libraries are running faster and causing an underlying problem to be obvious.  If the delay between each AT command fixes it for you that is going to be the simplest fastest thing to do.

 

Best regards

leroy's picture
leroy
Junior(1)

LightCollecter/Jim

I'm not seeing any unusual behavior with the CLI tools pulling in v5 vs v2 libs, or at least not that I know of.

I did update to the latest mbed 2 libs both on mbed and with the imported, CLI version. The CLI version of the gcc compilers (v4, 5, and 6) reported an unresolved wrap_atexit() that just required a dummy function be created in the main,cpp. Plus there was that funky _serial.serial.index() thing in MODSERIAL, but I was able to get the demo running ok.

 

Bottom line of what I'm seeing is what you had suspected LightCollecter, that mbed compiled/linked is a bit slower than local Ubuntu compiled/linked.

With a call to the dbgPuts() placed in the sendCmd() (puts() version) I'm seeing:  (timer tick times are approx).

 

mbed:    33500  ticks

CLI with mbed v2 lib version the Demo was checked in with:  12800 ticks

CLI with latest mbed v2 lib: 13500  ticks.

 

So, there is a bit of a difference. If I introduce a delay that does help, though I'm not positive it fully solves it. I've not run it enough to say for sure that just delaying the sendCmd() would eliminate the Operation not Supported" error.

But this is as far as I'm going with it. I don't really need the keepalive to work as I will probably shutdown the connection between MQTT data transfers anyway, It was just baffling as to why there was such a difference.

If one does need a more solid connection, using the putc() version of sendCmd() with a shorter timeout that the default would seem to do the trick as well.

 

Perhaps if I get some time, I'll port the demo to mbed 5 libs (and latest WNController) and see what mbed 5 does with the MQTT keepalive. If I get that done I'll report back here.

 

Thanks again for your insight and help with this.

Regards,

            Leroy 

LightCollector's picture
LightCollector
Moderator(18)

Hi Leroy,

 

You never gave up!  Sorry you had to deal with what seems to be a quirk of the combination of the software and the hardware.  If you ever get exactly to the bottom of it with your tool set, please let me know.  I wish I could try every combination of stuff out there but I just can't I need to sleep once in a while.  Thankfully we got you past this for now!

 

One of the problems with these types of devices is that every time a new firmware comes out, ATT has to re-certify for thier network, it isn't free and takes a lot of time.  This tends to slow down improvments that aren't perceived as having at least some kind of reasonable work around.  If anything related to this comes up, I will pass it along.

 

Best regards

leroy's picture
leroy
Junior(1)

Understood. I'll llet you know if find anything else out, but will probably port to the new libs before getting too concerned. Regardless, I have the workaround. Thanks. 

leroy's picture
leroy
Junior(1)

Hi,

FYI, just following up.  I've had a chance to implement a simple mqtt project using the newer versions of libs,  MBED ver 5, latest easy-connect and MQTT-JMF.

The project I cloned was based off of the mbed Avnet http-example-wnc. The app implementation is basically the same as Bluemix Demo, with a few api tweaks for the easy-connect apis. (wanted to keep it as close as possible).

I am using a local downloaded version of the mbed project (via mbed cli import, did not try a KDS project). Recall this is one of the scenarios where I was seeing issues with the older libs. Ran the app for about an hour.

I'm happy to report all looks good out of the box. I see no "Operation not supported" messages nor disconnects. As before, I'm running the sample app with no user published mqtt traffic, just the MQTT-JMF internal keep alive sending requests to the mqtt broker at 60 second intervals.

It appears the new libs results in a slightly faster call to the modem api (sendCmd() than I had seen with the older libs, but I see no issues thus far. This is the only scenario tested at this point.

     Leroy