Land, Ho!

No, I am not going to repeat the lyrics of the famous Doors’ song here. It’s just that I have decided to start a new post for reporting my progress and open issues in porting my driver to dahdi. To begin by a short summary, dahdi porting is going very well, modulo some problems that I think I will be able to resolve sooner or later. All this makes me feel like when approaching land after a long sea journey, hence the title of this post.

Here are the details (nasty as usual).

My new module, called oufxs.ko, loads successfully and registers with dahdi-base without any problems.  I have now fixed the “location” of the card, so dahdi_cfg when the card is first plugged into a USB port reports e.g.:

[2]
active=yes
alarms=UNCONFIGURED
description=Open USB FXS board 1: setup not started
name=OUFXS/1
manufacturer=Angelos Varvitsiotis
devicetype=Open USB FXS
location=USB 2-1 device #6
basechan=1
totchans=1
irq=0
type=analog
port=1,FXS

OK, you noticed the “USB 2-1 device #6” above; did you also notice the “description” message with the “setup not started” note? This is thanks to the board initialization worker thread, that changes the description string as it goes along with initialization, so successive calls to dahdi_scan report the following, in the order listed here, with changes from one step to the next taking approx. half to one second, depending on the initialization step:

description=Open USB FXS board 1: setup not started
description=Open USB FXS board 1: SLIC sanity check
description=Open USB FXS board 1: initial powerdown
description=Open USB FXS board 1: set indirect regs
description=Open USB FXS board 1: set dc-dc convrtr
description=Open USB FXS board 1: dc-dc cnv powerup
description=Open USB FXS board 1: VBAT pwrleak test
description=Open USB FXS board 1: ADC calibration
description=Open USB FXS board 1: Q5/6 calibration
description=Open USB FXS board 1: LBAL calibration
description=Open USB FXS board 1: final step setup
description=Open USB FXS board 1: up-and-running

BTW, dahdi_tool also reports correctly the board and displays the description, refreshing it whenever the current board status changes. The “alarms=UNCONFIGURED” line is the next thing to worry about. The dahdi_cfg utility needs to be executed to configure signaling before the board can be used by dahdi. After some guesswork, I think that the only configuration really required in /etc/dahdi/system.conf is a line containing “fxols=N”, where N is the the dahdi channel for the board (1, if no other dahdi hardware exists on the system). After that, dahdi_scan reports:

active=yes
alarms=OK
description=Open USB FXS board 1: up-and-running
name=OUFXS/1
manufacturer=Angelos Varvitsiotis
devicetype=Open USB FXS
location=USB 2-1 device #7
basechan=1
totchans=1
irq=0
type=analog
port=1,FXS

At this stage, “fxstest” can be used to play with the board. For example, “fxstest /dev/dahdi/1 stats” proudly reports:

TIP: -3.3840 Volts
RING: -27.8240 Volts
VBAT: -36.8480 Volts

(yes, this is the low-power mode that Edwin suggested in his comment to my previous post), while “fxstest /dev/dahdi/1 regdump” outputs the following, very enlightening dump:

Direct registers:
  0. 05    1. 28    2. 01    3. 00    4. 01    5. 00    6. 00    7. 00 
  8. 00    9. 00   10. 08   11. 33   12. 00   13. 00   14. 00   15. 00 
 16. 00   17. 00   18. 00   19. 00   20. 00   21. 00   22. ff   23. 01 
 24. 00   25. 00   26. 00   27. 00   28. 00   29. 00   30. 00   31. 00 
 32. 00   33. 00   34. 00   35. 00   36. 00   37. 00   38. 00   39. 00 
 40. 00   41. 00   42. 00   43. 00   44. 00   45. 00   46. 00   47. 00 
 48. 00   49. 00   50. 00   51. 00   52. 00   53. 00   54. 00   55. 00 
 56. 00   57. 00   58. 00   59. 00   60. 00   61. 00   62. 00   63. 54 
 64. 11   65. 61   66. 01   67. 1f   68. 04   69. 0a   70. 0a   71. 00 
 72. 10   73. 02   74. 2c   75. 28   76. 00   77. 00   78. 10   79. 42 
 80. 09   81. 49   82. 62   83. 62   84. 01   85. 16   86. 00   87. 00 
 88. 00   89. 00   90. 00   91. 00   92. ca   93. 0c   94. 65   95. 00 
 96. 00   97. 00   98. 0b   99. 12  100. 0f  101. 0f  102. 05  103. 46 
104. 05  105. 27  106. 02  107. 08  108. eb 

Indirect registers:
  0. 55c2    1. 51e6    2. 4b85    3. 4937    4. 3333    5. 0202 
  6. 0202    7. 0198    8. 0198    9. 0611   10. 0202   11. 00e5 
 12. 0a1c   13. 7b30   14. 0063   15. 0000   16. 7870   17. 007d 
 18. 0000   19. 0000   20. 7ef0   21. 0108   22. 0000   23. 2000 
 24. 2000   25. 0000   26. 2000   27. 4000   28. 1000   29. 3600 
 30. 1000   31. 0200   32. 0ff0   33. 7f80   34. 0ff0   35. 8000 
 36. 0320   37. 0010   38. 0010   39. 0010   40. 0000   41. 0c00 
 42. b4f9   43. 1000   44. 0000   45. 0000   46. 0000   47. 0000 
 48. 0000   49. 0000   50. 0000   51. 0000   52. 0000   53. 0000 
 54. 0000   55. 0000   56. 0000   57. 0000   58. 0000   59. 0000 
 60. 0000   61. 0000   62. 0000   63. 0000   64. 0000   65. 0000 
 66. 0000   67. 0000   68. 0000   69. 0000   70. 0000   71. 0000 
 72. 0000   73. 0000   74. 0000   75. 0000   76. 0000   77. 0000 
 78. 0000   79. 0000   80. 0000   81. 0000   82. 0000   83. 0000 
 84. 0000   85. 0000   86. 0000   87. 0000   88. 0000   89. 0000 
 90. 0000   91. 0000   92. 0000   93. 0000   94. 0000   95. 0000 
 96. 0000   97. 0000   98. 0000   99. 4bfe  100. b6c4  101. 9ea5 
102. 99ef  103. bffe  104. c699

Notice in the above dunp that unimplemented or reserved 3210 registers are not probed for their contents, and their contents are printed as zeros.

How about actual audio? So far I have implemented only the OUT part (if you ask why not also the IN part, I was too lazy — or precautious, if you prefer — to start writing the IN leg of the code without having yet tested thouroughly the OUT one). The OUT part appears to be working OK: using “fxstest /dev/dahdi/1 tones” as a test, I hear various beeping tones, including “dial tone”, “ring tone”, “busy tone”, etc., in succession. A few audible clicks from time to time can, according to my former experience, be attributed to VMWare’s sploppiness, so no worries here, although I owe to test more thoroughly under native (non-virtualized) linux. Provided that everything goes OK with the OUT part, I trust that the IN part will not be that hard to write as well. Summary: dahdi does fine with sending audio to my board by means of the new driver, and receiving audio does not look like an issue at all.

So far, so good. Now come the bad news: “fxstest /dev/dahdi/1 ring” (or “fxstest /dev/dahdi/1 polarity, which also sets the phone set ringing for a very short while) cause a complete system hang :-(. So far, I have not been able to find a good explanation for that. The nature of the hang is similar to what I have seen when trying to lock a spinlock twice, and by means of various debugging attempts (e.g., by not loading a tonezone in /etc/dahdi/system.conf and expecting the code circa line 5323 of dahdi-base.c to return -ENODATA since no zone is loaded, but contrarily to what is expected, getting another system hang), it could be related to the locking of chan->lock in line 5324 (my version is from debian source package dahdi-linux-2.2.0.2~dfsg). Why would this lock hang the system is a little mystery to me, but I trust I ‘ll be able to debug the situation further, if not otherwise, by means of extensive printk statements in the dahdi code.

There is also another caveat that I need to resolve. Wctdm.c, whose logic I am mimicking in my new driver, reads or sets various board and ProSLIC (3210/3215 in Digium boards) registers in interrupt context. As an example, while servicing a PCI interrupt from the board, if the data received indicates a 3210 hook change alarm, the hook-state ProSLIC register 68 is checked in-band. OK, this is fine if you work with a PCI board, where setting a 3210 register only involves a few inb/outb instructions, however it’s a disaster with USB, where one needs at least one bulk OUT packet (and its confirmation, hence a 2-ms separate I/O sequence) in order to communicate with the board. So, I guess that I need to somehow embed such commands within the isochronous OUT stream. This is not as hard as it sounds. The above example (checking the hook state) is already serviced in this manner (status is embedded in the IN packet stream) in my former driver, so copying it here is no problem. But embedding commands into the OUT stream will definitely need some additional work and will require a firmware upgrade.

Meanwhile, Alok Prasad (see his comments in previous posts) has reported an issue with porting the original openusbfxs driver to kernels prior to 2.6.20. The problem lies in usb_anchor, which was introduced in late kernel versions. In my reply to Alok, I am holding that

…the anchor is a place where pending URBs can be stored in order to cancel them easily if need be (e.g., if the driver unloads). It is relatively easy to craft a workaround, because in my code the active URBs are known at any time: I use usb_bulk_msg() for bulk IN/OUT, so the anchor is not used there. For isochronous IN/OUT, if one does not have an anchor, one can cancel active urbs by walking through dev->outbufs[] and dev->in_bufs[] and checking for the state of each of those: if dev->XXXbufs[i].state==st_sbmtd then dev->XXXbufs[i].urb is submitted to the core and must be canceled in case of a disconnect or driver unload. With this workaround, the anchor can be avoided.

While I keep myself busy with hunting nasty spinlock bugs and with other ugly details of dahdi porting, if any reader of this blog can assist by coming up with the above usb_anchor workaround, this would be of great help.

This is the situation right now. I am going to report further progress later in this post. Apart from the worrysome system hang, everything else looks really good! Maybe the board’s “hello, Asterisk!” day is not anymore that far away. Even though this day is not today or tomorrow, or not even next week, I have to admit that it certainly looks much, much closer now than when I first started the project. After all, this post is great news: while not yet in the harbor, I can see land ahead — hence the title.

Update, April 17: The kernel lockup/hang mystery has just been solved! It seems I fell a victim of my own thoughts when I wrote

OK, this is fine if you work with a PCI board, where setting a 3210 register only involves a few inb/outb instructions, however it’s a disaster with USB, where one needs at least one bulk OUT packet (and its confirmation, hence a 2-ms separate I/O sequence) in order to communicate with the board.

just a few paragraphs above. It turns out that the kernel lockup nightmare was hidden exactly in this paragraph. Why? Because, what is by far worse than the 2ms delay that I was babbling about, is the fact that bulk USB communication between the driver and the board is based on usb_bulk_msg(), which presumably is using blocking I/O. This means that the invoking process issues a bulk OUT URB and is then put in a sleep state while waiting for a URB completion  or a timeout. Now, here is what was happening: when the DAHDI_TXSIG_START ioctl was handled by the driver, I had chosen to do the same thing that wctdm.c does: write 3210’s DR 64 of course, to set the phone ringing! This operation is done with the channel spinlock locked (chan->lock  — this lock is acquired by dahdi-base right before invoking the ioctl handler in the hardware device driver), and at the same time, the isochronous engine was running, meaning that dahdi_transmit() would be called eight times each millisecond. What is one of the first things that dahdi_transmit() does? Acquire the channel spinlock, of course! So, the channel lock was acquired by dahdi-base on behalf of fxstest, then the calling process for fxstest went to sleep for two milliseconds (one to transmit the bulk OUT URB for setting DR 64, and one to receive a confirmation). To begin with, letting a process hold a spinlock while going to sleep is a very serious kernel programming error that can result in lockups. Meanwhile, just to make sure that the lockup would occur, the background isochronous engine thread was still running and invoking dahdi_transmit, which tried first-thing to acquire the same spinlock that was being held by the sleeping fxstest process. This was causing a kernel lockup. Bingo! I verified my guess by removing the dr_write() macros that set various DRs in all places but the initialization worker thread. And — yes! — the kernel hang disappeared. Of course the phone set did not ring, since DR64 is not yet being set, however my driver did no more lockup the kernel. Now I have resolved this issue, I am going to fix these to use isochronous OUT piggybacking. This means that, if no other unpleasant surprises wait for me down the road, the dahdi driver will be ready within a few days from now!

Update, April 18: I have implemented the DR set piggybacking logic in both the driver and the firmware. It did not work right away, so some testing and debugging is still needed, but the basis is there and there should be no fundamental error in the logic. What’s nice is that the kernel does not hang anymore on the DAHDI_RING ioctl. I do get a couple of serious errors though. When disconnecting the board, fxstest receives a SIGSEGV and the module count remains 1, even though the program that has opened the device gets killed. As a result, the module cannot be unloaded anymore and becomes a zombie. Some fixes here and there cured another kernel hang when the board was unplugged and then plugged back in. All these I guess are race conditions, as the usb core calls the module’s disconnect() method, which in turn starts dismantling the device instance, while the device is still hooked within dahdi. Anyway, I ‘ll soon resolve these errors, I ‘m sure.

Update, April 20: I think I fixed the module reference errors. The most significant error was that I was calling dahdi_unregister() too early, and thus the close() that occurred when the calling process exited was not received by my driver. Hence, the module’s reference count was left equal to one and the module could not be unloaded. Moving dahdi_unregister() from the oufxs_disconnect() callback to the oufxs_delete() one fixed it (you may check the source of the older driver to see what these do). So the driver looks stable now, does not hang or freeze the system and does not put the module into a zombie state anymore. Next thing to do is to debug the piggyback register setting capability. Once this is fixed as well, all that’s left is the IN isochronous engine, and then the driver will be ready!

Update, April 20 (later): Piggybacking of register set commands on OUT packet headers now works perfect and “fxstest /dev/dahdi/1 ring” rings the phone!

Update, April 22: One more nasty bug resolved: when lowpower mode was set to zero (normal power mode), the phone refused to ring! I chased this down to a wrong 3210 register setting during longitudinal balance mode calibration, due to a stupid copy-paste error of mine (I think that I ‘ve already written quite a few times in this blog that code copy-paste-due bugs were by far the worst ones I ‘ve dealt with). The fix to this bug resolved also an intolerable noise issue in audio, which I had swept under the rug for the time. Perfect! Another bug still remains, but this is kernel-related, so — I guess — easier to fix: when I plug the board with the module already loaded and then right away, before the board initializes, I reload the driver, I get a kernel freeze. By now, I ‘m experienced enough to suspect that this is a double spinlock locking, due to a race condition that occurs when the board plugging with the old module instance causes dahd_register() to be invoked shortly before or at the same time that disconnect() is called, which starts dismantling the device. Ugly, but definitely fixable. So, my short-term roadmap is to resolve this last lockup bug, then to write the IN isochronous engine (and hook/dtmf monitoring) and then — provided that no new bugs come up, of course — to publish the new driver code, along with the modified firmware. I have forgotten to mention that, in the meantime, I have assembled one new version-2 dongle board, which is now patiently waiting for me to finish with kernel driver bug-hunting in order to program and test it! So now — unless of course the new dongle blows in smoke on my face — things feel really like approaching the harbor, aren’t they?

Update, April 22 (later): The IN part works now (of course, after fixing another nasty copy-paste bug — I was using the OUT USB endpoint instead of the IN one). The audio quality is perfect in both playout and recording. It’s not ready, of course: I need to add the hook state checks, dtmf state, statistics, and a whole lot more (especially the echo canceller, which is one more thing that I have been sweeping under the rug all this time). However, by now the “ship” is in a “safe harbor”: Open USB FXS works with Dahdi, full stop. So, this is going to be my last update for this post. Prepare for the next one — titled “Hello, Asterisk!” or something similar (I ‘m sure you are guessing correctly it is going to be about, aren’t you?).

5 Responses to “Land, Ho!”

  1. Gabor Denes Says:

    Hi!

    For good audio quality, please put 4.7 kΩ pull-down resistor between the DTX signal and ground.

  2. Angelos Varvitsiotis Says:

    Hi Gabor,

    Thanks for the suggestion. I am not quite sure that I understand why this 4k7 pull-down that you are proposing is needed. During the PCM I/O timeslot, the DTX pin of 3210 is not tri-stated, so it acts like a normal GND-level sink if the current output data bit is zero. The only “open drain” pin of the chip requiring a external pull-up resistor is the \INT output. So, even though your suggestion may be perfectly correct, I would very much appreciate hearing some more details on why this is the case. Anyway, I might try to add the pull-down on an older board to see if this makes any difference.

    Thanks again for suggesting this improvement.

    Cheers,

    Angelos

  3. Gabor Denes Says:

    Hi!

    We were developing with the same Si3210 chipset for a security product. During tests PCM bus was very noisy. Then Silabs suggested to put this resistor on the bus and voilá, noise disappeared, sound quality is now perfect.
    Cheers,
    Gabor

  4. Angelos Varvitsiotis Says:

    Hi Gabor,

    This is great, thanks! I ‘ll make sure to try it. The only explanation I can think about is that the transition of the DTX signal from active to tri-state occurs somewhat too early, thus turning the line between the 3210 and the microprocessor into some sort of an antenna. Anyway, I ‘ll make sure to try the fix using a 10k resistor (I have them handy and a 10k resistor should be just as good as a 4k7 one for this purpose). I ‘ll make sure to let you know about the results.

    Cheers,

    Angelos

  5. Angelos Varvitsiotis Says:

    Hi Gabor,

    I realize that I never actually reported my results on the test regarding your suggestion. To begin with, it turns out that Silabs actualy do mention this 4k7 pull-down in their AN39 application note. There it may make sense because of the long wires between the evaluation board and the PCM test instrument. In your environment it might make sense as well. However, when I added this pull-down (10k in my case) in Open USB FXS, it did not make any audible difference. Of course, it did not hurt the audio quality either. With these data, I think that the best thing to do is to add the pull-down to the schematic and leave it unpopulated in the actual board. If people observe audio noise, or they intend to use the board on a noisy environment, they can add the pull-down (or let me know beforehand, so I can add it myself).

    I wish to thank you once more for sharing this important piece of information with me.

    Cheers,

    Angelos

Leave a comment