Sep 282014
 

Buried deep within the network stacks of all major operating systems there are two TCP extensions called Nagle’s Algorithm and Delayed ACKs. Both aim to relieve the pressure on networks (read: the Internet) by changing the way TCP acknowledgements work. This article focuses on the quirks that occur on OS X while doing real-time video editing.

Before we start it is important to note that these extensions are very important and should NOT be disabled:

Please note that while, in certain cases, the current Nagle algorithm can
have a negative performance impact for certain applications, turning OFF the
Nagle algorithm can have a very serious negative impact on the internet. ~Greg Minshall on the ietf-discuss w3.orf mailing list

Furthermore the source of most Nagle’s Algorithm related problems has already been fixed several years ago. Please check out Rolande’s blog [1,2] and the article “TCP Performance problems caused by interaction between Nagle’s Algorithm and Delayed ACK” by Stuart Cheshire for useful background information.

This post is based on problems reported by some of our flow:rage customers using OS X and 10Gbit Ethernet. They reported things like dropped frames in Final Cut Pro 7 or increased render times within Adobe Media Encoder. These issues were sometimes easily reproducible (like encoding a file twice) and sometimes they appeared and disappeared at will. They were caused by the read performance over the network dropping to only a few MBps – writes were not affected and still performed as expected. The graph below illustrates the observed performance drop:

Performance

To fix the performance issue it was necessary to disable Nagle’s Algorithm and to switch Delayed ACK to it’s compatibility mode. To do so I used the following Terminal command based on the documentation found in this post. As this is only a temporary change you still have to edit /etc/sysctl.conf for a permanent solution as explained in SmallTree’s KB.

sudo sysctl -w net.inet.tcp.delayed_ack=2

I invested quite a lot of my time in researching and writing down all of this information. I hope this post helps people to understand what Nagle’s Algorithm and Delayed ACKs are used for and that they are generally very important and useful extensions. However there are always exceptions and in this case it looks like 10Gbit Ethernet on OS X is one of those …

Sep 082014
 

Last week a local customer reported strange problems with his EMC Isilon storage. For example sometimes when they copy a file from Mac A to their central storage they can’t see it on Mac B. Only Macs are affected by this strange behaviour – all their PCs work great. I was happy when they booked an on-site appointment to investigate the problems further.

I started the investigation by talking to all the people there and writing down all the issues.
After summarising I found out that most of the issues where caused by the fact that they mixed SMB and NFS.
I discussed this with the customer and he happily agreed to switch all machines to SMB.

Isilon

However we still had one issue to solve: Some files (those with umlauts in their filename) where only visible over NFS. The following blog post is a summary of my on-site procedure and my findings:

1.) File Creating with Umlauts over NFS3

If you create a file with umlauts in it over NFS3 (tested with MS Word) it can’t be opened over SMB (“Das Programm kann nicht gefunden werden”). It still works over NFS. You are unable to delete the EAs (._ Files) over SMB (“No such file or directory”).

After removing the EA files over NFS, MS Word launched but complained about an illegal filename. I was furthermore unable to read the file (“no such file or directory”) using cat – however it was still working with NFS.

2.) File Creating with Umlauts over SMB

Same problem as above! The file could not be accessed using NFS – everything working as expected over SMB. Furthermore SMB supports alternate data streams -> EAs get lost between protocols. It this case it is somewhat good that they are enabled as it would break QT7 otherwise.

3.) File Creating without Umlauts

Everything works fine if there is no umlaut in the filename.

4.) Files are not shown within Finder

In Terminal you can see them using ls. This is related to the EA ._ metadata files! If a file has EA’s they disappear if Finder is unable to access those. This is the reason why some movies with umlauts in their name are hidden. If you delete the ._ files they reappear – but are still inaccessible.

5.) Word sometimes unable to save files with Umlauts?

“Word kann dieses Dokument aufgrund eines Bennenungs- oder Berechtigungsfehlerd nicht auf dem Zielvolume schreiben”

6.) Verification

Based on that knowledge I used the following procedure to locate the problem: I created a file with QT X (test.mov) on the storage. Then I duplicated it and renamed to either “NFS aaaÜ.mov” and “SMB aaaÜ.mov” over the corresponding protocol. Thereby two files were created.

Then I ran the following hexdump commands:

NFS Test File

MBP:Test3 user$ ls /Volumes/Broadcast/Test/Test3/NFS*|hexdump -C #over SMB
00000000  2f 56 6f 6c 75 6d 65 73  2f 42 72 6f 61 64 63 61  |/Volumes/Broadca|
00000010  73 74 2f 54 65 73 74 2f  54 65 73 74 33 2f 4e 46  |st/Test/Test3/NF|
00000020  53 20 61 61 61 55 cc 88  2e 6d 6f 76 0a           |S aaaU...mov.|
0000002d
MBP:Test3 user$ ls /Volumes/broadcast-1/Test/Test3/NFS*|hexdump -C #over NFS
00000000  2f 56 6f 6c 75 6d 65 73  2f 62 72 6f 61 64 63 61  |/Volumes/broadca|
00000010  73 74 2d 31 2f 54 65 73  74 2f 54 65 73 74 33 2f  |st-1/Test/Test3/|
00000020  4e 46 53 20 61 61 61 55  cc 88 2e 6d 6f 76 0a     |NFS aaaU...mov.|
0000002f

SMB Test File

MBP:Test3 user$ ls /Volumes/Broadcast/Test/Test3/SMB*|hexdump -C #over SMB
00000000  2f 56 6f 6c 75 6d 65 73  2f 42 72 6f 61 64 63 61  |/Volumes/Broadca|
00000010  73 74 2f 54 65 73 74 2f  54 65 73 74 33 2f 53 4d  |st/Test/Test3/SM|
00000020  42 20 61 61 61*55 cc 88* 2e 6d 6f 76 0a           |B aaaU...mov.|
0000002d
MBP:Test3 user$ ls /Volumes/broadcast-1/Test/Test3/SMB*|hexdump -C #over NFS
00000000  2f 56 6f 6c 75 6d 65 73  2f 62 72 6f 61 64 63 61  |/Volumes/broadca|
00000010  73 74 2d 31 2f 54 65 73  74 2f 54 65 73 74 33 2f  |st-1/Test/Test3/|
00000020  53 4d 42 20 61 61 61*c3  9c*2e 6d 6f 76 0a        |SMB aaa...mov.|
0000002e

Thereby I found out, that there is a different filename reported if you are using SMB. I marked the corresponding changes with an *. What that means it, that there are character encodings issues.

7.) The Issue: NFS

To test if the issue was related to their Isilon I repeated the test on a Debian VM. It shows the same strange issues. Thereby I conclude that the issue is caused by OS X’s NFS client and Finder. A possible way to reproduce this is to rename a file using terminal:

mv "/Volumes/NFSServer/testfile.mov" "/Volumes/NFSServer/testfileäöü.mov"

The expected behaviour is, that the file testfile.mov got renamed to testfileäöü.mov. While exactly that happened, the file got inaccessible. You cannot open it anymore.

9.) Next Steps

To fix this issues we recommend the following next steps:

  • Switch all machines to SMB – Thereby pretty much all problems should be fixed automatically.
  • To finalise the migration we have to fix the remaining issues:
    • We need a script that deletes all ._ EA files
    • Than we have to check if we can access all files containing umlauts. If not we have to rename them to work again (rename to some temporary name over NFS and rename back using SMB) This is the hard part, as we have to preserve the umlauts. Thereby we may be able to avoid the need to relink all assets.

If you have the same problem and need help see the About Me page for contact details.

Aug 172014
 

Over the last weeks we migrated one of our post production customers from Mac OS X Snow Leopard to OS X Mavericks and from Final Cut Pro 7 to Adobe Premiere Pro CC 2014. Furthermore we added a flow:rage as their central video storage to simply their workflows as they used to share their projects from Mac to Mac. However as they still had to access their old projects we also installed Final Cut Pro 7 on Mavericks. In theory Final Cut Pro 7 is still somewhat supported however there are some glitches here and there. This is the story about one such glitch that makes Final Cut Pro 7 almost unusable for our customer…

FCP-and-Network-Shares

After we finished the migration the cutters reported dropped frames in Final Cut Pro 7. Over time we were able to nail the problems down to projects that were opened over the network from a different Mac. If the projects were located on the flow:rage everything was working great. Based on that we tested the throughput of the hard disks and the network, checked the CPU and memory usage, used different network protocols and examined all logs on both the client and the server. However we couldn’t find the source of the problem!

We then tried to reproduce the problem at several other customers that still use Final Cut Pro 7 and to our surprise we could do so sometimes. What that means is that there can be problems with Final Cut Pro 7 on Mavericks if you try to edit over the network from Mac to Mac. This is especially true if there is a lot of traffic on the corresponding network interface. We never had any problems with projects stored on flow:rage storage system. In the end we suggested the customer to copy all projects to either the flow:rage or the local disk. No further dropped frames where reported.

I think the problem is a combination of a high kernel task utilisation caused by the network traffic, the fact that Final Cut Pro 7 was not extensively tested by Apple and some change in the VFS layer. For me it’s not worth to invest more time to further diagnose the problem. If you have any further hints please leave a comment.

Aug 072014
 

Currently I’m confronted with a lot of ignorance around LTFS. This is interesting as there are some very good resources [1,2] on what LTFS is good at and what should be solved using a dedicated backup or archiving application (like Archiware P5).

If you want to use LTFS consider the following best practice rules:

  • LTFS is good at transporting data – Archiving is hard as there is no real index database
  • LTFS should be used like a WORM (Write Once Read Many) tape
  • The bigger the files the better as small files have a horrible performance
  • If you only want to access files mount the tape read only to this increases the performance
  • Don’t force nonsequential tape operations with things like browsing a folder in thumbnail view
  • Try to only access top level folders (copy those folders to or from tape)

If you still think LTFS is the right solution for you go ahead and use it! On OS X most vendors [for example: Tandberg, HP] ship the same FUSE based filesystem and a small manager application. The following video gives a not so short introduction on how to use it:

Aug 052014
 

Recently a customer reported that he was unable to add new users to his OS X 10.8 Server. To be precise, he was even unable to login as diradmin to his local OpenDirectory master.

Workgroup Manager

Each login attempted created the following error message:

servermgrd: servermgr_accounts: got error 2100 trying to auth to local LDAP node

After ruling out all the common issues like discussed in “Why Is My OD LDAP Server Stopped & How To Fix It” it was time to move over to the dark side. In this case, one had to know that the auth database of the OD server itself is stored as a Berkeley DB in /var/db/openldap/authdata and that it is most likely damaged. Based on that (and after creating a backup) we can now use db_recover to repair it with the following commands:

sudo serveradmin stop dirserv
sudo db_recover -h /var/db/openldap/authdata
sudo serveradmin start dirserv

After a few seconds you should be able to login as diradmin again.

Aug 032014
 

heroLast week I observed a strange quirks of OS X Mavericks and AVFoundation: When writing a video using AVFoundation data caching is always enabled. Caching by itself provides a huge performance boost by the cost of reliability. Generally this wouldn’t be a problem, because you can easily disable caching by using fcntl and F_NOCACHE. However as AVFoundation does not expose the corresponding file descriptor this is not possible. Now think about the result of the following scenarios:

  1. You write a video file on an external storage while the volume is disconnected
  2. You write a video to a network volume and someone reboots a switch
  3. You write to a local disk and a power outage occurs

Yes, all these issues result in data loss as the cache has to be purged! This is especially problematic as the Unified Buffer Cache can grow up to hundreds of MB. This can result in the loss of several seconds or even minutes of video data.

I had to use all my Google-foo to find the blog post “Hacking the Mac OSX Unified Buffer Cache” that provides a possible solution. The undocumented? fcntl flag F_GLOBAL_NOCACHE allows you to disables the Unified Buffer Cache globally for a specific file. This even works for all already opened file handles. Thereby it is possible to mitigate all the problems outlined above. Stefan Bechtold wrote the command line wrapper UBCUtil that allows you to test the flag without modifying your code.

What a day…

Apr 222014
 

Bildschirmfoto 2014-04-22 um 21.47.50Have you ever wondered why Apple is moving from AFP to SMB2?

Well, here’s one example:
If you are connected to an AFP server (either OS X or netatalk) and you duplicate a really large file the complete AFP connection on the client stalls. In the background the client instructs the server to duplicate the file. However it blocks until the copy process is finished. This is a good idea implemented poorly. It causes all applications doing I/O operations on the sharepoint to either freeze until the operation is finished or even crash.

The best part: If you are using SMB you can duplicate files and don’t hang your applications. I could verify this behavior down to OS X Lion. Maybe it’s even true for Snow Leopard…

Mar 112014
 

Last year there was a bug in OS X that allowed a local attacker to gain root privileges by abusing sudo’s cache. A few month earlier (January 2013) I informed Apple about a similar problem caused by OS X’s default tty_tickets = off setting. This insecure configuration allows all applications to misuse a cached sudo authentication within the cache timeout. The same issue was discussed in this thread on the Debian mailing list in 2010!

The sudo man page reads:

[…] Once a user has been authenticated, a timestamp is updated and the user may then use sudo without a password for a short period of time […]

As Mac OS X uses the default value, a 5 minute timeout is used. That means if a admin user runs sudo, a malicious script can run privileged commands without any further user interaction for 5 minutes. My PoC installs a launchd configuration file (sh.bogner.sudo_escalation.plist) that loads sudo_escalation.sh at startup. This script tries to launch Terminal.app as root as soon as the user used sudo.

Download

How to Reproduce

  1. The PoC has to be installed (with the installer command)
  2. The currently logged in user has to be member of the admin group
  3. The user or any application (like an installer) has to use the sudo command
    Like: sudo echo “Show me what the PoC does”
  4. Terminal.app should be launched as root

My bug report was acknowledged and ignored. This problems exists on all fully patched versions of OS X including Snow Leopard (10.6), Lion (10.7), Mountain Lion (10.8) and Mavericks (10.9). That’s a really bad time to fix!

The Fix

The reaction time is especially disappointing as the fix is very easy and without any side effects (at least for me):

  1. Open a terminal window and type (or better copy and paste):
    export EDITOR=nano #you can skip this step if you know how to use vi
    sudo visudo #enter you password afterwards
  2. Add the following line after Defaults env_keep += “HOME MAIL” (it should look like this):

    Defaults        tty_tickets
  3. Press [CTRL]+[o] followed by [ENTER] to save your changes
  4. and [CTRL]+[x] to exit the editor
  5. Now you can verify your steps by retesting the PoC
Feb 252014
 

e_mailI take care of several OS X mail servers for my customers and I always use the widely deployed OpenDirectory LDAP server for user management. However, from time to time one of these OD servers stops working. Based on my experience, there are two (and a half) main reasons for this malfunction:

  1. Power outage: After a power outage the database got corrupted. (This is a valid reason for a service outage 😉
  2. OD Backup: After creating an OpenDirectory backup the launchd org.openldap.slapd.plist configuration file is disabled. This means, that the LDAP server will not be started and all services (Mail, iCal, Address Book) stop working.
  3. OD Backup^2: There is an even more serious OD Backup Bug. Thankfully I have seen it only once. If this bug is triggered, not only is the default org.openldap.slapd.plist configuration disabled but there is a second hidden dot-file temporary configuration file with the same launchd key. Thereby we trigger undefined behaviour (two configuration files with the same key) and no usable error message is logged! (It was quite hard to find this problem).

After finding and fixing the cause you still have to repair your OpenDirectory database. Use the following three easy steps to do so:

  1. First you have to stop a possible running OD instance by unloading the launchd configuration:
    sudo launchctl unload /System/Library/LaunchDaemons/org.openldap.slapd.plist
  2. Then run the db_recover utility with the following parameters to recover your OpenDirectory database:
    sudo db_recover -v -h /var/db/openldap/openldap-data/
  3. And restart your OpenDirectory server:
    sudo launchctl load /System/Library/LaunchDaemons/org.openldap.slapd.plist

Voilà your OpenDirectory is working again and you have earned yourself another coffee 😉