Old School Binary File Manipulation, Or Code Page Encoding Gotchas

Filed under Troubleshooting, VB Feng Shui

So, I was working on parsing a WordPerfect Merge DAT file today…

Yep, you read that right, WordPerfect! Seriously old school stuff.

Anyway, this file format is a peculiar binary format with a bunch of binary headers surrounding all the nice juicy ASCII merge text fields. Dealing with such files in VB6 was easy, but it seems that the few times I’ve had to work with this type of thing in .NET, I always end up stubbing my toe on encoding.

You see, all strings in .NET are UNICODE, and reading binary data like this into a string involves an encoding or decoding process. In order for things to work the way you’d (or at least, I’d) expect them to, you have to be SURE that the strings will round trip properly. And boy, they weren’t for me.

I wrestled with it all day, finally knocking off to go home and unwind.

But it was still bugging me.

So I whipped up this little sample (why didn’t I think of this at noon today<sigh>).

    Private Sub TestEncode()
        Dim b(255) As Byte
        For x = 0 To 255 : b(x) = x : Next

        Dim buf As String = System.Text.Encoding.Unicode.GetString(b)

        Dim c(255) As Byte

        c = System.Text.Encoding.Unicode.GetBytes(buf)
        For x = 0 To 255
            If c(x) <> x Then Stop
        Next
        Stop
    End Sub

If the code stops at that stop in the last FOR loop, something didn’t round trip properly and you’re pretty much guaranteed a headache.

And sure enough, the UNICODE encoding object failed to round trip. But so does the ASCIIENCODING, UTF8, etc etc.

On a whim, I tried the “default” object, SYSTEM.TEXT.ENCODING.DEFAULT.

And it worked!

A quick check revealed that on my system, DEFAULT is actually the encoding object for the codepage 1252, which is the Windows ANSI ASCII encoding. Read more about it here. But 1252 is the codepage you want to use if you want EVERY SINGLE binary value from 0-255 to map to the exact same unicode character when you read the file into a string.

Long story short, if you’re used to looking at binary files via a hex editor, and you want to manipulate those files in VB, you have two choices.

  1. Read the file into a BYTE() array as raw data, then operate on the bytes directly.
  2. Read the data into a string, but be SURE to use the proper encoder, like so:
Dim Buf as String= My.Computer.FileSystem.ReadAllText(MyFileName, System.Text.Encoding.GetEncoding(1252))

Option 1 is great if you need to work on the data as, more or less, strictly byte type info. But it’s a real pain if much of the data is string type stuff.

Option 2 is MUCH easier to work with for mostly string data (you can use INSTR, MID, LEFT, RIGHT, cutting and chopping much more easily than with byte arrays), BUT you have to have read the data in via the right encoder or it will be “altered” during the loading process and won’t contain the same bytes that were actually in the source file.

Doing this won’t work:

Dim Buf as String= My.Computer.FileSystem.ReadAllText(MyFileName)

Because the ReadAllText routine uses, as its default, the UTF8 encoder.

Hopefully, putting all this down in writing now will keep me from forgetting about it the next time I’m mucking with funky file formats!

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*