-
Notifications
You must be signed in to change notification settings - Fork 54
AdvancedIO
A buffer is a block of bytes of fixed size. Buffers are useful when you don't wish to manually calculate padding or if you know the size of the data but not the exact contents.
Here we have a collection of various fields that are contained within a 100 byte block. The fields don't total 100 bytes so we need to add padding.
class MyRecord < BinData::Record
endian :little
struct :one_hundred_bytes do
uint8 :a
string :b, length: 10
uint16 :c
...
string :padding, length: -> { 100 - padding.rel_offset }
end
end
Using a buffer removes the need to calculate this padding.
class MyRecord < BinData::Record
endian :little
buffer :one_hundred_bytes, length: 100 do
uint8 :a
string :b, length: 10
uint16 :c
...
end
end
The amount of padding in a buffer can be calculated if needed.
buffer.num_bytes - buffer.raw_num_bytes
A buffer is also useful when there is an array defined by number of bytes rather than number of elements.
class StringTable < BinData::Record
uint16le :table_size_in_bytes
buffer :strings, length: :table_size_in_bytes do
array type: :stringz, read_until: :eof
end
end
A Section is a layer on top of a stream that transforms the underlying data. This allows BinData to process a stream that has multiple encodings, e.g. some data in the stream is compressed or encrypted.
There are several common transforms provided in bindata/transform.
Here is an example of using the builtin zlib transform.
require 'bindata/transform/zlib'
class ZlibRecord < BinData::Record
int32le :section_len, value: -> { s.num_bytes }
section :s, transform: -> { BinData::Transform::Zlib.new(section_len) } do
int32le :len, value: -> { str.length }
string :str, read_length: :len
end
end
obj = ZlibRecord.new
obj.s.str = "highly compressible" * 100
obj.num_bytes #=> 51
If the required transform is not provided, then you will need to provide the transformation code yourself, but fortunately it is not difficult.
Here is an example of a xor encrypted stream.
class XorTransform < BinData::IO::Transform
def initialize(xor)
super()
@xor = xor
end
def read(n)
chain_read(n).bytes.map { |byte| (byte ^ @xor).chr }.join
end
def write(data)
chain_write(data.bytes.map { |byte| (byte ^ @xor).chr }.join)
end
end
obj = BinData::Section.new(transform: -> { XorTransform.new(0xff) },
type: [:string, read_length: 5])
obj.read("\x97\x9A\x93\x93\x90") #=> "hello"
Some structures contain binary data that is irrelevant to your purposes.
Say you are interested in 50 bytes of data located 10 megabytes into the stream. One way of accessing this useful data is:
class MyData < BinData::Record
string length: 10 * 1024 * 1024
string :data, length: 50
end
The advantage of this method is that the irrelevant data is preserved when writing the record. The disadvantage is that even if you don't care about preserving this irrelevant data, it still occupies memory.
If you don't need to preserve this data, an alternative is to use
skip
instead of string
. When reading it will seek over the irrelevant
data and won't consume space in memory. When writing it will write
:length
number of zero bytes.
class MyData < BinData::Record
skip length: 10 * 1024 * 1024
string :data, length: 50
end
Skip also has a :to_abs_offset
convenience option as an alternative to
:length:
. It will advance the stream to the given absolute offset. Skipping
backwards is not supported. If you need to skip backwards then you should
refer to Multi-pass I/O below.
Sometimes you don't know the offset that the data is located at. You can skip directly to the data by using assertions to specify a pattern to search for. It is better to be as specific as possible to avoid false matches.
class MyData < BinData::Record
endian :little
# skip to 'ST' followed by int16 between 10 and 1000
skip do
string read_length: 2, asserted_value: "ST"
uint16 assert: -> { [10..1000].include? value }
end
# we are now positioned correctly
string :sig, read_length: 2
uint16 :count
array :my_data, length: :count
...
end
Some file formats don't use length fields but rather read until the end
of the file. The stream length is needed when reading these formats. The
count_bytes_remaining
keyword will give the number of bytes remaining in the
stream.
Consider a string followed by a 2 byte checksum. The length of the string is not specified but is implied by the file length.
class StringWithChecksum < BinData::Record
count_bytes_remaining :bytes_remaining
string :the_string, read_length: -> { bytes_remaining - 2 }
int16le :checksum
end
These file formats only work with seekable streams (e.g. files). These formats do not stream well as they must be buffered by the client before being processed. Consider using an explicit length when creating a new file format as it is easier to work with.
BinData optimises for single pass file formats. A single pass format is one that is processed as a sequential stream of bytes.
Some file formats require multi pass I/O - i.e. they require seeking backwards
in the stream. BinData provides for this with delayed_io
fields.
delayed_io
are similar to virtual fields in that
they aren't read or written along with the other fields.
Let's consider a file format that contains a directory of people, represented by name and age. The person records are of variable length, so an offset array is provided for easy lookup.
class Person < BinData::Record
uint8 :name_len, value: -> { name.length }
string :name, read_length: :name_len
uint8 :age
end
class Directory < BinData::Record
endian :little
uint32 :num_entries
array :offsets, type: :uint32, initial_length: :num_entries
array :people, initial_length: :num_entries do
delayed_io type: :person, read_abs_offset: -> { offsets[index] }
end
end
This file format is multi pass as the offsets may not necessarily refer to the person data sequentially. e.g. The people could be stored in alphabetical order while the offsets could order them by age.
Reading the directory will read the offsets
field, but will not read any
person data.
d = Directory.read(io)
d.num_entries #=> 200
d.offsets.length #=> 200
d.offsets.num_bytes #=> 800
d.people.length #=> 200
d.people.num_bytes #=> 0
d.people[50] #=> { name_len: 0, name: "", age: 0 }
The person data won't be read until explicitly requested.
d = Directory.read(io)
d.people[50].read_now!
d.people[50] #=> { name_len: 13, name: "Charlie Brown", age: 8 }
To request all entries we could read like this:
d = Directory.read(io) { |obj| obj.people.each { |per| per.read_now! } }
d.people[50] #=> { name_len: 13, name: "Charlie Brown", age: 8 }
However, if we want BinData to perform the multiple passes for us
automatically, we can use the auto_call_delayed_io
keyword.
class Directory < BinData::Record
auto_call_delayed_io
endian :little
uint32 :num_entries
array :offsets, type: :uint32, initial_length: :num_entries
array :people, initial_length: :num_entries do
delayed_io type: :person, read_abs_offset: -> { offsets[index] }
end
end
d = Directory.read(io)
d.people[50] #=> { name_len: 13, name: "Charlie Brown", age: 8 }
Multi-pass writing is similar in that writing is not performed until
#write_now!
is called.
Note that #num_bytes
may behave unexpectedly when using delayed_io
. It
will behave as normal when using the auto_call_delayed_io
keyword. When not
using this keyword #num_bytes
will only return the number of bytes for a
single pass as the number of passes hasn't be specified yet.