act-act about projects rss

Serialize/Deserialize With Apache Arrow

Apache Arrow

Apache Arrow is an in-memory columnar data format across various systems such as Apache Spark, Impala, Apache Drill.

Arrow have a columnar data represent format - Value Vectors. There are various types of value vectors depending on the data type. In this post, I serialize NullableIntVector to a file and deserialize from it.

Sample Code

Getting Started

The arrow-vector module is already in maven repos.

pom.xml

1
2
3
4
5
6
7
8
<dependencies>
    <!-- https://mvnrepository.com/artifact/org.apache.arrow/arrow-vector -->
    <dependency>
        <groupId>org.apache.arrow</groupId>
        <artifactId>arrow-vector</artifactId>
        <version>0.4.0</version>
    </dependency>
</dependencies>

Writing NullableIntValue to a file

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public static void write(String path, BufferAllocator allocator) throws IOException {

    try (FileOutputStream out = new FileOutputStream(path)) {
        NullableIntVector vector = new NullableIntVector("test", allocator);
        vector.allocateNew();
        NullableIntVector.Mutator mutator = vector.getMutator();
        mutator.set(0, 3);
        mutator.set(1, 2);
        mutator.set(2, 1);
        mutator.set(3, 4);
        mutator.setValueCount(4);

        VectorSchemaRoot root = new VectorSchemaRoot(asList(vector.getField()), asList((FieldVector) vector), 4);
        try (ArrowWriter writer = new ArrowFileWriter(root, null, Channels.newChannel(out))) {
            writer.writeBatch();
        }
    }
}

Reading NullableIntValue from a file

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
public static void read(String path, BufferAllocator allocator) throws IOException {
    byte[] byteArray = Files.readAllBytes(FileSystems.getDefault().getPath(path));
    SeekableReadChannel channel = new SeekableReadChannel(new ByteArrayReadableSeekableByteChannel(byteArray));
    try (ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {

        for (ArrowBlock block : reader.getRecordBlocks()) {
            reader.loadRecordBatch(block);
            FieldReader fieldReader = reader.getVectorSchemaRoot().getVector("test").getReader();
            System.out.println("buf[0]: " + fieldReader.readInteger());
            fieldReader.setPosition(1);
            System.out.println("buf[1]: " + fieldReader.readInteger());
            fieldReader.setPosition(2);
            System.out.println("buf[2]: " + fieldReader.readInteger());
            fieldReader.setPosition(3);
            System.out.println("buf[3]: " + fieldReader.readInteger());
        }
    }
}

Caller

1
2
3
4
public static void main(String[] args) throws IOException {
    write("test", new RootAllocator(Long.MAX_VALUE));
    read("test", new RootAllocator(Long.MAX_VALUE));
}

And run this…

buf[0]: 3
buf[1]: 2
buf[2]: 1
buf[3]: 4

Conclusion

In this post, I tried to ser/des with Apache Arrow. I think that outputting data to a file with Apache Arrow is not an essential usage. Because I want to the executable Apache Arrow code, I did. Enjoy it!