Apache Arrow
Apache Arrow is an in-memory columnar data format across various systems such as Apache Spark, Impala, Apache Drill.
Arrow have a columnar data represent format - Value Vectors. There are various types of value vectors depending on the data type. In this post, I serialize NullableIntVector to a file and deserialize from it.
Sample Code
Getting Started
The arrow-vector
module is already in maven repos.
pom.xml:
1<dependencies>
2 <!-- https://mvnrepository.com/artifact/org.apache.arrow/arrow-vector -->
3 <dependency>
4 <groupId>org.apache.arrow</groupId>
5 <artifactId>arrow-vector</artifactId>
6 <version>0.4.0</version>
7 </dependency>
8</dependencies>
Write to file
The sample code that writing NullableIntValue to a file is follow:
1public static void write(String path, BufferAllocator allocator) throws IOException {
2
3 try (FileOutputStream out = new FileOutputStream(path)) {
4 NullableIntVector vector = new NullableIntVector("test", allocator);
5 vector.allocateNew();
6 NullableIntVector.Mutator mutator = vector.getMutator();
7 mutator.set(0, 3);
8 mutator.set(1, 2);
9 mutator.set(2, 1);
10 mutator.set(3, 4);
11 mutator.setValueCount(4);
12
13 VectorSchemaRoot root = new VectorSchemaRoot(asList(vector.getField()), asList((FieldVector) vector), 4);
14 try (ArrowWriter writer = new ArrowFileWriter(root, null, Channels.newChannel(out))) {
15 writer.writeBatch();
16 }
17 }
18}
Read from file
The sample code that reading NullableIntValue from a file is follow:
1public static void read(String path, BufferAllocator allocator) throws IOException {
2 byte[] byteArray = Files.readAllBytes(FileSystems.getDefault().getPath(path));
3 SeekableReadChannel channel = new SeekableReadChannel(new ByteArrayReadableSeekableByteChannel(byteArray));
4 try (ArrowFileReader reader = new ArrowFileReader(channel, allocator)) {
5
6 for (ArrowBlock block : reader.getRecordBlocks()) {
7 reader.loadRecordBatch(block);
8 FieldReader fieldReader = reader.getVectorSchemaRoot().getVector("test").getReader();
9 System.out.println("buf[0]: " + fieldReader.readInteger());
10 fieldReader.setPosition(1);
11 System.out.println("buf[1]: " + fieldReader.readInteger());
12 fieldReader.setPosition(2);
13 System.out.println("buf[2]: " + fieldReader.readInteger());
14 fieldReader.setPosition(3);
15 System.out.println("buf[3]: " + fieldReader.readInteger());
16 }
17 }
18}
Caller
The sample code that calling these write/read methods are follow:
1public static void main(String[] args) throws IOException {
2 write("test", new RootAllocator(Long.MAX_VALUE));
3 read("test", new RootAllocator(Long.MAX_VALUE));
4}
And run this…
1buf[0]: 3
2buf[1]: 2
3buf[2]: 1
4buf[3]: 4
Conclusion
In this post, I tried to ser/des with Apache Arrow. I think that outputting data to a file with Apache Arrow is not an essential usage. Because I want to the executable Apache Arrow code, I did. Enjoy it!