Processing zip file in memory – how to and not to

2021.05.14

この記事は公開されてから1年以上経過しています。情報が古い可能性がありますので、ご注意ください。

Introduction

I wrote about downloading from S3 and unzipping files in memory in here. Today we're gonna see how to unzip and process zip files from local disk.

First we are going to implement a solution that works but has a huge flaw and later we will do it properly.

How NOT to unzip in memory

As in the S3 example we will take an advantage of NewReader function provided by archive/zip module. To use it we need to provide it with object that implements ReaderAt interface and the size of our zip archive.

func NewReader(r io.ReaderAt, size int64) (*Reader, error)

This is the 'How not to' unzip it.

fi, err := os.Stat(filePath)
if err != nil {
	// process error
}
size := fi.Size()

b, err := ioutil.ReadFile(filePath)
if err != nil {
	// process error
}
readerAt := bytes.NewReader(b)
reader, err := zip.NewReader(readerAt, size)
if err != nil {
	// process error
}
for _, f := range reader.File {
    ...
}

To get the size of archive we simply us os.Stat and provide it a path to the file, and use Size() method of returned object. So far so good, we have the size.

Next we read whole file with ioutil.ReadFile method and pass received bytes into bytes.NewReader. This gives us the object that implements ReaderAt interface.

Finally we pass the object and size of the archive to zip.NewReader and we're home. Except there is a problem. On line 7 we read whole file into memory. It is not a problem for small files, this solution work perfectly fine in that case, however if we provide the program a big archive, this code can get us in trouble.

How to unzip in memory

Let's do it properly.

fi, err := os.Stat(evaledFilePath)
if err != nil {
	// process error
}
size := fi.Size()
f, err := os.Open(evaledFilePath)
if err != nil {
	// process error
}
reader, err := zip.NewReader(f, size)
if err != nil {
	// process error
}
for _, f := range reader.File {
    ...
}

We get size of the archive in exactly same manner as before.

The difference is how we open the archive and it is by utilizing os.Open function. By doing so we receive the object that implements required ReaderAt interface which we can directly pass to zip.NewReader function. Now the solution reads the archive in batches instead of reading the whole file into the memory.

Fin.