Download Zip file from S3 and unzip it in memory with Go

2021.03.29

Zip Archive

Golang's zip module alows to unzip archives and process files in it. The most common way is to use zip.OpenReader that requires us to have a file on the disk and unzip it into the disk.

https://golang.org/pkg/archive/zip/#OpenReader

func OpenReader(name string) (*ReadCloser, error)

Zip module also provides NewReader that takes an object that satisfies ReaderAt interface and size of the whole archive, and it is that object we will implement today using AWS SDK for Go. https://golang.org/pkg/archive/zip/#NewReader

func NewReader(r io.ReaderAt, size int64) (*Reader, error)

Implementation

Let's start with the ReaderAt. We need our object to implement the ReadAt method. https://golang.org/pkg/io/#ReaderAt

ReadAt(p []byte, off int64) (n int, err error)
ReadAt reads len(p) bytes into p starting at offset off in the underlying input source. It returns the number of bytes read (0 <= n <= len(p)) and any error encountered.

From the documenation for S3's Go SDK we take advantage of Range parameter for GetObjectInput method. https://docs.aws.amazon.com/sdk-for-go/api/service/s3/#GetObjectInput

// Downloads the specified range bytes of an object. For more information about
// the HTTP Range header, see https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35
// (https://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14.35).
//
// Amazon S3 doesn't support retrieving multiple ranges of data per GET request.
Range *string `location:"header" locationName:"Range" type:"string"`

This is how our object and its ReadAt method will look like.

type S3Download struct {
	BucketName 	string
	ObjectKey 	string
	ObjectLength 	int
}

func (sd *S3Download) ReadAt(p []byte, offset int) (int, error) {
	// #1
	if offset < 0 || offset >= sd.ObjectLength {
		return 0, fmt.Errorf("invalid offset")
	}

	svc := s3.New(session.New())
	input := &s3.GetObjectInput{
	    Bucket: aws.String(sd.BucketName),
	    Key:    aws.String(sd.ObjectKey),
	    // #2
	    Range:  aws.String(fmt.Sprintf("bytes=%d-%d", offset, offset + len(p))),
	}

	result, err := svc.GetObject(input)
	if err != nil {
	    if aerr, ok := err.(awserr.Error); ok {
	        switch aerr.Code() {
	        case s3.ErrCodeNoSuchKey:
	            return 0, fmt.Errorf(s3.ErrCodeNoSuchKey, aerr.Error())
	        case s3.ErrCodeInvalidObjectState:
	            return 0, fmt.Errorf(s3.ErrCodeInvalidObjectState, aerr.Error())
	        default:
	            return 0, fmt.Errorf(aerr.Error())
	        }
	    } else {
	        // Print the error, cast err to awserr.Error to get the Code and
	        // Message from an error.
	        return 0, fmt.Errorf(err.Error())
	    }
	    return
	}
	// #3
	n, err := result.Body.Read(p)
	if err != nil {
		if err == io.EOF || err == io.ErrUnexpectedEOF {
			return n, nil
		}
		return 0, err
	}
	return n, nil
}

This is a basic download from S3 functionality with 3 important points.

#1 - the offset parameter validation. Offset cannot be less than 0 or greater than the archive's size.

#2 - Range parameter is a string in format "bytes=[start]-[end]", we provide it with values offset - length of bytes array.

#3 - reading into the p []byte parameter

There is one more point and it is sd.ObjectLength. It is a size of archive and we need to implement a method that will retrieve it.

func (sd *S3Download) ObjectLength() (*int64, error) {
	svc := s3.New(session.New())
	input := &s3.GetObjectInput{
	    Bucket: aws.String(sd.BucketName),
	    Key:    aws.String(sd.ObjectKey),
	}

	result, err := svc.GetObject(input)
	if err != nil {
	    if aerr, ok := err.(awserr.Error); ok {
	        switch aerr.Code() {
	        case s3.ErrCodeNoSuchKey:
	            return 0, fmt.Errorf(s3.ErrCodeNoSuchKey, aerr.Error())
	        case s3.ErrCodeInvalidObjectState:
	            return 0, fmt.Errorf(s3.ErrCodeInvalidObjectState, aerr.Error())
	        default:
	            return 0, aerr.Error()
	        }
	    } else {
	        // Print the error, cast err to awserr.Error to get the Code and
	        // Message from an error.
	        fmt.Errorf(err.Error())
	    }
	    return
	}
	# 1
	sd.ObjectLength = result.ContentLength
	return result.ContentLenth, nil
}

#1 - here we set it on our object and return it for further use.

At this point our implementation is complete.

Usage

func main() {
	s := S3Download{
		BucketName: "exampleBucket",
		ObjectKey: 	"object.zip",
	}

	# 1
	size, err := s.ObjectLength()
	if err != nil {
		return nil, err
	}
	reader, err := zip.NewReader(s, size)
	# 2
	for _, f := range reader.File {
		...
	}
}

#1 - first we retrieve the size of the archive and use the result to zip.NewReader call.

#2 - having our reader, now we can iterate over the reader and process files inside - all in memory without unzipping it into the disk.

Conclusion

We implemented an object that allows us to download and unpack ZIP archive from S3, all in-memory. We took advantage of Go's SDK for AWS and the Range header used in GetObjectInput method.