How to Fetch Latest Objects from AWS S3 and Store Locally using Python

Hi, this is Charu from Classmethod. In this tutorial, we'll explore how to fetch the latest objects from an AWS S3 bucket and store them locally using Python. We'll be using the boto3 library, which is the official AWS SDK for Python.

Prerequisites:

  • Basic understanding of Python programming language and AWS.
  • AWS account with access to S3 buckets.
  • Python installed on your local machine.
  • Let's get Started!

    To get started, you need to install boto3 using the following command:

    pip install boto3

    Step 1: AWS Configuration

    Before we start coding, make sure you have your AWS credentials configured. You can either assume role or use AWSume to do this.

    Step 2: Writing the Python Script

    Let's continue by writing small code snippets and understand them:

    import boto3
    import os
    from datetime import datetime
    
    print("Initializing S3 client...")
    s3_client = boto3.client('s3')
    
    bucket_name = 'YOUR-BUCKET-NAME'
    local_directory = 'PATH/TO/YOUR/DIRECTORY'

    We begin by importing the required libraries and initializing the S3 client using boto3. Replace YOUR-BUCKET-NAME with your actual S3 bucket name and PATH/TO/YOUR/DIRECTORY with the local directory where you want to store the fetched files.

    print(f"Fetching the list of objects from the S3 bucket: {bucket_name}")
    paginator = s3_client.get_paginator('list_objects_v2')
    pages = paginator.paginate(Bucket=bucket_name)
    
    latest_time = None
    latest_key = None
    
    for page in pages:
        for obj in page.get('Contents', []):
            if latest_time is None or obj['LastModified'] > latest_time:
                latest_time = obj['LastModified']
                latest_key = obj['Key']

    Now, we fetch the list of objects from the specified S3 bucket. We iterate through the objects and find the one with the latest modification time (LastModified).

    Here, paginator is a mechanism provided by the boto3 library to help handle paginated responses from AWS services like S3. When you request a list of objects from an S3 bucket, especially if there are a large number of objects, AWS may not return all the objects at once. Instead, it breaks up the response into pages.

    In this code, we use paginator.paginate() to fetch pages of objects from the S3 bucket, and then iterate over these pages to find the latest object based on its modification time. This makes the code more efficient and scalable when dealing with large numbers of objects in the bucket.

    if latest_key:
        local_file_path = os.path.join(local_directory, latest_key)
        if not os.path.exists(os.path.dirname(local_file_path)):
            print(f"Creating directory for {local_file_path}")
            os.makedirs(os.path.dirname(local_file_path))
        
        print(f"Downloading the latest file {latest_key} to {local_file_path}...")
        s3_client.download_file(bucket_name, latest_key, local_file_path)
        print("Download of the latest file complete.")
    else:
        print("No files found in the bucket.")

    If we find the latest object, we construct the local file path and download the file from S3 to the local directory using s3_client.download_file(). If the directory doesn't exist, we create it.

    Step 3: Running the Script

    Save the above code to a Python file, for example, fetch_latest_s3.py, and run it using Python:

    python fetch_latest_s3.py

    Conclusion:

    That's it! You've now learned how to fetch the latest objects from an AWS S3 bucket and store them locally using Python. Feel free to customize the code according to your requirements and explore more features of the boto3 library for working with AWS services.

    Thank you for reading!

    Happy Learning:)