This project provides a Python-based ETL (Extract, Transform, Load) workflow using AWS Glue. It processes transaction data from multiple JSON datasets, joining user, order, and location data into a single structured JSON output. The transformed data is then stored in an Amazon S3 bucket for downstream use.
Make sure the following tools are installed before you begin:
- Terraform: Version
1.12.2or later - AWS CLI: Version
2.28.14or later - Python: Version
3.12.11or later
- Navigate to the
scriptfolder and start a Pipenv shell:
pipenv shell- Run the data generation script:
python main.pyYou should see output similar to:
Writing data to ../data/raw_records.json
Successfully created ../data/raw_records.json.- Open
platform/locals.tfand update:raw_bucket_nametransformed_bucket_nameto the S3 bucket names you want to use.
- Open
platform/versions.tfand update:s3 > buckets3 > keyto the S3 bucket and key name you want to use.
- Copy the sample environment file:
cp .env.sample .env
- Open
platform/modules/glue_job/transform_script.pyand set the required value:s3_bucket: use the name of your S3 bucket for transformed data
Use the Makefile to manage your project lifecycle:
| Command | Description |
|---|---|
make init |
Initializes the Terraform project |
make plan |
Creates a deployment plan without applying changes |
make deploy |
Deploys the project to your AWS account |
make update |
Updates the deployed project |
make delete |
Deletes all created AWS resources |
make output |
Displays outputs such as API endpoints and resource IDs |
make console |
Interactive CLI for evaluating and experimenting with Terraform Expressions |
Check AWS Glue job logs in Amazon CloudWatch to troubleshoot issues. Use the AWS Console or CLI:
aws logs get-log-events \
--log-group-name /aws-glue/jobs \
--log-stream-name <YourJobLogStream>

