🚀 DGW Scrapper

A powerful and automated web scraper for extracting LPJ data from the DGW Spartan platform. This tool streamlines the process of downloading and processing activity reports into organized Excel spreadsheets.

✨ Features

🤖 Automated Login: Seamlessly authenticate with DGW Spartan platform
📅 Date Range Filtering: Extract data for specific time periods
📊 Excel Export: Generate clean, organized Excel reports
🎨 Rich CLI Output: Beautiful terminal interface with progress tracking
⚡ Fast Processing: Efficient Playwright-based browser automation
🔄 Batch Processing: Handle multiple LPJ documents in one run
📁 Organized Output: Automatically structured file naming and storage

🔧 Prerequisites

Before you begin, ensure you have the following installed:

Python 3.8 or higher
pip (Python package installer)
Valid DGW Spartan account credentials

📦 Installation

Clone the repository

git clone https://github.com/alifnuryana/dgw-scrapper.git
cd dgw-scrapper

Install dependencies

pip install -r requirements.txt

Install Playwright browsers

playwright install chromium

🚀 Usage

Basic Command

python main.py --email YOUR_EMAIL --password YOUR_PASSWORD --from_date DD/MM/YYYY --to_date DD/MM/YYYY

Example

python main.py --email [email protected] --password mypassword123 --from_date 01/01/2024 --to_date 31/01/2024

Parameters

Parameter	Required	Description	Format
`--email`	✅ Yes	Your DGW Spartan email	string
`--password`	✅ Yes	Your DGW Spartan password	string
`--from_date`	✅ Yes	Start date for data extraction	DD/MM/YYYY
`--to_date`	✅ Yes	End date for data extraction	DD/MM/YYYY

⚙️ Configuration

The scraper is configured to:

Navigate to the "Sudah Diproses" (Processed) tab
Filter by document type: LPJ
Extract the following data:
- Activity Name
- PO Name
- Total Amount
- Activity Count

Output Directory

All generated files are saved in the output/ directory, which is automatically created if it doesn't exist. The directory is cleaned before each run to ensure fresh data.

📄 Output Format

File Naming Convention

Files are named using the following pattern:

{YYYY - Month} - {Submitted By} - {Activity Type 1} - {Activity Type 2} - {Proposal Name}.xlsx

Example:

2024 - January - John Doe - Workshop - Training - Employee Development.xlsx

Excel Structure

The generated Excel files contain the following columns:

Column	Description
Activity Name	Name of the activity
PO Name	Purchase Order name
Total	Total amount (in Rupiah)
Count	Number of activities

Data is automatically:

✅ Cleaned and formatted
✅ Grouped by Activity Name and PO Name
✅ Aggregated with sum and count calculations
✅ Converted to proper numeric formats

🛠️ Troubleshooting

Common Issues

Browser fails to launch

Ensure Playwright browsers are installed:

playwright install chromium

Login fails

Verify your email and password are correct
Check if your account has access to the Spartan platform
Ensure you're not using special characters that need escaping

TimeoutError during scraping

This can occur if:

Network connection is slow
The page takes longer to load
An item has no data table

The scraper will skip problematic items and continue processing others.

Empty output folder

Check if the date range contains any LPJ documents
Verify the filter settings match available documents
Review the console output for any error messages

🤝 Contributing

Contributions are welcome! Here's how you can help:

Fork the repository
Create a new branch (git checkout -b feature/amazing-feature)
Make your changes
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

Development Guidelines

Follow PEP 8 style guidelines
Add comments for complex logic
Test your changes thoroughly
Update documentation as needed

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Built with Playwright for reliable browser automation
Uses pandas for efficient data processing
Enhanced with Rich for beautiful terminal output

📧 Contact

Alif Nuryana - @alifnuryana

Project Link: https://github.com/alifnuryana/dgw-scrapper

Made with ❤️ by Alif Nuryana

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.github		.github
docs		docs
.editorconfig		.editorconfig
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
main.py		main.py
renovate.json		renovate.json
requirements.txt		requirements.txt
setup.bat		setup.bat
setup.sh		setup.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🚀 DGW Scrapper

✨ Features

📋 Table of Contents

🔧 Prerequisites

📦 Installation

🚀 Usage

Basic Command

Example

Parameters

⚙️ Configuration

Output Directory

📄 Output Format

File Naming Convention

Excel Structure

🛠️ Troubleshooting

Common Issues

🤝 Contributing

Development Guidelines

📝 License

🙏 Acknowledgments

📧 Contact

About

Uh oh!

Contributors 2

Uh oh!

Languages

License

alifnuryana/dgw-scrapper

Folders and files

Latest commit

History

Repository files navigation

🚀 DGW Scrapper

✨ Features

📋 Table of Contents

🔧 Prerequisites

📦 Installation

🚀 Usage

Basic Command

Example

Parameters

⚙️ Configuration

Output Directory

📄 Output Format

File Naming Convention

Excel Structure

🛠️ Troubleshooting

Common Issues

🤝 Contributing

Development Guidelines

📝 License

🙏 Acknowledgments

📧 Contact

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Contributors 2

Uh oh!

Languages