A powerful and automated web scraper for extracting LPJ data from the DGW Spartan platform. This tool streamlines the process of downloading and processing activity reports into organized Excel spreadsheets.
- π€ Automated Login: Seamlessly authenticate with DGW Spartan platform
- π Date Range Filtering: Extract data for specific time periods
- π Excel Export: Generate clean, organized Excel reports
- π¨ Rich CLI Output: Beautiful terminal interface with progress tracking
- β‘ Fast Processing: Efficient Playwright-based browser automation
- π Batch Processing: Handle multiple LPJ documents in one run
- π Organized Output: Automatically structured file naming and storage
Before you begin, ensure you have the following installed:
- Python 3.8 or higher
- pip (Python package installer)
- Valid DGW Spartan account credentials
- Clone the repository
git clone https://github.com/alifnuryana/dgw-scrapper.git
cd dgw-scrapper- Install dependencies
pip install -r requirements.txt- Install Playwright browsers
playwright install chromiumpython main.py --email YOUR_EMAIL --password YOUR_PASSWORD --from_date DD/MM/YYYY --to_date DD/MM/YYYYpython main.py --email [email protected] --password mypassword123 --from_date 01/01/2024 --to_date 31/01/2024| Parameter | Required | Description | Format |
|---|---|---|---|
--email |
β Yes | Your DGW Spartan email | string |
--password |
β Yes | Your DGW Spartan password | string |
--from_date |
β Yes | Start date for data extraction | DD/MM/YYYY |
--to_date |
β Yes | End date for data extraction | DD/MM/YYYY |
The scraper is configured to:
- Navigate to the "Sudah Diproses" (Processed) tab
- Filter by document type: LPJ
- Extract the following data:
- Activity Name
- PO Name
- Total Amount
- Activity Count
All generated files are saved in the output/ directory, which is automatically created if it doesn't exist. The directory is cleaned before each run to ensure fresh data.
Files are named using the following pattern:
{YYYY - Month} - {Submitted By} - {Activity Type 1} - {Activity Type 2} - {Proposal Name}.xlsx
Example:
2024 - January - John Doe - Workshop - Training - Employee Development.xlsx
The generated Excel files contain the following columns:
| Column | Description |
|---|---|
| Activity Name | Name of the activity |
| PO Name | Purchase Order name |
| Total | Total amount (in Rupiah) |
| Count | Number of activities |
Data is automatically:
- β Cleaned and formatted
- β Grouped by Activity Name and PO Name
- β Aggregated with sum and count calculations
- β Converted to proper numeric formats
Browser fails to launch
Ensure Playwright browsers are installed:
playwright install chromiumLogin fails
- Verify your email and password are correct
- Check if your account has access to the Spartan platform
- Ensure you're not using special characters that need escaping
TimeoutError during scraping
This can occur if:
- Network connection is slow
- The page takes longer to load
- An item has no data table
The scraper will skip problematic items and continue processing others.
Empty output folder
- Check if the date range contains any LPJ documents
- Verify the filter settings match available documents
- Review the console output for any error messages
Contributions are welcome! Here's how you can help:
- Fork the repository
- Create a new branch (
git checkout -b feature/amazing-feature) - Make your changes
- Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
- Follow PEP 8 style guidelines
- Add comments for complex logic
- Test your changes thoroughly
- Update documentation as needed
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with Playwright for reliable browser automation
- Uses pandas for efficient data processing
- Enhanced with Rich for beautiful terminal output
Alif Nuryana - @alifnuryana
Project Link: https://github.com/alifnuryana/dgw-scrapper