-
-
Notifications
You must be signed in to change notification settings - Fork 70
Implementation: Doaj fetch script for open access journals #247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
- Migrate from DOAJ API v3 to v4 for enhanced metadata access - Add comprehensive CC license analysis for academic journals - Implement publisher and geographic distribution analysis - Add programmatic ISO 3166-1 alpha-2 country code generation - Include automatic dependency resolution and error handling - Apply date filtering (default ≥2002) to prevent false positives - Generate 5 CSV files plus provenance for comprehensive analysis - Ensure static analysis compliance and comprehensive testing This integration enables quantification of institutional commitment to Creative Commons licensing in the scholarly publishing ecosystem.
|
@TimidRobot , Hello I have attempted to implement the fetch script to collect CC license information from the doaj datasource using its API. To eliminate false positives, the API fetches a license from a field, which is the actual journal licenses. I have also set a |
|
The data returned appears to focus primarily on articles. Given the lack of licensing information on the articles, I think the focus should be on the journals with article information providing context. Even though a lot of the data currently returned is really interesting, I think it is out of scope for this project. |
@TimidRobot, The script actually focuses on Journals, as this is the only available records with license fields. Articles in the DOAJ database do not have license fields, and doing a full |
|
Please update the description to match the current code |
| FILE_DOAJ_YEAR = shared.path_join( | ||
| PATHS["data_1-fetch"], "doaj_4_count_by_year.csv" | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(here and below) If I understand correctly, this is the year the journal started accepting a tool type. If so, please update YEAR to START_YEAR
| # Constants | ||
| BASE_URL = "https://doaj.org/api/v4/search" | ||
| DEFAULT_DATE_BACK = 2002 # Creative Commons licenses first released in 2002 | ||
| DEFAULT_FETCH_LIMIT = 1000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please reduce DEFAULT_FETCH_LIMIT to 100.
Defaults should be safe and fast.
| seaborn = "*" | ||
| urllib3 = ">=2.5.0" | ||
| wordcloud = "*" | ||
| pycountry = "*" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please maintain natural order of packages (move pycountry after pyarrow)
| BASE_URL = "https://doaj.org/api/v4/search" | ||
| DEFAULT_DATE_BACK = 2002 # Creative Commons licenses first released in 2002 | ||
| DEFAULT_FETCH_LIMIT = 1000 | ||
| RATE_LIMIT_DELAY = 0.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this value is required by API, please add comment here so that a future person doesn't change it to "optimize" it
Fixes
Description
This PR adds comprehensive DOAJ API v4 integration to the quantifying commons project, enabling collection and analysis of Creative Commons licensed academic journals. The implementation includes two main components:
scripts/1-fetch/doaj_fetch.py- Main data collection script for DOAJ journalsdev/generate_country_codes.py- Utility for programmatic ISO country code generationKey Features
Useful Links
Articles:
Journals:
Technical details
API Integration
Data Quality Measures
--date-back=2002to avoid retroactive CC license false positivesOutput Files Generated
Query Strategy
License Extraction
Date Filtering Implementation
Publisher Analysis
Auto-Dependency Resolution
Tests
Basic Code Execution
Data Quality Note
Please Note: DOAJ data represents journal-level licensing policies, not individual article licenses. This data should be interpreted as indicators of institutional commitment to CC licensing rather than precise counts of CC-licensed articles. The
--date-back=2002default prevents false positives from journals that retroactively adopted CC licenses.Checklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin