-
Notifications
You must be signed in to change notification settings - Fork 0
Final PR #7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…erything. need to fix events ingestion- event duplicates.
… reducing the nuber of hybrids
Rethink api
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
WalkthroughThis PR establishes a comprehensive data ingestion and unified AI chatbot platform for the RethinkAI Community Sentiment Analysis Platform. It introduces a Flask v2 REST API, orchestrates data sync from Google Drive, email, and Boston's open data portal into MySQL and vector databases, implements RAG and SQL query routing with Gemini integration, and provides extensive configuration, documentation, and deployment guidance. Changes
Sequence Diagram(s)sequenceDiagram
actor User
participant API as API v2
participant Chatbot as Unified Chatbot
participant Router as Question Router
participant SQLChat as SQL Chat
participant RAG as RAG Pipeline
participant VectorDB as Vector DB
participant MySQL as MySQL DB
participant Gemini as Gemini LLM
User->>API: POST /chat (message, history)
API->>Chatbot: chat(question, history)
Chatbot->>Router: _route_question(question)
Router->>Gemini: Classify intent (SQL/RAG/Hybrid)
Gemini-->>Router: mode decision
alt SQL Mode
Router->>SQLChat: _run_sql(question)
SQLChat->>Gemini: Generate SELECT (schema+history)
Gemini-->>SQLChat: SQL statement
SQLChat->>MySQL: Execute SQL
MySQL-->>SQLChat: Results
SQLChat->>Gemini: Generate natural answer
Gemini-->>SQLChat: Answer text
SQLChat-->>Router: {answer, mode}
else RAG Mode
Router->>RAG: _run_rag(question, plan)
RAG->>VectorDB: retrieve(query, filters)
VectorDB-->>RAG: chunks + metadata
RAG->>Gemini: Compose answer(chunks, history)
Gemini-->>RAG: Answer with sources
RAG-->>Router: {answer, sources, mode}
else Hybrid Mode
Router->>SQLChat: _run_sql(question)
Router->>RAG: _run_rag(question, plan)
SQLChat-->>Router: SQL results
RAG-->>Router: RAG results
Router->>Gemini: Merge results
Gemini-->>Router: Unified answer
Router-->>Router: {answer, sources, mode}
end
Router-->>Chatbot: Result
Chatbot->>API: Response with sources & mode
API->>MySQL: log_interaction
API-->>User: {response, sources, mode, session_id}
sequenceDiagram
participant Scheduler as Daily Scheduler
participant Orchestrator as main_daily_ingestion
participant Dotnews as Dotnews Download
participant Drive as Google Drive Sync
participant Email as Email Newsletter Sync
participant Boston as Boston Data Sync
participant VectorDB as Vector DB Builder
participant MySQL as MySQL
Scheduler->>Orchestrator: Execute daily run
rect rgb(200, 220, 255)
Note over Orchestrator: Phase 1: Dotnews
Orchestrator->>Dotnews: sync_dotnews_newsletters()
Dotnews->>Dotnews: Download latest PDF
Dotnews->>VectorDB: Store in processed registry
Dotnews-->>Orchestrator: {downloaded, errors}
end
rect rgb(200, 235, 220)
Note over Orchestrator: Phase 2: Google Drive
Orchestrator->>Drive: sync_google_drive_to_vectordb()
Drive->>Drive: List new files
Drive->>VectorDB: Process & add documents
Drive->>MySQL: Extract & store events
Drive-->>Orchestrator: {files_processed, errors}
end
rect rgb(235, 220, 200)
Note over Orchestrator: Phase 3: Email
Orchestrator->>Email: sync_email_newsletters_to_sql()
Email->>Email: Fetch recent newsletters
Email->>Email: Extract events via Gemini
Email->>MySQL: Insert weekly_events
Email-->>Orchestrator: {processed, errors}
end
rect rgb(235, 200, 200)
Note over Orchestrator: Phase 4: Boston Data
Orchestrator->>Boston: BostonDataSyncer.sync_all()
Boston->>Boston: Fetch from CKAN API
Boston->>MySQL: Upsert records
Boston->>MySQL: Create derivative tables
Boston-->>Orchestrator: {synced, errors}
end
rect rgb(220, 200, 235)
Note over Orchestrator: Phase 5: Vector DB
Orchestrator->>VectorDB: build_vectordb()
VectorDB->>VectorDB: Load policies, transcripts, newsletters
VectorDB->>VectorDB: Embed via Gemini
VectorDB->>VectorDB: Persist to Chroma
VectorDB-->>Orchestrator: {built, documents}
end
Orchestrator->>Orchestrator: Log summary (JSONL)
Orchestrator-->>Scheduler: Complete with status
sequenceDiagram
participant App as Data Ingestion
participant GDrive as Google Drive API
participant PDF as PDF/Newsletter
participant Gemini as Gemini LLM
participant VectorDB as Vector DB
participant SQL as MySQL
App->>App: Load sync state
rect rgb(220, 240, 255)
Note over App: File Discovery & Deletion
App->>GDrive: Get all current file IDs
GDrive-->>App: {file_ids}
App->>App: Compare with processed
App->>VectorDB: Delete chunks for removed files
end
App->>GDrive: List new files (limit: MAX_FILES)
GDrive-->>App: {files}
loop For each new file
App->>App: Download to temp dir
alt File is Newsletter PDF
App->>PDF: Extract pages
loop For each page
PDF->>Gemini: Extract events (JSON)
Gemini-->>PDF: {events}
PDF->>App: {event_list}
end
App->>SQL: insert_events_to_db(events)
App->>App: Update processed_files
else Other file type
App->>App: process_file_to_documents()
App->>VectorDB: Stage documents
end
end
App->>VectorDB: Batch add documents
App->>App: Save sync state
App->>App: Cleanup temp files
App-->>App: Return {files_processed, errors}
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes Complexity Factors:
Areas Requiring Extra Attention:
Possibly related PRs
Pre-merge checks and finishing touches❌ Failed checks (1 inconclusive)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
We've added a demo folder for testing our main agent framework. For the ingestion pipelines to work we'd require a lot of google credentials setup with a different google account. Instructions for setting it up and running it have been mentioned in the readme file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
♻️ Duplicate comments (1)
api/env.example (1)
1-46: Duplication with api/.env.example noted.See comment on
api/.env.exampleregarding consolidation of environment example files.
🟠 Major comments (21)
on_the_porch/rag stuff/other scripts/file_handling.py-34-40 (1)
34-40: Ensure output directory exists and add error handling.The code writes to
Data/boston_budget_projects.txtwithout checking if theData/directory exists, which will raise aFileNotFoundError. Additionally, there's no error handling for write failures.Apply this diff to create the directory and handle errors:
+from pathlib import Path + -def budget_to_text(): +def budget_to_text(input_path='../../api/datastore/budget_filtered.txt', output_dir='Data'): """Convert budget CSV data into narrative text for vector DB.""" # ... (earlier code) # Write to text file - output_file = 'Data/boston_budget_projects.txt' + output_path = Path(output_dir) + output_path.mkdir(parents=True, exist_ok=True) + output_file = output_path / 'boston_budget_projects.txt' + - with open(output_file, 'w', encoding='utf-8') as f: - f.write('\n'.join(text_entries)) + try: + with open(output_file, 'w', encoding='utf-8') as f: + f.write('\n'.join(text_entries)) + except Exception as e: + print(f"Error writing output: {e}") + sys.exit(1) print(f"Converted {len(text_entries)} projects to text format") print(f"Output saved to: {output_file}")Committable suggestion skipped: line range outside the PR's diff.
on_the_porch/rag stuff/other scripts/file_handling.py-7-9 (1)
7-9: Add error handling and make the input path configurable.The hard-coded relative path
../../api/datastore/budget_filtered.txtis fragile and will break if the script is executed from a different working directory. Additionally, there's no error handling for common I/O failures (file not found, permission denied, etc.).Apply this diff to add error handling and path flexibility:
+import sys +from pathlib import Path + -def budget_to_text(): +def budget_to_text(input_path='../../api/datastore/budget_filtered.txt', output_dir='Data'): """Convert budget CSV data into narrative text for vector DB.""" + input_file = Path(input_path) + if not input_file.exists(): + print(f"Error: Input file not found: {input_file}") + sys.exit(1) + # Read the CSV data - with open('../../api/datastore/budget_filtered.txt', 'r', encoding='utf-8') as f: + try: + with open(input_file, 'r', encoding='utf-8') as f: - reader = csv.DictReader(f) - rows = list(reader) + reader = csv.DictReader(f) + rows = list(reader) + except Exception as e: + print(f"Error reading CSV: {e}") + sys.exit(1)Committable suggestion skipped: line range outside the PR's diff.
on_the_porch/unused/pf/simple_sql_qa_pocketflow.py-102-109 (1)
102-109: Add error handling for JSON parsing.If the LLM returns malformed JSON or unexpected format,
json.loads(result)will raise aJSONDecodeErrorand crash the pipeline. Consider wrapping this in a try/except with a descriptive error message.- # Parse JSON - queries = json.loads(result) + # Parse JSON + try: + queries = json.loads(result) + except json.JSONDecodeError as e: + raise ValueError(f"Failed to parse LLM response as JSON: {result[:200]}") from e + + # Validate expected keys + if "answer_query" not in queries: + raise ValueError("LLM response missing 'answer_query' key")on_the_porch/unused/pf/simple_sql_qa_pocketflow.py-44-48 (1)
44-48: SQL injection vulnerability in table name.The
table_nameparameter is interpolated directly into the SQL command without sanitization. Iftable_nameoriginates from user input, an attacker could inject malicious SQL (e.g.,); DROP TABLE users; --).Consider validating the table name against an allowlist or using a regex to ensure it contains only valid identifier characters:
+import re + def get_schema(table_name): """Get table schema""" + # Validate table name to prevent SQL injection + if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', table_name): + raise ValueError(f"Invalid table name: {table_name}") engine = create_engine(DB_URL) with engine.connect() as conn:Committable suggestion skipped: line range outside the PR's diff.
.gitignore-108-108 (1)
108-108: Service account key was previously exposed in git history—verify credential rotation immediately.Git history confirms that
ontheporch-ebc5a60fc214.jsonwas committed in bcfebda061edba49b8cfec0559bdc8d3d6977ef4. While the .gitignore addition prevents future exposure, the key has already been exposed in the repository history. Verify that this Google service account key has been rotated or revoked in the GCP console. If not already done, rotate the key and clean the git history using tools likegit-filter-branchorbfgto remove all traces.on_the_porch/unused/pocketflow_stuff/agent.py-85-115 (1)
85-115: Make_query_analysisrobust to non‑JSON LLM outputRight now, any deviation from perfect JSON (very plausible in practice) will raise
json.JSONDecodeErrorand bring down the request.Consider a defensive parse with a safe fallback, e.g.:
- text = completion.choices[0].message.content or "{}" - parsed = json.loads(text) + raw_text = completion.choices[0].message.content or "{}" + try: + parsed = json.loads(raw_text) + except json.JSONDecodeError: + # Fallback: treat as empty analysis so downstream logic still runs. + parsed = {}The rest of the normalization (intent whitelist, defaulting entities/date_ranges/summary_flag/limit) can stay as is, but this avoids hard failures when the model violates the contract.
on_the_porch/unused/pocketflow_stuff/agent.py-127-160 (1)
127-160: Preserve and surfaceschema_mismatchinstead of reclassifying asnot_single_selectThe prompt tells the model to return
{"error":"schema_mismatch"}when the schema can’t answer the query, but this branch is never distinguished from other failures. That JSON blob is treated as SQL text, fails_is_single_select, and gets mapped toerror: "not_single_select".You can detect and handle the structured error before running the single‑SELECT validation, for example:
- sql_text = (completion.choices[0].message.content or "").strip().strip("`") - - is_valid = self._is_single_select(sql_text) - payload: Dict[str, Any] = { - "request_id": request.request_id, - "schema_hash": metadata.get("schema_hash"), - "sql_text": sql_text, - "is_valid_single_select": is_valid, - } - if not is_valid: - payload["error"] = "not_single_select" - return payload + sql_text = (completion.choices[0].message.content or "").strip().strip("`") + + # Allow the model to signal an explicit schema mismatch via JSON. + if sql_text.lstrip().startswith("{"): + try: + error_obj = json.loads(sql_text) + except json.JSONDecodeError: + error_obj = {} + if error_obj.get("error") == "schema_mismatch": + return { + "request_id": request.request_id, + "schema_hash": metadata.get("schema_hash"), + "sql_text": None, + "is_valid_single_select": False, + "error": "schema_mismatch", + } + + is_valid = self._is_single_select(sql_text) + payload: Dict[str, Any] = { + "request_id": request.request_id, + "schema_hash": metadata.get("schema_hash"), + "sql_text": sql_text, + "is_valid_single_select": is_valid, + } + if not is_valid: + payload["error"] = "not_single_select" + return payloadThis keeps your single‑SELECT guard while allowing downstream consumers to distinguish “bad SQL” from “schema cannot answer”.
on_the_porch/sql_chat/app3.py-116-141 (1)
116-141: Potential SQL injection in_get_unique_values.The
table_nameandcolumn_nameparameters are interpolated directly into the SQL query using f-strings. While these values come from internal SQL parsing rather than direct user input, they originate from LLM-generated SQL which could potentially include malicious identifiers.Consider using
psycopg2.sqlfor safer identifier quoting:+from psycopg2 import sql def _get_unique_values(table_name: str, column_name: str, schema: str = "public", limit: int = 50) -> List[Any]: """Get unique values from a column to help users see available options.""" conn = _get_db_connection() try: # Set a statement timeout to prevent hanging (5 seconds) with conn.cursor() as cur: cur.execute("SET statement_timeout = '5s'") - cur.execute( - f''' - SELECT DISTINCT "{column_name}" - FROM "{schema}"."{table_name}" - WHERE "{column_name}" IS NOT NULL - ORDER BY "{column_name}" - LIMIT %s - ''', - (limit,), - ) + query = sql.SQL(''' + SELECT DISTINCT {col} + FROM {schema}.{table} + WHERE {col} IS NOT NULL + ORDER BY {col} + LIMIT %s + ''').format( + col=sql.Identifier(column_name), + schema=sql.Identifier(schema), + table=sql.Identifier(table_name), + ) + cur.execute(query, (limit,)) rows = cur.fetchall()on_the_porch/sql_chat/app4.py-188-210 (1)
188-210: SQL injection vulnerability via string interpolation.The
_get_unique_valuesfunction constructs SQL using f-string interpolation forcolumn_nameandtable_name. Whilelimitis parameterized, the column and table names are directly interpolated, which could allow SQL injection if these values ever come from untrusted sources.Apply this diff to use parameterized identifiers or validate inputs:
def _get_unique_values(table_name: str, column_name: str, schema: str = "public", limit: int = 50) -> List[Any]: """Get unique values from a column to help users see available options.""" + # Validate identifiers to prevent SQL injection + if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', table_name): + return [] + if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', column_name): + return [] conn = _get_db_connection() try: with conn.cursor() as cur: cur.execute( f""" SELECT DISTINCT `{column_name}` FROM `{table_name}` WHERE `{column_name}` IS NOT NULL ORDER BY `{column_name}` LIMIT %s """, (limit,), )Committable suggestion skipped: line range outside the PR's diff.
example_env.txt-47-60 (1)
47-60: Align MySQL env var names withapi.Config.DB_CONFIGand demo defaultsRight now there are two mismatches:
Name mismatch with
api/api.py:
api.Config.DB_CONFIGexpectsDB_HOST,DB_USER,DB_PASSWORD,DB_NAME.- This template only defines
MYSQL_HOST,MYSQL_USER,MYSQL_PASSWORD,MYSQL_DB.
If a user follows the demo README and only populates this root.env,api.pywill seeuser=None,password=None,database=Noneand MySQL connections will fail.Value mismatch with the demo README / docker-compose demo:
- The README and
demo/docker-compose.demo.ymldescribe a demo DB withdemo_user/demo_pass/sentiment_demo, but the sample here still usesroot/rethink_ai_boston.To keep things consistent and make the quickstart work out of the box, I’d recommend:
- Updating the MySQL section to match the demo DB values, and
- Adding
DB_*aliases soapi.pyworks with this unified template.For example:
-# ============================================================================ -# MySQL Database Configuration (311, 911, events) -# ============================================================================ -MYSQL_HOST=127.0.0.1 -MYSQL_PORT=3306 -MYSQL_USER=root -MYSQL_PASSWORD=your-mysql-password -MYSQL_DB=rethink_ai_boston +# ============================================================================ +# MySQL Database Configuration (311, 911, events) +# +# These defaults match the demo Docker MySQL instance in demo/docker-compose.demo.yml. +# + # Primary variables used by ingestion and newer components +MYSQL_HOST=localhost +MYSQL_PORT=3306 +MYSQL_USER=demo_user +MYSQL_PASSWORD=demo_pass +MYSQL_DB=sentiment_demo + +# Backwards‑compat: api/api.py expects DB_* names. Keep these in sync with MYSQL_*. +DB_HOST=${MYSQL_HOST} +DB_USER=${MYSQL_USER} +DB_PASSWORD=${MYSQL_PASSWORD} +DB_NAME=${MYSQL_DB}(If you prefer not to rely on
${VAR}expansion, you can duplicate the literal values instead.)This keeps the env story coherent for both the legacy API and the newer ingestion / demo setup.
on_the_porch/calendar/ingest_events_llm.py-168-201 (1)
168-201: Missing explicit commit after database inserts.The connection is closed in the
finallyblock, but there's no explicitconn.commit(). Depending onapp4._get_db_connection()'s autocommit setting, inserts may be lost.for ev in events: cur.execute( ... ) + conn.commit() print(f"Inserted {len(events)} events into weekly_events.") finally: conn.close()on_the_porch/unused/data_download/download_911_data.py-96-118 (1)
96-118: Redundant data download - entire dataset fetched twice.
get_911_shots_fired_data()callsget_boston_911_data()internally to fetch all records. When the main block calls both functions, the full dataset is downloaded twice. Refactor to accept the DataFrame as a parameter, similar tofilter_shots_fired_dataindownload_crime_data.py.-def get_911_shots_fired_data(): +def get_911_shots_fired_data(df=None, filepath=None): """ Get specifically shots fired incidents """ print("🔍 Fetching shots fired data...") - # Get all crime data first - df, filepath = get_boston_911_data() + # Get all crime data first if not provided + if df is None: + df, filepath = get_boston_911_data() if df is not None:And update the main block:
if df_all is not None: print("\n" + "=" * 40) print("🔫 Downloading shots fired data specifically...") - df_shots, filepath_shots = get_911_shots_fired_data() + df_shots, filepath_shots = get_911_shots_fired_data(df_all, filepath_all)Also applies to: 125-130
api/api_v2.py-113-114 (1)
113-114: Hardcoded secret key is insecure.The default
SECRET_KEYvalue"agent-api-secret-2025"should not be used in production. The session cookie security depends on this being truly secret.Consider failing startup if no secret key is configured in production:
- SECRET_KEY = os.getenv("FLASK_SECRET_KEY", "agent-api-secret-2025") + SECRET_KEY = os.getenv("FLASK_SECRET_KEY") + if not SECRET_KEY: + raise ValueError("FLASK_SECRET_KEY environment variable must be set")Alternatively, generate a random key at startup (though this breaks sessions across restarts):
SECRET_KEY = os.getenv("FLASK_SECRET_KEY") or os.urandom(32).hex()api/api_v2.py-587-589 (1)
587-589: Same issue: internal error details leaked in /events endpoint.Apply the same fix as suggested for
/chatto avoid exposing internal errors.except Exception as e: print(f"Error in /events: {e}") - return jsonify({"error": f"Failed to fetch events: {str(e)}"}), 500 + return jsonify({"error": "Failed to fetch events"}), 500api/api_v2.py-152-158 (1)
152-158: Overly permissive CORS configuration allows all origins.Setting
origins: "*"allows any website to make authenticated requests to this API. For a production API with API key authentication, restrict origins to known frontend domains.CORS( app, supports_credentials=True, expose_headers=["RethinkAI-API-Key"], - resources={r"/*": {"origins": "*"}}, + resources={r"/*": {"origins": os.getenv("ALLOWED_ORIGINS", "http://localhost:3000").split(",")}}, allow_headers=["Content-Type", "RethinkAI-API-Key"], )api/api_v2.py-456-458 (1)
456-458: Avoid leaking internal error details to clients.Returning
str(e)exposes internal implementation details (stack traces, database errors) to API consumers. Log the full error server-side but return a generic message to clients.except Exception as e: print(f"Error in /chat: {e}") - return jsonify({"error": f"Internal server error: {str(e)}"}), 500 + import traceback + traceback.print_exc() + return jsonify({"error": "Internal server error"}), 500api/api_v2.py-125-137 (1)
125-137: Connection pool creation at module import time lacks error handling.If the database is unavailable when
api/api_v2.pyis imported, theMySQLConnectionPoolconstructor will raise an exception, preventing the application from starting. Wrap the pool creation (lines 129-137) in a try-except block or implement lazy initialization to prevent import failures.on_the_porch/data_ingestion/main_daily_ingestion.py-155-160 (1)
155-160: Operator precedence bug in error calculation.The ternary operator has lower precedence than addition, causing incorrect calculation. When
dotnews_statsisNone, the entire sum becomes 0.# Calculate overall success errors = ( len(drive_stats.get('errors', [])) + len(email_stats.get('errors', [])) + - (sum(len(d.get('errors', [])) for d in boston_stats.get('datasets', [])) if boston_stats else 0) + - len(dotnews_stats.get('errors', [])) if dotnews_stats else 0 + (sum(len(d.get('errors', [])) for d in boston_stats.get('datasets', [])) if boston_stats else 0) + + (len(dotnews_stats.get('errors', [])) if dotnews_stats else 0) )on_the_porch/data_ingestion/main_daily_ingestion.py-73-76 (1)
73-76: Bareexceptshould specify exception type.The bare
exceptclause can mask unexpected errors during JSON parsing. Specify the expected exception.if processed_file.exists(): try: processed_files = json.loads(processed_file.read_text()) - except: + except (json.JSONDecodeError, IOError): processed_files = {}on_the_porch/data_ingestion/build_vectordb.py-344-357 (1)
344-357: Mixingadd_documentsandfrom_documentsmay cause inconsistent behavior.When the vector DB exists,
add_documentsis called which appends to existing data. But there's no deduplication, so runningbuild_vectordb()multiple times will create duplicate entries.Consider either:
- Clearing the existing DB before rebuilding
- Implementing deduplication based on document IDs/hashes
- Adding a
--rebuildflag to control behaviorif VECTORDB_DIR.exists(): - print(f"Adding documents to existing vector database at {VECTORDB_DIR}") + print(f"⚠️ Vector database exists at {VECTORDB_DIR}") + print(" Use --rebuild flag to recreate, or documents will be appended (may create duplicates)") vectordb = Chroma( persist_directory=str(VECTORDB_DIR), embedding_function=embeddings, ) + # Consider: vectordb.delete(where={}) # Clear existing before adding vectordb.add_documents(all_documents)on_the_porch/data_ingestion/boston_data_sync/boston_data_sync.py-386-392 (1)
386-392: SQL injection risk via dynamic table/column names.Table names and column names are interpolated directly into SQL queries from configuration. While these values come from config files (not direct user input), if the config is ever exposed to untrusted sources, this could be exploited.
Consider validating table/column names against an allowlist or using identifier quoting utilities:
+import re + +def _validate_identifier(name: str) -> str: + """Validate and sanitize SQL identifiers.""" + if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', name): + raise ValueError(f"Invalid SQL identifier: {name}") + return name + # Then use: -cursor.execute(f"SHOW TABLES LIKE '{table_name}'") +cursor.execute("SHOW TABLES LIKE %s", (table_name,))Committable suggestion skipped: line range outside the PR's diff.
🟡 Minor comments (31)
on_the_porch/unused/actual_eda/instrutctions.md-1-1 (1)
1-1: Fix the filename typo.The filename contains a typo:
instrutctions.mdshould beinstructions.md.on_the_porch/unused/actual_eda/instrutctions.md-42-46 (1)
42-46: Reduce repetitive sentence structure.Three successive sentences begin with "Distribution," which impacts readability. Vary the phrasing for better clarity.
Apply this diff to improve sentence variety:
-Distribution of incidents by district (bar chart). +Distribution of incidents by district (bar chart). -Distribution of incidents by day of week (bar chart, ordered from Monday to Sunday). +Incidents by day of week (bar chart, ordered from Monday to Sunday). -Distribution of incidents by hour of day (bar chart). +Incidents by hour of day (bar chart).on_the_porch/unused/pf/create_db.py-31-31 (1)
31-31: Remove unnecessary f-string prefix.The f-string has no placeholders.
-print(f"\nDatabase created successfully!") +print("\nDatabase created successfully!")on_the_porch/unused/pf/create_db.py-38-38 (1)
38-38: Remove unnecessary f-string prefix.The f-string has no placeholders.
-print(f"\nTable schema:") +print("\nTable schema:")on_the_porch/data_ingestion/boston_data_sync/test_record_ordering.py-112-120 (1)
112-120: Fix logic error in ordering detection.The conditional logic has overlapping conditions. When
first_record_date == last_record_date, line 112 evaluates to true (descending branch), but line 115's condition<=also includes equality. Since line 115 is anelif, it won't execute, but the logic is confusing and technically incorrect for the equality case.When all records have the same date, it shouldn't be classified as "descending" but rather as "no ordering" or "indeterminate".
Apply this diff to fix the logic:
# Determine ordering print(f"\n[ORDERING] Analysis:") - if first_record_date >= last_record_date: + if first_record_date == last_record_date: + print(" [WARNING] All records have the same date") + print(" [WARNING] Cannot determine ordering - need to fetch all records") + elif first_record_date > last_record_date: print(" [OK] Records appear to be in DESCENDING order (newest first)") print(" [TIP] We can optimize by fetching only the first N records for incremental syncs!") - elif first_record_date <= last_record_date: + else: # first_record_date < last_record_date print(" [OK] Records appear to be in ASCENDING order (oldest first)") print(" [WARNING] We need to fetch all records or use offset to get latest records") - else: - print(" [WARNING] Records appear to be in RANDOM order") - print(" [WARNING] We need to fetch all records and filter")on_the_porch/unused/pf/simple_sql_qa_pocketflow.py-394-394 (1)
394-394: Typo in sample question."hwo many" should be "how many".
- question = "hwo many parking enforcement requests were made in May 2025?" + question = "how many parking enforcement requests were made in May 2025?"on_the_porch/unused/pf/simple_sql_qa_pocketflow.py-275-280 (1)
275-280: Latitude column detection logic may incorrectly match.The condition
if 'lat' in col_lower and not lon_colwill skip assigninglat_coliflon_colwas already found. This seems like a bug—the logic should beand not lat_colto avoid overwriting an already-found latitude column.for col in df.columns: col_lower = col.lower() - if 'lat' in col_lower and not lon_col: + if 'lat' in col_lower and not lat_col: lat_col = col if 'lon' in col_lower or 'lng' in col_lower: lon_col = colon_the_porch/unused/pf/simple_sql_qa_pocketflow.py-403-403 (1)
403-403: Add version compatibility check before calling_run.
flow._run(shared)at line 403 assumes_runexists, but different versions of pocketflow may have different APIs. The production code inapp4.pyandapp3.pyhandles this with:if hasattr(flow, "_run"): flow._run(shared) else: flow.run(shared)Apply the same pattern here with a comment noting version compatibility.
.gitignore-15-16 (1)
15-16: Fix incorrect directory name on line 88: use "rag stuff" instead of "rag_stuff".The actual filesystem contains
on_the_porch/rag stuff/(with a space). Line 15 correctly references this directory, but line 88 usesrag_stuff(with underscore), which does not exist. Update line 88 fromon_the_porch/rag_stuff/data/*toon_the_porch/rag stuff/data/*to match the actual directory structure.on_the_porch/unused/pocketflow_stuff/agent.py-1-18 (1)
1-18: Fix RUF002 by replacing the en dash in the module docstringRuff is correctly flagging the
–here; you can keep the wording and swap in a plain hyphen to satisfy lint.-""" -Pocket Flow – Minimal NL → SQL Orchestrator +""" +Pocket Flow - Minimal NL → SQL Orchestratordataset-documentation/README.md-113-118 (1)
113-118: Clarify geospatial asset format and 311normalized_typecolumn in docsTwo small consistency issues to consider tightening up:
- Lines 115‑118: Geospatial Community Assets list the Format as “GeoJSON files” but the Location path is
api/datastore/geocoding-community-assets.csv. If the source is actually CSV, adjust the format; if it’s really GeoJSON, update the path/extension so they align.- Lines 216‑221: The SQL example uses
normalized_typefromboston_311, but that column isn’t described in the 311 schema section above. Either addnormalized_typeto the schema description or change the example query to use whichever column is actually present.These are minor but worth fixing so the dataset documentation stays a reliable reference.
Also applies to: 216-221
demo/setup_windows.bat-34-37 (1)
34-37: Fix venv activation check to useactivate.baton WindowsOn Windows,
python -m venvcreates three activation scripts:activate.bat(cmd.exe),Activate.ps1(PowerShell), and a bareactivatescript (POSIX shells). Theif existcheck does not apply path extension resolution (PATHEXT), so:if exist ".venv_demo\Scripts\activate" ( call ".venv_demo\Scripts\activate" )will not find the file and skip activation. Update to explicitly check for
activate.bat:rem Activate the demo virtual environment if exist ".venv_demo\Scripts\activate.bat" ( call ".venv_demo\Scripts\activate.bat" )This ensures the venv activates reliably in cmd.exe at the end of the script.
README.md-82-84 (1)
82-84: Inconsistent environment file naming.The documentation references different environment file names:
- Line 83: "Copy
.env.examplefiles to.env"- Line 113-115: "Copy
example_env.txtto.env"Please standardize on one naming convention to avoid confusion.
Also applies to: 111-116
on_the_porch/api_readme.md-421-529 (1)
421-529: Clarify that at least one API key is mandatory for the API to function.The documentation correctly states authentication is "mandatory," but it should explicitly clarify that at least one API key must always be configured in
RETHINKAI_API_KEYS. The current wording—"Authentication is controlled by theRETHINKAI_API_KEYSenvironment variable"—could imply authentication can be disabled by omitting the variable. In practice, the API rejects all requests if no keys are configured. Update the documentation to state clearly: "At least one API key must be configured inRETHINKAI_API_KEYSfor the API to accept any requests."on_the_porch/data_ingestion/boston_data_sync/test_311_resources.py-20-55 (1)
20-55: Narrow the broadexcept Exceptionto more specific error typesThe outer loop currently catches a blind
Exception(flagged by Ruff BLE001), which can hide genuine programming errors:except Exception as e: print(f"{year}: {resource_id}") print(f" ERROR: {e}")Given this is all network + JSON handling, be more precise:
except requests.RequestException as e: print(f"{year}: {resource_id}") print(f" HTTP ERROR: {e}") except ValueError as e: print(f"{year}: {resource_id}") print(f" JSON ERROR: {e}")Also consider adding
response.raise_for_status()afterrequests.get()to catch HTTP errors explicitly rather than relying on the try-except.on_the_porch/new_metadata/bos311_data.json-1327-1334 (1)
1327-1334: Malformed decimal data type definitions.The
latitudeandlongitudecolumns have incomplete data types ("decimal(18"instead of"decimal(18,8)"or similar). This will cause issues if this metadata is used for schema generation or validation."latitude": { - "data_type": "decimal(18", + "data_type": "decimal(18,8)", "is_numeric": true }, "longitude": { - "data_type": "decimal(18", + "data_type": "decimal(18,8)", "is_numeric": true },on_the_porch/data_ingestion/boston_data_sync/reset_database.py-91-94 (1)
91-94: Validate table names to prevent SQL injection.Table names are read from a config file and interpolated directly into SQL queries. While backticks provide some protection, a malicious config file could still inject SQL. Consider validating that table names match an allowed pattern.
+import re + +TABLE_NAME_RE = re.compile(r'^[a-zA-Z_][a-zA-Z0-9_]*$') + +def validate_table_name(name: str) -> bool: + """Validate table name contains only safe characters.""" + return bool(TABLE_NAME_RE.match(name)) and len(name) <= 64Then use before querying:
for table in all_tables: if not validate_table_name(table): print(f" ⚠️ Skipping invalid table name: {table}") continue cursor.execute(f"SHOW TABLES LIKE '{table}'")on_the_porch/unused/data_download/download_911_data.py-88-89 (1)
88-89: Missing column existence checks can cause KeyError.Unlike
download_crime_data.py, these columns are accessed without checking if they exist first. If the API response is missing these columns, this will raise aKeyError.# Print data info print(f"📈 Data shape: {df.shape}") - print(f"📅 Date range: {df['OCCURRED_ON_DATE'].min()} to {df['OCCURRED_ON_DATE'].max()}") - print(f"🏘️ Districts: {df['DISTRICT'].unique()}") + if 'OCCURRED_ON_DATE' in df.columns: + print(f"📅 Date range: {df['OCCURRED_ON_DATE'].min()} to {df['OCCURRED_ON_DATE'].max()}") + if 'DISTRICT' in df.columns: + print(f"🏘️ Districts: {df['DISTRICT'].unique()}")on_the_porch/data_ingestion/README.md-421-428 (1)
421-428: Inconsistent security guidance: app-specific passwords vs OAuth.Line 426 advises "Use app-specific passwords for email" but line 104 states "Gmail now requires OAuth 2.0 authentication (app-specific passwords were discontinued in November 2025)." Remove or update the outdated guidance.
⚠️ **Important:** - Never commit `.env` file to git (it's in `.gitignore`) - Keep service account JSON file secure -- Use app-specific passwords for email (not your main password) +- Keep OAuth credentials (`gmail_credentials.json`, `gmail_token.json`) secure - Regularly rotate credentials - Use read-only permissions where possibleon_the_porch/data_ingestion/boston_data_sync/find_boston_resource_id.py-57-68 (1)
57-68: Bareexceptclause loses exception context.The bare
except:on line 67 silently swallows all exceptions includingKeyboardInterruptandSystemExit. Useexcept Exception:at minimum to preserve interrupt behavior.try: response = requests.get(url, params=params, timeout=10) response.raise_for_status() data = response.json() return data.get('success', False) - except: + except Exception: return Falseon_the_porch/new_metadata/generate_mysql_metadata_live.py-122-127 (1)
122-127: SQL injection in SHOW TABLES query.Line 124 uses f-string interpolation for
table_name. WhileTARGET_TABLESis hardcoded, use parameterized query for consistency and safety.# Check if table exists cursor = conn.cursor() - cursor.execute(f"SHOW TABLES LIKE '{table_name}'") + cursor.execute("SHOW TABLES LIKE %s", (table_name,)) if not cursor.fetchone(): print(f" WARNING: Table '{table_name}' does not exist, skipping") return Noneon_the_porch/unused/data_download/import_911_to_mysql.py-244-253 (1)
244-253: Potential issue:getctimebehavior varies by OS.On Unix systems,
os.path.getctimereturns inode change time, not creation time. Useos.path.getmtimefor consistent behavior across platforms.- latest_file = max(csv_files, key=os.path.getctime) + latest_file = max(csv_files, key=os.path.getmtime)on_the_porch/data_ingestion/main_daily_ingestion.py-181-244 (1)
181-244: Dotnews stats not included in total_errors calculation.The
print_final_summaryfunction calculatestotal_errorswithout includingdotnews_stats, but then references dotnews errors in the error details output.drive_errors = len(drive_stats.get('errors', [])) email_errors = len(email_stats.get('errors', [])) boston_errors = sum(len(d.get('errors', [])) for d in boston_stats.get('datasets', [])) if boston_stats else 0 - total_errors = drive_errors + email_errors + boston_errors + dotnews_errors = len(dotnews_stats.get('errors', [])) if dotnews_stats else 0 + total_errors = drive_errors + email_errors + boston_errors + dotnews_errorson_the_porch/data_ingestion/boston_data_sync/boston_data_sync.py-528-553 (1)
528-553: Statistics tracking logic may cause double-counting.The try block at lines 530-550 attempts to track new vs updated records, but if it succeeds, it updates
total_insertedandtotal_updated. Thencursor.executemanyruns and commits. However, ifexecutemanyfails after the stats tracking succeeds, the stats won't be rolled back. Also, if the stats query itself fails silently (caught at line 548), it falls back to assuming all are inserts, which could misrepresent actual database state.Consider moving the stats calculation to be a post-operation verification or simplifying to just report affected rows from
cursor.rowcount.on_the_porch/data_ingestion/utils/document_processor.py-84-86 (1)
84-86: Unreliable.docfile handling.The comment acknowledges this won't work for all
.docfiles sincepython-docxonly supports the newer.docxformat. Old binary.docfiles will likely raise an exception or produce garbage output.Consider either:
- Removing
.docsupport and raising an explicit error- Using a library like
antiwordortextractthat properly handles legacy.docformat- Adding a clearer warning in the exception when
.docparsing failselif ext == '.doc': - # For old .doc files, try DOCX library (may not work for all) - return extract_text_from_docx(file_path) + raise ValueError( + f"Legacy .doc format not fully supported. " + f"Please convert {file_path.name} to .docx format." + )on_the_porch/data_ingestion/build_vectordb.py-13-13 (1)
13-13: Import path may fail depending on execution context.
from retrieval import GeminiEmbeddingsassumesretrieval.pyis in the Python path. This will fail if the script is run from a different directory or as a module.Consider using relative imports or adding the directory to path:
+import sys +sys.path.insert(0, str(Path(__file__).parent.parent / "rag stuff")) from retrieval import GeminiEmbeddingsOr restructure as a proper package with
__init__.pyfiles.Committable suggestion skipped: line range outside the PR's diff.
on_the_porch/rag stuff/retrieval.py-6-7 (1)
6-7: Relative path may break depending on current working directory.
VECTORDB_DIR = Path("../vectordb_new")is relative to the current working directory, not the script location. This will fail if the script is executed from a different directory.+_THIS_FILE = Path(__file__).resolve() +_MODULE_DIR = _THIS_FILE.parent -VECTORDB_DIR = Path("../vectordb_new") +VECTORDB_DIR = _MODULE_DIR.parent / "vectordb_new"on_the_porch/data_ingestion/build_vectordb.py-228-232 (1)
228-232: Fragile JSON parsing fallback may still fail.The fallback at line 231 (
safe_text = text.replace("\\", "\\\\")) is a naive attempt to fix escape issues, but this could corrupt valid JSON or still fail on other malformed responses. If bothjson.loadsattempts fail, the exception propagates uncaught.Add comprehensive error handling:
try: data = json.loads(text) except json.JSONDecodeError: - safe_text = text.replace("\\", "\\\\") - data = json.loads(safe_text) + try: + safe_text = text.replace("\\", "\\\\") + data = json.loads(safe_text) + except json.JSONDecodeError as e: + print(f" ⚠️ Failed to parse Gemini response as JSON: {e}") + return []on_the_porch/data_ingestion/google_drive_to_vectordb.py-559-559 (1)
559-559: Remove extraneousfprefix from static strings.Several f-strings have no placeholders and should be regular strings. Example at line 559:
- print(f" ⚠ Skipping event with no name") + print(" ⚠ Skipping event with no name")Similar fixes needed at lines 645, 655, and 828.
on_the_porch/data_ingestion/google_drive_to_vectordb.py-211-218 (1)
211-218: Duplicate dictionary key:'may'appears twice.The
month_mapdictionary has'may': '05'on both line 213 (lowercase from full months) and line 216 (from short months). The second assignment overwrites the first with the same value, but this is a code smell that could cause issues if values diverge.month_map = { 'january': '01', 'february': '02', 'march': '03', 'april': '04', 'may': '05', 'june': '06', 'july': '07', 'august': '08', 'september': '09', 'october': '10', 'november': '11', 'december': '12', 'jan': '01', 'feb': '02', 'mar': '03', 'apr': '04', - 'may': '05', 'jun': '06', 'jul': '07', 'aug': '08', + 'jun': '06', 'jul': '07', 'aug': '08', 'sep': '09', 'oct': '10', 'nov': '11', 'dec': '12' }on_the_porch/data_ingestion/google_drive_to_vectordb.py-819-819 (1)
819-819: Unused variableexpected_chunks.This variable is assigned but never used. Remove it to clean up the code.
for file_id in deleted_file_ids: file_name = processed_files[file_id].get('name', 'unknown') - expected_chunks = processed_files[file_id].get('chunks', 0) try:
|
|
||
| app.run(host=Config.HOST, port=Config.PORT, debug=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not deploy with debug=True in production.
The Flask debug mode enables the interactive debugger which can lead to remote code execution if exposed. Use an environment variable to control this.
- app.run(host=Config.HOST, port=Config.PORT, debug=True)
+ app.run(host=Config.HOST, port=Config.PORT, debug=os.getenv("FLASK_DEBUG", "false").lower() == "true")Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 ast-grep (0.40.0)
[warning] 629-629: Detected Flask app with debug=True. Do not deploy to production with this flag enabled as it will leak sensitive information. Instead, consider using Flask configuration variables or setting 'debug' using system environment variables.
Context: app.run(host=Config.HOST, port=Config.PORT, debug=True)
Note: [CWE-489] Active Debug Code. [REFERENCES]
- https://labs.detectify.com/2015/10/02/how-patreon-got-hacked-publicly-exposed-werkzeug-debugger/
(debug-enabled-python)
🪛 Ruff (0.14.8)
630-630: Use of debug=True in Flask app detected
(S201)
🤖 Prompt for AI Agents
In api/api_v2.py around lines 629-630, the app.run call uses debug=True which
must not be enabled in production; change it to read a configuration/env var
(e.g., Config.DEBUG or os.getenv("FLASK_DEBUG", "false") converted to a boolean)
and pass that boolean to debug, and ensure Config.DEBUG is set from the
environment with a safe default of False; additionally, ensure production
deployments use a WSGI server (gunicorn/uvicorn) rather than app.run.
| *.json | ||
| !.env.example |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect negation pattern: !.env.example does not match preceding *.json rule.
Line 25 attempts to whitelist .env.example, but the preceding pattern *.json (line 24) would never match .env.example. This negation pattern will have no effect and appears to be a copy-paste error. If you intend to exclude certain JSON files, use a more specific pattern like !.env.example.json or !config.json.
If the intent is to preserve .env.example as a template, it should be handled separately (e.g., .env.example should not match the *.json rule). If you intend to whitelist specific JSON files, clarify the pattern:
# Google credentials
*.json
-!.env.example
+!config.example.json # or whichever JSON files should be trackedAlternatively, if .env.example should be tracked and is not a JSON file, remove this line entirely since it's redundant.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| *.json | |
| !.env.example | |
| # Google credentials | |
| *.json | |
| !config.example.json |
🤖 Prompt for AI Agents
In on_the_porch/data_ingestion/.gitignore around lines 24-25, the negation
pattern `!.env.example` is ineffective because the preceding `*.json` rule does
not match `.env.example`; either remove the negation if you simply want
`.env.example` tracked (it’s redundant), or replace it with a correct whitelist
for JSON files (e.g., `!.env.example.json` or `!config.json`) if you intended to
allow specific JSON files—update the file to use the appropriate pattern or
delete the unnecessary line.
| def connect_to_mysql(): | ||
| """Connect to MySQL database""" | ||
| try: | ||
| connection = mysql.connector.connect( | ||
| host='localhost', | ||
| user='rethinkai_user', | ||
| password='MySecureUserPass123!', | ||
| database='rethinkai_db' | ||
| ) | ||
| print("✅ Connected to MySQL database") | ||
| return connection | ||
| except Error as e: | ||
| print(f"❌ Error connecting to MySQL: {e}") | ||
| return None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Hardcoded database credentials must be removed.
Hardcoded credentials in source code are a serious security risk. These can be exposed via version control history even if removed later. Use environment variables instead.
+import os
+
def connect_to_mysql():
"""Connect to MySQL database"""
try:
connection = mysql.connector.connect(
- host='localhost',
- user='rethinkai_user',
- password='MySecureUserPass123!',
- database='rethinkai_db'
+ host=os.getenv('MYSQL_HOST', 'localhost'),
+ user=os.getenv('MYSQL_USER'),
+ password=os.getenv('MYSQL_PASSWORD'),
+ database=os.getenv('MYSQL_DB')
)
+ if not os.getenv('MYSQL_USER') or not os.getenv('MYSQL_DB'):
+ raise ValueError("MYSQL_USER and MYSQL_DB environment variables required")
print("✅ Connected to MySQL database")
return connection
except Error as e:
print(f"❌ Error connecting to MySQL: {e}")
return NoneImmediate action required: Rotate the exposed credentials (rethinkai_user / MySecureUserPass123!) as they are now in version control history.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def connect_to_mysql(): | |
| """Connect to MySQL database""" | |
| try: | |
| connection = mysql.connector.connect( | |
| host='localhost', | |
| user='rethinkai_user', | |
| password='MySecureUserPass123!', | |
| database='rethinkai_db' | |
| ) | |
| print("✅ Connected to MySQL database") | |
| return connection | |
| except Error as e: | |
| print(f"❌ Error connecting to MySQL: {e}") | |
| return None | |
| import os | |
| def connect_to_mysql(): | |
| """Connect to MySQL database""" | |
| try: | |
| connection = mysql.connector.connect( | |
| host=os.getenv('MYSQL_HOST', 'localhost'), | |
| user=os.getenv('MYSQL_USER'), | |
| password=os.getenv('MYSQL_PASSWORD'), | |
| database=os.getenv('MYSQL_DB') | |
| ) | |
| if not os.getenv('MYSQL_USER') or not os.getenv('MYSQL_DB'): | |
| raise ValueError("MYSQL_USER and MYSQL_DB environment variables required") | |
| print("✅ Connected to MySQL database") | |
| return connection | |
| except Error as e: | |
| print(f"❌ Error connecting to MySQL: {e}") | |
| return None |
🧰 Tools
🪛 ast-grep (0.40.0)
[warning] 16-21: The application creates a database connection with an empty password. This can lead to unauthorized access by either an internal or external malicious actor. To prevent this vulnerability, enforce authentication when connecting to a database by using environment variables to securely provide credentials or retrieving them from a secure vault or HSM (Hardware Security Module).
Context: mysql.connector.connect(
host='localhost',
user='rethinkai_user',
password='MySecureUserPass123!',
database='rethinkai_db'
)
Note: [CWE-287] Improper Authentication. [REFERENCES]
- https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html
(python-mysql-hardcoded-secret-python)
🪛 Ruff (0.14.8)
20-20: Possible hardcoded password assigned to argument: "password"
(S106)
24-24: Consider moving this statement to an else block
(TRY300)
🤖 Prompt for AI Agents
In on_the_porch/unused/data_download/import_911_to_mysql.py around lines 14 to
27, the database username and password are hardcoded; replace them with values
read from environment variables (e.g., MYSQL_HOST, MYSQL_USER, MYSQL_PASSWORD,
MYSQL_DATABASE), optionally loading a .env in development, and fail fast with a
clear error if required env vars are missing; update the connect_to_mysql
function to use os.environ.get (or a config helper) instead of literals, and
ensure you rotate the exposed credentials immediately and remove them from the
repo/history.
|
|
||
| # from postgres_stuff.test import DB_URL | ||
|
|
||
| DB_URL = "postgresql://user1:BbWTihWnsBHglVpeKK8XfQgEPDOcokZZ@dpg-d3g661u3jp1c73eg9v1g-a.ohio-postgres.render.com/crime_rate_h3u5" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Hardcoded database credentials exposed in source code.
The PostgreSQL connection string contains plaintext username and password. This is a severe security vulnerability—credentials committed to version control are permanently exposed in git history, even if removed later.
Move credentials to environment variables:
-DB_URL = "postgresql://user1:BbWTihWnsBHglVpeKK8XfQgEPDOcokZZ@dpg-d3g061u3jp1c73eg9v1g-a.ohio-postgres.render.com/crime_rate_h3u5"
+DB_URL = os.getenv("DATABASE_URL")
+if not DB_URL:
+ raise RuntimeError("DATABASE_URL not found in environment")Immediate action required: Rotate this database password since it's now exposed.
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| DB_URL = "postgresql://user1:BbWTihWnsBHglVpeKK8XfQgEPDOcokZZ@dpg-d3g661u3jp1c73eg9v1g-a.ohio-postgres.render.com/crime_rate_h3u5" | |
| DB_URL = os.getenv("DATABASE_URL") | |
| if not DB_URL: | |
| raise RuntimeError("DATABASE_URL not found in environment") |
🤖 Prompt for AI Agents
In on_the_porch/unused/langgraph_stuff/agent.py around line 15 the PostgreSQL
connection string is hardcoded including plaintext username and password; remove
the literal DB_URL, read the connection details from environment variables
(e.g., os.environ or a config loader) and construct the URL at runtime, ensure
any local .env usage is documented but not committed, add the config key name to
repo docs/secrets guide, add the file change to git and rotate the exposed
database password immediately to invalidate the leaked credentials.
| os.environ["LANGCHAIN_TRACING_V2"] = "true" | ||
| # Replace with your actual LangSmith API key and endpoint | ||
| os.environ["LANGSMITH_API_KEY"] = "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" | ||
| os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com" | ||
| # Optional but recommended for organization | ||
| os.environ["LANGCHAIN_PROJECT"] = "spark_eda-sql-agent" | ||
| os.environ["LANGCHAIN_SESSION"] = "local-dev" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critical: Hardcoded LangSmith API key exposed in source code.
The API key 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 is committed to the repository. This key should be revoked and rotated immediately.
Load these values from environment variables or .env file instead:
os.environ["LANGCHAIN_TRACING_V2"] = "true"
-os.environ["LANGSMITH_API_KEY"] = "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"
-os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
+# These should already be set via .env or environment
+# os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
os.environ["LANGCHAIN_PROJECT"] = "spark_eda-sql-agent"
os.environ["LANGCHAIN_SESSION"] = "local-dev"Committable suggestion skipped: line range outside the PR's diff.
🤖 Prompt for AI Agents
In on_the_porch/unused/langgraph_stuff/agent.py around lines 21 to 27, remove
the hardcoded LangSmith API key and endpoint values and instead read these
settings from environment variables or a .env file; revoke and rotate the
exposed key immediately, replace the literal assignments with
os.environ.get("LANGSMITH_API_KEY") and os.environ.get("LANGSMITH_ENDPOINT")
(with sensible defaults or error checks), and if desired load a local .env via
python-dotenv at process startup while ensuring .env is added to .gitignore and
no secrets are committed.
| typing_extensions==4.15.0 | ||
| tzdata==2025.2 | ||
| uritemplate==4.2.0 | ||
| urllib3==2.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, verify the file exists and check line 110
head -n 114 on_the_porch/unused/pf/requirements.txt | tail -n 10Repository: BU-Spark/ml-misi-community-sentiment
Length of output: 216
🏁 Script executed:
# Get the exact content at line 110
sed -n '110p' on_the_porch/unused/pf/requirements.txtRepository: BU-Spark/ml-misi-community-sentiment
Length of output: 94
🌐 Web query:
urllib3 2.5.0 GHSA-2xpw-w6gg-jr37 vulnerability CVE
💡 Result:
Short answer: GHSA-2xpw-w6gg-jr37 is a high-severity streaming/decompression vulnerability in urllib3 (published Dec 5, 2025). It can cause excessive resource consumption when handling highly‑compressed HTTP responses; affected versions are >=1.0 and <2.6.0. The issue is tracked as CVE-2025-66471 (and related CVE-2025-66418 for a similar decompression/resource issue). It was fixed in urllib3 2.6.0 — upgrade to >=2.6.0 (or disable decoding/streaming until patched). [1][2][3]
Sources:
- GitHub security advisory GHSA-2xpw-w6gg-jr37 (urllib3 advisory). [1]
- GitHub advisory details showing CVE mapping and fix (2.6.0). [2]
- OSS‑security / CVE writeups summarizing CVE-2025-66418 / CVE-2025-66471 and remediation. [3]
🌐 Web query:
urllib3 2.5.0 GHSA-gm62-xv2j-4w53 vulnerability CVE
💡 Result:
Yes — GHSA-gm62-xv2j-4w53 (CVE-2025-66418) affects urllib3 2.5.0. Summary:
- Vulnerability: urllib3 allowed an unbounded number of links in the HTTP response decompression chain, enabling a malicious server to cause high CPU and large memory allocations.
- Affected versions: >= 1.24 and < 2.6.0 (includes 2.5.0).
- Fix: upgrade to urllib3 >= 2.6.0 (limits chain length to 5).
- Temporary mitigations: disable automatic decoding (e.g., use preload_content=False) and validate resp.headers["content-encoding"] before reading.
Sources: GitHub security advisory [1], OSV / vulnerability database [2], NVD/Tenable summaries [3].
References
[1] GitHub security advisory GHSA-gm62-xv2j-4w53 (urllib3).
[2] OSV / osv.dev entry for GHSA-gm62-xv2j-4w53 (CVE-2025-66418).
[3] Tenable / NVD advisory summary for CVE-2025-66418.
Upgrade urllib3 from 2.5.0 to patch HIGH severity vulnerabilities.
urllib3 2.5.0 contains two confirmed HIGH severity vulnerabilities affecting streaming decompression:
- GHSA-2xpw-w6gg-jr37 (CVE-2025-66471): Streaming API improperly handles highly compressed data, causing excessive resource consumption.
- GHSA-gm62-xv2j-4w53 (CVE-2025-66418): Unbounded number of links in decompression chain allows malicious servers to trigger high CPU and memory allocation.
Both are fixed in urllib3 2.6.0. Upgrade immediately:
-urllib3==2.5.0
+urllib3==2.6.0📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| urllib3==2.5.0 | |
| urllib3==2.6.0 |
🤖 Prompt for AI Agents
In on_the_porch/unused/pf/requirements.txt around line 110, urllib3 is pinned to
2.5.0 which contains HIGH severity vulnerabilities; update the pin to
urllib3==2.6.0 to pull in the fixes, then regenerate any dependency
lock/constraints file (pip-compile, poetry.lock, or requirements.txt.build) and
run the test suite and a dependency audit (e.g., pip-audit or safety) to ensure
no regressions or incompatible transitive dependencies; if any package conflicts
arise, resolve them by updating the conflicting package or adding an explicit
compatible urllib3 constraint and document the change in the PR.
@Alinahvi @danoh07 this is our final PR.
@NamanGupta1102 @aryanjain6492 @kisss04
Summary by CodeRabbit
Release Notes
New Features
Documentation
Configuration
✏️ Tip: You can customize this high-level summary in your review settings.