WARNING: THIS SITE IS A MIRROR OF GITHUB.COM / IT CANNOT LOGIN OR REGISTER ACCOUNTS / THE CONTENTS ARE PROVIDED AS-IS / THIS SITE ASSUMES NO RESPONSIBILITY FOR ANY DISPLAYED CONTENT OR LINKS / IF YOU FOUND SOMETHING MAY NOT GOOD FOR EVERYONE, CONTACT ADMIN AT ilovescratch@foxmail.com
Skip to content

Conversation

@bsurya27
Copy link
Collaborator

@bsurya27 bsurya27 commented Dec 10, 2025

@Alinahvi @danoh07 this is our final PR.

@NamanGupta1102 @aryanjain6492 @kisss04

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced API v2.0 with chat, event retrieval, and interaction logging endpoints.
    • Automated data ingestion from Boston Open Data Portal, Google Drive, and email newsletters with vector database and SQL storage.
    • Demo environment setup with Docker Compose and platform-specific scripts for quick evaluation.
    • Calendar event extraction from PDFs using AI with database persistence.
    • Unified SQL and semantic search chatbot for intelligent data queries.
  • Documentation

    • Added comprehensive API documentation, dataset schema definitions, and data ingestion guides.
    • Created demo setup instructions for Mac, Linux, and Windows platforms.
    • Expanded project README with quick-start, architecture overview, and troubleshooting.
  • Configuration

    • New environment templates and configuration files for API, demo, and data ingestion components.

✏️ Tip: You can customize this high-level summary in your review settings.

@review-notebook-app
Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@coderabbitai
Copy link

coderabbitai bot commented Dec 10, 2025

Walkthrough

This PR establishes a comprehensive data ingestion and unified AI chatbot platform for the RethinkAI Community Sentiment Analysis Platform. It introduces a Flask v2 REST API, orchestrates data sync from Google Drive, email, and Boston's open data portal into MySQL and vector databases, implements RAG and SQL query routing with Gemini integration, and provides extensive configuration, documentation, and deployment guidance.

Changes

Cohort / File(s) Summary
Configuration & Environment
.gitignore, api/.env.example, api/env.example, demo/.env.example, example_env.txt
Expanded gitignore patterns for RAG, vectordb, and data artifacts; added environment configuration templates for API, demo, and root-level settings with Gemini, MySQL, and legacy API keys.
API Layer
api/api.py, api/api_v2.py, api/test_api_v2.py
Modified env loading in api.py; introduced api_v2.py with POST /chat, /log, GET /events, /health endpoints using session caching and unified_chatbot routing; added test suite for v2 endpoints.
Root & Project Documentation
README.md, manifest.md, handover.txt, dataset-documentation/README.md
Comprehensive README rewrite with quick-start, architecture, and API overview; added project manifest detailing components and data flow; included dataset documentation guide and handoff procedures.
Demo Setup
demo/setup.sh, demo/setup_windows.bat, demo/docker-compose.demo.yml, demo/README.md
Added automated setup scripts for Linux/macOS/Windows with Python venv and Docker MySQL; included demo-specific README with troubleshooting and environment configuration.
Data Ingestion: Boston Open Data
on_the_porch/data_ingestion/boston_data_sync/*
Comprehensive Boston CKAN datastore sync with incremental updates, upsert logic, and filtered derivative tables (shots_fired, homicide); includes config, CLI, scheduling, and testing utilities.
Data Ingestion: Google Drive & Email
on_the_porch/data_ingestion/google_drive_to_vectordb.py, on_the_porch/data_ingestion/email_to_calendar_sql.py, on_the_porch/data_ingestion/dotnews_downloader/*
Google Drive sync to vector DB with file tracking and deletion handling; Gmail newsletter ingestion with LLM-based event extraction; Dorchester Reporter PDF downloader.
Data Ingestion: Infrastructure
on_the_porch/data_ingestion/config.py, on_the_porch/data_ingestion/utils/*, on_the_porch/data_ingestion/main_daily_ingestion.py, on_the_porch/data_ingestion/test_connections.py, on_the_porch/data_ingestion/build_vectordb.py
Centralized configuration loader; document processing (PDF, DOCX, TXT) and email parsing utilities; orchestration script for daily multi-source sync; connection testers.
Vector DB & RAG
on_the_porch/rag stuff/retrieval.py, on_the_porch/rag stuff/build_vectordb.py, on_the_porch/rag stuff/other scripts/demo_retrieval.py, on_the_porch/rag stuff/README.md
GeminiEmbeddings wrapper and Chroma retrieval with metadata/tag filtering; vector DB builder for policies, transcripts, and newsletters; two-step RAG demo with retrieval planning and answer generation.
SQL Chat & Unified Chatbot
on_the_porch/sql_chat/app3.py, on_the_porch/sql_chat/app4.py, on_the_porch/unified_chatbot.py
PostgreSQL query assistant (app3) and MySQL assistant (app4) with LLM-based SQL generation/refinement and Pocketflow orchestration; unified chatbot with question routing (SQL/RAG/hybrid), retrieval cache, and answer composition.
Calendar Events
on_the_porch/calendar/create_events_table.py, on_the_porch/calendar/ingest_events_llm.py, on_the_porch/calendar/trial.py
MySQL table creation; Gemini-based event extraction from PDFs with field normalization; trial script for page-level event parsing.
Metadata & Schema
on_the_porch/new_metadata/*, on_the_porch/api_readme.md
JSON metadata files for crime, 311, shootings, and weekly_events tables; metadata generators (from dump and live DB); SQL dump table reader; API v2.0 architecture documentation.
Legacy & Unused Modules
on_the_porch/unused/actual_eda/*, on_the_porch/unused/data_download/*, on_the_porch/unused/langgraph_stuff/*, on_the_porch/unused/pf/*, on_the_porch/unused/pocketflow_stuff/*, on_the_porch/rag stuff/other scripts/file_handling.py
Jupyter EDA notebook; legacy 911/crime data downloaders; LangGraph agent experiments; Pocketflow framework and SQL QA pipeline; PocketFlow orchestrator for NL→SQL.
Removed & Deprecated
experiment-PIT/public/data/map_2.geojson
Deleted GeoJSON file containing location features (no longer used).

Sequence Diagram(s)

sequenceDiagram
    actor User
    participant API as API v2
    participant Chatbot as Unified Chatbot
    participant Router as Question Router
    participant SQLChat as SQL Chat
    participant RAG as RAG Pipeline
    participant VectorDB as Vector DB
    participant MySQL as MySQL DB
    participant Gemini as Gemini LLM

    User->>API: POST /chat (message, history)
    API->>Chatbot: chat(question, history)
    Chatbot->>Router: _route_question(question)
    Router->>Gemini: Classify intent (SQL/RAG/Hybrid)
    Gemini-->>Router: mode decision
    
    alt SQL Mode
        Router->>SQLChat: _run_sql(question)
        SQLChat->>Gemini: Generate SELECT (schema+history)
        Gemini-->>SQLChat: SQL statement
        SQLChat->>MySQL: Execute SQL
        MySQL-->>SQLChat: Results
        SQLChat->>Gemini: Generate natural answer
        Gemini-->>SQLChat: Answer text
        SQLChat-->>Router: {answer, mode}
    else RAG Mode
        Router->>RAG: _run_rag(question, plan)
        RAG->>VectorDB: retrieve(query, filters)
        VectorDB-->>RAG: chunks + metadata
        RAG->>Gemini: Compose answer(chunks, history)
        Gemini-->>RAG: Answer with sources
        RAG-->>Router: {answer, sources, mode}
    else Hybrid Mode
        Router->>SQLChat: _run_sql(question)
        Router->>RAG: _run_rag(question, plan)
        SQLChat-->>Router: SQL results
        RAG-->>Router: RAG results
        Router->>Gemini: Merge results
        Gemini-->>Router: Unified answer
        Router-->>Router: {answer, sources, mode}
    end
    
    Router-->>Chatbot: Result
    Chatbot->>API: Response with sources & mode
    API->>MySQL: log_interaction
    API-->>User: {response, sources, mode, session_id}
Loading
sequenceDiagram
    participant Scheduler as Daily Scheduler
    participant Orchestrator as main_daily_ingestion
    participant Dotnews as Dotnews Download
    participant Drive as Google Drive Sync
    participant Email as Email Newsletter Sync
    participant Boston as Boston Data Sync
    participant VectorDB as Vector DB Builder
    participant MySQL as MySQL

    Scheduler->>Orchestrator: Execute daily run
    
    rect rgb(200, 220, 255)
    Note over Orchestrator: Phase 1: Dotnews
    Orchestrator->>Dotnews: sync_dotnews_newsletters()
    Dotnews->>Dotnews: Download latest PDF
    Dotnews->>VectorDB: Store in processed registry
    Dotnews-->>Orchestrator: {downloaded, errors}
    end
    
    rect rgb(200, 235, 220)
    Note over Orchestrator: Phase 2: Google Drive
    Orchestrator->>Drive: sync_google_drive_to_vectordb()
    Drive->>Drive: List new files
    Drive->>VectorDB: Process & add documents
    Drive->>MySQL: Extract & store events
    Drive-->>Orchestrator: {files_processed, errors}
    end
    
    rect rgb(235, 220, 200)
    Note over Orchestrator: Phase 3: Email
    Orchestrator->>Email: sync_email_newsletters_to_sql()
    Email->>Email: Fetch recent newsletters
    Email->>Email: Extract events via Gemini
    Email->>MySQL: Insert weekly_events
    Email-->>Orchestrator: {processed, errors}
    end
    
    rect rgb(235, 200, 200)
    Note over Orchestrator: Phase 4: Boston Data
    Orchestrator->>Boston: BostonDataSyncer.sync_all()
    Boston->>Boston: Fetch from CKAN API
    Boston->>MySQL: Upsert records
    Boston->>MySQL: Create derivative tables
    Boston-->>Orchestrator: {synced, errors}
    end
    
    rect rgb(220, 200, 235)
    Note over Orchestrator: Phase 5: Vector DB
    Orchestrator->>VectorDB: build_vectordb()
    VectorDB->>VectorDB: Load policies, transcripts, newsletters
    VectorDB->>VectorDB: Embed via Gemini
    VectorDB->>VectorDB: Persist to Chroma
    VectorDB-->>Orchestrator: {built, documents}
    end
    
    Orchestrator->>Orchestrator: Log summary (JSONL)
    Orchestrator-->>Scheduler: Complete with status
Loading
sequenceDiagram
    participant App as Data Ingestion
    participant GDrive as Google Drive API
    participant PDF as PDF/Newsletter
    participant Gemini as Gemini LLM
    participant VectorDB as Vector DB
    participant SQL as MySQL

    App->>App: Load sync state

    rect rgb(220, 240, 255)
    Note over App: File Discovery & Deletion
    App->>GDrive: Get all current file IDs
    GDrive-->>App: {file_ids}
    App->>App: Compare with processed
    App->>VectorDB: Delete chunks for removed files
    end

    App->>GDrive: List new files (limit: MAX_FILES)
    GDrive-->>App: {files}

    loop For each new file
        App->>App: Download to temp dir
        alt File is Newsletter PDF
            App->>PDF: Extract pages
            loop For each page
                PDF->>Gemini: Extract events (JSON)
                Gemini-->>PDF: {events}
                PDF->>App: {event_list}
            end
            App->>SQL: insert_events_to_db(events)
            App->>App: Update processed_files
        else Other file type
            App->>App: process_file_to_documents()
            App->>VectorDB: Stage documents
        end
    end

    App->>VectorDB: Batch add documents
    App->>App: Save sync state
    App->>App: Cleanup temp files
    App-->>App: Return {files_processed, errors}
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Complexity Factors:

  • Scope: 150+ files added/modified across multiple subsystems (API, ingestion, RAG, SQL, scheduling)
  • Logic Density: Complex workflows in data ingestion (Boston sync, Google Drive/email sync, event extraction), multi-mode question routing in unified chatbot, and SQL generation/refinement with retry logic
  • Heterogeneity: Highly diverse changes—configuration files, Flask endpoints, data pipeline orchestration, LLM-based SQL generation, vector DB retrieval, email parsing, and scheduling scripts require separate reasoning for each module
  • Critical Paths: API endpoints handle user requests; data ingestion touches multiple external APIs (CKAN, Google Drive, Gmail, Gemini); unified chatbot is the central query interface
  • External Dependencies: Heavy reliance on Gemini API, multiple cloud services (Google Drive/Gmail), third-party data portals (Boston CKAN)

Areas Requiring Extra Attention:

  • API v2 (api/api_v2.py, api/test_api_v2.py): Session management, per-session retrieval caching, middleware authentication, error handling in endpoints
  • Data Ingestion Orchestration (on_the_porch/data_ingestion/main_daily_ingestion.py): Coordination of 4+ async data sources, error aggregation, state persistence
  • SQL Generation & Refinement (on_the_porch/sql_chat/app4.py): LLM prompt engineering, retry logic with incremental refinement, Dorchester-specific filtering constraints, column/table existence validation
  • Question Routing & Unified Chatbot (on_the_porch/unified_chatbot.py): Mode classification logic, cache context building, hybrid result merging, conversation history handling
  • Boston Data Sync (on_the_porch/data_ingestion/boston_data_sync/boston_data_sync.py): Incremental sync logic, upsert semantics, derivative table creation and filtering, CKAN API pagination and error recovery
  • Google Drive & Email Sync (on_the_porch/data_ingestion/google_drive_to_vectordb.py, email_to_calendar_sql.py): OAuth credential refresh, rate limiting, event extraction via LLM, sync state consistency
  • Vector DB & RAG (on_the_porch/rag stuff/retrieval.py, build_vectordb.py): GeminiEmbeddings wrapping, metadata filtering logic (tags, doc_type, source), document chunking strategy
  • Environment & Configuration Loading: Multiple entry points for env loading (.env at root, per-module .env.example); verify consistency and initialization order, especially in api/api.py and on_the_porch/unified_chatbot.py

Possibly related PRs

  • PR #6: Directly overlapping changes to the same files (api_v2.py, api/.env.example, api/test_api_v2.py, data_ingestion modules, unified_chatbot.py, README, .gitignore)—highly related implementation of the same features.
  • PR #5: Overlapping changes to vector DB, data ingestion, RAG modules, and unified chatbot—related data integration and retrieval infrastructure.
  • PR #3: Related calendar event ingestion changes (on_the_porch/calendar/ files) and .gitignore entries for pocEnv—overlapping event processing functionality.

Pre-merge checks and finishing touches

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'Final PR' is vague and generic, using non-descriptive phrasing that does not convey meaningful information about the changeset's primary purpose. Replace 'Final PR' with a specific, concise title that highlights the main change, such as 'Add comprehensive API v2, data ingestion, and documentation' or similar descriptive phrase.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch dev

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@bsurya27
Copy link
Collaborator Author

bsurya27 commented Dec 10, 2025

We've added a demo folder for testing our main agent framework. For the ingestion pipelines to work we'd require a lot of google credentials setup with a different google account.

Instructions for setting it up and running it have been mentioned in the readme file.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

♻️ Duplicate comments (1)
api/env.example (1)

1-46: Duplication with api/.env.example noted.

See comment on api/.env.example regarding consolidation of environment example files.

🟠 Major comments (21)
on_the_porch/rag stuff/other scripts/file_handling.py-34-40 (1)

34-40: Ensure output directory exists and add error handling.

The code writes to Data/boston_budget_projects.txt without checking if the Data/ directory exists, which will raise a FileNotFoundError. Additionally, there's no error handling for write failures.

Apply this diff to create the directory and handle errors:

+from pathlib import Path
+
-def budget_to_text():
+def budget_to_text(input_path='../../api/datastore/budget_filtered.txt', output_dir='Data'):
     """Convert budget CSV data into narrative text for vector DB."""
     
     # ... (earlier code)
     
     # Write to text file
-    output_file = 'Data/boston_budget_projects.txt'
+    output_path = Path(output_dir)
+    output_path.mkdir(parents=True, exist_ok=True)
+    output_file = output_path / 'boston_budget_projects.txt'
+    
-    with open(output_file, 'w', encoding='utf-8') as f:
-        f.write('\n'.join(text_entries))
+    try:
+        with open(output_file, 'w', encoding='utf-8') as f:
+            f.write('\n'.join(text_entries))
+    except Exception as e:
+        print(f"Error writing output: {e}")
+        sys.exit(1)
     
     print(f"Converted {len(text_entries)} projects to text format")
     print(f"Output saved to: {output_file}")

Committable suggestion skipped: line range outside the PR's diff.

on_the_porch/rag stuff/other scripts/file_handling.py-7-9 (1)

7-9: Add error handling and make the input path configurable.

The hard-coded relative path ../../api/datastore/budget_filtered.txt is fragile and will break if the script is executed from a different working directory. Additionally, there's no error handling for common I/O failures (file not found, permission denied, etc.).

Apply this diff to add error handling and path flexibility:

+import sys
+from pathlib import Path
+
-def budget_to_text():
+def budget_to_text(input_path='../../api/datastore/budget_filtered.txt', output_dir='Data'):
     """Convert budget CSV data into narrative text for vector DB."""
     
+    input_file = Path(input_path)
+    if not input_file.exists():
+        print(f"Error: Input file not found: {input_file}")
+        sys.exit(1)
+    
     # Read the CSV data
-    with open('../../api/datastore/budget_filtered.txt', 'r', encoding='utf-8') as f:
+    try:
+        with open(input_file, 'r', encoding='utf-8') as f:
-        reader = csv.DictReader(f)
-        rows = list(reader)
+            reader = csv.DictReader(f)
+            rows = list(reader)
+    except Exception as e:
+        print(f"Error reading CSV: {e}")
+        sys.exit(1)

Committable suggestion skipped: line range outside the PR's diff.

on_the_porch/unused/pf/simple_sql_qa_pocketflow.py-102-109 (1)

102-109: Add error handling for JSON parsing.

If the LLM returns malformed JSON or unexpected format, json.loads(result) will raise a JSONDecodeError and crash the pipeline. Consider wrapping this in a try/except with a descriptive error message.

-        # Parse JSON
-        queries = json.loads(result)
+        # Parse JSON
+        try:
+            queries = json.loads(result)
+        except json.JSONDecodeError as e:
+            raise ValueError(f"Failed to parse LLM response as JSON: {result[:200]}") from e
+        
+        # Validate expected keys
+        if "answer_query" not in queries:
+            raise ValueError("LLM response missing 'answer_query' key")
on_the_porch/unused/pf/simple_sql_qa_pocketflow.py-44-48 (1)

44-48: SQL injection vulnerability in table name.

The table_name parameter is interpolated directly into the SQL command without sanitization. If table_name originates from user input, an attacker could inject malicious SQL (e.g., ); DROP TABLE users; --).

Consider validating the table name against an allowlist or using a regex to ensure it contains only valid identifier characters:

+import re
+
 def get_schema(table_name):
     """Get table schema"""
+    # Validate table name to prevent SQL injection
+    if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', table_name):
+        raise ValueError(f"Invalid table name: {table_name}")
     engine = create_engine(DB_URL)
     with engine.connect() as conn:

Committable suggestion skipped: line range outside the PR's diff.

.gitignore-108-108 (1)

108-108: Service account key was previously exposed in git history—verify credential rotation immediately.

Git history confirms that ontheporch-ebc5a60fc214.json was committed in bcfebda061edba49b8cfec0559bdc8d3d6977ef4. While the .gitignore addition prevents future exposure, the key has already been exposed in the repository history. Verify that this Google service account key has been rotated or revoked in the GCP console. If not already done, rotate the key and clean the git history using tools like git-filter-branch or bfg to remove all traces.

on_the_porch/unused/pocketflow_stuff/agent.py-85-115 (1)

85-115: Make _query_analysis robust to non‑JSON LLM output

Right now, any deviation from perfect JSON (very plausible in practice) will raise json.JSONDecodeError and bring down the request.

Consider a defensive parse with a safe fallback, e.g.:

-        text = completion.choices[0].message.content or "{}"
-        parsed = json.loads(text)
+        raw_text = completion.choices[0].message.content or "{}"
+        try:
+            parsed = json.loads(raw_text)
+        except json.JSONDecodeError:
+            # Fallback: treat as empty analysis so downstream logic still runs.
+            parsed = {}

The rest of the normalization (intent whitelist, defaulting entities/date_ranges/summary_flag/limit) can stay as is, but this avoids hard failures when the model violates the contract.

on_the_porch/unused/pocketflow_stuff/agent.py-127-160 (1)

127-160: Preserve and surface schema_mismatch instead of reclassifying as not_single_select

The prompt tells the model to return {"error":"schema_mismatch"} when the schema can’t answer the query, but this branch is never distinguished from other failures. That JSON blob is treated as SQL text, fails _is_single_select, and gets mapped to error: "not_single_select".

You can detect and handle the structured error before running the single‑SELECT validation, for example:

-        sql_text = (completion.choices[0].message.content or "").strip().strip("`")
-
-        is_valid = self._is_single_select(sql_text)
-        payload: Dict[str, Any] = {
-            "request_id": request.request_id,
-            "schema_hash": metadata.get("schema_hash"),
-            "sql_text": sql_text,
-            "is_valid_single_select": is_valid,
-        }
-        if not is_valid:
-            payload["error"] = "not_single_select"
-        return payload
+        sql_text = (completion.choices[0].message.content or "").strip().strip("`")
+
+        # Allow the model to signal an explicit schema mismatch via JSON.
+        if sql_text.lstrip().startswith("{"):
+            try:
+                error_obj = json.loads(sql_text)
+            except json.JSONDecodeError:
+                error_obj = {}
+            if error_obj.get("error") == "schema_mismatch":
+                return {
+                    "request_id": request.request_id,
+                    "schema_hash": metadata.get("schema_hash"),
+                    "sql_text": None,
+                    "is_valid_single_select": False,
+                    "error": "schema_mismatch",
+                }
+
+        is_valid = self._is_single_select(sql_text)
+        payload: Dict[str, Any] = {
+            "request_id": request.request_id,
+            "schema_hash": metadata.get("schema_hash"),
+            "sql_text": sql_text,
+            "is_valid_single_select": is_valid,
+        }
+        if not is_valid:
+            payload["error"] = "not_single_select"
+        return payload

This keeps your single‑SELECT guard while allowing downstream consumers to distinguish “bad SQL” from “schema cannot answer”.

on_the_porch/sql_chat/app3.py-116-141 (1)

116-141: Potential SQL injection in _get_unique_values.

The table_name and column_name parameters are interpolated directly into the SQL query using f-strings. While these values come from internal SQL parsing rather than direct user input, they originate from LLM-generated SQL which could potentially include malicious identifiers.

Consider using psycopg2.sql for safer identifier quoting:

+from psycopg2 import sql

 def _get_unique_values(table_name: str, column_name: str, schema: str = "public", limit: int = 50) -> List[Any]:
     """Get unique values from a column to help users see available options."""
     conn = _get_db_connection()
     try:
         # Set a statement timeout to prevent hanging (5 seconds)
         with conn.cursor() as cur:
             cur.execute("SET statement_timeout = '5s'")
-            cur.execute(
-                f'''
-                SELECT DISTINCT "{column_name}"
-                FROM "{schema}"."{table_name}"
-                WHERE "{column_name}" IS NOT NULL
-                ORDER BY "{column_name}"
-                LIMIT %s
-                ''',
-                (limit,),
-            )
+            query = sql.SQL('''
+                SELECT DISTINCT {col}
+                FROM {schema}.{table}
+                WHERE {col} IS NOT NULL
+                ORDER BY {col}
+                LIMIT %s
+            ''').format(
+                col=sql.Identifier(column_name),
+                schema=sql.Identifier(schema),
+                table=sql.Identifier(table_name),
+            )
+            cur.execute(query, (limit,))
             rows = cur.fetchall()
on_the_porch/sql_chat/app4.py-188-210 (1)

188-210: SQL injection vulnerability via string interpolation.

The _get_unique_values function constructs SQL using f-string interpolation for column_name and table_name. While limit is parameterized, the column and table names are directly interpolated, which could allow SQL injection if these values ever come from untrusted sources.

Apply this diff to use parameterized identifiers or validate inputs:

 def _get_unique_values(table_name: str, column_name: str, schema: str = "public", limit: int = 50) -> List[Any]:
     """Get unique values from a column to help users see available options."""
+    # Validate identifiers to prevent SQL injection
+    if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', table_name):
+        return []
+    if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', column_name):
+        return []
     conn = _get_db_connection()
     try:
         with conn.cursor() as cur:
             cur.execute(
                 f"""
                 SELECT DISTINCT `{column_name}`
                 FROM `{table_name}`
                 WHERE `{column_name}` IS NOT NULL
                 ORDER BY `{column_name}`
                 LIMIT %s
                 """,
                 (limit,),
             )

Committable suggestion skipped: line range outside the PR's diff.

example_env.txt-47-60 (1)

47-60: Align MySQL env var names with api.Config.DB_CONFIG and demo defaults

Right now there are two mismatches:

  1. Name mismatch with api/api.py:

    • api.Config.DB_CONFIG expects DB_HOST, DB_USER, DB_PASSWORD, DB_NAME.
    • This template only defines MYSQL_HOST, MYSQL_USER, MYSQL_PASSWORD, MYSQL_DB.
      If a user follows the demo README and only populates this root .env, api.py will see user=None, password=None, database=None and MySQL connections will fail.
  2. Value mismatch with the demo README / docker-compose demo:

    • The README and demo/docker-compose.demo.yml describe a demo DB with demo_user / demo_pass / sentiment_demo, but the sample here still uses root / rethink_ai_boston.

To keep things consistent and make the quickstart work out of the box, I’d recommend:

  • Updating the MySQL section to match the demo DB values, and
  • Adding DB_* aliases so api.py works with this unified template.

For example:

-# ============================================================================
-# MySQL Database Configuration (311, 911, events)
-# ============================================================================
-MYSQL_HOST=127.0.0.1
-MYSQL_PORT=3306
-MYSQL_USER=root
-MYSQL_PASSWORD=your-mysql-password
-MYSQL_DB=rethink_ai_boston
+# ============================================================================
+# MySQL Database Configuration (311, 911, events)
+#
+# These defaults match the demo Docker MySQL instance in demo/docker-compose.demo.yml.
+#
+ # Primary variables used by ingestion and newer components
+MYSQL_HOST=localhost
+MYSQL_PORT=3306
+MYSQL_USER=demo_user
+MYSQL_PASSWORD=demo_pass
+MYSQL_DB=sentiment_demo
+
+# Backwards‑compat: api/api.py expects DB_* names. Keep these in sync with MYSQL_*.
+DB_HOST=${MYSQL_HOST}
+DB_USER=${MYSQL_USER}
+DB_PASSWORD=${MYSQL_PASSWORD}
+DB_NAME=${MYSQL_DB}

(If you prefer not to rely on ${VAR} expansion, you can duplicate the literal values instead.)

This keeps the env story coherent for both the legacy API and the newer ingestion / demo setup.

on_the_porch/calendar/ingest_events_llm.py-168-201 (1)

168-201: Missing explicit commit after database inserts.

The connection is closed in the finally block, but there's no explicit conn.commit(). Depending on app4._get_db_connection()'s autocommit setting, inserts may be lost.

             for ev in events:
                 cur.execute(
                     ...
                 )
+        conn.commit()
         print(f"Inserted {len(events)} events into weekly_events.")
     finally:
         conn.close()
on_the_porch/unused/data_download/download_911_data.py-96-118 (1)

96-118: Redundant data download - entire dataset fetched twice.

get_911_shots_fired_data() calls get_boston_911_data() internally to fetch all records. When the main block calls both functions, the full dataset is downloaded twice. Refactor to accept the DataFrame as a parameter, similar to filter_shots_fired_data in download_crime_data.py.

-def get_911_shots_fired_data():
+def get_911_shots_fired_data(df=None, filepath=None):
     """
     Get specifically shots fired incidents
     """
     print("🔍 Fetching shots fired data...")
     
-    # Get all crime data first
-    df, filepath = get_boston_911_data()
+    # Get all crime data first if not provided
+    if df is None:
+        df, filepath = get_boston_911_data()
     
     if df is not None:

And update the main block:

     if df_all is not None:
         print("\n" + "=" * 40)
         print("🔫 Downloading shots fired data specifically...")
-        df_shots, filepath_shots = get_911_shots_fired_data()
+        df_shots, filepath_shots = get_911_shots_fired_data(df_all, filepath_all)

Also applies to: 125-130

api/api_v2.py-113-114 (1)

113-114: Hardcoded secret key is insecure.

The default SECRET_KEY value "agent-api-secret-2025" should not be used in production. The session cookie security depends on this being truly secret.

Consider failing startup if no secret key is configured in production:

-    SECRET_KEY = os.getenv("FLASK_SECRET_KEY", "agent-api-secret-2025")
+    SECRET_KEY = os.getenv("FLASK_SECRET_KEY")
+    if not SECRET_KEY:
+        raise ValueError("FLASK_SECRET_KEY environment variable must be set")

Alternatively, generate a random key at startup (though this breaks sessions across restarts):

SECRET_KEY = os.getenv("FLASK_SECRET_KEY") or os.urandom(32).hex()
api/api_v2.py-587-589 (1)

587-589: Same issue: internal error details leaked in /events endpoint.

Apply the same fix as suggested for /chat to avoid exposing internal errors.

     except Exception as e:
         print(f"Error in /events: {e}")
-        return jsonify({"error": f"Failed to fetch events: {str(e)}"}), 500
+        return jsonify({"error": "Failed to fetch events"}), 500
api/api_v2.py-152-158 (1)

152-158: Overly permissive CORS configuration allows all origins.

Setting origins: "*" allows any website to make authenticated requests to this API. For a production API with API key authentication, restrict origins to known frontend domains.

 CORS(
     app,
     supports_credentials=True,
     expose_headers=["RethinkAI-API-Key"],
-    resources={r"/*": {"origins": "*"}},
+    resources={r"/*": {"origins": os.getenv("ALLOWED_ORIGINS", "http://localhost:3000").split(",")}},
     allow_headers=["Content-Type", "RethinkAI-API-Key"],
 )
api/api_v2.py-456-458 (1)

456-458: Avoid leaking internal error details to clients.

Returning str(e) exposes internal implementation details (stack traces, database errors) to API consumers. Log the full error server-side but return a generic message to clients.

     except Exception as e:
         print(f"Error in /chat: {e}")
-        return jsonify({"error": f"Internal server error: {str(e)}"}), 500
+        import traceback
+        traceback.print_exc()
+        return jsonify({"error": "Internal server error"}), 500
api/api_v2.py-125-137 (1)

125-137: Connection pool creation at module import time lacks error handling.

If the database is unavailable when api/api_v2.py is imported, the MySQLConnectionPool constructor will raise an exception, preventing the application from starting. Wrap the pool creation (lines 129-137) in a try-except block or implement lazy initialization to prevent import failures.

on_the_porch/data_ingestion/main_daily_ingestion.py-155-160 (1)

155-160: Operator precedence bug in error calculation.

The ternary operator has lower precedence than addition, causing incorrect calculation. When dotnews_stats is None, the entire sum becomes 0.

     # Calculate overall success
     errors = (
         len(drive_stats.get('errors', [])) + 
         len(email_stats.get('errors', [])) +
-        (sum(len(d.get('errors', [])) for d in boston_stats.get('datasets', [])) if boston_stats else 0) +
-        len(dotnews_stats.get('errors', [])) if dotnews_stats else 0
+        (sum(len(d.get('errors', [])) for d in boston_stats.get('datasets', [])) if boston_stats else 0) +
+        (len(dotnews_stats.get('errors', [])) if dotnews_stats else 0)
     )
on_the_porch/data_ingestion/main_daily_ingestion.py-73-76 (1)

73-76: Bare except should specify exception type.

The bare except clause can mask unexpected errors during JSON parsing. Specify the expected exception.

         if processed_file.exists():
             try:
                 processed_files = json.loads(processed_file.read_text())
-            except:
+            except (json.JSONDecodeError, IOError):
                 processed_files = {}
on_the_porch/data_ingestion/build_vectordb.py-344-357 (1)

344-357: Mixing add_documents and from_documents may cause inconsistent behavior.

When the vector DB exists, add_documents is called which appends to existing data. But there's no deduplication, so running build_vectordb() multiple times will create duplicate entries.

Consider either:

  1. Clearing the existing DB before rebuilding
  2. Implementing deduplication based on document IDs/hashes
  3. Adding a --rebuild flag to control behavior
     if VECTORDB_DIR.exists():
-        print(f"Adding documents to existing vector database at {VECTORDB_DIR}")
+        print(f"⚠️ Vector database exists at {VECTORDB_DIR}")
+        print("   Use --rebuild flag to recreate, or documents will be appended (may create duplicates)")
         vectordb = Chroma(
             persist_directory=str(VECTORDB_DIR),
             embedding_function=embeddings,
         )
+        # Consider: vectordb.delete(where={})  # Clear existing before adding
         vectordb.add_documents(all_documents)
on_the_porch/data_ingestion/boston_data_sync/boston_data_sync.py-386-392 (1)

386-392: SQL injection risk via dynamic table/column names.

Table names and column names are interpolated directly into SQL queries from configuration. While these values come from config files (not direct user input), if the config is ever exposed to untrusted sources, this could be exploited.

Consider validating table/column names against an allowlist or using identifier quoting utilities:

+import re
+
+def _validate_identifier(name: str) -> str:
+    """Validate and sanitize SQL identifiers."""
+    if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', name):
+        raise ValueError(f"Invalid SQL identifier: {name}")
+    return name
+
 # Then use:
-cursor.execute(f"SHOW TABLES LIKE '{table_name}'")
+cursor.execute("SHOW TABLES LIKE %s", (table_name,))

Committable suggestion skipped: line range outside the PR's diff.

🟡 Minor comments (31)
on_the_porch/unused/actual_eda/instrutctions.md-1-1 (1)

1-1: Fix the filename typo.

The filename contains a typo: instrutctions.md should be instructions.md.

on_the_porch/unused/actual_eda/instrutctions.md-42-46 (1)

42-46: Reduce repetitive sentence structure.

Three successive sentences begin with "Distribution," which impacts readability. Vary the phrasing for better clarity.

Apply this diff to improve sentence variety:

-Distribution of incidents by district (bar chart).
+Distribution of incidents by district (bar chart).
-Distribution of incidents by day of week (bar chart, ordered from Monday to Sunday).
+Incidents by day of week (bar chart, ordered from Monday to Sunday).
-Distribution of incidents by hour of day (bar chart).
+Incidents by hour of day (bar chart).
on_the_porch/unused/pf/create_db.py-31-31 (1)

31-31: Remove unnecessary f-string prefix.

The f-string has no placeholders.

-print(f"\nDatabase created successfully!")
+print("\nDatabase created successfully!")
on_the_porch/unused/pf/create_db.py-38-38 (1)

38-38: Remove unnecessary f-string prefix.

The f-string has no placeholders.

-print(f"\nTable schema:")
+print("\nTable schema:")
on_the_porch/data_ingestion/boston_data_sync/test_record_ordering.py-112-120 (1)

112-120: Fix logic error in ordering detection.

The conditional logic has overlapping conditions. When first_record_date == last_record_date, line 112 evaluates to true (descending branch), but line 115's condition <= also includes equality. Since line 115 is an elif, it won't execute, but the logic is confusing and technically incorrect for the equality case.

When all records have the same date, it shouldn't be classified as "descending" but rather as "no ordering" or "indeterminate".

Apply this diff to fix the logic:

         # Determine ordering
         print(f"\n[ORDERING] Analysis:")
-        if first_record_date >= last_record_date:
+        if first_record_date == last_record_date:
+            print("   [WARNING] All records have the same date")
+            print("   [WARNING] Cannot determine ordering - need to fetch all records")
+        elif first_record_date > last_record_date:
             print("   [OK] Records appear to be in DESCENDING order (newest first)")
             print("   [TIP] We can optimize by fetching only the first N records for incremental syncs!")
-        elif first_record_date <= last_record_date:
+        else:  # first_record_date < last_record_date
             print("   [OK] Records appear to be in ASCENDING order (oldest first)")
             print("   [WARNING] We need to fetch all records or use offset to get latest records")
-        else:
-            print("   [WARNING] Records appear to be in RANDOM order")
-            print("   [WARNING] We need to fetch all records and filter")
on_the_porch/unused/pf/simple_sql_qa_pocketflow.py-394-394 (1)

394-394: Typo in sample question.

"hwo many" should be "how many".

-    question = "hwo many parking enforcement requests were made in May 2025?"
+    question = "how many parking enforcement requests were made in May 2025?"
on_the_porch/unused/pf/simple_sql_qa_pocketflow.py-275-280 (1)

275-280: Latitude column detection logic may incorrectly match.

The condition if 'lat' in col_lower and not lon_col will skip assigning lat_col if lon_col was already found. This seems like a bug—the logic should be and not lat_col to avoid overwriting an already-found latitude column.

         for col in df.columns:
             col_lower = col.lower()
-            if 'lat' in col_lower and not lon_col:
+            if 'lat' in col_lower and not lat_col:
                 lat_col = col
             if 'lon' in col_lower or 'lng' in col_lower:
                 lon_col = col
on_the_porch/unused/pf/simple_sql_qa_pocketflow.py-403-403 (1)

403-403: Add version compatibility check before calling _run.

flow._run(shared) at line 403 assumes _run exists, but different versions of pocketflow may have different APIs. The production code in app4.py and app3.py handles this with:

if hasattr(flow, "_run"):
    flow._run(shared)
else:
    flow.run(shared)

Apply the same pattern here with a comment noting version compatibility.

.gitignore-15-16 (1)

15-16: Fix incorrect directory name on line 88: use "rag stuff" instead of "rag_stuff".

The actual filesystem contains on_the_porch/rag stuff/ (with a space). Line 15 correctly references this directory, but line 88 uses rag_stuff (with underscore), which does not exist. Update line 88 from on_the_porch/rag_stuff/data/* to on_the_porch/rag stuff/data/* to match the actual directory structure.

on_the_porch/unused/pocketflow_stuff/agent.py-1-18 (1)

1-18: Fix RUF002 by replacing the en dash in the module docstring

Ruff is correctly flagging the here; you can keep the wording and swap in a plain hyphen to satisfy lint.

-"""
-Pocket Flow – Minimal NL → SQL Orchestrator
+"""
+Pocket Flow - Minimal NL → SQL Orchestrator
dataset-documentation/README.md-113-118 (1)

113-118: Clarify geospatial asset format and 311 normalized_type column in docs

Two small consistency issues to consider tightening up:

  • Lines 115‑118: Geospatial Community Assets list the Format as “GeoJSON files” but the Location path is api/datastore/geocoding-community-assets.csv. If the source is actually CSV, adjust the format; if it’s really GeoJSON, update the path/extension so they align.
  • Lines 216‑221: The SQL example uses normalized_type from boston_311, but that column isn’t described in the 311 schema section above. Either add normalized_type to the schema description or change the example query to use whichever column is actually present.

These are minor but worth fixing so the dataset documentation stays a reliable reference.

Also applies to: 216-221

demo/setup_windows.bat-34-37 (1)

34-37: Fix venv activation check to use activate.bat on Windows

On Windows, python -m venv creates three activation scripts: activate.bat (cmd.exe), Activate.ps1 (PowerShell), and a bare activate script (POSIX shells). The if exist check does not apply path extension resolution (PATHEXT), so:

if exist ".venv_demo\Scripts\activate" (
  call ".venv_demo\Scripts\activate"
)

will not find the file and skip activation. Update to explicitly check for activate.bat:

rem Activate the demo virtual environment
if exist ".venv_demo\Scripts\activate.bat" (
  call ".venv_demo\Scripts\activate.bat"
)

This ensures the venv activates reliably in cmd.exe at the end of the script.

README.md-82-84 (1)

82-84: Inconsistent environment file naming.

The documentation references different environment file names:

  • Line 83: "Copy .env.example files to .env"
  • Line 113-115: "Copy example_env.txt to .env"

Please standardize on one naming convention to avoid confusion.

Also applies to: 111-116

on_the_porch/api_readme.md-421-529 (1)

421-529: Clarify that at least one API key is mandatory for the API to function.

The documentation correctly states authentication is "mandatory," but it should explicitly clarify that at least one API key must always be configured in RETHINKAI_API_KEYS. The current wording—"Authentication is controlled by the RETHINKAI_API_KEYS environment variable"—could imply authentication can be disabled by omitting the variable. In practice, the API rejects all requests if no keys are configured. Update the documentation to state clearly: "At least one API key must be configured in RETHINKAI_API_KEYS for the API to accept any requests."

on_the_porch/data_ingestion/boston_data_sync/test_311_resources.py-20-55 (1)

20-55: Narrow the broad except Exception to more specific error types

The outer loop currently catches a blind Exception (flagged by Ruff BLE001), which can hide genuine programming errors:

except Exception as e:
    print(f"{year}: {resource_id}")
    print(f"  ERROR: {e}")

Given this is all network + JSON handling, be more precise:

except requests.RequestException as e:
    print(f"{year}: {resource_id}")
    print(f"  HTTP ERROR: {e}")
except ValueError as e:
    print(f"{year}: {resource_id}")
    print(f"  JSON ERROR: {e}")

Also consider adding response.raise_for_status() after requests.get() to catch HTTP errors explicitly rather than relying on the try-except.

on_the_porch/new_metadata/bos311_data.json-1327-1334 (1)

1327-1334: Malformed decimal data type definitions.

The latitude and longitude columns have incomplete data types ("decimal(18" instead of "decimal(18,8)" or similar). This will cause issues if this metadata is used for schema generation or validation.

     "latitude": {
-      "data_type": "decimal(18",
+      "data_type": "decimal(18,8)",
       "is_numeric": true
     },
     "longitude": {
-      "data_type": "decimal(18",
+      "data_type": "decimal(18,8)",
       "is_numeric": true
     },
on_the_porch/data_ingestion/boston_data_sync/reset_database.py-91-94 (1)

91-94: Validate table names to prevent SQL injection.

Table names are read from a config file and interpolated directly into SQL queries. While backticks provide some protection, a malicious config file could still inject SQL. Consider validating that table names match an allowed pattern.

+import re
+
+TABLE_NAME_RE = re.compile(r'^[a-zA-Z_][a-zA-Z0-9_]*$')
+
+def validate_table_name(name: str) -> bool:
+    """Validate table name contains only safe characters."""
+    return bool(TABLE_NAME_RE.match(name)) and len(name) <= 64

Then use before querying:

for table in all_tables:
    if not validate_table_name(table):
        print(f"   ⚠️ Skipping invalid table name: {table}")
        continue
    cursor.execute(f"SHOW TABLES LIKE '{table}'")
on_the_porch/unused/data_download/download_911_data.py-88-89 (1)

88-89: Missing column existence checks can cause KeyError.

Unlike download_crime_data.py, these columns are accessed without checking if they exist first. If the API response is missing these columns, this will raise a KeyError.

         # Print data info
         print(f"📈 Data shape: {df.shape}")
-        print(f"📅 Date range: {df['OCCURRED_ON_DATE'].min()} to {df['OCCURRED_ON_DATE'].max()}")
-        print(f"🏘️ Districts: {df['DISTRICT'].unique()}")
+        if 'OCCURRED_ON_DATE' in df.columns:
+            print(f"📅 Date range: {df['OCCURRED_ON_DATE'].min()} to {df['OCCURRED_ON_DATE'].max()}")
+        if 'DISTRICT' in df.columns:
+            print(f"🏘️ Districts: {df['DISTRICT'].unique()}")
on_the_porch/data_ingestion/README.md-421-428 (1)

421-428: Inconsistent security guidance: app-specific passwords vs OAuth.

Line 426 advises "Use app-specific passwords for email" but line 104 states "Gmail now requires OAuth 2.0 authentication (app-specific passwords were discontinued in November 2025)." Remove or update the outdated guidance.

 ⚠️ **Important:**
 - Never commit `.env` file to git (it's in `.gitignore`)
 - Keep service account JSON file secure
-- Use app-specific passwords for email (not your main password)
+- Keep OAuth credentials (`gmail_credentials.json`, `gmail_token.json`) secure
 - Regularly rotate credentials
 - Use read-only permissions where possible
on_the_porch/data_ingestion/boston_data_sync/find_boston_resource_id.py-57-68 (1)

57-68: Bare except clause loses exception context.

The bare except: on line 67 silently swallows all exceptions including KeyboardInterrupt and SystemExit. Use except Exception: at minimum to preserve interrupt behavior.

     try:
         response = requests.get(url, params=params, timeout=10)
         response.raise_for_status()
         data = response.json()
         return data.get('success', False)
-    except:
+    except Exception:
         return False
on_the_porch/new_metadata/generate_mysql_metadata_live.py-122-127 (1)

122-127: SQL injection in SHOW TABLES query.

Line 124 uses f-string interpolation for table_name. While TARGET_TABLES is hardcoded, use parameterized query for consistency and safety.

     # Check if table exists
     cursor = conn.cursor()
-    cursor.execute(f"SHOW TABLES LIKE '{table_name}'")
+    cursor.execute("SHOW TABLES LIKE %s", (table_name,))
     if not cursor.fetchone():
         print(f"   WARNING: Table '{table_name}' does not exist, skipping")
         return None
on_the_porch/unused/data_download/import_911_to_mysql.py-244-253 (1)

244-253: Potential issue: getctime behavior varies by OS.

On Unix systems, os.path.getctime returns inode change time, not creation time. Use os.path.getmtime for consistent behavior across platforms.

-    latest_file = max(csv_files, key=os.path.getctime)
+    latest_file = max(csv_files, key=os.path.getmtime)
on_the_porch/data_ingestion/main_daily_ingestion.py-181-244 (1)

181-244: Dotnews stats not included in total_errors calculation.

The print_final_summary function calculates total_errors without including dotnews_stats, but then references dotnews errors in the error details output.

     drive_errors = len(drive_stats.get('errors', []))
     email_errors = len(email_stats.get('errors', []))
     boston_errors = sum(len(d.get('errors', [])) for d in boston_stats.get('datasets', [])) if boston_stats else 0
-    total_errors = drive_errors + email_errors + boston_errors
+    dotnews_errors = len(dotnews_stats.get('errors', [])) if dotnews_stats else 0
+    total_errors = drive_errors + email_errors + boston_errors + dotnews_errors
on_the_porch/data_ingestion/boston_data_sync/boston_data_sync.py-528-553 (1)

528-553: Statistics tracking logic may cause double-counting.

The try block at lines 530-550 attempts to track new vs updated records, but if it succeeds, it updates total_inserted and total_updated. Then cursor.executemany runs and commits. However, if executemany fails after the stats tracking succeeds, the stats won't be rolled back. Also, if the stats query itself fails silently (caught at line 548), it falls back to assuming all are inserts, which could misrepresent actual database state.

Consider moving the stats calculation to be a post-operation verification or simplifying to just report affected rows from cursor.rowcount.

on_the_porch/data_ingestion/utils/document_processor.py-84-86 (1)

84-86: Unreliable .doc file handling.

The comment acknowledges this won't work for all .doc files since python-docx only supports the newer .docx format. Old binary .doc files will likely raise an exception or produce garbage output.

Consider either:

  1. Removing .doc support and raising an explicit error
  2. Using a library like antiword or textract that properly handles legacy .doc format
  3. Adding a clearer warning in the exception when .doc parsing fails
 elif ext == '.doc':
-    # For old .doc files, try DOCX library (may not work for all)
-    return extract_text_from_docx(file_path)
+    raise ValueError(
+        f"Legacy .doc format not fully supported. "
+        f"Please convert {file_path.name} to .docx format."
+    )
on_the_porch/data_ingestion/build_vectordb.py-13-13 (1)

13-13: Import path may fail depending on execution context.

from retrieval import GeminiEmbeddings assumes retrieval.py is in the Python path. This will fail if the script is run from a different directory or as a module.

Consider using relative imports or adding the directory to path:

+import sys
+sys.path.insert(0, str(Path(__file__).parent.parent / "rag stuff"))
 from retrieval import GeminiEmbeddings

Or restructure as a proper package with __init__.py files.

Committable suggestion skipped: line range outside the PR's diff.

on_the_porch/rag stuff/retrieval.py-6-7 (1)

6-7: Relative path may break depending on current working directory.

VECTORDB_DIR = Path("../vectordb_new") is relative to the current working directory, not the script location. This will fail if the script is executed from a different directory.

+_THIS_FILE = Path(__file__).resolve()
+_MODULE_DIR = _THIS_FILE.parent
-VECTORDB_DIR = Path("../vectordb_new")
+VECTORDB_DIR = _MODULE_DIR.parent / "vectordb_new"
on_the_porch/data_ingestion/build_vectordb.py-228-232 (1)

228-232: Fragile JSON parsing fallback may still fail.

The fallback at line 231 (safe_text = text.replace("\\", "\\\\")) is a naive attempt to fix escape issues, but this could corrupt valid JSON or still fail on other malformed responses. If both json.loads attempts fail, the exception propagates uncaught.

Add comprehensive error handling:

     try:
         data = json.loads(text)
     except json.JSONDecodeError:
-        safe_text = text.replace("\\", "\\\\")
-        data = json.loads(safe_text)
+        try:
+            safe_text = text.replace("\\", "\\\\")
+            data = json.loads(safe_text)
+        except json.JSONDecodeError as e:
+            print(f"   ⚠️ Failed to parse Gemini response as JSON: {e}")
+            return []
on_the_porch/data_ingestion/google_drive_to_vectordb.py-559-559 (1)

559-559: Remove extraneous f prefix from static strings.

Several f-strings have no placeholders and should be regular strings. Example at line 559:

-                        print(f"    ⚠ Skipping event with no name")
+                        print("    ⚠ Skipping event with no name")

Similar fixes needed at lines 645, 655, and 828.

on_the_porch/data_ingestion/google_drive_to_vectordb.py-211-218 (1)

211-218: Duplicate dictionary key: 'may' appears twice.

The month_map dictionary has 'may': '05' on both line 213 (lowercase from full months) and line 216 (from short months). The second assignment overwrites the first with the same value, but this is a code smell that could cause issues if values diverge.

         month_map = {
             'january': '01', 'february': '02', 'march': '03', 'april': '04',
             'may': '05', 'june': '06', 'july': '07', 'august': '08',
             'september': '09', 'october': '10', 'november': '11', 'december': '12',
             'jan': '01', 'feb': '02', 'mar': '03', 'apr': '04',
-            'may': '05', 'jun': '06', 'jul': '07', 'aug': '08',
+            'jun': '06', 'jul': '07', 'aug': '08',
             'sep': '09', 'oct': '10', 'nov': '11', 'dec': '12'
         }
on_the_porch/data_ingestion/google_drive_to_vectordb.py-819-819 (1)

819-819: Unused variable expected_chunks.

This variable is assigned but never used. Remove it to clean up the code.

         for file_id in deleted_file_ids:
             file_name = processed_files[file_id].get('name', 'unknown')
-            expected_chunks = processed_files[file_id].get('chunks', 0)
             
             try:

Comment on lines +629 to +630

app.run(host=Config.HOST, port=Config.PORT, debug=True)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Do not deploy with debug=True in production.

The Flask debug mode enables the interactive debugger which can lead to remote code execution if exposed. Use an environment variable to control this.

-    app.run(host=Config.HOST, port=Config.PORT, debug=True)
+    app.run(host=Config.HOST, port=Config.PORT, debug=os.getenv("FLASK_DEBUG", "false").lower() == "true")

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 ast-grep (0.40.0)

[warning] 629-629: Detected Flask app with debug=True. Do not deploy to production with this flag enabled as it will leak sensitive information. Instead, consider using Flask configuration variables or setting 'debug' using system environment variables.
Context: app.run(host=Config.HOST, port=Config.PORT, debug=True)
Note: [CWE-489] Active Debug Code. [REFERENCES]
- https://labs.detectify.com/2015/10/02/how-patreon-got-hacked-publicly-exposed-werkzeug-debugger/

(debug-enabled-python)

🪛 Ruff (0.14.8)

630-630: Use of debug=True in Flask app detected

(S201)

🤖 Prompt for AI Agents
In api/api_v2.py around lines 629-630, the app.run call uses debug=True which
must not be enabled in production; change it to read a configuration/env var
(e.g., Config.DEBUG or os.getenv("FLASK_DEBUG", "false") converted to a boolean)
and pass that boolean to debug, and ensure Config.DEBUG is set from the
environment with a safe default of False; additionally, ensure production
deployments use a WSGI server (gunicorn/uvicorn) rather than app.run.

Comment on lines +24 to +25
*.json
!.env.example
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Incorrect negation pattern: !.env.example does not match preceding *.json rule.

Line 25 attempts to whitelist .env.example, but the preceding pattern *.json (line 24) would never match .env.example. This negation pattern will have no effect and appears to be a copy-paste error. If you intend to exclude certain JSON files, use a more specific pattern like !.env.example.json or !config.json.

If the intent is to preserve .env.example as a template, it should be handled separately (e.g., .env.example should not match the *.json rule). If you intend to whitelist specific JSON files, clarify the pattern:

 # Google credentials
 *.json
-!.env.example
+!config.example.json  # or whichever JSON files should be tracked

Alternatively, if .env.example should be tracked and is not a JSON file, remove this line entirely since it's redundant.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
*.json
!.env.example
# Google credentials
*.json
!config.example.json
🤖 Prompt for AI Agents
In on_the_porch/data_ingestion/.gitignore around lines 24-25, the negation
pattern `!.env.example` is ineffective because the preceding `*.json` rule does
not match `.env.example`; either remove the negation if you simply want
`.env.example` tracked (it’s redundant), or replace it with a correct whitelist
for JSON files (e.g., `!.env.example.json` or `!config.json`) if you intended to
allow specific JSON files—update the file to use the appropriate pattern or
delete the unnecessary line.

Comment on lines +14 to +27
def connect_to_mysql():
"""Connect to MySQL database"""
try:
connection = mysql.connector.connect(
host='localhost',
user='rethinkai_user',
password='MySecureUserPass123!',
database='rethinkai_db'
)
print("✅ Connected to MySQL database")
return connection
except Error as e:
print(f"❌ Error connecting to MySQL: {e}")
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Hardcoded database credentials must be removed.

Hardcoded credentials in source code are a serious security risk. These can be exposed via version control history even if removed later. Use environment variables instead.

+import os
+
 def connect_to_mysql():
     """Connect to MySQL database"""
     try:
         connection = mysql.connector.connect(
-            host='localhost',
-            user='rethinkai_user',
-            password='MySecureUserPass123!',
-            database='rethinkai_db'
+            host=os.getenv('MYSQL_HOST', 'localhost'),
+            user=os.getenv('MYSQL_USER'),
+            password=os.getenv('MYSQL_PASSWORD'),
+            database=os.getenv('MYSQL_DB')
         )
+        if not os.getenv('MYSQL_USER') or not os.getenv('MYSQL_DB'):
+            raise ValueError("MYSQL_USER and MYSQL_DB environment variables required")
         print("✅ Connected to MySQL database")
         return connection
     except Error as e:
         print(f"❌ Error connecting to MySQL: {e}")
         return None

Immediate action required: Rotate the exposed credentials (rethinkai_user / MySecureUserPass123!) as they are now in version control history.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def connect_to_mysql():
"""Connect to MySQL database"""
try:
connection = mysql.connector.connect(
host='localhost',
user='rethinkai_user',
password='MySecureUserPass123!',
database='rethinkai_db'
)
print("✅ Connected to MySQL database")
return connection
except Error as e:
print(f"❌ Error connecting to MySQL: {e}")
return None
import os
def connect_to_mysql():
"""Connect to MySQL database"""
try:
connection = mysql.connector.connect(
host=os.getenv('MYSQL_HOST', 'localhost'),
user=os.getenv('MYSQL_USER'),
password=os.getenv('MYSQL_PASSWORD'),
database=os.getenv('MYSQL_DB')
)
if not os.getenv('MYSQL_USER') or not os.getenv('MYSQL_DB'):
raise ValueError("MYSQL_USER and MYSQL_DB environment variables required")
print("✅ Connected to MySQL database")
return connection
except Error as e:
print(f"❌ Error connecting to MySQL: {e}")
return None
🧰 Tools
🪛 ast-grep (0.40.0)

[warning] 16-21: The application creates a database connection with an empty password. This can lead to unauthorized access by either an internal or external malicious actor. To prevent this vulnerability, enforce authentication when connecting to a database by using environment variables to securely provide credentials or retrieving them from a secure vault or HSM (Hardware Security Module).
Context: mysql.connector.connect(
host='localhost',
user='rethinkai_user',
password='MySecureUserPass123!',
database='rethinkai_db'
)
Note: [CWE-287] Improper Authentication. [REFERENCES]
- https://cheatsheetseries.owasp.org/cheatsheets/Secrets_Management_Cheat_Sheet.html

(python-mysql-hardcoded-secret-python)

🪛 Ruff (0.14.8)

20-20: Possible hardcoded password assigned to argument: "password"

(S106)


24-24: Consider moving this statement to an else block

(TRY300)

🤖 Prompt for AI Agents
In on_the_porch/unused/data_download/import_911_to_mysql.py around lines 14 to
27, the database username and password are hardcoded; replace them with values
read from environment variables (e.g., MYSQL_HOST, MYSQL_USER, MYSQL_PASSWORD,
MYSQL_DATABASE), optionally loading a .env in development, and fail fast with a
clear error if required env vars are missing; update the connect_to_mysql
function to use os.environ.get (or a config helper) instead of literals, and
ensure you rotate the exposed credentials immediately and remove them from the
repo/history.


# from postgres_stuff.test import DB_URL

DB_URL = "postgresql://user1:BbWTihWnsBHglVpeKK8XfQgEPDOcokZZ@dpg-d3g661u3jp1c73eg9v1g-a.ohio-postgres.render.com/crime_rate_h3u5"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Hardcoded database credentials exposed in source code.

The PostgreSQL connection string contains plaintext username and password. This is a severe security vulnerability—credentials committed to version control are permanently exposed in git history, even if removed later.

Move credentials to environment variables:

-DB_URL = "postgresql://user1:BbWTihWnsBHglVpeKK8XfQgEPDOcokZZ@dpg-d3g061u3jp1c73eg9v1g-a.ohio-postgres.render.com/crime_rate_h3u5"
+DB_URL = os.getenv("DATABASE_URL")
+if not DB_URL:
+    raise RuntimeError("DATABASE_URL not found in environment")

Immediate action required: Rotate this database password since it's now exposed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
DB_URL = "postgresql://user1:BbWTihWnsBHglVpeKK8XfQgEPDOcokZZ@dpg-d3g661u3jp1c73eg9v1g-a.ohio-postgres.render.com/crime_rate_h3u5"
DB_URL = os.getenv("DATABASE_URL")
if not DB_URL:
raise RuntimeError("DATABASE_URL not found in environment")
🤖 Prompt for AI Agents
In on_the_porch/unused/langgraph_stuff/agent.py around line 15 the PostgreSQL
connection string is hardcoded including plaintext username and password; remove
the literal DB_URL, read the connection details from environment variables
(e.g., os.environ or a config loader) and construct the URL at runtime, ensure
any local .env usage is documented but not committed, add the config key name to
repo docs/secrets guide, add the file change to git and rotate the exposed
database password immediately to invalidate the leaked credentials.

Comment on lines +21 to +27
os.environ["LANGCHAIN_TRACING_V2"] = "true"
# Replace with your actual LangSmith API key and endpoint
os.environ["LANGSMITH_API_KEY"] = "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"
os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
# Optional but recommended for organization
os.environ["LANGCHAIN_PROJECT"] = "spark_eda-sql-agent"
os.environ["LANGCHAIN_SESSION"] = "local-dev"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Critical: Hardcoded LangSmith API key exposed in source code.

The API key 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 is committed to the repository. This key should be revoked and rotated immediately.

Load these values from environment variables or .env file instead:

 os.environ["LANGCHAIN_TRACING_V2"] = "true"
-os.environ["LANGSMITH_API_KEY"] = "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"
-os.environ["LANGSMITH_ENDPOINT"] = "https://api.smith.langchain.com"
+# These should already be set via .env or environment
+# os.environ["LANGSMITH_API_KEY"] = os.getenv("LANGSMITH_API_KEY")
 os.environ["LANGCHAIN_PROJECT"] = "spark_eda-sql-agent"
 os.environ["LANGCHAIN_SESSION"] = "local-dev"

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In on_the_porch/unused/langgraph_stuff/agent.py around lines 21 to 27, remove
the hardcoded LangSmith API key and endpoint values and instead read these
settings from environment variables or a .env file; revoke and rotate the
exposed key immediately, replace the literal assignments with
os.environ.get("LANGSMITH_API_KEY") and os.environ.get("LANGSMITH_ENDPOINT")
(with sensible defaults or error checks), and if desired load a local .env via
python-dotenv at process startup while ensuring .env is added to .gitignore and
no secrets are committed.

typing_extensions==4.15.0
tzdata==2025.2
uritemplate==4.2.0
urllib3==2.5.0
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# First, verify the file exists and check line 110
head -n 114 on_the_porch/unused/pf/requirements.txt | tail -n 10

Repository: BU-Spark/ml-misi-community-sentiment

Length of output: 216


🏁 Script executed:

# Get the exact content at line 110
sed -n '110p' on_the_porch/unused/pf/requirements.txt

Repository: BU-Spark/ml-misi-community-sentiment

Length of output: 94


🌐 Web query:

urllib3 2.5.0 GHSA-2xpw-w6gg-jr37 vulnerability CVE

💡 Result:

Short answer: GHSA-2xpw-w6gg-jr37 is a high-severity streaming/decompression vulnerability in urllib3 (published Dec 5, 2025). It can cause excessive resource consumption when handling highly‑compressed HTTP responses; affected versions are >=1.0 and <2.6.0. The issue is tracked as CVE-2025-66471 (and related CVE-2025-66418 for a similar decompression/resource issue). It was fixed in urllib3 2.6.0 — upgrade to >=2.6.0 (or disable decoding/streaming until patched). [1][2][3]

Sources:


🌐 Web query:

urllib3 2.5.0 GHSA-gm62-xv2j-4w53 vulnerability CVE

💡 Result:

Yes — GHSA-gm62-xv2j-4w53 (CVE-2025-66418) affects urllib3 2.5.0. Summary:

  • Vulnerability: urllib3 allowed an unbounded number of links in the HTTP response decompression chain, enabling a malicious server to cause high CPU and large memory allocations.
  • Affected versions: >= 1.24 and < 2.6.0 (includes 2.5.0).
  • Fix: upgrade to urllib3 >= 2.6.0 (limits chain length to 5).
  • Temporary mitigations: disable automatic decoding (e.g., use preload_content=False) and validate resp.headers["content-encoding"] before reading.
    Sources: GitHub security advisory [1], OSV / vulnerability database [2], NVD/Tenable summaries [3].

References
[1] GitHub security advisory GHSA-gm62-xv2j-4w53 (urllib3).
[2] OSV / osv.dev entry for GHSA-gm62-xv2j-4w53 (CVE-2025-66418).
[3] Tenable / NVD advisory summary for CVE-2025-66418.


Upgrade urllib3 from 2.5.0 to patch HIGH severity vulnerabilities.

urllib3 2.5.0 contains two confirmed HIGH severity vulnerabilities affecting streaming decompression:

Both are fixed in urllib3 2.6.0. Upgrade immediately:

-urllib3==2.5.0
+urllib3==2.6.0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
urllib3==2.5.0
urllib3==2.6.0
🤖 Prompt for AI Agents
In on_the_porch/unused/pf/requirements.txt around line 110, urllib3 is pinned to
2.5.0 which contains HIGH severity vulnerabilities; update the pin to
urllib3==2.6.0 to pull in the fixes, then regenerate any dependency
lock/constraints file (pip-compile, poetry.lock, or requirements.txt.build) and
run the test suite and a dependency audit (e.g., pip-audit or safety) to ensure
no regressions or incompatible transitive dependencies; if any package conflicts
arise, resolve them by updating the conflicting package or adding an explicit
compatible urllib3 constraint and document the change in the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants