Backend (Codebase Overview)
Here's a comprehensive overview of the DiscoveryPlus-Backend repository:
What Is It?
Discoveries+ is a patent discovery and analysis platform. Its backend is a REST API that lets researchers search and filter patent data, manage research projects, generate AI-powered search queries from natural language, and embed analytics dashboards.
Technology Stack
| Layer | Technology |
|---|---|
| Framework | FastAPI (Python 3.9+) |
| ASGI Server | Uvicorn / Gunicorn |
| ORM / DB | SQLModel + SQLAlchemy (async via aiomysql) |
| App DB | MySQL (users, projects, roles, sessions) |
| Search/OLAP DB | StarRocks (patent data — PATSTAT/EPO) |
| Auth | JWT (PyJWT) + CAS/OAuth2 SSO (UAB university SSO) |
| LLM Integration | LangChain + Azure OpenAI (gpt-4o), Ollama (local fallback) |
| Dashboard | Apache Superset (embed integration) |
| Migrations | Alembic |
| Packaging | Poetry / uv |
| Containerization | Docker |
| Testing | pytest + httpx |
Directory Structure
DiscoveryPlus-Backend/
├── main.py # App entry point — creates and runs the FastAPI app
├── pyproject.toml # Dependencies and project config
├── Dockerfile # Container image (Python 3.12 slim + Poetry)
├── alembic.ini # Alembic migrations config
│
├── app/
│ ├── setup_app.py # FastAPI app factory (CORS, routers, OpenAPI tags)
│ ├── router.py # Central API router — mounts all sub-routers
│ │
│ ├── core/
│ │ ├── config.py # Settings (Pydantic BaseSettings, reads from .env)
│ │ ├── auth.py # JWT creation & get_current_user dependency
│ │ ├── api_response.py # Uniform ApiResponse envelope for all endpoints
│ │ └── logger.py # Logging setup
│ │
│ ├── db/
│ │ ├── session.py # SQLAlchemy async sessions (MySQL) + StarRocks engine
│ │ ├── connection_manager.py # StarRocks query executor with in-memory caching
│ │ ├── base.py # SQLModel declarative base
│ │ └── init_db.py # DB initialization helpers
│ │
│ ├── models/ # SQLModel table definitions + Pydantic schemas
│ │ ├── user_model.py # User table
│ │ ├── role_model.py # Role table
│ │ ├── project_model.py # Research project table
│ │ ├── patent_models.py # Patent result Pydantic model
│ │ ├── sql_session_model.py # Saved SQL session table
│ │ ├── requests/ # Request body schemas (auth, users, patents, sessions)
│ │ └── responses/ # Response schemas
│ │
│ ├── routes/ # FastAPI routers (one file per domain)
│ │ ├── auth_route.py # CAS SSO login, JWT refresh, dev-login
│ │ ├── users_route.py # User CRUD, CSV bulk import
│ │ ├── roles_route.py # Role management
│ │ ├── project_route.py # Project CRUD, sharing, chart data, XLSX import
│ │ ├── patent_route.py # Patent retrieval, add/remove, export
│ │ ├── db_route.py # Direct StarRocks search queries + header stats
│ │ ├── filters_route.py # Dynamic faceted filters (facets for queries/projects)
│ │ ├── llm_route.py # LLM prompt → structured patent query
│ │ ├── dashboard_route.py # Superset embed URLs, tokens, chart metadata
│ │ └── sql_session_route.py # Saved query session CRUD
│ │
│ ├── services/ # Business logic layer (one per domain)
│ │ ├── auth_service.py # CAS OAuth2 flow, token generation
│ │ ├── query_service.py # SQL query builder/executor for StarRocks
│ │ ├── header_service.py # Aggregated search statistics
│ │ ├── patent_service.py # Patent-to-project associations
│ │ ├── project_service.py # Project management logic
│ │ ├── filters_service.py # Faceted filter computation
│ │ ├── llm_service.py # Azure OpenAI / Ollama integration
│ │ ├── dashboard_service.py # Superset API calls & JWT embed tokens
│ │ ├── user_service.py # User DB operations
│ │ ├── role_service.py # Role DB operations
│ │ ├── sql_session_service.py # SQL session persistence
│ │ └── cache.py # In-memory query cache
│ │
│ ├── utils/
│ │ └── rison.py # Rison (URL-safe JSON) encoder for Superset
│ │
│ └── llm/ # LLM context files (loaded at startup)
│ ├── system_message.md # System prompt template
│ ├── tables_schema.sql # StarRocks schema injected into prompt
│ ├── elastic_queries_format.sql # Query format examples
│ ├── elastic_query_examples.sql # Query examples for few-shot prompting
│ └── sql_results_examples.sql # Example SQL results for context
│
├── migrations/ # Alembic migration scripts (MySQL)
├── tests/ # pytest tests (auth, users, roles, services)
├── docker/ # Docker helper scripts (wait-for-db.sh)
├── queries/ # Standalone SQL reference queries
├── views/ # SQL views / materialized view definitions
└── documentation/ # Additional docs
Key Architectural Patterns
-
Two databases: MySQL (via async SQLAlchemy) stores app data (users, projects, roles, sessions). StarRocks (via a synchronous SQLAlchemy engine) stores the patent OLAP data and is queried directly with raw SQL.
-
Layer separation: Every domain follows the same route → service → db stack. Routes handle HTTP, services handle business logic, and the DB layer handles persistence/queries.
-
Uniform API envelope: Every response is wrapped in
ApiResponsewithmessage,data,status_code, andstatus_messagefields. -
Auth: UAB university CAS SSO OAuth2 is the production auth flow. JWT tokens (HS256) are issued after SSO login. A dev-login endpoint allows token generation without CAS for development.
-
LLM-powered search: Users can submit a natural language prompt to
/api/v1/llm/prompt. The backend sends it to Azure OpenAI (gpt-4o) with a detailed system prompt (including the DB schema and query format examples), and the LLM returns a structured SQL CTE that is then executed against StarRocks. Ollama is available as a local fallback. -
Query caching:
ConnectionManagerwraps StarRocks queries with an in-memory cache (keyed by query hash + parameters) to avoid redundant OLAP queries. -
Dashboard embedding: The backend acts as a proxy to Apache Superset, generating scoped embed URLs and JWT guest tokens so the frontend can embed dashboards in iframes.