DatabaseCreation
Introduction
The database schema is implemented in PostgreSQL using the DDL from DDL_manual.sql. This represents the final, refined version of the relational model after addressing tool limitations and incorporating feedback from professor consultations.
Key Design Decisions
Primary Key Strategy
Two distinct primary key strategies were employed:
- Surrogate keys for excessively long composite natural keys - Entities that would inherit composite keys from multiple parent entities (potentially 5‑20 attributes) receive a single id column with
GENERATED ALWAYS AS IDENTITY. The natural key columns are retained withUNIQUE NOT NULLconstraints to preserve semantic integrity without propagating keys further down the chain.
- Surrogate keys as genuine identifiers – For many‑to‑many relationships where the foreign key pair alone does not uniquely identify a record (e.g.,
Exam_Attemptwhere a student may attempt the same exam multiple times), a surrogateidis added because no natural key exists.
GENERATED ALWAYS AS IDENTITY vs SERIAL
All surrogate keys use INTEGER GENERATED ALWAYS AS IDENTITY instead of SERIAL for the following reasons:
| Feature | SERIAL | GENERATED ALWAYS AS IDENTITY |
| SQL Standard | Non‑standard (PostgreSQL specific) | SQL Standard compliant |
| Under the Hood | Creates a separate SEQUENCE object implicitly | Links a SEQUENCE directly to the column |
| Accidental Overwrites | Easy to accidentally bypass or break | Strictly protected (unless forced) |
| Permissions | Requires separate permissions on the sequence | Permissions are tied directly to the table |
SERIAL is syntactic sugar that creates a sequence and sets the column’s DEFAULT to fetch from it. Because it is just a default, PostgreSQL does not prevent manual value insertion. GENERATED ALWAYS AS IDENTITY enforces strict safety rules – the database is solely responsible for generating IDs. If overriding is absolutely necessary (e.g., during data migration), OVERRIDING SYSTEM VALUE can be explicitly specified.
This choice ensures portability across enterprise database engines (Oracle, DB2, SQL Server) that support the SQL:2003 standard.
Constraint Naming Convention
Constraints follow a consistent naming pattern:
- Primary Keys:
pk_{table_name} - Foreign Keys:
fk_{table_name}_{referenced_table} - Not Null:
nn_{table_name}_{attribute} - Unique:
uq_{table_name}_{attribute(s)} - Check:
ck_{table_name}_{constraint_description}
Example:
CONSTRAINT pk_exercise PRIMARY KEY(id) CONSTRAINT fk_exercise_course_edition_id FOREIGN KEY(course_edition_id) REFERENCES Course_Edition(id) CONSTRAINT nn_exercise_title CHECK(title IS NOT NULL) CONSTRAINT uq_semester_enrollment_peidays UNIQUE(program_enrollment_id, academic_year, semester) CONSTRAINT ck_exercise_max_grade CHECK(max_grade > 0)
ON DELETE SET NULL for Historical Data Preservation
All foreign keys use ON DELETE SET NULL ON UPDATE CASCADE to preserve historical data. When a referenced record is deleted, dependent records remain in the database with a NULL foreign key, maintaining the audit trail. This is critical for academic systems where grade records, exam attempts, and submissions must be preserved even if the associated course edition or student record is removed.
Enum for Exam Types
Exam types are defined as a PostgreSQL ENUM type:
CREATE TYPE EXAM_TYPE AS ENUM (
'partial_1',
'partial_2',
'session_jan',
'session_jun',
'session_sep'
);
PostgreSQL stores only a 4‑byte numeric OID per value rather than the literal string. In contrast, a TEXT column with a CHECK constraint would store variable‑length strings (10‑12 bytes per value).
Data Population
Data Generation Approach
The database was populated using a combination of:
- Local Large Language Models with Tool Calling – CSV data was generated programmatically, producing realistic academic data including member names, course codes, enrollment patterns, and grade distributions.
- Python’s faker Library – Supplemented data generation for realistic names, dates, and other attributes where a more controlled generator was needed.
Database Size
The final database contains approximately 13 GB of data across 30 tables. The largest tables are:
| Table | Row Count | Total Size | Data Size | Index Size |
| student_answer | 26,574,600 | 4,230 MB | 3,194 MB | 1,036 MB |
| member_message | 19,999,100 | 3,113 MB | 2,512 MB | 602 MB |
| exam_attempt | 8,858,161 | 853 MB | 509 MB | 345 MB |
| survey_response | 8,447,940 | 1,079 MB | 551 MB | 528 MB |
| exercise_submission | 7,353,124 | 1,617 MB | 1,327 MB | 290 MB |
| semester_enrollment | 4,387,348 | 373 MB | 185 MB | 188 MB |
| course_enrollment | 2,214,813 | 217 MB | 94 MB | 123 MB |
| student_grade | 1,087,533 | 77 MB | 46 MB | 31 MB |
| member | 1,050,000 | 173 MB | 105 MB | 69 MB |
| program_enrollment | 1,049,789 | 118 MB | 55 MB | 62 MB |
| student | 1,000,000 | 110 MB | 42 MB | 67 MB |
...and 20 additional tables
Total Database Size: 13 GB
The large volume of data ensures that performance testing and query optimization exercises reflect realistic production conditions.
Trimming the Database
Initially, the database grew to approximately 63 GB. During a laboratory session, it was reduced to its current 13 GB size to ensure manageable operations while maintaining sufficient data volume for meaningful performance analysis. The database remained local throughout the project after disk space constraints were addressed.
Views Overview
Non‑Analytical Views
The system includes several non‑analytical views (no big COUNT, SUM, MIN, MAX aggregates) for operational queries:
v_student_current_semester– Student’s current semester course enrollmentsv_student_upcoming_deadlines– Upcoming exercises and examsv_student_transcript– Complete academic transcriptv_course_announcements– Course announcements with threadingv_exercise_grading_queue– Pending and graded submissionsv_course_prerequisite_chain– Recursive prerequisite treev_student_registration_state– Course registration eligibility (later replaced by a function)v_teaching_assistant_eligibility– TA eligibility for course editions (later replaced by a function)
Analytical Views
Additional analytical views (with aggregates) are provided as extras:
v_course_edition_overview– Enrollment and teacher statisticsv_survey_status– Survey response counts and capacityv_teacher_dashboard– Teacher activity metricsv_exam_results_summary– Exam statistics and pass ratesv_member_inbox– Message thread summariesv_program_catalog– Program statisticsv_course_catalog– Course offering historyv_exam_attempt_detail– Detailed exam attempt analysisv_program_completion_status– Graduation eligibility tracking
Attachments (5)
-
DDL_manual.sql
(22.7 KB
) - added by 4 days ago.
Manually written DDL.
-
DDL_OSDM.sql
(30.5 KB
) - added by 4 days ago.
DDL generated automatically from the diagrams in Oracle SQL Developer Data Modeler.
-
DDL_VP.sql
(14.1 KB
) - added by 4 days ago.
DDL generated automatically from the diagrams in Visual Paradigm.
-
pure_views.sql
(16.9 KB
) - added by 4 days ago.
Non-analytical VIEWs.
-
analytical_views.sql
(19.4 KB
) - added by 4 days ago.
Analytical VIEWs.
Download all attachments as: .zip
