== Database Creation



The database is designed as a relational PostgreSQL system that models a full-featured news platform, supporting users, journalists, articles, interactions, and moderation workflows. It follows a normalized structure with clearly defined relationships between entities such as users, roles, articles, categories, and comments. Core features include a role-based access control system (RBAC), hierarchical content organization (categories and threaded comments), and support for advanced journalism workflows such as fact-checking, source attribution, and article versioning. The schema enforces data integrity through primary and foreign keys, ensuring that all relationships such as articles belonging to authors or comments referencing parent comments remain consistent.

To simulate a realistic production environment, the database is populated using a modular data generation system implemented in Python. Each table has a dedicated generator responsible for producing synthetic data, coordinated by a central script that ensures correct insertion order and referential integrity. The generators use configurable parameters (such as number of users, articles, and interactions) defined in a shared configuration file, allowing the dataset to scale to millions of records. Realism is achieved using randomized but controlled data generation (e.g., weighted probabilities for statuses, timestamps within valid ranges, and structured category hierarchies), often leveraging libraries like Faker to generate human-like content.

The seeding process is optimized for performance and scalability by using batch inserts and bulk operations, minimizing database overhead when inserting large volumes of data. Data is inserted in logical stages starting from base entities like users and roles, then moving to dependent entities like articles, comments, and interactions to preserve relational consistency. Additional constraints are enforced during generation, such as preventing invalid relationships (e.g., self-following users) and maintaining valid references across tables. Overall, this approach enables efficient generation of a large, realistic dataset suitable for testing, analytics, and system validation.