You will develop a program that generates SQL statements to check normalization in one table and iteratively normalize it if needed?
The input will be only one table in a database. Each table will have one potential primary key (PK), and non-key columns. Other CKs and FKs will not be considered. The given primary key may simple or composite. Your program will process one table at a time, whose name and columns will be specified when your program is called. The actual table with data will be already stored in the
database. Even though your program will generate a short output text file, most processing will be done with SQL queries.
The output is a set of normalized tables, as many as needed, created in your own space. The goal is to reach BCNF on all tables.
Your program will generate SQL code based on the input table and columns.
1. Input: a plain table, with a potential PK. This table may have an artificial primary key (e.g. i, row id, tuple id) in the DDL that should be be ignored. That is, assume you imported a CSV or text file without assurance the table has a valid PK. In this specific HW edition the table will not have nulls.
2. PK: you must verify if the table PK is indeed valid. The PK can be composit and can have up 2 to attributes. You are not expected to find a PK if the given PK is invalid. In this specific HW edition the PK will not have nulls.
3. FDs: your program must compute all potential functional dependencies considering the max number of columns in the PK and nonkey columns. For FDs not involving the PK there must exist at least two repetitions. In other words, you need to eliminate potential candidate keys from consideration since they do not cause update anomalies.
4. Normalization: Your program must check 1NF, 2NF, 3NF, BCNF. If the table is not in BCNF then you have to iteratively normalize it up to BCNF. For 3NF/BCNF assume the key on the LHS has only one attribute.
5. Goal: validate PK (remember it can be composite with 2 columns), certify if the input table, with potential PK (input parameter) and nonkey columns obeys each normal form: 1NF, 2NF, 3NF, BCNF. A table may be in 1NF with an invalid PK given as parametrs, but in such case it will have a valid artificial PK (e.g. i).
6. Optional: the program must iteratively decompose the input table produce a final set of normalized tables, satisfying 3NF or BCNF if possible. Each table must satisfy the properties explained in the textbook. Extra credit= 20 points, which can be applied towards HW1.
7. Excluded: referential integrity. In this HW you do not have to worry about inconsistent or missing information across tables.
1. The input will be specified in the command line, with parameters separated by semicolon ”;” (parameter=value).
2. Parameters: input table, PK (potential) and columns (nonkey).
If a table has missing parameters: no PK or nonkey columns specified your program should halt and display an error message ”invalid input”.
3. System Software
OS: Linux (any distro is fine if you install the DBMS locally). However, it is feasible to develop programs in Windows and MacOs but later you must test them in our Linux server.
Programming language: JavaScript (Python is also acceptable). DBMS: PostgreSQL.
4. Processing: find FDs, decompose and leave all temporary tables in each higher normal form. Each normalized table must have a PK in the DDL; in case there is a choice (e.g. symmetrical FDs) use alphabetical order to choose its PK. Example: B ! C, C ! B: choose B. At the end of processing, you must write a list of normalized tables in a text file, showing normalized table names with a sux indicating NF and table number as needed. Example for table T: T 2NF 1,T 2NF 2,T 3NF 1,T 3NF 2 ,T BCNF 1,T BCNF 2.
5. Forbidden: You cannot export or read tables row by row, doing processing outside the DBMS. The bulk of processing must be done with SQL queries and temporary tables.
6. Allowed: you can store all table PK and column information inside the host language with any data structure including lists and arrays. You can use any libraries to store, search, parse table/column names. You can retrieve row counts, sums or boolean values computed by SQL queries.
7. SQL code generation: There is only one (give) primary key without other (secondary) candidate keys. You can use any combination of SELECT statements, temporary tables and views. You should preserve column names as you normalize tables. You are not allowed to modify the input table in any way.
8. JavaScript: you can use any JS library or statements to parse the input parameters. You can store table and column names on any data structure including lists, arrays, hash tables or trees. You do not have to worry about JS speed or memory usage since the ”heavy lifting” is done by the DBMS.
9. SQL: Your program must generate a plain SQL script file for all the queries used in your program (correctly formatted as shown in the textbook and seen in class, following SQL format/indentation standards). This file will be checked by the TAs for originality and format. The instructor may also check it in case of grade complaints.
Write to text file ”nf.sql” all the SQL statements generated during the checking of normalization. This SQL should be well formatted to be understood by a person, not as a long line. SQL is not case sensitive. But it is preferred SQL keywords are in uppercase and table/column names in lower case.
10. 10% extra credit: create a web page calling JS where you can connect to the database, list all tables (retrieving them from the internal schema information) and let the user customize the normalization process. The user can choose a table and your program can ”propose” PKs with up to k attributes. The user can tune a ”redundancy” parameter that acts as a threshold to discard weak FDs (e.g. one column that is almost unique). Then the program can proceed with nornalization. If you decide to do it you need to email the instructor and TAs; this extra credit cannot be obtained if you submit your HW without alerting us.
11. Your program should not crash and stop if there is some error. The program should drop any derived or temp table in advance to avoid exceptions. Test your program well with invalid input.
12. Omissions or mistakes in test cases, specification or TAs instructions: they must be reported as soon as possible. Important requirements will fixed in this PDF. It is impossible to foresee every complication when you develop your program. At the same time, this specification would have to be 10-page long to be ultra-specific about every requirement. It is understood that a mistake
or omission pointed out to the TAs or instructor in 3 days or less before the deadline will be very dicult to fix.
Here is a specification of input/output. There is only one main input parameter: the input table. There are two additional parameters: a potential PK and its non-key columns.
syntax example for main program call from the Linux prompt (simplified):
node normalize.js "table=T;pk=i,j;columns=a,b,c,d".
If the connection to the DBMS fails, the program must display an error message. If the connection to the DBMS is successful, you should show a ”success” message and generate the SQL script file regardless the selected option, and dump clean SQL statements generated by your program.
Input checking: Your program will be tested with several tables having di↵erent names and columns. But you must make elementary error checking for non-existent tables and invalid column names. That is, you should not waste time catching every potential error like a missing parenthesis, pkk instead of pk, repeated column names, inconsistent column names, and so on.
Output: a text file ’nf.txt’ specifying for each table if the PK is valid and if each NF (1NF, 2NF, 3NF, BCNF) is satisfied Y/N (one normal form per line and Y/N). Append this file after every run.
Optional: full decomposition as a set of normalized tables, iteratively written for each NF. Normalized tables will have a name with the corresponding normal form as a sux. The list of normalized tables will be listed in file ”nf.txt”, one per line (do not include any other info).
The most important ones are those in BCNF, but intermediate normal forms will be checked for correctness.
Potential PK: up to 2 columns. Since SQL only supports simople data types it is assumed values are atomic. Therefore, you only need to validate the PK.
1NF: Remember that 1NF requires the existence of a PK (i.e. unique rows) and atomic values.
Nonkey columns: up to 10 columns.
Finding FDs: only one column of the left side. The only exception for FDs is the PK which may be composite, but still you will find FDs with one column on the left side.
Data types: The input table will have only integer numbers and strings (no dates, no floating point, no boolean).
The SQL of your generated script file should be properly formatted and indented (one select in one line, join/where/group each in a di↵erent line). Indent nested queries.