Introducing ADCME Database and SQL Integration: an Efficient Approach to Simulation Data Management

Introduction

If you have a massive number of simulations and results from different simulation parameters, to facilitate the data analysis and improve reproducibility, database and Structured Query Language (SQL) are convenient and powerful ways for data management.

Database allows simulation parameters and results to be stored in a permanent storage, and records can be inserted, queried, updated, and deleted as we proceed in our research. Specifically, databases are usually designed in a way that we can concurrently read and write in a transactional manner, which ensures that the reads are writes are done correctly even in the case of data conflicts. This characteristic is very useful for parallel simulations. Another important feature of databases is "indexing". By indexing tables, we can manipute tables in a more efficient way.

SQL is a standad language for accessing and manipulating databases. The four main operations in SQLs are: create, insert, update, and delete. More advanced commands include where, groupby, join, etc. In ADCME, we implemented an interface to SQLite, a relational database management system contained in a C library. SQLite provides basic SQL engines, which is compliant to the SQL standard. One particular feature of SQLite is that the database is a single file or in-memory. This simplifies the client and server SQL logic, but bears the limitation of scalability. Nevertheless, SQLite is more than sufficient to store and manipulate our simulation parameters and results (typically a link to the data folder).

The introduction primarily focuses on some commonly used features of database management in ADCME.

Database Structure

In ADCME, a database is created using Database. There are two types of database:

db1 = Database() # in-memory 
db2 = Database("simulation.db") # file 

If you created a file-based database, whenever you finished operation, you need to commit to the database or close the database to make the changes effective.

commit(db2)
close(db2)

To execute a SQL command in the database, we can use execute command. In the next section, we list some commonly used operations. By default, commit is called after execute. Users can disable this by using

db1 = Database(commit_after_execute=false) # in-memory 
db2 = Database("simulation.db", commit_after_execute=false) # file 

Commonly Used Operations

Create a Database

execute(db2, """
CREATE TABLE simulation_parameters (
    name real primary key,
    dt real,
    h real, 
    result text 
)
""")

Insert an Record

execute(db2, """
INSERT INTO simulation_parameters VALUES
("sim1", 0.1, 0.01, "file1.png")
""")

Insert Many Records

params = [
    ("sim2", 0.3, "file2.png"),
    ("sim3", 0.5, "file3.png"),
    ("sim4", 0.9, "file4.png")
]
execute(db2, """
INSERT INTO simulation_parameters VALUES
(?, ?, 0.01, ?)
""", params)

Look Up Records

c = execute(db2, """
SELECT * from simulation_parameters
""")
collect(c)

Expected output:

("sim1", 0.1, 0.01, "file1.png")
("sim2", 0.3, 0.01, "file2.png")
("sim3", 0.5, 0.01, "file3.png")
("sim4", 0.9, 0.01, "file4.png")

Delete a Record

execute(db2, """
DELETE from simulation_parameters WHERE name LIKE "%3"
""")

Now the records are

("sim1", 0.1, 0.01, "file1.png")
("sim2", 0.3, 0.01, "file2.png")
("sim4", 0.9, 0.01, "file4.png")

Update a Record

execute(db2, """
UPDATE simulation_parameters
SET h = 0.2
WHERE name = "sim4"
""")

Now the records are

("sim1", 0.1, 0.01, "file1.png")
("sim2", 0.3, 0.01, "file2.png")
("sim4", 0.9, 0.2, "file4.png")

Insert a Conflict Record

Becase we set name as primary key, we cannot insert a record with the same name

execute(db2, """
INSERT INTO simulation_parameters VALUES
("sim1", 0.1, 0.1, "file1_2.png")
""")

We have an error

IntegrityError('UNIQUE constraint failed: simulation_parameters.name')

Alternatively, we can do

execute(db2, """
INSERT OR IGNORE INTO simulation_parameters VALUES
("sim1", 0.1, 0.1, "file1_2.png")
""")

Querying Meta Data

  • Get all tables in the database
keys(db2)

Output:

"simulation_parameters"
"sqlite_autoindex_simulation_parameters_1"

We can see a table that SQLites adds indexes to some fields: sqlite_autoindex_simulation_parameters_1.

  • Get column names in the database
keys(db2, "simulation_parameters")

Output:

"name"
"dt"
"h"
"result"

Drop a Table

execute(db2, """
DROP TABLE simulation_parameters
""")

Next Steps

We introduced some basic usage of ADCME database and SQL integration for simulation data management. We used one common and lightweight database management system, SQLite, for managing data of moderate sizes. Due to SQL's wide adoption, it is possible to scale the data management system by adopting a full-fledged database, such as MySQL. In this case, developers can overload ADCME functions such as execute, commit, close, keys, etc., so that the top level codes requires little changes.