A database for atoms
ASE has its own database that can be used for storing and retrieving atoms and associated data in a compact and convenient way.
There are currently five back-ends:
- JSON:
Simple human-readable text file with a
.json
extension.- SQLite3:
Self-contained, server-less, zero-configuration database. Lives in a file with a
.db
extension.- PostgreSQL:
Server based database.
- MySQL:
Server based database.
- MariaDB:
Server based database.
The JSON and SQLite3 back-ends work “out of the box”, whereas PostgreSQL, MySQL and MariaDB requires a server (See Running a PostgreSQL server or Running a MySQL server).
There is a command-line tool called ase db that can be used to query and manipulate databases and also a Python interface.
What’s in the database?
Every row in the database contains:
all the information stored in the
Atoms
object (positions, atomic numbers, …)calculator name and parameters (if a calculator is present)
already calculated properties such as energy and forces (if a calculator is present)
key-value pairs (for finding the calculation again)
an integer ID (unique for each database) starting with 1 and always increasing for each new row
a unique ID which is a 128 bit random number which should be globally unique (at least in the lifetime of our universe)
constraints (if present)
user-name
creation and modification time
ase db
The ase db command-line tool can be used to query databases and for manipulating key-value pairs. Try:
$ ase db --help
Example: Show all rows of SQLite database abc.db:
$ ase db abc.db
id|age|user |formula|calculator|energy|natoms| fmax|pbc|charge| mass
1| 0s|jensj|H2 |emt | 1.419| 2|9.803|FFF| 0.000|2.016
2| 0s|jensj|H2 |emt | 1.071| 2|0.000|FFF| 0.000|2.016
3| 0s|jensj|H |emt | 3.210| 1|0.000|FFF| 0.000|1.008
Rows: 3
Keys: relaxed
Show all details for a single row:
$ ase db abc.db relaxed=1 -l
H2:
Unit cell in Ang:
axis|periodic| x| y| z| length| angle
1| no| 0.000| 0.000| 0.000| 0.000| 90.000
2| no| 0.000| 0.000| 0.000| 0.000| 90.000
3| no| 0.000| 0.000| 0.000| 0.000| 90.000
Data: abc
Key | Description | Value
energy | | 1.071
mass | | 2.016
id | | 2
fmax | | 0.000
relaxed | | True
ctime | Age | 0.838 seconds
See also
Querying
Here are some example query strings:
Cu |
contains copper |
H<3 |
less than 3 hydrogen atoms |
Cu,H<3 |
contains copper and has less than 3 hydrogen atoms |
H2O |
At least two hydrogens and at least one oxygen |
formula=H2O |
Exactly two hydrogens and one oxygen |
v3 |
has ‘v3’ key |
abc=bla-bla |
has key ‘abc’ with value ‘bla-bla’ |
v3,abc=bla-bla |
both of the above |
calculator=nwchem |
calculations done with NWChem |
2.2<bandgap<3.0 |
‘bandgap’ key has value between 2.2 and 3.0 |
natoms>=10 |
10 or more atoms |
id=2345 |
specific id |
age<1h |
not older than 1 hour |
age>1y |
older than 1 year |
pbc=TTT |
Periodic boundary conditions along all three axes |
pbc=TTF |
Periodic boundary conditions along the first two axes (F=False, T=True) |
These names are special:
id |
integer identifier |
natoms |
number of atoms |
pbc |
Periodic boundary conditions |
formula |
formula |
energy |
potential energy |
charge |
total charge |
magmom |
total magnetic moment |
calculator |
name of calculator |
user |
who did it |
age |
age of calculation (use s, m, h, d, w, M and y for second, minute, hour, day, week, month and year respectively) |
Integration with other parts of ASE
ASE’s ase.io.read()
function can also read directly from databases:
>>> from ase.io import read
>>> a = read('abc.db@42')
>>> a = read('abc.db@id=42') # same thing
>>> b = read('abc.db@v3,abc=H')
Also the ASE’s GUI program can read from databases using the same syntax.
Browse database with your web-browser
You can use your web-browser to look at and query databases like this:
$ ase db abc.db -w
$ firefox http://0.0.0.0:5000/
Click individual rows to see details. See the CMR web-page for an example of how this works.
Python Interface
First, we connect()
to the database:
>>> from ase.db import connect
>>> db = connect('abc.db')
or
>>> import ase.db
>>> db = ase.db.connect('abc.db')
Let’s do a calculation for a hydrogen molecule and write some results to a database:
>>> from ase import Atoms
>>> from ase.calculators.emt import EMT
>>> h2 = Atoms('H2', [(0, 0, 0), (0, 0, 0.7)])
>>> h2.calc = EMT()
>>> h2.get_forces()
array([[ 0. , 0. , -9.803],
[ 0. , 0. , 9.803]])
Write a row to the database with a key-value pair ('relaxed'
, False
):
>>> db.write(h2, relaxed=False)
1
The write()
method returns an integer id.
Do one more calculation and write results:
>>> from ase.optimize import BFGS
>>> BFGS(h2).run(fmax=0.01)
BFGS: 0 12:49:25 1.419427 9.8029
BFGS: 1 12:49:25 1.070582 0.0853
BFGS: 2 12:49:25 1.070544 0.0236
BFGS: 3 12:49:25 1.070541 0.0001
>>> db.write(h2, relaxed=True)
2
Loop over selected rows using the select()
method:
>>> for row in db.select(relaxed=True):
... print(row.forces[0, 2], row.relaxed)
-9.8029057329 False
-9.2526347333e-05 True
The select()
method will generate Row objects
that one can loop over.
Write the energy of an isolated hydrogen atom to the database:
>>> h = Atoms('H')
>>> h.calc = EMT()
>>> h.get_potential_energy()
3.21
>>> db.write(h)
3
Select a single row with the get()
method:
>>> row = db.get(relaxed=1, calculator='emt')
>>> for key in row:
... print('{0:22}: {1}'.format(key, row[key]))
pbc : [False False False]
relaxed : True
calculator_parameters : {}
user : jensj
mtime : 15.3439399027
calculator : emt
ctime : 15.3439399027
positions : [[ ... ]]
id : 2
cell : [[ 0. 0. 0.] [ 0. 0. 0.] [ 0. 0. 0.]]
forces : [[ ... ]]
energy : 1.07054126233
unique_id : bce90ff3ea7661690b54f9794c1d7ef6
numbers : [1 1]
Calculate the atomization energy and update()
a row in
the database:
>>> e2 = row.energy
>>> e1 = db.get(H=1).energy
>>> ae = 2 * e1 - e2
>>> print(ae)
5.34945873767
>>> id = db.get(relaxed=1).id
>>> db.update(id, atomization_energy=ae)
1
Delete a single row:
>>> del db[db.get(relaxed=0).id]
or use the delete()
method to delete several rows.
Description of a row
The first 9 keys (from “id” to “positions”) are always present — the rest may be there:
key |
description |
datatype |
shape |
---|---|---|---|
id |
Local database id |
int |
|
unique_id |
Globally unique hexadecimal id |
str |
|
ctime |
Creation time |
float |
|
mtime |
Modification time |
float |
|
user |
User name |
str |
|
numbers |
Atomic numbers |
int |
(N,) |
pbc |
Periodic boundary condition flags |
bool |
(3,) |
cell |
Unit cell |
float |
(3, 3) |
positions |
Atomic positions |
float |
(N, 3) |
initial_magmoms |
Initial atomic magnetic moments |
float |
(N,) |
initial_charges |
Initial atomic charges |
float |
(N,) |
masses |
Atomic masses |
float |
(N,) |
tags |
Tags |
int |
(N,) |
momenta |
Atomic momenta |
float |
(N, 3) |
constraints |
Constraints |
list of dict |
|
energy |
Total energy |
float |
|
forces |
Atomic forces |
float |
(N, 3) |
stress |
Stress tensor |
float |
(6,) |
dipole |
Electrical dipole |
float |
(3,) |
charges |
Atomic charges |
float |
(N,) |
magmom |
Magnetic moment |
float |
|
magmoms |
Atomic magnetic moments |
float |
(N,) |
calculator |
Calculator name |
str |
|
calculator_parameters |
Calculator parameters |
dict |
Extracting Atoms objects from the database
If you want an Atoms
object insted of an
AtomsRow
object, you should use the
get_atoms()
method:
>>> h2 = db.get_atoms(H=2)
Add additional data
When you write a row to a database using the write()
method,
you can add key-value pairs where the values can be
strings, floating point numbers, integers and booleans:
>>> db.write(atoms, functional='LDA', distance=7.2)
More complicated data can be written like this:
>>> db.write(atoms, ..., data={'parents': [7, 34, 14], 'stuff': ...})
and accessed like this:
>>> row = db.get(...)
>>> row.data.parents
[7, 34, 14]
Row objects
There are three ways to get at the columns of a row:
as attributes (
row.key
)indexing (
row['key']
)the
get()
method (row.get('key')
)
The first two will fail if there is no key
column whereas the last will
just return None
in that case. Use row.get('key', ...)
to use
another default value.
- class ase.db.row.AtomsRow(dct)[source]
-
- property key_value_pairs
Return dict of key-value pairs.
- property constraints
List of constraints.
- property data
Data dict.
- property natoms
Number of atoms.
- property formula
Chemical formula string.
- property symbols
List of chemical symbols.
- property fmax
Maximum atomic force.
- property constrained_forces
Forces after applying constraints.
- property smax
Maximum stress tensor component.
- property mass
Total mass.
- property volume
Volume of unit cell.
- property charge
Total charge.
Writing and updating many rows efficiently
If you do this:
db = connect('mols.db')
for mol in molecules:
db.write(mol, ...)
the database will make sure that each molecule is written to permanent storage (typically a harddisk) before it moves on to the next molecule. This can be quite slow. To speed this up, you can write all the molecules in a single transaction like this:
with connect('mols.db') as db:
for mol in molecules:
db.write(mol, ...)
When the for-loop is done, the database will commit (or roll back if there was an error) the transaction.
Similarly, if you want to update()
many rows, you should
do it in one transaction:
with db:
for id in ...:
db.update(id, foo='bar')
Writing rows in parallel
Say you want to run a series of jobs and store the calculations in one database:
for name in many_molecules:
mol = read(name)
calculate_something(mol)
db.write(mol, name=name)
With four extra lines (see the reserve()
method):
for name in many_molecules:
id = db.reserve(name=name)
if id is None:
continue
mol = read(name)
calculate_something(mol)
db.write(mol, id=id, name=name)
you will be able to run several jobs in parallel without worrying about two
jobs trying to do the same calculation. The reserve()
method
will write an empty row with the name
key and return the ID of that row.
Other jobs trying to make the same reservation will fail. While the jobs are
running, you can keep an eye on the ongoing (reserved) calculations by
identifying empty rows:
$ ase db many_results.db natoms=0
More details
Use this function for getting a connection to a database:
- ase.db.core.connect(name, type='extract_from_name', create_indices=True, use_lock_file=True, append=True, serial=False)[source]
Create connection to database.
- name: str
Filename or address of database.
- type: str
One of ‘json’, ‘db’, ‘postgresql’, (JSON, SQLite, PostgreSQL). Default is ‘extract_from_name’, which will guess the type from the name.
- use_lock_file: bool
You can turn this off if you know what you are doing …
- append: bool
Use append=False to start a new database.
Here is a description of the database object:
- class ase.db.core.Database(filename=None, create_indices=True, use_lock_file=False, serial=False)[source]
Base class for all databases.
Database object.
- serial: bool
Let someone else handle parallelization. Default behavior is to interact with the database on the master only and then distribute results to all slaves.
- write(atoms, id=None, key_value_pairs={}, data={}, **kwargs)[source]
Write atoms to database with key-value pairs.
- atoms: Atoms object
Write atomic numbers, positions, unit cell and boundary conditions. If a calculator is attached, write also already calculated properties such as the energy and forces.
- key_value_pairs: dict
Dictionary of key-value pairs. Values must be strings or numbers.
- data: dict
Extra stuff (not for searching).
- id: int
Overwrite existing row.
Key-value pairs can also be set using keyword arguments:
connection.write(atoms, name='ABC', frequency=42.0)
Returns integer id of the new row.
- reserve(**key_value_pairs)[source]
Write empty row if not already present.
Usage:
id = conn.reserve(key1=value1, key2=value2, ...)
Write an empty row with the given key-value pairs and return the integer id. If such a row already exists, don’t write anything and return None.
- update(id, atoms=None, delete_keys=[], data=None, **add_key_value_pairs)[source]
Update and/or delete key-value pairs of row(s).
- id: int
ID of row to update.
- atoms: Atoms object
Optionally update the Atoms data (positions, cell, …).
- data: dict
Data dict to be added to the existing data.
- delete_keys: list of str
Keys to remove.
Use keyword arguments to add new key-value pairs.
Returns number of key-value pairs added and removed.
- metadata
Dictionary
- get_atoms(selection=None, add_additional_information=False, **kwargs)[source]
Get Atoms object.
- selection: int, str or list
See the select() method.
- add_additional_information: bool
Put key-value pairs and data into Atoms.info dictionary.
In addition, one can use keyword arguments to select specific key-value pairs.
- get(selection=None, **kwargs)[source]
Select a single row and return it as a dictionary.
- selection: int, str or list
See the select() method.
- select(selection=None, filter=None, explain=False, verbosity=1, limit=None, offset=0, sort=None, include_data=True, columns='all', **kwargs)[source]
Select rows.
Return AtomsRow iterator with results. Selection is done using key-value pairs and the special keys:
formula, age, user, calculator, natoms, energy, magmom and/or charge.
- selection: int, str or list
Can be:
an integer id
a string like ‘key=value’, where ‘=’ can also be one of ‘<=’, ‘<’, ‘>’, ‘>=’ or ‘!=’.
a string like ‘key’
comma separated strings like ‘key1<value1,key2=value2,key’
list of strings or tuples: [(‘charge’, ‘=’, 1)].
- filter: function
A function that takes as input a row and returns True or False.
- explain: bool
Explain query plan.
- verbosity: int
Possible values: 0, 1 or 2.
- limit: int or None
Limit selection.
- offset: int
Offset into selected rows.
- sort: str
Sort rows after key. Prepend with minus sign for a decending sort.
- include_data: bool
Use include_data=False to skip reading data from rows.
- columns: ‘all’ or list of str
Specify which columns from the SQL table to include. For example, if only the row id and the energy is needed, queries can be speeded up by setting columns=[‘id’, ‘energy’].
Metadata
It’s can be useful to add a discription of the data that’s in a database.
This description will be used when browsing the database with a
web-browser.
You can add the desciption using the metadata
attribute:
>>> db.metadata = {
... 'title': 'Project 1',
... 'key_descriptions':
... {'v0': ('Voltage', 'Longer description ...', 'V')},
... 'default_columns': ['id', 'formula', 'v0']}
ASE already knows all about the following keys:
key |
short description |
long description |
unit |
---|---|---|---|
id |
ID |
Uniqe row ID |
|
age |
Age |
Time since creation |
|
formula |
Formula |
Chemical formula |
|
pbc |
PBC |
Periodic boundary conditions |
|
user |
Username |
Username |
|
calculator |
Calculator |
ASE-calculator name |
|
energy |
Energy |
Total energy |
eV |
natoms |
Number of atoms |
Number of atoms |
|
fmax |
Maximum force |
Maximum force |
eV/Å |
smax |
Maximum stress |
Maximum stress on unit cell |
eV/ų |
charge |
Charge |
Net charge in unit cell |
|e| |
mass |
Mass |
Sum of atomic masses in unit cell |
au |
magmom |
Magnetic moment |
Magnetic moment |
μ_B |
unique_id |
Unique ID |
Random (unique) ID |
|
volume |
Volume |
Volume of unit cell |
ų |
You can also write/read to/from JSON using:
$ ase db proj1.db --set-metadata metadata.json
$ ase db proj1.db --show-metadata > metadata.json
External Tables
If the number of key_value_pairs becomes large, for example when saving a large number of features for your machine learning model, ASE DB offers an alternative way of storing them. Internally ASE can create a dedicated table to store groups of key_value_pairs. You can store a group of key_value_pairs in a separate table named features by:
>>> atoms = Atoms()
>>> no_features = 5000
>>> feature_dict = dict(('feature' + str(i), i) for i in range(no_features))
>>> id = db.write(atoms, external_tables={'features': feature_dict})
Values stored in external tables can be accessed using:
>>> row = db.get(id=id)
>>> f1 = row['features']['feature1']
>>> f4999 = row['features']['feature4999']
Running a PostgreSQL server
With your PostgreSQL server up and running, you should run the following
command as the postgres
user:
$ sudo -u postgres psql
Then you create an ‘ase’ user and one database for each project you have:
postgres=# create user ase login password 'pw';
postgres=# create database project1;
postgres=# create database project2;
Show databases and quit:
postgres=# \l
postgres=# \q
You should now be able to
query the database using a URI like
postgresql://user:pw@host:port/dbname
:
$ PROJ1=postgresql://ase:pw@localhost:5432/project1
$ ase db $PROJ1
If you have some data in a data.db
SQLite3 file, then you can insert that
into the PostgreSQL database like this:
$ ase db data.db --insert-into $PROJ1
Now you can start the Flask-app ase.db.app.app
with any WSGI
compatible server.
Note
Please review the code carefully before exposing the ase.db.app
to
the internet or bad things could happen.
Running a MySQL server
ASE DB can also be run with a MySQL server. First, we need to get the MySQL server up and running. There are many online resources describing how to to that, but on a Ubuntu system the following should work:
$ sudo apt-get install mysql-server
$ sudo mysql_secure_installation
Then we need to check if the server is running:
$ systemctl status mysql.service
if it is not running you can start the service by:
$ systemctl start mysql.service
Note that on some Linux distributions mysql.service should be replaced by mysqld.service.
Once the service is running, we can enter the MySQL shell:
$ mysql -u root -p
where we assume that there is a user named root, that will be prompted for a password. Now, we can create a user:
mysql> CREATE USER 'ase'@'localhost' IDENTIFIED BY 'strongPassword';
and then a database for our project:
mysql> CREATE DATABASE my_awesome_project;
We need to give the ase user privileges to edit this database:
mysql> GRANT ALL PRIVILEGES ON my_awesome_project.* TO 'ase'@'localhost' IDENTIFIED BY 'strongPassword';
From a Python script we can now connect to the database via
>>> mysql_url = 'mysql://ase:strongPassword@localhost:3306/my_awesome_project'
>>> connect(mysql_url)