{
  "cells": [
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "# Introduction\n",
        "\n",
        "In this series of excercises, you will learn how to work with ase databases,\n",
        "and do some simple machine learning for electronic structure properties. The\n",
        "driving idea is to predict complex properties of compounds from simpler\n",
        "properties, under the slogan that the fastest calculation is the one you\n",
        "don't have to run. We start by importing some relevant packages for\n",
        "scientific python and `ase` in particular.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "%matplotlib inline\n",
        "import numpy as np\n",
        "import matplotlib.pyplot as plt\n",
        "from ase.db import connect"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "In the current directory, there is an `ase` database file called 'organometal.db'.\n",
        "It contains information about organometallic perovskites, and the goal is to\n",
        "predict properties for these. Along with the perovskite compounds, there are\n",
        "also reference calculations of the elements in their standard states. We\n",
        "start by connecting to the database (more info on the `ase db` module can be\n",
        "found [here](https://wiki.fysik.dtu.dk/ase/ase/db/db.html#module-ase.db)),\n",
        "and inspecting a single row:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "db = connect('organometal.db')\n",
        "row = next(db.select(project='organometal'))\n",
        "vars(row)"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Each row of the database has some key-value pairs, which are stored explicitly, as well as some basic information which is always stored, recording how the data was calculated. Can you identify what each of the keys below refers to?\n",
        "\n",
        "calculator:\n",
        "calculator_parameters:\n",
        "cell:\n",
        "ctime:\n",
        "energy:\n",
        "gllbsc_dir_gap:\n",
        "gllbsc_disc:\n",
        "gllbsc_ind_gap:\n",
        "id:\n",
        "initial_magmoms:\n",
        "mtime:\n",
        "name:\n",
        "numbers:\n",
        "pbc:\n",
        "positions:\n",
        "project:\n",
        "space_group:\n",
        "symmetry:\n",
        "unique_id:\n",
        "user:\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Each row also has a `toatoms()` method, which lets us extract an `ase.Atoms` object from the row.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "from ase.visualize import view\n",
        "view(row.toatoms())"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "When doing any kind of data analysis, the first step is always to become familiar with the data in question, on a basic level. The `select()` method of the database applies a query to the database and returns an iterator of all the rows matching that query. To select all rows with a user of `einstein`, we would type `db.select(user='einstein')`. To select all rows with a gllbsc direct gap greater than 0.5 eV, we would type `db.select('gllbsc_dir_gap>0.5')`.\n",
        "Counting the number of hits can be be done using `db.count(key=value)` for some key-value pair.\n",
        "\n",
        "How many rows are there in the database?\n",
        "How many belong to the `organometal` _project_? And how many to the `references` _subproject_?\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "The structures in the database were generated from the general formula ABX,\n",
        "and then varying A, B and X. X represents a combination of 3 halogen atoms,\n",
        "chosen from [\"Cl\", \"Br\", \"I\"]. The A, B and X is encoded in value for the key\n",
        "`name`, i.e. `row.name -> 'CsPbI3'`. We have also distorted some of the\n",
        "structures, giving four different symmetry types for each atomic composition.\n",
        "\n",
        "1. Try to identity the possible values of A and B.\n",
        "   (Hint: A and B is labeled with two characters `row.name`,\n",
        "   i.e `A='Cs'` and `B='Pb'` in `'CsPbI3'`)\n",
        "\n",
        "2. Can you identify the four different symmetry classes?\n",
        "\n",
        "3. By making all possible combinations of both A, B, X, and symmetries, how\n",
        "   many structures could be generated in total? And how many unique are there,\n",
        "   i.e. without considering the different symmetries?\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "As you can see from the exercise above, two organic molecules (methylammonium MA, formula CH$_6$N and formamidinium FA, formula CH$_5$N$_2$) can be used instead of Cs as cations in the inorganic perovskite template. View the structure of MA and FA from the reference subproject in the database.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Two good ideas are to plot distributions of interesting quantities, and to calculate some summary statistics of numeric quantities, such as the mean and the variance.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "organometal_rows = [x for x in db.select(project='organometal')]\n",
        "discs = [x.gllbsc_disc for x in organometal_rows]\n",
        "print(f\"Mean of `gllbsc_disc` = {np.mean(discs):.4f}\")\n",
        "print(f\"Variance of `gllbsc_disc` = {np.var(discs):.4f}\")\n",
        "plt.hist(discs)\n",
        "plt.xlabel('Energy (eV)')\n",
        "plt.show()"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Make a histogram for each of the numeric quantities in the database, and calculate the mean and variance of these quantities. You can also make scatter plots of one quantitity against another by using the `scatter` method of pyplot, `plt.scatter()`. How do these distributions vary with the `symmetry` keyword of the database?\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "numeric_quantities = [\"energy\", \"gllbsc_disc\", \"gllbsc_dir_gap\", \"gllbsc_ind_gap\"]\n"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "The energy contained in each row is an energy with respect to some arbitrary reference, which was set in the original calculation. A more sensible reference is provided by the materials with `subproject == 'references'`. We can calculate the heat of formation per unit cell of the 'MAPbI3' compound as follows:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "row = db.get(name='MAPbI3', symmetry='cubic')\n",
        "en_cubic = row.energy\n",
        "en_refs = {}\n",
        "for row in db.select(subproject='references'):\n",
        "    en_refs[row.element] = row.energy / row.natoms\n",
        "\n",
        "E_standard = en_cubic - (8 * en_refs['MA'] + en_refs['Pb'] + 3 * en_refs['I'])\n",
        "print(f'hof = {E_standard / row.natoms:.3f} eV/atom')"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Based on this, can you calculate the difference in heat of formation per formula unit of MAPbI$_3$ in the tetragonal phase versus the cubic phase? What about the heat of formation of FASnBr$_2$Cl in the orthorhombic_1 phase versus the cubic phase of the FASnBr$_3$, and of FASnCl$_3$. (hint: tetragonal and orthorhombic phases have a unit cell larger than the cubic structure).\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "# Machine Learning\n",
        "\n",
        "Machine Learning is the science of getting computers to learn and act like humans do, and improve their learning over time in autonomous fashion, by feeding them data and information in the form of observations and real-world interactions. The crucial idea is that the computer should be able to improve its performance at a given task as we give it more information. A tutorial on machine learning in general can be found [here](http://scikit-learn.org/stable/tutorial/basic/tutorial.html).\n",
        "\n",
        "In this workbook we will be carrying  out a supervised learning task, where we attempt to predict a particular (known) attribute of a given structure, based on other attributes of the structure. This can be useful if it allows us to use easy-to-calculate properties to say something about quantities which are difficult to calculate. This approach to learning, where we attempt to find a map $f$ from the attributes, $X$, of our data to some target property, $y$, is known as supervised learning. See [here](https://en.wikipedia.org/wiki/Supervised_learning) for more general information on the topic.\n",
        "\n",
        "The two most important ingredients in the supervised learning approach are which attributes of the data we feed into the machine learning process, and which model we then apply on the data.\n",
        "\n",
        "## Input Vectors\n",
        "\n",
        "To start, we use a one-hot encoding of each of the different categories of data as our input vector. Later, you will be asked to see if you can find a better set of features to describe the data.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "def calculate_input_vector(row):\n",
        "    symm_vec = [0, 0, 0, 0]\n",
        "    A_vec = [0, 0, 0]\n",
        "    B_vec = [0, 0]\n",
        "    X_vec = [0, 0, 0]  # i.e I3->[0, 3, 0], I2Cl->[1, 2, 0], Br3->[0, 0, 3]\n",
        "    constant = [1,]\n",
        "    symm_vec[['cubic',\n",
        "              'tetragonal',\n",
        "              'orthorhombic_1',\n",
        "              'orthorhombic_2'].index(row.symmetry)] = 1\n",
        "    A_vec[['Cs', 'FA', 'MA'].index(row.name[:2])] = 1\n",
        "    B_vec[0] = 1 if 'Pb' in row.name else 0\n",
        "    B_vec[1] = 1 if 'Sn' in row.name else 0\n",
        "\n",
        "    Xs = ['Cl', 'I', 'Br']\n",
        "    nhalo = sum([s in Xs for s in row.symbols])\n",
        "    for i, X in enumerate(Xs):\n",
        "        X_vec[i] = 3 * sum([s == X for s in row.symbols]) // nhalo\n",
        "\n",
        "    vec = symm_vec + A_vec + B_vec + X_vec + constant\n",
        "    return vec"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "In a one-hot encoding, assign the data into different categorical classes, and then have one feature for each class. For example, there are four different symmetries in the data, so the first four features of the input vector describe which symmetry the material lies in. A '1' in a given position indicates that the material falls into that class, while a '0' indicates that it does not.\n",
        "\n",
        "As an example, we apply the encoding to the first few rows of the database with cubic symmetry, and show the formula and the name:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "for row in db.select(symmetry='cubic', limit=5):\n",
        "    print(f'name={row.name} formula={row.formula} symmetry={row.symmetry}')\n",
        "    print(f'vec={calculate_input_vector(row)}')\n",
        "    print('-'*79)"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "We see that as expected, the row has a '1' in the first position, indicating that it has a cubic symmetry.\n",
        "\n",
        "We are now ready to generate the input matrix $X$ that we will use in the machine learning process. At the same time we extract the indirect band gap as the target vector, $y$.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "X = []\n",
        "y = []\n",
        "for row in db.select('project'):\n",
        "    X.append(calculate_input_vector(row))\n",
        "    y.append(row.gllbsc_ind_gap)\n",
        "\n",
        "X = np.array(X)\n",
        "y = np.array(y).reshape(-1, 1)\n",
        "print('X.shape = ', np.shape(X))\n",
        "print('Y.shape =', np.shape(y))"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "## Modelling\n",
        "\n",
        "With the input and output in place, we are ready to do the initial machine\n",
        "learning. All supervised machine learning processes do the following, in a\n",
        "generalized sense:\n",
        "\n",
        "- Select a general functional form for the model, parametrized in\n",
        "  a suitable way.\n",
        "\n",
        "- Find a loss function to evaluate the performance of a given set\n",
        "  of parameters.\n",
        "\n",
        "- Optimize the parameters of the model to minimize the loss.\n",
        "\n",
        "All the hard work is usually in the last step, since for complex models the\n",
        "relationship between the parameters and the loss function can be very\n",
        "difficult. However, for simpler models, we can sometimes find a closed form\n",
        "for the loss function.\n",
        "\n",
        "The very simplest class of machine learning models are just generalized\n",
        "linear models, where the target, $y$, is assumed to be a linear function of\n",
        "the input variables. You can read more about them\n",
        "[here](http://scikit-learn.org/stable/modules/linear_model.html#ordinary-least-squares). For a\n",
        "linear function, we guess a functional form $f(\\mathbf{x}) = \\sum_n w_n x_n = \\mathbf w \\cdot \\mathbf x$, and seek to optimize the weight vector, $\\mathbf w$.\n",
        "\n",
        "If we choose the loss function\n",
        "$$ L = \\sum_i (f(\\mathbf{x}_i) - y_i))^2 = \\sum_{i} \\left(\\sum_{n}w_i x_{in} - y_i\\right)^2 $$\n",
        ", we will recover ordinary least squares regression. In matrix terms, the loss corresponds to the norm $ L = \\left\\| \\mathbf{y} - \\mathbf{X} \\mathbf{w} \\right\\|^2$.\n",
        "The loss is minimal when the derivative with respect to the weight vector is zero. A bit of\n",
        "rearranging gives that this is true when\n",
        "$$ \\mathbf w = (\\mathbf{X}^T\\mathbf{X}) ^ {-1} \\mathbf{X}^T \\mathbf{y} $$\n",
        "\n",
        "Here $\\mathbf w$ is an (n_features, 1) weight vector that we are trying to\n",
        "find, $\\mathbf{X}$ is an (n_samples , n_features) matrix of our observed\n",
        "inputs and $\\mathbf y$ is the (n_samples, 1) output vector that we are trying\n",
        "to predict.\n",
        "\n",
        "Write a function `fit`, which takes as input a matrix $\\mathbf X$, and a\n",
        "target vector $\\mathbf y$, and performs this linear regression, returning the\n",
        "list of weights and an estimate of the loss for this list of weights.\n",
        "\n",
        "Hint: useful functions are `np.dot()` and `np.linalg.inv()`, which calculate\n",
        "the dot product and inverse, respectively, of their arguments.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "def fit(X, y):\n",
        "    \"\"\"\n",
        "    code goes here\n",
        "    \"\"\"\n"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "You can test your code on the following toy input - hopefully the weight vector you find is close to [1, 2, 3, 4, 5]\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "nsamples = 50\n",
        "nfeatures = 5\n",
        "X_toy = np.random.rand(250).reshape(nsamples, nfeatures)\n",
        "coefficients = np.arange(1, nfeatures + 1).reshape(-1, 1)\n",
        "noise = np.random.normal(scale=0.2, size=nsamples).reshape(-1, 1)\n",
        "y_toy = np.dot(X_toy, coefficients) + noise\n",
        "w, loss = fit(X_toy, y_toy)\n",
        "plt.scatter(np.dot(X_toy, w), y_toy)\n",
        "plt.plot(plt.xlim(), plt.ylim(), ls='--', color='k')\n",
        "\n",
        "plt.show()\n",
        "print(w)"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Once that is working, try it on the original data! (hint: you have already calculated the matrix $X$ and target vector $y$ above) Does everything work as it should?\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "One of the assumptions made in the derivation of the linear model is that the matrix $(\\mathbf{X}^T\\mathbf{X})$ is invertible. Unfortunately, that's not true for our case. That's because of the encoding we have chosen, which means for example that for any row, the first four columns must sum to one. The fourth column can therefore always be written as 1 - the sum of the first three.\n",
        "\n",
        "We can alleviate this by adding a regularization term to our loss function, which penalises large weights. The new loss is can then be written as  $L = \\left\\| \\mathbf{y} - \\mathbf{X} \\mathbf{w} \\right\\|^2 + \\alpha \\left\\| w\\right\\|^2$. Luckily, there is still a closed-form solution for this, namely $\\mathbf w = (\\mathbf{X}^T\\mathbf{X} + \\alpha \\mathbf{I}) ^ {-1} \\mathbf{X}^T \\mathbf{y}$. Modify your `fit()` function to take an extra argument $\\alpha$, and apply regularization to the original problem. Does everything work now? Do the weights of your model make sense? Try few different values for $\\alpha$. How does the fit changes with $\\alpha$?\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "We can do all the things above in scikit-learn, which is a python software package for doing many different kinds of modelling.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "from sklearn import linear_model\n",
        "linear = linear_model.LinearRegression()\n",
        "\n",
        "linear.fit(X, y)\n",
        "ybar = linear.predict(X)\n",
        "ymax = np.array((y, ybar)).max() + 0.1\n",
        "plt.scatter(ybar, y)\n",
        "lims = [\n",
        "    np.min([plt.xlim(), plt.ylim()]),\n",
        "    np.max([plt.xlim(), plt.ylim()])\n",
        "]\n",
        "plt.plot(lims, lims, ls='--', color='k')\n",
        "plt.xlabel('Predicted Band Gap [eV]')\n",
        "plt.ylabel('Actual Band Gap [eV]')\n",
        "\n",
        "# We can wrap the above in a function, to avoid typing that same code again later\n",
        "def limits():\n",
        "    lims = [\n",
        "        np.min([plt.xlim(), plt.ylim()]),\n",
        "        np.max([plt.xlim(), plt.ylim()])\n",
        "    ]\n",
        "    plt.plot(lims, lims, ls='--', color='k')\n",
        "\n",
        "def make_comparison_plot(X, y, model, label):\n",
        "    model.fit(X, y)\n",
        "    ybar = model.predict(X)\n",
        "    plt.scatter(ybar, y)\n",
        "    limits()\n",
        "    plt.xlabel(f'Predicted {label}')\n",
        "    plt.ylabel(f'Actual {label}')"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "All the models in scikit-learn have a `fit` method, which expects an $X$ matrix and a $y$ vector as inputs, and then trains the model. They also have a `predict` method, which takes an $X$ matrix as input and returns the $y$ values predicted by the model. We use this to plot the true vs the predicted band gap.\n"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "We can also inspect the parameters of the model to see which elements of the input vector are important to the model:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "print(linear.coef_)\n",
        "print(linear.intercept_)"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "We see that despite the singular matrix, sklearn is able to do a linear fit, and returns sensible coefficients. Relying on this behaviour is rather brittle, and a better idea is to do as before and add a regularization parameter to the original loss function, which is done in the `linear_model.Ridge` model:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "from sklearn import linear_model\n",
        "\n",
        "linear_regularized = linear_model.Ridge(alpha = .5)\n",
        "make_comparison_plot(X, y, linear_regularized, \"Band Gap [eV]\")"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "print(linear_regularized.coef_)\n",
        "print(linear_regularized.intercept_)"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "From visual inspection, it seems that the regularized model performs about as well as the original linear model.\n",
        "\n",
        "To proceed with the machine learning, we need some way of evaluating the performance of a model, which is better than visual inspection of predicted versus actual values, and an assessment of the reasonableness of model parameters. Scikit-learn provides a `score()` method for each model, which evaluates how good the fit is.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "linear_regularized.score(X, y)"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "To truly compare between models, we should ideally train on some data, and evaluate the model on a different set of data. Otherwise, we could create a perfect model just by storing all the input data, and looking up the correct answer. The way to do this is by cross-validation, where the data is randomly split into a number of buckets, and for each bucket, the model is trained on all the other data, and then tested on the data in the bucket. Since the data might have a meaningful order in the database, it is important that the assignment of the data to each bucket is done at random. This is accomplished by the `shuffle` argument to `KFold`.\n",
        "\n",
        "This approach of evaluating the performance of a model is very general and can also be used to optimize the so-called hyperparameters of a model, such as the regularization parameter $\\alpha$. Here, we will not optimize $\\alpha$, but only compare the performance of the `alpha=0` and `alpha=0.5` model. The score has been chosen so that the closer it is to 1, the better.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "from sklearn import model_selection\n",
        "folds = model_selection.KFold(n_splits=2, shuffle=True, random_state=1)\n",
        "print(model_selection.cross_val_score(linear_regularized, X, y, cv=folds, scoring='explained_variance'))\n",
        "print(model_selection.cross_val_score(linear, X, y, cv=folds, scoring='explained_variance'))"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "## Modelling the heat of formation\n",
        "\n",
        "Having looked at the band gap, we turn now to the heat of formation, which was defined further up. Try using the heat of formation as a target vector $\\mathbf y$ instead of the band gap. See if it is possible to predict the heat of formation using (regularized) linear regression and the simple input vector defined above. You can use the following code to calculate the heat of formation of all the compounds in the database. Can you explain what it does?\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "en_refs = {}\n",
        "for row in db.select(subproject='references'):\n",
        "    en_refs[row.element] = row.energy/len(row.symbols)\n",
        "HoF = []\n",
        "for row in db.select(project='organometal'):\n",
        "        energy = row.energy\n",
        "        # how many units are in there!?\n",
        "        n_units = len([True for symbol in row.symbols if symbol == 'Pb' or symbol == 'Sn'])\n",
        "        energy_standard = 0\n",
        "        for symbol in row.symbols:\n",
        "            if symbol in ['Cs','Pb','Sn','I','Br','Cl']:\n",
        "                energy_standard += en_refs[symbol]\n",
        "        if 'FA' in row.name:\n",
        "            energy_standard += n_units * en_refs['FA'] * 8\n",
        "        if 'MA' in row.name:\n",
        "            energy_standard += n_units * en_refs['MA'] * 8\n",
        "        HoF.append((energy-energy_standard) / n_units)"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "When searching for new materials, we would like to find only stable materials. These should have a negative heat of formation, and be in the most stable symmetry class of the four we are considering. The model we have just made will tell us whether a given compound has a negative heat of formation, but it won't tell us whether a different symmetry class would have a lower energy.\n",
        "\n",
        "Predicting which symmetry class of the four is most stable for a given composition is a typical example of a classification problem: we would like to map a given composition, to which of the four symmetry classes is most stable. We start by creating the new output vector corresponding to our data. The output vector should have dimensions (n_compositions, 1), where n_compositions is the total number of different possible compositions without considering symmetries. For each composition, the output should indicate which of the symmetries in the database gives the lowest energy. An idea could be to reuse the `symmetry map` from further up.\n",
        "\n",
        "For the input vector, we can use the same one as before, only with the first four entries (corresponding to the symmetry class) removed.\n",
        "\n",
        "You should start by creating the target vector, which describes which symmetry is most stable for each composition:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Define `X_hof` and `y_hof` in here\n"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Once you have that, we can start modelling this data. To try something different, we will be using a decision tree to classify the data. This can be imported as follows:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "from sklearn import tree\n",
        "clf = tree.DecisionTreeClassifier()\n",
        "clf = clf.fit(?, ?)\n"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "We can visualize the tree using `graphviz`. Can you explain it?\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "import graphviz\n",
        "feature_names = ['first_cs','first_FA','first_MA','lead','tin','chlorine','iodine','bromine','reg']\n",
        "target_names = ['cubic', 'tetragonal', 'orthorhombic_1', 'orthorhombic_2']\n",
        "dot_data = tree.export_graphviz(clf, out_file=None,\n",
        "                                feature_names = feature_names,\n",
        "                                class_names = target_names,\n",
        "                                filled=True, rounded=True,\n",
        "                                special_characters=True)\n",
        "graph = graphviz.Source(dot_data)\n",
        "graph"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "One issue with decision trees is that it's often easy to overfit the data, by making the depth of the tree too large. To see if this is occurring, we can compute the cross-validation score of the model for different sets of maximum depths:\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "folds = model_selection.KFold(n_splits=4, shuffle=True, random_state=0)\n",
        "scores = []\n",
        "for max_depth in range(1, 10):\n",
        "    clf = tree.DecisionTreeClassifier(max_depth=max_depth)\n",
        "    score = model_selection.cross_val_score(clf, X_hof, y_hof, cv=folds)\n",
        "    print(max_depth, score)\n",
        "    scores.append(score.mean())\n",
        "plt.plot(range(1, 10), scores)\n",
        "plt.xlabel('Maximum depth')\n",
        "plt.ylabel('Cross-validation score')\n",
        "plt.show()"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "## Improving the input vector\n",
        "\n",
        "There is much more information we could include in the input vector, which might increase our predictive power of our model. An example could be the covalent radius of tin vs lead, or of the halogens, which will tend to increase or decrease the lattice constant of the whole crystal, and hence the energy. The covalent radii are available in `ase.data`, and can be imported as `from ase.data import covalent_radii`. Another example could be the electronegativities of the halogens, which will play a role in the character of the bonds. The mulliken electronegativities of chlorine, bromine and iodine are 8.3, 7.59 and 6.76 respectively\n",
        "\n",
        "Start by redoing a basic examination of the data: plot the band gap and heats of formation against some of the quantities you might add, and see if there is any correlation. For the quantities that do have a correlation, try adding them to the input vector, and see which (if any) result in an improved model for either the heat of formation, the relative energetic ordering of the symmetry classes, or the band gap.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "## Improving the models\n",
        "\n",
        "Most phenomena we encounter don't have nice linear relationships between inputs and outputs. We model nonlinear relationships in two different ways: we either introduce nonlinearity in our features, or in our model. The first\n",
        "is known as feature engineering, and requires a good method. The second makes optimizing the parameters of our model slightly more difficult, as we can lose many of the nice properties of the linear model, such as the closed-form solution.\n",
        "\n",
        "Here we will focus on gaussian process regression as a case study. You can read more about it [here](http://scikit-learn.org/stable/modules/gaussian_process.html#gaussian-process). This lets us fit nonlinear functions and also provides a confidence interval indicating how well the machine is doing.\n",
        "\n",
        "Similar to the `linear_regression`, we can import it from `sklearn` and create an example of the model with `model = GaussianProcessRegressor(kernel=YYY)`, where `YYY` is the kernel used. As a start, we should use a radial basis function, which is also available from `sklearn`. As usual, the model has a `fit()` and a `predict()` method, which sets the parameters of the model, and tries to predict outputs based on inputs.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "from sklearn.gaussian_process import GaussianProcessRegressor\n",
        "from sklearn.gaussian_process.kernels import RBF\n",
        "\n",
        "kernel = RBF(length_scale=0.1)\n",
        "model = GaussianProcessRegressor(kernel=kernel)\n",
        "model.fit(X, y)\n",
        "fit = model.predict(X)\n",
        "model.score(X, y)"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "The model we have chosen is very dependent on the length scale we set for the kernel, and it is not given that the value chosen above (0.1) is a good choice. The fact that we have a score of 1.0 is an indication that we may be overfitting. The trick to selecting the best value of this parameter is again cross-validation. We can loop over different possible values of the hyperparameter to make different classes of models, and then evaluate the cross-validation score of each to see which performs best. Try this!\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Ideally you should find that length scales of ~1.0 perform best according to this scoring metric. In general, `scikit` has many tools for finding optimized hyperparameters for a given model - an example is [GridSearchCV](http://scikit-learn.org/stable/auto_examples/model_selection/plot_grid_search_digits.html), which automatically goes through all possible combinations of hyperparameters, finding the best one. You should note that there is a problem with using the Cross-validation scores we compute here to evaluate the performance of a given set of hyperparameters, which is very similar to the original problem of overfitting. Can you see what it is?\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "# Testing and evaluating the model\n",
        "\n",
        "Now we've reached the stage where we start actually doing electronic structure calculations!\n",
        "\n",
        "We would like to test the models we have made by comparing the predicted quantities to calculated quantities. We are looking for a material that is not present in the database, and fulfills the following criteria:\n",
        "* It has a negative heat of formation\n",
        "* The cubic symmetry is the most stable for this composition class (due to computational cost of the other symmetry types)\n",
        "* It has a band gap of approximately 1.5 eV\n",
        "\n",
        "It might not be possible to find a material that fulfills all, so try to find one, where the band gap of the cubic symmetry structure is close to 1.5 eV.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "# Determine all the materials in the database, that do not have the cubic symmetry calculated\n",
        "names = set(r.name for r in db.select(\"symmetry!=cubic\", project=\"organometal\")) - \\\n",
        "        set(r.name for r in db.select(project=\"organometal\", symmetry=\"cubic\"))\n",
        "\n"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Ideally there is a structure similar to it in the database. To generate the new structure, we can therefore start with the atoms object of the similar structure. Suppose the identical structure is already present, only with lead instead of tin in the structure. We can then generate an initial guess for the starting structure.\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "row = next(db.select(name='similar name', symmetry='similar symmetry'))\n",
        "atoms = row.toatoms()\n",
        "symbols = atoms.get_chemical_symbols()\n",
        "new_symbols = [? if symbol == ? else symbol\n",
        "               for symbol in symbols]\n",
        "atoms.set_chemical_symbols(new_symbols)\n",
        "view(atoms)\n",
        "atoms.write(\"chosen_material.xyz\")"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Hopefully, you should see that the structure looks as expected! Unfortunately, the new structure is not in a relaxed configuration - by changing some of the atoms, we've also changed the forces on each atom. Before we calculate the energies and band gaps, we need to relax the structure. Note: because of the computational cost, select a cubic structure.\n",
        "\n",
        "**Tip**: You can save the following cell to a file, say `myrelax.py`, by uncommenting the first line and using the next cell to submit it to the cluster by writing the following in a cell:\n",
        "~~~\n",
        "!mq submit myrelax.py -R 8:1:4h\n",
        "~~~\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#%%writefile myrelax.py\n",
        "from gpaw import GPAW, FermiDirac, PW\n",
        "from ase.io import read\n",
        "from ase.optimize.bfgs import BFGS\n",
        "from ase.filters import UnitCellFilter\n",
        "\n",
        "atoms = read('chosen_material.xyz')\n",
        "name = atoms.get_chemical_formula()\n",
        "calc = GPAW(mode=PW(500),\n",
        "            kpts={'size': (4, 4, 4), 'gamma': True},\n",
        "            xc='PBE',\n",
        "            txt=name + '_relax.out',\n",
        "            occupations=FermiDirac(width=0.05))\n",
        "\n",
        "atoms.calc = calc\n",
        "uf = UnitCellFilter(atoms, mask=[1, 1, 1, 0, 0, 0])\n",
        "relax = BFGS(uf, logfile=name + '_relax.log', trajectory='chosen_relax.traj')\n",
        "relax.run(fmax=0.05)  # force is really a stress here"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Once we have the relaxed structure, we are ready to roll! We need to calculate the heat of formation and band gap of this structure, and compare with our predicted values. Time permitting, we should also calculate the heat of formation of the three competing crystal symmetries, to really confirm that we are looking at the correct state. Standard DFT seriously underestimates the band gap. We thus use a more accurate method which includes the calculation of the derivative discontinuity, called GLLBSC. You can find more information about it [here](https://journals.aps.org/prb/abstract/10.1103/PhysRevB.82.115106) and a benchmark of different methodologies [here](https://doi.org/10.1002/aenm.201400915).\n"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {},
      "source": [
        "#%%writefile dft-gllb.py\n",
        "from ase.io import read\n",
        "from gpaw import GPAW, FermiDirac, PW\n",
        "\n",
        "atoms = read('chosen_relax.traj')\n",
        "calc = GPAW(mode=PW(500),\n",
        "            kpts={'size': (8, 8, 8), 'gamma': True},\n",
        "            xc='GLLBSC',\n",
        "            occupations=FermiDirac(width=0.05))\n",
        "\n",
        "atoms.calc = calc\n",
        "energy = atoms.get_potential_energy()\n",
        "\n",
        "# Note! An accurate discontinuity calculation requires a k-point grid that\n",
        "# gives accurate HOMO/VBM and LUMO/CBM levels (in other words, the k-points of\n",
        "# the valence band maximum and the conduction band minimum should be\n",
        "# included in the used k-point grid).\n",
        "homo, lumo = calc.get_homo_lumo()\n",
        "response = calc.hamiltonian.xc.response\n",
        "dxc_pot = response.calculate_discontinuity_potential(homo, lumo)\n",
        "KS_gap, dxc = response.calculate_discontinuity(dxc_pot)\n",
        "gap = KS_gap + dxc\n",
        "print(f\"The gap is {gap:.3f} with components: Kohn-Sham gap {KS_gap:.3f} and \"\n",
        "      f\"discontinuity gap of {dxc:.3f}.\")"
      ],
      "outputs": [],
      "execution_count": null
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "\n",
        "Do the resulting energies match your predictions?\n"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "CAMD2022",
      "language": "python",
      "name": "camd2022"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.6.1"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 1
}