1 Introduction

Large planning problems are often tackled using heuristic search that exploits domain knowledge, rather than via direct numerical optimisation, which is often intractable. Consider, for example, nationwide railway scheduling. This involves generating a feasible and safe schedule for the entire railway operations involving trains and railway stations in the network. Optimal railway scheduling seeks to find an assignment of trains to track segments (or ‘blocks’) so as to minimize or eliminate delays, or, more generally minimise costs, while ensuring that that every prescribed train has a contiguous, conflict-free path to its destination. The railway scheduling problem can be modelled either as a specific form of the general job-shop scheduling problem (JSP) with blocking and no-wait constraints (D’Ariano et al. 2007), or as a binary integer optimisation problem (Harrod 2010). In the latter, the finite occupancy of a segment of track by a single train for a discrete time duration is encoded using a binary variable \(x_{i,t}^r\), where r indicates one of R trains, t is one of T time slots, and i refers to one of \(B^r\) track segments that compose one or more feasible paths for train r. The scheduling objective is then to minimise the total cost, i.e., \( \min \sum c_{i,t}^r x_{i,t}^r \), where \(c_{i,t}^r\) is a cost associated with the assignment \(x_{i,t}^r\), incorporating elements such as delay etc., subject to a number of flow constraints such as \( \sum _r x_{i,t}^r = 1, ~ \forall r \in R \). The complete set of constraints, which include other single-commodity flow constraints such as those above, as well as other multi-commodity constraints, together ensure the resolution of spatio-temporal contentions for track segments by ordering the trains on a line section or on a loop at a station, i.e., by determining the time instances at which each train enters and exits (arrival and departure) each line section and loop on its route.

Whichever formulation is used, the problem of finding an optimal schedule is NP-hard, and even for a small instance it is usually a challenge to get a reasonable solution that can then be tweaked towards optimality. For large-scale problems, it is practically impossible to guarantee optimality: India, for example, has about 8000 stations and runs 12,000 trains a day. By conservative estimates, this would result in about 2 million binary variables and 3.5 million real-valued variables for the optimisation formulation just described. So what is usually done? In practice, schedules are generated using knowledge that is qualitative, and not easily encoded in a form suitable for numerical optimisation. Here, for instance, are some guiding principles that are followed by the Australian Rail Track Corporation when scheduling trains: (1) If a “healthy” train is running late, it should be given equal preference to other healthy trains (there is a definition of healthy in the guidelines, which is not important here); (2) A higher priority train should be given preference to a lower priority train, provided the delay to the lower priority train is kept to a minimum; and so on. It is evident from this that train-scheduling may benefit from knowing if a train is healthy, what a train’s priority is, and so on. But are priorities and train-health fixed, irrespective of the context? What values constitute acceptable delays to a low-priority train? Generating good train-schedules will require a combination of quantitative knowledge of a train’s running times and qualitative knowledge about the train in isolation, and in relation to other trains.

In this paper, we propose a heuristic search method, that comes under the broad category of an estimation distribution algorithm (EDA: Pelikan et al. 2000). EDAs iteratively generate better solutions to an optimisation problem using machine-constructed models. Usually, these models have been generative probabilistic models, like Bayesian Networks. Domain knowledge has then to be translated into a specification of the prior distribution for such networks (often, the choice of prior distribution is restricted to allow efficient parameter estimation), and perhaps some constraints on the topology of the Bayesian) network. Here we are concerned with problems for which a translation into prior distribution functions over parameter values or network structure, is not evident [although see Angelopoulos and Cussens (2008) for flexible Bayesian priors]. Our interest in ILP derives from the approach presenting an extremely flexible way to use domain knowledge when constructing models. In experiments reported here, we do not take on a problem as ambitious as nationwide train-scheduling. Instead, we focus on two controlled (but non-trivial) synthetic problems and two uncontrolled real-world problems from the area of drug-design.

The principal contributions of this paper are as follows:

  1. 1.

    To the specific area of ILP, we provide an application of constructing models not for discriminating amongst instances (as is done usually), but to generate new instances; and

  2. 2.

    To the broader area of machine learning we provide evidence of the utility of using a technique that is capable of including domain knowledge for model-based identification of good solutions to optimisation problems.

The rest of the paper is organised as follows. Sect. 2 provides a brief description of the heuristic search we propose to employ for optimisation problems. Section 2.1 describes how ILP is used within the iterative loop of the search. The use of ILP in this manner requires: (a) models it constructs to be generative; and (b) a sampling procedure. These requirements are addressed in Sect. 2.2. Section 3 describes an empirical evaluation followed by conclusions in Sect. 4.

2 Evolutionary search for near-optimal solutions

The basic search method we use is inspired by the MIMIC algorithm (De Bonet et al. 1997), which uses an evolutionary procedure for model-assisted sampling. Assuming that we are looking to minimise an objective function \(F(\mathbf{x })\), where \(\mathbf{x }\) is an instance from some instance-space \({{\mathcal {X}}}\), the approach first constructs an appropriate discriminatory model to distinguish between samples of values below and above some thresold \(\theta \), i.e., \(F({\mathbf {x}}) \le \theta \) and \(F({\mathbf {x}}) > \theta \). The model is then used to generate the population for the next iteration, while also lowering \(\theta \). A generic procedure is in Fig. 1.

Fig. 1
figure 1

Evolutionary optimisation using machine learning models to guide sampling. Here the term “model” is used in a data-analytic sense, of being a description of data, rather than in the logical sense of being an interpretation that makes a formula true

Clearly, the procedure in Fig. 1 requires some kind of generative model which is used to generate new instances (Step 2c). Usually, the prefered choice is a probabilistic model, like a Bayesian Network. However sampling from such networks can be notoriously inefficient, especially if the number of variables are large. In addition, domain knowledge may not be of the kind that can translate easily into statements about conditional dependencies, or priors over network structures. For these reasons, interest is increasing in the use of deep generative network models, that have efficient sampling procedures, and are able to construct automatically relevant domain-specific features from low-level representations. We look at some options before we consider the use of generative ILP models. Widely used examples of deep generative networks are Deep Belief Nets (DBNs), composed of multiple latent variable models called Restricted Boltzman Machines (RBMs). RBMs belong to the class of Energy Based Models (EBMs) and can be interpreted as parameterized probabilistic graphical models (Fischer and Igel 2012).

Another kind of deep generative model that has gained significant prominence is a generative adversarial network (or GAN: Goodfellow et al. 2014). GANs are unique amongst deep generative models in that they do not employ a user-defined cost function for the generator. GANs are trained using two competing networks, a dicriminator and a generator. In this paper we use a variant of GANs called conditional adversarial networks (cGANs), which include a set of conditional variables Y as an additional input layer for the discriminator and generator. For our problem there is a single conditional variable (denoting the “1” or “0” class values in the EOMS procedure).

From now on, we will use the term “EODN” (Evolutionary Optimisation using Deep Networks) to refer to the EOMS procedure in which the M used is a deep generative model (specifically, either a DBN or a GAN). We contrast this with an EOMS procedure with a model obtained using Inductive Logic Programming (ILP).

2.1 ILP-assisted evolutionary optimisation

We now look at the use of ILP as the basis for the model M in the EOMS procedure. The principal motivation for the use of ILP has already been described earlier: ILP provides an extremely flexible way to construct models using domain knowledge. While prominent applications of ILP (for example, the line of work on structure-activity relations by King et al. (1995) and Srinivasan et al. (1997) and early specifications of ILP [for example, in Muggleton and Raedt (1994)] have largely been concerned with discriminatory models, this does not necessarily have to be case. Research in probabilistic inductive logic programming (De Raedt et al. 2008), and earlier work on stochastic logic programming (Muggleton 1996), do allow the use of theories in a generative manner. The use of ILP in this section can be seen as a simple form of probabilistic ILP, in which we deal separately with model-construction and inference for sampling (with the probabilistic aspect only concerned being concerned with the latter: more on this in a later section). The procedure EOIS in Fig. 2 is a refinement of the EOMS procedure above, using ILP as the vehicle for model-construction.

Fig. 2
figure 2

Evolutionary optimisation using ILP models to guide sampling. For simplicity, we do not distinguish between an instance \({{\mathbf {x}}} \in P\), and its logical encoding in a form suitable for an ILP engine

Assume we are provided with domain knowledge encoded as a set of definite clauses B, and examples E consisting of instances \({{\mathbf {x}}}\) s.t. \(F({{\mathbf {x}}}) \le \theta _k\) for some threshold \(\theta _k\) (positive instances, or \(E^+\)) and instances for which \(F({{\mathbf {x}}}) > \theta _k\) (negative instances, or \(E^-\)) respectively. Then, in EOIS, \(\textit{ilp}(B,E^+,E^-)\) is an ILP algorithm that returns a set of definite clauses M s.t. \(B \cup M \models E^+\); and \(B \cup M \cup E^- \not \models \Box \). Given instances \({{\mathbf {x}}}\) are drawn from a population P, assume the ILP algorithm has found some model M (\(M \ne \emptyset \)). Then the function sample(PnMB.E) returns a set S of at most n instances from \(P{\backslash }E\) s.t. for each \({{\mathbf {x}}} \in S\), \(B \cup M \models label({{\mathbf {x}}},pos)\). Practically speaking, we derive (infer) \(F({{\mathbf {x}}}) \le \theta _k\) from \(B \cup M\). Note: the actual F value of \({{\mathbf {x}}}\) may be above or below the corresponding value \(\theta _k\). If \(M = \emptyset \), sample returns a random selection of n instances from \(P {\setminus } E\). We note on any iteration \(k > 0\), \(E_{k-1} \subseteq E_k\). Also since \(\theta _{k} < \theta _{k-1}\), \(E^{-}_{k-1} \subseteq E^{-}_k\). The procedure terminates simply because the sequence of \(\theta _k\)’s is finite.

In general therefore, we require sample to draw instances from the success-set of the ILP-constructed theory M, since these represent the good solutions. We describe this next.

2.2 Sampling using an ILP model

Let us assume that on any iteration k of EOIS the ILP model \(M_k\) is a logic program consisting of definite clauses for the target predicate label / 2 as the positive literal (that is, the head of the clause). On iteration k, we require to be able to generate a sample from the success-set of \(M_k\).Footnote 1 We will look first at the inference procedure.

It is the intent of sampling to generate low-cost instances (or at least to generate instances expected to have low cost). It is well-known that the SLD-resolution procedure used by Prolog requires the specification of a computation-rule (“which literal?”) and a search-rule (“which clause?”). Standard Prolog uses a leftmost literal computation rule, and a search rule that amounts to selecting clauses in order of appearance in a program. Of these, for non-recursive clauses of the kind we will be considering here, the choice of the search rule is of special importance to determining elements of the success-set (Lloyd 1987).Footnote 2 It is of interest here to consider a search rule that is dependent on the average costs of instances entailed by the clauses in \(M_k\)

Recall that clauses in \(M_k\) are constructed by an ILP engine, given \({{\mathbf {x}}}\)’s with F-values above and below some threshold value \(\theta _k\). On each iteration k, with training set \(E_k = E_k^{+} \cup E_k^{-}\) and any clause \(C \in M_k\), we define the instances covered by C, given background knowledge B as follows:

$$\begin{aligned}&{Covers}_B(C) = \{{{\mathbf {x}}}: label({{\mathbf {x}}},pos) \in E_k^{+}~\text{ and }~B \wedge C \models label({{\mathbf {x}}},pos)\} \\&\quad \cup \{{{\mathbf {x}}}: \lnot label({{\mathbf {x}}},pos) \in E_k^{-}~\text{ and }~B \wedge C \models label({{\mathbf {x}}},pos)\} \end{aligned}$$

We will usually call this Covers(C). The costs of instances covered by C is:

$$\begin{aligned} Costs(C) = \{f: {{\mathbf {x}}} \in Covers(C)~\text{ and }~f=F({{\mathbf {x}}})\} \end{aligned}$$

and we are able to define AvCost(C) as the mean of the values in Costs(C)Footnote 3:

$$\begin{aligned} AvCost(C) ~=~ {\mathrm {Mean}}(Costs(C)) \end{aligned}$$

Then, we propose using a cost-sensitive search rule in which clauses with lower mean cost have a higher probability of being selected by the inference procedure. Using SLD inference with this kind of probabilistic selection is equivalent to backtrackable sampling from a stochastic logic program (Cussens 2000).

We turn now to the question of how the model \(M_k\) can be made generative. We know clauses in \(M_k\) will be of the form (shown here in a kind of pseudo-Prolog format):

figure a

where the body literals use predicates defined in the background knowledge. In logic-programming, such a clause being “generative” simply means that we are able to obtain answer-substitutions for X. But there is a small twist: we would like the answer-substitutions to be grounding substitutions from the instance-space P. Assuming that instances of P belong to some type T and that elements of T can be enumerated, then either one of the rewrites below of the clause will ensure substititutions are of the kind we seek:

figure b

where \({type}_T\) is a meta-predicate that is true for any instance x of type T. It is used here as an enumerator of elements of T (left), or to ensure answer substitutions x for X are elements of T (right). If the background predicates are ground atoms, then the rewrite on the right is more efficient. In either case, we will call the rewritten clause as the “generative version of the clause”.Footnote 4

The main steps of the procedure sample used in EOIS procedure are in Fig. 3.

Fig. 3
figure 3

Generating samples using the ILP model. Here, \(\textit{Uniform}\) returns a uniform random sample of size at most n from the P, and \(\textit{reorder}\) is non-deterministic, returning a total ordering (i.e. a sequence) of elements \((f_i,c_i)\) in \(\textit{Costs}\), generated by using probabilistic sampling based on the \(f_i\) values of tuples in \(\textit{Costs}\)). \(sld(\cdot )\) returns set of answer substitution from refutations of \(\textit{Goal}\) using Program. If \(\textit{reorder}\) is deterministic (for example, a sequence sorted on f values), then \(M^*\) need be computed only once, and can be obtained Step 6. The loop in Steps 6–7 terminates when \(|S| = n\) or no answer-substitutions are possible from \(sld(\cdot )\))

2.3 Note on dominance information

A special form of domain knowledge in optimisation problems is that of dominance. We follow (Jouglet and Carlier 2011), in which a dominance rule specifies a subset of the solution space that contains some optimal (or in our case, near-optimal) solutions (see Fig. 4).

Fig. 4
figure 4

Dominant solutions in optimisation (adapted from Jouglet and Carlier 2011). Here S denotes a set of possible solutions, F(S) denotes feasible solutions, O(S) the subset of (near-)optimal solutions. Dominance constraints are used to specify the shaded subset (\(\delta (S)\)). Solutions in \(\delta (S)\) are said to dominate solutions in \(S {\setminus } \delta (S)\). Ideally we would like \(O(S) \subseteq \delta (S)\)

Operationally, dominance constraints are used to reduce the search-space. ILP systems like Srinivasan (1999) have two natural ways of incorporating dominance information. First, clauses examined by the ILP system for inclusion in a model can be pruned, if we are sure that they will generate solutions outside the dominant subset:

figure c

Alternatively, we can define a refinement operator that enumerates elements of the search space, by actively incorporating dominance criteria:

figure d

Here the ILP system will start the search from some known clause \(C_0\) and repeatedly call the \(\textit{refine}\) predicate to obtain an enumeration of clauses \(C_1, C_2, \ldots \), all of which will satisfy some set of necessary dominance constraints. In either case, the clauses in the resulting ILP model will satisfy the dominance constraints encoded as domain knowledge, and we can expect the subsequent generative model to be more constrained than one obtained without dominance criteria. In experiments that follow, we do not employ such constraints: suffice to note that such constraints can be employed by an ILP system, if available.

We are now in a position to clarify further what we mean by the claim that the use of domain knowledge by an ILP system can lead to more effective models for identifying near-optimal instances. In this paper, this will mean that the EOIS procedure, equipped with an ILP engine and relevant background knowledge, will result in generating more near-optimal instances. Below, we investigate this empirically.

3 Empirical evaluation

3.1 Aims

The principal aim of the empirical evaluation is to investigate the performance of the EOIS procedure, given relevant domain knowledge, against: (a) simple random sampling from the instance space which uses no domain knowledge; and (b) evolutionary optimisation that uses methods that construct generative models by automatically constructing relevant domain-level features from low-level data.

We use evolutionary optimisation using deep generative networks (EODNs) as representative of (b). In the first instance, we will assess performance of EOIS as follows: at the end of each iteration of the evolutionary optimisation procedure, we compute the number of near-optimal instances identified and compare against alternatives.Footnote 5 Recall that by “near-optimal instances” we mean those instances for which the cost is below some pre-specified value \(\theta ^*\).

A question that follows naturally is this: what if the domain knowledge available for EOIS is irrelevant, or of limited relevance? One of the two synthetic datasets we have chosen is of this kind, and we will investigate what can be expected in such circumstances.Footnote 6

3.2 Materials

3.2.1 Synthetic data

We use two synthetic datasets, one arising from the KRK chess endgame (an endgame with just White King, White Rook and Black King on the board), and the other a restricted, but nevertheless hard \(5 \times 5\) job-shop scheduling (scheduling 5 jobs taking varying lengths of time onto 5 machines, each capable of processing just one task at a time).

The optimisation problem we examine for the KRK endgame is to predict the depth-of-win with optimal play (Bain and Muggleton 1994). Although this problem has not been as popular in ILP as the task of predicting “White-to-move position is illegal” (Bain 1991; Muggleton et al. 1992), it offers a number of advantages as a synthetic test-bed for problems of interest to us. First, as with other chess endgames, KRK-win is a complex, enumerable domain for which there is complete, noise-free data. Second, optimal “costs” are known for all data instances. Third, the problem has been studied by chess-experts at least since Torres y Quevado built a machine, in 1910, capable of playing the KRK endgame. This has resulted in a substantial amount of domain-specific knowledge. We direct the reader to Breda (2006) for the history of automated methods for the KRK-endgame. For us, it suffices to treat the problem as a form of optimisation, with the cost being the depth-of-win with Black-to-move, assuming minimax-optimal play. For us, the availability of significant amounts of domain knowledge for near-optimal play makes this what we term a “knowledge-rich” problem. In principle, there are \({64}^3 ~\approx 260{,}000\) possible positions for the KRK endgame, not all legal. Removing illegal positions, and redundancies arising from symmetries of the board reduces the size of the instance space to about 28,000 and the distribution shown in Fig. 5a. The sampling task here is to generate instances with depth-of-win equal to 0. Simple random sampling has a probability of about 1/1000 of generating such an instance once redundancies are removed.

Fig. 5
figure 5

Distribution of cost values. Numbers in parentheses in the tabulation are cumulative proportions

The job-shop scheduling problem is less controlled than the chess endgame, but is nevertheless representative of many real-life applications (like scheduling trains), which are, in general, known to be computationally hard. We use a job-shop problem with five jobs, each consisting of five tasks that need to be executed in order. These 25 tasks are to be performed using 5 machines, each capable of performing a particular task for any of the jobs. For simplicity, we assume each machine is specialised for a task, a \(5 \times 5\) matrix defines how long task j of job i takes to execute on machine j. The domain knowledge for this problem is not as well-developed as for the chess problem, and it is not clear how much of it is relevant to low-cost solutions. For us, the job-shop problem here will constitute an example of a “knowledge-deficient” problem.

Data instances for Chess are in the form of 6-tuples, representing the rank and file (X and Y values) of the 3 pieces involved. Data instances for Job-Shop are in the form of schedules defining the sequence in which tasks of different jobs are performed on each machine, along with the total cost (i.e., time duration) implied by the schedule. A tabulation of numbers of instances and their costs is in Fig. 5.

3.2.2 Real-world data

Besides performance on synthetic data, it is clearly helpful to know what can be expected on real-world problems. We use the following datasets:

Dataset

Task

Total instances

Near-optimal instances

ADME

Drug solubility

\(\approx 1300\)

\(\approx 85\)

Malaria

Drug efficacy

\(\approx 13{,}000\)

\(\approx 170\)

“ADME” refers to the dataset in Hou et al. (2004), to predict solubility of drug-like molecules. Specifically, the predictions are for \({\mathrm {log}}S\) where S is the solubility in mol/l of the molecule in water at a pH of 7.4. Data are available in the form of the 2-d structure of the molecules (the atom and bond structure). Some additional bulk properties can be obtained or estimated from this structure: we obtain the molecular weight, which is a straightforward computation. We consider the optimisation problem to be one of identifying soluble molecules.

“Malaria” refers to The Tres Cantos Antimalarial (TCAMS) dataset. It is available at the ChEMBL Neglected Tropical Disease archive (www.ebi.ac.uk/chemblntd). The TCAMS database is a result of screening GlaxoSmithKline’s library of approximately 2 million compounds. The database consists of 13,000 of the chemicals that were found, on screening, to inhibit significantly the growth of the 3D7 strain of P. falciparum in human erythrocytes (Gamo et al. 2010). Data made available include some bulk-properties of the compounds (like molecular weight and hydrophobicity). We will consider the task of identifying molecules with high inhibition activity (inhibition activity ranges from approximately 5.6–8.6: we are attempting to identify molecules with activity 7.0 and above).

Without access to specialised expertise, we are unable to characterise the real-world problems as knoweldge-rich or knowledge-deficient.

3.2.3 Domain knowledge

For Chess, background predicates encode the following (WK denotes the White King, WR the White Rook, and BK the Black King): (a) Distance between pieces WK-BK, WK-BK, WK-WR; (b) File and distance patterns: WR-BK, WK-WR, WK-BK; (c) “Alignment distance”: WR-BK; (d) Adjacency patterns: WK-WR, WK-BK, WR-BK; (e) “Between” patterns: WR between WK and BK, WK between WR and BK, BK between WK and WR; (f) Distance to closest edge: BK; (g) Distance to closest corner: BK; (h) Distance to centre: WK; and (i) Inter-piece patterns: Kings in opposition, Kings almost-in-opposition, L-shaped pattern. We direct the reader to Breda (2006) for the history of using these concepts, and their definitions.

For Job-Shop, background predicates encode: (a) schedule job J “early” on machine M (early means first or second); (b) schedule job J “late” on machine M (late means last or second-last); (c) job J has the fastest task for machine M; (d) job J has the slowest task for machine M; (e) job J has a fast task for machine M (fast means the fastest or second-fastest); (f) Job J has a slow task for machine M (slow means slowest or second-slowest); (g) Waiting time for machine M; (h) Total waiting time; (i) Time taken before executing a task on a machine. Correctly, the predicates for (g)–(i) encode upper and lower bounds on times, using the standard inequality predicates \(\le \) and \(\ge \).

For both the real-world tasks, domain knowledge is in the form of general chemical knowledge of ring-structures and some functional groups. Background-knowledge contains definitions used for concepts such as: alcohols, aldehydes, halides, amides, amines, acids. esters, ethers, imines, ketones, nitro groups, hydrogen donors and acceptors, hydrophobic groups, positive- and negatively-charged groups, aromatic rings and non-aromatic rings, hetero-rings, 5- and 6-carbon rings and so on. These have been used in structure-activity applications of ILP before (King et al. 1995; Srinivasan et al. 1997). We note that none of these definitions are specifically designed for the tasks of predicting solubility or inhibiting malarial targets.In addition, there are two bulk properties of the molecules that are available: the molecular weight, and partition coefficient values (\({\mathrm {log}}P\)).

3.2.4 Algorithms and machines

The ILP-engine we use is Aleph (1999: we use Version 6, available from A.S. on request). All ILP theories were constructed on an Intel Core i7 laptop computer, using VMware virtual machine running Fedora 13, with an allocation of 2 GB for the virtual machine. The Prolog compiler used was Yap, version 6.1.3.Footnote 7

3.3 Method

Our method is straightforward:

  • For each optimisation problem, and domain knowledge B:

    • Using a sequence of threshold values \(\langle \theta _1, \theta _2, \ldots , \theta _n \rangle \), with \(\theta ^* = \theta _n\). On iteration k (\(1 \le k \le n\)):

      1. 1.

        Obtain an estimate of the number of instances with \(F(\mathbf{x }) \le \theta _n)\) using a simple random sample (SRS), EOIS and EODN from the instance space (the EOIS and EODN models are obtained for discriminating between \(F(\mathbf{x }) \le \theta _k\) and \(F(\mathbf{x }) > \theta _k\))

      2. 2.

        Compare the numbers of near-optimal estimates obtained by the methods.

We clarify some additional details concerning the method above.

3.3.1 Thresholding

  1. 1.

    The sequence of thresholds for Chess are \(\langle 8, 4, 0 \rangle \). For Job-Shop, this sequence is \(\langle 1000, 750, 600 \rangle \); Thus, \(\theta ^*\) = 0 for Chess and 600 for Job-Shop, which means we require exactly optimal solutions for Chess. For the real-world datasets, the thresholds are \(\langle -\,4.0, -\,2.0,\) and \(-\,1.0\rangle \) for ADME; and \(\langle 6.5, 7.0, 7.5 \rangle \) for Malaria.

  2. 2.

    There is no specific theory underlying the choice of the threshold sequence (although a greedy-search for identifying the sequence is discussed later) , or that of \(\theta ^*\). For the latter, the choice of 0 for Chess has a problem-specific meaning (checkmate). Using Chess as the baseline, we have chosen \(\theta ^*\) values roughly by increasing the proportion of good solutions by an order of magnitude for Job-Shop and at 1% for the real-world problems (that is, good drugs are in the top 1-percentile of activity). The evolutionary procedures EOIS or EODN only require the sequence \(\theta _i\) decrease to a value below \(\theta ^*\), without specifying how this is to be done. It is possible that an adaptive procedure could be devised (more on this later).

3.3.2 Deep network parameter tuning

For training both the DBN and the GAN in every iteration, we utilized a grid search over the number of hidden layers(limited to two due to the relatively small input dimensions), the number of units in each layer (30, 50, 70, 90), and the learning rate (0.0001, 0.001, 0.01). In each iteration the number of novel low cost samples generated was taken to be the tuning criteria. The networks were implemented on a 64 GB server with a 16 GB Volta GPU.

  1. 1.

    The DBN hyper-parameters are tuned by observing the pseudo log-likelihood during training. During training an additional feature bit was concatenated to the data to indicate whether the sample was below (1) or above(0) the current threshold. Sampling was performed in the usual fashion via Gibbs sampling in the topmost layer of the DBN, with a conditional feature bit clamped to 1 during sampling to bias the sampling towards samples lying below the current threshold.

  2. 2.

    The conditional GAN was trained similarly with a feature bit set to 1 during training for samples below the current threshold, and 0 otherwise. The discriminator loss was used to tune the hyper-parameters of the GAN model, with models yielding higher loss being favoured. During generation the conditional feature bit was set to 1 to bias the sampling towards lower cost samples.

  3. 3.

    For both the DBN and the GAN, the number of samples below the current threshold value were over-sampled during training to ensure they constituted 50% of the training data. This ensured that good samples were adequately represented during training. 1000 samples were generated at each iteration and their costs evaluated.

  4. 4.

    Both the discriminator and generator in the GAN were trained as feedforward networks, with the output layer of the generator consisting of sigmoid units and the hidden layers comprising of rectified linear units. The input to the generator is a 10 dimensional gaussian noise vector and a conditional bit to bias it towards low cost solutions.

  5. 5.

    Each successive layer in the DBN was trained via contrastive divergence, with the bottom most data layer initialized to a random data vector, followed by repeated sampling of visible and hidden units for fixed number of iterations (we chose 3 as it yielded the best sampling results).

3.3.3 ILP parameter settings

  1. 1.

    Experience with the use of ILP engine used here (Aleph) suggests that the most sensitive parameter is the one defining a lower-bound on the precision of acceptable clauses (the minacc setting in Aleph). We report experimental results obtained with \(minacc=0.7\) for all datasets, which has been used in previous experiments with ILP. The domain knowledge for Job- Shop does not appear to be sufficiently powerful to allow the identification of good theories with short clauses. That is, the usual Aleph setting of up to 4 literals per clause leaves most of the training data ungeneralised. We therefore allow an upper-bound of up to 10 literals for Job-Shop, with a corresponding increase in the number of search nodes to 10,000 (Chess uses the default setting of 4 and 5000 for these parameters).

  2. 2.

    In the EOIS procedure, the bound on sample size n is 1000 for all problems except ADME, which is a small dataset. For ADME n is 250. The initial sample is obtained using a uniform distribution over all instances. Let us call this \(P_0\). On the first iteration of EOIS (\(k=1\)), the datasets \({E_1}^+\) and \({E_1}^-\) are obtained by computing the (actual) costs for instances in \(P_0\), and an ILP model \(M_{1,B}\) is constructed.

  3. 3.

    The sample procedure requires a cost-sensitive probability distribution. We use a simple negative exponential function \(Ae^{-\alpha c}\) where c is the cost. For all problems, we take \(A = \alpha = 1\). The resulting values are normalised to sum to 1, and form the probability distribution from which clauses are selected before generating an instance.

  4. 4.

    There are two sources of sampling variation in EOIS: the initial sample, drawn uniformly from the instance space; and variation arising from the probabilistic selection rule used by the inference engine. Of these, we account for the latter, by obtaining estimates of near-optimal estimates from 10 repetitions of sampling. Variation due to the initial sample affect all methods equally.

  5. 5.

    In all cases, a generative version of the ILP model is obtained by including a type enumerator in all clauses. It is expected that such an enumerator is defined as part of the domain knowledge provided to the ILP engine. This allows us to use standard Prolog inference to generate answer-substitutions.

3.4 Results

Results on the synthetic data are in Table 1. The principal finding in the tabulations are these:

  1. (1)

    For both problems, the numbers of near-optimal instances with ILP-assisted models is higher than with other models;

  2. (2)

    Despite identifying a higher number of near-optimal instances on both problems, there is a difference in the performance of EOIS. Specifically, the “performance-gap” in Chess is much greater than on Job-Shop, and the standard deviations suggest the differences may not be statistically significant; and

  3. (3)

    All results have been obtained by sampling a small portion of the instance space (about 10% for Chess, and about 3% for Job-Shop).

Table 1 Estimates of number of near-optimal instances generated after each iteration of evolutionary optimisation

We now examine the result in more detail, focussing first on specifics questions concerning and EOIS and later on the procedure’s strengths and weaknesses. We restrict these additional experiments to Chess, since it allows a greater amount of control (where appropriate, we have verified similar patterns on Job-Shop).

3.4.1 Recall versus precision

The results in Table 1 only tabulates recall numbers, which shows EOIS can identify more true positives than other methods. But how many other solutions does it generate, besides the true positives? (That is, what about its precision?) Table 2(a) tabulates the change in precision of EOIS (only shown for Chess: the behaviour for JobShop is similar), which shows precision improving with each iteration. This is consistent with the observation earlier that the numbers of negative examples increase with each iteration. The tabulations also suggest a relative gain in precision of about 3–9 over random sampling. This may not seem much, but can result in an order of magnitude less testing than random sampling, to obtain the same number of near-optimal instances. Precision can be improved, as usual, at the expense of recall. Table 2(b) shows this, achieved by generating fewer instances (100, as opposed to 1000) on each iteration.

Table 2 Estimated precision and recall of models (shown for Chess only)

3.4.2 Threshold selection

The evolutionary optimisation procedure uses pre-defined thresholds, and two questions that naturally arise are: (1) How sensitive are the results are to the choices used?;and (2) Can threshold-selection be automated? Table 3(a) suggests that a different choice of starting or intermediate threshold does not make a significant difference to the precision or recall of the final model. However, the tabulation in Table 3(b) shows that the number and actual thresholds chosen can make a difference.

Table 3 Sensitivity to thresholds

The results in Table 3 suggest a search procedure somewhat akin to that used by decision trees for identifying thresholds for continuous-valued attributes. Given a set of possible thresholds, the procedure simply searches for the next best threshold to apply, until no futher improvement is possible. By “best threshold” we mean here simply the threshold that results in identifying the most near-optimal instances. Given an upper-bound \(\theta ^*\) on the acceptable cost of solutions and ordered values \(\theta _1> \theta _2> \cdots >\theta _n \theta ^*\), the procedure can be implemented using EOIS with thresholds obtained from a greedy search through subsets of \(\{\theta _1,\theta _2,\ldots ,\theta _n,\theta ^*\}\). One form of this greedy search starts with the set \(\{\theta ^*\}\); examines subsets \(\{\theta _1,\theta ^*\}\),\(\{\theta _2,\theta ^*\},\ldots \{\theta _n,\theta ^*\}\); finds the best subset, say \(\{\theta _k,\theta ^*\}\) and then proceeds to examine subsets \(\{\theta _{k+1},\theta _k,\theta ^*\}\), \(\{\theta _{k+2},\theta _k,\theta ^*\},\ldots \)\(\{\theta _{n},\theta _k,\theta ^*\}\) and so on. Each sequence examined will require a model-construction step, and a sample-generation step. While the greedy approach attempts to minimise the number of models constructed, the samples generated can be fixed to a small number, since for most of the search, we are only interested in ordering sequences and not their exact value.

Table 4 shows that the sequence identified by this greedy procedure, which performs as well as the manual sequence for the same sample size (\(n=100\)), but employs fewer thresholds. There is an additional cost arising from the need to examine multiple threshold sequences.

Table 4 Trace of a greedy procedure for selecting a sequence of thresholds from the set \(\{0,2,4,6,8,10,12,14\}\) for Chess

3.4.3 Role of domain knowledge

The performance of the ILP-assisted models in the Job-Shop domain is not as good as on Chess: at the end of 3 iterations, recall is only about 0.2, compared to 1.0 iin Chess. The natural question that arises is: Why is this so? We conjecture that this is a consequence of the domain knowledge. Recall that we had characterised the Job-Shop problem previously as being knowledge-deficient. That is, the domain knowledge used is not especially relevant to low-cost solutions. In the knowledge encoded for Chess, in contrast, some of the concepts refer specifically to “cornering” the Black King, with a view of ending the game as soon as possible. We would expect these predicates to be especially useful for positions at depths-of-win near 0. Evidence of the unreliable performance of the EOIS procedure in Chess can be obtained by removing such focused domain knowlege: see Table 5.

Table 5 Performance of the EOIS procedure with domain knowledge of low relevance to near-optimal solutions

The results in Table 5 suggest a refinement to the use of EOIS. Using the term “knowledge-rich domains” to signify domains with knowledge that is relevant to low-cost solutions:

We expect the EOIS procedure to be effective only in knowledge-rich domains.

The performance of EOIS also gives us a partial answer to the practical question of what is to be done when the relevance of the domain knowledge is unclear. Our results here suggest that if EOIS performance is comparable to simple random sampling, then this points to the domain knowledge being of low relevance.

3.4.4 Deep network performance

Ostensibly, the deep network models on both Chess and Job-Shop are worse. However this comparison is not appropriate. Recall that the network models are constructed without the benefit of the domain knowledge available to EOIS. We have seen above that EOIS performance can degrade significantly if domain knowledge is deficient. In light of the \(B_{low}\) results in Table 5, the performance of EODNs are in fact much better than EOIS, when domain knowledge is deficient (the deep networks only have access to the position of the pieces, which is even less than \(B_{low}\)). Using the term “knowledge-deficient domains” to signify problems where the domain knowledge is deficient or irrelevant for low-cost solutions, the results here suggest a further caveat on the use of EOIS:

We expect the EOIS procedure to be less effective than deep models in knowledge-deficient domains.

Incorporating domain knowledge into deep networks remains a topic of great interest. One simple way that can be used here is simply to provide the rules constructed by EOIS as input features to a deep network. Although simple, the approach nevertheless improves the performance of deep networks, although they still ro not reach the levels of EOIS with \(B_{high}\) (see Table 6).

Table 6 Performance of deep networks with ILP-constructed rules as inputs

3.4.5 Comprehensibility

The logical representation used by EOIS allows us to inspect models constructed. This can be of some advantage for problems where it is important to understand how the near-optimal instances are generated. Figure 6 shows an example.

Fig. 6
figure 6

Understanding models. a A rule constructed for \(\hbox {Depth}=0\) (check-mate); b examples of good instances generated by the rule; and c a bad instance generated by the rule

Finally, to correctly compare the value of model-assisted sampling against simple random sampling, we would need to first estimate the number of instances that need to be drawn randomly to obtain the same numbers of near-optimal instances as with model-assisted EDA. We then need to obtain two other costs: (a) the time to obtain values of the objective function for randomly-sampled instances; and (b) the time taken to obtain the corresponding values for the training data used to construct models plus the total time taken for model construction. For model-construction to be beneficial, clearly the time in (b) has to be less than (a). For the problems here, random sampling requires approximately 6 (Job-Shop) to 10 (Chess) times as many samples to obtain the same numbers of near-optimal instances as EOIS. The times for theory-construction are small enough to expect that (b) is less than (a) for these ratios.

What can be expected from EOIS on real datasets? Table 7 tabulates the performance on the two datasets described in Sect. 3.2. It is clear that EOIS does not appear to achieve anything useful on ADME, but why is this so? Based on the experimental results obtained with synthetic data, one reason could be that the domain knowledge provided for ADME is of little relevance to predicting very high solubility molecules such as the ones sought here (\({\mathrm {log}}S\) values of \(\ge 0\)). Further supportive evidence of this follows from the fact that the ILP engine was unable to find many patterns, even with low precision, for these solubility values. Unfortunately, we do not have the pharmocological expertise to provide additional domain knowledge to test this conjecture. For Malaria, it appears that the domain knowledge is relevant, but can possibly be enriched further.

Table 7 Performance of EOIS on real data

4 Concluding remarks

This paper is concerned with model-driven generation of solutions for optimisation problems. Our interest is especially on hard problems for which there are no easy solutions, but there is significant, relevant human expertise. But how is such knowledge to be incorporated in the search for solutions? If the usual route of conversion into linear constraints is inappropriate, then it may be difficult to use numerical optimisation techniques efficiently. Dominance rules can help, of course, but these may not be easy to come by, especially for complex real problems. Equally difficult is to translate rules-of-thumb and problem requirements into a prior distribution function suitable for use by probabilistic modelling methods, So, what options remain? One possibility is to use deep networks that attempt to re-construct the domain knowledge needed—in the form of intermediate features—from low-level data. At the other end of the spectrum is ILP, which provides one of the most flexible ways of incorporating domain knowledge in model construction. Our results suggest that if domain knowledge relevant to low-cost solutions is available, then ILP can make effective use of such knowledge.

ILP models to date have largely been discriminatory in nature. Generating answers is not new for logic programming: it has long been understood that a logic program can be used to compute more than “yes” or “no” answers. It is surprising therefore that logic programs constructed by ILP systems have rarely been used in a manner to enumerate instances. Instead, ILP models have mainly been used to decide if one or more instances are logically entailed, given some domain knowledge and the model constructed. This has not always been the case: early ILP systems like MARVIN (Sammut 1981), for example, did use theories constructed in a generative manner. Sampling from logic programs is also central to the modern strand of research in ILP on probabilistic Inductive Logic Programming (PILP: see De Raedt et al. 2008), and the use of ILP in this paper is akin to this form of generative modelling. Specifically, we see our use as rewriting the model constructed by a standard ILP engine as a stochastic logic program (SLP: Muggleton 1996) over which we allow backtrackable sampling in the manner described in Cussens (2000). This also points to the problem of generating near-optimal solutions as a useful application areafor methods combining probability and ILP (Riguzzi and Zese 2017; De Raedt et al. 2008). These methods are a result of research into probabilistic extensions to logic-programming (see: Riguzzi 2018; De Raedt and Kimmig 2015; Muggleton 1996; Sato and Kameya 1997), and on the interface of these methods to ILP (the PITA system, for example Riguzzi and Swift 2011). Here, domain knowledge and examples remain statements in definite-clause logic as required by standard ILP. But we know beforehand that instances can have labels attached to them, based on their (true) costs. Results in this paper suggest that relevance of domain knowledge can be important to constructing good generative models.

The performance of deep generative networks suggests that in knowledge-poor domains, they can still perform reasonably well. However, unless there is a large amount of data, we cannot expect deep networks to match ILP-based methods in knowledge-rich domains. Practitioners using deep networks have found interesting ways to overcome the small data problem, provided there is a related problem with very large amounts of data (a form of transfer-learning then becomes possible). But incorporating prior knowledge remains a challenge. The hybrid approach of using ILP to learn rules for low-cost solutions, and using these as Boolean features has shown promise here. This use of ILP for feature-construction has a long and productive history (Joshi et al. 2008; Saha et al. 2012; Srinivasan and King 1996; Ramakrishnan et al. 2007; Specia et al. 2009, 2006; Srinivasan et al. 2012), but their use in conjunction with deep networks has only just begun to be explored (Saikia et al. 2016; Dash et al. 2018), and may provide one way of drawing on the strengths of deep models and ILP for generating low-cost solutions for optimisation problems.