Introduction
Integrating data from different sources consists of two main steps, the first in which the various relations are merged together, and the second in which some tuples are removed (or inserted) from the resulting database in order to satisfy integrity constraints. There are several ways to integrate databases or possibly distributed information sources, but whatever integration architecture we choose, the heterogeneity of the sources to be integrated causes subtle problems. In particular, the database obtained from the integration process may be inconsistent with respect to integrity constraints, that is, one or more integrity constraints are not satisfied. Integrity constraints represent an important source of information about the real world. They are usually used to define constraints on data (functional dependencies, inclusion dependencies, etc.) and have, nowadays, a wide applicability in several contexts such as semantic query optimization, cooperative query answering, database integration, and view update.
Since the satisfaction of integrity constraints cannot generally be guaranteed, if the database is obtained from the integration of different information sources, in the evaluation of queries, we must compute answers that are consistent with the integrity constraints. The following example shows a case of inconsistency.
Example 1: Consider the following database schema consisting of the single binary relation Teaches (Course, Professor) where the attribute Course is a key for the relation. Assume there are two different instances for the relations Teaches, D1={(c1,p1),(c2,p2)} and D2={(c1,p1),(c2,p3)}. The two instances satisfy the constraint that Course is a key, but from their union we derive a relation that does not satisfy the constraint since there are two distinct tuples with the same value for the attribute Course.
In the integration of two conflicting databases simple solutions could be based on the definition of preference criteria such as a partial order on the source information or a majority criterion (Lin & Mendelzon, 1996). However, these solutions are not generally satisfactory, and more useful solutions are those based on (1) the computation of "repairs" for the database, and (2) the computation of consistent answers (Arenas, Bertossi, & Chomicki, 1999).
The computation of repairs is based on the definition of minimal sets of insertion and deletion operations so that the resulting database satisfies all constraints. The computation of consistent answers is based on the identification of tuples satisfying integrity constraints and on the selection of tuples matching the goal. For instance, for the integrated database of Example 1, we have two alternative repairs consisting in the deletion of one of the tuples (c2,p2) and (c2,p3). The consistent answer to a query over the relation Teaches contains the unique tuple (c1,p1) so that we do not know which professor teaches course c2. Therefore, it is very important, in the presence of inconsistent data, not only to compute the set of consistent answers, but also to know which facts are unknown and if there are possible repairs for the database.
Background
Several proposals considering the integration of databases as well as the computation of queries over inconsistent databases have been provided in the literature (Agarwal, Keller, Wiederhold, & Saraswat, 1995; Arenas et al., 1999; Arenas, Bertossi, & Chomicki, 2000; Bry, 1997; Dung, 1996; Greco & Zumpano, 2000; Lin, 1996; Lin & Mendelzon, 1996; Lembo, Lenzerini, & Rosati, 2002; Lenzerini, 2002; Wijsen, 2003). Most of the techniques for computing queries over inconsistent databases work for restricted cases, and onlyrecently have there been proposals to consider more general constraints. This section provides an informal description of the main techniques proposed in the literature.
• Lin and Mendelzon (1996) proposed an approach taking into account the majority view of the knowledge bases in order to obtain a new relation that is consistent with the integrity constraints. The technique proposes a formal semantics to merge first order theories under a set of constraints.
However, the "merging by majority" technique does not resolve conflicts in all cases since information is not always present in the majority of the databases, and, therefore, it is not always possible to choose between alternative values. Moreover, the use of the majority criteria involves discarding inconsistent data and hence the loss of potentially useful information.
• Arenas et al. (1999) introduced a logical characterization of the notion of consistent answer in a possibly inconsistent database. The technique is based on the computation of an equivalent query Ta(Q) derived from the source query Q. The definition of TJQ) is based on the notion of residue developed in the context of semantic query optimization.
More specifically, for each literal B, appearing in some integrity constraint, a residue Res(B) is computed. Intuitively, Res(B) is a universal quantified first order formula that must be true, because of the constraints, if B is true.
The technique, more general than the previous ones, has been shown to be complete for universal binary integrity constraints and universal quantified queries. However, the rewriting of queries is complex since the termination conditions are not easy to detect and the computation of answers generally is not guaranteed to be polynomial.
• Arenas et al. (2000) proposed an approach consisting in the use of a Logic Program with exceptions (LPe) for obtaining consistent query answers. An LPe is a program with the syntax of an extended logic program (ELP), that is, in it we may find both logical (or strong) negation (—) and procedural negation (not). In this program, rules with a positive literal in the head represent a sort of general default, whereas rules with a logically negated head represent exceptions. The semantic of an LPe is obtained from the semantics for ELPs, by adding extra conditions that assign higher priority to exceptions. The method, given a set of integrity constraints ICs and an inconsistent database instance, consists in the direct specification of database repairs in a logic programming formalism. The resulting program will have both negative and positive exceptions, strong and procedural negations, and disjunctions of literals in the head of some of the clauses, that is, it will be a disjunctive extended logic program with exceptions. As shown by Arenas et al. (1999), the method considers a set of integrity constraints, IC, written in the standard format v°=1 P. (x.) v v°=1 (-Q. (y.) v (p, where j is a formula containing only built-in predicates, and there is an implicit universal quantification in front. This method specifies the repairs of the database, D, that violate IC, by means of a logical program with exceptions, nD. In nD, for each predicate P a new predicate P' is introduced, and each occurrence of P is replaced by P'.
The method can be applied to a set of domain independent binary integrity constraints IC, that is, the constraint can be checked w.r.t. satisfaction by looking to the active domain, and in each IC appear at most two literals.
• Cali, Calvanese, De Giacomo, and Lenzerini (2002), Lembo et al. (2002), and Lenzerini (2002) proposed a framework for data integration that allows to specify a general form of integrity constraints over the global schema, and it is defined a semantics for data integration in the presence of incomplete and inconsistent information sources. Moreover, it is defined as a method for query processing under the previous semantics when key constraints and foreign key constraints are defined upon the global schema.
Formally, a data integration system I is a triple <G, S, Mgs>, where G is the global schema, S is the source schema, and MGS is the mapping between G and S. More specifically, the global schema is expressed in the relational model with both key and foreign key constraints; the source schema is expressed in the relational model without integrity constraints; and the mapping is defined between the global and the source schema, that is, each relation in G is associated with a view, that is, a query over the sources. The semantics of a data integration system is given by considering a source database D for I, that is, a database for the source schema S containing relation rD for each source r in S.
Any database G is a global database for I, and it is said legal w.r.t. D if:
• It satisfies the integrity constraints defined on G.
• It satisfies the mapping w.r.t. D, that is, for each relation r in G, the set of tuples rB that B assigns to r is a subset of the set of tuples p (r)D computed by the associated query p (r) over D: p (r)D c rB.
In this framework, the semantics of I w.r.t. a source database D, denoted semP(I,D), is given in terms of a set of databases. In particular, semD(I, D) = {B / B is a legal global database for I, w.r.t. D}. If semD(I, D) ^ 0, then I is said to be consistent w.r.t. D.
In this setting, a query q posed to a data integration system lis a conjunctive query over the global schema, whose atoms have symbols in G as predicates. A tuple (cp .., cn) is considered an answer to the query only if it is a certain answer, that is, if it satisfies the query in every database that belongs to the semantics of the data integration system.
The retrieved global database, denoted by ret(I, D), is obtained by computing for each relation r of the global schema rD; the query p(r) is then evaluated the query over the source database D. Note that the retrieved global database satisfies all the key constraints in G, as it is assumed that p (r) does not violate the key constraints; thus, if ret(I,D) also satisfies the foreign key constraints, then the answer to a query q can be done by simply evaluating it over ret(I,D). If it is the case that ret(I,D) violates the foreign key constraints, then tuples have to be added to the relations of the global schema in order to satisfy them. Obviously in general there are an infinite number of legal database that are coherent with the retrieved global database, even if it is shown that there exists one, the canonical database, denoted can(I,D), that represents all the legal databases that are coherent with the retrieved global database. Thus formally the answer to a query q can be given by evaluating can(I, D). Anyhow, the computation of the canonical database is impractical, as generally the database can be infinite; thus, Cali et al. (2002) defined an algorithm that computes the certain answers of a conjunctive query q without actually building can(I, D).
• Wijsen (2003) proposed a general framework for repairing databases. In particular the author stressed that an inconsistent database can be repaired without deleting tuples (tuple-based approach), but using a finer repair primitive consisting in correcting faulty values within the tuples, without actually deleting them (value-based approach).
Example 2: Suppose to have the following set of tuples reporting the dioxin levels in food samples:
Sample | Sample Date | Food | Analysis Date | Lab | Dioxin Level |
110 | Jan 17, 2002 | poultry | Jan 18, 2002 | ICI | normal |
220 | Jan 17, 2002 | poultry | Jan 16, 2002 | ICB | alarming |
330 | Jan 18, 2002 | beef | Jan 18, 2002 | ICB | normal |
Dioxin Database and the constraint:
that imposes the date of analyzing a given sample cannot precede the date the sample was taken.
The first tuple in the Dioxin Database says that the sample 110 was taken on January 17, 2002, and analyzed the day after at the ICI lab, and that the dioxin level of this sample was normal. While the sample 110 respects the constraint, the sample 220 violates it. An inconsistency is present in the database, and the author claims to "clean" it in a way that avoids deleting the entire tuple, that is, acting at the attribute level and not at the tuple level.
Given an inconsistent database, a consistent answer can be obtained by letting the database in its inconsistent state, and by propagating in the answer the consistent portion of the database, that is, the set of tuples matching the query and satisfying the constraints. As the repair work is deferred until query time, this approach is called late-repairing. In this framework an alternative technique is proposed consisting in a database transformation. Given a satisfiable set of constraints E, that is, a set of finite constraints, and a relation I, apply a database transformation hE : I ^ I such that for every query Q, Q(hE(I)) yields exactly the consistent answer to Q on input I and E. Observe that hE(I) is not necessarily a repair for I and E; it can be thought of as a " condensed representation" of all possible repairs for I and E that is sufficient for consistent query answering. The practical intuition is that an inconsistent database I is first transformed through hE in such a way that the subsequent queries on the transformed database retrieve exactly the consistent answer; since databases are modified prior to query execution, this approach is called early-repearing. Clearly for a given set of satisfiable constraints E, early and late repairing should yield the same set of consistent answers, hence fE(Q)(I)=Q(hE(I)), for every query and every relation.
A NEW TECHNIQUE FOR QUERYING AND REPAIRING INCONSISTENT DATABASES
Greco, Greco, and Zumpano (2001, 2003) and Greco and Zumpano (2000) proposed a general framework for computing repairs and consistent answers over inconsistent databases with universally quantified variables. The technique is based on the rewriting of constraints into extended disjunctive rules with two different forms of negation (negation as failure and classical negation). The disjunctive program can be used for two different purposes: compute "repairs" for the database, and produce consistent answers, that is, a maximal set of atoms that do not violate the constraints. The technique is sound and complete (each stable model defines a repair, and each repair is derived from a stable model) and more general than techniques previously proposed.
Specifically, the technique is based on the generation of an extended disjunctive program LP derived from the set of integrity constraints. The repairs for the database can be generated from the stable models of LP, whereas the computation of the consistent answers of a query (g,P) can be derived by considering the stable models of the program Pu LP over the database D.
Let c be a universally quantified constraint of the form:
then, dj(c) denotes the extended disjunctive rule
where B'. denotes the atom derived from B, by replacing the predicate symbol p with the new symbol pd if Bi is a base atom otherwise is equal to false. Let IC be a set of universally quantified integrity constraints, then DP(IC) = {dj(c) / c e IC}, whereas LP(IC) is the set of standard disjunctive rules derived from DP(IC) by rewriting the body disjunctions.
Clearly, given a database D and a set of constraints, IC, LP(IC)d denotes the program derived from the union of the rules LP(IC) with the facts in D, whereas SM(LP(^C)D) denotes the set of stable models of LP(IC)D, and every stable model is consistent since it cannot contain two atoms of the form A and —A. The following example shows how constraints are rewritten.
Example 3: Consider the following integrity constraints:
and the database D containing the facts p(a), p(b), s(a), and q(a).
The derived generalized extended disjunctive program is defined as follows:
The previous rules can now be rewritten in standard form.
Let P be the corresponding extended disjunctive Datalog program. The computation of the program Pp gives the following stable models:
A (generalized) extended disjunctive Datalog program can be simplified by eliminating from the body rules all literals whose predicate symbols are derived and do not appear in the head of any rule (these literals cannot be true). As mentioned before, the rewriting of constraints into disjunctive rules is useful for both (1) making the database consistent through the insertion and deletion of tuples, and (2) computing consistent answers leaving the database inconsistent.
FUTURE TRENDS
As a future trend, an interesting topic consists in specifying preference criteria so that selecting among a set of feasible repairs the preferable ones, that is, those better conforming to the specified criteria. Preference criteria introduce desiderata on how to update the inconsistent database in order to make it consistent; thus they can be considered as a set of desiderata that are satisfied if possible by a generic repair. Therefore, informally a preferred repair is a repair that better satisfies preferences. Preliminary results have been published by Greco, Sirangelo, and Trubitsyna (2003).
CONCLUSION
In the integration of knowledge from multiple sources, two main steps are performed, the first in which the various relations are merged together, and the second in which some tuples are removed (or inserted) from the resulting database in order to satisfy integrity constraints.
The database obtained from the merging of different sources could contain inconsistent data. In this article we investigated the problem of querying and repairing inconsistent databases. In particular we presented the different techniques for querying and repairing inconsistent databases (Agarwal et al., 1995; Arenas et al., 1999; Greco & Zumpano, 2000; Lin & Mendelzon, 1996).
KEY TERMS
Consistent Answer: A set of tuples, derived from the database, satisfying all integrity constraints.
Consistent Database: A database satisfying a set of integrity constraints.
Data Integration: A process providing a uniform integrated access to multiple heterogeneous information sources.
Database Repair: Minimal set of insert and delete operations that makes the database consistent.
Disjunctive Datalog Program: A set of rules of the form:
constants or variables.
Inconsistent Database: A database violating some integrity constraints.
Integrity Constraints: A set of constraints that must be satisfied by database instances.