Crankshaft: Creating Software System Context Glossaries (information science)

INTRODUCTION

Requirements engineering (RE) is the area of software engineering responsible for the elicitation and definition of the software system requirements. This task implies joining the knowledge of the services that a software system can and cannot provide with the knowledge of clients' and users' needs (Jackson, 1995; Katasonov & Sakkinen, 2005; Kotonya & Sommerville, 1998; Sommerville & Sawyer, 1997; Sutcliffe, Fickas, & Sohlberg, 2006; Uchitel, Chatley, Kramer, & Magee, 2006). Frequently, this activity is done by people with a software engineering bias. The underlying hypothesis of this choice is that users' needs are easier to understand than the software's possible behaviors. This is not always true; however, this is the metacontext in which most RE heuristics and methodologies have been developed. Understanding clients' and users' needs is far more complex than merely interviewing selected clients and user representatives, compiling all gathered information in one document. Defining how to put into service a complex software system within an organization requires envisioning how the business process of the organization will be in the future from both points of view: software organization and business organization. This is the key of the RE commitment: to imagine how the future business process will be. This RE commitment requires a good knowledge about how the business process actually is. Understanding the software system's preexistent context basically means understanding the clients' and users' culture. In other words, this part of the RE is a learning process.

BACKGROUND

The importance language has in any culture should be noticed. Language is an organized system of speech by which people communicate with each other with mutual comprehension. Also, it is very important to note that by the words it contains and the concepts it can formulate, language is said to determine the attitudes, understandings, and responses in any society. Language, therefore, may be both a cause and a symbol of cultural differentiation (Fish-man, 1999; Hall, Hawkey, Kenny, & Storer, 1986). Language reflects environment and technology: Arabic has 80 words for camels, while Japanese has more than 20 words for rice and Inuit has more than 20 words for snow and ice (Nettle & Romaine, 2000). Clients and users have several special words that they use when discussing their activities. The requirements engineer must pick and understand as many of these words as possible as a first step in understanding clients' and users' culture.

Glossaries have been used in software engineering with different purposes, such as data dictionaries in early database topics (Codd, 1982) to document entities, attributes, relations, types, and services of databases. Thus, it provides a common understanding of all the system names to the developer team and later to the maintenance team. However, data dictionaries are also an important component of structured analysis, data recording, data storage, and the details of processes (Gane & Sarson, 1982; Senn, 1989), though authors like Gane and Sarson suggest that the real name for them should be project guide instead of data dictionary. These data dictionaries are created during analysis and also used during system design. They satisfy five objectives: to manage details, communicate common meanings, document system characteristics, help the analysis of details and changes, and locate errors and omissions of the system.

In this article an RE process beginning with the construction of a language-extended lexicon (LEL) as its first activity (Leite, Hadad, Doorn, & Kaplan, 2000) is addressed, and the structure and creation of this LEL is described.

GLOSSARY CREATION

The word dictionary was coined by Henry Cockeran in 1623, but the first known dictionaries belong to the seventh century B.C., and they contain the most important data of the Mesopotamian culture (MSN Microsoft Corporation, 2006). The first dictionaries were catalogues of unusual, difficult, or confusing words and phrases since the common vocabulary was considered to have no need of an explanation or a definition. The oldest glossary comes from the second century A.D. and contains technical Greek words used by Hypocrites. It was in later centuries that a catalogue of all the words of a language was built: for Arabic. So, the origin of glossaries and dictionaries was to give definitions of words and phrases of a particular domain; then, they were extended to an entire language in lexical dictionaries. Nowadays, there are different types of dictionaries covering different necessities, like dictionaries of synonyms and antonyms, dictionaries of idiomatic usage, etymology dictionaries, encyclopedic dictionaries, bilingual dictionaries, glossaries in text topics, dictionaries of ideologies, slang dictionaries, and dictionaries of neologisms, among many others.

Most relevant or peculiar words or phrases (named LEL symbols) of the universe of discourse (UofD) are included in the LEL. Every symbol is identified by its name (including synonyms) and two descriptions: notion and behavioral response. The notion contains sentences defining the symbol and the behavioral response reflects how it influences the UofD. Figure 1 depicts the model used to represent LEL symbols.

The LEL is created by filling the blanks in the LEL model (see Figure 1) using information obtained from the application domain. Intuition, supported with a good understanding of the LEL model, may be used to create the lexicon. Upon the experience and the skill of the authors, this may or may not lead to a well-conceived document. If so, apparently there is no need for heuristics. On the contrary, heuristics are needed, first to allow everyone to complete the process successfully and second to avoid weaknesses usually present in apparently good-quality LELs. Those weaknesses range from missing relevant symbols to the unnecessary insertion of some others, and the inclusion of excess of details in the symbol descriptions or the lack of them.

The lexicon creation process, depicted in Figure 2 using an SADT1 model (Ross & Schoman, 1977), consists of five independent activities: (a) plan, (b) collect, (c) describe, (d) verify, and (e) validate.

As seen in Figure 2, the process shows a main stream composed of three tasks: plan, collect, and describe. There is a well-established feedback when the verification and validation activities take place. After verifying the LEL, the process returns to the describe activity, where corrections are made based on a DEO list.2 After the validate activity, the process returns to the collect activity and/or the describe activity, depending on the validation DEO list, in order to make any necessary corrections. For easy reading, the SADT model does not show all the backtracking steps that may occur during the construction process. For instance, while describing a symbol, a wrongly assigned type may be discovered, thus a back step occurs in order to reclassify it (within the collect activity). Another example of going backward in the process could appear when a new term is identified while describing another. That is, the strategy is not at all a linear one. It is an iterative process where feedback is a constant mechanism. In addition to this continuous feedback, the main stream does not fully follow a cascade model since in practice its three main tasks may partially overlap. For instance, a symbol may be fully described while new sources of information should be identified to classify or describe others.

Figure 1. Language-extended lexicon model

LEL: It is the representation of the symbols in the application domain language. Syntax: {Symbol} 1N

Symbol: It is an entry of the lexicon that has a special meaning in the application domain. Syntax: {Name}1N + {Notion}1N + {Behavioral Response} 1N

Name: This is the identification of the symbol. Having more than one name represents synonyms. Syntax: Word | Phrase

Notion: It is the denotation of the symbol. Syntax: Sentence

Behavioral Response: It is the connotation of the symbol. Syntax: Sentence

Figure 2. SADT of the LEL creation process

Plan. The plan activity basically consists of identifying the sources of information, evaluating them, and finally selecting the strategies to elicit symbols. To identify the sources of information, it is necessary to define the context where the RE process will take place. A mandatory source of information is the document of system objectives and scope (if it was written), or the requesters of the software system. The most reliable sources of information are documents and people, but some other relevant sources could be topics about related themes, other clients' systems, and other systems available in the market. A source of information could be seen from several perspectives. One of the most salient perspectives is that of effectiveness, which classifies the information as either actual or formal. Formal information is about what should occur, but is not necessary to be put into practice or to be updated; actual information involves current practices or states, that is, what is actually in use. Accessing sources of information biased toward the formal point of view will create the important risk of developing a software system unable to deal with what actually happens in the business. On the other hand, ignoring formal sources of information does not allow the use of the software system as a tool for business process improvement. At this point, balancing the actual and formal points of view is almost impossible, but the requirements engineer must at least understand both.

Collect. The collect activity starts creating a candidate symbol list for accessing the sources of information by means of unstructured interviews, reading documents, or eliciting techniques appropriate for each source of information. The most important rules for choosing symbols are as follows.

• Pick exclusively words or phrases belonging to the application domain.

• Select words or phrases frequently used or highly repeated in documents.

• Select words or phrases meaningful in the application domain.

• Exclude too obvious words or phrases.

• Identify the full name no matter how long it is.

• An abbreviation or a partial name may be a synonym of a symbol with a long name.

Once the candidate list of symbols is available, every entry should be assigned to a class. In most cases, the basic classes of subject, verb, object, and state are useful. Subjects are active entities, such as persons, organizations, or software systems. Objects are passive entities to which actions are applied. Verbs are entries representing actions that happen in the application domain, and states are conditions of a group of subjects, objects, or verbs.

Describe. Describing symbols defines their notions and behavioral responses based on the LEL model and the class to which they belong. In order to describe the symbols, the requirements engineers may use previous elicited knowledge, though often they should go back to the UofD to collect more information. In this case, it is recommended that they conduct structured interviews in order to ask clients and users about the meaning of the symbols. Nevertheless, other sources of information may well be used.

Below, some rules for describing symbols in the lexicon are itemised (description heuristics).

• A symbol must have at least one name, one notion, and one behavioral response.

• Every name of the symbol must be the one used in the application domain.

• Symbols used as synonyms in the application domain must share one entry in the LEL

• Symbols having a regular meaning must contain only the application domain sense.

• Notion and behavioral response must be described using simple and direct sentences.

• Each sentence should express only one idea.

• Each sentence should contain only one verb.

• Each sentence should make it easy to identify the perspective (formal or actual).

• If two symbols share a characteristic, it should be repeated in both entries.

• Every notion and behavioral response must have at least one reference to other symbols.

• References to other symbols should be enhanced (underlining, bolding, or any other way).

• Every symbol must be referenced at least by another symbol.

• A symbol's full name should be used when referenced by other symbols.

Connections among LEL entries should be stressed as much as possible using references to other symbols and reducing the use of vocabulary from outside the LEL.

Verify. Nowadays, inspections have been applied to requirements documents with great success (Leite, Hadad, Doorn, & Kaplan, 2005). Although the verification of the lexicon can be made by several techniques, an inspection variant based on Fagan's (1976) original proposal has been used. This technique provides specific heuristics to detect defects called discrepancies, errors, or omissions (DEOs) (Kaplan, Hadad, Doorn, & Leite, 2000). Each step in the heuristics is based on a defect-oriented form designed for a given type of defect and is accompanied with guides about how to fill the form and how to analyse what it is written in the LEL in order to maximise the finding of the defects.

Validate. While identifying and describing symbols, some degree of validation takes place. Later, a more structured validation activity is carried out allowing the requirements engineer to correct, ratify, or increase the knowledge about the application domain vocabulary. This usually consists of structured interviews or meetings with clients and users at their workplaces. The description of each symbol may be read to the interviewers who confirm, correct, make observations, or add missing information. Sometimes, instead of reading symbol descriptions during the interview, the engineer could give the interviewer a copy of the LEL in advance. Summarizing, the validation activity aims basically to do the following.

• Check notions and behavioral responses of symbols already described

• Ratify the definition of symbols

• Identify new symbols and synonyms

The validation activity generates a DEO list similar to the one produced at verification. It is then sent backward to the collect step and/or to the describe step to do the necessary corrections. Sometimes the feedback from validation may require identifying new sources of information.

future trends

Even though the LEL has been used in many studies and practical application by researchers and practitioners, several aspects remain unknown and some questions remain unanswered. For example, there are no stopping rules defined. Different requirements engineers collect different symbols and different numbers of symbols; however, there is no known criterion about how to evaluate such differences.

Almost any LEL contains terms related by hypernym and hyponym relationships (Urena Lopez, Garria Vega, & Martinez Santiago, 2001). This symbol hierarchy can be observed among subjects, objects, and verbs. No benefits from this knowledge have been outlined yet.

conclusion

Glossaries may be used in several stages during the software development process; however, their use during the RE phase is very convenient. Their use eases the understanding of the clients' and users' culture and allows creating all RE documents using their vocabulary.

Other documents produced in later phases of the software development process may also take advantage of the use of LEL.

Creating well-conceived system software context glossaries requires a good understanding of the model and following the heuristics and guidelines developed for this purpose.

Verification and validation of the glossary are key practices in the whole process since they will find out most of the existent defects, improving its quality.

KEY TERMS

Elicitation: Elicitation is the activity of acquiring knowledge of a given kind during the requirements engineering process.

Heuristic: It is a set of guidelines to help people to use others' experience to improve performance in a given task.

Language-Extended Lexicon: An LEL is a semiformal model holding the most relevant words or phrases of the language of the application domain carrying special meaning.

Requirements Engineering: It is an area of the software engineering that is responsible for acquiring and defining needs of the software system.

Software Engineering: It is the computer science discipline concerned with creating and maintaining software applications by applying technologies and practices from computer science, project management, engineering, application domains, and other fields.

Sources of Information: These include documents, key people, topics, and so forth that can provide useful information about the matter being studied.

Validation: Validation is the activity of contrasting a model with the actual world. It should answer the question "Are we creating the right model?"

Verification: It is the activity of checking for the consistency of different parts of a model or different models among them. It should answer the question "Are we creating the model right?"