Crankshaft: A Content-Sensitive Approach to Search in Shared File Storages (information science)

Introduction

The article presents a novel approach to search in shared audio file storages such as P2P-based systems. The proposed method enables the recognition of specific patterns in the audio contents, in such a way it extends the searching possibility from the description-based model to the content-based model. The targeted shared file storages seam to change contents rather unexpectedly. This volatile nature led our development to use real-time capable methods for the search process.

The importance of the real-time pattern recognition algorithms that are used on audio data for content-sensitive searching in stream media has been growing over a decade (Liu, Wang, & Chen, 1998). The main problem of many algorithms is the optimal selection of the reference patterns (soundprints in our approach) used in the recognition procedure. This proposed method is based on distance maximization and is able to choose the pattern that later will be used as reference by the pattern recognition algorithms quickly (Richly, Kozma, Kovacs & Hosszu, 2001).

The presented method called EMESE (Experimental MEdia-Stream rEcognizer) is an important part of a lightweight content-searching method, which is suitable for the investigation of the network-wide shared file storages. This method was initially applied for real-time monitoring of the occurrence of known sound materials in broadcast audio. The experimental measurement data showed in the article demonstrate the efficiency of the procedure that was the reason for using it in shared audio database environment.

background

From the development of the Napster (Parker, 2004), the Internet-based communication is developing toward the application level networks (ALN). On the more and more powerful hosts, various collaborative applications run and create virtual (logical) connections with each others (Hosszu, 2005). They establish virtual overlay, and oppositely to the older client/server model they use the peer-to-peer (P2P) communication. The majority of such systems deal with file sharing (Adar, Huberman, 2000), that is why their important task is to search in large, distributed shared file storages.

A large amount of effort is dedicated for improving their searching (Yang & Garcia-Molina, 2002) and downloading capability (Cohen, 2003; Qiu & Srikant, 2004), however, the searching is quite traditional, it is based on the descriptive metadata of the media contents, as the file name, content description, and so forth. Such method has an inherent limitation, since the real content remains covered and even in case of the mistake of the file description this search can fail. Oppositely to the widely used description-based seeking, the content-based searching has been the topic of the research, only. Its main reason is that the content and its coded representation have a huge variety that is why a comprehensive method has not developed yet.

The novel system, EMESE, is dedicated for solving a special problem, where a small but significant pattern should be found in a large voice stream or bulk voice data file in order to identify known sections of audio. The developed method is light-weight, meaning that its design goals were the fast operation and the relatively small computing power. In order to reach these goals, the length of the pattern to be recognized should be very limited, and the total score is not required.

This article deals mainly with the heart of the EMESE system, the pattern recognition algorithm, especially with the creation of the reference pattern, called reference selection.

The Problem of The Pattern Recognition

In the field of sound recognition there are many different methods and applications for specific tasks (Coen, 1995; Kondoz, 1994).

The demand for working efficiently with streaming media on the Internet increases rapidly. These audio streams may contain artificial sound effects besides the mix of music and human speech. These effects furthermore may contain signal fragments that are not audible by the ear. As a consequence, processing of this kind of audio signal is rather different from the already developed methods, as for example, the short-term predictability of the signal is not applicable.

The representation of digital audio signal as individual sample values lacks any semantic structure to help automatic identification. For this reason, the audio signal is transformed into several different orthogonal or quasi-orthogonal bases that enable detecting certain properties.

Already, there are solutions for classifying the type of broadcast on radio or television using the audio signal. The solution in Akihito, Hamada, and Tonomura (1998) makes basically a speech/music decision by examining the spectrum for harmonic content, and the temporal behavior of the spectral-peek distribution. Although it was applied successfully to that decision problem, it cannot be used for generic recognition purposes. The paper (Liu et al., 1998) also describes a scheme classifying method where the extracted features are based on the short-time spectral distribution represented by a bandwidth and a central frequency value. Several other features, for example, the volume distribution and the pitch contour along the sound clip, are also calculated. The main difficulty with this method is its high computation-time demand, so real-time monitoring is hardly possible, when taking the great number of references to be monitored into account.

A similar monitoring problem was introduced in Lourens (1990) and the used feature, a section of the energy envelope of the record signal (reference) was correlated with the input (test) signal. The demand on real-time execution drove the development of the recognition scheme introduced in Richly, Varga, Hosszu, and Kovacs (2000), and Richly, Kozma, Kovacs, and Hosszu (2001) that is capable of recognizing a pattern of transformed audio signal in an input stream, even in the presence of level-limited noise. This algorithm first selects a short segment of the signal from each record in the set of records to be monitored.

The Sound Identification in the EMESE

The reference selection algorithm needs a well understanding of the recognition method. The audio signal, sampled at f=16kHz is transformed into a spectral description because time-domain representation of sound is more sensitive to even light, non-audible distortions. A sound interval is transformed into a block of data, where the columns are vectors computed from a short, atomic section of the sound, where its spectral content can be assumed static. This section of time-domain data is called a frame (.N=256 samples, T=16ms). First, the absolute value of the complex Fourier spectrum, the amplitude spectrum is computed from the frame. Then, amplitude values of neighboring frequencies are averaged to project the spectrum onto the Bark-scale, a non linear frequency scale. The reason for this is to speed up the later comparison stage by reducing the amount of relevant data and to include a well established emphasizing tool used in audio processing, the perceptual modeling of the human auditory system. As a result, we get a vector with NB=20 values computed from the samples of a frame. In the next step, the vector is normalized and quantized. Two levels are determined in each transformed frame. The levels are the 10% and 70% of the peak value of the amplitude spectrum. We name the transformed frame a slice to indicate the applied column-vise recognition method of a reference. In every reference, there are Ns=50 slices of non-overlapping consecutive frames and the audio section, from which the reference was made, is called the soundprint of that specific record (Tsp = Tf * Ns= 800ms).

Figure 1. The sound representation in the recognition system

The scheme of the recognition algorithm is to grow the already identified parts of the reference patterns continuously, according to the input. This means that the algorithm takes a frame from the input signal, executes the previously described transformation series, and compares the resulting slice to the actual one of every reference. The actual slice is the first one in every reference initially and if it is decided to be similar to the slice computed from the input stream (a slice-hit occurs), the next, non-overlapping input slice will be compared to the next slice of that reference. If an input slice is decided to be non-similar, the actual slice of that reference is reset to the first one. The similarity is evaluated by calculating the weighted Manhattan-distance of the two slices. Manhattan-distance is the sum of the absolute element-vise differences in the slice vectors. This distance metric was chosen at the initial state of the project to avoid multiplication, since the system was developed for low-resource microcontroller and ASIC realizations. Weighting is introduced to emphasize the relevant differences in the used strong quantized environment.

For achieving more accurate alignment between the test and reference signals, the initial slice-hit in a reference is evaluated using a distance buffer, as demonstrated on Figure 2. In this circular-memory, the distances of that first reference slice to overlapping test slices are stored and the middle of the buffer is examined whether it contains the lowest value in the buffer. In case it does, and it also satisfies the threshold criteria, the identification of the reference proceeds to the next reference slice. This method intends to align the identification process to the "distance-pit" described in the next section.

Figure 2. The synchronization mechanism

After successfully identifying the last slice of a reference, we successfully identified that record in the monitored input.

The method of selecting The Reference Patterns

The selection algorithm uses the previously described weighted Manhattan-distance for measuring the similarity of the audio segments. In the vicinity of the reference's beginning, there have to be frames that vary a lot in the sense of the applied distance metric. This has to be fulfilled because the pattern recognition algorithm cannot synchronize to the given reference otherwise, since the record may appear anywhere in the monitored signal. This way a robust synchronization can be realized that is also successful in the presence of noise.

Figure 3. The "distance-pit" around the reference position

Figure 4. The width of the pits around the sound-print candidate

If we take a long sound-print (reference candidate) from a record to be monitored and calculate the distance of this section all along the record, then it can be observed that the distance function has a local minimum, a pit around the candidate's position (Richly, Kozma, Hosszu, & Kovacs, 2001). This is demonstrated in Figure 3, where the x-axis shows which frame of the record is compared with the selected candidate, while the y-axis shows the Manhattan-distance values.

To achieve robust synchronization during the recognition, we must guarantee large Manhattan-distance between the candidate and its vicinity. This is assured if the slope of the pit, as showed on Figure 3, is as big as possible. For selecting the best distance-pit and its corresponding candidate section, we should determine the steepness of the pit-side. However, because it is generally not constant, so as an alternative we calculate the width at a given value. Figure 4 shows pit-width of 100 candidate sections, where the sections are extracted from the same record so that their first samples are consecutive in the record. In Figure 4 the horizontal axis is the sample position of the candidate in the record, while the vertical axis is the width of the pits at a record-adaptive level.

Our reference selection algorithm is based on the same principle, but since our pattern recognition method uses the first frame as kernel and grows from the first record, we observed this pit-width for one frame long candidates. The minimum value has to be found in this function without calculating every point of it.

We must also assure that our selected reference does not occur any more in the record again or in any other records. Using database terminology, we would say that the reference must be a key. To avoid unambiguous identification we must try to identify the selected reference in all the other records and if it is not a unique key then a new reference must be selected.

The exact solution would require us to compare every one of the reference-candidates to every other one. This would mean a lot of comparisons even in the case of a few records that could not be done in a conceivable time period.

In the presented algorithm we tried to keep the number of comparisons as low as possible. To do so we examine only the vicinity of the reference candidate in a region having the width w, where w is expressed in number of samples. Also we do not examine all possible reference candidates, only every 100th. The algorithm is listed in Table 1.

results

Using this algorithm we selected references from 69 advertisements that previously were recorded from a live Internet audio stream. During the tests, we monitored these 69 advertisements that were broadcast in test Internet stream-media. We added white noise to the input signal to test the robustness of the system. The duration of the test was 48 hours. The recognition results are shown on Figure 5.

Table 1. The reference selection algorithm of the EMESE

1. In the first turn, the reference candidate is selected from the y th sample of the first region of the given record (The region is w=5000 samples). The first region begins on the first sample of the record, and it will define the first sample of the frame.

2. The frame is compared to all possible frames in the region according to the distance metric mentioned above. As a result we get d(i), where i=0...w-N, as shown on Figure 1.

3. The next region is selected k*N samples forward in the record, and step 2 is repeated. We select further regions the same way and calculate the corresponding d(i) until we reach the end of the record.

4. The steepest pit and the corresponding iopt frame position in the record is selected examining all the d(i) functions for the narrowest pit.

5. In the k*N vicinity of position iopt the frame with the narrowest distance-pit is determined using a gradient search algorithm. This is the lowest point of the function on Figure 2.

6. The reference consisting of NR slices (transformed frames) is extracted from the record beginning with the frame selected in the previous step.

7. This reference is tested for uniqueness using the recognition algorithm. If the reference appears in the record more than once, not only at the correct position, then the next best reference must be selected in the previously described way.

8. The reference is then tried against all the other records, to filter out in-set false alarms. If the reference is found in any other record, step 7 is used for reference reselection.

9. The above steps are applied to all other records.

Figure 5. Percentage of patterns successfully identified by the recognition algorithm

We also observed the performance of the system, namely how many references can be handled in real time. The computer used was equipped with a Pentium-II-350 MHz processor and 256 MB of RAM, and the maximum possible number of references was 258. If the record set to be monitored is added to, the reference selection for the added record must be performed, and the new references have to be checked for false alarms. If we detect a possible false alarm due to representative signal similarity the selection must be repeated for the whole set. This takes 525 minutes in case of 69 records. This is a worst-case scenario, and it should be very rare. The average selection time for every new record is 10 minutes.

The second test was a synchronization test. We selected 50 frames from the stream and we observed the dependency of the width of the pit on the level of the noise, and the noise level where the monitoring algorithm cannot synchronize to the frame. The result of this test is shown on Figure 6.

Figure 6. Result of the synchronization test

future trends

Carrying out tests of various pattern recognition methods on live audio broadcasts showed that the success of identification process depends on the proper selection of the representative short segment. The position where this representative segment can be extracted is determined by the recognition algorithm. The selected references must be non-correlated to avoid false alarms.

The proposed method EMESE is an example of the Internet-oriented pattern recognition algorithms, which can be used for content-based search in media streams and files. The performed experiments checked how a monitoring system can synchronize to the stream under various conditions. The measured results proved the efficiency of the method however, more research efforts are necessary to reach a comprehensive content-based search method.

The Bark-spectrum as the base for the audio feature carries the possibility to easily process sampled audio with different sampling properties, since it hides the frequency resolution of the used Fast Fourier transform. However, this generalization indicates further investigations to define the usable sound-print size and comparison thresholds. The reason for that is the unequal size of feature's frequency bands defined by the sampling frequency.

The system is designed to quickly change the parameters and thresholds for investigating the behavior of longer sound-prints, like a whole record. The idea is the application for automatic identification of even corrupted audio files.

conclusion

The main directions of searching in shared file storages, especially the novel method based on the reference selection has been described. The proposed method was realized for an existing real-time recognition algorithm that was used on live audio streams to identify specific sound signals. The selection algorithm takes the properties of the recognition algorithm into account. The algorithm was tested on Internet media streams with a prerecorded signal set and we reached good results. Further tests should be carried out to determine the exact effect of the input noise level on the width of the distance-pit.

The light-weight pattern recognition methods presented earlier has proved that in Internet-wide environment it is possible to realize a fast recognition method, which does not require extraordinary CPU time or other computing resources. The increasing demand for searching in various media contents involves the popularity of the content-recognition tools. However, there are results, but further researches must be carried out in this field.

KEY TERMS

Application Level Network (ALN): The applications, which are running in the hosts, can create a virtual network from their logical connections. This virtual network is also called overlay (see later in the section). The operations of such software entities are not able to understand without knowing their logical relations. The most cases this ALN software entities use the P2P model (see later in the section), not the client/server (see later in the section)) one for the communication.

Audio Signal Processing: It means the coding, decoding, playing, and content handling of the audio data files and streams.

Bark-Scale: A non-linear frequency scale modeling the resolution of the human hearing system. One Bark distance on the Bark-scale equals to the so-called critical bandwidth that is linearly proportional to the frequency under 500Hz and logarithmically above that. The critical bandwidth can be measured by the simultaneous frequency masking effect of the ear.

Client/Server Model: A communicating way, where one host has more functionality than the other. It differs from the P2Pmodel (see later in the section).

Manhattan-Distance: The Lt metric for the points of the Euclidean space defined by summing the absolute coordinate differences of the two points (|x2-x1|+|y2-y1|+...). Also known as city block or taxi-cab distance; a car drives this far in a lattice-like street pattern.

Overlay: The applications, which create an ALN (see earlier in the section) work together, and they usually follow the P2P communication model (see later in the section).

Pattern Recognition: It means the procedure of finding a certain series of signals in a longer data file or signal stream.

Peer-to-Peer (P2P) Model: A communication way where each node has the same authority and communication capability. They create a virtual network, overlaid on the Internet. Its members organize themselves into a topology for data transmission.

Synchronization: It is the name of that procedure, which is carried out for finding the appropriate points in two or more streams for the correct parallel playing out.