Lecture Video Indexing and Retrieval Using Topic Keywords
In this paper, we propose a framework to help users to search and retrieve the portions in the lecture video of their interest. This is achieved by temporally segmenting and indexing the lecture video using the topic keywords. We use transcribed text from the video and documents relevant to the video topic extracted from the web for this purpose. The keywords for indexing are found by applying the non-negative matrix factorization (NMF) topic modeling techniques on the web documents. Our proposed technique first creates indices on the transcribed documents using the topic keywords, and these are mapped to the video to find the start and end time of the portions of the video for a particular topic. This time information is stored in the index table along with the topic keyword which is used to retrieve the specific portions of the video for the query provided by the users.
Compressed Suffix Arrays to Self-Indexes Based on Partitioned Elias-Fano
A practical and simple self-indexing data structure, Partitioned Elias-Fano (PEF) - Compressed Suffix Arrays (CSA), is built in linear time for the CSA based on PEF indexes. Moreover, the PEF-CSA is compared with two classical compressed indexing methods, Ferragina and Manzini implementation (FMI) and Sad-CSA on different type and size files in Pizza & Chili. The PEF-CSA performs better on the existing data in terms of the compression ratio, count, and locates time except for the evenly distributed data such as proteins data. The observations of the experiments are that the distribution of the φ is more important than the alphabet size on the compression ratio. Unevenly distributed data φ makes better compression effect, and the larger the size of the hit counts, the longer the count and locate time.
Shot Boundary Detection Using Octagon Square Search Pattern
In this paper, a shot boundary detection method is presented using octagon square search pattern. The color, edge, motion and texture features of each frame are extracted and used in shot boundary detection. The motion feature is extracted using octagon square search pattern. Then, the transition detection method is capable of detecting the shot or non-shot boundaries in the video using the feature weight values. Experimental results are evaluated in TRECVID video test set containing various types of shot transition with lighting effects, object and camera movement within the shots. Further, this paper compares the experimental results of the proposed method with existing methods. It shows that the proposed method outperforms the state-of-art methods for shot boundary detection.
Semantic Indexing Approach of a Corpora Based On Ontology
The growth in the volume of text data such as books
and articles in libraries for centuries has imposed to establish
effective mechanisms to locate them. Early techniques such as
abstraction, indexing and the use of classification categories have
marked the birth of a new field of research called "Information
Retrieval". Information Retrieval (IR) can be defined as the task of
defining models and systems whose purpose is to facilitate access to
a set of documents in electronic form (corpus) to allow a user to find
the relevant ones for him, that is to say, the contents which matches
with the information needs of the user. This paper presents a new
semantic indexing approach of a documentary corpus. The indexing
process starts first by a term weighting phase to determine the
importance of these terms in the documents. Then the use of a
thesaurus like Wordnet allows moving to the conceptual level.
Each candidate concept is evaluated by determining its level of
representation of the document, that is to say, the importance of the
concept in relation to other concepts of the document. Finally, the
semantic index is constructed by attaching to each concept of the
ontology, the documents of the corpus in which these concepts are
Content-Based Color Image Retrieval Based On 2-D Histogram and Statistical Moments
In this paper, we are interested in the problem of
finding similar images in a large database. For this purpose we
propose a new algorithm based on a combination of the 2-D
histogram intersection in the HSV space and statistical moments. The
proposed histogram is based on a 3x3 window and not only on the
intensity of the pixel. This approach overcome the drawback of the
conventional 1-D histogram which is ignoring the spatial distribution
of pixels in the image, while the statistical moments are used to
escape the effects of the discretisation of the color space which is
intrinsic to the use of histograms. We compare the performance of
our new algorithm to various methods of the state of the art and we
show that it has several advantages. It is fast, consumes little memory
and requires no learning. To validate our results, we apply this
algorithm to search for similar images in different image databases.
3D Objects Indexing with a Direct and Analytical Method for Calculating the Spherical Harmonics Coefficients
In this paper, we propose a new method for threedimensional
object indexing based on D.A.M.C-S.H.C descriptor
(Direct and Analytical Method for Calculating the Spherical
Harmonics Coefficients). For this end, we propose a direct
calculation of the coefficients of spherical harmonics with perfect
precision. The aims of the method are to minimize, the processing
time on the 3D objects database and the searching time of similar
objects to a request object.
Firstly we start by defining the new descriptor using a new
division of 3-D object in a sphere. Then we define a new distance
which will be tested and prove his efficiency in the search for similar
objects in the database in which we have objects with very various
and important size.
SC-LSH: An Efficient Indexing Method for Approximate Similarity Search in High Dimensional Space
Locality Sensitive Hashing (LSH) is one of the most
promising techniques for solving nearest neighbour search problem in
high dimensional space. Euclidean LSH is the most popular variation
of LSH that has been successfully applied in many multimedia
applications. However, the Euclidean LSH presents limitations that
affect structure and query performances. The main limitation of the
Euclidean LSH is the large memory consumption. In order to achieve
a good accuracy, a large number of hash tables is required. In this
paper, we propose a new hashing algorithm to overcome the storage
space problem and improve query time, while keeping a good
accuracy as similar to that achieved by the original Euclidean LSH.
The Experimental results on a real large-scale dataset show that the
proposed approach achieves good performances and consumes less
memory than the Euclidean LSH.
Fuzzy C-Means Clustering for Biomedical Documents Using Ontology Based Indexing and Semantic Annotation
Search is the most obvious application of information
retrieval. The variety of widely obtainable biomedical data is
enormous and is expanding fast. This expansion makes the existing
techniques are not enough to extract the most interesting patterns
from the collection as per the user requirement. Recent researches are
concentrating more on semantic based searching than the traditional
term based searches. Algorithms for semantic searches are
implemented based on the relations exist between the words of the
documents. Ontologies are used as domain knowledge for identifying
the semantic relations as well as to structure the data for effective
information retrieval. Annotation of data with concepts of ontology is
one of the wide-ranging practices for clustering the documents. In
this paper, indexing based on concept and annotation are proposed
for clustering the biomedical documents. Fuzzy c-means (FCM)
clustering algorithm is used to cluster the documents. The
performances of the proposed methods are analyzed with traditional
term based clustering for PubMed articles in five different diseases
communities. The experimental results show that the proposed
methods outperform the term based fuzzy clustering.
Systematic Functional Analysis Methods for Design Retrieval and Documentation
Apart from geometry, functionality is one of the most
significant hallmarks of a product. The functionality of a product can
be considered as the fundamental justification for a product
existence. Therefore a functional analysis including a complete and
reliable descriptor has a high potential to improve product
development process in various fields especially in knowledge-based
design. One of the important applications of the functional analysis
and indexing is in retrieval and design reuse concept. More than 75%
of design activity for a new product development contains reusing
earlier and existing design know-how. Thus, analysis and
categorization of product functions concluded by functional
indexing, influences directly in design optimization. This paper
elucidates and evaluates major classes for functional analysis by
discussing their major methods. Moreover it is finalized by
presenting a noble hybrid approach for functional analysis.
Visualization and Indexing of Spectral Databases
On-line (near infrared) spectroscopy is widely used to support the operation of complex process systems. Information extracted from spectral database can be used to estimate unmeasured product properties and monitor the operation of the process. These techniques are based on looking for similar spectra by nearest neighborhood algorithms and distance based searching methods. Search for nearest neighbors in the spectral space is an NP-hard problem, the computational complexity increases by the number of points in the discrete spectrum and the number of samples in the database. To reduce the calculation time some kind of indexing could be used. The main idea presented in this paper is to combine indexing and visualization techniques to reduce the computational requirement of estimation algorithms by providing a two dimensional indexing that can also be used to visualize the structure of the spectral database. This 2D visualization of spectral database does not only support application of distance and similarity based techniques but enables the utilization of advanced clustering and prediction algorithms based on the Delaunay tessellation of the mapped spectral space. This means the prediction has not to use the high dimension space but can be based on the mapped space too. The results illustrate that the proposed method is able to segment (cluster) spectral databases and detect outliers that are not suitable for instance based learning algorithms.
Content-based Retrieval of Medical Images
With the advance of multimedia and diagnostic
images technologies, the number of radiographic images is increasing
constantly. The medical field demands sophisticated systems for
search and retrieval of the produced multimedia document. This
paper presents an ongoing research that focuses on the semantic
content of radiographic image documents to facilitate semantic-based
radiographic image indexing and a retrieval system. The proposed
model would divide a radiographic image document, based on its
semantic content, and would be converted into a logical structure or
a semantic structure. The logical structure represents the overall
organization of information. The semantic structure, which is bound
to logical structure, is composed of semantic objects with
interrelationships in the various spaces in the radiographic image.
NOHIS-Tree: High-Dimensional Index Structure for Similarity Search
In Content-Based Image Retrieval systems it is
important to use an efficient indexing technique in order to perform
and accelerate the search in huge databases. The used indexing
technique should also support the high dimensions of image features.
In this paper we present the hierarchical index NOHIS-tree (Non
Overlapping Hierarchical Index Structure) when we scale up to very
large databases. We also present a study of the influence of clustering
on search time. The performance test results show that NOHIS-tree
performs better than SR-tree. Tests also show that NOHIS-tree keeps
its performances in high dimensional spaces. We include the
performance test that try to determine the number of clusters in
NOHIS-tree to have the best search time.
Image Indexing Using a Color Similarity Metric based on the Human Visual System
The novelty proposed in this study is twofold and consists in the developing of a new color similarity metric based on the human visual system and a new color indexing based on a textual approach. The new color similarity metric proposed is based on the color perception of the human visual system. Consequently the results returned by the indexing system can fulfill as much as possibile the user expectations. We developed a web application to collect the users judgments about the similarities between colors, whose results are used to estimate the metric proposed in this study. In order to index the image's colors, we used a text indexing engine to facilitate the integration of visual features in a database of text documents. The textual signature is build by weighting the image's colors in according to their occurrence in the image. The use of a textual indexing engine, provide us a simple, fast and robust solution to index images. A typical usage of the system proposed in this study, is the development of applications whose data type is both visual and textual. In order to evaluate the proposed method we chose a price comparison engine as a case of study, collecting a series of commercial offers containing the textual description and the image representing a specific commercial offer.
Fast Database Indexing for Large Protein Sequence Collections Using Parallel N-Gram Transformation Algorithm
With the rapid development in the field of life
sciences and the flooding of genomic information, the need for faster
and scalable searching methods has become urgent. One of the
approaches that were investigated is indexing. The indexing methods
have been categorized into three categories which are the lengthbased
index algorithms, transformation-based algorithms and mixed
techniques-based algorithms. In this research, we focused on the
transformation based methods. We embedded the N-gram method
into the transformation-based method to build an inverted index
table. We then applied the parallel methods to speed up the index
building time and to reduce the overall retrieval time when querying
the genomic database. Our experiments show that the use of N-Gram
transformation algorithm is an economical solution; it saves time and
space too. The result shows that the size of the index is smaller than
the size of the dataset when the size of N-Gram is 5 and 6. The
parallel N-Gram transformation algorithm-s results indicate that the
uses of parallel programming with large dataset are promising which
can be improved further.
Choosing R-tree or Quadtree Spatial DataIndexing in One Oracle Spatial Database System to Make Faster Showing Geographical Map in Mobile Geographical Information System Technology
The latest Geographic Information System (GIS)
technology makes it possible to administer the spatial components of
daily “business object," in the corporate database, and apply suitable
geographic analysis efficiently in a desktop-focused application. We
can use wireless internet technology for transfer process in spatial
data from server to client or vice versa. However, the problem in
wireless Internet is system bottlenecks that can make the process of
transferring data not efficient. The reason is large amount of spatial
data. Optimization in the process of transferring and retrieving data,
however, is an essential issue that must be considered. Appropriate
decision to choose between R-tree and Quadtree spatial data indexing
method can optimize the process. With the rapid proliferation of
these databases in the past decade, extensive research has been
conducted on the design of efficient data structures to enable fast
spatial searching. Commercial database vendors like Oracle have also
started implementing these spatial indexing to cater to the large and
diverse GIS. This paper focuses on the decisions to choose R-tree
and quadtree spatial indexing using Oracle spatial database in mobile
GIS application. From our research condition, the result of using
Quadtree and R-tree spatial data indexing method in one single
spatial database can save the time until 42.5%.
Query Optimization Techniques for XML Databases
Over the past few years, XML (eXtensible Mark-up
Language) has emerged as the standard for information
representation and data exchange over the Internet. This paper
provides a kick-start for new researches venturing in XML databases
field. We survey the storage representation for XML document,
review the XML query processing and optimization techniques with
respect to the particular storage instance. Various optimization
technologies have been developed to solve the query retrieval and
updating problems. Towards the later year, most researchers
proposed hybrid optimization techniques. Hybrid system opens the
possibility of covering each technology-s weakness by its strengths.
This paper reviews the advantages and limitations of optimization
EEIA: Energy Efficient Indexed Aggregation in Smart Wireless Sensor Networks
The main idea behind in network aggregation is that,
rather than sending individual data items from sensors to sinks,
multiple data items are aggregated as they are forwarded by the
sensor network. Existing sensor network data aggregation techniques
assume that the nodes are preprogrammed and send data to a central
sink for offline querying and analysis. This approach faces two major
drawbacks. First, the system behavior is preprogrammed and cannot
be modified on the fly. Second, the increased energy wastage due to
the communication overhead will result in decreasing the overall
system lifetime. Thus, energy conservation is of prime consideration
in sensor network protocols in order to maximize the network-s
operational lifetime. In this paper, we give an energy efficient
approach to query processing by implementing new optimization
techniques applied to in-network aggregation. We first discuss earlier
approaches in sensors data management and highlight their
disadvantages. We then present our approach “Energy Efficient
Indexed Aggregation" (EEIA) and evaluate it through several
simulations to prove its efficiency, competence and effectiveness.
Region-Based Segmentation of Generic Video Scenes Indexing
In this work we develop an object extraction method
and propose efficient algorithms for object motion characterization.
The set of proposed tools serves as a basis for development of objectbased
functionalities for manipulation of video content. The
estimators by different algorithms are compared in terms of quality
and performance and tested on real video sequences. The proposed
method will be useful for the latest standards of encoding and
description of multimedia content – MPEG4 and MPEG7.
Distributional Semantics Approach to Thai Word Sense Disambiguation
Word sense disambiguation is one of the most important open problems in natural language processing applications such as information retrieval and machine translation. Many approach strategies can be employed to resolve word ambiguity with a reasonable degree of accuracy. These strategies are: knowledgebased, corpus-based, and hybrid-based. This paper pays attention to the corpus-based strategy that employs an unsupervised learning method for disambiguation. We report our investigation of Latent Semantic Indexing (LSI), an information retrieval technique and unsupervised learning, to the task of Thai noun and verbal word sense disambiguation. The Latent Semantic Indexing has been shown to be efficient and effective for Information Retrieval. For the purposes of this research, we report experiments on two Thai polysemous words, namely /hua4/ and /kep1/ that are used as a representative of Thai nouns and verbs respectively. The results of these experiments demonstrate the effectiveness and indicate the potential of applying vector-based distributional information measures to semantic disambiguation.
Non-Overlapping Hierarchical Index Structure for Similarity Search
In order to accelerate the similarity search in highdimensional database, we propose a new hierarchical indexing method. It is composed of offline and online phases. Our contribution concerns both phases. In the offline phase, after gathering the whole of the data in clusters and constructing a hierarchical index, the main originality of our contribution consists to develop a method to construct bounding forms of clusters to avoid overlapping. For the online phase, our idea improves considerably performances of similarity search. However, for this second phase, we have also developed an adapted search algorithm. Our method baptized NOHIS (Non-Overlapping Hierarchical Index Structure) use the Principal Direction Divisive Partitioning (PDDP) as algorithm of clustering. The principle of the PDDP is to divide data recursively into two sub-clusters; division is done by using the hyper-plane orthogonal to the principal direction derived from the covariance matrix and passing through the centroid of the cluster to divide. Data of each two sub-clusters obtained are including by a minimum bounding rectangle (MBR). The two MBRs are directed according to the principal direction. Consequently, the nonoverlapping between the two forms is assured. Experiments use databases containing image descriptors. Results show that the proposed method outperforms sequential scan and SRtree in processing k-nearest neighbors.
A Content Vector Model for Text Classification
As a popular rank-reduced vector space approach,
Latent Semantic Indexing (LSI) has been used in information
retrieval and other applications. In this paper, an LSI-based content
vector model for text classification is presented, which constructs
multiple augmented category LSI spaces and classifies text by their
content. The model integrates the class discriminative information
from the training data and is equipped with several pertinent feature
selection and text classification algorithms. The proposed classifier
has been applied to email classification and its experiments on a
benchmark spam testing corpus (PU1) have shown that the approach
represents a competitive alternative to other email classifiers based
on the well-known SVM and naïve Bayes algorithms.
Enhancing Retrieval Effectiveness of Malay Documents by Exploiting Implicit Semantic Relationship between Words
Phrases has a long history in information retrieval, particularly in commercial systems. Implicit semantic relationship between words in a form of BaseNP have shown significant improvement in term of precision in many IR studies. Our research focuses on linguistic phrases which is language dependent. Our results show that using BaseNP can improve performance although above 62% of words formation in Malay Language based on derivational affixes and suffixes.
Spatio-Temporal Video Slice Edges Analysis for Shot Transition Detection and Classification
In this work we will present a new approach for shot transition auto-detection. Our approach is based on the analysis of Spatio-Temporal Video Slice (STVS) edges extracted from videos. The proposed approach is capable to efficiently detect both abrupt shot transitions 'cuts' and gradual ones such as fade-in, fade-out and dissolve. Compared to other techniques, our method is distinguished by its high level of precision and speed. Those performances are obtained due to minimizing the problem of the boundary shot detection to a simple 2D image partitioning problem.
Indexing and Searching of Image Data in Multimedia Databases Using Axial Projection
This paper introduces and studies new indexing techniques for content-based queries in images databases. Indexing is the key to providing sophisticated, accurate and fast searches for queries in image data. This research describes a new indexing approach, which depends on linear modeling of signals, using bases for modeling. A basis is a set of chosen images, and modeling an image is a least-squares approximation of the image as a linear combination of the basis images. The coefficients of the basis images are taken together to serve as index for that image. The paper describes the implementation of the indexing scheme, and presents the findings of our extensive evaluation that was conducted to optimize (1) the choice of the basis matrix (B), and (2) the size of the index A (N). Furthermore, we compare the performance of our indexing scheme with other schemes. Our results show that our scheme has significantly higher performance.
Evaluating Content Based Image Retrieval Techniques with the One Million Images CLIC Test Bed
Pattern recognition and image recognition methods are commonly developed and tested using testbeds, which contain known responses to a query set. Until now, testbeds available for image analysis and content-based image retrieval (CBIR) have been scarce and small-scale. Here we present the one million images CEA-List Image Collection (CLIC) testbed that we have produced, and report on our use of this testbed to evaluate image analysis merging techniques. This testbed will soon be made publicly available through the EU MUSCLE Network of Excellence.
Grouping and Indexing Color Features for Efficient Image Retrieval
Content-based Image Retrieval (CBIR) aims at searching image databases for specific images that are similar to a given query image based on matching of features derived from the image content. This paper focuses on a low-dimensional color based indexing technique for achieving efficient and effective retrieval performance. In our approach, the color features are extracted using the mean shift algorithm, a robust clustering technique. Then the cluster (region) mode is used as representative of the image in 3-D color space. The feature descriptor consists of the representative color of a region and is indexed using a spatial indexing method that uses *R -tree thus avoiding the high-dimensional indexing problems associated with the traditional color histogram. Alternatively, the images in the database are clustered based on region feature similarity using Euclidian distance. Only representative (centroids) features of these clusters are indexed using *R -tree thus improving the efficiency. For similarity retrieval, each representative color in the query image or region is used independently to find regions containing that color. The results of these methods are compared. A JAVA based query engine supporting query-by- example is built to retrieve images by color.
Object-Based Image Indexing and Retrieval in DCT Domain using Clustering Techniques
In this paper, we present a new and effective image indexing technique that extracts features directly from DCT domain. Our proposed approach is an object-based image indexing. For each block of size 8*8 in DCT domain a feature vector is extracted. Then, feature vectors of all blocks of image using a k-means algorithm is clustered into groups. Each cluster represents a special object of the image. Then we select some clusters that have largest members after clustering. The centroids of the selected clusters are taken as image feature vectors and indexed into the database. Also, we propose an approach for using of proposed image indexing method in automatic image classification. Experimental results on a database of 800 images from 8 semantic groups in automatic image classification are reported.
Concept Indexing using Ontology and Supervised Machine Learning
Nowadays, ontologies are the only widely accepted paradigm for the management of sharable and reusable knowledge in a way that allows its automatic interpretation. They are collaboratively created across the Web and used to index, search and annotate documents. The vast majority of the ontology based approaches, however, focus on indexing texts at document level. Recently, with the advances in ontological engineering, it became clear that information indexing can largely benefit from the use of general purpose ontologies which aid the indexing of documents at word level. This paper presents a concept indexing algorithm, which adds ontology information to words and phrases and allows full text to be searched, browsed and analyzed at different levels of abstraction. This algorithm uses a general purpose ontology, OntoRo, and an ontologically tagged corpus, OntoCorp, both developed for the purpose of this research. OntoRo and OntoCorp are used in a two-stage supervised machine learning process aimed at generating ontology tagging rules. The first experimental tests show a tagging accuracy of 78.91% which is encouraging in terms of the further improvement of the algorithm.