|Commenced in January 1999||Frequency: Monthly||Edition: International||Paper Count: 23|
In industrial areas, the measurement outputs usually contain small amounts of outliers that can lead to imprecise parameter estimation of system models. In order to solve the problem, this paper proposes a correlation analysis-based swarm intelligent identification algorithm. Firstly, the correlation analysis is employed to get equivalent finite impulse response (FIR) models. Since the growth of number of effective samples lowers the percentage of outliers, the bias caused by outliers can be minimized. However, the unknown-but-bounded errors due to the previous correlation analysis may lead to biased parameter estimation over the finite data windows. Taking advantage of global search capability of particle swarm optimization (PSO) and exactly local optimization of new Luus-Jaakola (NLJ), a PSO-NLJ intelligent algorithm is developed to identify parameters of a multi-input multi-output (MIMO) system. The simulation example shows that the proposed method works well.
EEG is a very complex signal with noises and other bio-potential interferences. EOG is the most distinct interfering signal when EEG signals are measured and analyzed. It is very important how to process raw EEG signals in order to obtain useful information. In this study, the EEG signal processing techniques such as EOG filtering and outlier removal were examined to minimize unwanted EOG signals and other noises. The two different mental states of resting and focusing were examined through EEG analysis. A focused state was induced by letting subjects to watch a red dot on the white screen. EEG data for 32 healthy subjects were measured. EEG data after 60-Hz notch filtering were processed by a commercially available EOG filtering and our presented algorithm based on the removal of outliers. The ratio of beta wave to theta wave was used as a parameter for determining the degree of focusing. The results show that our algorithm was more appropriate than the existing EOG filtering.
Rough set theory is used to handle uncertainty and incomplete information by applying two accurate sets, Lower approximation and Upper approximation. In this paper, the rough clustering algorithms are improved by adopting the Similarity, Dissimilarity–Similarity and Entropy based initial centroids selection method on three different clustering algorithms namely Entropy based Rough K-Means (ERKM), Similarity based Rough K-Means (SRKM) and Dissimilarity-Similarity based Rough K-Means (DSRKM) were developed and executed by yeast dataset. The rough clustering algorithms are validated by cluster validity indexes namely Rand and Adjusted Rand indexes. An experimental result shows that the ERKM clustering algorithm perform effectively and delivers better results than other clustering methods. Outlier detection is an important task in data mining and very much different from the rest of the objects in the clusters. Entropy based Rough Outlier Factor (EROF) method is seemly to detect outlier effectively for yeast dataset. In rough K-Means method, by tuning the epsilon (ᶓ) value from 0.8 to 1.08 can detect outliers on boundary region and the RKM algorithm delivers better results, when choosing the value of epsilon (ᶓ) in the specified range. An experimental result shows that the EROF method on clustering algorithm performed very well and suitable for detecting outlier effectively for all datasets. Further, experimental readings show that the ERKM clustering method outperformed the other methods.
An attempt has been made in the present communication to elucidate the efficacy of robust ANOVA methods to analyse horticultural field experimental data in the presence of outliers. Results obtained fortify the use of robust ANOVA methods as there was substantiate reduction in error mean square, and hence the probability of committing Type I error, as compared to the regular approach.
At-site flood frequency analysis is used to estimate flood quantiles when at-site record length is reasonably long. In Australia, FLIKE software has been introduced for at-site flood frequency analysis. The advantage of FLIKE is that, for a given application, the user can compare a number of most commonly adopted probability distributions and parameter estimation methods relatively quickly using a windows interface. The new version of FLIKE has been incorporated with the multiple Grubbs and Beck test which can identify multiple numbers of potentially influential low flows. This paper presents a case study considering six catchments in eastern Australia which compares two outlier identification tests (original Grubbs and Beck test and multiple Grubbs and Beck test) and two commonly applied probability distributions (Generalized Extreme Value (GEV) and Log Pearson type 3 (LP3)) using FLIKE software. It has been found that the multiple Grubbs and Beck test when used with LP3 distribution provides more accurate flood quantile estimates than when LP3 distribution is used with the original Grubbs and Beck test. Between these two methods, the differences in flood quantile estimates have been found to be up to 61% for the six study catchments. It has also been found that GEV distribution (with L moments) and LP3 distribution with the multiple Grubbs and Beck test provide quite similar results in most of the cases; however, a difference up to 38% has been noted for flood quantiles for annual exceedance probability (AEP) of 1 in 100 for one catchment. This finding needs to be confirmed with a greater number of stations across other Australian states.
Frequent pattern mining is the process of finding a pattern (a set of items, subsequences, substructures, etc.) that occurs frequently in a data set. It was proposed in the context of frequent itemsets and association rule mining. Frequent pattern mining is used to find inherent regularities in data. What products were often purchased together? Its applications include basket data analysis, cross-marketing, catalog design, sale campaign analysis, Web log (click stream) analysis, and DNA sequence analysis. However, one of the bottlenecks of frequent itemset mining is that as the data increase the amount of time and resources required to mining the data increases at an exponential rate. In this investigation a new algorithm is proposed which can be uses as a pre-processor for frequent itemset mining. FASTER (FeAture SelecTion using Entropy and Rough sets) is a hybrid pre-processor algorithm which utilizes entropy and roughsets to carry out record reduction and feature (attribute) selection respectively. FASTER for frequent itemset mining can produce a speed up of 3.1 times when compared to original algorithm while maintaining an accuracy of 71%.
In this article the problem of distributional moments estimation is considered. The new approach of moments estimation based on usage of the characteristic function is proposed. By statistical simulation technique author shows that new approach has some robust properties. For calculation of the derivatives of characteristic function there is used numerical differentiation. Obtained results confirmed that author’s idea has a certain working efficiency and it can be recommended for any statistical applications.
The shortest path (SP) problem concerns with finding the shortest path from a specific origin to a specified destination in a given network while minimizing the total cost associated with the path. This problem has widespread applications. Important applications of the SP problem include vehicle routing in transportation systems particularly in the field of in-vehicle Route Guidance System (RGS) and traffic assignment problem (in transportation planning). Well known applications of evolutionary methods like Genetic Algorithms (GA), Ant Colony Optimization, Particle Swarm Optimization (PSO) have come up to solve complex optimization problems to overcome the shortcomings of existing shortest path analysis methods. It has been reported by various researchers that PSO performs better than other evolutionary optimization algorithms in terms of success rate and solution quality. Further Geographic Information Systems (GIS) have emerged as key information systems for geospatial data analysis and visualization. This research paper is focused towards the application of PSO for solving the shortest path problem between multiple points of interest (POI) based on spatial data of Allahabad City and traffic speed data collected using GPS. Geovisualization of results of analysis is carried out in GIS.
This paper examines economic and Information and Communication Technology (ICT) development influence on recently increasing Internet purchases by individuals for European Union member states. After a growing trend for Internet purchases in EU27 was noticed, all possible regression analysis was applied using nine independent variables in 2011. Finally, two linear regression models were studied in detail. Conducted simple linear regression analysis confirmed the research hypothesis that the Internet purchases in analyzed EU countries is positively correlated with statistically significant variable Gross Domestic Product per capita (GDPpc). Also, analyzed multiple linear regression model with four regressors, showing ICT development level, indicates that ICT development is crucial for explaining the Internet purchases by individuals, confirming the research hypothesis.
Public health surveillance system focuses on outbreak detection and data sources used. Variation or aberration in the frequency distribution of health data, compared to historical data is often used to detect outbreaks. It is important that new techniques be developed to improve the detection rate, thereby reducing wastage of resources in public health. Thus, the objective is to developed technique by applying frequent mining and outlier mining techniques in outbreak detection. 14 datasets from the UCI were tested on the proposed technique. The performance of the effectiveness for each technique was measured by t-test. The overall performance shows that DTK can be used to detect outlier within frequent dataset. In conclusion the outbreak detection technique using anomaly-based on frequent-outlier technique can be used to identify the outlier within frequent dataset.
Outlier detection in streaming data is very challenging because streaming data cannot be scanned multiple times and also new concepts may keep evolving. Irrelevant attributes can be termed as noisy attributes and such attributes further magnify the challenge of working with data streams. In this paper, we propose an unsupervised outlier detection scheme for streaming data. This scheme is based on clustering as clustering is an unsupervised data mining task and it does not require labeled data, both density based and partitioning clustering are combined for outlier detection. In this scheme partitioning clustering is also used to assign weights to attributes depending upon their respective relevance and weights are adaptive. Weighted attributes are helpful to reduce or remove the effect of noisy attributes. Keeping in view the challenges of streaming data, the proposed scheme is incremental and adaptive to concept evolution. Experimental results on synthetic and real world data sets show that our proposed approach outperforms other existing approach (CORM) in terms of outlier detection rate, false alarm rate, and increasing percentages of outliers.
This paper discusses the classification process for medical data. In this paper, we use the data from ACM KDDCup 2008 to demonstrate our classification process based on latent topic discovery. In this data set, the target set and outliers are quite different in their nature: target set is only 0.6% size in total, while the outliers consist of 99.4% of the data set. We use this data set as an example to show how we dealt with this extremely biased data set with latent topic discovery and noise reduction techniques. Our experiment faces two major challenge: (1) extremely distributed outliers, and (2) positive samples are far smaller than negative ones. We try to propose a suitable process flow to deal with these issues and get a best AUC result of 0.98.
This research is aimed to describe the application of robust regression and its advantages over the least square regression method in analyzing financial data. To do this, relationship between earning per share, book value of equity per share and share price as price model and earning per share, annual change of earning per share and return of stock as return model is discussed using both robust and least square regressions, and finally the outcomes are compared. Comparing the results from the robust regression and the least square regression shows that the former can provide the possibility of a better and more realistic analysis owing to eliminating or reducing the contribution of outliers and influential data. Therefore, robust regression is recommended for getting more precise results in financial data analysis.
Modern spatial database management systems require a unique Spatial Access Method (SAM) in order solve complex spatial quires efficiently. In this case the spatial data structure takes a prominent place in the SAM. Inadequate data structure leads forming poor algorithmic choices and forging deficient understandings of algorithm behavior on the spatial database. A key step in developing a better semantic spatial object data structure is to quantify the performance effects of semantic and outlier detections that are not reflected in the previous tree structures (R-Tree and its variants). This paper explores a novel SSRO-Tree on SAM to the Topo-Semantic approach. The paper shows how to identify and handle the semantic spatial objects with outlier objects during page overflow/underflow, using gain/loss metrics. We introduce a new SSRO-Tree algorithm which facilitates the achievement of better performance in practice over algorithms that are superior in the R*-Tree and RO-Tree by considering selection queries.