Big Data: Map Reduce Based Outlier Identification for Fraud

Big Data: Fast, Parallel Map Reduce based mostly outlier identification for Fraud and Intrusion Detection

Pooja Vijay Pawar Ann Marie Joy

Abstract

One of the very most challenging aspect of Big Data analytics is real-time monitoring of data. Larger and bigger amount of data is being accumulated and stored on a regular basis, thus increasing the necessity for quick, effective and efficient way of analysing the data in identifying potential destructive data. Also network security hazards are increasing at an alarming rate and have become increasingly complex and difficult to find. Web traffic produced by non-human activities such as botnets or worms take in network resources, delude people and affect network security. Most of the existing work in Fraud Detection/Intrusion Detection respect as an outlier as a binary property. With this paper we use a member of family density based strategy put in place as a MapReduce job, gives a feeling of degree of a data point as an outlier; this is more important and also better immune to bogus positives.

Keywords

Big Data, Fraud detection, Intrusion Recognition, Hadoop, Outlier, Cluster, Security, Relative Density, LOF

1. Introduction

We are currently living in a world where we live surrounded and ruled by data. Regularly, exponentially large sums of data is collected, stored, processed and made available in a number of forms every day. Lately, Network Intrusion and Scams recognition has received increased attention with regard to network security, mainly due to this.

Big Data is greatly changing the way in which we detect fraud and intrusion instantly using advanced analytic solutions that are very powerful, complex and fast. On this newspaper, we propose a methodology to discover Fraudulent Website traffic and Intrusion in a network using MapReduce-based outlier recognition. These features help in filtering out clients that generate unusual traffic and specifically show different levels of potential anomalous traffic for every suspicious customer. The detected excessive web traffic can be visualized easily which method can be carried out for large systems and can be scaled accordingly.

Outlier can be explained as a data which is very unique from the other data of the same dataset, based on some distance solution. Outlier detection, being a significant data mining problem has engendered a great deal of research interest in the recent past. As a result, various options for outlier detection have been developed specifically for working with numerical data. However, outlier diagnosis for categorical data still remains an unexplored field. Handling this need, we propose a two-phase algorithm for discovering outliers in categorical data based on a novel meaning of outliers. This algorithm at first explores a clustering of the given data which is accompanied by the ranking period for deciding the set of most likely outliers. The suggested methodology is expected to show better results as it can identify different kinds of outliers, using impartial ranking scheme predicated on the inherent clustering composition of the given data.

Hadoop is an extremely popular wide open source Apache task, which can be used for storing and processing huge volume of data on commodity hardware. Hadoop package deal mainly consists of MapReduce Engine unit and Hadoop File system. Many Frameworks have been built on top of Hadoop. While using the distributed structures of Hadoop in this newspaper we discuss how we can exploit it for discovering outliers in captured network data.

2. Related Work

There has been lot of research and volume of techniques which have been developed before two decades regarding fraud detection and Intrusion Diagnosis. Large amount of machine learning techniques such as Neural Networks, Markov model, K Nearest Neighbour have drawn special attention. With this newspaper we use unsupervised Machine Learning ways to identify fraudulent ventures using Hadoop.

Most current techniques to the process of detecting intrusions exploit some type of rule-based analysis. Rule-based analysis be based upon units of predefined guidelines that are configured by an administrator and are automatically created by the machine. The use of automatic system techniques in intrusion diagnosis mechanisms was a significant milestone in the development of effective and useful detection founded information security systems. Rule-based systems suffer from an incompetence to find attacks situations that may occur over an extended duration of time.

However most significant advantage of neural sites in intrusion detection is the capability of the neural network to "learn" the characteristics of abnormal attacks and recognize occasions that are unlike any which have been detected before by the network.

Majority of studies that proposed Hidden Markov Model to use IDS are related to host-based systems, i. e. , IDS that analyses the actions performed on a single host to identify makes an attempt of intrusion.

3. Outlier Recognition Algorithms

Outliers can be found using various different techniques. A number of the techniques are mentioned below

3. 1 Distance-based and Clustering Approaches

Distance-based methods do not make conventions for the data since they quite simply compute the length between each point. For example, Knorr et al. Suggested a k-NN technique where, if m of the k nearest neighbours of a spot are within a particular distance d, then your point can be labeled as normal. Knorr et al. factors as an outlier if at least p% of the points in the dataset rest more than distance 10 d from it. These methods show high computational intricacy (e. g. nearest neighbour structured methods have quadratic difficulty with respect to the quantity of data things) which makes them impractical for really large datasets. Several techniques may be engaged to make the k-NN questions faster (to accomplish linear or logarithmic time), such as an indexing structure (e. g. KD-tree, or X-tree); however these set ups have been proven to break down as the dimensionality grows.

Clustering can be defined as the duty of grouping a set of objects so that objects in the same group talk about some feature similarity among each other than to those in other categories. Clustering is one of the very popular techniques becoming used in outlier diagnosis.

Any subject that has weakened account to the cluster it belongs to is a potential outlier. If there are any small clusters from other clusters, then your smaller cluster may potentially be an outlier. For example there may be many different sorts of fraudulent trades which can have high similarity among themselves and form a cluster.

There are additional problems with clustering. Clustering algorithms are optimized to find clusters somewhat than outliers. Hence, sometimes it may be hard in order to whether a cluster belongs to deceptive transactions or some new emergent buying behaviour of a legitimate user. Hence prior to making a final call we should perform additional research.

3. 2 Statistical distribution

Statistical outlier recognition was one of the very most basic approaches going out with back again to the 19th century. Multivariate statistical methods have been proposed, together with use of robust outlier's estimates of the multidimensional circulation guidelines, e. g. least covariance 9 determinant (MCD) and least quantity ellipsoid (MVE). One critical issue of statistical-based methods is the suitable model for every single dataset and program. Also, as data increases in dimensionality, it becomes ever more perplexing to calculate the multidimensional circulation. As the info raises in dimensionality, data amounts through a larger volume level and becomes sparse. In addition to the reduction it has on performance, it also spreads the convex hull, thus modifying the data distribution. This can be advanced by pre selecting the most noteworthy features to work, projecting to a lower-dimensional subspace, or making use of Principal Component Examination (PCA). Another technique to cope with higher dimensionalities is to organize data things in convex hull tiers according with their peeling depth, predicated on the idea that outliers are data items with a shallow depth value.

Fig 1

In this approach one main assumption is the fact data objects employs a certain distribution (E. g. Gaussian) and normal data things occur in a higher probability region of the model.

Fig. 1 shows a good example where there is high amount of data points lying in the standard region which affiliates on track data details; the mini distributions on both attributes of the standard distributions are possible outliers. As shown in Fig. 1 outliers will deviate highly from this distribution.

There are whole lot of problems with this system too, main being the curse of dimensionality, other being lack of robustness. This is because Mean and standard deviation are incredibly sensitive to outliers.

3. 3 Thickness Based approach

In this system we compare the denseness around a data point with the thickness around its local neighbours. The computed density is called an outlier rating. The primary assumption here's that the thickness around a standard data point is nearly similar to the density around its local neighbours. Here Density/Outlier score means that some clusters are densely stuffed and some others are not. Mathematically, it is defined as the inverse of the average distance to the k nearest neighbours.

Lower density of your data point signifies that the probability of it being an outlier is very high. There were many variants of Density centered approaches suggested in the past few decades, majority of which package with reducing the computational time complexity.

In Fig. 2 the factors that are densely packed, showing yellow indicate normal data factors, the ones that happen to be away from the cluster are outliers, possible applicants for harmful data. Within this paper, we utilize this technique to determine possible applicants for outliers.

4. Experiment

For our experiment we used KDD Glass 1999: Computer network intrusion diagnosis dataset for trials and analyzing our way. We used Comparative Density [3] based mostly approach for our system. Which engaged 4 Map Reduce Jobs. The Algorithms works the following

Computing K-NN [3]

We get started with the idea of K-Nearest Neighbour of object p.

Definition: K-Nearest Neighbour of object p

For any positive integer k, the k-distance of thing p, denoted as k-distance (p), is defined as the distance d(p, o) between p and an object o D in a way that

(i) For at least k items o'D\p it holds that d(p, o') ‰ d(p, o), and (ii) For for the most part k-1 things o'D\p it holds that d(p, o') < d(p, o).

Given the k-distance of p, the k-distance neighbourhood of p has every thing whose distance from p is not higher than the k-distance, i. e.

Nk-distance (p) (p) = q D\p

These objects q are called the k-nearest neighbours of p.

We use 2 MapReduce jobs, someone to compute the pairwise distance between data items as explained above and other to compute the density of the data point. The density of the data point is simply the inverse of the common distance to the k nearest neighbours.

Finding all the neighbourhood group the data points are associated with and also give them unique id

We define yet another term reachability distance associated with an object p w. r. t the data point o to determine the neighbourhood.

Definition: Reachability distance of any object p w. r. t. subject or

Let k be considered a natural amount. The reachability distance of subject p regarding object o is described as

reach-distk(p, o) = max k-distance(o), d(p, o).

The higher the worthiness of k, a lot more similar the reachability distances for objects within the same neighbourhood.

We use the same MapReduce class as before with just a bit different configuration to recognize the neighbourhood. Once neighbourhood are determined they receive a distinctive ID.

Using previous results, generate a mapping between data point and its own density.

In a typical density-based clustering algorithm, there are two guidelines that define the idea of density

(i) a parameter MinPts specifying a minimum number of objects;

(ii) a parameter specifying a volume level.

These two variables determine a density threshold for the clustering algorithms to use. That is, items or regions are connected if their neighbourhood densities go over the given density threshold. To detect density structured outliers, however, it is necessary to compare the densities of different sets of objects, which means that we have to determine the density of packages of objects dynamically. Therefore, we keep MinPts as the only parameter and use the principles reach-distMinPts(p, o), for o NMinPts(p), as a measure of the volume to look for the density in the neighbourhood of thing p.

Definition: Density of the object p

lrdMinPts(p) = 1/

Intuitively, the neighborhood reachability density of your object p is the inverse of the average reachability distance predicated on the MinPts- nearest neighbours of p. Remember that the neighborhood density can be if all the reachability distances in the summation are 0. This might appear for an thing p if there are in least MinPts objects, not the same as p, but sharing the same spatial coordinates, i. e. if there are in least MinPts duplicates of p in the dataset. For convenience, we will not handle this circumstance explicitly but simply assume that we now have no duplicates.

Hence inside our MapReduce implementation, first we sort the data details based on density data and the neighbourhood, in a way that in the input for the reducer, we get first value as density, and the subsequent values are the neighbourhood ids.

Determining the Relative Density or LOF (Local Outlier Factor)

Results from the prior step is then used in another MapReduce activity to compute the relative density or also called as Local Outlier Factor (LOF).

Definition:

LOFMinPts(p) =

The outlier factor of subject p captures the degree to which we call p an outlier. It's the average of the percentage of the neighborhood reachability density of p and those of p's MinPts-nearest neighbours. It is not hard to notice that the low p's local reachability density is, and the bigger the neighborhood reachability densities of p's MinPts-nearest neighbours are, the higher is the LOF value of p. In the following section, the formal properties of LOF are created exact. To simplify notation, we drop the subscript MinPts from reach-dist, lrd and LOF, if no bafflement arises.

Finally, Data Things which have low relative Density or LOF are decided as possible applicants for outliers.

5. Conclusion

Existing Intrusion recognition system are in nascent stage in handling extremely large traffic and the info exchanges in large Systems. MapReduce Framework can handle massive amount data quickly and effectively. Thus our suggested methodology for Outlier detection using Comparative Density based way not only can handle massive amount data but also scales easily. In near future filled with MapReduce based mostly IDS needs to developed and evaluated. We also intend to explore multiple classifier system compared to solitary classifier to get advanced results.

6. Acknowledgement

This work is backed by CSE Office, PES Institute of Technology.

7. References

[1] Barnett, V. , Lewis, T. (1995). Outliers in Statistical Data. Wiley, 3rd Release. [2] Davide Ariu, Giorgio Giacinto, and Roberto Perdisci, Sensing episodes in Computers Networks with Hidden Markov Models. [3] Ng, Jorg Sander, Hans-Peter Kriegel, Raymond T, Markus M. Breunig. LOF: Identifying Density-Based Local Outliers [4] Manh Cong Tran, Lee Hee Jeong, Yasuhiro Nakamura. Unnatural Web Traffic Detection Using Connection Graph. [5] Suri, N. N. R. R. Centre for AI & Robot. , Bangalore, India Murty, M. N. ; Athithan, G. . An algorithm for mining outliers in categorical data through position. [6] Kuo Zhao, Liang Hu. Intrusion Recognition and Prevention in high speed network. [7] Qing He Yunlong Ma, Qun Wang, Fuzhen Zhuang, Zhongzhi Shi. Parallel Outlier Recognition Using KD-Tree Predicated on MapReduce [8] Koufakou, A. Sch, FL Secretan, J. , Reeder, J. , Cardona, K. , Georgiopoulos, M. Fast parallel outlier detection for categorical datasets using MapReduce. [9] Ganesh Ananthanarayanan, Srikanth Kandula, Albert Greenberg, Ion Stoica, Yi Lu, Bikas Saha, Edward Harris. Reining in the Outliers in Map-Reduce Clusters using Mantri. [10] H. GuЛ†nes Kayac±k, A. Nur Zincir-Heywood, Malcolm I. Heywood. Selecting Features for Intrusion Detection:A Feature Relevance Research on KDD 99 Intrusion Detection Datasets. [12] E. Eskin, A. Arnold, M. Prerau, L. Portnoy, S. Stolfo, "A geometric construction for unsupervised anomaly detection: Detecting intrusions in unlabeled data, " in Applications of Data Mining in Computer Security, Chapter 4, D. Barbara and S. Jajodia (editors), Kluwer. [13] Q. He, F. Z. Zhuang, J. C. Li, Z. Z. Shi. Parallel execution of classification algorithms based on mapreduce. International Meeting on Rough Set in place and Knowledge Technology. [15]Koufakou, A. , Ortiz, E. , Georgiopoulos, M. , Anagnostopoulos, G. , Reynolds, K. , "A Scalable and Efficient Outlier Detection Technique for Categorical Data", Int'l Conference on Tools with Artificial Intellect ICTAI, October, 2007. [16] Big Data Analytics for Security Intellect, CLOUD SECURITY ALLIANCE - September 2013. [17] DuMouchel W. , Schonlau M. : "A Fast Computer Intrusion Detection Algorithm predicated on Hypothesis Evaluating of Command Move Probabilities", Proc. 4th Int. Conf. on Knowledge Discovery and Data Mining, NY, NY, AAAI Press, 1998, pp. 189-193. [18] Ramaswamy S. , Rastogi R. , Kyuseok S. : "Efficient Algorithms for Mining Outliers from Large Data Sets", Proc. ACM SIDMOD Int. Conf. on Management of Data, 2000. [19] Fawcett T. , Provost F. : "Adaptive Fraud Detection", Data Mining and Knowledge Discovery Journal, Kluwer Academic Web publishers, Vol. 1, No. 3, 1997, pp. 291-316. [20] Holtz, Marcelo D. , Bernardo M. David, and Rafael Timteo de Sousa Jєnior. "Building Scalable Distributed Intrusion Detection Systems Predicated on the MapReduce Platform. " Telecomunicacoes (Santa Rita do Sapucai) 13 (2011): 22-31. [21] DuMouchel W. , Schonlau M. : "A Fast Computer Intrusion Recognition Algorithm predicated on Hypothesis Examining of Command Transition Probabilities", Proc. 4th Int. Conf. on Knowledge Breakthrough and Data Mining, NY, NY, AAAI Press, 1998, pp. 189-193.

1

Also We Can Offer!

Other services that we offer

If you don’t see the necessary subject, paper type, or topic in our list of available services and examples, don’t worry! We have a number of other academic disciplines to suit the needs of anyone who visits this website looking for help.

How to ...

We made your life easier with putting together a big number of articles and guidelines on how to plan and write different types of assignments (Essay, Research Paper, Dissertation etc)