The main focus of this thesis is using machine learning and data mining techniques to solve challenging problems. Three problems from different subject areas are discussed: nuclear material detection, drug ranking and target tracking in video sequences. The techniques of the three problems described are all based on an efficiently solvable variant of normalized cut, Normalized Cut Prime (or NC).
The first problem concerns detecting concealed illicit nuclear material, an important part of strategies preventing and deterring nuclear terrorism. What makes this an extremely difficult task are physical limitations of nuclear radiation detectors (arising from energy resolutions and efficiency) and shielding materials terrorists would presumably use to surround the radioactive nuclear material and absorb some of the radiation, thereby reducing the strength of the detected signal.
This means the central data analysis problem is identifying a potentially very weak signal, and distinguishing it from both background noise arising from the detector characteristics and naturally occurring environmental radiation. We aim at enhancing the capabilities of detection with algorithmic methods specifically tailored to nuclear data. A novel graph-theory-based methodology based on NC is used, called Supervised Normalized Cut (SNC). This data mining method classifies measurements obtained from very low resolution plastic scintillation detectors.
The accompanying computational study, comparing SNC method with several alternative classification methods shows that in terms of accuracy, the SNC method is on par with alternative approaches, yet SNC is computationally more efficient. The second subject area is in the field of drug ranking. This problem refers to placing in rank order, according to their effectiveness, several drugs treating the same disease, using data derived from cell images. Current technologies use the recently developed high-throughput drug profiling (high content screening or HCS).
Despite the potential of HCS for accurate descriptions of drug profiles, it produces a deluge of data of quantitative and multidimensional nature, posing analytical challenges in the data mining process. Our new framework is designed to alleviate these difficulties, in the way of producing graph theoretic descriptors and automatically ordering the performance of drugs, called fractional adjusted bi-partitional score (FABS), a way of converting classification to scores.
We experimented with the FABS framework by implementing different algorithms and assessing the accuracy of results by a comparative study, which includes other four baseline methods. The conclusion is encouraging: FABS implemented with NC consistently outperforms other implementations of FABS and alternative methods currently used for ranking that are unrelated to FABS.
The third problem is target tracking in video sequences–it can be framed as an un-supervised learning problem: the goal is to delineate a target of interest in a video from background. The tracking task is cast as a graph-cut, incorporating intensity and motion data into the formulation. Tests on real-life benchmark videos show that the developed technique, NC-track, based on NC, is more efficient than many existing techniques, and that it delivers good quality results.
The algorithm solving the problem is then a simple parametric cut algorithm in the graph Gst. The graph is presented in Figure 2.1. For more details on the graph construction and its validity. NC is the basis of the algorithms described in this thesis. One example is Supervised Normalized Cut (SNC) in Chapter 3. As noted before, the NC solution procedure requires to assign, in advance, a single node which will be included in the source S (or sink S) set. This node is referred to as a seed node.
NUCLEAR MATERIAL IDENTIFICATION WITH SUPERVISED NORMALIZED CUT (SNC)
The nodes in I which end up in S are classified as A, i.e., acquired from material M1 and nodes in I which end in S are classified as B, thus acquired from M2. This process is illustrated in Figure 3.1.The adjustment of NC to a supervised context, as described above, is a new supervised classification methodology, which takes advantage of the solvability of NC, and broadens the application of NC to a wider class of problems.
PCA-LDA has the running time more than 40 times than that of SNC–this is primarily due to the using of PCA, which is the reduced version; the full version of PCA takes even longer. Figure 3.5 clearly shows that SNC is significantly more efficient (factor of 2−80) than SVMs and factor of 40 than PCA-LDA under the same hardware setup (1.3GHz Intel SU7300 Core 2 Duo ULV Processor with 1GB 1, 066MHz RAM).
DRUG RANKING WITH FRACTIONAL ADJUSTED BI-PARTITIONAL SCORE (FABS)
One can further use the FABS scores to test statistical significance of the difference between the effects of two drugs. The idea is to apply bootstrapping to obtain FABS scores from a large number of resampling trials and then perform hypothesis test on the difference of the distributions of FABS. Algorithm 5 gives the test procedure, which takes resulting FABS’s from repeated experiment and calculate p-values from a t-test for each drug.
Figure 4.2 shows some example cell images of mitochondria at different fragmentation stages. Intact mitochondria usually appear like threads, as shown in the images at the top row, while fragmented mitochondria appear like small globules as shown at the bottom row. Even though the totally intact and totally fragmented mitochondria (extreme set cases) can be easily distinguished by visual inspection, it is very hard (if not impossible) to look at a set of mitochondria images that are neither totally intact nor totally fragmented.
TARGET TRACKING IN VIDEO SEQUENCES
Notice the graph G is not a complete graph–different from that in Notation section of Chapter 1. The construction is illustrated in Figure 5.1. The similarity is computed for each pair of neighboring pixels. All edges between non-neighboring pixels are assigned zero weights. The edges in the graph carry similarity weights. This similarity may take into account multiple pixels features such as the pixel’s neighborhood texture, its intensity, corresponding motion and its color or brightness.
It is important to note that while additional user input throughout the process may improve the tracking results, the user’s input is required only at the beginning of the process for identifying the target of interest. Our experiments show that a window size of N = 10 was a good tradeoff between computation time and accuracy. Following the discussion in Section 5.2, pixel’s neighborhood is defined over 7 frames. If N < 7, then the pixel’s neighborhood is truncated symmetrically around the central pixel.
This thesis explores machine learning techniques in several important application areas: nuclear material detection, drug ranking and target tracking. Supervised Normalized Cut (SNC), Fractional Adjusted Bipartitional Score (FABS) and NC-Track are used to solve these problem respectively. They are all based on a new algorithm, Normalized Cut Prime (NC). We divide the techniques into three different categories: clustering, classification and ranking. We explore clustering in the context of target tracking.
The target tracking problem is to cluster pixels into two groups, the background and the foreground in a sequence of video frames. We find that a graph-cut formulation incorporating intensity and motion data has the highest performance. Tests on real-life benchmark videos show that this graph-cut technique is more efficient than many existing techniques, and that it delivers good quality results. We explore classification in the context of detecting concealed illicit nuclear material.
Supervised Normalized Cut (SNC) is used to classify measurements obtained from very low resolution plastic scintillation detectors, among others. The classification problem is to use training data to accurately determine the identity of a given material. SNC method is proved to be appropriate for this task. In terms of accuracy, the SNC method is on par with alternative approaches, yet SNC is computationally more efficient.
We explore drug ranking of several drugs treating the same disease according to their effectiveness, using data directly from experimental images. The framework used in drug ranking producing graph theoretic descriptors, automatically ordering the performance of drugs, is called fractional adjusted bi-partitional score (FABS). Computational experiments show that FABS framework implemented with normalized cut prime (FABS-NC) outper-forms other implementations of FABS and alternative methods currently used for ranking that are unrelated to FABS.
Source: University of California
Author: Yan Yang