## The Challenge of Identifying Patients for Clinical Trials

Discuss about the Coherent Selection for Clinical Trials.

Identifiction of patients that satify certain criteria that allows them to be plced in clincial trails forms a fundamental aspect of medical research. It is a bit of a challenge to find patients for clincal trails owing to the sophisticated nature of the criteria of medical research which are not easily translatable into a database query but instead require the examnation of the clcinical narraitives that are found in the records of the patient Liu & Motoda (2012). This tends to take a lot of timeespecially for medical resrecher who are intending to recurit patients and thus researchers are usually linited to patients who are directed towards a cetain trial or seek for trial on their own. Recruitment from particular places or by particular people can result in selection bias towards certain populations which in turn can bias the results of the study Robert (2014). Developing NLP systems that can automatically assess if a patient is eligible for a study can both reduce the time it takes to recruit patients, and help remove bias from clinical trials.

However, matching patients to selection criteria is not a trivial task for machines, due to the complexity the criteria often exhibit. This shared task aims to identify whether a patient meets, does not meet, or possibly meets a selected set of eligibility criteria based on their longitudinal records. The eligibility criteria come from real clinical trials and focus on patients’ medications, past medical histories, and whether certain events have occurred in a specified timeframe in the patients’ records Alpaydin (2014). This task uses data from the 2014 i2b2/UTHealth Shared-Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data, with tasks on de-identification and heart disease risk factors. The data is composed of nearly 202 sets of longitudinal patient records, annotated by medical professionals to determine if each patient matches a list of 13 selection criteria. These criteria include determining whether the patient has taken a dietary supplement (excluding Vitamin D) in the past 2 months, whether the patient has a major diabetes-related complication, and whether the patient has advanced cardiovascular disease.

All the files have been annotated at the document level to indicate whether the patient meets or do not meet each criterion. The gold standard annotations will provide the category of each patient for each criterion Alpaydin (2014). Participants will be evaluated on the predicted category of each patient in the held-out test data. The data for this task is provided by Partners HealthCare. All records have been fully de-identified and manually annotated for whether they meet, possibly meet, or do not meet clinical trial eligibility criteria. The evaluation for both NLP tasks will be conducted using withheld test data in which the participating teams are asked to stop development as soon as they download the test data. Each team is allowed to upload (through this website) up to three system runs for each of these tracks. System output is to be submitted in the exact format of the ground truth annotations, which will be provided by the organizers Paik (2013, July).

## The Role of NLP in Patient Identification for Clinical Trials

Participants are asked to submit a 500-word long abstract describing their methodologies. Abstracts may also have a graphical summary of the proposed architecture. The authors of either top performing systems or particularly novel approaches will be invited to present or demonstrate their systems at the workshop Liu & Motoda (2012). A special issue of a journal will be organized following the workshop.

This is a program that is highly configurable and was developed for the purposes of mapping biomedical text to UMLS Metathesaurus or in equal measure makes discoveries on the concept of Metathesaurus that is referred to in a text. UMLS MetaMap makes used of an approach of knowledge-intensive that is pegged on symbolic natural language processing as well as computational linguistic techniques Alpaydin (2014). Other than finding its applications in IR and data mining applications, MetaMap is acknowledged as one of the foundations of Medical Text Indexer (MTI) of the National Library Medicine. The Medical Text Indexer is applied both in semiautomatic as well as entirely automatic indexing of the literature of biomedicine at the National Library Medicine.

An improved version of MetaMap is available called MetaMap2016 V2 that has come with numerous new features having special purposes that are aimed at improving the performance of specific input types. There are also provided JSON output besides the XML output. Among the benefits that come with MetaMap2016 V2 include:

- Suppression of Numerical concepts: There are some numerical concepts of certain Sematic Type that have been found to inject limited value to an entity of a biomedical name recognition application. Through MetaMap2016 V2 such unnecessary and irrelevant concepts are automatically suppressed Goeuriot et al (2014, September).
- JSON Output generation: MetaMap2016 V2 is able to produce an output of JSON
- Processing of data in tables: Through MetaMap2016 V2 is possible to identify concepts of UMLS in a better and more efficient way as they are found in tabular data.
- Improved Conjunction Handling: There is provision for an improvement in the handling of conjunction by MetaMap2016 V2.

**Bigram**

Also called a diagram, a bigram is a sequence made up of two elements that are adjacent to each other from a string of tokens which are basically letter, words or syllables. It is an n-gram for n=2. The distribution frequency of each of the bigram in the string is applied in simple statistical analysis of texts in various applications among them speech recognition, computational linguistics and cystography among others. Guppy bigrams or simply skipping bigrams are pairs of words that enable gaps (may be by avoiding the use of connecting words or through permitting some dependencies of simulation as is the case with a dependency grammar) Rocktäschel, Weidlich & Leser (2012).

Bigrams are mainly used in the provision of conditional probability of a toke provided the preceding token upon the application of the relation of the conditional probability

**Applications**

- They are applied in what is termed as one of the most successful models of language in the recognition of speech in which they act as a special case of n-gram.
- The attacks of bigram frequency are usable in cryptography for solving cryptograms Del, López, Benítez, & Herrera (2014)
- One of the approaches in statistical language identification
- Bigrams are involved in some of the activities of recreational linguistics or logo logy

When used in information retrieval, tf-idf or TFIDF is an abbreviation for term frequency-inverse document frequency and refers to a numerical statistics that is meant for reflecting the importance of a word in any document that is available in a collection or a corpus. It is normally adopted as weighting factors in searches involving the retrieval of information, user modeling as well as text mining Alpaydin (2014). There is proportional increase in the value of tf-idf with the frequency of appearance of a word in a document and is usually offset by the now frequent the word is in the corpus. This assists in the adjustment owing to the fact that some words generally appear more frequently.

## The 2014 i2b2/UTHealth Shared-Tasks and Workshop

The aim of using tf-idf as opposed to raw frequencies of occurrence of any token in a provided document is scaling down the effects of the tokens that may be occur more frequently in a specific corpus and which are thus less informative in empirical terms that the features that take place in a small fraction of the training corpus Alpaydin (2014).

Computation of tf-idf is done using the formula t is tf-idf (d, t)=tf (t)*idf (d,t) while the idf is computed rom the from idf(d,t)=log (n/df(d,t))+1 and this is applicable under the conditions (if smooth_idf=False) in which n is the total amount of the documents and df(d,t) as the frequency of the document; the frequency of the document refers to the number of documents of d that have the term t. the addition of 1 in the idf of the equation shown is that in case of terms with zero idf which is the terms that are found in all the documents in a single training set, not be all ignored Witten et al (2016). It should be noted that there is a difference between the formula of idf listed in this paper and that found in standard textbook notations which in most cases define idf as idf(d, t)=log (n/(df(d, t+1))) the condition here remains if smooth_idf=True which is the default conditions, there is addition of 1 to both the numerator and the denominator of the idf in a manner suggesting that there was access to an extra document every time there was a collection just once thereby preventing zero divisions: idf (d,t)=log (1+n)/(1+df (d,t))+1 Paik (2013, July)

These formulas that are furthermore used in the computation of the tf and idf are a factor of the settings of the parameter which are in correspondence with the SMART notation that is deployed in IR as follows:

Tf is by default n (natural), l which is the logarithmic when sublinear_tf=True. Idf is found to be t when use_idf is provided, n (none) otherwise. The normalization is found to be c (cosine) at norm= ‘12’, n (none) when norm=none Paik (2013, July).

There is a significant difference between sentiment analysis and tf-idf even though all of them are treated as techniques for classification for text, they have distinct goals. While sentiment analysis is aiming at classification of documents based on opinions such as negative and positive, tf-idf classifies documents into various categories that are within the very documents.

## UMLS MetaMap and Medical Text Indexer

This algorithm tends to be more useful in cases where there is large set of document that is supposed to be characterized. It is simple as one does not need to train a model before time but instead it will just automatically account for the variations in the lengths of the document Rocktäschel et al (2012).

Also called term vector model, vector space model is an algebraic model that is used in the representation of text documents or any objects generally as vectors of identifiers for example in terms of index. Vector space model is applied in filtering of information, indexing, and retrieval and relevancy rankings Alpaydin (2014). The model was first applied in the SMART Information Retrieval System.

In the vectors shown, each of the dimensions is in correspondence with a spate term and should a term appear in a document, then it has a non-zero vector value. There are several ways in which these values are being computed, which is also referred to as weights Paik (2013, July). Tf-idf weighting is one of such schemes that are best known. The definition of the model is influenced by its application and in most single words, longer phrases or even keywords. Should words be chosen as the terms then the dimensionality of the vector refers to the quantity of words that are available in the vocabulary which is the number of distinct words that are found in the corpus.

By the use of the assumption of similarity of documents theory, it os possible to compute the relevance rankings of documents available in a keyword search. This is achieved through making a comparison between the angles of deviation between each of the documents vector and the vector of the original query in which the query is represented as the very vector kind as the documents Alpaydin (2014).

It is simpler to estimate the cosine of the angle that is formed between the vectors in real life as opposed to calculating the angle itself.

Where d_{2}.q is the point of intersection which is the dot product of the document and d_{2} has been illustrated in the figure below alongside the query vectors which is resented in the figure by q. ||d_{2}|| defines the norm of vector d_{2 }and the norm of vector q is represented by ||q||. The equation below is used in the calculation of the norm of a vector in general Trstenjak, Mikac & Donko (2014):

## Improvements in MetaMap

This is due to the fact that all the vectors that are being considered by this model nonnegative in terms of elements and thus a zero value of the cosine illustrates that the query and the document vector are orthogonal and bear no match i.e. there is no query in the document under consideration.

The vector space model is divisible into three stages with the first stage being document indexing in which the terms that bear content are extracted from the text of the document. The second stage involves weighting the indexed terms to enable document retrieval of only those documents that are of relevance to the user. The third stages which are also the last stage involves ranking the documents with regard to the query as per a similarity measure Alpaydin (2014).

**Advantages**

- Does not have a binary term weight
- It is a simple model that is based purely on linear algebra
- Permits partial matching
- It permits computing at a continuous degree of similarity between the documents and the queries
- It enables ranking of the documents as per their possible relevance

**Limitations**

- It has an intuitive weighting which is not very formal
- It is poor in the representation of long documents due to their poor similarity values
- There is loss of order of appearance in the vector space representation as was in the document Chandrashekar & Sahin (2014)
- It has a theoretical assumption that terms are independent statistically
- It has sematic sensitivity i.e. documents that have similar context but distinct term vocabulary cannot be associated and thus bringing about a false negative match.

**Feature Selection**

Feature Selection forms an integral step in data processing that is done just the application of a learning algorithm. The complexity of the computation is the main issue that is taken into consideration when a proposal is being made on a method of feature selection. In most case, a fast feature selection process is unable to search rough the space of the feature subset thus the accuracy of the classification is found to be reduced Alpaydin (2014). Also known as variable selection, variable subset selection and attribute selection, feature selection n defines the process through which a subset of the appropriate and relevant features is selected to be used in the model construction.

The main role of feature selection is the elimination of the irrelevant and redundant features. Irrelevant features are features that are treated to be providing information that is of no use with regard to the data while redundant features are those that are no longer providing information apart from the currently chosen features. In other terms, redundant features offer information which is of importance to the data set even though the same information is already provided by the currently chosen features Chandrashekar & Sahin (2014).

An example of such is the year of birth as well as the age which provide the very information about a person. Redundant and irrelevant feature have the potential of lowering the accuracy of learning and the model quality achieved by the learning algorithm. There are numerous proposals that have been made in an attempt to more accurately and efficiently apply the learning algorithms Alpaydin (2014). Such proposals reduce the dimensionality for example relief, CFS, FOCUS. Through the removal of the irrelevant information and minimizing noise levels, the accuracy and efficiency of leaning algorithms can significantly be improved. Feature selection has attracted special interests especially in areas of research that call for high dimensional datasets such as text processing, combinational chemistry and gene expression.

## Bigram and Its Applications

An evaluation measure and a search technique are the two requirements for the formation of the algorithm of feature selection. The search technique works by giving proposal on subsets of new feature and include search approaches as genetic algorithm, best first, greedy forward selection, simulated annealing, greedy backward elimination as well as exhaustive Liu & Motoda (2012). An evaluation measure is on the other hand used in scoring the various feature subsets with some of the most common evaluation measures including error probability, entropy, correlation, inter-class distance and mutual information. The feature selection summary is as shown in the diagram below

**Feature selection flow**

A feature selection algorithm performs search on all the probable feature subsets in a concept so as to come up with the optimal subset. This process may be computationally intensive and hence calls for some criterion for stopping Panahiazar et al (2014, October). The stopping criterion is normally dependent on the conditions which include the number of interactions and the threshold of evaluation. An example is forcing the feature selection process to come to a halt upon reaching some iterations or number.

There are four main reasons for application of feature selection techniques:

- To prevent the curse of dimensionality
- Reduce the length of training times Liu & Motoda (2012)
- Simplify the models making their interpretation by the user or researchers easier
- Improve generalization through minimizing over fitting which os basically the reduction of variance.

Correlation-Base feature selection is used in the measurement of the subsets of the feature based on the hypothesis that “Good feature subsets have highly correlated features that have the classification, but uncorrelated to each other.”

The equation shown below is sued in providing merit of an S feature subset that has total k features

In which is the mean value of the correlations of feature and classification and the average mean of all the correlations between features and feature. The criterion of the correlation feature selection is defined by

In which the variables are known as the correlations even though they are not necessarily the coefficient of Pearson correlation Bache & Lichman (2013). A dissertation by Dr. Mark adopts neither of these and instead uses various measures that show relationship, relief, minimum description length and symmetrical uncertainty. Assuming the x_{i} is the indicator function of the set membership for a feature f_{i} then an optimization problem can be achieved through rewriting the above equation as:

The above combinatorial problems are combined 0-1 problems of linear programming which the branch and branch algorithm can be used in solving.

For obvious reasons, machine learning has become of the most widely discussed topics with some of the reasons given is due to it being able to offer the ability of automatically get deep insights, come up with a high performing predictive data models as well as implore and acknowledge unknown patterns without necessarily calling for the need of explicit programming instructions Panahiazar, Taslimitehrani, Jadhav & Pathak (2014, October). Machine learning is defined as a computer program that can be said to have leant from experience from E with regard to a few class of tasks T and performance measures P should its performance at tasks in T as estimated by P improves with E, in a less formal language, machine leaning can be descried as a subtopic in the field of computer science is which normally known as additive analytics or otherwise predictive learning. The goal and use of machine learning is to construct and or leverage the current algorithms so as to learn from data in building generalizable models which offer accurate predictions or to get pattern especially with the new and hidden data which are similar in nature Panahiazar et al (2014, October).

## TF-IDF and Its Applications

Just as has been hinted in the definition of machine learning, it leverages algorithms in automatically modeling and gets patterns in data in most cases with the aim of predicting some of the target output, also called the response. The algorithms are mainly based on mathematical optimization and statistics Bache & Lichman (2013). Optimization is defined as the processing of getting the least or greatest value (minima or maxima) of any function in most cases referred to as a cost function or loss in the case of minimization. The gradient descent tends to be one of the optimization algorithms that are most commonly used. The normal equation is yet another optimization that has gained popularity over the recent past. In summary, machine learning revolves around learning a model that is highly accurate predictive or classier automatically. It also involves getting the unknowns pattern in data through the use of leveraging learning of algorithms as well as techniques for optimizations.

Machine learning is primarily categorized into supervised, semi-supervised and unsupervised learning. The response variable that is being modeled is contained in the data and the goal in this case is to predict the class or the value of the data that has not been seen. Unsupervised learning on the other hand entails learning from a set of data that does not have a response variable or label and thus more of finding patterns as opposed to predictions Panahiazar et al (2014, October).

The following output types are the main uses of machine learning algorithms:

- Recommended systems
- Clustering
- Regressions: Multivariate, Univariate
- Two class and multi-class classification
- Detection of anomaly

Each output uses specific algorithm. Clustering is defined as a supervised technique that is used in, making discoveries of the stricture ad composition of a specific set of data. It refers to the process of bring together data into clusters to find out which groupings may be derived out of them should there be any Panahiazar et al (2014, October). Each of the clusters is characterized sing a cluster centroid and a set of data points in which the cluster centroid refers to the average of all the available fata point that are contained in the cluster throughout the entire features. Classification problems entail he placement of a data point also called observation into a class or category that is pre-defined and at other times classification problems just assign a class to data point while in other case the goal is to approximate the probabilities that the data point is a member of each of the provided classes.

## Sentiment Analysis vs TF-IDF

Regression on the other hand is used to mean that a model would be assigning a response to a data observation contrary to a discrete class Panahiazar et al (2014, October). In other time, regression is synonymously with an algorithm that is used in the classification problems or in the prediction of the discrete categorical response for example ham or spam. Logistic regression offers an excellent example of regression and is used in the prediction of the probabilities of a specific discrete value.

At times, anomalies are used to indicate a real problem which is not easily explained for example a defect in manufacturing. In such cases, detecting anomalies is used in the provision of a measure of quality control besides an insight into the effectiveness of the taken steps in the reduction of the effects. In both cases, there comes a time that finding the anomalous values are of benefit and thus the use of some algorithms of machine learning Chandrashekar & Sahin (2014).

Recommendation systems, also called recommendation engines are an information filtering system type that are aimed at providing recommendations in numerous application among them books, articles, movies, restaurants, products and many others. There are two most common approaches that are adopted: content based and collaborative filtering Panahiazar et al (2014, October).

Some of the machines learning algorithms are as shown below:

Supervised regression

- Poisson regression
- Multiple liner and simple regression
- Ordinal regression
- Methods of nearest neighbor
- Forest regression or decision tree
- Artificial Neural Networks
- Anomaly Detection
- Principle component analysis
- Support vector machine

Supervised two-class and multi-class grouping

- Perception methods
- Bayesian classifiers
- Over against all multiclass
- Artificial Neural Networks Bache & Lichman (2013)
- Multinomial and logical regressions
- Support vector machine
- Method of nearest neighbors
- Jungles, decision tree and forest

- Hierarchical clustering
- K-means clustering

With reference to machine learning, Naive Bayes Classifier refer to a family of probalistic classifiers which are simple and are based upon the applications of Bayes’ theorem in which there are strong independences in the assumptions made between the various characteristics. Naive Bayes is one of the techniques that are used in the construction of classifiers which are basically models that are used in assigning class labels to instances of problems and are represented in the form of vectors of feature values in which the labels are drawn from a specific set Bache & Lichman (2013). It is a family of algorithms that are working on a common principle which is that all the Naive Bayes classifiers make an assumption that the value of a specific feature does not depend on the value of another feature if the class is provided.

An example can be a case such as a fruit may be assumed to be an apple should it be red, round and approximately 10 cm in diameter. In this case, the Naive Bayes classifiers would take into consideration each of the characteristic to independently contribute to the possibility that the said fruit is an apple without regard to any chances of correlation between the color, diameter and roundness characteristics Chandrashekar & Sahin (2014). In some probability model types, it is possible to train Naive Bayes classifiers quite efficiently in a supervised setting of learning. In most of the practical applications in the estimation of the parameters, Naive Bayes models adopt the use of maximum likelihood Bache & Lichman (2013). This means that it is possible to work with the Naive Bayes model without necessarily admitting Bayesian probability or the use of any of the methods of Bayes.

## Vector Space Model and Its Applications

Selection of the best hypothesis (h) and the given data (d) are normally at the top of interest in machine learning. Under classification problem, the hypothesis (h) could be the class that is to be assigned a new data says (d). The use of the prior knowledge offers one of the simplest ways of picking on the most probable hypothesis from the given data Chandrashekar & Sahin (2014). Through Bayes’ theorem, a way out for calculating the probability of the hypothesis following a prior knowledge can be calculated. The Bayes’ theorem states that:

where defines the probability of the hypothesis h form the given data d. this is also called posterior probability is the data d probability that is given provided that the hypothesis h was true

P (h) is the probability that the hypothesis h is true without considering the data. Such is called prior h probability Alpaydin, E. (2014)

P (d) is the probability of the data without taking into consideration the hypothesis.

As can be observed, the interest here is to determine the posterior probability of using the probability, P(h) and P(d). According to Bache & Lichman (2013) upon calculation of the posterior probability of various hypotheses, a selection can be made on the hypothesis that has the greatest pro ability which will be the maximum probable hypothesis s and may be referred to as the maximum posteriori (MAP). This is mathematically expressed as

MAP (h) =max

or

MAP (h) =*P (h))/P (d)) Rocktäschel, T., Weidlich, M., & Leser, U. (2012)

Or

MAP (h) =max) =*P (h)

The P(d) in the calculation is used a normalization term that permits the estimation of the probability and can thus be dropped in case the focus is on the most probable hypothesis since it is constant and only applied when there is need to normalize Tuarob, Bhatia, Mitra, & Giles (2013, August).

Random Forests are ensemble learning models that are being supervised and are used in the classification and regression. Ensemble learning models bring together numerous learning models for machines and thus bringing about a better performance Alpaydin (2014). The idea or logic behind such a move is that each of the models that are used in ensemble learning is weak and hence not efficient when employed to work on its own but strength is gained upon aggregating multiple learning models together. For the case of the Random Forest, a great number of Decision Trees which serve as the weak factors are incorporated and their yields aggregated in which the result is an illustration of the achieved string assembly.

The random in Random Forests is derived from the fact that each of the individual tree decisions is trained by the algorithm through the use of varied subsets of the training data Alpaydin (2014). Still, each of the nodes of every decision tree is divided using an attribute that has been selected from the data randomly. The algorithm thus manages to generate models that have no correlation with each other as a result of introduction of the randomness. The result of this is chances of errors that are evenly being spread out in the entire model, to mean the errors will finally be cancelled out via the voting of the majority decision strategy of the models of Random Forest. Just like is the case with forests in which the higher the number of trees in the forest the more robust the forest, for the case of Random forest the higher the number of trees in the forest, the higher the accuracy levels Iqbal, Ahmed & Abu-Rub (2012).

The decision tree concept tends to be more of the rule based system. The decision tree algorithm sets up some instructions or rules upon being treated with training data set that has features and targets. The very set of rules are applicable in the performance of the prediction for the case of the test dataset Alpaydin (2014).

Random forest algorithm pseudocode can be divided into two distinct stages:

- Random forest creation pseudocode
- Pseudocode to perform prediction from the generated random forest classifier

- Select “k” features at random from total of “m” features. The condition here must be that k<<m
- using the best split option, evaluate the node d among the features of k
- divide the nodes into daughter nodes through the use of the best split Panahiazar et al (2014, October)
- perform steps 1 to 3 until an I number of nodes has been achieved
- Repeat steps 1 to 4 for as the number of times as n so as to create n number of trees, this would build the forest.

The start of every random forest algorithm is the selection of k features from the total number of features available which is usually m features. As can be observed in the procedure, the features and the observations are randomly being taken Quinlan (2014). The next phase involves using the k features that were randomly selected in obtaining the root node through the use of best split approach. This step s followed closely is calculation of the daughter nodes which adopts the very best split approach. These first three stages are to be rewound until a tree having a root node is obtained in which the target is the leaf node. The final stage is to repeat the first 4 stages in the creation of n trees that are randomly created. From the randomly created tress is the random forest.

The pseudocode below is adopted in performing prediction through the use of trained random forest algorithm

- Picks on the forest features and adopt the rules of each of the randomly created decision tree in the prediction of the outcome and keep the predicted outcomes Chandrashekar & Sahin (2014)
- Calculation of the votes of each of the predicted targets
- Taking into consideration the predicted target the received the highest number of votes and considering it as the fin prediction from the algorithm of the random forest.

Using the trained random forest algorithm in performing the prediction demands passing the test features through following the rules that have been set for each of the randomly created forests. By taking for example that there were 100 random decision trees that were formed from the random test, an analysis can be done Chandrashekar & Sahin (2014). The first thing that needs to be understood in that each of the random forest would predict various targets for the same feature of the test. From consideration of each of the predicted target, calculation of the votes is done. Assuming that the 100 random decision trees are a prediction of about 3 unique targets named x, y and z, then the votes from x does not count much but instead just out of 100 random decision tree the number of trees whose prediction is x. this is applicable to the other two targets, y and z as well Quinlan (2014). Assuming that x had the highest number of votes, say from the 100 random decision tree there are 60 trees that are predicting its target, the x would become the target and thus the final random forest would return the x as the outcome or the predicted target. Such a concept is referred to as majority voting.

There are numerous applications of random algorithm among them banking, e-commerce, medicine and stock market. Random forest algorithm has two main applications in the banking sector: finding fraud customers and getting loyal customers. Loyal customer in this case is customer who pays well and takes a large amount of large and delivers the loan interest effectively to the bank Ferrucci et al (2013). The growth and development of any bank is influenced directly by the by the availability of loyal customers. Using the details of the customer, the bank highly analyzes the customer to establish his pattern. In the very way, it is of equal importance to identify a customer who is of little profit if any to the bank.

Such a customer is one who does not take loan and should he do he would not effectively pay the interest. By having an opportunity to identify such customer before loans can be advanced to them, the bank could be able to reject approvals for their loans Panahiazar et al (2014, October). Random forest algorithm is again applicable in this case in the identification of the non-profitable customers. Random forest algorithm is applied in the stock market for identification of the behavior of the stock and the anticipated profit or loss when a certain stock is purchased.

In the medical field, random forest algorithm aids in the identification of the right combination of the various component that would be used in validating a drug. Random forest algorithm is equal of importance in the identification of the disease through the analysis of the medical records of patients Chandrashekar & Sahin (2014). On the other hand, random forest algorithm is used in very small segments in the recommendation engine, usable in the identification of the chances that a customer prefers the recommended products based on similar types of customers when it comes to e commerce. High end GPU systems are needed to aid in running random forest algorithm on huge dataset but it is possible to run the machine learning models in the desktop hosted in the cloud in case GPU systems are unavailable Ferrucci et al. (2013)

- Regression and classification tasks can use the same random forest algorithm
- Random forest algorithm eliminates the problem of over fitting in any classification problem Mikolov, Chen, Corrado & Dean (2013)
- Random forest algorithm is applicable in feature engineering

This is a synonym for Kullback-Leibler divergence which refers not the non-symmetric measure of the level of divergence between the functions of probability P and Q

Kullback-Leibler is indeed the anticipated divergence of logarithmic difference between the probability model Q and P. when the probability functions of P and Q are equal; the value of Kullback-Leibler divergence is zero Quinlan (2014). Mutual information may also be used in defining information gain in which it refers to the reduction in the entropy achievable through a learning variable A as shown:

where H(S) is the given dataset entropy, H (S_{t}) is the entropy of the i_{th} subset that produced by partitioning the feature that is based on S. information gain helps in ranking the various features in machine learning in which the feature having the highest information gain is treated to be ranking higher than the other features since it has stronger power when it comes to classification of the data Ferrucci et al (2013). Information gain can as well be denied for a set of features that are joined as entropy reduction archived thorough bearing a joint feature set, F. the above definition is similar to the one for this case:

where H (S_{t}) defines the entropy of i_{th }subset that is produced through partitioning of all the features that are based on S found in the joint feature set F.

The study proposes to make use of Bigrams in the concatenation of every patient’s record text. Then, filters will be used to extract and define the feature that was found to have occurred for more than fifth in the entire patients' records. Tf-idf will then be calculated based on the noted key features. Then, Vector Space Model use to represent results of Tf-idf. Weka tools will then be used to select features and train model Quinlan (2014).

I will make use of UMLS MetaMap. From the MetaMap, I will deduce the concepts of every patient’s records individual text. Then, filters will be used to extract and define the feature that was found to have occurred for more than fifth in the entire patients' records. After that, the tf-idf of every feature will be calculated. Then, Vector Space Model use to represent results of Tf-idf. After that, Weka tools will be used to select features and train model Ferrucci et al (2013).

Conclusion

This task aimed to identify whether a patient meets, does not meet, or possibly meets a selected set of eligibility criteria based on their longitudinal records. The eligibility criteria come from real clinical trials and focus on patients’ medications, past medical histories, and whether certain events have occurred in a specified timeframe in the patients’ records. This task uses data from the 2014 i2b2/UTHealth Shared-Tasks and Workshop on Challenges in Natural Language Processing for Clinical Data. A literature review on the various machine learning languages and techniques is also provided to offer insights into these various techniques and their applicability. Different machine languages and techniques have different advantages and disadvantages as has been seen in the above discussion which should be taken into consideration before a choice is made on a machine learning language, technique or algorithm in the identification of the eligibility criteria of patents for clinical trials.

References

Robert, C. (2014). Machine learning, a probabilistic perspective

Michalski, R. S., Carbonell, J. G., & Mitchell, T. M. (Eds.). (2013). Machine learning: An artificial intelligence approach. Springer Science & Business Media

Witten, I. H., Frank, E., Hall, M. A., & Pal, C. J. (2016). Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann.

Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier

Alpaydin, E. (2014). Introduction to machine learning. MIT press

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., & Kudlur, M. (2016, November). TensorFlow: A System for Large-Scale Machine Learning. In OSDI (Vol. 16, pp. 265-283)

Bache, K., & Lichman, M. (2013). UCI machine learning repository

Liu, H., & Motoda, H. (2012). Feature selection for knowledge discovery and data mining (Vol. 454). Springer Science & Business Media

Chandrashekar, G., & Sahin, F. (2014). A survey on feature selection methods. Computers & Electrical Engineering, 40(1), 16-28

Brown, G., Pocock, A., Zhao, M. J., & Luján, M. (2012). Conditional likelihood maximisation: a unifying framework for information theoretic feature selection. Journal of machine learning research, 13(Jan), 27-66

Panahiazar, M., Taslimitehrani, V., Jadhav, A., & Pathak, J. (2014, October). Empowering personalized medicine with big data and semantic web technology: promises, challenges, and use cases. In Big Data (Big Data), 2014 IEEE International Conference on (pp. 790-795). IEEE

Goeuriot, L., Kelly, L., Li, W., Palotti, J., Pecina, P., Zuccon, G., & Mueller, H. (2014, September). Share/clef ehealth evaluation lab 2014, task 3: User-centred health information retrieval. In Proceedings of CLEF 2014

Rocktäschel, T., Weidlich, M., & Leser, U. (2012). ChemSpot: a hybrid system for chemical named entity recognition. Bioinformatics, 28(12), 1633-1640.

Ferrucci, D., Levas, A., Bagchi, S., Gondek, D., & Mueller, E. T. (2013). Watson: beyond jeopardy! Artificial Intelligence, 199, 93-105

Paik, J. H. (2013, July). A novel TF-IDF weighting scheme for effective ranking. In Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval (pp. 343-352). ACM.

Hong, T. P., Lin, C. W., Yang, K. T., & Wang, S. L. (2013). Using TF-IDF to hide sensitive itemsets. Applied Intelligence, 38(4), 502-510.

Trstenjak, B., Mikac, S., & Donko, D. (2014). KNN with TF-IDF based Framework for Text Categorization. Procedia Engineering, 69, 1356-1364.

Del Río, S., López, V., Benítez, J. M., & Herrera, F. (2014). On the use of MapReduce for imbalanced big data using Random Forest. Information Sciences, 285, 112-137.

Tuarob, S., Bhatia, S., Mitra, P., & Giles, C. L. (2013, August). Automatic detection of pseudocodes in scholarly documents using machine learning. In Document Analysis and Recognition (ICDAR), 2013 12th International Conference on (pp. 738-742). IEEE.

Iqbal, A., Ahmed, S. M., & Abu-Rub, H. (2012). Space vector PWM technique for a three-to-five-phase matrix converter. IEEE Transactions on Industry Applications, 48(2), 697-707.

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv preprint arXiv: 1301.3781.

**Cite This Work**

To export a reference to this article please select a referencing stye below:

My Assignment Help. (2019). *Using NLP To Identify Patients For Clinical Trials Essay.*. Retrieved from https://myassignmenthelp.com/free-samples/coherent-selection-for-clinical-trials.

"Using NLP To Identify Patients For Clinical Trials Essay.." My Assignment Help, 2019, https://myassignmenthelp.com/free-samples/coherent-selection-for-clinical-trials.

My Assignment Help (2019) *Using NLP To Identify Patients For Clinical Trials Essay.* [Online]. Available from: https://myassignmenthelp.com/free-samples/coherent-selection-for-clinical-trials

[Accessed 08 December 2023].

My Assignment Help. 'Using NLP To Identify Patients For Clinical Trials Essay.' (My Assignment Help, 2019) <https://myassignmenthelp.com/free-samples/coherent-selection-for-clinical-trials> accessed 08 December 2023.

My Assignment Help. Using NLP To Identify Patients For Clinical Trials Essay. [Internet]. My Assignment Help. 2019 [cited 08 December 2023]. Available from: https://myassignmenthelp.com/free-samples/coherent-selection-for-clinical-trials.