- Published on
CITS4009 Computational Data Analysis Final Exam Cheet Sheet
Printed Version: 4009_final_notes.pdf
Life cycle:
- Define the goal
- Collect and manage data
- Build the model
- Evaluate and critique model
- Present results and document
- Deploy model
EDA:
- Data Cleaning/Transformation
- Visualisation (e.g. Income is skewed)
- Log Transformation
- Scaling
- Categorisation
- Training etc.
!! Each step can loop back to the previous step; exploratory data analysis. A task that uses visualization and transformation to explore the data in a systematic way. The goal is to develop an understanding of the data by using questions as tools to guide the investigation.
Iterative process:
- Generating questions about your data
- Searching for answers by visualizing, transforming, and modelling the data
- Using knowledge learned to refine questions and/or generate new questions
EDA sensible questions:
- How many peaks are there for a numerical variable (use density plot or histogram)
- What's the central tendency of a numerical variable (use boxplots)
- What's the covariance between two numerical variables (use scatter plots, or hexbin)
- What's the proportion of two categorical variables (use barcharts)
- What variation occurs within vars?/covariation occurs between vars?
**Hypothesis Generation (Data Exploration): **
Look at data and armed with subject knowledge to generate interesting hypotheses in multiple ways to help explain why the data behaves the way it does.
**Hypothesis Confirmation: **
- Using a precise mathematical model to generate falsifiable predictions
- An observation can only be used once in order to confirm a hypothesis
- Writing an analysis plan in advance and stick to it
Operators:
- %/% integer division
- %*% dot product or matrix multiplication
- %in% subset
- & and | or ! not
- %>% pipe
- <- vs =: <- is assignment operator, = specifies a named parameter formula for function.
**Summary() **
- Can look for common issues: Missing values, Invalid values and outliers, Data ranges that are too wide or too narrow, The units of the data
**ggplot: **
ggplot(data = < a dataset >) + < a geom function (the shape of the data points)>(mapping = aes(<a collection of mappings (data selection for each coordinate >))
**Plots for single variable: **
- Histogram: Examines data range. geom_histogram(aes(x=a), binwidth=5, fill="gray")
- Density Plot: Checks number of modes Checks if distribution is normal/lognormal/etc. geom_density(aes(x=a))
- Boxplot: Checks for anomalies and outliers. geom_boxplot(aes(y=a), outlier.colour="red")
- Bar Chart: Compares relative or absolute frequencies of the values of a categorical variable. geom_bar(aes(x=a), fill="gray")
Plots for two variable comparisons:
**Two categorical: **geom_count(aes(x=a, y=b, size=c, color=d)), geom_tile(x=a, y=b, fill=c)
[different types of position adjustments: position=stack, position=dodge, position=fill, position=identity]
Continuous Data Visualization:
Boxplot: The IQR (inter-quartile range) represents the middle 50% of the data. Q0Q1/Q3Q4 denote the upper/lower whiskers, indicating populations outside the middle 50%. Q1/Q3 represent the lower/upper quartile (25%/75%), while Q2 is the median (50%). IQR is calculated as Q3-Q1, Q4 as Q3+1.5IQR, and Q0 as Q1-1.5IQR.
Interpreting Boxplots:
- Large IQR indicates significant variability.
- Higher IQR values suggest higher variable values.
- Median skewed to one side of the box indicates similar values in one quartile group but significant variation in the other.
!! Data Cleaning
0. Z-normalisation: Z-normalisation involves transforming a variable by subtracting its mean and dividing by its standard deviation, resulting in a mean of 0 and standard deviation of 1. This is useful when attributes have different scales, making comparisons invalid. However, it may not be suitable for detecting outliers as it makes extreme values less noticeable.
1. Invalid values: Use na_if() to replace zero values in the age column with NA, and ifelse() to replace negative income values with NA.
2. Sentinel values: Create indicator variables for original sentinel values, then set the sentinel values in the original column to NA. For example, if gas_usage has sentinel values 1, 2, and 3, create indicator variables for each and then set the original gas_usage values to NA.
3. Outliers: Identify outliers using boxplot.stats(). Replace invalid values with NA and re-run the statistics.
4. Missing values: Use is.na() to check for missing values.
!! Handling Missing Values
- Drop the rows with missing values: Use listwise deletion if only a small proportion of values are missing, and they tend to be for the same data points.
Count NA values:** Use **count_missing()__ to count missing values in each column.
Omit all NA:** Use **na.omit()__ to remove rows with any NA values.
Apply subsetting to remove NA:__ Use subsetting to remove rows with NA values in specific columns.
Data Cleaning
- Convert the missing values to a meaningful value if there are too many missing data:
- Categorical: create a new category for the variable.
- Numerical:
- Missing randomly: replace them with mean (vtreat with indicator variable) or appropriate estimates (regression, clustering …).
- Missing systematically: convert to categorical and add a new category, or replace with zero and add a masking variable.
Vtreat: a package for automatically treating missing values. It creates a treatment plan that records all the information needed so that the data treatment process can be repeated.
It can be used to “prepare” or treat your training data before you fifit a model, and then again to treat new data before feeding it into the model.
5. recording variable:
- Change a continuous variable into a set of categories
- Create a pass/fail variable based on a set of cutoff scores
- Replace miscoded values with correct values
6. discretizing continuous variables: geom_jitter(alpha=1**/**5, height = 0.1) when their exact values matter less than whether they fall into a certain range. convert the continuous variables into ranges or discrete variables. It is useful when the relationship between input and output isn’t linear.
- 7. Renaming variables: rename(df, new1=old1, new2= old2, ...)
!! When renaming variables in a dataframe, use the rename function with the new and old variable names.
-----------3. Data transformation/Integration↓------
- 1. Combining datasets:
- Row: rbind(dfA, dfB)
- Column: cbind(dfA, dfB)
**2. Mutating joins: **which add new variables to one data frame from the matching observations in another.
- Inner join — only matching observations will be kept.
- merge(df1, df2, by.x = "col1", by.y = "col2") inner_join(df1, df2)
_Normalized by a median var with col1: _
1. Calculate median using aggregate(): median.var <- aggregate(df1[, 'var'], list(df1**$**col1), median).
2. merge two df: Use the merge function to merge df1 and median.var by matching values in col2 and Group.1.
3. create a new col of normalized var: Add a new column var.normalised to df1 by dividing var by x.
!! Left/right outer join — matching observations as well as unmatched ones from left/right will be kept.
- merge(df1, df2, all.x = TRUE)
- left_join(df1, df2)
- merge(df1, df2, all = TRUE)
- full_join(df1, df2)
!! Full outer join — matching observations as well as all unmatched ones from both tables are kept.
3. Filtering joins: Use semi_join(x, y) to keep all observations in x that have a match in y. Use anti_join(x, y) to drop all observations in x that have a match in y.
4. Set operations:
intersect(x, y) returns only observations in both x and y.
union(x, y) returns unique observations in x and y.
setdiff(x, y) returns observations that are in x but not in y.
Sub-setting a data frame:
- Logical comparison
- Row and column index
- Row and column names
- Subset() function
Piping: Connect a series of operations
Sample: Take a random sample with/without replacement
Classification and model evaluation
Classification: Learn from data to decide how to assign labels
Needed processes:
- Collect features (useful column)
- Build a training set
Methods:
- Naïve Bayes
- Decision trees: decision rules are more interpretable for non-technical users
3. Logistic regression:
**Estimate the relative impact of different input variables on the output. **Useful for estimating the class probabilities in addition to class assignments
4. Support vector machines:
Make fewer assumptions about variable distribution.
!! Useful when:
- there are very many input variables.
- input variables interact with the outcome or with each other in complicated (nonlinear) ways.
- training data isn’t completely representative of the data distribution in production.
4.2 Model evaluation:
Confusion matrix: table(truth=test**pred>**0.5)
Accuracy: (TP+TN)/(TP+FP+TN+FN). Defined as the number of items categorized correctly divided by the total number of items. Inappropriate for unbalanced classes.
!! Precision: TP/(TP+FP). What fraction of the items the classifier flags as being in the class actually are in the class.
!! Recall/Sensitivity/TPR: TP/(TP+FN). What fraction of the things that are in the class are detected by the classifier. F1 score: harmonic mean of precision and recall: 2_precision_recall/(precision+recall).
Specificity/TNR: True Negatives divided by the sum of True Negatives and False Positives. This measures the fraction of non-class members that are identified as negative.
Log Likelihood
Log likelihood: a measure (non-positive number) of how well a model’s predictions match the true class labels. 0 – perfect match, smaller value – worse match.
Deviance
**Deviance: **( -2*(loglikelihood - S) S: the log likelihood of the saturated model. Perfect model = return probability 1 for items in the class and probability 0 for items not in class (S=0)
Akaike Information Criterion (AIC)
**Akaike Information Criterion (AIC): **AIC = deviance + 2 * numberOfParameters – more parameters, more complex model, more likely to overfit, useful for comparing with different measures of complexity and modelling variables with different number of levels. Can use to score model with a bonus proportional to its scaled log likelihood on the calibration data and minus a penalty proportion to the complexity of the model. Prefer: 1. largest log likelihood 2. Smallest deviance 3. Highest AIC score.
Density Plots
density plots is a representation of the distribution of values in a variable. It is the smoothed version of a histogram. A** double density plot** can be used to graph the distribution of the predicted response probabilities against true response density. The better the separation of heaps and valleys (i.e. area under the curve) between response types, the better the model performs in identifying them correctly. Most importantly, the shape of the double density plot enables the selection of an appropriate threshold value for a binary classification problem.
!! ROC Plot and Double Density Plot
ROC Plot
ROC curve is used to examine the trade-off between the detection of True Positives, while avoiding the False Positives. It’s a plot of True Positive Rate (Sensitivity) against False Positive Rate (Specificity). To compare the performance of two models, the model producing the curve that has high True Positive Rate and low False Positive Rate is ideal, i.e., the curve that bulges more towards or closer to the top left.
Double Density Plot
!! Representation of probability distribution in a variable.
- Plot of predicted response probabilities against true response density
- Compare performance of two models by plotting double density separately
- Ideal model has more separation/distinction between the two curves
- Enables classification of binary outcome using distinct probability threshold
4.1 Single Variable Models
1. Null Models:
- Return a single constant answer for all situations
- Work independently of the data
- For categorical: return most popular category
- For numerical: average of all outcomes
- Define the lower-bound of performance
2. Single-Variable Model:
- Describes what a simple model can achieve
Single Categorical Variable Model
Finding the proportion of positive outcome (assuming Income=High is our positive outcome and Income=Low is our negative outcome) that co-occurs with each unique value (i.e. a factor level) of a categorical variable. That is, the probability of a positive outcome when seeing a factor level. We can then use that for predication. To do so, we build the contingency table and work out the probability.
Single numerical variable model
Predicting and Evaluating Numeric Variables
Assessing AU
Contingency Table Example
**Overfitting: **looks great on the training data and then performs poorly on new data. Training error: a model’s prediction error for the data that it is trained on. Generalisation error: A model’s prediction error for new data. If GE is large & test performance is poor & TE is small == overfit. – simpler models tend to generalize better and avoid overfitting.
4.3 Classification models
Muti-variable model: Decision Trees: split the training data into pieces and using a simple constant on each piece to make a prediction that is piecewise constant. Disadvantages:
- A tendency to overfit, especially without pruning.
- Have high training variance: samples drawn from the same population can produce trees with different structures and different prediction accuracy.
- Simple decision trees are not as reliable as other tree-based ensemble methods: e.g., random forests.
**Data preparation for KNN: **predicting a property of a datum based on the datum or data that are most similar to it. It can be used for regression and multi-class classification. (For categorical variables in the dataset, the Hamming distance should be used.).
steps:
- 1. splitting the dataset into a training set and a calibration set (assuming the test set is completely unknown**), **
- 2. Determine a set of k values. Use the calibration set to help find the optimal k value from the set.
- 3. Numerical columns having too many NAs should be dropped from the set of input features because it is sensitive to different scales in the numerical columns_. _
4. If the number of NAs and/or missing values is small in a numerical column, then they can be imputed by the median or mean value of the column;
5. NAs in a categorical column may be treated as a separate level. Categorical columns whose levels can be ordered (e.g., small/medium/large) should be converted into numerical ones.
6. Normalization is usually required to ensure that there are no dominating numerical columns in the training set. Feature columns in the calibration and test set will need to be normalized the same way.
Data Splitting
!! Split the dataset into a training set and a calibration set using an 80/20 ratio.
Clustering Technique
5.1 Clustering
- How the dist() works There are 6 points in the table, so 15 distance values will be computed. By default, dist() uses the Euclidean distance function to compute a 5 x 5 lower triangular distance matrix. The output returned by dist() is a data structure containing the data matrix and other information such as what distance function was used, number of data points, etc. For example, if we use Hamming distance for the categorical variable carType and add the distance with the Euclidean distance for the numerical Age column, then the distance between points 1 and 2 is: (23 – 17) + 1 = 7.
!! CH Index The CH-index measures the ratio of the variance B (between clusters) to the variance W (within clusters). We want the variance between clusters to be large, and we want data points within each cluster to be close to each other so we want W to be small. So the optimal number of clusters, kbest, should maximise the CH-index. Here B = BSS/(k-1) and W = WSS/(n-k), where n = number of data points; BSS = between sum of squares (measuring how far apart clusters are from each other); WSS(i) = the average squared distance of data points in the ith cluster from the cluster centroid; WSS is the total of all the WSS(i), for i=1,…,k.
- Hierarchical clustering: uses agglomerative or divisive methods to group observations with common features.
- Divisive clustering: assuming all your n data points are in one big cluster initially. From this, the data points are divided based on dissimilarity into separate groups
- Methods for grouping can include the nearest neighbour method, the centroid method, and the wards method. In context of the dataset provided, we can explore potential relationships between different variables e.g. explore the similarity of different job types based on an individual’s marital status, education and income. If divisive clustering is used, then we would start with all job types in one group. Then make smaller groups based on dissimilarity in education background, as an example.
K-means clustering iterative algorithm used when data is all numeric, with a distance metric of squared Euclidean (though can be run with other metrics). This requires the number of clusters k to be defined beforehand. 1.Select k cluster centers at random. 2.Assign every data point to the nearest cluster center. These are the cluster. 3.For each cluster, compute its actual center 4.Reassign all data points to the nearest (new) cluster center 5.Repeat steps 3 and 4 until the points stop moving or you have reached a maximum number of iterations. Clustering of these variables can reveal information that was previously unknown about the relationship between these variables for use in supervised learning methods. For example, there may be a strong correlation between income and education that was previously unknown that may affect the utility of a model that uses both variables to predict the cardiovascular disease risk factor.
5.2 Regression
Logistic regression
When to Use Regression Methods for Data Modeling?
- Analyzing relationships between variables.
- Predicting a numeric outcome based on features.
- Identifying trends.
- Determining influential variables in predictions.
Difference between Linear and Logistic Regression:
**Linear Regression: **Purpose: Predicts a continuous dependent variable based on predictors. Example: Predicting house prices based on factors like size, location.
Logistic Regression: Purpose: Predicts the probability of a categorical (often binary) outcome. Example: Predicting if a person will buy a product based on age, income, and gender.
Summary: While both linear and logistic regressions predict the value of a dependent variable, they differ in the type of outcome they predict. Linear regression predicts continuous outcomes, while logistic regression predicts the probability of a categorical outcome.
Describe How One Can Build a Linear Regression Model Using the Credit Risk Dataset?
Pick numerical fields, such as income, loan amount, annual repayment, and the number of children as variables, and the score as a response variable. Find the best coefficients to predict risk.
Describe How One Can Build a Logistic Regression Model of the Same Dataset and Explain How It Is Different from the Linear Regression Model?
Logistic regression treats the score as probability prediction that