12 Practice

In this section, we’ll work on annotating and understanding unfamiliar code.

The code examples may look intimidating! But remember, we’re less concerned here about understanding every detail about what the code is doing, and more about using what we’ve learned about programming logic as a lens through which to begin interpreting code.

12.1 New types of conditions (in SQL)

Consider the following SQL command:

SELECT * FROM OceanBuoys
WHERE Ocean = 'Atlantic' AND (BuoyName LIKE 'S%' OR BuoyName LIKE 'K%');

One option for navigating unfamiliar code is to break down the code into its component parts. Using that approach, you can differentiate functions separate from variables and names to get a better understanding of what type of data you’re working with, and what the code is trying to do.

Exercise: Using what we’ve learned in prior sections, try to answer the following questions.

  • What does the OceanBuoys term most likely refer to - variable or function?
    • Tip: Variables will contain a data object. Functions perform an action or task.
  • Assuming that SELECT is a function, what might it do?
  • The * character is new. What might * indicate, in combination with SELECT?
  • Parsing the second line that starts with WHERE, can you make an educated guess about what Ocean and BuoyName refer to, in terms of variables?
  • Can you guess (or feel free to use your favorite search engine) what LIKE 'S%' and LIKE 'K%' would indicate?
  • Note the ; at the end of the line starting with WHERE. What might this mean?
Do your answers match these?
  • The OceanBuoys term here refers to a table or dataframe or matrix (really, any kind of tabular format).
  • The SELECT function selects columns from the OceanBuoys table.
  • The * is a wildcard matching symbol. It indicates that we want to return all columns from the table, no matter what their column names are.
  • The OceanBuoys table must have columns named Ocean and BuoyName.
    • If we wanted to return only these columns, the syntax could be: SELECT Ocean, BuoyName FROM OceanBuoys
  • LIKE is used for pattern matching. The symbol % is a wildcard character in SQL. This code is matching values that start with “S” or with “K”.
  • The conditions after WHERE are the row-based conditions for what will be returned, analogous to SELECT for the column-specific condition.
  • The ; is used to indicate the end of a statement in SQL. It tells the computer that the ‘thought’ is finished, and the action of doing the thought can commence.


Exercise, continued: Now, can you write out a narrative of what you see the code is trying to do?

Example written narrative

This code is returning a subset of data from the OceanBuoys table. Starting from the full OceanBuoys table, return all the columns from this table, but keep only rows where Ocean is equal to “Atlantic” and where the BuoyName value starts with “S” or “K”. The returned data should be tabular, with the same number of columns as the full OceanBuoys table, but likely with less rows (only those that met the condition).



12.2 New syntax and terms (in R)

While the logic underlying programming languages stays consistent, a central challenge is that different languages often have their own special syntax, which can take a while to get used to. Don’t panic! Familiarity comes with experience and in the meantime, Google is your friend (well, the search engine of your choice. Some of us prefer DuckDuckGo).

Here is an example of R code:

1.  df_new <- df %>% 
2.    select(-respondent_name) %>% 
3.    mutate(identifier = paste(respondent_id, survey_wave, sep = "_")) %>% 
4.    mutate(survey_type = 
5.           ifelse(survey_wave %in% c("first", "second"), "phone", "in person"))
  

Exercise: Go line by line and annotate, in your own words, what that line of code is doing. Then, combine these into a written narrative of what this code is doing.

Each line has a new take on the same logic we’ve covered in prior sections; a breakdown of new terms is below to guide your annotation.

  • Line 1
    • What is df_new <- df doing here?
      • Hint: refer to 8.1.3, Common Variable Names
    • What does the %>% symbol indicate?
  • Line 2
    • What is the select() function likely doing?
    • What might it mean that the argument in this function is preceded by -?
  • Line 3
    • Using context clues (and your favorite search engine) what do you think the mutate() function does?
    • The new paste() function has three arguments (paste(arg1, arg2, arg3)). What do you think the arguments are for?
  • Line 4 & 5 (note: new line not strictly necessary, only to fit in page width without needing scroll)
    • This ifelse() statement has a different format than we’ve covered so far, but the concept is the same. Assuming this ifelse() statement has three arguments (ifelse(condition, mystery1, mystery2)) what might the two mystery arguments be specifying?
    • Looking at the section survey_wave %in% c("first", "second"), what do you think this would translate to, as a written explanation of the task here?
  • Finally, it is always important to understand the data type and structure of the data being acted upon. Keep in mind, based on the questions above, what is the structure of the data being used here?
Example narrative

Starting from the table/dataframe called df, we want to keep (select) all columns except the column named respondent_name.

Then, make a new column called identifier. This new column is created by pasting together the value in the respondent_id column and the survey_wave column, separated by an underscore, _.

Then, make another new column called survey_type. The values in this column are determined by an ifelse statement: if the value in the survey_wave column is any of the values specified in the list ("first", "second") (so if the value is first or second), then the value in the survey_type column will be phone. Otherwise, the value will be in person.



12.3 New functions in Stata

This challenge uses Stata, a proprietary statistical analysis platform. Stata has a number of unique features and syntax that can make it challenging to interpret. (At least in this instructor’s experience, Stata is not user-friendly, but your mileage may vary!)

Relying on what we’ve learned so far and context clues, what is the code below doing?

replace variable = variable[_n-1] if missing(variable)
What is this code doing?

This is a bit of code to replace any missing values of variable with with previous (n-1) value of variable.

There are many ways to replace missing values in Stata, this is one.

12.4 C++

This is a C++ file named testScratchDoc.cpp

Review the file then answer the following questions, annotating as you see fit.

  • How is the code similar to some of the code that we’ve seen so far?
  • What do you think the code does (generally)?
  • What questions do you have about it?
  • What are some suggestions to make it easier to read?
#include <iostream>
#include <stdio.h>
#include <cstring>


int main() {
    printf("Hello World!\n");

//    std::cout << "Hello, World!" << std::endl;

    // mbr (a,b,c,d)
    int a=6;
    int b=3;
    int c=8; //c=b+1; inner/outer test c=4
    int d=5;

    //mbr (i,j,k,l)
    int i=1;    // i=n;  inner/outer test i=1
    int j=2;
    int k=3;  // k=j+1; inner/outer test k=3
    int l=4;
    bool intersect_x = false;
    bool intersect_y = false;

    printf("\nMBR1[%d,%d,%d,%d]\n",a,b,c,d);
    printf("MBR1[%d,%d,%d,%d]\n",i,j,k,l);

    if(!((c<i) ||(k<a) ))
    {
        printf("x intersects!\n");
        intersect_x = true;
        printf("intersect_x is %d\n", intersect_x);
    } else{
        printf("x does not intersect!\n");
    }

    if(!((d<j) ||(l<b) ))
    {
        printf("y intersects!\n");
        intersect_y = true;
        printf("intersect_y is %d\n", intersect_y);
    } else{
        printf("y does not intercept!\n");
    }

    if(intersect_x&&intersect_y)
        printf("MBR1 intersects with MBR2\n");
    else
        printf("MBR1 does not intersect MBR2\n");



/*
    // testing writing to memory using pointers
    int page = 1;

    int *pData; // pointer to data
    pData = &page;
    printf("\npdata is %s; \n&pData is %p; \n*pdata is %d",pData, &pData, *pData);

    int* newP; // new pointer to data
    int a = 2;
    int b = 3;

    int m = sizeof(b);

    newP = pData+m;

    memcpy(pData,&a,sizeof(int));
    printf("\npdata is %s; \n&pData is %p; \n*pdata is %d",pData, &pData, *pData);

    memcpy(pData+m,&b,sizeof(int));
//    newP = pData+sizeof(int);
    printf("\nnewP is %s; \n&newP is %p; \n*newP is %d",newP, &newP, *newP);
*/ // end testing writing to memory using pointers



    // testing whether you can initialize a structure with an outside variable
    // not really, you can with an outside constant
/*    const int b = 100;
    const int c = 7;
    const int a = (int)(b/c);
    int i;

    struct test {
        int test[a];
    };

    test mytest;

    for(i=0; i<a; i++)
    {
        mytest.test[i]=i;
    }

    for(i=0; i<a; i++)
    {
        printf("\nmytest[%d]=%d",i,mytest.test[i]);
    }*/


    return 0;
}
/*
#include <stdio.h>

main() {
    printf("Hello World!\n");
    char sentence []="test your 12 7 42";
    char str[20];
    int a, b, c;
    sscanf(sentence,"%s %s %d %d %d", str, str, &a, &b, &c);
    printf("%d %d %d\n", a,b,c);
    printf("%s\n", str);
    sscanf(sentence,"%s your %d 7 %d", str, &a, &b);
    printf("%d %d %d\n", a,b,c);
    printf("%s\n", str);
    printf("%s\n", sentence);

}*/

12.5 MATLAB

This is a MATLAB file named assignment01_leaf.m

Review the file then answer the following questions, annotating as you see fit.

  • How is the code similar to some of the code that we’ve seen so far?
  • What do you think starts a comment line?
  • What do you think the code does (generally)?
  • What questions do you have about it?
  • What are some suggestions to make it easier to read?
function assignment01_leaf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% (C) Student Name Here %%%%%%%%%%%%%%%%%%%%%%%%%
TRAIN = load('FaultsNNA_csv'); % Only these two lines need to be changed to test a different dataset. % 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

TRAIN_class_labels = TRAIN(:,1);    % Pull out the class labels.
TRAIN(:,1) = [];    % Remove class labels from training set.

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Compute and display Default Rate and other Basic Information
%
[class,count] = mode(TRAIN_class_labels(:,1));
default_error_rate=1-(count/size(TRAIN_class_labels(:,1),1));
disp(['The dataset you tested has ', int2str(length(unique(TRAIN_class_labels))), ' classes.']);
disp(['The training set is of size ', int2str(size(TRAIN,1)),'.']);
disp(['The time series are of length ', int2str(size(TRAIN,2)),'.']);
disp(['The dataset''s most common class is ',num2str(class),' with a total of ',num2str(count),' occurances.']);
disp(['The default error rate is ',num2str(default_error_rate),'.']);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Compute and display the k-fold error rate
%
k=length(TRAIN_class_labels);
k_fold_error_rate = k_fold_cross_validation(TRAIN,TRAIN_class_labels,k);
disp(['The normal k-fold error rate is ',num2str(k_fold_error_rate),' where k=',num2str(k),' and length of train set is ',num2str(length(TRAIN_class_labels)),'.']);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Here is a sample classification algorithm, it is the simple (yet very competitive) one-nearest
% neighbor using the Euclidean distance. 
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function predicted_class = Classification_Algorithm(TRAIN,TRAIN_class_labels,unknown_object)
best_so_far = inf;
for i = 1 : length(TRAIN_class_labels)
    compare_to_this_object = TRAIN(i,:);
    distance = sqrt(sum((compare_to_this_object - unknown_object).^2)); % Euclidean distance
    if distance < best_so_far
        predicted_class = TRAIN_class_labels(i);
        best_so_far = distance;
    end
 end;