12 Practice
In this section, we’ll work on annotating and understanding unfamiliar code.
The code examples may look intimidating! But remember, we’re less concerned here about understanding every detail about what the code is doing, and more about using what we’ve learned about programming logic as a lens through which to begin interpreting code.
12.1 New types of conditions (in SQL)
Consider the following SQL command:
One option for navigating unfamiliar code is to break down the code into its component parts. Using that approach, you can differentiate functions separate from variables and names to get a better understanding of what type of data you’re working with, and what the code is trying to do.
Exercise: Using what we’ve learned in prior sections, try to answer the following questions.
- What does the
OceanBuoys
term most likely refer to - variable or function?- Tip: Variables will contain a data object. Functions perform an action or task.
- Assuming that
SELECT
is a function, what might it do? - The
*
character is new. What might*
indicate, in combination withSELECT
? - Parsing the second line that starts with
WHERE
, can you make an educated guess about whatOcean
andBuoyName
refer to, in terms of variables? - Can you guess (or feel free to use your favorite search engine) what
LIKE 'S%'
andLIKE 'K%'
would indicate? - Note the
;
at the end of the line starting withWHERE
. What might this mean?
Do your answers match these?
- The
OceanBuoys
term here refers to a table or dataframe or matrix (really, any kind of tabular format). - The
SELECT
function selects columns from theOceanBuoys
table. - The
*
is a wildcard matching symbol. It indicates that we want to return all columns from the table, no matter what their column names are. - The
OceanBuoys
table must have columns namedOcean
andBuoyName
.- If we wanted to return only these columns, the syntax could be:
SELECT Ocean, BuoyName FROM OceanBuoys
- If we wanted to return only these columns, the syntax could be:
-
LIKE
is used for pattern matching. The symbol%
is a wildcard character in SQL. This code is matching values that start with “S” or with “K”. - The conditions after
WHERE
are the row-based conditions for what will be returned, analogous toSELECT
for the column-specific condition. - The
;
is used to indicate the end of a statement in SQL. It tells the computer that the ‘thought’ is finished, and the action of doing the thought can commence.
Exercise, continued: Now, can you write out a narrative of what you see the code is trying to do?
Example written narrative
This code is returning a subset of data from the OceanBuoys
table. Starting from the full OceanBuoys
table, return all the columns from this table, but keep only rows where Ocean
is equal to “Atlantic” and where the BuoyName
value starts with “S” or “K”. The returned data should be tabular, with the same number of columns as the full OceanBuoys
table, but likely with less rows (only those that met the condition).
12.2 New syntax and terms (in R)
While the logic underlying programming languages stays consistent, a central challenge is that different languages often have their own special syntax, which can take a while to get used to. Don’t panic! Familiarity comes with experience and in the meantime, Google is your friend (well, the search engine of your choice. Some of us prefer DuckDuckGo).
Here is an example of R code:
1. df_new <- df %>%
2. select(-respondent_name) %>%
3. mutate(identifier = paste(respondent_id, survey_wave, sep = "_")) %>%
4. mutate(survey_type =
5. ifelse(survey_wave %in% c("first", "second"), "phone", "in person"))
Exercise: Go line by line and annotate, in your own words, what that line of code is doing. Then, combine these into a written narrative of what this code is doing.
Each line has a new take on the same logic we’ve covered in prior sections; a breakdown of new terms is below to guide your annotation.
- Line 1
- What is
df_new <- df
doing here?- Hint: refer to 8.1.3, Common Variable Names
- What does the
%>%
symbol indicate?
- What is
- Line 2
- What is the
select()
function likely doing? - What might it mean that the argument in this function is preceded by
-
?
- What is the
- Line 3
- Using context clues (and your favorite search engine) what do you think the
mutate()
function does? - The new
paste()
function has three arguments (paste(arg1, arg2, arg3)
). What do you think the arguments are for?
- Using context clues (and your favorite search engine) what do you think the
- Line 4 & 5 (note: new line not strictly necessary, only to fit in page width without needing scroll)
- This
ifelse()
statement has a different format than we’ve covered so far, but the concept is the same. Assuming thisifelse()
statement has three arguments (ifelse(condition, mystery1, mystery2)
) what might the two mystery arguments be specifying? - Looking at the section
survey_wave %in% c("first", "second")
, what do you think this would translate to, as a written explanation of the task here?
- This
- Finally, it is always important to understand the data type and structure of the data being acted upon. Keep in mind, based on the questions above, what is the structure of the data being used here?
Example narrative
Starting from the table/dataframe called df
, we want to keep (select
) all columns except the column named respondent_name
.
Then, make a new column called identifier
. This new column is created by pasting together the value in the respondent_id
column and the survey_wave
column, separated by an underscore, _
.
Then, make another new column called survey_type
. The values in this column are determined by an ifelse
statement: if the value in the survey_wave
column is any of the values specified in the list ("first", "second")
(so if the value is first
or second
), then the value in the survey_type
column will be phone
. Otherwise, the value will be in person
.
12.3 New functions in Stata
This challenge uses Stata, a proprietary statistical analysis platform. Stata has a number of unique features and syntax that can make it challenging to interpret. (At least in this instructor’s experience, Stata is not user-friendly, but your mileage may vary!)
Relying on what we’ve learned so far and context clues, what is the code below doing?
What is this code doing?
This is a bit of code to replace any missing values of variable
with with previous (n-1
) value of variable
.
12.4 C++
This is a C++ file named testScratchDoc.cpp
Review the file then answer the following questions, annotating as you see fit.
- How is the code similar to some of the code that we’ve seen so far?
- What do you think the code does (generally)?
- What questions do you have about it?
- What are some suggestions to make it easier to read?
#include <iostream>
#include <stdio.h>
#include <cstring>
int main() {
printf("Hello World!\n");
// std::cout << "Hello, World!" << std::endl;
// mbr (a,b,c,d)
int a=6;
int b=3;
int c=8; //c=b+1; inner/outer test c=4
int d=5;
//mbr (i,j,k,l)
int i=1; // i=n; inner/outer test i=1
int j=2;
int k=3; // k=j+1; inner/outer test k=3
int l=4;
bool intersect_x = false;
bool intersect_y = false;
printf("\nMBR1[%d,%d,%d,%d]\n",a,b,c,d);
printf("MBR1[%d,%d,%d,%d]\n",i,j,k,l);
if(!((c<i) ||(k<a) ))
{
printf("x intersects!\n");
intersect_x = true;
printf("intersect_x is %d\n", intersect_x);
} else{
printf("x does not intersect!\n");
}
if(!((d<j) ||(l<b) ))
{
printf("y intersects!\n");
intersect_y = true;
printf("intersect_y is %d\n", intersect_y);
} else{
printf("y does not intercept!\n");
}
if(intersect_x&&intersect_y)
printf("MBR1 intersects with MBR2\n");
else
printf("MBR1 does not intersect MBR2\n");
/*
// testing writing to memory using pointers
int page = 1;
int *pData; // pointer to data
pData = &page;
printf("\npdata is %s; \n&pData is %p; \n*pdata is %d",pData, &pData, *pData);
int* newP; // new pointer to data
int a = 2;
int b = 3;
int m = sizeof(b);
newP = pData+m;
memcpy(pData,&a,sizeof(int));
printf("\npdata is %s; \n&pData is %p; \n*pdata is %d",pData, &pData, *pData);
memcpy(pData+m,&b,sizeof(int));
// newP = pData+sizeof(int);
printf("\nnewP is %s; \n&newP is %p; \n*newP is %d",newP, &newP, *newP);
*/ // end testing writing to memory using pointers
// testing whether you can initialize a structure with an outside variable
// not really, you can with an outside constant
/* const int b = 100;
const int c = 7;
const int a = (int)(b/c);
int i;
struct test {
int test[a];
};
test mytest;
for(i=0; i<a; i++)
{
mytest.test[i]=i;
}
for(i=0; i<a; i++)
{
printf("\nmytest[%d]=%d",i,mytest.test[i]);
}*/
return 0;
}
/*
#include <stdio.h>
main() {
printf("Hello World!\n");
char sentence []="test your 12 7 42";
char str[20];
int a, b, c;
sscanf(sentence,"%s %s %d %d %d", str, str, &a, &b, &c);
printf("%d %d %d\n", a,b,c);
printf("%s\n", str);
sscanf(sentence,"%s your %d 7 %d", str, &a, &b);
printf("%d %d %d\n", a,b,c);
printf("%s\n", str);
printf("%s\n", sentence);
}*/
12.5 MATLAB
This is a MATLAB file named assignment01_leaf.m
Review the file then answer the following questions, annotating as you see fit.
- How is the code similar to some of the code that we’ve seen so far?
- What do you think starts a comment line?
- What do you think the code does (generally)?
- What questions do you have about it?
- What are some suggestions to make it easier to read?
function assignment01_leaf
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% (C) Student Name Here %%%%%%%%%%%%%%%%%%%%%%%%%
TRAIN = load('FaultsNNA_csv'); % Only these two lines need to be changed to test a different dataset. %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
TRAIN_class_labels = TRAIN(:,1); % Pull out the class labels.
TRAIN(:,1) = []; % Remove class labels from training set.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Compute and display Default Rate and other Basic Information
%
[class,count] = mode(TRAIN_class_labels(:,1));
default_error_rate=1-(count/size(TRAIN_class_labels(:,1),1));
disp(['The dataset you tested has ', int2str(length(unique(TRAIN_class_labels))), ' classes.']);
disp(['The training set is of size ', int2str(size(TRAIN,1)),'.']);
disp(['The time series are of length ', int2str(size(TRAIN,2)),'.']);
disp(['The dataset''s most common class is ',num2str(class),' with a total of ',num2str(count),' occurances.']);
disp(['The default error rate is ',num2str(default_error_rate),'.']);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Compute and display the k-fold error rate
%
k=length(TRAIN_class_labels);
k_fold_error_rate = k_fold_cross_validation(TRAIN,TRAIN_class_labels,k);
disp(['The normal k-fold error rate is ',num2str(k_fold_error_rate),' where k=',num2str(k),' and length of train set is ',num2str(length(TRAIN_class_labels)),'.']);
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Here is a sample classification algorithm, it is the simple (yet very competitive) one-nearest
% neighbor using the Euclidean distance.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
function predicted_class = Classification_Algorithm(TRAIN,TRAIN_class_labels,unknown_object)
best_so_far = inf;
for i = 1 : length(TRAIN_class_labels)
compare_to_this_object = TRAIN(i,:);
distance = sqrt(sum((compare_to_this_object - unknown_object).^2)); % Euclidean distance
if distance < best_so_far
predicted_class = TRAIN_class_labels(i);
best_so_far = distance;
end
end;