4 Data types
4.1 Variables
Now that we understand the structure and logic of algorithms, we can start on how the natural language of algorithms is translated into programming scripts. In order to do this, we need to introduce closely related concepts of variables, data types, and data structures.
A word that people use a lot when referring to computer code is variable, but what is a variable? A variable is a placeholder for some type of data that will be used in computer code.
ingredients <- cheddar cheese, potatoes, milk, salt, butter
We use variables to represent the data that are being used within a program because it’s easy to change it,
ingredients <- smoked gouda cheese, potatoes, milk, salt, butter
and also easy to build a mental model of how the different variables interrelate.
ingredients <- smoked gouda cheese, potatoes, milk, salt, butter
ingredient_quantities <- all, 3 lb, 0.5 c, to taste, 8 tbsp
4.1.1 Pro-tip
A common rule to help decide what name to use for a variable is, the name should describe what the data represent. If we used the term x
- which is commonly used as a variable in math equations - we wouldn’t know what it represents. Using the word ingredients
is descriptive, so you have an idea what the data are in the variable - it describes what the variable represents.
4.2 Data Types
As we already stated, variables hold some value. (I won’t go into the details of how that happens on a computer, though if you’re interested, I could tell you after the workshop, or at a separate time.) It’s enough for us to know that whenever you see a variable, it correlates with some sort of data.
You know how before you can do math, you have to define numbers? Or, more commonly, before you can write and spell a word, you have to define the alphabet the word uses? Or, to use our recipe analogy, before you can cook, you have to define (or identify) food? Computer programming is the same way. Before you can use data, you have to define data types.
There are three data types which are ubiquitous across most programming languages.
The first two are related to numbers:
Integers (int
) which are whole numbers: positive, negative, and zero.
Floats (float
) or floating point numbers, which are real numbers, or numbers with decimals.
And the third is related to words and text.
Characters (char
) which by themselves are a single character, but can be connected together to create a string (str
), which is basically text.
Most data types specific to any particular language is a special case of one or more of these three data types (such as how a string is a special case of a character).
4.3 Data Structures
A data structure is a special framework which holds your data. Again, there are a lot of different types of data structures, but we’ll describe some of the common ones.
A single value isn’t strictly a data structure, but it’s worth mentioning they exist.
A list is just that, a list of data elements. The nice thing about lists is it can contain a multitude of different data types, and also different data structures.
cheeses <- (8, smoked gouda, 0, milk, 16, cheddar, .25, limburger)
An array is a common data structure, which at its base is a single-dimension structure. You’ll commonly see an array represented with square brackets.
You may encounter text elements in an array
ingredients = [cheese, potatoes, milk, salt, butter]
or an array of integers
quantities <- [16, 32, 8, 0, 4]
or floats
quantities = [16, 32, 8, 0.125, 4]
The data in all cells (or elements) of an array needs to be the same type, which is part of what makes an array different than a list. If an array is created with different data types, many computer languages, such as R or Python, will automatically convert the data to a single data type. This is formally called coercion. Numeric values with text will be coerced to all text, and integers with real numbers to floats, but not the other way around.
You may also see this data structure called a vector (which is different than vectors used in math or physics).
A matrix is a multi-dimensional array and has the same restrictions as an array, in terms of same data type used throughout. Honestly, it depends on who you’re talking to for how they refer to arrays and matrices, and it’s likely highly dependent on the type of math they’re using in their research.
A matrix will have both rows and columns. The number of elements in each row needs to be the same, and the same with the number of elements in each column.
Below is a matrix with size \(i\) by \(j\), which means that there are \(i\) number of rows, and \(j\) number of columns. The matrix below also demonstrates what is called an index which is basically the address of any particular element in the matrix.
\[\begin{equation} \begin{bmatrix} index_{1,1} & index_{1,2} & \cdots & index_{1,j} \\ index_{2,1} & index_{2,2} & \cdots & index_{2,j} \\ \vdots & \vdots & \ddots & \vdots \\ index_{i,1} & index_{i,2} & \cdots & index_{i,j} \end{bmatrix} \end{equation}\]Using an index for a one-dimensional array is pretty easy.
for ingredients <- [cheese, potatoes, milk, salt, butter]
the value in ingredients[1]
is cheese
if we expand it into a matrix
\[\begin{equation} ingredients = \begin{bmatrix} potatoes & milk & heavy\text{ }cream & butter \\ Cheddar & Gouda & Parmesan & Muenster \\ salt & rosemary & thyme & sage \end{bmatrix} \end{equation}\]
the value in ingredients[2,4]
is \(Muenster\)
I want to share that I’ve been programming for a long time, and have done all of the math, and I still will have to double-check that the index I’m referring to in multidimensional arrays or matrices is listed in the correct order.
A dataframe is a tabular form with rows and columns, much like a matrix. However, dataframes can contain mixed data types, for instance a column of integers, a column of strings, and another column of floats. Dataframes can also have column names and row names.
ingredients |
quantities |
importance |
---|---|---|
smoked gouda cheese |
all |
5 |
potatoes |
3 lb |
1 |
milk |
0.5 c |
3 |
salt |
to taste |
4 |
butter |
8 tbsp |
2 |
4.3.1 To make things more confusing: A note on indices
A fun “feature” of programming languages is different languages will start their index with different numbers, either 1 or 0. Python is a zero-indexed language, meaning the index of the first element of an array or list is 0, while R is a normal language, I mean, not a zero-indexed language, meaning the index of the first element of an array in R is 1.
I like to pretend that this is an extension of mathematicians having arguments over which is the first number, zero or one (yes, I’ve seen these debates). But, I think it has to do with ease of memory management in computers, which is something most researchers who come to us for consultations will not have questions on. So right now, it’s just a quirk.
For our workshop, we will use 1 as the first element of an array, because that’s easier to learn logically.
4.4 Review
For the next items in review, we want you to type the answer, but don’t press enter until we tell you. We’ll say, 3, 2, 1, enter then you press enter when we say enter.
Let’s practice this first, where we all type in chat Hello World
.
In chat, assign a variable a value, then type the data type in parentheses.
e.g. quantity = 1.5 (float)For the following array
cheese <- [Cheddar, Gouda, Parmesan, Muenster]
, what is the value ofcheese[4]
For the following matrix,
\[\begin{equation} potatoes = \begin{bmatrix} Yukon\text{ }Gold & Red\text{ }Gold & Gala \\ Yellow\text{ }Finn & Russet & Fingerling \\ Sweet & Kennebec & Red\text{ }Pontiac \\ Yam & German\text{ }Butterball & Purple\text{ }Viking \\ Carola & Nicola & Canela\text{ }Russet \end{bmatrix} \end{equation}\]
What is the value ofpotatoes[3,2]
?Challenge: For the following dataframe, named
df
,
ingredients |
quantities |
importance |
---|---|---|
smoked gouda cheese |
all |
5 |
potatoes |
3 lb |
1 |
milk |
0.5 c |
3 |
salt |
to taste |
4 |
butter |
8 tbsp |
2 |
What do you think df$ingredients[3]
would return?
Answers
- Your answer will vary! Some examples:
temperature = 75 (integer)
precipitation = 1.2 (float)
location = California (string)
The value of
cheese[4]
isMuenster
, the fourth item in thecheese
array.The value of
potatoes[3,2]
isKennebec
, which is the value in row 3, column 2.df$ingredients[3]
would returnmilk
. We can guess thatdf$ingredients
means theingredients
column in the dataframedf
(even though the$
symbol is new). From there, we can use the same indexing approach as the other questions, such thatdf$ingredients[3]
indicates the third rows down in theingredients
column, which ismilk
.