Prelude: Data Structures & Algorithms
In earlier sections, we defined a program as an ordered sequence of instructions for solving a problem. We now further specify that definition: A program is a set of instructions for performing computations with a given collection of data. Using this definition, we can now see how closely linked algorithms and data structures are. Without data, we cannot write programs, and without programs, we cannot meaningfully use data.
Data, broadly speaking, are representations of real world information. Specific to computer science, they are representations of information that computer can understand. For example, a computer cannot determine whether two people are "attracted" to one another, unless we feed the computer data: age difference, shared properties like mutual interests, friends, and alma matter, favorite songs, as well preferences. A data structure is an arrangement of all that collected data.
Whenever we program, we are actually creating two files: A program
file__, and a __data file. To understand the roles for these two files,
let's consider what happens when a user opens a Word Document. On double
clicking the Word document's icon, the Word program (the program file) is
loaded into main memory. Once loaded into the main memory, the CPU begins
executing the Word program's source code. This is the point where we start
seeing things on the screen; if the program is coded well, it should be
almost instantaneous.1 Now, if we clicked on a file icon
(some file with the extension .docx
), we would see all the that file's
contents as well as presets. This .docx
file is a data file, and it too
is loaded into main memory. All the text, formatting, and presets we see in
the Word file is the result of the Word program operating on the data
structure contained in the data file.
What does this example tell us? It tells us that every program handles data in some form or another. Even the simplest programs handle data—the variables initialized, the functions called, the arrays declared, it's all data.
How we arrange the data (i.e., our choice of data structure) dramatically impacts our program, whether that's in terms of efficiency, security, user-friendliness, or maintainability. Data structures impact how well a computer handles the program, since data is loaded into main memory at runtime. If the data structure is too memory intensive, then our program's user base is necessarily limited. Apple's XCode IDE, the Unreal game engine, Blender, and high definition games like GTA V or Fallout 4 are all heavy, resource intensive, but relatively efficient, programs. They all work well on computers with large amounts of RAM (i.e., at least 8GB), but anything lower, and the programs start getting into trouble. In contrast, programs like Sublime (with default settings) or TextEdit consume very little resources, but come with less features.
The decisions above are fine, but only if they are intentional. Maybe the programmers decided the features were more important, so the user has to decide if they really want them. Or maybe the programmer decided efficiency was more important, so it's up to the user to modify and tweak the program with extensions. Nevertheless, some decisions are worse than others. No one wants a program that offers a useless feature and hogs large amounts of memory. Similarly, a user would be fairly reluctant to use an extremely efficient program if the program has enormous security risks.2 Along the margins, if two programs offer the same features, the user is usually going to decide based on some other criteria—efficiency, reliability, user-friendliness, cheapness, etc.—all tradeoffs determined, at the lowest level, by data structure choices.
A data structure should not be confused with a database.3 A database is a collection of related data stored in some form of permanent storage (perhaps an HDD, an SD, a USB, floppy disk) for ease of access by a program. Most databases are organized as tables. The database, however, is really just a crude way of storing data. We can think of it like a giant container filled (but organized) with numerous, different pieces of wood, metal, nails, screws, glass, etc. To build anything useful, however, we take the pieces we need, and arrange them in our own way. The glass is separated, the wood is stacked, the nails in one pouch, the screws in another. The same goes for data structures. The program must retrieve the data from somewhere (i.e., a database), but the data must still be arranged into a data structure. The database is really just there for the program to easily retrieve data for structuring.
Following this distinction between database and data structure, there is a further distinction within databases. Data stored in databases generally fall into two categories—operational data and legacy data.4 Generally, operational data is new and frequently-used data. For example, a new user registering for a site, or a user frequently using the site. Both users' data constitutes operational data. In contrast, legacy data is data that is old, or not frequently used. For example, that Gmail account we forgot the password to fifteen years ago. It'd probably be fairly difficult to reset the password for the account because the data has been stored in a data warehouse—a database specifically made for legacy data.5 As one might suspect, databases and data warehouses are more commonly used by large, commercial programs, where there are enough users to justify the costs of storage and maintenance.
In sum and crudely: A data structure is an arrangement of data in main memory. A database is an arrangmenet of data in permanent memory. A data warehouse is an arrangement of data in a collection of permanent memory.
Static v. Dynamic Memory Allocation
As we saw in the previous section, data structures concern the arrangement of data in main memory. We now want to delve deeper into how the data is arranged. To do so, we need a clearer perspective on main memory, and the way memory is allocated.
Conceptually, we can think of main memory as a large grid of small squares.
Each of the small squares are units with addresses. We call the unit a byte, and the address a memory address. The memory addresses are linear, and arranged in order:
The sum of all the individual units, or bytes, is the total amount of memory we have. For example, if a program consumes 64 kilobytes of memory at runtime, then it takes up 64,000 bytes (64,000 of the grid's squares) in memory. If the main memory is large (e.g., one of 8GB), then the bytes are sectioned into segments. Usually, the size of a segment is 64KB.
At the highest level, there are four large segments, sometimes called regions, of memory: (1) the code segment; (2) the data segment; (3) the stack segment; and (4) the heap segment:
There's a lot going on in the diagram above, but for now, are just focusing on the stack, heap, and code regions. Suppose we wrote the following code:
void func1(int i) {
int a;
}
void func2() {
int x;
func1(x);
// more code
}
void main() {
int a;
float b;
func2();
// more code
}
Suppose further that according to the compiler, an int
takes 2 bytes, and
a float
takes 4 bytes. We compile the code, and get back an executable.
When we run the executable, the machine code is stored in the code region.
Once there, the CPU executes, or loads, the machine code—the program begins
using the stack and the heap. The variables a
and b
are allocated
inside the stack, 6 bytes total. How does the CPU know to allocate 6 bytes
total? With the compiler. The compiler determines 6 bytes are needed, and
that amount is fixed. Because the amount of memory needed is fixed, we call
this static memory allocation.
What about those functions? Again, the machine code is stored in the code
region—func1()
's machine code; func2()
's machine code; and main()
's
machine code. Now, to start, the main()
function's machine code is
executed first. As stated earlier, the variables a
and b
are allocated
in the stack. While executing main()
, func2()
is encountered. The CPU
begins executing func2()
's machine code. In doing so, it encounters
another variable, x
. Again, this is stored in the stack, but this time in
a different stack frame—a sort of layer in the stack on top of the
layer where a
and b
are stored. As the CPU continues executing
func2()
's machine code, it encounters func1(x)
. The CPU begins
executing func1()
's machine code. The CPU allocates i
and a
to a
memory location in a different stack.
Once func1()
is done executing, the CPU pops off its stack (i.e.,
"deletes" the stack), and continues in func2()
. Once func2()
is done
executing, the CPU again pops off its stack, and continues in main()
.
Finally, once the CPU finishes executing main()
, it pops off its
stack—the program has finished.
Notice what this discussion entails. How much memory a function takes up depends on its variables. And how much memory variables take up depend on their sizes, which are in turn determined by the compiler.6 This means that we, the programmers, have little to no control over—(a) the automatic destruction of stacks and (b) how much memory is allocated for an initialized variable. The rules are different, however, for variables that end up in the heap.
The Oxford Dictionary defines a "heap" as a disorderly collection of objects haphazardly on top of each other. This captures the essence of the heap segment in memory. The heap is an open field of memory, where values can be stored. The difference: everything allocated into the heap is free to move around. We can think of the heap as this chaotic, open field of memory—everything in it is crawling, walking, and running in every direction. This is in contrast to the stack, where values are nice and organized into small neighborhoods—stacks. As an open field, the heap is a resource. We let values hang out in the heap momentarily, and once the value is done being used, we remove the value from the heap. All values allocated to the heap must be removed eventually. We can't have them hanging out in there forever; we will see why shortly. This discussion entails that memory in the heap is dynamic.
Because heap memory is dynamic the program cannot directly access the values in the heap. This is because of the point we made earlier—values in the heap are all over the place; they don't just stay in one place. So how does the program access values in the heap? With pointers. Pointers are how we take memory from the heap:
main() {
int *p;
}
In C and C++, we indicate a pointer with the asterisk symbol. For the sake
of discussion, we will say that the pointer *p
above takes up 2 bytes of
memory. Now, *p
is a variable inside the function main()
. Accordingly,
*p
is allocated into the stack, taking up 2
bytes.7 To take up memory in the heap:
main() {
int *p;
p = new int[5];
}
In writing p = new int[5]
, we do two things: (1) create an int
array of
size 5
in the
heap; and (2) made the pointer *p
point to that array. This is a
crucial point to understand. The *p
variable resides in the stack. The
int[5]
value resides in the heap. To access int[5]
, the program must
use the variable p
. It cannot directly access things inside the heap.
As we said earlier, int[5]
must be de-allocated from the heap. This is
because there is no automatic destruction for things we place in the heap.
main() {
int *p;
p = new int[5];
delete []p;
p = null;
}
The statement delete []p
is the way we de-allocate heap memory. In this
case, we include []
because *p
points to an array. We then must write
p = null;
to ensure the pointer no longer points anywhere.
Overflows
Why must we de-allocate heap memory? Because of overflows. Remember, memory is finite, and memory segments are discrete, strict boundaries.
Going back to the diagram above, there are two arrows stemming from the heap and the stack. These arrows symbolize how the heap and the stack are free to grow in size. With the stack, we at least have automatic stack destruction, which keeps its size in check to some extent. We say "to some extent" because there are situations where automatic stack destruction doesn't work. Automatic destruction only kicks in if the CPU finishes executing the function's machine code. There are many situations where a CPU cannot finish executing a given function's machine code. Some examples:
-
The function calls itself without any way to terminate (i.e., a recursive function without a base case or with a bad base case). This causes the function to call itself over and over again, allocating more and more memory in the stack.
-
The function takes up too much memory in the stack. For example, a function with an infinite loop, or a function performing too complex and large or a computation.
In the situations above, the function continues taking up memory in the stack, to the point where the stack segment collides with the heap segment. This causes a stack overflow.
On the other end, we have the heap, which is arguably even more insidious than the stack. With the heap, we do not have any automatic destruction. This means that when we create pointers without manually de-allocating, the heap keeps growing. With enough pointers, the heap can grow so large that it collides with the stack segment. What makes the heap so dangerous is that heap overflows are much harder to detect. They almost always tend to be extremely tiny and small amounts of overflow, but that's all it takes to cause a collision. That collision results in a heap overflow.
Data Structures: Physical v. Logical Interpretations
To better understand data structures, we will momentarily create a distinction between physical data structures and logical data structures. With this momentary distinction, we analyze the differences. Let's first compare an array and a linked list.
An array is the simplest data structure. Physically (i.e., inside the computer) it is a contiguous collection of memory. It has a fixed size, and it can reside either in the memory or the heap (e.g., a pointer in the stack pointing to the array in the heap). The array is what we normally use when (a) we have data that should be ordered, and (b) we know for certain what the size of the array should be. We say that the array is a static data structure because it is of fixed size.
A linked list is like an array, but instead of a contiguous collection
of memory, it consists of nodes composed of two parts: (1) the value
the node stores, and (2) a pointer pointing to the next node. Structurally,
the linked list consists of a head—the first node in the list—and a
tail—the last node in the list. The tail always points to null
. Like
the array, the linked list allows us to order data. However, unlike the
array, the pointer is a dynamic data structure. We can make it bigger
or smaller by having the tail of the node point to a new node rather than
null
. Furthermore, in contrast to the array, a linked list is always
created in the heap. The head may exist in the stack, but the rest of the
node always exists in the heap.
Both linked lists and arrays are examples of physical data structures. They directly state how data is actually stored in memory. Physical data structures are different from logical data structures. Most of the data structures we're interested in are logical data structures: stacks, queues, trees, graphs, hash tables, and many more. These are all examples of logical data structures. All logical data structures are implemented using some combination of physical data structures, whether they're linked linked lists or arrays.
Having said this, we now distill this artificial distinction. The difference between physical data structures and logical data structures boils down to a distinction between a data structure's underlying logic and its implementation. The computer does not know how to organize data into a tree. We have to actually give it instructions as to how a tree is structured. And to do so, we must state those instructions in ways the computer understands—with arrays or linked lists. Accordingly, every data structure has two interpretations: (1) its physical interpretation, and (2) its logical interpretation. The physical interpretation, what we momentarily referred to as a physical data structure, refers to the way we
implement the data structure. The logical interpretation, what we momentarily referred to as a logical data structure, refers to the way we design the data structure.
Both the logical interpretation and the physical interpretation are necessary to constructing data structures, but call for different skill sets. Implementing data structures requires knowledge and familiarity with the language we are using and programming principles—i.e., coding. Designing data structures, however, requires knowledge of discrete mathematics. The materials that follow employ both skill sets, but err more so on the side of design.
Abstract Data Types
There is a distinction between an abstract data type and adata structure.8 In the broadest terms, an abstract data type represents "what we want to do" while the data structure represents "how we do it." Thus, an abstract data type is a specification; it tells us what data we can store. The data structure, in contrast, is the implementation; examining it tells us how the data is stored.
In sum, the abstract data type is a rule, or instruction, that communicates to the computer how a particular piece of data should be interpreted. More generally, a data type encapsulates, or contains, two kinds of information: (1) what data can be stored; and (2) what operations can be performed with that data.
The data structure is what implements the two pieces of information above. It dictates how the data is stored, and how the operations are performed (via algorithms).
So, for example, the type int
is a data type. The type int
contains the
two kinds of information for abstract data types. What data can be stored
with int
? A datum of type int
stores integers. What operations can be
performed on a datum of type int
? Addition, subtraction, multiplication,
division, modulus, and a plethora of other arithmetic operations.
In most languages, the data type int
is what we would call a base
type__ or a __primitive type. Base types are data types provided natively
by the language. In other words, the language's implementation provides the
data type by default. An abstract data type, however, is most often a
data type that we, the programmers, create. More broadly, it's a data type
that hides away the implementations. In many languages, particuarly
object-oriented languages, we can create our own data types with special
constructs. For example, in Java and C++, we can create our own data types
with classes.
For example, suppose we have the following data:
(John, Luke, Michael, Kento, Idris)
We want the order to be kept as is. Based on this constraint, we can think of an abstract data type—a list. That idea—the abstraction list—is distinct from its implementation. How do we implement a list? We have two options: An array, or a linked list.
The abstract type list contains the two kinds of information we discussed earlier. (1) How is the data represented? The list type takes up space for storing the element; it has a capacity (i.e., how big can list be); and it has a size (how big the list actually is). (2) What operations can be performed on data of type list? We can append a new element; remove an element; sort the elements; search the elements; and so on and so forth.
The two kinds of information in an abstract data type correspond to the two things we focus on in the materials. The question of how data is represented is answered by a data structure. The question of what operations can be performed on the data is answered by algorithms. It should not be apparent why data structures and algorithms are closely linked. The abstract data type is the product of gathering a data structure and its operations (created via algorithms) into a single box (e.g., a class or a struct) so as to hide away all the implementations.
Data Structures v. Abstract Data Types
After programming for some time, we might ask, what is the difference between a data structure and an abstract data type?
A data structure is a collection. It contains data, relationships between data, and operations to be applied to the data. Data structures are very explicity—they state exactly what and how tasks are performed.
In contrast, an abstract data type is something that defines a particular datum's behavior from the perspective of its user. In creating an abstract data type, we only describe what task must be done. We do not explicitly state how that that task should be done.
It's critical that we keep this difference in mind as we explore algorithms. For example, with the array, is a data structure. However, when we confront particular problems, we might want to think of the array as the implementation of certain abstractions: a
sequence, atuple, alist, or avector. Similarly, the multidimensional array is a data structure, but depending on the problem, we might want to think of it as amatrix, agrid, or a tensor. Why might we think in terms of these abstractions? Because doing so allows us to model the problem in terms of established facts in other areas. Whether that's logic, mathematics, physics, biology, economics, linguistics, etc. The more facts we can rely on, the more likely we are to come up with a solution. Perhaps even an elegant one.
Complexity
Having seen the close connection between data structures and algorithms, we are now ready to address the guiding concern of data structures and algorithms—complexity. We can think of complexity as a quantification of how "costly" a given data structure or algorithm is. Usually, that cost is measured in the context of space and time. This results in two forms of complexity, or two kinds of costs—(1)
time complexity and space compexity. We address the former first and the later second.
A short story: A factory recently hired Luke, a 20 year-old male in good shape. Part of Luke's job is transferring barrels from one rack to another rack. As strong as Luke is, the barrels are large and weigh at least 100 lbs. The first few days, he carries them, one by one. After a while, however, Luke notices how unwieldy this is. It takes far too long, and his mornings are now marked by lingering pain in his lower back. Luke begins asking, how do I completely move these barrels in a shorter amount of time? Aha! Why not just roll them? Luke takes this route, and it works almost miraculously. Suddenly, what took him at least an hour is suddenly cut to 20 minutes. He's delighted, and begins rolling two, three, four barrels at a time. Luke continues his job, happily moving barrels, until management finds an even faster solution—a giant robotic system that simply swaps the racks.
The question of "How long does it take to do " is what time complexity answers, where is some task. In programming, every task reduces to computations on data. Accordingly, the time complexity of a given task is determined by the how those computations are performed—the algorithm.
Let's take an example. Suppose we want to summate all the integers from some integer One way to do this would be to write:
#include <iostream>
using namespace std;
int addUpTo(int n) {
int x = 0;
for (int i = 1; i <= n; i++) {
x += i;
}
return x;
}
int main() {
int sum3 = addUpTo(5);
cout << sum3 << endl;
return 0;
}
15
With the code above, we use a loop to compute the sum. Here is another possible approach, using Gauss's formula:
#include <iostream>
using namespace std;
int addUpTo(int n) {
return n * (n + 1) / 2;
}
int main() {
int sum3 = addUpTo(5);
cout << sum3 << endl;
return 0;
}
15
Complexity analysis is what allows us to compare these two approaches.
Consider the two implementations of addUpTo()
side by side:
int addUpToLoop(int n) {
int x = 0;
for (int i = 1; i <= n; i++) {
x += i;
}
return x;
}
int addUpToFormula(int n) {
return n * (n + 1) / 2;
}
Clearly, addUpToFormula()
is much shorter than addUpToLoop()
. But
brevity does not imply better, at least not always. Instead, we count the
number of operations each implementation requires. That count provides one
premise for an argument of the form "This algorithm is better than this
one."
Let's consider addUpToLoop()
. How many operations occur here? We can
start by counting all of the individual steps the code performs.
int addUpToLoop(int n) {
int x = 0;
for (int i = 1; i <= n; i++) {
x += i;
}
return x;
}
/*
Operations:
int x = 0 happens once. (1).
int i = 1 happense once. (1).
int i <= n happens n times. (n).
int i++ happens n times. (n).
int x + i happens n times. (n).
int x = i happens n times. (n).
return x happens once. (1).
*/
Putting the commented analysis above together, we have steps
(another count might yield more steps, but we'll see that this doesn't
really matter). Accordingly, we have complexity for this
approach. However, with complexity analysis, we're really just concerned
with the leading term. In this case, the leading term is Thus, the
addUpToLoop()
function executes in linear time. Now compare that with
addUpToFormula()
. Again, we count the operations:
int addUpToFormula(int n) {
return n * (n + 1) / 2;
}
/*
Operations:
n + 1 happens once. (1).
n * (n + 1) happens once. (1).
n * (n + 1) / 2 happens once. (1).
*/
Here, we have 3 steps. Thus, the addUpToFormula()
has a complexity of
More generally, it has a complexity of
constant time. Comparing the two: constant time, is much faster
than linear time. We now have a premise for arguing that
addUpToFormula()
is better than addUpToLoop()
in terms of time
complexity. When handling big-O expressions, there are several helpful
rules to keep in mind.
-
Constants are dropped. For example, is simply i.e., an algorithm that takes operations every time, is simply Similarly, is simply
-
Smaller terms are dropped. is reduced to is is reduced to
-
Arithmetic operations are in constant time. Thus, the operations
+
,-
,*
,/
and%
are all operations that take constant time. -
Variable declarations and assignments are in constant time. Thus, writing
int n
is in constant time, andint n = 2
is in constant time. -
Accessing elements in an array by index is in constant time. E.g., given the array
arr[] {1, 2, 3}
, writingarr[0]
to access1
takes constant time. -
In a loop, the complexity is the length of the loop, multiplied by the the complexity of the loop's body. For example, a loop that runs from to has a complexity of If the loop's body (the code to be executed at each iteration) has a complexity of then we have or simply If the loop contains a loop in its body, then the outer loop has a complexity of
Another example: let's say we have a list
type datum:
["Sam", "Julie", "Eric", "Dan", "Helen"]
Again, there are two says to implement this abstract type: an array or a
linked list. Say we use an array. Now suppose we want to check if "Mike"
is in the list. Checking if "Mike"
is in the list is a searching
problem. We have to search for a string in the list that matches "Mike"
.
How long it takes to find a match depends on how we search for the match.
Whenever we think about algorithms, we must always keep this rule in mind:
A computer can only "look" at one thing at a time.
In other words, a computer does not have the benefit of looking at many
things and pinpointing the match. In the example above, we have that
benefit. We can clearly see that "Mike"
is not in the
list.9
Accordingly, when we write our algorithms, we must always follow the fundamental constraint that a computer can only "look" at one thing at a time. That said, what are some possible algorithms to search for a match?
One possible algorithm is a linear search. In a linear search, the
computer goes through each element in the list, one by one, checking if
there's a match. In the case above, the computer would go through all 5
strings just to return a false
(i.e., "False, there is no "Mike"."). If
the list contained elements and there was no "Mike"
, then the
algorithm takes a total of steps.
This is where we make a distinction in measuring time complexity. Time complexity is not necessarily measured in units of time. Instead, we generally measure time complexity in terms of the number of
steps a computer must take to complete the task. We will explore why we measure time in this manner in later sections.
Because a linear search algorithm, in the worst-case scenario, might take steps to complete its task, we say that the linear search algorithm has a complexity in the order of However, note that when we conduct a complexity analysis, we always want to also look at the code. For example, the linear search algorithm might be implemend like this (in code):
for (i = 0; i < arr.length; i++) {
if i == "Mike" {
return true;
} else {
return false;
}
}
In the code above, we have a loop that executes times. However, we
also have instructions inside the loop: (1) a variable initialization,
which occurs exactly once, i = 0
; (2) a comparison check
i < arr.length
, (3) a arr.length
computation (which, depending on the
language, may be called at each iteration); (2) an equality check ==
, and
an increment i++
. These are are all distinct instructions, that may
result in the number of steps being something along the lines of
where and are constants. This does not, however,
change our previous result, In complexity analysis, we focus on
the
leading terms. In other words, the "biggest" terms in the expression. Accordingly, this requires dropping constants.
What's an Algorithm?
An algorithm is a solution to a computational problem. As a solution, it is a sequence of computational steps that consumes a set of inputs and produces a set of outputs.
Because of this definition, the way a problem is stated determines the relationship between inputs and outputs. We will use certain notations to state problems. For example, if the problem asks, "Sort the numbers into non-decreasing order," we would write:
s := <a1, a2, ..., an>
f(s) := <a1, a2, ..., an>:
a1 < a2 <= an
In the notation above, we use to denote the input. Of course, we are free to use whatever variable we would like. We use to denote the output, where is the actual algorithm itself. We can use any other variable. For example, and and The symbol reads "defined as," and the colon is read "such that." Finally, the brackets indicate a sequence.
The problem above is an example of a sorting problem. This is a particularly important problem in computer science, because numerous algorithms rely on its solution (for example, many
search algorithms assume values are sorted).
When we write algorithms, we want our algorithms to be
correct. An algorithm is correct if, and only if, it halts at the correct output (in other words, the algorithm stops the moment it finds a correct output). An incorrect algorithm is one that halts at the incorrect output, or one that never halts.
Performance is currency.
At the end of the day, we study algorithms because of performance. With efficient and correct algorithms, we can obtain other things—security and features. Some programmers prefer Python over C, even if C is so much faster. Why? For a variety of reasons—Python is more readable, more user-friendly, more widely used, more abstracted, and so on. The cost, however, is performance. Python's memory management algorithms may not be the best approach for our problem. Python's garbage collection algorithms may be not how we would like our data handled. However, we may be willing to take on those costs because we've written well-performing algorithms.
Efficiency
One of the core concerns for algorithms is efficiency. For now, let's consider efficiency in terms of time. Accordingly, the term efficiency refers to how long it takes an algorithm to solve a given problem. How do we measure this length of time?
The most obvious way is to write some sort of stop watch program, where the timer begins once the algorithm commences, and stops when the algorithm finishes. We then compare that time to other algorithm times.
But what's the problem with this approach? The metrics would be entirely machine dependent! The time it takes to perform algorithm on a Macbook Pro would be different from the time it takes to perform on a Raspberry Pi, even if we're measuring the same algorithm and using the same data set. Because of this limitation, we must measure "speed" with a different metric.
Instead of measuring speed with time, we assume that every step a computer takes consumes a constant amount of time, then we count how many operations, or steps, the algorithm takes. In doing so, an algorithm's efficiency is not so much a measure of
how long the algorithm takes to solve a problem, but how many steps the algorithm takes.
To measure the number of steps an algorithm takes to solve a given problem, we perform asymptotic analysis. The subsequent comments provide a big-picture view of asymptotic analysis.
Asymptotic Complexity
Asymptotic complexity is the tool used by computer scientists and mathematicians for measuring algorithmic efficiency. The formal definition:
Asymptotic Complexity.
A function runs on the order of if there exists some value and some constant for which for all
At its core, asymptotic complexity is a measure of how fast a program's
runtime grows asymptotically. This is equivalent to asking: As the size of
's domain (its inputs) tends towards infinity, how does the runtime
of the program grow? Consider the example of counting the number of
characters in this string: "hegel"
If we used the algorithm of counting each character one at a time, we would count and We say that this algorithm runs on
linear time with respect to the number of characters. Why? Because it takes a constant amount of time to count one character, and as such, the number of characters is directly proportional to the amount of time it takes to count the number of characters in the string. As the number of characters increases, the runtime increases linearly with the number of characters (the input). Denoting the number of characters with the variable we say that the character-counting algorithm has a complexity of order
Functions of Runtime
There are several classes of runtime that we're most concerned with the analysis of algorithms. We cover each of them below.
Constant Time:
If an algorithm always takes a certain amount of time regardless of the size of its input, we say that the algorithm's running time is constant.
Logarithmic Time:
If an algorithm's running time grows logarithmically with increasing inputs, we say that the algorithm's running time is logarithmic. Logarithmic time is most often found with algorithms that solve a large problem by dividing it into a series of smaller and smaller problems.
Linear Time:
If an algorithm's running time increases in direct proportion to its inputs, we say that the algorithm's running time is linear. Linear time is most often found with algorithms that must perform some processing on every input one by one. For example, if the number of inputs is then the algorithm takes operations to complete. Such an algorithm runs on linear time.
Linearithmic Time:
Some algorithms break problems into smaller problems, solve them individually, then combine the solutions. These algorithms' running time is linearithmic.
Quadratic Time:
Algorithms that require a nested loop will almost always have quadratic running times. Needless to say, quadratic time is not the most efficient, and it is generally restricted to small problems.
Cubic Time:
Like quadratic time, algorithms requiring loops nested within nested loops almost assuredly have cubic running times. Like algorithms of order algorithms of order are generally relegated to small problems.
Exponential Time:
Algorithms with exponential running times are those where every input results in an exponential increase in running time. For example, where the running time is and where the running time is Exponential running times are most often found with brute-force algorithms.
Types of Analyses
There are three different types of algorithmic analysis. Each type conveys a particular piece of information.
There are a few things to note about these notations. First, it's a very common mistake to use big-O notation as an approximate model. This amounts to an abuse of notation, because all that big-O provides is an upper bound of the running time's growth. For example, if we determined that some algorithm has a running time of we could say that the algorithm has a big-O of However, we could also say that the algorithm has a big-O of as well as as well as and so on. All of those statements would be true, because all that big-O communicates is an upper bound.
To avoid this abuse, we use tilde notation in these materials to communicate an approximate model. All that this notation communicates is the leading term of the growth function. Because tilde notation only communicates the leading term, it is up to the reader to decide how interpret this information. If we wanted to provide more context for that interpretation, we use the other notations available.
Space Complexity
The same analysis for time complexity applies to space complexity. The difference: With space complexity, we're concerned with how much space a given data structure, or algorithm, uses up. For example, with a linked list of nodes, we have a space complexity of Why Because every node contains two parts—the value stored and a pointer. Of course, the actual result might be something more than since an operation on the list may require more memory. However, that amount is more than likely a constant, or a variable that has no affect on the leading term, As we said earlier, we are only focused on the leading term. Accordingly, the data structure has a space complexity of
To understand space complexity analyses, we should be familiar with the basic units of memory. First, the smallest unit of memory is a bit, or more specifically, a -bit register. We can think of bit as a small box that can contain either or Units of memory then, are just a more convenient way of counting these boxes.
Unit | Abbreviation | Definition |
---|---|---|
Byte | bits | |
Megabyte | thousand bytes | |
Megabyte | million bytes | |
Gigabyte | billion bytes |
The units above give the total number of -bit registers in units of However, if we examined a computer system's underlying hardware, the total number of registers doesn't directly translate to these units of Because computers can only think in binary, the number of registers in a given computer is some power of Thus:
Unit | Abbreviation | Meaning |
---|---|---|
bit | A single bit. | |
byte | bits. | |
kilobyte | bytes ( iB) | |
megabyte | KiB ( iB) | |
gigabyte | MiB ( iB) | |
terabyte | GiB ( iB) | |
petabyte | TiB ( iB) |
On older computers, the largest possible register consisted of bits. This meant that in a single clock cycle, the computer system's CPU could process a piece of data consisting of bits, and pointers were bytes. Fairly recently, however, computer systems have upgraded to -bit registers, with byte pointers. This means that computers today can address more memory, but it also means that pointers use more space.
Type | Bytes |
---|---|
boolean | |
byte | |
char | |
int | |
float | |
long | |
double | |
char[] | |
int[] | |
double[] | |
char[][] | |
int[][] | |
double[][] | |
Object overhead | |
Reference | |
Padding | Because each object uses a multiple of bytes, some padding of |
string | where is the string's length, is the amount of space consumed by the reference, and is the amount of padding. |
Common Running Times
Something to be cognizant of is the cost of basic operations in a given language. A common mistake novices make is assuming that a language's built-in or provided operations always run in constant time. This is certainly not the case. The table below provides an overview of running times for basic operations in popular languages.
Operation | Time | Example | Comment |
---|---|---|---|
binary search on array | See implementation. | ||
peak-finder on array | See implementation. | ||
merge sort | See implementation. | ||
2-sum | See implementation. | ||
3-sum | See implementation. | ||
check all subsets | See implementation. | ||
variable declaration | int x; | ||
assignment Statement | x = 1; | ||
numeric comparison | x < 1; | ||
array element access | arr[1] | ||
array length | arr.length | ||
1D array allocation | new int[n] | ||
2D array allocation | new int[n][n] | ||
string length | str.length() | ||
string extraction | str.substring(n/2, n) | ||
string concatenation | str1 + str2 | very commonly abused |
Containers
Many languages come with a set of pre-built
data structures. That
is, data structures provided by some library native to the language. For
example C++ has data structures provided by its standard library, as does
Java. The table below provides a brief overview of some common containers.
Footnotes
-
For poorly code programs, we might notice the twirling beach ball on a Mac, slow loading times, or the worst of all, a crash. ↩
-
Given how valuable of a commodity security has become in the modern era, the operating system would likely prevent even the most efficient and feature-filled programs from running. ↩
-
The distinguishing trait between a data structure and a database: A data structure answers the question, How do I efficiently and correctly store this data in main memory (i.e., RAM), so that a program can use it easily? A database answers the question, How do I efficiently and correctly store this data in permanent memory (i.e., an HDD or SD), so that a program can retrieve it easily? ↩
-
The term big data refers to the vast amount of data now available via the internet. Analyzing that data, with clever data structures and algorithms, can lead to breakthrough insights, some of which spawn new programs, companies, laws, and even entire markets. ↩
-
Developers tend not to want users touching or requesting legacy data. Most data warehouses are simply arrays of disks, making retrieval costly. Moreover, data warehouses are generally intended for the developer. They are helpful in identifying factors like user behavior and trends; facts crucial for making business decisions. The algorithms used for analyzing such legacy data are called data mining algorithms. ↩
-
This demonstrates why optimization is critical in compiler design. ↩
-
The amount of memory a pointer takes varies. Usually, it depends on the size of the pointer's type. For example, if an
int
type takes 2 bytes, anint
pointer takes up 2 bytes. ↩ -
Abstract data types are also called interfaces or Application Programming Interface (API). ↩
-
Of course, our benefit of seeing many things at once effectively disappears once we have to deal with a list of a thousand, or a million things. ↩