You can not enjoy wealth, if you don’t have good health.
A data structure is nothing but a way of storing data. When you write your own class, you create a data structure for your own needs where each member variable of the class is your data, and the way you organize them makes your class a data structure. There are some common data structures used frequently for various applications and hence it becomes important to study them well. This is also a reason why coding interviews focus so much on them. Data structures are the core of your application because they control how efficient your application is. Making a bad choice there is very costly and hence we will see appropriate usage of each data structure with one common theme across the post. Each data structure has its own merits and demerits. Some data structure does something really fast, but the same data structure performs poorly for some other operation. Wasting no time, let’s take a look.
COVID-19 disturbed the entire world in 2020. Hospital beds are filled with COVID-19 patients. Let’s say each bed is in a separate room and there are 100 rooms numbered from 1000 to 1099. Now say, you want to know what room the 4th (assuming zero-based indexing) patient is in. You know that rooms start at 1000, so the 4th person has to be in room no. 1000+4 = 1004. Similarly, the last i.e. 99th person will be in room no. 1000+99=1099. This is the beauty of the array data structure. In a computer science world, the hospital is analogous to memory, room numbers analogous to memory addresses, and patients analogous to the data you store in memory. If you want to access any element based on an index (0th, 1st, 2nd, and so on), it’s very efficient because it’s very quick to find the exact memory address of the element. So this is the first property of an array – constant time access. But say, you want to know where a patient named XYZ is and you don’t know the number, you will have to check each room. Similarly, searching an element in an array is not efficient and it takes linear time i.e. O(n) where n is the total number of elements in the array. Because search based on an index is quick, updating a value at a certain index is also quick. Now we saw that the hospital had rooms from 1000 to 1099. However, what if there are more than 100 patients? The hospital can not accommodate them. In the same way, arrays are fixed length and can’t be extended, so this is one disadvantage with arrays. Other problems with arrays are when inserting and deleting an element. You have to shift all existing elements if you want to insert a new element or remove an existing one.
So summary about the arrays is –
- Great for an index-based lookup/search. Time complexity: O(1).
- Great for an index-based update. O(1).
- Inserting or deleting an element is inefficient. O(n).
COVID-19 spreads very rapidly without locking down the city. So the local government allowed each person to meet only one other person. Naturally, this created a chain of people – you can meet A, A can meet B, B can meet C, and so on with you being the start of the chain. Now, if someone new needs to be added in the chain, it’s easy. Say we want to add the person P between person B & C. What we can simply do is to allow B to meet P and P to meet C. Without breaking the law, you could easily accommodate a new person. Similarly, if someone wants to move out of the city, again it’s easy. Say, person P now wants to leave. We can simply let P go and because P was allowed to meet C, we can now allow B to meet C. This system resembles the linked-list data structure in computer science. The people in our example are analogous to nodes and allowing a person to meet another person is analogous to having a link between two nodes. As person A was the start of the chain, each linked-list has a ‘head’ or a ‘start’ node. The insertion and deletion operations are efficient in linked-list because you only have to change a couple of links and nodes can be added, removed easily. Hence, the time complexity is O(1). This is assuming you know the nodes between whom you want to insert or a node previous to the node that you want to delete i.e. you have the pointers available to those nodes. If you have observed, there is no concept of ‘index’ in the linked-list. As nodes can be added, removed; indices don’t make any sense and this is one disadvantage of the linked-list. Let’s say you want to pass on a mask to person X in the chain, but you can only start with person A. So you will pass on the mask to A, A will pass it to B, B to C, and so on. This was quite inefficient. It would have been best if I could know where the X is and pass it directly. In the same way, looking for some node in the linked-list is inefficient and takes O(n) time where ‘n’ is the number of nodes in the linked-list. Because the search is inefficient, the update is also inefficient as it involves searching as a prerequisite. We talked about insertion and deletion being O(1), but remember it is only if you have the pointers to the necessary nodes. If you don’t have those, insertion and deletion are also O(n) because you will have to search for the right position. However, insertion and deletion at the front and from the front of a linked-list is always O(1). Inserting and deleting at the end of the linked-list will naturally be O(n) if you don’t have the pointer to appropriate nodes.
So the summary about the linked-list is –
- No index-based access.
- Inefficient search – O(n).
- Inefficient update – O(n). If you have the pointer to the node that you want to update, then it is O(1).
- Insertion and deletion is O(1) if you have the direct access to necessary nodes. Otherwise they are O(n). Insertion and deletion at the front are O(1).
- The benefit is that the linked-list is of dynamic size, so any amount of data can be accommodated as long as there is memory available.
Doctors are limited in number and there are a lot of patients in this unprecedented time. Hence the doctors need to prioritize between the patients. Say a doctor is treating patient A with some minor illness. At the same time, another patient B who has a relatively more serious illness walks in. Now the doctor needs to put the patient A on hold and take care of B. While treating B, a COVID-19 patient C walks in and now the doctor needs to immediately look to treat the patient C. So in summary, A walked in first but put on hold when B walked in. B was put on hold when C walked in. Now once the doctor treats C, B will get treated, and then A. This is the first-in-last-out or last-in-first-out approach. You can also visualize this as a stack of books where the book at the top is the last one that was added to the stack while the first book is at the bottom of the stack. This is sometimes the most appropriate way of organizing the data and hence computer science has a data structure called a stack. There are very limited operations that can be done on the stack but they are efficient. Adding a book at the top of the stack is easy and so is adding an element into the stack. It is efficient with time complexity being O(1). Similarly, removing an element from the top of the stack is O(1). Stack does not allow inserting an element anywhere else in the stack, it has to be at the top. Similarly, the deletion has to be from the top. The good part is that both these operations are O(1), the bad part is that there are no other operations supported. However, it helps to have operations like checking if the stack is empty, or checking if the stack is full. Both of these operations are also O(1).
So the summary about the stack is –
- Can’t access elements randomly based on the index.
- Only supported operations are insertion at the top and deletion from the top, both of which are efficient – O(1).
- No way to search for an element unless the elements from the top are removed.
- Similarly, you can’t update a value without removing the top elements.
- You can check the element at the top efficiently – O(1).
- Checking if the stack is full, or empty is also efficient – O(1).
With the increase in COVID-19 patients, there are dedicated doctors to take care of. Doctors are treating patients as they arrive in the first-in-first-out approach. This is very intuitive – think of a checkout queue at Walmart or a queue for getting served at the bank. This is a very natural representation of data and hence there is a data structure called a queue. You can imagine, insertion in the queue is only allowed at the end while removal is only allowed from the beginning. Both of these operations are O(1). Similar to stack, the queue does not allow random insertion/deletion or search in the queue without removing elements. However, checking if a queue is empty or full is efficient and O(1). You can also check/access the element at the front in O(1) time.
Summary of a queue –
- Can’t access elements randomly based on the index.
- Only supported operations are insertion at the end and deletion from the front, both of which are efficient – O(1).
- No way to search for an element unless the elements from the front are removed.
- Similarly, you can’t update a value without removing the front elements.
- You can check the element at the front efficiently – O(1).
- Checking if the queue is full, or empty is also efficient – O(1).
There are millions of cases all over the world and now want to maintain a table of the number of COVID-19 cases by country. You started with a list of pairs of countries and the count. Whenever a new case is discovered, you find the country on the list and then increase the count associated with that country. Soon you realize you are wasting a lot of time searching for the country. So you sort the list by name of the country and now you can find the country relatively quickly because from the name of the country that you want to increase the count of, you get the idea of where that country would be, on the list. Another approach you can take is to create 26 buckets for 26 alphabets and store each country into a bucket of the country name’s initial alphabet. Now you suddenly find yourself wasting close to no time in searching for the country. Unsurprisingly, this is also a very common operation in the computer science world. There are many instances where you need to store the key and value pairs like country and corresponding COVID-19 cases, roll number of a student and student name, or a person and list of their favorite food items. For this reason, there is a data structure called a hashmap. Hashmap stores keys and their corresponding values. The best part about the hashmap is that looking for a value associated with the key is very efficient, on average O(1). Adding new key-value pairs, removing existing key-value pairs is on average O(1) too. Although looking for a value when you know the key is O(1), checking whether a certain value exists in the hashmap is not efficient, it is O(n). Internally, hashmap is implemented using a hashing technique. The keys are assigned to buckets using a hash function. Ideally, the hash function uniformly divides the keys into buckets. Whenever you want to find a value associated with the key, the key is again hashed and the key is only searched into that bucket which avoids searching for the key in the rest of the data. This is why hash maps are fast at all operations that we discussed.
Summary of the hashmap data structure –
- Efficient insertion, deletion of key-value pairs – O(1) on average.
- Efficient lookup based on the key, efficient update – O(1).
- Searching for value is inefficient – O(n).
- Although it is safe to assume that all operations are O(1) from a practical point of view, it is not the case in theory. Hence, the worst-case time complexities are O(n). Again, for all practical purposes, you can assume that they are O(1).
The healthcare systems of all countries are at tremendous strain at this time. Doctors are overworked and they need to prioritize treating various patients with most critical patients at high priority and least critical at low priority. After treating the current patient, doctors need to find who the next highest priority patient is. When the patients are small in number, it is easy to find the highest priority patient by reading the entire list of patients. However, as the number of patients grows, the list gets longer, and scanning the entire list every time becomes inefficient. Similar problems in the computer science world are common. Whether it be finding the next highest priority job to run on a CPU, or finding the top 5 queries searched in the Google search bar. The data structure that helps in such situations is a heap or priority queue. Heap provides an efficient way to get max (or min depending on the implementation) element at any time. Finding the max is very efficient, O(1) but other operations are less efficient or some operations are not allowed at all. Inserting a value in the heap takes O(logn) time which is not bad but not as efficient as O(1) either. Secondly, deletion from the heap is very limited. You can only remove the max element from the heap and it takes O(logn) time. You can not remove any other elements without removing the current max element. The search operation is also not supported, you can only look at the max element, rest all are, in a way, hidden. If it was not clear, the heap data structure always provides access to the max element among all elements that it has. So even if you remove the max element, the heap will still provide access to the max element among the remainder of the elements. Hence it becomes possible to keep getting the next highest/max elements until the heap is empty.
Summary of the heap data structure –
- Best for finding the max, next max, next to next max, and so on. Getting the max element is very efficient – O(1).
- Inserting a value is a O(logn) operation.
- Deletion also takes O(logn) time and only the current max element in the heap can be removed from the heap.
- Search is not supported and hence update is also not supported.
- Because it is possible to get the max element at any given point, it can be a way of sorting the elements or getting the top K elements without sorting all elements.
Even after lockdowns are enforced, essential workers are still working. People have to go out to get groceries. Getting the natural air, exercising is important too and people go out in the evening. However, this is leading to the spread of COVID-19. One person who is COVID-19 positive transmits the virus to everyone that person meets. All those people, in turn, transmit it to many others whom they meet, and so on. One way we can keep track of this is by building a list of pairs of people who mate each other. If person A meets B, we can add (A, B) to the list., if C meets D, we can add (C, D), and so on. If any person is tested positive, then we can search the entire list and find all people who mate the infected person to quarantine them. When the list was small, it wasn’t a big deal to check this. But as the list grew longer, it became very time-consuming. Then came a better way of storing the meeting/connection information. We could instead have a table where the first column is a person’s name and the second column is a list of all people who came in contact with the corresponding person. This way, we could quickly find all people who came in contact with any person. This is just one example where we need to store network information. Another example can be storing who is a friend of whom. With so many people and so many friends per person, it’s infeasible to just keep the simple list and this is where a graph data structure comes to rescue. The graph is defined by nodes and edges where nodes denote the entities and edges denote the relationship. Like in our example, people are nodes and an edge between two people exists if they mate each other. These nodes and edges are for humans to visualize. Computers can’t store information like this. So graphs are stored in a couple of ways. One way is to store the list of edges i.e. list of pairs of nodes that are connected. This is called an Edge-list representation. If the graph is sparse, depending on the application, this can be a better way of storing. Another way is adjacency matrix representation which is nothing but a table where rows and columns are nodes. If an edge exists between nodes A and B, the intersection of the row for node A and column for node B is marked to indicate that. The third way is adjacency list representation where for each node, a list is maintained and the list has nodes whom the corresponding node is connected in the graph.
Adjacency list representation is the most common one because it requires a lot less memory than the adjacency matrix. Traversing the graph is a very common operation and the common algorithms for traversal include depth-first and breadth-first traversal. If the graph is represented as adjacency list, then both the traversals take O(V+E) time where V is the number of vertices or nodes and E is the number of edges. If the graph is in the adjacency matrix form, then it would take O(V^2) time. If you want to check if there is a direct edge between two nodes, the adjacency matrix can be the most efficient way of representing the graph because it can be done in O(1) time. Adjacency list representation might need O(V) time for the same. There are other algorithms on graph-like finding the shortest path between two nodes in a graph or finding the minimum spanning tree. Both of these are rarely needed in a coding interview. Checking if the two nodes are connected directly or indirectly can be done in O(V+E) time using a depth-first search or a breadth-first search. Topological sort is one very important algorithm on directed acyclic graphs and takes the same time as a depth first search or traversal.
Summary of graph data structure –
- There are 3 ways to represent graphs – Edge list representation, Adjacency list representation, Adjacency matrix representation. Depending on what operations are going to get performed, one should choose the right approach.
- Checking if two nodes are directly connected can be done in O(1) time in adjacency matrix representation.
- Depth-first search and breadth-first search are the most common algorithms on graphs and there are a lot of problems that get asked in interviews around these two. Both need O(V+E) time to execute.
- The topological sort is also very helpful in solving various problems which take O(V+E) time.
- Adjacency list representation can be a lot memory-efficient than an adjacency matrix representation.
Covid-19 patients list is growing very fast and hence the amount of memory required to store the names of the patients. At the same time, you observe that a lot of time is getting wasted in typing the full names of patients when searching for their names. Instead, it can save a lot of time if the computer can suggest names based on the letters that are already typed and if there’s only one suggestion starting with the entered prefix, the computer can just auto complete it. Both of these are the ideal applications for a trie data structure. When there are a lot of strings to be stored, trie can be a go-to data structure because of the way it stores strings. The other thing that a trie is good at is finding all strings that start with some prefix, so it can help in auto completion of strings. The insert, update, delete, search operations on a trie are all O(L) where L is the length of the string to be searched, inserted, deleted or updated. Although there are limited use cases when a trie is helpful, it is one of the best data structures for the particular applications we discussed. Without a trie, auto completion or spell checking can be relatively very inefficient.
Summary of trie data structure –
- It supports efficient i.e. O(L) search, insert, delete operations where L is the length of the string.
- Mostly used with strings and supports efficient operations such as auto completion, spell checking.
- It can store a lot of strings in a relatively small space and can be used as a hashmap when the keys are strings.
Although the examples we used were a bit (?) stretched, I hope that it helped you in getting a better understanding of how different data structures can be used in different real world scenarios. Choosing a right data structure is a key to the success of an application, make the right choices depending on the application.
Let me know in the comments below if I missed anything or what you think about the post. Do you struggle with applying any particular data structure, or aren’t sure where some data structure can be used? Put that in the comments and we can discuss.
Don’t forget to like & share the post, and subscribe to the blog.
Stay safe and Happy coding!