What is indexing? What does it do?

Stackoverflow doesn’t have very good answers about this topic since it’s too broad.

This article is helpful: http://www.programmerinterview.com/index.php/database-sql/what-is-an-index/

  • Basically, without an index, if you’d like to find one certain row(s), you’ll have to do a full table scan, this is literally really inefficient, like human eye check
  • An index is a data structure (most commonly a B-tree) that stores the values of a specific column in the table, i.e. in index consists of column values from the table
  • Why B-tree?
    • This is due to their efficiency: insertion/deletion/lookup can all be done in O(logn) time.
    • Another major reason is that the values stored in B-tree can be sorted
  • How does Hash table indexes work?
    • Imagine that we store one column value as the key in the hashtable, and the row data as the value of the hashtable, the lookup time is very fast. Because basically, Hashtable is like an “associative array”.
  •  The disadvantages of Hashtable working as index:
    • Hashtable keys are not sorted, it only maintains a mapping between key and value, so it’s good for fast lookup, but it’s not good for queries such to find how many employees that are younger than 25 years old in a table.
  • Good analogy of database index:
    • It’s like the index of a book: if you’d like to find the Chapter describing Python decorators, you could either flip through the pages or just go to the index page where that Chapter is listed, also a page number of that chapter is listed as well. Apparently, using the index page is a lot faster.
  • What’s the cost of having a database index?
    • It takes up space, the larger your table is, the larger your index is.
    • Another performance hit is that whenever you do CRUD to your table, the same operations will have to be done to your index
  • As a general rule, an index should only be created, if the data on the indexed column will be queried frequently.

Heap Sort

Wikipedia has very good explanation about heap sort and why its time complexity is O(nlogn):

“The heapsort algorithm involves preparing the list by first turning it into a max heap. The algorithm then repeatedly swaps the first value of the list with the last value, decreasing the range of values considered in the heap operation by one, and sifting the new first value into its position in the heap. This repeats until the range of considered values is one value in length.

The steps are:

  • Call the buildMaxHeap() function on the list. Also referred to as heapify(), this builds a heap from a list in O(n) operations.
  • Swap the first element of the list with the final element. Decrease the considered range of the list by one.
  • Call the siftDown() function on the list to sift the new first element to its appropriate index in the heap.
  • Go to step (2) unless the considered range of the list is one element.

The buildMaxHeap() operation is run once, and is O(n) in performance. The siftDown() function is O(log(n)), and is called n times. Therefore, the performance of this algorithm is O(n + n * log(n)) which evaluates to O(n log n).”


Some key points to understand this algorithm and its implementation:

  • the heap data structure is actually a virtual thought, it exists only in our imagination, in reality, we’re only re-positioning the elements into different indices, this is also why heap sort space complexity is O(1), since it doesn’t require any extra memory
  • how do we construct this virtual heap data structure? some keys to understand it:
    • regard the array as an level order traversal of a complete binary tree (draw out the picture, and then you’ll have a better understanding)
    • how to construct a max heap? we always want the parent’s val to be greater than its both children’s val, so we swap the two if we find a situation that doesn’t follow this rule and recursively do it.
    • how do we find a node’s left and right children?
      • if you draw out the tree, you’ll figure out that: a node with index i, its left child will be 2*i and its right child will be 2*i+1, if the two children exist.


  • Heap sort in NOT stable: the relative order of equal elements might be changed since heapsort peeks the largest element and put it at the last of list.

Used the wikipedia example input: 6, 5, 3, 1, 8, 7, 2, 4 to test my code:heap_sort

public class _20160710_HeapSortAgain {
private static int N;
public static void sort(int[] nums){
for(int i = N; i > 0; i--){//i doesn't need to be equal to zero, because we don't need to swap zero-indexed number with itself
swap(nums, i, 0);//we always swap the first element in the array which means it's at the root of the heap with the number at index i which is the largest index in the UN-sorted array
N -= 1;//don't remember to decrement N by 1, because we only need to worry about one number fewer each time
maxheap(nums, 0);//then we always update the heap for the number at index zero
private static void heapify(int[] nums) {
N = nums.length-1;
for(int i = N/2; i >= 0; i--){//here we need i to be equal to zero because we need to do maxheap() on its first element as well
maxheap(nums, i);
private static void maxheap(int[] nums, int i) {
int leftChildIndex = 2*i;
int rightChildIndex = leftChildIndex+1;
int max = i;
if(leftChildIndex <= N && nums[leftChildIndex] > nums[i]){
max = leftChildIndex;
if(rightChildIndex <= N && nums[rightChildIndex] > nums[max]){
max = rightChildIndex;
if(i != max){
swap(nums, i, max);
maxheap(nums, max);
private static void swap(int[] nums, int i, int j) {
int temp = nums[i];
nums[i] = nums[j];
nums[j] = temp;
public static void main(String...strings){
int[] nums = new int[]{6,5,3,1,8,7,2,4};
// int[] nums = new int[]{1,2,3,4,5,6};
// int[] nums = new int[]{6,5,4,3,2,1};
print("BEFORE printing, nums are: ", nums);
print("AFTER printing, nums are: ", nums);
private static void print(String msg, int[] nums) {
for(int i : nums){
System.out.print(i + ", ");


  • The idea behind a Map is to be able to find an object faster than a linear search.
    • Using hashed keys to locate objects is a two-step process. Internally the Map stores objects as an array of arrays.
    • The index for the first array is the hash code of the key. This locates the second array which is searched linearly by using equals() to determine if the object is found.
  • Open addressing, or closed hashing, is a method of collision resolution in hash tables. With this method a hash collision is resolved by probing, or searching through alternate locations in the array (the probe sequence) until either the target record is found, or an unused array slot is found, which indicates that there is no such key in the table.
  • Well known probe sequences include:
  • Linear probinghttps://courses.cs.washington.edu/courses/cse326/00wi/handouts/lecture16/sld015.htm
    • in which the interval between probes is fixed — often at 1.
  • Quadratic probing 
    • in which the interval between probes increases linearly (hence, the indices are described by a quadratic function).
  • Double hashinghttps://courses.cs.washington.edu/courses/cse326/00wi/handouts/lecture16/sld025.htm
    • in which the interval between probes is fixed for each record but is computed by another hash function.
  • The main tradeoffs between these methods are that linear probing has the best cache performance but is most sensitive to clustering, while double hashing has poor cache performance but exhibits virtually no clustering; quadratic probing falls in-between in both areas. Double hashing can also require more computation than other forms of probing.
  • A critical influence on performance of an open addressing hash table is the load factor; that is, the proportion of the slots in the array that are used. As the load factor increases towards 100%, the number of probes that may be required to find or insert a given key rises dramatically. Once the table becomes full, probing algorithms may even fail to terminate. Even with good hash functions, load factors are normally limited to 80%. A poor hash function can exhibit poor performance even at very low load factors by generating significant clustering. What causes hash functions to cluster is not well understood, and it is easy to unintentionally write a hash function which causes severe clustering.
  • Use LinkedList to store the values, but note: HashMap stores both Key and Value into the LinkedList node in the form of Map.Entry object. Otherwise, with a given key, you won’t know which value in this linkedlist should be returned
    • after finding bucket location, we will call keys.equals() method to identify a correct node in LinkedList and return associated value object for that key in Java HashMap.
  • Why HashTable doesn’t allow null key/values while HashMap does?
    • HashMap was introduced later than HashTable to fix some of its limitations;
    • HashTable uses the default .hashCode on each key to get the hashing, so, if the key is null, the method would fail
    • But in some cases, we also want to store null as the key/values, like it is useful to store null to distinguish a key that you know exists but doesn’t have an associated value and a key that doesn’t exist.
  • Why String, Integer and other wrapper classes are considered good candidates for keys in HashMap?
    • because they’re immutable.
    • HashMap uses the keys to to hashing, if the key is changed, it’s impossible to get an object from HashMap
  • The contract between equals() and hashCode() is:
    • 1) If two objects are equal, then they must have the same hash code.
    • 2) If two objects have the same hash code, they may or may not be equal.