What is indexing? What does it do?

Stackoverflow doesn’t have very good answers about this topic since it’s too broad.

This article is helpful: http://www.programmerinterview.com/index.php/database-sql/what-is-an-index/

  • Basically, without an index, if you’d like to find one certain row(s), you’ll have to do a full table scan, this is literally really inefficient, like human eye check
  • An index is a data structure (most commonly a B-tree) that stores the values of a specific column in the table, i.e. in index consists of column values from the table
  • Why B-tree?
    • This is due to their efficiency: insertion/deletion/lookup can all be done in O(logn) time.
    • Another major reason is that the values stored in B-tree can be sorted
  • How does Hash table indexes work?
    • Imagine that we store one column value as the key in the hashtable, and the row data as the value of the hashtable, the lookup time is very fast. Because basically, Hashtable is like an “associative array”.
  •  The disadvantages of Hashtable working as index:
    • Hashtable keys are not sorted, it only maintains a mapping between key and value, so it’s good for fast lookup, but it’s not good for queries such to find how many employees that are younger than 25 years old in a table.
  • Good analogy of database index:
    • It’s like the index of a book: if you’d like to find the Chapter describing Python decorators, you could either flip through the pages or just go to the index page where that Chapter is listed, also a page number of that chapter is listed as well. Apparently, using the index page is a lot faster.
  • What’s the cost of having a database index?
    • It takes up space, the larger your table is, the larger your index is.
    • Another performance hit is that whenever you do CRUD to your table, the same operations will have to be done to your index
  • As a general rule, an index should only be created, if the data on the indexed column will be queried frequently.

Inner Join/Left Join/Right Join/Full Join

The INNER JOIN keyword selects all rows from both tables as long as there is a match between the columns in both tables.



The LEFT JOIN keyword returns ALL rows from the left table (table1), with the matching rows in the right table (table2). The result is NULL in the right side when there is no match.


The RIGHT JOIN keyword returns all rows from the right table (table2), with the matching rows in the left table (table1). The result is NULL in the left side when there is no match.


The FULL OUTER JOIN keyword returns all rows from the left table (table1) and from the right table (table2).

The FULL OUTER JOIN keyword combines the result of both LEFT and RIGHT joins.



The SQL UNION operator combines the result of two or more SELECT statements.
The UNION operator is used to combine the result-set of two or more SELECT statements.
Notice that each SELECT statement within the UNION must have the same number of columns.


SQL Aggregate Functions
SQL aggregate functions return a single value, calculated from values in a column.

Useful aggregate functions:

AVG() – Returns the average value
COUNT() – Returns the number of rows
FIRST() – Returns the first value
LAST() – Returns the last value
MAX() – Returns the largest value
MIN() – Returns the smallest value
SUM() – Returns the sum

Sharding/Partition/Horizontal scaling

  • RDBMS:
    • This model organizes data into one or more tables (or “relations“) of columns and rows, with a unique key identifying each row. Rows are also called records or tuples.
  • In computer science, ACID (Atomicity, Consistency, Isolation, Durability) is a set of properties that guarantee that database transactions are processed reliably.
  • Sharding is a type of database partitioning that separates very large databases the into smaller, faster, more easily managed parts called data shards. The word shard means a small part of a whole.
    • So each shard is just one small portion of the big database table, but each table has the same structure, it’s just that their rows/entries are different, this definitely improves search performance, since the amount in each shard is smaller.
    • The advantages are:
      • High availability. If one box goes down the others still operate.
      • Faster queries. Smaller amounts of data in each user group mean faster querying.
      • More write bandwidth. With no master database serializing writes you can write in parallel which increases your write throughput. Writing is major bottleneck for many websites.
      • You can do more work. A parallel backend means you can do more work simultaneously. You can handle higher user loads, especially when writing data
  • Difference between NoSQL and RDBMS?
    • SQL databases are table based databases whereas NoSQL databases are document based, key-value pairs, graph databases or wide-column stores.
      • This means that SQL databases represent data in form of tables which consists of n number of rows of data whereas NoSQL databases are the collection of key-value pair, documents, graph databases or wide-column stores which do not have standard schema definitions which it needs to adhered to.
    • SQL databases have predefined schema whereas NoSQL databases have dynamic schema for unstructured data.
    • SQL databases are vertically scalable whereas the NoSQL databases are horizontally scalable.
      • SQL databases are scaled by increasing the horse-power of the hardware.
      • NoSQL databases are scaled by increasing the databases servers in the pool of resources to reduce the load.
    • To scale horizontally (DynamoDB) (or scale out/in) means to add more nodes to (or remove nodes from) a system, such as adding a new computer to a distributed software application. An example might involve scaling out from one Web server system to three.
    • To scale vertically (MySQL) (or scale up/down) means to add resources to (or remove resources from) a single node in a system, typically involving the addition of CPUs or memory to a single computer.
    • In a database world horizontal-scaling is often based on partitioning of the data i.e. each node contains only part of the data , in vertical-scaling the data resides on a single node and scaling is done through multi-core i.e. spreading the load between the CPU and RAM resources of that machine.
    • With horizontal-scaling it is often easier to scale dynamically by adding more machines into the existing pool – Vertical-scaling is often limited to the capacity of a single machine, scaling beyond that capacity often involves downtime and comes with an upper limit.
    • A better illustration is here: https://medium.com/@jeeyoungk/how-sharding-works-b4dec46b3f6#.ocmqofqlx
      • vertical partitioning: store different tables & columns into separate databases;
      • horizontal partitioning: storing rows of a same table into multiple database nodes.
    • A good example for horizontal scaling is Cassandra , MongoDB .. and a good example for vertical scaling is MySQL – Amazon RDS (The cloud version of MySQL) provides an easy way to scale vertically by switching from small to bigger machines this process often involves downtime.
  • Then naturally, I had a question, how query works in a sharded database world? since the entry to be searched could be stored in any one of the shards?
    • This short slides: http://www.slideshare.net/mongodb/how-queries-work-with-sharding well illustrates it.
    • Different shards could share the same shard key, if the query specifies the shard key in the query, then we could directly go to that node and search.
    • If the query doesn’t contain the common shard key, then multiple/all nodes could process the query in parallel, so, still much faster than a single monolithic database server as the traditional ways.
  • What does “index mean and why do we need it?
    • from https://en.wikipedia.org/wiki/Database_index: A database index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional writes and storage space to maintain the index data structure. Indexes are used to quickly locate data without having to search every row in a database table every time a database table is accessed.
    • In my own words, it’s basically creating a mapping, like for each folder/document/entry, we create one index for it, so next time, when clients query something about this folder/document/entry, we can quickly find its index and thus quickly return this folder/document/entry.


After connected to your MySQL server via:

$mysql -u root -p

You need to select a database, note this is different from tables, each database could contain multiple tables.

$use TUTORIALS //this means to use TUTORIALS database

, then we can create/update/read/delete tables inside this database.

In MySQL prompt, it does NOT terminate until you give it a semicolon, this means you could keep hitting carriage return to make your commands more readable.


Followed this link:

created a table tutorials_tbl: http://www.tutorialspoint.com/mysql/mysql-create-tables.htm

inserted some entries: http://www.tutorialspoint.com/mysql/mysql-insert-query.htm

(if it’s a literal string that you want to insert into the table, you should use double quotes or single quotes.)

If SQL LIKE clause is used along with % characters, then it will work like a meta character (*) in UNIX while listing out all the files or directories at command prompt.


mysql> SELECT * FROM tutorials_tbl WHERE tutorial_author LIKE "%Yan";


Remember to have comma separate the items:

SELECT t.tutorial_title, e.employer_name FROM tutorials_tbl t, employment_tbl e WHERE t.tutorial_author = e.employee_name;

otherwise it’ll complain syntax error.

LEFT/RIGHT/FULL/INNER JOIN must use ON clause! Don’t use WHERE, it won’t work with WHERE.



  1. open one terminal:
  2. open another terminal:
    $mysql -u root -p
    1. the second terminal will be my working prompt to interact with SQL server.
  3. in the end when I’m done with work, do
    $ps -ef | grep mysqld 

    and then

    $kill -9 MYSQL_PROCESS_ID

    to stop the mysql daemon.




I used brew to have installed MySQL on my Mac 10.11.5 (15F34).

$mysql –version
$mysql Ver 14.14 Distrib 5.7.12, for osx10.11 (x86_64) using EditLine wrapper

$mysql.server start
$Starting MySQL

Also, I can log in via:

$mysql -u root -p
Enter password:
Welcome to the MySQL monitor. Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.7.12 Homebrew

Copyright (c) 2000, 2016, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective

Type ‘help;’ or ‘\h’ for help. Type ‘\c’ to clear the current input statement.


from the above prompt, I typed:

mysql> SELECT user, host FROM mysql.user;
| user | host |
| mysql.sys | localhost |
| root | localhost |
2 rows in set (0.00 sec)


mysql> show databases;
| Database |
| information_schema |
| mysql |
| performance_schema |
| sys |
4 rows in set (0.01 sec)


a useful command to check if my mysql server is running or not.

ps -ef | grep mysqld


Got a problem that didn’t resolve today:
When I tried to start mysql, it kept saying:

$ERROR 1045 (28000): Access denied for user ‘SteveSun’@’localhost’ (using password: NO)

also tried

$mysql -u root
$ERROR 1045 (28000): Access denied for user ‘root’@’localhost’ (using password: NO)

The closest post I can find is: http://stackoverflow.com/questions/10299148/mysql-error-1045-28000-access-denied-for-user-billlocalhost-using-passw
but no luck.
Since I could use mysql via $mysql -u root -p, I won’t spend more time here.

Connect MySQL with PhP, this post is helpful: http://jason.pureconcepts.net/2015/10/install-apache-php-mysql-mac-os-x-el-capitan/