Getting started with NoSQL for storing and retrieving data

How to
Share on FacebookShare on Google+Tweet about this on TwitterPin on PinterestShare on LinkedInEmail this to someone

As a data journalist, I have been working with increasingly large datasets as my confidence has grown in programming and creating visualizations. Through my learning process, I have realized the pitfalls of some programs.

Excel, for instance, is good for smaller datasets – which I’d define as under 10,000 rows or records. But if the dataset is larger than that, this trusty spreadsheet program can freeze up when trying to run queries. And if you have been making edits or designing queries, you run the risk of losing your work.

SQL is significantly more powerful and is a dominant database program used by back-end developers. But while SQL can process large datasets, if your dataset isn’t in perfect condition or has empty fields, you will have to spend time cleaning before importing your dataset. In recent years, NoSQL has become a viable – and attractive – alternative. This primer will explain why a NoSQL system might be right for your database needs.

What is a database?

A database is where data is stored, just like a refrigerator – with its different shelves and drawers – is where your food is stored.

Why not use a spreadsheet?

  1. It’s very convenient to use Excel for data analysis if you only need to handle thousands of rows and tables. But once you get to about 10,000 rows, you’ll need a database system to deal with gigantic datasets. (Excel can quickly freeze up.)
  2. If you are working for a company that has tens of thousands of Excel spreadsheets – on different computers, say – and the staff and customers need to see data in real-time, it’s difficult to harmonize your data, especially if many people are working on the same project.
  3. Spreadsheets are not ideal for working with multiple datasets in tandem.

Because of this, a database becomes more useful and efficient for running queries with breakneck speed. Furthermore, anyone with permission can directly retrieve or change the data.

What is SQL?

SQL is a language for querying relational databases. A relational database stores data in a table, just like an Excel table where the data is neatly separated into rows and columns.

Many data journalists and data scientists moved to relational database managers after finding spreadsheets too limited. One example where SQL might be right for you: having to join or query two separate spreadsheets. Querying is the act of asking questions of your data via a pivot table in Excel or a query in databases. Querying allows you to subtract the subset of data directly without interacting with other non-related data.

For example, imagine you have a database with two tables:

Fruit table

RecordNumber fruit count Rate of preference Reviewer Id
1 apple 4 5 3
2 Orange 10 2 2
3 banana 99 10 1

 

Reviewer table

Reviewer Id name Gender
1 Dan male
2 Megan Female
3 Andy male

 

We can write a simple SQL statement that directly returns all the fruit and quantities reviewed by males in this way:

SELECT the reviewer’s name, fruit name, the fruit count
FROM the Fruit table, Reviewer table
WHERE Fruit table. Reviewer’s Id = Reviewer table. Reviewer Id AND Gender = “male

 

= >

name fruit count
Dan banana 99
Andy apple 4

 

What’s so special about NoSQL?

With the rapid growth of data sizes and the need to use more sophisticated data models, there are some problems that cannot be solved with relational databases which, with more data, need more computing power.

NoSQL databases are a speedier alternative because, for one, you don’t need to join tables in NoSQL. Every piece of data is stored in a JSON format. And the limitation of relational databases is that each item can only contain one attribute. To illustrate this point, look at the above Fruit table and notice how each column is dedicated to just one measure or attribute. Thanks to javascript, NoSQL allows you to store data in a nested fashion.

NoSQL is also simple to use. You don’t have to deal with the “mismatch” between rows and columns. For example, storing all the reviewer’s information in one document as opposed to having to join countless tables. In that way, you write less code which, hopefully, means fewer errors.

Also, NoSQL can solve problems SQL cannot. Before putting data in SQL for data modeling, for example, a programmer has to consider the role that each table represents and also what the inside columns represent, as well as how to join each by determining a common element/column that will be used as a primary key. This is time-consuming because there are so many kinds of data sources. And after the data is put into the database, adding a new column into the table – for example, a column noting the kinds of fruit in the first table – or changing the characteristics of a column is very difficult. The NoSQL database reduces the burden of data modeling. For example, the first row in the Fruit table becomes the following JSON document:

{
Name of fruit: Banana,
Count:99,
Rate of preference:10,
Reviewer: {
  Name:Dan,
  Gender: male
  }
}

In this way, you can modify the appearance of the data model and add to the database quickly.

Using MongoDB to build a NoSQL database

Now, we are going to play around with some foundational NoSQL skills by using MongoDB, which can store data in JSON in a very flexible way, meaning you can change the data structure based on your needs. Also, MongoDB is free and open-source.

First, a note on operations. There are four operations in every database system: create, read, update, and delete. These four operations are referred to as CRUD.

The first step is to make sure you have installed MongoDB. If you haven’t yet, set aside some time, as it’s not a one-click operation. Visit the website and then follow the specific download instructions via the website.

Next, open your computer’s terminal program. Make sure to keep it running. When you get the message “Waiting for connections on port 27017,” you have successfully connected.

Next, open another terminal window, also known as a “bash,” and run Mongo by entering mongo. 

Read

In your second window, enter show dbs. That will show all the existing databases on your computer. I have five databases on my computer. In this example, I want to read the contents of my database called education-portal.

Type education-portal in the command line to use that database.

One key concept necessary for understanding NoSQL databases is the term “collections.” Collections are the same thing as tables in relational databases. Databases are made up of at least one collection – otherwise you’d have no data! – but usually have several collections.  

To see the collections in your database after you’ve navigated to it, type in show collections.

 

 

In this example, I have only one collection in my database. Next, let’s enter db.user.find( ) to show all of the data as objects. An object displays a property and its corresponding value with the following syntax: { “Name”;”Wei Tang” }

 

 

But the way the information is returned is quite unreadable. Let’s make this object more readable by entering db.user.find( ).pretty()

Create

If you want to insert a new object – like user information that’s stored in my example database – use this syntax:

My database has a collection named “user” instead of “users” like above, so I’m going to enter the following to add a user record using db.user.insertOne(). My new user is Lily Allen, who is 26 years old with the email address allenlily@gmail.com. Her username for the education portal is Lilyallen and her password is 1234564321.

 

Let’s check whether we inserted the new user successfully by entering the following command: db.user.find( )

And we made the new object successfully since it’s now listed in the terminal. High five!

Update

To update an existing user, use the following command: db.collection.updateOne(filter,update,options)

In this example, we want to update Lily’s last name from “Allen” to “Brown.”

Now, let’s see if we were successful.

 Indeed, Lily’s last name has been replaced by “Brown.” High five again!

 

Delete

You’re probably getting the hang of the syntax now, so let’s move on the method for deleting a user. Type your user details within the following syntax: db.collection.deleteOne( )

We are going to delete the person whose last name is Tang.

Again, we are going to use the command db.user.find( ) to find out if we were successful. By leaving the parentheses empty, the terminal will list all users entered into the user collection.

Using the find command, we see there is only one record in the collection.

And we did it. Another high five for completing this tutorial on using CRUD commands in MongoDB. Though the syntax may be different for other NoSQL databases, the differences from this example will be minor. Good luck on your NoSQL journey!

 

Wei Tang

Wei Tang is a graduate student in journalism at Northeastern University.

Leave a Reply