File and Data Structures

  • Key names are stored in the BSON doc, so make sure key names are short.
  • Data files are pre allocated, doubling in size each time.
  • Files are accessed using memory mapped files at the OS level.
  • fsync’d every 60 seconds.
  • Should use 64bit systems to support the memory mapped files.

Journalling in 1.8 default in 2.0+

  • Write-ahead log
  • Ops written to journal before memory mapped regions
  • Journal flushed every 100ms or 100 MB written
  • db.getLastError({j:true}) to force a journal flush
  • /journal sub directory
  • 1 GB files, rotated ( only really need stuff that has not been fsync’d )
  • some slowdown on high throughput systems, can be symlink’d to other drives.
  • on by default in 64 systems.

When to use

  • If Single node
  • Replica Set – at least 1 node
  • For large data sets.


  • files get fragmented over time if docs sizes change or deletes
  • collections that have a lot of resizes get padding factor to help reduce fragmentation.
  • need to improve free list
    • 2.0 reduced scanning to reasonable amount
    • 2.2 will change


  • 2.0+ compact command

    • only needs 2 GB extra space
    • off line operation (another good reason with replica sets.)
  • safemode: waits for a round trip from using getLastError, with this call you can specify how safe you want the data to be.

  • drop collection doesn’t free the data file, dropDatabase does. Sometimes it makes sense to create and drop databases.

Index and Query Evaluation


  • indexes are lists of values associated with documents
  • stored in a btree
  • required for geo queries and unique constraints
  • assending/descending really only matter on compound indexes.
  • null == null for unique indexes, you can drop duplicates on create.
  • create index is blocking, unless {background: true}, still should try to do off peak
  • when dropping an index, you need to use the same document when created.
  • the $where operator doesn’t use the indexes
  • Regexp’s starting with /^ will use an index
  • Indexes are used for updates and deletes
  • Compound indexes can be used to query for the first field and sort on the second
  • Only uses one index at a time
  • Limited index uses:
    • $ne uses the index, but doesn’t help performance much
    • $not
    • $where
    • $mod index only limits to numbers
    • Range queries only help some.


  • created using “2d”
  • $near

    • sorted nearest to farthest
  • $within

  • $within{$polygon}}
  • can be in compound queries

Sparse Indexes

  • only store values w/ the indexed field, results won’t have documents w/ null in that field.
  • can be sparse & unique

Covering Indexes

  • contains all fields in the query and the results, no db lookup

Limits and Trade offs

  • max of 64
  • can slow down inserts and updates
  • compound index can be more valuable and handle multiple queries
  • You can force an index or full scan
  • use sort if you really want sorted data
  • db.c.find(…).explain() => see whats going on.
  • db.setProfilingLevel() – record slow queries.
  • Indexes work best when they fit in RAM

Replica Set

  • One is always the primary others are secondary.
  • Chosen by election
  • Automatic fail-over and recover
  • Reads can be from primary or secondary
  • Writes will always go to primary
  • Replica Sets are 2+ nodes, at least 3 is better
  • When a failed nodes come back, they recover by getting the missed updates, then join as a secondary node
  • Setup

    mongod —replSet

    cfg = { _id:, members: [{_id:0, host:‘’}] }

    use admin


  • rs objects has replica set commands, needs to be issued on the current primary

  • rs.status()
  • Strong Consistency is only available when reading from primary
  • Reads on the secondary machines will be eventually consistent
  • Durability Options (set by driver)

    • fire and forget

      • won’t know about failures due to unique constraints, disk full, or anything else.
    • wait for error recommended

    • wait for journal sync
    • wait for fsync (slow)
    • wait for replication (really slow)
  • Can give nodes priorities, which will help ensure a specific machine is primary.

    • 0 priorities will never be primary
    • when a higher priority machine is back online, it will force an election.
  • Can have a slave delay

  • Tag replica sets with properties and can specify when waiting.
  • Arbiters Member

    • Don’t have data
    • vote in elections
    • used to break a tie
  • Hidden Member

    • not seen by the clients
  • Data is stored for replication in an oplog capped collection, all secondaries have an oplog too.


  • Vertical Scaling is limited
  • Horizontal scaling is cheaper, can scale wider then higher
  • Vertical can be a single point a failure, can be hard to backup/maintain.

  • Replica Sets are one type

    • Can scale reads, but now writes, eventual consistency is the biggest downside.
    • Replication can overwhelm the secondaries, reducing performance anyway
  • Why Shard?

    • Distribute the write load
    • Keep working set in RAM, by using multiple machines act like one big virtual machine
    • Consistent reads
    • Preserve functionality, by range based portioning most(all?) the query operators are available.
  • Sharding design goals

    • scale linearly
    • increase capacity with no downtime
    • transparent to the application / clients
    • low administration to add capacity
    • no joins or transactions
    • BigTable / PNUTS inspired read the PNUTS paper
  • Basics

    • Choose how you partition data
    • Convert from a single replica set to sharding with no downtime
    • Full feature set
    • Fully consistent by default
    • You pick a shard key, which is used to move ranges of data to a shard


  • Shard – each shard is it’s own replica set for automated fail over
  • Config Servers – store the meta data about where the partitions of data is, which shard

    • Not a replica set, writes to the config server is done with a transaction by the mongod / mognos
  • Mongos – uses the config servers to know what shard to use for the data/query

    • Client talks to the mongos servers
    • chunk = collection minkey, maxkey, shard
      • chunks are logical, not physical
      • chunk is 64MB, once you hit this point a new split happens, a new shard is created and data is moved.

shard keys

  • they are immutable.
  • Choose a key
    • _id? is incremental this results in all writes going to one query
    • hash? is random, this partitions well, but now great for queries
    • user_id? kinda random, useful for lookups, all data for user_id X will be on one shard

      • However you can’t split on this if one user is a really heavy user.
    • user_id + md5(x)? this is the best option.

Other notes

  • Want to add capacity way before it’s needed, at least before 70% operation capacity, this allows the data to migrate over time
  • Understand working set in RAM
  • Machine too small and admin overhead goes up
  • Machine too big and sharding doesn’t happen smoothly

MMS – MongoDB Monitoring Service