Market Data Basics

Market data can be loosely defined as all the information disseminated electronically by a financial institution: broker, exchange or counterparty.

In industry parlance, market data can be divided into two major groups with regards to its nature:

  1. Tick Data is primarily the information about orders in the matching engine of an exchange or broker. It comprises the most volume of data disseminated by an exchange and in the majority of cases is current for a very tiny moment in time due to being updated frequently.

  2. Reference Data also called metadata by some, is all the information about the underlying financial instrument to which tick data refers to. In that sense, reference data complements tick data with the necessary economical meaning. Reference data tends to be mostly static or infrequently updated, most often daily. Examples are symbol, product codes, nature, primary exchange, trading hours, dividends, closing prices, stock splits.

Market data can also be classified with respect to its contemporaneity.

  1. Data collected in the past and stored for analysis is called historical market data.

  2. Data that has just been streamed and is deemed the last, current state of the exchange matching engine is called realtime market data.

In that sense, you can have historical tick data, realtime tick data, historical reference data and realtime reference data.


Tick Data

The purpose of tick data is to present the client with sufficient information to reconstruct the state of the exchange’s matching engine.

Exchanges build up a set of securities to be traded and store this information in a database which can be queried through several methods, typically through an FTP download but also disseminated in realtime through its market data gateway directly to listening clients.

Upon market open, exchanges receive orders from clients and forward them to the matching engine. If no trade happens immediately, the order is recorded and an update is forwarded to the market data gateway which formats a message to be sent to all market participants with information about the newly added liquidity.

digraph {
    subgraph cluster_1 {
    rankdir = LR;
        MatchingEngine [shape=hexagon,label="Matching\nEngine"]
        MDGateway [label="Market Data\nGateway"]
        OrderGateway [label="Order\nGateway"]
        Database [shape=cylinder]
        Database -> MDGateway [label="Securities\nInformation"]
    }
    Client [label="Trading Firm"]
    MatchingEngineMirror [shape=hexagon,label="Matching\nEngine\n(mirror)",style=dashed]
    Client -> MatchingEngineMirror [shape=dot,label=replicate]
    MatchingEngine -> MDGateway [ label=updates  ]
    OrderGateway -> MatchingEngine [ label=orders ]
    MDGateway -> Client [ label="tick data\nupdates" ]
    MDGateway -> Client [ label="reference data\nupdates" ]
    Client -> OrderGateway [ label="place\norders" ]
    OrderGateway -> Client [ label="ack orders\nand trades"]
}

Tick Data Purpose

Notice that only incremental updates about the new differences are sent. Thus it is essential that on the trading firm/client’s side there is software that is able to reconstruct the complete state of the matching engine inside the exchange from these tidbits of information.

There are typically services provided by the exchange that disseminate the entire state of the matching engine but that is extremely slow to be used continuously.

That’s where LightQR shines, keeping a perfect reproduction of the exchange’s matching engine on the client’s side. Our software updates the mirror image in few nanoseconds even for a large number (hundreds of thousands) of symbols maintained, which translates in lower IT costs and superior trading performance.

Tick data can be disseminated in several forms:

  1. Level 1. Data has the best market prices for buy and sale, sometimes accompanied by the respective quantity of contracts (or shares) and very infrequently by the number of shares. The volume of data tends to be very low.

  2. Level 2. It is a summary of all orders in for every price: the price level. Price levels contain price, quantity and sometimes number of orders. Some exchanges (notoriously CME) limits the number of levels to a predefined number to avoid excessive network traffic.

  3. Level 3. In this mode the individual orders at every price level are disseminated. This is the most volumous amount of data.

Historical Tick Data

Our suite parses and transforms raw market data as provided by the exchange or captured by common appliances in pcap or ERF formats.

When provided with appropriate hardware, LightQR Capture will activate and record data with pico-second resolution timestamps.

LightQR Synchronize will then filter, pair, realign, reformat and adjust latency the data for archival.

digraph capture {
rankdir = LR;
Exchange [shape=hexagon]
TpCapture [shape=ellipse,label="Third Party\nCapture"]
Capture [shape=ellipse,label="LightQR\nCapture"]
Exchange -> TpCapture;
CapFormats [shape=cylinder,label="Pcap\nPcap-ng\nERF"]
Pcap [shape=cylinder]
TpCapture -> CapFormats;
Exchange -> Capture;
Capture -> Pcap;
Synch [label="LightQR\nSynchronize"]
CapFormats -> Synch;
Pcap -> Synch;
Cloud [shape=folder]
OnPremises [shape=folder,label="On Premises"]
Synch -> { Cloud, OnPremises }
}

Tick Data Capture Process

From that point, the data can then be post-processed and pushed to various formats and databases.

digraph postprocess {
rankdir = LR;
Cloud [shape=folder]
OnPremises [shape=folder,label="On Premises"]
TickLoader [shape=ellipse,label="LightQR\nTickLoader"]
Cloud -> TickLoader
OnPremises -> TickLoader
KDB [shape=cylinder]
HDF5 [shape=cylinder]
MongoDB [shape=cylinder]
SQL [shape=cylinder]
OneTick [shape=cylinder]
CSV [shape=cylinder]
TickLoader -> KDB
TickLoader -> HDF5
TickLoader -> MongoDB
TickLoader -> SQL
TickLoader -> OneTick
TickLoader -> CSV
Backtesting [shape=ellipse,label="Backtesting\nResearch\nC++\nPython\nJava"]
KDB -> Backtesting
HDF5 -> Backtesting
MongoDB -> Backtesting
SQL -> Backtesting
OneTick -> Backtesting
CSV -> Backtesting
}

Tick Data Post-Process

For strategy backtesting LucisQR can also read from raw files and drive your backtesting simulation exactly as if* it was in the colo.

digraph backtesting {
rankdir = LR;
Cloud [shape=folder]
OnPremises [shape=folder,label="On Premises"]
TickReplay [shape=ellipse,label="LightQR\nTickReplay"]
Backtesting [shape=ellipse,label="Backtesting\nC++\nPython\nJava"]
Cloud -> TickReplay
OnPremises -> TickReplay
TickReplay -> Backtesting
}

Tick Data Historical Replay

Datasets can contain multiple days from multiple exchanges allowing for evaluation of cross-exchange arbitrage strategies.

Latency (constant or stochastic) can be artifically added during replay to study different placements of the trading application across different datacenters.

Data replay can happen in several modes:

  1. Backtest mode. Data is pushed into the client application as fast as it is read from the source, regardless of timestamps. This is optimal for PNL research.

  2. Replay mode. Data is pushed at the speed it was captured from the exchange. This allows for proper evaluation of internal latencies in your application.

  3. Stress mode. Speed is multiplied by a configurable factor which allows the evaluation of extreme, doomsday scenarios (eg 5 times top volume).

Realtime Tick Data

Realtime data is the most current data as published by the exchange. As the number of updates tends to be very large, realtime data requires the latest hardware and competent software to cope with its enormous volumes.

Most commercial packages have at least one of these shortcomings:

  1. Provide only an API to translate exchange messages and nothing else. The problem here is that the user is left with the task of writing everything else: building your application, publishing internally, filtering, conflating.

  2. Provide only an appliance and an API to read market data, limiting the number of options to use the data - in process for example.

  3. Are too expensive.

  4. Limited coverage and high turnaround.

Our realtime data delivers tick data to the user’s application on four tiers:

  1. Tier 0. In process. The user links against our API and uses LightQR as if it was their own code. User might choose between running LightQR in serial or running it independently, in a separate thread.

  2. Tier 1. Same machine. LightQR runs as a separate process in the box, using fast Inter Process Communication mechanisms (shared memory) to deliver data.

  3. Tier 2. Same network. LightQR runs as a separate process, publishing data through TCP or UDP.

  4. Tier 3. Cross datacenters. Similar to the previous option but the data is conflated and funnelled prior to distribution.

Our motto is: good market data is delivered market data. Our pledge is that we will deliver market data where your application is, in the format you need.

digraph realtime_tier0 {
rankdir = LR;
node [shape=record];
Exchange [shape=hexagon]
Process [label="<cap>LucisQR\nCapture|<api>LucisQR\nAPI|Business\nLogic"]
Exchange -> Process:cap
}

Tier 0: In-Process Consumption

digraph realtime_tier1 {
rankdir = LR;
node [shape=record];
Exchange [shape=hexagon]
subgraph cluster_1 {
    rankdir = LR;
    #style=filled
    #color=lightgrey
    label="Client Machine"
    Capture [label="<cap>LucisQR\nCapture|<pub>LucisQR\nPublisher"]
    Storage [label="Shared Memory\nUnix Sockets",shape=cylinder]
    Process1 [label="<api>LucisQR\nAPI|Business\nLogic"]
    Process2 [label="<api>LucisQR\nAPI|Business\nLogic"]
    Process3 [label="<api>LucisQR\nAPI|Business\nLogic"]
}
Exchange -> Capture:cap
Capture:pub -> Storage
Storage -> Process1:api
Storage -> Process2:api
Storage -> Process3:api
}

Tier 1: Inter-Process Sourcing

digraph realtime_tier2 {
rankdir = LR;
node [shape=record,splines=line];
Exchange [shape=hexagon]
subgraph cluster_1 {
    rankdir = LR;
    label="Capture Box"
    Capture [label="<cap>LucisQR\nCapture|<pub>LucisQR\nPublisher"]
}
subgraph cluster_1 {
    rankdir = LR;
    label="Client Machine"
    Process1 [label="<api>LucisQR\nAPI|Business\nLogic"]
    Process2 [label="<api>LucisQR\nAPI|Business\nLogic"]
    Process3 [label="<api>LucisQR\nAPI|Business\nLogic"]
}
Exchange -> Capture:cap
Capture:pub -> Process1:api [label="TCP/UDP",style=dashed]
Capture:pub -> Process2:api [style=dashed]
Capture:pub -> Process3:api [style=dashed]
}

Tier 2: Same Network Publishing

digraph realtime_tier3 {
rankdir = LR;
node [shape=record,splines=line];
Exchange [shape=hexagon]
subgraph cluster_1 {
    rankdir = LR;
    label="Capture Box\nLocal Datacenter"
    Capture [label="<cap>LucisQR\nCapture|LucisQR\nConflation|<pub>LucisQR\nPublisher"]
}
Exchange -> Capture:cap
subgraph cluster_2 {
    rankdir = LR;
    label="Client Box\nRemote Datacenter"
    Process1 [label="<api>LucisQR\nAPI|Business\nLogic"]
}
Capture:pub -> Process1:api [label="Conflated Data\nOver TCP",style=dashed]
}

Tier 3: Cross Datacenter Publishing

Reference Data

(coming soon)

Historical Reference Data

Realtime Reference Data


Proceed to: Tick Data or also Reference Data