Blog post

On Large-scale SIEM Architecture

By Anton Chuvakin | July 25, 2012 | 11 Comments

SIEMsecuritylogging

How would YOU architect a SIEM deployment for this FICTITIOUS (but real-world-inspired …) large corporate environment:

  • About 30,000 events/second ongoing rate (this is NOT a peak rate, but a rate measured and then averaged over the course of 24 hours)
  • 15 separate sites, most in US but some in Europe and Asia; a few datacenters and a few regional offices (large and small)
  • Log source mix is a diverse blend of firewalls, network devices, NIPS, Windows servers, Unix/Linux servers, web proxies, and also web servers and select database servers
  • Retention policy is 30 days for log data used for operational security analysis and 1 year for searchable full archives
  • The use case is a combination of near-real-time monitoring (via correlation rules and whatever other analytic features the SIEM has) AND incident investigations (via searches, reports and whatever store data analytics the SIEM has) + maybe some compliance reports to spice it up Smile The monitoring efforts will focus on both outside attackers, malware as well as possible insider abuse.
  • A few analysts will be using the tool simultaneously most of the time.

So, here is a mental exercise for you:

  • How would you architect it?
  • Where would you place the collectors? How many?
  • How would you plan storage – single or distributed?
  • Where the main correlation system (or systems) will be deployed?
  • How will the system deal with outages in collection and maybe even storage?
  • How will its performance be tracked over time?
  • How will it scale with increasing volumes (logs tend to grow)?
  • What OTHER information will you need to architect it?

By the way, if you tell me that one appliance will handle the entire environment and no other software/hardware will be needed, a filter will be implemented to send further communication to /dev/null Smile

Related posts:

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed

11 Comments

  • As with all of these types of requirements, the real information comes out with probing around the environment and the project team. The immediate things that spring to my mind include (not in any particular order:

    1. Why? What is the key driver – incident management, compliance, etc..
    2. What are the event rates for the various key sites/data centers
    3. Are events sourced from management platforms (where) or direct from devices
    4. Is security operations centralised/regionalised
    5. What’s the business impact of an outage in event feed, real-time display, reporting (= what level of resilience do you want to create)

    The list typically grows based on customer responses 🙂

  • Thanks a lot for the comment. I sort of left the “why” out of this blog post, but you were correct to bring it back in. A few times, however, the customer will simply redirect the ‘why?’ question back to the consultant, often the disastrous results 🙂

  • Martin says:

    I recommend splitting the duties into search/retention and report/correlation. You can use open-source (e.g. ELSA) for the former and commercial for the latter. This lets you buy smaller, cheaper SIEM’s since you’re not relying on them for search/retention. It also gives you an avenue to leverage all of the collected data for operations use in addition to security.

    Architecting for resiliency depends on the risk tolerance of the company and regulatory requirements. If risk is largely tolerated, then put the boxes in the data centers and backhaul branch office logs there. Otherwise, you’ll need boxes for each branch office. The main collection points for log data should be virtual IP addresses on a load balancer to provide failover. Almost any Cisco device above a simple switch will do this. I recommend that the search/retention boxes be redundant and load-balanced to guarantee log reception. The SIEM should not need such resiliency because momentary outages will have less effect (especially when you can spot-check with the search/retention box).

  • Tamer Hassan says:

    Hello,

    to be able to architect, will you please provide the following information
    -how many devices per each site ,what are models and vendor for each device or system at each site
    -what is the WAN links speed
    -can you provide high level design for sites and WAN connections
    -is the client has a open budget or limited ?
    -what is the time frame to implement the solution, more than 3, 6 or 9 months ?
    -will the client be able to hire system or database engineers to support the SIEM solution if needed ?
    -is it allowed to specify the SIEM model and vendor ?

    Thanks

  • @Martin Thanks a lot for the comment. The duties are likely split 3 ways, no 2 ways: retention, report/search, correlation/monitoring.

    And, of course, you are correct about the resiliency. In many cases, their business systems are not resilient, so it is likely that their SIEM won’t be either.

  • @Tamer Thanks for the comment! I especially love the \is the client has a open budget or limited ?\ 🙂

  • Sam says:

    What is your opinion you Mr Anton ?

  • Clay Keller says:

    With any additional info, here is what I would do. (I’ve never built, designed, or used a SIEM)

    Distributed collectors flowing back to a centralized redundant analysis console, database & storage.

    Collectors in each datacenter, dedicated Collectors for each site type (one or two for regional offices to report through, etc..)
    In general locate the collectors in a logical way to aggregate geolocated sites or to handle log volume.

    Single scalable redundant 1-tier storage for 30 days.
    Possibly second tier storage for 1 year searchable.

    Arcsight

    Collection Outages – Collectors should buffer if possible. Build local storage on collectors with this design in mind. We are talking 80-150GB, not TB.
    Certain amount of “buffer” should be available on collectors and correlation systems to withstand short outages.

    Performance? –
    Events handled/stored per second.
    Need something based effectiveness

    Scale – Scale out with collectors and redundant correlation backend.
    The backend correlation system should be built to scale out and up. The collectors simply scale out.

    Other information –
    Network bandwidth to remote locations.
    Estimated event volume per site or geolocation area.

  • Clay, thanks a lot for the comment. This is a great architecture description indeed.

    Of course, the devil (devilS) would be in the details, for example:

    “In general locate the collectors in a logical way to aggregate geolocated sites or to handle log volume.” may end up a future logistical pain if datacenter log volume changes (well, grows)

    “Possibly second tier storage for 1 year searchable.” <- likely a MUST, not just 'possible'

    "The backend correlation system should be built to scale out and up. " <- is a brilliant point that soooooo many architects fail to realize. As more rules get built, correlation need to scale out and up, not just grow slightly

  • Damian says:

    Hi Anton,

    I would add consideration for regulatory restrictions. For example, trying to ship any PII out of Luxembourg or Germany – locations a global enterprise may well have offices in. In those cases, you may way require local log retention, with possible forwarding of partially-obfuscated data out of the state for centralised (and/or regional) correlation.

    Another consideration is corporate politics; while the Global Security team may well be eager to centralise and analyse all logs on behalf of the corporation, the different operating companies, countries, and business units, may have other ideas. In such cases, it may be necessary to architect differently – for example, one country may expect local log retention and correlation capabilities; another may be happy to leave this to the centralised team. Being able to flexibly architect the SIEM and data model to fit, is an advantage here.

  • Damien, thanks for the comment. Privacy and PII is the area I’d intentionally stay away from at this stage. Too much of this is country-specific.

    Now, global vs local security will be covered indeed – these are common issues and can be discussed in depth without stepping into the pile of poo that is European privacy regs 🙂