Anytime someone takes an action on Twitter, from something as small as a click or a scroll to something as large as signing up or tweeting, Twitter logs it. To date, the company has used mainly a batch processing system to take a first pass at the data, surprisingly.

Before Twitter Sparrow, the log ingestion pipeline workflow included a batching system causing data science engineers to wait several hours for fresh customer events data.

Streaming data pipelines provide data scientists access to fresh data in real time.

Twitter engineers developed a Streaming Event Aggregator that collected log events from services and passed them to a message queue like https://thenewstack.io/apache-kafka-primer/ or https://cloud.google.com/pubsub/docs/overview.

Related Articles