A Fortune 500 commercial real estate services company was building a service oriented web application platform for business market data analysis built with Django. The platform required large quantities of data from multiple internal data silos to be ingested, processed, and loaded into their API databases efficiently and programmatically.
The source data lived in less-than-desirable conditions, like:
- Very unkempt PeopleSoft databases
- Excel spreadsheets on an FTP site
- Tab delimited files in email inboxes
Further, the data (while related) was maintained by entirely different business units in the organization and therefore inconsistent. Records would need unit transformations and extensive mappings, and relationships would need to be built in the final dataset.
Finally, much of the data was sensitive and confidential to the organization, so bulk pushes to the cloud were a non-starter--relevant information would have to be extracted explicitly.
Lofty Labs built a distributed data pipeline using Python and Celery.
With target application coded in Python and Django, taking advantage of existing validation and APIs and building a pipeline with Python was a massive boost to time-to-market.
The pipeline takes advantage of a host of optimizations, including smart batch processing, asynchronous task grouping, and post-processing jobs that cleanly and idempotently moves data into the cloud.
The organization now has a pipeline capable of moving and manipulating tens of millions of records per day into their market statistic application APIs, powering multiple dashboards and analysis applications.
This pipeline is supported by an extensible API that can be augmented to include many data sources and many new target stores, making it a potential standard for pipelining live data across the entire organization.