•  


Data Quality Automation at Twitter

Infrastructure

Data Quality Automation at Twitter

By and
Thursday, 15 September 2022

Twitter ingests thousands of datasets daily through our automated framework. It runs on top of existing services such as GCP Dataflow and Apache Airflow moving Hadoop on-premise data into BigQuery as described in this previous article . This framework enables Twitter employees to run over 10 million queries a month on almost an exabyte of data in BigQuery.?

After we increased data availability, the next natural step was to ensure data quality on those datasets. They power Twitter’s Core Ads product analytics, ML feature generation, and portions of the personalization models.

Why data quality

Data freshness, completeness, accuracy, and consistency are some of the criteria used to determine data quality, which assesses the state of the data.

Some product teams were doing manual testing, on their own, by executing SQLs commands manually via BigQuery UI and/or Jupyter notebooks. There wasn’t a single framework to run data quality checks in an automated and consistent way.

It is important to have automated data quality checks to identify anomalies, accuracy, and reliability of datasets at scale to achieve:

  • Confidence: When we have better data quality, customers have confidence in the outputs they produce, lowering risk in the outcomes, and increasing efficiency.
  • Better productivity: Allows customers to be more productive instead of spending time validating and fixing data errors. They can focus on their core solutions.
  • Avoid lost revenue: In the decision making process, poor data can lead to lost revenue.

What we did

We created Data Quality Platform (DQP) which is a managed, config-driven, workflow-based solution to build and collect standard and custom quality metrics, alerting on data validations, and adding monitoring to those metrics / stats within GCP.

These features in the platform enable us to identify and monitor anomalies, latency, accuracy, and reliability of these datasets.

Under the hood, DQP relies on open-source Great Expectations and our own built Stats Collector Library as operators to generate the logic to query the resources. It also depends on Airflow for workflows and state management and Google’s Dataflow for transportation into BigQuery.

The solution design

Data Quality Platform relies on a number of technologies along its stack. Using a CI/CD workflow, we upload YAML configurations to GCS. From there, the associated Airflow worker will start the associated test at the resource and cadence granularity. The results of the test will run and send its results to a PubSub queue. Later, the Dataflow job lands the dataset from the queue into the destination table in BigQuery used in Looker, enabling users to debug and identify trends in metrics.

This post is unavailable
This post is unavailable.

Impact

Here are some work streams that benefit from the solution:

The Revenue Analytics Platform team builds and maintains products to efficiently ingest, aggregate, and serve revenue analytics data to downstream products such as AdsManager.? After implementing DQP:

  • There was a? 20% reduction in roll-out of new processing features by leveraging DQP for automated validation of the output data.
  • Increased confidence in data being delivered to advertisers through continuous measurement.

Core Served Impressions is a core dataset for product analytics of direct revenue, generating products within Twitter that many downstream customers consume to build their own specialized dataset for their product needs.

  • Prior to DQP we had no automated visibility of deviance between the upstream and downstream datasets. Provides alignment metrics between core served impressions dataset and downstream datasets for over 400 internal customers.

Conclusion

Data Quality Platform allowed Twitter to leverage open source libraries, Apache Airflow , and Great Expectations , and integrated with GCP services like GCS, PubSub, Dataflow, BigQuery and Looker. This? provides an end-to-end automated solution to ensure accuracy and reliability of thousands of datasets ingested daily, increasing confidence in data being delivered to advertisers.

Acknowledgments

We are grateful for the following contributors that helped us to deliver the Data Quality Platform solution:

Josh Peng, Wini Tran, Tushar Arora, Joanna Douglas, Oguz Erdogmus, Katie Macias, Stacey Ko, Bhakti Narvekar, Kasie Okpala, Nathan Chang.

This post is unavailable
This post is unavailable.
- "漢字路" 한글한자자동변환 서비스는 교육부 고전문헌국역지원사업의 지원으로 구축되었습니다.
- "漢字路" 한글한자자동변환 서비스는 전통문화연구회 "울산대학교한국어처리연구실 옥철영(IT융합전공)교수팀"에서 개발한 한글한자자동변환기를 바탕하여 지속적으로 공동 연구 개발하고 있는 서비스입니다.
- 현재 고유명사(인명, 지명등)을 비롯한 여러 변환오류가 있으며 이를 해결하고자 많은 연구 개발을 진행하고자 하고 있습니다. 이를 인지하시고 다른 곳에서 인용시 한자 변환 결과를 한번 더 검토하시고 사용해 주시기 바랍니다.
- 변환오류 및 건의,문의사항은 juntong@juntong.or.kr로 메일로 보내주시면 감사하겠습니다. .
Copyright ⓒ 2020 By '전통문화연구회(傳統文化硏究會)' All Rights reserved.
 한국   대만   중국   일본