SlideShare a Scribd company logo
wrangle_report
December 12, 2020
1 Wrangle Report
1.1 WeRateDogs Twitter Acount Data Wrangling
1.2 Intoduction
1.2.1 The porpose of this project is to wrangle data about twitter acount WeRateDogs
from 3 different sources to create interesting and trustworthy analyses and
visualizations.
2 Project Details
2.0.1 1. Gathering
2.0.2 2. Assess
2.0.3 3. Clean
2.0.4 4. Store
2.1 Gathering
2.1.1 1. Twitter archive file was downloaded manually; it contains basic tweet data
for all 5000+ of their tweets, but not everything.
2.1.2 2. Image Predictions File downloaded programatically every image in the WeR-
ateDogs Twitter archive through a neural network that can classify breeds of
dogs*. The results: a table full of image predictions (the top three only) along-
side each tweet ID, image URL, and the image number that corresponded to
the most confident prediction (numbered 1 to 4 since tweets can have up to
four images).
2.1.3 3. Additional Data via the Twitter API: I successfully created a Twitter De-
veloper acount and collected more data with the tweets Id column from the
Twitter archive file.
A Line Brake
3 Assess
3.0.1 Twitter archive
1. the archive have 2356 rows only 2278 are tweets
1
2. some ratings are too high and the type should be a float
3. the rating numerator has very high and very low values it should be 10 or a multiple of ten
for multiple dog ratings
4. the name of the dog have non-name values
5. the timestamp column is of type object
6. doggo, floofer, pupper and puppo these are values not columns names the should be melt into
one column
3.0.2 Image Predictions
1. not all tweets have a valid pic of dog; the col“p1_dog, p2_dog, p3_dog” are false
2. jpg_url has duplicates
3. tidiness issue that (p1,p2,p3) (p1_conf,p2_conf,p3_conf),and (p1_dog ,p2_dog ,p3_dog )
are in 3 columns instead of one
3.0.3 Twitter API
1. to manny info a bout the tweets was rtrieved from Twitter Api i choose the retweet count
and favorite count
4 Clean
4.0.1 Twitter archive
1. first make a copy of the archive_df
2. remove all retweets and tweets without a photo
3. convert timestamp to datetime format
4. extract ratings from the tweets text and invistigate them
5. clean the name column
6. crate a column named dog_class and append the 4-columns of class in it
4.0.2 Image Predictions
1. remove all tweeets with 3 algorithms failed to predict a dog breed
2. remove all duplicated photos
3. i choose only the first algorithm to continue the analysis
4.0.3 Twitter API: no cleaning needed
5 Store
5.0.1 i merged the tree data frames into one master data frame stored it as ‘twit-
ter_archive_master.csv’ it has tweets with a photoor more only with the
retweet count and favorite count and a most confidence prediction of the dog
breed as a name if excist and dog stage if excist.
2

More Related Content

DOCX
Assignment 2
PDF
Social Aggregator Paper
DOCX
README
PPTX
Mz sdl-140331
PDF
Warningbird
PDF
Data Wrangle and Visualization report
PDF
2024 Trend Updates: What Really Works In SEO & Content Marketing
PDF
Storytelling For The Web: Integrate Storytelling in your Design Process
Assignment 2
Social Aggregator Paper
README
Mz sdl-140331
Warningbird
Data Wrangle and Visualization report
2024 Trend Updates: What Really Works In SEO & Content Marketing
Storytelling For The Web: Integrate Storytelling in your Design Process

Recently uploaded (20)

PDF
Business Analytics and business intelligence.pdf
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
Mega Projects Data Mega Projects Data
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
1_Introduction to advance data techniques.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Business Analytics and business intelligence.pdf
oil_refinery_comprehensive_20250804084928 (1).pptx
Qualitative Qantitative and Mixed Methods.pptx
Mega Projects Data Mega Projects Data
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Galatica Smart Energy Infrastructure Startup Pitch Deck
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Clinical guidelines as a resource for EBP(1).pdf
Business Ppt On Nestle.pptx huunnnhhgfvu
STUDY DESIGN details- Lt Col Maksud (21).pptx
Business Acumen Training GuidePresentation.pptx
1_Introduction to advance data techniques.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
Acceptance and paychological effects of mandatory extra coach I classes.pptx
climate analysis of Dhaka ,Banglades.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Ad
Ad

Data Wrangle and Visualization

  • 1. wrangle_report December 12, 2020 1 Wrangle Report 1.1 WeRateDogs Twitter Acount Data Wrangling 1.2 Intoduction 1.2.1 The porpose of this project is to wrangle data about twitter acount WeRateDogs from 3 different sources to create interesting and trustworthy analyses and visualizations. 2 Project Details 2.0.1 1. Gathering 2.0.2 2. Assess 2.0.3 3. Clean 2.0.4 4. Store 2.1 Gathering 2.1.1 1. Twitter archive file was downloaded manually; it contains basic tweet data for all 5000+ of their tweets, but not everything. 2.1.2 2. Image Predictions File downloaded programatically every image in the WeR- ateDogs Twitter archive through a neural network that can classify breeds of dogs*. The results: a table full of image predictions (the top three only) along- side each tweet ID, image URL, and the image number that corresponded to the most confident prediction (numbered 1 to 4 since tweets can have up to four images). 2.1.3 3. Additional Data via the Twitter API: I successfully created a Twitter De- veloper acount and collected more data with the tweets Id column from the Twitter archive file. A Line Brake 3 Assess 3.0.1 Twitter archive 1. the archive have 2356 rows only 2278 are tweets 1
  • 2. 2. some ratings are too high and the type should be a float 3. the rating numerator has very high and very low values it should be 10 or a multiple of ten for multiple dog ratings 4. the name of the dog have non-name values 5. the timestamp column is of type object 6. doggo, floofer, pupper and puppo these are values not columns names the should be melt into one column 3.0.2 Image Predictions 1. not all tweets have a valid pic of dog; the col“p1_dog, p2_dog, p3_dog” are false 2. jpg_url has duplicates 3. tidiness issue that (p1,p2,p3) (p1_conf,p2_conf,p3_conf),and (p1_dog ,p2_dog ,p3_dog ) are in 3 columns instead of one 3.0.3 Twitter API 1. to manny info a bout the tweets was rtrieved from Twitter Api i choose the retweet count and favorite count 4 Clean 4.0.1 Twitter archive 1. first make a copy of the archive_df 2. remove all retweets and tweets without a photo 3. convert timestamp to datetime format 4. extract ratings from the tweets text and invistigate them 5. clean the name column 6. crate a column named dog_class and append the 4-columns of class in it 4.0.2 Image Predictions 1. remove all tweeets with 3 algorithms failed to predict a dog breed 2. remove all duplicated photos 3. i choose only the first algorithm to continue the analysis 4.0.3 Twitter API: no cleaning needed 5 Store 5.0.1 i merged the tree data frames into one master data frame stored it as ‘twit- ter_archive_master.csv’ it has tweets with a photoor more only with the retweet count and favorite count and a most confidence prediction of the dog breed as a name if excist and dog stage if excist. 2