How Three Tech Dudes Biked 500 Miles from SF to LA (Part 2)

Santa Cruz, California

Santa Cruz, California

When Raymond called me, we were right here: 

 
 

We were in the bum effin middle of nowhere. No ubers. No taxis. This is what the road looked like: 

 
Screen Shot 2019-09-07 at 8.43.36 PM.png
 

I offered to carry a portion of his luggage but he didn’t want to do that to us. He planned to call and uber and taxi to meet us in Santa Cruz. 

My skills at persuasion were good. But the firmness of his statement meant his decision was made. At that point, the pain was so unbearable, he wanted to cry. 

After hopping off the phone, he took a swig of water, unloaded his backpack and sat on the curb. He opened up the uber app, a blank green canvas map engulfed his phone. He centered his location and pressed enter, attempting to get the nearest driver. Uber & Lyft service unavailable. Shit. 

Rather than give up, his bulldozer mind cycled through ideas on how he could get to Santa Cruz. We had just passed by the town of Pescadero, which had an arts festival going on. Maybe, just maybe, he could find a person in town to give him a lift to Santa Cruz. 

He exhales and takes a swig of water allowing his body to recuperate. He throws his backpack back on his back, only to feel a sharp twinge on his lower back. He tries to pedal but the sharp pain stabs his lower back. He walked to the town of Pescadero. 

Pescadero was a small, quiant town with no large buildings. He finds a gas station, purchases a bottle of water. He googles for a car service. None available. Since Pescadero had an arts festival that day, maybe he could ask a stranger to give him a lift. He talks to three tourists only to find that they were all going towards San Francisco. 

As Raymond ticked off each idea, he was running out of options. He had phone last option: phone a friend. Dun dun dun *Insert who wants to be a millionaire music*

He facetimes one of his best friends, Ryan. In the background, he notices that Ryan isn’t home. He’s outside. Where was he? Ryan had planned to take his girlfriend skydiving that day. Unfortunately, when he got to the skydiving location, the weather conditions weren’t suited for skydiving. But where was this skydiving location? 

Santa Cruz. 

In a miraculous turn of luck, Ryan, Raymond’s last option, had been in the area. If the weather conditions were good, he wouldn’t have been free. Ryan found Raymond in Pescadero and took him to Santa Cruz. 

Upon arrival, he called every bike shop in town asking Do you guys carry paneers? 

We do. 

Mission saved.  

Brian and I cruised 30 miles into Santa Cruz, arriving at 5pm. Santa The day totaled to 60 miles, 3000 feet of elevation. We re-convened with Raymond at the hostel and spent the rest of the day devouring thai food and exploring the boardwalk.

Santa Cruz is Northern California’s premiere surftown. A town of 90,000 people, home of UC Santa Cruz, the city is a blend of bushy forests & soothing beaches. Since Santa Cruz is a surf town, we thought we should make the most of our visit by going surfing. Internet surfing. We surfed mindless youtube videos, refreshed instagram and vegetated the rest of the night. Fun stuff. 

Our next ride was 45 miles to Marina, CA. A random city that sat adjacent to Monterey: 

 
marina (1).png
 


We left at around 11am and enjoyed an easy, flat ride towards Marina. We pedaled through flat farmlands, putrid smelling horse poop, as the sun settled past the horizon. The crisp sounds of music tapped my eardrum, with each note soothing my mind into the flow of the ride.   

Poke. 

A spec of pain poked the front of my left knee, barely noticeable. I treated this poke like an annoying baby. 

Ignore. 

Soon, the spec of pain evolved into an annoying little brother, banging on the door of my mind, begging to come in. 

Ignore. 

At some point I had to open the door. Not right now though. 

Ignore. 

I had no clue what it was. It kinda hurt, but we still had about 350 miles left, with our hardest day in two days. One thing I learned from Raymond’s fiasco was that small discomforts multiply out. Raymond wearing a 30lb bag for 20 miles is do-able. Multiplying that across 500 miles is nearly impossible. The pain in my knee was no different. 

When we biked, Brian wasn’t concerned with speed, he was concerned with pedaling correctly. Combine this with hybrid wheels, he’d often be pedaling slower in the back. Through his life, he’s had multiple injuries: breaking his foot, tearing his shoulder, injuring his back. In life, it’s the painful experiences that shape our behavior. And these experiences made Brian a bit neurotic about an injury during this ride. 

In preparation for this ride, Raymond & I had an attitude of youth invincibility. Brian did the opposite. He researched, trained and took the most precaution, which meant he was the most prepared. 

When I brought up my knee pain, Brian immediately knew the answer: your bike seat might be too low. Might be good to see if you can get a bike fit. 

A bike fit requires a bike fitter to measure the dimension of your body, pedal stroke and hand positions. Then, he/she will re-adjust the saddle height, position to your body size. If the seat is too low, you’d be putting too much pressure on the front of your knee:

 
maxresdefault.png
 

You can see in this photo, how the man’s knee is nearly in front of his toe(black line). Raising the seat would push the knee back, thus, putting less pressure on the knee(red line). 

Although the pain was manageable, I began to panic. I had no idea what a small knee pain, multiplied over 300 miles would become. Would I be completely incapacitated in 100 miles? 200 miles? 

We stopped by REI where I got my bike seat adjusted + bought a patellar tendon strap. However, the REI crew didn’t have proper bike fit measurement equipment which meant my bike seat was still too low after they adjusted it. This would come back to bite me. 

The patellar tendon strap absorbed some of the pressure that I put on my tendon. My knee hurt, but the pain was bearable. If the pain didn’t get worse, I think I could keep going. I self-diagnosed myself with patellar tendonitis.  

On we went. The 71 mile monster, lurking ahead of us. 

Big Sur 

The trek into Big Sur was an easier 45 mile ride. Through Big Sur was the tough one:

 
Big Sur Route (1).png
 

The roads in Big Sur slivered like a snake on alongside the jagged cliffs. Ocean waves crumbled against the coastline while a cool breeze rustled through the luscious green trees. The bike shoulder was only 1-2 feet so cars would vroom by awfully close:

 
Screen Shot 2019-09-07 at 8.58.39 PM.png
 

Because the roads winded, this limited the speed of cars. We’d pedal up hills, cruise down, pedal up again, cruise down. At one stoplight, I pull up next to two bikers. They were dressed in a blue and red bike jersey. And both had steel-carved calves. Damn. And they also looked to be retired and in their 70’s. Damn. 

As the light flicks green, one of the men pedal right beside me:

Old Dude: How’s it going? Where are ya guys headed?

Me: We’re headed from SF to LA. You?

Old Dude: We’re headed from Santa Cruz to Orange County. 

We chat for a few seconds and he pedals past me. After a few minutes, he’s pedaled into the distance. I can’t see him. This guy is probably three times my age but he flew right past me as if I was the 70 year old. It’s always refreshing to meet people who haven’t allowed age to suppress their lust for adventure. I’d like to be that old dude when I’m 70. 

We arrived at Big Sur Lodge and rested up for the toughest ride of this trip: 71 miles, 5511 feet of elevation. We left Big Sur lodge at 9am and planned to end in the small town of Cambria. 

About 40 miles in, the three of us concluded that Big Sur wasn’t as hard as we thought. That is, until we hit “The Climb.” 

The Climb, was a 1000 foot, straight uphill climb. When cycling up a hill, we usually shift to a lower gear. 

Good cycling technique means maintain consistent pedaling cadence. To maintain cadence, you shift your gears higher with incline, lower when it’s flat. Climbing becomes exponentially harder when you already hit your lowest gear but cannot maintain cadence. The only thing you can do is generate more power with your legs to push through the resistance. 

I could feel a thick wad of moisture underneath the straps of my backpack. My backpack started to stink of sweat. I don’t remember if my left knee hurt. My focus was on the burn in my thighs and conquering this 71 mile beast. 

The beauty behind endurance sports like cycling, running, swimming is the repetitive motion. Unlike football, basketball, you are repeating the same exact physical motion, in this case, for hours. Some might see this as boring. But boredom is the best test for the equanimity of your mind. Nothing in life is boring. When we feel bored, it’s our failure to squeeze the interesting juices from the amazing world around us. Endurance sports train this skill. 

I clicked my gears lower until I hit my lowest gear. I couldn’t click any lower, which meant I needed to exert more force through my legs. As rivers of sweat drifted down my forehead, the beautiful views of Big Sur faded in the background. I was alone in the crevices of my mind. 

Why was I doing this? Why did I feel this urge to bike 500 miles? Was I doing this to impress girls? What did I have to prove to others? Was I running away from something? 

A few years ago, I went on a 10-day vipassana meditation retreat. The retreat required us to meditate for 10 hours per day with a break every hour, no talking, no writing, no cell phone. There was one moment on the retreat where I had been meditating for one hour straight and I felt like I was sitting on burning stove. Usually, we would have a 5 minute break every hour of meditation. This time, I challenged myself to meditate through the break and go for two hours. 

There was an odd moment where the pain became so unbearable, I started giggling like a little school girl. It was as if I was a 3rd person, in my own story, watching the pain sit in my legs. Yes! That was it. I completely detached from the pain in my legs. What Buddha meant with enlightenment is suffering is that any sort of pain, physical, emotional is the path to enlightenment. Peace comes from the detachment and overcoming of pain. 

A woman I dated a few months ago  went through a soul-crushing breakup, which triggered her to start reading, investing in her passion for cooking and start exercising. Colin O’Brady when traveling in Thailand suffered 3rd degree burns from a fire accident. This accident triggered him to win a triathlon, ultimately spurring an amazing career as an athlete. He became the fastest man to cross Antarctica solo, unassisted. It was ultimately the pain of her mother that led Cheryl Strayed to go on the 1,100 mile hike through the Pacific Coast Trail. This led to the book & movie: Wild

And that’s why I was doing this ride. I knew that pushing through the pain, struggle would sculpt the best version of myself. 

Little did I know, the worst pain of this trip had yet to come. 

END OF PART 2

Click here for the final part!

How Three Tech Dudes Biked 500 Miles from SF to LA (Part 1)

 
istockphoto-510378427-612x612.jpg

A searing knife of pain sliced through my knee as I pressed down on the pedal. Burn. Each pedal stroke, igniting a flame within my swollen red knee. Cars whiffed by at 80 mph. My friends, no longer visible, as they pedaled into the distance. The pain was unbearable. My self confidence was shot……..  

In Japan, misogi is a water purification ritual to reach spiritual enlightenment. In the west, it means to do something that radically expands what you believe is possible. This bike trip, was our misogi. San Francisco to Los Angeles. 506 miles. 10 flat tires. 21,436 feet of elevation. 65,880 calories burned. Patellar Tendonitis. Stranded with no ubers or taxi’s. 1000 foot climbs. Steel-calved 70 year olds. This is the story of our misogi. 

The Call to Adventure

Three years ago, I was unemployed and living at home. The combination of a non-existent dating life & lots of free time, gave room for interesting ideas to dance in my head. One idea caught my eye: biking from SF to LA.

Telling my dad this idea was a mistake. He wasn’t pleased. He begged, made threats, used passive aggression. And eventually, I agreed to not go on this trip, if he helped me pay for part of my data science bootcamp. Negotiation FTW. 

So I buried the idea. At least until I moved out. 

After climbing the Month to Master mountain, I needed to find a new mountain. It’s the large goals that give our every days a sense of purpose. Biking from San Francisco to LA resurfaced and BAM, it had a WWE chokehold on my brain. 

Assembling the Team 

Finding people interested in this was hard. I sent texts, emails, FB messages, went into long diatribes at parties. Here was my pitch: 

If you do the math, 500 miles/10 days = 50 miles per day. If we go 10 mph, that’s about 5 hours per day. 6 hours if we count breaks. If we sleep about 8 hours per day, that’s still 10-11 hours. Very do-able.  


Despite my persistent & mathematical efforts at persuasion, I had no bites. 

Until Raymond. 

Raymond is my roommate who currently doubles as a robust bulldozer. This is a man who can party for seven days straight at yacht week and a man who loves triathlons. An adventurous fellow. Climb a mountain, take a salsa class, jump off a cliff, this is the dude you can count on to say “I’m down.” A man who lives life on the edge. I’ll let his dating profile pic speak for itself: 

 
IMG_0943.jpg
 

As we sat at our dining table, I prepared to give Raymond the full pitch. Before I could dive in, Raymond blurted: Oh, I’ve always wanted to do that. I’m down. That was easy. 

For Brian, not as much. 

A VR, product-designer and real-life “hype man”. After a night out, this is a dude who can’t sleep since he needs the hype to “settle.” Neurotic about injuries since he’s had problems with his back, feet, shoulder (which ultimately benefits him for this trip). A dude you can always count on bring up the energy level, but can’t count on to change a tire. I’ll let this instagram story speak for itself:

btian.gif

Brian needed convincing. He didn’t own a bike. But one night, as we were eating delicious Greek food, he said Fuck it, I’m in.  He purchased a $900 hybrid Kona Rove. Team assembled.

As complete noobs to bike touring, we were in over our heads. We prepped by biking to San Jose, Sacramento and multiple trips to Hawk Hill in Marin County. The San Francisco summer days chugged along and soon, the date was upon us. 

August 17, 2019 - San Francisco, California 

This was the full route not including detours:

 
default.png
 

On August 17, 2019 at 12 noon, we started our cycling adventure at the busy, tourist-packed San Francisco Ferry Building. The first leg ended at the lighthouse hostel near Half Moon Bay. Here’s a picture of us at the SF ferry building:

 
IMG_0945.jpg
 

I lead the initial legs of this trip through the coast of San Francisco, Sutro baths and down the coastline along ocean beach. We cycled through Daly City, which required a solid amount of uphill climbing.

If you’re a dude who skips leg day, an easy way to eliminate chicken legs is to cycle uphill. Cycling uphill is like pouring gasoline on the flame of your thighs. Every stroke ignites the fire while beads of sweat drizzle down your forehead attempting to cool you off. 

In cycling elevation trumps distance. Biking 100 miles with 0 feet elevation is much easier than biking 50 miles with 5000 feet of elevation. Our first few days was low on the mileage but high on elevation, which meant a majority of our ride would consist of climbing. 

Just as we completed a 200 foot climb in Daly City, google maps robotically commanded me to make a right in 300 feet. To our dismay, google maps didn’t tell us there was another damn hill.

That’s when Raymond, the bulldozer, started to crack. 

Raymond, a normally tame, proper, positive person, exclaimed: Jeff, did you check the route we we’re going on?!? No more hills, okay! We’re not trying to kill ourselves. 

Although what he said wasn’t rude, it was unusual. Raymond never gets upset. Something was up.

We ended at the lighthouse hostel and ate scrumptious seafood in Half Moon Bay. The next day, we had a 60 mile ride into Santa Cruz. I asked Raymond how he was doing and he stoicly replied: Terrible. I was worried.

The First Hurdle

About 15 miles in, we hit a 500 foot climb. Raymond lagged further and further back. When we reached the top, we took a break to fuel up on water, energy gels and bars. Raymond’s face had the look of death. Raymond unclipped his pedals, sat on the curb, closed his eyes and buried his face in his arms. He didn’t say a word for the next ten minutes. 

At this point, both Brian and I were tired but not at the point of dying. Raymond wasn’t out of shape. He’s run triathlons, plays basketball, tennis and lives an active lifestyle. It was his back. 

Imagine going into a gym, grabbing a 30 pound dumbbell and dropping it in your backpack. Then, wear this for literally the whole day. This was what Raymond was doing. Oh wait, he was also cycling uphill. Oh, and let’s throw in 500 miles to LA. 

I meekly asked him if he was okay. He replied: I’ll make it to Santa Cruz. We cycled on. 

Since I was leading these legs, I would periodically glance back to make sure everyone was in sight. After another 10 miles, I glance back and see Brian right behind me. But Raymond was nowhere to be found. 

In a few seconds, I get a call: 

“Jeff, I’m done. I can’t do this anymore.” 

END OF PART 1

Click here for Part 2!!!

How do I break into data science without a relevant degree, experience?

Data_Science.AdobeStock_216743672.jpg

I’ve done dozens of interviews and ZERO people do this. As someone who might not have a traditional math, statistics background, this is the single best strategy to break into data science. 

If you’re looking for advice on learning technical skills, read my Quora post. 

In 2019, there are dozens of masters programs sprouting up at major universities like MIT, Berkeley etc. Although the technical skills are important, you won’t be able to compete directly with PhD’s in math. 

You’ll need to use the briefcase technique. And this is how it works: 

1.Informational Interview: Before applying for a company, you’ll need to discover your target company’s pain-points. There are multiple ways to do this: 

  • Conduct an informational interview

  • Use the product and find areas of improvement 

  • Read forums, comments from users on areas of improvement

The most informative will be conducting an informational interview. Conducting informational interviews deserves a separate post. For now, I’d recommend this post. When conducting the interview, the most important question you can ask is what are your biggest challenges? 

This will give us ammunition to WOW the hiring manager later on. 

2. Briefcase/Pre-Interview Project - Once you have the pain-points, spend 3 to 5 hours researching this problem. Then you’ll package this research into a doc or slide deck to send to the hiring manager. Document should contain: 

  • Pain-Points

  • Project Ideas

  • Resource Requirements

  • Time estimated

  • Prioritization 

When you send this to the hiring manager, do this tactfully. DO NOT expect anything in return. Although you’ve spent time thinking about their problems, they do not owe you anything.


Hi guys, I used Machine Learning to build.... the Booki Monster!

One of the biggest problems I've noticed since entering the "real world", is that it's hard to find time for things. Going on adventures, finding a girlfriend, building new social groups, having interesting hobbies...... and reading books. Or maybe, you're just too hungover....... 

But a book that might take 5-10 hours to read? Well, what if I told you that you could get the key points from the book without actually reading the book? I want to introduce you to the "Booki Monster." A machine-learning powered monster that reads your books and summarizes them for you. 

You can play with the application here. If it's slow, give it a sec. 

My goal here is for a non-technical person to understand technically how I built my project.

1. Who is the Booki Monster?

You see those pile of books on your drawer that you're "too busy" to crack open? Well, that's the Booki Monster's food. Feed the Booki Monster your books and then she'll spit out the golden nuggets in the form of summaries.

2.Feeding the Monster

You know that feeling when you're with your friends, you want to eat out, but can't decide where to eat because there are wayyyy too many options? Feeding the Booki Monster was the same, I had too many options: science fiction, business, self-help, psychology, scientific research etc. 

And for those who understand product marketing, when you're product is for everyone, it's for no one.  I'd prefer to make the Booki Monster generate high-quality summaries for a more niche, targeted set of books, than mediocre summaries for many books. 

So I settled on feeding the Booki Monster only business books for this reason. Plus, Blinkist.com, a company that produces human-written summaries, happily agreed to send over their human-written summaries, so I can measure the Booki Monster summaries quality. 

With this understanding, grab your nearest surgeon, and let's start dissecting the body of the Booki Monster. Mmmmm.... tasty........

3.The Booki Monster's Body (Technical)

Method

When creating the Booki Monster, I had a couple different options:

  1. Sentence Extraction: It's similar to DJ'ing vs. Music Production. Am I using the songs already created? Or am I creating new sounds? Sentence Extraction is like DJ'ing, taking the text already written and using them as the summary.
  2. Abstractive Methods: Abstractive Methods are kinda like creating the sounds yourself. In the context of summarizing, it means that the machine needs to understand the text on a much, much deeper level.
  3. Graph-Based: Graph-Based is more like DJ'ing than Music Production. Imagine all your Facebook friends as a fuzzy ball, where each person may have a relationship with another, with varying degrees of strength. The same model would be used for sentences, each sentence would have a relationship with each other, with varying degrees of strength.

And because I only had two weeks to do it, DJ'ing would probably be more feasible for a one-man team.

Strategy

If you've read the Lean Startup, you'll notice that Eric Ries advocates the "Minimum Viable Product" approach. In this context, my goal was to build a working model as fast as I could and then continuously iterate upon that. So the way I modeled this was:

  1. Model one chapter
  2. Model one book
  3. Model five books
  4. Model 10 books

And on... and on.... you get the idea.

Rapid Automatic What.........?

As I'm typing these words on the keyboard, I'm wondering how I can explain this without boring Machine-Learning enthusiasts while making it understandable for normal people.

Sorry ML people, general audience wins here.

Imagine yourself as a puzzle-maker. You're boss gives you a beautiful sunset photo and wants you to hand-cut the pieces out. Each time you cut out the pieces, you have a little snippet of that photo. In Natural Language Processing, taking a picture and cutting it into pieces is called tokenizing. In order to analyze text, we need to cut it up into different pieces( usually each piece = word) but it depends on the project you're working on.

In the context of this project, I wanted to tokenize on key words. Sometimes, an author might use a phrase like "Moby Dick." "Moby Dick" should be treated as one phrase, not two. This is called Rapid Automatic Keyword Extraction.

After passing my books & summaries through a Rapid Automatic Keyword Extraction, it's time to engineer features:

Feature Engineering

To understand what's going on here, let me introduce you to this scenario:

Let's say, you're waiting for your Uber to come and have a couple minutes of time to kill. So you flip out your phone and open up your Facebook. You start scrolling through your newsfeed and see that Sally posted an article that says "Trump Suggests Bigger Role for U.S. in Syria’s Conflict." You live in San Francisco, so you have a passionate hate for Trump, so you click on the link. The article is kinda long and since you're limited on time, you scan the article, trying to decide if the entire thing is worth reading. You see that the article talks about "North Korea" and start thinking " Ohh.... this is interesting, I'll save this for later." When you saw the title of the news article, what keyword triggered you to click?  Trump. If you're interested in foreign policy, it might've been Syria . It changes for each person, but the idea is that there were specific key words in the text that gave you a good picture of what the article is about. And when you scan the article, you see North Korea, so Trump, Syria, North Korea , already give you an idea that this article is about some problems/tensions. 

This idea of a key word giving you some information about text is called a feature. Features are kinda like hints. It's saying "HEY MODEL! PAY ATTENTION TO THIS A LITTLE BIT MORE!"

In addition to key words, here are all the things I thought the model should notice:

  1. Term-Frequency: If a keyword appeared more often, the better the sentence.

  2. Sentence Location: Sentences in the beginning are likely to be more important, since the author is often introducing the general concept of the entire book. Middle sections are usually diving into details, examples of an idea, which may not be the best sentences for summarizing.

  3. Presence of a Verb: I used a position tagger to score the number of verbs a sentence contained. I guessed that sentences which contained verbs, likely had a subject-object action in the sentence, which usually provided more information and to get rid of flowery, descriptive sentences( which aren't good for summaries).

  4. Sentence Length: I down weighted short sentences, since a short, 4 word sentence, that might contain a key word, isn't that important.

And because I'm DJ'ing(extracting), the goal is for each sentence to get it's own "score." It's like how Steph Curry averages 25.3 points, 6.6 assists, 4.5 rebounds per game. Each one of these stats is a "feature" for Steph. And ESPN uses these numbers(plus many more) to create a PER score, for Steph it's 24.74. I'm trying to create the PER of sentences.

And as you might be able to guess, there are an INFINITE amount of additional things I can track, here are a couple:

Sentence Structure: How many subjects, objects occur in the sentence? What combination of subject-object, verb, adverbs are most conducive to high-quality summary sentences?

Named Entity Tagging: If I'm reading an article about "San Francisco" and I see the word "San Francisco","Oakland","San Jose", should I give more weight to these special "entitites"?

Sentence position within paragraph: Topic sentences should be upweighted while the middle sentences should be down-weighted.

PageRank: Similar to how Google's Search algorithm worked, I could add a PageRank method to additionally weight scores.

Word Length: Do # of characters in a word play a part in high-quality summaries?

Punctuation: How effective are rhetorical questions, questions, normal statements, exclamations in providing high-quality information to summaries?

And if the list keeps going, I'm either boring you, or I'm just trying to show you how smart I am(which if you're an ML engineer, you probably don't even think my modeling was smart. Well you're wrong.).

Anyways, let's get to the sexy stuff in Data Science. We've got our data, we've got our "features", what do we do next.... drumroll please............

Modeling

You remember you're first few years in college, you're excited to become independent from your parents, so you get to your dorm room, your floormates become your bestest buddies, while you go on to inebriate yourself while riding the wave of independent life? And five years later, the wave crashes and you think back: " Man, I was an idiot. I would've totally made better use of my school resources, spent more time learning skills and put myself out there a bit more."

I actually don't believe in replays because you wouldn't have known to do this, if you hadn't done that, circular logic. And the same goes for modeling book summaries. First time around, I'm super excited to add to the hype fueling the buzzword "Machine Learning." But as a young Data Scientist, I am young Luke Skywalker and have many, many things to learn.

I'm going to show you what I did, and what I would do differently next time.

But to dive in, I used two different models:

  1. Latent Dirichlet Allocation

  2. Doc2Vec

And you'll probably have no idea what those mean. Let's start with what Latent Dirichlect Allocation is and why I used it:

Let's say we took the world's 8 billion people and threw em all in a pot. Mixed them up all together. All the Asians standing next to each other, the Indians mixed with Arabs, English & Americans mixed, confusing right. And let's say you were some almighty god, and Zeus commanded you to re-organize this pot into all their original countries without going one by one. How the eff would you do that?

Well, you would use a Topic Model. If you imagine each word in a text as a person, a word likely corresponds to a specific topic. For example, in an article about food, the words "dumpling","fried rice","herbal tea","small eyes" would fall under one topic and "fat","burgers","french fries","obesity" might fall under another. Can you guess what topics they are? Yes, Chinese and American.

I chose Latent Dirichlet Allocation, because it does this categorizing for me.

And since this was my first Data Science project, I wanted to make sure I had a model up and running first and ran out of time in trying other topic models. Other ones I considered:

  • Non-Negative Matrix Factorization

  • Principal Component Analysis

  • Singular Value Decomposition

And I'm not going to exhaust you by explaining what each of those are. But I chose LDA, because each word isn't bound to a specific topic, but each word gets a distribution over all the topics( Sorry, for the non-tech folks, I don't have a good explanation for that, yet).

This model would give me the best key words I could use for my scoring(explained earlier).

Here's what it looks like visually:

Here's a wordcloud of the chosen topic model for Chaos Monkey by Antonio Garcia Martinez:

 

And a wordcloud of the entire book:

 

And you might be wondering, how the heck does the model know how many categories to give the text? How does it know how many key words it chooses? It doesn't. I have to decide that and this depends on my knowledge of the text I'm modeling. I optimized my model for 10 topics & 50 key words. And I chose the topic based on my knowledge of the book( if I read it) or I chose them at random.

(Eff... getting tired writing..... time for a coffee break!)

The second model I tried is a Doc2Vec, which, yes, don't get too excited, is a "neural network." GASP GASP GASP

I'm being silly. You know, I need to have fun writing this.

Ok. Imagine you're standing on the surface of earth, you've been single for way too long, and want to find your significant other by pulling a Goku and shining a Kame-Kame-Ha lightbeam towards the sky. You'll determine that your new girlfriend/boyfriend will shine their own light to the sky. The one that's most similar to your light, is the winner.

Sorry, that's the best explanation I can do right now and the metaphor does not fully represent Doc2Vec correctly. However, the idea is that every sentence is like a beam of light shining to the sky(vector) and we want to see how similar these vectors are to the vector of the entire book. This gives us the score.

And this is how I modeled. In the future, I would:

  1. Try a basic Logistic Regression: Can I classify a specific sentence as representative of a reference summary sentence?

  2. Try all the topic-modeling models listed: A wider variety of models and give me different insights on the text.

  3. Acquire more data to turn it into a Convolutional Neural Network.

  4. Try a sole PageRank/Graph-Based Model.

  5. Use all the models as weights for a "final score" for each sentence based on different techniques.

Scoring

Who is a better athlete, Kobe Bryant or Tom Brady? Who is the better writer, Tolstoy or Hemmingway? Who is the better visionary, Steve Jobs or Bill Gates? What's better, Apple or Android? Better ad platform, Facebook or Google?

When you ask different people, you get different answers. And summaries are the same way. Is there a quantitatively sound way of saying "Yes. This summary is dope."

No, there isn't. But we can try. After doing some research, I found that researchers use something called ROUGE-N Score to measure quality of summaries.

But what the heck does this score actually measure? It looks at the pairs of words in my booki-monster summary and then checks how many times these words occur in the human-written summary. And then takes a ratio.

Here are the scores:

Doc2Vec Split in 10: 0.241( + 0.126 over random)

LDA Split in 10: 0.176 ( + 0.062 over random)

Random Split in 10: 0.114 (-----------------)

Note: Random means, I built a model that randomly takes sentences from the book and titles it "the summary." Because what the heck is the point, if a monkey can write summaries just as good as the Booki Monster's.... and..... these numbers don't mean anything unless we have a baseline.

As you can see, the Kame-Kame-Ha Method(Doc2Vec) did 12.6% better than random and LDA did 6% better than random.

Conclusions

As a "Scientist", I've gotta extract some insights from all this "stuff." Let's bring the cake out the oven! <----- bad metaphor, but whatever.

  1. Better to be used for previews than summaries: Because I was DJ'ing/extracting, I knew that writing style of an author is going to be different from a summary. Author's tend to write their books, knowing they have many, many pages to articulate an iea. As a result, sentences will contain more detail, and author's are willing to dive into technicalities a bit more because they have enough space to explain a term they can use for the rest of the book. Compared with the human-written summaries, human writers are going to condense the writing into fewer words, while diluting the arguments behind the concepts.

  2. Booki Monster loves long, meaty sentences( I thought about making a sexual joke here, but nah. This is professional): If you look at the average sentence length:

 

In addition, I created a quick regression of word/sentence against ROUGE-N score to look at the relationship.

 

Notice that the average words/sentence for the Doc2Vec summaries are about 20 words/sentence longer than the words/sentence in the reference summaries, which supports my first point as well. This finding leads me to claim, that the model bias' a bit towards longer sentences, which makes sense due to the scoring method. A longer sentence has a higher likelihood of containing pairs of words that match pairs in the reference summaries, which boosts the ROUGE-N score. This method does eliminate low-information short sentences.

In the future, how will I not over-weight long sentences but still keep short sentences?

  1. Human summarizers emphasize different key points: Summaries and most writing, is subjective. A human summarizer already decides upon what key points they think the reader finds interesting. However, every reader is asking different questions when they're reading a book. An older man, may be wondering how he can find peace for the rest of his life, while a teenage girl may be trying to figure out what she should do with her life. Different questions, different answers, different summaries.

A future solution can be a query-based summarization method, where the user inputs a specific question they're asking, and then the model writes the summary based on the question the user asks.

Future

In the future, there are many things I may be able to try:

  1. Learn summarization framework: Similar to the grade-school five paragraph format, I can teach the Booki Monster a summarization-writing framework. This can improve the coherence and flow of ideas within the summary.

  2. Human Feedback: Scoring is hard. Like I said before, summaries are subjective. In the future, having the model create summaries and get user feedback can add a human element to summary creation.

  3. Query-Based Summary: Have users input questions and model creates summary based on those questions.

All in all, I hope you enjoyed reading this as much as I had writing/building this project. My journey into the world of Data Science is only beginning and I'll be creating many more monsters to come!

Example

Collapse by Jared Diamond

As for the complications, of course it’s not true that all societies are doomed to collapse because of environmental damage in the past, some societies did while others didn’t; the real question is why only some societies proved fragile, and what distinguished those that collapsed from those that didn’t. Some societies that I shall discuss, such as the Icelanders and Tikopians, succeeded in solving extremely difficult environmental problems, have thereby been able to persist for a long time, and are still going strong today.

Some of my Montana friends now say in retrospect, when we compare the multi-billion dollar mine cleanup costs borne by us taxpayers with Montana’s own meager past earnings from its mines, most of whose profits went to shareholders in the eastern U.S. or in Europe, we realize that Montana would have been better off in the long run if it had never mined copper at all but had just imported it from Chile, leaving the resulting problems to the Chileans! After living for so many years elsewhere, I found that it took me several visits to Montana to get used to the panorama of the sky above, the mountain ring around, and the valley floor below to appreciate that I really could enjoy that panorama as a daily setting for part of my life and to discover that I could open myself up to it, pull myself away from it, and still know that I could return to it.

One person said that Balaguer might have been influenced by exposure to environmentalists during early years in his life that he spent in Europe; one noted that Balaguer was consistently anti Haitian, and that he may have sought to improve the Dominican Republic’s landscape in order to contrast it with Haiti’s devastation; another thought that he had been influenced by his sisters, to whom he was close, and who were said to have been horrified by the deforestation and river siltation that they saw resulting from the Trujillo years; and still another person commented that Balaguer was already 60 years old when he ascended to the post-Trujillo presidency and 90 years old when he stepped down from it, so that he might have been motivated by the changes that he saw around him in his country during his long life.

Sources

Automatic Extraction Based Summarizer - R.M Aliguliyev

Latent Dirichlet Allocation Based Multi-Document Summarization - Rachit Arora, Balaraman Ravindran

Looking for a Few Good Metrics: ROUGE and its Evaluation - Chin-Yew Lin

Sentence Extraction Based Single Document Summarization - Jagadeesh J, Prasad Pingali, Vasudeva Varma

Distributed Representations of Sentences and Documents - Quoc Le, Tomas Mikolov

Latent Dirichlet Allocation - David Blei, Andrew Ng, Michael Jordan

LDA2Vec - Chris Moody