Python for Data Science

( Reblogging from – )

Setting Up Scientific Python

I’ve found that one of the most difficult parts of using the Scientific Python libraries is getting them installed and setup on my computer. pandas, scipy, numpy, and sklearn make heavy use of C/C++ extensions which can be difficult to compile and configure on whatever flavor of OS you use. In this post I’ll go over the easiest way to install the libraries you need to get up and running with Scientific Python.

Getting Python

Step 1 is getting Python of course! If you don’t already have Python2.7 installed on your computer, select one of these distributions from Double-check to make sure you don’t already have Python 2.7 installed, many UNIX distributions ship with it.

Installing pip

Next you need to install pip, Python package manager.


$ curl
$ python


Download Christoph Gohlke’s installer


Debian, Ubuntu
$ apt-get install python-pip
CentOS, Fedora
$ yum -y install python-pip

Make sure pip is on your PATH. If it isn’t, add the python/scripts directory to your PATH.

Enthought Free Distribution

Enthought, which provides commerical support for Scientific Python, is nice enough to publish an installer that works on Windows, OSX, and Linux. This eliminates a lot of headaches of having to compile libaries and ensures you get the most stable versions. There are different tiers of installers, including paid versions, but for most people the free version is all you’ll need. They’re website is a little tricky to navigate (they sort of funnel you to the non-free versions), but here’s the page you want. Select the distribution for your OS and it’ll start the download. This part could take a while. Packed into the installer are the following libraries:

  • scipy
  • numpy
  • ipython
  • matplotlib
  • pandas
  • sympy
  • nose
  • traits
  • chaco

Once the download finishes double-clicking the installer will get you setup with everything–including adding all libraries to your PYTHON PATH.

Installing sklearn, statsmodels, and patsy

Now that we’ve got core libaries installed, it’s time to add some fun stats packages. The Enthought distribution took care of the compiled dependencies. pip makes installing these libraries a breeze:

$ pip install --upgrade scikit-learn
$ pip install --upgrade statsmodels
$ pip install --upgrade patsy

These libraries are going to start spitting out a lot of garbage into the terminal during the install. Don’t worry, this is normal! You might want to take this time to have someone non-technical come by your computer.

Development environment

For ease of usage and interactive computing use IPython (  ) and it also  makes it easy to share your activity as an IPython Notebook.

That’s it! You should be ready to go. If you run into any problems (typically happens if you have previous versions of libraries installed), checkstackoverflow or the numpy/pandas/sklearn docs.

Parsing my Inbox: use-cases and some code

Our inbox is a great snapshot of things that were important to us at some point in time (assuming you are an email hoarder and not a inbox-to-zero proponent!). So for sometime I was obsessed with all the use-cases for parsing and understanding 10 years of my email (yes, it has been 10 years for gmail and me!)


1. Sentiment analysis of email

2. Detecting groups or networks of users (work vs. family vs. room-mates)

3. Email fatigue detection

4. Analytics for of firsts, seconds and emails with large attachments etc.

How do we parse emails? 

You could check out some code I put together for parsing a thunderbird dump of my inbox here on github

What are some libraries for visualizing the analysis?

Email timeline visualization

– Similie:

– Highcharts Javascript:

Visualizing groups and people – Immersion project at MIT

Enron email dataset (

Weka: Out of memory

Being a fan of Java, when the time came to pick my machine learning skills, I got introduced to Weka and instantly loved it. It was the first machine learning library that I could use and its rich support for Visualizations meant that
I could get a sneek peak under the hood. Weka also lets you do exploratory data analysis and run multiple algorithms on your datasets. It was a major part of my learning of Machine Learning concepts.

From time to time I hear users complain about memory problems in Weka and render it useless for any serious
machine learning problems. I would gladly agree with any researcher that means it when he/she says Weka is not scalable, I tend to disagree with novice users of Weka when they diss it for its poor memory maintenance.

Often enough, the answer to the memory problems with Weka is people not trying hard enough.
Generally with memory requirements most machine learning algorithms fall into four quadrants

Training time vs. Test time
Training time vs. Test time

Here are two ways to get the most out of Weka and one is harder than the other –

1. Increasing memory for the JVM environment

java -Xmx4g -Xms3g weka.jar

2. API – Train using the GUI, but Test via the API

Most machine learning models are trained on relatively small datasets (due to scarcity of labeled data).
However they need to be run on large datasets. If you were to use weka to do both Training and Testing using the
command below, you end up with a error as follows.

In such cases where you know you can train your algorithm well and perform cross validation etc but are unable to
test it on large datasets, consider using the API where you deploy the model on one instance at a time.

The Data Science approach

Often colleagues and friends ask me the following questions and very often there isn’t a simple answer that comes to my mind. In this post, I will try and answer some of these

1. Who is a data-scientist? 

To-date the most comprehensive accompanied with a brilliant viz –

2. What is the data science approach?

Exploratory Analysis: Starts with Big Data

Big Data and Broad Data:  


Business Deliverable: 

3. How is it different from business analytics , statistics or research scientists?

DS vs. Statisticians 

Inference vs. Prediction

Structured vs. Unstructured

DS vs. Business Analytics 

Analysis vs. Product/Service

Centralized data vs. Distributed Data

Structured vs. Unstructured

DS vs. Research Scientists

Applied Domain-specific vs. Generic

Algorithm vs. Product/Service

4. Building a data-science team

Finally, having worked on more than a few data-science team configurations, I think careful attention needs to be paid while putting together a team that aligns well with the business. There is not a single working mantra for doing this, but in general my observation is that a successful data-science team very often is not the one that has the most number of data-scientists on it. In general a well functioning data-science team should have a good mixture of the following roles

– The Explorer (‘aka’ a data scientist)

– The Finisher (‘aka’ a software engineer)

– The Researcher (‘aka’ a research scientist)

– The Sanity Checker (‘aka’ the analyst)

– The Communicator (‘aka’ the product manager)

Sentiment Analysis (Social Media)


1. Lexicon-based  (L)

Using polarity lexicons, classify into one class or the other

2. Binary Classification: Bag-of-words

Build a classifiers using labeled data, where the features are simple bag of words

3. Binary Classification: Bag-of-words + Ngrams

Same as above with addition of bigram features

4. MultiClass Classification (Pos,Neg,Neut): Bag-of-words + Ngrams

Same as above with addition of bigram features and also modeling for Neutral Class along with Positive and Negative

5. DeepLearning (RAE) Classification (Pos,Neg,Neut): Bag-of-words

Using deep learning (Recursive auto encoder) techniques to train classifiers for sentiment

6. Semi-supervised Learning based Classification (Pos,Neg,Neut): Bag-of-words + Ngrams

Can we use semi-supervised learning approaches to enhance either LEXICON , or TRAINING data for the above classifiers

other ideas

  • Look at Modeling Neutral class in a better way
  • Spelling correct (“swweeeettt” -> sweet)
  • New Features
    • Handle negation
    • Stemming the words for better match with lexicon
    • Emoticon and distance from keyword “google”
    • What was the social network dynamics of the tweet (who , how many times?)
    • Phrasal Lexicons vs. Unigram BOW models (RAE does a bit of that)
    • Detect marketing campaign tweets
  • Semi-supervised learning to get more labels
  • Identify polarity (subjectiveness) in tweets followed by detection of negative vs. positive
  • Target dependent twitter sentiment ( – Is google keyword the central focus of the tweet?
  • Joint Topic detection (Aspect) + Sentiment