Thursday, April 14, 2016

spark 1.6.1 python 3.5.1 building naive bayes classifier

Leave a Comment

My question is based upon this.

  1. Would it be possible more detailed comments/explain code starting line tf = HashingTF().transform( training_raw.map(lambda doc: doc["text"], preservesPartitioning=True))
  2. How could I print the confusion matrix?
  3. What does below error mean? How can I fix it? The model still gets built and I get predictions

    >>> # Train and check ... model = NaiveBayes.train(training) [Stage 2:=============================> (2 + 2) / 4]16/04/05 18:18:28 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS 16/04/05 18:18:28 WARN BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS

  4. How could I print results for the new observation. I tried and failed

    >>> model.predict("love") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\spark-1.6.1-bin-hadoop2.6\spark-1.6.1-bin-hadoop2.6\python\pyspark\mllib\classification.py", line 594, in predict x = _convert_to_vector(x) File "c:\spark-1.6.1-bin-hadoop2.6\spark-1.6.1-bin-hadoop2.6\python\pyspark\mllib\linalg\__init__.py", line 77, in _convert_to_vector raise TypeError("Cannot convert type %s into Vector" % type(l)) TypeError: Cannot convert type <class 'str'> into Vector

1 Answers

Answers 1

1.hashingTF in spark is similiar to the scikitlearn HashingVectorizer. training_raw is an rdd of text.For a detailed explanation of the available vectorizers in pySpark see Vectorizers. For a complete example see this post

2.BLAS is the Basic Linear Algebra Subprograms library. You can check out this page on github for a potential solution.

3.You are trying to use model.predict on a string ("love"). You must first convert the string to a vector. A simple example that takes a dense vector string and outputs a dense vector with label is

def parseLine(line):     parts = line.split(',')     label = float(parts[0])     features = Vectors.dense([float(x) for x in parts[1].split(' ')])     return LabeledPoint(label, features) 

You are probably looking for a sparse vector. So try Vectors.sparse.

If You Enjoyed This, Take 5 Seconds To Share It

0 comments:

Post a Comment