Articles postés par Ludwine Probst

Eléonore Koffi, l’AmazoOn du Web

En Septembre dernier Axelle Lemaire, secrétaire d’Etat chargée du numérique, s’est rendue à Abidjan et a rencontré les acteurs et actrices locaux du numérique.

Direction la Côte d’Ivoire à la rencontre de l’une d’entre elles : Eléonore Koffi, présidente des AmazoOn du Web.

Lire la suite

Kaouthar Azzoune, «le code… Pourquoi pas un truc de filles ? »

Cette année j’ai rencontré Kaouthar, une lycéenne qui suivait le programme WI-Filles, une initiative menée par Salwa Toko pour apprendre le code et les nouvelles technologies. Puis en mars dernier Kaouthar a participé à notre vidéo “Fière d’être développeuse” pour partager sa passion pour le code.

Lire la suite

Rencontre USI: Deep Learning avec Yann Le Cun, chercheur et directeur de recherche IA de Facebook

Peut-être avez-vous déjà entendu parler de machine learning, puisque c’est l’un des buzzword ces derniers temps. Et bien dans l’intelligence artificielle il y a aussi le domaine du deep learning, ou apprentissage profond en français. D’ailleurs un groupe meetup existe sur le sujet à Paris.

« Wherever there is data, there is Machine Learning »

 

yannLeCun

 

Yann Le Cun justement est un chercheur de renommée internationale dans le domaine de l’intelligence artificielle, et en deep learning. Ses travaux de recherches ont notamment portés sur les réseaux de neurones à convolution, théorie qui est utilisée par exemple dans la reconnaissance d’image, d’objets. Enfin Yann est le directeur de la recherche en intelligence artificielle de Facebook.

 

C’est lors de la conférence USI les 2 & 3 juillet derniers où Yann Le Cun animait une session sur le “deep learning” que nous l’avons rencontré et interviewé.

Rencontre avec Yann Le Cun pour parler de deep learning et d’autres choses encore !

 


Télécharger l’interview en ogg ou mp3

Interview réalisée par Amira Lakhal et Ludwine Probst

La trame de l’interview

00’0 – Le deep learning c’est quoi ?

1’29 – Où retrouve t’on le deep learning dans notre quotidien ?

2’31 – Pourquoi un tel engouement pour le deep learning ?

5’32 – Aujourd’hui existe-t’il des limites pour aller encore plus loin et si oui quelles sont-elles ?

6’50 – Et concernant les vidéos, y’a t-il des recherches faites sur le domaine ?

8’58 – Pouvez-vous nous en dire plus sur le laboratoire de recherche en IA de Facebook qui vient de s’ouvrir à Paris ? Est-ce que vous recrutez et quels profils ?

 

Merci à Yann Le Cun pour cette interview. Vous pouvez le suivre via son twitter.

Getting started with Spark in practice

Some months ago, we, Sam Bessalah and I organized a workshop via Duchess France to introduce Apache Spark and its ecosystem.

This post aims to quickly recap basics about the Apache Spark framework and it describes exercises provided in this workshop (see the Exercises part) to get started with Spark (1.4), Spark streaming and dataFrame in practice.

If you want to start with Spark and come of its components, exercises of the workshop are available both in Java and Scala  on this github account. You just have to clone the project and go! If you need help, take a look at the solution branch.

spark workshop streaming dataframe

With MapReduce/Hadoop frameworks the way to reuse data between computations is to write it to an external stable storage system like HDFS. And it’s not very effective when you iterate because it suffers from I/O overhead.

If you want to reuse data, in particular, if you want to use Machine Learning algorithms you need to find a more efficient solution. This was one of the motivation behind the creation of the Spark framework: to develop a framework that works well with data reuse.

Other goals of Apache Spark were to design a programming model that supports more than MapReduce patterns, and to maintain its automatic fault tolerance.

In a nutshell Apache Spark is a large-scale in-memory data processing framework, just like Hadoop, but faster and more flexible.

Furthermore Spark 1.4.0 includes standard components: Spark streaming, Spark SQL & DataFrame, GraphX and MLlib (Machine Learning libraries). And these frameworks can be combined seamlessly in the same application.

Here is an example on how to use Spark and MLlib on data coming from an accelerometer.

 

RDD

Spark has one main abstraction: Resilient Distributed Datasets or RDDs.

An RDD is an immutable collection partitioned across the nodes of the cluster which can be operated on in parallel.

You can control the persistence of an RDD:

  • RDDs can be stored in memory between queries, if you need to reuse it (improve the performances in this way)
  • you can also persist an RDD on disk

RDDs support two types of operations (transformations and actions):

  • a transformation creates another RDD and it is lazy operation (for example: map, flatmap, filter, groupBy…)
  • an action returns a value after running a computation (for example: count, first, take, collect…)

You can chain operations together, but keep in mind that the computation only runs when you call an action.

 

Operation on RDDs: a global view

Here is a general diagram to understand the data flow.

spark workshop streaming dataframe

  1. To the left, the input data comes from an external storage. The data are loaded into Spark and an RDD is created.
  2. The big orange box represents an RDD with its partitions (small orange boxes). You can chain transformations on RDDs. As the transformations are lazy, the partitions will be sent across the nodes of the cluster when you will call an action on the RDD.
  3. Once a partition is located on a node, you can continue to operate on it.

N.B: all operations made on RDDs are registered in a DAG (direct acyclic graph): this is the lineage principle. In this way if a partition is lost, Spark can rebuilt automatically the partition thanks to this DAG.

 

An example: the wordcount

Here is the wordcount example: it is the “hello world” for MapReduce.

The goal is to count how many times each word appears in a file, using the mapreduce pattern.

https://gist.github.com/nivdul/0b84c5184ae42278b02f#file-wordcount

First the mapper step: we attribute 1 to each word using the transformation map.

Finally the reducer step: here the key is a word, and reduceBykey, which is an action, return the total for each word.

https://gist.github.com/nivdul/0b84c5184ae42278b02f

Exercises

In this workshop the exercises are focused on using the Spark core and Spark Streaming APIs, and also the dataFrame on data processing. The workshop is available in Java (1.8) and Scala (2.10). And to help you to implement each class, unit tests are in and there are a lot of comments.

scala_logo

java_logo

Prerequisites

In order to get the exercises below, you need to have Java 8 installed (better to use the lambda expression). Spark 1.4.0 uses Scala 2.10 so you will need to use a compatible Scala version (2.10.x). Here we use 2.10.4.
As build manager, this hands-on uses maven for the Java part and sbt for the Scala one. As unit tests library, we use jUnit (Java) and scalatest (Scala).

All exercises runs in local mode as a standalone program.

To work on the hands-on, retrieve the code via the following command line:
#Scala
$ git clone https://github.com/nivdul/spark-in-practice-scala.git

#Java
$ git clone https://github.com/nivdul/spark-in-practice.git

Then you can import the project in IntelliJ or Eclipse (add the SBT and Scala plugins for Scala), or use sublime text for example.

If you want to use the spark-shell (only scala/python), you need to download the binary Spark distribution spark download.

# Go to the Spark directory
$ cd /spark-1.4.0

# First build the project
$ build/mvn -DskipTests clean package

# Launch the spark-shell
$ ./bin/spark-shell
scala >

 

Part 1: Spark core API

To be more familiar with the Spark API, you will start by implementing the wordcount example (Ex0).

After that you will use reduced tweets as the data along a json format for data mining (Ex1-Ex3). It will give you a good insight of the basic functions provided by the Spark API.

https://gist.github.com/nivdul/b41f54f02b983bc0bf05#file-reduced_tweets-json

You will have to:

  • Find all the tweets by user
  • Find how many tweets each user has
  • Find all the persons mentioned on tweets
  • Count how many times each person is mentioned
  • Find the 10 most mentioned persons
  • Find all the hashtags mentioned on a tweet
  • Count how many times each hashtag is mentioned
  • Find the 10 most popular Hashtags

The last exercise (Ex4) is a way more complicated: the goal is to build an inverted index knowing that an inverted is the data structure used to build search engines.

Assuming #spark is a hashtag that appears in tweet1, tweet3, tweet39, the inverted index will be a Map that contains a (key, value) pair as (#spark, List(tweet1,tweet3, tweet39)).

Part 2: streaming analytics with Spark Streaming

Spark Streaming is a component of Spark to process live data streams in a scalable, high-throughput and fault-tolerant way.

 spark workshop streaming dataframe

In fact Spark Streaming receives live input data streams and divides the data into batches, which are then processed by the Spark engine to generate the final stream of results in batches.

The abstraction, which represents a continuous stream of data is the DStream (discretized stream).

In the workshop, Spark Streaming is used to process a live stream of Tweets using twitter4j, a library for the Twitter API.

To be able to read the firehose, you will need first to create a Twitter application at http://apps.twitter.com, get your credentials, and add it in the StreamUtils class.

In this exercise you will:

  • Print the status text of the some of the tweets
  • Find the 10 most popular Hashtag over a 1 minute window

 

Part 3: structured data with the DataFrame

A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood.
In fact, the Spark SQL/dataframe component provides a SQL like interface. So you can apply SQL like queries directly on the RDDs.

DataFrames can be constructed from different sources such as: structured data files, tables in Hive, external databases, or existing RDDs.

spark workshop streaming dataframe

In the exercise you will:

  • Print the dataframe
  • Print the schema of the dataframe
  • Find people who are located in Paris
  • Find the user who tweets the more

 

Conclusion

If you find better way/implementation, do not hesitate to send a pull request or open an issue on github.

Here are some useful links around Spark and its ecosystem:

L’édition USI 2015 revient les 2&3 juillet

USI_logo

Les 2 & 3 Juillet, se tiendra l’édition de l’USI 2015 au Carrousel du Louvre à Paris.

L’USI est une conférence où leaders d’opinions, scientifiques, philosophes ou entrepreneurs se rencontrent, échangent et partagent leurs expériences.

Les conférences de cette nouvelle édition se concentreront sur les thèmes du Big Data, Lean Management, Digitalisation, ou encore Internet des Objets, avec des intervenant-e-s prestigieux.

Plusieurs membres de Duchess France seront présentes pour suivre les sessions et faire des interviews !

Zoom sur quelques intervenant-e-s de l’USI

Cette année encore, l’USI semble prometteuse avec un riche éventail d’intervenants dont notamment:

Hilary Mason, PDG de Fast Forward Labs, nous expliquera comment devenir une vraie « Data Driven Company ».

Peter Norvig, directeur de la R&D de Google, interviendra sur l’intérêt de mettre en place un système de Machine Learning dans tout type d’entreprises.

Yann LeCun, directeur de la AI Facebook, parlera quant à lui de « Deep learning ».

Cédric Villani, mathématicien et médaillé Fields en 2010, interviendra sur le thème « Pour faire naître une idée ».

Infos pratiques

Liste complète des speakers : http://www.usievents.com/fr/speakers

Plus d’informations : http://www.usievents.com/fr/information

Inscriptions : http://www.usievents.com/fr/registrations

Vidéos des précédentes émissions : https://www.youtube.com/user/usievents/videos

 

Je me lance, je soumets un talk à une conférence

IMG_7273
L’une des choses qui nous tient beaucoup à coeur est de voir plus de femmes intervenir lors de conférences techniques. Alors ces dernières années nous avons lancé des actions pour y remédier : ateliers de préparation au Call For Papers, répétitions, atelier confiance en soi
Et une dizaine de Duchesses se sont lancées, y ont pris goût, et aujourd’hui interviennent régulièrement en France et à l’étranger pour certaines.

 

Plusieurs conférences ont ouvert leur Call For Papers, comprendre « appel à conférenciers/intervenants/orateurs ». Et il y en a pour tous les goûts : technos Google, web, android, java, agilité, innovation…
De plus, plusieurs formats d’intervention sont possibles : quickies (15-20min), conférences (40-50min), workshops…

 

Peur de se lancer ?

 

Screen-Shot-2013-04-17-at-11.12.50

 

« Oui…mais je ne suis pas experte… »
« Et je ne sais pas de quoi parler… »

 

Pauline nous parle dans son interview de son expérience, ses appréhensions et donne quelques conseils.
Et quelques discussions sur notre forum interne se sont tenues sur le sujet, avec des conseils.

 

Hum…et le syndrome de l’imposteur ?

 

Syndrome de l'imposteur

 

Certaines personnes sont persuadées qu’elles ne méritent pas leurs succès, malgré les efforts qu’elles fournissent pour réussir. De fait elles vivent en permanence avec un sentiment de duperie et craignent sans cesse que quelqu’un découvre leur présumée imposture. C’est ce qu’on appelle le syndrome de l’imposteur.

Même si tout le monde peut être touché par le syndrome de l’imposteur, il semble que les femmes le soient encore plus. Et le manque de modèles féminins n’aide pas forcément.

 

Echanger et partager avant tout !

Faire une présentation à une conférence c’est tout d’abord avoir envie de partager des connaissances, une expérience sur une techno ou sur des méthodes de travail. Ce n’est pas connaître absolument tous les détails en profondeur : il y a des présentations qui s’adressent à des débutants, intermédiaires ou expérimentés. A vous de cibler et d’adapter votre discours et les messages que vous souhaitez faire passer.

La partie « questions », un bon moyen d’améliorer votre présentation !

Concernant les questions, c’est un passage qui fait souvent peur : c’est très simple, on ne peut pas tout savoir. Certains vont suggérer de nouvelles pistes de réflexions, et voir les choses sous un angle différent, ou poser des questions auxquelles vous n’aurez pas les réponses.

En fait les questions peuvent être vues comme un moyen d’améliorer votre présentation : « Ah, ce passage là n’était peut-être pas assez détaillé, je vais rajouter des slides la prochaine fois », « tiens, j’avais pas pensé à parler de ce point, c’est intéressant ! ».

Ok, je me lance

Nous avons plusieurs thread ([1], [2]) sur notre google group Duchess sur lesquels vous pouvez :
– partager vos propositions, peurs, idées d’abstract et bio
– avoir des relectures, conseils et axes d’amélioration
– trouver quelqu’un pour animer avec vous votre présentation !
Tout ceci peut être fait à distance via le partage google doc/skype/hangout et physiquement si plusieurs sont motivés pour organiser un petit groupe de travail.
N’hésitez pas à nous y rejoindre !

 

4044218-hg

 

Quelques Call For Papers en cours

Devoxx Anvers : fin du CFP le 30 juin
Soft-shake (Genève) : fin du CFP le 31juillet
Codeurs en seine (Rouen): fin du CFP le 31 juillet
JugSummerCamp (La Rochelle) : fin du CFP le 3 juillet
bdx.io (Bordeaux) : fin du CFP le 31 juillet

 

Laissez nous un message si vous souhaitez qu’on rajoute un CFP.

Analyze accelerometer data with Apache Spark and MLlib

The past months I grew some interest in Apache Spark, Machine Learning and Time Series, and I thought of playing around with it.

In this post I will explain how to predict user’s physical activity (like walking, jogging, sitting…) using Spark, the Spark-Cassandra connector and MLlib.

The entire code and data sets are available on my github account.

This post is inspired from the WISDM Lab’s study and data (not cleaned) come from here.

 

Spark-accelerometer

A FEW WORD ABOUT APACHE SPARK & CASSANDRA

Apache Spark started as a research project at the University of California, Berkeley in 2009 and it is an open source project written mostly in Scala. In a nutshell, Apache Spark is a fast and general engine for large-scale data processing.
Spark’s main property is in-memory processing, but you can also process data on disk and it can be fully integrated with Hadoop to process data from HDFS. Spark provides three main API, in  Java, Scala and Python. In this post I chose the Java API.
Spark offers an abstraction called resilient distributed datasets (RDDs),  which are  immutable and lazy data collections partitioned across the nodes of a cluster.

MLlib is a standard component of Spark providing machine learning primitives on top of Spark which contains common algorithms (regression, classification, recommendation, optimization, clustering..), and also basic statistics and feature extraction functions.

If you want to get a better look at Apache Spark and its ecosystem, just check out the web site Apache Spark and its documentation.

Finally the Spark-Cassandra connector lets you expose Cassandra tables as Spark RDDs, and persist  Spark RDDs into Cassandra tables, and execute arbitrary CQL queries within your Spark applications.

AN EXAMPLE: USER’S PHYSICAL ACTIVITY RECOGNITION

The availability of acceleration sensors creates exciting new opportunities for data mining and predictive analytics applications. In this post, I will consider data from accelerometers to perform activity recognition.

The data in my github account are already cleaned.
Data come from 37 different users. Each user has recorded the activity he was performing. That is why something the data are not relevant and need to be cleaned. Some rows are empty in the original file, and some other are misrecorded.

DATA DESCRIPTION

I have used labeled accelerometer data from users thanks to a device in their pocket during different activities (walking, sitting, jogging, ascending stairs, descending stairs, and standing).

The accelerometer measures acceleration in all three spatial dimensions as following:

  • Z-axis captures the forward movement of the leg
  • Y-axis captures the upward and downward movement of the leg
  • X-axis captures the horizontal movement of the leg

The plots below show characteristics for each activity. Because of the periodicity of such activities, a few seconds windows is sufficient to find specific characteristics for each activity.

walking_jogging_view

stairs_view

standing_sitting_view

We observe repeating waves and peaks for the following repetitive activities walking, jogging, ascending stairs and descending stairs. The activities Upstairs and Downstairs are very similar, and there is no periodic behavior for more static activities like standing or sitting, but different amplitudes.

 

DATA INTO CASSANDRA

I have pushed my data into Cassandra using the cql shell.

https://gist.github.com/nivdul/88d1dbb944f75c8bf612

Because I need to group my data by (user_id, activity) and then to sort them by timestamp, I decided to define the couple (user_id, activity) and timestamp, as a primary key.

Just below, an example of what my data looks like.

Capture d’écran 2015-04-15 à 20.25.19

And now how to retrive the data from Cassandra with the Spark-Cassandra connector:

 

https://gist.github.com/nivdul/b5a3654488886cd36dc5

 

PREPARE MY DATA

As you can imagine my data was not clean, and I needed to prepare them to extract my features from it. It is certainly the most time consuming part of the work, but also the more exciting for me.

My data is contained in a csv file,  and the data was acquired on different sequential days . So I needed to define the different recording intervals for each user and each activity. Thanks to these intervals, I have extracted windows on which I have computed my features.

Here is a diagram to explain what I did and the code.

 

IMG_7954

 

First retrieve the data for each (user, activity) and sorted by timestamp.

 

https://gist.github.com/nivdul/6424b9b21745d8718036

 

Then search for the jumps between the records in order to define my recording intervals and the number of windows per intervals.

https://gist.github.com/nivdul/84b324f883dc86991332

DETERMINE AND COMPUTE FEATURES FOR THE MODEL

Each of these activities demonstrate characteristics that we will use to define the features of the model. For example, the plot for walking shows a series of high peaks for the y-axis spaced out approximately 0.5 seconds intervals, while it is rather a 0.25 seconds interval for jogging. We also notice that the range of the y-axis acceleration for jogging is greater than for walking, and so on. This analysis step is essential and takes time (again) to determine the best features to use for our model.

After several tests with different features combination, the ones that I have chosen are described below (it is basic statistics):

  • Average acceleration (for each axis)
  • Variance (for each axis)
  • Average absolute difference (for each axis)
  • Average resultant acceleration (1/n * sum [√(x² + y² + z²)])
  • Average time between peaks (max) (Y-axis)

 

FEATURES COMPUTATION USING SPARK AND MLLIB

Now let’s compute the features to build the predictive model!

AVERAGE ACCELERATION AND VARIANCE

https://gist.github.com/nivdul/0ff01e13ba05135df09d

AVERAGE ABSOLUTE DIFFERENCE

https://gist.github.com/nivdul/1ee82f923991fea93bc6

AVERAGE RESULTANT ACCELERATION

https://gist.github.com/nivdul/666310c767cb6ef97503

AVERAGE TIME BETWEEN PEAKS

https://gist.github.com/nivdul/77225c0efee45a860d30

THE MODEL: DECISION TREES

Just to recap, we want to determine the user’s activity from data where the possible activities are: walking, jogging, sitting, standing, downstairs and upstairs. So it is a classification problem.

Here I have chosen the implementation of  the Decision Trees algorithms using MLlib, to create my model and then to predict the activity performing by users.

You could also use others algorithms such as Random Forest or Multinomial Logistic Regression (from Spark 1.3) available in MLlib.
Remark: with the chosen features, prediction for « up » and « down » activities are pretty bad. One trick would be to define more relevant features to have a better prediction model.

Below is the code that shows how to load our dataset, split it into training and testing datasets.

 

https://gist.github.com/nivdul/246dbe803a2345b7bf5b

 

Let’s use DecisionTree.trainClassifier to fit our model. After that the model is evaluated against the test dataset and an error is calculated to measure the algorithm accuracy.

https://gist.github.com/nivdul/f380586bfefc39b05f0c

RESULTS

classes number mean error (Random Forest) mean error (Decision Tree)
4 (4902 samples) 1,3% 1,5%
6 (6100 samples)  13,4% 13,2%

 

CONCLUSION

In this post we have first demonstrated how to use Apache Spark and Mllib to predict user’s physical activity.

The features extraction step is pretty long, because you need to test and experiment to find the best features as possible. We also have to prepare the data and the data processing is long too, but exciting.

If you find a better way/implementation to prepare the data or compute the features, do not hesitate to send a pull request or open an issue on github.

En continuant à utiliser le site, vous acceptez l’utilisation des cookies. Plus d’informations

Les paramètres des cookies sur ce site sont définis sur « accepter les cookies » pour vous offrir la meilleure expérience de navigation possible. Si vous continuez à utiliser ce site sans changer vos paramètres de cookies ou si vous cliquez sur "Accepter" ci-dessous, vous consentez à cela.

Fermer